The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.
The standard is currently in version 4.3, although the 1000 Genomes Project has developed their own specification for structural variations such as duplications, which are not easily accommodated into the existing schema. A set of tools is also available for editing and manipulating the files.
##fileformat=VCFv4.0##fileDate=20110705##reference=1000GenomesPilot-NCBI37##phasing=partial##INFO=
##INFO=##INFO=##INFO=##INFO=dbSNP membership, build 129">##INFO=##FILTER=##FILTER=##FORMAT=##FORMAT=##FORMAT=##FORMAT=#CHROM POS    ID        REF  ALT     QUAL FILTER INFO                              FORMAT      Sample1        Sample2        Sample32      4370   rs6057    G    A       29   .      NS=2;DP=13;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:52,51 1|0:48:8:51,51 1/1:43:5:.,.2      7330   .         T    A       3    q10    NS=5;DP=12;AF=0.017               GT:GQ:DP:HQ 0|0:46:3:58,50 0|1:3:5:65,3   0/0:41:32      110696 rs6055    A    G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:42      130237 .         T    .       47   .      NS=2;DP=16;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:56,51 0/0:61:22      134567 microsat1 GTCT G,GTACT 50   PASS   NS=2;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3chr1    45796269        .       G       Cchr1    45797505        .       C       Gchr1    45798555        .       T       Cchr1    45798901        .       C       Tchr1    45805566        .       G       Cchr2    47703379        .       C       Tchr2    48010488        .       G       Achr2    48030838        .       A       Tchr2    48032875        .       CTAT    -chr2    48032937        .       T       Cchr2    48033273        .       TTTTTGTTTTAATTCCT       -chr2    48033551        .       C       Gchr2    48033910        .       A       Tchr2    215632048       .       G       Tchr2    215632125       .       TT      -chr2    215632155       .       T       Cchr2    215632192       .       G       Achr2    215632255       .       CA      TGchr2    215634055       .       C       TThe header begins the file and provides metadata describing the body of the file. Header lines are denoted as starting with #. Special keywords in the header are denoted with ##. Recommended keywords include fileformat, fileDate and reference.
The header contains keywords that optionally semantically and syntactically describe the fields used in the body of the file, notably INFO, FILTER, and FORMAT (see below).
The columns of a VCF
The body of VCF follows the header, and is tab separated into 8 mandatory columns and an unlimited number of optional columns that may be used to record other information about the sample(s). When additional columns are used, the first of optional column is used to describe the format of the data in the columns that follow.
Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):
AA ancestral allele
AC allele count in genotypes, for each ALT allele, in the same order as listed
AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
AN total number of alleles in called genotypes
BQ RMS base quality at this position
CIGAR cigar string describing how to align an alternate allele to the reference allele
DB dbSNP membership
DP combined depth across samples, e.g. DP=154
END end position of the variant described in this record (for use with symbolic alleles)
H2 membership in hapmap2
H3 membership in hapmap3
MQ RMS mapping quality, e.g. MQ=52
MQ0 Number of MAPQ == 0 reads covering this record
NS Number of samples with data
SB strand bias at this position
SOMATIC indicates that the record is a somatic mutation, for cancer genomics
VALIDATED validated by follow-up experiment
1000G membership in 1000 Genomes