Annotation Configuration
Annotation config files follow the INI file format and specify a list of
annotators and their options in the form of sections and properties.
The annotation.conf
file which can be found in the data repository is
the default annotation config.
Each annotator is described in a single section. The name of the section
must be unique, but since it does not influence the annotation process, it
can be chosen freely. Lines starting with #
are ignored.
All annotators are configured in a similar manner, using three main
properties - annotator
, options
and columns
.
annotator
has a single value, but the other two properties
have sub-properties separated with a .
.
Example Configuration
[DEFAULT]
GENOMIC_SCORES_HG19 = %(scores_hg19_dir)s
###############################
[Step-CADD-Score]
annotator = score_annotator.NPScoreAnnotator
options.scores_file = %(GENOMIC_SCORES_HG19)s/CADD/CADD.bedgraph.gz
columns.raw = cadd_raw
columns.phred = cadd_phred
###############################
[Step-MPC-Score]
annotator = score_annotator.NPScoreAnnotator
options.scores_file = %(GENOMIC_SCORES_HG19)s/MPC/fordist_constraint_official_mpc_values_v2.txt.gz
columns.MPC = mpc
#######################################
[gnomAD genome Frequency]
annotator = frequency_annotator.FrequencyAnnotator
options.scores_file = %(GENOMIC_SCORES_HG19)s/gnomAD_genome/gnomad.genomes.r2.1.sites.tsv.gz
columns.AF = genome_gnomad_af
columns.AF_percent = genome_gnomad_af_percent
columns.AC = genome_gnomad_ac
columns.AN = genome_gnomad_an
columns.controls_AC = genome_gnomad_controls_ac
columns.controls_AN = genome_gnomad_controls_an
columns.controls_AF = genome_gnomad_controls_af
columns.non_neuro_AC = genome_gnomad_non_neuro_ac
columns.non_neuro_AN = genome_gnomad_non_neuro_an
columns.non_neuro_AF = genome_gnomad_non_neuro_af
columns.controls_AF_percent = genome_gnomad_controls_af_percent
columns.non_neuro_AF_percent = genome_gnomad_non_neuro_af_percent
#######################################
[gnomAD exome Frequency]
annotator = frequency_annotator.FrequencyAnnotator
options.scores_file = %(GENOMIC_SCORES_HG19)s/gnomAD_exome/gnomad.exomes.r2.1.sites.tsv.gz
columns.AF = exome_gnomad_af
columns.AF_percent = exome_gnomad_af_percent
columns.AC = exome_gnomad_ac
columns.AN = exome_gnomad_an
columns.controls_AC = exome_gnomad_controls_ac
columns.controls_AN = exome_gnomad_controls_an
columns.controls_AF = exome_gnomad_controls_af
columns.non_neuro_AC = exome_gnomad_non_neuro_ac
columns.non_neuro_AN = exome_gnomad_non_neuro_an
columns.non_neuro_AF = exome_gnomad_non_neuro_af
columns.controls_AF_percent = exome_gnomad_controls_af_percent
columns.non_neuro_AF_percent = exome_gnomad_non_neuro_af_percent
annotator
annotator = <annotator python file name>.<annotator class name>
This property indicates the type of the annotator.
Value |
Description |
---|---|
annotator_base.CopyAnnotator |
Duplicates the given columns. |
cleanup_annotator.CleanupAnnotator |
Removes the given columns from the output file. |
dbnsfp_annotator.dbNSFPAnnotator |
Annotate variants using scores from dbNSFP. |
effect_annotator.VariantEffectAnnotator |
Annotate variants with their effects. |
frequency_annotator.FrequencyAnnotator |
Annotate variants with a frequency score. |
score_annotator.PositionScoreAnnotator |
Annotate variants with a score file by the position/location of the variant. |
score_annotator.PositionMultiScoreAnnotator |
Identical in function to PositionScoreAnnotator, but uses a directory with multiple score files. |
score_annotator.NPScoreAnnotator |
Annotate variants with a score file by the location and the type of the variant. |
lift_over_annotator.LiftOverAnnotator |
Create a column with the variant’s location lifted over. |
vcf_info_extractor.VCFInfoExtractor |
Extract key-value pairs from a VCF file’s INFO column as separate columns. |
options.*
options.<option name> = value
These are custom options that will be passed to the annotator. Each annotator provides different options that can be set.
Option |
Used by |
Description |
---|---|---|
scores_file |
Variant annotators |
The absolute path to the score file. |
scores_config_file |
Variant annotators |
The absolute path to the score configuration file. |
scores_directory |
PositionMultiScoreAnnotator |
The absolute path to the directory containing the score files and their configs. |
dbNSFP_path |
dbNSFPAnnotator |
The absolute path to the directory holding dbNSFP files, separated by chromosome. |
dbNSFP_filename |
dbNSFPAnnotator |
A glob-like pattern of the generic dbNSFP file’s name (e.g. dbNSFP_chr*). |
dbNSFP_config |
dbNSFPAnnotator |
The name (not absolute path) of the score config inside the dbNSFP directory. |
Graw |
VariantEffectAnnotator |
The absolute path to the genome file. |
Traw |
VariantEffectAnnotator |
The absolute path to the gene models file. |
chain_file |
LiftOverAnnotator |
The absolute path to the liftover chain to be used. |
columns.*
columns.<raw/original column name> = <output column name>
This option simultaneously describes which columns must be added to the output file and what their name will be. The pool of available columns is determined by the annotator - for example, a copy annotator’s pool of available columns is the input file’s own columns, while an annotator that uses a score file will have the score file’s columns available.
The following are some special cases, used by certain annotators.
Columns |
Used by |
Description |
---|---|---|
columns.cleanup |
CleanupAnnotator |
Comma-separated list of columns to remove. |
columns.* |
CopyAnnotator |
The pool of available columns are all columns in the input file. |
columns.* |
VCFInfoExtractor |
The pool of available columns are all keys in the INFO column. |