Annotation Configuration

Annotation config files follow the INI file format and specify a list of annotators and their options in the form of sections and properties. The annotation.conf file which can be found in the data repository is the default annotation config.

Each annotator is described in a single section. The name of the section must be unique, but since it does not influence the annotation process, it can be chosen freely. Lines starting with # are ignored.

All annotators are configured in a similar manner, using three main properties - annotator, options and columns. annotator has a single value, but the other two properties have sub-properties separated with a ..

Example Configuration

[DEFAULT]
GENOMIC_SCORES_HG19 = %(scores_hg19_dir)s

###############################
[Step-CADD-Score]

annotator = score_annotator.NPScoreAnnotator

options.scores_file = %(GENOMIC_SCORES_HG19)s/CADD/CADD.bedgraph.gz

columns.raw = cadd_raw
columns.phred = cadd_phred

###############################
[Step-MPC-Score]

annotator = score_annotator.NPScoreAnnotator

options.scores_file = %(GENOMIC_SCORES_HG19)s/MPC/fordist_constraint_official_mpc_values_v2.txt.gz

columns.MPC = mpc

#######################################
[gnomAD genome Frequency]

annotator = frequency_annotator.FrequencyAnnotator

options.scores_file = %(GENOMIC_SCORES_HG19)s/gnomAD_genome/gnomad.genomes.r2.1.sites.tsv.gz

columns.AF = genome_gnomad_af
columns.AF_percent = genome_gnomad_af_percent

columns.AC = genome_gnomad_ac
columns.AN = genome_gnomad_an
columns.controls_AC = genome_gnomad_controls_ac
columns.controls_AN = genome_gnomad_controls_an
columns.controls_AF = genome_gnomad_controls_af
columns.non_neuro_AC = genome_gnomad_non_neuro_ac
columns.non_neuro_AN = genome_gnomad_non_neuro_an
columns.non_neuro_AF = genome_gnomad_non_neuro_af
columns.controls_AF_percent = genome_gnomad_controls_af_percent
columns.non_neuro_AF_percent = genome_gnomad_non_neuro_af_percent

#######################################
[gnomAD exome Frequency]

annotator = frequency_annotator.FrequencyAnnotator

options.scores_file = %(GENOMIC_SCORES_HG19)s/gnomAD_exome/gnomad.exomes.r2.1.sites.tsv.gz

columns.AF = exome_gnomad_af
columns.AF_percent = exome_gnomad_af_percent

columns.AC = exome_gnomad_ac
columns.AN = exome_gnomad_an
columns.controls_AC = exome_gnomad_controls_ac
columns.controls_AN = exome_gnomad_controls_an
columns.controls_AF = exome_gnomad_controls_af
columns.non_neuro_AC = exome_gnomad_non_neuro_ac
columns.non_neuro_AN = exome_gnomad_non_neuro_an
columns.non_neuro_AF = exome_gnomad_non_neuro_af
columns.controls_AF_percent = exome_gnomad_controls_af_percent
columns.non_neuro_AF_percent = exome_gnomad_non_neuro_af_percent

annotator

annotator = <annotator python file name>.<annotator class name>

This property indicates the type of the annotator.

Value

Description

annotator_base.CopyAnnotator

Duplicates the given columns.

cleanup_annotator.CleanupAnnotator

Removes the given columns from the output file.

dbnsfp_annotator.dbNSFPAnnotator

Annotate variants using scores from dbNSFP.

effect_annotator.VariantEffectAnnotator

Annotate variants with their effects.

frequency_annotator.FrequencyAnnotator

Annotate variants with a frequency score.

score_annotator.PositionScoreAnnotator

Annotate variants with a score file by the position/location of the variant.

score_annotator.PositionMultiScoreAnnotator

Identical in function to PositionScoreAnnotator, but uses a directory with multiple score files.

score_annotator.NPScoreAnnotator

Annotate variants with a score file by the location and the type of the variant.

lift_over_annotator.LiftOverAnnotator

Create a column with the variant’s location lifted over.

vcf_info_extractor.VCFInfoExtractor

Extract key-value pairs from a VCF file’s INFO column as separate columns.

options.*

options.<option name> = value

These are custom options that will be passed to the annotator. Each annotator provides different options that can be set.

Option

Used by

Description

scores_file

Variant annotators

The absolute path to the score file.

scores_config_file

Variant annotators

The absolute path to the score configuration file.

scores_directory

PositionMultiScoreAnnotator

The absolute path to the directory containing the score files and their configs.

dbNSFP_path

dbNSFPAnnotator

The absolute path to the directory holding dbNSFP files, separated by chromosome.

dbNSFP_filename

dbNSFPAnnotator

A glob-like pattern of the generic dbNSFP file’s name (e.g. dbNSFP_chr*).

dbNSFP_config

dbNSFPAnnotator

The name (not absolute path) of the score config inside the dbNSFP directory.

Graw

VariantEffectAnnotator

The absolute path to the genome file.

Traw

VariantEffectAnnotator

The absolute path to the gene models file.

chain_file

LiftOverAnnotator

The absolute path to the liftover chain to be used.

columns.*

columns.<raw/original column name> = <output column name>

This option simultaneously describes which columns must be added to the output file and what their name will be. The pool of available columns is determined by the annotator - for example, a copy annotator’s pool of available columns is the input file’s own columns, while an annotator that uses a score file will have the score file’s columns available.

The following are some special cases, used by certain annotators.

Columns

Used by

Description

columns.cleanup

CleanupAnnotator

Comma-separated list of columns to remove.

columns.*

CopyAnnotator

The pool of available columns are all columns in the input file.

columns.*

VCFInfoExtractor

The pool of available columns are all keys in the INFO column.