Annotation Infrastructure
Introduction
Annotation is an essential step in any genomic analysis. It is the process of assigning attributes or properties to a set of objects that an analyst studies. For example, one can annotate the genetic variants identified in a group of case and control individuals in a genetic study with a prediction of pathogenicity of the variants and estimates for the conservation at the loci of the variants.
Example 1
- position_score: hg38/scores/phyloP7way
Example 2
- position_score: hg38/scores/phyloP7way
- effect_annotator:
gene_models: hg38/gene_models/refSeq_v20200330
genome: hg38/genomes/GRCh38-hg38
Example 3
- position_score: hg38/scores/phyloP7way
- effect_annotator
This one will use the reference genome from the genomic context. The genomic context can an active gpf_instance or command line parameters like:
-ref hg38/genomes/GRCh38-hg38 -genes hg38/gene_models/refSeq_v20200330
Example 4
- position_score: hg38/scores/phyloP100way
- position_score: hg38/scores/phyloP30way
- position_score: hg38/scores/phyloP20way
- position_score: hg38/scores/phyloP7way
- position_score: hg38/scores/phastCons100way
- position_score: hg38/scores/phastCons30way
- position_score: hg38/scores/phastCons20way
- position_score: hg38/scores/phastCons7way
- np_score: hg38/scores/CADD_v1.4
- liftover_annotator:
chain: liftover/hg38ToHg19
target_genome: hg19/genomes/GATK_ResourceBundle_5777_b37_phiX174
attributes:
- source: liftover_annotatable
destination: hg19_annotatable
internal: true
- position_score:
resource_id: hg19/scores/FitCons-i6-merged
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/Linsight
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E067
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E068
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E069
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E070
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E071
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E072
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E073
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E074
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E081
input_annotatable: hg19_annotatable
- position_score:
resource_id: hg19/scores/FitCons2_E082
input_annotatable: hg19_annotatable
- np_score:
resource_id: hg19/scores/MPC
input_annotatable: hg19_annotatable
- normalize_allele_annotator:
genome: hg38/genomes/GRCh38-hg38
- allele_score:
resource_id: hg38/variant_frequencies/SSC_WG38_CSHL_2380
# input_annotatable: normalized_allele
- allele_score:
resource_id: hg38/variant_frequencies/gnomAD_v2.1.1_liftover/exomes
input_annotatable: normalized_allele
- allele_score:
resource_id: hg38/variant_frequencies/gnomAD_v2.1.1_liftover/genomes
input_annotatable: normalized_allele
- allele_score:
resource_id: hg38/variant_frequencies/gnomAD_v3/genomes
input_annotatable: normalized_allele
Annotables
Genomic Position
VCF Variant
Genomic Region
Annotation pipeline
General structure
The pipeline is a yaml file that to the top level is a list with annotators. Each annotator looks like:
- <annotator type>:
A1: v1
A2: v2
...
There are syntax sort cuts possible, like
- <annotator type>
or
- <annotator type>: <resource id>
Some attributes are general and some are annotator specific. General ones include: attributes and input_annotatable
Position score
- position_score:
resource_id: <position score resource ID>
attributes:
- source: <source score ID>
destination: <destination attribute name>
position_aggregator: <aggregator to use for INDELs>
NP score
- np_score:
resource_id: <NP-score resource ID>
attributes:
- source: <source score ID>
destination: <destination attribute name>
position_aggregator: <aggregator to use for INDELs>
Allele score
- allele_score:
resource_id: <allele score resource ID>
attributes:
- source: <source score ID>
destination: <destination attribute name>
Effect annotator
- effect_annotator:
genome: <reference genome resource ID>
gene_models: <gene models resource ID>
This contains the implementation of the three score annotators.
Genomic score annotators defined are positions_score, np_score, and allele_score.
- class dae.annotation.score_annotator.PositionScoreAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]
This class implements the position_score annotator.
The position_score annotator requires the resrouce_id parameter, whose value must be an id of a genomic resource of type position_score.
The position_score resource provides a set of scores (see …) that the position_score annotator uses as attributes to assign to the annotatable.
The position_score annotator recognized one attribute level parameter called position_aggregator that controls how the position scores are aggregator for annotates that ref to a region of the reference genome.
Normalize allele annotator
- normalize_allele_annotator:
genome: hg38/genomes/GRCh38-hg38
Lift-over annotator
- liftover_annotator:
chain: liftover/hg38ToHg19
target_genome: hg19/genomes/GATK_ResourceBundle_5777_b37_phiX174
attributes:
- source: liftover_annotatable
destination: hg19_annotatable
internal: true
Gene score annotator
- gene_score_annotator:
resource_id: <gene score resource ID>
input_gene_list: <Gene list to use to match annotatables (see below)>
attributes:
- source: <source score ID>
destination: <destination attribute name>
gene_aggregator: <aggregator type>
Note
Input gene list is a list of genes that must be present in the annotation context.
Gene lists are provided by effect annotators and is mandatory to supply to a gene score annotator, therefore, gene score annotation is dependent on an effect annotator being present earlier in the pipeline.
Effect annotators currently provide 2 gene lists - gene_list
and
LGD_gene_list
, making these 2 the possible options.
Command Line Tools
annotate_columns
annotate_vcf
Example: How to annotate variants with ClinVar
For this example, we’ll assume that you have a GRR repository with the ClinVar score resource.
We’ll use a small list of de Novo variants saved as denovo-variants.tsv
:
CHROM POS REF ALT person_ids
chr14 21403214 T C f1.p1
chr14 21431459 G C f1.p1
chr14 21391016 A AT f2.p1
chr14 21403019 G A f2.p1
chr14 21402010 G A f3.p1
chr14 21393484 TCTTC T f3.p1
Annotate variants with ClinVar resource
Let us create an annotation configuration stored as clinvar-annotation.yaml
:
- allele_score: clinvar_20221105
Run annotate_columns
tool:
annotate_columns --grr ./grr_definition.yaml \
--col_pos POS --col_chrom CHROM --col_ref REF --col_alt ALT \
denovo-variants.tsv clinvar_annotation.yaml
Example: How to annotate using gene score annotators.
Preparing a variants file
For this example we will reuse the denovo_variants.tsv
in the previous example:
CHROM POS REF ALT person_ids
chr14 21403214 T C f1.p1
chr14 21431459 G C f1.p1
chr14 21391016 A AT f2.p1
chr14 21403019 G A f2.p1
chr14 21402010 G A f3.p1
chr14 21393484 TCTTC T f3.p1
Setting up the Genomic Resource Repository
We will be using the SFARI gene score along with a genome and gene models
from the public GRR. Create a grr_definition.yaml
that looks like this:
type: group children: - id: "seqpipe" type: "url" directory: "https://grr.seqpipe.org"
Setting up the annotation configuration
Create a properties-annotation.yaml
like this:
- effect_annotator: gene_models: hg38/gene_models/refSeq_v20200330 genome: hg38/genomes/GRCh38-hg38 - gene_score_annotator: resource_id: gene_properties/gene_scores/SFARI_gene_score input_gene_list: gene_list attributes: - source: "SFARI gene score" destination: SFARI_gene_score
When setting up gene score annotators, we need to have a gene list in the annotation context.
Effect annotators provide 2 lists of genes: gene_list
and LGD_gene_list
. Thus, effect
annotators are a requirement when annotating with gene scores.
Annotating the variants
Run annotate_columns
tool:
annotate_columns --grr ./grr_definition.yaml \
--col_pos POS --col_chrom CHROM --col_ref REF --col_alt ALT \
denovo-variants.tsv properties_annotation.yaml
Example annotation with gene score annotator and changed aggregator
- effect_annotator:
gene_models: hg38/gene_models/refSeq_v20200330
genome: hg38/genomes/GRCh38-hg38
- gene_score_annotator:
resource_id: gene_properties/gene_scores/pLI
input_gene_list: gene_list
attributes:
- source: pLI
gene_aggregator: max