annotation_pipeline

Annotation pipeline is a tool for annotating data.

The annotation_pipeline help

The tool has a help info that can be shown using -h/–help option on the command line.:

usage: annotation_pipeline.py [-h] [-H] --config CONFIG [--always-add]
                              [--region REGION] [--split SPLIT]
                              [--separator SEPARATOR]
                              [--options =OPTION:VALUE] [--skip-preannotator]
                              [-a A] [-c C] [--Graw GRAW] [--vcf] [-v V]
                              [-p P] [-r R] [-x X]
                              [infile] [outfile]

Program to annotate variants combining multiple annotating tools

positional arguments:
  infile                path to input file; defaults to stdin
  outfile               path to output file; defaults to stdout

optional arguments:
  -h, --help            show this help message and exit
  -H                    no header in the input file
  --config CONFIG       config file location
  --always-add          always add columns; default behavior is to replace
                        columns with the same label
  --region REGION       region to annotate (chr:begin-end) (input should be
                        tabix indexed)
  --split SPLIT         split variants based on given column
  --separator SEPARATOR
                        separator used in the split column; defaults to ","
  --options =OPTION:VALUE
                        add default arguments
  --skip-preannotator   skips preannotators
  -a A                  alternative column number/name
  -c C                  chromosome column number/name
  --Graw GRAW           genome file location
  --vcf                 if the variant position is in VCF semantics
  -v V                  variant (CSHL format) column number/name
  -p P                  position column number/name
  -r R                  reference column number/name
  -x X                  location (chr:position) column number/name

Typical invocation of annotation_pipeline.py

To start annotation_pipeline.py you should pass some required parameters:

  • using --config option the user should pass the path of config file

Example annotating of de novo data with annotation_pipeline.py:

./annotation_pipeline.py --config file.conf input_file output_file

or:

./annotation_pipeline.py --config file.conf -x location -v variant input_file output_file

Example annotating of transmission data with annotation_pipeline.py:

./annotation_pipeline.py --config file.conf -c chromosome -p position input_file output_file

Example annotating of VCF data with annotation_pipeline.py:

./annotation_pipeline.py --config file.conf --vcf -c chromosome -p position -r reference -a alternative input_file output_file

We need -x, -v, -c, -p, -r, -a, –vcf options to describe the input format so that we can support multiple formats. Preannotators use these arguments to generate virtual columns.

Configuration file

Configuration file contain sections for different operations which contain options for annotating data

Example annotation pipeline configuration file:

# comment

# DEFAULT section is to define variables who will be used in config file
[DEFAULT]
data_dir=/data-dir
genome_dir=%(data_dir)s/genomes/GATK_ResourceBundle_5777_b37_phiX174
graw=%(genome_dir)s/chrAll.fa
traw=%(genome_dir)s/refGene-201309.gz

################################
# Section name
[Step-Effects]

# Annotator class. Complete list of annotators can be seen below in dae.annotation.tools module
annotator=annotate_variants.EffectAnnotator

# Annotator options. Complete list of option for annotators can be seen below in dae.annotation.tools module
options.c=CSHL:chr
options.p=CSHL:position
options.v=CSHL:variant
options.Graw=%(graw)s
options.Traw=%(traw)s

# columns.<column_in_input>=<new_column_in_output> With columns you say to annotation_pipeline which columns to use in annotation and new labels of these columns.
columns.effect_type=effectType
columns.effect_gene=effectGene
columns.effect_details=effectDetails

################################
[Step-SSC-Frequency]

annotator=annotateFreqTransm.FrequencyAnnotator

options.c=CSHL:chr
options.p=CSHL:position
options.v=CSHL:variant
options.scores_file=%(data_dir)s/cccc/w1202s766e611/transmissionIndex-HW-DNRM.txt.bgz
options.direct = True

columns.score=SSC-freq

# with virtual_columns anotation_pipeline add new column which removes after annotation
virtual_columns = score

Add Annotator or Preannotator

To create Annotator or Preannotator you need to create a class who inherits dae.annotation.utilities.AnnotatorBase. That class must implements methods new_columns() and line_annotations(line, new_columns). new_columns must be a @property method who returns names of new columns, line_annotations must be a method who takes line and new_columns and returns a list of values for new_columns.

Annotator directory is annotation/tools. Except annotator you need to add get_argument_parser() function and you must call main function from utilities in annotator file. get_argument_parser() function must return an argument parser with annotator options. utilities.main() function takes two params first is argument parser and second is Annotator. Preannotator directory is annotation/preannotators. Except preannotator you need to add get_arguments() function in preannotator file. This function must return dictionary with preannotator options. When you add preannotator in this directory it is found by MultiAnnotator class.

dae.annotation.annotation_pipeline module

class dae.annotation.annotation_pipeline.PipelineAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.CompositeVariantAnnotator

ANNOTATION_SCHEMA_EXCLUDE = ['effect_gene_genes', 'effect_gene_types', 'effect_genes', 'effect_details_transcript_ids', 'effect_details_details', 'effect_details', 'OLD_effectType', 'OLD_effectGene', 'OLD_effectDetails']
add_annotator(annotator)[source]
static build(options, config_file, work_dir, genomes_db, defaults=None)[source]
build_annotation_schema()[source]
collect_annotator_schema(schema)[source]
line_annotation(aline)[source]

Method returning annotations for the given line in the order from new_columns parameter.

dae.annotation.annotation_pipeline.main_cli_options(gpf_instance)[source]
dae.annotation.annotation_pipeline.pipeline_main(argv)[source]
dae.annotation.annotation_pipeline.run_tabix(filename)[source]

dae.annotation.tools module

dae.annotation.tools.annotator_base module

class dae.annotation.tools.annotator_base.AnnotatorBase(config)[source]

Bases: object

AnnotatorBase is base class of all Annotators.

annotate_df(df)[source]
annotate_file(file_io_manager)[source]

Method for annotating file from Annotator.

build_output_line(annotation_line)[source]
collect_annotator_schema(schema)[source]
line_annotation(annotation_line)[source]

Method returning annotations for the given line in the order from new_columns parameter.

class dae.annotation.tools.annotator_base.CompositeAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.AnnotatorBase

add_annotator(annotator)[source]
collect_annotator_schema(schema)[source]
line_annotation(aline)[source]

Method returning annotations for the given line in the order from new_columns parameter.

class dae.annotation.tools.annotator_base.CompositeVariantAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.VariantAnnotatorBase

add_annotator(annotator)[source]
collect_annotator_schema(schema)[source]
line_annotation(aline)[source]

Method returning annotations for the given line in the order from new_columns parameter.

class dae.annotation.tools.annotator_base.CopyAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.AnnotatorBase

collect_annotator_schema(schema)[source]
line_annotation(annotation_line, variant=None)[source]

Method returning annotations for the given line in the order from new_columns parameter.

class dae.annotation.tools.annotator_base.DAEBuilder(config, genome)[source]

Bases: dae.annotation.tools.annotator_base.VariantBuilder

build_variant(aline)[source]
class dae.annotation.tools.annotator_base.VCFBuilder(config, genome)[source]

Bases: dae.annotation.tools.annotator_base.VariantBuilder

build_variant(aline)[source]
class dae.annotation.tools.annotator_base.VariantAnnotatorBase(config)[source]

Bases: dae.annotation.tools.annotator_base.AnnotatorBase

annotate_summary_variant(summary_variant)[source]
collect_annotator_schema(schema)[source]
do_annotate(aline, variant)[source]
line_annotation(aline)[source]

Method returning annotations for the given line in the order from new_columns parameter.

class dae.annotation.tools.annotator_base.VariantBuilder(config, genome)[source]

Bases: object

build(annotation_line)[source]
build_variant(annotation_line)[source]

dae.annotation.tools.annotator_config module

class dae.annotation.tools.annotator_config.AnnotationConfigParser[source]

Bases: dae.configuration.config_parser_base.ConfigParserBase

SPLIT_STR_LISTS = ('virtual_columns',)
classmethod parse(config, genomes_db)[source]

Parse SECTION section from configuration if it is defined else parse all of the sections in the configuration.

Parameters

config (Box or dict) – configuration.

Returns

parsed configuration.

Return type

Box or dict or None

classmethod parse_section(config_section, genomes_db)[source]

Parse one section from configuration based on the SPLIT_STR_LISTS, SPLIT_STR_SETS, CAST_TO_BOOL, CAST_TO_INT, FILTER_SELECTORS and VERIFY_VALUES class properties. If enabled property is defined in the configuration section then it would be checked if the configuration section is enabled.

Parameters

config_section (Box or dict) – section from configuration.

Returns

parsed configuration section.

Return type

Box or dict or None

classmethod read_and_parse_file_configuration(options, config_file, work_dir, genomes_db, defaults=None)[source]

Read and parse configuration stored in a file.

Parameters
  • config_file (str) – file which contains configuration.

  • work_dir (str) – working directory which will be added as work_dir and wd default values in the configuration.

  • defaults (dict or None) – default values which will be used when configuration file is readed.

Returns

read and parsed configuration.

Return type

Box or None

class dae.annotation.tools.annotator_config.AnnotationOptionsSectionParser[source]

Bases: dae.configuration.config_parser_base.ConfigParserBase

VERIFY_VALUES = {'vcf': <function verify_bool>}
class dae.annotation.tools.annotator_config.ScoreFileConfigParser[source]

Bases: dae.configuration.config_parser_base.ConfigParserBase

CAST_TO_BOOL = ('chr_prefix',)
SPLIT_STR_LISTS = ('header', 'score', 'str', 'float', 'int', 'list(str)', 'list(float)', 'list(int)')
dae.annotation.tools.annotator_config.annotation_config_cli_options(gpf_instance)[source]
dae.annotation.tools.annotator_config.verify_bool(inp_val)[source]

dae.annotation.tools.cleanup_annotator module

class dae.annotation.tools.cleanup_annotator.CleanupAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.AnnotatorBase

collect_annotator_schema(schema)[source]
line_annotation(annotation_line)[source]

Method returning annotations for the given line in the order from new_columns parameter.

dae.annotation.tools.dbnsfp_annotator module

class dae.annotation.tools.dbnsfp_annotator.dbNSFPAnnotator(config)[source]

Bases: dae.annotation.tools.score_annotator.NPScoreAnnotator

do_annotate(aline, variant)[source]

dae.annotation.tools.effect_annotator module

class dae.annotation.tools.effect_annotator.EffectAnnotator(config, **kwargs)[source]

Bases: dae.annotation.tools.effect_annotator.EffectAnnotatorBase

COLUMNS_SCHEMA = [('effect_type', 'list(str)'), ('effect_gene', 'list(str)'), ('effect_details', 'list(str)')]
do_annotate(aline, variant)[source]
class dae.annotation.tools.effect_annotator.EffectAnnotatorBase(config, **kwargs)[source]

Bases: dae.annotation.tools.annotator_base.VariantAnnotatorBase

collect_annotator_schema(schema)[source]
do_annotate(aline, variant)[source]
class dae.annotation.tools.effect_annotator.VariantEffectAnnotator(config, **kwargs)[source]

Bases: dae.annotation.tools.effect_annotator.EffectAnnotatorBase

COLUMNS_SCHEMA = [('effect_type', 'str'), ('effect_gene_genes', 'list(str)'), ('effect_gene_types', 'list(str)'), ('effect_genes', 'list(str)'), ('effect_details_transcript_ids', 'list(str)'), ('effect_details_genes', 'list(str)'), ('effect_details_details', 'list(str)'), ('effect_details', 'list(str)')]
do_annotate(aline, variant)[source]
classmethod effect_severity(effect)[source]
classmethod effect_simplify(effects)[source]
classmethod gene_effect(effects)[source]
classmethod sort_effects(effects)[source]
classmethod transcript_effect(effects)[source]
classmethod worst_effect(effects)[source]
wrap_effects(effects)[source]

dae.annotation.tools.file_io_parquet module

class dae.annotation.tools.file_io_parquet.ParquetReader(opts, buffer_size=1000)[source]

Bases: dae.annotation.tools.file_io_tsv.AbstractFormat

line_write(line)[source]
lines_read_iterator()[source]
class dae.annotation.tools.file_io_parquet.ParquetSchema(schema_dict={})[source]

Bases: dae.annotation.tools.schema.Schema

BASE_SCHEMA = bucket_index: int32 summary_variant_index: int64 allele_index: int8 chrom: string position: int32 reference: string alternative: string variant_type: int8 alternatives_data: string effect_type: string effect_gene: string effect_data: string family_variant_index: int64 family_id: string is_denovo: bool variant_sexes: int8 variant_roles: int32 variant_inheritance: int16 variant_in_member: string genotype_data: string af_parents_called_count: int32 af_parents_called_percent: float af_allele_count: int32 af_allele_freq: float frequency_data: string genomic_scores_data: string
classmethod convert(schema)[source]
create_column(col_name, col_type)[source]
classmethod from_arrow(pa_schema)[source]
classmethod from_dict(schema_dict)[source]
classmethod from_parquet(pq_schema)[source]
classmethod merge_schemas(left, right)[source]
classmethod produce_type(type_name)[source]
to_arrow()[source]
type_map = {'bigint': (<class 'int'>, DataType(int64)), 'binary': (<class 'bytes'>, DataType(binary)), 'bool': (<class 'bool'>, DataType(bool)), 'boolean': (<class 'bool'>, DataType(bool)), 'float': (<class 'float'>, DataType(float)), 'float32': (<class 'float'>, DataType(float)), 'float64': (<class 'float'>, DataType(double)), 'int': (<class 'int'>, DataType(uint32)), 'int16': (<class 'int'>, DataType(int16)), 'int32': (<class 'int'>, DataType(int32)), 'int64': (<class 'int'>, DataType(int64)), 'int8': (<class 'int'>, DataType(int8)), 'list(float)': (<class 'float'>, ListType(list<item: double>)), 'list(int)': (<class 'int'>, ListType(list<item: uint32>)), 'list(str)': (<class 'str'>, ListType(list<item: string>)), 'smallint': (<class 'int'>, DataType(int16)), 'str': (<class 'str'>, DataType(string)), 'string': (<class 'bytes'>, DataType(string)), 'tinyint': (<class 'int'>, DataType(int8))}
class dae.annotation.tools.file_io_parquet.ParquetWriter(opts, buffer_size=1000)[source]

Bases: dae.annotation.tools.file_io_tsv.AbstractFormat

classmethod coerce_column(col_name, col_data, expected_col_type)[source]
static coerce_func(new_type)[source]
classmethod get_col_type(col_data)[source]
header_write(header)[source]
line_write(line)[source]
lines_read_iterator()[source]

dae.annotation.tools.file_io_tsv module

class dae.annotation.tools.file_io_tsv.AbstractFormat(opts)[source]

Bases: object

abstract line_write(input_)[source]
abstract lines_read_iterator()[source]
class dae.annotation.tools.file_io_tsv.NoRegionHelper[source]

Bases: object

contains(pos)[source]
class dae.annotation.tools.file_io_tsv.RegionHelper(region_string, pos_index)[source]

Bases: object

contains(line)[source]
static parse_region_string(region)[source]
class dae.annotation.tools.file_io_tsv.TSVFormat(opts)[source]

Bases: dae.annotation.tools.file_io_tsv.AbstractFormat

static is_gzip(filename)[source]
static is_tabix(filename)[source]
class dae.annotation.tools.file_io_tsv.TSVGzipReader(options, filename=None)[source]

Bases: dae.annotation.tools.file_io_tsv.TSVReader

class dae.annotation.tools.file_io_tsv.TSVReader(options, filename=None)[source]

Bases: dae.annotation.tools.file_io_tsv.TSVFormat

line_read()[source]
line_write(line)[source]
lines_read_iterator()[source]
class dae.annotation.tools.file_io_tsv.TSVWriter(options, filename=None)[source]

Bases: dae.annotation.tools.file_io_tsv.TSVFormat

NA_VALUE = ''
header_write(line)[source]
line_write(line)[source]
lines_read_iterator()[source]
class dae.annotation.tools.file_io_tsv.TabixReader(options, filename=None)[source]

Bases: dae.annotation.tools.file_io_tsv.TSVFormat

line_write(line)[source]
lines_read_iterator()[source]
class dae.annotation.tools.file_io_tsv.TabixReaderVariants(options, filename=None)[source]

Bases: dae.annotation.tools.file_io_tsv.TabixReader

lines_read_iterator()[source]
dae.annotation.tools.file_io_tsv.handle_chrom_prefix(expect_prefix, data)[source]
dae.annotation.tools.file_io_tsv.to_str(column_value)[source]

dae.annotation.tools.file_io module

class dae.annotation.tools.file_io.IOManager(opts, io_type_r, io_type_w)[source]

Bases: object

property header
header_write(input_)[source]
line_write(input_)[source]
lines_read_iterator()[source]
class dae.annotation.tools.file_io.IOType[source]

Bases: object

class Parquet[source]

Bases: object

static instance_r(opts)[source]
static instance_w(opts)[source]
class TSV[source]

Bases: object

static instance_r(opts)[source]
static instance_w(opts)[source]

dae.annotation.tools.frequency_annotator module

class dae.annotation.tools.frequency_annotator.FrequencyAnnotator(config)[source]

Bases: dae.annotation.tools.score_annotator.VariantScoreAnnotatorBase

collect_annotator_schema(schema)[source]
do_annotate(aline, variant)[source]

dae.annotation.tools.lift_over_annotator module

class dae.annotation.tools.lift_over_annotator.LiftOverAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.VariantAnnotatorBase

static build_lift_over(chain_filename)[source]
collect_annotator_schema(schema)[source]
do_annotate(aline, variant)[source]

dae.annotation.tools.schema module

class dae.annotation.tools.schema.Schema[source]

Bases: object

property col_names
create_column(col_name, col_type)[source]
static diff_schemas(left, right)[source]
classmethod from_dict(schema_dict)[source]
static merge_schemas(left, right)[source]
order_as(ordered_col_names)[source]
classmethod produce_type(type_name)[source]
remove_column(col_name)[source]
type_map = {'float': <class 'float'>, 'int': <class 'int'>, 'list(float)': <class 'float'>, 'list(int)': <class 'int'>, 'list(str)': <class 'str'>, 'str': <class 'str'>}

dae.annotation.tools.score_annotator module

class dae.annotation.tools.score_annotator.NPScoreAnnotator(config)[source]

Bases: dae.annotation.tools.score_annotator.VariantScoreAnnotatorBase

do_annotate(aline, variant)[source]
class dae.annotation.tools.score_annotator.PositionMultiScoreAnnotator(config)[source]

Bases: dae.annotation.tools.annotator_base.CompositeVariantAnnotator

class dae.annotation.tools.score_annotator.PositionScoreAnnotator(config)[source]

Bases: dae.annotation.tools.score_annotator.VariantScoreAnnotatorBase

do_annotate(aline, variant)[source]
class dae.annotation.tools.score_annotator.VariantScoreAnnotatorBase(config)[source]

Bases: dae.annotation.tools.annotator_base.VariantAnnotatorBase

collect_annotator_schema(schema)[source]

dae.annotation.tools.score_file_io_bigwig module

class dae.annotation.tools.score_file_io_bigwig.BigWigAccess(score_file)[source]

Bases: object

class dae.annotation.tools.score_file_io_bigwig.BigWigLineAdapter(score_file, chromosome, line)[source]

Bases: object

property chrom
property pos_begin
property pos_end

dae.annotation.tools.score_file_io module

class dae.annotation.tools.score_file_io.LineAdapter(score_file, line)[source]

Bases: object

property chrom
property pos_begin
property pos_end
class dae.annotation.tools.score_file_io.LineBufferAdapter(score_file, access)[source]

Bases: object

append(line)[source]
back()[source]
property chrom
empty()[source]
fill(chrom, pos_begin, pos_end)[source]
front()[source]
pop()[source]
property pos_begin
property pos_end
purge(chrom, pos_begin, pos_end)[source]
static regions_intersect(b1, e1, b2, e2)[source]
reset()[source]
select_lines(chrom, pos_begin, pos_end)[source]
class dae.annotation.tools.score_file_io.NoLine(score_file)[source]

Bases: object

class dae.annotation.tools.score_file_io.ScoreFile(score_filename, config_filename=None)[source]

Bases: object

property alt_name
property chr_name
fetch_scores(chrom, pos_begin, pos_end)[source]
fetch_scores_df(chrom, pos_begin, pos_end)[source]
property pos_begin_name
property pos_end_name
property ref_name
scores_to_dataframe(scores)[source]
class dae.annotation.tools.score_file_io.TabixAccess(score_file)[source]

Bases: dae.annotation.tools.file_io_tsv.TabixReader

ACCESS_SWITCH_THRESHOLD = 1500
LONG_JUMP_THRESHOLD = 5000

dae.annotation.tools.utils module

class dae.annotation.tools.utils.AnnotatorFactory[source]

Bases: object

classmethod make_annotator(annotator_config)[source]
class dae.annotation.tools.utils.LineMapper(source_header)[source]

Bases: object

map(source_line)[source]
dae.annotation.tools.utils.handle_header(source_header)[source]

dae.annotation.tools.vcf_info_extractor module

class dae.annotation.tools.vcf_info_extractor.VCFInfoExtractor(config)[source]

Bases: dae.annotation.tools.annotator_base.AnnotatorBase

collect_annotator_schema(schema)[source]
line_annotation(annotation_line)[source]

Method returning annotations for the given line in the order from new_columns parameter.