variants package

Variants in families

Example usage of variants

Example usage of variants package:

import os
from utils.variant_utils import mat2str
from variants.builder import variants_builder as VB

prefix = "ivan-tiny/a"
# prefix = "spark/nspark"
prefix = 'fixtures/effects_trio'

genome_file = os.path.join(
    os.environ.get("DAE_DB_DIR"),
    "genomes/GATK_ResourceBundle_5777_b37_phiX174",
    "chrAll.fa")
print(genome_file)

gene_models_file = os.path.join(
    os.environ.get("DAE_DB_DIR"),
    "genomes/GATK_ResourceBundle_5777_b37_phiX174",
    "refGene-201309.gz")
print(gene_models_file)


fvars = VB(prefix=prefix, genome_file=genome_file,
           gene_models_file=gene_models_file)

vs = fvars.query_variants()


for c, v in enumerate(vs):
    print(c, v, v.family_id, mat2str(v.best_st), sep='\t')
    for aa in v.alt_alleles:
        print(aa.effect.worst, aa.effect.genes)
        print(aa['af_allele_count'], aa['af_allele_freq'])

Family variants query interface

Once you have family variants interface created, you can use it to search for variants you are interested in. The variants interface supports query by various attributes of the family variants:

  • query by genome regions

  • query by genes and variant effect types

  • query by inheritance types

  • query by family IDs

  • query by person IDs

  • query by sexes

  • query by family roles

  • query by variant types

  • query by real value variant attributes (scores).

  • query using general purpose filter function

In the following examples we will assume that fvars is an instance of family variants query interface that allows searching for variants by various criteria.

Query by regions

The query interface support searching of variants in given genome region or list of regions.

Example

The following example will return variants that are at one single position on chromosome 1:878109:

from RegionOperations import Region

vs = fvars.query_variants(regions=[Region("1", 878109, 878109)])

You can specify list of regions in the query:

from RegionOperations import Region

vs = fvars.query_variants(
    regions=[Region("1", 11539, 11539), Region("1", 11550, 11550)])

Query by genes and effect types

Example

The following example will return only variants with effect type frame-shift:

vs = fvars.query_variants(
    effects=["frame-shift"])

You can specify multiple effects in the query. The following example will return variants that with effect type frame-shift or missense:

vs = fvars.query_variants(
    effects=["frame-shift", "missense"])

You can search for variants in specific gene:

vs = fvars.query_variants(
    genes=["PLEKHN1"])

or list of genes:

vs = fvars.query_variants(
    genes=["PLEKHN1", "SAMD11"])

You can specifye combination of effect types and genes in which case the query will return only variants that match both criteria:

vs = fvars.query_variants(
    effect_types=["synonymous", "frame-shift"],
    genes=["PLEKHN1"])

Query by inheritance

Example

The following example will return only variants that have inheritance type denovo:

vs = fvars.query_variants(
    inheritance="denovo")

You can inheritance type using or:

vs = fvars.query_variants(
    inheritance="denovo or omission")

You can use not to get all family variants that has non reference inheritance type:

vs = fvars.query_variants(inheritance="not reference")

Query by family IDs

Example

The following example will return only variants that affect specified families:

vs = fvars.query_variants(family_ids=['f1', 'f2'])

where f1 and f2 are family IDs.

Query by person IDs

Example

The following example will return only variants that affect specified individuals:

vs = fvars.query_variants(person_ids=['mom2', 'ch2'])

where mom2 and ch2 are persons (individuals) IDs.

Query by sexes

Example

The following example will return only variants that affect male individuals:

vs = fvars.query_variants(sexes="male")

You can use or to combine sexes and not to negate. For example:

vs = fvars.query_variants(sexes="male and not female")

will return only family variants that affect male individuals in family, but not female.

Query by roles

Example

The following example will return only variants that affect probands in families:

vs = fvars.query_variants(roles="prb")

You can use or, and and not to combine roles. For example:

vs = fvars.query_variants(roles="prb and not sib")

will return only family variants that affect probands in family, but not siblings.

Query by variant types

Example

The following example will return only variants that are of type sub:

vs = fvars.query_variants(variant_types="sub")

You can use or, and and not to combine variant types. For example:

vs = fvars.query_variants(roles="sub or del")

will return only family variants that are of type sub or del.

Query with real value variant attributes (scores)

Not fully implemented yet

Query with filter function

Not fully implemented yet

VariantBase - a base class for variants

class dae.variants.variant.VariantBase(chromosome, position, reference, alternative=None)[source]

VariantBase is a base class for variants. It supports description of a variant in a la VCF style.

Expected parameters of the constructor are:

Parameters
  • chromosome – chromosome label where variant is located

  • position – position of the variant using VCF convention

  • reference – reference DNA string

  • alternatives – list of alternative DNA strings

Each object of VariantBase has following fields:

Variables
  • chromosome – chromosome lable where variant is located

  • position – position of the variant using VCF convention

  • reference – reference DNA string

  • _alternative – alternative DNA string

__eq__(other)[source]

Return self==value.

__gt__(other)[source]

Return self>value.

__init__(chromosome, position, reference, alternative=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__lt__(other)[source]

Return self<value.

__ne__(other)[source]

Return self!=value.

property alternative

alternative DNA string; comma separated string when multiple alternative DNA strings should be represented; alternative is None when the variant is a reference variant.

property chrom
chromosome = None

chromosome on which the variant is located

property location
position = None

1-based start postion of this variant on the reference

reference = None

reference DNA string

property variant

SummaryAllele - a base class for representing alleles

class dae.variants.variant.SummaryAllele(chromosome, position, reference, alternative=None, summary_index=None, allele_index=0, attributes=None)[source]

SummaryAllele represents a single allele for given position.

__init__(chromosome, position, reference, alternative=None, summary_index=None, allele_index=0, attributes=None)[source]

Initialize self. See help(type(self)) for accurate signature.

allele_index = None

index of the allele of summary variant

property alternative

alternative DNA string; comma separated string when multiple alternative DNA strings should be represented; alternative is None when the variant is a reference variant.

attributes = None

allele additional attributes

property chrom
static create_reference_allele(allele)[source]
property cshl_location
property cshl_position
property cshl_variant
property effect
property effects
property frequency
get_attribute(item, default=None)[source]

looks up values matching key item in additional attributes passed on creation of the variant.

has_attribute(item)[source]

checks if additional variant attributes contain values for key item.

property is_reference_allele
property location
summary_index = None

index of the summary variant this allele belongs to

update_attributes(atts)[source]

updates additional attributes of variant using dictionary atts.

property variant
property variant_type

SummaryVariant - representation of summary variants

class dae.variants.variant.SummaryVariant(alleles)[source]
__contains__(item)[source]
__getitem__(item)[source]
__init__(alleles)[source]

Initialize self. See help(type(self)) for accurate signature.

alleles = None

list of all alleles in the variant

property alt_alleles

list of all alternative alleles

property alternative

alternative DNA string; comma separated string when multiple alternative DNA strings should be represented; alternative is None when the variant is a reference variant.

property chrom
property details

1-based list of VariantDetails, that describes each alternative allele.

property effects

1-based list of Effect, that describes variant effects.

property frequencies

0-base list of frequencies for variant.

get_allele(allele_index)[source]
get_attribute(item, default=None)[source]
has_attribute(item)[source]
property location
property ref_allele

the reference allele

update_attributes(atts)[source]
property variant
property variant_types

returns set of variant types.

FamilyDelegate - common inheritance methods

class dae.variants.family_variant.FamilyDelegate(family)[source]
property family_id

Returns the family ID.

get_family_members_attribute(attribute)[source]
property members_ids

Returns list of family members IDs.

property members_in_order

Returns list of the members of the family in the order specified from the pedigree file. Each element of the returned list is an object of type variants.family.Person.

people_group_attribute(attribute)[source]

FamilyAllele - representation of family allele

class dae.variants.family_variant.FamilyAllele(chromosome, position, reference, alternative, summary_index, allele_index, attributes, family, genotype)[source]
property alternative

alternative DNA string; comma separated string when multiple alternative DNA strings should be represented; alternative is None when the variant is a reference variant.

property best_st
classmethod calc_inheritance_trio(p1, p2, ch, allele_index)[source]

Calculates the inheritance type of a trio family.

Parameters
  • p1 – genotype of the first parent (pair of allele indexes).

  • p2 – genotype of the second parent.

  • ch – genotype of the child.

Returns

inheritance type as variants.attributes.Inheritance of the trio family.

static check_denovo_trio(p1, p2, ch, allele_index)[source]

Checks if the inheritance type for a trio family is denovo.

Parameters
  • p1 – genotype of the first parent (pair of allele indexes).

  • p2 – genotype of the second parent.

  • ch – genotype of the child.

Returns

True, when the inheritance is mendelian.

static check_mendelian_trio(p1, p2, ch, allele_index)[source]

Checks if the inheritance type for a trio family is mendelian.

Parameters
  • p1 – genotype of the first parent (pair of allele indexes).

  • p2 – genotype of the second parent.

  • ch – genotype of the child.

Returns

True, when the inheritance is mendelian.

static check_omission_trio(p1, p2, ch, allele_index)[source]

Checks if the inheritance type for a trio family is omission.

Parameters
  • p1 – genotype of the first parent (pair of allele indexes).

  • p2 – genotype of the second parent.

  • ch – genotype of the child.

Returns

True, when the inheritance is mendelian.

property chrom
static create_reference_allele(allele)
property cshl_location
property cshl_position
property cshl_variant
property effect
property effects
property family_id

Returns the family ID.

property frequency
static from_summary_allele(summary_allele, family, genotype)[source]
property genotype

Returns genotype of the family.

get_attribute(item, default=None)

looks up values matching key item in additional attributes passed on creation of the variant.

get_family_members_attribute(attribute)
gt_flatten()[source]

Return genotype of the family variant flattened to 1-dimensional array.

has_attribute(item)

checks if additional variant attributes contain values for key item.

property inheritance_in_members
property is_reference_allele
property location
property members_ids

Returns list of family members IDs.

property members_in_order

Returns list of the members of the family in the order specified from the pedigree file. Each element of the returned list is an object of type variants.family.Person.

people_group_attribute(attribute)
update_attributes(atts)

updates additional attributes of variant using dictionary atts.

property variant
property variant_in_members

Returns set of members IDs of the family that are affected by this family variant.

property variant_in_members_objects
property variant_in_roles

Returns list of roles (or ‘None’) of the members of the family that are affected by this family variant.

property variant_in_sexes

Returns list of sexes (or ‘None’) of the members of the family that are affected by this family variant.

property variant_type

FamilyVariant - representation of family variants

class dae.variants.family_variant.FamilyVariant(family_alleles, family, genotype)[source]
property alt_alleles

list of all alternative alleles

property alternative

alternative DNA string; comma separated string when multiple alternative DNA strings should be represented; alternative is None when the variant is a reference variant.

property best_st
static calc_alleles(gt)[source]

Returns allele indexes that are relevant for the given genotype.

Parameters

gt – genotype as np.array.

Returns

list of all allele indexes present into genotype passed.

static calc_alt_alleles(gt)[source]

Returns alternative allele indexes that are relevant for the given genotype.

Parameters

gt – genotype as np.array.

Returns

list of all alternative allele indexes present into genotype passed.

property chrom
property details

1-based list of VariantDetails, that describes each alternative allele.

property effects

1-based list of Effect, that describes variant effects.

property family_id

Returns the family ID.

property frequencies

0-base list of frequencies for variant.

static from_summary_variant(summary_variant, family, genotype)[source]
property genotype

Returns genotype of the family.

get_allele(allele_index)
get_attribute(item, default=None)
get_family_members_attribute(attribute)
gt_flatten()[source]

Return genotype of the family variant flattened to 1-dimensional array.

has_attribute(item)
is_reference()[source]

Returns True if all known alleles in the family variant are reference.

is_unknown()[source]

Returns True if all alleles in the family variant are unknown.

property location
property matched_alleles
property matched_alleles_indexes
property matched_gene_effects
property members_ids

Returns list of family members IDs.

property members_in_order

Returns list of the members of the family in the order specified from the pedigree file. Each element of the returned list is an object of type variants.family.Person.

people_group_attribute(attribute)
property ref_allele

the reference allele

set_matched_alleles(alleles_indexes)[source]
update_attributes(atts)
property variant
property variant_types

returns set of variant types.

RawVcfVariants - query interface for VCF variants

Apache Parquet variants schema

Summary Variants/Alleles flat schema

  • chrom (string) -

    chromosome where variant is located

  • position (int64) -

    1-based position of the start of the variant

  • reference (string) -

    reference DNA string

  • alternative (string) -

    alternative DNA string (None for reference allele)

  • summary_index (int64) -

    index of the summary variant

  • allele_index (int16) -

    index of the allele inside given summary variant

  • variant_type (int8) -

    variant type in CSHL nottation

  • cshl_variant (string) -

    variant description in CSHL notation

  • cshl_position (int64) -

    variant position in CSHL notation

  • cshl_length (int32) -

    variant length in CSHL notation

  • effect_type (string) -

    worst effect of the variant (None for reference allele)

  • effect_gene_genes (list_(string)) -

    list of all genes affected by the variant allele (None for reference allele)

  • effect_gene_types (list_(string)) -

    list of all effect types corresponding to the effect_gene_genes (None for reference allele)

  • effect_details_transcript_ids (list_(string)) -

    list of all transcript ids affected by the variant allele (None for reference allele)

  • effect_details_details (list_(string)) -

    list of all effected details corresponding to the effect_details_transcript_ids (None for reference allele)

  • af_parents_called_count (int32) -

    count of independent parents that has well specified genotype for this allele

  • af_parents_called_percent (float64) -

    parcent of independent parents corresponding to af_parents_called_count

  • af_allele_count (int32) -

    count of this allele in the independent parents

  • af_allele_freq (float64) -

    allele frequency

Family Variants schema

  • chrom (string)

  • position (int64)

  • family_index (int64) -

    index of the family variant

  • summary_index (int64) -

    index of the summary variant

  • family_id (string) -

    family ID

  • genotype (list_(int8)) -

    genotype of the variant for the specified family

  • inheritance (int32) -

    inheritance type of the variant

Family Alleles schema

  • family_index (int64)

  • summary_index (int64)

  • allele_index (int16)

  • variant_in_members (list_(string)) -

    list of members of the family that have this allele

  • variant_in_roles (list_(int32)) -

    list of family members’ roles that have this allele

  • variant_in_sexes (list_(int8)) -

    list of family members’ sexes that have this allele

Variant Scores schema

  • summary_index (int64)

  • allele_index (int16)

  • score_id (string or int64)

  • score_value (float64)

Pedigree file schema

  • familyId (string)

  • personId (string)

  • dadId (string)

  • momId (string)

  • sex (int8)

  • status (int8)

  • role (int32)

  • sampleId (string)

  • order (int32)

Functions from parquet_io module