dae.backends.schema2 package

Submodules

dae.backends.schema2.base_query_builder module

class dae.backends.schema2.base_query_builder.BaseQueryBuilder(dialect: dae.backends.schema2.base_query_builder.Dialect, db: str, family_variant_table: str, summary_allele_table: str, pedigree_table: str, family_variant_schema: dict[str, str], summary_allele_schema: dict[str, str], table_properties: Optional[dict], pedigree_schema: dict[str, str], pedigree_df: pandas.core.frame.DataFrame, gene_models: Optional[dae.genomic_resources.gene_models.GeneModels] = None)[source]

Bases: abc.ABC

Class that abstracts away the process of building a query.

GENE_REGIONS_HEURISTIC_CUTOFF = 20
GENE_REGIONS_HEURISTIC_EXTEND = 20000
MAX_CHILD_NUMBER = 9999
QUOTE = "'"
WHERE = '\n    WHERE\n    {where}\n    '
build_query(regions: Optional[list[dae.utils.regions.Region]] = None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, pedigree_fields=None)[source]

Build an SQL query in the correct order.

class dae.backends.schema2.base_query_builder.Dialect(namespace: Optional[str] = None)[source]

Bases: abc.ABC

Caries info about a SQL dialect.

static add_unnest_in_join() bool[source]
build_table_name(table: str, db: str) str[source]
static float_type() str[source]
static int_type() str[source]

dae.backends.schema2.bigquery_variants module

class dae.backends.schema2.bigquery_variants.BigQueryDialect(ns: Optional[str] = None)[source]

Bases: dae.backends.schema2.base_query_builder.Dialect

Abstracts away details related to bigquery.

static add_unnest_in_join() bool[source]
static float_type() str[source]
static int_type() str[source]
class dae.backends.schema2.bigquery_variants.BigQueryVariants(gcp_project_id, db, summary_allele_table, family_variant_table, pedigree_table, meta_table, gene_models=None)[source]

Bases: object

Backend for BigQuery.

query_summary_variants(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None)[source]

Query summary variants.

query_variants(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, affected_status=None)[source]

Query summary variants.

dae.backends.schema2.family_builder module

class dae.backends.schema2.family_builder.FamilyQueryBuilder(dialect: dae.backends.schema2.base_query_builder.Dialect, db: str, family_variant_table: str, summary_allele_table: str, pedigree_table: str, family_variant_schema: dict[str, str], summary_allele_schema: dict[str, str], table_properties: Optional[dict], pedigree_schema: dict[str, str], pedigree_df: pandas.core.frame.DataFrame, gene_models=None, do_join_pedigree=False)[source]

Bases: dae.backends.schema2.base_query_builder.BaseQueryBuilder

Build queries related to family variants.

dae.backends.schema2.impala_variants module

class dae.backends.schema2.impala_variants.ImpalaDialect[source]

Bases: dae.backends.schema2.base_query_builder.Dialect

class dae.backends.schema2.impala_variants.ImpalaVariants(impala_helpers, db, family_variant_table, summary_allele_table, pedigree_table, meta_table, gene_models=None)[source]

Bases: object

A backend implementing an impala backend.

build_family_variants_query_runner(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, pedigree_fields=None)[source]

Build a query selecting the appropriate family variants.

static build_person_set_collection_query(person_set_collection: dae.person_sets.PersonSetCollection, person_set_collection_query: Tuple[str, Set[str]])[source]

No idea what it does. If you know please edit.

build_summary_variants_query_runner(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None) dae.backends.query_runners.QueryRunner[source]

Build a query selecting the appropriate summary variants.

connection()[source]
query_summary_variants(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None)[source]

Query summary variants.

query_variants(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, pedigree_fields=None)[source]

Query family variants.

dae.backends.schema2.parquet_io module

class dae.backends.schema2.parquet_io.ContinuousParquetFileWriter(filepath, variant_loader, filesystem=None, rows=100000, schema='schema')[source]

Bases: object

A continous parquet writer.

Class that automatically writes to a given parquet file when supplied enough data. Automatically dumps leftover data when closing into the file

append_family_allele(allele, json_data)[source]

Append the data for an entire variant to the correct file.

append_summary_allele(allele, json_data)[source]

Append the data for an entire variant to the correct file.

build_table()[source]
close()[source]
data_reset()[source]
size()[source]
class dae.backends.schema2.parquet_io.NoPartitionDescriptor(root_dirname='')[source]

Bases: dae.backends.schema2.parquet_io.PartitionDescriptor

Defines class for missing partition description.

build_impala_partitions()[source]
property chromosomes
family_filename(family_allele)[source]
static generate_file_access_glob()[source]

Return a glob for accessing every parquet file in the partition.

has_partitions()[source]
property region_length
summary_filename(summary_allele)[source]
static variants_filename_basedir(filename)[source]

Extract the variants basedir from filename.

write_partition_configuration()[source]
class dae.backends.schema2.parquet_io.ParquetManager[source]

Bases: object

Provide function for producing variants and pedigree parquet files.

static build_parquet_filenames(prefix, study_id=None, bucket_index=0, suffix=None)[source]

Build parquet filenames.

static families_to_parquet(families, pedigree_filename)[source]
static variants_to_parquet(variants_loader, partition_descriptor, bucket_index=1, rows=100000, include_reference=False)[source]

Read variants from variant_loader and store them in parquet.

class dae.backends.schema2.parquet_io.ParquetPartitionDescriptor(chromosomes, region_length, family_bin_size=0, coding_effect_types=None, rare_boundary=0, root_dirname='')[source]

Bases: dae.backends.schema2.parquet_io.PartitionDescriptor

Defines partition description used for parquet datasets.

add_family_bins_to_families(families)[source]
build_impala_partitions()[source]
property chromosomes
property coding_effect_types
property family_bin_size
family_filename(family_allele)[source]

Return filename that family_allele should be appended to.

static from_config(config_path, root_dirname='')[source]

Create a partition description from the provided config file.

static from_dict(config, root_dirname='')[source]

Create a partition description from the provided dictionary.

generate_file_access_glob()[source]

Return a glob for accessing every parquet file in the partition.

has_partitions()[source]
property rare_boundary
property region_length
summary_filename(summary_allele)[source]

Return filename that summary_allele should be appended to.

variants_filename_basedir(filename)[source]

Extract the variants basedir from filename.

write_partition_configuration()[source]
class dae.backends.schema2.parquet_io.PartitionDescriptor[source]

Bases: object

Abstract class for partition description.

build_impala_partitions()[source]
property chromosomes
family_alleles_dirname: str = 'family'
family_filename(family_allele)[source]
has_partitions()[source]
property region_length
summary_alleles_dirname: str = 'summary'
summary_filename(summary_allele)[source]
write_partition_configuration()[source]
class dae.backends.schema2.parquet_io.VariantsParquetWriter(variants_loader, partition_descriptor, bucket_index=1, rows=100000, include_reference=True, filesystem=None)[source]

Bases: object

Provide functions for storing variants into parquet dataset.

write_dataset()[source]
write_schema()[source]

Write the schema to a separate file.

dae.backends.schema2.parquet_io.add_missing_parquet_fields(pps, ped_df)[source]

Add missing parquet fields.

dae.backends.schema2.parquet_io.pedigree_parquet_schema()[source]

Return the schema for pedigree parquet file.

dae.backends.schema2.parquet_io.save_ped_df_to_parquet(ped_df, filename, filesystem=None)[source]

Save ped_df as a parquet file named filename.

dae.backends.schema2.serializers module

class dae.backends.schema2.serializers.AlleleParquetSerializer(annotation_schema, extra_attributes=None)[source]

Bases: object

Serialize a bunch of alleles.

BASE_SEARCHABLE_PROPERTIES_TYPES = {'allele_in_members': DataType(string), 'allele_in_roles': DataType(int32), 'allele_in_sexes': DataType(int8), 'allele_in_statuses': DataType(int8), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene': ListType(list<item: struct<effect_gene_symbols: string, effect_types: string>>), 'end_position': DataType(int32), 'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'position': DataType(int32), 'reference': DataType(string), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_type': DataType(int8)}
ENUM_PROPERTIES = {'allele_in_roles': <enum 'Role'>, 'allele_in_sexes': <enum 'Sex'>, 'allele_in_statuses': <enum 'Status'>, 'inheritance_in_members': <enum 'Inheritance'>, 'transmission_type': <enum 'TransmissionType'>, 'variant_type': <enum 'Type'>}
FAMILY_ALLELE_BASE_SCHEMA = {'allele_in_members': ListType(list<item: string>), 'allele_in_roles': DataType(int32), 'allele_in_sexes': DataType(int8), 'allele_in_statuses': DataType(int8), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'summary_index': DataType(int32)}
SUMMARY_ALLELE_BASE_SCHEMA = {'af_allele_count': DataType(int32), 'af_allele_freq': DataType(float), 'af_parents_called': DataType(int32), 'af_parents_freq': DataType(float), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene': ListType(list<element: struct<effect_gene_symbols: string, effect_types: string>>), 'end_position': DataType(int32), 'position': DataType(int32), 'reference': DataType(string), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_type': DataType(int8)}
build_family_allele_batch_dict(allele, family_variant_data) dict[str, list[Any]][source]

Build a batch of family allele data in the form of a dict.

build_summary_allele_batch_dict(allele, summary_variant_data) dict[str, list[Any]][source]

Build a batch of summary allele data in the form of a dict.

property schema_family

Lazy construct and return the schema for the family alleles.

property schema_summary

Lazy construct and return the schema for the summary alleles.

property searchable_properties
property searchable_properties_family
property searchable_properties_summary

dae.backends.schema2.summary_builder module

class dae.backends.schema2.summary_builder.SummaryQueryBuilder(dialect: dae.backends.schema2.base_query_builder.Dialect, db, family_variant_table, summary_allele_table, pedigree_table, family_variant_schema, summary_allele_schema, table_properties, pedigree_schema, pedigree_df, gene_models=None, do_join_affected=False)[source]

Bases: dae.backends.schema2.base_query_builder.BaseQueryBuilder

Build queries related to summary variants.

dae.backends.schema2.vcf2schema2 module

import script similar to vcf2parquet.py.

# when complete add to setup.py # do not inherit, create a new tool. # retrace steps of Variants2ParquetTool class

class dae.backends.schema2.vcf2schema2.MakefilePartitionHelper(partition_descriptor, genome, add_chrom_prefix=None, del_chrom_prefix=None)[source]

Bases: object

bucket_index(region_bin)[source]
build_target_chromosomes(target_chromosomes)[source]
generate_chrom_targets(target_chrom)[source]
generate_variants_targets(target_chromosomes)[source]
region_bins_count(chrom)[source]
class dae.backends.schema2.vcf2schema2.Variants2Schema2[source]

Bases: object

BUCKET_INDEX_DEFAULT = 1000
VARIANTS_FREQUENCIES: bool = True
VARIANTS_LOADER_CLASS

alias of dae.backends.vcf.loader.VcfLoader

VARIANTS_TOOL: Optional[str] = 'vcf2schema2.py'
classmethod cli_arguments_parser(gpf_instance)[source]
classmethod main(argv=['-b', 'html', '-d', '_build/doctrees', '.', '_build/html'], gpf_instance=None)[source]
dae.backends.schema2.vcf2schema2.construct_import_annotation_pipeline(gpf_instance, annotation_configfile=None)[source]
dae.backends.schema2.vcf2schema2.construct_import_effect_annotator(gpf_instance)[source]
dae.backends.schema2.vcf2schema2.main(argv=['-b', 'html', '-d', '_build/doctrees', '.', '_build/html'], gpf_instance=None)[source]
dae.backends.schema2.vcf2schema2.save_study_config(dae_config, study_id, study_config, force=False)[source]

Module contents

Implementation for the next version (v2) of the DB schema.

Variants schema separated into two separate tables: summary allele and family variant.

  • supported on BigQuery and Impala (specified via Dialect)

  • parquet generation outputs two separate parquet files