dae.backends.impala package
Submodules
dae.backends.impala.base_query_builder module
- class dae.backends.impala.base_query_builder.BaseQueryBuilder(db, variants_table, pedigree_table, variants_schema, table_properties, pedigree_schema, pedigree_df, gene_models=None)[source]
Bases:
abc.ABC
A base class for all query builders.
- GENE_REGIONS_HEURISTIC_CUTOFF = 20
- GENE_REGIONS_HEURISTIC_EXTEND = 20000
- MAX_CHILD_NUMBER = 9999
- QUOTE = "'"
- WHERE = '\n WHERE\n {where}\n '
- build_where(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, **_kwargs)[source]
Build the where clause of a query.
- property product
dae.backends.impala.family_variants_query_builder module
- class dae.backends.impala.family_variants_query_builder.FamilyVariantsQueryBuilder(db, variants_table, pedigree_table, variants_schema, table_properties, pedigree_schema, pedigree_df, families, gene_models=None, do_join=False)[source]
Bases:
dae.backends.impala.base_query_builder.BaseQueryBuilder
Build queries related to family variants.
- build_where(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, pedigree_fields=None)[source]
Build the where clause of a query.
dae.backends.impala.hdfs_helpers module
- class dae.backends.impala.hdfs_helpers.HdfsHelpers(hdfs_host, hdfs_port, replication=None)[source]
Bases:
object
Helper methods for working with HDFS.
- property hdfs
Return a file system for working with HDFS.
dae.backends.impala.impala_helpers module
- class dae.backends.impala.impala_helpers.ImpalaHelpers(impala_hosts, impala_port=21050, pool_size=1)[source]
Bases:
object
Helper methods for working with impala.
dae.backends.impala.impala_query_director module
- class dae.backends.impala.impala_query_director.ImpalaQueryDirector(query_builder)[source]
Bases:
object
Build a query in the right order.
- build_query(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, pedigree_fields=None)[source]
Build a query in the right order.
dae.backends.impala.impala_variants module
- class dae.backends.impala.impala_variants.ImpalaQueryRunner(connection_pool, query, deserializer=None)[source]
Bases:
dae.backends.query_runners.QueryRunner
Run a query in a separate thread.
- class dae.backends.impala.impala_variants.ImpalaVariants(impala_helpers, db, variants_table, pedigree_table, gene_models=None)[source]
Bases:
object
A backend implementing an impala backend.
- TYPE_MAP: Dict[str, Any] = {'bigint': (<class 'int'>, DataType(int64)), 'binary': (<class 'bytes'>, DataType(binary)), 'bool': (<class 'bool'>, DataType(bool)), 'boolean': (<class 'bool'>, DataType(bool)), 'float': (<class 'float'>, DataType(float)), 'float32': (<class 'float'>, DataType(float)), 'float64': (<class 'float'>, DataType(double)), 'int': (<class 'int'>, DataType(int32)), 'int16': (<class 'int'>, DataType(int16)), 'int32': (<class 'int'>, DataType(int32)), 'int64': (<class 'int'>, DataType(int64)), 'int8': (<class 'int'>, DataType(int8)), 'list(float)': (<class 'list'>, ListType(list<item: double>)), 'list(int)': (<class 'list'>, ListType(list<item: int32>)), 'list(str)': (<class 'list'>, ListType(list<item: string>)), 'smallint': (<class 'int'>, DataType(int16)), 'str': (<class 'str'>, DataType(string)), 'string': (<class 'bytes'>, DataType(string)), 'tinyint': (<class 'int'>, DataType(int8))}
- build_count_query(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, return_reference=None, return_unknown=None, limit=None)[source]
Build a query that counts variants.
- build_family_variants_query_runner(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, pedigree_fields=None)[source]
Build a query selecting the appropriate family variants.
- static build_person_set_collection_query(person_set_collection: dae.person_sets.PersonSetCollection, person_set_collection_query: Tuple[str, Set[str]])[source]
No idea what it does. If you know please edit.
- build_summary_variants_query_runner(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None)[source]
Build a query selecting the appropriate summary variants.
- property connection_pool
- property executor
- query_summary_variants(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None)[source]
Query summary variants.
- query_variants(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, limit=None, pedigree_fields=None)[source]
Query family variants.
dae.backends.impala.import_commons module
- class dae.backends.impala.import_commons.BatchImporter(gpf_instance)[source]
Bases:
object
- property families
- class dae.backends.impala.import_commons.DatasetHelpers(gpf_instance=None)[source]
Bases:
object
A collection of helper methods for working with datasets.
- dataset_remove_hdfs_directory(dataset_id, dry_run=None)[source]
Delete the HDFS directory for a dataset with id dataset_id.
- class dae.backends.impala.import_commons.MakefileGenerator[source]
Bases:
object
Generate a Makefile which when executed imports a study.
- TEMPLATE = <Template memory:7fc43a1bf940>
- class dae.backends.impala.import_commons.MakefilePartitionHelper(partition_descriptor, genome, add_chrom_prefix=None, del_chrom_prefix=None)[source]
Bases:
object
- class dae.backends.impala.import_commons.SnakefileGenerator[source]
Bases:
object
Generate a Snakefile which when executed imports a study.
- TEMPLATE = <Template memory:7fc43a1c2730>
- class dae.backends.impala.import_commons.SnakefileKubernetesGenerator[source]
Bases:
object
Generate a Snakefile which when executed imports a study using k8s.
- TEMPLATE = <Template memory:7fc43a20e640>
- class dae.backends.impala.import_commons.Variants2ParquetTool[source]
Bases:
object
Tool that stores variants in a parquet file(s).
- BUCKET_INDEX_DEFAULT = 1000
- VARIANTS_FREQUENCIES: bool = False
- VARIANTS_LOADER_CLASS: Any = None
- VARIANTS_TOOL: Optional[str] = None
- dae.backends.impala.import_commons.construct_import_annotation_pipeline(gpf_instance, annotation_configfile=None)[source]
Construct the import annotation pipeline.
dae.backends.impala.parquet_io module
Provides Apache Parquet storage of genotype data.
- class dae.backends.impala.parquet_io.ContinuousParquetFileWriter(filepath, variant_loader, filesystem=None, rows=100000)[source]
Bases:
object
Class that writes to a output parquet file.
This class automatically writes to a given parquet file when supplied enough data. Automatically dumps leftover data when closing into the file
- class dae.backends.impala.parquet_io.NoPartitionDescriptor(root_dirname='')[source]
Bases:
dae.backends.impala.parquet_io.PartitionDescriptor
Defines class for missing partition description.
- property chromosomes
- static generate_file_access_glob()[source]
Generate a glob for accessing parquet files in the partition.
- property region_length
- class dae.backends.impala.parquet_io.ParquetManager[source]
Bases:
object
Provide function for producing variants and pedigree parquet files.
- class dae.backends.impala.parquet_io.ParquetPartitionDescriptor(chromosomes, region_length, family_bin_size=0, coding_effect_types=None, rare_boundary=0, root_dirname='')[source]
Bases:
dae.backends.impala.parquet_io.PartitionDescriptor
Defines partition description used for parquet datasets.
- property chromosomes
- property coding_effect_types
- property family_bin_size
- static from_config(config_path, root_dirname='')[source]
Create a partition description from the provided config file.
- static from_dict(config, root_dirname='') dae.backends.impala.parquet_io.ParquetPartitionDescriptor [source]
Create a paritition description from the provided config.
- generate_file_access_glob()[source]
Return a glob for accessing every parquet file in the partition.
- property rare_boundary
- property region_length
- class dae.backends.impala.parquet_io.PartitionDescriptor[source]
Bases:
object
Abstract class for partition description.
- property chromosomes
- property region_length
- class dae.backends.impala.parquet_io.VariantsParquetWriter(variants_loader, partition_descriptor, bucket_index=None, rows=100000, include_reference=True)[source]
Bases:
object
Provide functions for storing variants into parquet dataset.
- dae.backends.impala.parquet_io.add_missing_parquet_fields(pps, ped_df)[source]
Add missing parquet fields.
dae.backends.impala.rsync_helpers module
dae.backends.impala.serializers module
- class dae.backends.impala.serializers.AlleleParquetSerializer(annotation_schema, extra_attributes=None)[source]
Bases:
object
Serialize a bunch of alleles.
- ALLELE_CREATION_PROPERTIES = ['chromosome', 'position', 'end_position', 'variant_type', 'reference', 'alternative', 'allele_index', 'summary_index', 'transmission_type', 'family_id', 'gt', 'best_state', 'genetic_model', 'variant_in_roles', 'variant_in_sexes', 'inheritance_in_members', 'variant_in_members']
- ALLELE_PROP_GETTERS = {'effect_details_details': <function AlleleParquetSerializer.<lambda>>, 'effect_details_transcript_ids': <function AlleleParquetSerializer.<lambda>>, 'effect_gene_genes': <function AlleleParquetSerializer.<lambda>>, 'effect_gene_types': <function AlleleParquetSerializer.<lambda>>, 'effect_type': <function AlleleParquetSerializer.<lambda>>}
- BASE_SCHEMA_FIELDS = [pyarrow.Field<bucket_index: int32>, pyarrow.Field<summary_variant_index: int64>, pyarrow.Field<allele_index: int8>, pyarrow.Field<chrom: string>, pyarrow.Field<position: int32>, pyarrow.Field<end_position: int32>, pyarrow.Field<reference: string>, pyarrow.Field<alternative: string>, pyarrow.Field<variant_type: int8>, pyarrow.Field<transmission_type: int8>, pyarrow.Field<alternatives_data: string>, pyarrow.Field<effect_type: string>, pyarrow.Field<effect_gene: string>, pyarrow.Field<effect_data: string>, pyarrow.Field<family_variant_index: int64>, pyarrow.Field<family_id: string>, pyarrow.Field<is_denovo: bool>, pyarrow.Field<variant_sexes: int8>, pyarrow.Field<variant_roles: int32>, pyarrow.Field<variant_inheritance: int16>, pyarrow.Field<variant_in_member: string>, pyarrow.Field<genotype_data: string>, pyarrow.Field<best_state_data: string>, pyarrow.Field<genetic_model_data: int8>, pyarrow.Field<inheritance_data: string>, pyarrow.Field<af_parents_called_count: int32>, pyarrow.Field<af_parents_called_percent: float>, pyarrow.Field<af_allele_count: int32>, pyarrow.Field<af_allele_freq: float>, pyarrow.Field<frequency_data: string>, pyarrow.Field<genomic_scores_data: string>]
- BASE_SEARCHABLE_PROPERTIES_TYPES = {'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene_symbols': DataType(string), 'effect_types': DataType(string), 'end_position': DataType(int32), 'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'position': DataType(int32), 'reference': DataType(string), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_in_members': DataType(string), 'variant_in_roles': DataType(int32), 'variant_in_sexes': DataType(int8), 'variant_type': DataType(int8)}
- ENUM_PROPERTIES = {'inheritance_in_members': <enum 'Inheritance'>, 'transmission_type': <enum 'TransmissionType'>, 'variant_in_roles': <enum 'Role'>, 'variant_in_sexes': <enum 'Sex'>, 'variant_type': <enum 'Type'>}
- FAMILY_SEARCHABLE_PROPERTIES_TYPES = {'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'variant_in_members': DataType(string), 'variant_in_roles': DataType(int32), 'variant_in_sexes': DataType(int8)}
- GENOMIC_SCORES_SCHEMA_CLEAN_UP = ['worst_effect', 'family_bin', 'rare', 'genomic_scores_data', 'frequency_bin', 'coding', 'position_bin', 'chrome_bin', 'coding2', 'region_bin', 'coding_bin', 'effect_data', 'genotype_data', 'inheritance_data', 'genomic_scores_data', 'variant_sexes', 'alternatives_data', 'chrom', 'best_state_data', 'summary_variant_index', 'effect_type', 'effect_gene', 'effect_genes', 'effect_gene_genes', 'effect_gene_types', 'effect_details_details', 'effect_details_transcript_ids', 'effect_details', 'variant_inheritance', 'variant_in_member', 'variant_roles', 'genetic_model_data', 'frequency_data', 'alternative', 'variant_data', 'family_variant_index']
- LIST_TO_ROW_PROPERTIES_LISTS = [['effect_types', 'effect_gene_symbols'], ['variant_in_members']]
- PRODUCT_PROPERTIES_LIST = ['effect_types', 'effect_gene_symbols', 'variant_in_members']
- SUMMARY_SEARCHABLE_PROPERTIES_TYPES = {'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene_symbols': DataType(string), 'effect_types': DataType(string), 'end_position': DataType(int32), 'position': DataType(int32), 'reference': DataType(string), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_type': DataType(int8)}
- property allele_batch_header
Return the names of the properties in an allele batch.
- build_allele_batch_dict(allele, variant_data, extra_attributes_data, summary_vectors)[source]
Build a batch of allele data in the form of a dict.
- deserialize_family_variant(main_blob, family, extra_blob=None)[source]
Read a family variant from the input stream.
- deserialize_summary_variant(main_blob, extra_blob=None)[source]
Read a summary variant from the input stream.
- property schema
Lazy construct and return the schema.
- property searchable_properties
- property searchable_properties_family
- property searchable_properties_summary
- class dae.backends.impala.serializers.Serializer(serialize, deserialize, type)
Bases:
tuple
- deserialize
Alias for field number 1
- serialize
Alias for field number 0
- type
Alias for field number 2
- dae.backends.impala.serializers.read_in_roles(stream)[source]
Read a list of Roles from the stream.
- dae.backends.impala.serializers.read_in_sexes(stream) list[Optional[dae.variants.attributes.Sex]] [source]
Read a list of sexes from the stream.
- dae.backends.impala.serializers.read_inheritance(stream)[source]
Read a list of Inheritances from the stream.
- dae.backends.impala.serializers.read_string_list(stream) list[Optional[str]] [source]
Read a list of strings from the stream.
- dae.backends.impala.serializers.write_best_state(stream, best_state)[source]
Write best state to the stream.
- dae.backends.impala.serializers.write_big_enum_list(stream, the_list)[source]
Write a list of big enums (more than 128 states) to the stream.
- dae.backends.impala.serializers.write_effects(stream, allele)[source]
Write allele’s effect data to the stream.
- dae.backends.impala.serializers.write_enum_list(stream, the_list)[source]
Write a list of enums to the stream.
dae.backends.impala.summary_variants_query_builder module
- class dae.backends.impala.summary_variants_query_builder.SummaryVariantsQueryBuilder(db, variants_table, pedigree_table, variants_schema, table_properties, pedigree_schema, pedigree_df, gene_models=None, summary_variants_table=None)[source]
Bases:
dae.backends.impala.base_query_builder.BaseQueryBuilder
Build queries related to summary variants.
- build_where(regions=None, genes=None, effect_types=None, family_ids=None, person_ids=None, inheritance=None, roles=None, sexes=None, variant_type=None, real_attr_filter=None, ultra_rare=None, frequency_filter=None, return_reference=None, return_unknown=None, **_kwargs)[source]
Build the where clause of a query.