dae.parquet.schema2 package

Submodules

dae.parquet.schema2.parquet_io module

class dae.parquet.schema2.parquet_io.ContinuousParquetFileWriter(filepath: str, annotation_schema: list[dae.annotation.annotation_pipeline.AttributeInfo], filesystem: AbstractFileSystem | None = None, row_group_size: int = 50000, schema: str = 'schema', blob_column: str | None = None)[source]

Bases: object

A continous parquet writer.

Class that automatically writes to a given parquet file when supplied enough data. Automatically dumps leftover data when closing into the file

BATCH_ROWS = 1000
DEFAULT_COMPRESSION = 'SNAPPY'
append_family_allele(allele: FamilyAllele, json_data: str) None[source]

Append the data for an entire variant to the correct file.

append_summary_allele(allele: SummaryAllele, json_data: str) None[source]

Append the data for an entire variant to the correct file.

build_batch() RecordBatch[source]
build_table() Table[source]
close() None[source]

Close the parquet writer and write any remaining data.

data_reset() None[source]
size() int[source]
class dae.parquet.schema2.parquet_io.VariantsParquetWriter(out_dir: str, annotation_schema: list[dae.annotation.annotation_pipeline.AttributeInfo], partition_descriptor: PartitionDescriptor, bucket_index: int = 1, row_group_size: int = 50000, include_reference: bool = True, filesystem: AbstractFileSystem | None = None)[source]

Bases: object

Provide functions for storing variants into parquet dataset.

close() None[source]
write_dataset(full_variants_iterator: Iterator[tuple[dae.variants.variant.SummaryVariant, list[dae.variants.family_variant.FamilyVariant]]]) list[str][source]

Write variant to partitioned parquet dataset.

write_summary_variant(summary_variant: SummaryVariant, attributes: dict[str, Any] | None = None, sj_base_index: int | None = None) None[source]

Write a single summary variant to the correct parquet file.

dae.parquet.schema2.serializers module

class dae.parquet.schema2.serializers.AlleleParquetSerializer(annotation_schema: List[AttributeInfo], extra_attributes: List[str] | None = None)[source]

Bases: object

Serialize a bunch of alleles.

ENUM_PROPERTIES = {'allele_in_roles': <enum 'Role'>, 'allele_in_sexes': <enum 'Sex'>, 'allele_in_statuses': <enum 'Status'>, 'inheritance_in_members': <enum 'Inheritance'>, 'transmission_type': <enum 'TransmissionType'>, 'variant_type': <enum 'Type'>}
FAMILY_ALLELE_BASE_SCHEMA = {'allele_in_members': ListType(list<item: string>), 'allele_in_roles': DataType(int32), 'allele_in_sexes': DataType(int8), 'allele_in_statuses': DataType(int8), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'sj_index': DataType(int64), 'summary_index': DataType(int32)}
SUMMARY_ALLELE_BASE_SCHEMA = {'af_allele_count': DataType(int32), 'af_allele_freq': DataType(float), 'af_parents_called_count': DataType(int32), 'af_parents_called_percent': DataType(float), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene': ListType(list<item: struct<effect_gene_symbols: string, effect_types: string>>), 'end_position': DataType(int32), 'family_alleles_count': DataType(int32), 'family_variants_count': DataType(int32), 'position': DataType(int32), 'reference': DataType(string), 'seen_as_denovo': DataType(bool), 'seen_in_status': DataType(int8), 'sj_index': DataType(int64), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_type': DataType(int8)}
build_family_allele_batch_dict(allele: FamilyAllele, family_variant_data: str) dict[str, list[Any]][source]

Build a batch of family allele data in the form of a dict.

classmethod build_family_schema() Schema[source]

Build the schema for the family alleles.

build_summary_allele_batch_dict(allele: SummaryAllele, summary_variant_data: str) dict[str, Any][source]

Build a batch of summary allele data in the form of a dict.

classmethod build_summary_schema(annotation_schema: list[dae.annotation.annotation_pipeline.AttributeInfo]) Schema[source]

Build the schema for the summary alleles.

property schema_family: Schema

Lazy construct and return the schema for the family alleles.

property schema_summary: Schema

Lazy construct and return the schema for the summary alleles.

Module contents