dae.genomic_resources package
Submodules
dae.genomic_resources.aggregators module
- class dae.genomic_resources.aggregators.Aggregator[source]
Bases:
ABC
Base class for score aggregators.
- class dae.genomic_resources.aggregators.ConcatAggregator[source]
Bases:
Aggregator
Aggregator that concatenates all passed values.
- class dae.genomic_resources.aggregators.DictAggregator[source]
Bases:
Aggregator
Aggregator that builds a dictionary of all passed values.
- class dae.genomic_resources.aggregators.JoinAggregator(separator: str)[source]
Bases:
Aggregator
Aggregator that joins all passed values using a separator.
- class dae.genomic_resources.aggregators.ListAggregator[source]
Bases:
Aggregator
Aggregator that builds a list of all passed values.
- class dae.genomic_resources.aggregators.MaxAggregator[source]
Bases:
Aggregator
Maximum value aggregator for genomic scores.
- class dae.genomic_resources.aggregators.MeanAggregator[source]
Bases:
Aggregator
Aggregator for genomic scores that calculates mean value.
- class dae.genomic_resources.aggregators.MedianAggregator[source]
Bases:
Aggregator
Aggregator for genomic scores that calculates median value.
- class dae.genomic_resources.aggregators.MinAggregator[source]
Bases:
Aggregator
Minimum value aggregator for genomic scores.
- class dae.genomic_resources.aggregators.ModeAggregator[source]
Bases:
Aggregator
Aggregator for genomic scores that calculates mode value.
- dae.genomic_resources.aggregators.build_aggregator(aggregator_type: str) Aggregator [source]
- dae.genomic_resources.aggregators.create_aggregator(aggregator_def: dict[str, Any]) Aggregator [source]
Create an aggregator by aggregator definition.
- dae.genomic_resources.aggregators.create_aggregator_definition(aggregator_type: str) dict[str, Any] [source]
Parse an aggregator definition string.
- dae.genomic_resources.aggregators.get_aggregator_class(aggregator: str) Callable[[], Aggregator] [source]
dae.genomic_resources.cached_repository module
Provides caching genomic resources.
- class dae.genomic_resources.cached_repository.CacheResource(resource: GenomicResource, protocol: CachingProtocol)[source]
Bases:
GenomicResource
Represents resources stored in cache.
- class dae.genomic_resources.cached_repository.CachingProtocol(remote_protocol: ReadOnlyRepositoryProtocol, local_protocol: FsspecReadWriteProtocol)[source]
Bases:
ReadOnlyRepositoryProtocol
Defines caching GRR repository protocol.
- file_exists(resource: GenomicResource, filename: str) bool [source]
Check if given file exist in give resource.
- get_all_resources() Generator[GenomicResource, None, None] [source]
Return generator for all resources in the repository.
- load_manifest(resource: GenomicResource) Manifest [source]
Load resource manifest.
- open_raw_file(resource: GenomicResource, filename: str, mode: str = 'rt', **kwargs: str | bool | None) IO [source]
Open file in a resource and returns a file-like object.
- open_tabix_file(resource: GenomicResource, filename: str, index_filename: str | None = None) TabixFile [source]
Open a tabix file in a resource and return a pysam tabix file.
Not all repositories support this method. Repositories that do no support this method raise and exception.
- open_vcf_file(resource: GenomicResource, filename: str, index_filename: str | None = None) VariantFile [source]
Open a vcf file in a resource and return a pysam VariantFile.
Not all repositories support this method. Repositories that do no support this method raise and exception.
- refresh_cached_resource(resource: GenomicResource) None [source]
Refresh all resource files in cache if neccessary.
- refresh_cached_resource_file(resource: GenomicResource, filename: str) tuple[str, str] [source]
Refresh a resource file in cache if neccessary.
- class dae.genomic_resources.cached_repository.GenomicResourceCachedRepo(child: GenomicResourceRepo, cache_url: str, **kwargs: str | None)[source]
Bases:
GenomicResourceRepo
Defines caching genomic resources repository.
- find_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource | None [source]
Return requested resource or None if not found.
- get_all_resources() Generator[GenomicResource, None, None] [source]
Return a generator over all resource in the repository.
- get_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource [source]
Return one resource with id qual to resource_id.
If resource is not found, exception is raised.
- dae.genomic_resources.cached_repository.cache_resources(repository: GenomicResourceRepo, resource_ids: Iterable[str] | None, workers: int | None = None) None [source]
Cache resources from a list of remote resource IDs.
dae.genomic_resources.cli module
Provides CLI for management of genomic resources repositories.
- dae.genomic_resources.cli.cli_browse(cli_args: list[str] | None = None) None [source]
Provide CLI for repository browsing.
- dae.genomic_resources.cli.cli_manage(cli_args: list[str] | None = None) None [source]
Provide CLI for repository management.
- dae.genomic_resources.cli.collect_dvc_entries(proto: ReadWriteRepositoryProtocol, res: GenomicResource) dict[str, dae.genomic_resources.repository.ManifestEntry] [source]
Collect manifest entries defined by .dvc files.
dae.genomic_resources.clinvar module
dae.genomic_resources.fsspec_protocol module
Provides GRR protocols based on fsspec library.
- class dae.genomic_resources.fsspec_protocol.FsspecReadOnlyProtocol(proto_id: str, url: str, filesystem: AbstractFileSystem)[source]
Bases:
ReadOnlyRepositoryProtocol
Provides fsspec genomic resources repository protocol.
- file_exists(resource: GenomicResource, filename: str) bool [source]
Check if given file exist in give resource.
- get_all_resources() Generator[GenomicResource, None, None] [source]
Return generator over all resources in the repository.
- get_resource_file_url(resource: GenomicResource, filename: str) str [source]
Return url of a file in the resource.
- get_resource_url(resource: GenomicResource) str [source]
Return url of the specified resources.
- load_manifest(resource: GenomicResource) Manifest [source]
Load resource manifest.
- open_raw_file(resource: GenomicResource, filename: str, mode: str = 'rt', **kwargs: str | bool | None) IO [source]
Open file in a resource and returns a file-like object.
- open_tabix_file(resource: GenomicResource, filename: str, index_filename: str | None = None) TabixFile [source]
Open a tabix file in a resource and return a pysam tabix file.
Not all repositories support this method. Repositories that do no support this method raise and exception.
- open_vcf_file(resource: GenomicResource, filename: str, index_filename: str | None = None) VariantFile [source]
Open a vcf file in a resource and return a pysam VariantFile.
Not all repositories support this method. Repositories that do no support this method raise and exception.
- class dae.genomic_resources.fsspec_protocol.FsspecReadWriteProtocol(proto_id: str, url: str, filesystem: AbstractFileSystem)[source]
Bases:
FsspecReadOnlyProtocol
,ReadWriteRepositoryProtocol
Provides fsspec genomic resources repository protocol.
- build_content_file() list[dict[str, Any]] [source]
Build the content of the repository (i.e ‘.CONTENTS’ file).
- collect_all_resources() Generator[GenomicResource, None, None] [source]
Return generator over all resources managed by this protocol.
- collect_resource_entries(resource: GenomicResource) Manifest [source]
Scan the resource and resturn a manifest.
- copy_resource_file(remote_resource: GenomicResource, dest_resource: GenomicResource, filename: str) ResourceFileState | None [source]
Copy a resource file into repository.
- delete_resource_file(resource: GenomicResource, filename: str) None [source]
Delete a resource file and it’s internal state.
- get_all_resources() Generator[GenomicResource, None, None] [source]
Return generator over all resources in the repository.
- get_resource_file_size(resource: GenomicResource, filename: str) int [source]
Return the size of a resource file.
- get_resource_file_timestamp(resource: GenomicResource, filename: str) float [source]
Return the timestamp (ISO formatted) of a resource file.
- load_resource_file_state(resource: GenomicResource, filename: str) ResourceFileState | None [source]
Load resource file state from internal GRR state.
If the specified resource file has no internal state returns None.
- obtain_resource_file_lock(resource: GenomicResource, filename: str) ContextManager [source]
Lock a resource’s file.
- save_resource_file_state(resource: GenomicResource, state: ResourceFileState) None [source]
Save resource file state into internal GRR state.
- update_resource_file(remote_resource: GenomicResource, dest_resource: GenomicResource, filename: str) ResourceFileState | None [source]
Update a resource file into repository if needed.
- dae.genomic_resources.fsspec_protocol.build_fsspec_protocol(proto_id: str, root_url: str, **kwargs: str | None) FsspecReadOnlyProtocol | FsspecReadWriteProtocol [source]
Create fsspec GRR protocol based on the root url.
- dae.genomic_resources.fsspec_protocol.build_inmemory_protocol(proto_id: str, root_path: str, content: Dict[str, Any]) FsspecReadWriteProtocol [source]
Build and return an embedded fsspec protocol for testing.
- dae.genomic_resources.fsspec_protocol.build_local_resource(dirname: str, config: Dict[str, Any]) GenomicResource [source]
Build a resource from a local filesystem directory.
dae.genomic_resources.gene_models module
- class dae.genomic_resources.gene_models.Exon(start: int, stop: int, frame: int | None = None, number: int | None = None, cds_start: int | None = None, cds_stop: int | None = None)[source]
Bases:
object
Provides exon model.
- class dae.genomic_resources.gene_models.GeneModels(resource: GenomicResource)[source]
Bases:
GenomicResourceImplementation
,ResourceConfigValidationMixin
,InfoImplementationMixin
Provides class for gene models.
- SUPPORTED_GENE_MODELS_FILE_FORMATS = {'ccds', 'default', 'gtf', 'knowngene', 'refflat', 'refseq', 'ucscgenepred'}
- add_statistics_build_tasks(task_graph: TaskGraph, **kwargs: Any) list[dae.task_graph.graph.Task] [source]
Add tasks for calculating resource statistics to a task graph.
- calc_statistics_hash() bytes [source]
Compute the statistics hash.
This hash is used to decide whether the resource statistics should be recomputed.
- property files: set[str]
Return a list of resource files the implementation utilises.
- gene_models_by_gene_name(name: str) list[dae.genomic_resources.gene_models.TranscriptModel] | None [source]
- gene_models_by_location(chrom: str, pos1: int, pos2: int | None = None) list[dae.genomic_resources.gene_models.TranscriptModel] [source]
Retrieve TranscriptModel objects based on genomic position(s).
- Args:
chrom (str): The chromosome name. pos1 (int): The starting genomic position. pos2 (Optional[int]): The ending genomic position. If not provided,
only models that contain pos1 will be returned.
- Returns:
- list[TranscriptModel]: A list of TranscriptModel objects that
match the given location criteria.
- load() GeneModels [source]
Load gene models.
- relabel_chromosomes(relabel: dict[str, str] | None = None, map_file: str | None = None) None [source]
Relabel chromosomes in gene model.
- property resource_id: str
- class dae.genomic_resources.gene_models.GeneModelsParser(*args, **kwargs)[source]
Bases:
Protocol
Gene models parser function type.
- class dae.genomic_resources.gene_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, tx: tuple[int, int], cds: tuple[int, int], exons: list[dae.genomic_resources.gene_models.Exon] | None = None, attributes: dict[str, Any] | None = None)[source]
Bases:
object
Provides transcript model.
- all_regions(ss_extend: int = 0, prom: int = 0) list[dae.utils.regions.BedRegion] [source]
Build and return list of regions.
- cds_regions(ss_extend: int = 0) list[dae.utils.regions.BedRegion] [source]
Compute CDS regions.
- utr3_regions() list[dae.utils.regions.BedRegion] [source]
Build and return list of UTR3 regions.
- utr5_regions() list[dae.utils.regions.BedRegion] [source]
Build list of UTR5 regions.
- dae.genomic_resources.gene_models.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None) GeneModels [source]
Load gene models from local filesystem.
- dae.genomic_resources.gene_models.build_gene_models_from_resource(resource: GenomicResource | None) GeneModels [source]
Load gene models from a genomic resource.
- dae.genomic_resources.gene_models.join_gene_models(*gene_models: GeneModels) GeneModels [source]
Join muliple gene models into a single gene models object.
dae.genomic_resources.genomic_position_table module
- class dae.genomic_resources.genomic_position_table.Line(raw_line: tuple, chrom_key: str | int = 0, pos_begin_key: str | int = 1, pos_end_key: str | int = 2, ref_key: str | int | None = None, alt_key: str | int | None = None, header: tuple[str, ...] | None = None)[source]
Bases:
LineBase
Represents a line read from a genomic position table.
Provides attribute access to a number of important columns - chromosome, start position, end position, reference allele and alternative allele.
- class dae.genomic_resources.genomic_position_table.LineBuffer[source]
Bases:
object
Represent a line buffer for Tabix genome position table.
- fetch(chrom: str, pos_begin: int, pos_end: int) Generator[LineBase, None, None] [source]
Return a generator of rows matching the region.
- class dae.genomic_resources.genomic_position_table.TabixGenomicPositionTable(genomic_resource: GenomicResource, table_definition: dict)[source]
Bases:
GenomicPositionTable
Represents Tabix file genome position table.
- BUFFER_MAXSIZE = 20000
- get_all_records() Generator[LineBase | None, None, None] [source]
Return generator of all records in the table.
- get_chromosome_length(chrom: str, step: int = 100000000) int [source]
Return the length of a chromosome (or contig).
Returned value is guarnteed to be larget than the actual contig length.
- get_file_chromosomes() list[str] [source]
Return chromosomes in a genomic table file.
This is to be overwritten by the subclass. It should return a list of the chromomes in the file in the order determinted by the file.
- get_line_iterator(chrom: str | None = None, pos_begin: int | None = None) Generator[LineBase | None, None, None] [source]
Extract raw lines and wrap them in our Line adapter.
- class dae.genomic_resources.genomic_position_table.VCFGenomicPositionTable(genomic_resource: GenomicResource, table_definition: dict)[source]
Bases:
TabixGenomicPositionTable
Represents a VCF file genome position table.
- CHROM = 'CHROM'
- POS_BEGIN = 'POS'
- POS_END = 'POS'
- get_file_chromosomes() list[str] [source]
Return chromosomes in a genomic table file.
This is to be overwritten by the subclass. It should return a list of the chromomes in the file in the order determinted by the file.
- get_line_iterator(chrom: str | None = None, pos_begin: int | None = None) Generator[VCFLine | None, None, None] [source]
Extract raw lines and wrap them in our Line adapter.
- open() VCFGenomicPositionTable [source]
- class dae.genomic_resources.genomic_position_table.VCFLine(raw_line: VariantRecord, allele_index: int | None)[source]
Bases:
LineBase
Line adapter for lines derived from a VCF file.
Implements functionality for handling multi-allelic variants and INFO fields.
- dae.genomic_resources.genomic_position_table.build_genomic_position_table(resource: GenomicResource, table_definition: dict) GenomicPositionTable [source]
Instantiate a genome position table from a genomic resource.
dae.genomic_resources.genomic_context module
- class dae.genomic_resources.genomic_context.CLIGenomicContext(context_objects: Dict[str, Any], source: tuple[str, ...])[source]
Bases:
SimpleGenomicContext
Defines CLI genomics context.
- static add_context_arguments(parser: ArgumentParser) None [source]
Add command line arguments to the argument parser.
- static context_builder(args: Namespace) CLIGenomicContext [source]
Build a CLI genomic context.
- class dae.genomic_resources.genomic_context.DefaultRepositoryContextProvider[source]
Bases:
SimpleGenomicContextProvider
Genomic context provider for default GRR.
- static context_builder() GenomicContext [source]
- class dae.genomic_resources.genomic_context.GenomicContext[source]
Bases:
ABC
Abstract base class for genomic context.
- abstract get_context_keys() set[str] [source]
Return set of all keys that could be found in the context.
- abstract get_context_object(key: str) Any | None [source]
Return a genomic context object corresponding to the passed key.
If there is no such object returns None.
- get_gene_models() GeneModels | None [source]
Return gene models from context.
- get_genomic_resources_repository() GenomicResourceRepo | None [source]
Return genomic resources repository from context.
- get_reference_genome() ReferenceGenome | None [source]
Return reference genome from context.
- class dae.genomic_resources.genomic_context.GenomicContextProvider[source]
Bases:
ABC
Abstract base class for genomic contexts provider.
- abstract get_contexts() Iterable[GenomicContext] [source]
- class dae.genomic_resources.genomic_context.PriorityGenomicContext(contexts: Iterable[GenomicContext])[source]
Bases:
GenomicContext
Defines a priority genomic context.
- class dae.genomic_resources.genomic_context.SimpleGenomicContext(context_objects: Dict[str, Any], source: tuple[str, ...])[source]
Bases:
GenomicContext
Simple implementation of genomic context.
- class dae.genomic_resources.genomic_context.SimpleGenomicContextProvider(context_builder: Callable[[], GenomicContext | None], provider_type: str, priority: int)[source]
Bases:
GenomicContextProvider
Simple implementation of genomic contexts provider.
- get_contexts() Iterable[GenomicContext] [source]
- dae.genomic_resources.genomic_context.get_genomic_context() GenomicContext [source]
- dae.genomic_resources.genomic_context.register_context(context: GenomicContext) None [source]
- dae.genomic_resources.genomic_context.register_context_provider(context_provider: GenomicContextProvider) None [source]
Register genomic context provider.
dae.genomic_resources.genomic_scores module
- class dae.genomic_resources.genomic_scores.AlleleScore(resource: GenomicResource)[source]
Bases:
GenomicScore
Defines allele genomic scores.
- fetch_scores(chrom: str, position: int, reference: str, alternative: str, scores: list[str] | None = None) list[Any] | None [source]
Fetch scores values for specific allele.
- fetch_scores_agg(chrom: str, pos_begin: int, pos_end: int, scores: list[dae.genomic_resources.genomic_scores.AlleleScoreQuery] | None = None) list[dae.genomic_resources.aggregators.Aggregator] [source]
Fetch score values in a region and aggregates them.
- open() AlleleScore [source]
Open genomic score resource and returns it.
- class dae.genomic_resources.genomic_scores.AlleleScoreAggr(score: 'str', position_aggregator: 'Aggregator', allele_aggregator: 'Aggregator')[source]
Bases:
object
- allele_aggregator: Aggregator
- position_aggregator: Aggregator
- score: str
- class dae.genomic_resources.genomic_scores.AlleleScoreQuery(score: 'str', position_aggregator: 'Optional[str]' = None, allele_aggregator: 'Optional[str]' = None)[source]
Bases:
object
- allele_aggregator: str | None = None
- position_aggregator: str | None = None
- score: str
- class dae.genomic_resources.genomic_scores.GenomicScore(resource: GenomicResource)[source]
Bases:
ResourceConfigValidationMixin
Genomic scores base class.
PositionScore, NPScore and AlleleScore inherit from this class. Statistics builder implementation uses only GenomicScore interface to build all defined statistics.
- fetch_region(chrom: str, pos_begin: int | None, pos_end: int | None, scores: Iterable[str]) Iterator[dict[str, Union[str, int, float, bool, NoneType]]] [source]
Return score values in a region.
- get_default_annotation_attribute(score_id: str) str | None [source]
Return default annotation attribute for a score.
Returns None if the score is not included in the default annotation. Returns the name of the attribute if present or the score if not.
- get_number_range(score_id: str) tuple[float, float] | None [source]
Return the value range for a number score.
- get_score_histogram(score_id: str) NullHistogram | CategoricalHistogram | NumberHistogram [source]
Return defined histogram for a score.
- open() GenomicScore [source]
Open genomic score resource and returns it.
- class dae.genomic_resources.genomic_scores.NPScore(resource: GenomicResource)[source]
Bases:
GenomicScore
Defines nucleotide-position genomic score.
- fetch_scores(chrom: str, position: int, reference: str, alternative: str, scores: list[str] | None = None) list[Any] | None [source]
Fetch score values at specified genomic position and nucleotide.
- fetch_scores_agg(chrom: str, pos_begin: int, pos_end: int, scores: list[dae.genomic_resources.genomic_scores.NPScoreQuery] | None = None) list[dae.genomic_resources.aggregators.Aggregator] [source]
Fetch score values in a region and aggregates them.
- class dae.genomic_resources.genomic_scores.NPScoreAggr(score: 'str', position_aggregator: 'Aggregator', nucleotide_aggregator: 'Aggregator')[source]
Bases:
object
- nucleotide_aggregator: Aggregator
- position_aggregator: Aggregator
- score: str
- class dae.genomic_resources.genomic_scores.NPScoreQuery(score: 'str', position_aggregator: 'Optional[str]' = None, nucleotide_aggregator: 'Optional[str]' = None)[source]
Bases:
object
- nucleotide_aggregator: str | None = None
- position_aggregator: str | None = None
- score: str
- class dae.genomic_resources.genomic_scores.PositionScore(resource: GenomicResource)[source]
Bases:
GenomicScore
Defines position genomic score.
- fetch_scores(chrom: str, position: int, scores: list[str] | None = None) list[Any] | None [source]
Fetch score values at specific genomic position.
- fetch_scores_agg(chrom: str, pos_begin: int, pos_end: int, scores: list[dae.genomic_resources.genomic_scores.PositionScoreQuery] | None = None) list[dae.genomic_resources.aggregators.Aggregator] [source]
Fetch score values in a region and aggregates them.
- Case 1:
- res.fetch_scores_agg(“1”, 10, 20) –>
all score with default aggregators
- Case 2:
- res.fetch_scores_agg(“1”, 10, 20,
non_default_aggregators={“bla”:”max”}) –>
all score with default aggregators but ‘bla’ should use ‘max’
- open() PositionScore [source]
Open genomic score resource and returns it.
- class dae.genomic_resources.genomic_scores.PositionScoreAggr(score: 'str', position_aggregator: 'Aggregator')[source]
Bases:
object
- position_aggregator: Aggregator
- score: str
- class dae.genomic_resources.genomic_scores.PositionScoreQuery(score: 'str', position_aggregator: 'Optional[str]' = None)[source]
Bases:
object
- position_aggregator: str | None = None
- score: str
- class dae.genomic_resources.genomic_scores.ScoreDef(score_id: str, desc: str, value_type: str, pos_aggregator: str | None, nuc_aggregator: str | None, allele_aggregator: str | None, small_values_desc: str | None, large_values_desc: str | None, hist_conf: NullHistogramConfig | CategoricalHistogramConfig | NumberHistogramConfig | None)[source]
Bases:
object
Score configuration definition.
- allele_aggregator: str | None
- desc: str
- hist_conf: NullHistogramConfig | CategoricalHistogramConfig | NumberHistogramConfig | None
- large_values_desc: str | None
- nuc_aggregator: str | None
- pos_aggregator: str | None
- score_id: str
- small_values_desc: str | None
- value_type: str
- class dae.genomic_resources.genomic_scores.ScoreLine(line: LineBase, score_defs: dict[str, dae.genomic_resources.genomic_scores._ScoreDef])[source]
Bases:
object
Abstraction for a genomic score line. Wraps the line adapter.
- property alt: str | None
- property chrom: str
- property pos_begin: int
- property pos_end: int
- property ref: str | None
- dae.genomic_resources.genomic_scores.build_score_from_resource(resource: GenomicResource) GenomicScore [source]
Build a genomic score resource and return the coresponding score.
dae.genomic_resources.group_repository module
Provides group genomic resources repository.
- class dae.genomic_resources.group_repository.GenomicResourceGroupRepo(children: list[dae.genomic_resources.repository.GenomicResourceRepo], repo_id: str | None = None)[source]
Bases:
GenomicResourceRepo
Defines group genomic resources repository.
- find_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource | None [source]
Return one resource with id qual to resource_id.
If resource is not found, None is returned.
- get_all_resources() Generator[GenomicResource, None, None] [source]
Return a generator over all resource in the repository.
- get_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource [source]
Return one resource with id qual to resource_id.
If resource is not found, exception is raised.
dae.genomic_resources.histogram module
Handling of genomic scores statistics.
Currently we support only genomic scores histograms.
- class dae.genomic_resources.histogram.CategoricalHistogram(config: CategoricalHistogramConfig, values: dict[str, int] | None = None)[source]
Bases:
Statistic
Class for categorical data histograms.
- VALUES_LIMIT = 100
- add_value(value: str | None) None [source]
Add a value to the categorical histogram.
Returns true if successfully added and false if failed. Will fail if too many values are accumulated.
- property bars: dict[str, int]
Return categorical histogram bars in order.
- static deserialize(content: str) CategoricalHistogram [source]
Create a statistic from serialized data.
- static from_dict(data: dict[str, Any]) CategoricalHistogram [source]
- type = 'categorical_histogram'
- class dae.genomic_resources.histogram.CategoricalHistogramConfig(value_order: list[str] | None = None, y_log_scale: bool = False)[source]
Bases:
object
Configuration class for categorical histograms.
- static default_config() CategoricalHistogramConfig [source]
- static from_dict(parsed: dict[str, Any]) CategoricalHistogramConfig [source]
Create categorical histogram config from configuratin dict.
- value_order: list[str] | None = None
- y_log_scale: bool = False
- exception dae.genomic_resources.histogram.HistogramError[source]
Bases:
BaseException
Class used for histogram specific errors.
Histograms should be nullified when a HistogramError occurs.
- class dae.genomic_resources.histogram.HistogramStatisticMixin[source]
Bases:
object
Mixin for creating statistics classes with histograms.
- class dae.genomic_resources.histogram.NullHistogram(config: NullHistogramConfig | None)[source]
Bases:
Statistic
Class for annulled histograms.
- static deserialize(content: str) NullHistogram [source]
Create a statistic from serialized data.
- static from_dict(data: dict[str, Any]) NullHistogram [source]
Build a null histogram from a dict.
- type = 'null_histogram'
- class dae.genomic_resources.histogram.NullHistogramConfig(reason: str)[source]
Bases:
object
Configuration class for null histograms.
- static default_config() NullHistogramConfig [source]
- static from_dict(parsed: dict[str, Any]) NullHistogramConfig [source]
Create Null histogram from configuration dict.
- reason: str
- class dae.genomic_resources.histogram.NumberHistogram(config: NumberHistogramConfig, bins: ndarray | None = None, bars: ndarray | None = None)[source]
Bases:
Statistic
Class to represent a histogram.
- static deserialize(content: str) NumberHistogram [source]
Create a statistic from serialized data.
- static from_dict(data: dict[str, Any]) NumberHistogram [source]
Build a number histogram from a dict.
- type = 'number_histogram'
- property view_range: tuple[Optional[float], Optional[float]]
- class dae.genomic_resources.histogram.NumberHistogramConfig(view_range: tuple[Optional[float], Optional[float]], number_of_bins: int = 30, x_log_scale: bool = False, y_log_scale: bool = False, x_min_log: float | None = None)[source]
Bases:
object
Configuration class for number histograms.
- static default_config(min_max: MinMaxValue | None) NumberHistogramConfig [source]
Build a number histogram config from a parsed yaml file.
- static from_dict(parsed: dict[str, Any]) NumberHistogramConfig [source]
Build a number histogram config from a parsed yaml file.
- number_of_bins: int = 30
- view_range: tuple[Optional[float], Optional[float]]
- x_log_scale: bool = False
- x_min_log: float | None = None
- y_log_scale: bool = False
- dae.genomic_resources.histogram.build_default_histogram_conf(value_type: str, **kwargs: Any) NumberHistogramConfig | CategoricalHistogramConfig | NullHistogramConfig [source]
Build default histogram config for given value type.
- dae.genomic_resources.histogram.build_empty_histogram(config: NullHistogramConfig | CategoricalHistogramConfig | NumberHistogramConfig) NumberHistogram | CategoricalHistogram | NullHistogram [source]
Create an empty histogram from a deserialize histogram dictionary.
- dae.genomic_resources.histogram.build_histogram_config(config: dict[str, Any] | None) NullHistogramConfig | CategoricalHistogramConfig | NumberHistogramConfig | None [source]
Create histogram config form configuration dict.
- dae.genomic_resources.histogram.load_histogram(resource: GenomicResource, filename: str) NullHistogram | CategoricalHistogram | NumberHistogram [source]
Load and return a histogram in a resource.
On an error or missing histogram, an appropriate NullHistogram is returned.
dae.genomic_resources.liftover_resource module
dae.genomic_resources.reference_genome module
- class dae.genomic_resources.reference_genome.ReferenceGenome(resource: GenomicResource)[source]
Bases:
ResourceConfigValidationMixin
Provides an interface for quering a reference genome.
- property chrom_prefix: str
Return a prefix of all chromosomes of the reference genome.
- property chromosomes: list[str]
Return a list of all chromosomes of the reference genome.
- fetch(chrom: str, start: int, stop: int | None, buffer_size: int = 512) Generator[str, None, None] [source]
Yield the nucleotides in a specific region.
While line feed calculation can be inaccurate because not every fetch will start at the start of a line, line feeds add extra characters to read and the output is limited by the amount of nucleotides expected to be read.
- property files: list[str]
- get_sequence(chrom: str, start: int, stop: int) str [source]
Return sequence of nucleotides from specified chromosome region.
- is_pseudoautosomal(chrom: str, pos: int) bool [source]
Return true if specified position is pseudoautosomal.
- open() ReferenceGenome [source]
Open reference genome resources.
- property resource_id: str
- dae.genomic_resources.reference_genome.build_reference_genome_from_file(filename: str) ReferenceGenome [source]
Open a reference genome from a file.
- dae.genomic_resources.reference_genome.build_reference_genome_from_resource(resource: GenomicResource) ReferenceGenome [source]
Open a reference genome from resource.
dae.genomic_resources.repository module
Provides basic classes for genomic resources and repositories.
+———————+ +—————–+
+—–| GenomicResourceRepo |--------------------| GenomicResource | | +———————+ +—————–+ | ^ ^ | | | | | | | +—————————–+ +—————————-+ | | | GenomicResourceProtocolRepo | —-| ReadOnlyRepositoryProtocol | | | +—————————–+ +—————————-+ | | ^ | | | | +————————–+ +—————————–+ +—-| GenomicResourceGroupRepo | | ReadWriteRepositoryProtocol |
+————————–+ +—————————–+
- class dae.genomic_resources.repository.GenomicResource(resource_id: str, version: tuple[int, ...], protocol: ReadOnlyRepositoryProtocol | ReadWriteRepositoryProtocol, config: dict[str, Any] | None = None, manifest: Manifest | None = None)[source]
Bases:
object
Base class for genomic resources.
- get_file_content(filename: str, *, uncompress: bool = True, mode: str = 't') Any [source]
Return the content of file in a resource.
- get_genomic_resource_id_version() str [source]
Return a string combinint resource ID and version.
Returns a string of the form aa/bb/cc[3.2] for a genomic resource with id aa/bb/cc and version 3.2. If the version is 0 the string will be aa/bb/cc.
- open_raw_file(filename: str, mode: str = 'rt', **kwargs: str | bool | None) IO [source]
Open a file in the resource and returns a File-like object.
- class dae.genomic_resources.repository.GenomicResourceProtocolRepo(proto: ReadOnlyRepositoryProtocol | ReadWriteRepositoryProtocol)[source]
Bases:
GenomicResourceRepo
Base class for real genomic resources repositories.
- find_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource | None [source]
Return one resource with id qual to resource_id.
If resource is not found, None is returned.
- get_all_resources() Generator[GenomicResource, None, None] [source]
Return a generator over all resource in the repository.
- get_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource [source]
Return one resource with id qual to resource_id.
If resource is not found, exception is raised.
- class dae.genomic_resources.repository.GenomicResourceRepo(repo_id: str)[source]
Bases:
ABC
Base class for genomic resources repositories.
- property definition: dict[str, Any] | None
- abstract find_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource | None [source]
Return one resource with id qual to resource_id.
If resource is not found, None is returned.
- abstract get_all_resources() Generator[GenomicResource, None, None] [source]
Return a generator over all resource in the repository.
- abstract get_resource(resource_id: str, version_constraint: str | None = None, repository_id: str | None = None) GenomicResource [source]
Return one resource with id qual to resource_id.
If resource is not found, exception is raised.
- property repo_id: str
- class dae.genomic_resources.repository.Manifest[source]
Bases:
object
Provides genomic resource manifest object.
- add(entry: ManifestEntry) None [source]
Add manifest enry to the manifest.
- static from_file_content(file_content: str) Manifest [source]
Produce a manifest from manifest file content.
- static from_manifest_entries(manifest_entries: list[dict[str, Any]]) Manifest [source]
Produce a manifest from parsed manifest file content.
- to_manifest_entries() list[dict[str, Any]] [source]
Transform manifest to list of dictionaries.
Helpfull when storing the manifest.
- update(entries: dict[str, dae.genomic_resources.repository.ManifestEntry]) None [source]
- class dae.genomic_resources.repository.ManifestEntry(name: str, size: int, md5: str | None)[source]
Bases:
object
Provides an entry into manifest object.
- md5: str | None
- name: str
- size: int
- class dae.genomic_resources.repository.ManifestUpdate(manifest: Manifest, entries_to_delete: set[str], entries_to_update: set[str])[source]
Bases:
object
Provides a manifest update object.
- entries_to_delete: set[str]
- entries_to_update: set[str]
- class dae.genomic_resources.repository.Mode(value)[source]
Bases:
Enum
Protocol mode.
- READONLY = 1
- READWRITE = 2
- class dae.genomic_resources.repository.ReadOnlyRepositoryProtocol(proto_id: str)[source]
Bases:
ABC
Defines read only genomic resources repository protocol.
- CHUNK_SIZE = 32768
- build_genomic_resource(resource_id: str, version: tuple[int, ...], config: dict | None = None, manifest: Manifest | None = None) GenomicResource [source]
Build a genomic resource based on this protocol.
- compute_md5_sum(resource: GenomicResource, filename: str) str [source]
Compute a md5 hash for a file in the resource.
- abstract file_exists(resource: GenomicResource, filename: str) bool [source]
Check if given file exist in give resource.
- find_resource(resource_id: str, version_constraint: str | None = None) GenomicResource | None [source]
Return requested resource or None if not found.
- abstract get_all_resources() Generator[GenomicResource, None, None] [source]
Return generator for all resources in the repository.
- get_file_content(resource: GenomicResource, filename: str, *, uncompress: bool = True, mode: str = 't') Any [source]
Return content of a file in given resource.
- get_manifest(resource: GenomicResource) Manifest [source]
Load and returns a resource manifest.
- get_resource(resource_id: str, version_constraint: str | None = None) GenomicResource [source]
Return requested resource or raises exception if not found.
In case resource is not found a FileNotFoundError exception is raised.
- abstract load_manifest(resource: GenomicResource) Manifest [source]
Load resource manifest.
- load_yaml(resource: GenomicResource, filename: str) Any [source]
Return parsed YAML file.
- abstract open_raw_file(resource: GenomicResource, filename: str, mode: str = 'rt', **kwargs: str | bool | None) IO [source]
Open file in a resource and returns a file-like object.
- abstract open_tabix_file(resource: GenomicResource, filename: str, index_filename: str | None = None) TabixFile [source]
Open a tabix file in a resource and return a pysam tabix file.
Not all repositories support this method. Repositories that do no support this method raise and exception.
- abstract open_vcf_file(resource: GenomicResource, filename: str, index_filename: str | None = None) VariantFile [source]
Open a vcf file in a resource and return a pysam VariantFile.
Not all repositories support this method. Repositories that do no support this method raise and exception.
- class dae.genomic_resources.repository.ReadWriteRepositoryProtocol(proto_id: str)[source]
Bases:
ReadOnlyRepositoryProtocol
Defines read write genomic resources repository protocol.
- abstract build_content_file() list[dict[str, Any]] [source]
Build the content of the repository (i.e ‘.CONTENTS’ file).
- build_manifest(resource: GenomicResource, prebuild_entries: dict[str, dae.genomic_resources.repository.ManifestEntry] | None = None) Manifest [source]
Build full manifest for the resource.
- build_resource_file_state(resource: GenomicResource, filename: str, **kwargs: str | float | int | None) ResourceFileState [source]
Build resource file state.
- check_update_manifest(resource: GenomicResource, prebuild_entries: dict[str, dae.genomic_resources.repository.ManifestEntry] | None = None) ManifestUpdate [source]
Check if the resource manifest needs update.
- abstract collect_all_resources() Generator[GenomicResource, None, None] [source]
Return generator for all resources managed by this protocol.
- abstract collect_resource_entries(resource: GenomicResource) Manifest [source]
Scan the resource and returns manifest with all files.
- copy_resource(remote_resource: GenomicResource) GenomicResource [source]
Copy a remote resource into repository.
- abstract copy_resource_file(remote_resource: GenomicResource, dest_resource: GenomicResource, filename: str) ResourceFileState | None [source]
Copy a remote resource file into local repository.
- abstract delete_resource_file(resource: GenomicResource, filename: str) None [source]
Delete a resource file and it’s internal state.
- get_manifest(resource: GenomicResource) Manifest [source]
Load or build a resource manifest.
- get_or_create_resource(resource_id: str, version: tuple[int, ...]) GenomicResource [source]
Return a resource with specified ID and version.
If the resource is not found create an empty resource.
- abstract get_resource_file_size(resource: GenomicResource, filename: str) int [source]
Return the size of a resource file.
- abstract get_resource_file_timestamp(resource: GenomicResource, filename: str) float [source]
Return the timestamp (ISO formatted) of a resource file.
- abstract load_resource_file_state(resource: GenomicResource, filename: str) ResourceFileState | None [source]
Load resource file state from internal GRR state.
If the specified resource file has no internal state returns None.
- save_index(resource: GenomicResource, contents: str) None [source]
Save an index HTML file into the genomic resource’s directory.
- save_manifest(resource: GenomicResource, manifest: Manifest) None [source]
Save manifest into genomic resource’s directory.
- abstract save_resource_file_state(resource: GenomicResource, state: ResourceFileState) None [source]
Save resource file state into internal GRR state.
- update_manifest(resource: GenomicResource, prebuild_entries: dict[str, dae.genomic_resources.repository.ManifestEntry] | None = None) Manifest [source]
Update or create full manifest for the resource.
- update_resource(remote_resource: GenomicResource, files_to_copy: set[str] | None = None) GenomicResource [source]
Copy a remote resource into repository.
Allows copying of a subset of files from the resource via files_to_copy. If files_to_copy is None, copies all files.
- abstract update_resource_file(remote_resource: GenomicResource, dest_resource: GenomicResource, filename: str) ResourceFileState | None [source]
Update a resource file into repository if needed.
- class dae.genomic_resources.repository.ResourceFileState(filename: str, size: int, timestamp: float, md5: str)[source]
Bases:
object
Defines resource file state saved into internal GRR state.
- filename: str
- md5: str
- size: int
- timestamp: float
- dae.genomic_resources.repository.is_gr_id_token(token: str) bool [source]
Check if token can be used as a genomic resource ID.
Genomic Resource Id Token is a string with one or more letters, numbers, ‘.’, ‘_’, or ‘-’. The function checks if the parameter token is a Genomic REsource Id Token.
- dae.genomic_resources.repository.is_version_constraint_satisfied(version_constraint: str | None, version: tuple[int, ...]) bool [source]
Check if a version matches a version constraint.
- dae.genomic_resources.repository.parse_gr_id_version_token(token: str) tuple[str, tuple[int, ...]] [source]
Parse genomic resource ID with version.
Genomic Resource Id Version Token is a Genomic Resource Id Token with an optional version appened. If present, the version suffix has the form “(3.3.2)”. The default version is (0). Returns None if s in not a Genomic Resource Id Version. Otherwise returns token,version tupple
- dae.genomic_resources.repository.parse_resource_id_version(resource_path: str) tuple[str, tuple[int, ...]] [source]
Parse genomic resource id and version path into Id, Version tuple.
An optional version (0,) appened if needed. If present, the version suffix has the form “(3.3.2)”. The default version is (0,). Returns tuple (None, None) if the path does not match the resource_id/version requirements. Otherwise returns tuple (resource_id, version).
dae.genomic_resources.repository_factory module
Provides a factory for building genomic resources repostiories.
- dae.genomic_resources.repository_factory.build_genomic_resource_group_repository(repo_id: str, children: list[dae.genomic_resources.repository.GenomicResourceRepo]) GenomicResourceRepo [source]
- dae.genomic_resources.repository_factory.build_genomic_resource_repository(definition: dict | None = None, file_name: str | None = None) GenomicResourceRepo [source]
Build a GRR using a definition dict or yaml file.
- dae.genomic_resources.repository_factory.build_resource_implementation(res: GenomicResource) GenomicResourceImplementation [source]
Build a resource implementation from a resource.
- dae.genomic_resources.repository_factory.get_default_grr_definition() dict[str, Any] [source]
Return default genomic resources repository definition.
dae.genomic_resources.testing module
Provides tools usefult for testing.
- dae.genomic_resources.testing.build_filesystem_test_protocol(root_path: Path, repair: bool = True) FsspecReadWriteProtocol [source]
Build and return an filesystem fsspec protocol for testing.
The root_path is expected to point to a directory structure with all the resources.
- dae.genomic_resources.testing.build_filesystem_test_repository(root_path: Path) GenomicResourceProtocolRepo [source]
Build and return an filesystem fsspec repository for testing.
The root_path is expected to point to a directory structure with all the resources.
- dae.genomic_resources.testing.build_filesystem_test_resource(root_path: Path) GenomicResource [source]
- dae.genomic_resources.testing.build_http_test_protocol(root_path: Path, repair: bool = True) Generator[FsspecReadOnlyProtocol, None, None] [source]
Run an HTTP range server and construct genomic resource protocol.
The HTTP range server is used to serve directory pointed by root_path. This directory should be a valid filesystem genomic resource repository.
- dae.genomic_resources.testing.build_inmemory_test_protocol(content: dict[str, Any]) FsspecReadWriteProtocol [source]
Build and return an embedded fsspec protocol for testing.
- dae.genomic_resources.testing.build_inmemory_test_repository(content: dict[str, Any]) GenomicResourceProtocolRepo [source]
Create an embedded GRR repository using passed content.
- dae.genomic_resources.testing.build_inmemory_test_resource(content: dict[str, Any]) GenomicResource [source]
Create a test resource based on content passed.
The passed content should appropriate for a single resource. Example content: {
- “genomic_resource.yaml”: textwrap.dedent(‘’’
type: position_score table:
filename: data.txt
- scores:
- id: aaaa
type: float desc: “” name: sc
‘’’), “data.txt”: convert_to_tab_separated(‘’’
#chrom start end sc 1 10 12 1.1 2 13 14 1.2
‘’’)
}
- dae.genomic_resources.testing.build_s3_test_bucket(s3filesystem: S3FileSystem | None = None) str [source]
Create an s3 test buckent.
- dae.genomic_resources.testing.build_s3_test_filesystem(endpoint_url: str | None = None) S3FileSystem [source]
Create an S3 fsspec filesystem connected to the S3 server.
- dae.genomic_resources.testing.build_s3_test_protocol(root_path: Path) Generator[FsspecReadWriteProtocol, None, None] [source]
Run an S3 moto server and construct fsspec genomic resource protocol.
The S3 moto server is populated with resource from filesystem GRR pointed by the root_path.
- dae.genomic_resources.testing.convert_to_tab_separated(content: str) str [source]
Convert a string into tab separated file content.
Useful for testing purposes. If you need to have a space in the file content use ‘||’.
- dae.genomic_resources.testing.copy_proto_genomic_resources(dest_proto: FsspecReadWriteProtocol, src_proto: FsspecReadOnlyProtocol) None [source]
- dae.genomic_resources.testing.http_process_test_server(path: Path) Generator[str, None, None] [source]
- dae.genomic_resources.testing.http_threaded_test_server(path: Path) Generator[str, None, None] [source]
Run a range HTTP threaded server.
The HTTP range server is used to serve directory pointed by root_path.
- dae.genomic_resources.testing.proto_builder(scheme: str, content: dict) Generator[FsspecReadOnlyProtocol | FsspecReadWriteProtocol, None, None] [source]
Build a test genomic resource protocol with specified content.
- dae.genomic_resources.testing.resource_builder(scheme: str, content: dict) Generator[GenomicResource, None, None] [source]
- dae.genomic_resources.testing.s3_test_protocol() FsspecReadWriteProtocol [source]
Build an S3 fsspec testing protocol on top of existing S3 server.
- dae.genomic_resources.testing.setup_dae_transmitted(root_path: Path, summary_content: str, toomany_content: str) tuple[str, str] [source]
Set up a DAE transmitted variants file using passed content.
- dae.genomic_resources.testing.setup_directories(root_dir: Path, content: str | dict[str, Any]) None [source]
Set up directory and subdirectory structures using the content.
- dae.genomic_resources.testing.setup_empty_gene_models(out_path: Path) GeneModels [source]
Set up empty gene models.
- dae.genomic_resources.testing.setup_gene_models(out_path: Path, content: str, fileformat: str | None = None) GeneModels [source]
Set up gene models in refflat format using the passed content.
- dae.genomic_resources.testing.setup_genome(out_path: Path, content: str) ReferenceGenome [source]
Set up reference genome using the content.
- dae.genomic_resources.testing.setup_gzip(gzip_path: Path, gzip_content: str) Path [source]
Set up a gzipped TSV file.
Module contents
- class dae.genomic_resources.GenomicResource(resource_id: str, version: tuple[int, ...], protocol: ReadOnlyRepositoryProtocol | ReadWriteRepositoryProtocol, config: dict[str, Any] | None = None, manifest: Manifest | None = None)[source]
Bases:
object
Base class for genomic resources.
- get_file_content(filename: str, *, uncompress: bool = True, mode: str = 't') Any [source]
Return the content of file in a resource.
- get_genomic_resource_id_version() str [source]
Return a string combinint resource ID and version.
Returns a string of the form aa/bb/cc[3.2] for a genomic resource with id aa/bb/cc and version 3.2. If the version is 0 the string will be aa/bb/cc.
- open_raw_file(filename: str, mode: str = 'rt', **kwargs: str | bool | None) IO [source]
Open a file in the resource and returns a File-like object.
- dae.genomic_resources.build_genomic_resource_repository(definition: dict | None = None, file_name: str | None = None) GenomicResourceRepo [source]
Build a GRR using a definition dict or yaml file.
- dae.genomic_resources.get_resource_implementation_builder(resource_type: str) Callable[[GenomicResource], GenomicResourceImplementation] | None [source]
Return an implementation builder for a certain resource type.
If the builder is not registered, then it will search for an entry point in the found implementations list. If an entry point is found, it will be loaded and registered and returned.