Import Tools ============ What is Import Tools -------------------- Import Tools is the new way to import studies into seqpipe. Import Tools allows for studies to be imported in one step without the need to generate Snakefiles and execute those in another step. The study and the arguments used to import it are described in a single yaml config file. The config file is general enough and can be commited in a source code repository allowing other team members to use it. Import Tools assumes input data has already been massaged into a format supported by seqpipe. Import Tools config files ------------------------- Import Tools config files has a relatively simple structure. In their simple form they consist of 3 sections: - Input section: describing the input files. Files like the pedigree file, vcf files and the configuration options required to read these files successfully. - Processing config: describing how the input is supposed to be handled and processed. - Destination: describing where to store the generated data. For example this could be an impala table. This section also includes the partition_description. Import Tools configuration format --------------------------------- .. code-block:: yaml vars: my_dir: "..." id: SFARI_SPARK_WES_2 input: file: "external file defining input" (OR) input_dir: pedigree: file: %(my_dir)s/SFARI_SPARK_WES_2.ped dad: fatherId mom: motherId status: affected vcf: files: - wes2_15995_exome.gatk.vcf.gz denovo_mode: ignore omission_mode: ignore add_chrom_prefix: chr denovo: files: - wes2_merged_cohFreq_Cut17_final_v1_ALL_042921_GPF.tsv.txt persion_id: spid chrom: chrom pos: pos ref: ref alt: alt add_chrom_prefix: chr processing_config: vcf: single_bucket (OR) vcf: chromsome (OR): vcf: chromosomes: ['chr1', 'chr2', 'chr3', ..., 'chr22', 'chrX', 'chrY'] (OR): vcf: chromosomes: ['autosomes', 'chrX', 'chrM'] region_length: 100M work_dir: "" (optional by default use default gpf_instance) gpf_instance: path: ... (optional by default use gpf_instance annotation pipeline-a) annotation: gpf_pipeline: "" (OR) file: "" (OR) embedded-annotation (optional by default use the default storage of the gpf instance) destination: storage_id: "id in gpf_instance" (OR) storage_type: impala (OR) storage_type: impala id: storage_id hdfs: base_dir: "/user/impala_schema_1/studies" host: seqclust0 port: 8020 replication: 1 impala: db: "impala_schema_1" hosts: - seqclust0 - seqclust1 - seqclust2 port: 21050 pool_size: 3 parquet_row_group_size: vcf: 30M partition_description: region_bin: chromosomes: [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX] region_length: 30000000 family_bin: bin_size: 10 frequency_bin: rare_boundary: 5 coding_bin: coding_effect_types: [splice-site,frame-shift,nonsense,no-frame-shift-newStop,noStart,noEnd,missense,no-frame-shift,CDS,synonymous,coding_unknown,regulatory,3'UTR,5'UTR] *input* is the section where we describe the input files. It is devided into subsections for each input type (vcf, denovo and so on). All files are relative to the *input_dir*. The *input_dir* is itself relative to the directory where the config file is located. *input_dir* is options, if unspecified then every file would be relative to the config file's directory. If the input configuration is in an external file then input file paths will be relative to the external file. *processing_config* is where we describe how to split input files into smaller buckets for parallel processing. *single_bucket* means that the entire input will be processed in a single task without spliting it into smaller parts. *chromosome* or a list of chromosomes means that each chromosome will be processed in parallel. If *region_length* is specified then each chromosome will be split into regions with length *region_length* and all such regions will be processed in parallel. *work_dir* is the location where parquet files will be generated. If missing then the current working directory is used. For any set of input files (denovo, vcf and so on) if the corresponding section in *processing_config* is missing then the default value for bucket generation is *single_bucket*. *gpf_instance* is an optional section that allows you to specify a gpf instance configuration file. *annotation* is where the annotation pipeline is specified. It can either be the name of a pipeline described in the gpf config (using the gpf_pipeline argument), path to a file describing the pipeline or an embedded annotation pipeline. *destination* describes where generated parquet files will be imported. This section could be the name of a storage defined in the gpf instance or an embedded storage config. If only *storage_type* is specified then parquet files will be generated for the particular storage type but will NOT be imported anywhere. This is useful for just generating parquet files without actually importing them. Working with the Import Tools CLI --------------------------------- To import a study first you would need the import configuration as described above. To run import tools with the config file execute: .. code-block:: bash import_tools import_config.yaml To list the steps that will be executed without actually executing them: .. code-block:: bash import_tools import_config.yaml list *import_tools* has a number of parameters. Run with --help to see them. One commonly used one is `-j` which specifies the number of tasks to run in parallel. Running on a SGE cluster ------------------------- .. code-block:: bash import_tools import_config.yaml run --sge -j 100 This command will run import tools on a SGE cluster using 100 parallel workers. This assumes a preconfigured, working SGE cluster. The *import_config.yaml* file should be placed on a shared file system that can be accessed by all nodes in the cluster. Running on a Kubernetes cluster ------------------------------- Running on kubernetes is a little bit more involved because typically nodes in the cluster don't share a common file system and the machine where we run *import_tools* is usually not part of the cluster. So the import process needs a common storage that can be access both by the nodes in the cluster and the machine where import tools is run from. The easiest way to achieve this is by using S3. The best setup is to place the import configuration on S3 together will the input data. Accessing S3 (and other AWS services) usually happends through an access and secret keys. Assuming these keys are already configured in the corresponding environment variables we can run import tools like that: .. code-block:: bash import_tools s3://bucket/import_config.yaml run --kubernetes --envvars AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY --image-pull-secrets seqpipe-registry-cred -j 20 The environment variables specified by --envvars will be propagated to the worker pods so that the workers can access S3. The --image-pull-secrets specifies a kubernetes secret that should contain the credentials used for accessing the seqpipe docker registry from which the images for the worker pods will be pulled from. And -j specifies that 20 workers should be started. If using a non-AWS S3 such as a ceph storage, the endpoint url can be specified using the *S3_ENDPOINT_URL* environment variable: .. code-block:: bash S3_ENDPOINT_URL=http://s3.my-server.com:7480 import_tools s3://bucket/import_config.yaml run --kubernetes --envvars AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY --image-pull-secrets seqpipe-registry-cred -j 20 Classes and Functions --------------------- .. toctree:: :maxdepth: 3 modules/dae.import_tools