GPF Getting Started Guide ========================= Prerequisites ############# This guide assumes that you are working on a recent Linux box. Working version of `anaconda` or `miniconda` ++++++++++++++++++++++++++++++++++++++++++++ The GPF system is distributed as an Anaconda package using the `conda` package manager. If you do not have a working version of Anaconda or Miniconda, you must install one. We recommended using a Miniconda version. Go to the Miniconda `distribution page `_, download the Linux installer .. code-block:: bash wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh and install it in your local environment: .. code-block:: bash sh Miniconda3-latest-Linux-x86_64.sh .. note:: At the end of the installation process, you will be asked if you wish to allow the installer to initialize Miniconda3 by running `conda` init. If you choose to, every terminal you open after that will have the ``base`` Anaconda environment activated, and you'll have access to the ``conda`` commands used below. Once Anaconda/Miniconda is installed, we would recommend installing ``mamba`` instead of ``conda``. Mamba will speed up the installation of packages: .. code-block:: conda install -c conda-forge mamba GPF Installation ################ The GPF system is developed in Python and supports Python 3.9 and up. The recommended way to set up the GPF development environment is to use Anaconda. Install GPF +++++++++++ Create an empty Anaconda environment named `gpf`: .. code-block:: bash conda create -n gpf To use this environment, you need to activate it using the following command: .. code-block:: bash conda activate gpf Install the `gpf_wdae` conda package into the already activated `gpf` environment: .. code-block:: bash mamba install \ -c defaults \ -c conda-forge \ -c bioconda \ -c iossifovlab \ gpf_wdae This command is going to install GPF and all of its dependencies. Create an empty GPF instance ++++++++++++++++++++++++++++ Create an empty directory named ``data-hg38-empty``: .. code-block:: bash mkdir data-hg38-empty and inside it, create a file named ``gpf_instance.yaml`` with the following content: .. code-block:: yaml reference_genome: resource_id: "hg38/genomes/GRCh38-hg38" gene_models: resource_id: "hg38/gene_models/refSeq_v20200330" This will create a GPF instance that: * The reference genome used by this GPF instance is ``hg38/genomes/GRCh38-hg38`` from default GRR; * The gene models used by this GPF instance are ``hg38/gene_models/refSeq_v20200330`` from default GRR; * If not specified otherwise, the GPF uses the default genomic resources repository located at `https://www.iossifovlab.com/distribution/public/genomic-resources-repository/ `_. Resources are used without caching. Run the GPF development web server ################################## By default, the GPF system looks for a file ``gpf_instance.yaml`` in the current directory (and its parent directories). If GPF finds such a file, it uses it as a configuration for the GPF instance. Otherwise, it throws an exception. Now we can run the GPF development web server and browse our empty GPF instance: .. code-block:: bash wgpf run and browse the GPF development server at ``http://localhost:8000``. To stop the development GPF web server, you should press ``Ctrl-C`` - the usual keybinding for stopping long-running Linux commands in a terminal. .. warning:: The development web server run by ``wgpf run`` used in this guide is meant for development purposes only and is not suitable for serving the GPF system in production. Import genotype variants ######################## Data Storage ++++++++++++ The GPF system uses genotype storages for storing genomic variants. We are going to use in-memory genotype storage for this guide. It is easiest to set up and use, but it is unsuitable for large studies. By default, each GPF instance has internal in-memory genotype storage. Import Tools and Import Project +++++++++++++++++++++++++++++++ Importing genotype data into a GPF instance involves multiple steps. The tool used to import genotype data is named `import_tools`. This tool expects an import project file that describes the import. This tool supports importing variants from three formats: * List of de novo variants * List of de novo CNV variants * Variant Call Format (VCF) Example import of de novo variants: ``helloworld`` ++++++++++++++++++++++++++++++++++++++++++++++++++ .. note:: Input files for this example can be downloaded from :download:`denovo-helloworld.tar.gz `. Let us import a small list of de novo variants. We will need the list of de novo variants ``helloworld.tsv``: .. code-block:: CHROM POS REF ALT person_ids chr14 21403214 T C p1 chr14 21431459 G C p1 chr14 21391016 A AT p2 chr14 21403019 G A p2 chr14 21402010 G A p1 chr14 21393484 TCTTC T p2 and a pedigree file that describes the families ``helloworld.ped``: .. code-block:: familyId personId dadId momId sex status role phenotype f1 m1 0 0 2 1 mom unaffected f1 d1 0 0 1 1 dad unaffected f1 p1 d1 m1 1 2 prb autism f1 s1 d1 m1 2 2 sib unaffected f2 m2 0 0 2 1 mom unaffected f2 d2 0 0 1 1 dad unaffected f2 p2 d2 m2 1 2 prb autism .. warning:: Please note that the default separator for the list of de novo and pedigree files is ``TAB``. If you copy these snippets and paste them into corresponding files the separators between values most probably will become spaces. You need to ensure that separators between column values are ``TAB`` symbols. The project configuration file for importing this study ``denovo_helloworld.yaml`` should look like: .. code-block:: yaml id: denovo_helloworld input: pedigree: file: helloworld.ped denovo: files: - helloworld.tsv person_id: person_ids chrom: CHROM pos: POS ref: REF alt: ALT To import this project run the following command: .. code-block:: bash import_tools denovo_helloworld.yaml When the import finishes you can run the GPF development server using: .. code-block:: bash wgpf run and browse the content of the GPF development server at `http://localhost:8000` Example import of VCF variants: ``vcf_helloworld`` ++++++++++++++++++++++++++++++++++++++++++++++++++ .. note:: Input files for this example can be downloaded from :download:`vcf-helloworld.tar.gz `. Let us have a small VCF file ``hellowrold.vcf``: .. code-block:: ##fileformat=VCFv4.2 ##FORMAT= ##contig= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT m1 d1 p1 s1 m2 d2 p2 chr14 21385738 . C T . . . GT 0/0 0/1 0/1 0/0 0/0 0/1 0/0 chr14 21385954 . A C . . . GT 0/0 0/0 0/0 0/0 0/1 0/0 0/1 chr14 21393173 . T C . . . GT 0/1 0/0 0/0 0/1 0/0 0/0 0/0 chr14 21393702 . C T . . . GT 0/0 0/0 0/0 0/0 0/0 0/1 0/1 chr14 21393860 . G A . . . GT 0/0 0/1 0/1 0/1 0/0 0/0 0/0 chr14 21403023 . G A . . . GT 0/0 0/1 0/0 0/1 0/1 0/0 0/0 chr14 21405222 . T C . . . GT 0/0 0/0 0/0 0/0 0/0 0/1 0/0 chr14 21409888 . T C . . . GT 0/1 0/0 0/1 0/0 0/1 0/0 1/0 chr14 21429019 . C T . . . GT 0/0 0/1 0/1 0/0 0/0 0/1 0/1 chr14 21431306 . G A . . . GT 0/0 0/1 0/1 0/1 0/0 0/0 0/0 chr14 21431623 . A C . . . GT 0/0 0/0 0/0 0/0 0/1 1/1 1/1 chr14 21393540 . GGAA G . . . GT 0/1 0/1 1/1 0/0 0/0 0/0 0/0 and a pedigree file ``helloworld.ped`` (the same pedigree file used in `Example import of de novo variants: ``helloworld```_): .. code-block:: familyId personId dadId momId sex status role phenotype f1 m1 0 0 2 1 mom unaffected f1 d1 0 0 1 1 dad unaffected f1 p1 d1 m1 1 2 prb autism f1 s1 d1 m1 2 2 sib unaffected f2 m2 0 0 2 1 mom unaffected f2 d2 0 0 1 1 dad unaffected f2 p2 d2 m2 1 2 prb autism .. warning:: Please note that the default separator for the VCF and pedigree files is ``TAB``. If you copy these snippets and paste them into corresponding files the separators between values most probably will become spaces. You need to ensure that separators between column values are ``TAB`` symbols for import to work. The project configuration file for importing this VCF study ``vcf_helloworld.yaml`` should look like: .. code-block:: yaml id: vcf_helloworld input: pedigree: file: helloworld.ped vcf: files: - helloworld.vcf To import this project run the following command: .. code-block:: bash import_tools vcf_helloworld.yaml When the import finishes you can run the GPF development server using: .. code-block:: bash wgpf run and browse the content of the GPF development server at `http://localhost:8000` Example of a dataset (group of genotype studies) ++++++++++++++++++++++++++++++++++++++++++++++++ The already imported studies ``denovo_helloworld`` and ``vcf_helloworld`` have genomic variants for the same group of individuals ``helloworld.ped``. We can create a dataset (group of genotype studies) that include both studies. To this end create a directory ``datasets/helloworld`` inside the GPF instance directory ``data-hg38-empty``: .. code-block:: bash cd data-hg38-empty mkdir -p datasets/helloworld and place the following configuration file ``hellowrold.yaml`` inside that directory: .. code-block:: yaml id: helloworld name: Hello World Dataset studies: - denovo_helloworld - vcf_helloworld Example import of de novo variants from `Rates of contributory de novo mutation in high and low-risk autism families` +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Let us import de novo variants from the `Yoon, S., Munoz, A., Yamrom, B. et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021). `_. We will focus on de novo variants from the SSC collection published in the aforementioned paper. To import these variants into the GPF system we need a list of de novo variants and a pedigree file describing the families. The list of de novo variants is available from `Supplementary Data 2 `_. The pedigree file for this study is not available. Instead, we have a list of children available from `Supplementary Data 1 `_. Let us first export these Excel spreadsheets into CSV files. Let us say that the list of de novo variants from the SSC collection is saved into a file named ``SupplementaryData2_SSC.tsv`` and the list of children is saved into a TSV file named ``SupplementaryData1_Children.tsv``. .. note:: Input files for this example can be downloaded from :download:`denovo-in-high-and-low-risk-papter.tar.gz `. Preprocess the families data ____________________________ To import the data into GPF we need a pedigree file describing the structure of the families. The ``SupplementaryData1_Children.tsv`` contains only the list of children. There is no information about their parents. Fortunately for the SSC collection it is not difficult to build the full families' structures from the information we have. For the SSC collection if you have a family with ID ````, then the identifiers of the individuals in the family are going to be formed as follows: * mother - ``.mo``; * father - ``.fa``; * proband - ``.p1``; * first sibling - ``.s1``; * second sibling - ``.s2``. Another important restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father and siblings are ``unaffected``. Using all these conventions we can write a simple python script ``build_ssc_pedigree.py`` to convert ``SupplementaryData1_Children.tsv`` into a pedigree file ``ssc_denovo.ped``: .. code-block:: python """Converts SupplementaryData1_Children.tsv into a pedigree file.""" import pandas as pd children = pd.read_csv("SupplementaryData1_Children.tsv", sep="\t") ssc = children[children.collection == "SSC"] # list of all individuals in SSC persons = [] # each person is represented by a tuple: # (familyId, personId, dadId, momId, status, sex) for fam_id, members in ssc.groupby("familyId"): persons.append((fam_id, f"{fam_id}.mo", "0", "0", "unaffected", "F")) persons.append((fam_id, f"{fam_id}.fa", "0", "0", "unaffected", "F")) for child in members.to_dict(orient="records"): persons.append(( fam_id, child["personId"], f"{fam_id}.fa", f"{fam_id}.mo", child["affected status"], child["sex"])) with open("ssc_denovo.ped", "wt", encoding="utf8") as output: output.write( "\t".join(("familyId", "personId", "dadId", "momId", "status", "sex"))) output.write("\n") for person in persons: output.write("\t".join(person)) output.write("\n") If we run this script it will read ``SupplementaryData1_Children.tsv`` and produce the appropriate pedigree file ``ssc_denovo.ped``. Preprocess the variants data ____________________________ The ``SupplementaryData2_SSC.tsv`` file contains 255231 variants. To import so many variants in in-memory genotype storage is not appropriate. For this example we are going to use a subset of 10000 variants: .. code-block:: bash head -n 10001 SupplementaryData2_SSC.tsv > ssc_denovo.tsv Data import of ``ssc_denovo`` _____________________________ Now we have a pedigree file ``ssc_denovo.ped`` and a list of de novo variants ``ssc_denovo.tsv``. Let us prepare an import project configuration file ``ssc_denovo.yaml``: .. code-block:: yaml id: ssc_denovo input: pedigree: file: ssc_denovo.ped denovo: files: - ssc_denovo.tsv person_id: personIds variant: variant location: location To import the study we should run: .. code-block:: bash import_tools ssc_denovo.yaml and when the import finishes we can run the development GPF server: .. code-block:: bash wgpf run In the list of studies, we should have a new study ``ssc_denovo``. Getting started with Dataset Statistics ########################################## .. _reports_tool: To generate family and de novo variant reports, you can use the ``generate_common_report.py`` tool. It supports the option ``--show-studies`` to list all studies and datasets configured in the GPF instance: .. code-block:: bash generate_common_report.py --show-studies To generate the reports for a given study or dataset, you can use the ``--studies`` option. By default the dataset statistics are disabled. If we try to run .. code-block:: bash generate_common_report.py --studies helloworld it will not generate the dataset statistics. Instead, it will print a message that the reports are disabled to study ``helloworld``: .. code-block:: bash WARNING:generate_common_reports:skipping study helloworld To enable the dataset statistics for the ``helloworld`` dataset we need to modify the configuration and add a new section that enables dataset statistics: .. code-block:: yaml id: helloworld name: Hello World Dataset studies: - denovo_helloworld - vcf_helloworld common_report: enabled: True Let us now re-run the ``generate_common_report.py`` command: .. code-block:: bash generate_common_report.py --studies helloworld If we now start the GPF development server: .. code-block:: bash wgpf run and browse the ``helloworld`` dataset we will see the `Dataset Statistics` section available. Getting started with de novo gene sets ###################################### To generate de novo gene sets, you can use the ``generate_denovo_gene_sets.py`` tool. Similar to :ref:`reports_tool` above, you can use the ``--show-studies`` and ``--studies`` option. By default the de novo gene sets are disabled. If you want to enable them for a specific study or dataset you need to update the configuration and add a section that enable the de novo gene sets: .. code-block:: yaml denovo_gene_sets: enabled: true For example the configuration of ``helloworld`` dataset should become similar to: .. code-block:: yaml id: helloworld name: Hello World Dataset studies: - denovo_helloworld - vcf_helloworld common_report: enabled: True denovo_gene_sets: enabled: true Then we can generate the de novo gene sets for ``helloworld`` dataset by running: .. code-block:: bash generate_denovo_gene_sets.py --studies helloworld .. include:: getting_started/getting_started_with_annotation.rst .. include:: getting_started/getting_started_with_preview_columns.rst .. include:: getting_started/getting_started_with_gene_browser.rst .. todo:: WIP .. include:: getting_started/getting_started_with_enrichment.rst .. include:: getting_started/getting_started_with_phenotype_data.rst .. _impala_storage: Using Apache Impala as storage ############################## Starting Apache Impala ++++++++++++++++++++++ To start a local instance of Apache Impala you will need an installed `Docker `_. .. note:: If you are using Ubuntu, you can use the following `instructions `_ to install Docker. We provide a Docker container with Apache Impala. To run it, you can use the script:: run_gpf_impala.sh This script pulls out the container's image from `dockerhub `_ and runs it under the name "gpf_impala". When the container is ready, the script will print the following message:: ... =============================================== Local GPF Apache Impala container is READY... =============================================== .. note:: In case you need to stop this container, you can use the command ``docker stop gpf_impala``. For starting the container, use ``run_gpf_impala.sh``. .. note:: Here is a list of some useful Docker commands: - ``docker ps`` shows all running docker containers - ``docker logs -f gpf_impala`` shows the log from the "gpf_impala" container - ``docker start gpf_impala`` starts the "gpf_impala" container - ``docker stop gpf_impala`` stops the "gpf_impala" container - ``docker rm gpf_impala`` removes the "gpf_impala" container (only if stopped) .. note:: The following ports are used by the "gpf_impala" container: - 8020 - for accessing HDFS - 9870 - for Web interface to HDFS Named Node - 9864 - for Web interface to HDFS Data Node - 21050 - for accessing Impala - 25000 - for Web interface to Impala daemon - 25010 - for Web interface to Impala state store - 25020 - for Web interface to Impala catalog Please make sure these ports are not in use on the host where you are going to start the "gpf_impala" container. Configuring the Apache Impala storage +++++++++++++++++++++++++++++++++++++ The available storages are configured in ``DAE.conf``. This is an example section which configures an Apache Impala storage. .. code:: none [storage.test_impala] storage_type = "impala" dir = "/tmp/test_impala/studies" impala.hosts = ["localhost"] impala.port = 21050 impala.db = "gpf_test_db" hdfs.host = "localhost" hdfs.port = 8020 hdfs.base_dir = "/user/test_impala/studies" Importing studies into Impala +++++++++++++++++++++++++++++ The simple study import tool has an optional argument to specify the storage you wish to use. You can pass the ID of the Apache Impala storage configured in ``DAE.conf`` earlier. .. code:: none --genotype-storage Id of defined in DAE.conf genotype storage [default: genotype_impala] For example, to import the IossifovWE2014 study into the "test_impala" storage, the following command is used: .. code:: none simple_study_import.py IossifovWE2014.ped \ --id iossifov_2014 \ --denovo-file IossifovWE2014.tsv \ --genotype-storage test_impala Example Usage of GPF Python Interface ##################################### The simplest way to start using GPF's Python API is to import the ``GPFInstance`` class and instantiate it: .. code-block:: python3 from dae.gpf_instance.gpf_instance import GPFInstance gpf_instance = GPFInstance() This ``gpf_instance`` object groups together a number of objects, each dedicated to managing different parts of the underlying data. It can be used to interact with the system as a whole. For example, to list all studies configured in the startup GPF instance, use: .. code-block:: python3 gpf_instance.get_genotype_data_ids() This will return a list with the ids of all configured studies: .. code-block:: python3 ['comp_vcf', 'comp_denovo', 'comp_all', 'iossifov_2014'] To get a specific study and query it, you can use: .. code-block:: python3 st = gpf_instance.get_genotype_data('comp_denovo') vs = list(st.query_variants()) .. note:: The `query_variants` method returns a Python iterator. To get the basic information about variants found by the ``query_variants`` method, you can use: .. code-block:: python3 for v in vs: for aa in v.alt_alleles: print(aa) 1:865664 G->A f1 1:865691 C->T f3 1:865664 G->A f3 1:865691 C->T f2 1:865691 C->T f1 The ``query_variants`` interface allows you to specify what kind of variants you are interested in. For example, if you only need "splice-site" variants, you can use: .. code-block:: python3 st = gpf_instance.get_genotype_data('iossifov_2014') vs = st.query_variants(effect_types=['splice-site']) vs = list(vs) print(len(vs)) >> 87 Or, if you are interested in "splice-site" variants only in people with "prb" role, you can use: .. code-block:: python3 vs = st.query_variants(effect_types=['splice-site'], roles='prb') vs = list(vs) len(vs) >> 62