GPF Getting Started Guide¶
If you are using Ubuntu, you can run:
sudo apt-get install wget
The GPF system is developed in Python and supports Python 3.6 and up. The recommended way to setup the GPF development environment is to use Anaconda.
Download and install Anaconda¶
Download Anaconda from the Anaconda distribution page (https://www.anaconda.com/distribution/):
wget -c https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh
and install it in your local environment, following the installer instructions:
At the end of the installation process, you will be asked if you wish
to allow the installer to initialize Anaconda3 by running conda init.
If you choose to, every terminal you open after that will have the
Anaconda environment activated, and you’ll have access to the
Create an empty conda gpf environment:
conda create -n gpf
To use this environment, you need to activate it using the following command:
conda activate gpf
Install the gpf_wdae conda package into the already activated gpf environment:
conda install -c conda-forge -c bioconda -c iossifovlab gpf_wdae
This command is going to install GPF and all its dependencies.
To start working with GPF, you will need a startup data instance. There are two GPF startup instances that are aligned with different versions of the reference human genome - for HG19 and HG38.
Besides the startup data instance some initial bootstrapping of GPF is also necessary.
To make bootstrapping easier, the script
wdae_bootstrap.sh is provided,
which prepares GPF for initial start.
The bootstrap script creates a working directory where the data will be stored. You can provide the name of the working directory as a parameter to the boostrap script. For example, if you want the working directory to be named gpf_test, use the following command:
- For HG19:
- wdae_bootstrap.sh hg19 gpf_test
- For HG38
- wdae_bootstrap.sh hg38 gpf_test
As a result, a directory named gpf_test should be created with following structure:
gpf_test ├── annotation.conf ├── DAE.conf ├── defaultConfiguration.conf ├── geneInfo ├── geneInfo.conf ├── genomes ├── genomesDB.conf ├── genomicScores ├── genomicScores.conf ├── genomic-scores-hg19 ├── genomic-scores-hg38 ├── pheno ├── studies └── wdae
Run GPF web server¶
gpf_test/ and source
cd gpf_test/ source ./setenv.sh
You are now ready to run the GPF development web server:
wdaemanage.py runserver 0.0.0.0:8000
You can browse the development server using the IP of the host you’re running the server on at port 8000. For example, if you are running the GPF develompent server locally, you can use the following URL:
Import a Demo Dataset¶
In the GPF startup data instance there are some demo studies already that are already configured:
multi with some VCF variants in a multigenerational family
comp contains de Novo and VCF variants and phenotype database
You can download some more publicly available studies, which are prepared to be imported into the GPF startup data instance.
To demonstrate how to import new study data into the GPF data instance, we will reproduce the necessary steps for importing the comp study data.
Start local Apache Impala¶
By default GPF uses Apache Impala as a backend for storing genomic variants. The GPF import tools import studies data into Impala.
To start a local instance of Apache Impala you will need an installed Docker (https://www.docker.com/get-started).
Docker can be installed by following the instructions at https://docs.docker.com/install/linux/docker-ce/ubuntu/.
To make using GPF easier, we provide a Docker container with Apache Impala. To run it, you can use the script:
This script pulls out Apache Impala image from dockerhub, creates and starts Docker container named gpf_impala containing all the components needed for running Apache Impala. When the Apache Impala container is ready for use the script will print a message:
... =============================================== Local GPF Apache Impala container is READY... ===============================================
In case you need to stop this container you can use Docker comands docker stop gpf_impala. For starting the gpf_impala container use run_gpf_impala.sh.
Here is a list of some useful Docker commands:
docker ps shows all running docker containers;
docker logs -f gpf_impala shows log from gpf_impala container;
docker stop gpf_impala stops the running gpf_impala container;
docker start gpf_impala starts existing stopped gpf_impala container;
docker rm gpf_impala removes existing and stopped gpf_impala container.
Following ports are used by gpf_impala container:
8020 - port for accessing HDFS
9870 - port for Web interface to HDFS Named Node
9864 - port for Web interface to HDFS Data Node
21050 - port for accessing Impala
25000 - port for Web interface to Impala deamon
25010 - port for Web interface to Impala state store
25020 - port for Web interface to Impala catalog
Please make sure that this ports are not in use on the host where you are starting gpf_impala conatiner.
Simple study import¶
Importing study data into a GPF instance usually involves multiple steps. To
make initial bootstraping easier you can use the
tool that combines all the necessary steps in one tool.
This tool supports variants import from two input formats:
DAE de Novo list of variants
To see the available options supported by this tools use:
which will output a short help message:
usage: simple_study_import.py [-h] [--id <study ID>] [--vcf <VCF filename>] [--denovo <de Novo variants filename>] [-o <output directory>] [--skip-reports] [--genotype-storage <genotype storage id>] <pedigree filename> simple import of new study data positional arguments: <pedigree filename> families file in pedigree format optional arguments: -h, --help show this help message and exit --id <study ID> Unique study ID to use. If not specified the basename of the family pedigree file is used for study ID --vcf <VCF filename> VCF file to import --denovo <de Novo variants filename> DAE denovo variants file -o <output directory>, --out <output directory> output directory for storing intermediate parquet files. If none specified, "parquet/" directory inside GPF instance study directory is used [default: None] --skip-reports skip running report generation [default: False] --genotype-storage <genotype storage id> Id of defined in DAE.conf genotype storage [default: genotype_impala]
Example import of variants¶
Let’s say you have a pedigree file
comp.ped describing family information,
a VCF file
comp.vcf with transmitted variants and a list of de Novo
comp.tsv. The example data can be downloade from following URL:
To import this data as a study into the GPF instance:
compdemo study and extract the download archive:
wget -c https://iossifovlab.com/distribution/public/studies/comp-latest.tar.gz tar zxvf comp-latest.tar.gz
enter into the create directory
simple_study_import.pyto import the VCF variants; this command uses three arguments - study ID to use, pedigree file name and VCF file name:
simple_study_import.py --id comp_vcf \ --vcf comp.vcf \ comp.ped
This command creates a study with ID comp_vcf that contains all VCF variants.
simple_study_import.pyto import the de Novo variants; this command uses three arguments - study ID to use, pedigree file name and VCF file name:
simple_study_import.py --id comp_denovo \ --denovo comp.tsv \ comp.ped
This command creates a study with ID comp_denovo that contains all de Novo variants.
simple_study_import.pyto import all VCF and de Novo variants; this command uses four arguments - study ID to use, pedigree file name, VCF file name and de Novo variants file name:
simple_study_import.py --id comp_all \ --denovo comp.tsv \ --vcf comp.vcf \ comp.ped
This command creates a study with ID comp_all that contains all VCF and de Novo variants.
The expected format for the de Novo variants file is a tab separated file that contains following columns:
familyId - family Id matching a family from the pedigree file
location - location of the variant
variant - description of the variant
bestState - best state of the variant in the family
familyId location variant bestState f1 1:865664 sub(G->A) 2 2 1 2/0 0 1 0 f1 1:865691 sub(C->T) 2 2 1 2/0 0 1 0 f2 1:865664 sub(G->A) 2 2 1 2/0 0 1 0 f2 1:865691 sub(C->T) 2 2 1 2/0 0 1 0
Example import of de Novo variants¶
As an example of importing study with de Novo variants you can use data from:
wget -c https://iossifovlab.com/distribution/public/studies/iossifov_2014-latest.tar.gz
Untar this data:
tar zxf iossifov_2014-latest.tar.gz
cd iossifov_2014/ simple_study_import.py --id iossifov_2014 \ --denovo IossifovWE2014.tsv \ IossifovWE2014.ped
To see the imported variants, restart the GPF development web server and find iossifov_2014 study.
Example Usage of GPF Python Interface¶
The simplest way to start using GPF’s Python API is to import the
class and instantiate it:
from dae.gpf_instance.gpf_instance import GPFInstance gpf_instance = GPFInstance()
gpf_instance object creates and stores different types of facades. One
of these facades is
VariantsDb, which is responsible for creating and
storing studies and datasets.
vdb = gpf_instance.variants_db
vdb factory object allows you to get all studies and datasets in the
configured GPF instance. For example, to list all studies configured in
the startup GPF instance, use:
This should return a list of all studies’ IDs:
['multi', 'comp_vcf', 'comp_denovo', 'comp_all', 'iossifov_2014']
To get a specific study and query it, you can use:
st = vdb.get_study("comp_denovo") vs = list(st.query_variants())
The query_variants method returns a Python iterator.
To get the basic information about variants found by the
you can use:
for v in vs: for aa in v.alt_alleles: print(aa) 1:865664 G->A f1 1:865691 C->T f3 1:865664 G->A f3 1:865691 C->T f2 1:865691 C->T f1
query_variants interface allows you to specify what kind of variants
you are interested in. For example, if you only need ‘splice-site’ variants, you
st = vdb.get_study("iossifov_2014") vs = st.query_variants(effect_types=['splice-site']) vs = list(vs) print(len(vs)) >> 85
Or, if you are interested in ‘splice-site’ variants only in people with role ‘prb’ you can use:
vs = st.query_variants(effect_types=['splice-site'], roles='prb') vs = list(vs) len(vs) >> 60
Getting Started with Enrichment Tool¶
For studies, that include de Novo variants you can enable Enrichment Tool. As an example let us enable Enrichment Tool for the already imported iossifov_2014 study.
Go to the directory, where the configuration file of the iossifov_2014 study is located:
Edit the study configuration file iossifov_2014.conf to add the line:
enrichmentTool = yes
Ater the the editing the configuration file should look like:
[study] id = iossifov_2014 genotype_storage = genotype_impala enrichmentTool = yes
Restart the wdaemanage.py:
wdaemanage.py runserver 0.0.0.0:8000
Now if you locate the iossifov_2014 study in the browser you should be able to use the tool from Enrichment Tool tab of study.
Getting Started with Preview Columns¶
For each study we can specify the columns that are shown in the preview of variants and in the downloaded variants.
As an example we are going to redefine Frequency column in comp_vcf study imported in previous example.
Edit the configuration file comp_vcf.conf and add following lines
[genotypeBrowser] genotype.freq.name = Frequency genotype.freq.slots = exome_gnomad_af_percent:exome gnomad:E %%.3f, genome_gnomad_af_percent:genome gnomad:G %%.3f, af_allele_freq:study freq:S %%.3f
This overwrites the definition of existing preview column Frequency to include not only the gnomAD frequencies, but also to include allele frequency.
Getting Started with Phenotype Data¶
Simple Pheno Import Tool¶
The GPF simple pheno import tool prepares phenotype data to be use by GPF system.
As and example we are going to show how to import simulated demo phenotype data into our gemo GPF instance. We are going to use simulated phenotype data available:
Download the archive and extract it outside of GPF instance data directory:
wget -c https://iossifovlab.com/distribution/public/pheno/comp_pheno_data-latest.tar.gz tar zxvf comp_pheno_data-latest.tar.gz
This will create a
Files that are available in that directory are:
comp_pheno.ped- the pedigree file for all families included into the database;
instruments- directory, containing all instruments;
instruments/i1.csv- all measurements for instrument
comp_pheno_data_dictionary.tsv- descriptions for all measurements
comp_pheno_regressions.conf- regression configuration file
The easiest way to import this phenotype database into the GPF instance is to use simple_pheno_import.py tool. This tool combines converting phenotype instruments and measures into a GPF phenotype database and generates data and figures needed for GPF Phenotype Browser. It will import the phenotype database directly to the DAE data directory specified in your environment.
simple_pheno_import.py -p comp_pheno.ped \ -i instruments/ -d comp_pheno_data_dictionary.tsv -o comp_pheno \ --regression comp_pheno_regressions.conf
Options used in this command are as follows:
-poption allows to specify the pedigree file;
-doption specifies the name of the data dictionary file for the phenotype database
-ioption allows to specify the directory where instruments are located;
-ooptions specifies the name of the output phenotype database that will be used in phenotype browser;
--regressionoption specifies a path to a pheno regression config which describes a list of measures to make regressions against
You can use
-h option to see all options supported by the
Configure Phenotype Database¶
Phenotype databases have a short configuration file (whose filenames
usually end with the extension
.conf) which points
the system to their files, as well as specifying some
other properties. When importing a phenotype database through the
simple_pheno_import.py tool, a configuration file is automatically
generated. You may inspect the
to see the configuration file generated from the import tool:
[phenoDB] name = comp_pheno dbfile = %(wd)s/comp_pheno.db browser_dbfile = %(wd)s/browser/comp_pheno.db browser_images_dir = %(wd)s/browser/comp_pheno browser_images_url = /static/comp_pheno
Configure Phenotype Browser¶
To demonstrate how a study is configured with a phenotype database, we will
be working with the manually imported
The phenotype databases could be attached to one or more studies and datasets.
If you want to attach
comp_all study, you need to specify it in the
study configuration file
[study] id = comp_all prefix = data/ phenoDB = comp_pheno
and to enable the phenotype browser you must add:
phenotypeBrowser = yes
If you restart the GPF system WEB interface after this change you should be able to see Phenotype Browser tab in comp_all dataset.
Configure Phenotype Filters in Genotype Browser¶
A study or a dataset can have Phenotype Filters configured for its Genotype Browser when it has a phenoDB attached to it. The configuration looks like this:
[genotypeBrowser] selectedPhenoFiltersValues = sampleContinuousFilter phenoFilters.sampleContinuousFilter.name = sampleFilterName phenoFilters.sampleContinuousFilter.measureType = continuous phenoFilters.sampleContinuousFilter.filter = multi:prb
selectedPhenoFiltersValues is a comma separated list of ids of the defined
Phenotype Filters. Each phenotype filter is expected to have a
The required configuration options for each pheno filter are:
phenoFilters.<pheno_filter_id>.name- name to use when showing the pheno filter in the Genotype Browser Table Preview.
phenoFilters.<pheno_filter_id>.measureType- the measure type of the pheno filter. One of
phenoFilters.<pheno_filter_id>.filter- the definition of the filter.
The definition of a pheno filter has the format
<filter_type>:<role>(:<measure_id>). Each of these
multiple. A single filter is used to filter on only one specified measure (specified by
multiplepheno filter allows the user to choose which measure to use for filtering. The available measures depend on the
role- which persons’ phenotype data to use for this filter. Ex.
prbuses the probands’ values for filtering. When the role matches more than one person the first is chosen.
measure_id- id of the measure to be used for a
singlefilter. Not used when a
multiplefilter is being defined.
After adding the configuration for Phenotype Filters and reloading the Genotype Browser the Advanced option of the Family Filters should be present.
Configure Phenotype Columns in Genotype Browser¶
Phenotype Columns are values from the Phenotype Database for each variant displayed in Genotype Browser Preview table. They can be added when a phenoDB is attached to a study or a dataset.
To add a Phenotype Column you need to define it in the study or dataset config:
[genotypeBrowser] selectedPhenoColumnValues = pheno pheno.pheno.name = Measures pheno.pheno.slots = prb:i1.age:Age, prb:i1.iq:Iq
selectedPhenoColumnValues property is a comma separated list of ids for
each Pheno Column to display. Each Pheno Column has to have a
pheno.<measure_id> configuration with the following properties:
pheno.<measure_id>.name- the display name of the pheno column group used in the Genotype Browser Preview table.
pheno.<measure_id>.slots- comma separated definitions for all pheno columns.
The Phenotype Column definition has the following structure:
<role>- role of the person whose pheno values will be displayed. If the role matches two or more people all of their values will be shown, separated with a comma.
<measure_id>- id of the measure whose values will be displayed.
<name>- the name of the sub-column to be displayed.
For the Phenotype Columns to be in the Genotype Browser Preview table or the
Genotype Browser Download file, they have to be present in the
previewColumns or the
downloadColumns in the Genotype Browser
previewColumns = family,variant,genotype,effect,weights,mpc_cadd,freq,pheno
In the above
comp_all configuration, the last column
pheno is a
Enabling the Phenotype tool¶
To enable the Phenotype tool for a study, you must edit
its configuration file and set the appropriate property, as with
the Phenotype browser. Open the configuration file
[study] id = comp prefix = data/ phenoDB = comp_pheno phenotypeBrowser = yes
You can enable the Phenotype tool using the following property:
phenotypeTool = yes
Restart the GPF development web server and select the comp_all study.
You should see a Phenotype Tool tab. Once you have selected it, you
can select a phenotype measure of your choice. To get the tool to acknowledge
the variants in the
comp_all study, select the All option of the
Present in Parent field. Since the effect types of the variants in the comp
study are only Missense and Synonymous, you may wish to de-select the
LGDs option under the Effect Types field. There are is also the option to
normalize the results by one or two measures configured as regressors - age and
Click on the Report button to produce the results.
Dataset Statitistics and de Novo Gene Sets¶
Generate Variant Reports (optional)¶
To generate families and de Novo variants report, you should use
generate_common_report.py. This tool supports the option
to list all studies and datasets configured in the GPF instance:
To generate the families and variants reports for a given configured study
or dataset, you can use the
For example, to generate the families and
variants reports for the quad study, you should use:
generate_common_report.py --studies comp
Generate Denovo Gene Sets (optional)¶
To generate de Novo Gene sets, you should use the
generate_denovo_gene_sets.py tool. This tool supports the option
--show-studies to list all studies and datasets configured in the
To generate the de Novo gene sets for a given configured study
or dataset, you can use
For example, to generate the de Novo
gene sets for the quad study, you should use:
generate_denovo_gene_sets.py --studies comp
Getting Started with Annotation Pipeline¶
Get Genomic Scores Database (optional)¶
To annotate variants with genomic scores you will need a genomic scores database or at least genomic scores you plan to use. You can find some genomic scores for HG19 at:
Download and untar the genomic scores you want to use into a separate directory. For example, if you want to use gnomAD_exome and gnomAD_genome frequencies:
cd gpf_test/genomic-scores-hg19 wget -c https://iossifovlab.com/distribution/public/genomic-scores-hg19/gnomAD_exome-hg19-latest.tar wget -c https://iossifovlab.com/distribution/public/genomic-scores-hg19/MPC-hg19-latest.tar tar xvf gnomAD_exome-hg19-latest.tar tar xvf MPC-hg19-latest.tar
This will create two subdirectories inside your genomic-scores-hg19 directory, that contain gnomAD_exome frequencies and MPC genomic scores prepared to be used by GPF annotation pipeline and GPF import tools.
If you want to use some genomic scores for annotation of the variants you are importing, you must make appropriate changes in GPF annotation pipeline configuration file:
This configuration pipeline contains some examples on how to configure annotation with MPC and CADD genomic scores and for gnomAD exome and gnomAD genome frequencies. Comment out the appropriate example and adjust it according to your needs.
The genomic scores folders inside the directory generated by
the default locations where the annotation pipeline will resolve the
%(scores_hg38_dir)s, respectively. These interpolation strings are used
when specifying the location of the genomic score source file to use
You can put your genomic scores inside these directories, or you can specify a
scores_hg19_dir path at the top of the annotation configuration
file. Beware that this will likely break genomic scores which were specified
using the old path.
For example if you want to annotate variants with gnomAD_exome frequencies and MPC genomic scores the annotation.conf file should be edited in the following way:
[DEFAULT] ################################ [VariantEffectAnnotation] annotator=effect_annotator.VariantEffectAnnotator columns.effect_type=effect_type columns.effect_genes=effect_genes columns.effect_gene_genes=effect_gene_genes columns.effect_gene_types=effect_gene_types columns.effect_details=effect_details columns.effect_details_transcript_ids=effect_details_transcript_ids columns.effect_details_details=effect_details_details ############################## [MPC Genomic Score] annotator=score_annotator.NPScoreAnnotator options.scores_file=%(scores_hg19_dir)s/MPC/fordist_constraint_official_mpc_values_v2.txt.gz columns.MPC=mpc ###################################### [gnomAD Exome Frequencies] annotator=frequency_annotator.FrequencyAnnotator options.scores_file=%(scores_hg19_dir)s/gnomAD_exome/gnomad.exomes.r2.1.sites.tsv.gz columns.AF=exome_gnomad_af columns.AF_percent=exome_gnomad_af_percent columns.AC=exome_gnomad_ac columns.AN=exome_gnomad_an columns.controls_AC=exome_gnomad_controls_ac columns.controls_AN=exome_gnomad_controls_an columns.controls_AF=exome_gnomad_controls_af columns.non_neuro_AC=exome_gnomad_non_neuro_ac columns.non_neuro_AN=exome_gnomad_non_neuro_an columns.non_neuro_AF=exome_gnomad_non_neuro_af columns.controls_AF_percent=exome_gnomad_controls_af_percent columns.non_neuro_AF_percent=exome_gnomad_non_neuro_af_percent
VariantEffectAnnotation section defines how the variant effect
annotation and should not be changed. Next section
MPC Genomic Score
defines annotation with MPC genomic score. The last section
gnomAD Exome Frequencies specifies which of the gnomAD exome frequencies
are used in the annotation.
When ready with changes in the annotation configuration file
we need to rerun import process. Let’s do it of
cd iossifov_2014/ simple_study_import.py --id iossifov_2014 \ --denovo IossifovWE2014.tsv \ IossifovWE2014.ped
After import is finished restart the GPF develompent instance:
wdaemanage.py runserver 0.0.0.0:8000