Working With VCF Files Guide

Import data from the “1000 Genome Project”

The data used in this guide is from the “1000 Genome Project”. For more information, visit IGSR.

Begin by making a new directory, in which you can download and create files:

mkdir 1KGP

Navigate to it:

cd 1KGP

Download the data:

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/related_samples_vcf/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx

The three downloaded files are:

ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz - contains the variants data of the individuals
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz - contains the variants data of the related individuals
20130606_sample_info.xlsx - contains information about all the examined individuals

Creating the pedigree file

Information about the individual’s relationships within their family can be found in the spreadsheet file 20130606_sample_info.xlsx, in the ‘Sample Info’ tab. Let’s create a pedigree file for the family with id “PR05”. For more information on working with pedigree files, refer to the Working With Pedigrees Guide.

First, create the pedigree file:

touch PR05.ped

Then open it in a text editor and add the necessary columns - familyId, personId, momId, dadId, sex, status and role. Fill in the individuals and the values in each column, by referring to the spreadshee’s information. After the editing, the pedigree file should look like this:

familyId    personId        momId   dadId   sex     status  role
PR05        HG00731 0       0       M       unspecified     dad
PR05        HG00732 0       0       F       unspecified     mom
PR05        HG00733 HG00732 HG00731 M       unspecified     prb

Warning

The columns should be separated by tabs, not spaces.

Next, you have to standardize the pedigree file, using the ped2ped.py tool:

ped2ped.py PR05.ped -o PR05_standardized.ped --ped-layout-mode generate

This command will generate a new pedigree file - PR05_standardized.ped with two newly added columns - sampleId and layout, which will be used by the GPF system. Now the pedigree file is ready for importing.

Creating the VCF files

To extract the variant data for the individuals HG00731, HG00732 and HG00733 in a separate files, we will use Bcftools.

Let’s start with individual HG00733, whose data is in the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz file. Using bcftools’ view --samples argument, we can get the data for a specific individual. Adding > HG00733.vcf in the end of the command will redirect the command’s result into a new file, named HG00733.vcf:

bcftools view \
--samples HG00733 \
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz \
> HG00733.vcf

The data for individuals HG00731 and HG00732 is in the second vcf file - ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.

To extract the variants data for the other two individuals in a file named HG00731_HG00732.vcf, run:

bcftools view \
--samples HG00731,HG00732 \
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
> HG00731_HG00732.vcf

Importing the data into GPF

To import the collected data into the GPF system, it’s recommended to use the impala_batch_import.py tool. To do so, run:

impala_batch_import.py PR05.ped \
--vcf-files HG00731_HG00732.vcf HG00733.vcf \
--gs genotype_impala \
--id 1KGP \
-o parquet

Note

To see a list of it’s commands, use:

impala_batch_import.py --help

Navigate to the newly created parquet directory:

cd parquet

and run this command to initiate the importing:

make -j 10

This command will take some time to complete.

Afer it’s done, run the GPF web server:

wdaemanage.py runserver 0.0.0.0:8000

Now you should be able to see the “1KGP” dataset. To view the imported variants, navigate to the genotype_browser_ui tab and click on the Table Preview button.