Working With VCF Files Guide ============================ Import data from the "1000 Genome Project" ########################################## The data used in this guide is from the "1000 Genome Project". For more information, visit `IGSR `_. Begin by making a new directory, in which you can download and create files: .. code-block:: bash mkdir 1KGP Navigate to it: .. code-block:: bash cd 1KGP Download the data: .. code-block:: bash wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/related_samples_vcf/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx The three downloaded files are: * ``ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz`` - contains the variants data of the individuals * ``ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz`` - contains the variants data of the related individuals * ``20130606_sample_info.xlsx`` - contains information about all the examined individuals Creating the pedigree file ########################## Information about the individual's relationships within their family can be found in the spreadsheet file ``20130606_sample_info.xlsx``, in the 'Sample Info' tab. Let's create a pedigree file for the family with id "PR05". For more information on working with pedigree files, refer to the :ref:`Working With Pedigrees Guide `. First, create the pedigree file: .. code-block:: bash touch PR05.ped Then open it in a text editor and add the necessary columns - familyId, personId, momId, dadId, sex, status and role. Fill in the individuals and the values in each column, by referring to the spreadshee's information. After the editing, the pedigree file should look like this: .. code-block:: familyId personId momId dadId sex status role PR05 HG00731 0 0 M unspecified dad PR05 HG00732 0 0 F unspecified mom PR05 HG00733 HG00732 HG00731 M unspecified prb .. warning:: The columns should be separated by tabs, not spaces. Next, you have to standardize the pedigree file, using the ``ped2ped.py`` tool:: ped2ped.py PR05.ped -o PR05_standardized.ped --ped-layout-mode generate This command will generate a new pedigree file - ``PR05_standardized.ped`` with two newly added columns - sampleId and layout, which will be used by the GPF system. Now the pedigree file is ready for importing. Creating the VCF files ###################### To extract the variant data for the individuals `HG00731`, `HG00732` and `HG00733` in a separate files, we will use `Bcftools `_. Let's start with individual `HG00733`, whose data is in the ``ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz`` file. Using bcftools' view ``--samples`` argument, we can get the data for a specific individual. Adding `> HG00733.vcf` in the end of the command will redirect the command's result into a new file, named ``HG00733.vcf``:: bcftools view \ --samples HG00733 \ ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz \ > HG00733.vcf The data for individuals `HG00731` and `HG00732` is in the second vcf file - ``ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz``. To extract the variants data for the other two individuals in a file named ``HG00731_HG00732.vcf``, run:: bcftools view \ --samples HG00731,HG00732 \ ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \ > HG00731_HG00732.vcf Importing the data into GPF ########################### To import the collected data into the GPF system, it's recommended to use the ``impala_batch_import.py`` tool. To do so, run:: impala_batch_import.py PR05.ped \ --vcf-files HG00731_HG00732.vcf HG00733.vcf \ --gs genotype_impala \ --id 1KGP \ -o parquet .. note:: To see a list of it's commands, use:: impala_batch_import.py --help Navigate to the newly created `parquet` directory:: cd parquet and run this command to initiate the importing:: make -j 10 This command will take some time to complete. Afer it's done, run the GPF web server:: wdaemanage.py runserver 0.0.0.0:8000 Now you should be able to see the "1KGP" dataset. To view the imported variants, navigate to the :ref:`genotype_browser_ui` tab and click on the `Table Preview` button.