Genomic resources repository (GRR)

Introduction

Genomic Resource Repository (GRR) is a collection of genomic resources, like reference genomes, gene models, etc. One can use one or more GRRs at the same time. By default (or without any configuration), you will use the public GRR build by Iossifov lab and accessible through https://storage.googleapis.com/iossifovlab-grr/.

Note

To browse the conent of the default GRR follow this link: https://storage.googleapis.com/iossifovlab-grr/index.html

If you want to use additional genomic resources, you can build your own GRR (see Management of GRR below) and add it to the GRRs you use. The set of GRR that are accessible can be configured in several ways.

To configure the GRRs to be used by default for your user you can create the file ~/.grr_definition.yaml. An example of what the contents of this file can be is:

id: "development"
type: group
children:
- id: "grr_local"
  type: "directory"
  directory: "~/my_grr"

- id: "default"
  type: "url"
  url: "https://storage.googleapis.com/iossifovlab-grr/"
  cache_dir: "~/default_grr_cache"

This configures a group of two repositories with ids the ‘grr_local’ and the ‘default’. When you search for a resource, the system will first try to find in the grr_local repository, because it is listed first and, if it doesn’t find it there, it will try the default GRR. The default GRR is a remote GRR at the given URL and its configuration specified that resources used from it will be cached in the “~/default_grr_cache” directory. It is significantly faster to use cached resources, but it takes some time to cache them the first time they are used and they occupy substantial disk space.

Alternatively, the system will use GRR configuration file pointed to by the GRR_DEFINITION_FILE environment variable.

Finally, most command line tools that use GRRs have a –grr <file name> argument that overrides the defaults.

Configuration

Genomic resources repository could be accessed via different protocols. Currently supported protocols for GRR access are:

File system (directory) protocol.

id: <repo id>
type: directory
directory: <path to the local file system>

HTTP/HTTPS protocol.

id: <repo id>
type: http
url: <http:// or https:// url>

id: <repo id>
type: url
url: <http(s) url>

S3 protocol.

id: <repo id>
type: url
url: <S3 url>
endpoint_url: <endpoint url>

In-memory (embedded) protocol.
```
id: <repo id>
type: embedded
content:
```

Browse available resources

grr_browse [--grr grr_definition.yaml]

Management of genomic resources repository (GRR)

Genomic resources and genomic resources repository

The genomic resource is a set of files stored in a directory. To make given directory a genomic resource, it should contain genomic_resource.yaml file.

A genomic resources repository is a directory that contains genomic resources. To make a given directory into a repository, it should have a .CONTENTS file.

Create an empty GRR

To create and empty GRR first create an empty directory. For example let us create an empty directory named grr_test, enter inside that directory and run grr_manage repo-init command:

mkdir grr_test
cd grr_test
grr_manage repo-init

After that the directory should contain an empty .CONTENTS file:

ls -a

.  ..  .CONTENTS

If we try to list all resources in this repository we should get an empty list:

grr_manage list

Create an empty genomic resource

Let us create our first genomic resource. Create a directory hg38/scores/score9 inside grr_test repository and create an empty genomic_resource.yaml file inside that directory:

mkdir -p hg38/scores/score9
cd hg38/scores/score9
touch genomic_resource.yaml

This will create an empty genomic resource in our repository with ID hg38/scores/score9.

If we list the resources in our repository we would get:

grr_manage list

working with repository: .../grr_test
Basic                0        1            0 hg38/scores/score9

When we create or change a resource we need to repair the repository:

grr_manage repo-repair

This command will create a .MANIFEST file for our new resource hg38/scores/score9 and will update the repository .CONTENTS to include the resource.

Add genomic score resources

Add all score resource files (score file and Tabix index) inside the created directory hg38/scores/score9. Let’s say these files are:

score9.tsv.gz
score9.tsv.gz.tbi

Configure the resource hg38/scores/score9. To this end create a genomic_resource.yaml file, that contains the position score configuration:

type: position_score
table:
  filename: score9.tsv.gz
  format: tabix

  # defined by score_type
  chrom:
    name: chrom
  pos_begin:
    name: start
  pos_end:
    name: end

# score values
scores:
- id: score9
    type: float
    desc: "score9"
    index: 3
histograms:
- score: score9
  bins: 100
  y_scale: "log"
  x_scale: "linear"
default_annotation:
  attributes:
  - source: score9
    destination: score9
meta: |
## score9
  TODO

When ready you should run grr_manage resource-repair from inside resource directory:

cd hg38/scores/score9
grr_manage resource-repair

This command is going to calculate histograms for the score (if histograms are configured) and create or update the resource manifest.

Once the resource is ready we need to regenerated the repository contents:

grr_manage repo-repair

Usage of genomic resources repositories (GRRs)

The GPF system can use genomic resources from different repositories. The default genomic resources repository used by GPF system is located at https://www.iossifovlab.com/distribution/public/genomic-resources-repository/. You can browse the content of the repository using the grr_manage list command:

grr_manage list -R https://www.iossifovlab.com/distribution/public/genomic-resources-repository

If you have a repository on your local filesytem you can browse it by providing the path to the root directory:

grr_manage list -R <path to the local repo>

You can store a genomic resource repository in an S3 storage and you can browse its content with:

grr_manage list -R s3://grr-bucket-test/grr \
    --extra-args "endpoint_url=http://piglet.seqpipe.org:7480"

where grr-bucket-test is the bucket where you store the repository and --extra-args are used to specify the S3 endpoint.

Genomic Resource types

position_score

np_score

allele_score

gene_models

Example genomic_resoruce.yaml:

type: gene_models
filename: refGeneMito-201309.gz
format: "default"

The available formats are:

default – this is a GPF internal format
refflat
refseq
ccds
knowngene
gtf
ucscgenepred