Genomic resources repository (GRR)
Introduction
Genomic Resource Repository (GRR) is a collection of genomic resources, like reference genomes, gene models, etc. One can use one or more GRRs at the same time. By default (or without any configuration), you will use the public GRR build by Iossifov lab and accessible through https://storage.googleapis.com/iossifovlab-grr/.
Note
To browse the conent of the default GRR follow this link: https://storage.googleapis.com/iossifovlab-grr/index.html
If you want to use additional genomic resources, you can build your own GRR (see Management of GRR below) and add it to the GRRs you use. The set of GRR that are accessible can be configured in several ways.
To configure the GRRs to be used by default for your user you can create the file ~/.grr_definition.yaml. An example of what the contents of this file can be is:
id: "development"
type: group
children:
- id: "grr_local"
type: "directory"
directory: "~/my_grr"
- id: "default"
type: "url"
url: "https://storage.googleapis.com/iossifovlab-grr/"
cache_dir: "~/default_grr_cache"
This configures a group of two repositories with ids the ‘grr_local’ and the ‘default’. When you search for a resource, the system will first try to find in the grr_local repository, because it is listed first and, if it doesn’t find it there, it will try the default GRR. The default GRR is a remote GRR at the given URL and its configuration specified that resources used from it will be cached in the “~/default_grr_cache” directory. It is significantly faster to use cached resources, but it takes some time to cache them the first time they are used and they occupy substantial disk space.
Alternatively, the system will use GRR configuration file pointed to by the GRR_DEFINITION_FILE environment variable.
Finally, most command line tools that use GRRs have a –grr <file name> argument that overrides the defaults.
Configuration
Genomic resources repository could be accessed via different protocols. Currently supported protocols for GRR access are:
File system (directory) protocol.
id: <repo id> type: directory directory: <path to the local file system>
HTTP/HTTPS protocol.
id: <repo id> type: http url: <http:// or https:// url>
id: <repo id> type: url url: <http(s) url>
S3 protocol.
id: <repo id> type: url url: <S3 url> endpoint_url: <endpoint url>
In-memory (embedded) protocol.
id: <repo id> type: embedded content:
Browse available resources
grr_browse [--grr grr_definition.yaml]
Management of genomic resources repository (GRR)
Genomic resources and genomic resources repository
The genomic resource is a set of files stored in a directory. To make given
directory a genomic resource, it should contain genomic_resource.yaml
file.
A genomic resources repository is a directory that contains genomic resources.
To make a given directory into a repository, it should have a .CONTENTS
file.
Create an empty GRR
To create and empty GRR first create an empty directory. For example let us
create an empty directory named grr_test
, enter inside that directory and
run grr_manage repo-init
command:
mkdir grr_test
cd grr_test
grr_manage repo-init
After that the directory should contain an empty .CONTENTS
file:
ls -a
. .. .CONTENTS
If we try to list all resources in this repository we should get an empty list:
grr_manage list
Create an empty genomic resource
Let us create our first genomic resource. Create a directory
hg38/scores/score9
inside
grr_test
repository and create an empty genomic_resource.yaml
file
inside that directory:
mkdir -p hg38/scores/score9
cd hg38/scores/score9
touch genomic_resource.yaml
This will create an empty genomic resource in our repository
with ID hg38/scores/score9
.
If we list the resources in our repository we would get:
grr_manage list
working with repository: .../grr_test
Basic 0 1 0 hg38/scores/score9
When we create or change a resource we need to repair the repository:
grr_manage repo-repair
This command will create a .MANIFEST
file for our new resource
hg38/scores/score9
and will update the repository .CONTENTS
to include
the resource.
Add genomic score resources
Add all score resource files (score file and Tabix index) inside
the created directory hg38/scores/score9
. Let’s say these files are:
score9.tsv.gz
score9.tsv.gz.tbi
Configure the resource hg38/scores/score9
. To this end create
a genomic_resource.yaml
file, that contains the position score
configuration:
type: position_score
table:
filename: score9.tsv.gz
format: tabix
# defined by score_type
chrom:
name: chrom
pos_begin:
name: start
pos_end:
name: end
# score values
scores:
- id: score9
type: float
desc: "score9"
index: 3
histograms:
- score: score9
bins: 100
y_scale: "log"
x_scale: "linear"
default_annotation:
attributes:
- source: score9
destination: score9
meta: |
## score9
TODO
When ready you should run grr_manage resource-repair
from inside resource
directory:
cd hg38/scores/score9
grr_manage resource-repair
This command is going to calculate histograms for the score (if histograms are configured) and create or update the resource manifest.
Once the resource is ready we need to regenerated the repository contents:
grr_manage repo-repair
Usage of genomic resources repositories (GRRs)
The GPF system can use genomic resources from different repositories. The
default genomic resources repository used by GPF system is located at
https://www.iossifovlab.com/distribution/public/genomic-resources-repository/.
You can browse the content of the repository using the grr_manage list
command:
grr_manage list -R https://www.iossifovlab.com/distribution/public/genomic-resources-repository
If you have a repository on your local filesytem you can browse it by providing the path to the root directory:
grr_manage list -R <path to the local repo>
You can store a genomic resource repository in an S3 storage and you can browse its content with:
grr_manage list -R s3://grr-bucket-test/grr \
--extra-args "endpoint_url=http://piglet.seqpipe.org:7480"
where grr-bucket-test
is the bucket where you store the repository and
--extra-args
are used to specify the S3 endpoint.
Genomic Resource types
position_score
np_score
allele_score
gene_models
Example genomic_resoruce.yaml:
type: gene_models
filename: refGeneMito-201309.gz
format: "default"
The available formats are:
default – this is a GPF internal format
refflat
refseq
ccds
knowngene
gtf
ucscgenepred