Genomic resources repository (GRR)
Genomic Resource Repository (GRR) is a collection of genomic resources, like reference genomes, gene models, etc. One can use one or more GRRs at the same time. By default (or without any configuration), you will use the public GRR build by Iossifov lab and accessible through https://www.iossifovlab.com/distribution/public/genomic-resources-repository/.
If you want to use additional genomic resources, you can build your own GRR (see Management of GRR below) and add it to the GRRs you use. The set of GRR that are accessible can be configured in several ways.
To configure the GRRs to be used by default for your user you can create the file ~/.grr_definition.yaml. An example of what the contents of this file can be is:
id: "development" type: group children: - id: "grr_local" type: "directory" directory: "~/my_grr" - id: "default" type: "url" url: "https://www.iossifovlab.com/distribution/public/genomic-resources-repository" cache_dir: "~/default_grr_cache"
This configures a group of two repositories with ids the ‘grr_local’ and the ‘default’. When you search for a resource, the system will first try to find in the grr_local repository, because it is listed first and, if it doesn’t find it there, it will try the default GRR. The default GRR is a remote GRR at the given URL and its configuration specified that resources used from it will be cached in the “~/default_grr_cache” directory. It is significantly faster to use cached resources, but it takes some time to cache them the first time they are used and they occupy substantial disk space.
Alternatively, the system will use GRR configuration file pointed to by the GRR_DEFINITION_FILE environment variable.
Finally, most command line tools that use GRRs have a –grr <file name> argument that overrides the defaults.
Genomic resources repository could be accessed via different protocols. Currently supported protocols for GRR access are:
File system (directory) protocol.
id: <repo id> type: directory directory: <path to the local file system>
id: <repo id> type: url url: <http(s) url>
id: <repo id> type: url url: <S3 url> endpoint_url: <endpoint url>
In-memory (embedded) protocol.
id: <repo id> type: embedded
Browse available resources
grr_browse [--grr grr_definition.yaml]
Management of genomic resources repository (GRR)
Genomic resources and genomic resources repository
The genomic resource is a set of files stored in a directory. To make given
directory a genomic resource, it should contain
A genomic resources repository is a directory that contains genomic resources.
To make a given directory into a repository, it should have a
Create an empty GRR
To create and empty GRR first create an empty directory. For example let us
create an empty directory named
grr_test, enter inside that directory and
grr_manage repo-init command:
mkdir grr_test cd grr_test grr_manage repo-init
After that the directory should contain an empty
ls -a . .. .CONTENTS
If we try to list all resources in this repository we should get an empty list:
Create an empty genomic resource
Let us create our first genomic resource. Create a directory
grr_test repository and create an empty
inside that directory:
mkdir -p hg38/scores/score9 cd hg38/scores/score9 touch genomic_resource.yaml
This will create an empty genomic resource in our repository
If we list the resources in our repository we would get:
grr_manage list working with repository: .../grr_test Basic 0 1 0 hg38/scores/score9
When we create or change a resource we need to repair the repository:
This command will create a
.MANIFEST file for our new resource
hg38/scores/score9 and will update the repository
.CONTENTS to include
Add genomic score resources
Add all score resource files (score file and Tabix index) inside
the created directory
hg38/scores/score9. Let’s say these files are:
Configure the resource
hg38/scores/score9. To this end create
genomic_resource.yaml file, that contains the position score
type: position_score table: filename: score9.tsv.gz format: tabix # defined by score_type chrom: name: chrom pos_begin: name: start pos_end: name: end # score values scores: - id: score9 type: float desc: "score9" index: 3 histograms: - score: score9 bins: 100 y_scale: "log" x_scale: "linear" default_annotation: attributes: - source: score9 destination: score9 meta: | ## score9 TODO
When ready you should run
grr_manage resource-repair from inside resource
cd hg38/scores/score9 grr_manage resource-repair
This command is going to calculate histograms for the score (if histograms are configured) and create or update the resource manifest.
Once the resource is ready we need to regenerated the repository contents:
Usage of genomic resources repositories (GRRs)
The GPF system can use genomic resources from different repositories. The
default genomic resources repository used by GPF system is located at
You can browse the content of the repository using the
grr_manage list -R https://www.iossifovlab.com/distribution/public/genomic-resources-repository
If you have a repository on your local filesytem you can browse it by providing the path to the root directory:
grr_manage list -R <path to the local repo>
You can store a genomic resource repository in an S3 storage and you can browse its content with:
grr_manage list -R s3://grr-bucket-test/grr \ --extra-args "endpoint_url=http://piglet.seqpipe.org:7480"
grr-bucket-test is the bucket where you store the repository and
--extra-args are used to specify the S3 endpoint.