Working With Pedigree Files Guide

Brief outline

The GPF system has flexible support for various input formats of pedigree files. When loaded the pedigree is kept in an internal representation. The ped2ped.py tool allows loading the input pedigree file and storing it into the canonical GPF pedigree representation.

Additionally, GPF has a tool draw_pedigree.py that could generate a PDF file with drawings of all pedigrees loaded from the input.

Input pedigree file

A pedigree file (usually found with a .ped extension) is a text file with delimiter-separated values (comma, tab, etc.). Each row in this file describes an individual.

The file should contain the following columns:

Family ID:

Contains family ID for the individual.

Person ID:

Contains person ID for the individual.

Father ID:

Contains the person ID of the individual’s father. If the individual is without a specified father this column contains 0 or is left empty.

Mother ID:

Contains the person ID of the individual’s mother. If the individual is without a specified mother this column contains 0 or is left empty.

Sex:

The sex of the individual.

Status:

The affected status of the individual - whether they are affected or not.

The pedigree file can contain more columns, that specify different attributes for individuals.

Canonical pedigree file

The canonical pedigree file matches closely the internal representation of the pedigree data. Each line in the pedigree file specifies the attributes of an individual.

The columns in a canonical pedigree file are:

familyId:

Contains IDs for families included in the pedigree file.

personId:

Contains IDs for individuals included in the pedigree file. The assumption is that person IDs are unique across the whole pedigree file.

momId:

Contains the ID of the individual’s mother. If a mother is not specified this column should contain 0 or should left be empty.

dadId:

Contains the ID of the individual’s father. If a father is not specified this column should contain 0 or should be left empty.

sex:

The sex of the individual. Supported values for the sex column are described in Supported values for sex

status:

The affected status of the individual. Supported values for the status column are described in Supported values for status

role:

The role of the individual. Supported values for the role column are described in Supported values for role

sampleId:

This column is used to map an individual to the sample ID used in the genotypes file(s) (e.g a VCF file(s)). If this column is not specified in the input pedigree, it will be created and values will coincide with the personId column.

layout:

This column is optional. The suported format is following: <rank>:<x>,<y>, where

  • <rank> is the rank of the individual in the family, where the individuals from the earliest generation in the family have a rank of 1, the individuals from the next generation have a rank of 2, et. For example in a nuclear family, the mother and father have a rank of 1 and the children have a rank of 2.

  • <x> is the x-coordinate of the individual icon.

  • <y> is the y-coordinate of the individual icon.

generated:

This column specifies if a given individual is generated. The supported values in this column are True and False.

When the pedigree file contains not full families, the GPF tools add individuals to the family to make the family full.

For example, if a family contains two individuals - mother and proband, - the GPF adds father to this family to make proper visualization of the family.

These additional individuals are marked as generated and are not used in any downstream analysis. Their use is purely for visualization purposes.

When the input pedigree contains any additional columns the GPF tools keep these columns in the canonical representation.

Possible input pedigree file structures

When the input pedigree file is given to the GPF system it tries to transform it into the canonical representation described in Canonical pedigree file.

GPF system uses individuals’ roles for various queries. When the role column is not present in the input pedigree file, the GPF system tries to deduce the role of each individual in respect to the family’s proband.

The GPF system has different strategies to infer the role of each individual. Which strategy to use depends on the input data.

Plain pedigree (familyId, personId, momId, dadId, sex, status)

Often, the pedigree does not contain a role column. In this case the GPF system uses the following approach:

  • Assign a role proband to the first affected child in each family.

  • The roles of all other members in the family are inferred with respect to the proband.

Note

If no proband is found, all the roles will be set to unknown.

Example: simple pedigree file

Let’s say we have the following input pedigree file:

familyId

personId

momId

dadId

sex

status

f1

f1.01

0

0

F

unaffected

f1

f1.02

0

0

M

unaffected

f1

f1.03

f1.01

f1.02

F

affected

f1

f1.04

f1.01

f1.02

M

affected

f1

f1.05

0

0

M

unaffected

f1

f1.06

f1.01

f1.05

F

unaffected

To assign roles to the members of family f1 the GPF system will look for the first affected child in the f1 family - this will be f1.03 and this individual will get a role proband. The mother and father of f1.03 will become with roles mom and dad and hence f1.01 is going to have the role mom and f1.02 - role dad. The sibling of f1.03 will have the role sib and hence f1.04 is going to have the role sib.

This process continues until all individuals in the family have their roles set.

familyId

personId

momId

dadId

sex

status

role

f1

f1.01

0

0

F

unaffected

mom

f1

f1.02

0

0

M

unaffected

dad

f1

f1.03

f1.01

f1.02

F

affected

prb

f1

f1.04

f1.01

f1.02

M

affected

sib

f1

f1.05

0

0

M

unaffected

step_dad

f1

f1.06

f1.01

f1.05

F

unaffected

maternal_half_sibling

Pedigree with proband column (familyId, personId, momId, dadId, sex, status, prb)

When the strategy described in Plain pedigree (familyId, personId, momId, dadId, sex, status) is not appropriate the GPF can use a pedigree file with a proband column, that specifies which individual in the family has the role proband.

  • The first individual in the family for whom the proband column has value True recivies the role proband.

  • The roles of all other individuals are inferred with respect to the proband.

Note

If no proband is indicated, the tools fallback into the strategy described in Plain pedigree (familyId, personId, momId, dadId, sex, status)

Note

If more than one proband is selected, the role prb is assigned to the first of them and the rest of the roles are inferred with respect to the first (in the pedigree file) proband.

Example: pedigree file with prb column

Let’s say we have the following input pedigree file:

familyId

personId

momId

dadId

sex

status

prb

f1

f1.01

0

0

F

unaffected

0

f1

f1.02

0

0

M

unaffected

0

f1

f1.03

f1.01

f1.02

F

affected

0

f1

f1.04

f1.01

f1.02

M

affected

1

f1

f1.05

0

0

M

unaffected

0

f1

f1.06

f1.01

f1.05

F

unaffected

0

Note the prb column that specifies which individual has the role proband. So the f1.04 recivies role prb. The mother and father of f1.04 will have roles mom and dad and hence f1.01 is going to have the role mom and f1.02 - role dad. The sibling of f1.04 will have the role sib and hence f1.03 is going to have the role sib.

This process continues until all individuals in the family have their roles set.

familyId

personId

momId

dadId

sex

status

role

f1

f1.01

0

0

F

unaffected

mom

f1

f1.02

0

0

M

unaffected

dad

f1

f1.03

f1.01

f1.02

F

affected

sib

f1

f1.04

f1.01

f1.02

M

affected

prb

f1

f1.05

0

0

M

unaffected

step_dad

f1

f1.06

f1.01

f1.05

F

unaffected

maternal_half_sibling

Pedigree with role column (familyId, personId, momId, dadId, sex, status, role)

When a role column is defined in the input pedigree it becomes the source of truth about individuals’ roles. Whatever is saved in this column is interpreted as the role of the individual.

Example: pedigree with role column

familyId

personId

momId

dadId

sex

status

role

f1

f1.01

0

0

F

unaffected

mom

f1

f1.02

0

0

M

unaffected

dad

f1

f1.03

f1.01

f1.02

F

affected

prb

f1

f1.04

f1.01

f1.02

M

affected

sib

f1

f1.05

0

0

M

unaffected

step_dad

f1

f1.06

f1.01

f1.05

F

unaffected

maternal_half_sibling

Full canonical pedigree

The canonical pedigree file contains the role column and so, the GPF system uses this column to assign the role of each individual.

Todo

The loader will be upset (ERROR) if the role is not one of the recognized, names or synonyms.

The loader will output a WARNING if no proband is assigned for a family (can be suppressed with an argument???) OR consider it an ERROR condition that can be suppressed with an argument.

The loader will output a WARNING if more than one proband is assigned for a family?? (can be suppressed with an argument???)

Preparing the pedigree data

The pedigree data may require preparation beforehand. This section describes the requirements for pedigree data that must be met to use the tools.

In some cases, the initial pedigree file must be expanded with additional individuals to correctly form some families. Following that, individuals must be connected to their parents from the newly added individuals.

We must ensure the values in the sex, status and role columns in the file are supported by the GPF system. You can see a list of the supported values here - Supported values for sex, Supported values for status, Supported values for role.

Also, these properties support synonyms, which are listed in the tables below:

Supported values for sex

Sex column canonical values

Synonyms (case insensitive)

F

female, F, 2

M

male, M, 1

U

unspecified, U, 0

Supported values for status

Sex column canonical values

Synonyms (case insensitive)

affected

affected, 2

unaffected

unaffected, 1

unspecified

unspecified, -, 0

Supported values for role

Role column canonical values

Synonyms (case insensitive)

prb

proband, prb

sib

sibling, younger sibling, older sibling, sib

maternal_grandmother

maternal grandmother, maternal_grandmother

maternal_grandfather

maternal grandfather, maternal_grandfather

paternal_grandmother

paternal grandmother, paternal_grandmother

paternal_grandfather

paternal grandfather, paternal_grandfather

mom

mom, mother

dad

dad, father

child

child

maternal_half_sibling

maternal half sibling, maternal_half_sibling

paternal_half_sibling

paternal half sibling, paternal_half_sibling

half_sibling

half sibling, half_sibling

maternal_aunt

maternal aunt, maternal_aunt

maternal_uncle

maternal uncle, maternal_uncle

paternal_aunt

paternal aunt, paternal_aunt

paternal_uncle

paternal uncle, paternal_uncle

maternal_cousin

maternal cousin, maternal_cousin

paternal_cousin

paternal cousin, paternal_cousin

step_mom

step mom, step_mom, step mother

step_dad

step dad, step_dad, step father

spouse

spouse

unknown

unknown

Common arguments for the pedigree tools

positional arguments:

<families filename> families filename in pedigree or simple family format

optional arguments:
--ped-family PED_FAMILY

specify the name of the column in the pedigree file that holds the ID of the family the person belongs to [default: familyId]

--ped-person PED_PERSON

specify the name of the column in the pedigree file that holds the person’s ID [default: personId]

--ped-mom PED_MOM

specify the name of the column in the pedigree file that holds the ID of the person’s mother [default: momId]

--ped-dad PED_DAD

specify the name of the column in the pedigree file that holds the ID of the person’s father [default: dadId]

--ped-sex PED_SEX

specify the name of the column in the pedigree file that holds the sex of the person [default: sex]

--ped-status PED_STATUS

specify the name of the column in the pedigree file that holds the status of the person [default: status]

--ped-role PED_ROLE

specify the name of the column in the pedigree file that holds the role of the person [default: role]

--ped-no-role

indicates that the provided pedigree file has no role column. If this argument is provided, the import tool will guess the roles of individuals and write them in a “role” column.

--ped-proband PED_PROBAND

specify the name of the column in the pedigree file that specifies persons with role proband; this column is used only when option –ped-no-role is specified. [default: None]

--ped-no-header

indicates that the provided pedigree file has no header. The pedigree column arguments will accept indices if this argument is given. [default: False]

--ped-file-format PED_FILE_FORMAT

Families file format. It should pedigree or simple for simple family format [default: pedigree]

--ped-layout-mode PED_LAYOUT_MODE

Layout mode specifies how pedigrees drawing of each family is handled. Available options are generate and load. When the layout mode option is set to generate` the loader tries to generate a layout for each family pedigree. When load is specified, the loader tries to load the layout from the layout column of the pedigree. [default: load]

--ped-sep PED_SEP

Families file field separator [default: t]

-o OUTPUT_FILENAME

specify the name of the output file

Transform a pedigree file into canonical GPF form

To transform a pedigree file into canonical GPF form you can use the ped2ped.py tool. To see the tool’s full functionality use:

ped2ped.py --help

To demonstrate how it works, we will use the sample data. To standardize the example_families.ped file use:

ped2ped.py example_families.ped \
--ped-layout-mode generate -o example_family_standardized.ped

The output example_family_standardized.ped file has two newly generated columns - sampleId and layout, which are used by the GPF system.

The ped2ped.py tool can also process pedigree files with noncanonical column names. For such cases, it has arguments that can be used to specify which column contains the family id, role, status, sex, etc. For example, see the case of the example_families_with_noncanonical_column_names.ped file:

ped2ped.py example_families_with_noncanonical_column_names.ped \
--ped-family Family_id --ped-person Person_id --ped-dad Dad_id --ped-mom Mom_id \
--ped-sex Sex --ped-status Status --ped-role Role \
--ped-layout-mode generate -o example_families_from_noncanonical_column_names.ped

The ped2ped.py tool can also process pedigree files without headers. One such file is example_families_without_header.ped. In this case, we have to map the column’s index to a specific column name. The same way we mapped ‘Family_id’ to the family id column in the upper example, here we map the first column to family id (Keep in mind the column indices begin from 0). See the example below:

ped2ped.py example_families_without_header.ped \
--ped-no-header --ped-family 0 --ped-person 1 --ped-dad 2 --ped-mom 3 \
--ped-sex 4 --ped-status 5 --ped-role 6 \
--ped-layout-mode generate -o example_families_from_no_header.ped

Visualize a pedigree file into a PDF file

To visualize a pedigree file into a PDF file, containing drawings of the family pedigrees you can use the draw_pedigrees.py tool. To see its full functionality use:

draw_pedigree.py --help

Notice that it shares a lot of common flags with the ped2ped.py tool. Similar to the ped2ped.py tool, it can also process pedigree files with noncanonically named columns or without a header.

In addition to that, it has a --mode flag, which supports two values:

  • report

    the tool will generate a family pedigree drawing for each unique family structure family

  • families

    the tool will generate a family pedigree drawing for every individual family

To demonstrate how to use the draw_pedigree.py tool we will visualize the example_families.ped file:

draw_pedigree.py example_families.ped -o example_families_visualization.pdf

This command outputs the example_families_visualization.pdf file with the pedigree drawings.