EthSEQ: ethnicity annotation from whole-exome sequencing data


BASIC INFO

EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines:

  1. 1,000 Genomes Project genotype data, used to generate reference models for specific WES platforms;
  2. ASEQ, used to genotype the input samples with unknow ethnicity, and
  3. EIGENSTRAT, used to perform principal component analysis on the aggregated genotyped data.

The package is organized as follows:

EthSEQ
 -> EthSEQ.R 
 -> Functions.R 
 -> CreateReferenceModel.R
 -> MultiStepRefinementAnalysis.R
 -> Models
 -> VCF
 -> Include
 -> Example
  • EthSEQ.R is the R script required to infer ethnicity while Functions.R contains utility functions.
  • CreateReferenceModel.R is the R script required to generate e new reference model given genotype data of ethnical groups and a set of specified captured regions.
  • MultiStepRefinementAnalysis.R is the script R that implements the multi-step refinement analysis.
  • Models folder contains reference models for a set WES platforms (Agilent Haloplex, Roche Nimblegen version 3, Agilent SureSelect version 2 and version 4 are currently available); models are built from 1,000 Genomes Project genotype data. Models folder is empty; models should be downloaded separately from the bottom links and uncompressed in the Models folder.
  • The VCF folder contains the lists of SNPs (in VCF format) used to generate the reference models for the available WES platforms.
  • The Include folder contains ASEQ binaries and EIGENSTRAT software folder with pre-compiled binaries available.
  • The Example folder contains and example of configuration file, and example of BAM input list of individuals with unknown ethnicity and an example of output report.

INFER THE ETHNICITY OF A SET OF INDIVIDUALS

To infer the ethnicity of a set of individuals run the EthSEQ.R script in the following way:

Rscript EthSEQ.R <ConfigurationFile.R>

The configuration file has the following structure:

##########################################################################
# Basic folders
source.dir = “EthSEQ path”
bam.list = “path to a text file containing the list of BAM files to be analyzed”
out.dir = “path to the output folder”
eigenstrat.path = “path to EIGENTRAT binaries folder”

# Models available
# SS2 = Agilent Sure Select version 2
# SS4 = Agilent Sure Select version 4
# HALO = Agilent Haloplex
# NimblegenV3 = Roche Nimblegen V3
model = “HALO”

# To run the analysis with your own reference model uncomment the following line and specify the needed variables
# model = ”” # keep this empty
# vcf.file = “path to VCF file”
# sif.file = “path to file with ethnical annotations”
# model.ped = “path to PED file with genotype information”
# model.map = “path to MAP file with data of variant specified in the PED file”

# ASEQ parameters
ASEQ.path = “path to ASEQ binary”
mbq=20 # minimum base quality
mrq=20 # minimum read quality
mdc=20 # minimum depth of coverage
cores=10 # number of cores to be used

# analysis options
run.genotype=TRUE
reduce.composite.model = TRUE
composite.model.call.rate = 1

# output details
verbose=FALSE
##########################################################################


MULTI-STEP REFINEMENT METHOD

To infer the ethnicity of a set of individuals using the multi-step refinement method run the MultiStepRefinementModel.R script in the following way:

Rscript MultiStepRefinementModel.R <ConfigurationFile.R>

The configuration file extends the previous specification by adding the ethnic group sets specification:
# Subsets specification
subsets = list(c(“AFR”,”SAS”,”EAS”,”EUR”,”ASH”),c(“EUR”,”ASH”))


CREATE A REFERENCE MODEL

To create a new reference model run the CreateReferenceModel.R script in the following way:

Rscript CreateReferenceModel.R

by specifying in the script code the following variables:

## Parameters sif.file = “specify_path_to_file”
vcf.file = “specify_path_to_file”
phased = FALSE # TRUE if genotypes in VCF format are phased
out.dir = “specify_path_to_dir”
model.name = “specify_model_name”
call.rate = 1 # fraction of samples with genotype calls for a specific SNP

Sample information file (sif.file) should have the following format:
Sample\tRace\tGender
S1\tEUR\tmale
S2\tAFR\tfemale


VCF file should specify in the INFO column the MAF information (e.g. MAF=0.32) for all variants.
Genotype should be specified using phased notation (e.g. 0|0,1|0,0|1,1|1,…) or unphased notation (e.g. 0/0,0/1,1/1,…).


REQUIREMENTS

EthSEQ requires Linux kernel >= 2.6.15.
EthSEQ requires R >= 2.7 and the package “rgeos”.
EthSEQ requires global folder names.
EthSEQ requires ASEQ (version 1.1.11 available in Tools folder) available also here
EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also here


Code by Alessandro Romanel
Laboratory of Computational Oncology (F. Demichelis)
Centre for Integrative Biology, University of Trento, Italy
email contacts: romanel@science.unitn.it; demichelis@science.unitn.it

EthSEQ is distributed under the MIT Licence.


DOWNLOADS

Tool versions
Reference Models