EthSEQ: ethnicity annotation from whole-exome sequencing data


BASIC INFO

EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines:

  1. 1,000 Genomes Project genotype data, used to generate reference models for specific WES platforms;
  2. ASEQ, used to genotype the input samples with unknow ethnicity, and
  3. EIGENSTRAT, used to perform principal component analysis on the aggregated genotyped data.

The package is organized as follows:

EthSEQ
 -> EthSEQ.R 
 -> Functions.R 
 -> CreateReferenceModel.R
 -> MultiStepRefinementAnalysis.R
 -> Models
 -> VCF
 -> Include
 -> Example

INFER THE ETHNICITY OF A SET OF INDIVIDUALS

To infer the ethnicity of a set of individuals run the EthSEQ.R script in the following way:

Rscript EthSEQ.R <ConfigurationFile.R>

The configuration file has the following structure:

##########################################################################
# Basic folders
source.dir = “EthSEQ path”
bam.list = “path to a text file containing the list of BAM files to be analyzed”
out.dir = “path to the output folder”
eigenstrat.path = “path to EIGENTRAT binaries folder”

# Models available
# SS2 = Agilent Sure Select version 2
# SS4 = Agilent Sure Select version 4
# HALO = Agilent Haloplex
# NimblegenV3 = Roche Nimblegen V3
model = “HALO”

# To run the analysis with your own reference model uncomment the following line and specify the needed variables
# model = ”” # keep this empty
# vcf.file = “path to VCF file”
# sif.file = “path to file with ethnical annotations”
# model.ped = “path to PED file with genotype information”
# model.map = “path to MAP file with data of variant specified in the PED file”

# ASEQ parameters
ASEQ.path = “path to ASEQ binary”
mbq=20 # minimum base quality
mrq=20 # minimum read quality
mdc=20 # minimum depth of coverage
cores=10 # number of cores to be used

# analysis options
run.genotype=TRUE
reduce.composite.model = TRUE
composite.model.call.rate = 1

# output details
verbose=FALSE
##########################################################################


MULTI-STEP REFINEMENT METHOD

To infer the ethnicity of a set of individuals using the multi-step refinement method run the MultiStepRefinementModel.R script in the following way:

Rscript MultiStepRefinementModel.R <ConfigurationFile.R>

The configuration file extends the previous specification by adding the ethnic group sets specification:
# Subsets specification
subsets = list(c(“AFR”,”SAS”,”EAS”,”EUR”,”ASH”),c(“EUR”,”ASH”))


CREATE A REFERENCE MODEL

To create a new reference model run the CreateReferenceModel.R script in the following way:

Rscript CreateReferenceModel.R

by specifying in the script code the following variables:

## Parameters sif.file = “specify_path_to_file”
vcf.file = “specify_path_to_file”
phased = FALSE # TRUE if genotypes in VCF format are phased
out.dir = “specify_path_to_dir”
model.name = “specify_model_name”
call.rate = 1 # fraction of samples with genotype calls for a specific SNP

Sample information file (sif.file) should have the following format:
Sample\tRace\tGender
S1\tEUR\tmale
S2\tAFR\tfemale


VCF file should specify in the INFO column the MAF information (e.g. MAF=0.32) for all variants.
Genotype should be specified using phased notation (e.g. 0|0,1|0,0|1,1|1,…) or unphased notation (e.g. 0/0,0/1,1/1,…).


REQUIREMENTS

EthSEQ requires Linux kernel >= 2.6.15.
EthSEQ requires R >= 2.7 and the package “rgeos”.
EthSEQ requires global folder names.
EthSEQ requires ASEQ (version 1.1.11 available in Tools folder) available also here
EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also here


Code by Alessandro Romanel
Laboratory of Computational Oncology (F. Demichelis)
Centre for Integrative Biology, University of Trento, Italy
email contacts: romanel@science.unitn.it; demichelis@science.unitn.it

EthSEQ is distributed under the MIT Licence.


DOWNLOADS

Tool versions
Reference Models