This is an old revision of the document!
EthSEQ: Ethnicity inference from whole-exome sequencing data
EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines:
The package is organized as follows:
EthSEQ -> EthSEQ.R -> Functions.R -> CreateReferenceModel.R -> MultiStepRefinementAnalysis.R -> Models -> VCF -> Include -> Example
To infer the ethnicity of a set of individuals run the EthSEQ.R script in the following way:
Rscript EthSEQ.R <ConfigurationFile.R>
The configuration file has the following structure:
##########################################################################
# Basic folders
source.dir = “EthSEQ path”
bam.list = “path to a text file containing the list of BAM files to be analyzed”
out.dir = “path to the output folder”
eigenstrat.path = “path to EIGENTRAT binaries folder”
# Models available
# SS2 = Agilent Sure Select version 2
# SS4 = Agilent Sure Select version 4
# HALO = Agilent Haloplex
# NimblegenV3 = Roche Nimblegen V3
model = “HALO”
# To run the analysis with your own reference model uncomment the following line and specify the needed variables
# model = ”” # keep this empty
# vcf.file = “path to VCF file”
# sif.file = “path to file with ethnical annotations”
# model.ped = “path to PED file with genotype information”
# model.map = “path to MAP file with data of variant specified in the PED file”
# ASEQ parameters
ASEQ.path = “path to ASEQ binary”
mbq=20 # minimum base quality
mrq=20 # minimum read quality
mdc=20 # minimum depth of coverage
cores=10 # number of cores to be used
# analysis options
run.genotype=TRUE
reduce.composite.model = TRUE
# output details
verbose=FALSE
##########################################################################
To infer the ethnicity of a set of individuals using the multi-step refinement method run the MultiStepRefinementModel.R script in the following way:
Rscript MultiStepRefinementModel.R <ConfigurationFile.R>
The configuration file extends the previous specification by adding the ethnic group sets specifications:
##########################################################################
# Basic folders
subsets = list(c(“AFR”,”SAS”,”EAS”,”EUR”,”ASH”),c(“EUR”,”ASH”))
##########################################################################
To create a reference model run the CreateReferenceModel.R script in the following way:
Rscript CreateReferenceModel.R
by specifying in the code the following variables:
## Parameters
sif.file = “specify_path_to_file”
vcf.file = “specify_path_to_file”
phased = FALSE # TRUE if genotypes in VCF format are phased
out.dir = “specify_path_to_dir”
model.name = “specify_model_name”
call.rate = 1 # fraction of samples with genotype calls for a specific SNP
Sample information file should have the following format:
Sample\tRace\tGender
S1\tEUR\tmale
S2\tAFR\tfemale
…
EthSEQ requires Linux kernel >= 2.6.15.
EthSEQ requires R >= 2.7 and the package “rgeos”.
EthSEQ requires global folder names.
EthSEQ requires ASEQ (version 1.1.11 available in Tools folder) available also here
EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also here
Code by Alessandro Romanel
Laboratory of Computational Oncology (F. Demichelis)
Centre for Integrative Biology, University of Trento, Italy
email contacts: romanel@science.unitn.it; demichelis@science.unitn.it
EthSEQ is distributed under the MIT Licence.