EthSEQ: ethnicity annotation from whole-exome sequencing data ---- === BASIC INFO === EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines: - [[http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/|1,000 Genomes Project]] genotype data, used to generate reference models for specific WES platforms; - [[public:aseq|ASEQ]], used to genotype the input samples with unknow ethnicity, and - [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|EIGENSTRAT]], used to perform principal component analysis on the aggregated genotyped data. \\ The package is organized as follows:\\ EthSEQ -> EthSEQ.R -> Functions.R -> CreateReferenceModel.R -> MultiStepRefinementAnalysis.R -> Models -> VCF -> Include -> Example * EthSEQ.R is the R script required to infer ethnicity while Functions.R contains utility functions.\\ * CreateReferenceModel.R is the R script required to generate e new reference model given genotype data of ethnical groups and a set of specified captured regions.\\ * MultiStepRefinementAnalysis.R is the script R that implements the multi-step refinement analysis.\\ * Models folder contains reference models for a set WES platforms (Agilent Haloplex, Roche Nimblegen version 3, Agilent SureSelect version 2 and version 4 are currently available); models are built from 1,000 Genomes Project genotype data. **Models folder is empty; models should be downloaded separately from the bottom links and uncompressed in the Models folder.**\\ * The VCF folder contains the **lists of SNPs** (in VCF format) used to generate the reference models for the available WES platforms.\\ * The Include folder contains ASEQ binaries and EIGENSTRAT software folder with pre-compiled binaries available.\\ * The Example folder contains and example of configuration file, and example of BAM input list of individuals with unknown ethnicity and an example of output report. ---- === INFER THE ETHNICITY OF A SET OF INDIVIDUALS === To infer the ethnicity of a set of individuals run the EthSEQ.R script in the following way: \\ Rscript EthSEQ.R The configuration file has the following structure:\\ \\ ##########################################################################\\ # Basic folders\\ source.dir = "EthSEQ path"\\ bam.list = "path to a text file containing the list of BAM files to be analyzed"\\ out.dir = "path to the output folder"\\ eigenstrat.path = "path to EIGENTRAT binaries folder"\\ \\ # Models available \\ # SS2 = Agilent Sure Select version 2\\ # SS4 = Agilent Sure Select version 4\\ # HALO = Agilent Haloplex\\ # NimblegenV3 = Roche Nimblegen V3\\ model = "HALO"\\ \\ # To run the analysis with your own reference model uncomment the following line and specify the needed variables\\ # model = "" # keep this empty\\ # vcf.file = "path to VCF file"\\ # sif.file = "path to file with ethnical annotations"\\ # model.ped = "path to PED file with genotype information"\\ # model.map = "path to MAP file with data of variant specified in the PED file" \\ \\ # ASEQ parameters\\ ASEQ.path = "path to ASEQ binary"\\ mbq=20 # minimum base quality\\ mrq=20 # minimum read quality\\ mdc=20 # minimum depth of coverage\\ cores=10 # number of cores to be used\\ \\ # analysis options\\ run.genotype=TRUE\\ reduce.composite.model = TRUE\\ composite.model.call.rate = 1\\ \\ # output details\\ verbose=FALSE\\ ########################################################################## ---- === MULTI-STEP REFINEMENT METHOD === To infer the ethnicity of a set of individuals using the multi-step refinement method run the MultiStepRefinementModel.R script in the following way: \\ Rscript MultiStepRefinementModel.R The configuration file extends the previous specification by adding the ethnic group sets specification:\\ # Subsets specification\\ subsets = list(c("AFR","SAS","EAS","EUR","ASH"),c("EUR","ASH"))\\ ---- === CREATE A REFERENCE MODEL === To create a new reference model run the CreateReferenceModel.R script in the following way: \\ Rscript CreateReferenceModel.R by specifying in the script code the following variables:\\ \\ ## Parameters sif.file = "specify_path_to_file"\\ vcf.file = "specify_path_to_file"\\ phased = FALSE # TRUE if genotypes in VCF format are phased\\ out.dir = "specify_path_to_dir"\\ model.name = "specify_model_name"\\ call.rate = 1 # fraction of samples with genotype calls for a specific SNP\\ \\ Sample information file (sif.file) should have the following format:\\ //Sample\tRace\tGender\\ S1\tEUR\tmale\\ S2\tAFR\tfemale\\ ...//\\ \\ VCF file should specify in the INFO column the MAF information (e.g. MAF=0.32) for all variants.\\ Genotype should be specified using phased notation (e.g. 0|0,1|0,0|1,1|1,...) or unphased notation (e.g. 0/0,0/1,1/1,...). ---- === REQUIREMENTS === EthSEQ requires Linux kernel >= 2.6.15.\\ EthSEQ requires R >= 2.7 and the package "rgeos".\\ EthSEQ requires global folder names.\\ EthSEQ requires ASEQ (version 1.1.11 available in Tools folder) available also [[public:aseq|here]]\\ EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|here]]\\ ---- === COPYRIGHT === Code by Alessandro Romanel\\ Laboratory of Computational Oncology (F. Demichelis)\\ Centre for Integrative Biology, University of Trento, Italy\\ email contacts: romanel@science.unitn.it; demichelis@science.unitn.it\\ EthSEQ is distributed under the MIT Licence. ---- === DOWNLOADS === == Tool versions == * {{:EthSEQ_v1_0.zip|EthSEQ_v1_0.zip}} == Reference Models == * {{:1000GP_HALO.zip|Reference model for Agilent HaloPlex WES design}} * {{:1000GP_SS2.zip|Reference model for Agilent SureSelectV2 WES design}} * {{:1000GP_SS4.zip|Reference model for Agilent SureSelectV4 WES design}} * {{:1000GP_NimblegenV3.zip|Reference model for Roche Nimblegen V3 WES design}}