This shows you the differences between two versions of the page.
public:ethseq_v1 [2017/04/08 21:32] alessandro.romanel@unitn.it |
public:ethseq_v1 [2017/04/08 21:38] (current) alessandro.romanel@unitn.it |
||
---|---|---|---|
Line 1: | Line 1: | ||
<html> | <html> | ||
- | <span style="color:gray;font-size:200%;">EthSEQ version 1.0</span> | + | <span style="color:gray;font-size:200%;">EthSEQ: ethnicity annotation from whole-exome sequencing data</span> |
</html> | </html> | ||
---- | ---- | ||
- | === BASIC USAGE === | + | === BASIC INFO === |
EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines: | EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines: | ||
- | - [[http://hapmap.ncbi.nlm.nih.gov/|HapMap]] data, used to generate reference models for specific WES platforms; | + | - [[http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/|1,000 Genomes Project]] genotype data, used to generate reference models for specific WES platforms; |
- | - [[public:aseq|ASEQ]], used to genotype the input samples, and | + | - [[public:aseq|ASEQ]], used to genotype the input samples with unknow ethnicity, and |
- | - [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|EIGENSTRAT]], used to perform principal component analysis on the aggregated genotyped data. | + | - [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|EIGENSTRAT]], used to perform principal component analysis on the aggregated genotyped data. \\ |
- | \\ | + | The package is organized as follows:\\ |
- | EthSEQ script has the following syntax: \\ | + | |
- | EthSEQ.R <ConfigurationFile.R> | + | |
- | + | ||
- | Folders are organized as follows:\\ | + | |
EthSEQ | EthSEQ | ||
-> EthSEQ.R | -> EthSEQ.R | ||
-> Functions.R | -> Functions.R | ||
+ | -> CreateReferenceModel.R | ||
+ | -> MultiStepRefinementAnalysis.R | ||
-> Models | -> Models | ||
-> VCF | -> VCF | ||
Line 25: | Line 23: | ||
-> Example | -> Example | ||
- | EthSEQ.R is the R script required to infer ethnicity with Functions.R file containing utility functions.\\ | + | * EthSEQ.R is the R script required to infer ethnicity while Functions.R contains utility functions.\\ |
- | The Models folder contains HapMap models for specific WES platforms (Haloplex, SureSelect version2 and version 4 are currently available).\\ | + | * CreateReferenceModel.R is the R script required to generate e new reference model given genotype data of ethnical groups and a set of specified captured regions.\\ |
- | The VCF folder contains the **lists of SNPs** (in VCF format) used to generate the HapMap models for the different WES platforms.\\ | + | * MultiStepRefinementAnalysis.R is the script R that implements the multi-step refinement analysis.\\ |
- | The Include folder contains ASEQ binaries and EIGENSTRAT software folder with pre-compiled binaries available. | + | * Models folder contains reference models for a set WES platforms (Agilent Haloplex, Roche Nimblegen version 3, Agilent SureSelect version 2 and version 4 are currently available); models are built from 1,000 Genomes Project genotype data. **Models folder is empty; models should be downloaded separately from the bottom links and uncompressed in the Models folder.**\\ |
- | The Example folder contains and example of configuration file | + | * The VCF folder contains the **lists of SNPs** (in VCF format) used to generate the reference models for the available WES platforms.\\ |
- | \\ \\ | + | * The Include folder contains ASEQ binaries and EIGENSTRAT software folder with pre-compiled binaries available.\\ |
+ | * The Example folder contains and example of configuration file, and example of BAM input list of individuals with unknown ethnicity and an example of output report. | ||
+ | |||
+ | ---- | ||
+ | |||
+ | === INFER THE ETHNICITY OF A SET OF INDIVIDUALS === | ||
+ | |||
+ | To infer the ethnicity of a set of individuals run the EthSEQ.R script in the following way: \\ | ||
+ | Rscript EthSEQ.R <ConfigurationFile.R> | ||
The configuration file has the following structure:\\ \\ | The configuration file has the following structure:\\ \\ | ||
##########################################################################\\ | ##########################################################################\\ | ||
- | ## Basic folders\\ | + | # Basic folders\\ |
source.dir = "EthSEQ path"\\ | source.dir = "EthSEQ path"\\ | ||
bam.list = "path to a text file containing the list of BAM files to be analyzed"\\ | bam.list = "path to a text file containing the list of BAM files to be analyzed"\\ | ||
out.dir = "path to the output folder"\\ | out.dir = "path to the output folder"\\ | ||
- | eigenstrat.path = "path to EIGENTRAT binaries folder"\\ \\ | + | eigenstrat.path = "path to EIGENTRAT binaries folder"\\ |
- | + | \\ | |
- | ## Models available \\ | + | # Models available \\ |
- | ## SS2 = Sure Select version 2\\ | + | # SS2 = Agilent Sure Select version 2\\ |
- | ## SS4 = Sure Select version 4\\ | + | # SS4 = Agilent Sure Select version 4\\ |
- | ## HALO = Haloplex\\ | + | # HALO = Agilent Haloplex\\ |
+ | # NimblegenV3 = Roche Nimblegen V3\\ | ||
model = "HALO"\\ | model = "HALO"\\ | ||
\\ | \\ | ||
- | ## ASEQ parameters\\ | + | # To run the analysis with your own reference model uncomment the following line and specify the needed variables\\ |
+ | # model = "" # keep this empty\\ | ||
+ | # vcf.file = "path to VCF file"\\ | ||
+ | # sif.file = "path to file with ethnical annotations"\\ | ||
+ | # model.ped = "path to PED file with genotype information"\\ | ||
+ | # model.map = "path to MAP file with data of variant specified in the PED file" | ||
+ | \\ | ||
+ | \\ | ||
+ | # ASEQ parameters\\ | ||
ASEQ.path = "path to ASEQ binary"\\ | ASEQ.path = "path to ASEQ binary"\\ | ||
mbq=20 # minimum base quality\\ | mbq=20 # minimum base quality\\ | ||
Line 52: | Line 68: | ||
cores=10 # number of cores to be used\\ | cores=10 # number of cores to be used\\ | ||
\\ | \\ | ||
- | ## output details\\ | + | # analysis options\\ |
- | verbose=F\\ | + | run.genotype=TRUE\\ |
- | ##########################################################################\\ \\ | + | reduce.composite.model = TRUE\\ |
+ | composite.model.call.rate = 1\\ | ||
+ | \\ | ||
+ | # output details\\ | ||
+ | verbose=FALSE\\ | ||
+ | ########################################################################## | ||
+ | |||
+ | ---- | ||
+ | |||
+ | === MULTI-STEP REFINEMENT METHOD === | ||
+ | |||
+ | To infer the ethnicity of a set of individuals using the multi-step refinement method run the MultiStepRefinementModel.R script in the following way: \\ | ||
+ | Rscript MultiStepRefinementModel.R <ConfigurationFile.R> | ||
+ | |||
+ | The configuration file extends the previous specification by adding the ethnic group sets specification:\\ | ||
+ | # Subsets specification\\ | ||
+ | subsets = list(c("AFR","SAS","EAS","EUR","ASH"),c("EUR","ASH"))\\ | ||
+ | |||
+ | ---- | ||
+ | |||
+ | === CREATE A REFERENCE MODEL === | ||
+ | |||
+ | To create a new reference model run the CreateReferenceModel.R script in the following way: \\ | ||
+ | Rscript CreateReferenceModel.R | ||
+ | |||
+ | by specifying in the script code the following variables:\\ | ||
+ | \\ | ||
+ | ## Parameters | ||
+ | sif.file = "specify_path_to_file"\\ | ||
+ | vcf.file = "specify_path_to_file"\\ | ||
+ | phased = FALSE # TRUE if genotypes in VCF format are phased\\ | ||
+ | out.dir = "specify_path_to_dir"\\ | ||
+ | model.name = "specify_model_name"\\ | ||
+ | call.rate = 1 # fraction of samples with genotype calls for a specific SNP\\ | ||
+ | \\ | ||
+ | Sample information file (sif.file) should have the following format:\\ | ||
+ | //Sample\tRace\tGender\\ | ||
+ | S1\tEUR\tmale\\ | ||
+ | S2\tAFR\tfemale\\ | ||
+ | ...//\\ \\ | ||
+ | VCF file should specify in the INFO column the MAF information (e.g. MAF=0.32) for all variants.\\ | ||
+ | Genotype should be specified using phased notation (e.g. 0|0,1|0,0|1,1|1,...) or unphased notation (e.g. 0/0,0/1,1/1,...). | ||
---- | ---- | ||
=== REQUIREMENTS === | === REQUIREMENTS === | ||
EthSEQ requires Linux kernel >= 2.6.15.\\ | EthSEQ requires Linux kernel >= 2.6.15.\\ | ||
- | EthSEQ requires R >= 2.7 and the package SDMTools.\\ | + | EthSEQ requires R >= 2.7 and the package "rgeos".\\ |
EthSEQ requires global folder names.\\ | EthSEQ requires global folder names.\\ | ||
- | EthSEQ requires ASEQ (version 1.1.8 available in Tools folder) available also [[public:aseq|here]]\\ | + | EthSEQ requires ASEQ (version 1.1.11 available in Tools folder) available also [[public:aseq|here]]\\ |
EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|here]]\\ | EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|here]]\\ | ||
---- | ---- | ||
Line 74: | Line 132: | ||
---- | ---- | ||
=== DOWNLOADS === | === DOWNLOADS === | ||
- | + | == Tool versions == | |
- | * {{:ethseq-v0.1.zip|EthSEQ}} | + | * {{:EthSEQ_v1_0.zip|EthSEQ_v1_0.zip}} |
+ | == Reference Models == | ||
+ | * {{:1000GP_HALO.zip|Reference model for Agilent HaloPlex WES design}} | ||
+ | * {{:1000GP_SS2.zip|Reference model for Agilent SureSelectV2 WES design}} | ||
+ | * {{:1000GP_SS4.zip|Reference model for Agilent SureSelectV4 WES design}} | ||
+ | * {{:1000GP_NimblegenV3.zip|Reference model for Roche Nimblegen V3 WES design}} |