Differences

This shows you the differences between two versions of the page.

--- public:ethseq_v1 [2017/04/08 21:32]
alessandro.romanel@unitn.it
+++ public:ethseq_v1 [2017/04/08 21:38] (current)
alessandro.romanel@unitn.it
@@ Line 1: / Line 1: @@
 <html>
- <span style="color:gray;font-size:200%;">EthSEQ version 1.0</span>
+ <span style="color:gray;font-size:200%;">EthSEQ: ethnicity annotation from whole-exome sequencing data</span>
 </html>
 ----
-=== BASIC USAGE ===
+=== BASIC INFO ===
 EthSEQ is an R script that allows to infer ethnicity of a set of samples for which whole exome sequencing (WES) data is available from differential SNP genotypes profiles. It combines:
-  - [[http://hapmap.ncbi.nlm.nih.gov/|HapMap]] data, used to generate reference models for specific WES platforms;
+  - [[http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/|1,000 Genomes Project]] genotype data, used to generate reference models for specific WES platforms;
-  - [[public:aseq|ASEQ]], used to genotype the input samples, and
+  - [[public:aseq|ASEQ]], used to genotype the input samples with unknow ethnicity, and
-  - [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|EIGENSTRAT]], used to perform principal component analysis on the aggregated genotyped data.
+  - [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|EIGENSTRAT]], used to perform principal component analysis on the aggregated genotyped data. \\
-\\
+The package is organized as follows:\\
-EthSEQ script has the following syntax: \\
-  EthSEQ.R <ConfigurationFile.R>
-Folders are organized as follows:\\
   EthSEQ
    -> EthSEQ.R
    -> Functions.R
+   -> CreateReferenceModel.R
+   -> MultiStepRefinementAnalysis.R
    -> Models
    -> VCF
@@ Line 25: / Line 23: @@
    -> Example
-EthSEQ.R is the R script required to infer ethnicity with Functions.R file containing utility functions.\\
+  * EthSEQ.R is the R script required to infer ethnicity while Functions.R contains utility functions.\\
-The Models folder contains HapMap models for specific WES platforms (Haloplex, SureSelect version2 and version 4 are currently available).\\
+  * CreateReferenceModel.R is the R script required to generate e new reference model given genotype data of ethnical groups and a set of specified captured regions.\\
-The VCF folder contains the **lists of SNPs** (in VCF format) used to generate the HapMap models for the different WES platforms.\\
+  * MultiStepRefinementAnalysis.R is the script R that implements  the multi-step refinement analysis.\\
-The Include folder contains ASEQ binaries and EIGENSTRAT software folder with pre-compiled binaries available.
+  * Models folder contains reference models for a set WES platforms (Agilent Haloplex, Roche Nimblegen version 3, Agilent SureSelect version 2 and version 4 are currently available); models are built from 1,000 Genomes Project genotype data. **Models folder is empty; models should be downloaded separately from the bottom links and uncompressed in the Models folder.**\\
-The Example folder contains and example of configuration file
+  * The VCF folder contains the **lists of SNPs** (in VCF format) used to generate the reference models for the available WES platforms.\\
-\\ \\
+  * The Include folder contains ASEQ binaries and EIGENSTRAT software folder with pre-compiled binaries available.\\
+  * The Example folder contains and example of configuration file, and example of BAM input list of individuals with unknown ethnicity and an example of output report.
+----
+=== INFER THE ETHNICITY OF A SET OF INDIVIDUALS ===
+To infer the ethnicity of a set of individuals run the EthSEQ.R script in the following way: \\
+  Rscript EthSEQ.R <ConfigurationFile.R>
 The configuration file has the following structure:\\ \\
 ##########################################################################\\
-## Basic folders\\
+# Basic folders\\
 source.dir = "EthSEQ path"\\
 bam.list = "path to a text file containing the list of BAM files to be analyzed"\\
 out.dir = "path to the output folder"\\
-eigenstrat.path = "path to EIGENTRAT binaries folder"\\ \\
+eigenstrat.path = "path to EIGENTRAT binaries folder"\\
+\\
-## Models available \\
+# Models available \\
-## SS2 = Sure Select version 2\\
+# SS2 = Agilent Sure Select version 2\\
-## SS4 = Sure Select version 4\\
+# SS4 = Agilent Sure Select version 4\\
-## HALO = Haloplex\\
+# HALO = Agilent Haloplex\\
+# NimblegenV3 = Roche Nimblegen V3\\
 model = "HALO"\\
 \\
-## ASEQ parameters\\
+# To run the analysis with your own reference model uncomment the following line and specify the needed variables\\
+# model = "" # keep this empty\\
+# vcf.file = "path to VCF file"\\
+# sif.file = "path to file with ethnical annotations"\\
+# model.ped = "path to PED file with genotype information"\\
+# model.map = "path to MAP file with data of variant specified in the PED file"
+\\
+\\
+# ASEQ parameters\\
 ASEQ.path = "path to ASEQ binary"\\
 mbq=20 # minimum base quality\\
@@ Line 52: / Line 68: @@
 cores=10 # number of cores to be used\\
 \\
-## output details\\
+# analysis options\\
-verbose=F\\
+run.genotype=TRUE\\
-##########################################################################\\ \\
+reduce.composite.model = TRUE\\
+composite.model.call.rate = 1\\
+\\
+# output details\\
+verbose=FALSE\\
+##########################################################################
+----
+=== MULTI-STEP REFINEMENT METHOD  ===
+To infer the ethnicity of a set of individuals using the multi-step refinement method run the MultiStepRefinementModel.R script in the following way: \\
+  Rscript MultiStepRefinementModel.R <ConfigurationFile.R>
+The configuration file extends the previous specification by adding the ethnic group sets specification:\\
+# Subsets specification\\
+subsets = list(c("AFR","SAS","EAS","EUR","ASH"),c("EUR","ASH"))\\
+----
+=== CREATE A REFERENCE MODEL  ===
+To create a new reference model run the CreateReferenceModel.R script in the following way: \\
+  Rscript CreateReferenceModel.R
+by specifying in the script code the following variables:\\
+\\
+## Parameters
+sif.file = "specify_path_to_file"\\
+vcf.file = "specify_path_to_file"\\
+phased = FALSE # TRUE if genotypes in VCF format are phased\\
+out.dir = "specify_path_to_dir"\\
+model.name = "specify_model_name"\\
+call.rate = 1 # fraction of samples with genotype calls for a specific SNP\\
+\\
+Sample information file (sif.file) should have the following format:\\
+//Sample\tRace\tGender\\
+S1\tEUR\tmale\\
+S2\tAFR\tfemale\\
+...//\\ \\
+VCF file should specify in the INFO column the MAF information (e.g. MAF=0.32) for all variants.\\
+Genotype should be specified using phased notation (e.g. 0|0,1|0,0|1,1|1,...) or unphased notation (e.g. 0/0,0/1,1/1,...).
 ----
 === REQUIREMENTS ===
 EthSEQ requires Linux kernel >= 2.6.15.\\
-EthSEQ requires R >= 2.7 and the package SDMTools.\\
+EthSEQ requires R >= 2.7 and the package "rgeos".\\
 EthSEQ requires global folder names.\\
-EthSEQ requires ASEQ (version 1.1.8 available in Tools folder) available also [[public:aseq|here]]\\
+EthSEQ requires ASEQ (version 1.1.11 available in Tools folder) available also [[public:aseq|here]]\\
 EthSEQ requires EIGENSTRAT (version 5.0.2 available in Tools folder) available also [[http://genepath.med.harvard.edu/~reich/EIGENSTRAT.htm|here]]\\
 ----
@@ Line 74: / Line 132: @@
 ----
 === DOWNLOADS ===
+== Tool versions ==
-  * {{:ethseq-v0.1.zip|EthSEQ}}
+  * {{:EthSEQ_v1_0.zip|EthSEQ_v1_0.zip}}
+== Reference Models ==
+  * {{:1000GP_HALO.zip|Reference model for Agilent HaloPlex WES design}}
+  * {{:1000GP_SS2.zip|Reference model for Agilent SureSelectV2 WES design}}
+  * {{:1000GP_SS4.zip|Reference model for Agilent SureSelectV4 WES design}}
+  * {{:1000GP_NimblegenV3.zip|Reference model for Roche Nimblegen V3 WES design}}

Demichelis Lab - Computational Oncology

Content

Tools

Data

Resources

Differences

Demichelis Lab - Computational Oncology

Site Tools

User Tools

Content

Tools

Data

Resources

Differences

Page Tools