SNP panel identification assay (SPIA): a genetic-based assay for the identification of cell lines


Translational research hinges on the ability to make observations in model systems and to implement those findings into clinical applications, such as the development of diagnostic tools or targeted therapeutics. Tumor cell lines are commonly used to model carcinogenesis. The same tumor cell line can be simultaneously studied in multiple research laboratories throughout the world, theoretically generating results that are directly comparable. One important assumption in this paradigm is that researchers are working with the same cells. However, recent work using high throughput genomic analyses questions the accuracy of this assumption. Observations by our group and others suggest that experiments reported in the scientific literature may contain pre-analytic errors due to inaccurate identities of the cell lines employed. To address this problem, we developed a simple approach that enables an accurate determination of cell line and sample identity. We described (Demichelis F, et al, Nucleic Acids Research, 2008) the empirical development of a SNP panel identification assay (SPIA) compatible with routine use in the laboratory setting to ensure the identity of tumor cell lines and human tumor samples throughout the course of long term research use.

spia_distrib.jpg Schematic illustration of probabilistic test settings. The figure shows the binomial distributions of real match pair population (red dots) and of non-pair population (blue dots) for N equal to 30 and PM and Pnon-M, for the real match pair population and the non-pair population, equal to 0.9 and 0.4. The red, blue and green bars define regions of ‘different’ (mnon-M set equal to 1), ‘uncertain’ and ‘similar’ (mM set equal to 2) SPIA test calls. The smaller the number of SNPs is, the narrower the region of uncertainty and the higher the probability of making an incorrect call. spia_schema.jpg Schema of SNP panel identification assay (SPIA) applicability and use modality.


SPIA comes in two versions.
The R package SPIAssay can be installed directly from the CRAN project website.
The standalone script can be downloaded SPIA and works under linux.

System Requirements

SPIA requires R installation (>= 1.8.0).

Standalone SPIA

The standalone script includes a file SPIA.R, that needs to be executable, and a file SPIAfunctions.R that includes all the functions used by SPIA.R. SPIA searches for Rscript in the environment and can be run from command line with the command

SPIA.R <config_file>

The <config_file> follows R syntax and contains four sections:

  1. Input files
  2. Parameters for the SPIA statistical test
  3. Output files
  4. Other parameters
1. Input files

## Location of SPIAfunctions.R
SPIAfunctions_location = “path to SPIAfunctions.R file”

## List of VCF files. Each VCF file must have at least one genotype column. If two VCF files contain the genotype of the same sample (identical sample ID), only the last one is used. If the list of SNPs in a VCF file does not match the list of SNPs of the first VCF, it will be ignored.
vcfFileList = “path to VCF file list”

2. Parameters for the SPIA statistical test

## Probability that two matching samples (e.g. biological/technical replicates, normal and tumor from same individual) have different genotypes
Pmm = 0.1

## Given N SNPs, the maximum allowed distance between two matching samples is Pmm + N * nsigma
nsigma = 2

## Probability that two unrelated samples have different genotypes (e.g. with ideal SNPs close to 0.6)
Pmm_nonM = 0.6

## Given N SNPs, the minimum allowed distance between two unrelated samples is Pmm_nonM - N * nsigma_nonM
nsigma_nonM = 5

## Minimum percentage of valid SNPs genotypes required to perform the SPIA statistical test (out of N SNPs)
PercValidCall = 0.7

3. Output files

## SPIA table
outSPIAtable_file = “path of the SPIA output table”

## SPIA can optionally plot a graphical representation of the test
saveSPIAplot = T

## SPIA plot file name
SPIAplot_file = “path of the SPIA output graph”

## Save SPIA genotype (for debugging)
saveGenotype = F

## SPIA plot file name
genotypeTable_file = ””

4. Other parameters

# Print verbose information (for debugging purpose)
verbose = F

# Print output on screen (if F it create a log file)
print_on_screen = T

SPIA output table

SPIA output table has a row for each possible pair of samples analyzed. Each line includes the following information:

  • Sample_1 and Sample_2: identifiers of the two samples analyzed
  • Distance: genotype distance computed by SPIA
  • SPIA_Score: says if Sample_1 and Sample_2 are Similar, Different, or Uncertain
  • SNP_available: number of valid SNPs used for computing genotype distance
  • Total_SNP: total number of SNPs provided
  • One_SNP_NA: number of SNPs without genotype information in exactly one sample
  • Bot_SNP_NA: number of SNPs without genotype information in both samples
  • Diff_AvsB_or_BvsA: number of SNPs with genotype AA in Sample_1 and genotype BB in Sample_2, or vice versa
  • Diff_AorBvsAB_or_vic: number of SNPs with genotype AA or BB in Sample_1 and genotype AB in Sample_2, or vice versa
  • DiffABvsAorB: number of SNPs with genotype AA or BB in Sample_1 and genotype AB in Sample_2
  • counterBothHomoz: number of SNPs homozygous in both Sample_1 and Sample_2
  • counterBothHeter: number of SNPs heterozygous in both Sample_1 and Sample_2


The package SPIA contains a directory Bin with the SPIA.R and the file SPIAfunctions.R with the SPIA functions. The package also comes with a ready to use example folder Example. The folder contains a SPIa config file SPIA.configFile.R, a vcf file CEU.exon.2010_03.genotypes.143SNPs.vcf, and a list of one VCF file named 1000G.CEU.exon.vcfList.txt. To test SPIA unzip package, enter into the folder SPIA, and type

./Bin/SPIA.R ./Example/SPIA.configFile.R

If SPIA successfully complete the analysis you will find two more files within the Example folder:
that represent the tabular and graphical output of SPIA, respectively.