US8898021

Method and system for DNA mixture analysis

The present invention pertains to a process for automatically analyzing mixed DNA samples. Specifically, the process comprises the steps of obtaining a mixed DNA sample; amplifying the DNA sample to produce a product; detecting the product to produce a signal; and analyzing the signal to determine information about the composition of the mixed DNA sample. This DNA mixture analysis is useful for finding criminals and convicting them. This mixture analysis provides high quality estimates, and can determine genotypes, mixture weights, and likelihood ratios. This analysis provides confidence measures in the results it computes, and generates reports and intuitive visualizations. The process automates a tedious manual procedure, thereby reducing the cost, time, and effort involved in DNA forensic analysis. The system can greatly accelerate the rate of DNA crime analysis, and be used to exonerate innocent people.

Download Patent

Claims: What is claimed is:

A method of analyzing a DNA mixture comprised of the steps:
1. obtaining a DNA mixture that contains genetic material from at least two contributing individuals;
2. amplifying the DNA mixture in a DNA amplification process to produce an amplification product comprising DNA fragments;
3. producing from the amplification product a signal comprising signal peaks from the DNA fragments;
4. detecting signal peak amounts in the signal, and quantifying the amounts to produce DNA lengths and concentrations from the mixture to form quantitative genotyping data;
5. assuming a genotype value of alleles for a contributor to the quantitative genotyping data at a genetic locus;
6. setting a mixture weight value for a relative proportion of the contributors to the quantitative genotyping data;
7. forming a linear combination of the genotype values based on the mixture weight value;
8. deriving with a computer a data variance of the amplification process from a model that includes both the quantitative genotyping data and the linear combination;
9. determining with the computer a probability of the quantitative genotyping data corresponding to a set of suspects from the DNA mixture at the locus using both the linear combination and the data variance value;
10. computing a probability of a genotype for one of the contributing individuals using the determined probability of the quantitative genotyping data; and
11. comparing the genotype probability with a set of suspect genotypes to identify a likely suspect.
The method as described in claim 1 wherein the amplifying step at a locus generates relative amounts of DNA fragments that are proportional to relative amounts of DNA template present in the DNA mixture.
The method as described in claim 2 wherein the detecting step generates relative amounts of signal that are proportional to the relative amounts of DNA fragments.
The method as described in claim 1 wherein the forming step includes a mathematical operation based on a linear model that relates the quantitative genotyping data to a product of a genotype matrix multiplied by a weight vector that describes the relative contribution of each individual that is considered in the DNA mixture.
The method as described in claim 4 wherein after the determining step there is the step of using a computing device to generate a visualization that shows the genotype matrix and the weight vector.
The method as described in claim 4 wherein the forming step includes iteratively determining the genotype matrix based on the weight vector, and the weight vector based on the genotype matrix.
The method as described in claim 4 wherein the genotype matrix contains entries that are not integer values.
The method as described in claim 7 wherein the noninteger values represent an efficiency of the amplification.
The method as described in claim 4 wherein the mathematical operation computes a genotype of an individual by subtracting from the data genotypes of other individuals in the mixture in proportion to the weight vector.
The method as described in claim 4 wherein the genotype matrix is a design matrix of the linear model.
The method as described in claim 4 wherein the weight vector is a regression parameter of the linear model.
The method as described in claim 4 wherein the data variance is a parameter of the linear model.
The method as described in claim 4 wherein the linear combination of genotypes is represented in a linear model.
The method as described in claim 1 wherein the computing step includes determining a probability of each genotype in a set of genotypes of an individual whose DNA is in the DNA mixture.
The method as described in claim 1 wherein the computing step includes determining a relative weight of an individual's DNA in the DNA mixture.
The method as described in claim 1 wherein the computing step includes determining a statistical confidence in the genotype of the DNA mixture.
The method as described in claim 1 wherein the computing step includes recording a genotype likelihood or probability of an individual in a report.
The method as described in claim 1 wherein the set of suspect genotypes contains a genotype of a convicted offender individual.
The method as described in claim 1 wherein the determining step can process at least one DNA mixture per hour.
The method as described in claim 1 wherein the determining step can determine a genotype using a computing device with memory.
The method as described in claim 20 in performing the determining step does not exceed one hour.
The method as described in claim 20 wherein the determining step is performed entirely by computer, without any human intervention during the determining step.
The method as described in claim 1 wherein the amplification includes a single nucleotide polymorphism (SNP) marker.
The method as described in claim 1 wherein the analysis includes more than one DNA experiment.
The method as described in claim 1 wherein the DNA mixture contains genetic material from more than two individuals.
The method as described in claim 1 wherein the assuming step includes a candidate genotype selected from a database of previously determined genotypes.
The method as described in claim 26 wherein the determining step produces a set of ranked genotypes that are in the database and match the data.
The method as described in claim 1 wherein the amplifying step includes a low copy number PCR protocol.
The method as described in claim 1 wherein the data variance model in the deriving step includes an inverse chi square probability distribution.
The method as described in claim 1 wherein the probability of the quantitative genotyping data in the determining step is a multivariate normal distribution.
The method as described in claim 1 wherein the probability of the quantitative genotyping data is combined with a genotype prior distribution to form a posterior genotype probability distribution.
The method as described in claim 31 wherein the prior distribution is computed from population allele frequency information.
The method as described in claim 31 wherein the prior distribution is uniformly distributed.
A method as described in claim 1, where the quantitative genotyping data obtained in Step (d) is from a short tandem repeat (STR) locus.
A method as described in claim 1, where the quantitative genotyping data obtained in Step (d) is from a Y chromosome short tandem repeat (Y-STR) locus.
A method as described in claim 1, where the quantitative genotyping data obtained in Step (d) is low copy.
A method as described in claim 1, where the quantitative genotyping data obtained in Step (d) is derived from allele peak height or area.
A method as described in claim 1, where the quantitative genotyping data obtained in Step (d) is derived from repeated experiments.
A method as described in claim 1, where the quantitative genotyping data obtained in Step (d) entails fewer loci.
A method as described in claim 1, where the genotype value assumed in Step (e) has an allele count that is continuous, rather than discrete.
A method as described in claim 1, where the genotype value assumed in Step (e) accounts for allele dropout.
A method as described in claim 1, where the genotype value assumed in Step (e) includes a representation of PCR stutter.
A method as described in claim 1, where the genotype value assumed in Step (e) includes a representation of relative amplification.
A method as described in claim 1, where the mixture weight value set in Step (f) is related to a mixture weight probability distribution.
A method as described in claim 1, where the linear combination formed in Step (g) contains one genotype.
A method as described in claim 1, where the linear combination formed in Step (g) contains two genotypes.
A method as described in claim 1, where the linear combination formed in Step (g) contains three genotypes.
A method as described in claim 1, where the linear combination formed in Step (g) contains more than three genotypes.
A method as described in claim 1, where the data variance value derived in Step (h) is conditioned on a genotype value.
A method as described in claim 1, where the data variance value derived in Step (h) is a data covariance matrix.
A method as described in claim 1, where the data variance value derived in Step (h) is obtained from a calibration.
A method as described in claim 1, where the data variance value derived in Step (h) is related to a data variance probability distribution.
A method as described in claim 1, where the data variance value derived in Step (h) scales with peak height or area.
A method as described in claim 1, where the probability of the quantitative genotyping data determined in Step (i) involves a simulated genotype.
A method as described in claim 1, where the probability of the quantitative genotyping data determined in Step (i) uses a function with continuous, rather than binary, values.
A method as described in claim 1, where the probability of the quantitative genotyping data determined in Step (i) includes a hierarchical Bayesian model.
A method as described in claim 1, where the probability of the quantitative genotyping data determined in Step (i) uses Markov chain Monte Carlo methods.
A method as described in claim 1, where a computer draws a visualization of the computed genotype probability.
A method as described in claim 1, where genotypes are ranked in the order of their computed probability values.
A method as described in claim 1, where DNA mixture data is separated into component genotypes having computed probability.
A method as described in claim 1, where a computer generates a genotype report from the computed probability that includes a figure or a table.
A method as described in claim 1, where computed genotype probabilities are stored on a DNA database.
A method as described in claim 1, where separated genotypes of contributors to a mixture are stored on a DNA database, along with their computed probability.
A method as described in claim 1, where the computed genotype probabilities provide more specific investigative leads.
A method as described in claim 1, where a service center computes the genotype probabilities from the quantitative genotyping data.
A method of analyzing a DNA mixture comprised of the steps:
1. obtaining a DNA mixture that contains genetic material from at least two contributing individuals;
2. amplifying the DNA mixture in a DNA amplification process to produce an amplification product comprising DNA fragments;
3. producing from the amplification product a signal comprising signal peaks from the DNA fragments;
4. detecting signal peak amounts in the signal, and quantifying the amounts to produce DNA lengths and concentrations from the mixture to form quantitative genotyping data;
5. assuming a genotype value of alleles for a contributor to the quantitative genotyping data at a genetic locus;
6. setting a mixture weight value for a relative proportion of the contributors to the quantitative genotyping data;
7. forming a linear combination of the genotype values based on the mixture weight value;
8. deriving with a computer a data variance of the amplification process from a model that includes both the quantitative genotyping data and the linear combination;
9. determining with the computer a probability of the quantitative genotyping corresponding to a set of suspects data from the DNA mixture at the locus using both the linear combination and the data variance value;
10. calculating a likelihood ratio of a hypothesis using the determined probability of the quantitative genotyping data; and
11. comparing with a set of suspect genotypes using the likelihood ratio to identify a likely suspect.
A method as described in claim 66, where the likelihood ratio is calculated from a plurality of genetic loci.
A method as described in claim 66, where the likelihood ratio hypothesis is related to identifying an individual.
A method as described in claim 66, where the likelihood ratio is calculated on genotypes on a DNA database to find genotype association strength.
A method as described in claim 66, where the calculated likelihood ratio provides greater certainty in human identification.