DNA evidence is the pre-eminent tool in the modern forensic scientists toolbox. It is widely accepted by the public, scientific and legal communities and it has been instrumental in determining both the innocence and guilt of individuals involved in the legal process. Despite this widespread acceptance there is unease regarding the statistical measures used to evaluate DNA evidence amongst some of members of all these communities. In particular, some people regard the random match probabilities associated with DNA evidence as just too small or basically unsupportable. In this vignette we discuss the basics of STR profiles, which serves as a reference for the package’s other vignettes:
In the db_vignette
(“Empirical testing of DNA match
probabilities”) we discuss what it means for a pair of DNA profiles to
match or partially match, and we present how the DNAtools
package allows a rational examination of the statistical properties of a
DNA database.
In the noa_vignette
(“On the exact distribution of
the numbers of alleles in DNA mixtures”) we show how to calculate the
distribution of the number of distinct alleles present in a DNA mixture
constituted by an arbitrary number of contributors.
Forensic genetics has its terminology which we briefly explain here.
Human DNA consists of 23 pairs of chromosomes and those chromosomes are
composed of a sequence of nucleotides which are labelled A
,
G
, C
and T
after the bases
adenine, guanine, cytosine and thymine that are used to form them.
Modern DNA typing uses short tandem repeats (STRs). These are regions of
DNA which are highly variable, but are patterned in that they consist of
repeats of a short sequence of DNA bases. The locations at which this
information is collected are called loci, and the (length) variations in
the patterns observed at each locus are called alleles. We have two
alleles at each locus, because humans are a diploid species, meaning
they have two copies of each chromosome. One allele comes from our
mother, and the other from our father.
A pair of alleles at a locus is called a genotype, and therefore a DNA profile is actually a multi-locus genotype. Modern forensic laboratories genotype DNA evidence using commercial kits, called multiplexes which consist of 9–17 loci. The multiplex currently used in the United Kingdom (and until recently New Zealand and Denmark) is called AmpFlSTR SGM Plus, or SGM Plus for short, and consists of 10 loci, plus one gender specific locus, Amelogenin. Forensic laboratories in the United States which load profiles into the FBI’s Combined DNA Index System (CODIS) collect a core set of thirteen loci, although they are not constrained to use one multiplex.
Locus: | vWA | D18 | TH01 | D2 | D8 | D3 | FGA | D16 | D21 | D19 |
Alleles: | 15, 18 | 14, 17 | 6, 9.3 | 17, 23 | 12, 15 | 15, 15 | 19, 23 | 11, 12 | 28, 28 | 13, 14 |
Table above shows a DNA profile from the SGM plus multiplex. There
are two numbers at each locus representing the two alleles that make up
the genotype at that locus. The numbers relate to the number of times
the pattern or motif that describe the alleles at the locus are
repeated. For example, this person’s genotype at the locus TH01 is
6,9.3
. This means that on one chromosome, the motif for
TH01, TCAT
was repeated 6 times, and on the other
chromosome it was repeated 9 times, and then followed by
TCA
. The .3
represents the fact that three of
the four bases have been repeated.
DNAtools
packageThe aim of the DNAtools
package is to provide
statisticians and forensic scientists with access to the specific
procedures described in the other vignettes. For example, for the
database matching exercise (db_vignette
), early
implementations by Weir (2004) and then
Curran, Walsh, and Buckleton (2007)
required custom written code for each new database and, in the case of
Curran, Walsh, and Buckleton (2007),
generation of at least half a dozen precursor files and a significant
amount of memory. Tvedebrink (2010); Tvedebrink et al. (2012) reduced the
computational effort of Weir (2004) and
Curran, Walsh, and Buckleton (2007) by
deriving recursion formulas for the expectation and variance of the
computed summary statistics. DNAtools
aims to make all of
these procedures easier to use in R.
DNAtools
In the listed vignettes the main features of the package are described, which allows statisticians and forensic scientists to easily examine the properties of a forensic DNA database. In particular, our package makes it simple to carry out a database comparison exercise where every DNA profile in the database is compared to every other database, and compare the resulting numbers of observed pairs of matching and partially matching profiles to expectation under a set of population genetic assumptions. Similarly, evaluating the distribution of the number of distinct alleles in high-order DNA mixtures is easily computed.