---
title: "Introduction"
output: rmarkdown::html_vignette
date: "`r format(Sys.time(), '%d %B, %Y')`"
author: Mikkel Meyer Andersen
bibliography: refs.bib
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Introduction}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(fig.width = 5, fig.height = 5)
```
# Introduction
This vignette shows how to use the R package `disclapmix` that implements the method described in [@AndersenDisclap2013].
**The key function is `disclapmix_adaptive()`**.
For a more gentle introduction to the method, refer to [@AndersenDisclapIntroduction2013].
# Analysis
A Danish reference database [@Hallenberg2005YchromosomeSH] with $n = 185$ observations (male Y-STR haplotypes) at $r=10$ loci is available in the `danes` dataset.
Let us load the package as well as the data:
```{r}
library(ggplot2)
library(disclapmix)
data(danes)
```
The database is in compact format, i.e. one unique haplotype per row.
To fit the model, we need one observation per row.
This is done for example like this:
```{r}
db <- as.matrix(danes[rep(seq_len(nrow(danes)), danes$n), seq_len(ncol(danes)-1)])
str(db)
```
Also, note that the database is now an integer matrix.
# Short version
We fit models from 1 cluster up to as many necessary (see `?disclapmix_adaptive`
for how this is determined):
```{r}
fits <- disclapmix_adaptive(x = db)
```
We can extract how well each model fits the data (judged by the marginal BIC [@BIC]):
```{r}
BICs <- sapply(fits, function(x) x$BIC_marginal)
bestfit <- fits[[which.min(BICs)]]
bestfit
```
We can plot how well each model fits the data:
```{r}
clusters <- seq_along(fits) # or: sapply(fits, function(x) nrow(x$y))
plot(clusters, BICs, type = "b")
points(nrow(bestfit$y), bestfit$BIC_marginal, pch = 4, cex = 4, col = "red")
```
And use the model to predict haplotype frequencies:
```{r}
db_haplotypes <- as.matrix(danes[, seq_len(ncol(db))])
p <- predict(bestfit, db_haplotypes)
plot(log10(danes$n / sum(danes$n)), log10(p),
xlab = "Observed relative frequency",
ylab = "Discrete Laplace estimated probability")
abline(0, 1)
```
# Longer version
To fit a model using 2 clusters, the following command can be used (note the `L` postfix to emphasize that the number is an integer):
```{r, eval=FALSE}
fit <- disclapmix(x = db, clusters = 2L)
```
The number of clusters is not known beforehand.
Here, the numbers 1 through 5 are tried and the best one according to the BIC [@BIC] is taken:
```{r}
clusters <- 1L:5L
fits <- lapply(clusters, function(clusters) {
fit <- disclapmix(x = db, clusters = clusters)
return(fit)
})
marginalBICs <- sapply(fits, function(fit) {
fit$BIC_marginal
})
bestfit <- fits[[which.min(marginalBICs)]]
```
The best fit is now in the `bestfit` that can be inspected by `print` (default method called when the variable is just written) or `summary` which (currently) give the same output:
```{r}
bestfit
summary(bestfit)
```
We can also plot the fitted model:
```{r, fig.width=10}
plot(bestfit)
```
There are important information returned by `disclapmix`, e.g. the central haplotypes and the dispersion parameters for the discrete Laplace distributions:
```{r}
bestfit$y
bestfit$disclap_parameters
```
The returned object is described in `?disclapmix` and objects can be inspected using e.g. `str()`.
Haplotype frequencies can be obtained using the `predict` function. Note, that this is done per haplotype (`danes`) and not per observation (`db`):
```{r}
disclap_estimates <- predict(bestfit,
newdata = as.matrix(danes[, 1:(ncol(danes) - 1)]))
```
These can be compared to the database frequencies:
```{r}
ggplot() +
geom_abline(intercept = 0, slope = 1) +
geom_point(aes(x = danes$n/sum(danes$n), y = disclap_estimates)) +
labs(x = "Observed frequency",
y = "Predicted frequency (discrete Laplace)") +
theme_bw() +
scale_x_log10() +
scale_y_log10()
```
# References