Title: | Discrete Laplace Mixture Inference using the EM Algorithm |
---|---|
Description: | Make inference in a mixture of discrete Laplace distributions using the EM algorithm. This can e.g. be used for modelling the distribution of Y chromosomal haplotypes as described in [1, 2] (refer to the URL section). |
Authors: | Mikkel Meyer Andersen [aut, cre], Poul Svante Eriksen [aut] |
Maintainer: | Mikkel Meyer Andersen <[email protected]> |
License: | GPL-2 | file LICENSE |
Version: | 1.7.4.9910 |
Built: | 2024-10-31 03:50:52 UTC |
Source: | https://github.com/mikldk/disclapmix |
Discrete Laplace Mixture Inference using the EM Algorithm.
A central function is disclapmix_adaptive
(and the underlying disclapmixfit
).
Mikkel Meyer Andersen <[email protected]> and Poul Svante Eriksen <[email protected]>
clusterdist
calculates the distance between each pair of clusters.
The distance measure is based on a symmetric Kullback-Leibler divergence.
clusterdist(fit, ...)
clusterdist(fit, ...)
fit |
A |
... |
Not used |
A distance matrix
disclapmix-package
disclapmix
disclapmixfit
clusterprob
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
disclap
clusterprob
calculates the cluster origin probabilities for
haplotypes.
clusterprob(fit, newdata, ...)
clusterprob(fit, newdata, ...)
fit |
A |
newdata |
The haplotypes to predict the cluster origin probabilities for. |
... |
Not used |
A matrix where the rows correspond to the rows in newdata
and
the sum of each row is 1.
disclapmix-package
disclapmix
disclapmixfit
clusterdist
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
disclap
Get all possible contributor pairs from a 2 person mixture
contributor_pairs(mixture)
contributor_pairs(mixture)
mixture |
A list of integer vectors. The k'th element in the list is an integer vector with the alleles in the mixture at locus k. |
A contrib_pairs
object that is a unordered list of pairs.
Note, that contributor order is disregarded so that each contributor pair is
only present once (and not twice as would be the case if taking order into
consideration). See example usage at rank_contributor_pairs
.
rank_contributor_pairs
generate_mixture
disclapmix-package
disclapmix
disclapmixfit
clusterprob
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
disclap
185 Y-STR 10 loci haplotypes
A data frame with 185 observations on the following 10 loci
(n
is the number of times each haplotype has been observed)
”Y-chromosome STR haplotypes Danes” by Hallenberg et al (2005), http://www.sciencedirect.com/science/article/pii/S0379073805000095
disclapmix
makes inference in a mixture of Discrete Laplace
distributions using the EM algorithm. After the EM algorithm has converged,
the centers are moved if the marginal likelihood increases by doing so. And
then the EM algorithm is run again. This continues until the centers are not
moved.
disclapmix( x, clusters, init_y = NULL, iterations = 100L, eps = 0.001, verbose = 0L, glm_method = "internal_coef", glm_control_maxit = 50L, glm_control_eps = 1e-06, init_y_method = "pam", init_v = NULL, ret_x = FALSE, ... )
disclapmix( x, clusters, init_y = NULL, iterations = 100L, eps = 0.001, verbose = 0L, glm_method = "internal_coef", glm_control_maxit = 50L, glm_control_eps = 1e-06, init_y_method = "pam", init_v = NULL, ret_x = FALSE, ... )
x |
Dataset. |
clusters |
The number of clusters/components to fit the model for. |
init_y |
Initial central haplotypes, if NULL, these will be estimated
as described under the |
iterations |
Maximum number of iterations in the EM-algorithm. |
eps |
Convergence stop criteria in the EM algorithm which is compared
to |
verbose |
from 0 to 2 (both including): 0 for silent, 2 for extra verbose. |
glm_method |
|
glm_control_maxit |
Integer giving the maximal number of IWLS iterations. |
glm_control_eps |
Positive convergence tolerance epsilon; the
iterations converge when |
init_y_method |
Which cluster method to use for finding initial central
haplotypes, y: |
init_v |
Matrix with 'nrow(x)' rows and 'clusters' columns specifying initial posterior probabilities to get EM started, if none specified, then 'matrix(1/clusters, nrow = nrow(x), ncol = clusters)' is used. |
ret_x |
Return data 'x' |
... |
Used to detect obsolete usage (when using parameters
|
glm_method
: internal_coef
is the fastest as it uses the
relative changes in the coefficients as a stopping criterium, hence it does
not need to compute the deviance until the very end. In normal situations,
it would not be a problem to use this method. internal_dev
is the
reasonably fast method that uses the deviance as a stopping criterium (like
glm.fit
). glm.fit
to use the traditional glm.fit
IWLS
implementation and is slow compared to the other two methods.
init_y_method
: For init_y_method = 'clara'
, the sampling
parameters are: samples = 100
, sampsize =
min(ceiling(nrow(x)/2), 100 + 2*clusters)
and the random number generator
in R is used.
A disclapmixfit
object:
The supplied GLM method.
The supplied initial central haplotypes,
init_y
.
The supplied method for
choosing initial central haplotypes (only used if init_y
is
NULL
).
Whether the estimation converged or not.
Dataset used to fit the model if 'ret_x' is 'TRUE', else 'NULL'.
The
central haplotypes, y
.
The prior probabilities of
belonging to a cluster, tau
.
The matrix
v
of each observation's probability of belonging to a certain
cluster. The rows are in the same order as the observations in x
used
to generate this fit.
A matrix with the estimated dicrete Laplace parameters.
The
coefficients from the last GLM fit (used to calculate
disclap_parameters
).
Number of observations.
Number of parameters in the model.
Number of iterations performed in total (including moving centers and re-estimating using the EM algorithm).
Full log likelihood of the final model.
Marginal log likelihood of the final model.
BIC based on the full log likelihood of the final model.
BIC based on the marginal log likelihood of the final model.
The gain ,
where
v
is vic_matrix
mentioned above, during the iterations.
The prior probability of belonging to the centers during the iterations.
Full log likelihood of the models during
the iterations (only calculated when verbose = 2L
).
Marginal log likelihood of the
models during the iterations (only calculated when verbose = 2L
).
BIC based on full log likelihood of the
models during the iterations (only calculated when verbose = 2L
).
BIC based on marginal log likelihood
of the models during the iterations (only calculated when verbose =
2L
).
disclapmix-package
disclapmix
disclapmixfit
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
clusterdist
clusterprob
glm.fit
disclap
pam
clara
# Generate sample database db <- matrix(disclap::rdisclap(1000, 0.3), nrow = 250, ncol = 4) # Add location parameters db <- sapply(1:ncol(db), function(i) as.integer(db[, i]+13+i)) head(db) fit1 <- disclapmix(db, clusters = 1L, verbose = 1L, glm_method = "glm.fit") fit1$disclap_parameters fit1$y fit1b <- disclapmix(db, clusters = 1L, verbose = 1L, glm_method = "internal_coef") fit1b$disclap_parameters fit1b$y max(abs(fit1$disclap_parameters - fit1b$disclap_parameters)) # Generate another type of database db2 <- matrix(disclap::rdisclap(2000, 0.1), nrow = 500, ncol = 4) db2 <- sapply(1:ncol(db2), function(i) as.integer(db2[, i]+14+i)) fit2 <- disclapmix(rbind(db, db2), clusters = 2L, verbose = 1L) fit2$disclap_parameters fit2$y fit2$tau
# Generate sample database db <- matrix(disclap::rdisclap(1000, 0.3), nrow = 250, ncol = 4) # Add location parameters db <- sapply(1:ncol(db), function(i) as.integer(db[, i]+13+i)) head(db) fit1 <- disclapmix(db, clusters = 1L, verbose = 1L, glm_method = "glm.fit") fit1$disclap_parameters fit1$y fit1b <- disclapmix(db, clusters = 1L, verbose = 1L, glm_method = "internal_coef") fit1b$disclap_parameters fit1b$y max(abs(fit1$disclap_parameters - fit1b$disclap_parameters)) # Generate another type of database db2 <- matrix(disclap::rdisclap(2000, 0.1), nrow = 500, ncol = 4) db2 <- sapply(1:ncol(db2), function(i) as.integer(db2[, i]+14+i)) fit2 <- disclapmix(rbind(db, db2), clusters = 2L, verbose = 1L) fit2$disclap_parameters fit2$y fit2$tau
A wrapper around 'disclapmix_robust()' that instead of fitting one model for a given number of clusters, fits models until the best model (lowest marginal BIC) is in the interior (with margin 'M') of all number of clusters tried.
disclapmix_adaptive( x, label = "DL", margin = 5L, criteria = "BIC", init_y_generator = NULL, init_v_generator = NULL, ... )
disclapmix_adaptive( x, label = "DL", margin = 5L, criteria = "BIC", init_y_generator = NULL, init_v_generator = NULL, ... )
x |
Dataset. |
margin |
Fit models until there is at least this margin |
criteria |
The slot to chose the best model from (BIC/AIC/AICc) |
init_y_generator |
Function taking the number of clusters as input and returns 'init_y' values |
init_v_generator |
Function taking the number of clusters as input and returns 'init_v' values |
... |
Passed on to 'disclapmix_robust()' (and further to 'disclapmix()') |
E.g., the best model has 3 clusters and the margin 'M = 5', then this function ensures that models with 1, 2, ..., 3+5 = 8 clusters are fitted. If e.g. then 7 is better than 3, then it continues such that also models with up to 7+5 = 12 clusters are fitted.
Note that models with 1-5 clusters are always fitted.
A list of all 'disclapmix' fits
data(danes) db <- as.matrix(danes[rep(1:nrow(danes), danes$n), 1:(ncol(danes)-1)]) fits <- disclapmix_adaptive(db, margin = 5L) fits BICs <- sapply(fits, function(x) x$BIC_marginal) BICs ks <- sapply(fits, function(x) nrow(x$y)) # Always same as seq_along(fits) ks max_k <- max(ks) best_k <- which.min(BICs) max_k best_k max_k - best_k # = margin = 5 plot(ks, BICs, type = "b") fits_clara <- disclapmix_adaptive(db, margin = 5L, init_y_method = "clara")
data(danes) db <- as.matrix(danes[rep(1:nrow(danes), danes$n), 1:(ncol(danes)-1)]) fits <- disclapmix_adaptive(db, margin = 5L) fits BICs <- sapply(fits, function(x) x$BIC_marginal) BICs ks <- sapply(fits, function(x) nrow(x$y)) # Always same as seq_along(fits) ks max_k <- max(ks) best_k <- which.min(BICs) max_k best_k max_k - best_k # = margin = 5 plot(ks, BICs, type = "b") fits_clara <- disclapmix_adaptive(db, margin = 5L, init_y_method = "clara")
A wrapper around 'disclapmix()' that tries to avoid errors. Can sometimes avoid errors with SVD problems happening with ‘glm_method = ’internal_coef'‘ and 'glm_method = ’internal_dev''. This is done by taking a random subset of observations.
disclapmix_robust( x, clusters, rnd_obs = min(5L * clusters, round(nrow(x)/2)), rnd_tries = 10, ... )
disclapmix_robust( x, clusters, rnd_obs = min(5L * clusters, round(nrow(x)/2)), rnd_tries = 10, ... )
x |
Dataset. |
clusters |
The number of clusters/components to fit the model for. |
rnd_obs |
Number of random observations in subset if initial fit fails |
rnd_tries |
Number of tries with random subset if initial fit fails |
... |
Passed on to 'disclapmix()' |
data(danes) db <- as.matrix(danes[rep(1:nrow(danes), danes$n), 1:(ncol(danes)-1)]) fit <- disclapmix_robust(db, 3L) fit
data(danes) db <- as.matrix(danes[rep(1:nrow(danes), danes$n), 1:(ncol(danes)-1)]) fit <- disclapmix_robust(db, 3L) fit
This function can generate a mixture given a list of contributors.
generate_mixture(profiles)
generate_mixture(profiles)
profiles |
A list with profiles to mix. |
A list, e.g. for use with contributor_pairs
. See
example usage at rank_contributor_pairs
.
contributor_pairs
rank_contributor_pairs
disclapmix-package
disclapmix
disclapmixfit
clusterprob
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
disclap
Get rank of pair
get_rank(x, haplotype)
get_rank(x, haplotype)
x |
A |
haplotype |
A haplotype. |
Calculate haplotype diversity from a disclapmixfit
object. The
method is based on simulating a huge database that approximates the
population.
haplotype_diversity(object, nsim = 10000L)
haplotype_diversity(object, nsim = 10000L)
object |
a |
nsim |
number of haplotypes to generate for calculating the haplotype diversity. |
The calculated haplotype diversity.
disclapmix
disclapmixfit
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
Plot a disclapmixfit
object.
## S3 method for class 'disclapmixfit' plot(x, which = 1L, clusdist = clusterdist(x), ...)
## S3 method for class 'disclapmixfit' plot(x, which = 1L, clusdist = clusterdist(x), ...)
x |
a |
which |
What plot to make. 1L = clusters and their distances. |
clusdist |
To use previously computed cluster distances to avoid doing the same computations twice. |
... |
not used |
A data frame with discrete Laplace distributions for each cluster and locus. Side effect: A plot.
disclapmix
disclapmixfit
predict.disclapmixfit
print.disclapmixfit
simulate.disclapmixfit
summary.disclapmixfit
data(danes) db <- as.matrix(danes[rep(1:nrow(danes), danes$n), 1:(ncol(danes)-1)]) fit <- disclapmix(db, clusters = 4L) plot(fit)
data(danes) db <- as.matrix(danes[rep(1:nrow(danes), danes$n), 1:(ncol(danes)-1)]) fit <- disclapmix(db, clusters = 4L) plot(fit)
Plot ranked contributor pairs
## S3 method for class 'ranked_contrib_pairs' plot(x, top = NULL, ..., xlab = "Rank", ylab = "P(H1)P(H2)")
## S3 method for class 'ranked_contrib_pairs' plot(x, top = NULL, ..., xlab = "Rank", ylab = "P(H1)P(H2)")
x |
A |
top |
The top ranked number of pairs to print. |
... |
Delegated to the generic |
xlab |
Graphical parameter. |
ylab |
Graphical parameter. |
Is able to predict haplotype frequencies using a disclapmixfit
object.
## S3 method for class 'disclapmixfit' predict(object, newdata, marginalise = FALSE, ...)
## S3 method for class 'disclapmixfit' predict(object, newdata, marginalise = FALSE, ...)
object |
a |
newdata |
the haplotypes in matrix format to estimate haplotype probabilities for |
marginalise |
Should loci with 'NA' be dealt with by marginalising out that locus? |
... |
not used |
Note that 'NA' values give rise to an error unless the 'marginalise' argument is set to 'TRUE'.
disclapmix
disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
plot.disclapmixfit
clusterprob
Print contributor pairs
## S3 method for class 'contrib_pairs' print(x, ...)
## S3 method for class 'contrib_pairs' print(x, ...)
x |
A |
... |
Ignored |
Prints a disclapmixfit
object.
## S3 method for class 'disclapmixfit' print(x, ...)
## S3 method for class 'disclapmixfit' print(x, ...)
x |
a |
... |
not used |
disclapmix
disclapmixfit
predict.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
plot.disclapmixfit
Print ranked contributor pairs
## S3 method for class 'ranked_contrib_pairs' print(x, top = 5L, hide_non_varying_loci = TRUE, ...)
## S3 method for class 'ranked_contrib_pairs' print(x, top = 5L, hide_non_varying_loci = TRUE, ...)
x |
A |
top |
The top ranked number of pairs to print/plot. |
hide_non_varying_loci |
Whether to hide alleles on loci that do not vary. |
... |
Ignored |
Separate a 2 person mixture by ranking the possible contributor pairs.
rank_contributor_pairs(contrib_pairs, fit, max_rank = NULL)
rank_contributor_pairs(contrib_pairs, fit, max_rank = NULL)
contrib_pairs |
A |
fit |
A |
max_rank |
Not used. Reserved for future use. |
A ranked_contrib_pairs
object that is basically an order
vector and the probabilities for each pair (in the same order as given in
contrib_pairs
), found by using fit
. Note, that contributor
order is disregarded so that each contributor pair is only present once (and
not twice as would be the case if taking order into consideration).
contributor_pairs
generate_mixture
disclapmix-package
disclapmix
disclapmixfit
clusterprob
predict.disclapmixfit
print.disclapmixfit
summary.disclapmixfit
simulate.disclapmixfit
disclap
data(danes) db <- as.matrix(danes[rep(1L:nrow(danes), danes$n), 1L:(ncol(danes) - 1L)]) set.seed(1) true_contribs <- sample(1L:nrow(db), 2L) h1 <- db[true_contribs[1L], ] h2 <- db[true_contribs[2L], ] db_ref <- db[-true_contribs, ] h1h2 <- c(paste(h1, collapse = ";"), paste(h2, collapse = ";")) tab_db <- table(apply(db, 1, paste, collapse = ";")) tab_db_ref <- table(apply(db_ref, 1, paste, collapse = ";")) tab_db[h1h2] tab_db_ref[h1h2] rm(db) # To avoid use by accident mixture <- generate_mixture(list(h1, h2)) possible_contributors <- contributor_pairs(mixture) possible_contributors fits <- lapply(1L:5L, function(clus) disclapmix(db_ref, clusters = clus)) best_fit_BIC <- fits[[which.min(sapply(fits, function(fit) fit$BIC_marginal))]] best_fit_BIC ranked_contributors_BIC <- rank_contributor_pairs(possible_contributors, best_fit_BIC) ranked_contributors_BIC plot(ranked_contributors_BIC, top = 10L, type = "b") get_rank(ranked_contributors_BIC, h1)
data(danes) db <- as.matrix(danes[rep(1L:nrow(danes), danes$n), 1L:(ncol(danes) - 1L)]) set.seed(1) true_contribs <- sample(1L:nrow(db), 2L) h1 <- db[true_contribs[1L], ] h2 <- db[true_contribs[2L], ] db_ref <- db[-true_contribs, ] h1h2 <- c(paste(h1, collapse = ";"), paste(h2, collapse = ";")) tab_db <- table(apply(db, 1, paste, collapse = ";")) tab_db_ref <- table(apply(db_ref, 1, paste, collapse = ";")) tab_db[h1h2] tab_db_ref[h1h2] rm(db) # To avoid use by accident mixture <- generate_mixture(list(h1, h2)) possible_contributors <- contributor_pairs(mixture) possible_contributors fits <- lapply(1L:5L, function(clus) disclapmix(db_ref, clusters = clus)) best_fit_BIC <- fits[[which.min(sapply(fits, function(fit) fit$BIC_marginal))]] best_fit_BIC ranked_contributors_BIC <- rank_contributor_pairs(possible_contributors, best_fit_BIC) ranked_contributors_BIC plot(ranked_contributors_BIC, top = 10L, type = "b") get_rank(ranked_contributors_BIC, h1)
Simulate from a disclapmixfit
object.
## S3 method for class 'disclapmixfit' simulate(object, nsim = 1L, seed = NULL, cluster = NULL, ...)
## S3 method for class 'disclapmixfit' simulate(object, nsim = 1L, seed = NULL, cluster = NULL, ...)
object |
a |
nsim |
number of haplotypes to generate. |
seed |
not used |
cluster |
which cluster to simulate from (if 'NULL', the default, take a random according to the a priori probabilities) |
... |
not used |
A matrix where the rows correspond to the simulated haplotypes.
disclapmix
disclapmixfit
predict.disclapmixfit
print.disclapmixfit
plot.disclapmixfit
summary.disclapmixfit
Summary of a disclapmixfit
object.
## S3 method for class 'disclapmixfit' summary(object, ...)
## S3 method for class 'disclapmixfit' summary(object, ...)
object |
a |
... |
not used |
disclapmix
disclapmixfit
predict.disclapmixfit
print.disclapmixfit
simulate.disclapmixfit
clusterdist