---
title: "Case study: Partial profiles"
output: rmarkdown::html_vignette
date: "`r format(Sys.time(), '%d %B, %Y')`"
author: Mikkel Meyer Andersen
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Case study: Partial profiles}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(fig.width = 7)
```

First, the library is loaded:
```{r, message=FALSE}
library(malan)
```

For reproducibility, the seed for the (pseudo) random number generator is set:
```{r}
set.seed(1)
```

# Population simulation

First, the population sizes are determined:

```{r}
population_sizes <- round(c(rep(100, 10), 100*1.02^(1:10)))
plot(population_sizes, type = "b")
```

A population can be simulated (hiding progress information) as follows:

```{r}
set.seed(1) # For reproducibility
sim_res_growth <- sample_geneology_varying_size(
  population_sizes = population_sizes, 
  
  # VRS = 0.2:
  enable_gamma_variance_extension = TRUE,
  gamma_parameter_shape = 5,
  gamma_parameter_scale = 1/5,
  
  # Live population: 
  # 3 generations
  generations_full = 3, 
  generations_return = 3,
  
  progress = FALSE)
```

Live population:

```{r}
live_pop <- sim_res_growth$individuals_generations
```

## Building the pedigrees

Until pedigrees are build/infered, there is not much information available (e.g. about children). So let us infer the pedigrees:

```{r}
pedigrees <- build_pedigrees(sim_res_growth$population, progress = FALSE)
pedigrees
pedigrees_count(pedigrees)
pedigrees_table(pedigrees)
pedigree_size(pedigrees[[1]])
```

We can look at the population as a (tidy)graph:

```{r}
g <- as_tbl_graph(pedigrees)
g
```

This can be plotted:

```{r}
if (requireNamespace("ggraph", quietly = TRUE)) {
  library(ggraph)
  p <- ggraph(g, layout = 'tree') +
    geom_edge_link() +
    geom_node_point(size = 8) +
    geom_node_text(aes(label = name), color = "white") +
    facet_nodes(~ ped_id) +
    theme_graph() 
  
  print(p)
}
```

This is rather difficult to make any sense of. Let's instead plot only pedigree 1:

```{r}
PED_ID <- 1

g_ped2 <- g %>% 
  activate(nodes) %>% 
  filter(ped_id == PED_ID)

if (requireNamespace("ggraph", quietly = TRUE)) {
  library(ggraph)
  p <- ggraph(g_ped2, layout = 'tree') +
      geom_edge_link() +
      geom_node_point(size = 8) +
      geom_node_text(aes(label = name), color = "white") +
      theme_graph() 
  print(p)
}
```


## Run a mutation process

Up until now, only the genealogy has been simulated. Now, we run a mutational process, i.e. assign haplotypes to founders and let haplotypes flow down the individuals.

We use realistic data. In the package, there is information about the individual markers:

```{r}
ystr_markers
```

Note, that `MutProb` is the point estimate given by `MutProb = Mutations / Meioses`. Information about which markers that are in which kit is also provided:

```{r}
ystr_kits
ystr_kits %>% count(Kit)
```

Let us take all PowerPlex Y23 markers and assume that we only have a partial profile where `DYS437` and `DYS448` dropped out. At the same time, we also filter out the integer alleles (for generating random founder haplotypes in a minute):

```{r}
partial_kit <- ystr_kits %>% 
  filter(Kit == "PowerPlex Y23") %>% 
  inner_join(ystr_markers, by = "Marker") %>% 
  filter(!(Marker %in% c("DYS437", "DYS448"))) %>% 
  rowwise() %>% # To work on each row
  mutate(IntegerAlleles = list(Alleles[Alleles == round(Alleles)]),
         MinIntAllele = min(IntegerAlleles),
         MaxIntAllele = max(IntegerAlleles)) %>% 
  ungroup() %>% 
  select(-Kit, -Alleles)
partial_kit
```

This "partial kit" has the following mutation probabilities:

```{r}
mu <- partial_kit %>% pull(MutProb)
mu
```

We can make a founder haplotype generator as follows (sampling alleles randomly is not how Y-STR works, but it may work fine for founder haplotypes):

```{r}
generate_random_haplotype <- function() {
  partial_kit %>% 
    rowwise() %>% 
    mutate(Allele = IntegerAlleles[sample.int(length(IntegerAlleles), 1)]) %>% 
    pull(Allele)
}
```

Now, a new haplotype is created everytime the function is called (with no arguments):

```{r}
generate_random_haplotype()
generate_random_haplotype()
```

Of course such generator can also be created for a reference database with Y-STR profiles.

Now, we are ready to assign haplotypes to the genealogy:

```{r}
set.seed(1)
pedigrees_all_populate_haplotypes_custom_founders(
  pedigrees = pedigrees, 
  get_founder_haplotype = generate_random_haplotype,
  mutation_rates = mu,
  progress = FALSE)
```

We can now plot pedigrees with haplotype information (note that `as_tbl_graph` needs to be called again):

```{r}
g_ped2 <- as_tbl_graph(pedigrees) %>% 
  activate(nodes) %>% 
  filter(ped_id == PED_ID) %>%
  group_by(name) %>% 
  mutate(haplotype_str = paste0(haplotype[[1]], collapse = ";"))
  #mutate(haplotype_str = map(haplotype, paste0, collapse = ";")[[1]])

if (requireNamespace("ggraph", quietly = TRUE)) {
  library(ggraph)
  p <- ggraph(g_ped2, layout = 'tree') +
      geom_edge_link() +
      geom_node_point(aes(color = haplotype_str), size = 8) +
      geom_node_text(aes(label = name), color = "white") +
      theme_graph() 
  print(p)
}
```


# Counting matches

We have `live_pop` from the population. 

## Drawing an individual and counting matches

```{r}
set.seed(5)
Q_index <- sample.int(n = length(live_pop), size = 1)
Q <- live_pop[[Q_index]]
print_individual(Q)
Q_hap <- get_haplotype(Q)
Q_hap
```

Now, count matches in live part of pedigree and in live part of population:

```{r}
Q_ped <- get_pedigree_from_individual(Q)
ped_live_matches <- count_haplotype_occurrences_pedigree(
  pedigree = Q_ped, 
  haplotype = Q_hap, 
  generation_upper_bound_in_result = 2) # gen 0, 1, 2
pop_live_matches <- count_haplotype_occurrences_individuals(
  individuals = live_pop, 
  haplotype = Q_hap)

ped_live_matches
pop_live_matches
```


We can also inspect pedigree matches information about number of meioses and $L_1$ distances:

```{r}
path_details <- pedigree_haplotype_matches_in_pedigree_meiosis_L1_dists(
  suspect = Q, 
  generation_upper_bound_in_result = 2)
```

```{r}
nrow(path_details)
head(path_details)
```

This can of course be repeated to many populations (genealogies, haplotype processes, suspects etc.). Also note that variability can be put on mutation rates, e.g. by a Bayesian approach.