---
title: "Alpha Diversity"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Alpha Diversity}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Input Matrix

We will use the `ex_counts` dataset included with ecodive. This feature table contains counts of bacterial genera across various samples.

```r
library(ecodive)

counts <- rarefy(ex_counts)
t(counts)
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  309    6     1
#> Bacteroides            2    2    0   341
#> Corynebacterium        0    0  171     1
#> Haemophilus          180   34    0     1
#> Propionibacterium      1    0   82     0
#> Staphylococcus         0    0   86     1
```


# Alpha Diversity

Alpha diversity measures diversity within a single sample. In `ecodive`, metrics are grouped into four categories based on the aspect of diversity they quantify.


## Richness Metrics

Richness metrics estimate the number of distinct features (e.g., genera) in a sample. The simplest metric, `observed()`, counts features with non-zero abundance.

```r
# Equivalent to rowSums(counts > 0)
observed(counts)
#> Saliva   Gums   Nose  Stool 
#>      4      3      4      5 
```

The **Chao1** estimator extends this by inferring the number of unobserved, low-abundance features based on the ratio of singletons (`counts == 1`) to doubletons (`counts == 2`).

```r
# Infers 8 unobserved genera
chao1(c(1, 1, 1, 1, 2, 5, 5, 5))
#> [1] 16

# Infers less than 1 unobserved genera
chao1(c(1, 2, 2, 2, 2, 5, 5, 5))
#> [1] 8.125

# Datasets without 1s and 2s give Inf or NaN
chao1(counts)
#> Saliva   Gums   Nose  Stool 
#>    4.5    3.0    NaN    Inf 
```


## Diversity Metrics

Diversity metrics account for both richness and evenness (how equally abundances are distributed). 

**Simpson's index** is often used as a measure of evenness, representing the probability that two randomly selected individuals belong to different species.

```r
# High Evenness (0.8) vs Low Evenness (0.07)
simpson(c(20, 20, 20, 20, 20))
#> [1] 0.8
simpson(c(100, 1, 1, 1, 1))
#> [1] 0.07507396

# Stool < Gums < Saliva < Nose
sort(simpson(counts))
#>      Stool       Gums     Saliva       Nose 
#> 0.02302037 0.18806133 0.50725478 0.63539593 
```

The **Shannon diversity index** (entropy) is another common metric that weights both richness and evenness.

```r
# High richness, High evenness
shannon(rep(100, 100))
#> [1] 4.60517

# Stool < Gums < Saliva < Nose
sort(shannon(counts))
#>      Stool       Gums     Saliva       Nose 
#> 0.07927797 0.35692121 0.74119910 1.10615349 
```


## Dominance Metrics

Dominance metrics focus on the abundance of the most common species. The **Berger-Parker** index is the proportional abundance of the single most abundant feature.

```r
# Stool is dominated by Bacteroides (341/345 counts -> ~0.99)
# Nose is more balanced; Corynebacterium is max (171/345 counts -> ~0.49)
sort(berger(counts))
#>      Nose    Saliva      Gums     Stool 
#> 0.4956522 0.5217391 0.8956522 0.9884058 
```


## Phylogenetic Metrics

Phylogenetic metrics use a phylogenetic tree to incorporate evolutionary distance. **Faith's Phylogenetic Diversity (PD)** calculates the total branch length spanned by the features present in a sample.

```r
# ex_tree:
#
#       +----------44---------- Haemophilus
#   +-2-|
#   |   +----------------68---------------- Bacteroides  
#   |                      
#   |             +---18---- Streptococcus
#   |      +--12--|       
#   |      |      +--11-- Staphylococcus
#   +--11--|              
#          |      +-----24----- Corynebacterium
#          +--12--|
#                 +--13-- Propionibacterium


faith(c(Propionibacterium = 1, Corynebacterium = 1), tree = ex_tree)
#> [1] 60

faith(c(Propionibacterium = 1, Haemophilus = 1), tree = ex_tree)
#> [1] 82

# Nose < Gums < Saliva < Stool
sort(faith(counts, tree = ex_tree))
#>   Nose   Gums Saliva  Stool 
#>    101    155    180    202 
```


# Formulas

Given:

* $n$ : Number of features (e.g. species, OTUs, ASVs).
* $X_i$ : Integer count of the $i$-th feature.
* $X_T$ : Total of all counts (sequencing depth). $X_T = \sum_{i=1}^{n} X_i$
* $P_i$ : Proportional abundance of the $i$-th feature. $P_i = X_i / X_T$
* $F_1$ : Number of singletons ($X_i = 1$).
* $F_2$ : Number of doubletons ($X_i = 2$).

| Metric | Formula |
| :----- | :------ |
| **Abundance-based Coverage Estimator (ACE)** | See below. |
| **Berger-Parker Index** | $\max(P_i)$ |
| **Brillouin Index** | $\displaystyle \frac{\ln{[(\sum_{i = 1}^{n} X_i)!]} - \sum_{i = 1}^{n} \ln{(X_i!)}}{\sum_{i = 1}^{n} X_i}$ |
| **Chao1** | $\displaystyle n + \frac{(F_1)^2}{2 F_2}$ |
| **Faith's Phylogenetic Diversity** | See below. |
| **Fisher's Alpha ($\alpha$)** | $\displaystyle \frac{n}{\alpha} = \ln{\left(1 + \frac{X_T}{\alpha}\right)}$ <br> ($\alpha$ is solved for iteratively) |
| **Gini-Simpson Index** | $1 - \sum_{i = 1}^{n} P_i^2$ |
| **Inverse Simpson Index** | $1 / \sum_{i = 1}^{n} P_i^2$ |
| **Margalef's Richness Index** | $\displaystyle \frac{n - 1}{\ln{X_T}}$ |
| **McIntosh Index** | $\displaystyle \frac{X_T - \sqrt{\sum_{i = 1}^{n} (X_i)^2}}{X_T - \sqrt{X_T}}$ |
| **Menhinick's Richness Index** | $\displaystyle \frac{n}{\sqrt{X_T}}$ |
| **Observed Features** | $n$ |
| **Shannon Diversity Index** | $-\sum_{i = 1}^{n} P_i \times \ln(P_i)$ |
| **Squares Richness Estimator** | $\displaystyle n + \frac{(F_1)^2 \sum_{i=1}^{n} (X_i)^2}{X_T^2 - nF_1}$ |


## Abundance-based Coverage Estimator (ACE)

Given:

* $r$ : Rare cutoff (features with $\le r$ counts are considered rare).
* $F_{rare}$ : Number of rare features.
* $F_{abund}$ : Number of abundant features ($> r$ counts).
* $X_{rare}$ : Total counts belonging to rare features.
* $C_{ace}$ : Sample abundance coverage estimator.
* $\gamma_{ace}^2$ : Estimated coefficient of variation.

$$C_{ace} = 1 - \frac{F_1}{X_{rare}}$$

$$\gamma_{ace}^2 = \max\left[\frac{F_{rare} \sum_{i=1}^{r}i(i-1)F_i}{C_{ace}X_{rare}(X_{rare} - 1)} - 1, 0\right]$$

$$D_{ace} = F_{abund} + \frac{F_{rare}}{C_{ace}} + \frac{F_1}{C_{ace}}\gamma_{ace}^2$$


## Faith's Phylogenetic Diversity (Faith's PD)

Given $n$ branches with lengths $L$ and a binary vector $A$ indicating presence (1) or absence (0) of descendants on each branch:

$\sum_{i = 1}^{n} L_i A_i$