--- title: "Alpha Diversity" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Alpha Diversity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Input Matrix We will use the `ex_counts` dataset included with ecodive. This feature table contains counts of bacterial genera across various samples. ```r library(ecodive) counts <- rarefy(ex_counts) t(counts) #> Saliva Gums Nose Stool #> Streptococcus 162 309 6 1 #> Bacteroides 2 2 0 341 #> Corynebacterium 0 0 171 1 #> Haemophilus 180 34 0 1 #> Propionibacterium 1 0 82 0 #> Staphylococcus 0 0 86 1 ``` # Alpha Diversity Alpha diversity measures diversity within a single sample. In `ecodive`, metrics are grouped into four categories based on the aspect of diversity they quantify. ## Richness Metrics Richness metrics estimate the number of distinct features (e.g., genera) in a sample. The simplest metric, `observed()`, counts features with non-zero abundance. ```r # Equivalent to rowSums(counts > 0) observed(counts) #> Saliva Gums Nose Stool #> 4 3 4 5 ``` The **Chao1** estimator extends this by inferring the number of unobserved, low-abundance features based on the ratio of singletons (`counts == 1`) to doubletons (`counts == 2`). ```r # Infers 8 unobserved genera chao1(c(1, 1, 1, 1, 2, 5, 5, 5)) #> [1] 16 # Infers less than 1 unobserved genera chao1(c(1, 2, 2, 2, 2, 5, 5, 5)) #> [1] 8.125 # Datasets without 1s and 2s give Inf or NaN chao1(counts) #> Saliva Gums Nose Stool #> 4.5 3.0 NaN Inf ``` ## Diversity Metrics Diversity metrics account for both richness and evenness (how equally abundances are distributed). **Simpson's index** is often used as a measure of evenness, representing the probability that two randomly selected individuals belong to different species. ```r # High Evenness (0.8) vs Low Evenness (0.07) simpson(c(20, 20, 20, 20, 20)) #> [1] 0.8 simpson(c(100, 1, 1, 1, 1)) #> [1] 0.07507396 # Stool < Gums < Saliva < Nose sort(simpson(counts)) #> Stool Gums Saliva Nose #> 0.02302037 0.18806133 0.50725478 0.63539593 ``` The **Shannon diversity index** (entropy) is another common metric that weights both richness and evenness. ```r # High richness, High evenness shannon(rep(100, 100)) #> [1] 4.60517 # Stool < Gums < Saliva < Nose sort(shannon(counts)) #> Stool Gums Saliva Nose #> 0.07927797 0.35692121 0.74119910 1.10615349 ``` ## Dominance Metrics Dominance metrics focus on the abundance of the most common species. The **Berger-Parker** index is the proportional abundance of the single most abundant feature. ```r # Stool is dominated by Bacteroides (341/345 counts -> ~0.99) # Nose is more balanced; Corynebacterium is max (171/345 counts -> ~0.49) sort(berger(counts)) #> Nose Saliva Gums Stool #> 0.4956522 0.5217391 0.8956522 0.9884058 ``` ## Phylogenetic Metrics Phylogenetic metrics use a phylogenetic tree to incorporate evolutionary distance. **Faith's Phylogenetic Diversity (PD)** calculates the total branch length spanned by the features present in a sample. ```r # ex_tree: # # +----------44---------- Haemophilus # +-2-| # | +----------------68---------------- Bacteroides # | # | +---18---- Streptococcus # | +--12--| # | | +--11-- Staphylococcus # +--11--| # | +-----24----- Corynebacterium # +--12--| # +--13-- Propionibacterium faith(c(Propionibacterium = 1, Corynebacterium = 1), tree = ex_tree) #> [1] 60 faith(c(Propionibacterium = 1, Haemophilus = 1), tree = ex_tree) #> [1] 82 # Nose < Gums < Saliva < Stool sort(faith(counts, tree = ex_tree)) #> Nose Gums Saliva Stool #> 101 155 180 202 ``` # Formulas Given: * $n$ : Number of features (e.g. species, OTUs, ASVs). * $X_i$ : Integer count of the $i$-th feature. * $X_T$ : Total of all counts (sequencing depth). $X_T = \sum_{i=1}^{n} X_i$ * $P_i$ : Proportional abundance of the $i$-th feature. $P_i = X_i / X_T$ * $F_1$ : Number of singletons ($X_i = 1$). * $F_2$ : Number of doubletons ($X_i = 2$). | Metric | Formula | | :----- | :------ | | **Abundance-based Coverage Estimator (ACE)** | See below. | | **Berger-Parker Index** | $\max(P_i)$ | | **Brillouin Index** | $\displaystyle \frac{\ln{[(\sum_{i = 1}^{n} X_i)!]} - \sum_{i = 1}^{n} \ln{(X_i!)}}{\sum_{i = 1}^{n} X_i}$ | | **Chao1** | $\displaystyle n + \frac{(F_1)^2}{2 F_2}$ | | **Faith's Phylogenetic Diversity** | See below. | | **Fisher's Alpha ($\alpha$)** | $\displaystyle \frac{n}{\alpha} = \ln{\left(1 + \frac{X_T}{\alpha}\right)}$
($\alpha$ is solved for iteratively) | | **Gini-Simpson Index** | $1 - \sum_{i = 1}^{n} P_i^2$ | | **Inverse Simpson Index** | $1 / \sum_{i = 1}^{n} P_i^2$ | | **Margalef's Richness Index** | $\displaystyle \frac{n - 1}{\ln{X_T}}$ | | **McIntosh Index** | $\displaystyle \frac{X_T - \sqrt{\sum_{i = 1}^{n} (X_i)^2}}{X_T - \sqrt{X_T}}$ | | **Menhinick's Richness Index** | $\displaystyle \frac{n}{\sqrt{X_T}}$ | | **Observed Features** | $n$ | | **Shannon Diversity Index** | $-\sum_{i = 1}^{n} P_i \times \ln(P_i)$ | | **Squares Richness Estimator** | $\displaystyle n + \frac{(F_1)^2 \sum_{i=1}^{n} (X_i)^2}{X_T^2 - nF_1}$ | ## Abundance-based Coverage Estimator (ACE) Given: * $r$ : Rare cutoff (features with $\le r$ counts are considered rare). * $F_{rare}$ : Number of rare features. * $F_{abund}$ : Number of abundant features ($> r$ counts). * $X_{rare}$ : Total counts belonging to rare features. * $C_{ace}$ : Sample abundance coverage estimator. * $\gamma_{ace}^2$ : Estimated coefficient of variation. $$C_{ace} = 1 - \frac{F_1}{X_{rare}}$$ $$\gamma_{ace}^2 = \max\left[\frac{F_{rare} \sum_{i=1}^{r}i(i-1)F_i}{C_{ace}X_{rare}(X_{rare} - 1)} - 1, 0\right]$$ $$D_{ace} = F_{abund} + \frac{F_{rare}}{C_{ace}} + \frac{F_1}{C_{ace}}\gamma_{ace}^2$$ ## Faith's Phylogenetic Diversity (Faith's PD) Given $n$ branches with lengths $L$ and a binary vector $A$ indicating presence (1) or absence (0) of descendants on each branch: $\sum_{i = 1}^{n} L_i A_i$