Package 'ecodive' reference manual

Title:	Parallel and Memory-Efficient Ecological Diversity Metrics
Description:	Computes alpha and beta diversity metrics in a parallel, memory-efficient manner using 'C' and 'pthreads'. Implements 51 classic, compositional, and phylogenetic indices including Shannon, Bray-Curtis, Faith, JSD, Hellinger, Robust Aitchison, and the UniFrac family. Provides random sub-sampling for table rarefaction and parses Newick trees into 'phylo' objects. Integrates with data structures from 'phyloseq', 'rbiom', and other common bioinformatics packages. Algorithms are described in Smith et al. (2026) <doi:10.21105/joss.09777>.
Authors:	Daniel P. Smith [aut, cre] (ORCID: <https://orcid.org/0000-0002-2479-2044>), Melissa O'Neill [ctb, cph] (PCG random number generator), Alkek Center for Metagenomics and Microbiome Research [cph, fnd]
Maintainer:	Daniel P. Smith <[email protected]>
License:	MIT + file LICENSE
Version:	2.3.0
Built:	2026-06-24 05:21:28 UTC
Source:	https://github.com/cmmr/ecodive

Abundance-based Coverage Estimator (ACE)

Description

A non-parametric estimator of species richness that separates features into abundant and rare groups.

Usage

ace(counts, cutoff = 10L, margin = 1L, cpus = n_cpus())
ace(counts, cutoff = 10L, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

cutoff

The maximum number of observations to consider "rare". Default: 10.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The ACE metric separates features into "abundant" and "rare" groups based on a cutoff (usually 10 counts). It assumes that the presence of abundant species is certain, while the true number of rare species must be estimated.

Equations:

$C_{ace} = 1 - \frac{F_1}{X_{rare}}$

$\gamma_{ace}^2 = \max\left[\frac{F_{rare} \sum_{i=1}^{r}i(i-1)F_i}{C_{ace}X_{rare}(X_{rare} - 1)} - 1, 0\right]$

$D_{ace} = F_{abund} + \frac{F_{rare}}{C_{ace}} + \frac{F_1}{C_{ace}}\gamma_{ace}^2$

Where:

$r$ : Rare cutoff (default 10). Features with $\le r$ counts are considered rare.
$F_i$ : Number of features with exactly $i$ counts.
$F_1$ : Number of features where $X_i = 1$ (singletons).
$F_{rare}$ : Number of rare features where $X_i \le r$ .
$F_{abund}$ : Number of abundant features where $X_i > r$ .
$X_{rare}$ : Total counts belonging to rare features.
$C_{ace}$ : The sample abundance coverage estimator.
$\gamma_{ace}^2$ : The estimated coefficient of variation.

Parameter: cutoff The integer threshold distinguishing rare from abundant species. Standard practice is to use 10.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Chao, A., & Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of the American Statistical Association, 87(417), 210-217. doi:10.1080/01621459.1992.10475194

Examples

    ace(ex_counts)
ace(ex_counts)

Aitchison distance

Description

Calculates the Euclidean distance between centered log-ratio (CLR) transformed abundances.

Usage

aitchison(
  counts,
  margin = 1L,
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
aitchison(
  counts,
  margin = 1L,
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

Which combinations of samples should distances be calculated for? The default value (NULL) calculates all-vs-all. Provide a numeric or logical vector specifying positions in the distance matrix to calculate. See examples.

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Aitchison distance is defined as:

$\sqrt{\sum_{i=1}^{n} [(\ln{X_i} - X_L) - (\ln{Y_i} - Y_L)]^2}$

Where:

$X_i$ , $Y_i$ : Absolute counts for the $i$ -th feature.
$X_L$ , $Y_L$ : Mean log of abundances. $X_L = \frac{1}{n}\sum_{i=1}^{n} \ln{X_i}$ .
$n$ : The number of features.

Base R Equivalent:

x <- log((x + pseudocount) / exp(mean(log(x + pseudocount))))
y <- log((y + pseudocount) / exp(mean(log(y + pseudocount))))
sqrt(sum((x-y)^2)) # Euclidean distance

Pseudocount

Zeros are undefined in the Aitchison (CLR) transformation. If pseudocount is NULL (the default) and zeros are detected, the function uses half the minimum non-zero value (min(x[x>0]) / 2) and issues a warning.

To suppress the warning, provide an explicit value (e.g., 1).

Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Aitchison, J. (1986). The statistical analysis of compositional data. Chapman and Hall. doi:10.1002/bimj.4710300705

Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139-160. doi:10.1111/j.2517-6161.1982.tb01195.x

Costea, P. I., Zeller, G., Sunagawa, S., & Bork, P. (2014). A fair comparison. Nature Methods, 11(4), 359. doi:10.1038/nmeth.2897

Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology, 8, 2224. doi:10.3389/fmicb.2017.02224

Kaul, A., Mandal, S., Davidov, O., & Peddada, S. D. (2017). Analysis of microbiome data in the presence of excess zeros. Frontiers in Microbiology, 8, 2114. doi:10.3389/fmicb.2017.02114

Examples

    aitchison(ex_counts, pseudocount = 1)
aitchison(ex_counts, pseudocount = 1)

Alpha Diversity Wrapper Function

Description

Alpha Diversity Wrapper Function

Usage

alpha_div(
  counts,
  metric,
  norm = "percent",
  cutoff = 10L,
  digits = 3L,
  tree = NULL,
  margin = 1L,
  cpus = n_cpus()
)
alpha_div(
  counts,
  metric,
  norm = "percent",
  cutoff = 10L,
  digits = 3L,
  tree = NULL,
  margin = 1L,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

metric

The name of an alpha diversity metric. One of c('ace', 'berger', 'brillouin', 'chao1', 'faith', 'fisher', 'inv_simpson', 'margalef', 'mcintosh', 'menhinick', 'observed', 'shannon', 'simpson', 'squares'). Case-insensitive and partial name matching is supported. Programmatic access via list_metrics('alpha').

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

cutoff

The maximum number of observations to consider "rare". Default: 10.

digits

Precision of the returned values, in number of decimal places. E.g. the default digits=3 could return 6.392.

tree

A phylo-class object representing the phylogenetic tree for the OTUs in counts. The OTU identifiers given by colnames(counts) must be present in tree. Can be omitted if a tree is embedded with the counts object or as attr(counts, 'tree').

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Integer Count Requirements

A frequent and critical error in alpha diversity analysis is providing the wrong type of data to a metric's formula. Some indices are mathematically defined based on counts of individuals and require raw, integer abundance data. Others are based on proportional abundances and can accept either integer counts (which are then converted to proportions) or pre-normalized proportional data. Using proportional data with a metric that requires integer counts will return an error message.

Requires Integer Counts Only

Chao1
ACE
Squares Richness Estimator
Margalef's Index
Menhinick's Index
Fisher's Alpha
Brillouin Index

Can Use Proportional Data

Observed Features
Shannon Index
Gini-Simpson Index
Inverse Simpson Index
Berger-Parker Index
McIntosh Index
Faith's PD

Value

A numeric vector.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Examples

    # Example counts matrix
    ex_counts
    
    # Shannon diversity values
    alpha_div(ex_counts, 'Shannon')
    
    # Chao1 diversity values
    alpha_div(ex_counts, 'c')
    
    # Faith PD values
    alpha_div(ex_counts, 'faith', tree = ex_tree)
    
    
# Example counts matrix
    ex_counts
    
    # Shannon diversity values
    alpha_div(ex_counts, 'Shannon')
    
    # Chao1 diversity values
    alpha_div(ex_counts, 'c')
    
    # Faith PD values
    alpha_div(ex_counts, 'faith', tree = ex_tree)

Berger-Parker Index

Description

A measure of the numerical importance of the most abundant species.

Usage

berger(counts, margin = 1L, cpus = n_cpus())
berger(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Berger-Parker index is defined as the proportional abundance of the most dominant feature:

$\max(P_i)$

Where:

$P_i$ : Proportional abundance of the $i$ -th feature.

Base R Equivalent:

x <- ex_counts[1,]
p <- x / sum(x)
max(p)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Berger, W. H., & Parker, F. L. (1970). Diversity of planktonic foraminifera in deep-sea sediments. Science, 168(3937), 1345-1347. doi:10.1126/science.168.3937.1345

Examples

    berger(ex_counts)
berger(ex_counts)

Beta Diversity Wrapper Function

Description

Beta Diversity Wrapper Function

Usage

beta_div(
  counts,
  metric,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  power = 1.5,
  alpha = 0.5,
  tree = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
beta_div(
  counts,
  metric,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  power = 1.5,
  alpha = 0.5,
  tree = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted by some diversity metrics.

metric

The name of a beta diversity metric. One of c('aitchison', 'bhattacharyya', 'bray', 'canberra', 'chebyshev', 'chord', 'clark', 'divergence', 'euclidean', 'generalized_unifrac', 'gower', 'hamming', 'hellinger', 'horn', 'jaccard', 'jensen', 'jsd', 'lorentzian', 'manhattan', 'matusita', 'minkowski', 'morisita', 'motyka', 'normalized_unifrac', 'ochiai', 'psym_chisq', 'soergel', 'sorensen', 'squared_chisq', 'squared_chord', 'squared_euclidean', 'topsoe', 'unweighted_unifrac', 'variance_adjusted_unifrac', 'wave_hedges', 'weighted_unifrac'). Flexible matching is supported (see below). Programmatic access via list_metrics('beta').

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

power

Only used when metric = 'minkowski'. Scaling factor for the magnitude of differences between communities ( $p$ ). Default: 1.5

alpha

Only used when metric = 'generalized_unifrac'. How much weight to give to relative abundances; a value between 0 and 1, inclusive. Setting alpha=1 is equivalent to normalized_unifrac().

tree

Only used by phylogeny-aware metrics. A phylo-class object representing the phylogenetic tree for the OTUs in counts. The OTU identifiers given by colnames(counts) must be present in tree. Can be omitted if a tree is embedded with the counts object or as attr(counts, 'tree').

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

List of Beta Diversity Metrics

Option / Function Name	Metric Name
`aitchison`	Aitchison distance
`bhattacharyya`	Bhattacharyya distance
`bray`	Bray-Curtis dissimilarity
`canberra`	Canberra distance
`chebyshev`	Chebyshev distance
`chord`	Chord distance
`clark`	Clark's divergence distance
`divergence`	Divergence
`euclidean`	Euclidean distance
`generalized_unifrac`	Generalized UniFrac (GUniFrac)
`gower`	Gower distance
`hamming`	Hamming distance
`hellinger`	Hellinger distance
`horn`	Horn-Morisita dissimilarity
`jaccard`	Jaccard distance
`jensen`	Jensen-Shannon distance
`jsd`	Jesen-Shannon divergence (JSD)
`lorentzian`	Lorentzian distance
`manhattan`	Manhattan distance
`matusita`	Matusita distance
`minkowski`	Minkowski distance
`morisita`	Morisita dissimilarity
`motyka`	Motyka dissimilarity
`normalized_unifrac`	Normalized Weighted UniFrac
`ochiai`	Otsuka-Ochiai dissimilarity
`psym_chisq`	Probabilistic Symmetric Chi-Squared distance
`soergel`	Soergel distance
`sorensen`	Dice-Sorensen dissimilarity
`squared_chisq`	Squared Chi-Squared distance
`squared_chord`	Squared Chord distance
`squared_euclidean`	Squared Euclidean distance
`topsoe`	Topsoe distance
`unweighted_unifrac`	Unweighted UniFrac
`variance_adjusted_unifrac`	Variance-Adjusted Weighted UniFrac (VAW-UniFrac)
`wave_hedges`	Wave Hedges distance
`weighted_unifrac`	Weighted UniFrac

Flexible name matching

Case insensitive and partial matching. Any runs of non-alpha characters are converted to underscores. E.g. ⁠metric = 'Weighted UniFrac⁠ selects weighted_unifrac.

UniFrac names can be shortened to the first letter plus "unifrac". E.g. uunifrac, w_unifrac, or ⁠V UniFrac⁠. These also support partial matching.

Finished code should always use the full primary option name to avoid ambiguity with future additions to the metrics list.

Value

A numeric vector.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

Zeros are undefined in the centered log-ratio (CLR) transformation. If norm = 'clr', pseudocount is NULL (the default), and zeros are detected, the function uses half the minimum non-zero value (min(x[x>0]) / 2) and issues a warning.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

Examples

    # Example counts matrix
    ex_counts
    
    # Bray-Curtis distances
    beta_div(ex_counts, 'bray')
    
    # Generalized UniFrac distances
    beta_div(ex_counts, 'GUniFrac', tree = ex_tree)
    
# Example counts matrix
    ex_counts
    
    # Bray-Curtis distances
    beta_div(ex_counts, 'bray')
    
    # Generalized UniFrac distances
    beta_div(ex_counts, 'GUniFrac', tree = ex_tree)

Bhattacharyya distance

Description

Measures the similarity of two probability distributions.

Usage

bhattacharyya(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
bhattacharyya(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Bhattacharyya distance is defined as:

$-\ln{\sum_{i=1}^{n}\sqrt{P_{i}Q_{i}}}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
-log(sum(sqrt(p * q)))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society, 35, 99-109.

Examples

    bhattacharyya(ex_counts)
bhattacharyya(ex_counts)

Bray-Curtis dissimilarity

Description

A standard ecological metric quantifying the dissimilarity between communities.

Usage

bray(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
bray(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Bray-Curtis dissimilarity is defined as:

$\frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} (X_i + Y_i)}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(abs(x-y)) / sum(x+y)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Bray, J. R., & Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4), 325-349. doi:10.2307/1942268

Examples

    bray(ex_counts)
bray(ex_counts)

Brillouin Index

Description

A diversity index derived from information theory, appropriate for fully censused communities.

Usage

brillouin(counts, margin = 1L, cpus = n_cpus())
brillouin(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Brillouin index is defined as:

$\frac{\ln{[(\sum_{i = 1}^{n} X_i)!]} - \sum_{i = 1}^{n} \ln{(X_i!)}}{\sum_{i = 1}^{n} X_i}$

Where:

$n$ : The number of features.
$X_i$ : Integer count of the $i$ -th feature.

Base R Equivalent:

x <- ex_counts[1,]
# note: lgamma(x + 1) == log(x!)
(lgamma(sum(x) + 1) - sum(lgamma(x + 1))) / sum(x)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Brillouin, L. (1956). Science and information theory. Academic Press.

Examples

    brillouin(ex_counts)
brillouin(ex_counts)

Canberra distance

Description

A weighted version of the Manhattan distance, sensitive to differences when both values are small.

Usage

canberra(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
canberra(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Canberra distance is defined as:

$\sum_{i=1}^{n} \frac{|X_i - Y_i|}{X_i + Y_i}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(abs(x-y) / (x+y))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Lance, G. N., & Williams, W. T. (1966). Computer programs for hierarchical polythetic classification ("similarity analyses"). The Computer Journal, 9(1), 60-64. doi:10.1093/comjnl/9.1.60

Examples

    canberra(ex_counts)
canberra(ex_counts)

Chao1 Richness Estimator

Description

A non-parametric estimator of the lower bound of species richness.

Usage

chao1(counts, margin = 1L, cpus = n_cpus())
chao1(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Chao1 estimator uses the ratio of singletons to doubletons to estimate the number of missing species:

$n + \frac{(F_1)^2}{2 F_2}$

Where:

$n$ : The number of observed features.
$F_1$ : Number of features observed once (singletons).
$F_2$ : Number of features observed twice (doubletons).

Base R Equivalent:

x <- ex_counts[1,]
sum(x>0) + (sum(x == 1) ** 2) / (2 * sum(x == 2))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11, 265-270.

Examples

    chao1(ex_counts)
chao1(ex_counts)

Chebyshev distance

Description

The maximum difference between any single feature across two samples.

Usage

chebyshev(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
chebyshev(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Chebyshev distance is defined as:

$\max(|X_i - Y_i|)$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
max(abs(x-y))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Cantrell, C. D. (2000). Modern mathematical methods for physicists and engineers. Cambridge University Press.

Examples

    chebyshev(ex_counts)
chebyshev(ex_counts)

Chord distance

Description

Euclidean distance between normalized vectors.

Usage

chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Chord distance is defined as:

$\sqrt{\sum_{i=1}^{n} \left(\frac{X_i}{\sqrt{\sum_{j=1}^{n} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j=1}^{n} Y_j^2}}\right)^2}$

Where:

$X_i$ , $Y_i$ : Absolute counts of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sqrt(sum(((x / sqrt(sum(x ^ 2))) - (y / sqrt(sum(y ^ 2))))^2))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Orlóci, L. (1967). An agglomerative method for classification of plant communities. Journal of Ecology, 55(1), 193-206. doi:10.2307/2257725

Examples

    chord(ex_counts)
chord(ex_counts)

Clark's divergence distance

Description

Also known as the coefficient of divergence.

Usage

clark(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
clark(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Clark's divergence distance is defined as:

$\sqrt{\sum_{i=1}^{n}\left(\frac{X_i - Y_i}{X_i + Y_i}\right)^{2}}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sqrt(sum((abs(x - y) / (x + y)) ^ 2))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Clark, P. J. (1952). An extension of the coefficient of divergence for use with multiple characters. Copeia, 1952(2), 61-64. doi:10.2307/1438598

Examples

    clark(ex_counts)
clark(ex_counts)

Divergence

Description

A probabilistic divergence metric.

Usage

divergence(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
divergence(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Divergence is defined as:

$2\sum_{i=1}^{n} \frac{(P_i - Q_i)^2}{(P_i + Q_i)^2}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
2 * sum((p - q)^2 / (p + q)^2)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.

Examples

    divergence(ex_counts)
divergence(ex_counts)

Euclidean distance

Description

The straight-line distance between two points in multidimensional space.

Usage

euclidean(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
euclidean(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Euclidean distance is defined as:

$\sqrt{\sum_{i=1}^{n} (X_i - Y_i)^2}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sqrt(sum((x-y)^2))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.

Examples

    euclidean(ex_counts)
euclidean(ex_counts)

Example counts matrix

Description

Genera found on four human body sites.

Usage

ex_counts
ex_counts

Format

A matrix of 4 samples (columns) x 6 genera (rows).

Source

Derived from The Human Microbiome Project dataset.

Example phylogenetic tree

Description

Companion tree for ex_counts.

Usage

ex_tree
ex_tree

Format

A phylo object.

Details

ex_tree encodes this tree structure:

      +----------44---------- Haemophilus
  +-2-|
  |   +----------------68---------------- Bacteroides  
  |                      
  |             +---18---- Streptococcus
  |      +--12--|       
  |      |      +--11-- Staphylococcus
  +--11--|              
         |      +-----24----- Corynebacterium
         +--12--|
                +--13-- Propionibacterium

Faith's Phylogenetic Diversity (PD)

Description

Calculates the sum of the branch lengths for all species present in a sample.

Usage

faith(counts, tree = NULL, margin = 1L, cpus = n_cpus())
faith(counts, tree = NULL, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

tree

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Faith's PD is defined as:

$\sum_{i = 1}^{n} L_i A_i$

Where:

$n$ : The number of branches in the phylogenetic tree.
$L_i$ : The length of the $i$ -th branch.
$A_i$ : A binary value (1 if any descendants of branch $i$ are present in the sample, 0 otherwise).

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1), 1-10. doi:10.1016/0006-3207(92)91201-3

Examples

    faith(ex_counts, tree = ex_tree)
faith(ex_counts, tree = ex_tree)

Fisher's Alpha

Description

A parametric diversity index assuming species abundances follow a log-series distribution.

Usage

fisher(counts, digits = 3L, margin = 1L, cpus = n_cpus())
fisher(counts, digits = 3L, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

digits

Precision of the returned values, in number of decimal places. E.g. the default digits=3 could return 6.392.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Fisher's Alpha ( $\alpha$ ) is the parameter in the equation:

$\frac{n}{\alpha} = \ln{\left(1 + \frac{X_T}{\alpha}\right)}$

Where:

$n$ : The number of features.
$X_T$ : Total of all counts (sequencing depth).

The value of $\alpha$ is solved for iteratively.

Parameter: digits

The precision (number of decimal places) to use when solving the equation.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12, 42-58. doi:10.2307/1411

Examples

    fisher(ex_counts)
fisher(ex_counts)

Generalized UniFrac (GUniFrac)

Description

A unified UniFrac distance that balances the weight of abundant and rare lineages.

Usage

generalized_unifrac(
  counts,
  tree = NULL,
  alpha = 0.5,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)
generalized_unifrac(
  counts,
  tree = NULL,
  alpha = 0.5,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

tree

alpha

How much weight to give to relative abundances; a value between 0 and 1, inclusive. Setting alpha=1 is equivalent to normalized_unifrac().

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Generalized UniFrac distance is defined as:

$\frac{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}\left|\frac{P_i - Q_i}{P_i + Q_i}\right|}{\sum_{i=1}^{n} L_i(P_i + Q_i)^{\alpha}}$

Where:

$n$ : The number of branches in the tree.
$L_i$ : The length of the $i$ -th branch.
$P_i$ , $Q_i$ : The proportion of the community descending from branch $i$ in sample P and Q.
$\alpha$ : A scalable weighting factor.

Parameter: alpha

The alpha parameter controls the weight given to abundant lineages. $\alpha = 1$ corresponds to Weighted UniFrac, while $\alpha = 0$ corresponds to Unweighted UniFrac.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Chen, J., Bittinger, K., Charlson, E. S., Hoffmann, C., Lewis, J., Wu, G. D., ... & Li, H. (2012). Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics, 28(16), 2106-2113. doi:10.1093/bioinformatics/bts342

Examples

    generalized_unifrac(ex_counts, tree = ex_tree, alpha = 0.5)
generalized_unifrac(ex_counts, tree = ex_tree, alpha = 0.5)

Gower distance

Description

A distance metric that normalizes differences by the range of the feature.

Usage

gower(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
gower(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Gower distance is defined as:

$\frac{1}{n}\sum_{i=1}^{n}\frac{|X_i - Y_i|}{R_i}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$R_i$ : The range of the $i$ -th feature across all samples (max - min).
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
r <- abs(x - y)
n <- length(x)
sum(abs(x-y) / r) / n

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823

Examples

    gower(ex_counts)
gower(ex_counts)

Hamming distance

Description

Measures the minimum number of substitutions required to change one string into the other.

Usage

hamming(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
hamming(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Hamming distance is defined as:

$(A + B) - 2J$

Where:

$A$ , $B$ : Number of features in each sample.
$J$ : Number of features in common (intersection).

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(xor(x, y))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147-160. doi:10.1002/j.1538-7305.1950.tb00463.x

Examples

    hamming(ex_counts)
hamming(ex_counts)

Hellinger distance

Description

A distance metric related to the Bhattacharyya distance, often used for community data with many zeros.

Usage

hellinger(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
hellinger(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Hellinger distance is defined as:

$\sqrt{\sum_{i=1}^{n}(\sqrt{P_i} - \sqrt{Q_i})^{2}}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sqrt(sum((sqrt(p) - sqrt(q)) ^ 2))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Rao, C. R. (1995). A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qüestiió, 19, 23-63.

Hellinger, E. (1909). Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Journal für die reine und angewandte Mathematik, 136, 210–271. doi:10.1515/crll.1909.136.210

Examples

    hellinger(ex_counts)
hellinger(ex_counts)

Horn-Morisita dissimilarity

Description

A similarity index based on Simpson's diversity index, suitable for abundance data.

Usage

horn(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
horn(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Horn-Morisita dissimilarity is defined as:

$1 - \frac{2\sum_{i=1}^{n}P_{i}Q_{i}}{\sum_{i=1}^{n}P_i^2 + \sum_{i=1}^{n}Q_i^2}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
z <- sum(x^2) / sum(x)^2 + sum(y^2) / sum(y)^2
1 - ((2 * sum(x * y)) / (z * sum(x) * sum(y)))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Horn, H. S. (1966). Measurement of "overlap" in comparative ecological studies. The American Naturalist, 100(914), 419-424. doi:10.1086/282436

Examples

    horn(ex_counts)
horn(ex_counts)

Inverse Simpson Index

Description

A transformation of the Simpson index that represents the "effective number of species".

Usage

inv_simpson(counts, margin = 1L, cpus = n_cpus())
inv_simpson(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Inverse Simpson index is defined as:

$1 / \sum_{i = 1}^{n} P_i^2$

Where:

$n$ : The number of features.
$P_i$ : Proportional abundance of the $i$ -th feature.

Base R Equivalent:

x <- ex_counts[1,]
p <- x / sum(x)
1 / sum(p ** 2)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. doi:10.1038/163688a0

Examples

    inv_simpson(ex_counts)
inv_simpson(ex_counts)

Jaccard distance

Description

Measures dissimilarity between sample sets.

Usage

jaccard(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
jaccard(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Jaccard distance is defined as:

$1 - \frac{J}{(A + B - J)}$

Where:

$A$ , $B$ : Number of features in each sample.
$J$ : Number of features in common (intersection).

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
1 - sum(x & y) / sum(x | y)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37-50. doi:10.1111/j.1469-8137.1912.tb05611.x

Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles, 44(163), 223-270. doi:10.5169/seals-268384

Examples

    jaccard(ex_counts)
jaccard(ex_counts)

Jensen-Shannon distance

Description

The square root of the Jensen-Shannon divergence.

Usage

jensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
jensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Jensen-Shannon distance is defined as:

$\sqrt{\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sqrt(sum(p * log(2 * p / (p+q)), q * log(2 * q / (p+q))) / 2)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Endres, D. M., & Schindelin, J. E. (2003). A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7), 1858-1860. doi:10.1109/TIT.2003.813506

Examples

    jensen(ex_counts)
jensen(ex_counts)

Jensen-Shannon divergence (JSD)

Description

A symmetrized and smoothed version of the Kullback-Leibler divergence.

Usage

jsd(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
jsd(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Jensen-Shannon divergence (JSD) is defined as:

$\frac{1}{2}\left[\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)\right]$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sum(p * log(2 * p / (p+q)), q * log(2 * q / (p+q))) / 2

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151. doi:10.1109/18.61115

Examples

    jsd(ex_counts)
jsd(ex_counts)

Find and Browse Available Metrics

Description

Programmatic access to the lists of available metrics, and their associated functions.

Usage

list_metrics(
  div = c(NA, "alpha", "beta"),
  val = c("data.frame", "list", "func", "id", "name", "div", "phylo", "weighted",
    "int_only", "true_metric"),
  nm = c(NA, "id", "name"),
  phylo = NULL,
  weighted = NULL,
  int_only = NULL,
  true_metric = NULL
)

match_metric(
  metric,
  div = NULL,
  phylo = NULL,
  weighted = NULL,
  int_only = NULL,
  true_metric = NULL
)
list_metrics(
  div = c(NA, "alpha", "beta"),
  val = c("data.frame", "list", "func", "id", "name", "div", "phylo", "weighted",
    "int_only", "true_metric"),
  nm = c(NA, "id", "name"),
  phylo = NULL,
  weighted = NULL,
  int_only = NULL,
  true_metric = NULL
)

match_metric(
  metric,
  div = NULL,
  phylo = NULL,
  weighted = NULL,
  int_only = NULL,
  true_metric = NULL
)

Arguments

div

Filter by diversity type. One of "alpha" or "beta". Default: NA (no filtering).

val

Sets the return value for this function call. See "Value" section below. Default: "data.frame"

nm

What value to use for the names of the returned object. Default is "id" when val is "list" or "func", otherwise the default is NA (no name).

phylo

Filter by whether a phylogenetic tree is required. TRUE returns only phylogenetic metrics. FALSE returns only non-phylogenetic metrics. Default: NULL (no filtering).

weighted

Filter by whether relative abundance is used. TRUE returns quantitative metrics. FALSE returns qualitative (presence/absence) metrics. Default: NULL (no filtering).

int_only

Filter by whether integer counts are required. TRUE returns metrics requiring integers (e.g. richness estimators). FALSE returns metrics that accept proportions. Default: NULL (no filtering).

true_metric

Filter by whether the metric satisfies the triangle inequality. TRUE returns proper distance metrics. FALSE returns dissimilarities. Default: NULL (no filtering).

metric

The name of an alpha/beta diversity metric to search for. Supports partial matching. All non-alpha characters are ignored.

Value

match_metric()

A list with the following elements.

name : Metric name, e.g. "Faith's Phylogenetic Diversity"
id : Metric ID - also the name of the function, e.g. "faith"
div : Either "alpha" or "beta".
phylo : TRUE if metric requires a phylogenetic tree; FALSE otherwise.
weighted : TRUE if metric takes relative abundance into account; FALSE if it only uses presence/absence.
int_only : TRUE if metric requires integer counts; FALSE otherwise.
true_metric : TRUE if metric is a true metric and satisfies the triangle inequality; FALSE if it is a non-metric dissimilarity; NA for alpha diversity metrics.
func : The function for this metric, e.g. ecodive::faith
params : Formal args for func, e.g. c("counts", "norm", "tree", "cpus")

list_metrics()

The returned object's type and values are controlled with the val and nm arguments.

val = "data.frame" : The data.frame from which the below options are sourced.
val = "list" : A list of objects as returned by match_metric() (above).
val = "func" : A list of functions.
val = "id" : A character vector of metric IDs.
val = "name" : A character vector of metric names.
val = "div" : A character vector "alpha" and/or "beta".
val = "phylo" : A logical vector indicating which metrics require a tree.
val = "weighted" : A logical vector indicating which metrics take relative abundance into account (as opposed to just presence/absence).
val = "int_only" : A logical vector indicating which metrics require integer counts.
val = "true_metric" : A logical vector indicating which metrics are true metrics and satisfy the triangle inequality, which work better for ordinations such as PCoA.

If nm is set, then the names of the vector or list will be the metric ID (nm="id") or name (nm="name"). When val="data.frame", the names will be applied to the rownames() property of the data.table.

Examples


    # A data.frame of all available metrics.
    head(list_metrics())
    
    # All alpha diversity function names.
    list_metrics('alpha', val = 'id')
    
    # Try to find a metric named 'otus'.
    m <- match_metric('otus')
    
    # The result is a list that includes the function.
    str(m)

# A data.frame of all available metrics.
    head(list_metrics())
    
    # All alpha diversity function names.
    list_metrics('alpha', val = 'id')
    
    # Try to find a metric named 'otus'.
    m <- match_metric('otus')
    
    # The result is a list that includes the function.
    str(m)

Lorentzian distance

Description

A log-based distance metric that is robust to outliers.

Usage

lorentzian(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
lorentzian(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Lorentzian distance is defined as:

$\sum_{i=1}^{n}\ln{(1 + |X_i - Y_i|)}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(log(1 + abs(x - y)))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Examples

    lorentzian(ex_counts)
lorentzian(ex_counts)

Manhattan distance

Description

The sum of absolute differences, also known as the taxicab geometry.

Usage

manhattan(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
manhattan(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Manhattan distance is defined as:

$\sum_{i=1}^{n} |X_i - Y_i|$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(abs(x-y))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Krause, E. F. (1987). Taxicab geometry: An adventure in non-Euclidean geometry. Dover Publications.

Examples

    manhattan(ex_counts)
manhattan(ex_counts)

Margalef's Richness Index

Description

A richness metric that normalizes the number of species by the log of the total sample size.

Usage

margalef(counts, margin = 1L, cpus = n_cpus())
margalef(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Margalef's index is defined as:

$\frac{n - 1}{\ln{X_T}}$

Where:

$n$ : The number of features.
$X_T$ : Total of all counts (sequencing depth).

Base R Equivalent:

x <- ex_counts[1,]
(sum(x > 0) - 1) / log(sum(x))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Margalef, R. (1958). Information theory in ecology. General Systems, 3, 36-71.

Gamito, S. (2010). Caution is needed when applying Margalef diversity index. Ecological Indicators, 10(2), 550-551. doi:10.1016/j.ecolind.2009.07.006

Examples

    margalef(ex_counts)
margalef(ex_counts)

Matusita distance

Description

A distance measure closely related to the Hellinger distance.

Usage

matusita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
matusita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Matusita distance is defined as:

$\sqrt{\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sqrt(sum((sqrt(p) - sqrt(q)) ^ 2))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Matusita, K. (1955). Decision rules, based on the distance, for problems of fit, two samples, and estimation. The Annals of Mathematical Statistics, 26(4), 631-640. doi:10.1214/aoms/1177728422

Examples

    matusita(ex_counts)
matusita(ex_counts)

McIntosh Index

Description

A dominance index based on the Euclidean distance from the origin.

Usage

mcintosh(counts, margin = 1L, cpus = n_cpus())
mcintosh(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The McIntosh index is defined as:

$\frac{X_T - \sqrt{\sum_{i = 1}^{n} (X_i)^2}}{X_T - \sqrt{X_T}}$

Where:

$n$ : The number of features.
$X_i$ : Integer count of the $i$ -th feature.
$X_T$ : Total of all counts.

Base R Equivalent:

x <- ex_counts[1,]
(sum(x) - sqrt(sum(x^2))) / (sum(x) - sqrt(sum(x)))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

McIntosh, R. P. (1967). An index of diversity and the relation of certain concepts to diversity. Ecology, 48(3), 392-404. doi:10.2307/1932674

Examples

    mcintosh(ex_counts)
mcintosh(ex_counts)

Menhinick's Richness Index

Description

A richness metric that normalizes the number of species by the square root of the total sample size.

Usage

menhinick(counts, margin = 1L, cpus = n_cpus())
menhinick(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Menhinick's index is defined as:

$\frac{n}{\sqrt{X_T}}$

Where:

$n$ : The number of features.
$X_T$ : Total of all counts.

Base R Equivalent:

x <- ex_counts[1,]
sum(x > 0) / sqrt(sum(x))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Menhinick, E. F. (1964). A comparison of some species-individuals diversity indices applied to samples of field insects. Ecology, 45(4), 859-861. doi:10.2307/1934933

Examples

    menhinick(ex_counts)
menhinick(ex_counts)

Minkowski distance

Description

A generalized metric that includes Euclidean and Manhattan distance as special cases.

Usage

minkowski(
  counts,
  margin = 1L,
  power = 1.5,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
minkowski(
  counts,
  margin = 1L,
  power = 1.5,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

power

Scaling factor for the magnitude of differences between communities ( $p$ ). Default: 1.5

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Minkowski distance is defined as:

$\sqrt[p]{\sum_{i=1}^{n} (X_i - Y_i)^p}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.
$p$ : The geometry of the space (power parameter).

Parameter: power

The power parameter (default 1.5) determines the value of $p$ in the equation.

Special Cases

Manhattan distance: When $p = 1$ , the formula reduces to the sum of absolute differences.
Euclidean distance: When $p = 2$ , the formula reduces to the standard straight-line distance.
Chebyshev distance: When $p \to \infty$ , the formula reduces to the maximum absolute difference.

Base R Equivalent:

p <- 1.5
x <- ex_counts[1,]
y <- ex_counts[2,]
sum(abs(x - y)^p) ^ (1/p)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Springer.

Minkowski, H. (1896). Geometrie der Zahlen. Teubner.

Examples

    minkowski(ex_counts, power = 2) # Equivalent to Euclidean
minkowski(ex_counts, power = 2) # Equivalent to Euclidean

Morisita dissimilarity

Description

A measure of overlap between samples that is independent of sample size. Requires integer counts.

Usage

morisita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
morisita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Morisita dissimilarity is defined as:

$1 - \frac{2\sum_{i=1}^{n}X_{i}Y_{i}}{\left(\frac{\sum_{i=1}^{n}X_i(X_i - 1)}{X_T(X_T - 1)} + \frac{\sum_{i=1}^{n}Y_i(Y_i - 1)}{Y_T(Y_T - 1)}\right)X_{T}Y_{T}}$

Where:

$X_i$ , $Y_i$ : Absolute counts of the $i$ -th feature.
$X_T$ , $Y_T$ : Total counts in each sample. $X_T = \sum_{i=1}^{n} X_i$ .
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
simpson_x <- sum(x * (x - 1)) / (sum(x) * (sum(x) - 1))
simpson_y <- sum(y * (y - 1)) / (sum(y) * (sum(y) - 1))
1 - ((2 * sum(x * y)) / ((simpson_x + simpson_y) * sum(x) * sum(y)))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Morisita, M. (1959). Measuring of interspecific association and similarity between communities. Memoirs of the Faculty of Science, Kyushu University, Series E (Biology), 3, 65-80.

Examples

    morisita(ex_counts)
morisita(ex_counts)

Motyka dissimilarity

Description

Also known as the Bray-Curtis dissimilarity when applied to abundance data, but formulated slightly differently.

Usage

motyka(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
motyka(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Motyka dissimilarity is defined as:

$\frac{\sum_{i=1}^{n} \max(X_i, Y_i)}{\sum_{i=1}^{n} (X_i + Y_i)}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(pmax(x, y)) / sum(x, y)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Motyka, J. (1947). O celach i metodach badan geobotanicznych. Annales Universitatis Mariae Curie-Sklodowska, Sectio C, 3, 1-168.

Examples

    motyka(ex_counts)
motyka(ex_counts)

Number of CPU Cores

Description

A thin wrapper around parallely::availableCores(). If the parallely package is not installed, then it falls back to parallel::detectCores(all.tests = TRUE, logical = TRUE). Returns 1 if pthread support is unavailable or when the number of cpus cannot be determined.

Usage

n_cpus()
n_cpus()

Value

A scalar integer, guaranteed to be at least 1.

Examples

    n_cpus()

n_cpus()

Normalized Weighted UniFrac

Description

Weighted UniFrac normalized by the tree length to allow comparison between trees.

Usage

normalized_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)
normalized_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

tree

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Normalized Weighted UniFrac distance is defined as:

$\frac{\sum_{i=1}^{n} L_i|P_i - Q_i|}{\sum_{i=1}^{n} L_i(P_i + Q_i)}$

Where:

$n$ : The number of branches in the tree.
$L_i$ : The length of the $i$ -th branch.
$P_i$ , $Q_i$ : The proportion of the community descending from branch $i$ in sample P and Q.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Lozupone, C. A., Hamady, M., Kelley, S. T., & Knight, R. (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5), 1576-1585. doi:10.1128/AEM.01996-06

Examples

    normalized_unifrac(ex_counts, tree = ex_tree)
normalized_unifrac(ex_counts, tree = ex_tree)

Observed Features

Description

The count of unique features (richness) in a sample.

Usage

observed(counts, margin = 1L, cpus = n_cpus())
observed(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

Observed features is defined simply as the number of features with non-zero abundance:

$n$

Base R Equivalent:

x <- ex_counts[1,]
sum(x > 0)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Examples

    observed(ex_counts)
observed(ex_counts)

Otsuka-Ochiai dissimilarity

Description

Also known as the cosine similarity for binary data.

Usage

ochiai(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
ochiai(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Otsuka-Ochiai dissimilarity is defined as:

$1 - \frac{J}{\sqrt{AB}}$

Where:

$A$ , $B$ : Number of features in each sample.
$J$ : Number of features in common (intersection).

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
1 - sum(x & y) / sqrt(sum(x>0) * sum(y>0))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Ochiai, A. (1957). Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bulletin of the Japanese Society of Scientific Fisheries, 22, 526-530.

Examples

    ochiai(ex_counts)
ochiai(ex_counts)

Probabilistic Symmetric Chi-Squared distance

Description

A chi-squared based distance metric for comparing probability distributions.

Usage

psym_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
psym_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Probabilistic Symmetric $\chi^2$ distance is defined as:

$2\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
2 * sum((p - q)^2 / (p + q))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Examples

    psym_chisq(ex_counts)
psym_chisq(ex_counts)

Rarefy Observation Counts

Description

Sub-sample observations from a feature table such that all samples have the same library size (depth). This is performed via random sampling without replacement.

Usage

rarefy(
  counts,
  depth = NULL,
  seed = 0,
  times = NULL,
  drop = TRUE,
  margin = 1L,
  cpus = n_cpus(),
  warn = interactive()
)
rarefy(
  counts,
  depth = NULL,
  seed = 0,
  times = NULL,
  drop = TRUE,
  margin = 1L,
  cpus = n_cpus(),
  warn = interactive()
)

Arguments

counts

A numeric matrix or sparse matrix object (e.g., dgCMatrix). Counts must be integers.

depth

The number of observations to keep per sample. If NULL (the default), a depth is auto-selected to maximize data retention.

seed

An integer seed for the random number generator. Providing the same seed guarantees reproducible results. Default: 0

times

The number of independent rarefactions to perform. If set, returns a list of matrices. Seeds for subsequent iterations are sequential (seed, seed + 1, ...). Default: NULL

drop

Logical. If TRUE, samples with fewer than depth observations are discarded. If FALSE, they are kept with their original counts. Default: TRUE

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

warn

Logical. If TRUE, emits a warning when samples are dropped or returned unrarefied due to insufficient depth. Default: interactive()

Value

A rarefied matrix. The output class (matrix, dgCMatrix, etc.) matches the input class.

Auto-Depth Selection

If depth is NULL, the function defaults to the highest depth that retains at least 10% of the total observations in the dataset.

Dropping vs. Retaining Samples

If a sample has fewer observations than the specified depth:

drop = TRUE (Default): The sample is removed from the output matrix.
drop = FALSE: The sample is returned unmodified (with its original counts). It is not rarefied or zeroed out.

Zero-Sum Features

Features (OTUs, ASVs, Genes) that lose all observations during rarefaction are always retained as columns/rows of zeros. This ensures the output matrix dimensions remain consistent with the input (barring dropped samples).

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Examples

    # A 4-sample x 5-OTU matrix with samples in rows.
    counts <- matrix(c(0,0,0,0,0,8,9,10,5,5,5,5,2,0,0,0,6,5,7,0), 4, 5,
      dimnames = list(LETTERS[1:4], paste0('OTU', 1:5)))
    counts
    rowSums(counts)
    
    # Rarefy all samples to a depth of 18.
    # Sample 'A' (13 counts) and 'D' (15 counts) will be dropped.
    r_mtx <- rarefy(counts, depth = 18)
    r_mtx
    rowSums(r_mtx)
    
    # Keep under-sampled samples by setting `drop = FALSE`.
    # Samples 'A' and 'D' are returned with their original counts.
    r_mtx <- rarefy(counts, depth = 18, drop = FALSE)
    r_mtx
    rowSums(r_mtx)
    
    # Perform 3 independent rarefactions.
    r_list <- rarefy(counts, times = 3)
    length(r_list)
    
    # Sparse matrices are supported and their class is preserved.
    if (requireNamespace('Matrix', quietly = TRUE)) {
      counts_dgC <- Matrix::Matrix(counts, sparse = TRUE)
      str(rarefy(counts_dgC))
    }

# A 4-sample x 5-OTU matrix with samples in rows.
    counts <- matrix(c(0,0,0,0,0,8,9,10,5,5,5,5,2,0,0,0,6,5,7,0), 4, 5,
      dimnames = list(LETTERS[1:4], paste0('OTU', 1:5)))
    counts
    rowSums(counts)
    
    # Rarefy all samples to a depth of 18.
    # Sample 'A' (13 counts) and 'D' (15 counts) will be dropped.
    r_mtx <- rarefy(counts, depth = 18)
    r_mtx
    rowSums(r_mtx)
    
    # Keep under-sampled samples by setting `drop = FALSE`.
    # Samples 'A' and 'D' are returned with their original counts.
    r_mtx <- rarefy(counts, depth = 18, drop = FALSE)
    r_mtx
    rowSums(r_mtx)
    
    # Perform 3 independent rarefactions.
    r_list <- rarefy(counts, times = 3)
    length(r_list)
    
    # Sparse matrices are supported and their class is preserved.
    if (requireNamespace('Matrix', quietly = TRUE)) {
      counts_dgC <- Matrix::Matrix(counts, sparse = TRUE)
      str(rarefy(counts_dgC))
    }

Read a newick formatted phylogenetic tree.

Description

A phylogenetic tree is required for computing UniFrac distance matrices. You can load a tree from a file or by providing the tree string directly. This tree must be in Newick format, also known as parenthetic format and New Hampshire format.

Usage

read_tree(newick, underscores = FALSE)
read_tree(newick, underscores = FALSE)

Arguments

newick

Input data as either a file path, URL, or Newick string. Compressed (gzip or bzip2) files are also supported.

underscores

If TRUE, underscores in unquoted names will remain underscores. If FALSE, underscores in unquoted named will be converted to spaces.

Value

A phylo class object representing the tree.

Examples

    tree <- read_tree("
        (A:0.99,((B:0.87,C:0.89):0.51,(((D:0.16,(E:0.83,F:0.96)
        :0.94):0.69,(G:0.92,(H:0.62,I:0.85):0.54):0.23):0.74,J:0.1
        2):0.43):0.67);")
    class(tree)

tree <- read_tree("
        (A:0.99,((B:0.87,C:0.89):0.51,(((D:0.16,(E:0.83,F:0.96)
        :0.94):0.69,(G:0.92,(H:0.62,I:0.85):0.54):0.23):0.74,J:0.1
        2):0.43):0.67);")
    class(tree)

Robust Aitchison distance

Description

Calculates the pairwise Robust Aitchison distance for compositional data. This method is specifically engineered for sparse datasets - such as microbiome OTU/ASV tables - by calculating distances based only on observed positive abundances, avoiding the need for pseudo-counts.

Usage

robust_aitchison(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
robust_aitchison(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Robust Aitchison distance is defined as:

$\sqrt{\sum_{i=1}^{n} (X^*_i - Y^*_i)^2}$

Where:

$X^*_i$ , $Y^*_i$ : The rclr-transformed counts for the $i$ -th feature. For a given sample $X$ , $X^*_i = \ln(X_i) - X_L$ if $X_i > 0$ , and $0$ otherwise.
$X_L$ , $Y_L$ : Mean log of strictly positive abundances. $X_L = \frac{1}{|P_X|}\sum_{j \in P_X} \ln{X_j}$ , where $P_X$ is the set of indices where $X > 0$ .
$|P_X|$ , $|P_Y|$ : The number of strictly positive features in the respective samples.
$n$ : The total number of features.

Base R Equivalent:

x <- ifelse(x > 0, log(x) - mean(log(x[x > 0])), 0)
y <- ifelse(y > 0, log(y) - mean(log(y[y > 0])), 0)
sqrt(sum((x-y)^2)) # Euclidean distance

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Martino, C., Morton, J. T., Marotz, C. A., Thompson, L. R., Tripathi, A., Knight, R., and Zengler, K. (2019). A novel sparse compositional technique reveals microbial perturbations. mSystems, 4(1), e00016-19. doi:10.1128/mSystems.00016-19

Examples

    robust_aitchison(ex_counts)
robust_aitchison(ex_counts)

Shannon Diversity Index

Description

A commonly used diversity index accounting for both abundance and evenness.

Usage

shannon(counts, margin = 1L, cpus = n_cpus())
shannon(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Shannon index (entropy) is defined as:

$-\sum_{i = 1}^{n} P_i \times \ln(P_i)$

Where:

$n$ : The number of features.
$P_i$ : Proportional abundance of the $i$ -th feature.

Base R Equivalent:

x <- ex_counts[1,]
p <- x / sum(x)
-sum(p * log(p))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423.

Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press.

Examples

    shannon(ex_counts)
shannon(ex_counts)

Gini-Simpson Index

Description

The probability that two entities taken at random from the dataset represent different types.

Usage

simpson(counts, margin = 1L, cpus = n_cpus())
simpson(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Gini-Simpson index is defined as:

$1 - \sum_{i = 1}^{n} P_i^2$

Where:

$n$ : The number of features.
$P_i$ : Proportional abundance of the $i$ -th feature.

Base R Equivalent:

x <- ex_counts[1,]
p <- x / sum(x)
1 - sum(p ** 2)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. doi:10.1038/163688a0

Examples

    simpson(ex_counts)
simpson(ex_counts)

Soergel distance

Description

A distance metric related to the Bray-Curtis and Jaccard indices.

Usage

soergel(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
soergel(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Soergel distance is defined as:

$\frac{\sum_{i=1}^{n} |X_i - Y_i|}{\sum_{i=1}^{n} \max(X_i, Y_i)}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(abs(x - y)) / sum(pmax(x, y))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Examples

    soergel(ex_counts)
soergel(ex_counts)

Dice-Sorensen dissimilarity

Description

A statistic used for comparing the similarity of two samples.

Usage

sorensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
sorensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Dice-Sorensen dissimilarity is defined as:

$\frac{2J}{(A + B)}$

Where:

$A$ , $B$ : Number of features in each sample.
$J$ : Number of features in common (intersection).

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
2 * sum(x & y) / sum(x>0, y>0)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Sørensen, T. (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Kongelige Danske Videnskabernes Selskab, Biologiske Skrifter, 5, 1-34.

Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. doi:10.2307/1932409

Examples

    sorensen(ex_counts)
sorensen(ex_counts)

Squared Chi-Squared distance

Description

The squared version of the Chi-Squared distance.

Usage

squared_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
squared_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Squared $\chi^2$ distance is defined as:

$\sum_{i=1}^{n}\frac{(P_i - Q_i)^2}{P_i + Q_i}$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sum((p - q)^2 / (p + q))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Examples

    squared_chisq(ex_counts)
squared_chisq(ex_counts)

Squared Chord distance

Description

The squared version of the Chord distance.

Usage

squared_chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
squared_chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Squared Chord distance is defined as:

$\sum_{i=1}^{n}\left(\sqrt{P_i} - \sqrt{Q_i}\right)^2$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sum((sqrt(x) - sqrt(y)) ^ 2)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.

Examples

    squared_chord(ex_counts)
squared_chord(ex_counts)

Squared Euclidean distance

Description

The squared Euclidean distance between two vectors.

Usage

squared_euclidean(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
squared_euclidean(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Squared Euclidean distance is defined as:

$\sum_{i=1}^{n} (X_i - Y_i)^2$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum((x-y)^2)

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.

Examples

    squared_euclidean(ex_counts)
squared_euclidean(ex_counts)

Squares Richness Estimator

Description

A richness estimator based on the concept of "squares" (counts of species observed once or twice).

Usage

squares(counts, margin = 1L, cpus = n_cpus())
squares(counts, margin = 1L, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Squares estimator is defined as:

$n + \frac{(F_1)^2 \sum_{i=1}^{n} (X_i)^2}{X_T^2 - nF_1}$

Where:

$n$ : The number of observed features.
$X_T$ : Total of all counts.
$F_1$ : Number of features observed once (singletons).
$X_i$ : Integer count of the $i$ -th feature.

Base R Equivalent:

x  <- ex_counts[1,]
N  <- sum(x)      # sampling depth
S  <- sum(x > 0)  # observed features
F1 <- sum(x == 1) # singletons
S + ((sum(x^2) * (F1^2)) / ((N^2) - F1 * S))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Alroy, J. (2018). Limits to species richness estimates based on subsampling. Paleobiology, 44(2), 177-194. doi:10.1017/pab.2017.38

Examples

    squares(ex_counts)
squares(ex_counts)

Topsoe distance

Description

A symmetric divergence measure based on the Jensen-Shannon divergence.

Usage

topsoe(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
topsoe(counts, margin = 1L, pairs = NULL, cpus = n_cpus())

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Topsoe distance is defined as:

$\sum_{i=1}^{n}P_i\ln\left(\frac{2P_i}{P_i + Q_i}\right) + \sum_{i=1}^{n}Q_i\ln\left(\frac{2Q_i}{P_i + Q_i}\right)$

Where:

$P_i$ , $Q_i$ : Proportional abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]; p <- x / sum(x)
y <- ex_counts[2,]; q <- y / sum(y)
sum(p * log(2 * p / (p+q)), q * log(2 * y / (p+q)))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Topsoe, F. (2000). Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46(4), 1602-1609. doi:10.1109/18.850703

Examples

    topsoe(ex_counts)
topsoe(ex_counts)

Unweighted UniFrac

Description

A phylogenetic distance metric that accounts for the presence/absence of lineages.

Usage

unweighted_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)
unweighted_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

tree

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Unweighted UniFrac distance is defined as:

$\frac{1}{n}\sum_{i=1}^{n} L_i|A_i - B_i|$

Where:

$n$ : The number of branches in the tree.
$L_i$ : The length of the $i$ -th branch.
$A_i$ , $B_i$ : Binary values (0 or 1) indicating if descendants of branch $i$ are present in sample A or B.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Lozupone, C., & Knight, R. (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12), 8228-8235. doi:10.1128/AEM.71.12.8228-8235.2005

Examples

    unweighted_unifrac(ex_counts, tree = ex_tree)
unweighted_unifrac(ex_counts, tree = ex_tree)

Variance-Adjusted Weighted UniFrac

Description

A weighted UniFrac that adjusts for the expected variance of the metric.

Usage

variance_adjusted_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)
variance_adjusted_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

tree

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Variance-Adjusted Weighted UniFrac distance is defined as:

$\frac{\sum_{i=1}^{n} L_i\frac{|P_i - Q_i|}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }{\sum_{i=1}^{n} L_i\frac{P_i + Q_i}{\sqrt{(P_i + Q_i)(2 - P_i - Q_i)}} }$

Where:

$n$ : The number of branches in the tree.
$L_i$ : The length of the $i$ -th branch.
$P_i$ , $Q_i$ : The proportion of the community descending from branch $i$ in sample P and Q.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Chang, Q., Luan, Y., & Sun, F. (2011). Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny. BMC Bioinformatics, 12, 118. doi:10.1186/1471-2105-12-118

Examples

    variance_adjusted_unifrac(ex_counts, tree = ex_tree)
variance_adjusted_unifrac(ex_counts, tree = ex_tree)

Wave Hedges distance

Description

A distance metric derived from the Hedges' distance.

Usage

wave_hedges(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)
wave_hedges(
  counts,
  margin = 1L,
  norm = "none",
  pseudocount = NULL,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

norm

Normalize the incoming counts. Options are:

'none': No transformation.
'percent': Relative abundance (sample abundances sum to 1).
'binary': Unweighted presence/absence (each count is either 0 or 1).
'clr': Centered log ratio.
'rclr': Robust centered log ratio.

Default: 'none'.

pseudocount

Value added to counts to handle zeros when norm = 'clr'. Ignored for other normalization methods. See Pseudocount section.

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Wave Hedges distance is defined as:

$\sum_{i=1}^{n}\frac{|X_i - Y_i|}{\max(X_i, Y_i)}$

Where:

$X_i$ , $Y_i$ : Absolute abundances of the $i$ -th feature.
$n$ : The number of features.

Base R Equivalent:

x <- ex_counts[1,]
y <- ex_counts[2,]
sum(abs(x - y) / pmax(x, y))

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

Pseudocount

The pseudocount parameter is only relevant when norm = 'clr'.

To suppress the warning, provide an explicit value (e.g., 1).

See aitchison for references.

References

Examples

    wave_hedges(ex_counts)
wave_hedges(ex_counts)

Weighted UniFrac

Description

A phylogenetic distance metric that accounts for the relative abundance of lineages.

Usage

weighted_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)
weighted_unifrac(
  counts,
  tree = NULL,
  margin = 1L,
  pairs = NULL,
  cpus = n_cpus()
)

Arguments

counts

A numeric matrix of count data (samples $\times$ features). Typically contains absolute abundances (integer counts), though proportions are also accepted.

tree

margin

The margin containing samples. 1 if samples are rows, 2 if samples are columns. Ignored when counts is a special object class (e.g. phyloseq). Default: 1

pairs

cpus

How many parallel processing threads should be used. The default, n_cpus(), will use all logical CPU cores.

Details

The Weighted UniFrac distance is defined as:

$\sum_{i=1}^{n} L_i|P_i - Q_i|$

Where:

$n$ : The number of branches in the tree.
$L_i$ : The length of the $i$ -th branch.
$P_i$ , $Q_i$ : The proportion of the community descending from branch $i$ in sample P and Q.

Input Types

The counts parameter is designed to accept a simple numeric matrix, but seamlessly supports objects from the following biological data packages:

phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment

For large datasets, standard matrix operations may be slow. See vignette('performance') for details on using optimized formats (e.g. sparse matrices) and parallel processing.

References

Examples

    weighted_unifrac(ex_counts, tree = ex_tree)
weighted_unifrac(ex_counts, tree = ex_tree)

Package 'ecodive'

Help Index

Abundance-based Coverage Estimator (ACE)

Description

Usage

Arguments

Details

Input Types

References

See Also

Examples

Aitchison distance

Description

Usage

Arguments

Details

Pseudocount

Input Types

References

See Also

Examples

Alpha Diversity Wrapper Function

Description

Usage

Arguments

Details

Integer Count Requirements

Requires Integer Counts Only

Can Use Proportional Data

Value

Input Types

Examples

Berger-Parker Index

Description

Usage

Arguments

Details

Input Types

References

See Also

Examples

Beta Diversity Wrapper Function

Description

Usage

Arguments

Details

Value

Input Types

Pseudocount

Examples

Bhattacharyya distance

Description

Usage

Arguments

Details

Input Types

References

See Also

Examples

Bray-Curtis dissimilarity

Description

Usage

Arguments

Details

Input Types

Pseudocount

References

See Also

Examples

Brillouin Index

Description

Usage

Arguments

Details

Input Types

References

See Also

Examples

Canberra distance

Description