| Title: | Parallel and Memory-Efficient Ecological Diversity Metrics |
|---|---|
| Description: | Computes alpha and beta diversity metrics using concurrent 'C' threads. Metrics include 'UniFrac', Faith's phylogenetic diversity, Bray-Curtis dissimilarity, Shannon diversity index, and many others. Also parses newick trees into 'phylo' objects and rarefies feature tables. Parallel and memory-efficient algorithms are described in Smith et al. (2026) <doi:10.21105/joss.09777>. |
| Authors: | Daniel P. Smith [aut, cre] (ORCID: <https://orcid.org/0000-0002-2479-2044>), Melissa O'Neill [ctb, cph] (Author of PCG random number generator), Alkek Center for Metagenomics and Microbiome Research [cph, fnd] |
| Maintainer: | Daniel P. Smith <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.3.0 |
| Built: | 2026-06-03 23:40:37 UTC |
| Source: | https://github.com/cmmr/ecodive |
A non-parametric estimator of species richness that separates features into abundant and rare groups.
ace(counts, cutoff = 10L, margin = 1L, cpus = n_cpus())ace(counts, cutoff = 10L, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
cutoff |
The maximum number of observations to consider "rare".
Default: |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The ACE metric separates features into "abundant" and "rare" groups based on a cutoff (usually 10 counts). It assumes that the presence of abundant species is certain, while the true number of rare species must be estimated.
Equations:
Where:
: Rare cutoff (default 10). Features with counts are considered rare.
: Number of features with exactly counts.
: Number of features where (singletons).
: Number of rare features where .
: Number of abundant features where .
: Total counts belonging to rare features.
: The sample abundance coverage estimator.
: The estimated coefficient of variation.
Parameter: cutoff The integer threshold distinguishing rare from abundant species. Standard practice is to use 10.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Chao, A., & Lee, S. M. (1992). Estimating the number of classes via sample coverage. Journal of the American Statistical Association, 87(417), 210-217. doi:10.1080/01621459.1992.10475194
alpha_div(), vignette('adiv')
Other Richness metrics:
chao1(),
margalef(),
menhinick(),
observed(),
squares()
ace(ex_counts)ace(ex_counts)
Calculates the Euclidean distance between centered log-ratio (CLR) transformed abundances.
aitchison( counts, margin = 1L, pseudocount = NULL, pairs = NULL, cpus = n_cpus() )aitchison( counts, margin = 1L, pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Aitchison distance is defined as:
Where:
, : Absolute counts for the -th feature.
, : Mean log of abundances. .
: The number of features.
Base R Equivalent:
x <- log((x + pseudocount) / exp(mean(log(x + pseudocount)))) y <- log((y + pseudocount) / exp(mean(log(y + pseudocount)))) sqrt(sum((x-y)^2)) # Euclidean distance
Zeros are undefined in the Aitchison (CLR) transformation. If
pseudocount is NULL (the default) and zeros are detected,
the function uses half the minimum non-zero value (min(x[x>0]) / 2)
and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Aitchison, J. (1986). The statistical analysis of compositional data. Chapman and Hall. doi:10.1002/bimj.4710300705
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139-160. doi:10.1111/j.2517-6161.1982.tb01195.x
Costea, P. I., Zeller, G., Sunagawa, S., & Bork, P. (2014). A fair comparison. Nature Methods, 11(4), 359. doi:10.1038/nmeth.2897
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., & Egozcue, J. J. (2017). Microbiome datasets are compositional: and this is not optional. Frontiers in Microbiology, 8, 2224. doi:10.3389/fmicb.2017.02224
Kaul, A., Mandal, S., Davidov, O., & Peddada, S. D. (2017). Analysis of microbiome data in the presence of excess zeros. Frontiers in Microbiology, 8, 2114. doi:10.3389/fmicb.2017.02114
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
aitchison(ex_counts, pseudocount = 1)aitchison(ex_counts, pseudocount = 1)
Alpha Diversity Wrapper Function
alpha_div( counts, metric, norm = "percent", cutoff = 10L, digits = 3L, tree = NULL, margin = 1L, cpus = n_cpus() )alpha_div( counts, metric, norm = "percent", cutoff = 10L, digits = 3L, tree = NULL, margin = 1L, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
metric |
The name of an alpha diversity metric. One of |
norm |
Normalize the incoming counts. Options are:
Default: |
cutoff |
The maximum number of observations to consider "rare".
Default: |
digits |
Precision of the returned values, in number of decimal
places. E.g. the default |
tree |
A |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
A frequent and critical error in alpha diversity analysis is providing the wrong type of data to a metric's formula. Some indices are mathematically defined based on counts of individuals and require raw, integer abundance data. Others are based on proportional abundances and can accept either integer counts (which are then converted to proportions) or pre-normalized proportional data. Using proportional data with a metric that requires integer counts will return an error message.
Chao1
ACE
Squares Richness Estimator
Margalef's Index
Menhinick's Index
Fisher's Alpha
Brillouin Index
Observed Features
Shannon Index
Gini-Simpson Index
Inverse Simpson Index
Berger-Parker Index
McIntosh Index
Faith's PD
A numeric vector.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
# Example counts matrix ex_counts # Shannon diversity values alpha_div(ex_counts, 'Shannon') # Chao1 diversity values alpha_div(ex_counts, 'c') # Faith PD values alpha_div(ex_counts, 'faith', tree = ex_tree)# Example counts matrix ex_counts # Shannon diversity values alpha_div(ex_counts, 'Shannon') # Chao1 diversity values alpha_div(ex_counts, 'c') # Faith PD values alpha_div(ex_counts, 'faith', tree = ex_tree)
A measure of the numerical importance of the most abundant species.
berger(counts, margin = 1L, cpus = n_cpus())berger(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Berger-Parker index is defined as the proportional abundance of the most dominant feature:
Where:
: Proportional abundance of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) max(p)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Berger, W. H., & Parker, F. L. (1970). Diversity of planktonic foraminifera in deep-sea sediments. Science, 168(3937), 1345-1347. doi:10.1126/science.168.3937.1345
alpha_div(), vignette('adiv')
Other Dominance metrics:
mcintosh()
berger(ex_counts)berger(ex_counts)
Beta Diversity Wrapper Function
beta_div( counts, metric, margin = 1L, norm = "none", pseudocount = NULL, power = 1.5, alpha = 0.5, tree = NULL, pairs = NULL, cpus = n_cpus() )beta_div( counts, metric, margin = 1L, norm = "none", pseudocount = NULL, power = 1.5, alpha = 0.5, tree = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
metric |
The name of a beta diversity metric. One of |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
power |
Only used when |
alpha |
Only used when |
tree |
Only used by phylogeny-aware metrics. A |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
List of Beta Diversity Metrics
| Option / Function Name | Metric Name |
aitchison |
Aitchison distance |
bhattacharyya |
Bhattacharyya distance |
bray |
Bray-Curtis dissimilarity |
canberra |
Canberra distance |
chebyshev |
Chebyshev distance |
chord |
Chord distance |
clark |
Clark's divergence distance |
divergence |
Divergence |
euclidean |
Euclidean distance |
generalized_unifrac |
Generalized UniFrac (GUniFrac) |
gower |
Gower distance |
hamming |
Hamming distance |
hellinger |
Hellinger distance |
horn |
Horn-Morisita dissimilarity |
jaccard |
Jaccard distance |
jensen |
Jensen-Shannon distance |
jsd |
Jesen-Shannon divergence (JSD) |
lorentzian |
Lorentzian distance |
manhattan |
Manhattan distance |
matusita |
Matusita distance |
minkowski |
Minkowski distance |
morisita |
Morisita dissimilarity |
motyka |
Motyka dissimilarity |
normalized_unifrac |
Normalized Weighted UniFrac |
ochiai |
Otsuka-Ochiai dissimilarity |
psym_chisq |
Probabilistic Symmetric Chi-Squared distance |
soergel |
Soergel distance |
sorensen |
Dice-Sorensen dissimilarity |
squared_chisq |
Squared Chi-Squared distance |
squared_chord |
Squared Chord distance |
squared_euclidean |
Squared Euclidean distance |
topsoe |
Topsoe distance |
unweighted_unifrac |
Unweighted UniFrac |
variance_adjusted_unifrac |
Variance-Adjusted Weighted UniFrac (VAW-UniFrac) |
wave_hedges |
Wave Hedges distance |
weighted_unifrac |
Weighted UniFrac |
Flexible name matching
Case insensitive and partial matching. Any runs of non-alpha characters are
converted to underscores. E.g. metric = 'Weighted UniFrac selects
weighted_unifrac.
UniFrac names can be shortened to the first letter plus "unifrac". E.g.
uunifrac, w_unifrac, or V UniFrac. These also support partial matching.
Finished code should always use the full primary option name to avoid ambiguity with future additions to the metrics list.
A numeric vector.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
# Example counts matrix ex_counts # Bray-Curtis distances beta_div(ex_counts, 'bray') # Generalized UniFrac distances beta_div(ex_counts, 'GUniFrac', tree = ex_tree)# Example counts matrix ex_counts # Bray-Curtis distances beta_div(ex_counts, 'bray') # Generalized UniFrac distances beta_div(ex_counts, 'GUniFrac', tree = ex_tree)
Measures the similarity of two probability distributions.
bhattacharyya(counts, margin = 1L, pairs = NULL, cpus = n_cpus())bhattacharyya(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Bhattacharyya distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) -log(sum(sqrt(p * q)))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society, 35, 99-109.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
bhattacharyya(ex_counts)bhattacharyya(ex_counts)
A standard ecological metric quantifying the dissimilarity between communities.
bray( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )bray( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Bray-Curtis dissimilarity is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x-y)) / sum(x+y)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Bray, J. R., & Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4), 325-349. doi:10.2307/1942268
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
bray(ex_counts)bray(ex_counts)
A diversity index derived from information theory, appropriate for fully censused communities.
brillouin(counts, margin = 1L, cpus = n_cpus())brillouin(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Brillouin index is defined as:
Where:
: The number of features.
: Integer count of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] # note: lgamma(x + 1) == log(x!) (lgamma(sum(x) + 1) - sum(lgamma(x + 1))) / sum(x)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Brillouin, L. (1956). Science and information theory. Academic Press.
alpha_div(), vignette('adiv')
Other Diversity metrics:
fisher(),
inv_simpson(),
shannon(),
simpson()
brillouin(ex_counts)brillouin(ex_counts)
A weighted version of the Manhattan distance, sensitive to differences when both values are small.
canberra( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )canberra( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Canberra distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x-y) / (x+y))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Lance, G. N., & Williams, W. T. (1966). Computer programs for hierarchical polythetic classification ("similarity analyses"). The Computer Journal, 9(1), 60-64. doi:10.1093/comjnl/9.1.60
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
canberra(ex_counts)canberra(ex_counts)
A non-parametric estimator of the lower bound of species richness.
chao1(counts, margin = 1L, cpus = n_cpus())chao1(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Chao1 estimator uses the ratio of singletons to doubletons to estimate the number of missing species:
Where:
: The number of observed features.
: Number of features observed once (singletons).
: Number of features observed twice (doubletons).
Base R Equivalent:
x <- ex_counts[1,] sum(x>0) + (sum(x == 1) ** 2) / (2 * sum(x == 2))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Chao, A. (1984). Nonparametric estimation of the number of classes in a population. Scandinavian Journal of Statistics, 11, 265-270.
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
margalef(),
menhinick(),
observed(),
squares()
chao1(ex_counts)chao1(ex_counts)
The maximum difference between any single feature across two samples.
chebyshev( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )chebyshev( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Chebyshev distance is defined as:
Where:
, : Absolute abundances of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] max(abs(x-y))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Cantrell, C. D. (2000). Modern mathematical methods for physicists and engineers. Cambridge University Press.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
chebyshev(ex_counts)chebyshev(ex_counts)
Euclidean distance between normalized vectors.
chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Chord distance is defined as:
Where:
, : Absolute counts of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sqrt(sum(((x / sqrt(sum(x ^ 2))) - (y / sqrt(sum(y ^ 2))))^2))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Orlóci, L. (1967). An agglomerative method for classification of plant communities. Journal of Ecology, 55(1), 193-206. doi:10.2307/2257725
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
chord(ex_counts)chord(ex_counts)
Also known as the coefficient of divergence.
clark( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )clark( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Clark's divergence distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sqrt(sum((abs(x - y) / (x + y)) ^ 2))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Clark, P. J. (1952). An extension of the coefficient of divergence for use with multiple characters. Copeia, 1952(2), 61-64. doi:10.2307/1438598
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
clark(ex_counts)clark(ex_counts)
A probabilistic divergence metric.
divergence( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )divergence( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
Divergence is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) 2 * sum((p - q)^2 / (p + q)^2)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
divergence(ex_counts)divergence(ex_counts)
The straight-line distance between two points in multidimensional space.
euclidean( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )euclidean( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Euclidean distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sqrt(sum((x-y)^2))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
euclidean(ex_counts)euclidean(ex_counts)
Genera found on four human body sites.
ex_countsex_counts
A matrix of 4 samples (columns) x 6 genera (rows).
Derived from The Human Microbiome Project dataset.
Companion tree for ex_counts.
ex_treeex_tree
A phylo object.
ex_tree encodes this tree structure:
+----------44---------- Haemophilus
+-2-|
| +----------------68---------------- Bacteroides
|
| +---18---- Streptococcus
| +--12--|
| | +--11-- Staphylococcus
+--11--|
| +-----24----- Corynebacterium
+--12--|
+--13-- Propionibacterium
Calculates the sum of the branch lengths for all species present in a sample.
faith(counts, tree = NULL, margin = 1L, cpus = n_cpus())faith(counts, tree = NULL, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Faith's PD is defined as:
Where:
: The number of branches in the phylogenetic tree.
: The length of the -th branch.
: A binary value (1 if any descendants of branch are present in the sample, 0 otherwise).
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1), 1-10. doi:10.1016/0006-3207(92)91201-3
alpha_div(), vignette('adiv')
Other Phylogenetic metrics:
generalized_unifrac(),
normalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
faith(ex_counts, tree = ex_tree)faith(ex_counts, tree = ex_tree)
A parametric diversity index assuming species abundances follow a log-series distribution.
fisher(counts, digits = 3L, margin = 1L, cpus = n_cpus())fisher(counts, digits = 3L, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
digits |
Precision of the returned values, in number of decimal
places. E.g. the default |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Fisher's Alpha () is the parameter in the equation:
Where:
: The number of features.
: Total of all counts (sequencing depth).
The value of is solved for iteratively.
Parameter: digits
The precision (number of decimal places) to use when solving the equation.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Fisher, R. A., Corbet, A. S., & Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology, 12, 42-58. doi:10.2307/1411
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
inv_simpson(),
shannon(),
simpson()
fisher(ex_counts)fisher(ex_counts)
A unified UniFrac distance that balances the weight of abundant and rare lineages.
generalized_unifrac( counts, tree = NULL, alpha = 0.5, margin = 1L, pairs = NULL, cpus = n_cpus() )generalized_unifrac( counts, tree = NULL, alpha = 0.5, margin = 1L, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
tree |
A |
alpha |
How much weight to give to relative abundances; a value
between 0 and 1, inclusive. Setting |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Generalized UniFrac distance is defined as:
Where:
: The number of branches in the tree.
: The length of the -th branch.
, : The proportion of the community descending from branch in sample P and Q.
: A scalable weighting factor.
Parameter: alpha
The alpha parameter controls the weight given to abundant lineages. corresponds to Weighted UniFrac, while corresponds to Unweighted UniFrac.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Chen, J., Bittinger, K., Charlson, E. S., Hoffmann, C., Lewis, J., Wu, G. D., ... & Li, H. (2012). Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics, 28(16), 2106-2113. doi:10.1093/bioinformatics/bts342
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
normalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
generalized_unifrac(ex_counts, tree = ex_tree, alpha = 0.5)generalized_unifrac(ex_counts, tree = ex_tree, alpha = 0.5)
A distance metric that normalizes differences by the range of the feature.
gower( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )gower( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Gower distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The range of the -th feature across all samples (max - min).
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] r <- abs(x - y) n <- length(x) sum(abs(x-y) / r) / n
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
gower(ex_counts)gower(ex_counts)
Measures the minimum number of substitutions required to change one string into the other.
hamming(counts, margin = 1L, pairs = NULL, cpus = n_cpus())hamming(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Hamming distance is defined as:
Where:
, : Number of features in each sample.
: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(xor(x, y))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147-160. doi:10.1002/j.1538-7305.1950.tb00463.x
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
jaccard(),
ochiai(),
sorensen()
hamming(ex_counts)hamming(ex_counts)
A distance metric related to the Bhattacharyya distance, often used for community data with many zeros.
hellinger(counts, margin = 1L, pairs = NULL, cpus = n_cpus())hellinger(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Hellinger distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sqrt(sum((sqrt(p) - sqrt(q)) ^ 2))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Rao, C. R. (1995). A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qüestiió, 19, 23-63.
Hellinger, E. (1909). Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. Journal für die reine und angewandte Mathematik, 136, 210–271. doi:10.1515/crll.1909.136.210
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
hellinger(ex_counts)hellinger(ex_counts)
A similarity index based on Simpson's diversity index, suitable for abundance data.
horn( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )horn( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Horn-Morisita dissimilarity is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] z <- sum(x^2) / sum(x)^2 + sum(y^2) / sum(y)^2 1 - ((2 * sum(x * y)) / (z * sum(x) * sum(y)))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Horn, H. S. (1966). Measurement of "overlap" in comparative ecological studies. The American Naturalist, 100(914), 419-424. doi:10.1086/282436
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
horn(ex_counts)horn(ex_counts)
A transformation of the Simpson index that represents the "effective number of species".
inv_simpson(counts, margin = 1L, cpus = n_cpus())inv_simpson(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Inverse Simpson index is defined as:
Where:
: The number of features.
: Proportional abundance of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) 1 / sum(p ** 2)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. doi:10.1038/163688a0
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
fisher(),
shannon(),
simpson()
inv_simpson(ex_counts)inv_simpson(ex_counts)
Measures dissimilarity between sample sets.
jaccard(counts, margin = 1L, pairs = NULL, cpus = n_cpus())jaccard(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Jaccard distance is defined as:
Where:
, : Number of features in each sample.
: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] 1 - sum(x & y) / sum(x | y)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37-50. doi:10.1111/j.1469-8137.1912.tb05611.x
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles, 44(163), 223-270. doi:10.5169/seals-268384
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
hamming(),
ochiai(),
sorensen()
jaccard(ex_counts)jaccard(ex_counts)
The square root of the Jensen-Shannon divergence.
jensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())jensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Jensen-Shannon distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sqrt(sum(p * log(2 * p / (p+q)), q * log(2 * q / (p+q))) / 2)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Endres, D. M., & Schindelin, J. E. (2003). A new metric for probability distributions. IEEE Transactions on Information Theory, 49(7), 1858-1860. doi:10.1109/TIT.2003.813506
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
jensen(ex_counts)jensen(ex_counts)
A symmetrized and smoothed version of the Kullback-Leibler divergence.
jsd(counts, margin = 1L, pairs = NULL, cpus = n_cpus())jsd(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Jensen-Shannon divergence (JSD) is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum(p * log(2 * p / (p+q)), q * log(2 * q / (p+q))) / 2
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151. doi:10.1109/18.61115
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
jsd(ex_counts)jsd(ex_counts)
Programmatic access to the lists of available metrics, and their associated functions.
list_metrics( div = c(NA, "alpha", "beta"), val = c("data.frame", "list", "func", "id", "name", "div", "phylo", "weighted", "int_only", "true_metric"), nm = c(NA, "id", "name"), phylo = NULL, weighted = NULL, int_only = NULL, true_metric = NULL ) match_metric( metric, div = NULL, phylo = NULL, weighted = NULL, int_only = NULL, true_metric = NULL )list_metrics( div = c(NA, "alpha", "beta"), val = c("data.frame", "list", "func", "id", "name", "div", "phylo", "weighted", "int_only", "true_metric"), nm = c(NA, "id", "name"), phylo = NULL, weighted = NULL, int_only = NULL, true_metric = NULL ) match_metric( metric, div = NULL, phylo = NULL, weighted = NULL, int_only = NULL, true_metric = NULL )
div |
Filter by diversity type. One of |
val |
Sets the return value for this function call. See "Value"
section below. Default: |
nm |
What value to use for the names of the returned object.
Default is |
phylo |
Filter by whether a phylogenetic tree is required.
|
weighted |
Filter by whether relative abundance is used. |
int_only |
Filter by whether integer counts are required. |
true_metric |
Filter by whether the metric satisfies the triangle
inequality. |
metric |
The name of an alpha/beta diversity metric to search for. Supports partial matching. All non-alpha characters are ignored. |
match_metric()
A list with the following elements.
name : Metric name, e.g. "Faith's Phylogenetic Diversity"
id : Metric ID - also the name of the function, e.g. "faith"
div : Either "alpha" or "beta".
phylo : TRUE if metric requires a phylogenetic tree; FALSE otherwise.
weighted : TRUE if metric takes relative abundance into account; FALSE if it only uses presence/absence.
int_only : TRUE if metric requires integer counts; FALSE otherwise.
true_metric : TRUE if metric is a true metric and satisfies the triangle inequality; FALSE if it is a non-metric dissimilarity; NA for alpha diversity metrics.
func : The function for this metric, e.g. ecodive::faith
params : Formal args for func, e.g. c("counts", "norm", "tree", "cpus")
list_metrics()
The returned object's type and values are controlled with the val and nm arguments.
val = "data.frame" : The data.frame from which the below options are sourced.
val = "list" : A list of objects as returned by match_metric() (above).
val = "func" : A list of functions.
val = "id" : A character vector of metric IDs.
val = "name" : A character vector of metric names.
val = "div" : A character vector "alpha" and/or "beta".
val = "phylo" : A logical vector indicating which metrics require a tree.
val = "weighted" : A logical vector indicating which metrics take relative abundance into account (as opposed to just presence/absence).
val = "int_only" : A logical vector indicating which metrics require integer counts.
val = "true_metric" : A logical vector indicating which metrics are true metrics and satisfy the triangle inequality, which work better for ordinations such as PCoA.
If nm is set, then the names of the vector or list will be the metric ID
(nm="id") or name (nm="name"). When val="data.frame", the names will be
applied to the rownames() property of the data.table.
# A data.frame of all available metrics. head(list_metrics()) # All alpha diversity function names. list_metrics('alpha', val = 'id') # Try to find a metric named 'otus'. m <- match_metric('otus') # The result is a list that includes the function. str(m)# A data.frame of all available metrics. head(list_metrics()) # All alpha diversity function names. list_metrics('alpha', val = 'id') # Try to find a metric named 'otus'. m <- match_metric('otus') # The result is a list that includes the function. str(m)
A log-based distance metric that is robust to outliers.
lorentzian( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )lorentzian( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Lorentzian distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(log(1 + abs(x - y)))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
lorentzian(ex_counts)lorentzian(ex_counts)
The sum of absolute differences, also known as the taxicab geometry.
manhattan( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )manhattan( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Manhattan distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x-y))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Krause, E. F. (1987). Taxicab geometry: An adventure in non-Euclidean geometry. Dover Publications.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
manhattan(ex_counts)manhattan(ex_counts)
A richness metric that normalizes the number of species by the log of the total sample size.
margalef(counts, margin = 1L, cpus = n_cpus())margalef(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Margalef's index is defined as:
Where:
: The number of features.
: Total of all counts (sequencing depth).
Base R Equivalent:
x <- ex_counts[1,] (sum(x > 0) - 1) / log(sum(x))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Margalef, R. (1958). Information theory in ecology. General Systems, 3, 36-71.
Gamito, S. (2010). Caution is needed when applying Margalef diversity index. Ecological Indicators, 10(2), 550-551. doi:10.1016/j.ecolind.2009.07.006
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
menhinick(),
observed(),
squares()
margalef(ex_counts)margalef(ex_counts)
A distance measure closely related to the Hellinger distance.
matusita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())matusita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Matusita distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sqrt(sum((sqrt(p) - sqrt(q)) ^ 2))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Matusita, K. (1955). Decision rules, based on the distance, for problems of fit, two samples, and estimation. The Annals of Mathematical Statistics, 26(4), 631-640. doi:10.1214/aoms/1177728422
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
matusita(ex_counts)matusita(ex_counts)
A dominance index based on the Euclidean distance from the origin.
mcintosh(counts, margin = 1L, cpus = n_cpus())mcintosh(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The McIntosh index is defined as:
Where:
: The number of features.
: Integer count of the -th feature.
: Total of all counts.
Base R Equivalent:
x <- ex_counts[1,] (sum(x) - sqrt(sum(x^2))) / (sum(x) - sqrt(sum(x)))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
McIntosh, R. P. (1967). An index of diversity and the relation of certain concepts to diversity. Ecology, 48(3), 392-404. doi:10.2307/1932674
alpha_div(), vignette('adiv')
Other Dominance metrics:
berger()
mcintosh(ex_counts)mcintosh(ex_counts)
A richness metric that normalizes the number of species by the square root of the total sample size.
menhinick(counts, margin = 1L, cpus = n_cpus())menhinick(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Menhinick's index is defined as:
Where:
: The number of features.
: Total of all counts.
Base R Equivalent:
x <- ex_counts[1,] sum(x > 0) / sqrt(sum(x))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Menhinick, E. F. (1964). A comparison of some species-individuals diversity indices applied to samples of field insects. Ecology, 45(4), 859-861. doi:10.2307/1934933
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
margalef(),
observed(),
squares()
menhinick(ex_counts)menhinick(ex_counts)
A generalized metric that includes Euclidean and Manhattan distance as special cases.
minkowski( counts, margin = 1L, power = 1.5, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )minkowski( counts, margin = 1L, power = 1.5, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
power |
Scaling factor for the magnitude of differences between
communities ( |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Minkowski distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
: The geometry of the space (power parameter).
Parameter: power
The power parameter (default 1.5) determines the value of in the equation.
Special Cases
Manhattan distance: When , the formula reduces to the sum of absolute differences.
Euclidean distance: When , the formula reduces to the standard straight-line distance.
Chebyshev distance: When , the formula reduces to the maximum absolute difference.
Base R Equivalent:
p <- 1.5 x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x - y)^p) ^ (1/p)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. Springer.
Minkowski, H. (1896). Geometrie der Zahlen. Teubner.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
minkowski(ex_counts, power = 2) # Equivalent to Euclideanminkowski(ex_counts, power = 2) # Equivalent to Euclidean
A measure of overlap between samples that is independent of sample size. Requires integer counts.
morisita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())morisita(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Morisita dissimilarity is defined as:
Where:
, : Absolute counts of the -th feature.
, : Total counts in each sample. .
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] simpson_x <- sum(x * (x - 1)) / (sum(x) * (sum(x) - 1)) simpson_y <- sum(y * (y - 1)) / (sum(y) * (sum(y) - 1)) 1 - ((2 * sum(x * y)) / ((simpson_x + simpson_y) * sum(x) * sum(y)))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Morisita, M. (1959). Measuring of interspecific association and similarity between communities. Memoirs of the Faculty of Science, Kyushu University, Series E (Biology), 3, 65-80.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
morisita(ex_counts)morisita(ex_counts)
Also known as the Bray-Curtis dissimilarity when applied to abundance data, but formulated slightly differently.
motyka( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )motyka( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Motyka dissimilarity is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(pmax(x, y)) / sum(x, y)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Motyka, J. (1947). O celach i metodach badan geobotanicznych. Annales Universitatis Mariae Curie-Sklodowska, Sectio C, 3, 1-168.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
motyka(ex_counts)motyka(ex_counts)
A thin wrapper around parallely::availableCores(). If the parallely
package is not installed, then it falls back to
parallel::detectCores(all.tests = TRUE, logical = TRUE). Returns 1 if
pthread support is unavailable or when the number of cpus cannot be
determined.
n_cpus()n_cpus()
A scalar integer, guaranteed to be at least 1.
n_cpus()n_cpus()
Weighted UniFrac normalized by the tree length to allow comparison between trees.
normalized_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )normalized_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Normalized Weighted UniFrac distance is defined as:
Where:
: The number of branches in the tree.
: The length of the -th branch.
, : The proportion of the community descending from branch in sample P and Q.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Lozupone, C. A., Hamady, M., Kelley, S. T., & Knight, R. (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5), 1576-1585. doi:10.1128/AEM.01996-06
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
normalized_unifrac(ex_counts, tree = ex_tree)normalized_unifrac(ex_counts, tree = ex_tree)
The count of unique features (richness) in a sample.
observed(counts, margin = 1L, cpus = n_cpus())observed(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
Observed features is defined simply as the number of features with non-zero abundance:
Base R Equivalent:
x <- ex_counts[1,] sum(x > 0)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
margalef(),
menhinick(),
squares()
observed(ex_counts)observed(ex_counts)
Also known as the cosine similarity for binary data.
ochiai(counts, margin = 1L, pairs = NULL, cpus = n_cpus())ochiai(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Otsuka-Ochiai dissimilarity is defined as:
Where:
, : Number of features in each sample.
: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] 1 - sum(x & y) / sqrt(sum(x>0) * sum(y>0))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Ochiai, A. (1957). Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bulletin of the Japanese Society of Scientific Fisheries, 22, 526-530.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
hamming(),
jaccard(),
sorensen()
ochiai(ex_counts)ochiai(ex_counts)
A chi-squared based distance metric for comparing probability distributions.
psym_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())psym_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Probabilistic Symmetric distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) 2 * sum((p - q)^2 / (p + q))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
psym_chisq(ex_counts)psym_chisq(ex_counts)
Sub-sample observations from a feature table such that all samples have the same library size (depth). This is performed via random sampling without replacement.
rarefy( counts, depth = NULL, seed = 0, times = NULL, drop = TRUE, margin = 1L, cpus = n_cpus(), warn = interactive() )rarefy( counts, depth = NULL, seed = 0, times = NULL, drop = TRUE, margin = 1L, cpus = n_cpus(), warn = interactive() )
counts |
A numeric matrix or sparse matrix object (e.g., |
depth |
The number of observations to keep per sample. If |
seed |
An integer seed for the random number generator. Providing
the same seed guarantees reproducible results. Default: |
times |
The number of independent rarefactions to perform. If set,
returns a list of matrices. Seeds for subsequent iterations are
sequential ( |
drop |
Logical. If |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
warn |
Logical. If |
A rarefied matrix. The output class (matrix, dgCMatrix, etc.)
matches the input class.
If depth is NULL, the function defaults to the highest depth that retains
at least 10% of the total observations in the dataset.
If a sample has fewer observations than the specified depth:
drop = TRUE (Default): The sample is removed from the output matrix.
drop = FALSE: The sample is returned unmodified (with its original
counts). It is not rarefied or zeroed out.
Features (OTUs, ASVs, Genes) that lose all observations during rarefaction are always retained as columns/rows of zeros. This ensures the output matrix dimensions remain consistent with the input (barring dropped samples).
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
# A 4-sample x 5-OTU matrix with samples in rows. counts <- matrix(c(0,0,0,0,0,8,9,10,5,5,5,5,2,0,0,0,6,5,7,0), 4, 5, dimnames = list(LETTERS[1:4], paste0('OTU', 1:5))) counts rowSums(counts) # Rarefy all samples to a depth of 18. # Sample 'A' (13 counts) and 'D' (15 counts) will be dropped. r_mtx <- rarefy(counts, depth = 18) r_mtx rowSums(r_mtx) # Keep under-sampled samples by setting `drop = FALSE`. # Samples 'A' and 'D' are returned with their original counts. r_mtx <- rarefy(counts, depth = 18, drop = FALSE) r_mtx rowSums(r_mtx) # Perform 3 independent rarefactions. r_list <- rarefy(counts, times = 3) length(r_list) # Sparse matrices are supported and their class is preserved. if (requireNamespace('Matrix', quietly = TRUE)) { counts_dgC <- Matrix::Matrix(counts, sparse = TRUE) str(rarefy(counts_dgC)) }# A 4-sample x 5-OTU matrix with samples in rows. counts <- matrix(c(0,0,0,0,0,8,9,10,5,5,5,5,2,0,0,0,6,5,7,0), 4, 5, dimnames = list(LETTERS[1:4], paste0('OTU', 1:5))) counts rowSums(counts) # Rarefy all samples to a depth of 18. # Sample 'A' (13 counts) and 'D' (15 counts) will be dropped. r_mtx <- rarefy(counts, depth = 18) r_mtx rowSums(r_mtx) # Keep under-sampled samples by setting `drop = FALSE`. # Samples 'A' and 'D' are returned with their original counts. r_mtx <- rarefy(counts, depth = 18, drop = FALSE) r_mtx rowSums(r_mtx) # Perform 3 independent rarefactions. r_list <- rarefy(counts, times = 3) length(r_list) # Sparse matrices are supported and their class is preserved. if (requireNamespace('Matrix', quietly = TRUE)) { counts_dgC <- Matrix::Matrix(counts, sparse = TRUE) str(rarefy(counts_dgC)) }
A phylogenetic tree is required for computing UniFrac distance matrices. You can load a tree from a file or by providing the tree string directly. This tree must be in Newick format, also known as parenthetic format and New Hampshire format.
read_tree(newick, underscores = FALSE)read_tree(newick, underscores = FALSE)
newick |
Input data as either a file path, URL, or Newick string. Compressed (gzip or bzip2) files are also supported. |
underscores |
If |
A phylo class object representing the tree.
tree <- read_tree(" (A:0.99,((B:0.87,C:0.89):0.51,(((D:0.16,(E:0.83,F:0.96) :0.94):0.69,(G:0.92,(H:0.62,I:0.85):0.54):0.23):0.74,J:0.1 2):0.43):0.67);") class(tree)tree <- read_tree(" (A:0.99,((B:0.87,C:0.89):0.51,(((D:0.16,(E:0.83,F:0.96) :0.94):0.69,(G:0.92,(H:0.62,I:0.85):0.54):0.23):0.74,J:0.1 2):0.43):0.67);") class(tree)
Calculates the pairwise Robust Aitchison distance for compositional data. This method is specifically engineered for sparse datasets—such as microbiome OTU/ASV tables—by calculating distances based only on observed positive abundances, entirely avoiding the need for arbitrary pseudo-counts.
robust_aitchison(counts, margin = 1L, pairs = NULL, cpus = n_cpus())robust_aitchison(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Robust Aitchison distance is defined as:
Where:
, : The rclr-transformed counts for the -th feature.
For a given sample , if , and otherwise.
, : Mean log of strictly positive abundances.
, where is the set of indices where .
, : The number of strictly positive features in the respective samples.
: The total number of features.
Base R Equivalent:
x <- ifelse(x > 0, log(x) - mean(log(x[x > 0])), 0) y <- ifelse(y > 0, log(y) - mean(log(y[y > 0])), 0) sqrt(sum((x-y)^2)) # Euclidean distance
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Martino, C., Morton, J. T., Marotz, C. A., Thompson, L. R., Tripathi, A., Knight, R., and Zengler, K. (2019). A novel sparse compositional technique reveals microbial perturbations. mSystems, 4(1), e00016-19. doi:10.1128/mSystems.00016-19
beta_div(), aitchison()
vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
robust_aitchison(ex_counts)robust_aitchison(ex_counts)
A commonly used diversity index accounting for both abundance and evenness.
shannon(counts, margin = 1L, cpus = n_cpus())shannon(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Shannon index (entropy) is defined as:
Where:
: The number of features.
: Proportional abundance of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) -sum(p * log(p))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423.
Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press.
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
fisher(),
inv_simpson(),
simpson()
shannon(ex_counts)shannon(ex_counts)
The probability that two entities taken at random from the dataset represent different types.
simpson(counts, margin = 1L, cpus = n_cpus())simpson(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Gini-Simpson index is defined as:
Where:
: The number of features.
: Proportional abundance of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] p <- x / sum(x) 1 - sum(p ** 2)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. doi:10.1038/163688a0
alpha_div(), vignette('adiv')
Other Diversity metrics:
brillouin(),
fisher(),
inv_simpson(),
shannon()
simpson(ex_counts)simpson(ex_counts)
A distance metric related to the Bray-Curtis and Jaccard indices.
soergel( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )soergel( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Soergel distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x - y)) / sum(pmax(x, y))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
soergel(ex_counts)soergel(ex_counts)
A statistic used for comparing the similarity of two samples.
sorensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())sorensen(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Dice-Sorensen dissimilarity is defined as:
Where:
, : Number of features in each sample.
: Number of features in common (intersection).
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] 2 * sum(x & y) / sum(x>0, y>0)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Sørensen, T. (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Kongelige Danske Videnskabernes Selskab, Biologiske Skrifter, 5, 1-34.
Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302. doi:10.2307/1932409
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Presence/Absence metrics:
hamming(),
jaccard(),
ochiai()
sorensen(ex_counts)sorensen(ex_counts)
The squared version of the Chi-Squared distance.
squared_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())squared_chisq(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Squared distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum((p - q)^2 / (p + q))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chord(),
squared_euclidean(),
topsoe(),
wave_hedges()
squared_chisq(ex_counts)squared_chisq(ex_counts)
The squared version of the Chord distance.
squared_chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())squared_chord(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Squared Chord distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum((sqrt(x) - sqrt(y)) ^ 2)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_euclidean(),
topsoe(),
wave_hedges()
squared_chord(ex_counts)squared_chord(ex_counts)
The squared Euclidean distance between two vectors.
squared_euclidean( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )squared_euclidean( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Squared Euclidean distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum((x-y)^2)
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
topsoe(),
wave_hedges()
squared_euclidean(ex_counts)squared_euclidean(ex_counts)
A richness estimator based on the concept of "squares" (counts of species observed once or twice).
squares(counts, margin = 1L, cpus = n_cpus())squares(counts, margin = 1L, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
cpus |
How many parallel processing threads should be used. The
default, |
The Squares estimator is defined as:
Where:
: The number of observed features.
: Total of all counts.
: Number of features observed once (singletons).
: Integer count of the -th feature.
Base R Equivalent:
x <- ex_counts[1,] N <- sum(x) # sampling depth S <- sum(x > 0) # observed features F1 <- sum(x == 1) # singletons S + ((sum(x^2) * (F1^2)) / ((N^2) - F1 * S))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Alroy, J. (2018). Limits to species richness estimates based on subsampling. Paleobiology, 44(2), 177-194. doi:10.1017/pab.2017.38
alpha_div(), vignette('adiv')
Other Richness metrics:
ace(),
chao1(),
margalef(),
menhinick(),
observed()
squares(ex_counts)squares(ex_counts)
A symmetric divergence measure based on the Jensen-Shannon divergence.
topsoe(counts, margin = 1L, pairs = NULL, cpus = n_cpus())topsoe(counts, margin = 1L, pairs = NULL, cpus = n_cpus())
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Topsoe distance is defined as:
Where:
, : Proportional abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,]; p <- x / sum(x) y <- ex_counts[2,]; q <- y / sum(y) sum(p * log(2 * p / (p+q)), q * log(2 * y / (p+q)))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Topsoe, F. (2000). Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46(4), 1602-1609. doi:10.1109/18.850703
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
wave_hedges()
topsoe(ex_counts)topsoe(ex_counts)
A phylogenetic distance metric that accounts for the presence/absence of lineages.
unweighted_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )unweighted_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Unweighted UniFrac distance is defined as:
Where:
: The number of branches in the tree.
: The length of the -th branch.
, : Binary values (0 or 1) indicating if descendants of branch are present in sample A or B.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Lozupone, C., & Knight, R. (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology, 71(12), 8228-8235. doi:10.1128/AEM.71.12.8228-8235.2005
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
normalized_unifrac(),
variance_adjusted_unifrac(),
weighted_unifrac()
unweighted_unifrac(ex_counts, tree = ex_tree)unweighted_unifrac(ex_counts, tree = ex_tree)
A weighted UniFrac that adjusts for the expected variance of the metric.
variance_adjusted_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )variance_adjusted_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Variance-Adjusted Weighted UniFrac distance is defined as:
Where:
: The number of branches in the tree.
: The length of the -th branch.
, : The proportion of the community descending from branch in sample P and Q.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Chang, Q., Luan, Y., & Sun, F. (2011). Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny. BMC Bioinformatics, 12, 118. doi:10.1186/1471-2105-12-118
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
normalized_unifrac(),
unweighted_unifrac(),
weighted_unifrac()
variance_adjusted_unifrac(ex_counts, tree = ex_tree)variance_adjusted_unifrac(ex_counts, tree = ex_tree)
A distance metric derived from the Hedges' distance.
wave_hedges( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )wave_hedges( counts, margin = 1L, norm = "none", pseudocount = NULL, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
margin |
The margin containing samples. |
norm |
Normalize the incoming counts. Options are:
Default: |
pseudocount |
Value added to counts to handle zeros when
|
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Wave Hedges distance is defined as:
Where:
, : Absolute abundances of the -th feature.
: The number of features.
Base R Equivalent:
x <- ex_counts[1,] y <- ex_counts[2,] sum(abs(x - y) / pmax(x, y))
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
The pseudocount parameter is only relevant when norm = 'clr'.
Zeros are undefined in the centered log-ratio (CLR) transformation. If
norm = 'clr', pseudocount is NULL (the default), and
zeros are detected, the function uses half the minimum non-zero value
(min(x[x>0]) / 2) and issues a warning.
To suppress the warning, provide an explicit value (e.g., 1).
Why this matters: The choice of pseudocount is not neutral; it acts as a weighting factor that can significantly distort downstream results, especially for sparse datasets. See Gloor et al. (2017) and Kaul et al. (2017) for open-access discussions on the mathematical implications, or Costea et al. (2014) for the impact on community clustering.
See aitchison for references.
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1(4), 300–307.
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Abundance metrics:
aitchison(),
bhattacharyya(),
bray(),
canberra(),
chebyshev(),
chord(),
clark(),
divergence(),
euclidean(),
gower(),
hellinger(),
horn(),
jensen(),
jsd(),
lorentzian(),
manhattan(),
matusita(),
minkowski(),
morisita(),
motyka(),
psym_chisq(),
robust_aitchison(),
soergel(),
squared_chisq(),
squared_chord(),
squared_euclidean(),
topsoe()
wave_hedges(ex_counts)wave_hedges(ex_counts)
A phylogenetic distance metric that accounts for the relative abundance of lineages.
weighted_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )weighted_unifrac( counts, tree = NULL, margin = 1L, pairs = NULL, cpus = n_cpus() )
counts |
A numeric matrix of count data (samples |
tree |
A |
margin |
The margin containing samples. |
pairs |
Which combinations of samples should distances be
calculated for? The default value ( |
cpus |
How many parallel processing threads should be used. The
default, |
The Weighted UniFrac distance is defined as:
Where:
: The number of branches in the tree.
: The length of the -th branch.
, : The proportion of the community descending from branch in sample P and Q.
The counts parameter is designed to accept a simple numeric matrix, but
seamlessly supports objects from the following biological data packages:
phyloseq
rbiom
SummarizedExperiment
TreeSummarizedExperiment
For large datasets, standard matrix operations may be slow. See
vignette('performance') for details on using optimized formats
(e.g. sparse matrices) and parallel processing.
Lozupone, C. A., Hamady, M., Kelley, S. T., & Knight, R. (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology, 73(5), 1576-1585. doi:10.1128/AEM.01996-06
beta_div(), vignette('bdiv'), vignette('bdiv_guide')
Other Phylogenetic metrics:
faith(),
generalized_unifrac(),
normalized_unifrac(),
unweighted_unifrac(),
variance_adjusted_unifrac()
weighted_unifrac(ex_counts, tree = ex_tree)weighted_unifrac(ex_counts, tree = ex_tree)