Title: | Computes Proximity in Large Sparse Matrices |
---|---|
Description: | Computes proximity between rows or columns of large matrices efficiently in C++. Functions are optimised for large sparse matrices using the Armadillo and Intel TBB libraries. Among various built-in similarity/distance measures, computation of correlation, cosine similarity and Euclidean distance is particularly fast. |
Authors: | Kohei Watanabe [cre, aut, cph] , Robrecht Cannoodt [aut] |
Maintainer: | Kohei Watanabe <[email protected]> |
License: | GPL-3 |
Version: | 0.4.2 |
Built: | 2024-10-24 05:18:19 UTC |
Source: | https://github.com/koheiw/proxyc |
Produces the same result as apply(x, 1, sd)
or apply(x, 2, sd)
without coercing matrix to dense matrix. Values are not identical to
sd
because of the floating point precision issue in C++.
colSds(x) rowSds(x)
colSds(x) rowSds(x)
x |
mt <- Matrix::rsparsematrix(100, 100, 0.01) colSds(mt) apply(mt, 2, sd) # the same
mt <- Matrix::rsparsematrix(100, 100, 0.01) colSds(mt) apply(mt, 2, sd) # the same
Produces the same result as applying sum(x == 0)
to each row or column.
colZeros(x) rowZeros(x)
colZeros(x) rowZeros(x)
x |
mt <- Matrix::rsparsematrix(100, 100, 0.01) colZeros(mt) apply(mt, 2, function(x) sum(x == 0)) # the same
mt <- Matrix::rsparsematrix(100, 100, 0.01) colZeros(mt) apply(mt, 2, function(x) sum(x == 0)) # the same
Fast similarity/distance computation function for large sparse matrices. You
can floor small similarity value to to save computation time and storage
space by an arbitrary threshold (min_simil
) or rank (rank
). You
can specify the number of threads for parallel computing via
options(proxyC.threads)
.
simil( x, y = NULL, margin = 1, method = c("cosine", "correlation", "jaccard", "ejaccard", "fjaccard", "dice", "edice", "hamann", "faith", "simple matching"), min_simil = NULL, rank = NULL, drop0 = FALSE, diag = FALSE, use_nan = NULL, sparse = TRUE, digits = 14 ) dist( x, y = NULL, margin = 1, method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan", "maximum", "canberra", "minkowski", "hamming"), p = 2, smooth = 0, drop0 = FALSE, diag = FALSE, use_nan = NULL, sparse = TRUE, digits = 14 )
simil( x, y = NULL, margin = 1, method = c("cosine", "correlation", "jaccard", "ejaccard", "fjaccard", "dice", "edice", "hamann", "faith", "simple matching"), min_simil = NULL, rank = NULL, drop0 = FALSE, diag = FALSE, use_nan = NULL, sparse = TRUE, digits = 14 ) dist( x, y = NULL, margin = 1, method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan", "maximum", "canberra", "minkowski", "hamming"), p = 2, smooth = 0, drop0 = FALSE, diag = FALSE, use_nan = NULL, sparse = TRUE, digits = 14 )
x |
matrix or Matrix object. Dense matrices are covered to the CsparseMatrix-class internally. |
y |
if a matrix or Matrix object is provided, proximity
between documents or features in |
margin |
integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns. |
method |
method to compute similarity or distance |
min_simil |
the minimum similarity value to be recorded. |
rank |
an integer value specifying top-n most similarity values to be recorded. |
drop0 |
if |
diag |
if |
use_nan |
if |
sparse |
if |
digits |
determines rounding of small values towards zero. Use primarily to correct floating point errors. Rounding is performed in C++ in a similar way as zapsmall. |
p |
weight for Minkowski distance |
smooth |
adds a fixed value to all the cells to avoid division by zero.
Only used when |
Similarity:
cosine
: cosine similarity
correlation
: Pearson's correlation
jaccard
: Jaccard coefficient
ejaccard
: the real value version of jaccard
fjaccard
: Fuzzy Jaccard coefficient
dice
: Dice coefficient
edice
: the real value version of dice
hamann
: Hamann similarity
faith
: Faith similarity
simple matching
: the percentage of common elements
Distance:
euclidean
: Euclidean distance
chisquared
: chi-squared distance
kullback
: Kullback–Leibler divergence
jeffreys
: Jeffreys divergence
jensen
: Jensen–Shannon divergence
manhattan
: Manhattan distance
maximum
: the largest difference between values
canberra
: Canberra distance
minkowski
: Minkowski distance
hamming
: Hamming distance
See the vignette for how the similarity and distance are computed:
vignette("measures", package = "proxyC")
It performs parallel computing using Intel oneAPI Threads Building Blocks.
The number of threads for parallel computing should be specified via
options(proxyC.threads)
before calling the functions. If the value is -1,
all the available threads will be used. Unless the option is used, the
number of threads will be limited by the environmental variables
(OMP_THREAD_LIMIT
or RCPP_PARALLEL_NUM_THREADS
) to comply with CRAN
policy and offer backward compatibility.
zapsmall
mt <- Matrix::rsparsematrix(100, 100, 0.01) simil(mt, method = "cosine")[1:5, 1:5] mt <- Matrix::rsparsematrix(100, 100, 0.01) dist(mt, method = "euclidean")[1:5, 1:5]
mt <- Matrix::rsparsematrix(100, 100, 0.01) simil(mt, method = "cosine")[1:5, 1:5] mt <- Matrix::rsparsematrix(100, 100, 0.01) dist(mt, method = "euclidean")[1:5, 1:5]