Similarity and Distance Measures in proxyC

This vignette explains how proxyC compute the similarity and distance measures.

Notation

$$ \begin{gather} \vec{x} = [x_i, x_{i + 1}, \dots, x_n] \\ \vec{y} = [y_i, y_{i + 1}, \dots, y_n] \end{gather} $$

The length of the vector n = ||x⃗||, while |x⃗| is the absolute values of the elements.

Operations on vectors are element-wise:

$$ \begin{gather} \vec{z} = \vec{x}\vec{y} \\ n = ||\vec{x}|| = ||\vec{y}|| =||\vec{z}|| \end{gather} $$

Summation of the elements of vectors is written using sigma without specifying the range:

$$ \sum{\vec{x}} = \sum_{i=1}^{n}{x_i} $$

When the elements of the vector is compared with a value in a pair of square brackets, the summation is counting the number of elements that equal (or unequal) to the value:

$$ \sum{[\vec{x} = 1]} = \sum_{i=1}^{n}{[x_i = 1]} $$

Similarity Measures

Similarity measures are available in proxyC::simil().

Cosine similarity (“cosine”)

$$ simil = \frac{\sum{\vec{x}\vec{y}}}{\sqrt{\sum{\vec{x} ^ 2}} \sqrt{\sum{\vec{y} ^ 2}}} $$

Pearson correlation coefficient (“correlation”)

$$ simil = \frac{Cov(\vec{x},\vec{y})}{Var(\vec{x}) Var(\vec{y})} $$

Jaccard similarity (“jaccard” and “ejaccard”)

The values of x and y are Boolean for “jaccard”.

$$ \begin{gather} e = \sum{\vec{x} \vec{y}} \\ w = \text{user-provided weight} \\ simil = \frac{e}{\sum{\vec{x} ^ w} + \sum{\vec{y} ^ w} - e} \end{gather} $$

Fuzzy Jaccard similarity (“fjaccard”)

The values must be 0 ≤ x ≤ 1.0 and 0 ≤ y ≤ 1.0.

$$ simil = \frac{\sum{min(\vec{x}, \vec{y})}}{\sum{max(\vec{x}, \vec{y})}} $$

Dice similarity (“dice” and “edice”)

The values of x and y are Boolean for “dice”.

$$ \begin{gather} e = \sum{\vec{x} \vec{y}} \\ w = \text{user-provided weight} \\ simil = \frac{2 e}{\sum{\vec{x} ^ w} + \sum{\vec{y} ^ w}} \end{gather} $$

Hamann similarity (“hamann”)

$$ \begin{gather} e = \sum{\vec{x} \vec{y}} \\ n = ||\vec{x}|| = ||\vec{y}|| \\ u = n - e \\ simil = \frac{e - u}{e + u} \end{gather} $$

Faith similarity (“faith”)

$$ \begin{gather} t = \sum{[\vec{x} = 1][\vec{y} = 1]} \\ f = \sum{[\vec{x} = 0][\vec{y} = 0]} \\ n = ||\vec{x}|| = ||\vec{y}|| \\ simil = \frac{t + 0.5 f}{n} \end{gather} $$

Simple matching (“matching”)

simil = ∑[x⃗ = y⃗]

Distance Measures

Similarity measures are available in proxyC::dist(). Smoothing of the vectors can be performed when method is “chisquared”, “kullback”, “jefferys” or “jensen”: the value of smooth will be added to each element of x⃗ and y⃗.

Manhattan distance (“manhattan”)

dist = ∑|x⃗ − y⃗|

Canberra distance (“canberra”)

$$ dist = \frac{|\vec{x} - \vec{y}|}{|\vec{x}| + |\vec{y}|} $$

Euclidian (“euclidian”)

$$ dist = \sum{\sqrt{\vec{x}^2 + \vec{y}^2}} $$

Minkowski distance (“minkowski”)

$$ \begin{gather} p = \text{user-provided parameter} \\ dist = \left( \sum{|\vec{x} - \vec{y}| ^ p} \right) ^ \frac{1}{p} \end{gather} $$

Hamming distance (“hamming”)

dist = ∑[x⃗ ≠ y⃗]

The largest difference between values (“maximum”)

dist = max x⃗ − y⃗

Chi-squared divergence (“chisquared”)

$$ \begin{gather} O_{ij} = \text{augmented matrix from } \vec{x} \text{ and } \vec{y} \\ E_{ij} = \text{matrix of expected count for } O_{ij} \\ dist = \sum{\frac{(O_{ij} - E_{ij}) ^ 2}{ E_{ij}}} \end{gather} $$

Kullback–Leibler divergence (“kullback”)

$$ \begin{gather} \vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ dist = \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{p}}}} \end{gather} $$

Jeffreys divergence (“jeffreys”)

$$ \begin{gather} \vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ dist = \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{p}}}} + \sum{\vec{p} \log_2{\frac{\vec{p}}{\vec{q}}}} \end{gather} $$

Jensen-Shannon divergence (“jensen”)

$$ \begin{gather} \vec{p} = \frac{\vec{x}}{\sum{\vec{x}}} \\ \vec{q} = \frac{\vec{y}}{\sum{\vec{y}}} \\ \vec{m} = \frac{1}{2} (\vec{p} + \vec{q}) \\ dist = \frac{1}{2} \sum{\vec{q} \log_2{\frac{\vec{q}}{\vec{m}}}} + \frac{1}{2} \sum{\vec{p} \log_2{\frac{\vec{p}}{\vec{m}}}} \end{gather} $$

References

Choi, S., Cha, S., & Tappert, C. C. (2010). A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics, 8(1), 43–48.
Nielsen, F. (2019). On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5), 485. https://doi.org/10.3390/e21050485
Jain, G., Mahara, T., & Tripathi, K. N. (2020). A Survey of Similarity Measures for Collaborative Filtering-Based Recommender System. In M. Pant, T. K. Sharma, O. P. Verma, R. Singla, & A. Sikander (Eds.), Soft Computing: Theories and Applications (pp. 343–352). Springer. https://doi.org/10.1007/978-981-15-0751-9_32
Miyamoto, S. (1990). Hierarchical Cluster Analysis and Fuzzy Sets. In S. Miyamoto (Ed.), Fuzzy Sets in Information Retrieval and Cluster Analysis (pp. 125–188). Springer Netherlands. https://doi.org/10.1007/978-94-015-7887-5_6