| Title: | Word and Document Vector Models |
|---|---|
| Description: | Create dense vector representation of words and documents using 'quanteda'. Implements Word2vec (Mikolov et al., 2013) <doi:10.48550/arXiv.1310.4546>, Doc2vec (Le & Mikolov, 2014) <doi:10.48550/arXiv.1405.4053> and Latent Semantic Analysis (Deerwester et al., 1990) <doi:10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9>. |
| Authors: | Kohei Watanabe [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-6519-5265>), Jan Wijffels [aut] (Original R code), BNOSAC [cph] (Original R code), Max Fomichev [ctb, cph] (Original C++ code) |
| Maintainer: | Kohei Watanabe <[email protected]> |
| License: | Apache License (>= 2.0) |
| Version: | 0.6.2 |
| Built: | 2026-06-05 09:03:36 UTC |
| Source: | https://github.com/koheiw/wordvector |
Convert a formula to a named character vector in analogy tasks.
analogy(formula)analogy(formula)
formula |
a formula object that defines the relationship between words
using |
a named character vector to be passed to similarity().
analogy(~ berlin - germany + france) analogy(~ quick - quickly + slowly)analogy(~ berlin - germany + france) analogy(~ quick - quickly + slowly)
Extract word or document vectors from a textmodel_word2vec or textmodel_doc2vec object.
## S3 method for class 'textmodel_doc2vec' as.matrix( x, normalize = TRUE, layer = c("documents", "words"), group = FALSE, ... ) ## S3 method for class 'textmodel_word2vec' as.matrix(x, normalize = TRUE, layer = "words", ...)## S3 method for class 'textmodel_doc2vec' as.matrix( x, normalize = TRUE, layer = c("documents", "words"), group = FALSE, ... ) ## S3 method for class 'textmodel_word2vec' as.matrix(x, normalize = TRUE, layer = "words", ...)
x |
a |
normalize |
if |
layer |
the layer from which the vectors are extracted. |
group |
[experimental] average sentence or paragraph vectors from the same document.
Silently ignored when |
... |
not used. |
a matrix that contain the word or document vectors in rows.
Create distributed representation of documents as weighted word vectors.
as.textmodel_doc2vec(x, model, normalize = FALSE, group_data = FALSE, ...)as.textmodel_doc2vec(x, model, normalize = FALSE, group_data = FALSE, ...)
x |
a quanteda::tokens or quanteda::dfm object. |
model |
a textmodel_wordvector object. |
normalize |
if |
group_data |
if |
... |
additional arguments passed to quanteda::object2id. |
Returns a textmodel_docvector object with the following elements:
values |
a list of matrices for word and document vectors. |
dim |
the size of the document vectors. |
concatenator |
the concatenator in |
docvars |
document variables copied from |
normalize |
if the document vectors are normalized. |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
A corpus object containing 2,000 news summaries collected from Yahoo News via RSS feeds in 2014. The title and description of the summaries are concatenated.
data_corpus_news2014data_corpus_news2014
An object of class corpus (inherits from character) of length 20000.
Watanabe, K. (2018). Newsmap: A semi-supervised approach to geographical news classification. Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487
Compute the probability of words given other words.
probability( x, targets, layer = c("words", "documents"), mode = c("character", "numeric"), ... )probability( x, targets, layer = c("words", "documents"), mode = c("character", "numeric"), ... )
x |
a trained |
targets |
words for which probabilities are computed. |
layer |
the layer based on which probabilities are computed. |
mode |
specify the type of resulting object. |
... |
passed to |
a matrix of words or documents sorted in descending order by the probability
scores when mode = "character"; a matrix of the probability scores when mode = "numeric".
When targets is a named numeric vector, probability scores are weighted by
the values.
Compute the cosine similarity between word vectors for selected words.
similarity( x, targets, layer = c("words", "documents"), mode = c("character", "numeric") )similarity( x, targets, layer = c("words", "documents"), mode = c("character", "numeric") )
x |
a |
targets |
words or documents for which similarity is computed. |
layer |
the layer based on which similarity is computed. This must be "documents"
when |
mode |
specify the type of resulting object. |
a matrix of cosine similarity scores when mode = "numeric" or of
words sorted in descending order by the similarity scores when mode = "character".
When targets is a named numeric vector, word (or document) vectors are weighted and summed
before computing similarity scores.
Train a doc2vec model (Le & Mikolov, 2014) using a quanteda::tokens object.
textmodel_doc2vec( x, dim = 50, type = c("dm", "dbow"), min_count = 5, window = 5, iter = 10, alpha = 0.05, model = NULL, use_ns = TRUE, ns_size = 5, sample = 0.001, tolower = TRUE, include_data = FALSE, verbose = FALSE, ... )textmodel_doc2vec( x, dim = 50, type = c("dm", "dbow"), min_count = 5, window = 5, iter = 10, alpha = 0.05, model = NULL, use_ns = TRUE, ns_size = 5, sample = 0.001, tolower = TRUE, include_data = FALSE, verbose = FALSE, ... )
x |
a quanteda::tokens or quanteda::tokens_xptr object. |
dim |
the size of the word vectors. |
type |
the architecture of the model; either "dm" (distributed memory) or "dbow" (distributed bag-of-words). |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
window |
the size of the window for context words. Ignored when |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
model |
a trained Word2vec model; if provided, its word vectors are updated for |
use_ns |
if |
ns_size |
the size of negative samples. Only used when |
sample |
the rate of sampling of words based on their frequency. Sampling is
disabled when |
tolower |
lower-case all the tokens before fitting the model. |
include_data |
if |
verbose |
if |
... |
additional arguments. |
Returns a textmodel_doc2vec object with matrices for word and document vector
values, quanteda::docvars and quanteda::ntoken of x. Other elements are
the same as textmodel_word2vec.
Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (No. arXiv:1405.4053). arXiv. https://doi.org/10.48550/arXiv.1405.4053
Train a Latent Semantic Analysis model (Deerwester et al., 1990) on a quanteda::tokens object.
textmodel_lsa( x, dim = 50, min_count = 5L, engine = c("RSpectra", "irlba", "rsvd"), weight = "count", tolower = TRUE, verbose = FALSE, ... )textmodel_lsa( x, dim = 50, min_count = 5L, engine = c("RSpectra", "irlba", "rsvd"), weight = "count", tolower = TRUE, verbose = FALSE, ... )
x |
a quanteda::tokens or quanteda::tokens_xptr object. |
dim |
the size of the word vectors. |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
engine |
select the engine perform SVD to generate word vectors. |
weight |
weighting scheme passed to |
tolower |
if |
verbose |
if |
... |
additional arguments. |
Returns a textmodel_wordvector object with the following elements:
values |
a matrix for word vectors values. |
weights |
a matrix for word vectors weights. |
frequency |
the frequency of words in |
engine |
the SVD engine used. |
weight |
weighting scheme. |
min_count |
the value of min_count. |
concatenator |
the concatenator in |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.
library(quanteda) library(wordvector) # pre-processing corp <- corpus_reshape(data_corpus_news2014) toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) %>% tokens_tolower() # train LSA lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE) # find similar words head(similarity(lsa, c("berlin", "germany", "france"), mode = "words")) head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values")) head(similarity(lsa, analogy(~ berlin - germany + france)))library(quanteda) library(wordvector) # pre-processing corp <- corpus_reshape(data_corpus_news2014) toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) %>% tokens_tolower() # train LSA lsa <- textmodel_lsa(toks, dim = 50, min_count = 5, verbose = TRUE) # find similar words head(similarity(lsa, c("berlin", "germany", "france"), mode = "words")) head(similarity(lsa, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values")) head(similarity(lsa, analogy(~ berlin - germany + france)))
Train a word2vec model (Mikolov et al., 2013) using a quanteda::tokens object.
textmodel_word2vec( x, dim = 50, type = c("cbow", "sg", "dm"), min_count = 5, window = ifelse(type == "sg", 10, 5), iter = 10, alpha = 0.05, model = NULL, use_ns = TRUE, ns_size = 5, sample = 0.001, tolower = TRUE, include_data = FALSE, verbose = FALSE, ... )textmodel_word2vec( x, dim = 50, type = c("cbow", "sg", "dm"), min_count = 5, window = ifelse(type == "sg", 10, 5), iter = 10, alpha = 0.05, model = NULL, use_ns = TRUE, ns_size = 5, sample = 0.001, tolower = TRUE, include_data = FALSE, verbose = FALSE, ... )
x |
a quanteda::tokens or quanteda::tokens_xptr object. |
dim |
the size of the word vectors. |
type |
the architecture of the model; either "cbow" (continuous back-of-words), "sg" (skip-gram), or "dm" (distributed memory). |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
window |
the size of the word window. Words within this window are considered to be the context of a target word. |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
model |
a trained Word2vec model; if provided, its word vectors are updated for |
use_ns |
if |
ns_size |
the size of negative samples. Only used when |
sample |
the rate of sampling of words based on their frequency. Sampling is
disabled when |
tolower |
lower-case all the tokens before fitting the model. |
include_data |
if |
verbose |
if |
... |
additional arguments. |
If type = "dm", it trains a doc2vec model but saves only
word vectors to save storage space. textmodel_doc2vec should be
used to access document vectors.
Users can changed the number of processors used for the parallel computing via
options(wordvector_threads). When the value is large than one, the result
of every execution becomes slightly different even if set.seed() is used because
parameters are updated in different orders by the processors.
Returns a textmodel_word2vec object with the following elements:
values |
a list of a matrix for word vector values. |
weights |
a matrix for word vector weights. |
dim |
the size of the word vectors. |
type |
the architecture of the model. |
frequency |
the frequency of words in |
window |
the size of the word window. |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
use_ns |
the use of negative sampling. |
ns_size |
the size of negative samples. |
min_count |
the value of min_count. |
concatenator |
the concatenator in |
data |
the original data supplied as |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.
library(quanteda) library(wordvector) # pre-processing corp <- data_corpus_news2014 toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) %>% tokens_tolower() # train word2vec wov <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001) # find similar words head(similarity(wov, c("berlin", "germany", "france"), mode = "words")) head(similarity(wov, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values")) head(similarity(wov, analogy(~ berlin - germany + france), mode = "words"))library(quanteda) library(wordvector) # pre-processing corp <- data_corpus_news2014 toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) %>% tokens_tolower() # train word2vec wov <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001) # find similar words head(similarity(wov, c("berlin", "germany", "france"), mode = "words")) head(similarity(wov, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values")) head(similarity(wov, analogy(~ berlin - germany + france), mode = "words"))