Title: | Semi-Supervised Algorithm for Document Scaling |
---|---|
Description: | A word embeddings-based semi-supervised model for document scaling Watanabe (2020) <doi:10.1080/19312458.2020.1832976>. LSS allows users to analyze large and complex corpora on arbitrary dimensions with seed words exploiting efficiency of word embeddings (SVD, Glove). It can generate word vectors on a users-provided corpus or incorporate a pre-trained word vectors. |
Authors: | Kohei Watanabe [aut, cre, cph] |
Maintainer: | Kohei Watanabe <[email protected]> |
License: | GPL-3 |
Version: | 1.4.1 |
Built: | 2024-11-20 04:50:58 UTC |
Source: | https://github.com/koheiw/lsx |
Convert a list or a dictionary to seed words
as.seedwords(x, upper = 1, lower = 2, concatenator = "_")
as.seedwords(x, upper = 1, lower = 2, concatenator = "_")
x |
a list of characters vectors or a dictionary object. |
upper |
numeric index or key for seed words for higher scores. |
lower |
numeric index or key for seed words for lower scores. |
concatenator |
character to replace separators of multi-word seed words. |
named numeric vector for seed words with polarity scores
A function to compute polarity scores of words and documents by resampling hyper-parameters from a fitted LSS model.
bootstrap_lss( x, what = c("seeds", "k"), mode = c("terms", "coef", "predict"), remove = FALSE, from = 100, to = NULL, by = 50, verbose = FALSE, ... )
bootstrap_lss( x, what = c("seeds", "k"), mode = c("terms", "coef", "predict"), remove = FALSE, from = 100, to = NULL, by = 50, verbose = FALSE, ... )
x |
a fitted textmodel_lss object. |
what |
choose the hyper-parameter to resample in bootstrapping. |
mode |
choose the type of the result of bootstrapping. If |
remove |
if |
from , to , by
|
passed to |
verbose |
show messages if |
... |
additional arguments passed to |
bootstrap_lss()
creates LSS fitted textmodel_lss objects internally by
resampling hyper-parameters and computes polarity of words or documents.
The resulting matrix can be used to asses the validity and the reliability
of seeds or k.
Note that the objects created by as.textmodel_lss()
does not contain data, users
must pass newdata
via ...
when mode = "predict"
.
coef()
extract model coefficients from a fitted textmodel_lss
object. coefficients()
is an alias.
## S3 method for class 'textmodel_lss' coef(object, ...) coefficients.textmodel_lss(object, ...)
## S3 method for class 'textmodel_lss' coef(object, ...) coefficients.textmodel_lss(object, ...)
object |
a fitted textmodel_lss object. |
... |
not used. |
Seed words for analysis of left-right political ideology
as.seedwords(data_dictionary_ideology)
as.seedwords(data_dictionary_ideology)
Seed words for analysis of positive-negative sentiment
Turney, P. D., & Littman, M. L. (2003). Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Trans. Inf. Syst., 21(4), 315–346. doi:10.1145/944012.944013
as.seedwords(data_dictionary_sentiment)
as.seedwords(data_dictionary_sentiment)
This model was trained on a Russian media corpus (newspapers, TV transcripts
and newswires) to analyze framing of street protests. The scale is protests
as "freedom of expression" (high) vs "social disorder" (low). Although some
slots are missing in this object (because the model was imported from the
original Python implementation), it allows you to scale texts using
predict
.
Lankina, Tomila, and Kohei Watanabe. “'Russian Spring' or 'Spring Betrayal'? The Media as a Mirror of Putin's Evolving Strategy in Ukraine.” Europe-Asia Studies 69, no. 10 (2017): 1526–56. doi:10.1080/09668136.2017.1397603.
[experimental] Compute variance ratios with different hyper-parameters
optimize_lss(x, ...)
optimize_lss(x, ...)
x |
a fitted textmodel_lss object. |
... |
additional arguments passed to bootstrap_lss. |
optimize_lss()
computes variance ratios with different values of
hyper-parameters using bootstrap_lss. The variance ration is defined
as
It maximizes when the model best distinguishes between the documents on the latent scale.
## Not run: # the unit of analysis is not sentences dfmt_grp <- dfm_group(dfmt) # choose best k v1 <- optimize_lss(lss, what = "k", from = 50, newdata = dfmt_grp, verbose = TRUE) plot(names(v1), v1) # find bad seed words v2 <- optimize_lss(lss, what = "seeds", remove = TRUE, newdata = dfmt_grp, verbose = TRUE) barplot(v2, las = 2) ## End(Not run)
## Not run: # the unit of analysis is not sentences dfmt_grp <- dfm_group(dfmt) # choose best k v1 <- optimize_lss(lss, what = "k", from = 50, newdata = dfmt_grp, verbose = TRUE) plot(names(v1), v1) # find bad seed words v2 <- optimize_lss(lss, what = "seeds", remove = TRUE, newdata = dfmt_grp, verbose = TRUE) barplot(v2, las = 2) ## End(Not run)
Prediction method for textmodel_lss
## S3 method for class 'textmodel_lss' predict( object, newdata = NULL, se_fit = FALSE, density = FALSE, rescale = TRUE, cut = NULL, min_n = 0L, ... )
## S3 method for class 'textmodel_lss' predict( object, newdata = NULL, se_fit = FALSE, density = FALSE, rescale = TRUE, cut = NULL, min_n = 0L, ... )
object |
a fitted LSS textmodel. |
newdata |
a dfm on which prediction should be made. |
se_fit |
if |
density |
if |
rescale |
if |
cut |
a vector of one or two percentile values to dichotomized polarty scores of words. When two values are given, words between them receive zero polarity. |
min_n |
set the minimum number of polarity words in documents. |
... |
not used |
Polarity scores of documents are the means of polarity scores of
words weighted by their frequency. When se_fit = TRUE
, this function
returns the weighted means, their standard errors, and the number of
polarity words in the documents. When rescale = TRUE
, it converts the raw
polarity scores to z sores for easier interpretation. When rescale = FALSE
and cut
is used, polarity scores of documents are bounded by
[-1.0, 1.0].
Documents tend to receive extreme polarity scores when they have only few
polarity words. This is problematic when LSS is applied to short documents
(e.g. social media posts) or individual sentences, but users can alleviate
this problem by adding zero polarity words to short documents using
min_n
. This setting does not affect empty documents.
Seed words for Latent Semantic Analysis
seedwords(type)
seedwords(type)
type |
type of seed words currently only for sentiment ( |
Turney, P. D., & Littman, M. L. (2003). Measuring Praise and Criticism: Inference of Semantic Orientation from Association. ACM Trans. Inf. Syst., 21(4), 315–346. doi:10.1145/944012.944013
seedwords('sentiment')
seedwords('sentiment')
Smooth predicted polarity scores by local polynomial regression
smooth_lss( x, lss_var = "fit", date_var = "date", span = 0.1, group = NULL, from = NULL, to = NULL, by = "day", engine = c("loess", "locfit"), ... )
smooth_lss( x, lss_var = "fit", date_var = "date", span = 0.1, group = NULL, from = NULL, to = NULL, by = "day", engine = c("loess", "locfit"), ... )
x |
a data.frame containing polarity scores and dates. |
lss_var |
the name of the column in |
date_var |
the name of the column in |
span |
the level of smoothing. |
group |
the name of the column in |
from , to , by
|
the the range and the internal of the smoothed scores; passed to seq.Date. |
engine |
specifies the function to be used for smoothing. |
... |
additional arguments passed to the smoothing function. |
Smoothing is performed using stats::loess()
or locfit::locfit()
.
When the x
has more than 10000 rows, it is usually better to choose
the latter by setting engine = "locfit"
. In this case, span
is passed to
locfit::lp(nn = span)
.
Latent Semantic Scaling (LSS) is a word embedding-based semisupervised algorithm for document scaling.
textmodel_lss(x, ...) ## S3 method for class 'dfm' textmodel_lss( x, seeds, terms = NULL, k = 300, slice = NULL, weight = "count", cache = FALSE, simil_method = "cosine", engine = c("RSpectra", "irlba", "rsvd"), prop_slice = NULL, auto_weight = FALSE, include_data = FALSE, group_data = FALSE, verbose = FALSE, ... ) ## S3 method for class 'fcm' textmodel_lss( x, seeds, terms = NULL, w = 50, max_count = 10, weight = "count", cache = FALSE, simil_method = "cosine", engine = c("rsparse"), auto_weight = FALSE, verbose = FALSE, ... )
textmodel_lss(x, ...) ## S3 method for class 'dfm' textmodel_lss( x, seeds, terms = NULL, k = 300, slice = NULL, weight = "count", cache = FALSE, simil_method = "cosine", engine = c("RSpectra", "irlba", "rsvd"), prop_slice = NULL, auto_weight = FALSE, include_data = FALSE, group_data = FALSE, verbose = FALSE, ... ) ## S3 method for class 'fcm' textmodel_lss( x, seeds, terms = NULL, w = 50, max_count = 10, weight = "count", cache = FALSE, simil_method = "cosine", engine = c("rsparse"), auto_weight = FALSE, verbose = FALSE, ... )
x |
a dfm or fcm created by |
... |
additional arguments passed to the underlying engine. |
seeds |
a character vector or named numeric vector that contains seed words. If seed words contain "*", they are interpreted as glob patterns. See quanteda::valuetype. |
terms |
a character vector or named numeric vector that specify words
for which polarity scores will be computed; if a numeric vector, words' polarity
scores will be weighted accordingly; if |
k |
the number of singular values requested to the SVD engine. Only used
when |
slice |
a number or indices of the components of word vectors used to
compute similarity; |
weight |
weighting scheme passed to |
cache |
if |
simil_method |
specifies method to compute similarity between features.
The value is passed to |
engine |
select the engine to factorize |
prop_slice |
[experimental] specify the number of compontes to use by proportion. |
auto_weight |
automatically determine weights to approximate the polarity of terms to seed words. See details. |
include_data |
if |
group_data |
if |
verbose |
show messages if |
w |
the size of word vectors. Used only when |
max_count |
passed to |
Latent Semantic Scaling (LSS) is a semisupervised document scaling
method. textmodel_lss()
constructs word vectors from use-provided
documents (x
) and weights words (terms
) based on their semantic
proximity to seed words (seeds
). Seed words are any known polarity words
(e.g. sentiment words) that users should manually choose. The required
number of seed words are usually 5 to 10 for each end of the scale.
If seeds
is a named numeric vector with positive and negative values, a
bipolar LSS model is construct; if seeds
is a character vector, a
unipolar LSS model. Usually bipolar models perform better in document
scaling because both ends of the scale are defined by the user.
A seed word's polarity score computed by textmodel_lss()
tends to diverge
from its original score given by the user because it's score is affected
not only by its original score but also by the original scores of all other
seed words. If auto_weight = TRUE
, the original scores are weighted
automatically using stats::optim()
to minimize the squared difference
between seed words' computed and original scores. Weighted scores are saved
in seed_weighted
in the object.
Please visit the package website for examples.
Watanabe, Kohei. 2020. "Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages", Communication Methods and Measures. doi:10.1080/19312458.2020.1832976.
Watanabe, Kohei. 2017. "Measuring News Bias: Russia's Official News Agency ITAR-TASS' Coverage of the Ukraine Crisis" European Journal of Communication. doi:10.1177/0267323117695735.
Plot similarity between seed words
textplot_simil(x)
textplot_simil(x)
x |
fitted textmodel_lss object. |
Plot polarity scores of words
textplot_terms( x, highlighted = NULL, max_highlighted = 50, max_words = 1000, ... )
textplot_terms( x, highlighted = NULL, max_highlighted = 50, max_words = 1000, ... )
x |
a fitted textmodel_lss object. |
highlighted |
quanteda::pattern to select words to highlight. If a quanteda::dictionary is passed, words in the top-level categories are highlighted in different colors. |
max_highlighted |
the maximum number of words to highlight. When
|
max_words |
the maximum number of words to plot. Words are randomly sampled to keep the number below the limit. |
... |
passed to underlying functions. See the Details. |
Users can customize the plots through ...
, which is
passed to ggplot2::geom_text()
and ggrepel::geom_text_repel()
. The
colors are specified internally but users can override the settings by appending
ggplot2::scale_colour_manual()
or ggplot2::scale_colour_brewer()
. The
legend title can also be modified using ggplot2::labs()
.
Identify context words using user-provided patterns
textstat_context( x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, window = 10, min_count = 10, remove_pattern = TRUE, n = 1, skip = 0, ... ) char_context( x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, window = 10, min_count = 10, remove_pattern = TRUE, p = 0.001, n = 1, skip = 0 )
textstat_context( x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, window = 10, min_count = 10, remove_pattern = TRUE, n = 1, skip = 0, ... ) char_context( x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, window = 10, min_count = 10, remove_pattern = TRUE, p = 0.001, n = 1, skip = 0 )
x |
a tokens object created by |
pattern |
|
valuetype |
the type of pattern matching: |
case_insensitive |
if |
window |
size of window for collocation analysis. |
min_count |
minimum frequency of words within the window to be considered as collocations. |
remove_pattern |
if |
n |
integer vector specifying the number of elements to be concatenated
in each n-gram. Each element of this vector will define a |
skip |
integer vector specifying the adjacency skip size for tokens
forming the n-grams, default is 0 for only immediately neighbouring words.
For |
... |
additional arguments passed to |
p |
threshold for statistical significance of collocations. |
tokens_select()
and textstat_keyness()