Title: | Feature Extraction and Document Classification with Noisy Labels |
---|---|
Description: | Extract features and classify documents with noisy labels given by document-meta data or keyword matching Watanabe & Zhou (2020) <doi:10.1177/0894439320907027>. |
Authors: | Kohei Watanabe [aut, cre, cph] |
Maintainer: | Kohei Watanabe <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.9.2 |
Built: | 2024-10-29 03:26:31 UTC |
Source: | https://github.com/koheiw/wordmap |
accuracy()
counts the number of true positive,
false positive, true negative, and false negative cases
for each predicted class and calculates precision, recall and F1 score
based on these counts.
summary()
calculates micro-average precision and recall, and
macro-average precision and recall based on the output of
accuracy()
.
accuracy(x, y) ## S3 method for class 'textmodel_wordmap_accuracy' summary(object, ...)
accuracy(x, y) ## S3 method for class 'textmodel_wordmap_accuracy' summary(object, ...)
x |
vector of predicted classes. |
y |
vector of true classes. |
object |
output of |
... |
not used. |
accuracy()
returns a data.frame with following columns:
tp |
the number of true positive cases. |
fp |
the number of false positive cases. |
tn |
the number of true negative cases. |
fn |
the number of false negative cases. |
precision |
|
recall |
|
f1 |
the harmonic mean of precision and recall. |
summary()
returns a named numeric vector with the following elements:
p |
micro-average precision. |
r |
micro-average recall |
P |
macro-average precision. |
R |
macro-average recall. |
class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class acc <- accuracy(class_pred, class_true) print(acc) summary(acc)
class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class acc <- accuracy(class_pred, class_true) print(acc) summary(acc)
afe()
computes Average Feature Entropy (AFE), which measures randomness of
occurrences of features in labelled documents (Watanabe & Zhou, 2020). In
creating seed dictionaries, AFE can be used to avoid adding seed words that would
decrease classification accuracy.
afe(x, y, smooth = 1)
afe(x, y, smooth = 1)
x |
a dfm for features. |
y |
a dfm for labels. |
smooth |
a numeric value for smoothing to include all the features. |
Returns a single numeric value.
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
as.list()
returns features with the largest coefficients as a list of
character vector. as.dictionary()
returns a quanteda::dictionary object
that can be use for dictionary analysis.
## S3 method for class 'textmodel_wordmap' as.dictionary(x, separator = NULL, ...) ## S3 method for class 'textmodel_wordmap' as.list(x, ...)
## S3 method for class 'textmodel_wordmap' as.dictionary(x, separator = NULL, ...) ## S3 method for class 'textmodel_wordmap' as.list(x, ...)
x |
a model fitted by |
separator |
the character in between multi-word dictionary values. If
|
... |
passed to coef.textmodel_wordmap |
Returns a list or a quanteda::dictionary object.
coef()
extracts top n
features with largest coefficients for each class.
## S3 method for class 'textmodel_wordmap' coef(object, n = 10, select = NULL, ...) ## S3 method for class 'textmodel_wordmap' coefficients(object, n = 10, select = NULL, ...)
## S3 method for class 'textmodel_wordmap' coef(object, n = 10, select = NULL, ...) ## S3 method for class 'textmodel_wordmap' coefficients(object, n = 10, select = NULL, ...)
object |
a model fitted by |
n |
the number of coefficients to extract. |
select |
returns the coefficients for the selected class; specify by the
names of rows in |
... |
not used. |
Returns a list of named numeric vectors sorted in descending order.
A corpus of 196 speeches from the 2017 UN General Debate (Mikhaylov and Baturo, 2017). The economic data for 2017 (GDP and GDP per capita) are downloaded from the World Bank website.
data_corpus_ungd2017
data_corpus_ungd2017
The corpus includes the following document variables:
ISO3c country code, e.g. "AFG" for Afghanistan
UN session, a numeric identifier (in this case, 72)
4-digit year (2017).
country name, in English.
continent of the country, one of: Africa, Americas, Asia, Europe, Oceania. Note that the speech delivered on behalf of the European Union is coded as "Europe".
GDP in $US for 2017, from the World Bank. Contains missing values for 9 countries.
GDP per capita in $US for 2017, derived from the World Bank. Contains missing values for 9 countries.
Mikhaylov, M., Baturo, A., & Dasandi, N. (2017). "United Nations General Debate Corpus". doi:10.7910/DVN/0TJX8Y. Harvard Dataverse, V4.
Baturo, A., Dasandi, N., & Mikhaylov, S. (2017). "Understanding State Preferences With Text As Data: Introducing the UN General Debate Corpus". doi:10.1177/2053168017712821. Research and Politics.
A dictionary with seed words for size common topics at the United Nations General Assembly (Watanabe and Zhou, 2020).
data_dictionary_topic
data_dictionary_topic
An object of class dictionary2
of length 6.
Kohei Watanabe [email protected]
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Predict document class using fitted Wordmap models.
## S3 method for class 'textmodel_wordmap' predict( object, newdata = NULL, confidence = FALSE, rank = 1L, type = c("top", "all"), rescale = FALSE, min_conf = -Inf, min_n = 0L, ... )
## S3 method for class 'textmodel_wordmap' predict( object, newdata = NULL, confidence = FALSE, rank = 1L, type = c("top", "all"), rescale = FALSE, min_conf = -Inf, min_n = 0L, ... )
object |
a model fitted by |
newdata |
a dfm on which prediction will be made. |
confidence |
if |
rank |
rank of the class to be predicted. Only used when |
type |
if |
rescale |
if |
min_conf |
returns |
min_n |
set the minimum number of polarity words in documents. |
... |
not used. |
Returns predicted classes as a vector. If confidence = TRUE
,
it returns a list of two vectors:
class |
predicted classes of documents. |
confidence.fit |
the confidence of predictions. |
Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.
textmodel_wordmap( x, y, label = c("all", "max"), smooth = 0.01, boolean = FALSE, drop_label = TRUE, entropy = c("none", "global", "local", "average"), residual = FALSE, verbose = quanteda_options("verbose"), ... )
textmodel_wordmap( x, y, label = c("all", "max"), smooth = 0.01, boolean = FALSE, drop_label = TRUE, entropy = c("none", "global", "local", "average"), residual = FALSE, verbose = quanteda_options("verbose"), ... )
x |
a dfm or fcm created by |
y |
a dfm or a sparse matrix that record class membership of the
documents. It can be created applying |
label |
if "max", uses only labels for the maximum value in each row of
|
smooth |
the amount of smoothing in computing coefficients.
When |
boolean |
if |
drop_label |
if |
entropy |
the scheme to compute the entropy to
regularize likelihood ratios. The entropy of features are computed over
labels if |
residual |
if |
verbose |
if |
... |
additional arguments passed to internal functions. |
Wordmap learns association between words in x
and classes in y
based on likelihood ratios. The large
likelihood ratios tend to concentrate to a small number of features but the
entropy of their frequencies over labels or documents helps to disperse the
distribution.
A residual class is created internally by adding a new column to y
.
The column is given 1 if the other values in the same row are all zero
(i.e. rowSums(y) == 0
); otherwise 0. It is useful when users cannot create
an exhaustive dictionary that covers all the categories.
Returns a fitted textmodel_wordmap object with the following elements:
model |
a matrix that records the association between classes and features. |
data |
the original input of |
feature |
the feature set in |
class |
the class labels in |
concatenator |
the concatenator in |
entropy |
the scheme to compute entropy weights. |
boolean |
the use of the Boolean transformation of |
call |
the command used to execute the function. |
version |
the version of the wordmap package. |
Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
require(quanteda) # split into sentences corp <- corpus_reshape(data_corpus_ungd2017) # tokenize toks <- tokens(corp, remove_punct = TRUE) %>% tokens_remove(stopwords("en")) # apply seed dictionary toks_dict <- tokens_lookup(toks, data_dictionary_topic) # form dfm dfmt_feat <- dfm(toks) dfmt_dict <- dfm(toks_dict) # fit wordmap model map <- textmodel_wordmap(dfmt_feat, dfmt_dict) coef(map) predict(map)
require(quanteda) # split into sentences corp <- corpus_reshape(data_corpus_ungd2017) # tokenize toks <- tokens(corp, remove_punct = TRUE) %>% tokens_remove(stopwords("en")) # apply seed dictionary toks_dict <- tokens_lookup(toks, data_dictionary_topic) # form dfm dfmt_feat <- dfm(toks) dfmt_dict <- dfm(toks_dict) # fit wordmap model map <- textmodel_wordmap(dfmt_feat, dfmt_dict) coef(map) predict(map)
Plot coefficients of words
textplot_terms( x, highlighted = NULL, max_highlighted = 50, max_words = 1000, ... )
textplot_terms( x, highlighted = NULL, max_highlighted = 50, max_words = 1000, ... )
x |
a fitted textmodel_wordmap object. |
highlighted |
quanteda::pattern to select words to highlight. If a quanteda::dictionary is passed, words in the top-level categories are highlighted in different colors. |
max_highlighted |
the maximum number of words to highlight. When
|
max_words |
the maximum number of words to plot. Words are randomly sampled to keep the number below the limit. |
... |
passed to underlying functions. See the Details. |
Users can customize the plots through ...
, which is
passed to ggplot2::geom_text()
and ggrepel::geom_text_repel()
. The
colors are specified internally but users can override the settings by appending
ggplot2::scale_colour_manual()
or ggplot2::scale_colour_brewer()
. The
legend title can also be modified using ggplot2::labs()
.