Title: | Semi-Supervised Model for Geographical Document Classification |
---|---|
Description: | Semissupervised model for geographical document classification (Watanabe 2018) <doi:10.1080/21670811.2017.1293487>. This package currently contains seed dictionaries in English, German, French, Spanish, Italian, Russian, Hebrew, Arabic, Turkish, Japanese and Chinese (Simplified and Traditional). |
Authors: | Kohei Watanabe [aut, cre, cph], Stefan Müller [aut], Dani Madrid-Morales [aut], Katerina Tertytchnaya [aut], Ke Cheng [aut], Chung-hong Chan [aut], Claude Grasland [aut], Giuseppe Carteny [aut], Elad Segev [aut], Dai Yamao [aut], Barbara Ellynes Zucchi Nobre Silva [aut], Lanabi la Lova [aut], Lungta Seki [aut] |
Maintainer: | Kohei Watanabe <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.9.1 |
Built: | 2024-11-08 04:43:13 UTC |
Source: | https://github.com/koheiw/newsmap |
Evaluate classification accuracy in precision and recall
accuracy(x, y)
accuracy(x, y)
x |
vector of predicted classes |
y |
vector of true classes |
class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class acc <- accuracy(class_pred, class_true) print(acc) summary(acc)
class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class acc <- accuracy(class_pred, class_true) print(acc) summary(acc)
AFE computes randomness of occurrences features in labelled documents.
afe(x, y, smooth = 1)
afe(x, y, smooth = 1)
x |
a dfm for features |
y |
a dfm for labels |
smooth |
a numeric value for smoothing to include all the features |
Extract coefficients for features
## S3 method for class 'textmodel_newsmap' coef(object, n = 10, select = NULL, ...) ## S3 method for class 'textmodel_newsmap' coefficients(object, n = 10, select = NULL, ...)
## S3 method for class 'textmodel_newsmap' coef(object, n = 10, select = NULL, ...) ## S3 method for class 'textmodel_newsmap' coefficients(object, n = 10, select = NULL, ...)
object |
a Newsmap model fitted by |
n |
the number of coefficients to extract. |
select |
returns the coefficients for the selected class; specify by the
names of rows in |
... |
not used. |
Seed geographical dictionary in Arabic
Dai Yamao [email protected]
Seed geographical dictionary in German
Stefan Müller [email protected]
Seed geographical dictionary in English
Kohei Watanabe [email protected]
Seed geographical dictionary in Spanish
Dani Madrid-Morales [email protected]
Seed geographical dictionary in French
Claude Grasland [email protected]
Seed geographical dictionary in Hebrew
Elad Segev [email protected]
Seed geographical dictionary in Italian
Giuseppe Carteny [email protected]
Seed geographical dictionary in Japanese
Kohei Watanabe [email protected]
Seed geographical dictionary in Portuguese
Barbara Ellynes Zucchi Nobre Silva [email protected]
Seed geographical dictionary in Russian
Katerina Tertytchnaya [email protected]
Lanabi la Lova [email protected]
Seed geographical dictionary in Turkish
Lungta Seki [email protected]
Seed geographical dictionary in Chinese (simplified)
Ke Cheng [email protected]
Seed geographical dictionary in Chinese (traditional)
Chung-hong Chan [email protected]
Predict document class using trained a Newsmap model
## S3 method for class 'textmodel_newsmap' predict( object, newdata = NULL, confidence = FALSE, rank = 1L, type = c("top", "all"), rescale = FALSE, min_conf = -Inf, min_n = 0L, ... )
## S3 method for class 'textmodel_newsmap' predict( object, newdata = NULL, confidence = FALSE, rank = 1L, type = c("top", "all"), rescale = FALSE, min_conf = -Inf, min_n = 0L, ... )
object |
a fitted Newsmap textmodel. |
newdata |
dfm on which prediction should be made. |
confidence |
if |
rank |
rank of the class to be predicted. Only used when |
type |
if |
rescale |
if |
min_conf |
return |
min_n |
set the minimum number of polarity words in documents. |
... |
not used. |
This function calculates micro-average precision (p) and recall (r) and
macro-average precision (P) and recall (R) based on a confusion matrix from
accuracy()
.
## S3 method for class 'textmodel_newsmap_accuracy' summary(object, ...)
## S3 method for class 'textmodel_newsmap_accuracy' summary(object, ...)
object |
output of accuracy() |
... |
not used. |
Train a Newsmap model to predict geographical focus of documents with labels given by a dictionary.
textmodel_newsmap( x, y, label = c("all", "max"), smooth = 1, boolean = FALSE, drop_label = TRUE, verbose = quanteda_options("verbose"), entropy = c("none", "global", "local", "average"), ... )
textmodel_newsmap( x, y, label = c("all", "max"), smooth = 1, boolean = FALSE, drop_label = TRUE, verbose = quanteda_options("verbose"), entropy = c("none", "global", "local", "average"), ... )
x |
a dfm or fcm created by |
y |
a dfm or a sparse matrix that record class membership of the
documents. It can be created applying |
label |
if "max", uses only labels for the maximum value in each row of
|
smooth |
a value added to the frequency of words to smooth likelihood ratios. |
boolean |
if |
drop_label |
if |
verbose |
if |
entropy |
[experimental] the scheme to compute the entropy to
regularize likelihood ratios. The entropy of features are computed over
labels if |
... |
additional arguments passed to internal functions. |
Newsmap learns association between words and classes as likelihood
ratios based on the features in x
and the labels in y
. The large
likelihood ratios tend to concentrate to a small number of features but the
entropy of their frequencies over labels or documents helps to disperse the
distribution.
Kohei Watanabe. 2018. "Newsmap: semi-supervised approach to geographical news classification." Digital Journalism 6(3): 294-309.
require(quanteda) text_en <- c(text1 = "This is an article about Ireland.", text2 = "The South Korean prime minister was re-elected.") toks_en <- tokens(text_en) label_toks_en <- tokens_lookup(toks_en, data_dictionary_newsmap_en, levels = 3) label_dfm_en <- dfm(label_toks_en) feat_dfm_en <- dfm(toks_en, tolower = FALSE) model_en <- textmodel_newsmap(feat_dfm_en, label_dfm_en) predict(model_en)
require(quanteda) text_en <- c(text1 = "This is an article about Ireland.", text2 = "The South Korean prime minister was re-elected.") toks_en <- tokens(text_en) label_toks_en <- tokens_lookup(toks_en, data_dictionary_newsmap_en, levels = 3) label_dfm_en <- dfm(label_toks_en) feat_dfm_en <- dfm(toks_en, tolower = FALSE) model_en <- textmodel_newsmap(feat_dfm_en, label_dfm_en) predict(model_en)