Package 'newsmap'

Title: Semi-Supervised Model for Geographical Document Classification
Description: Semissupervised model for geographical document classification (Watanabe 2018) <doi:10.1080/21670811.2017.1293487>. This package currently contains seed dictionaries in English, German, French, Spanish, Italian, Russian, Hebrew, Arabic, Turkish, Japanese and Chinese (Simplified and Traditional).
Authors: Kohei Watanabe [aut, cre, cph], Stefan Müller [aut], Dani Madrid-Morales [aut], Katerina Tertytchnaya [aut], Ke Cheng [aut], Chung-hong Chan [aut], Claude Grasland [aut], Giuseppe Carteny [aut], Elad Segev [aut], Dai Yamao [aut], Barbara Ellynes Zucchi Nobre Silva [aut], Lanabi la Lova [aut], Lungta Seki [aut]
Maintainer: Kohei Watanabe <[email protected]>
License: MIT + file LICENSE
Version: 0.9.1
Built: 2024-11-08 04:43:13 UTC
Source: https://github.com/koheiw/newsmap

Help Index


Evaluate classification accuracy in precision and recall

Description

Evaluate classification accuracy in precision and recall

Usage

accuracy(x, y)

Arguments

x

vector of predicted classes

y

vector of true classes

Examples

class_pred <- c('US', 'GB', 'US', 'CN', 'JP', 'FR', 'CN') # prediction
class_true <- c('US', 'FR', 'US', 'CN', 'KP', 'EG', 'US') # true class
acc <- accuracy(class_pred, class_true)
print(acc)
summary(acc)

Compute average feature entropy (AFE)

Description

AFE computes randomness of occurrences features in labelled documents.

Usage

afe(x, y, smooth = 1)

Arguments

x

a dfm for features

y

a dfm for labels

smooth

a numeric value for smoothing to include all the features


Extract coefficients for features

Description

Extract coefficients for features

Usage

## S3 method for class 'textmodel_newsmap'
coef(object, n = 10, select = NULL, ...)

## S3 method for class 'textmodel_newsmap'
coefficients(object, n = 10, select = NULL, ...)

Arguments

object

a Newsmap model fitted by textmodel_newsmap().

n

the number of coefficients to extract.

select

returns the coefficients for the selected class; specify by the names of rows in object$model.

...

not used.


Seed geographical dictionary in Arabic

Description

Seed geographical dictionary in Arabic

Author(s)

Dai Yamao [email protected]


Seed geographical dictionary in German

Description

Seed geographical dictionary in German

Author(s)

Stefan Müller [email protected]


Seed geographical dictionary in English

Description

Seed geographical dictionary in English

Author(s)

Kohei Watanabe [email protected]


Seed geographical dictionary in Spanish

Description

Seed geographical dictionary in Spanish

Author(s)

Dani Madrid-Morales [email protected]


Seed geographical dictionary in French

Description

Seed geographical dictionary in French

Author(s)

Claude Grasland [email protected]


Seed geographical dictionary in Hebrew

Description

Seed geographical dictionary in Hebrew

Author(s)

Elad Segev [email protected]


Seed geographical dictionary in Italian

Description

Seed geographical dictionary in Italian

Author(s)

Giuseppe Carteny [email protected]


Seed geographical dictionary in Japanese

Description

Seed geographical dictionary in Japanese

Author(s)

Kohei Watanabe [email protected]


Seed geographical dictionary in Portuguese

Description

Seed geographical dictionary in Portuguese

Author(s)

Barbara Ellynes Zucchi Nobre Silva [email protected]


Seed geographical dictionary in Russian

Description

Seed geographical dictionary in Russian

Author(s)

Katerina Tertytchnaya [email protected]

Lanabi la Lova [email protected]


Seed geographical dictionary in Turkish

Description

Seed geographical dictionary in Turkish

Author(s)

Lungta Seki [email protected]


Seed geographical dictionary in Chinese (simplified)

Description

Seed geographical dictionary in Chinese (simplified)

Author(s)

Ke Cheng [email protected]


Seed geographical dictionary in Chinese (traditional)

Description

Seed geographical dictionary in Chinese (traditional)

Author(s)

Chung-hong Chan [email protected]


Prediction method for textmodel_newsmap

Description

Predict document class using trained a Newsmap model

Usage

## S3 method for class 'textmodel_newsmap'
predict(
  object,
  newdata = NULL,
  confidence = FALSE,
  rank = 1L,
  type = c("top", "all"),
  rescale = FALSE,
  min_conf = -Inf,
  min_n = 0L,
  ...
)

Arguments

object

a fitted Newsmap textmodel.

newdata

dfm on which prediction should be made.

confidence

if TRUE, it returns likelihood ratio score.

rank

rank of the class to be predicted. Only used when type = "top".

type

if top, returns the most likely class specified by rank; otherwise return a matrix of likelihood ratio scores for all possible classes.

rescale

if TRUE, likelihood ratio scores are normalized using scale(). This affects both types of results.

min_conf

return NA when confidence is lower than this value.

min_n

set the minimum number of polarity words in documents.

...

not used.


Calculate micro and macro average measures of accuracy

Description

This function calculates micro-average precision (p) and recall (r) and macro-average precision (P) and recall (R) based on a confusion matrix from accuracy().

Usage

## S3 method for class 'textmodel_newsmap_accuracy'
summary(object, ...)

Arguments

object

output of accuracy()

...

not used.


Semi-supervised Bayesian multinomial model for geographical document classification

Description

Train a Newsmap model to predict geographical focus of documents with labels given by a dictionary.

Usage

textmodel_newsmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  boolean = FALSE,
  drop_label = TRUE,
  verbose = quanteda_options("verbose"),
  entropy = c("none", "global", "local", "average"),
  ...
)

Arguments

x

a dfm or fcm created by quanteda::dfm()

y

a dfm or a sparse matrix that record class membership of the documents. It can be created applying quanteda::dfm_lookup() to x.

label

if "max", uses only labels for the maximum value in each row of y.

smooth

a value added to the frequency of words to smooth likelihood ratios.

boolean

if TRUE, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.

drop_label

if TRUE, drops empty columns of y and ignore their labels.

verbose

if TRUE, shows progress of training.

entropy

[experimental] the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if global or over documents with the same labels if local. Local entropy is averaged if average. See the details.

...

additional arguments passed to internal functions.

Details

Newsmap learns association between words and classes as likelihood ratios based on the features in x and the labels in y. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

References

Kohei Watanabe. 2018. "Newsmap: semi-supervised approach to geographical news classification." Digital Journalism 6(3): 294-309.

Examples

require(quanteda)
text_en <- c(text1 = "This is an article about Ireland.",
             text2 = "The South Korean prime minister was re-elected.")

toks_en <- tokens(text_en)
label_toks_en <- tokens_lookup(toks_en, data_dictionary_newsmap_en, levels = 3)
label_dfm_en <- dfm(label_toks_en)

feat_dfm_en <- dfm(toks_en, tolower = FALSE)

model_en <- textmodel_newsmap(feat_dfm_en, label_dfm_en)
predict(model_en)