Statistical Methods for Document Text Classification

Introduction

A lot of textual content is being generated online in the form of news articles, blog posts and technical reports/papers. Most of these are very rich in text but are seldom labelled. Moreover even if some of these are labelled, the labels may differ across platforms, for example, a news article may be labelled as "entertainment" on one platform and the same article as "sports" on some other platform. Thus it becomes very important to come up with techniques to automatically give labels to documents based on the text contained. So the problem, in terms of information retrieval boils down to this supervised learning problem: given a corpus of documents (D) along with their labels (which may be user defined) come up with a method to automatically label a new document (x). Many methods have been proposed in IR literature like SVMs, Neural Networks, Linear Least Squares Fit, Naive Bayes and kNN classifiers. In this blog post we will take a look at each of these models in detail and also compare their performance as reported in the literature on the benchmark Reuters-21578 dataset.

History of Text Classification

Other than the statistical methods mentioned in the introduction, some other methods have also been proposed like CONSTRUE, Decision Trees, Rocchio and WORD. Decision Trees have been used to identify informative words by calculating the information gain each word has in classification. CONSTRUE is a non-learning method which relies on expert annotations, so this method clearly doesn't scale. Rocchio uses vector space model to represent each category with a vector, which is a weighted sum of the vectors of the documents in the corpus. This is quick in terms of computation but fails when documents within a category tend to form different clusters. Lastly, WORD simply ranks the categories for a document based on word matching with the category name. While all these models had been tried out by various researchers in the early and mid 1900s, it wasn't clear what was the state-of-the-art in text classification as all these results were reported on different datasets and looked at different metrics. It was only in 1999 that Yiming Yang of Carnegie Mellon University gave an empirical large scale analysis of the methods by using the same dataset for cross-method evaluation. She found that statistical methods not only scale better than non-learning algorithms but also give better accuracies and F1 scores. The rest of the blog will summarise her work on text categorisation and will focus only on statistical learning methods.

Models

Before we dig into how these models work, we should get familiar with some jargons. Vector Space Model (VSM) refers to the model discussed in the lecture where each entry in the vector corresponds to a dictionary term. This entry is defined using the tf-idf score of the term in the document. Each vector xi represents the document di where D = {d1, d2, d3,...., dn} is the set of all annotated documents. Consequently, X = {x1, x2, x3,...., xn} is the set of vectors corresponding to these documents

1. Support Vector Machines (SVMs): SVMs are a well known machine learning model which learn a maximum margin hyper-plane given a training set with binary labels. Mathematically, it aims to learn w and b for a point x in X such that,

and the following constraints are satisfied

and min || w ||²

Here y represents the label of the document. It's 1 if the document belongs to the positive class and -1 otherwise.

Once weights are learned, then for a new document x' if w.x' - b > 0, we assign it to the positive class otherwise to the negative class. This binary model can be scaled to multi-class classification by using one-vs-all or one-vs-one techniques.

2. Neural Networks (NNet): Here the document labels are kept in one-hot encoded format and a non-linear mapping is learned between the input documents' vector x and the label. Many different network architectures have been tried by varying the number of hidden layers and number of nodes. It has been found that usually a shallow neural net (one hidden layer) gives a good performance where the number of nodes (k) is treated as a hyper parameter and found using cross validation.

3. Linear Least Squares Fit (LLSF): In this technique a multivariate regression model is learned using the training set. As the name suggests, the regressor aims to learn an equation such that the distance of the learned curve from each training data point is minimised. The learning problem is formulated as stated below

Here F is the weight matrix of size m x k, A is the input matrix of size n x m and B is the one hot encoded label matrix of n x k. Here n is the number of documents and k is the number of classes and m is the size of each vector representation of the documents. Once the best weights F are learned (either by matrix inversion or gradient descent) classification can be done by thresholding the value of F. x - b where x is the new document to be classified.

4. Naive Bayes (NB): Naive bayes is widely used as a baseline in machine learning literature. It is a probabilistic model which uses bayes theorem to assign class probabilities given prior and likelihood probablities. The naive assumption in this model is independence of the features, hence the name "naive bayes". The class probability for document classification is modelled as:

Using the independence assumption:

For us, xi corresponds to a word in a document and y is the label assigned to the document. We can find out P(y=0 | x1,...,xn), P(y=1 | x1,....,xn) .... P(y=m | x1,....,xn) for m labels using this assumption and then we predict document class as whichever label gives maximum probability.

5. k Nearest Neighbours (kNN): In this method, we have a vectorised representation of training documents and for a given document to be classified, we identify the k nearest vectors (where k is user defined and may be found out using grid search) and then takes a weighted average of the categories of these k vectors to decide a score for the document in question. Say a new document x needs to be classified, we find the weights, which are nothing but the cosine similarity scores of x with the k vectors. Now for each category, we choose the vectors out of k which correspond to this label and sum up the calculated weights. This gives us a likelihood score of x belonging to that category. After calculating this likelihood for each label, we can get a ranked list of categories the document x may belong to. Now thresholding on these scores we can make the label dichotomous. Mathematically, the decision rule can be written as:

Here sim(x, d) is the cosine similarity of new document x with the training document d which belongs to D, y is the label of the document and b is the thresholding constant for category j. This constant can be learned using cross-validation.

Results

Most of the works prior to Yiming Yang's analysis used metrics like accuracy, precision, recall and F1 score, often inconsistently over different publications. Also most of the evaluation was done on balanced datasets by conveniently sub sampling from the larger Reuters corpus. However, mosly the distribution of labels in real world is skewed and hence it's important to evaluate these models on such datasets as well. Also to account for skewness, micro and macro level averaging also needs to be done. On doing this micro level analysis, it was found that SVMs work slightly better than kNN which work much better than LLSF and NNet, both of which gave a similar performance. NB works the worst out of all the models due to it's very strong assumption of word independence. To summarise this: