Multi Aspect Term Frequency : A Novel Approach in TF-IDF Weighting Schemes

Multi Aspect Term Frequency

A Novel Approach in TF-IDF Weighting Schemes


Introduction:


Term weighting schemes play a pivotal role in data science today.Be it Machine learning,natural language processing ,or information retrieval , these techniques help to filter away redundant and/or useless information and bring to light anchor terms in a body of text.The importance of a term is measured by : term frequency(TF) in the document(D) , size of the document, and specificity of the term in the collection.The input text is filtered on the basis of the scheme used, and a bag of rare,descriptive words are outputted.

E.g : “He said he was happy and that he'd like a sundae” returns {”happy”,”sundae”}.
Context :

Most weighting schemes focus only on a single aspect of term normalization : either document length normalization or relative term frequency normalization.Length based normalization reduces bias of term frequency towards larger documents,whereas relative term frequency based normalization filters rarer words in a document that are possibly more descriptive to the statement.

Hence, most of the results pertaining to these schemes are either biased towards large documents that match more query terms or to shorter documents with that have higher relative frequency due to smaller document size.

To tackle this issue ,a novel scheme Multi Aspect TF , or MATF - was suggested that incorporates both the above features with a weight assigned to each.The weights assigned are heuristics inferred from studies.

Therefore ,a good balance is obtained when ranking retrieved documents,large and small.It also uses input query length to normalize the number of matched terms in a document.E.g. , a small query that matches the same number of terms in a document as a large query, the document should result in a higher ranking in the results returned by the small query.


Performance of MATF :


Given a query Q and document D,a term weighting scheme aims to form a score for each document based on the number of query terms captured in the document.The main objective of a term weighting scheme is to quantify the salience of the query terms in the document.

There are 3 hypotheses for quantifying importance of a term in a generic TF-IDF scheme :

  1. Term Frequency Hypothesis(TFH): Weight of a term should increase with increase in frequency;however the relationship doesn't grow linearly; thus ,a damped version log(TF) is used instead of normal TF.

  1. Advanced TF Hypothesis (AD-TFH): Rate of change in weight of a term should decrease with larger TF e.g., increase of TF from 3 to 4 instead of 20 to 21 is more significant.

  1. Document Length Hypothesis(DLH):
Long documents are more likely to contain terms with higher frequency.E.g., for 2 documents with different lengths and same TF values of a term, the contribution of TF should be higher for the shorter document.

To account for the above hypotheses , MATF takes two metrics into consideration, Length Regularized TF (LRTF) and Relative Intra-document TF (RITF).

LRTF(t, D) = T F(t,D) × log2 ( 1 + (ADL(C) /len(D)) )
RITF(t,D) = log2(1 + T F(t, D)) /log2(1 + Avg.T F(D))
[ ADL(C) is average document length of the document, 't' is term , 'D' is document].

MATF is a weighted combination of these two terms , and hence accounts for both the requirements shown previously.

Comparisons and Results:


MATF was compared with other pre-existing models in the study, Pivot TF-IDF and Lemur TF-IDF, in 3 scenarios : news article collections (TREC-678) ,webpage collections(W10G), and query data sets(MQ-07).

The results of the analysis clearly were in favour of MATF, with clear margins of performance gain across each collection.Some statistics (precision) are :

Method
TREC-678
W10G
MQ-07
Lemur TF-IDF
20.9
18.4
39.6
Pivot TF-IDF
21.5
20.5
40.0
MATF
23.5
22.2
44.2
% better than Lemur TF-IDF
12.0
20.7
11.6
% better than Pivot TF- IDF
8.8
8.3
10.5



Conclusion :



MATF had performed significantly better than existing TF-IDF models, and the statistics backs up the logic that founded this new scheme. Since the applications of TF-IDF models are many, MATF can be deemed a useful improvement that can lead to better results and performance.

Comments