Multi
Aspect
Term
Frequency
A
Novel
Approach
in
TF-IDF
Weighting
Schemes
Introduction:
Term
weighting
schemes
play a
pivotal
role
in
data
science
today.Be
it
Machine
learning,natural
language
processing
,or information
retrieval
, these
techniques
help to
filter
away redundant
and/or
useless
information
and
bring
to
light
anchor
terms in
a body
of text.The
importance
of a
term
is
measured
by
: term
frequency(TF)
in
the document(D)
, size
of the
document,
and
specificity
of the
term
in
the
collection.The
input
text is
filtered
on the
basis
of the
scheme
used,
and
a bag
of
rare,descriptive
words are
outputted.
E.g
: “He
said
he was
happy and
that
he'd
like a
sundae”
returns
{”happy”,”sundae”}.
Context
:
Most
weighting
schemes
focus
only on
a single
aspect
of term
normalization
: either
document
length
normalization
or relative
term
frequency
normalization.Length
based
normalization
reduces
bias
of term
frequency towards
larger
documents,whereas
relative
term
frequency based
normalization
filters
rarer
words
in
a document
that are
possibly more
descriptive
to
the
statement.
Hence,
most
of the
results
pertaining
to
these
schemes
are
either
biased
towards
large
documents
that match
more
query terms
or to
shorter
documents
with
that
have
higher
relative
frequency due
to
smaller
document
size.
To
tackle
this
issue
,a novel
scheme
– Multi
Aspect
TF
, or
MATF
- was
suggested
that incorporates
both
the
above
features
with
a weight
assigned
to
each.The
weights
assigned
are
heuristics
inferred
from studies.
Therefore
,a good
balance
is
obtained
when
ranking
retrieved
documents,large
and
small.It
also
uses
input
query length
to
normalize
the
number
of matched
terms
in
a document.E.g.
, a
small
query that
matches
the
same
number
of terms
in
a document
as
a large
query, the
document
should
result
in
a higher
ranking
in
the
results
returned
by the
small
query.
Performance
of MATF
:
Given
a query
Q and
document
D,a term
weighting
scheme
aims
to
form a
score
for each
document
based
on the
number
of query
terms
captured
in
the
document.The
main
objective
of a
term
weighting
scheme
is
to
quantify the
salience of the
query terms
in
the
document.
There
are
3 hypotheses
for quantifying
importance
of a
term
in
a generic
TF-IDF
scheme
:
-
Term Frequency Hypothesis(TFH): Weight of a term should increase with increase in frequency;however the relationship doesn't grow linearly; thus ,a damped version log(TF) is used instead of normal TF.
-
Advanced TF Hypothesis (AD-TFH): Rate of change in weight of a term should decrease with larger TF e.g., increase of TF from 3 to 4 instead of 20 to 21 is more significant.
-
Document Length Hypothesis(DLH):
Long
documents
are
more
likely to
contain
terms
with
higher
frequency.E.g.,
for 2
documents
with
different
lengths
and
same
TF
values
of a
term,
the
contribution
of TF
should
be higher
for the
shorter
document.
To
account
for the
above
hypotheses ,
MATF
takes
two
metrics
into
consideration,
Length
Regularized
TF
(LRTF)
and
Relative
Intra-document
TF
(RITF).
LRTF(t,
D) = T
F(t,D)
× log2
( 1
+ (ADL(C)
/len(D)) )
RITF(t,D)
= log2(1
+ T
F(t,
D)) /log2(1
+ Avg.T
F(D))
[
ADL(C)
is
average
document
length
of the
document,
't'
is
term
, 'D'
is
document].
MATF
is
a weighted
combination
of these
two
terms
, and
hence
accounts
for both
the requirements
shown previously.
Comparisons
and
Results:
MATF
was
compared
with
other
pre-existing
models
in
the
study,
Pivot
TF-IDF
and
Lemur
TF-IDF,
in
3 scenarios
: news
article
collections
(TREC-678)
,webpage
collections(W10G),
and
query data
sets(MQ-07).
The
results
of the
analysis
clearly were
in
favour
of MATF,
with
clear
margins of
performance
gain
across
each
collection.Some
statistics
(precision)
are
:
Method
|
TREC-678
|
W10G
|
MQ-07
|
Lemur
TF-IDF
|
20.9
|
18.4
|
39.6
|
Pivot
TF-IDF
|
21.5
|
20.5
|
40.0
|
MATF
|
23.5
|
22.2
|
44.2
|
%
better
than
Lemur
TF-IDF
|
12.0
|
20.7
|
11.6
|
%
better
than
Pivot
TF-
IDF
|
8.8
|
8.3
|
10.5
|
Conclusion
:
MATF
had
performed
significantly
better
than
existing
TF-IDF
models,
and
the
statistics
backs
up
the
logic
that
founded
this
new
scheme.
Since
the
applications
of TF-IDF
models
are
many,
MATF
can
be deemed
a useful
improvement
that
can
lead
to
better
results
and
performance.
Comments
Post a Comment