Sqrt-Cosine Similarity Measurement


Sqrt-Cosine Similarity Measurement

Text Mining and Information Retrieval are one of the hottest topics of research in the computer industry nowadays. Text similarity measurement
is one of the central concerns among them, that aims to find the commonality existing among text documents. One of the most popular method
to calculate it is the cosine similarity.
But in the recent times, with ever increasing data, a more efficient approach is required to find the text similarity in a more effective way. 
Recently a couple of researchers came up with an interesting paper which proposes a new method called "sqrt-cosine similarity". So, further in
this blog I will be explaining some basic terms and cosine similarity quickly and then for the majority of the blog will walk you through the 
sqrt-cosine similarity.

If you already have an idea about cosine similarity method and some basic terms like tf-idf then you may want to skip to the sqrt-cosine section.
Otherwise, it is recommended you read through. Before explaining all the technical terms, let’s first cover why we need them. Document similarity,
one of the core requirements in IR is simply the measure of how similar or alike two documents are. This similarity between pairs of the
documents can be found by using one of the most popular algorithm - cosine similarity algorithm, which measures the cosine of the angle 
between two vectors. In this case, each document can be presented as a vector whose direction is determined on a set of the TF-IDF values in
the space.

Hold up don’t get confused with so many technical terms, lets visit them one by one and try to get a hang of them in a very easy and simple
manner.

Cosine similarity
One of the basic requirement to find the document similarity is that first the tf-idf value is known. Then the only part that remains is to find the
actual similarity which can be achieved through various similarity algorithms. Though here we will only discuss the cosine similarity.
Cosine similarity is the measure of the cosine of angle between two vectors(text documents), being represented as vector of tf-idf values. The
angle obtained is the measure of overlap or similarity(based on number of terms) between the documents. You can find this cosine similarity
is by using the equation of dot product. In order to find cosine similarity between two documents x and y we need to normalize them to one in 
L2 norm.

By having two normalized vectors x and y the cosine similarity between them will be simply the dot product of them .
This cosine formula is actually derived from the euclidean distance.
If you are unfamiliar with the term Euclidean distance, or L2 distance, it actually refers to the square root of the sum of squared differences
between corresponding elements of the two vectors.

If the value obtained after calculating the cosine is 1, means that the the documents being compared are completely similar to one another and
the similarity reduces as the value of cosine reduces.

Sqrt-Cosine Similarity
Finally coming to the main topic, after receiving this much information I hope you still have appetite for the main course. Let’s get straight to the
point then.
In Information retrieval it’s very common that you may encounter with high-dimensional data also, but here comes the point where problem 
arises. These high dimensional data creates a lot of problem and is messier to work upon while still using the Euclidean distance as the 
mechanism. In fact for higher dimensions, Euclidean distance is not even considered an effective distance measurement. This fact has been 
proven by many resources.

This is believed to be because it has been shown that in the high dimension, the calculated distance from various distance function is highly 
concentrated resulting in the ratio of the distances of the nearest and farthest neighbors to a given target being almost equal to one. This 
behavior in fact results in the outcome that there is no variation between the distances of different data points. It has also been shown by 
many researchers that in reality, it is always preferable to use lower value of k in Lk norm for any given high dimension. So, using this result 
you may be benefited by using the L1 distance, like Hellinger, instead of L2 (Euclidean distance) for a high-dimensional application.
If you are not satisfied with only this much explanation for using L1 norm and crave for more ,don’t worry. There are lot of sources all over the 
internet with much more detailed explanation. For this blog’s purpose this much information is sufficient though.
Hellinger formula goes like this:
Using this Hellinger formula, the so called cosine formula is re-written as:
This new formula is called by its own name: sqrt-cosine similarity measurement.

To show that not only in theory but practically also this method is more efficient, authors of the papers conducted a thorough testing, taking into
consideration different famous datasets and various performance measures. For all of you enthusiastic people the link to the paper is given in 
the references below.

References

  1. Improved Sqrt-Cosine Similarity Measurement, Journal of Big Data

Comments