When IR meets NLP

N.L.P. and I.R. are two fields that have grown so large and with speed over the years. One could not have imagined that one day Netflix, YouTube, and Facebook would make such fantastic personalized recommendation. With a tremendous amount of data present over the Internet one could now wonder with them, I think there is no better example than  IBM Watson.

Both NLP and IR techniques have played their part from the beginning. Where N.L.P tries to understand, utilize, and preserve the meaning of the words and sentences where words are the grammatical relation that form sentences. On the other hand, I.R. extracts the documents for a query based on the rank retrieval done on the similarity measure. The similarity measures merely judge the term frequency (T.F.)  of query terms in Collection.

Now imagine if one has two make a system that deals with semantic-related queries. For eg., a query like " Mom gave me Milk" and " Woman is making dinner."  are two subject-verb-object phrases and they contain different words.  Even if there no syntactically related terms but there is a semantic relation between phrases.

The question is for such systems, Will you choose NLP methods, IR methods, or a combination of both?


Before coming to the solution, lets first see what methods are used:


1. Distribution Similarity (DS)
The document that contains more query terms is ranked above the documents that do not contain query term at all. In DS, the semantic representations of words and sentences are their vector representations. If the cosine angle between the two vectors is less, the terms related to those words are more similar than if the angle is more.

2. Vector-based Distributional Similarity
Let there a feature space that consists of 1,000 words. Let "ocean", "sailing", "animal", "pet" be the first four words. The vectors for “ship” and “boat" could be as follows:
ship = (4, 3, 0, 0, . . .)
boat = (2, 3, 0, 0, . . .)

The target word "ship" occurs four times in the neighbourhood of “ocean" (one place before or after the word "ship") and three times in the neighbourhood of ‘sailing’, but 0 times in the neighbourhoods of “animal” and “pet”. For example, let us assume that the coordinates of the ship and boat are zero. Hence the lengths of their vectors become √25 and √13, and we obtain: 
cos(ship, boat) =0.943.


3. Phrase-based Semantic IR
The main task is to measure the dependency of the source (document) and the target (our query). This can be computed by:



                                    d: a document, q: a query

The document query independence measure is also referred as the point-wise mutual information PMI(d, q)).

So what we can say that DS focuses on the similarity between two target words and phrases and  IR focuses on the similarity between a source document and a query document. 

We can see for some other correspondence between distributional similarity and IR from the above diagram.

Now we come to the main part of the to see what we can achieve when both the techniques are mixed. The combination of both the techniques is a two layer structure. The symbolic meaning of the terms in a foreground IR with the vector representation is provided by the background layer which does DS. The resultant mixture model can be presented as :

                                     This is called a DS-Based IR model.

The λ is taken as 0.5, the Pfg(sq | sd) is IR model, and Pbg(sq | sd) is DS model.

To understand consider two sentences "men love yachts" and "women love porsches". The probabilities are: 

P_IR(men|women) = 0.2                          P_IR(yachts|porsches) = 0.1

P_DS(men, women) = 0.72                      P_DS(yachts, porsches) = 0.80

The weights are:
Pλ_mix(men|women,sim) = 0.2|2 + 0.72|2

Pλ_mix(yachts|porches,sim) = 0.1|2 + 0.80|2

The score can be calculated as:


The mixture model performed better than the IR-only model. This is because the similarity based background DS model has higher values than the IR model direct occurrence.

Evaluation and Results:

The model was evaluated on the basis of the correlation between model score and reference score. For distributional similarity models this is similarity based, for IR models, it had a simulation where a query contains exactly one phrase and a document also contains only one phrase.
       These are the  average score (Pearson) correlation per query


Conclusion

It was found that the mixture model performed better than the other. The score was not that high and there was just a small difference but was expressively high for some cases. 


Reference:

[1] Dmitrijs Milajevs, Mehrnoosh Sadrzadeh and Thomas Roelleke School of Electronic Engineering and Computer Science, Queen Mary University of London London, UK. IR meets NLP: On the Semantic Similarity between Subject-Verb-Object Phrases, Proceedings of the 2015 International Conference on The Theory of Information Retrieval Pages 231-240, ICTIR'15.

Comments