Recommendation System Using Content Based Similarity

As more and more information becomes available it becomes harder to find whatever we are looking for.Often when we browse the web for documents, we are looking for topics based on our query but generally, the search engine gives results based on the crude matching of keywords.

Recommendation System often uses just the meta-data about the document to classify them into a certain category. Such techniques can be useful in areas where the content is not that important in terms of classification. But in case of blogs or research papers, where we might not be able to infer anything just using the meta-data of the document.In such cases, we need to parse the content to classify them into various topics. This process is called topic modelling.
Topic modelling is grouping words into a cluster such that they represent a topic as can be seen in the following figure.



It can be particularly useful in document classification where certain words occur in conjunction with other words for a certain topic.It can also be useful for movie recommendation based on the plot.Currently, movie recommendations are based on meta-data of the film like the genre, actors etc.
Topic modelling can be used for such type of recommendation. 

There can be various challenges for such type of recommender system.The data will increase exponentially and we need efficient methods for handling such amount of data efficiently. The computational complexity also increases due to the complex algorithms for topic modelling.Even with these challenges, it can be useful for finding relationships between documents and topics.

Methods for Topic Modelling

We need topic models to discover hidden topic based patterns.As we know that words are related to topics. Similarly, topics are related to documents. Hence, topic models can be used to retrieve relevant documents.Here we go through the following topic modelling techniques.

Latent Semantic Analysis(LSA):

  • Here we first form the tf-idf matrix.
  • Then use SVD to break this matrix into 3 matrices: U, V, Sigma
  • U relates terms to "concepts" 
  • V relates "concepts" to documents.
  • Sigma is the diagonal matrix for singular values 

Latent Dirichlet Allocation(LDA): LDA represents documents as a mixture of topics that can split out into words with different probabilities.It returns the following things.Firstly all the words in different topics. Topics make up a cluster and many clusters overlap.Many topics may contain overlapping words.Then it returns for each document how many words belong to a certain topic.It can be useful in finding a semantic relationship between documents and query which might not be possible with just keyword matching.


LSA and LDA can be used for finding content based similarity between documents.It is very useful when we are trying to find documents related to some topic.It can further be extended to other areas such as movie recommendation where we try to find similarity based on the plot and not just the meta-data of the movies.A content-based similarity is more useful when we are searching for related topics and not just for keywords.It gives a deeper relationship with the document. 



References:

Comments