A very fast way of finding similar document: Autoencoders as Hashing function

A very fast way of finding similar document: Really 😊

Lets see How?😋

Semantic hashing [1] provides an extremely effecient way of finding documents similar to a query document. The idea is to convert a document into a memory address. In that memory, to organize things such that if you go to a particular address and look at the nearby addresses, you will be able to find documents that are very similar. People have known for a long time that if you get binary descriptor of images, you would have a very good way of retrieving images quickly. This extremely fast retrieval makes it possible to search using multiple different transformations of the query image.
Some binary descriptors are easy to get but its much harder to get a list of 30 binary descriptors which are more or less orthogonal to one another which is really need. In this context, machine learning or deep learning methods work which we can study in the following:

Finding binary codes:-
Deep auto-encoders [2] are used to map images to short binary codes. Using semantic hashing, 28-bit codes can be used to retrieve images that are similar to a query image in a time that is independent of the size of the database.  256-bit binary codes allow much more accurate matching and can be used to prune the set of images found using the 28-bit codes. We can do it by training an auto-encoder that has logistic units in it's code layer. During fine tuning stage, add noise to the inputs to the code units. The noise forces their activities to become bi-modal in order to resist the effect of the noise. Then, simply threshold the activities to get a binary code.

Feature Selection:-
If we can train an auto-encoder like this, we will be able to convert the counts for a bag of words into a small number of binary values. In another words, we"ll have learned a set of binary features that are good for reconstructing the bag of problems.

Matching:-
Once we have got these short binary codes, we could do a sequential search where for each known document, we store a code. And when a query arrives, we first extract its code, if it's not one of our known documents, then we can compare the code of all the stored documents. 
Semantic Hashing [wikipedia]

Comparison Time:
The comparison can be very fast, because it use special bit operations on a CPU which can compare many bits in parallel. In addition, we can treat the code as it was a memory address. 
The idea is to use autoencoder as hash function that converts a document into bit address and in that memory each address have a pointer to the documents that have address. if several documents have same address then we can point them as similar. if the auto encoder is successful in making similar documents have similar addresses, we have a very fast way of finding similar documents. Simply, take the query document, and go to the address in the memory that corresponds to its binary code, then look at the nearby addresses and aceess nearby addresses which differs by small bits.In other words, we can take a memory address and flip a few bits and access all the nearby addresses which looks up the similar documnets.


References:-
1. Salakhutdinov, Ruslan, and Geoffrey Hinton. "Semantic hashing." International Journal of Approximate Reasoning50.7 (2009): 969-978.
2. Krizhevsky, Alex, and Geoffrey E. Hinton. "Using very deep autoencoders for content-based image retrieval." ESANN. 2011.

Comments