Cross-domain information matching using Deep Learning


The human body has six primary senses that act as sources of input for our mind to process, to extract information from our surroundings and form a construct of the processes that happen around us.

Similar to the above scenario, information is available in several formats across the vast domain on the Internet:
- text (tweets being the most popular source)
- images (from Instagram/Facebook), and in some cases
- audio (to companies like Google and Apple via their speech assistants).

Sufficient state of the art techniques exist for successful information retrieval and semantics for all these domains individually. However, there is close to zero work that actually tries to bring all this information (from various domains like text, audio, video, images, social context) together.
The human mind does not process inputs from one sensory organ without using the others. For somebody to identify and understand why somebody is angry at them, listening to only the other person's words or only looking at them cannot work out well. What if we could use a similar idea, to bring together information from different domains and use them in a combined fashion for better, efficient and more meaningful information extraction?

Recent work by Google, 'One Model to Learn Them All' analyzes and proposes an architecture that can be used to tackle this problem. The proposed architecture uses information in data from one domain to improve the model's confidence/knowledge for data in other domains. For instance, knowledge of the fact that the image of a banana resembles the image of a yellow mustard bottle can be used to handle a query by the user like "what food item looks like a mustard bottle".  Though the paper does not claim any boost in performance in the individual domains they have analyzed, they do claim using much lower data, and utilizing the abundance of data in one domain to help improve another domain.

The architecture utilizes existing state of the art models per domain while designing information extraction modules. Information from any domain is first decoded into a unified representation of the model (similar to how audio/vision is all unified inside our brains for processing), and then performs inputs from multiple domains for knowledge expansion. The three main type of blocks the model uses are:
  1. Convolution blocks: for detecting local patterns and generalize across space.
  2. Attention layer blocks:  allow to focus on specific elements to improve performance.
  3. Sparsely-gated mixture-of-experts: expanding the model's capacity without proportional computation costs.
Using a conjunction of these blocks to suit the input domain (using convolution blocks for images/videos, as they have been shown to work well, attention blocks for text and audio) is used for encoding data into a common representation. Such a representation can be used, along with metric learning, for answering queries originating from any domain! A query made in the audio form like  'show me some images of bananas' can be directly converted to the unified representation, which can then be matched with existing images to retrieve data.

The applications of such an architecture in the domain of information retrieval have not been analyzed yet; they can have enormous applications! Joint information extraction while, say, scraping Facebook data, can be used to analyze videos, images, text and any audio inside a Facebook post to gather a clearer picture behind the inherent information inside the post, and this can be used for better, more semantically capable, information retrieval for any user to query/analyze.  The abundance of Youtube video comments can be utilized to extract information inside a video, minus the computational (and thus, time) costs involved in analyzing the actual video.

The use of cross-domain information matching and cohesion is mostly unexplored, but with the availability of tools and architectures like the one described above, can be used to push beyond existing limits on the capabilities of models to extract semantic meaning from content, towards a more intelligent information extraction and retrieval system. This architecture is also available for us for out-of-the-box use, in the form of blocks (described above) in the Tensor2Tensor library.

References:
  1. MultiModel: Multi-Task Machine Learning Across Domains, Google Research Blog 
  2. One Model to Learn Them All, arXiv
  3. Tensor2Tensor, Github

Comments