
Similar to the above scenario, information is available in several formats across the vast domain on the Internet:
- text (tweets being the most popular source)
- images (from Instagram/Facebook), and in some cases
- audio (to companies like Google and Apple via their speech assistants).
Sufficient state of the art techniques exist for successful information retrieval and semantics for all these domains individually. However, there is close to zero work that actually tries to bring all this information (from various domains like text, audio, video, images, social context) together.

Recent work by Google, 'One Model to Learn Them All' analyzes and proposes an architecture that can be used to tackle this problem. The proposed architecture uses information in data from one domain to improve the model's confidence/knowledge for data in other domains. For instance, knowledge of the fact that the image of a banana resembles the image of a yellow mustard bottle can be used to handle a query by the user like "what food item looks like a mustard bottle". Though the paper does not claim any boost in performance in the individual domains they have analyzed, they do claim using much lower data, and utilizing the abundance of data in one domain to help improve another domain.
The architecture utilizes existing state of the art models per domain while designing information extraction modules. Information from any domain is first decoded into a unified representation of the model (similar to how audio/vision is all unified inside our brains for processing), and then performs inputs from multiple domains for knowledge expansion. The three main type of blocks the model uses are:
- Convolution blocks: for detecting local patterns and generalize across space.
- Attention layer blocks: allow to focus on specific elements to improve performance.
- Sparsely-gated mixture-of-experts: expanding the model's capacity without proportional computation costs.
The applications of such an architecture in the domain of information retrieval have not been analyzed yet; they can have enormous applications! Joint information extraction while, say, scraping Facebook data, can be used to analyze videos, images, text and any audio inside a Facebook post to gather a clearer picture behind the inherent information inside the post, and this can be used for better, more semantically capable, information retrieval for any user to query/analyze. The abundance of Youtube video comments can be utilized to extract information inside a video, minus the computational (and thus, time) costs involved in analyzing the actual video.
The use of cross-domain information matching and cohesion is mostly unexplored, but with the availability of tools and architectures like the one described above, can be used to push beyond existing limits on the capabilities of models to extract semantic meaning from content, towards a more intelligent information extraction and retrieval system. This architecture is also available for us for out-of-the-box use, in the form of blocks (described above) in the Tensor2Tensor library.
References:
- MultiModel: Multi-Task Machine Learning Across Domains, Google Research Blog
- One Model to Learn Them All, arXiv
- Tensor2Tensor, Github
Comments
Post a Comment