Techniques for Automatic Video Annotation

In today's world, videos have become one of the most commonplace ways to record and share information. There have been multiple businesses (eg: YouTube) which have been set up just to share and manage large video content. However, due to the increasing proliferation of video data, managing it has become harder. There is a need for ways to effectively retrieve videos through queries, which is where Video Annotation comes in.

Video annotation (VA), in a nutshell, is used to describe videos. It adds various descriptive features (visual, semantic etc) to a video based on its content. A general VA system takes in the multiple frames of a video, detects its keyframes and extracts features, something like this:


We now study a few techniques which have been used to study automated video annotation:

1. Semi-automatic video annotation: Described in this paper, the authors propose a content description ontology, consisting of four content layers:

  • Video Description layer: Category of the entire video. Can answer queries like: "What does the video talk about?"
  • Group Description layer: Conveys event information between adjacent shots (i.e talks about event and actors). Can answer queries like: "Give me all the assists in the football videos".
  • Shot Description layer: Describes a single shot. Can answer queries like: "Give me all the scenes where Messi scores a goal".
  • Frame Description layer: Describes objects in the frame. Can answer queries regarding to that particular frame.
Then, a semi-automatic annotation algorithm is used which utilises video processing techniques and assists annotator in identifying scenarios for annotation. Scene detection algorithms are also used to refine the results.

2. Automated video annotation using hierarchical topic trajectory models: Described in this paper, to represent relationship between video frames and their labels, it incorporates temporal dynamics of the videos and the co-occurrences among visual and text information. This also has four layers: Temporal data (video frames and audio etc), Features extracted from data, Latent Variables and Hidden State Variables. This architecture reduces the computational cost for model parameter estimation, and is automated as well unlike the previous approach.

The architecture of the model and the layers are given as below (v_x = video frames, w_t = text labels, x_t and y_t = features extracted, z_t = latent variable, s_t = hidden state).

The PDF of this model is given by:


3. Text detection: By using algorithms like OCR and corner-based algorithms, text and captions are detected in the videos, and their motion is tracked using Optical Flow. Since text provides direct information about what's happening in the video many times, it can be used for annotation as well. This technique is also language-independent.

4. Annotation of Web and Mobile videos: For web and mobile, a lot of videos are captured in a rough and uncontrolled manner. To counter this, foreground detection algorithms are used, and more focus is given on moving rigid objects. For this, algorithms like Consensus Foreground Object Template (CFOT) and SIFT are used, followed by object detection. Audio features are also used to improve accuracy.

Recently, another method has been devised which uses data mining for annotations.  It ranks videos in a database by multimodal search, then their transcripts are mined to get keywords for tag annotations.

References:

Comments