Intelligent System Creates Descriptive Sentences Directly From Videos

Wednesday, August 27, 2014


 Artificial Intelligence
Summarizing what is going on in a video is another task that may soon be done automatically thanks to work done through the Video In Sentences Out study. Using artificial intelligence deep learning methodology, a team has already been able to achieve accurate results in almost half of the videos the system has examined.




In a DARPA-funded research effort, a team using computer vision, robotics, and natural-language processing and deep learning have created a system that provides sentence descriptions of what is occurring in a video.

Apart from the obvious use of summarizing YouTube videos, the applications of this narrow artificial intelligence system are many, including robotics and the development of smart cameras.

The system called Video In Sentence Out produces what the researches call "sentential descriptions" of video.  This includes the  who did what to whom, and where and how they did it.

The research for Video in Sentences Out was conducted at the University of TorontoPurdue University and the University of South Carolina.

Video In Sentences Out was developed by the Purdue-University of South Carolina-University of Toronto team under the DARPA Mind's Eye program. The Mind's Eye program seeks to develop the capability for visual intelligence by automating the ability to learn generally applicable and generative representations of action between objects in a scene directly from visual inputs, and then reason over those learned representations.

Video in Sentences Out Example 2

The study has been published in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, and has been fully open-sourced on GitHub.

Actions are returned by the system as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers.

Using an approach called event recognition, the research team was able to extract the information needed to create sentence descriptions from nearly 750 short videos.  This included recognition of object tracks (where things are going and from where), to whom or what objects were going, and changing body posture of the people in the videos.

Video in Sentences Out Example 1

For the language processing elements of the system, uses a simple vocabulary of 118 words: 1 coordination, 48 verbs, 24 nouns, 20 adjectives, 8 prepositions, 4 lexical prepositional phrases, 4 determiners, 3 particles, 3 pronouns, 2 adverbs, and 1 auxiliary.

Video in Sentences Out architecture

Related articles
While it may seem at first to be a very straightforward activity to summarize a short video clip for a human it is far from easy for an artificial system to do.

Creating a descriptive sentence from video requires  recognizing the primary action being performed, because such actions are rendered as verbs and verbs serve as the central structure for sentences. However, event recognition alone is insufficient to generate the remaining components. Video In Sentence Out must recognize object classes in order to render nouns. But even object recognition alone is insufficient to generate meaning full sentences. The system must determine the roles that such objects play in the event.

The overall architecture of Video In Sentence Out first uses detectors for each object on each frame of the video. The objects are cross-checked for false positives at this stage by a number of sub-systems.  Then, dynamic programming algorithm is employed to select the main objects detected within the time flow of the action. These objects are then tracked by the system.

People in the videos are also checked and tracked, as well as postures and actions determined based on deep learning. Hidden Markov Models (HMMs) are then used to determine the verbs that will be used in the developing sentences.  Adjectives and prepositional phrases are processed into the final sentence generation.

To test the system, the human judges rated each video-sentence pair to assess whether the sentence was true of the video and whether it described the event depicted in the video. 26.7% of the video-sentence pairs were found to be true and 7.9% of the video-sentence pairs were deemed to be salient, covering the main elements seen in the video.

The video above represents highlights from the study. The full videos are available here.


By 33rd SquareEmbed

0 comments:

Post a Comment