33rd Square Business Tools

Showing posts with label computer vision. Show all posts

Monday, July 24, 2017

Latest Version of Microsoft HoloLens Will Use AI Coprocessor for Processing Deep Neural Networks

Deep learning has made large inroads in the world of computer vision, and many other recognition tasks, in recent years. Microsoft has just announced that it is bringing the technology to its HoloLens system, integrating a deep neural network into the system's holographic processor.

Many of the most difficult recognition and computer vision problems have seen major gains in recent years. Now, Microsoft hopes to embed this technology into the latest version of their HoloLens augmented reality computer system.

"I work on HoloLens, and in HoloLens, we’re in the business of making untethered mixed reality devices, writes Microsoft's Marc Pollefeys, Director of Science for HoloLens. "We put the battery on your head, in addition to the compute, the sensors, and the display. Any compute we want to run locally for low-latency, which you need for things like hand-tracking, has to run off the same battery that powers everything else. So what do you do?"

"You create custom silicon to do it."

"Mixed reality and artificial intelligence represent the future of computing."

"Mixed reality and artificial intelligence represent the future of computing," Pollefeys writes.

HoloLens contains a custom multiprocessor called the Holographic Processing Unit, or HPU. It is responsible for processing the information coming from all of the on-board sensors, including Microsoft’s custom time-of-flight depth sensor, head-tracking cameras, the inertial measurement unit (IMU), and the infrared camera.

According to Microsoft, the HPU is part of what makes HoloLens the world’s first–and still only–fully self-contained holographic computer.

Recently, Harry Shum, executive vice president of the company's Artificial Intelligence and Research Group, announced in a keynote speech at the Conference on Computer Vision and Pattern Recognition (CVPR), that the second version of the HPU, currently under development, will incorporate an AI coprocessor to natively and flexibly implement deep neural networks (DNNs).

The chip supports a wide variety of layer types, that will be fully programmable. Shum demonstrated an early spin of the second version of the HPU running live code implementing hand segmentation at the conference.

The AI coprocessor is designed to work in the next version of HoloLens, running continuously, off the HoloLens battery.

According to Pollefeys, this is the kind of technology that needs to be developed to bring about "mixed reality devices that are themselves intelligent."

SOURCE Microsoft Research

By 33rd Square	Embed

Tuesday, April 18, 2017

Artificial Intelligence

YouTube channel ColdFusion has released a new video that explores the latest developments in artificial intelligence. The video looks at stacked generative adversarial networks, and how the progress of AI could drastically alter society.

The YouTube channel ColdFusion (previously called ColdfusTion) has released a new video that explores the latest developments in artificial intelligence.

"Imagine typing a descriptive sentence of a theme, and having an artificial intelligence generate a convincing photorealistic image just from your text input," asks Dagogo Altraide in the video. "This has just been created."

The system, detailed in this paper, uses an AI system called Stacked Generative Adversarial Networks (StackGAN).

Creating photorealistic images from text descriptions alone is a challenging problem in computer vision and has many practical applications. Up until recently, samples generated by other text-to-image approaches could only roughly reflect the meaning of the given descriptions and failed to contain necessary details and vivid, recognizable image output.

Stacked Generative Adversarial Networks (StackGAN)

Images of birds created with a Stacked Generative Adversarial Network (StackGAN)

To point to one area where the StackGAN approach may one day be used, think about how Tony Stark designs his Iron Man suit with the help of J.A.R.V.I.S. Stark isn't sitting at a CAD system plugging in numbers and detailing every aspect of the suit design, he is giving broad descriptions, and letting the AI system fill in all the necessary details.

StackGAN System - Xun Huang, Yixuan Li, Omid Poursaeed, John Hopcroft, and Serge Belongie. (https://arxiv.org/abs/1612.04357)

The adversarial neural network used in the paper above is one possible path to such a system.

As Altraide describes,

If we combine two neural networks together, and make them compete against each other so that they can train and improve themselves without human intervention, that's what StackGAN is doing. It uses one neural network to generate images, and another neural network within the same system to decide if the image generated is real or fake. What ends up happening is that the generative neural network improves itself at generating images based on the feedback given by the deciding network. In the same stride, the deciding network gets better at distinguishing what's real and fake.

This software architecture creates a feedback loop of continuous improvement without human intervention, that keeps cleaning and refining the images generated. As Altraide comments, "The end results are nothing short of stunning."

The video also highlights the AI generated sound system, WaveNet and a system from Carnegie Mellon University that has mastered Texas Hold 'Em poker. WaveNet is capable of producing natural-sounding human speech and music on its own.

Jeff Dean's recent talk about how image recognition is already exceeding human capability is also showcased in the video.

"It seems like playtime is over in regards to AI."

"It seems like playtime is over in regards to AI, especially with techniques like deep learning and neural networks," states Altraide. He discusses how social disruption is soon to follow these rapid technological advances.

Another strong video from ColdFusion!

SOURCE ColdFusion

By 33rd Square	Embed

Tuesday, February 10, 2015

Deep Learning

A new computer vision system based on deep convolutional neural networks has for the first time eclipsed the abilities of humans to classify objects. The Microsoft researchers claim their system achieved a 4.94 percent error rate on the ImageNet database. Humans tested on the same system averaged 5.1 percent.

Researchers at Microsoft claim their latest deep learning computer vision system can outperform humans in image recognition.

In their paper, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification," the Asian-based Microsoft Research developers say their system achieved a 4.94 percent error rate for the correct classification of images in the 2012 version of the widely recognized ImageNet data set, compared with a 5.1 percent error rate among humans.

The challenge involved identifying objects in the images and then correctly selecting the most accurate categories for the images, out of 1,000 options. Categories included “hatchet,” “geyser,” and “microwave.”
“To the best of our knowledge, our result surpasses for the first time the reported human-level performance on this visual recognition challenge,” Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun wrote in the paper.

Related articles

Deep learning involves training artificial neural networks on lots of information derived from images, audio, and other inputs, and then presenting the systems with new information and receiving inferences about it in response

"To the best of our knowledge, our result surpasses for the first time the reported human-level performance on this visual recognition challenge."

The research builds on the company's other impressive deep learning demonstrations of Project Adam, which was first demonstrated last year.

Along with surpassing human capability, the new system from Microsoft researchers improves on Google’s award-winning GoogLeNet system by 26 percent, as it performed with 6.66 percent error, the Microsoft researchers claim.

In a bit of modesty, the researchers noted that they don’t feel computer vision trumps human vision.

“While our algorithm produces a superior result on this particular dataset, this does not indicate that machine vision outperforms human vision on object recognition in general,” they wrote. “On recognizing elementary object categories (i.e., common objects or concepts in daily lives) such as the Pascal VOC task, machines still have obvious errors in cases that are trivial for humans. Nevertheless, we believe that our results show the tremendous potential of machine algorithms to match human-level performance on visual recognition.”

There is no word yet from Microsoft if this development will be used in Cortana, or in the upcoming release of Windows 10.

SOURCE Microsoft Research

By 33rd Square

Embed

Friday, January 2, 2015

Teaching Robots How To Manipulate Objects By Having Them Watch YouTube Videos

Machine Learning

Using convolutional neural networks, a team of researchers has taught robots how to manipulate objects by having them watch videos from the Internet.

Using machine learning, an international team of researchers has taught robots how to manipulate objects by having them watch videos from the Internet.

The research, that will be presented at the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), later this month involved was carried out by Yezhou Yang, a PhD. candidate and Yi Li at the Computer Vision Lab in the Department of Computer Science at the University of Maryland College Park The work was under Professor Yiannis Aloimonos and Dr. Cornelia Fermuller.

"Our ultimate goal is to build a self-learning robot that is able to enrich its knowledge about fine grained manipulation actions by “watching” demo videos."

The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation.

In experiments conducted on a publicly available unconstrained videos, the team was able to show that the system was able to learn manipulation actions by “watching” unconstrained videos with high accuracy.

"Our ultimate goal is to build a self-learning robot that is able to enrich its knowledge about fine grained manipulation actions by “watching” demo videos," the team writes in the research paper.

Working with Objects not so easy

Teaching robots how to grasp objects remains a tedious and complicated task, involving multiple subsystems like computer vision, 3D scanning and biomechanics. For people we generally don't really think about these actions when manipulating objects after we are toddlers—our biological system of our brains and dexterous hands is that good.

The researchers chose to classify manipulation actions into multiple levels of abstraction. At lower levels the symbolic quantities are grounded in perception, and at the high level a grammatical structure represents symbolic information for objects, grasps and actions. Their system uses CNN based object recognition and CNN based grasp type recognition.

Related articles

Using visual sentences like (LeftHand GraspType1 Object1 Action RightHand GraspType2 Object2), the system puts everything together into a program for the robot. By using the visual information from the videos, the robot chooses the grasp type based on the object. Moreover, because the videos involve human grasp behaviors, in essence, the robots are learning by watching people.

The right grasp for the job

The system also takes into account the type of grippers the robot has to work with. For instance, a humanoid robot with one parallel gripper and one vacuum gripper using a power grasp should select the vacuum gripper for a stable grasp, and the parallel gripper for a precision grasping task.

The researchers found that their CNN system achieved 93% success in object recognition, and 76% on grasp recognition. This ended up with a 83% success rate for manipulation actions by the robots, although they admit the robot did get confused about how to handle the tofu.

"We believe this preliminary integrated system raises hope towards a fully intelligent robot for manipulation tasks that can automatically enrich its own knowledge resource by “watching” recordings from the World Wide Web," write the researchers.

Future directions

In future studies the researchers plan to further extend the list of grasping types with a finer categorization, investigate the possibility of using the grasp type as an additional feature for action recognition, and automatically segment a long demonstration video into
action clips based on the change of grasp type.

The team is also looking at a higher level system that would use machine learning to construct a language of manipulation much more naturally. This work is similar to work being done on language that extends machine learning beyond words and sentences, to contextual understanding of the underlying message.

SOURCE Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web

By 33rd Square

Embed

Thursday, November 20, 2014

Computer Vision

An essential element for robotic systems that can navigate on their own, will be the robots' ability to see and make sense of the world around them. New advancements in machine learning are greatly extending computer vision.

Computer software only recently became smart enough to recognize objects in photographs. Now, Stanford and Google researchers using machine learning have created a system that takes the next step, writing a simple story of what's happening in any digital image.

"The system can analyze an unknown image and explain it in words and phrases that make sense," said Fei-Fei Li, a professor of computer science and director of the Stanford Artificial Intelligence Lab.

"This is an important milestone," Li said. "It's the first time we've had a computer vision system that could tell a basic story about an unknown image by identifying discrete objects and also putting them into some context."

"It's the first time we've had a computer vision system that could tell a basic story about an unknown image by identifying discrete objects and also putting them into some context."

The research, which has been published online, details how the team used a novel combination
of convolutional neural networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modes.

Humans, Li said, create mental stories that put what we see into context. "Telling a story about a picture turns out to be a core element of human visual intelligence but so far it has proven very difficult to do this with computer algorithms," she said. Li goes as far as saying vision is the key factor in the development of intelligence in animals after the Cambrian Explosion.

At the heart of the Stanford system are algorithms that enable the system to improve its accuracy by scanning scene after scene, looking for patterns, and then using the accumulation of previously described scenes to extrapolate what is being depicted in the next unknown image.

"It's almost like the way a baby learns," Li, who is featured in a video below, said.

She and her collaborators, including Andrej Karpathy, a graduate student in computer science, describe their approach in a paper submitted in advance of a forthcoming conference on cutting edge research in the field of computer vision.

Eventually these advances will lead to robotic systems that can navigate unknown situations. In the near term, machine-based systems that can discern the story in a picture will enable people to search photo or video archives and find specific images. The possibilities for computer surveillance are nothing short of chilling.

"Most of the traffic on the Internet is visual data files, and this might as well be dark matter as far as current search tools are concerned," Li said. "Computer vision seeks to illuminate that dark matter."

The new Stanford paper describes two years of effort that flows from research that Li has been pursuing for a decade. Her work builds on advances that have come, slowly at times, over the last 50 years since MIT scientist Seymour Papert convened a "summer project" to create computer vision in 1966.

Conceived during the early days of artificial intelligence, that timeline proved exceedingly optimistic, as computer scientists struggled to replicate in machines what took millions of years to evolve in living beings. It took researchers 20 years to create systems that could take the relatively simple first step of recognizing discrete objects in photographs.

More recently the emergence of the Internet has helped to propel computer vision. On one hand, the growth of photo and video uploads has created a demand for tools to sort, search and sift visual information. On the other, sophisticated algorithms running on powerful computers have led to electronic systems that can train themselves by performing repetitive tasks, improving as they go.

Related articles

Computer scientists call this machine learning, and Li likened this to how a child learns soccer by getting out and kicking the ball. A coach might demonstrate how to kick, and comment on the child's technique. But improvement occurs from within as the child's eyes, brain, nerves and muscles make tiny adjustments.

Researchers such as Li are developing ways to create positive feedback in loops in machines by inserting mathematical instructions into software. Her latest algorithms incorporate work that her researchers and others have done. This includes training their system on a visual dictionary, using a database of more than 14 million objects.

Each object is described by a mathematical term, or vector, that enables the machine to recognize the shape the next time it is encountered. Those mathematical definitions are linked to the words humans would use to describe the objects, be they cars, carrots, men, mountains or zebras.

Li played a leading role in creating this training tool, the ImageNet project, but her current work goes well beyond memorizing this visual dictionary.

Her team's new computer vision algorithm trained itself by looking for patterns in a visual dictionary, but this time a dictionary of scenes, a more complicated task than looking just at objects.

This was a smaller database, made up of tens of thousands of images. Each scene is described in two ways: in mathematical terms that the machine could use to recognize similar scenes and also in a phrase that humans would understand. For instance, one image might be "cat sits on keyboard" while another could be "girl rides on horse in field."

These two databases – one of objects and the other of scenes – served as training material. Li's machine-learning algorithm analyzed the patterns in these predefined pictures and then applied its analysis to unknown images and used what it had learned to identify individual objects and provide some rudimentary context. In other words, it told a simple story about the image.

SOURCE Stanford University

By 33rd Square

Embed

Tuesday, September 9, 2014

Computer Vision

Google has explained their new award-wining image detection system that can identify multiple objects in a scene, even if they're partly obscured. The key is a neural network that can rapidly refine the criteria it's looking for without requiring a lot of extra computing power.

During the annual ImageNet Computer Vision competition this year, the winning techniques continued the exponential progress of blowing last year's entries out of the water.

John Markoff of the New York Times published a piece on competition and some of those improvements recently.

“We see innovation and creativity exploding,” said Fei-Fei Li, the director of the Stanford Artificial Intelligence Laboratory and one of the creators of a vast set of labeled digital images that is the basis for the contest. “The algorithms are more complex and they are just more interesting.”

"These technological advances will enable even better image understanding on our side and the progress is directly transferable to Google products such as photo search, image search, YouTube, self-driving cars, and any place where it is useful to understand what is in an image as well as where things are."

In the five years that the contest has been held, the organizers have twice, once in 2012 and again this year, seen striking improvements in accuracy, accompanied by more sophisticated algorithms along with larger and faster computers.

Now, Google has published a blog post explaining some of their techniques including, deep learning networks. The team of researchers used the methods to win in a few different categories at the competition.

Related articles

The deeper scanning system Google used can both identify more objects and make better guesses including items in a living room, and in one example, a jumping cat.

Despite the incredible increases in computer vision accuracy, the systems still cannot match human vision, according to the researchers, and there is a lot of progress remaining to equal a human looking at an image.

According to the post, "These technological advances will enable even better image understanding on our side and the progress is directly transferable to Google products such as photo search, image search, YouTube, self-driving cars, and any place where it is useful to understand what is in an image as well as where things are."

SOURCE Google Research

By 33rd Square

Embed

Thursday, June 12, 2014

Artificial Intelligence

Computer scientists from the University of Washington and the Allen Institute for Artificial Intelligence in Seattle have created the first fully automated computer program that teaches everything there is to know about any visual concept. Called Learning Everything about Anything, or LEVAN, the program searches millions of books and images on the Web to learn all possible variations of a concept, then displays the results to users as a comprehensive, browsable list of images, helping them explore and understand topics quickly in great detail.

"The program learns to tightly couple rich sets of phrases with pixels in images. This means that it can recognize instances of specific concepts when it sees them."

“It is all about discovering associations between textual and visual data,” said Ali Farhadi, a UW assistant professor of computer science and engineering. “The program learns to tightly couple rich sets of phrases with pixels in images. This means that it can recognize instances of specific concepts when it sees them.”

The research team will present the project and a related paper this month at the Computer Vision and Pattern Recognition annual conference in Columbus, Ohio.

The program learns which terms are relevant by looking at the content of the images found on the Web and identifying characteristic patterns across them using object recognition algorithms. It’s different from online image libraries because it draws upon a rich set of phrases to understand and tag photos by their content and pixel arrangements, not simply by words displayed in captions.

Users can browse the existing library of roughly 175 concepts. Existing concepts range from “airline” to “window,” and include “beautiful,” “breakfast,” “shiny,” “cancer,” “innovation,” “skateboarding,” “robot,” and the researchers’ first-ever input, “horse.”

If the concept you’re looking for doesn’t exist, you can submit any search term and the program will automatically begin generating an exhaustive list of subcategory images that relate to that concept. For example, a search for “dog” brings up the obvious collection of subcategories: Photos of “Chihuahua dog,” “black dog,” “swimming dog,” “scruffy dog,” “greyhound dog.” But also “dog nose,” “dog bowl,” “sad dog,” “ugliest dog,” “hot dog” and even “down dog,” as in the yoga pose.

The technique works by searching the text from millions of books written in English and available on Google Books, scouring for every occurrence of the concept in the entire digital library. Then, an algorithm filters out words that aren’t visual. For example, with the concept “horse,” the algorithm would keep phrases such as “jumping horse,” “eating horse” and “barrel horse,” but would exclude non-visual phrases such as “my horse” and “last horse.”

Once it has learned which phrases are relevant, the program does an image search on the Web, looking for uniformity in appearance among the photos retrieved. When the program is trained to find relevant images of, say, “jumping horse,” it then recognizes all images associated with this phrase.

“Major information resources such as dictionaries and encyclopedias are moving toward the direction of showing users visual information because it is easier to comprehend and much faster to browse through concepts. However, they have limited coverage as they are often manually curated. The new program needs no human supervision, and thus can automatically learn the visual knowledge for any concept,” said Santosh Divvala, a research scientist at the Allen Institute for Artificial Intelligence and an affiliate scientist at UW in computer science and engineering.

The research team also includes Carlos Guestrin, a UW professor of computer science and engineering. The researchers launched the program in March with only a handful of concepts and have watched it grow since then to tag more than 13 million images with 65,000 different phrases.

Right now, the program is limited in how fast it can learn about a concept because of the computational power it takes to process each query, up to 12 hours for some broad concepts. The researchers are working on increasing the processing speed and capabilities.

The team wants the open-source program to be both an educational tool as well as an information bank for researchers in the computer vision community. The team also hopes to offer a smartphone app that can run the program to automatically parse out and categorize photos.

SOURCE University of Washington

By 33rd Square

Embed

Monday, March 24, 2014

Computers See Through Faked Expressions of Pain Better Than People

Computer Vision

A joint study by researchers at the University of California, San Diego and the University of Toronto has found that a computer system spots real or faked expressions of pain more accurately than people can.

The work, titled “Automatic Decoding of Deceptive Pain Expressions,” is published in the latest issue of Current Biology.

“The computer system managed to detect distinctive dynamic features of facial expressions that people missed,” said Marian Bartlett, research professor at UC San Diego’s Institute for Neural Computation and lead author of the study. “Human observers just aren’t very good at telling real from faked expressions of pain.”

"Our computer-vision system can be applied to detect states in which the human face may provide important clues as to health, physiology, emotion, or thought."

Senior author Kang Lee, professor at the Dr. Eric Jackman Institute of Child Study at the University of Toronto, said “humans can simulate facial expressions and fake emotions well enough to deceive most observers. The computer’s pattern-recognition abilities prove better at telling whether pain is real or faked.”

Related articles

The research team found that humans could not discriminate real from faked expressions of pain better than random chance – and, even after training, only improved accuracy to a modest 55 percent. The computer system attains an 85 percent accuracy.

“In highly social species such as humans,” said Lee, “faces have evolved to convey rich information, including expressions of emotion and pain. And, because of the way our brains are built, people can simulate emotions they’re not actually experiencing – so successfully that they fool other people. The computer is much better at spotting the subtle differences between involuntary and voluntary facial movements.”

“By revealing the dynamics of facial action through machine vision systems,” said Bartlett, “our approach has the potential to elucidate ‘behavioral fingerprints’ of the neural-control systems involved in emotional signaling.”

The single most predictive feature of falsified expressions, the study shows, is the mouth, and how and when it opens. Fakers’ mouths open with less variation and too regularly.

“Further investigations,” said the researchers, “will explore whether over-regularity is a general feature of fake expressions.”

In addition to detecting pain malingering, the computer-vision system might be used to detect other real-world deceptive actions in the realms of homeland security, psychopathology, job screening, medicine, and law, said Bartlett.

“As with causes of pain, these scenarios also generate strong emotions, along with attempts to minimize, mask, and fake such emotions, which may involve ‘dual control’ of the face,” she said. “In addition, our computer-vision system can be applied to detect states in which the human face may provide important clues as to health, physiology, emotion, or thought, such as drivers’ expressions of sleepiness, students’ expressions of attention and comprehension of lectures, or responses to treatment of affective disorders.”

SOURCE UC San Diego

By 33rd Square

Embed

Sunday, November 24, 2013

Artificial Intelligence

Running since July this year, Carnegie Mellon University computer vision system, NEIL or the Never Ending Image Learner has analyzed over five million images, labeled 0.5 million images and learned 3000 common sense relationships.

The Never Ending Image Learner (NEIL) at Carnegie Mellon University is running 24 hours a day, searching the internet for images and doing its best to understand them on its own and, as it builds a growing visual database, gathering common sense on a massive scale.

NEIL, which is partially funded by Google, leverages recent advances in computer vision that enable computer programs to identify and label objects in images, to characterize scenes and to recognize attributes, such as colors, lighting and materials, all with a minimum of human supervision. In turn, the data it generates will further enhance the ability of computers to understand the visual world.

Related articles

But NEIL also makes associations between these things to obtain common sense information that people just seem to know without ever saying — that cars often are found on roads, that buildings tend to be vertical and that ducks look sort of like geese. Based on text references, it might seem that the color associated with sheep is black, but people — and NEIL — nevertheless know that sheep typically are white.

"Images are the best way to learn visual properties," said Abhinav Gupta, assistant research professor in Carnegie Mellon's Robotics Institute. "Images also include a lot of common sense information about the world. People learn this by themselves and, with NEIL, we hope that computers will do so as well."

A computer cluster has been running the NEIL program since late July and already has analyzed three million images, identifying 1,500 types of objects in half a million images and 1,200 types of scenes in hundreds of thousands of images. It has connected the dots to learn 2,500 associations from thousands of instances.

The public can now view NEIL's findings at the project website, www.neil-kb.com.

The research team, including Xinlei Chen, a Ph.D. student in CMU's Language Technologies Institute, andAbhinav Shrivastava, a Ph.D. student in robotics, will present its findings on Dec. 4 at the IEEE International Conference on Computer Vision in Sydney, Australia.

One motivation for the NEIL project is to create the world's largest visual structured knowledge base, where objects, scenes, actions, attributes and contextual relationships are labeled and catalogued.

"What we have learned in the last 5-10 years of computer vision research is that the more data you have, the better computer vision becomes," Gupta said.

Some projects, such as ImageNet and Visipedia, have tried to compile this structured data with human assistance. But the scale of the Internet is so vast — Facebook alone holds more than 200 billion images — that the only hope to analyze it all is to teach computers to do it largely by themselves.

Shrivastava said NEIL can sometimes make erroneous assumptions that compound mistakes, so people need to be part of the process. A Google Image search, for instance, might convince NEIL that "pink" is just the name of a singer, rather than a color.

"People don't always know how or what to teach computers," he observed. "But humans are good at telling computers when they are wrong."

People also tell NEIL what categories of objects, scenes, etc., to search and analyze. But sometimes, what NEIL finds can surprise even the researchers. It can be anticipated, for instance, that a search for "apple" might return images of fruit as well as laptop computers. But Gupta and his landlubbing team had no idea that a search for F-18 would identify not only images of a fighter jet, but also of F18-class catamarans.

As its search proceeds, NEIL develops subcategories of objects — tricycles can be for kids, for adults and can be motorized, or cars come in a variety of brands and models. And it begins to notice associations — that zebras tend to be found in savannahs, for instance, and that stock trading floors are typically crowded.

NEIL is computationally intensive, the research team noted. The program runs on two clusters of computers that include 200 processing cores.

SOURCE Carnegie Mellon University

By 33rd Square

Subscribe to 33rd Square

Monday, September 23, 2013

Object Recognition

Researchers have developed a new technique that enables the visualization of a common mathematical representation of images, which should help researchers understand why their current recognition algorithms fail.

Object-recognition systems — software that tries to identify objects in digital images — is still fairly limited in capability. Even the best object-recognition systems, however, succeed only around 30 or 40 percent of the time — and their failures can be totally baffling.

Now, in an attempt to improve these systems, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory have created a system that, allows humans to see the world the way an object-recognition system does.

The team has also published their results in a paper available online.

Their system, called HOG Glasses, takes an ordinary image, translates it into the mathematical representation used by an object-recognition system and then, using inventive new algorithms, translates it back into a conventional image.

The researchers report that, when presented with the re-translation of a translation, human volunteers make classification errors that are very similar to those made by computers.

That suggests that the learning algorithms are just fine, and throwing more data at the problem won’t help; it’s the feature selection that’s the culprit.

The researchers are also hopeful that, in addition to identifying the problem, their system will also help solve it, by letting their colleagues reason more intuitively about the consequences of particular feature decisions.

Related articles

The feature set most widely used in computer-vision research is called the histogram of oriented gradients, or HOG. HOG first breaks an image into square chunks, usually eight pixels by eight pixels. Then, for each square, it identifies a “gradient,” or change in color or shade from one region to another. It characterizes the gradient according to 32 distinct variables, such as its orientation — vertical, horizontal or diagonal, for example — and the sharpness of the transition — whether it changes color suddenly or gradually.

Thirty-two variables for each square translates to thousands of variables for a single image, which define a space with thousands of dimensions. Any conceivable image can be characterized as a single point in that space, and most object-recognition systems try to identify patterns in the collections of points that correspond with particular objects.

“This feature space, HOG, is very complex,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and first author on the new paper. “A bunch of researchers sat down and tried to engineer, ‘What’s the best feature space we can have?’ It’s very highly dimensional. It’s almost impossible for a human to comprehend intuitively what’s going on. So what we’ve done is built a way to visualize this space.”

Vondrick; his advisor, Antonio Torralba, an associate professor of electrical engineering and computer science; and two other researchers in Torralba’s group, graduate student Aditya Khosla and postdoc Tomasz Malisiewicz, experimented with several different algorithms for converting points in HOG space back into ordinary images. One of those algorithms, which didn’t turn out to be the most reliable, nonetheless offers a fairly intuitive understanding of the process.

The algorithm first produces a HOG for an image and then scours a database for images that match it — on a very weak understanding of the word “match.”

“Because it’s a weak detector, you won’t find very good matches,” Vondrick explains. “But if you average all the top ones together, you actually get a fairly good reconstruction. Even though each detection is wrong, each one still captures the statistics of the original image patch.”

The reconstruction algorithm that ended up proving the most reliable is more complex. It uses a so-called “dictionary,” a technique that’s increasingly popular in computer-vision research. The dictionary consists of a large group of HOGs with fairly regular properties: One, for instance, might have a top half that’s all diagonal gradients running bottom left to upper right, while the bottom half is all horizontal gradients; another might have gradients that rotate slowly as you move from left to right across each row of squares. But any given HOG can be represented as a weighted combination of these dictionary “atoms.”

The researchers’ algorithm assembled the dictionary by analyzing thousands of images downloaded from the Internet and settled on the dictionary that allowed it to reconstruct the HOG for each of them with, on average, the fewest atoms. The trick is that, for each atom in the dictionary, the algorithm also learned the ordinary image that corresponds to it. So for an arbitrary HOG, it can apply the same weights to the ordinary images that it does to the dictionary atoms, producing a composite image.

The volunteers were slightly better than machine-learning algorithms at identifying the objects depicted in the reconstructions, but only slightly — nowhere near the disparity of 60 or 70 percent when object detectors and humans are asked to identify objects in the raw images. And the dropoff in accuracy as the volunteers moved from the easiest cases to the more difficult ones mirrored that of the object detectors.

Using HOG, the researchers hope to help others develop more efficient object recognition systems and highlight why failures may result. As Marcel Proust noted, "The real voyage of discovery consists not in seeking new landscapes but in having new eyes."

SOURCE MIT

By 33rd Square

Subscribe to 33rd Square

Friday, June 14, 2013

Computer Vision

Researchers at Carnegie Mellon University have developed a method for tracking the locations of multiple individuals in complex, indoor settings using a network of video cameras, creating something similar to the fictional Marauder's Map used by Harry Potter to track comings and goings at the Hogwarts School.

The method used in the research was able to automatically follow the movements of 13 people within a nursing home, even though individuals sometimes slipped out of view of the cameras. None of Potter's magic was needed to track them for prolonged periods; rather, the researchers made use of multiple cues from the video feed: apparel color, person detection, trajectory and, perhaps most significantly, facial recognition.

Related articles

Multi-camera, multi-object tracking has been an active field of research for a decade, but automated techniques have only focused on well-controlled lab environments. The Carnegie Mellon team, by contrast, proved their technique with actual residents and employees in a nursing facility—with camera views compromised by long hallways, doorways, people mingling in the hallways, variations in lighting and too few cameras to provide comprehensive, overlapping views.

The performance of the Carnegie Mellon algorithm significantly improved on two of the leading algorithms in multi-camera, multi-object tracking. It located individuals within one meter of their actual position 88 percent of the time, compared with 35 percent and 56 percent for the other algorithms.

The researchers—Alexander Hauptmann, principal systems scientist in the Computer Science Department (CSD); Shoou-I Yu, a Ph.D. student in the Language Technologies Institute; and Yi Yang, a CSD post-doctoral researcher—will present their findings June 27 at the Computer Vision and Pattern RecognitionConference in Portland, Ore.

Though Harry Potter could activate the Marauder's Map only by first solemnly swearing "I am up to no good," the Carnegie Mellon researchers developed their tracking technique as part of an effort to monitor the health of nursing home residents.

"The goal is not to be Big Brother, but to alert the caregivers of subtle changes in activity levels or behaviors that indicate a change of health status," Hauptmann said. All of the people in this study consented to being tracked.

These automated tracking techniques also would be useful in airports, public facilities and other areas where security is a concern. Despite the importance of cameras in identifying perpetrators following this spring's Boston Marathon bombing and the 2005 London bombings, much of the video analysis necessary for tracking people continues to be done manually, Hauptmann noted.

The CMU work on monitoring nursing home residents began in 2005 as part of a National Institutes of Health-sponsored project called CareMedia, which is now associated with the Quality of Life Technology Center, a National Science Foundation engineering research center at CMU and the University of Pittsburgh.

"We thought it would be easy," Hauptmann said of multi-camera tracking, "but it turned out to be incredibly challenging."

Something as simple as tracking based on color of clothing proved difficult, for instance, because the same color apparel can appear different to cameras in different locations, depending on variations in lighting. Likewise, a camera's view of an individual can often be blocked by other people passing in hallways, by furniture and when an individual enters a room or other area not covered by cameras, so individuals must be regularly re-identified by the system.

Face detection helps immensely in re-identifying individuals on different cameras. But Yang noted that faces can be recognized in less than 10 percent of the video frames. So the researchers developed mathematical models that enabled them to combine information, such as appearance, facial recognition and motion trajectories.

Using all of the information is key to the tracking process, but Yu said facial recognition proved to be the greatest help. When the researchers removed facial recognition information from the mix, their on-track performance in the nursing home data dropped from 88 percent to 58 percent, not much better than one of the existing tracking algorithms.

The nursing home video analyzed by the researchers was recorded in 2005 using 15 cameras; the recordings are just more than six minutes long.

Further work will be necessary to extend the technique during longer periods of time and enable real-time monitoring. The researchers also are looking at additional ways to use video to monitor resident activity while preserving privacy, such as by only recording the outlines of people together with distance information from depth cameras similar to the Microsoft Kinect.

SOURCE Carnegie Mellon University

By 33rd Square

Subscribe to 33rd Square

Monday, July 24, 2017

SOURCE Microsoft Research

Tuesday, April 18, 2017

Artificial Intelligence

SOURCE ColdFusion

Tuesday, February 10, 2015

"To the best of our knowledge, our result surpasses for the first time the reported human-level performance on this visual recognition challenge."

Friday, January 2, 2015

"Our ultimate goal is to build a self-learning robot that is able to enrich its knowledge about fine grained manipulation actions by “watching” demo videos."

Working with Objects not so easy

The right grasp for the job

Future directions

Thursday, November 20, 2014

"It's the first time we've had a computer vision system that could tell a basic story about an unknown image by identifying discrete objects and also putting them into some context."

Tuesday, September 9, 2014

Thursday, June 12, 2014

"The program learns to tightly couple rich sets of phrases with pixels in images. This means that it can recognize instances of specific concepts when it sees them."

Monday, March 24, 2014

"Our computer-vision system can be applied to detect states in which the human face may provide important clues as to health, physiology, emotion, or thought."

Sunday, November 24, 2013

Monday, September 23, 2013

Friday, June 14, 2013

Popular Stories