(Press-News.org) Object recognition -- determining what objects are where in a digital image -- is a central research topic in computer vision.
But a person looking at an image will spontaneously make a higher-level judgment about the scene as whole: It's a kitchen, or a campsite, or a conference room. Among computer science researchers, the problem known as "scene recognition" has received relatively little attention.
Last December, at the Annual Conference on Neural Information Processing Systems, MIT researchers announced the compilation of the world's largest database of images labeled according to scene type, with 7 million entries. By exploiting a machine-learning technique known as "deep learning" -- which is a revival of the classic artificial-intelligence technique of neural networks -- they used it to train the most successful scene-classifier yet, which was between 25 and 33 percent more accurate than its best predecessor.
At the International Conference on Learning Representations this weekend, the researchers will present a new paper demonstrating that, en route to learning how to recognize scenes, their system also learned how to recognize objects. The work implies that at the very least, scene-recognition and object-recognition systems could work in concert. But it also holds out the possibility that they could prove to be mutually reinforcing.
"Deep learning works very well, but it's very hard to understand why it works -- what is the internal representation that the network is building," says Antonio Torralba, an associate professor of computer science and engineering at MIT and a senior author on the new paper. "It could be that the representations for scenes are parts of scenes that don't make any sense, like corners or pieces of objects. But it could be that it's objects: To know that something is a bedroom, you need to see the bed; to know that something is a conference room, you need to see a table and chairs. That's what we found, that the network is really finding these objects."
Torralba is joined on the new paper by first author Bolei Zhou, a graduate student in electrical engineering and computer science; Aude Oliva, a principal research scientist, and Agata Lapedriza, a visiting scientist, both at MIT's Computer Science and Artificial Intelligence Laboratory; and Aditya Khosla, another graduate student in Torralba's group.
Under the hood
Like all machine-learning systems, neural networks try to identify features of training data that correlate with annotations performed by human beings -- transcriptions of voice recordings, for instance, or scene or object labels associated with images. But unlike the machine-learning systems that produced, say, the voice-recognition software common in today's cellphones, neural nets make no prior assumptions about what those features will look like.
That sounds like a recipe for disaster, as the system could end up churning away on irrelevant features in a vain hunt for correlations. But instead of deriving a sense of direction from human guidance, neural networks derive it from their structure. They're organized into layers: Banks of processing units -- loosely modeled on neurons in the brain -- in each layer perform random computations on the data they're fed. But they then feed their results to the next layer, and so on, until the outputs of the final layer are measured against the data annotations. As the network receives more data, it readjusts its internal settings to try to produce more accurate predictions.
After the MIT researchers' network had processed millions of input images, readjusting its internal settings all the while, it was about 50 percent accurate at labeling scenes -- where human beings are only 80 percent accurate, since they can disagree about high-level scene labels. But the researchers didn't know how their network was doing what it was doing.
The units in a neural network, however, respond differentially to different inputs. If a unit is tuned to a particular visual feature, it won't respond at all if the feature is entirely absent from a particular input. If the feature is clearly present, it will respond forcefully.
The MIT researchers identified the 60 images that produced the strongest response in each unit of their network; then, to avoid biasing, they sent the collections of images to paid workers on Amazon's Mechanical Turk crowdsourcing site, who they asked to identify commonalities among the images.
Beyond category
"The first layer, more than half of the units are tuned to simple elements -- lines, or simple colors," Torralba says. "As you move up in the network, you start finding more and more objects. And there are other things, like regions or surfaces, that could be things like grass or clothes. So they're still highly semantic, and you also see an increase."
According to the assessments by the Mechanical Turk workers, about half of the units at the top of the network are tuned to particular objects. "The other half, either they detect objects but don't do it very well, or we just don't know what they are doing," Torralba says. "They may be detecting pieces that we don't know how to name. Or it may be that the network hasn't fully converged, fully learned."
In ongoing work, the researchers are starting from scratch and retraining their network on the same data sets, to see if it consistently converges on the same objects, or whether it can randomly evolve in different directions that still produce good predictions. They're also exploring whether object detection and scene detection can feed back into each other, to improve the performance of both. "But we want to do that in a way that doesn't force the network to do something that it doesn't want to do," Torralba says.
INFORMATION:
Related links
ARCHIVE: Never forget a face
http://newsoffice.mit.edu/2013/never-forget-a-face
ARCHIVE: Teaching computers to see -- by learning to see like computers
http://newsoffice.mit.edu/2013/teaching-computers-to-see-by-learning-to-see-like-computers-0919
The Philippines warning center has raised a #2 warning for its citizens in the Luzon province of Catanduanes. This warning indicates, among other things, that the tropical cyclone will affect the locality and that winds of greater than 100 kph up to 185 kph (62 to 114 mph) may be expected in at least 18 hours.
Philippines know the storm as Dodong and have also raised warning #1 for the areas of Luzon provinces of Sorsogon. Albay, Camarines Norte, Camarines Sur, Quezon, Polillo Island, Aurora, Quirino, Isabela and the Visayas provinces of Northern and Eastern Samar.
Warning ...
Forecasters expect Tropical Depression 07W which is riding behind Typhoon Noul to intensify to typhoon strength within the next five days. Currently 07W is located 298 miles southeast of Pohnopei, one of the Federated States of Micronesia. It is moving at a slow 5 knots on a east northeast trajectory with maximum sustained winds of 30 knots gusting to 40 (18 to 24 mph). Maximum wave height is 11 feet. 07W is moving east, but will turn round to a northwesterly course.
A tropical storm WARNING is in force for Kosrae, Pingelap and Mokil in Pohnpei State. A tropical storm ...
For the last decade, scientists have deployed increasingly capable underwater robots to map and monitor pockets of the ocean to track the health of fisheries, and survey marine habitats and species. In general, such robots are effective at carrying out low-level tasks, specifically assigned to them by human engineers -- a tedious and time-consuming process for the engineers.
When deploying autonomous underwater vehicles (AUVs), much of an engineer's time is spent writing scripts, or low-level commands, in order to direct a robot to carry out a mission plan. Now a new ...
The EU-funded MycoSynVac project combines gene engineering and biotechnology to design a novel veterinary vaccine chassis based on the bacterium Mycoplasma pneumoniae.
By combining their systems biology expertise with cutting-edge synthetic biology methodologies, researchers will engineer a universal chassis, which will be free of virulence and optimized for fast growth in a serum-free medium.
This chassis will be used to create specific vaccines against two highly detrimental pathogens that are causing suffering in livestock animals and large financial losses to the ...
TAMPA, Fla. -- Androgen deprivation therapy (ADT) is a common treatment option for patients with advanced stage prostate cancer. But nearly 80 percent of patients who receive ADT report experiencing hot flashes during and after treatment. Moffitt Cancer Center researchers are working to determine what genetic factors and other characteristics might make prostate cancer patients more likely to experience hot flashes during and after therapy.
Cancer therapies often are associated with unwanted side effects. Some side effects can be so debilitating that patients decide ...
Mexico is undergoing a transformation: ranked as the second largest economy in Latin America, it's an increasingly dynamic middle-income country -- and its population is ageing rapidly. How will this relate to the burden of cancer?
Mexico is an interesting case study for the relationship between population ageing and cancer burden, according to new research published in ecancermedicalscience.
Researchers led by Dr Ajay Aggarwal of the Institute of Cancer Policy, Kings College London, UK, examined population data, cancer databases, and the research output of Mexican ...
The ability that some people have to use echoes to determine the position of an otherwise silent object, in a similar way to bats and dolphins, requires good high-pitch hearing in both ears, according to new research from the University of Southampton.
The study, published in Hearing Research, found that locating an object by listening to echoes, without moving the head, requires good hearing at high frequencies and in both ears. This builds on research published in 2013 by the team at the University's Institute of Sound and Vibration Research (ISVR) that demonstrated ...
Scientists at LSTM have come a step closer to understanding why people exposed to household air pollution (HAP) are at higher risk of lung infections such as pneumonia and tuberculosis.
Three billion people worldwide are exposed to HAP from the fuels they burn to cook, light and heat with at home. Frequently, charcoal, wood and food waste are burned and generate high concentrations of smoke particles. This exposure is associated with increased risk of pneumonia, particularly in low and middle income countries where bacterial pneumonia is the biggest cause of infant mortality. ...
PITTSBURGH-- Countless research and self-help books claim that having more sex will lead to increased happiness, based on the common finding that those having more sex are also happier. However, there are many reasons why one might observe this positive relationship between sex and happiness. Being happy in the first place, for example, might lead someone to have more sex (what researchers call 'reverse causality'), or being healthy might result in being both happier and having more sex.
In the first study to examine the causal connection between sexual frequency and ...
This news release is available in Spanish. Metals are pollutants that have to be monitored in order to obtain a global overview of the quality of water systems, due to the fact that they remain in the environment. Although sediments act as a drain for pollutants, they can also act as a source of pollutants under certain environmental conditions (like changes in the composition of the water or movement of the sediments owing to a flood event). The UPV/EHU's Hydrology-Environment Group has conducted research on the River Deba and its tributaries to assess the influence ...