James DiCarlo - MIT neuroscientists see design flaws in computer vision tests
Cathryn M. Delude, McGovern Institute
January 24, 2008
For years, scientists have been trying to teach computers how to see like
humans, and recent research has seemed to show computers making
progress in recognizing visual objects.
A new MIT study, however, cautions that this apparent success may be
misleading because the tests being used are inadvertently stacked in favor
Computer vision is important for applications ranging from "intelligent" cars
to visual prosthetics for the blind. Recent computational models show
apparently impressive progress, boasting 60-percent success rates in
classifying natural photographic image sets. These include the widely used
Caltech101 database, intended to test computer vision algorithms against
the variety of images seen in the real world.
However, James DiCarlo, a neuroscientist in the McGovern Institute for
Brain Research at MIT, graduate student Nicolas Pinto and David Cox of
the Rowland Institute at Harvard argue that these image sets have design
flaws that enable computers to succeed where they would fail with
more-authentically varied images. For example, photographers tend to
center objects in a frame and to prefer certain views and contexts. The
visual system, by contrast, encounters objects in a much broader range of
"The ease with which we recognize visual objects belies the computational
difficulty of this feat," explains DiCarlo, senior author of the study in the Jan.
25 online edition of PLoS Computational Biology. "The core challenge is
image variation. Any given object can cast innumerable images onto the
retina depending on its position, distance, orientation, lighting and
The team exposed the flaws in current tests of computer object recognition by using a simple "toy" computer model inspired by the earliest steps in the
brain's visual pathway. Artificial neurons with properties resembling those in
the brain's primary visual cortex analyze each point in the image and
capture low-level information about the position and orientation of line
boundaries. The model lacks the more sophisticated analysis that happens
in later stages of visual processing to extract information about higher-level
features of the visual scene such as shapes, surfaces or spaces between
The researchers intended this model as a straw man, expecting it to fail as
a way to establish a baseline. When they tested it on the Caltech101
images, however, the model did surprisingly well, with performance similar
or better than five state-of-the-art object-recognition systems.
How could that be? "We suspected that the supposedly natural images in
current computer vision tests do not really engage the central problem of
variability, and that our intuitions about what makes objects hard or easy to
recognize are incorrect," Pinto explains.
To test this idea, the authors designed a more carefully controlled test.
Using just two categories--planes and cars--they introduced variations in
position, size and orientation that better reflect the range of variation in the
"With only two types of objects to distinguish, this test should have been
easier for the 'toy' computer model, but it proved harder," Cox says. The
team's conclusion: "Our model did well on the Caltech101 image set not
because it is a good model but because the 'natural' images fail to
adequately capture real-world variability."
As a result, the researchers argue for revamping the current standards and
images used by the computer-vision community to compare models and
measure progress. Before computers can approach the performance of the
human brain, they say, scientists must better understand why the task of
object recognition is so difficult and the brain's abilities are so impressive.
One approach is to build models that more closely reflect the brain's own
solution to the object recognition problem, as has been done by Tomaso
Poggio, a close colleague of DiCarlo's at the McGovern Institute (Tech Talk
Feb 28, 2007).
This study was supported by the National Eye Institute, the Pew Charitable
Trust and the McKnight Endowment Fund for Neuroscience.
A version of this article appeared in MIT Tech Talk on January 30, 2008