Peter Szolovits - Mining physicians' notes for medical insights
A new approach to algorithmically
distinguishing words with multiple
possible meanings could help find useful
data in electronic medical records.
Larry Hardesty, MIT News Office
October 30, 2012
In the last 10 years, it’s become far more common
for physicians to keep records electronically. Those
records could contain a wealth of medically useful
data: hidden correlations between symptoms,
treatments and outcomes, for instance, or
indications that patients are promising candidates
for trials of new drugs.
Much of that data, however, is buried in physicians’ freeform notes. One of the difficulties in extracting data from unstructured text is what computer scientists call
word-sense disambiguation. In a physician’s notes, the word “discharge,” for instance, could refer to a bodily
secretion — but it could also refer to release from a hospital. The ability to infer words’ intended meanings makes
it much easier for computers to find useful patterns in mountains of data.
At the American Medical Informatics Association’s (AMIA) annual symposium next week, researchers from MIT’s
Computer Science and Artificial Intelligence Laboratory will present a new system for disambiguating the senses
of words used in doctors’ clinical notes. On average, the system is 75 percent accurate in disambiguating words
with two senses, a marked improvement over previous methods. But more important, says Anna Rumshisky, an
MIT postdoc who helped lead the new research, it represents a fundamentally new approach to word
disambiguation that could lead to much more accurate systems while drastically reducing the amount of human
effort required to develop them.
Indeed, Rumshisky says, the paper that was initially accepted to the AMIA symposium described a system that
used a more conventional approach to word disambiguation, with an average accuracy of only about 63 percent. “In our opinion, that wasn’t enough to actually be usable,” Rumshisky says. “So what we tried instead was
something that’s been tried before in the general domain but never in the biomedical or clinical domains.”
In particular, Rumshisky explains, she and her co-authors — graduate student Rachel Chasin, whose master's
thesis is the basis for the new paper; Peter Szolovits, an MIT professor of computer science and engineering and
health science and technology; and research affiliate Özlem Uzuner, who got her PhD at MIT and is now an
assistant professor at the University at Albany — adapted algorithms from a research area known as topic
modeling. Topic modeling seeks to automatically identify the topics of documents by inferring relationships among
prominently featured words.
“The twist on it that we’re trying to transpose from the general domain is to treat occurrences of a target word as
documents and to treat senses as hidden topics that we’re trying to infer,” Rumshisky says.
Where an ordinary topic-modeling algorithm will search through huge bodies of text to identify clusters of words
that tend to occur in close proximity to each other, Rumshisky and her colleagues’ algorithm identifies correlations
not only between words but between words and other textual “features” — such as the words’ syntactic roles. If
the word “discharge” is preceded by an adjective, for instance, it’s much more likely to refer to a bodily secretionthan to an administrative event.
Ordinarily, topic-modeling algorithms assign different weights to different topics: A single news article, for
instance, might be 50 percent about politics, 30 percent about the economy, and 20 percent about foreign affairs.
Similarly, the MIT researchers’ new algorithm assigns different weights to the different possible meanings of
One advantage of topic-modeling algorithms is that they’re “unsupervised”: They can be deployed on huge bodies
of text without human oversight. As a consequence, the researchers can keep revising their algorithm so that it
incorporates more features, then set it loose on unannotated medical papers to draw its own inferences. And the
more features it incorporates, the more accurate it should be, Rumshisky says.
Among the features that the researchers plan to incorporate into the algorithm are listings in a huge thesaurus of
medical terms, compiled by the National Institutes of Health, called the Unified Medical Language System (UMLS).
Indeed, word associations in the UMLS were the basis of the researchers’ original algorithm — the one that
achieved 63 percent accuracy. There, the problem was that the length and structure of the paths from one word to
another in the UMLS didn’t always correspond to the semantic difference between the words. But the new system
intrinsically identifies only those correspondences that recur with enough frequency that they’re likely to be useful.
“The parts of the [UMLS] that are relevant for distinguishing the senses would basically float to the top by
themselves,” Rumshisky says. “It kind of gives you, for free, this association, if it’s valid. If it’s not valid, it just
The researchers are also experimenting with additional syntactic and semantic features that could help with word
disambiguation and with word associations established by NIH’s Medical Subject Headings paper-classification
scheme. “It’s still not perfect, because we haven’t integrated all the linguistic features that we want to,”
Rumshisky says. “But my hunch is that this is the way to go.”
“About 80 percent of clinical information is buried in clinical notes,” says Hongfang Liu, an associate professor of
medical informatics at the Mayo Clinic. “A lot of words or phrases are ambiguous there. So in order to get the
correct interpretation, you need to go through the word-disambiguation phase.”
Liu says that while some computational linguists have applied topic-modeling algorithms to the problem of
word-sense disambiguation, “My feeling is that they work on kind of toy problems. And here, I think, it can actually
be used in production-scale natural-language-processing systems.”