Machines That Learn
by David Pescovitz
Computers aren't big thinkers, but they can analyze massive amounts of information much faster than our own brains. That's the idea behind machine learning, a method that enables computer software to recognizes hidden patterns in as slew of data and learn from what if finds. UC Berkeley statistics professor Michael I. Jordan is applying machine learning to myriad applications, from genomics to information retrieval to the development of Internet software that repairs itself.
"Learning is a branch of statistics," says Jordan, who is also a professor in the College of Engineering's Computer Science Division. "Classical statistics generally centered around a person sitting down with some data, applying some procedure, and ending up with a pattern or hypothesis. We want to more thoroughly automate the process so computers can learn from data in a statistical sense."
Michael I. Jordan is a fellow of the American Association for Artificial Intelligence
Drawing from esoteric mathematics, Jordan develops pattern-finding algorithms that compare various sets of historical data and identify commonalities between them. The software then uses those specific instances to generate a predictive model of what's likely to occur in the future.
For example, Jordan and professor Steven Brenner of the Department of Microbial Biology are leading the development of a system to aid biologists in sussing out the function of proteins in plant or animal genomes. Biologists have developed the Gene Ontology, a collection of terms they use to hierarchically organize proteins that have already been characterized. The Gene Ontology has become a lingua franca for the description of biological function, Jordan says.
"A good statistical question is 'given what's currently available in the Gene Ontology and the scientific literature, can I infer the label of an unlabeled protein?'" Jordan says. "In particular, I'd like to take into account the evolutionary history of these proteins and the function of related proteins."
The machine learning technique does just that. First, it identifies all of the known proteins that are nearby the protein in question in the sense that they have similar strings of amino acids. It then builds a phylogenetic tree, a diagram representing the evolutionary relationships among the various proteins. A number of factors are taken into account to place the known proteins in appropriate locations on the tree. Finally, the algorithm propagates information around the tree, labeling the uncharacterized proteins with the terms from the Gene Ontology. Jordan, Brenner, and graduate student Barbara Engelhardt recently tested the system on a known set of proteins and demonstrated accuracies greater than 90 percent.
For Jordan though, a genome is just another set of raw data to be statistically analyzed. The goal of all of his work, he says, "is to create methods that without a huge effort can be turned into analysis engines for different kinds of data."
In another project, he's tailoring the machine learning algorithms to infer meaning from vast bodies of documents so they can be searched and categorized more efficiently and accurately. In this case, pages and pages of, say, technical papers about computer science would be loaded into the system. The system analyzes that data and generates a hierarchical tree where the highest branches represent common words found in the documents like "and," "but," and "the." Somewhere below that level would be more specific words such as "database," "operating system," and "network." As you move down the tree, the terms become more specialized. The result is a method of classifying and clustering documents based on their particular "paths" down the tree.
"So you can pump in a document corpus and it'll figure out the tree, the branches, and the probabilities of all of the words on those branches," Jordan says. "Then when you input a new document, it will predict where it falls on the tree. It's a general purpose corpus analyzer."
The system could be used not only to help readers identify documents that are truly related to one another but might also improve speech recognition software by figuring out the "gist" of what someone is saying. Meanwhile, a group at MIT has adapted the algorithm for a computer vision system that identifies objects in an image.
In the next few months, Jordan will embark on another statistical journey in computer science. He is a founding member of the University's new Reliable, Adaptive and Distributed systems Laboratory (RAD Lab). Funded with $7.5 million from Google, Microsoft, and Sun Microsystems, the laboratory will develop technology that leverages the power of machine learning so that one person can create the next eBay, Amazon, or even Google, by herself. The aim is to enable the quick development of Web services that can adapt to changing numbers of users while self-diagnosing and repairing bugs or other potential problems.
"Systems and software should be much more graceful in their ability to work in the context of the real world and I think statistical methods are the core technology for achieving that," Jordan says.
Related Web Sites
Return to top