'Analysis Engines' Enhance Speed, Accuracy of Internet Searches

April 26, 2006

The picture anchoring IBM Corp.'s research page on UIMA (unstructured information management architecture) shows a bright orange train waiting at an empty subway station. Far down the platform, barely visible, the tiny figures of people can barely be seen. It's a cleverly selected image, because the buzz among IT researchers pegs UIMA as one of the virtual trains designed to carry us to future destinations in cyberspace.


In fact, David Ferrucci, senior manager at IBM's T.J. Watson Research Center, who describes UIMA as "a sort of software framework," said his team's discoveries will enable businesses to more effectively search their own databases -- something like an in-house, laser-sharp Google -- for not only documents but also "unstructured" resource material such as video clips, sound bites and photos.


UIMA is designed to facilitate high-level, conceptual database searches that go far beyond the traditional keyword scan. An example of a traditional keyword scan is typing "bank" into Google. In just .22 seconds during an April 20, 2006 search, the search engine pulled up 1,090,000,000 hits including a state job bank, dozens of financial institutions, a Flagstaff food bank, the Arizona Diamondbacks home page and an organ-donor bank. A ranking algorithm within the search engine notes which of the millions of documents were clicked on by previous searchers using the same keywords, then automatically punches them to the top of the list seen by the computer user.


The vast majority of these hits are irrelevant to the searcher, regardless of which type of bank they are looking for. But this overkill response is typical of what you get with the existing "bag of words" approach embedded in online search engines, according to Robert St. Louis, chairman of the Department of Information Systems at the W. P. Carey School of Business.


That irrelevancy is a huge problem, St. Louis explained. "So the question has to be asked: 'How can you train Google so it is able to sort through a full-text search, filter the one billion results within and get the five or six results you're interested in to appear on the first page?'" he continued. "It also has to be almost instantaneous, and a fairly simple procedure" so users don't get discouraged, he added.


Refining the search by typing in more words often narrows the search too much, IBM's Ferrucci said. "If I put in a lot of words, the fewer documents I get, the search narrows and is less successful. As a result, we are conditioned to use just a word or two keywords," he explained.


The remedy seems to be the UIMA approach, which, while not a search engine, is the platform allowing searchers to use concepts, themes and ideas -- rather than keywords -- to scan databases with sophisticated software components called "analysis engines." The analysis engines "scan documents and identify concepts and relationships in documents," Ferrucci said.


While UIMA advances will vastly improve the search efficacy of public, Internet databases, Ferrucci and his team are focused on applying the technology to corporate and government settings. These databases lack the hyperlinks common to public search engines, posing another obstacle to efficient, fast ranking of hits.


Until recently, the 200 or so IBM researchers delving into this area were split among various corporate-environment search-engine projects. But since they weren't linked, their work tended to "silo" -- to develop in isolation of each other's similar projects. One of Ferrucci's goals, now met, was to get them on all the same virtual page. As a result, progress on UIMA is accelerating.


Here's a UIMA scenario: A chief financial officer wants information that spans two documents. Using the only keyword he knows, he'll get one, but not both documents displayed on the first page of hits. Instead, using UIMA-facilitated software to analyze documents for the concept of, say, "fiscally responsible background checking," he may get exactly the reports he needs, stored by two departments under different titles.


Another example: Say you're interested in reading business pundits' opinions of Company ABC's president, chairman and chief operating officer, but don't know the three honchos' names. Analysis engines would scan database content and identify the concepts of interest ("positive opinions of company leaders") typed in by the searcher, Ferrucci said. "This is a very different search concept that we don't have today. UIMA is the plumbing that puts everyone on the same platform," he added.


IBM gives away its UIMA framework and source code free of charge online, "to provide a common foundation for industry and academia to collaborate and accelerate the worldwide development of technologies." Of course, this is a savvy marketing strategy, because by doing so, the computing behemoth also creates a global market for the UIMA-facilitating software applications currently being developed in-house.


So far, individuals and companies have ordered just under 10,000 downloads of the UIMA kit and source code, Ferrucci added, and IBM is partnering with Carnegie Mellon to publish UIMA components. He predicts the technology will prove especially helpful in the automotive and health-care industries, as well as in national security.



Another pioneering approach


But IBM isn't alone in pioneering deeper database search methods. St. Louis said researchers at the W. P. Carey School "recently found a way to add content to a Google search."


He continued, "There's an explosion in unstructured information-based technology going on, to search not just documents but e-mails and images." Department members, assisted by doctoral students, are building on Google to make it more context-sensitive.


Using a patented algorithm, they start by evaluating each hit (St. Louis calls them "snippets of information") returned by a database search on a particular topic. "Each snippet that comes up has a button next to it saying 'high,' 'medium' or 'low.' You click the appropriate button based on how relevant that particular snippet is to you, personally," he explained.


According to Asim Roy, a W. P. Carey information systems professor and developer of this approach, "the algorithm can go back and learn, from your input, what the high-relevance snippets have in common. Next, it brings forward the snippets high in relevance for you, thus making Google's results much more useful to you. Algorithms learn incredibly fast."


Another W. P. Carey researcher, Dmitri Roussinov, is developing a search approach that allows the searcher to ask a question and get a relevant answer from a database. For instance, if you enter "who&is&the&largest&software&developer&worldwide" into Google -- using the "bag of words" approach currently available -- you'll get zero hits. Entering "world's&largest&software&developer," likewise, draws nothing.


"Yet just about everyone in business knows that Microsoft is the world's largest software developer," marveled St. Louis. "Such a basic question and Microsoft is all over the Internet, but no results."



Bottom line:

  • Despite its ranking as the most popular public search engine in the world, repository for millions upon millions of records, Google can't answer a simple question: "who is the world's largest software developer?"
  • IBM is creating a market for a new line of software components by giving away its framework and source code for UIMA (unstructured information management architecture).