A Smarter Web
March/April 2007
This article appears in the March/April 2007 issue of Technology Review.
Last year, Eric Miller, an MIT-affiliated computer scientist, stood on a beach in southern France, watching the sun set, studying a document he'd printed earlier that afternoon. A March rain had begun to fall, and the ink was beginning to smear.
Five years before, he'd agreed to lead a diverse group of researchers working on a project called the Semantic Web, which seeks to give computers the ability--the seeming intelligence--to understand content on the World Wide Web. At the time, he'd made a list of goals, a copy of which he now held in his hand. If he'd achieved those goals, his part of the job was done.
Taking stock on the beach, he crossed off items one by one. The Semantic Web initiative's basic standards were in place; big companies were involved; startups were merging or being purchased; analysts and national and international newspapers, not just technical publications, were writing about the project. Only a single item remained: taking the technology mainstream. Maybe it was time to make this happen himself, he thought. Time to move into the business world at last.
"For the Semantic Web, it was no longer a matter of if but of when," Miller says. "I felt I could be more useful by helping people get on with it."
Now, six months after the launch of his own Zepheira, a consulting company that helps businesses link fragmented data sources into easily searched wholes, Miller's beachside decision seems increasingly prescient. The Semantic Web community's grandest visions, of data-surfing computer servants that automatically reason their way through problems, have yet to be fulfilled. But the basic technologies that Miller shepherded through research labs and standards committees are joining the everyday Web. They can be found everywhere--on entertainment and travel sites, in business and scientific databases--and are forming the core of what some promoters call a nascent "Web 3.0."
Already, these techniques are helping developers stitch together complex applications or bring once-inaccessible data sources online. Semantic Web tools now in use improve and automate database searches, helping people choose vacation destinations or sort through complicated financial data more efficiently. It may be years before the Web is populated by truly intelligent software agents automatically doing our bidding, but their precursors are helping people find better answers to questions today.
The "3.0" claim is ambitious, casting these new tools as successors to several earlier--but still viable--generations of Net technology. Web 1.0 refers to the first generation of the commercial Internet, dominated by content that was only marginally interactive. Web 2.0, characterized by features such as tagging, social networks, and user-created taxonomies of content called "folksonomies," added a new layer of interactivity, represented by sites such as Flickr, Del.icio.us, and Wikipedia.
Analysts, researchers, and pundits have subsequently argued over what, if anything, would deserve to be called "3.0." Definitions have ranged from widespread mobile broadband access to a Web full of on-demand software services. A much-read article in the New York Times last November clarified the debate, however. In it, John Markoff defined Web 3.0 as a set of technologies that offer efficient new ways to help computers organize and draw conclusions from online data, and that definition has since dominated discussions at conferences, on blogs, and among entrepreneurs.
The 3.0 moniker has its critics. Miller himself, like many in his research community, frowns at the idea of applying old-fashioned software release numbers to a Web that evolves continually and on many fronts. Yet even skeptics acknowledge the advent of something qualitatively different. Early versions of technologies that meet Markoff's definition are being built into the new online TV service Joost. They've been used to organize Yahoo's food section and make it more searchable. They're part of Oracle's latest, most powerful database suite, and Hewlett-Packard has produced open-source tools for creating Semantic Web applications. Massive scientific databases, such as the Creative Commons-affiliated Neurocommons, are being constructed around the new ideas, while entrepreneurs are readying a variety of tools for release this year.
The next wave of technologies might ultimately blend pared-down Semantic Web tools with Web 2.0's capacity for dynamic user-generated connections. It may include a dash of data mining, with computers automatically extracting patterns from the Net's hubbub of conversation. The technology will probably take years to fulfill its promise, but it will almost certainly make the Web easier to use.
"There is a clear understanding that there have to be better ways to connect the mass of data online and interrogate it," says Daniel Waterhouse, a partner at the venture capital firm 3i. Waterhouse calls himself skeptical of the "Web 3.0" hyperbole but has invested in at least one Semantic Web-based business, the U.K. company Garlik. "We're just at the start," he says. "What we can do with search today is very primitive."
Melvil Dewey and the Vision of a New Web
For more than a decade, Miller has been at the center of this slow-cresting technological wave. Other names have been more prominent--Web creator Tim Berners-Lee is the Semantic Web's most visible proselytizer, for example. But Miller's own experiences trace the technology's history, from academic halls through standards bodies and, finally, into the private sector.
In the often scruffy Web world, the 39-year-old Miller has been a clean-cut exception, an articulate and persuasive technological evangelist who looks less programmer than confident young diplomat. He's spent most of his professional life in Dublin, OH, far from Silicon Valley and from MIT, where he continues to serve as a research scientist. But it's no accident that Zepheira is based in this Columbus suburb, or that Miller himself has stayed put. Dublin is a hub of digital library science, and as the Semantic Web project has attempted to give order to the vast amounts of information online, it has naturally tapped the expertise of library researchers here.
Miller joined this community as a computer engineering student at Ohio State University, near the headquarters of a group called the Online Computer Library Center (OCLC). His initial attraction was simple: OCLC had the largest collection of computers in the vicinity of Ohio State. But it also oversees the venerable Dewey Decimal System, and its members are the modern-day inheritors of Melvil Dewey's obsession with organizing and accessing information.
Dewey was no technologist, but the libraries of his time were as poorly organized as today's Web. Books were often placed in simple alphabetical order, or even lined up by size. Libraries commonly numbered shelves and assigned books to them heedless of subject matter. As a 21-year-old librarian's assistant, Dewey found this system appalling: order, he believed, made for smoother access to information.
Dewey envisioned all human knowledge as falling along a spectrum whose order could be represented numerically. Even if arbitrary, his system gave context to library searches; when seeking a book on Greek history, for example, a researcher could be assured that other relevant texts would be nearby. A book's location on the shelves, relative to nearby books, itself aided scholars in their search for information.
As the Web gained ground in the early 1990s, it naturally drew the attention of Miller and the other latter-day Deweys at OCLC. Young as it was, the Web was already outgrowing attempts to categorize its contents. Portals like Yahoo forsook topic directories in favor of increasingly powerful search tools, but even these routinely produced irrelevant results.
Nor was it just librarians who worried about this disorder. Companies like Netscape and Microsoft wanted to lead their customers to websites more efficiently. Berners-Lee himself, in his original Web outlines, had described a way to add contextual information to hyperlinks, to offer computers clues about what would be on the other end.
This idea had been dropped in favor of the simple, one-size-fits-all hyperlink. But Berners-Lee didn't give it up altogether, and the idea of connecting data with links that meant something retained its appeal.
On the Road to Semantics
By the mid-1990s, the computing community as a whole was falling in love with the idea of metadata, a way of providing Web pages with computer-readable instructions or labels that would be invisible to human readers.
To use an old metaphor, imagine the Web as a highway system, with hyperlinks as connecting roads. The early Web offered road signs readable by humans but meaningless to computers. A human might understand that "FatFelines.com" referred to cats, or that a link led to a veterinarian's office, but computers, search engines, and software could not.
Metadata promised to add the missing signage. XML--the code underlying today's complicated websites, which describes how to find and display content--emerged as one powerful variety. But even XML can't serve as an ordering principle for the entire Web; it was designed to let Web developers label data with their own custom "tags"--as if different cities posted signs in related but mutually incomprehensible dialects.
In early 1996, researchers at the MIT-based World Wide Web Consortium (W3C) asked Miller, then an Ohio State graduate student and OCLC researcher, for his opinion on a different type of metadata proposal. The U.S. Congress was looking for ways to keep children from being exposed to sexually explicit material online, and Web researchers had responded with a system of computer-readable labels identifying such content. The labels could be applied either by Web publishers or by ratings boards. Software could then use these labels to filter out objectionable content, if desired.
Miller, among others, saw larger possibilities. Why, he asked, limit the descriptive information associated with Web pages to their suitability for minors? If Web content was going to be labeled, why not use the same infrastructure to classify other information, like the price, subject, or title of a book for sale online? That kind of general-purpose metadata--which, unlike XML, would be consistent across sites--would be a boon to people, or computers, looking for things on the Web.
This idea resonated with other Web researchers, and in the late 1990s it began to bear fruit. Its first major result was the Resource Description Framework (RDF), a new system for locating and describing information whose specifications were published as a complete W3C recommendation in 1999. But over time, proponents of the idea became more ambitious and began looking to the artificial-intelligence community for ways to help computers independently understand and navigate through this web of metadata.
Since 1998, researchers at W3C, led by Berners-Lee, had been discussing the idea of a "semantic" Web, which not only would provide a way to classify individual bits of online data such as pictures, text, or database entries but would define relationships between classification categories as well. Dictionaries and thesauruses called "ontologies" would translate between different ways of describing the same types of data, such as "post code" and "zip code." All this would help computers start to interpret Web content more efficiently.
In this vision, the Web would take on aspects of a database, or a web of databases. Databases are good at providing simple answers to queries because their software understands the context of each entry. "One Main Street" is understood as an address, not just random text. Defining the context of online data just as clearly--labeling a cat as an animal, and a veterinarian as an animal doctor, for example--could result in a Web that computers could browse and understand much as humans do, researchers hoped.
To go back to the Web-as-highway metaphor, this might be analogous to creating detailed road signs that cars themselves could understand and upon which they could act. The signs might point out routes, describe road and traffic conditions, and offer detailed information about destinations. A car able to understand the signs could navigate efficiently to its destination, with minimal intervention by the driver.
In articles and talks, Berners-Lee and others began describing a future in which software agents would similarly skip across this "web of data," understand Web pages' metadata content, and complete tasks that take humans hours today. Say you'd had some lingering back pain: a program might determine a specialist's availability, check an insurance site's database for in-plan status, consult your calendar, and schedule an appointment. Another program might look up restaurant reviews, check a map database, cross-reference open table times with your calendar, and make a dinner reservation.
At the beginning of 2001, the effort to realize this vision became official. The W3C tapped Miller to head up a new Semantic Web initiative, unveiled at a conference early that year in Hong Kong. Miller couldn't be there in person; his wife was in labor with their first child, back in Dublin. Miller saw it as a double birthday.
Standards and Critics
More to go ....