DBpedia, UMBEL & the Future Web's Ecology – interview with Mike Bergman & Sören Auer
The Linked Open Data infrastructure is in a tremendous process of maturing – the recent release of UMBEL’s webservice AND the incorporation of UMBEL classes in DBpedia are yet another confirmation of this exciting process. Knowing and having met DBpedia co-initiator, Triplify main developer and head of the AKSW research group Sören Auer and UMBEL editor and Zitgist CEO Mike Bergman in various contexts, I felt it was time to talk to and pick the brains of both these key players in a dialog situation. The (first) result is the interview you can find below. As not everyone can expected to be familiar with both projects, here is some backgrond to get you started (you can also go directly to the interview):
DBpedia has become the largest RDF repository for encyclopaedic knowledge, extracting structured information from Wikipedia and making it available on the Web of Data. UMBEL, on the other hand, provides an OpenCYC-based, light-weight ontology structure for relating Web content and data to a standard set of subject concepts, with a number of 20,000 concepts currently reached. In the Linked Data Cloud, DBpedia and UMBEL map and cross-reference each other.
In practice this means that UMBEL provides classes to describe the concepts to which “things” are members. For instance, named entities from Wikipedia such as “John F. Kennedy” are mapped with subject concepts such as Leader, Person, Administrator and Graduate, with broader and equivalent classes in CYC and FOAF and broader subject concepts within UMBEL. A link is set to Wikipedia, as well as a ‘same as’ reference to DBpedia. A class structure enables faceted browsing and extraction, inferencing, and navigation and discovery for all datasets linked to that structure.
DBpedia, in turn, returns properties of ‘John J. Kennedy’ (e.g. abstracts in available Wikipedia languages, demographic information such as birth date and place, alma mater, predecessors and successors), and ‘same as’ references, e.g., to the JFK entry in Freebase (who recently released their RDF service) and the aforementioned page in UMBEL. Furthermore, DBpedia maps the URI with available RDF types, for instance foaf:person or yago:AssassinatedAmericanPoliticians and, once again, with UMBEL’s subject concepts Person, Administrator, Graduate and Leader.
Due to its reliance on Wikipedia, DBpedia does a great job at covering a bandwidth of knowledge as broad as the spectrum of the interest of people participating in Wikipedia; it’s within the area of named entities, i.e. entities such as persons, organizations, locations, which have a proper name, but are not necessarily and specifically part of a particular, acknowledged domain or discipline. UMBEL, on the other hand, has as its most apparent advantage its reliance on OpenCyc and with that the strong inferencing and logic capabilities of the CYC knowledge-base which are thus also brought to the Web of Data. DBpedia is a community project started by the University of Leipzig, Free University Berlin and OpenLink Software, while the open and free UMBEL is developed and hosted by Zitgist with support from, again, OpenLink Software.
Now, and in particular with the recent release of Zitgist’s web service endpoints and with the incorporation of UMBEL classes in DBpedia, questions arises as to the relationship of the two projects, and regarding the role of OpenLink Software in the further process. To draw a distinction:
One could say that DBpedia’s goal is to lower the barrier for web developers and end-users in the actual use of the semantic web, while UMBEL aims at bringing “order to the chaos” that is inherent to user-generated, collective knowledge.
Would you agree with this description – and is it a contradiction at all or the kind of dynamic the Semantic Web community has been waiting for?
Mike Bergman: Yes, I would agree with this description, though we have tried many others. For example, in various writings in the past, we have described UMBEL as a roadmap, or middleware, or a backbone, or a concept ontology, or an ‘infocline’, or a meta layer for metadata, and others. Today, what I tend to use, particularly in reference to DBpedia, is the TBox-ABox distinction in computer science and description logics. UMBEL is more of a class or structural and concept relationships schema — a TBox — while DBpedia is more of an an instance and entity layer with attributes — an ABox. I think they are pretty complementary…
Sören Auer: I very much agree with Mike, but would like to add that Wikipedia authors do not have in mind to create a coherent and consistent knowledge base when working on Wikipedia. I think the more we demonstrate the benefits of the semantic representations in DBpedia to the Wikipedia community, these people will start to organize and rearrange content to enable the use of Wikipedia as a knowledge base. Right now, Wikipedia authors just have not yet been confronted with the problem of synonymous infobox properties or the uncleanliness of the category system, for example. I think with a few small and non-invasive changes to Wikipedia, much of the current chaos can be already resolved.
Mike Bergman: I agree, too, with Sören’s adder: I think it is difficult for Wikipedia authors to be consistent or coherent across the entire Wikipedia knowledge base. I think, then, the real question is where does that coherence or structural consistency come from? I think the nature of that task is quite different than creating or editing instance articles.
As for the dynamics and drivers of the community, the role of DBpedia for practical, linked data can not be overstressed. It was the first, remains the biggest, and has brought much visibility and awareness to linked data. I think I was one of the first to give DBpedia some press shortly after its release nearly two years ago, which is still one of my most popular blog posts. I’d like to think that UMBEL is now timely and well-positioned to help provide a complementary resource of concept classes, but that is not proven. DBpedia is.
DBpedia relies on user-generated content, UMBEL, with CYC, is expert-driven. How will a system that combines these divergent approaches continue to grow?
Mike Bergman: DBpedia is in the enviable position of being able to leverage two things: 1) the phenomenal success and growth of its source content, Wikipedia; and 2) the growing sophistication of information and structure extraction techniques built around that phenomenon. In fact, my most recent SWEETpedia listing of research projects based on Wikipedia now exceeds 170 or something projects and we discover and learn about more daily. We now see major research efforts in Germany, Austria, Japan, New Zealand, the US and England (among those I know) aggressively mining and learning from Wikipedia. I’m sure DBpedia will learn more, but others will increasingly contribute as well. This is unprecedented and very exciting. Wikipedia may have as strong a heritage in contributing to research and language and structure understanding as it does as a reference encyclopedia.
But, that is largely instance and attribute data, the ABox to use my earlier terminology. For structural and conceptual relationships — the coherent way to organize and relate things in the world; that is, the TBox — I think there is much subtlety and thinking required. I’ve used the phrase before that creating structural schema is “not like flinging hash”. Efforts such as Cyc, with nearly 1,000 person years of consistent testing and effort behind it, or perhaps others such as SUMO or what is coming out of the biology community with OBO (Open Biomedical Ontologies), offer better coherency and the ability to interoperate across diverse datasets and domains. Perhaps Wikipedia and its data extraction offshoots may someday get to this point — and I truly hope so — but are not anywhere near that today in my opinion.
Sören Auer: I’m actually not so concerned about the lack of structure and coherency as Mike. If we look at the current mostly textual Web and the search engines making it accessible to us humans, there is almost no data, few structure and even less coherency, but search engines still manage to provide an enormous added-value. If we add more data to the Webm much more sophisticated browsing, searching and data integration interfaces can be built. Structure and coherency will then emerge automatically, once people see how their content is indexed and can be easily found (or not).
The same happened by the way with the traditional Web 1.0 – in the beginning nobody used HTML’s meta-data tags. Once search engines started to interpret those for ranking results, meta-tags shifted to the center of attention of every Web content manager. Applied to the Semantic Data Web: once search engines understand foaf:person, everybody will use this concept for describing people.
A little experiment – Web 3.0, Semantic Web, Web of Data, Linked Data: Can you think of an ontology that is able to connect these terms and reveal the concepts behind them?
Mike Bergman: Well, I’d like to think UMBEL is that ontology (smile). That is certainly our intent, though truthfully we are still working out structural details and have not added all of this nice SemWeb terminology. But it is coming shortly (smile).
Sören Auer: A precise definition of these terms in the mathematical sense does not (and probably will never) exist, so articles such as those in Wikipedia (or many other publications) about the terms are from my point of view completely sufficient to reveal the concepts behind them to us. Of course it’s nice to have pretty and world-wide unique identifiers (such as provided by UMBEL) to annotate articles about these terms.
Mike Bergman: Well, UMBEL is about linking to concepts, though we welcome anyone thinking our identifiers are pretty (smile). One key aspect we will see moving forward is how we can translate those concepts into one of the 250 languages now used by Wikipedia while retaining existing structure. That is a real exciting prospect.
What are the concepts you would personally want to employ to explain the over-arching idea of these terms to a newbie?
Mike Bergman: Structured data on the Web is becoming like newly visible stars as nighttime darkens in the desert. Structured data are points of light in a global information space. We need fixed reference points in that sky to find specific stars. We need linkages to extract meanings and constellations from them. So, we both need to expose those stars — as linked data — and to provide fixed references to find them again and connections to draw meaning from them. Objects, references and connections all work in concert to expose the wonder.
Sören Auer: Its difficult for me to add something after Mike’s truly poetic description of pretty technical terms (smile). I see Linked Data or the Web of Data as the next milestone on the road of realizing the vision of the Semantic Web. In this regard, I’m Marxist (smile) and think Marx’s Law of transformation of Quantity into Quality applies: once we have a sufficient quantity of data out there on the Web, a new quality will emerge. Unfortunately, we are still far away from reaching a critical mass, since the Semantic or Data Web as we recently found out (cf.: Triplify – Lightweight Linked Data Publication from Relational Databases, PDF, 332 KB) is effectively shrinking if compared with the growth of the traditional Web.
Kingsley Idehen from OpenLink Software was the first to announce on the W3C mailing list that DBpedia & UMBEL are now “fully connected.” Is Kingsley the bridge-builder between the two projects?
Mike Bergman: Without question. Kingsley has backed both efforts in a big way with people and resources. His company actually did the first RDF linkage between the two projects. What is more remarkable, however, is that DBpedia and UMBEL are but a small slice of the things Kingsley has been backing. He has been a leader in middleware, scalable clusters and cloud computing, RDFizing all data forms, converting relational legacy data to linked RDF data, and providing demos and teaching to newbies on mailing lists. I’m glad you asked this question because Kingsley is a real catalyst and visionary.
Sören Auer: I agree, Kingsley is a mover and shaker in many areas of technology and innovation and in particular the Data Web. However, we should not forget his marvelous team at OpenLink with the database mastermind Orri Erling, Hugh Williams, Ivan Mikhailov, Yrjäna Rankka and all the others.
In the same vein: What are roles that are vital for the LOD-engineering process? Are there also “gardeners” or “bureaucrats“, as Wikipedians would put it?
Mike Bergman: I think the LOD (linking open data) and DBpedia mailing lists have been very effective, and there continues to be good community and organization around those efforts. We know that Wikipedia is not a free for all, with a kind of self-policing plus type of governance. I think that works well at the instance level.
However, structure decisions at the level of conceptual schema such as UMBEL or OpenCyc or even the mapping of classes between ontologies or datasets requires more skill and care. Others may not agree, but I think the schema aspects essential to UMBEL’s purpose — while definitely needing to be open and participatory at the suggestion or input level — possibly require roles more like “priest” or “professional” or “authority” at the actual roll-out level. Without quality, structure is nothing, and all of this is just an elaborate toy.
Sören Auer: Again my philosophy here differs a little from Mike’s: I’m pretty skeptic there will be one ontology or organization scheme for the Semantic Web. Rather, I think structure and homogeneity will be achieved on a peer-to-peer basis first and a community consensus will emerge later.
Mike Bergman: Yeah, this is an excellent point and I’m glad Sören raised it. While it is true we have put much effort into creating a lightweight structure of concept classes for linking disparate datasets, we too do not see “one ontology to rule them all.” I suspect there will often be none, and then many times other frameworks chosen. It really depends on the use case and purpose. UMBEL’s specific purpose is to provide a coherent framework for serious knowledge engineers looking to federate data. After that, other frameworks with a different purposes may then need to do the heavy lifting of actual data interoperability.
With its recent RDF service release, Freebase has risen to the level of a major SemWeb knowledge base, too. Where do you see its role in the future SemWeb ecology, also in relation to your own projects?
Sören Auer: To be honest, I was not very convinced of Freebase from the very beginning although their technology in particular the user interfaces are impressive. From my point of view the Freebase approach was too centralized and proprietary. A better strategy would have been to develop an open (source) technology, which people can deploy on their own Websites combined with server side crawling and search facilities. The first part of this equation is by the way exactly what we are aiming for with our OntoWiki and Triplify projects. However, if Freebase now moves towards more openness and interoperability, this can be only applauded.
Mike Bergman: I think early on that Freebase was needing to get its own feet set and did not do much with regard to open standards or external interoperability. These are good signs we are now seeing. However, I think the revenue model around user-supplied data remains highly suspect. Heck, Wikipedia with all of its tremendous success is daily soliciting contributions from users. It’s hard to get traction without being open and free, and its hard to make money when you are open and free even with traction. But these new announcements now make it much easier for us to use Freebase should our customers request it.
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. Wikipedia is the by far largest publicly available encyclopedia on the Web. Wikipedia editions are available in over 250 languages with the English one accounting for more than 2.49 million articles. Unfortunately, Wikipedia’s search capabilities are limited to full-text search, which allows very limited access to this valuable knowledge-base. Semantic Web technologies enable expressive queries against structured and interlinked information on the Web. DBpedia allows you to make sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. The DBpedia data set currently provides information about more than 2.49 million “things”, including at least 108,000 persons, 392,000 places, 57,000 music albums, and 36,000 films. Altogether, the DBpedia data set consists of 218 million pieces of information (RDF triples).
dbpedia.org (general website)
wiki.dbpedia.org/OnlineAccess (DBpedia Wiki – Online Access)
UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure for relating Web content and data to a standard set of 20,000 subject concepts. Its purpose is to provide a fixed set of reference points in a global knowledge space. These subject concepts have defined relationships between them, and can act as binding or attachment points for any Web content or data. UMBEL is like a map of an interstate highway system, a set of roadsigns to help find related content and a way of getting from one big place to another. Once in the right vicinity, other maps (or ontologies) — more akin to detailed street maps — are then necessary to get to specific locations or street addresses. By definition, these more fine-grained maps are beyond UMBEL’s scope. But UMBEL can help provide the context for placing such detailed maps in relation to one another and in relation to the Big Picture of what related content is about.
umbel.org (project website)
umbel.zitgist.com (UMBEL webservice – sandbox)
The interview was led by Andreas Blumauer, SWC