Interview with Georgi Kobilarov: "I believe that data publishing must happen in a distributed style."
Uberblic.org connects structured data from the web. The Berlin-based inventor Georgi Kobilarov gives a brief insight into the mashup service and talks about the challenges when it comes to build applications upon linked data.
You have recently published the service uberblic.org, a Linked Data mashup editor. What was your motivation to develop this tool?
Uberblic.org provides an integrated view of web data. Our goal is to integrate all the structured data on the web, and give web-developers a single point to access to that reconciled data. More than that, we will open up the tools we use to manage the data sources to the community, so that the people can help us curating that repository of free data. We re-publish all the data we import as Linked Data, under the licenses of the original data publishers.
Some of the data sources we import are available in the Linked Open Data cloud as well, but many are not. Linked Data is an elegant way to publish data in a distributed way on the web, but consuming it from that distributed cloud is – at least – impractical. In every real-world application using linked data from the web I’ve seen, organizations built up internal copies of the cloud, and often even reconcile linked data sources. They build their own Linked Data proxies. Uberblic.org helps those users by providing one public proxy for data from the web. Many of our sources get monitored for data changes, and the according data in uberblic is updated in real-time.
Can you give us a brief insight how the tool works? What technology is is built on?
My company, Uberblic Labs, has developed a data integration platform that we use to power uberblic.org. We call it the Uberblic Platform (the name uberblic is derived from the German “Überblick” – English “overview”). This platform enables us to do the full process of “data fusion”: Importing and converting external data sources, mapping the data schemas to a central ontology, filtering out data errors, automatically suggesting duplicates to the user, and merging data from different sources into a single, reconciled representation.
Structured and semi-structured data from the web is an excellent use case for our software platform, since there we come across all the interesting cases of real-world data heterogeneity. But what I think is especially powerful and yet missing in other Linked Data projects I know, is the ability to subscribe to update-feeds. We do that extensively, fetching updates in real-time from Wikipedia and the like.
Our platform is built in Scala and runs a on cluster of machines, with workers communicating through a messaging system. We developed an RDF storage layer on top of a distributed key-values store for storing all provenance information used in the extraction process, currently around 100 million named graphs for uberblic.org. That storage layer does not directly provide SPARQL access, so we push all the output data into a SPARQL endpoint hosted by Talis as well.
What have been the biggest challenges in tackling the integration issues of dispersed data?
It was quite a steep learning curve to do Linked Data not only in an academic environment, but in a reliable, industry-strength set-up. In academia, there was always the excuse that things are just research prototypes. Now that excuse is gone. That’s also where it becomes necessary to manually clean up data. And there are two ways to do that: Either you enable the users to change facts directly in your repository after you have imported the external data (that is what Freebase does), or you facilitate clean-up cycles in the original data source and fetch these updates in real-time. That is what we do.
I believe that data publishing must happen in a distributed style, because then each data source gets taken care of by a specialized group of people using specialized tools. And it’s what you see not only on the web, but also inside organizations and enterprises. But consuming data trough centralized APIs is more than just convenient. We all use Google
or another search engine as a central access point to web pages which are published in a distributed way all over the web, don’t we? Can you imagine today researching a topic on the web without the centralization power of search engines, just by following links across web sites, like in the old days?
When we built the Uberblic Platform, some of the things I imagined to be large headaches, like schema mapping, turned out to work really well. Those pathologic cases you often see in academic “challenges” are – well – pathologic. It’s not necessary to solve them fully automatically through super-intelligent algorithms. Much more important than the sophistication of your algorithms are well designed workflows so that the user becomes a part of the solution. And that’s not about crowd-sourcing or swarm intelligence, the editorial curating of schema mappings and object reconciliation can be done just by a small team of people. If they have the right set of tools.
What are the next plans with uberblic.org? Where will the journey go?
Uberblic.org will continue to integrate more interesting and useful data sources from the web, and we will start making more APIs available to web developers to build their applications on top. We are also looking for partners who are interested in developing applications and have been struggling in the past to get the cross-source data from the web they need.
The work on improving uberblic.org will also benefit our Uberblic Platform, and hence our clients who use that same software for integrating organizational data sources with each other and with the web of data.
About Georgi Kobilarov
Georgi is founder and managing director of Uberblic Labs, a company based in Berlin specialized in Linked Data integration. He worked as a research associate in the Web-based Systems Group at Freie Universität Berlin and as a visiting researcher at Hewlett Packard Labs Bristol. As co-founder and lead developer of DBpedia, he was also a day-one contributor to the Linking Open Data project. Georgi is consulting with the BBC on several Linked Data related projects. He organizes the Web of Data Meetup London, a bi-yearly gathering of the UK Linked Data community. Georgi graduated with a Diplom in business administration from Freie Universität Berlin and has many years of work experience as a software developer. Visit his blog: http://blog.georgikobilarov.com