As a matter of fact things change – the Web of Data is no exception in that respect. While some sources, such as Twitter, are intrinsically dynamic, others change every now and then, potentially in unforeseeable intervals. In the recent Talis Nodalities Magazine, we made a case for Keeping up with a LOD of changes; here I’m going to elaborate a bit more on the current state of Dataset Dynamics and its challenges.
However, tackling dataset-level changes is a rather new field with no agreed-upon, even less standardised solution handy. The main problem is that a dataset typically talks about many thousands to millions of distinct entities, which makes it impractical to apply entity-level solutions for a range of use cases, such as link maintenance or replication (see also Fig. 2).
I often hear these days: “it seems there is no solution for handling of dataset-level changes”; nevertheless, I think quite the opposite it true. There are plenty of proposed solutions from both the academia and practitioners, targeting different challenges in the areas of:
Change discovery – how do I find out about about dataset changes?
Propagating changes – if there is a change, how is the change communciated to a consumer?
Change semantics – how do I learn what has changed (has been added, removed, etc.)?
No matter on what (set of) solutions the community eventually agrees on to address the handling of dataset-level changes, it should adhere to the following principles:
distributed and scalable
Obviously, a light-weight (and ideally RESTful) approach lowers the barriers to adoption and enables a quick uptake. When I say light-weight, I mean it both in terms of protocol and code. It should be easy to integrate in RDF stores and libraries and available in all common Web programming languages including but not limited to Java, PHP, .NET family, etc.
Just as the Web of Data is a globally distributed dataspace, handling of changes should be done in a distributed fashion. There will be many different publishers and consumers (such as agents, indexer, consolidator platforms, etc.) of datasets with different requirements and capabilities. A distributed approach can cope with this challenge in a cost- and performance-efficient way. Tightly connected to this: It has to scale. Today, we’re dealing with some hundreds of LOD datasets. In the next couple of years, this will likely explode into the millions and hence one needs to be able to deal with such a growth. The same, just sooner, is true for the number of consumers of the changes.
Last but not least the Dataset Dynamics solution should be based on standards. It doesn’t necessarily need to be RDF for all of the challenges as outlined above. For example, Atom offers a standardised, extensible and widely accepted format to propagate changes; to take this further Pubsubhubbub can be utilised to enable a standardised, distributed publisher-subscriber scheme (Fig 3.)
As I’ve outlined above, it might still be too early for a conclusion on how to deal with dataset-level changes. However, people interested in this area have gathered already in the Dataset Dynamics group where solutions are discussed and implemented, potentially leading to a W3C standardisation work.