In a previous blog post I have discussed the power of SPARQL to go beyond data retrieval to analytics. Here I look into the possibilities to implement a product recommender all in SPARQL. Products are considered to be similar if they share relevant characteristics, and the higher the overlap the higher the similarity. In the case of movies or TV programs there are static characteristics (e.g. genre, actors, director) and dynamic ones like viewing patterns of the audience.
The static part of this we can look up in resources like the DBpedia. If we look at the data related to the resource <http://dbpedia.org/resource/Friends> (that represents the TV show “Friends”) we can use for example the associated subjects (see predicate dcterms:subject). In this case we find for example <http://dbpedia.org/resource/Category:American_television_sitcoms> or <http://dbpedia.org/resource/Category:Television_shows_set_in_New_York_City> If we want to find other TV shows that are related to the same subjects we can do this with the following query:
- Count the number of subjects related to TV show “Friends”.
- Get all TV shows that share at least one subject with “Friends” and count how many they have in common.
- For each of those related shows count the number of subjects they are related to.
- Now we can calculate the relative overlap in subjects which is (number of shared subjects) / (numbers of subjects for “Friends” + number of subjects for other show – number of common subjects).
This gives us a score of how related one show is to another one. The results are sorted by score (the higher the better) and these are the results for “Friends”:
In the fist line of the results we see that “Friends” is associated with 16 subjects (that is the same in every line), “Will & Grace” with 18, and they share 10 subjects. That results into a score of 0.416667. Other characteristics to look at are actors starring a show, the creators (authors), or executive producers.
We can pack all this in one query and retrieve similar TV shows based on shared subjects, starring actors, creators, and executive producers. The inner queries retrieve the shows that share some of those characteristics, count numbers as shown before and calculate a score for each dimension. The individual scores can be weighted, in the example here the creator score is multiplied by 0.5 and the producer score by 0.75 to adjust the influence of each of them.
This results into:
Each line shows the individual scores for each of the predicates used and in the last column the final score. You can also try out the query with “House” <http://dbpedia.org/resource/House_(TV_series)> or “Suits” <http://dbpedia.org/resource/Suits_(TV_series)> and get shows related to those.
This approach can be used for any similar data, too, where we want to obtain similar items based on characteristics they share. One could for example compare persons (by e.g. profession, interests, …), or consumer electronic products like photo cameras (resolution, storage, size or price range).