SPARQL analytics proves boxers live dangerously

SPARQL, SPARQL code, Uncategorized

You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.

For simplicity we use a publicly available dataset of DBpedia on an open SPARQL endpoint: http://live.dbpedia.org/sparql (execute with default graph = http://dbpedia.org).

Mean life expectancy for different sports

The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.

prefix dbp:<http://dbpedia.org/ontology/> select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg) where { filter(?age >= 20 && ?age <= 100) . { select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age) where { ?subOfAthlete rdfs:subClassOf dbp:Athlete . ?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") . bind(str(?athleteGroup) as ?athleteGroupEN) ?athlete a ?subOfAthlete . ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) . ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) . bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) . bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) . } } } group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg

The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.

athleteGroupEN	count	ageAvg
wrestler	693	58.962481962481962
winter sport Player	1775	66.60169014084507
tennis player	577	71.483535528596187
table tennis player	45	68.733333333333333
swimmer	402	68.674129353233831
soccer player	6572	63.992391965916007
snooker player	25	70.12
rugby player	1452	67.272038567493113
rower	69	63.057971014492754
poker player	30	66.866666666666667
national collegiate athletic association athlete	44	68.090909090909091
motorsport racer	1237	58.117219078415521
martial artist	197	67.157360406091371
jockey (horse racer)	139	65.992805755395683
horse rider	181	74.651933701657459
gymnast	175	65.805714285714286
gridiron football player	4247	67.713680244878738
golf player	400	71.13
Gaelic games player	95	70.589473684210526
cyclist	1370	67.469343065693431
cricketer	4998	68.420368147258904
chess player	45	70.244444444444444
boxer	869	60.352128883774453
bodybuilder	27	52
basketball player	822	66.165450121654501
baseball player	9207	68.611382643640708
Australian rules football player	2790	69.52831541218638

This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.

Simple statistical measures over life expectancy

Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like Virtuoso (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.

prefix dbp:<http://dbpedia.org/ontology/> select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / ?count)) / (?count - 1)) as ?standDev) where { { select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum) where { { select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare) where { filter(?age >= 20 && ?age <= 100) . { select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age) where { ?subOfAthlete rdfs:subClassOf dbp:Athlete . ?athlete a ?subOfAthlete . ?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) . ?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) . bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) . bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) . } } } } } } }

count	ageAvg	standDev
38542	66.876290799647138	17.6479

These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.

SPARQL analytics proves boxers live dangerously

Mean life expectancy for different sports

Simple statistical measures over life expectancy

Follow us on