Learn more
- Sep 29, 2015
SPARQL analytics proves boxers live dangerously
You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.
For simplicity we use a publicly available dataset of DBpedia on an open SPARQL endpoint: http://live.dbpedia.org/sparql (execute with default graph = http://dbpedia.org).
Mean life expectancy for different sports
The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.
prefix dbp:<http://dbpedia.org/ontology/>
select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg)
where {
filter(?age >= 20 && ?age <= 100) .
{
select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age)
where {
?subOfAthlete rdfs:subClassOf dbp:Athlete .
?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") .
bind(str(?athleteGroup) as ?athleteGroupEN)
?athlete a ?subOfAthlete .
?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
}
}
} group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg
The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.
athleteGroupEN
|
count
|
ageAvg
|
---|---|---|
wrestler | 693 | 58.962481962481962 |
winter sport Player | 1775 | 66.60169014084507 |
tennis player | 577 | 71.483535528596187 |
table tennis player | 45 | 68.733333333333333 |
swimmer | 402 | 68.674129353233831 |
soccer player | 6572 | 63.992391965916007 |
snooker player | 25 | 70.12 |
rugby player | 1452 | 67.272038567493113 |
rower | 69 | 63.057971014492754 |
poker player | 30 | 66.866666666666667 |
national collegiate athletic association athlete | 44 | 68.090909090909091 |
motorsport racer | 1237 | 58.117219078415521 |
martial artist | 197 | 67.157360406091371 |
jockey (horse racer) | 139 | 65.992805755395683 |
horse rider | 181 | 74.651933701657459 |
gymnast | 175 | 65.805714285714286 |
gridiron football player | 4247 | 67.713680244878738 |
golf player | 400 | 71.13 |
Gaelic games player | 95 | 70.589473684210526 |
cyclist | 1370 | 67.469343065693431 |
cricketer | 4998 | 68.420368147258904 |
chess player | 45 | 70.244444444444444 |
boxer | 869 | 60.352128883774453 |
bodybuilder | 27 | 52 |
basketball player | 822 | 66.165450121654501 |
baseball player | 9207 | 68.611382643640708 |
Australian rules football player | 2790 | 69.52831541218638 |
This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.
Simple statistical measures over life expectancy
Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like Virtuoso (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.
prefix dbp:<http://dbpedia.org/ontology/>
select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / ?count)) / (?count - 1)) as ?standDev)
where {
{
select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum)
where {
{
select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare)
where {
filter(?age >= 20 && ?age <= 100) .
{
select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age)
where {
?subOfAthlete rdfs:subClassOf dbp:Athlete .
?athlete a ?subOfAthlete .
?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
}
}
}
}
}
}
}
count
|
ageAvg
|
standDev
|
---|---|---|
38542 | 66.876290799647138 | 17.6479 |
These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.