Semantic Web Company
Menu
Open
Close
Menu
  • Home01
  • Solutions02
    • backSolutions
    • Search & Analytics02
    • Recommender Systems02
    • Digital Transformation02
  • Products03
    • backProducts
    • PoolParty Semantic Suite03
    • PoolParty PowerTagging03
  • Company04
    • backCompany
    • About us04
    • Leadership Team04
    • Partners04
  • Careers05
  • Learn more06
    • backLearn more
    • Research06
    • PoolParty Academy06
    • SEMANTiCS Conference06
    • Company News06
  • Legal07
    • backLegal
    • Imprint07
    • Privacy07
    • Terms of use07
  • Contact us08

Learn more

  • Sep 29, 2015

SPARQL analytics proves boxers live dangerously

  • SPARQL, SPARQL code, Uncategorized

You have always thought that SPARQL is only a query language for RDF data? Then think again, because SPARQL can also be used to implement some cool analytics. I show here two queries that demonstrate that principle.

For simplicity we use a publicly available dataset of DBpedia on an open SPARQL endpoint: http://live.dbpedia.org/sparql (execute with default graph = http://dbpedia.org).

Mean life expectancy for different sports

The query shown here starts from the class dbp:Athlete and retrieves sub classes thereof that cover different sports. With that athletes of that areas are obtained and their birth and death dates (i.e. we only take into account deceased individuals). From the dates the years are extracted. Here a regular expression is used because the SPARQL function to extract years from a literal of a date type returned errors and could not be used. From the birth and death years the age is calculated (we filter for a range of 20 to 100 years because in data sources like this erroneous entries have always to be accounted for). Then the data is simply grouped and we count for each sport the number of athletes that were selected and the average age they reached.

prefix dbp:<http://dbpedia.org/ontology/>
select ?athleteGroupEN (count(?athlete) as ?count) (avg(?age) as ?ageAvg)
where {
filter(?age >= 20 && ?age <= 100) .
{
select distinct ?athleteGroupEN ?athlete (?deathYear - ?birthYear as ?age)
where {
?subOfAthlete rdfs:subClassOf dbp:Athlete .
?subOfAthlete rdfs:label ?athleteGroup filter(lang(?athleteGroup) = "en") .
bind(str(?athleteGroup) as ?athleteGroupEN)
?athlete a ?subOfAthlete .
?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
}
}
} group by ?athleteGroupEN having (count(?athlete) >= 25) order by ?ageAvg

The results are not unexpected and show that athletes in the area of motor sports, wresting and boxing die at younger age. On the other hand horse riders, but also tennis and golf players live on average clearly longer.

athleteGroupEN
count
ageAvg
wrestler 693 58.962481962481962
winter sport Player 1775 66.60169014084507
tennis player 577 71.483535528596187
table tennis player 45 68.733333333333333
swimmer 402 68.674129353233831
soccer player 6572 63.992391965916007
snooker player 25 70.12
rugby player 1452 67.272038567493113
rower 69 63.057971014492754
poker player 30 66.866666666666667
national collegiate athletic association athlete 44 68.090909090909091
motorsport racer 1237 58.117219078415521
martial artist 197 67.157360406091371
jockey (horse racer) 139 65.992805755395683
horse rider 181 74.651933701657459
gymnast 175 65.805714285714286
gridiron football player 4247 67.713680244878738
golf player 400 71.13
Gaelic games player 95 70.589473684210526
cyclist 1370 67.469343065693431
cricketer 4998 68.420368147258904
chess player 45 70.244444444444444
boxer 869 60.352128883774453
bodybuilder 27 52
basketball player 822 66.165450121654501
baseball player 9207 68.611382643640708
Australian rules football player 2790 69.52831541218638

This is especially relevant when that data is large and one would have to extract it from the database and import it into another tool to do the counting and calculations.

Simple statistical measures over life expectancy

Another standard statistical measure is the standard deviation. A good description about how to calculate it can be found for example here. We start again with the class dbp:Athlete and calculate the ages they reached (this time for the entire class dbp:Athlete not its sub classes). Another thing we need are the squares of the ages that we calculate with “(?age * ?age as ?ageSquare)”. At the next stage we count the number of athletes in the result, and calculate the average age, the square of the sums and the sum of the squares. With those values we can calculate in the next step the standard deviation of the ages in our data set. Note that SPARQL does not specify a function for calculating square roots but RDF stores like Virtuoso (that hosts the DBpedia data) provide additional functions like bif:sqrt for calculating the square root of a value.

prefix dbp:<http://dbpedia.org/ontology/>
select ?count ?ageAvg (bif:sqrt((?ageSquareSum - (strdt(?ageSumSquare,xsd:double) / ?count)) / (?count - 1)) as ?standDev)
where {
{
select (count(?athlete) as ?count) (avg(?age) as ?ageAvg) (sum(?age) * sum(?age) as ?ageSumSquare) (sum(?ageSquare) as ?ageSquareSum)
where {
{
select ?subOfAthlete ?athlete ?age (?age * ?age as ?ageSquare)
where {
filter(?age >= 20 && ?age <= 100) .
{
select distinct ?subOfAthlete ?athlete (?deathYear - ?birthYear as ?age)
where {
?subOfAthlete rdfs:subClassOf dbp:Athlete .
?athlete a ?subOfAthlete .
?athlete dbp:birthDate ?birth filter(datatype(?birth) = xsd:date) .
?athlete dbp:deathDate ?death filter(datatype(?death) = xsd:date) .
bind (strdt(replace(?birth,"^(\\d+)-.*","$1"),xsd:integer) as ?birthYear) .
bind (strdt(replace(?death,"^(\\d+)-.*","$1"),xsd:integer) as ?deathYear) .
}
}
}
}
}
}
}

 

count
ageAvg
standDev
38542 66.876290799647138 17.6479

These examples show that SPARQL is quite powerful and a lot more than “just” a query language for RDF data but that it is possible to implement basic statistical methods directly at the level of the triple store without the need to extract the data and import it into another tool.

Share on twitter
Share on linkedin
Share on whatsapp
Share on email
PrevPrevious post
Next postNext
ALL POSTS

Twitter

@semwebcompany

RT @hhedden: If you missed the PoolParty Summit, recordings of all presentations are now available,... Read More

Mar 20 2023, 2:02 pm
@semwebcompany

RT @PoolParty_Team: PoolParty is Named a Leader in the Metadata Management Data Quadrant 🏆 #PoolParty #Metadata... Read More

Mar 20 2023, 9:27 am
@semwebcompany

FAIR Content: Better Chatbot Answers and Content Reusability at Scale https://t.co/Acz6iBWiuX https://t.co/1V7Ac2qDUb Read More

Mar 17 2023, 1:03 pm
More
  • Twitter
  • Linkedin
  • Youtube
  • Xing
Scroll Top

2023 © Semantic Web Company