As a matter of fact still a lot of “semantic technologies” are around which do nothing else than pure statistical analysis of text. Sure, this is better than simple full text search but there are still quite a lot of opportunities to improve search, especially when it comes to more sophisticated applications like “similarity search”, the search for similar documents to enable cross-reading or recommendation systems.
Providers of first generation semantic technologies calculate rather basic “semantic networks” by co-occurency analysis which results sometimes in disappointing results. Bearing in mind that Google just bought a company (“Google buys Metaweb“) which has been working on one of the largest knowledge bases in the world, we could assume that some of the last miles towards a semantic search engine can be achieved by applying thesauri or other structured knowledge bases.
A demo application was recently developed by PoolParty team where one can find out how thesauri will improve search results on top of second generation semantic technologies. With PoolParty SKOS based controlled vocabularies can be managed and also can be enriched with linked data. PoolParty Tag & Content Recommender analyzes virtually any text or website to recommend corresponding tags, concepts from (in this case) STW (Standard Thesaurus für Wirtschaft), DBpedia and respective articles from Wikipedia.
STW which was developed by the German National Library of Economics (ZBW) provides vocabulary on any economic subject: about 6,000 standardized subject headings and about 18,000 entry terms to support individual keywords.
This background knowledge is used in this demo app to improve the search for similar documents dramatically:
Similarity between two documents can be calculated not only on a key-phrase basis but also on a rather conceptual basis. Even if two documents do not have one single word or phrase in common they can be identified as “similar documents”.
This can be achieved because thousands of important relations between economic subjects are represented in the domain specific thesaurus. Thus, in this special case best results are achieved with documents from economics (for instance from Econstor) but of course for other recommender systems thesauri from other domains can be used instead of STW.
Nevertheless, also this approach can be improved and this development is underway: SKOS thesauri enriched with Linked Data do an even better job. This kind of third generation semantic technologies are currently developed by LASSO project and LOD2 project, two innovative projects in the area of linked data and the semantic web.