LDL4HELTA: Linked Data Lexicography for High-End Language Technology Applications
Language ambiguity remains one of the hardest technical challenges when it comes to NLP tools and applications. The research partner KDictionaries is an established on- and offline lexicographic data provider for over 40 languages. This project makes use of this large lexical data repository by transforming it into RDF and interlinking it with 3rd party data sources for improved language technologies.
“Lexicography is at a crossroads. The digital turn, followed by the semantic turn, did not just mean a change of format for lexicographers, linguists and terminologists, but implies a complete turn-over of the workflow.”
— German Research Centre for Artificial Intelligence (2014)
How Can Dictionaries Improve Text Analytics?
Dictionaries include the most complete linguistic information ranging from syntactic relations, dialectal varieties, pronunciations to word-forms. This data is the holy grail for a truly sophisticated semantic technology solution given that it is interlinked and processable as RDF and directly embedded in knowledge graphs. Then it is a robust foundation for wordsense disambiguation and text mining.
What Is the Biggest Challenge When Transforming Lexicographic Data into RDF?
Reusing lexicographic data within a Linked Data application depends on a rock-solid data model. The W3C consortium provides lexicon models, but they have to be extended and customized depending on the language. Terms come with a multitude of related additional information such as synonyms, pronunciation and lexical sense that have to be considered in the language graph. In this research project available de-facto W3C standards as OntoLex were used and heavily expanded for the requirements of lexical data. For the automatic transformation of dictionary data from XML to RDF, the semantic technologies component UnifiedViews also had to be configured accordingly.
Further Developments at Semantic Web Company
The lexicographic, multilingual, interlinked and enriched RDF data is incorporated into the PoolParty text mining component and improves the analytical and disambiguation results through a highly enriched semantic layer. The multilingual lexicographical RDF data will also result in a PoolParty translation service for knowledge models in the future.