LDL4HELTA: Using Lexicographical Data for Semantic Text Mining

Schematic representation of data as bubbles with hair line connections

LDL4HELTA: Linked Data Lexicography for High-End Language Technology Applications

Language ambiguity remains one of the hardest technical challenges when it comes to NLP tools and applications. The research partner KDictionaries is an established on- and offline lexicographic data provider for over 40 languages. This project makes use of this large lexical data repository by transforming it into RDF and interlinking it with 3rd party data sources for improved language technologies.

Visit the project website

“Lexicography is at a crossroads. The digital turn, followed by the semantic turn, did not just mean a change of format for lexicographers, linguists and terminologists, but implies a complete turn-over of the workflow.”
— German Research Centre for Artificial Intelligence (2014)

How Can Dictionaries Improve Text Analytics?

Dictionaries include the most complete linguistic information ranging from syntactic relations, dialectal varieties, pronunciations to word-forms. This data is the holy grail for a truly sophisticated semantic technology solution given that it is interlinked and processable as RDF and directly embedded in knowledge graphs. Then it is a robust foundation for wordsense disambiguation and text mining. 

What Is the Biggest Challenge When Transforming Lexicographic Data into RDF?

Reusing lexicographic data within a Linked Data application depends on a rock-solid data model. The W3C consortium provides lexicon models, but they have to be extended and customized depending on the language. Terms come with a multitude of related additional information such as synonyms, pronunciation and lexical sense that have to be considered in the language graph. In this research project available de-facto W3C standards as OntoLex were used and heavily expanded for the requirements of lexical data. For the automatic transformation of dictionary data from XML to RDF, the semantic technologies component UnifiedViews also had to be configured accordingly.

Further Developments at Semantic Web Company

The lexicographic, multilingual, interlinked and enriched RDF data is incorporated into the PoolParty text mining component and improves the analytical and disambiguation results through a highly enriched semantic layer. The multilingual lexicographical RDF data will also result in a PoolParty translation service for knowledge models in the future.

Project title

LDL4HELTA

Project website

https://ldl4.com/

Duration

2 years

Methods Applied

  • Data Processing
  • Word Sense Disambiguation
  • Semantic Enrichment

Our research partners

Project sponsors

LDL4HELTA has obtained the approval of the National Funding Agency FFG in Austria and the Chief Scientist in Israel and is launched as part of the Austria Israel Bilateral Agreement within the EUREKA framework.

NLP White Paper

Dive deeper into the world of text mining, semantics and machine learning.

Free Download