Investigating Semantic and Machine Learning Approaches with the SWC Research Team
As pioneers in the field of semantic AI, research and innovation have always been core principles of the Semantic Web Company (SWC) culture. Over 15 years later since SWC’s conception, these values still live on.
Our Research Team at SWC has been busy this past year with several papers that have been published and/or accepted for presentation at conferences, with further papers in the pipeline that are planned for submission to close out the year.
On top of this work, our Researchers have been involved in company webinars, where a recent webinar of theirs about Semantic Web enabled Machine Learning, had some of the highest turn out and engagement of participants among all the webinars we’ve held this year.
According to Director of Research Artem Revenko, “The overarching goal of our research strategy is to identify and develop novel semantic and machine learning approaches to tackle the most important use cases across different industries. Therefore, in our research we, on the one hand, systematically investigate cutting edge methods and tools such as Semantic AI systems or data management platforms, and on the other hand, develop benchmarks (WiC-TSV-de) and methods for automatic knowledge acquisition, structuring, linking and decision making.”
So far in 2022, the Research Team has been involved in various papers.
A Systematic Review of Data Management Platforms
“Systematically reviews a set of well-established data management platforms and compares their functionality. We derived an initial criteria catalogue from existing research work and ex-tended it based on the input gathered through several expert interviews. Finally, we applied this criteria catalogue to a set of data management platforms. The contribution of this work is (i) an up-to-date criteria catalogue to systematically assess the feature-richness of data management platforms, generalizable to related use-cases (e.g. data markets), and (ii)the systematic review of a selected set of data management platforms along these criteria. This work lays the foundation for future research in this area, being subject to periodic re-evaluation to also include developments and improvements of the platforms.”
Citation: Boch, Michael; Gindl, Stefan; Barnett, Alan; Margetis, George; Mireles, Victor; Adamakis, Emmanouil and Knoth, Petr (2022). A Systematic Review of Data Management Platforms. In: WorldCIST’22, 12-14 Apr 2022, Budva, Montenegro.
WiC-TSV-de: German Word-in-Context Target-Sense-Verification Dataset and Cross-Lingual Transfer Analysis
Presented at the LREC 2022, this paper introduces “WiC-TSV-de,” a multi-domain dataset for German Target Sense Verification” created by Researcher Anna Breit and Director of Research Artem Revenko.
The paper examines this dataset:
“Target Sense Verification (TSV) describes the binary disambiguation task of deciding whether the intended sense of a target word in a context corresponds to a given target sense. In this paper, we introduce WiC-TSV-de, a multi-domain dataset for German Target Sense Verification. While the training and development sets consist of domain-independent instances only, the test set contains domain-bound subsets, originating from four different domains, being Gastronomy, Medicine, Hunting, and Zoology. The domain-bound subsets incorporate adversarial examples such as in-domain ambiguous target senses and context-mixing (i.e., using the target sense in an out-of-domain context) which contribute to the challenging nature of the presented dataset. WiC-TSV-de allows for the development of sense-inventory-independent disambiguation models that can generalise their knowledge for different domain settings. By combining it with the original English WiC-TSV benchmark, we performed monolingual and cross-lingual analysis, where the evaluated baseline models were not able to solve the dataset to a satisfying degree, leaving a big gap to human performance.”
The dataset is publicly available on GitHub, and a blog post about the TSV task is available on Medium.
Contributing authors: Anna Breit, Artem Revenko, Narayani Blaschke
Extraction and Semantic Representation of Domain-Specific Relations in Spanish Labour Law
Accepted for SEPLN 2022, this paper, written by Artem Revenko and Patricia Martín-Chozas explores the following:“Despite the freedom of information and the development of various open data repositories, the access to legal information to general audience remains hindered due to the difficulty of understanding and interpreting it. In this paper we aim at employing modern language models to extract the most important information from legal documents and structure this information in a knowledge graph.This knowledge graph can later be used to retrieve information and answer legal question. To evaluate the performance of different models we formalize the task asevent extraction and manually annotate 133 instances. We evaluate two models: GRIT and Text2Event. The latter model achieves a better score of«0.8F1scorefor identifying legal classes and 0.5F1 score for identifying roles in legal relations.We demonstrate how the produced legal knowledge graph could be exploited with 2 example use cases. Finally, we annotate the whole Workers’ Statute using the fine-tuned Text2Event model and publish the results in an open repository.”
Some data about this study can be found here.
Contributing authors: Artem Revenko, Patricia Martín-Chozas; Semantic Web Company, Vienna, Austria; Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain
A Lifecycle Framework for Semantic Web Machine Learning Systems
Accepted for the MLKgraphs section of DEXA 2022, Anna Breit’s paper focuses on the following:
“Semantic Web Machine Learning Systems (SWeMLS) characterise applications, which combine symbolic and subsymbolic components in innovative ways. Such hybrid systems are expected to benefit from both domains and reach new performance levels for complex tasks.While existing taxonomies in this field focus on building blocks and pat-terns for describing the interaction within the final systems, typical life-cycles describing the steps of the entire development process have not yet been introduced. Thus, we present ourSWeMLS lifecycle frame-work, providing a unified view on SW, ML, and their interaction in aSWeMLS. We further apply the framework in a case study based on three systems, described in literature. This work should facilitate the under-standing, planning, and communication of SWeML system designs and process views.”
This paper can be referenced here.
Contributing authors: Anna Breit, Laura Waltersdorfer, Fajar J. Ekaputra, Tomasz Miksa, and Marta Sabou ;Semantic Web Company, Vienna, Austria; Technical University Vienna, Vienna, Austria; Vienna University of Economics and Business, Vienna, Austria
Learning ontologies from text by clustering lexical substitutes derived from language models
Accepted for SEMANTiCS 2022, Artem Revenko, Victor Mireles, and Anna Breit wrote the following paper:
“Many tools for knowledge management and the Semantic Web presuppose the existence of an arrangement of instances into classes, i. e. an ontology. Creating such an ontology, however, is a labor-intensive task. We present an unsupervised method to learn an ontology from text. We rely on pre-trained language models to generate lexical substitutes of given entities and then use matrix factorization to induce new classes and their entities. Our method differs from previous approaches in that (1) it captures the polysemy of entities; (2) it produces interpretable labels of the induced classes; (3) it does not require any particular structure of the text; (4) no re-training is required. We evaluate our method on German and English WikiNER corpora and demonstrate the improvements over state of the art approaches.”
Contributing authors: Artem Revenko, Victor Mireles, Anna Breit, Peter Bourgonje, Julian Moreno-Schneider, Maria Khvalchik, Georg Rehm
SWC is proud to see how the Research Team’s efforts have culminated in many opportunities to showcase their work at international conferences and beyond.
To follow more of our Researchers’ work, please refer to their personal Twitter handles: