Learn more
- Jun 26, 2008
Combining Closed and Open Data Classification Mechanisms in an Extended Thesaurus
In the next session, Rolf Sint gave us insights into his approach to the combination of closed and open data classification mechanisms, which is informed by his findings in his master’s thesis. The probably most widely used retrieval method for digital content is full-text search; Google and Yahoo’s indexing methods, for instance, rely on full-text search. To be able to use this method, words must be contained within the content, leading to obvious problems with synonyms, ambiguities or the different lexical inventory of different languages. Advantages are that full-text search is easy to use, and that no maintenance is required as this responsibility rests with the content providers.
On the other end of the spectrum, within open data classification mechanisms, we have social tagging. Tagging (in general) means that a user asigns labels to content items. The advantage here is that content is immediately classified; as such, tagging is an easy way to provide metadata for content, in particular as the user does not to have think about (arbitrary, system-dictated) structures. However, this leads to problems if singulars and plurals are used simultaneously, if synonyms are used, spelling mistakes occur etc etc. With tags, the exact same spelling has to be used if items are to be assigned to the same group. But if done collectively (and that is what social tagging is about), the wisdom of crowds can improve the signal to noise ratio significantly – see the miracle of the tag cloud.
What Rolf proposed in his thesis was to combine the two approaches. In his design, he used an extended thesaurus as an instrument to achieve vocabulary control – we’re looking at an extended thesaurus here, because it’s not simply built around a taxonomy, but expanded by tags that were assigned by users and integrated using a vocabulary management tool.
This extended thesaurus can be applied in multiple ways. During a tag event, for instance, the user can be assisted by questions like “Did you mean…” if a term is ambiguous:
Search can be improved, too: If a user makes a search query, related terms can be suggested, drawing on the thesaurus. E.g., the term ‘jaguar’ would call up similar terms, allowing the user to specify the query and clarify that he (or she) is looking for a predatory animal (i.e. not the car).
In the long term, using an extended thesaurus as a light-weight ontology can reduce the amount of work needed to maintain a vocabulary. What’s special in Rolf’s proposal is that the controlled vocabulary also contains the terminology of the community. The user is thus able to navigate within the communal information space and, as a result, problems with homonyms, synonyms and different languages would be reduced.
A paper in which Rolf and two of his colleagues explain this approach in more detail is currently being prepared for publication: Güntner, G., Sint, R., Westenthaler, R. (2008): “Ein Ansatz zur Unterstützung traditioneller Klassifikation durch Social Tagging”. Tagungsband des ExpertInnenworkshops “Social Tagging in der Wissensorganisation – Perspektiven und Potenziale”, 2008 (im Druck). Further details about the publication can be obtained from Rolf.