Dr. Tomaž Erjavec, Dr. Jerneja Žganec Gros, Tomaž Erjavec, ...
Ministrstvo za visoko šolstvo, znanost in tehnologijo
Dan Tufiş, Svetla Koeva, Tomaž Erjavec, Maria Gavrilidou, Cvetana Krstev
The paper presents the results of a small and short-term SEE-ERA.net project the purpose of which was to investigate the feasibility of machine translation (MT) research and development for several...
Tomaž Erjavec, Kristina Hmeljak Sangawa, Irena Srdanović Erjavec
The paper presents our experiences in producing a hypertext learners ’ Japanese-Slovene dictionary jaSlo, which currently contains over 10,000 entries. The paper discusses the conversion of the...
Building the Slovene Wordnet: First Steps, First Problems (2008)
We report on the prototype Slovene wordnet which currently contains about 5,000 top-level concepts. The resource is based on the Serbian wordnet which has been automatically translated with the help...
The paper outlines the methodology used to present Slovenian literary texts and documents in critical e-editions. The encoding and linking of the several forms of the text in one single edition was...
St International Conference, Jezikovne Tehnologije, Tomaž Erjavec, Jerneja Žganec Gros, Tomaž Erjavec, Jerneja Žganec Gros
konference o jezikovnih tehnologijah.
LEARNING POS TAGGING FROM A TAGGED MACEDONIAN TEXT CORPUS (2008)
Viktor Vojnovski, Sašo Džeroski, Tomaž Erjavec
This paper presents several new linguistic resources for the Macedonian language, in particular a language corpus consisting of the digitized and annotated Orwell's “1984 ” in the Macedonian...
Report A web corpus and word sketches for Japanese (2008)
Irena Srdanović Erjavec, Tomaž Erjavec, Adam Kilgarriff
Of all the major world languages, Japanese is lagging behind in terms of publicly accessible and searchable corpora. In this paper we describe the development of JpWaC, a large corpus of 400 million...
Massive multi lingual corpus compilation: Acquis Communautaire (2008)
Tomaž Erjavec, Camelia Ignat, Bruno Pouliquen, Ralf Steinberger
The paper discusses the compilation of massively multilingual corpora, the EU ACQUIS corpus, and the corpus annotation tool “totale”. The ACQUIS text collection has recently become available on...
Massive multi lingual corpus compilation: Acquis Communautaire and totale (2008)
Tomaž Erjavec, Camelia Ignat, Bruno Pouliquen, Ralf Steinberger
The paper discusses the compilation of massively multilingual corpora, the EU ACQUIS corpus, and the corpus annotation tool “totale”. The ACQUIS text collection has recently become available on...
Department of Knowledge Technologies, (2008)
Building Slovene Wordnet, Tomaž Erjavec, Darja Fišer
A WordNet is a lexical database in which nouns, verbs, adjectives and adverbs are organized in a conceptual hierarchy, linking semantically and lexically related concepts. Such semantic lexicons have...
The paper presents the methodology, technology and results of a collaborative Slovenian project aimed at epublishing text-critical editions of literary heritage. The materials exhibit great...
Morphosyntactic Tagging of Slovene Legal Language. Informatica 30:483–488 (2006)
Part-of-speech tagging or, more accurately, morphosyntactic tagging, is a procedure that assigns to each word token appearing in a text its morphosyntactic description, e.g. “masculine singular...
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages (2006)
Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, ...
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional...
Morphosyntactic Tagging of Slovene Legal Language. Informatica 30:483–488 (2006)
Part-of-speech tagging or, more accurately, morphosyntactic tagging, is a procedure that assigns to each word token appearing in a text its morphosyntactic description, e.g. “masculine singular...
Towards a Slovene dependency treebank (2006)
Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtsky, Andreja Žele
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Our approach to annotation is based on the Prague Dependency Treebank,...
The English-Slovene ACQUIS corpus (2006)
The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing translated legal texts of the European Union, the ACQUIS Communautaire. The corpus contains...
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages (2006)
Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, ...
We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional...
Towards a Slovene dependency treebank (2006)
Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtsky, Andreja Žele
The paper presents the initial release of the Slovene Dependency Treebank, currently containing 2000 sentences or 30.000 words. Our approach to annotation is based on the Prague Dependency Treebank,...
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages (2006)
Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş
We are presenting a new and unique parallel corpus available in all 2 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is...
Digitisation of Literary Heritage Using Open Standards (2005)
Abstract: The paper presents the methodology, technology and results of a collaborative Slovenian project aimed at e-publishing text-critical editions of literary heritage. The materials exhibit...
A tool set for the quick and efficient exploration of large document collections (2005)
Camelia Ignat, Ralf Steinberger, Bruno Pouliquen, Tomaž Erjavec
We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain...
A tool set for the quick and efficient exploration of large document collections (2005)
Camelia Ignat, Ralf Steinberger, Bruno Pouliquen, Tomaž Erjavec
We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain...
Making an XML-based Japanese-Slovene Learners' Dictionary (2004)
Tomaž Erjavec, Irena Srdanović, Kristina Hmeljak Sangawa, Anton Ml. Vahčič
In this paper we present a hypertext dictionary of Japanese lexical units for Slovene students of Japanese at the Faculty of Arts of Ljubljana University. The dictionary is planned as a long-term...
2003: The MULTEXT-East Morphosyntactic Specifications for Slavic Languages (2004)
Tomaž Erjavec, Kiril Simov, Cvetana Krstev, Marko Tadić, Vladimír Petkevič, Duško Vitas
Word-level morphosyntactic descriptions, such as “Ncmsn ” designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few...
Migrating Language Resources from SGML to XML: the Text Encoding Initiative Recommendations (2002)
Syd Bauman, Tomaž Erjavec, Alejandro Bia, Christine Ruotolo, Lou Burnard, Susan Schreibman
The Text Encoding Initiative (TEI), established in 1987, has been the largest effort in the area of standardisation of computer encoding of language resources. TEI chose SGML (Standard Generalized...
Automatic Sense Tagging Using Parallel Corpora (2001)
Nancy Ide, Tomaž Erjavec, Dan Tufiş
This article reports the results of an analysis of translation equivalents in six languages from different language families, automatically extracted from an on-line 7-way parallel corpus of George...