State of Language Technology

Hannes Vilhjalmsson
Administrator

Teacher

Associate Professor of Computer Science at Reykjavik University

Posts: 82

State of Language Technology Sept 9, 2009 18:43:07 GMT -5

Quote

Post by Hannes Vilhjalmsson on Sept 9, 2009 18:43:07 GMT -5

Hi all,

In class this week we ask the question what is the state of Language Technology (LT) for the language spoken in your country?. Please summarize your findings here, and provide references or links to resources and tools that are available. Don't forget to mention the language/country!

Cheers,

- hannes / hrafn

nik
New Member

Posts: 8

State of Language Technology Sept 10, 2009 6:59:01 GMT -5

Quote

Post by nik on Sept 10, 2009 6:59:01 GMT -5

Okay, I will start with what we have in Germany. Of course it's just a short overview of my findings and I am not that deep in the matter.

Indeed I think we have nearly everything that is available for the English language also for the German one. In most cases the quality is quite good, but not as well as for the English language.
We have good spell-checkers (http://www.j3e.de/cgi-bin/spellchecker or www.canoo.net/services/GermanSpellingChecker/Controller to mention just a couple).
Also there are German text-to-speech applications, or screenreaders, which mainly are German versions of English programs, i.e. just the model is changed. There is a English version of a German website, listing many available tts applications: ttssamples.syntheticspeech.de/
Also speech-recognition systems are widely used in Germany. Of course for telephone dialogues, but also for speech-controlled
gps navigation systems. We have companies offering speech control for industrial machinery (http://www.novotech-gmbh.de/sprachsteuerung.htm) ans lots of stuff like this. But it's not a thing everyone uses every day, so it is still in the early stages of development.
We have grammar checkers, too. For example in office products - you know, when it underlines a senetence in green not red - and again they work quite well, but for me it seems that the English grammar checkers performs even better. Might be, because I don't make as many mistakes in German as in English

Furthermore question-answer-systems are used in Germany. The distance university "Fernuniversität Hagen" has invented the "InSicht" question-answer-system, which was rated as best German one at the CLEF 2004 (Cross-Language System Evaluation Campaign). As far as I know it is not used commercially by now,
but it could be used for intelligent searches at newspaper-archives, websites, internal documentations of companies or where else there is a need for systems like this (http://pi7.fernuni-hagen.de/Frage-Antwort-System/Frage-Antwort-System.html)
Also there are some search engines on the web, where you can ask questions in German and get something like a summary of all answers. fa1.auskunft.org/cgi-bin/qa.cgi for example, but it works fairly poor. I think it's something like Wolfram Alpha, where somebody has to put all answers into a huge database.
We also have organisations in Germany dealing with natural language processing. The "Deutsche Gesellschaft für Sprachwissenschaften (DGfW)" and the "Gesellschaft für linguistische Datenverarbeitung (GFLD)". I post the websites but neither of them is available in English.

I assume Oliver will provide some further information but from my point of view: That's it!

oliver
New Member

Posts: 4

State of Language Technology Sept 13, 2009 8:36:36 GMT -5

Quote

Post by oliver on Sept 13, 2009 8:36:36 GMT -5

Sorry Nik,

but I think you covered them all. I could provide more links for each topic that you mentioned, but that would only be corporations or societies down the list of google search.

Well, as we will speak about corpora tomorrow, here are some links to websites with german corpora:

A very large, growing, online German corpus archive (778 million words in August 2000) maintained by the "Institut für Deutsche Sprache" in Mannheim, Germany. A copyright-free portion of the archive (379 million words in August 2000) is freely searchable. Invited guests have access to the whole archive. Partially tagged.
www.ids-mannheim.de/kl/corpora.html
negr@ corpus - A Syntactically Annotated Corpus
of German Newspaper Texts
www.coli.uni-saarland.de/projects/sfb378/negra-corpus/
Berlin-Brandenburgische Akademie der Wissenschaften - "Das digitale Wörterbuch der deutschen Sprache des 20. Jahrunderts" (DWDS) provides an so called core-corpus, the first balanced corpus for the german language of the 20. century. Other balanced corpora are available too.
The "Institut für Kommunikationswissenschaften" at the "Rheinische Friedrich Wilhelms Universität Bonn" provides four different corpora.
www.ikp.uni-bonn.de/forschung/computerlinguistik/computerlinguistik/
The TIGER Treebank (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes.
www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/

tihomir
New Member

Posts: 4

State of Language Technology Sept 13, 2009 17:16:57 GMT -5

Quote

Post by tihomir on Sept 13, 2009 17:16:57 GMT -5

The projects that I know of for Bulgarian are:

SpeechLab 2.0 - a text-to-speech synthesis system
ItaEst - spell checker and an application that breaks words into syllables (corrects texts in word processing applications when a word has to be divided between two lines)
Intex - a parser.
GrammLab - a tagger.
www.bacl.org/products.html

Cyrilla - a spell-checker and grammar checker. (http://www.cyrilla.bg/ - website is only in Bulgarian)
BulTreeBank - www.bultreebank.org/ (some other projects on this website)
There is a text corpus of around one million words and a speech corpus. A frequency dictionary also exists.
There are a number of applications for transliteration between Cyrillic and Latin script (e.g., bg.translit.cc/)
BulNet - a wordnet as part of the BalkaNet project: dcl.bas.bg/BulNet/wordnet_en.html
OCoRrect - an application for correction of texts after optical recognition: lml.bas.bg/~stoyan/ocorrect/index.html

lele120
New Member

Posts: 3

State of Language Technology Sept 13, 2009 18:27:28 GMT -5

Quote

Post by lele120 on Sept 13, 2009 18:27:28 GMT -5

the projects that I have found are:
tcc.itc.it/ The Cognitive and Communication Technologies (TCC) division at ITC-Irst is a major European research group in areas such as Natural Language Processing, Human-Computer Interaction and Dialogue Systems, Multi-modality and Natural Language Generation, Production and Maintenance of Linguistic Resources, Linguistic Theory
www.spellcheckanywhere.com/home/langauge/italian.asp Italian (spell check) (italiano, or lingua italiana) is a Romance language spoken by about 80 million people primarily in Italy. Standard Italian (spell check) is based on the Tuscan dialect and is somewhat intermediate between the languages of Southern Italy and the Gallo-Romance languages of the North

thors
New Member

CS Dweeb

Posts: 23

State of Language Technology Sept 13, 2009 19:20:25 GMT -5

Quote

Post by thors on Sept 13, 2009 19:20:25 GMT -5

This is for Icelandic.

First, I found a paper from 2008 attempting to describe the state of affairs at the date.

According to the paper „Icelandic Language Technology Ten Years Later“ by E.Rögnvaldsson ( www3.hi.is/~eirikur/Ten_Years_Later.pdf ), the current status of Icelandic LT is mostly „should be“, that is, a complete list of incomplete wishes.

According to the list at Tungutækni ( tungutaekni.is/products/hugbun.html ), there are however a few products:

Íslensk orðabók - tölvuútgáfa: Icelandic dictionary, commercial product.
Ordabok.is: Web-based is-en/en-is dictionary, commercial product.
Púki 2003: Spelling and grammar tool for the Icelandic language, for use under Microsoft Windows only. Commercial product.
Ragga: Icelandic speech synthesizer. Samples are from 2006, and its webpages seem to have gone stale. Status unknown.
Snorri: Icelandic speech synthesizer. Acquired from Infovox by Acapela. See below...
Stakorðagreinir fyrir íslensku: ?
Stóra tölvuorðabólkin: Icelandic dictionary, commercial product.
Vefbækur Eddu: Icelandic and bilingual Icelandic dictionaries. Commercial products.
Hjal: A speech recognition system from 2002/3 by University of Iceland ( www.tungutaekni.is/new/hjal.PDF ). Status unknown.

Those projects that might be considered research projects seem to be all in a limbo.

Then there are a few that weren't listed...
The Finnish Laboratory of Acoustics and Audio Signal Processing has a sample waveform ( www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/appa.html sample 52 ) with Icelandic text being read. Very primitive and hardly understandable.

eSpeak ( espeak.sourceforge.net ) is a speech synthesizer that claims to be capable of Icelandic. It's got potential, but it still has a long way to go ( e.g. "fær" becomes "faaar" )

The Acapela speech synthesizer can be downloaded from: www.acapela-group.com/download-infovox-desktop-text-to-speech-demo.html

According to the Masters Thesis of Arnar Thor Jensson, B. Kristinsson (2004, Towards speech synthesis for Icelandic) is working on a better speech synthesizer than „Snorri“ is.

And for the record - Arnars MSc Thesis makes a wonderful read that probably fits this course as a glove fits the hand it's made for

( www.furui.cs.titech.ac.jp/~arnar/publications/masterThesis.pdf )

De gustibus non disputandum est.

jeppewelling
New Member

Posts: 2

State of Language Technology Sept 14, 2009 6:12:05 GMT -5

Quote

Post by jeppewelling on Sept 14, 2009 6:12:05 GMT -5

Danish

Southern university of Denmark:
visl.sdu.dk/
Here you can find tools for parsing:
visl.sdu.dk/visl/da/parsing/automatic/
I especially like the visual tree view that they provide:
visl.sdu.dk/visl/da/parsing/automatic/trees.php
Also they provide various translators from danish to other languages.

The danish Language - and literature society :
ordnet.dk/korpusdk/
They provide a corpus with 56 million words

Also I found this website with links to various projects:
www.cst.dk/dandokcenter/resultat/ressourcer/index-material-ny.html

ivar
New Member

Posts: 4

State of Language Technology Sept 14, 2009 12:45:58 GMT -5

Quote

Post by ivar on Sept 14, 2009 12:45:58 GMT -5

CLARA:They use web-crawler to get everything that is written on blogs and other online pages then they parse them and are able to show what you interesting results about what people are writing online about your company.
We went to visit them last Friday. Very interesting what they are working on.
www.clara.is

best regards,
Ívar Björn Hilmarsson

grimur
New Member

Posts: 2

State of Language Technology Sept 21, 2009 16:02:02 GMT -5

Quote

Post by grimur on Sept 21, 2009 16:02:02 GMT -5

The forums system ate my previous post, so here is a second attempt.

Aside from the tools mentioned on tungutaekni.is there is an online (free) version of the Púki spellchecker (http://vefur.puki.is/vefpuki/).

There does not seem to be a BLARK for icelandic. The IceNLP aims to be a part of that. IceNLP consists of, at least, a morphological analyzer, a rule based part of speech tagger, a trigram tagger and a shallow parser. Information on that can be found at: nlp.ru.is/projects.htm. Notice that the information on the current status of the various ongoing research projects is at best limited.
A 56K list of phonetically transcribed Icelandic words does exist (http://www.tungutaekni.is/materials/001.html).
Work on a 25 million word tagged corpus was supposed to be completed in 2008. It does not appear to be completed.
There is a good list of recent or current research on www.tungutaekni.is/researchsystems/rannsoknir.html.
The main web page for NLP for Icelandic seems to be www.tungutaekni.is. If that is correct not much seems to be going on, the last news item is from january and the latest papers the link to from 2007.

Alberto
Guest

State of Language Technology Sept 22, 2009 9:02:59 GMT -5

Quote

Post by Alberto on Sept 22, 2009 9:02:59 GMT -5

Hi all,

Spanish,

I know a webpage about the Spanish Language Tools (http://www.datsi.fi.upm.es/~coes/coes.html): The COES Spanish Language Tools are a research field of the Departamento de Arquitectura y Tecnología de Sistemas Informáticos (DATSI) of the Universidad Politécnica de Madrid (UPM) and the Departamento de Informática of the Universidad CarlosIII de Madrid (my university).

The main task of this research is to to develop an extensive set of Spanish grammatical rules and to apply them to test correctness on documents written in Spanish. To enhance distribution, COES is integrated with the ispell tool. COES is being distributed for free since the end of 1994.

Currently this package is distributed under GNU license or any other by specific agreement with the authors.

Anothers projects from the Alicante's university (http://gplsi.dlsi.ua.es/gplsi09/doku.php?id=proyectos) are:

R2D2:Answers in digital documents. Construction of hybrid analyzers of natural languages.

3LB: Construction of data bases of syntactic semantic trees.

TUSIR: Develop of text compression's system for information recovery.

TEXT-MESS: Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies

I'm going to looking for more information about this.

bye,

Message Board

State of Language Technology

Post by Hannes Vilhjalmsson on Sept 9, 2009 18:43:07 GMT -5

Post by nik on Sept 10, 2009 6:59:01 GMT -5

Post by oliver on Sept 13, 2009 8:36:36 GMT -5

Post by tihomir on Sept 13, 2009 17:16:57 GMT -5

Post by lele120 on Sept 13, 2009 18:27:28 GMT -5

Post by thors on Sept 13, 2009 19:20:25 GMT -5

Post by jeppewelling on Sept 14, 2009 6:12:05 GMT -5

Post by ivar on Sept 14, 2009 12:45:58 GMT -5

Post by grimur on Sept 21, 2009 16:02:02 GMT -5

Post by Alberto on Sept 22, 2009 9:02:59 GMT -5

Quick Reply