Post by Hannes Vilhjalmsson on Sept 9, 2009 18:43:07 GMT -5
In class this week we ask the question what is the state of Language Technology (LT) for the language spoken in your country?. Please summarize your findings here, and provide references or links to resources and tools that are available. Don't forget to mention the language/country!
Okay, I will start with what we have in Germany. Of course it's just a short overview of my findings and I am not that deep in the matter.
Indeed I think we have nearly everything that is available for the English language also for the German one. In most cases the quality is quite good, but not as well as for the English language. We have good spell-checkers (http://www.j3e.de/cgi-bin/spellchecker or www.canoo.net/services/GermanSpellingChecker/Controller to mention just a couple). Also there are German text-to-speech applications, or screenreaders, which mainly are German versions of English programs, i.e. just the model is changed. There is a English version of a German website, listing many available tts applications: ttssamples.syntheticspeech.de/ Also speech-recognition systems are widely used in Germany. Of course for telephone dialogues, but also for speech-controlled gps navigation systems. We have companies offering speech control for industrial machinery (http://www.novotech-gmbh.de/sprachsteuerung.htm) ans lots of stuff like this. But it's not a thing everyone uses every day, so it is still in the early stages of development. We have grammar checkers, too. For example in office products - you know, when it underlines a senetence in green not red - and again they work quite well, but for me it seems that the English grammar checkers performs even better. Might be, because I don't make as many mistakes in German as in English Furthermore question-answer-systems are used in Germany. The distance university "Fernuniversität Hagen" has invented the "InSicht" question-answer-system, which was rated as best German one at the CLEF 2004 (Cross-Language System Evaluation Campaign). As far as I know it is not used commercially by now, but it could be used for intelligent searches at newspaper-archives, websites, internal documentations of companies or where else there is a need for systems like this (http://pi7.fernuni-hagen.de/Frage-Antwort-System/Frage-Antwort-System.html) Also there are some search engines on the web, where you can ask questions in German and get something like a summary of all answers. fa1.auskunft.org/cgi-bin/qa.cgi for example, but it works fairly poor. I think it's something like Wolfram Alpha, where somebody has to put all answers into a huge database. We also have organisations in Germany dealing with natural language processing. The "Deutsche Gesellschaft für Sprachwissenschaften (DGfW)" and the "Gesellschaft für linguistische Datenverarbeitung (GFLD)". I post the websites but neither of them is available in English.
I assume Oliver will provide some further information but from my point of view: That's it!
but I think you covered them all. I could provide more links for each topic that you mentioned, but that would only be corporations or societies down the list of google search.
Well, as we will speak about corpora tomorrow, here are some links to websites with german corpora:
A very large, growing, online German corpus archive (778 million words in August 2000) maintained by the "Institut für Deutsche Sprache" in Mannheim, Germany. A copyright-free portion of the archive (379 million words in August 2000) is freely searchable. Invited guests have access to the whole archive. Partially tagged. www.ids-mannheim.de/kl/corpora.html
Berlin-Brandenburgische Akademie der Wissenschaften - "Das digitale Wörterbuch der deutschen Sprache des 20. Jahrunderts" (DWDS) provides an so called core-corpus, the first balanced corpus for the german language of the 20. century. Other balanced corpora are available too.
The TIGER Treebank (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes. www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
SpeechLab 2.0 - a text-to-speech synthesis system ItaEst - spell checker and an application that breaks words into syllables (corrects texts in word processing applications when a word has to be divided between two lines) Intex - a parser. GrammLab - a tagger. www.bacl.org/products.html
Cyrilla - a spell-checker and grammar checker. (http://www.cyrilla.bg/ - website is only in Bulgarian) BulTreeBank - www.bultreebank.org/ (some other projects on this website) There is a text corpus of around one million words and a speech corpus. A frequency dictionary also exists. There are a number of applications for transliteration between Cyrillic and Latin script (e.g., bg.translit.cc/) BulNet - a wordnet as part of the BalkaNet project: dcl.bas.bg/BulNet/wordnet_en.html OCoRrect - an application for correction of texts after optical recognition: lml.bas.bg/~stoyan/ocorrect/index.html
the projects that I have found are: tcc.itc.it/ The Cognitive and Communication Technologies (TCC) division at ITC-Irst is a major European research group in areas such as Natural Language Processing, Human-Computer Interaction and Dialogue Systems, Multi-modality and Natural Language Generation, Production and Maintenance of Linguistic Resources, Linguistic Theory www.spellcheckanywhere.com/home/langauge/italian.asp Italian (spell check) (italiano, or lingua italiana) is a Romance language spoken by about 80 million people primarily in Italy. Standard Italian (spell check) is based on the Tuscan dialect and is somewhat intermediate between the languages of Southern Italy and the Gallo-Romance languages of the North
First, I found a paper from 2008 attempting to describe the state of affairs at the date.
According to the paper „Icelandic Language Technology Ten Years Later“ by E.Rögnvaldsson ( www3.hi.is/~eirikur/Ten_Years_Later.pdf ), the current status of Icelandic LT is mostly „should be“, that is, a complete list of incomplete wishes.
Íslensk orðabók - tölvuútgáfa: Icelandic dictionary, commercial product. Ordabok.is: Web-based is-en/en-is dictionary, commercial product. Púki 2003: Spelling and grammar tool for the Icelandic language, for use under Microsoft Windows only. Commercial product. Ragga: Icelandic speech synthesizer. Samples are from 2006, and its webpages seem to have gone stale. Status unknown. Snorri: Icelandic speech synthesizer. Acquired from Infovox by Acapela. See below... Stakorðagreinir fyrir íslensku: ? Stóra tölvuorðabólkin: Icelandic dictionary, commercial product. Vefbækur Eddu: Icelandic and bilingual Icelandic dictionaries. Commercial products. Hjal: A speech recognition system from 2002/3 by University of Iceland ( www.tungutaekni.is/new/hjal.PDF ). Status unknown.
Those projects that might be considered research projects seem to be all in a limbo.
CLARA:They use web-crawler to get everything that is written on blogs and other online pages then they parse them and are able to show what you interesting results about what people are writing online about your company. We went to visit them last Friday. Very interesting what they are working on. www.clara.is
The forums system ate my previous post, so here is a second attempt.
Aside from the tools mentioned on tungutaekni.is there is an online (free) version of the Púki spellchecker (http://vefur.puki.is/vefpuki/).
There does not seem to be a BLARK for icelandic. The IceNLP aims to be a part of that. IceNLP consists of, at least, a morphological analyzer, a rule based part of speech tagger, a trigram tagger and a shallow parser. Information on that can be found at: nlp.ru.is/projects.htm. Notice that the information on the current status of the various ongoing research projects is at best limited. A 56K list of phonetically transcribed Icelandic words does exist (http://www.tungutaekni.is/materials/001.html). Work on a 25 million word tagged corpus was supposed to be completed in 2008. It does not appear to be completed. There is a good list of recent or current research on www.tungutaekni.is/researchsystems/rannsoknir.html. The main web page for NLP for Icelandic seems to be www.tungutaekni.is. If that is correct not much seems to be going on, the last news item is from january and the latest papers the link to from 2007.
I know a webpage about the Spanish Language Tools (http://www.datsi.fi.upm.es/~coes/coes.html): The COES Spanish Language Tools are a research field of the Departamento de Arquitectura y Tecnología de Sistemas Informáticos (DATSI) of the Universidad Politécnica de Madrid (UPM) and the Departamento de Informática of the Universidad CarlosIII de Madrid (my university).
The main task of this research is to to develop an extensive set of Spanish grammatical rules and to apply them to test correctness on documents written in Spanish. To enhance distribution, COES is integrated with the ispell tool. COES is being distributed for free since the end of 1994.
Currently this package is distributed under GNU license or any other by specific agreement with the authors.
Anothers projects from the Alicante's university (http://gplsi.dlsi.ua.es/gplsi09/doku.php?id=proyectos) are:
R2D2:Answers in digital documents. Construction of hybrid analyzers of natural languages.
3LB: Construction of data bases of syntactic semantic trees.
TUSIR: Develop of text compression's system for information recovery.
TEXT-MESS: Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies
I'm going to looking for more information about this.