|
Post by Hrafn Loftsson on Nov 28, 2009 16:02:53 GMT -5
Hello MSc-students in Darmstadt. I just uploaded (as of Saturday, November 28th, at 22:00) a new revised description of your tagging project. If you had downloaded the description earlier then you need to download it again. Moreover, I uploaded a new version of IceNLP to www.ru.is/faculty/hrafn/Software/IceNLP-1.3.zip - make sure you have that version. Regards, Hrafn.
|
|
ali
New Member
Posts: 3
|
Post by ali on Dec 4, 2009 16:27:56 GMT -5
Hello all,
I'm encountering a problem while doing the fourth part of the project.
During tagging the test corpus file I get an error message saying that the number of found states is bigger than the number of maximum states:
./tritagger.sh -p paramDefault.txt -cp ************************************************ * TriTagger - A HMM tagger (bi- or trigrams) * * Version 1.1 * * Copyright (C) 2005-2009, Hrafn Loftsson * ************************************************ Input: ./test_hmm.txt, one token per line Output: ./test.tri.out, one token per line Sentences start with an upper case letter Started at: 12/4/09 10:05 PM Loading model ../../ngrams/models/corpus ... Using trigrams Found 65357 states but maximum is: 65356
The output file remains empty.
If I reduce the number of tokens in test.corpus (e.g. from 47373 to 500) , the trigger will work properly.
Any idea what I'm doing wrong?
Ali
|
|
|
Post by Sren on Dec 5, 2009 6:42:02 GMT -5
I'm still fighting with an Unicode-Problem. The file encoding of "tiger_release_aug07.export" is set to latin1. By this a have problems with umlauts with cat and tail. If I change the encoding of the file by hand to utf-8 (with vi) all works fine. My locale setting seems to be correct (LANG=en_US.UTF-8).
|
|
|
Post by Hrafn Loftsson on Dec 7, 2009 3:42:53 GMT -5
Is it possible that your training file is not UTF-8 encoded? Or, does your test_hmm.txt file actually contain both the tokens and the tags (it should only contain the tokens)?
|
|
|
Post by Hrafn Loftsson on Dec 7, 2009 3:52:26 GMT -5
Yes, it is true that the "tiger_release_aug07.export" file is latin1 encoded. I use Linux commands like tail, grep, and awk to pre-process the file (but without explicitly searching for umlauts), i.e. to get the relevant data. No problems there and my default encoding is UTF-8.
After pre-processing, I change the resulting file to UTF-8, so my training and test corpora are UTF-8 encoded. You could as well change the encoding immediately from the start, i.e. change the encoding of "tiger_release_aug07.export" to UTF-8.
Note that you can use the Linux iconv command to change encoding, e.g.:
iconv -f iso-8859-1 -t utf-8 <infile >outfile
|
|
ali
New Member
Posts: 3
|
Post by ali on Dec 7, 2009 5:43:11 GMT -5
Neither of them. My training file is UTF-8 encoded and the test_hmm.txt contains only tokens. The paramDefault file is attached to this reply. It looks actually good, but maybe I'm overlooking a fault... Attachments:
|
|
ali
New Member
Posts: 3
|
Post by ali on Dec 7, 2009 10:42:08 GMT -5
Hello all, I'm encountering a problem while doing the fourth part of the project. During tagging the test corpus file I get an error message saying that the number of found states is bigger than the number of maximum states: ./tritagger.sh -p paramDefault.txt -cp ************************************************ * TriTagger - A HMM tagger (bi- or trigrams) * * Version 1.1 * * Copyright (C) 2005-2009, Hrafn Loftsson * ************************************************ Input: ./test_hmm.txt, one token per line Output: ./test.tri.out, one token per line Sentences start with an upper case letter Started at: 12/4/09 10:05 PM Loading model ../../ngrams/models/corpus ... Using trigrams Found 65357 states but maximum is: 65356
The output file remains empty. If I reduce the number of tokens in test.corpus (e.g. from 47373 to 500) , the trigger will work properly. Any idea what I'm doing wrong? Ali Meanwhile, we have found out why the problem was caused: We mistakenly removed the empty lines between tokens in the input file (test_hmm.txt). So the tagger was not able to recognise the begin of new sentences and aborted tagging after ~500 tokens.
|
|
|
Post by andresherrera on Dec 7, 2009 11:09:52 GMT -5
Hello,
there's something I would like to ask about the data extraction in the first pre-processing part. When I'm creating the file data.txt there are:
888,578 lines beginning with a character different from '#' (which is also the number of tokens said to be in the last line of the tiger_release_aug07.export file) 474,911 lines beginning with the character '#' -> from which 373,964 begin with #5XX (where X are digits) -> 50,474 begin with #EOS -> 50,473 begin with #BOS
so, if I want data.txt to have 939,052 lines I should leave in it the lines without '#' and also the lines beginning with '#EOS'. Do I really need to leave these lines in the file?
|
|
|
Post by Sren on Dec 7, 2009 14:36:57 GMT -5
You have to remove all lines beginning with an #. But you have to replace alle lines beginning with #EOS with a new line. The assignment says you have to put an empty line between sentences and #EOS marks the end of a sentence.
So in the end you have 888,578 "word-lines" and 50,474 empty lines.
|
|
|
Post by Sren on Dec 13, 2009 6:28:20 GMT -5
My results of part 3 are different to the given example output:
Number of tokens: 47373 Number of errors: 18752 Overall tagging accuracy: 60.42% Tagging accuracy for known words: 71.61% Number of unkown words: 7810 Unknown word ratio: 16.49% Number of errors for unknown words: 7519 Tagging accuracy for unknown words: 3.73%
Has somebody else differnt results?
|
|
|
Post by andresherrera on Dec 13, 2009 15:07:35 GMT -5
yes...
Number of tokens: 47373 Number of errors: 16427 Overall tagging accuracy: 65.32413% Tagging accuracy for known words: 71.08749% Number of unknown words: 4108 Unknown word ratio: 8.671606% Number of errors for unknown words: 3918 Tagging accuracy for unknown words: 4.625122%
|
|
|
Post by Sren on Dec 14, 2009 7:05:42 GMT -5
We found the error. It was a UTF-8 problem, ones again, in outer baseTagger. Here our new results: Opening "combined.txt" Number of tokens: 47373 Number of errors: 16384 Overall tagging accuracy: 65.41% Tagging accuracy for known words: 71.19% Number of unkown words: 4108 Unknown word ratio: 8.67% Number of errors for unknown words: 3918 Tagging accuracy for unknown words: 4.63%
|
|