Project: PoS tagging - Darmstadt

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project: PoS tagging - Darmstadt Nov 28, 2009 16:02:53 GMT -5

Quote

Post by Hrafn Loftsson on Nov 28, 2009 16:02:53 GMT -5

Hello MSc-students in Darmstadt.

I just uploaded (as of Saturday, November 28th, at 22:00) a new revised description of your tagging project. If you had downloaded the description earlier then you need to download it again.

Moreover, I uploaded a new version of IceNLP to www.ru.is/faculty/hrafn/Software/IceNLP-1.3.zip - make sure you have that version.

Regards, Hrafn.

ali
New Member

Posts: 3

Project: PoS tagging - Darmstadt Dec 4, 2009 16:27:56 GMT -5

Quote

Post by ali on Dec 4, 2009 16:27:56 GMT -5

Hello all,

I'm encountering a problem while doing the fourth part of the project.

During tagging the test corpus file I get an error message saying that the number of found states is bigger than the number of maximum states:

./tritagger.sh -p paramDefault.txt -cp
************************************************
* TriTagger - A HMM tagger (bi- or trigrams) *
* Version 1.1 *
* Copyright (C) 2005-2009, Hrafn Loftsson *
************************************************
Input: ./test_hmm.txt, one token per line
Output: ./test.tri.out, one token per line
Sentences start with an upper case letter
Started at: 12/4/09 10:05 PM
Loading model ../../ngrams/models/corpus ...
Using trigrams
Found 65357 states but maximum is: 65356

The output file remains empty.

If I reduce the number of tokens in test.corpus (e.g. from 47373 to 500) , the trigger will work properly.

Any idea what I'm doing wrong?

Ali

Sren
Guest

Project: PoS tagging - Darmstadt Dec 5, 2009 6:42:02 GMT -5

Quote

Post by Sren on Dec 5, 2009 6:42:02 GMT -5

I'm still fighting with an Unicode-Problem. The file encoding of "tiger_release_aug07.export" is set to latin1. By this a have problems with umlauts with cat and tail. If I change the encoding of the file by hand to utf-8 (with vi) all works fine. My locale setting seems to be correct (LANG=en_US.UTF-8).

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project: PoS tagging - Darmstadt Dec 7, 2009 3:42:53 GMT -5

Quote

Post by Hrafn Loftsson on Dec 7, 2009 3:42:53 GMT -5

During tagging the test corpus file I get an error message saying that the number of found states is bigger than the number of maximum states:

./tritagger.sh -p paramDefault.txt -cp
************************************************
* TriTagger - A HMM tagger (bi- or trigrams) *
* Version 1.1 *
* Copyright (C) 2005-2009, Hrafn Loftsson *
************************************************
Input: ./test_hmm.txt, one token per line
Output: ./test.tri.out, one token per line
Sentences start with an upper case letter
Started at: 12/4/09 10:05 PM
Loading model ../../ngrams/models/corpus ...
Using trigrams
Found 65357 states but maximum is: 65356

Is it possible that your training file is not UTF-8 encoded?
Or, does your test_hmm.txt file actually contain both the tokens and the tags (it should only contain the tokens)?

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project: PoS tagging - Darmstadt Dec 7, 2009 3:52:26 GMT -5

Quote

Post by Hrafn Loftsson on Dec 7, 2009 3:52:26 GMT -5

I'm still fighting with an Unicode-Problem. The file encoding of "tiger_release_aug07.export" is set to latin1. By this a have problems with umlauts with cat and tail. If I change the encoding of the file by hand to utf-8 (with vi) all works fine. My locale setting seems to be correct (LANG=en_US.UTF-8).

Yes, it is true that the "tiger_release_aug07.export" file is latin1 encoded. I use Linux commands like tail, grep, and awk to pre-process the file (but without explicitly searching for umlauts), i.e. to get the relevant data. No problems there and my default encoding is UTF-8.

After pre-processing, I change the resulting file to UTF-8, so my training and test corpora are UTF-8 encoded. You could as well change the encoding immediately from the start, i.e. change the encoding of "tiger_release_aug07.export" to UTF-8.

Note that you can use the Linux iconv command to change encoding, e.g.:

iconv -f iso-8859-1 -t utf-8 <infile >outfile

Last Edit: Dec 7, 2009 3:53:37 GMT -5 by Hrafn Loftsson

ali
New Member

Posts: 3

Project: PoS tagging - Darmstadt Dec 7, 2009 5:43:11 GMT -5

Quote

Post by ali on Dec 7, 2009 5:43:11 GMT -5

Is it possible that your training file is not UTF-8 encoded?
Or, does your test_hmm.txt file actually contain both the tokens and the tags (it should only contain the tokens)?

Neither of them. My training file is UTF-8 encoded and the test_hmm.txt contains only tokens.
The paramDefault file is attached to this reply. It looks actually good, but maybe I'm overlooking a fault...

Attachments:

ali
New Member

Posts: 3

Project: PoS tagging - Darmstadt Dec 7, 2009 10:42:08 GMT -5

Quote

Post by ali on Dec 7, 2009 10:42:08 GMT -5

Dec 4, 2009 16:27:56 GMT -5 ali said:

Hello all,

I'm encountering a problem while doing the fourth part of the project.

During tagging the test corpus file I get an error message saying that the number of found states is bigger than the number of maximum states:

./tritagger.sh -p paramDefault.txt -cp
************************************************
* TriTagger - A HMM tagger (bi- or trigrams) *
* Version 1.1 *
* Copyright (C) 2005-2009, Hrafn Loftsson *
************************************************
Input: ./test_hmm.txt, one token per line
Output: ./test.tri.out, one token per line
Sentences start with an upper case letter
Started at: 12/4/09 10:05 PM
Loading model ../../ngrams/models/corpus ...
Using trigrams
Found 65357 states but maximum is: 65356

The output file remains empty.

If I reduce the number of tokens in test.corpus (e.g. from 47373 to 500) , the trigger will work properly.

Any idea what I'm doing wrong?

Ali

Meanwhile, we have found out why the problem was caused:

We mistakenly removed the empty lines between tokens in the input file (test_hmm.txt). So the tagger was not able to recognise the begin of new sentences and aborted tagging after ~500 tokens.

andresherrera
New Member

Posts: 2

Project: PoS tagging - Darmstadt Dec 7, 2009 11:09:52 GMT -5

Quote

Post by andresherrera on Dec 7, 2009 11:09:52 GMT -5

Hello,

there's something I would like to ask about the data extraction in the first pre-processing part. When I'm creating the file data.txt there are:

888,578 lines beginning with a character different from '#' (which is also the number of tokens said to be in the last line of the tiger_release_aug07.export file)
474,911 lines beginning with the character '#'
-> from which 373,964 begin with #5XX (where X are digits)
-> 50,474 begin with #EOS
-> 50,473 begin with #BOS

so, if I want data.txt to have 939,052 lines I should leave in it the lines without '#' and also the lines beginning with '#EOS'. Do I really need to leave these lines in the file?

Sren
Guest

Project: PoS tagging - Darmstadt Dec 7, 2009 14:36:57 GMT -5

Quote

Post by Sren on Dec 7, 2009 14:36:57 GMT -5

You have to remove all lines beginning with an #. But you have to replace alle lines beginning with #EOS with a new line. The assignment says you have to put an empty line between sentences and #EOS marks the end of a sentence.

So in the end you have 888,578 "word-lines" and 50,474 empty lines.

Sren
Guest

Project: PoS tagging - Darmstadt Dec 13, 2009 6:28:20 GMT -5

Quote

Post by Sren on Dec 13, 2009 6:28:20 GMT -5

My results of part 3 are different to the given example output:

Number of tokens: 47373
Number of errors: 18752
Overall tagging accuracy: 60.42%
Tagging accuracy for known words: 71.61%
Number of unkown words: 7810
Unknown word ratio: 16.49%
Number of errors for unknown words: 7519
Tagging accuracy for unknown words: 3.73%

Has somebody else differnt results?

andresherrera
New Member

Posts: 2

Project: PoS tagging - Darmstadt Dec 13, 2009 15:07:35 GMT -5

Quote

Post by andresherrera on Dec 13, 2009 15:07:35 GMT -5

yes...

Number of tokens: 47373
Number of errors: 16427
Overall tagging accuracy: 65.32413%
Tagging accuracy for known words: 71.08749%
Number of unknown words: 4108
Unknown word ratio: 8.671606%
Number of errors for unknown words: 3918
Tagging accuracy for unknown words: 4.625122%

Sren
Guest

Project: PoS tagging - Darmstadt Dec 14, 2009 7:05:42 GMT -5

Quote

Post by Sren on Dec 14, 2009 7:05:42 GMT -5

We found the error. It was a UTF-8 problem, ones again, in outer baseTagger.

Here our new results:

Opening "combined.txt"
Number of tokens: 47373
Number of errors: 16384
Overall tagging accuracy: 65.41%
Tagging accuracy for known words: 71.19%
Number of unkown words: 4108
Unknown word ratio: 8.67%
Number of errors for unknown words: 3918
Tagging accuracy for unknown words: 4.63%

Message Board

Project: PoS tagging - Darmstadt

Post by Hrafn Loftsson on Nov 28, 2009 16:02:53 GMT -5

Post by ali on Dec 4, 2009 16:27:56 GMT -5

Post by Sren on Dec 5, 2009 6:42:02 GMT -5

Post by Hrafn Loftsson on Dec 7, 2009 3:42:53 GMT -5

Post by Hrafn Loftsson on Dec 7, 2009 3:52:26 GMT -5

Post by ali on Dec 7, 2009 5:43:11 GMT -5

Post by ali on Dec 7, 2009 10:42:08 GMT -5

Post by andresherrera on Dec 7, 2009 11:09:52 GMT -5

Post by Sren on Dec 7, 2009 14:36:57 GMT -5

Post by Sren on Dec 13, 2009 6:28:20 GMT -5

Post by andresherrera on Dec 13, 2009 15:07:35 GMT -5

Post by Sren on Dec 14, 2009 7:05:42 GMT -5

Quick Reply