Project II: POS tagging

Hrafn Loftsson Teachers Assistant Professor Posts: 33	Project II: POS tagging Oct 19, 2009 5:10:48 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Hrafn Loftsson on Oct 19, 2009 5:10:48 GMT -5 Here you can discuss and pose questions about the PoS tagging programming project. Regards, Hrafn.
	Last Edit: Oct 19, 2009 5:11:20 GMT -5 by Hrafn Loftsson

nik
New Member

Posts: 8

Project II: POS tagging Oct 22, 2009 5:55:04 GMT -5

Quote

Post by nik on Oct 22, 2009 5:55:04 GMT -5

It's nothing really worth mentioning, but I realized a mistake in the assignment description.
It says unknown word beginning with an upper case letter should be tagged as NNP, but in the example output they are tagged as NNS.
I suppose NNP is the right way to go.

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project II: POS tagging Oct 22, 2009 17:53:46 GMT -5

Quote

Post by Hrafn Loftsson on Oct 22, 2009 17:53:46 GMT -5

Thanks for pointing out this conflict.

Actually, NNP is correct, because that stands for a proper noun in the Penn Treebank tagset (see www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html ).

So the description was correct, but the sample output incorrect!

I have corrected the sample output.

Last Edit: Oct 22, 2009 17:55:55 GMT -5 by Hrafn Loftsson

nik
New Member

Posts: 8

Project II: POS tagging Oct 23, 2009 9:44:56 GMT -5

Quote

Post by nik on Oct 23, 2009 9:44:56 GMT -5

The gold standard also seems to be not that golden.

Another DT
one NN
just RB
like VB
all PDT
the DT
others NNS
. .

"like" should better be IN right?

The DT
sea NN
of IN
this DT
medium-young NN
planet NN
was VBD
full JJ
of IN
perfectly RB
good JJ
water NN
, ,
.
.
.

I guess "medium-young" is more like an adjective here.

He PRP
'd MD
be VB
getting VBG
a DT
haircut VB
soon RB
enough RB
. .

haircut as verb???

Miniscule JJ
organisms NN
washed VBN
upon IN
the DT
rocks NNS
, ,
.
.
.

isn't organisms a plural word (NNS)?

Last Edit: Oct 23, 2009 17:13:48 GMT -5 by nik

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project II: POS tagging Oct 24, 2009 6:37:49 GMT -5

Quote

Post by Hrafn Loftsson on Oct 24, 2009 6:37:49 GMT -5

Good job, Nik!

The errors are indeed true.

This "gold standard" was created in the following manner. First, I trained TriTagger using the eng.train file. Second, I made TriTagger tag the eng.tst file and called the resulting file eng.tst.gold. Last, I looked at the unknown words in eng.test.gold and corrected many of the errors. I probably did not find all the errors, and since I did not look at the known words in any detail it is very likely that there are more errors left in this gold standard.

All of you should correct these errors that Nik pointed out, and, moreover, please post here additional errors that you find.

Last Edit: Oct 24, 2009 6:43:21 GMT -5 by Hrafn Loftsson

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project II: POS tagging Oct 28, 2009 4:37:30 GMT -5

Quote

Post by Hrafn Loftsson on Oct 28, 2009 4:37:30 GMT -5

I got the following question from one of the students:

How exactly do I calculate the Tagging accuracy for known words, in exercise 2?

Note that, when tagging new text, a word is unknown if it was not found during training. Therefore, you should be able to calculate the accuracy separately for known words and unknown words.

Recall also that the gold standard file has the "correct" tag for each word, no matter wether the word is in the training file or not.

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project II: POS tagging Oct 28, 2009 17:01:47 GMT -5

Quote

Post by Hrafn Loftsson on Oct 28, 2009 17:01:47 GMT -5

Tihomir found several additional errors in the eng.tst.gold file. I have corrected the file - the new version is now in www.ru.is/faculty/hrafn/Data/eng.zip .

You should use the new file for testing the accuracy of your tagger.

Last Edit: Oct 28, 2009 17:02:19 GMT -5 by Hrafn Loftsson

Post by Hrafn Loftsson on Oct 19, 2009 5:10:48 GMT -5

Post by nik on Oct 22, 2009 5:55:04 GMT -5

Post by Hrafn Loftsson on Oct 22, 2009 17:53:46 GMT -5

Post by nik on Oct 23, 2009 9:44:56 GMT -5

Post by Hrafn Loftsson on Oct 24, 2009 6:37:49 GMT -5

Post by Hrafn Loftsson on Oct 28, 2009 4:37:30 GMT -5

Post by Hrafn Loftsson on Oct 28, 2009 17:01:47 GMT -5

Quick Reply