|
Post by Hrafn Loftsson on Oct 19, 2009 5:10:48 GMT -5
Here you can discuss and pose questions about the PoS tagging programming project.
Regards, Hrafn.
|
|
nik
New Member
Posts: 8
|
Post by nik on Oct 22, 2009 5:55:04 GMT -5
It's nothing really worth mentioning, but I realized a mistake in the assignment description. It says unknown word beginning with an upper case letter should be tagged as NNP, but in the example output they are tagged as NNS. I suppose NNP is the right way to go.
|
|
|
Post by Hrafn Loftsson on Oct 22, 2009 17:53:46 GMT -5
|
|
nik
New Member
Posts: 8
|
Post by nik on Oct 23, 2009 9:44:56 GMT -5
The gold standard also seems to be not that golden.
Another DT one NN just RB like VB all PDT the DT others NNS . .
"like" should better be IN right?
The DT sea NN of IN this DT medium-young NN planet NN was VBD full JJ of IN perfectly RB good JJ water NN , , . . .
I guess "medium-young" is more like an adjective here.
He PRP 'd MD be VB getting VBG a DT haircut VB soon RB enough RB . .
haircut as verb???
Miniscule JJ organisms NN washed VBN upon IN the DT rocks NNS , , . . .
isn't organisms a plural word (NNS)?
|
|
|
Post by Hrafn Loftsson on Oct 24, 2009 6:37:49 GMT -5
Good job, Nik!
The errors are indeed true.
This "gold standard" was created in the following manner. First, I trained TriTagger using the eng.train file. Second, I made TriTagger tag the eng.tst file and called the resulting file eng.tst.gold. Last, I looked at the unknown words in eng.test.gold and corrected many of the errors. I probably did not find all the errors, and since I did not look at the known words in any detail it is very likely that there are more errors left in this gold standard.
All of you should correct these errors that Nik pointed out, and, moreover, please post here additional errors that you find.
|
|
|
Post by Hrafn Loftsson on Oct 28, 2009 4:37:30 GMT -5
I got the following question from one of the students:
How exactly do I calculate the Tagging accuracy for known words, in exercise 2?
Note that, when tagging new text, a word is unknown if it was not found during training. Therefore, you should be able to calculate the accuracy separately for known words and unknown words.
Recall also that the gold standard file has the "correct" tag for each word, no matter wether the word is in the training file or not.
|
|
|
Post by Hrafn Loftsson on Oct 28, 2009 17:01:47 GMT -5
Tihomir found several additional errors in the eng.tst.gold file. I have corrected the file - the new version is now in www.ru.is/faculty/hrafn/Data/eng.zip . You should use the new file for testing the accuracy of your tagger.
|
|