Project I: Tokenisation

Hrafn Loftsson Teachers Assistant Professor Posts: 33	Project I: Tokenisation Sept 30, 2009 8:32:27 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Hrafn Loftsson on Sept 30, 2009 8:32:27 GMT -5 Here you can discuss and pose questions about the tokenisation programming project. Regards, Hrafn.
	Last Edit: Oct 19, 2009 5:11:46 GMT -5 by Hrafn Loftsson

Jeppe Welling Hansen
Guest

Project I: Tokenisation Sept 30, 2009 9:06:04 GMT -5

Quote

Post by Jeppe Welling Hansen on Sept 30, 2009 9:06:04 GMT -5

I was wondered what to do with special characters fx. like the danish æøå.
When I process them with JFlex I get strange characters for values (some how the characters gets transformed into something that looks like chineese letters)
Is there some way to specify that the input file is UTF-8 or what can be done?

- Jeppe

Jeppe Welling Hansen
Guest

Project I: Tokenisation Sept 30, 2009 9:11:58 GMT -5

Quote

Post by Jeppe Welling Hansen on Sept 30, 2009 9:11:58 GMT -5

I probably should also specify that I was trying to match the characters in the file like:

Lower = [a-zæøå]
Upper = [A-ZÆØÅ]

I think it is here where it gets missed up...
I want to be able to match special characters.

/ Jeppe

nik New Member Posts: 8	Project I: Tokenisation Sept 30, 2009 9:26:36 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by nik on Sept 30, 2009 9:26:36 GMT -5 According to the jflex manual you could try \u plus the hex unicode value of your characters. "a \u followed by four hexadecimal digits [a-fA-F0-9] (denoting an unicode escape sequence)" I don't know if this helps.

Hrafn Loftsson
Teachers

Assistant Professor

Posts: 33

Project I: Tokenisation Sept 30, 2009 9:28:56 GMT -5

Quote

Post by Hrafn Loftsson on Sept 30, 2009 9:28:56 GMT -5

From the JFlex manual:

%unicode defines the set of characters the scanner will work on. For scanning text files, %unicode should always be used.

Also make sure that you read jflex.de/manual.html#sec:encodings

I have not had any problems with this when my input files are UTF-8 encoded and the platform default encoding is also UTF-8 (which is the case for my Linux machine). When, you run a UTF-8 input file on a JFlex scanner under Windows you might encounter problems.

In that case, you may need to change the Java code that JFlex produces! That is, where it uses InputStreamReader in the Scanner constructor:

/**
* Creates a new scanner.
* There is also java.io.Reader version of this constructor.
*
* @param in the java.io.Inputstream to read input from.
*/
public EngGood(java.io.InputStream in) {
this(new java.io.InputStreamReader(in));
}

...
java.io.InputStreamReader(in) might have to be changed to
java.io.InputStreamReader(in, "UTF-8")

Last Edit: Sept 30, 2009 9:29:48 GMT -5 by Hrafn Loftsson

Hrafn Loftsson Teachers Assistant Professor Posts: 33	Project I: Tokenisation Oct 2, 2009 8:07:45 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Hrafn Loftsson on Oct 2, 2009 8:07:45 GMT -5 Note that I do not insist that you use JFlex as a preference to some other lexical analyser tool. You could for example use lex (generates C code) or flex (generates C++) code.

Jeppe Welling Hansen
Guest

Project I: Tokenisation Oct 4, 2009 9:15:04 GMT -5

Quote

Post by Jeppe Welling Hansen on Oct 4, 2009 9:15:04 GMT -5

Actually the problem still remains.

We have some thing like:

// 00C6 = Æ
// 00D8 = Ø
// 00C4 = Ä
// 00C5 = Å
// 00D6 = Ö

// 00E6 = æ
// 00F8 = ø
// 00E4 = ä
// 00E5 = å
// 00F6 = ö

SpecialUpper = [\u00c6\u00d8\u00c4\u00c5\u00d6]
SpecialLower = [\u00e6\u00f8\u00e4\u00e5\u00f6]

Lower = [a-z]|SpecialLower
Upper = [A-Z]|SpecialUpper

It is suppose to match the letters, shown above in the commented section.
How ever when using JFlex, none of the letters are recognized.

Did any one have any luck?

ivar
New Member

Posts: 4

Project I: Tokenisation Oct 5, 2009 11:55:52 GMT -5

Quote

Post by ivar on Oct 5, 2009 11:55:52 GMT -5

If you are using Mac then you are using by default the Mac-Roman encoding which is the default encoding in the operating system. You should change the encoding default of the shell (terminal not the OS) this is done differently depending on which terminal.app you are using.

In the default terminal that ships with your mac then you change this by going into window->settings

and in iTerm then you go to the session window info and change it there.

Just changing the terminal to use UTF-8 will most likely solve the character encoding problems you might have. At least your files that are created in bash will be encoded in UTF-8.

Hope that helps.

best regards,
Ívar Björn Hilmarsson

oliver
New Member

Posts: 4

Project I: Tokenisation Oct 5, 2009 15:01:48 GMT -5

Quote

Post by oliver on Oct 5, 2009 15:01:48 GMT -5

Looks like I got the same problem as Jeppe. My input file is, according to VIM in encoded in UTF-8 format. I love VIM. VIM never lied to me. So that should not be the problem.

I also tried the source code level fix Hrafn provided. But that did not work either.

Even replacing my default Mac bash with the newest version from MacPorts did not help.

I will try flex next. If that one fails to, I will just bang my head against some wall or something

ivar
New Member

Posts: 4

Project I: Tokenisation Oct 5, 2009 21:08:55 GMT -5

Quote

Post by ivar on Oct 5, 2009 21:08:55 GMT -5

After reading that some of you have problems with Mac encoding I decided to take my project which I was developing on Linux and test it on my mac and everything is working fine on my MacBook Pro intel. I am currently running Snow Leopard and using iTerm since I never liked the default terminal that ships with mac. I changed the encoding to utf-8 on iTerm and everything works fine.

I also should mention that I am using Flex and C++ with GNU-GCC-4.2. I have tried Danish, Swedish, and my native tongue Icelandic and everything works at least I can parse the special characters. I am not good enough Danish or Swedish speaker to be able to say it was parsed 100% correctly.

I don't know if java is using different character encoding in mac I always thought it was using utf-8 by default but maybe it is using the default encoding of the operating system which is the Mac-Roman.

best regards,
Ívar Björn Hilmarsson

lele120 New Member Posts: 3	Project I: Tokenisation Oct 6, 2009 13:07:57 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by lele120 on Oct 6, 2009 13:07:57 GMT -5 Hi, does someone know how to invoke the commands of the shell windows in a program perl? I would want invoke "java tokenizer input.txt > output.txt" in a perl program.

Hrafn Loftsson Teachers Assistant Professor Posts: 33	Project I: Tokenisation Oct 6, 2009 13:51:39 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Hrafn Loftsson on Oct 6, 2009 13:51:39 GMT -5 Have a look at www.perlhowto.com/executing_external_commands

oliver
New Member

Posts: 4

Project I: Tokenisation Oct 7, 2009 8:10:40 GMT -5

Quote

Post by oliver on Oct 7, 2009 8:10:40 GMT -5

Well, I think Snow Leopard does the trick. Because it did not work with iTerm on my MacOS X Leopard. My copy of Snow Leopard should arrive tomorrow. So there's plenty of time to try it on MacOS again after I upgraded my machine. Until then, I'll stick to my Windows Virtual Machine

Post by Hrafn Loftsson on Sept 30, 2009 8:32:27 GMT -5

Post by Jeppe Welling Hansen on Sept 30, 2009 9:06:04 GMT -5

Post by Jeppe Welling Hansen on Sept 30, 2009 9:11:58 GMT -5

Post by nik on Sept 30, 2009 9:26:36 GMT -5

Post by Hrafn Loftsson on Sept 30, 2009 9:28:56 GMT -5

Post by Hrafn Loftsson on Oct 2, 2009 8:07:45 GMT -5

Post by Jeppe Welling Hansen on Oct 4, 2009 9:15:04 GMT -5

Post by ivar on Oct 5, 2009 11:55:52 GMT -5

Post by oliver on Oct 5, 2009 15:01:48 GMT -5

Post by ivar on Oct 5, 2009 21:08:55 GMT -5

Post by lele120 on Oct 6, 2009 13:07:57 GMT -5

Post by Hrafn Loftsson on Oct 6, 2009 13:51:39 GMT -5

Post by oliver on Oct 7, 2009 8:10:40 GMT -5

Quick Reply