MIT 6.035: September 2006

Line Numbers in the Parser

I received a question regarding how one can get line number (and
column) information for tokens in the parser. Note that this is not
required for this phase of the project.

This is not straightforward because the parser extracts the "value"
field from the Symbol class when you label a terminal or non-terminal
in the RHS of a production. The other fields of ComplexSymbol are not
available in the action code of the parser.

The JFlex manual gives a solution to this problem. I find it a bit
distasteful but it works. I will detail it below:

In addition to the value field, the parser also extracts the left and
right fields from the Symbol in the RHS of a production. For example
if you have:

id ::= IDENTIFIER:Val

You can use Valleft and Valright in the action code. For example:

id ::= IDENTIFIER:Val {: System.out.println(Valleft + " " + Valright); :} ;

The default constructor for ComplexSymbol does not set the left and
right fields of its superclass, Symbol. We can use left and right to
store the line and column of the token. Our symbol(...) factory
method in the scanner can explicitly set symbol.left and symbol.right
(assuming symbol is the newly created ComplexSymbol is named symbol).
The implementation should be straightforward.

After the line number is available, we can use it while error reporting.

(This is distasteful because this is not the intended use for the left
and right fields.)

Scanner Output Inconsistency

Your classmate Zev found an inconsistency in the scanner output.
There is an inconsistency between the output for char3 and char5
regarding the line number the scanner records in the ComplexSymbol for
an illegal character.

The desired behavior is for the line number in the symbol for
character literals to take the value of the *beginning* line of the
char literal. This only makes a difference for illegal char literals
that are unterminated (meaning a newline is encountered before a
closing ' ). char3 is an example of this behavior.

But char5 uses the ending line in the line field of the symbol.

I have changed the provided output of char5 to reflect the beginning line.

Since this is short notice and the difference is unimportant, I will
accept either option from your scanner (ending line or beginning
line).

Let me know if there are any further problems. Sorry for the confusion
and thank you Zev.

.05 is an Identifier!

You should note that .05 (and any other "floating point number" of the form .digit*) is an identifier. Somewhat weird, and it confused me for a bit, but since the language does not include floating point numbers, it is fine.

So we can have an statement like

.5 = .7 + 6;

Where, .5 and .7 are variables.

Tabs in Char/Sring literals

Your scanner should not allow tabs (depressing the "tab" key, not '\t') in char literals and string literals. Look at scanner test char9.

Scanner Output

Question:

Do you have a comprehensive list somewhere of all the token strings and
error strings that you expect? Or can you at least guarantee that the
public tests contain all of them? It would suck if one of my error messages
didn't match yours because I didn't know what it should look like.

Answer:

Only the token strings that are listed in the sample scanner output
are specified. The names for any others you recognize are not
specified and I don't know if the provided tests for the scanner are
complete. There is no list of the token strings.

Don't worry about the exact format of any error messages that are not
given in the sample output. Just worry about catching as many errors
in the scanner as you can, and try to match the output of the given
tests. I will look closely at the scanner/parser output for the hidden
tests and I will not simply diff the output against my scanner/parser.
And also, if you really cannot match the output of the scanner
perfectly, don't try too hard, I'll examine any tests that fail to
match the output exactly.

LALR(1) Conflict Resolution

After class a question was raised regarding how the parser can recognize a field declaration versus a method declaration because they both start out with a type declaration and an identifier.

I think I understand the confusion. You should remember that an LALR(1) parser has one lookahead symbol to decide which *state* it should transition into, but a state can include more than one of the original productions.

So during a parse:

class Program {
int f . (

The parser should be in a state that "includes" both method declaration (if there are no field decls) and field declaration and upon seeing the "(" will move into a state for a method declaration.

The trick is to delay the decision as to which rule should be
matched as long as possible. If you have the following rules:

program -> CLASS id LBRACE field_decl method_decl RBRACE;

field_decl -> type id ... | ;

method_decl -> type id LPAREN ... | ;

You will get a conflict upon seeing a type.

The parser will have to decide upon seeing a type whether it is a
field_decl or a method_decl. But it cannot tell yet and you did not give it the option of delaying the decision until later and continuting in a state that includes both.

You do not want to limit the decisions that the parser must make early in the parse. You need the early states to contain each option (i.e., whether there is a field_decl or not).

How can you re-write the grammar above to be conflict-free?

I hope this helps. Let me know if there are still any problems. I can write more on this subject.

Setting Up CVS

You can follow the steps below on Athena to create a cvs repository and import the skeleton to begin working on the project. Lines beginning with # are comments.

cd /mit/6.035/groups/leXX
mkdir cvsroot

#set your CVSROOT, should be added to your startup script
#bash
export CVSROOT=/mit/6.035/groups/leXX/cvsroot
# or csh
setenv CVSROOT /mit/6.035/groups/leXX/cvsroot

#create the repository
cvs -d $CVSROOT init

#import the skeleton
cd /mit/6.035/provided/skeleton
#the option compilier is the parent directory for the project
#leXX is a vendor tag and start is a release tag
cvs import -m "Importing Skeleton" compiler leXX start

Now you can run 'cvs checkout project' to check out the skeleton where you want to begin working.

If you would like to work on non-athena computer that has linux and CVS, you can use remote CVS by setting the environment variable CVS_RSH to "ssh" and setting your CVSROOT to:

:ext:username@ant.mit.edu:/mit/6.035/groups/leXX/cvsroot

Then you can run cvs checkout from your linux box.

Introduction

Welcome to the blog for MIT 6.035. Here the I will post answers to questions that I receive. Students can also post comments, corrections, and updates via the comments.

MIT 6.035

Sunday, September 17, 2006

Line Numbers in the Parser

Saturday, September 16, 2006

Scanner Output Inconsistency

Friday, September 15, 2006

.05 is an Identifier!

Tabs in Char/Sring literals

Thursday, September 14, 2006

Scanner Output

Wednesday, September 13, 2006

LALR(1) Conflict Resolution

Tuesday, September 12, 2006

Setting Up CVS

Introduction

About Me

Links

Previous Posts

Archives