Anda di halaman 1dari 14

Writing a Compiler with SableCC

To create a compiler with SableCC we follow the following steps: 1. We create a SableCC specification file containing the lexical definitions and the grammar for the language being designed. 2. After creating the SableCC specification file, we generate the framework by launching SableCC on the specification file. . 3. At this stage, we generate the working classes. This is the only code we have to write in JavaTM. It is in this step that we write the semantic analyzer, the code generator, and possibly the code optimizer. In the case of an interpreter, we write a single class. These working classes may be subclasses of one of the classes from the analysis subfolder generated by SableCC in the previous step. 4. It is during this step that we create the driver class of the compiler. It is used to activate the lexer, parser and working classes. 5. Finally, in this step, we compile the compiler with a JavaTM compiler. To illustrate this, we are going to implement a subset of Pascal by following the steps outlined above. We are going to implement a subset instead of the full language, because we want to give the reader an idea of how SableCC works. We are now going to start the implementation

SableCC specification file


Before we start implementing the SableCC specification file for the subset of Pascal, we are going to describe the syntax of SableCC using BNF. I am assuming that you have read the introduction to regular expressions and the contextfree grammars. The syntax of SableCC is as follows: <grammar> [<package declaration>] [<helper declarations>] [<states declarations>] [<token declarations>] [<ignored tokes>] [<productions>] As we can see from the grammar, a SableCC specification file may by an empty file. Remember the meaning of anything between [ and ] ? This means that anything between [ and ] can only be the empty string or a single occurrence of whatever is between the brackets. From the allowed sections in a SableCC specification file, we will not take into account the <state declarations> because we will not use it. But if the reader is interested to know how it works, he should refer to [1]. We will now describe the syntax of the other productions. The <package declaration> is used to name the destination root, that is, the directory where all the generated files will be placed. We declare the name of the package as follows: Package uk.co.brainycreatures.smallpascal; In this example, the root directory will be uk\co\brainycreatures\smallpascal. Meaning that all the genarated subfolders will be placed under this directory. A package name may be any sequence of identifiers, starting with a letter followed by zero or more letters or digits, separated by a dot. The <helper declarations> works like constants in Pascal. And, as its name imply, it is used as a helper of

something, in this case the <token declarations> section (more about this next). To see how a <helper declarations> is helpful lets examine the following regular expression: id = [a .. z] ([a .. z] | [0 .. 9])* In this example we could simplify this regular expression by defining digit = [0 .. 9] and letter = [a .. z], then our regular expression ID becomes id = letter (letter | digit)* Letter and digit are our helpers. Heres how we declare helpers in SableCC: Helpers letter = [a .. z]; digit = [0 .. 9]; After the <helper declarations> follows the <token declarations>, which is where we define our terminals or tokens. Lets use the example above to show how to declare tokens in SableCC: Tokens id = letter (letter | digit)*; This is very similar to regular expressions as described in the Regular Expressions tutorial. A token defined by a string of characters is declared between quotes, e.g. program = program, and every declaration ends with a semicolon. Following the <token declarations> is the <ignored tokens>, in other words, the section where we declare the tokens to be ignored by the parser. For example, comments, blanks, carriage return, etc. Heres an example: Helpers any_charater = [0x0 .. 0xfffff]; nl = \n; Tokens comment = any_character nl blank = 10 | 10 13 | 9; Ignored Tokens comment, blank; In this case, comment and blank will be ignored by the parser. Finally, in the <productions> section, we declare the productions grammar for the language. The productions are defined in BNF or EBNF (Extended BackusNaur Form). Heres an example of how to declare productions: Tokens identifier = letter (letter | digit)*; number = digit+; plus = +; Productions expression = identifier plus identifier minus number ; As we can see it is similar to the context-free grammars presented in the grammars tutorial. The only

difference is that in SableCC we dont declare a nonterminal surrounded by < and >, and we replace the by =. In the productions section, it is sometimes required to precede a token by T. and a production by a P. This happens when we have a token and a production with the same name. For example if we have a production program and a token program, then in the production below we must precede program by T., because will not know if it is the token program or the production program. Tokens program = program; semicolon = ; ; Productions program = T.program identifier semicolon ; Now that we are familiar with the syntax of SableCC, lets implement our subset. The root of our package will be uk.co.brainycreatures.smallpascal. Hence in the package declaration we will have Package uk.co.brainycreatures.smallpascal; After defining the package, we are going to declare our helpers, which are shown below: Helpers a = a | A ; b = b | B ; e = e | E ; g = g | G ; i = i | I ; m = m | M; n = n | N ; o = o | O ; p = p | P ; r = r | R ; t = t | T; v = v | V ; w = w | W; cr = 13 ; carriage return lf = 10 ; line feed tab = 9 ; tab char ascii_char = [32 .. 127] ; blank = ; digit = [0 .. 9] ; letter = [[a .. z] + [A .. Z]] ; l_brace = { ; r_brace = } ; l_paren_star = (* ; r_paren_star = *) ; Note the space between the letters in the definition of the keywords. In this case we need them, because each letter is a helper token. So, in defining our tokens, we have a sequence of helpers separated by a space. Finally, we will describe the grammar for the language. Here it is:

Productions program = program_heading global_declaration_part begin main_statement_part end dot; program_heading = {non_empty} program identifier semicolon | {empty} ; global_declaration_part = var_declaration; var_declaration = var var_decl+; var_decl = identifier_list colon type; identifier_list = {single} identifier | {sequence} identifier_list comma identifier; type = integer; main_statement_part = statement_sequence; statement_sequence = {single} statement | {sequence} statement_sequence semicolon statement ; statement = {assign} identifier assignop expression | {writeln} writeln_stmt ; writeln_stmt = {simple}writeln | {arguments} writeln l_paren expression_list r_paren ; expression_list = {single} expression | {sequence} expression_list comma expression ; expression = {term} term | {plus} expression plus term | {minus} expression minus term ; term = {factor} factor | {mult} term mult factor | {div} term div factor; factor = {identifier} identifier | {integer} integer;

Generating the working classes


In this section we will implement the semantic analyser and the code generators for the language. We start by implementing the semantic analyser below. The semantic analyser The semantic analysis phase completes the static analysis (parsing) phase of the source program, by ensuring that the grammatically correct statements passed to it by the parser make sense. For instance, the syntax of assignment statements in Pascal does not require that the identifiers have been declared, that they are variables, that they are of similar type, and that the operations can be performed on those types. All these restrictions are defined by the static semantics of Pascal, and the semantic analyser performs the corresponding checks. To perform this checking, the semantic analyser completely processes the declarations and creates equivalent information known as property information. This information is stored in a data structure known as symbol table. The symbol table is used so that given any identifier the associated information may be found. For this small subset, the only checking that needs to be performed is that the identifiers are declared before use and that they are declared only once. There is no need for type checking because all the identifiers are of the same type. The semantic analysis is processed by a depth first traversal on the abstract syntax tree, storing information about identifiers in the symbol table after having visited the subtrees of the production containing the variables declaration. These information will then be fetched as we encounter identifiers in productions factor and variable, respectively. Now that we have the required information to implement the semantic analyser, we will now describe how to implement it. Returning to the grammar above, we are now going to explain the effect of the names between { and } in the alternatives of each production. The name of each alternative in the productions of the grammar above are prefixed by A and concatenated by the name of the production to produce a type name for the alternative. For example, for the production below Factor = {identifier} identifier | {integer} integer_literal | ; a type named AIdentifierFactor and a type name AIntegerFactor will be produced. Given the information above, we design our semantic analyser by defining the methods for the alternatives including identifiers. That is, alternatives ASingleIdentifierList, ASequenceIdentifierList, AAssignStatement and AIdentifierFactor. Also, we need to process alternative AVarDecl. Note that for the alternative for var_decl doe not have a name. This is allowed in SableCC if the production is defined by one alternative only, and the resulting type is the name of the production prefixed by A. Here is the implementation of our semantic analyser: package uk.co.brainycreatures.smallpascal; import uk.co.brainycreatures.smallpascal.node.*; import uk.co.brainycreatures.smallpascal.analysis.*; import java.util.*; for the Hashtable public class SemanticAnalyzer extends DepthFirstAdapter { stores the identifiers being defined Hashtable symbol_table = new Hashtable(); create a new table check if the identifier is already in the table and report an error

if it is public void outASingleIdentifierList(AidentifierList node) { identifier to be stored in the symbol table TIdentifier ident = node.getIdentifier(); name of the identifier to be stored in the table String key = ident.getText().toUpperCase(); is the identifier in the table? if (symbol_table.containsKey(key)) { report an error System.out.println(Identifier already defined.); System.exit(0); } else { symbol_table.put(key, key); } } same as above public void outASingleIdentifierList(AidentifierList node) { identifier to be stored in the symbol table TIdentifier ident = node.getIdentifier(); name of the identifier to be stored in the table String key = ident.getText().toUpperCase(); is the identifier in the table? if (symbol_table.containsKey(key)) { report an error System.out.println(Error: [ ident.getLine() + , + ident.getPos() + ] Identifier already defined.); System.exit(0); } else { symbol_table.put(key, key); } } checks if the identifier in the assignment statement was previously declared, and report an error if it wasnt public void outAAssignStatement(AAssignStatement node) { Tidentifier ident = node.getIdentifier(); String key = ident.getText().toUpperCase(); Is the identifier in the table? if not report error if (!symbol_table.containsKey(key)) { System.out.println(Error: [ + ident.getLine() + , ident.getPos() + ] Unknown identifier.); System.exit(0); }

} same as above public void outAIdentifierFactor(AIdentifierFactor node) { Tidentifier ident = node.getIdentifier(); String key = ident.getText().toUpperCase(); Is the identifier in the table? if not report error if (!symbol_table.containsKey(key)) { System.out.println(Error: [ + ident.getLine() + , + ident.getPos() + ] Unknown identifier.); System.exit(0); } } } We define the class SemanticAnalyser as extending class DepthFirstAdapter. This automatically provides a depthfirst traversal of the AST. To check the declarations of identifiers we just need methods outASingleIdentifierList and outASequenceIdentifierList. To check if they were declared before use we just need methods outAAssignStatement and outAIdentifierFactor. For more about how SableCC generates this methods refer to chapters 5 and 6 of [1]. We are now done with the semantic analyser. For our code generation we do the same thing, but we generate code as well as we process the identifiers in the symbol table. The Class Generator The ClassGenerator is also a subclass of class DepthFirstAdapter. To implement the code generator we need to do a depth first traversal of the tree, just as we did with the semantic analyser. But for the code generator we will use alternatives from other productions as well. For example, we will use the program_heading to define the name of the class to be generated. That is, we will use the name of the identifier that follows the program keyword, if the program heading is present in the program. If the program heading is not present in the program, the name of the class will be a.out, just as the executables generated by a C compiler in a Unix based system. We will now start the description of the design of our code generator. For example, from the production program_heading = {non_empty} T.program identifier semicolon | {empty} ; we will define methods outANonEmptyProgramHeading and outAEmptyProgramHeading. To generate a class file using the Jas API we need to create an instance of class ClassEnv first. Then we set the class attributes: its name, its super class, the source from which it is being compiled and its access modifiers. This is all done in the methods outlined above. We also need to create a default constructor. Although we dont realise, the Java compiler generates a default constructor for us. This is how it looks in Java assembly: .method public <init>()V aload_0 invokevirtual javalangObject<init>()V return

.end method To define this method, we can either do it in production program_heading, or even before production program traverses its subtrees. More precisely, on the definition of its inAProgram method. It is up to the programmers preference. In our case, we are going to define the constructor in method inAProgram. To define a method we need to define an instance of class CodeAttr, then we add the instructions with its addInsn method. We also need to create a new instance of class CodeAttr for object main_code. Here is how we implement it: public void inAProgram(AProgram node) { create an new instance of CodeAttr init = new CodeAttr(); main_code = new CodeAttr(); create a new object for main program body try { define the method init.addInsn(new Insn(opc_aload_0)); init.addInsn(new Insn(opc_invokevirtual, new MethodCP(java/lang/Object, <init>, ()V))); init.addInsn(new Insn(opc_return)); } catch (Exception e) { System.out.println(e); } } After defining the method, we must add it to the class. This will be done in either method outAEmptyProgramHeading or outANonEmptyProgramHeading. Here is its implementation: public void outANonEmptyProgramHeading(ANonEmptyProgramHeading node) { name of the class to be generated class_name = node.getIdentifier().getText(); try { set class attributes main_class.setClass((new ClassCP(class_name)); main_class.setSuper(new ClassCP(java/lang/Object)); source from which it is compiled main_class.setSource(new SourceAttr(source_file_name)); make class public main_class.setClassAccess((short) ACC_PUBLIC); add the constructor method to the class main_class.addMethod((short) ACC_PUBLIC, , <init>, ()V, init, null); } catch (Exception e) { System.out.println(e); } } The code is self explanatory, so we wont go into more details. The code for method outAEmptyProgramHeading is similar to the code above, except for the name of the class. It sets class_name to a.out. Proceeding with the grammar, we are now going to define the code that processes identifiers; that is, the code that defines the identifiers. Pascal variables will be defined as Java fields. To define fields in the Jas API, we use method addField of class ClassEnv. The

name of the generated variables will be in lowercase. We dont need a symbol table in order to fetch the signature of a field. Rather, we use the name of the class to which it belongs. Here is the implementation: public void outASingleIdentifierList(AsingleIdentifierList node) { get the name of the name of the variable in lower case String var_name = node.getIdentifier().getText().toLowerCase(); try { add the field to the class main_class.addField(new Var((short) (ACC_STATIC | ACC_PUBLIC)), new AsciiCP(var_name), new AsciiCP(I), new ConstAttr(new IntegerCP(0))))); } catch (Exception e) { System.out.println(e); } } The code above is equivalent to the following Java declaration: public static int <var_name> = 0; where <var_name> is the same as the variable being declared. The code for method outASequenceIdentifierList is the same as the code for outASingleIdentifierList above. So, there is no need to describe it again. Lets now describe how we implement the statements. Before going any further lets analyse how we translate Pascal statements into Java assembly. Because the Java Virtual Machine is stack based, we need to think in terms of stack operations only. So to translate an assignment statement we push all the factors on the right hand side to the stack and then we pop the result into the variable on the left hand side. For example, the assignment statement a := b + c is translated to the following Java assembly code: getstatic <class name>b I getstatic <class name>c I iadd putstatic <class name>a I where <class name> is the name of the class to which each of the fields a, b and c belong. In Java we use the ., but in Java assembly we use the / to access either methods or fields. The I following the name of the variable means that this variable is of type int. The instruction getstatic is used to get the value of static fields of a class, and the instruction putstatic is used to store values into static fields of a class. The other arithmetic instructions that will be used in the code generation of the compiler for this subset are: isub used to subtract integer numbers. This instruction expects to find two numbers on the top of the stack, and if they are there it pops them, subtracts them and pushes the result onto the top of the stack. imul the same as isub, but it multiplies instead

idiv used to divide integers In order to generate the appropriate code, we need to traverse the subtrees of the corresponding operator, and after visiting them we generate the opcode corresponding to this same operator. We need to push numbers and identifiers as we encounter them in production factor, and then generate the opcode for the arithmetic instruction in the alternatives of productions expression and term. Here are the implementations: public void caseAIdentifierFactor(AidentifierFactor node) { String var_name = node.getIdentifier().getText().toLowerCase(); try { getstatic <class_name><var_name> I code.addInsn(new Insn(opc_getstatic, new FieldCP(class_name, var_name, I))); } catch (Exception e) { System.out.println(e); } } Code for integer factors: public void caseAIntegerFactor(AIntegerFactor node) { get the string value String num_image = node.getIdentifier().getText(); int value = Integer.parseInt(num_image); try { switch(value) { case 1 : code.addInsn(new Insn(opc_iconst_m1)); break; case 0: code.addInsn(new Insn(opc_iconst_0)); break; case 1 : code.addInsn(new Insn(opc_iconst_1)); break; case 2: code.addInsn(new Insn(opc_iconst_2)); break; case 3: code.addInsn(new Insn(opc_iconst_3)); break; case 4: code.addInsn(new Insn(opc_iconst_4)); break; case 5: code.addInsn(new Insn(opc_iconst_5)); break; default: if (value => 128 && value <= 127) { code.addInsn(new Insn(opc_bipush, value)); } else if (value >= 65536 && value <= 65535) { code.addInsn(new Insn(opc_sipush, value)); } else { code.addInsn(new Insn(opc_ldc, value));

} break; } } } Code for addition operator: public void outAPlusExpression(APlusExpression node) { try { code.addInsn(new Insn(opc_iadd)); } catch (Exception e) { System.out.println(e); } } Code for subtraction operator: public void outAMinusExpression(AMinusExpression node) { try { code.addInsn(new Insn(opc_isub)); } catch (Exception e) { System.out.println(e); } } Code for multiplication operator: public void outAMultTerm(AMultTerm node) { try { code.addInsn(new Insn(opc_imul)); } catch (Exception e) { System.out.println(e); } } Code for division operator: public void outADivTerm(ADivTerm node) { try { code.addInsn(new Insn(opc_idiv)); } catch (Exception e) { System.out.println(e); } } Code for assignment statement: public void outAAssignStatement(AassignStatement node) { String var_name = node.getIdentifier().getText().toLowerCase();

try { code.addInsn(new Insn(opc_putstatic, new FieldCP(class_name, var_name, I))); } catch (Exception e) { System.out.println(e); } } Code for writeln statement: public void inAWritelnStatement(AwritelnStatement node) { try { code.addInsn(new Insn(opc_getstatic, new FieldCP(java/lang/Object, out, Ljava/io/PrintStream;))); } catch (Exception e) { System.out.println(e); } } public void outAWritelnStatement(AwritelnStatement node) { try { code.addInsn(new Insn(opc_invokevirtual, new MethodCP(java/io/PrintStream, println, (I)V))); } catch (Exception e) { System.out.println(e); } } After generating all the code, we add an extra return instruction, we add the method to the class, and we dump the class. This is done in method outAProgram, that is at the exit of a program. The code is a follows: public void outAProgram(AProgram node) { try { add return instruction to the end of the method code.addInsn(new Insn(opc_return); add main method to main_class main_class.addMethod((short)( ACC_PUBLIC | ACC_STATIC), main, ([Ljava/lang/String;)V, code, null)); generate class file main_class.write(new DataOutputStream(new FileOutputStream(class_name + .class))); output status System.out.println(Wrote + class_name + .class]); } catch (Exception e) { System.out.println(e);

} }

The Main class


Now that we have implemented the working classes, it time to implement the Main class. This class simply activates the lexer, parser, semantic analyser and the class generator. This is always the same for every compiler or interpreter implemented using SableCC. The code for the main class is shown below: package uk.co.brainycreatures.smallpascal; import uk.co.brainycreatures.smallpascal.semantic.SemanticAnalyser; import uk.co.brainycreatures.smallpascal.code.ClassGenerator; import uk.co.brainycreatures.smallpascal.parser.*; import uk.co.brainycreatures.smallpascal.lexer.*; import uk.co.brainycreatures.smallpascal.node.*; import java.io.*; public class Main { public static void main(String[] args) { long start_time, stop_time; times compilation if (args.length < 1) { System.out.println(Usage:); System.out.println( java uk.co.brainycreatures.smallpascal.Main <filename>); } try { start_time = System.currentTimeMillis(); create lexer Lexer lexer = new Lexer (new PushbackReader(new BufferedReader(new FileReader(args[0])), 1024)); parser program Parser parser = new Parser(lexer); Start ast = parser.parse(); check program semantics ast.apply(new SemanticAnalyser()); generate class file ast.apply(new ClassGenerator()); } catch (Exception e) { System.out.println(e); }

} } Now that we have implemented everything, it is time to compile all the source code. Assuming that the current working directory is in our CLASSPATH, we compile the compiler as follows: C:\mycompilers\javac uk\co\brainycreatures\Main.java If no errors were found in the source, we will have all the sources compiled. We then just have to debug the compiler. That is, test it with small pascal programs. The best way to test a compiler is by inputig erroneous data to it. That is, invalid programs. If you have any queries, please do not hesitate to contact me. Best Regards Fidel. References: 1. Etienne Gagnon,"SableCC, An ObjectOriented Compiler Framework", Masters thesis, McGill University, Montreal, Quebec, March 1998.

Anda mungkin juga menyukai