UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN Z-^^^ UTlJCDCS-R-75-7i+5 }u). 7/y TXrf^^/ IHii LIUKAKY OH iH£ SEP 8 1975 iirjivfriTv OF II I iMOr. AN INTERACTIVE TABLE -DRIVEN PARSER SYSTEM by Michael Harry Tindall August, 1975 Digitized by the Internet Archive in 2013 http://archive.org/details/interactivetable745tind UTUCDCS-R-75-7^5 AN INTERACTIVE TABLE -DRIVE PARSER SYSTEM by Michael Harry Tindall August, 1975 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 6I8OI *This work' was supported in part by the IBM Corporation, Thomas J. Watson Research Laboratory and the National Science Foundation, Grant No. EC -^1511 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, August 1975* iii ACKNOWLEDGEMENT I would like to express my graditude to my thesis advisor, Professor Thomas Wilcox. He provided many s\:;ggestions and much useful guidance during my work on this project. Thanks are also in order for Mike Mj.lner and Bob Barnett who worked on the initial implementations of Fortran IV and COBOL. They provided very useful feedback on the operations and design of the overall system, and their assistance is greatly appreciated. Also, a word of thanks goes to Cinda Robbins for her excellent typing of this manuscript. (. iv TABLE OF CONTENTS Page ACKNOWLEDGMENT iii 1. INTRODUCTION • 1 1.1 Organization of this Thesis 1 1.2 The Compiling Environment 1 1.3 The Interative Compiler 2 2. THE TRANSITION DIAGRAM PARSING MODEL k 2.1 The Basic Model h 2 .2 Extensions to the Model 9 3. ASSEMBLER SYNTAX SOURCE INSTRUCTION SPECIFICATION ... 10 3-1 Introduction and Chapter Organization 10 3.2 Overall Organization of the Syntax Specification ik 3.3 Parser Action Instruction Description 22 3.^ Description of Valid Parameters Used in Action Instructions 36 h. THE PARSER TABLE INSTRUCTION FORMS k^J U.l Notation., i+Y U.2 The Table Instructions 48 5. THE COMPILER'S TABLE MAINTENANCE SYSTEM 57 5.1 Maintenance System's Purpose 57 5 .2 General Operation of the Maintenance System ... 57 5.3 Logging into the Maintenance System . 58 5-4 Preparing the Syntax Table for the Compiler System 64 6. FUTURE DEA/ELOIMENT 67 LIST OF REFERENCES 69 APPENDIX • 70 CHAPTER 1. 1. INTRODUCTION 1.1 Organization of this Thesis This paper discusses the design and implementation of a table- driven syntactic parser with concurrent static semantic checks to he used in an interactive compiling environment. After a brief introduction in this chapter to the compiling environment in which the parser is to be used, and the the general operation of the parser system. Chapter 2 will present a model of the selected transition diagram parsing technique. Chapter 3 contains documentation on the parser's assembler language syntax source instruction specification. Chapter h contains detailed documentation on the form of each instruction that is used in the actual parser table. Chapter 5 describes the ta.ble maintenance system that is used by the compiler system, and the thesis concludes with a few comments about further refinements that can be made to this parsing system. 1.2 The Compiling Environment Recently a project at the University of Illinois at Urbana- Champaign has been under way to automate the teaching of the basic Computer Science courses by utilizing the PLA.TO IV computer-aided instructional system that is being developed on this campus [h]. This computer system features an excellent graphical display terminal and fairly sophisticated computer-aided instructional software support for the writing of instructional lessons and the corresponding course curriculum [5]. The curriculum that is being implemented will teach new 2 comp-uter science programming language concepts and constructs. Specific programming detail on a variety of languages (e.g., FORTRAN IV, PL/1, COBOL, basic) is available; students will progress at their own speeds thro-ugh a fairly flexible course structure [2], An important part of the system is an online compiler in which a student can easily and conveniently try out new programming constructs immediately after learning about them in an instructional lesson. The remainder of this paper will discuss this compiler and, in particular, the parser system that is used in the compiler. r 1.3 The Interative Compiler A number of design criterion emerge from examining the environment for this compiler system. First, the compiler should be as interactive as possible to utilize the PLATO IV system effectively and to maintain a desirable computer-aided instructional environment for the student. To accomplish this, the compiler compiles character-by-character, that is, each single key press by the student using the compiler is examined immediately as the student types it in; thus the compiler keeps up completely with the student and detects programming syntax errors as soon as possible. The student is able to edit his program by moving a cursor through the program on the screen. The compiler moves with the cursor, compiling when the cursor moves forward in the program, and backing-up ("un compiling", i.e., resetting the lexical and syntactical analyzers to previous states) when the cursor moves backward in the program. Thus, the compiler is highly interactive and easy to use. A second design criterion for the compiler is that it be multilingual. To accomplish this, the compiler is completely table -driven; to allow 3 another language to be recognized by the compiler system and used by students, a language designer must merely fill in a new set of tables and provide an execution supervisor system for the actual interpretive execution of compiled programs. This paper is concerned with one of these compiler tables, neunely, the syntax parser table. A third design criterion for the compiler is that it provide a. a high and sophisticated level of error diagnostics for the student when a syntactic or semantic error is detected in the program by the parser system. Since the intended users of the compiler are beginning students, the error messages must be direct and to-the-point . To accomplish this goal, an automatic, interactive error diagnostic system has been designed and implemented [6]. The important point for this discussion is that this automatic error system is driven by the compiler'. syntax tables, that is, it is essentially language-independent. Thus, it is apparent that these syntax tables are a very important part of this compiler system. We now examine a model for the transition diagram parsing technique used by the compiler system. CHAPTER 2. 2. THE TRMSITION DIAGRAM PARSING MODEL . 2.1 The Basic Model The syntactic analyzer used in this interactive compiler is based on the transition diagram systems first introduced by Conway [1], and recently formalized by Lomet [3]. A transition diagram system consists of a set of nested push-down automata (NPDA) that have the capability of invoking one another. The remainder of this section will present first an intuitive, graphical description of transition diagram systems, followed by a slightly more formal description of the transition diagram model. A key concept of a transition diagram parser is that of the parser "STATE": the STATE is a descriptor that maintains information about what input has already been accepted and what further inputs would be acceptable to the parser. While "STATE" is a very important concept, it is a very easy thing to visualize and implement in a transition diagram system. For the rest of this paper, STATE refers to this "state of the parse". "The action of a transition diagram parser is to examine the "possible" or "acceptable" parsing options (determined by the current STATE and the (transition idagrams) along with the current input token; based on this information, the parser will accept the token by updating the STATE information and asking for a new input token, or reject the current input token and signal a syntactic/semantic error if the token does not satisfy any of the available options. Note that all the parsing options are defined in terms of "tokens" only, that is, the tokens that are acciomulated and output from the lexical analyzer. This process can be conveniently shown graphically as follows : Let SI denote STATE "Si"; each branch out of a STATE corresponds to a possible syntax option for that STATE: these branches are labeled with their particular syntactic option requirements (these labels will be described as the paper progresses). Then, a PL/l "GOTO" statement can be shown in transition diagram form as i "GOTO" 'GO' S2 'TO' [label-name] /"^)A ";" , ^^ This is interpreted as: if the parser is in STATE SI and the current input token is "GOTO", make the state transition to STATE S3; if the following token is a [label-name], move to STATE Sk; if the token after that is ";", accept it, and (in this case) accept an entire "GOTO" state- ment. Note that if none of the branch options for the current STATE satisfies the current input token, then that token is in error, and the normal parsing error condition should be signaled. Another example is a PL/l "IF" statement: SI 'IF' S2 S3 "THEN" $k ST In this case, notice the references to and as labels on option branches: this indicates that if, for example, the parser is in STATE S2, then when trying to accept the input token, it should refer to another transition diagram in the system, corresponding to ; after a has been parsed and accepted, the parser should then return to STATE SS, having successfully satisfied the option branch out of STATE S2. This is an example of one transition diagram "invoking" ("calling", "referring to") another . One more example is needed to illustrate another important feature of transition diagrams: a transition diagram system which has been invoked by another transition diagram has the capability of return- ing to one of a number of possible STATES in the invoking transition diagrsim. A good example of where this is useful arises in trying to parse a (simplified) PL/I as follows: expression parentheses ( ( ( I + 100 ) * J ) = K ) L conditional expression parentheses When the initial parentheses "(" are examined, it is not knovm whether they are part of the overall conditional expression or are part of the inside simple expression. The technique to solve this ambiguity is to allow an "invoked" transition diagram to return to more than one STATE in the "calling" transition diagram, depending on what tokens are found later on. The diagram for can be drawn as : 8 In this case, each time is invoked, 2 return STATES n are specified (i.e., Rl and R2); when the parser reaches , it returns to return-state n. Note that the way in which this has been drawn corresponds to assuming that the initial "(" belongs to the overall conditional expression; if it turns out they actually belonged to the first simple expression, then the parse is resumed at point 5, which accepts the assixmed "conditional" parenthesis and continues parsing the simple expression. As a more formal description, a transition diagram system consists of a set of nested push-down automata (NPDA) that have the capability of invoking one another. Each NPDA is capable of reading a portion of the input string and accepting or rejecting it. Lomet calls the NPDA that are capable of being invoked "submachines"; the initial STATE of a submachine is known as its "entry" STATE and a submachine is invoked by the use of its entry STATE number by another NPDA. This results in the invoking STATE being saved at the top of a parser stack and the parse resumed at the new entry STATE. Each submachine also contains one or more "exit" STATEs; when an exit STATE is reached, the top of the stack together with the particular exit STATE determine the STATE in the original invoking NPDA with which to continue the parse. An error in the parse is detected if an NPDA reads a token from the input string for which there is no corresponding STATE transition that the NPDA can make. The reader is referred to the discussion by Lomet [3] for further technical details. 2.2 Extensions to the Model The preceding discussion in this chapter describes a parser model that is sufficiently powerful to recognize all deterministic context-free languages [3]' A few extensions to the model have actually- been included in the compiler's parser system to enable the handling of context-sensitive static semantic requirements such as proper and consistent declaration of attributes for an identifier, consistent references to declared array variables^ etc. These extensions include allowing auxiliary memory variables to be utilized by the parser (for operations such as counting the number of subscripts in an array reference), allowing the labels on any branch in a transition diagram to refer to any of these auxiliary variables or any symbol table field, and allowing special, language -dependent semantic -subroutines to be invoked at appropriate times during the operations of the parser system. To summarize the activities of a transition diagram parser: it must be capable of requesting that a new input token be read in; it has to be able to test the current input token to decide which labeled branch to follow out of the current STATE; it must have facilities to manipulate a return-state parser stack; and, finally, it must be able to perform any semantic analysis that is required by a particular language. 10 CHAPTER 3. 3. ASSEMBLER SYNTAX SOURCE INSTRUCTION SPECIFICATION 3.1 Introduction and Chapter Organization 3.1.1 The Syntax Parser Table Description A table-driven parser system has been designed that incorporates the actions described in the preceding chapter. The syntax table is actually an encoding of the programming language syntax productions in a small, interpretable instruction set. The parser of the compiler consists of a routine that interprets this instruction set in an appropriate way (this routine can be called the "table interpreter" or "table driven" routine). The parser has control over and maintains certain tables and data structures within the compiler. All of these tables and structures are located in a region of memory that is referred to as the parser storage area. One particular variable that is maintained is the parser STATE variable. This variable always points to some instruction in the syntax table. As the input string is parsed, this state pointer variable is updated (i.e., moved to point to a different sequence) to reflect the current state of the parse. The parser maintains two stacks. The first is the regular parsing return-state stack which is used when one NPDA invokes another. The other stack is called the "variable stack"; it is used by a language implementor to save miscellaneous information as the parse of an input string progresses; the language implementor has control over the 11 manipulation of entries in this stack: when to put new entries on the stack and delete entries from the stack (synchronized with PROC entry and exit), or change the value stored in an entry on the stack. There are also certain tables that are maintained by the parser. These include the compiler's symbol-table and the compiler's block structure tables. Entries from these tables can be examined or modified by the parser with the complete control of the language implernentor. The action of the parser system in the compiler is to examine and interpret, beginning at some location in the syntax table (determined by the STATE variable), the table instructions that specify the acceptable syntax for the programming language. The instructions are interpreted sequentially unless a particular instruction modifies the parser's table instruction pointer. For convenience in specifying the instructions in the parser table, an assembler language representation for each table instruction has been designed, and an assembler program is provided as part of the compiler system's maintanence utilities (chapter 5) that translates the assembler source language representation into the actual table instruction form that is used by the compiler system.. The remainder of this chapter documents the assembler source language specification requirements, and chapter h documents the actual table form of the parser table instructions. 3.1.2 Chapter Organization This chapter documents the assembler source language that is used to specify the syntactic requirements for a programming language. 12 Once a syntax source representation for the language has been prepared (using the compiler system's table builder option), an assembler program will translate the source representation into a compact table form to be used by the actual compiler. The chapter is divided into three major sections as follows : Section 3.2: Overall Organization of the Syntax Specification. Discusses the proper ordering of instructions for the syntax specification; discusses the function and use of procedures (PROC-EKD blocks) in the specification; explains the form and the use of the mnemonic definition (DEFINE) instruction, the storage allocation (ALLOCATE) instruction, the different error NAME instructions, and the purpose and form of the FINAL PARSE STATE instruction. Section 3-3'- Parser Action Instructions. Discusses the form and purpose of the instructions that are used to control the actions of the parser: SCAN, GOTO, CALLI, CALL, RETI, RET, BC, SEMA, and the auxiliary environment - changing instructions (ASSIGN, MASKON, MASKOFF, ADDIT, SUBIT). Section 3-^: Description of Valid Parameters used in Action Instructions, CLASS, PDN, UDN, PDSTP, UDSTP, defined constants, ALLOCATEd variables, Symbol-table entries. Block-table entries and TEMP variables . 3.1.3 Source Text Preparation Rules The rules for preparing the syntax source specification for the assembler program are as follows: 13 1) Names: All def Ined-names , variable-names, PROC-names and label-names discussed in this paper are of the form: Any ccambination of 9 letters and number, with the first character being a letter (note that capital letters are acceptable, but they require 2 character positions in the name), 2) Form of instructions: a) All instructions in section 3.2 are of the form: eol that is, one instruction per line. The eol is inserted automatically by the editor program when a line is terminated. No explicit spacing is required within a line (free form, extra blanks ignored), b) All instructions in section 3.3 (Action Instructions) are of the form: