'L I B RAHY OF THE UNIVLRSITY OF ILLINOIS 510.84 IZQT no. 237-242 cop. 2 The person charging this material is re- sponsible for its return on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. University of Illinois Library MARJ8 mar*;* DEC ? ': ft$ L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/translatorgenera237mcin ^ a37 Report No. 237 July 12, 1967 C00-1018-1121 THE TRANSLATOR GENERATOR SYSTEM by A. W. Mclnnes Z^ I ^ DEPARTMENT OF COMPUTER SCIENCE • UNIVERSITY OF ILLINOIS ■ URBANA, ILLINOIS Report No. 237 THE TRANSLATOR GENERATOR SYSTEM by A. W. Mclnnes July 12, 1967 Department of Computer Science University of Illinois Urbana, Illinois 61801 Supported in part by Contract AT (11-1) -1018 with the U.S* Atomic Energy Commission and the Advanced Research Projects Agency. THE TRANSLATOR GENERATOR SYSTEM A. W. Mclnnes 1 . INTRODUCTION This paper describes a system designed to produce a scanner from the Backus Naur Form definition of the syntax of a programming language. The theoretical aspects of this problem are considered in the paper by Eickel and Paul [1J. This problem really involves two important problems: (a) Parsing problem To give a method which decides whether or not a given string, according to some rules of a language, belongs to it, and if so to give the corresponding rules, (b) Ambiguity problem To decide whether or not a language is unique, i.e., whether or not for every string in the language the corre- sponding set of rules is uniquely determined. In this paper we restrict our attention to Chomsky two grammars (i.e., having a production system having just one element of the alphabet of the language as left hand side of each production) having just one distinguished element, Z, of the alphabet to which every sentence in the language may be reduced by repeated applications of the members of the production system. The main purpose of this system is to produce a purposive parsing system i.e., if in a left to right scan of a string, we find any production (replacement rule in the syntactical definition of tne language; matched by a substring of the given string, we can make the indicated replacement knowing that this is the only possible replacement. However, due to the practical necessity of providing bounds on the algorithm at various stages, it is not clear that such a goal is possible for every such language considered. The size of the subset which can be successfully treated will be a matter of practical experimentation, and possible modi- fication based on experience with the system, 1 . 1 Basic Conc epts Let (?< = (A, B, C, .».., Z} be a fin ite set of distinguished characters which we shall call the alphabet. The elements (a, b, c, „,.„.) of the free semigroup J (Of) over Of under concatenation will be called strings over 01 . The unit element (or empty string) of/ ('-J will be denoted by e. If x = X,X„ . .. X (X, e 'Jf ) , then Ixj - n :'.s called the J 1 2 n k length of x. A finite set II of ordered pairs of strings, n - { £L :: = b ; a , b R e fY#) "* (e } ; k = l s 2, , .. , , ra } is called a production system in j (OJ) - A production a :: = b (read b, may be replaced by, or becomes, a ' ) may be regarded as a replacement rule in the defining syntax of the language, a is called the left side and b, the right side of the production a :: = b , II generates relations p and p in -/ (D'l) in the following manner; Definition 1 u, v e -f(Oj) : u p Q v iff x, y e f(<%) , 3 (a : : = b) e II 9u™ xby, v = xay i.e., for given strings u and v in the free semigroup, we define u in relation to v (u p v) if and only if there exist strings x and y in the free semigroup and a production a :: = b in the production system II, such that the string u has the form xby and v has the form xay. Definition 2. u, v e f(a) : u p v if f 3 1 ' 2 ' •••• n ' 1) -2- i.e., we define the string u to be in relation to the string v (u p v) if and only if there exists a finite sequence of strings u such that tL p u, . according to definition 1, and such that we are led by repeated applica- tions of the relation p from the string u to the string v. That is the relation p is the transitive closure of the relation p . For u p v we o shall say that u is a word for v. A special subset of (7( is distinguished: 01 ~ { X | X e 0( » X does not occur as a constituent of the right side of a production of II } We call the subset 0( the set of anti-terminal characters in the alphabet. The system we wish to consider is a quadruple Z = ( (?{ , II, Z, y) where 0[ is an alphabet, II is a production system in f (.01) , Z is the -vJ. distinguished element in U( , and Y = ( x x e * (Ol) , x p Z } i.e., the set of all strings in the free semigroup which are words for the distinguished element Z, including the case x = Z. However note that we have restricted our consideration to those systems having a production system of which the left side of each production is just a single element of the alphabet (X . The set y of such strings is called a language. Definition 3 (critical production) : A production A :: = B.B ... B is called critical if there r 12m exists a k e { 1,2, .... m} and a production A' :: = b' in II, not id b' identical with the production given above such that B occurs the string Using this concept we now define a relation p contained in p o' -3- D efinition _4 u, v e f(C'(). u p" o v iff x, y e /^, 3 (a :: - b) e il 3 a ; : = b is not critical, u = xby , v = xay. i.e., the relation p is just the relation p restricted to the non- ' o o critical productions. Corresponding to definition 2, we also define the relation p as the transitive closure of p ; that is u p v if we can pass from u to v by a finite sequence of applications of p using only the non-critical productions „ Clearly p contains p. The relation p has the following property: (//) u, v, w e ■/([}/) '• upv,upw =>vgw To demonstrate this property by an example let us suppose we have productions: A l :: =B 1 A 2 :: -B A :: = B The first is the only non-critical production in this system= Let us suppose u=aB bBcBd Thus if v=aAbBcBd We have u p v Now if w = a A b A c A_ d then u p w since for p we may use any of the productions and we have the following sequence of relations p : o -4- u = aBbBcBdp a B, b B c A„ d p a A, b B c A„ d p a A, b A^ c A„ d 1 o 1 3 o 1 3 o 1 2 3 and note that these two relations satisfy the condition that p is just the relation p restricted to the non-critical productions. Thus we clearly have v p w since v = a A bBcBdpaA b B c A d p a A b A„ c A. d Note that this property # is not true for p since we could then have u p v = a A b A„ c B d and u p w - a A b A p c A- d In this case it is not true that v p_ w since although v p v = a A b A., c A d , there is no way we can replace the element A„ between the substrings b and c to become the element A required for the string w. Thus we have come to a "dead end" and have been led outside the set y consisting of the strings comprising our language. Thus if the relation p (p) had this property (#), then the parsing problem would become trivial for then u g y, v g ffa) : u P v i>v g 7 , and this would mean that the application of the syntactical rules ac- cording to p never leads from inside to outside of 7. In general however, p does not have this property. Therefore, the main goal to be achieved is to restrict the production system II to a syster by means of contextual conditions such that the orrespondingl^ stricted relation p* (C p) has the above mentioned property (#) . All ambiguities in the language are characterized by the fact that they are words for Z according to more than one essentially different ways. In other words we may obtain more than one distinct sequence of -5- replacement rules (or productions) which may be applied to a given string and reduce it to the distinguished element Z. Thus the given string be regarded as a valid word in the language in more than one essentially different ways - i.e., in the language of programming, it may be scanned in more than one way and still turn up as an apparently legal construct. This is essentially the problem as discussed above except that the "wrong choice" of a production now leads again into the set y instead of outside of y and to a "dead end" as before. In order to rigorously explain "essentially different" as used in the previous paragraph we give the following definitions: i )efinition 5 (local „ambi £uit x)_i. A string x is called lo^^lj{_ambij'uo_us_ if there exist in II two different productions a : : » b and a 1 :: = b' with |x|< |bj + |b'| where b and b' are substrings of x, which cover x completely. Thus since the length of the string x is less than the sum of the lengths of the strings b and !>' and since {b} (J {b'} covers x com- pletely, it follows that {!>} f\ {b'} ^ . Thus it is clear that the strings b ana b' have at least one element in common and so since the productions are different, if there exist two such productions they must be critical by definition 3. From definition 5 it follows that for ev< \ locally ambiguous string x, there exist u, v, u' , v' e -/^(Of) with x=ubv=u' b'v' , x p u a v ~ x, o 1 x p u' a' v' = x~ o 2 De finition 6 (ambiguity) : A string m is called ambiguous iff there exist u, v e y and a locally ambiguous string x (where x p x , x p x^) such that m p_ u x v , u x v P Z and u x v P Z This rigorously defines what we mean by saying that the string m is a word for Z in two essentially different ways, and, places in terms of the concepts rigorously defined above, our intuitive idea of an ambiguity as a construct having two (or more) apparently valid interpretations within the structure of the language, Thus it is clear, that in order to isolate those elements of IT which are causing either ambiguities occuring in Z, or the relation p not having the quoted property(#), we need to isolate the critical productions of n. We then attempt to resolve these problems by finding contextual conditions on the critical productions and obtain a unique system which contains no ambiguities. The remainder of this report gives details of how this goal may be achieved. Chapter 2 deals with the reduction of our system to a Simple Chomsky system; Chapter 3 considers the determination of the contextual conditions for the critical productions and Chapter k describes a purposive scanner obtained from the system. 1.2 Example In order to better explain some of the concepts mentioned above, we shall illustrate by means of a short example. Suppose we have the production system: F = (E) = F = T * F F T T E E Z = E + T =■■ E It will be seen that these are part of the productions for an arithmetic expression in ALGOL, where V = variable, F = factor, T = term and E = expression, which becomes the distinguished element in the system. -7- Suppose we consider the string V + V * V. Note that all productions except the first are critical so we may expect that some methods of reducing this string will lead to dead ends. There is a production F : : = V. So suppose we replace all the elements V in the string by F. The string now becomes F + F * F There is a production T :: = F, so we can replace every F by a T. The string becomes T + T * t Similarly, using the production E :: = T, the string becomes E + E * E and finally using the production Z :: = E, the string becomes Z + Z * Z. However, we have now reached a dead end since the string has not been reduced to the single distinguished element Z and no further reductions form the production system can be made. Let us consider what went wrong. By an inspection of the production system and our intuitive knowledge in this case, we find that the problems began in replacing the F by T. If the final F in the string had not been replaced, we would have obtained the string T + T * F and the last three elements of the string is a right side of a production, T : : = T * F. Making this replacement we obtain the string T + T -8- However, if we again use the production E : : - T in both cases here, we are again led to a dead end Z + Z But if we use this production to replace only the first T we obtain the string E + T which is again a right side of a production E : : = E + T. Making this replacement and then using Z : : - E, we have finally reduced the string to the single element Z as required Thus, our reduction of the original string V + V * V, would have been satisfactory if we use the production T :: ^ F to repl F by T, except when F is preceded by a *. Or expressing the same con- dition constructively, we use the production T : : - F only when F is preceded by a '(' or a '+' or 'e f (e is the null string)- Thu; we replace the production T :: s F in the system by the three produ eT : : - eF (T ;: - (F +T :: - +F Then we can make purposive replacements of F, using any production whos< right side matches a substring of the given string, and then being sure that we will not be led to a dead end= Similarly for the production E :: - T, we find that we m replace T bv E except when T is eded by a '-»", Hem replace this production in the system by the three productions -9- eE :: = eT (E :: ■ (T *E :: = *T We now have a purposive passing scheme using this modified production system which avoids leading to dead ends by the addition of contextual conditions to certain productions. The following system attempts to implement in a systematic manner, our intuitive evaluation of this simple system. -10- 2. SIMPLE CHOMSKY SYSTEMS The first part of this system will transform the production system of a CHOMSKY - language given in Backus Naur Form into a simple production system for the same CHOMSKY - language. The essential property of a simple production system is that its critical productions are easily recognizable and are all of length one. The description of the implemen- tation given in section 2.2 should be read in conjunction with the flow charts for the first part given in the appendix* 2.1 Basic Concepts We are given a CHOMSKY system, which includes a set of produc- tions II. Initially we reduce the given CHOMSKY system to a simple CHOMSKY system which implies that the set of productions II must be trans- formed to a set II 1 where II 1 is defined in definition 7. Definition 7 (Simple CHOMSKY System) A CHOMSKY System E' = { & IT, Z, y 1 } is called Simple if n' = ( k^ :: = b k ; k = 1, 2, ... n } has the following properties: 1) ! l>, j ^ 2 i.e., the right sides of the productions have length one or two. Thus all productions of n' have the form: A. :: = S. T. i - 1, 2, , £ l li A. , : : ■ U . j = 1, 2, , m 2) The sets {S.}, {T.}, {U.} are pairwise disioint. i i j r J 3) S t S , T $ T for all p, q - 1, 2, . .., I and p # q. pqpq r»-i >» r-i Thus all the critical productions are of length one and the local ambi- guities (i.e., having more than one replacement possible from a given element) are exactly the right-hand sides of the critical productions. -11- The simple CHOMSKY system is obtained from the given system by carrying out the following operations: i) If there is a production X :: ■ X.X....X where n > 2 then 1 2 n a new character E is created and this production is replaced by the pair of productions: E :: =X X X 2 A * I ~~ Lj An ° O • A 3 n The second of these new productions is of length n - 1. If n - 1 > 2 then this procedure is repeated iteratively until all the new productions have length 2. The new character E is adjoined to the alphabet 01 . ii) If there exist two productions of the form A :: ■ XT; B :: ■ SX (or a production of the form A :: = XX), then the symbols X^ , X are created and the productions are replaced by A :: -XjT or A : : '\\ B II -SI, X R : : = X X, :: -I \ : : = X \ ■■■■ =X The new characters X_ , X^ are adjoined to the alphabet 01 . iii) If there exist two productions of the form A : : ■ XT (A : : - SX) and B : : = X, then the symbol E is created and the first production is replaced by A : : - ET (A : : = SE) E : : - X The new character E is adjoined to the alphabet vl . -12- iv) If there exist two productions of the form A : : = ST and B : : « ST then a new symbol H is created and these productions are replaced by H : : = S A : : - H B : : = H The new character H is adjoined to the alphabet Ot • v) If there exist two productions of the form A : : = XT and B : : - XT where T f T (or A : : = S.X and B : : - S 2 X where S i S 2 ) then the symbols G, H are created and these productions are replaced by A : : = GT (A :: = S 1 G) G : : = X B : : =HT 2 (B :: - S 2 H) H : : = X The new characters G, H are adjoined to the alphabet 6y . This construction, if carried out in the given order, leads after finitely many steps to a simple CHOMSKY system. The proof of this assertion is given in the paper by Eickel and Paul [1]. -13- 2 .2 Implementation 2„2,1 Functional Description In order to accomplish this reduction and also in order to facilitate the next stage in which the ambiguous productions are resolved by means of context, the productions IT are collected in three stacks: (1) Stack S (with stack pointer p_) which contains all productions of length 2< (2) Stack S 1 (with stack pointer p..) which contains all unambiguous productions of length 1. (3) Stack S (with stack pointer p ) which contains all critical productions (of length 1 by hypothesis) . As each production of the given system is read in, each new alphabet element is entered in the symbol table u[ in the first avail- able location o The location of this element in the table (ft gives its internal number which is used in all subsequent manipulations • Each element of the symbol table Qf is of one word in length and has the following form: Flag bit on first byte i r 1 — J 1 — I — 1 1 — I i — I — 1 — ! ! — I 1 — I ! < — I -\ — i — i — i — r I Z K. ^T ~J Five alphabetic characters of external representation. Ol Pointer, p , to the next available word in the table. Thus p will become the a internal number of the next symbol entered in (Jl . -14- If the new element is an anti-terminal symbol (i.e., a member of Ol which consists of all those symbols that do not appear on the right-hand side of a production), then the flag bit on the first byte is set to one. The first two bits of the word are used to signal certain conditions on the symbol represented in the rest of the word. If both bits are zero then the external symbol is not more than five alphabetic characters in length and is represented in BCD code in the remaining 30 bits of the word. If both bits are one then the external symbol is more than five alphabetic characters. In this case only the first character is represented in BCD code in the remaining six bits of the first byte and the last three bytes contain the link address of the first byte of the sequence of consecutive bytes which contain the remaining characters of the extended symbol. The flag bit is set on the last byte of this sequence to indicate the length of the extended symbol. If the first bit is one and the second bit is zero then a "created symbol" is indicated. In this case the external representation is left blank, but the internal number corresponding to this position is used in the pro- ductions to represent a created symbol as outlined above. To summarize: External symbol of length ^_ 5 alphabetic characters Created symbol Extended external symbol ;,,e., lengtt 5 characters , In order to conserve storage and yet not be limited by ; he num of symbols that may be used in a given system, the internal number will be represented in the following form: Since in general the reduced system will not have more chan 255 symbols in its "alphabet", the internal number of each system will be represented by one byte. If however there are more than 255 symbols in the alphabet then one byte is added to allow another 255 elements to be Bit 1 Bit 2 1 1 1 -15- represented. This process is continued for as large an internal number as required i.e., if the first byte has the value 255 (all bits are units) then to find the internal number the value of the next byte is added to 255. If the second byte also has the value 255 then the next byte is considered and its value is also added on. e.g- 11111111 11111111 00000011 Internal number = 255 + 255 + 3 The forms of the stacks S_, S n , and S are 2 1 u V s r a. S J T. l i i 1 1 i 1 Po " * b. J U. J C k v k 1 1 J 1 P x - ! 1 < k < p - 1 — — r u where the symbols denote the i-th, j-th, k-th elements of the respective stacks, and where each symbol represents an internal number of one byte in length in general, but extended as above when necessary. -16- In the procedure outlined below for making the required reduction to a simple Chomsky system, the operations described previously are combined in a series of subroutines. As each production is read in, each symbol is compared with the symbols already in the symbol table 0{ , and converted to its internal number. The production, represented in internal number form is thus built up in a special location called PR0DN, from which it is pushed into the appropriate stack after any necessary operations have been carried out. If these operations involve the creation of a new symbol, then the 'created symbol' bit is set on in the next available cell in the symbol table Ql , the external symbol representation is left blank, and the corresponding internal number is us ad in PR0DN as the created symbol. The pointer p is of course increased by one. The symbols of PR0DN are denoted by a, S, T (not subscripted) for the zeroth, first and second symbols respectively . The procedure to accomplish the reduction and set up the system for the resolution of ambiguities, is a relatively simple one which calls on three subroutines; C0MPAR, E and RH0 which will be explained in detail below . Note that the left hand side of a production is called the zeroth element and the first, second etc, elements of the right hand side are termed the first, second etc. elements respectively of the production. The parameter n determines the position of the symbol of the production currently under consideration and the parameter j is a measure of the number of elements of the right-hand side of the productions which match previous right hand sides (see Subroutine E(n)). -17- 2.2o2 Operation Description The main procedure SIMPLE A new production is read in, the location PR0DN (where the produc- tion is to be assembled in its internal form) is cleared, the parameters n, j are initialized and a call is made on the subroutine C0MPAR to find if this symbol has already been used in a production. The next symbol is tested and if this is empty and if n = (i.e., a left hand side was being considered) then there is an error condition, ERR0R 1, since the production did not have a right hand side. If n 4 then, since all symbols of this production have been read in, a call is made to the subroutine E which will push the produc- tion in the appropriate stack depending on the value of the parameter j* If, however, the next symbol is not empty, then if it is not the third symbol, the comparison procedure with the symbol table 0[ is again repeated with this new symbol by means of the subroutine C0MPAR, as before. If the next symbol is the third then the subroutine RH0 is called, which separates off a pro- duction of length two as described in operation (i) above and third symbol now becomes the second before repeating this process of comparison. Eventually the "next symbol" will be empty (the productions are of finite length) and the procedure loops back, as above, to consider the next pro- duction. Subroutine C0MPAR (n, j) This procedure searches through the symbol table Cn (whose i-th element is denoted by N.) to check if the n-th symbol of the production matches any element of Q{ , i„e., the n-th symbol has been used in a previous production. If no match is found then the new symbol is entered into the symbol table Q( . Note that if the symbol is longer than five alphabetic characters then only the first character is entered into Of and the remain- ing bytes contain a link address to the extension as explained previously, The pointer p is tne internal number of this new symbol and this is placed in -18- the n-th position of PR0DN (where if the internal number is greater than 255 it is converted to the representation described above). If n ■ the symbol is a left hand side and so the flag bit is initially set, to denote that this symbol possibly is an anti-terminal symbol. However if a match is found and we are not considering a left-hand side (i.e., n # 0) then a check is made to determine if the matched symbol was provisionally marked as an anti- terminal symbol (i.e., the flag bit is set) . If this is the case then the flag bit is set to zero since we are considering a right hand side and the symbol, therefore, is not anti-terminal. If the flag is zero then frhis right-hand side symbol matches a right-hand side symbol of a previously considered production and the parameter j is adjusted accordingly. Whether the symbol is a right or left hand side symbol the loon variable i (converted to the required representation if necessary) is placed in the n-th position of PR0DN and a return is made to the calling point of the procedure. Note if n = 1 and this first element of the right-hand side was now matchedchen j is set to the value 1/2 before the return, in order that j has the correct value consistent with the description in the subroutine E(n) in the case that the second element is matched „ Subroutine KH0 (n) This procedure separates off a production of length two from a production of length greater than two, as described in operation (i) . A new symbol is created in the manner previously described. Since the zeroth element (left hand side) of the production will be needed for further processing of the over length production, this is stored temporarily before placing the newly created symbol (in internal number form) in the zeroth position. Then, since the subroutine E must be called to determine into which stack this production of length two is to be pushed (since there may -19- have been matches in the first two elements, the subroutine E will destroy the contents of PR0DN) the contents of PR0DN are stored temporarily before calling E. Upon return from E, processing of this production is continued with the original left hand side as left hand side of the production and the left-hand side of the split off production as the new first element. The third symbol of the production being processed now becomes the second symbol upon return. Note that in the main procedure, after calling RH0, the parameter j is set to one half. This is because the first symbol is now the created symbol and so cannot match any symbol of a previous production. Subroutine E(n) This is the main subroutine which determines into which stack the production is to be pushed. It combines all the operations (ii) , (iii) , (iv) , (v) in case any of these conditions are met. This subroutine in turn calls on three further subroutines: a) REPL4 in case condition (iv) is met, b) REPL2 in case either conditions (ii), (iii) or (v) are met, c) SSI which searches the stacks of productions of length one in case the matching element was not found among the elements of S 2 . The parameter n is the measure of the number of matching right-hand sides . If the parameter n = then no matches were found and so if the second element T of PR0DN is empty then the production is pushed into S else it is pushed into S_ before making a return. If the parameter n = 1 then the first element of PR0DN was matched but not the second. If n = 2 then the second element of PR0DN was matched but not the first. If n = 3 both elements of production were matched, Thus if n ^ the first and second elements of S„ are searched from the top down -20- to attempt lu find where the matching element occurs. If no match has been found when the bottom of S 9 has been reached then before the subroutine SSI is called to search S, and S for the match, we check whether n = 2, and 1 u if so, whether the matched element in fact matches the first element of the production under consideration. In this case we call the subroutine REPL2 to effect the replacements called for by condition (ii) . We set the second parameter of REPL2 to be the T element of this new production which will be pushed into S by the subroutine REPL3 which is called by REPL2 „ If n = 2 the second element of PR0DN was matched and so SSI is called with parameter T (the second element of PR0DN) . If n = 1, the first element was matched and so SSI is called with parameter S. However if n = 3, after calling SSI (S) and determining the match for the first element, the match for the second element has yet to be determined and so the production must be popped from S_ and the same production evaluated again for the match of the second element by calling SSI (T) . On the other hand if a match is found in S ? , the resulting action again depends on the value of the parameter n« (if n = 2 only T need be checked against the element of S ). If T matches an element of S then subroutine REPL2 with appropriate parameters, is called which essentially makes the replacements outlined in operations (ii) and (v) . Then if n f 3 a return is made, otherwise, the match of the first element S has yet to be found and so the production is popped from S for further consideratiun. The parameter n is set to 1 (first element matched) and the search of S_ is continued since this match may also be in S, but among the elements as yet unsearched. If S matches an element of S there is a more complicated situation. If S matches a first element and n = 3 then a check is made to see whether T matches the second element of the same production. If this is true subroutine REPLA is called to carry out operation (iv) and then return „ If not then -21- REPL2 is called as above, and again if n = 3 the top element of S must be popped for recondiseration and the parameter set to 2 (second element matched but not the first) . Subroutine REPL 4{n) This subroutine carries out operation (iv) when the first and second elements of PR0DN match the first and second elements of a production in S_» Since both productions of length one have the same right hand side, these are pushed into the stack S of critical productions . When there is a sequence of productions in S with the same right-hand sides then to signal the end of the sequence, the flag bit is set on the first byte of the final production of the sequence Finally the zero-th element of the matched production in S~ is replaced by the created symbol (which of course is still the topmost element in the symbol table (/( ) , Note that a push into a stack from PR0DN is assumed to be a non-destructive store from PR0DN o Subroutine REPL2 (Y,X) This subroutine combines the replacements required in operations (ii) (iii) and (iv) „ If the contents of PR0DN is a production of length two (i.e., T is not empty) then the subroutine REPL3 (Y, X) is called to make the appropriate replacement of the matched element in PR0DN s and set up the contents of PR0DN for the remaining replacements which is the same situation as if the contents of PR0DN was production of length one and a match had been found in S , The contents of PR0DN are pushed into S , a new symbol is created, the matched symbol in S is replaced by the created symbol, the zero-th element of PR0DN is replaced by the created symbol (the first element is still the matched element since the store from PR0DN has been assumed non-destructive), and the contents of PR0DN is pushed into S (with flag bit set on first byte since this is the last production of this sequence of critical productions) before making a return. -22- Subroutine KEPL3 (Y , X) Tnis subroutine replaces the matched symbol (parameter Y) with a created symbol, pushes the production in PR0DN into S,,, and tne PR0DN is set up as a production of length one for the required replacements of pro- ductions of length one. The created symbol is placed in the zero-th position of PR0DN and the matched element (taken from the matched element in S tnis ti~.c) is placed in the first position. Subroutine SS1(Y) Tnis subroutine searches the stacks of productions of length one (namely S and S ) to find the element which matches the parameter Y. It in turn calls three further subroutines: a) REPL3 - to handle the production of length two as above, b) ADJS1 - to adjust the stack S of the matching element is found here, c) ADJSU - to adjust the stack S if tne matching element is found here. The stack S n is searched from the top down. If a ma ten is found 1 and PR0DN contains a production of length two (i.e., T is not empty), then subroutine REPL3 is called to push the modified production of length two into S„ (according to operation (iii)), and then ADJS1 is called to modify the stack S since an element has been matched in this stack, If PR0DN i is of length one then only ADJS1 is called to adjust the stack S . If no match is found in S., , then the critical productions in S 1 u are searched for a match. When the match is found then, if T is not empty REPL3 is called to deal with the production of length two as before, and then ADJSU is called to adjust the stack S since an element in this stack u has been matched, Note that if no match is found in this stack then all the stacks S.-., S, and S have been searched without finding a matcning 2 1 u element. This is necessarily an error condition (ERR0R 2) since these subroutines are called only if the symbol of the production under consid- eration is a right-hand side which matches a symbol in the symbol table which is also a rignt hand side. Thus a match must occur in one of these three stacks. -23- Subroutine ADJSl(n) This subroutine is called when the n-th element of S is matched. Since this implies that there are two productions of length one with identical right-hand sides, both of these productions must be pushed into the stack S of critical productions. Thus the n-th production in S, is popped from S.. and pushed into S . The productions of S.. which were on top of the n-th and thus also popped in this process, are pushed back into S , The contents of PR0DN are pushed into S and the flag bit set since this is the end of this sequence of critical productions. Subroutine ADJSU(n) This subroutine is called when the n-th production of S is r u matched. Since the search of S was made from the top down, all the pro- ductions of S above the n-th are popped. Since the top element of the stack is now the final production of this sequence of critical productions, the flag bit signifying this on the top production of the stack is set to zero and is reset on the new top element of the stack after PR0DN has been pushed into S . Finally the productions of S which were previously popped are pushed back into S and a return is made. -24- 3. DETERMINATION OF CONTEXTUAL CONDITIONS In this second part of the translator generator system, the contextual conditions necessary to solve the parsing and ambiguity problems are obtained. Within given bounds, M and N, of right and left context respectively, we set up successively, context compatible with the local ambiguities determined from the simple production system obtained in the first part. Note that if the production system is not uniquely determined within the given bounds, it may be necessary to run the program again with one or both of the bounds increased. The notation of the first part is carried over into this section. 3.1 Basic Concepts Since every CHOMSKY system can be embedded into a simple CHOMSKY system such that the set of words for Z in the simple system, over the alphabet of the original system, is the same as the set of words for Z in the original system (see [1]), it is sufficient to solve the problems raised in the introduction, for simple CHOMSKY systems only. Therefore, in the following we shall deal only with simple CHOMSKY systems as defined in definition 7. Note that this implies that is a disjoint partition for Q( , Simple CHOMSKY systems have the valuable advantage, that all critical productions of n belong to the set {B. :: - U.; j - 1,2, . . . m} 3 3 -25- and that the local ambiguities are exactly the right sides of the critical productions which are contained in this set. This means especially that the local ambiguities of simple CHOMSKY systems are all of length one. A necessary condition for a string u e "fwu to be in y is that u does not contain a substring XY of one of the following forms: (a) X e {S 1 } , Y e ft (b) X e Q& , Y e {T ± } (c) X e {S i } , Y e {T } and J k such that XY = ST i.e., X,Y are not corresponding elements in the sets {S.} , {T } respectively. For then it is apparent that u does not belong to y. In order to accomplish the determination of the contextual conditions we consider a five-tuple associated with each critical or undetermined production. Suppose C :: - V is a critical production. Then the five-tuple associated with this production is denoted by (a, V, e,w, C) where a is the string of left context elements, 3 is the string of right context elements, and W is an element such that a V 3 is a word for W; i.e., there exists a sequence of productions in the production system n, such that the string a V 3 may be reduced to W by making the indicated replacements. An examination of W will determine what the next context element is since if W = S. for some i in { 1,2 .... I } then the next l context element will be the corresponding T and thus T. is added to the -26- , right context strings . On the other hand if W = T. for some i in { 1,2, ...., £} then the next context elemenc is the corresponding element S. and S. is added to the left context string a. 11 In order to determine whether a given critical production has been determined by the context found at a given stage we need to define the concept of compatibility. Definition 8 (Compatibility) Let a = A | i ... A. .A. or 2 1 i _ A '| a .| - vv 6 = B 1 B 2 — B |S| S 1 » B.'B ' ... B f . . be strings. 12 | b'| then we call (a, V, 6) compatibl e with (a', V', B') if and only if (1) V - V (2) A - A ' I i = 1,2, ... , min (jaj , |a* | )] (3) B. - B. r [ j - 1,2, ... , min (|s| , |3'| )] i.e., the two strings a V 8 and o.' V B' are compatible if their common substring exhausts the smaller of the substrings of right context and also the smaller of the substrings of left context. -27- 3.2 Implementation 3.2.1 Functional Description 3.2.1.1 Program Structure The main program consists of three basic procedures called ZETA, MU and SIGMA which are utilized at each step until the set of undetermined productions vanishes . The bounds M, N of left and right context respectively are used as global parameters which stop the itera- tion if the set of undetermined productions has not vanished after (M + N) steps. The procedure ZETA decides whether the context set up so far is sufficient for the determination of the critical productions. The procedure MU essentially makes all possible replacements in the produc- tions by considering only those productions of length one. The procedure SIGMA obtains the context elements of the productions. The procedure ZETA gives a partition of the set, G, of unde- termined productions from the previous step into a determined part and an undetermined part. For a given production in this set, we determine whether there is different production (*) in the set which is compatible with the given production. If such a production (*) is found, then if W = W* the production (*) is placed in a subset G of G. If W fi W* A then the production (*) is placed in the set, G , of undetermined pro- ductions. The set of determined productions are those elements of G which are not in G . The procedure SIGMA determines the context elements at a given stage, by checking whether the element W is a member of the set of the right-hand sides of the productions in the stack S„ of productions of -28- length 2. If W = T. for some i, then we replace the string a by the string S a, and replace W by a. . If W = S. for some j, then we replace the string 6 by B T. and replace W by a . The procedure MU operates on the element W of the five- tuple, and determines whether W is a member of the set of right-hand sides of the productions in S_, or is a member of the set A of elements which do not appear on the right-hand side of any production. If not then an iteration process is entered. First all possible reductions are made from all the productions of length one, and then the resulting elements are again checked to determine whether they are right-hand sides in S or in A . If not, we then repeat the process. A global parameter K is used to stop the process if the procedure is not satisfied after K steps. The object of the whole process is to reduce the set of unde- termined productions to the empty set. However, if this is not achieved then one can either run the program again with an increased value for one or all of the global parameters M, N, K; or one can apply the pro- duction system as it is and accept the disadvantage of parsing strings in ways which do not lead to valid constructions in the language. -29- 3.2.1.2 Data Structure It is assumed that the global parameters M,N,K are given to the system, as are the results of the first part. The context is built up one element at a time (either left or right context) within the given bounds M,N and controlled by the variable P. In order that the bounds may be considered finite but unbounded, the main idea of the implementation is that the context elements at each stage are stored in one stack which is linked to the elements of the preceding stacks in the sequence, in a cascading manner. The stacks in this sequence are denoted by S(0), S(l), S(2), . .. S(P), S(P + 1). The pointer for the stack S(j) is denoted by p(j) and the k-th element of the stack S(j) is denoted by y, . Each element of every stack in the sequence after S(0) (i.e., for S(j), 1 <_ j £ P + 1) is to be at least two bytes in length, although extended where necessary to accomodate a larger internal number (see first part). The flag bit on the first byte of each element is called "FLAG 1" and the flag bit on the second byte is called "FLAG 2" . When the context parameter has the value P, we are finding the (P - l)st contextual element, and we have P + 2 stacks. S(0) S(l) FLAG 1 FLAG 2 f k S(P-2) Last P-2 S(P-l) New P-l r k S(P) Old W P Y k S(P+1) New W P+l are the typical elements in each stack -30- The first two stacks, S and S(O) have elements which are only one byte in length but extended when necessary as described in the first part. The first stack contains all the different elements V which are the right-hand sides of the critical productions found in the first part. The second stack S(0) contains all the left-hand sides of the critical productions. How these stacks are linked will be described below. The first context element found for each production is stored in the stack S(l). If it is a left context element (i.e., y° = A ) then FLAG 2 is not set, but if it is a right context element (i.e., y e = B..) then FLAG 2 is set. When a given production becomes determined, it is popped from this system and the context for the remaining undetermined productions is built up by building a new stack at each stage. When the context parameter has the value P we have the following situation: The last context element found for each production is stored in the stack S(P - 2). During this iterative pass the new context elements found are stored in S(P - 1) . With each production is associated a five-tuple containing an element W, and these W's found during the previous iterative pass are stored in S(P) while the elements W being determined during this iterative pass are stored successively in S(P + 1). Now in the stack S of critical productions found in the first u part, for each right-hand side V , there exists at least one other pro- duction in S with the same right-hand side — by definition of a critical production. Thus conversely, associated with each distinct V, in S , k u there is a set of C.'s which contains at least two elements. Thus we J first push an element V 1 into the stack S and then push the set of C.'s associated with it, into the stack S(0), setting FLAG 1 on the first element of this set pushed into S(0). Thus to determine all the C.'s associated with the first element of S we must search the stack S(0) from v -31- the bottom until we come to the second flag — at which point we have a left-hand side of V~ , the second element of the stack S . In calling 2 v ° the procedure MU each such C. could in itself give rise to such a set of left-hand sides, so we have 'cascaded' the set of productions corresponding to the right-hand sides V, . Each of these can have its own first context element and so we push into S(l) the set of first context elements (the first element of each set having FLAG 1 set on) corresponding to each C.. This cascading process is continued throughout the sequence of stacks of context elements. However, there is a one-to-one correspondence between the stack of elements W and the last stack of context elements at each stage. This means that there exists a one-one correspondence between the stacks S(P) and S(P-2) and also between the stacks S(P + 1) and S(P - 1). Thus the flags 1 and 2 have no significance in the stacks S(P) and S(P + 1) and hence FLAG 1 in these stacks has been utilized to denote that the flagged production is an element of the subset G of the set G of undetermined productions at the previous stage as mentioned in the previous section of this report. -32- s and call the subroutine MU directly. Since the variables m, n denote the number of left and right context-elements respectively for each produc- tion, these are set to zero during this first iteration, -36- During other iterations we proceed through the stacks S( j ) successively, using the current value e(j) of the running variable for each stack, determining whether the value stored in each stack is a member of a or (3 and so building up the stacks S and S appropriately. If Q!I p I the last context stack S(P-2) is exhausted, control is returned to the calling point of this subroutine, while if any other stack is exhausted we have an error situation in the FLAG l's as described in the previous subroutine . When all the context elements have been found for this pro- duction at this stage (i.e., we have reached S(P-2), then a call is made to CCMP 1 if this is the main pass, or to CCMP 2 if this is the comparison pass. CCMP 1 initiates the comparison pass and CCMP 2 determines whether there is a compatible production. On return from these subroutines, the stacks S „, S „ are adjusted to be ready to accept next context elements af pf of the production. Note that because of the cascading nature of the stacks described above we may have to pop a sequence of elements from these stacks if the FLAG l's denoting a new group are set. If this cascade reaches right down to S(0) then we have to return and adjust the first two stacks in the subroutine ZETA-. If the production is "determined" (i.e., D = l), then it must be popped as in the subroutine ZETA. However since the possible context strings for a production may have many elements in common (in the preceding stacks S(j)), we pop only those stacks containing elements which are unique i.e., in cascading back to S(o) we pop no further stacks after finding one in which the next element is not a member of a new group . The preceding elements of each stack must be replaced after each pop. After completing the pops, we return to the process where we left off after setting the "switch" D back to zero. This final pop is affected by the short subroutine PR. Subroutine CCMP 1 This subroutine initiates the comparison pass for a given production by setting the "switch" f to 2 . The comparison pass is designed to find a production compatible to the given one. This sub- -37- routine achieves essentially those decisions described in procedure ZETA in section 3.2.1.1. Since the definition of compatibility requires the right-hand side V be the same, we need check only those productions 'cas- cading' from one element V of the stack S . Thus we initialize the running variables g(h) for h ■ 0, . ..., P-2 to the values they had on beginning new groups when we began consideration of the new V, . We com- pare all possible productions up to the next FLAG 1 in S(0) when we would be beginning a new group corresponding to the next V. The number of elements in the a, 3 stacks corresponding to the given production are saved in variables m and u, n and v respectively for use later, and then we again make a call on the subroutine LP to build up the a and 3 strings for the productions being compared. We noted in section 3.2.1.1 that the production is determined if it is not undertermined and so after checking all possible compatible productions and it is still not undetermined, it must be determined and we call the subroutine DET to place it in the stack S of determined pro- ductions, f is then set back to 1 and we proceed with consideration of the next production in the main pass. Subroutine DET If the production is found to be "determined", this subroutine pushes it into a stack S of determined productions and then sets D to 1 so that the offending production is then popped from the system. In order to conserve storage, the 5-tuple corresponding to the determined production is stored "vertically" in the stack S whose elements are one byte in length, extended where necessary. The method of storage in S is depicted on the figure. -38- % • • *. % « Wi • • • • • VA i Count for cc A a V A, Count for p - B IP P "I ^■a v >P w c (FLAG bit on W denotes bit G„) t FLAG Bit All productions in S are pushed in this fixed format and so unstacking will immediately produce the productions in the desired form viz: W = A a , A 2 A X C B X B 2 B IP = A a ,A 2 A 1 V B X B 2 . . B IP -v- We first push in C from S(0) and then W from S(P), We check whether this production is a member of G. (denoted by flag 1 set on the element W in S(P)) and if so call the subroutine DETA before popping the W element of this production from the stack S(P). The subroutine DETA determines whether this production is a valid member of G. and if so increases the counter GA of the number of members in the ambiguous class G. A Next we push into S , in order, the elements of right context, the last found first and the first found last, from the stack S , followed p by the count for p i.e., the number of elements of right context. The count is flagged, firstly to distinguish it from the internal numbers representing -39- symbols and secondly to denote the end of the arbitrary string 0. V is pushed in next followed by the elements of the left context a, the first found element first and the last found element last, from the stack S , al followed by the count for a flagged as above. Subroutine COMP 2 This subroutine determines whether this production in the comparison pass is compatible with the given production in the main pass, Since we have required V = V , recall that this means that A ± - A i [i ■ 1, 2, »..., min (ja| J a i ) ] B, - B * [j - 1, 2 S ...., min (|b|,|B | ) ] Since we know these conditions were satisfied for the previous iteration, we need only check that the last found context element satisfies these conditions and so we compare the top elements of the stacks S s S from each pass. If it is not compatible, we return and consider the next production of the comparison pass, while if it is compatible, we check that the left-hand sides, C 9 (found in S(0))are different and then if the W elements (in S(P))are the same we flag the W element of the given production to denote a member of G and return to the comparison pass, else if the W elements are different we have found an undetermined production and so make a call to the subroutine MU before initializing the a and 3 stacks of the comparison pass and returning to the next production of the main pass, Subroutine MU The subroutine MU places the W of the production being considered in a dummy location Q and then calls the main subroutine, OP, on the undetermined productions, The subroutine OP includes the procedure SIGMA described in the first section. The subroutine OP operates on the W of the production and determines whether i: is 'open' i e e , whether it is a member of tne set of elements which are on the right-hand side of the -40- productions In S„ or it is in A,^ (i.e», is not on the right-hand side of any production) . If W is not open, it is returned with the flag bit set which means a call must be made to NUITER which is the subroutine which makes all possible replacements from the productions of length 1. Before returning from MU we set FLAG 1 on the next available cell in S(P-l) since the next context element placed in there will be the first element of a new group as we will then be considering the next W element in S(P). Subroutine OP(x) This subroutine determines whether the argument, X, is open (i.e., an element of a right-hand side of some production in S„) or does not appear on any right-hand side. This latter case was the reason for putting all these anti-terminal elements in a separate stack A . If X is not open, we set the flag bit and return., However if X is a member of a right-hand side of S_, we take this opportunity to determine the next context element . The number of left and right context elements found to date for this production have been saved by means of the variables m, n respectively and so after checking that the addition of another context element will not violate the given bound, the context element is pushed into the new context stack S(P-l) and FLAG 2 is set if it is a right context element. The left-hand side of the matched element in S„ becomes the new W and so is pushed into the new W stack, S(P + 1) . The next available cell of S(P-l) is then cleared of everything including flags so that a FLAG 1 set to denote a W in G will not cause confusion here. Note that by construction in part 1, all the right-hand sides of S are different from each other and all other right-hand sides and so having found one match we need look no further. If the addition of a further context element would violate the given bounds M or N of left and right context, then we can find no more context within the given bounds. In this case the production is pushed into a stack S of the same form as S described previously, and a de- cision about what to do about such productions must be made upon completion of the program. Clearly this problem could be eliminated by running the program again with increased bounds, -41- On the other hand if a match is obtained with an element of A , then no further reductions can be made since this element does not appear as the right-hand side of any production in our system. In this case we first check that this element is not the distinguished element Z of our language, and if it is not we push this production into a stack S , of R the same form as S , If the element is the distinguished element, Z, then after checking that the addition of another context element would not violate the bounds on right context, we add the symbol // as the next right context element. The symbol # is the "string terminator" symbol and in this context means that replacing it by a right element at the end of a string is not possible without leading outside of Y. The modifications to the stack when a new context element is added are carried out as described above and in this case the new element W in the stack S(P + 1} remains as Z . Subroutine NUITER(Q) This subroutine is an iterative procedure making all possible reductions from the productions of length one if the argument Q is not open. The iteration is bounded by the given global parameter K, and if all the new W elements so produced are not open after K iterations, they are pushed into the new W stack S(P + 1) with the corresponding new context elements left blank and the process continues as before. The iteration is controlled by the variable & . We initially check Q against the right-hand sides of all the non-critical productions. If no match is found, it is compared to the right-hand sides of the critical productions. If no match is found here either, then we must have an error situation since Q was not open and by construction of part one 01 - {S.} U {T.l U (u.) u {v k } U (A^ is a disjoint partition of the alphabet Of . If a match is found with U. for some j then Q is replaced by the left-hand side of the matched production, a call is made to the subroutine OP with this new Q as argument and the iteration proceeds. -42- If a match is found with V. for some j, the same procedure is followed except that now all possible left-hand sides associated with the V are now used as replacements successively. Note that if a given V. is not matched then the running variable g of the stack S(0) of left-hand sides must be adjusted appropriately according to the scheme outlined previously. Note also that this running variable g must also be a function of the iteration variable 2,, and hence having found an open Q, we must check that all the possible "left-hand sides" have been dealt with before returning from this subroutine. This is the reason for the series of tests before the return on page 9B of the flow-charts. Subroutine DETA(X) This subroutine determines whether or not the given production is a valid member of the ambiguous set G . This requires that the element W associated with this production is a word for the distinguished element Z. All those productions for which W is not a word for Z may be dropped from the set G . The element W of this production is stored in the i (P)th position of the stack S(P) and it is placed in the cell W in order to carry out these operations. If W is the element Z, then the flag 1 associated with the W element in the stack S is set in order to denote that this production is a member of the class G and the counter GA (which counts the number of elements in the class G ) is increased by A 1 before returning to DET(X) to store the remaining elements of this production in S . But if W is not the element Z, we attempt to reduce x it to Z by finding a sequence of replacements from the stack S of productions of length 1, since W itself is just of length 1. If no further reductions can be found (the top of the stack S is reached without finding a matching element for W) , then this production may be dropped from the set G and we return to DET(X) as before. •43- 4. PURPOSIVE PARSER 4.1 Implementation 4.1.1 Functional Description The production system constructed in the first two parts allows the purposive parsing of strings; for this one uses a modification of the NU procedure used before. The production system consists of those productions stored in the stacks S and S_, which contain productions of length one and two respectively and also those in the stack S of 'de- termined' productions in which the sets of critical productions have been 'determined' or differentiated by context. Let Z be the distinguished element which is given and suppose the program which is to be scanned is read into the stack S with elements X. and with pointer p . The program string is terminated by reading the string terminator symbol, //, into the top of the stack. Assume the be- ginning of the program is at the bottom of the stack, and we assume a left to right scan. Since we do not assume a 'look back' facility and in order to simplify the procedure, we made only one replacement of a string element at each consideration of an element, it is necessary to iterate the scanning passes through the program string, the iteration being bounded by an arbitrary given parameter Q. 4.1.2 Program Description The Scanning Procedure We first initialize the iteration parameter q and the variable i which counts the elements in the program string. Note that if the program string x is 'open' (i.e., if X- e {T, } or Xi i e (S, } ) then J. 1C x J K x i y, i.e., the program is not a sentence in the language. Thus we first check whether X- e {T } and if so go to EXIT 1 which is an error condi- tion since such an element needs an element of {S, } in front of it in k order to be reduced. -44- We next note that if x = u U v and U is an anti-terminal element (i.e., 6 C{ consisting of elements which do not appear on the right-hand side of any production) then x j. 7, since no further re- ductions could then be made on this element and it could not thus be reduced to the distinguished element Z. This is of course assuming that Z is not the only element in the program string, in which case we have finished. If, upon reaching the top of the stack S< excluding the string terminator symbol^), and the single element is not Z or there is more than one element left in the stack, then we increase the iteration counter q by one and, provided this counter is less than the iteration bound Q, we return to the start and repeat the process. If the iteration bound has been reached, then we exit by EXIT 3 and must re-evaluate the situation. It may be necessary to increase Q. Having checked these initial conditions on the element X. of to 1 the program string, we proceed to finding the appropriate replacement rules in the procedure NU. X. is first checked against the right-hand sides of the distinct productions of length one found in S. . If a match is found, the appropriate replacement is made in the stack S and we return to consideration of the next element in the program string since the alphabet has been divided into disjoint sets as described in the previous parts. If no match is found in S , we next seek a match in S . The method of storing these productions in S was explained in part two, and an examination of this structure will show why it is necessary to examine the flag bits on each element of the stack in order to jump over the context strings a and p for each production. Since the first flag is set on the count for the right context p, if a match is found and the count of p is zero, we need only compare the left context oc. The number of left context elements is again found by checking for the flag bit, but note that if the count backwards, k, on the program string reaches zero, then there are not enough left context elements in the program string and this replacement rule fails. Otherwise we check each context element against the corresponding element in the program string. If a non- matching element is found, we return to the search of S but otherwise -U5- on reaching the end of the left context string (denoted by another flag), a valid context has been found. Thus, since for this value of j we hh V. = Icul, and the count for 8 is obtained from the element in 3 with J N index j -V. -2. With this new value of the index j we finally obtain J the replacement symbol for this production from the element of S with index j - V. -2 [compare the storage scheme for S in previous part]. If the symbol X. was not, matched by a right-hand side of a production in S , then we must jump over the left context of this pro- duction in S , before returning to look for the next right-hand side to compare . If the count for 8 was not zero after a match for a right-hand side has been found, then we must check also the right context (see label F). If the count, i, on the program string plus the count for 8, V., is not less than the stack pointer p , then there are not enough right context elements in the string for this replacement to be made, and we resume the search on S . Otherwise we compare each right context element against the corresponding symbol in the program string as was done above for left context. If this comparison is successful, the left context is compared as above. If no match is found in S^, then we search for a match amongst the right-hand sides of productions of length two. If the entire stack S is searched without finding a match, then we have an error exit at EXIT 2 since we should have found a match in one of these stacks due to the partition of the alphabet into disjoint classes. If the element X. matches a member of the class (T ), then we return to the beginning and consider the next element of the program string . This is because the previous element may have been replaced to become an element of the class [S }, and we are not assuming a 'look back' property. This re- K. placement would be made on the next scanning pass through the program string. If, however, a match is found with a member of the class (S }, then after checking that X. is not the last element of the program string, we check that the following element of the program string is the corresponding member of this production in S . Note if X. is the last element of the string, then the string is open and so we go to error -1+6- exit EXIT 1, since we cannot obtain the corresponding T that this final S, needs in order to made a replacement „ Otherwise, if the corresponding T, does not match the next element of the program string, then it must be a member of one of the other disjoint classes of the alphabet and so we return to the beginning and consider this next element of the program string. However if it does match, then the two elements are replaced by the left-hand side of this production in S ? and then remaining elements of the program string in S are pushed down one space, in turn, before a return is made to the beginning to consider the next element in S . As mentioned previously, when the top of the stack S is reached, if there is more than one element remaining in the stack or if the single element is not the distinguished element Z, then the scanning counter q is increased by one and another scanning pass is initiated provided the scanning counter is less than the iteration bound Q. If the single remaining element in S is the distinguished element Z, then we have achieved our goal of reducing the program by successive appli- cations of the productions of the system. -47- 5. CONCLUDING REMARKS 5.1 Consequences of the Parsing Algorithm In order to determine to what extent we have solved the ambiguity problem and to what extent this parsing algorithm is really purposive, we have to discuss the following cases: (a) S - and G - (J x J Recall from 3.2,1„1 that G was a subset of the set, G, of A undetermined productions at a given stage . An element of G has the A property that if (a, V, 8, W, C) is the 5-tuple associated with this production, then a V 8 is ambiguous* in the following sense: 3 i, j such that C, :: = V c. :: - V. i i j J are productions in the production system II such that In this case, for any u e y (i.e„, u a word for Z) , the parsing algorithm gives, successively, strings u„ (i = 1, 2, .=.,, n) such that u = u, , u - Z, u. p u..,. While running, the main program does not In 1 1+1 ° r o consider tentatively other sequences v. which do not lead to Z. The ambiguity problem is solved since E is unique. This is because it does not contain ambiguities in the sense described above, which are necessary for the existence of ambiguities (see definition 6) -A8- (b) S - , G A 4 u A In this case we have ambiguities in the sense described in case (a) above . Such W have to be checked whether they are words for Z, i,e., W p Z, All elements of G for which the corresponding W is not a word for Z can be dropped from G , as is done in the procedure DET in the second part of this system, The production system is ambiguous if and only if G , thus A reduced, is not empty. This means that the parsing algorithm should sometimes consider more than one sequence {u.}, all of which lead to Z and therefore are of interest. However at the present time, the algorithm has been set up to consider just one sequence, namely that obtained by using the first matched production found, and the notation is given that the production system is ambiguous. This decision will be reviewed in the light of practical experience with the system but there would seem to be some doubt as to the practical utility of a system which is ambiguous in this sense. (c) S 4 U In this case the ambiguity problem is not solved. It may possibly be solved by increasing any or all of the bounds, M, N, K (See 3.2.1.1) For the parsing problem, the comments under (a) and (b) above hold except that we generate terms u„ which contain a right side of a production which is not in S , ie., a production in either S.. , S~ or S . All the substrings u. which are words for Z and which do not contain a substring which is a right side of a production in S , S ? or S but which contain a substring which is a right side of a production in S , must be of the following form: -49- (i) u contains a substring v zy ((X^Ol ) with |x| >_ m, where m is the length of the right side of the critical production assoc- iated with the locally ambiguous string x, contained in the substring v. (See definition 5) i.e., v contains a locally ambiguous substring x which implies from the definition that the right side of a corresponding critical production must be a substring of i.e., m <_ |x|. (ii) u contains a substring v 1 such that there exists a sequence {v.} (i = 1, 2, ... , t) , v. = v\ v f Z i,e , there exist sequences which lead outside of y, the set of words for Z By increasing K and/or M and/or N, both of these situations become rarer-. If one of these situations occurs during the parsing of a string one has two choices: (i) One can run the main program again with one or more of the bounds M, N, K increased, and thereby improve the resulting production system, (ii) One can tentatively apply all the corresponding productions of S and accept the disadvantage of following sequences which lead outside of y i.e., do not lead to the desired distinguished element Z„ 5.2 The Semantics The next stage in the system is to adjoin the semantics to actually compile the code for the constructs recognized by the parsing algorithm. One method of doing this would be to attempt to match the semantics associated with the definition of the language in BNF, to the rather expanded set of productions produced by the main program. However it has been suggested [2] that an easier method would be to run each right side of the original definition of the language, through this system and hence obtain a definitive context for each of these. Then we use these original productions (with context) purposively in the scanner. In this form the semantics may be adjoined in their original form and we have also eliminated all the "created characters" used to "determine" the production system, However this will have to be the object of further study. •50- REFERENCES [1] Eickel, J. and M. Paul, "The Parsing and Ambiguity Problem for Chomsky-Languages," TH Munchen, Bericht Nr. 6409 , (August 1964) [2] Paul, M. , Private Communication -51- ( START j_ READ IN A NEW PRODUCTION CLEAR PR0DN' n n+l CALL RHO (j) MAIN PROCEDURE: SIMPLE CREATE A NEW SYMBOL H TEMPORARILY STORE THE ZERO ELEMENT OF ' PRODN ' a — H STORE ' PRODN ' TEMPORARILY A i RESTORE THE CONTENTS OF 'PRODN' S — a RESTORE THE ZEROTH ELEMENT OF 'PR0DN' CALL \_ . E '"' J 1 * THIRD SYMBOL OF PRODUCTION BEING A\ dctiidmU CONSIDERED BECOMES THE S NOW ECOND. SUBROUTINE: RHO (n) FIGURE IA 52 ENTER n th SYMBOL INTO SYMBOL TABLE A IN THE NEXT AVAILABLE LOCATION i + l I PLACE p a (CONVERTED IF>25U) IN THE n th POSITION IN THE CELL 'PRODN' 2j + l SET FLAG TO ZERO SET FLAG BIT ON FIRST BYTE OF TOP ELEMENT IN THE SYMBOL TABLE A P a — V PLACE i (CONVERTED IF > 25U) IN THE n th POSITION IN THE CELL 'PR0DN' SUBROUTINE: C0MPAR (n,j) FIGURE 2A 53 PUSH 'PRODN' INTO S, OF S„ CALL REPL2 (S,T.) f CALL 1 SUBROUTINE: E(n) [PAGE I] FIGURE 3 A ^ _E1 3A CALL REPL 2 (S, Sj) CALL REPL 2 (S, Tj n 2 POP THE TOP PRODUCTION OF S 2 CALL REPL 4 ( i ) SUBROUTINE: E (n) [PAGE 2] FIGURE 4A 55 : CALL REPL 3 (Y, X) PUSH ' PRODN ' INTO S, CREATE NEW SYMBOL H X — H a — H PUSH ' PR0DN ' INTO S u AND SET FLAG BIT SUBROUTINE: REPL 2 (Y,X) REPL 3 \ CREATE NEW SYMBOL G Y - - G PUSH ' PRODN ' INTO S 2 a — G (Y,X) J S — x CLEAR T SUBROUTINE: REPL 3 (Y,X) (REPL4(h) \ CREATE NEW SYMBOL H S - H CLEAR T PUSH ' PRODN ' INTO Sy a — a n 1 1 ^/[RETURN j\ PUSH 'PRODN' INTO S u AND SET THE FLAG BIT a n _ H SUBROUTINE: REPL 4 (n) FIGURE 5A 56 SUBROUTINE: SS I (Y) ( ADJSU (n)j- POP n th PRODUCTION OF S| AND PUSH IT INTO S„ PUSH PRECEDING PRODUCTIONS OF S| BACK INTO Si SUBROUTINE: ADJSI (n) POP (n+l) st PRODUCTION OF S„ SET FLAG BIT TO ZERO ON TOP ELEMENT OF STACK S u PUSH 'PR0DN' INTO S u AND SET THE FLAG BIT PUSH 'PR0DN' INTO S u AND SET THE FLAG BIT PUSH PRECEDING PRODUCTIONS OF S u BACK INTO. S„ SUBROUTINE: ADJSU (n) FIGURE 6A 57 START k — k-l 4 PUSH C k INTO S(O) WITHOUT FLAG BIT PUSH C k INTO S(2) WITHOUT FLAG BIT i — u -I GA — P — ■* P — P+l CALL ZETA (P) MAIN PROCEDURE: CONDETN (M,N,K) FIGURE IB 58 k —I -*T* j — j+l » j - j — j*l i(j) — I CALL REPLACE PRECEDING ELB4ENTS OF POP v k SUBROUTINE: ZETA (P) FIGURE 2B 59 j — jfl ISABEL n UB POP TOP ELEMENT OF y POP TOP ELEMENT OF Stf 'LABEL 3 SUBROUTINE: LP (e,f,p) [PAGE I] FIGURE 3B 60 'LABEL 4 3B j _ j -I REPLACE PRECEDING ELEMENTS OF S(j) POP r. e(j) SUBROUTINE: LP (e.f.p) [PAGE 2] POP X i(j) FROM THE STACK S(j) REPLACE PRECEDING ELEMENTS OF S(j) SUBROUTINE: PR ( X i ( j > > FIGURE 4B 61 u — -p - 1 1 •" -"Pol - 1 v — -p - 1 n -^ -l h — 2 0(h) - q (h) ii ' h — h+l CALL LP (o.f.P) u — u + 1 0(0) - o(o)+i CALL DET (D) SUBROUTINE: COMP I (P) PUSH X ',(ol INTO S» PUSH (U a| -I) INTO S x AND SET FLAG I D - 1 PUSH r |(pJ INTO S x PUSH (ftg, -I INTO S x AND SET FLAG I PUSH V K INTO S x PUSH 0„ INTO S x ' ) v — v - I SUBROUTINE: DET (X) FIGURE 5B 62 a2 — I *02 *" » f SUBROUTINE: COMP 2 FIGURE 6B 63 on vp+d PUSH a; INTO S(P+I) SET FLAG 2 P-l ON r. p(P+D PUSH Z INTO S(P+i; CLEAR y p(p+|) INCLUDING FLAGS SUBROUTINE OP (X) FIGURE 7B 61+ •> £ — o 0(1) - v(o) Jl — £+1 j - -p I - 1 I j - j-l SET ON >♦ PUSH INTO S(P+I) P-l CLEAR y P (Pt " INCLUDING FLAGS — b, CALL OP (0) J — i(£) - I j - j + i »(( ERROR J] 0(£) _ g(j?M SUBROUTINE: NUITER [PAGE FIGURE 8B - X WITHOUT FLAG BIT \ ' k — k-l V fc j —j »• J V j — J-Vj-2 1 ' X. — V. 1 J PROCEDURE: SCANNER [PAGE 2] FIGURE 2C 69 ( LABa E V_ j — o j — j + l »<^ J=P2' K*J t\ " 4— = "J i