LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 51 . 84 l^6T no. 171 -187 cop. a. Digitized by the Internet Archive in 2013 http://archive.org/details/fasttranslationt172gear Dort No. 172 *o. I7A. coo-li+69-0001 A FAST TRANSLATION TECHNIQUE WHICH PARTIALLY OPTIMIZES THE OBJECT CODE by C. W. Gear JAN 1 ■ ■■ April 5, 1965 The person charging this material is re- sponsible for its return on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. University of Illinois Library Report No. 172 A FAST TRANSLATION TECHNIQUE WHICH PARTIALLY OPTIMIZES THE OBJECT CODE by C. W. Gear April 5, 1965 Department of Computer Science University of Illinois Urbana, Illinois Abstract A three-pass compiler is briefly described with the following properties : The last two passes scan an intermediate language produced by the preceding pass in essentially the reverse of the order in which it was generated, so that the first pass is the only one which has to read the relatively bulky problem oriented input. The double scan, one in either direction, performed by the first two passes, allows the compiler to remove locally constant expressions and recursively calculable expressions from loops and to do the important part of common subexpression recognition. Optimization such as the effective use of index registers, although as important, is not discussed since the object code which would be most efficient is highly machine dependent. The discussion is in terms of a F0RTRAN-like language, although the technique is applicable to most algebraic languages . • 111- 1. Introduction Existing compilers tend to fall into two classes. Those that are relatively fast, very often syntax directed, one or two pass programs with either an inefficient or an interpretive object, and those which spend a large amount of time optimizing the object for the particular machine. Descriptions of the latter do not often appear in print because of their highly machine dependent nature. (However, descriptions of IBM FORTRAN II L J and the KDF 9 [3] optimizing ALGOL are available.) The arguments in favor of this situation are well known. Code checks are run on the fast compiler and then an optimized object is obtained from the other for production. It has been the experience of this author that a large number of medium-sized jobs seldom remain unchanged . Before each production run small changes are made at the source language level. Both because of this, and because the source language is a condensed, readily comprehensible representation of the job, it is highly desirable to seek a middle ground with some of the features of both worlds . The purpose of this paper is to describe a method for the high-speed compilation of code which has been rearranged so as to optimize it in terms of the number of basic operations performed. This technique is machine independent. Machine- dependent features, such as index registers, finite length operand stacks, etc., are not discussed. The object code of a compiler can be made more efficient in a number of ways. They include the location of locally constant expressions or expressions that can be calculated by recursion, location of common subexpressions, and the rearranging of the calculation to facilitate any of these objectives, such that either code can be removed from loop or duplicated coding can be removed. Non- buffered input from auxiliary storage can also be made more efficient within the operating system by pre-calling the input package to load a buffer. For example, the FORTRAN statement READ TAPE N, i/o list will cause a wait by the main frame until tape N has been read into a buffer so that it can be copied to the i/o list. This wait can be reduced by calling for the buffer fill earlier. -1- In order to perform these types of optimization it is necessary to analyze all statements in a neighborhood of a given statement. This neighbor- hood is the extent of the loop or block of program from which it is desired to remove code. Thus most optimization procedures involve a considerable amount of rescanning, or the construction of very large tables which must be examined more or less at random. The requirements for high-speed compilation are that a minimum of scans of the entire data be used, and that it not be necessary to refer at random to sections of the program remote from the one being considered, or equivalent ly, to very large tables containing sections of the program . For high- speed computer systems now being designed, it is a further advantage to refer to memory in a sequential manner rather than in a highly data dependent manner, since a larger degree of memory-main frame overlap is then possible using the various look-ahead schemes . Thus the requirements of high-speed compilation and efficient object code production are somewhat incompatible. Any solution is bound to reflect the assessment of the individual as to the relative importance. In this paper, high- speed compilation is stressed, optimization is performed within this constraint. This paper will not deal with the techniques involved in the standard compilation process since they are both well known and machine dependent. It will only point out additions to the processes which achieve optimization. 2. An Outline of the Program An example will clarify the type of optimization being sought. The following section of a FORTRAN program DIMENSION A (15, 5) D0 11= M1,N1,L1 D0 2 J = M2,N2,L2 2 A(I,J) = A(I,J)*C*15'5*D**5 1 C = C + 1. might compile on a fast "single-pass" compiler as the machine language equivalent of -2- Move Ml to I LOOP 1 Move M2 to J LOOP 2 (J-1)*15 + 1-1 + A -» Tl (J-1)*15 + 1-1 + A - T2 Load indirect from T2 Multiply by C Multiply by 15 -5 Multiply by D Multiply by 0.5 Store indirect into Tl J + L2 - J Subtract N2 and transfer to LOOP 2 if result < C + 1. -> C I + LI - I Subtract Nl and transfer to L00P1 if result < where Tl and T2 are temporary storage cells, and A is the memory cell containing A(l,l). A fairly efficient object version is given in the form below. LI -» Xk A - ik + 15*M2 + Ml -> X3 L2*15 -> X2 T2 = D*T-T5 Ml - I Transfer to LOOP 1+1 L00P1 X3 + Xk -» X3 X3 - XI <3 T2*C -» Tl M2 - J Transfer to L00P2 + 1 L00P2 XI + X2 - XI Load indirect from XI <}— ■3- Multiply by Tl Store indirect in XI J + L2 - J Subtract N2 and transfer to L00P2 if result < C + 1. -* C I + LI -* I Subtract Nl and transfer to L00P1 if result < The basic steps that have been taken in this example are: a) The locally constant expression C*15. 5*D*. 5 was removed from the inner loop since C and D are not changed in that loop. Since C but not D is changed in the outer loop the 15.5*D*.5 part can also be removed from that loop. b) The expression (j + l)*15 + I - 1 + A is recognized to be recursively calculable in the inner loop by starting at (M2 - l)*15 + I - 1 + A and adding increments of 15*L2. Then the starting expression is also recognized as recursively calculable in the outer loop. c) The expression 15.5*D*-5 is rearranged to get D*7-75- d) The occurrence of the expression (J - l)*15 + I - 1 + A twice is recognized. The first two of these steps may only be taken if the variables concerned are either not changed inside the loop, or are changed in an incremental manner. It is therefore necessary to examine the range of each loop for potential changes. The assumptions made about function or subroutine calls are crucial in this problem. If side effects are allowed, then it is necessary either to tabulate information about the changes effected by each such procedure, or else to assume that everything may be changed and abandon this optimization in any loop containing such a call. In order to gather the information required about each loop in the program in a straight forward manner Pass I maintains a stack S , each of whose entries is a simple list of variable names. The top level of the stack corresponds to the iterative loop of the program currently being scanned and lower levels correspond to each of the loops within which the current one is -h- nested. As each statement is scanned, the names of variables which are changed by the statement are added to the top level of S . When the end of a loop is reached, the top level contains the list of all changed variables. This list is saved and also added to the second level of S , so that when the old top v * level is discarded, the new top level contains the current set of changed variables for the next outer loop. Pass I also translates the source input into a more compact internal form where, for example, arithmetic expressions are in a polish postfix notation. In order not to have to save each of the lists of changed variables in main core, they also are added to the intermediate language output from Pass I. They occur at the end of the block to which they refer. To handle recursive calculation, a second stack S (which could be part of S ) is maintained by Pass I. The loop parameters are entered into the top of this stack when they are read so that they can also be moved to the end of the loop so that a test can be made to see if the controlled variable or the step quantity is changed inside the loop. Pass I will also perform the syntax checking of the source language, and, for efficiency, it should replace all identifiers by an internal form such as a table address. Pass II basically scans the output from Pass I in reverse order. This makes the list of changed variable names and other loop information available to the program just as the first (in time) statement of a loop is read. Pass II maintains a stack S by placing the changed variable list on top of S when the end of each loop is entered, and removing it when the beginning of the loop is left. Thus during the processing of any loop, the top of the stack S contains the list of variables changed in that loop. Similarly a stack S contains at its top the current loop control parameters. Additional stacks S T and S whose entries are also lists are maintained by Pass II. These stacks contain respectively the lists of strings representing the local constants and recursive increments which are in the process of being moved to the beginning of a loop. The output of Pass II is in a machine language with symbolic addresses. The length of the program being generated can be calculated during the second pass, so that at its completion, variable storage can be assigned. Pass III will then produce a relocatable binary object. -5- The method described below involves operations such as concatenating elements and strings to form larger strings, where these elements or strings are members of stacks. Deliberately, no mention is made of how to mechanize this since it is highly machine dependent. On some machines it is better to chain everything and not to move elements unless absolutely necessary, on other machines the opposite holds. These tricks of the programmer's trade can easily save or cost a factor of 2 in speed, so a judgment of the method should include an evaluation of their potential value in this problem. 3= Removing Locally Constant Subexpressions This section indicates the manner in which Pass II scans arithmetic expressions in order to locate locally constant subexpressions. The information it has available is contained in the top entry of the stack S , which contains v a list of changed names. For speed it is convenient on most machines to translate this information so that it is represented by a bit in the name table. Arithmetic expressions have been translated by Pass I into a polish postfix form. Although the main Pass II scan is proceeding backwards, it is more convenient to scan expressions forward. This can be handled either by having Pass I reverse the order of the expression or, probably more conveniently, have Pass I generate a pointer at the end of an expression indicating the beginning of the expression. As the polish postfix form is scanned from left to right, operand names are read and placed in an operand stack S . As they are put in this stack they are marked as changed variables or not depending on whether they appear in the top of S or not. When operations are read in from the polish postfix string they must be compiled. The nature of the compiling action depends on the operands involved. Consider only binary operations. (n-ary operations follow the same principle.) If neither of the top two operands in S are changed no code is generated, rather the top two levels are replaced with a single entry representing the combination. For example, if the top two levels are both locally constant variables (LCV) they are replaced by a single locally constant expression L CE. This is mechanized by forming a polish postfix string consisting of the top two operands followed by the operation, placing this in main store and placing a pointer to it in the stack with the mark LCE. Thus the entries in the operand stack can be any one of the classes C (constant, e.g., 15 • 5) LCV (locally constant variable) LCE (locally constant expression) V (variable, one that is changed in the loop) The rules for the combination of these by any binary operation are: 1st and 2nd level Operands Action C - C Combine at compile time to get new constant C - LCV C - LCE LCV - LCV LCV - LCE and LCE - LCE Form new string consisting of the concatenation of 1st level of S , 2nd level of S , operation Single stack entry becomes LCE pointing to this string C - V LCV - V V - V Compile object code directly Single stack entry becomes V type LCE - V Allocate a temporary storage cell with name T, say. Compile code for T, V, operation Add 'T = ' to the end of the string representing the LCE and add this new string to the top level of the S stack, Li the stack of locally constant expressions. The single Sq stack entry becomes V type. When the beginning of the loop is reached in the Pass II scan, the loop is closed, the top level of S„ discarded and the polish postfix assignment statements removed from the top of the S stack and compiled to yield the program to generate L the local constants. Since these expressions may be partially or wholly local constants in this outer loop, they should be handled in the same manner as Pass I -7- generated expressions, that is, all that is necessary is to switch for input from the intermediate language of Pass I to this list until it is exhausted. Since the temporary storage name generated for a locally constant expression is not marked as changed, the rules outlined above will cause an expression to be removed from all loops in which it does not change. General n-ary operations such as functions can be handled by the obvious extension of the process. Indexing is an extension of functions. If the indexing expressions are local constants then the address of the element can be calculated outside of the loop. If in addition no elements of the array are changed, then the element could be part of a larger locally constant expression. h. Recursive Calculation Recursive calculation of addresses has been discussed by Samelson and Bauer for ALG0L. Restrictions essentially similar to those of FORTRAN are imposed on the subscripts. The method proposed here is to calculate all expressions recursively where that can be done. If the expression happens to be an index expression, then a compiler for a particular machine can take note of it and use index registers or other features to perform the calculation. The technique is essentially identical to the one used for identifying local constants. A new type of operand stack entry is defined, the "Step Variable" (SV)o This is a variable which is incremented by constant amounts during the execution of a loop. The controlled variable in the loop is of this type provided that neither it nor the step quantity are changed inside the loop. This is the reason for moving the controlled variable name and loop parameters to the end of the loop by the S stack mechanism in Pass I. As the end of a loop is entered, the new top level is added to the S stack, and the loop count mechanism is compiled. If the expression used for the step compiler is a fixed point constant or local constant, the controlled variable is checked against the changed list. If it is not changed in the loop (except by the loop control) it is flagged in the table as a step variable for the duration of this loop. (if the controlled variable of the next outer loop is also flagged as a step variable, its flag must be turned off until the beginning of the loop has been reached.) -8- Another type of operand stack entry is defined, a "Step Expression" (SE). An SE is a combination, by addition or subtraction of any two SV's, SE's, LCE's, LCV's or C's, one of which is an SV or SE, or a combination, by multi- plication of an SE or SV with an LCE, LCV or C type. Thus a step expression can be expressed in the form LCEA + LCEB*I where I is the step variable and LCEA and LCEB are two C's, LCV's or LCE's. Since I is being incremented by a locally constant amount, say INC, the value of the step expression can be obtained by starting at LCEA + LCEB*S and incrementing each time by LCEB*TNC where S is the starting value of I. The procedure is mechanized by marking a variable in the operand stack S n as an SV if it is so marked in the variable table. When two (or more) operands are combined which result in an SE, the entry in the stack is so marked, and a pointer is placed there which points to a pair of strings. The first string represents LCEA and the second represents LCEB. When the operation is an addition (or subtraction), the new lists are formed by concatenating the pairs of strings representing each operand together with a trailing plus (minus) sign. If either of the expressions is simpler than an SE, then its LCEA and LCEB are implied in the obvious way. When the operation is multiplication of an SE or SV by a C, LCV or LCE, the new LCEA is the string LCEA A LCV f * and the new LCEB is similarly r C "1 L LCE J LCEB^ LCV Y *. ^LCE J In all other cases involving an SE or SV, some object code is to be compiled. If the operand is either an SV or an SE with its LCEB equal to +1, then there is no point to handling it as other than a changed variable (with possibly a locally constant additive LCEA). Apart from these cases, it is necessary to assign a temporary storage cell, say Tl, to the SE for the purpose of compiling the current operation. A second temporary storage cell T2 must also be provided for the increment storage. The strings LCEA LCEB S * + Tl = and LCEB INC * T2 = are added to the top list of the stack S T of locally constant Li expressions and Tl T2 + Tl = is added to the top of the S D stack of recursive K increments . -9- When the beginning of the loop is reached in the scan, it is necessary to prepare the incrementing orders by compiling the top level of the S stack R and then discarding it. The loop is then closed by removing the top level of the S stack and compiling the loop initialization plus the transfer around the incrementation orders before compiling the top list of the S stack. L Just as the multiplication operation has been reduced to successive additions, the exponentiation operation could be reduced to successive multiplications, but this author feels that it is of doubtful value, both because of rounding error and frequency of occurrence. 5. Rearranging the Calculation of an Expression Whether or not this should be done is open to question numerically. The fact that some compilers work from left to right, others right to left and still others in some optimizing manner means that if the programmer desires to calculate A+B+C in a specific order, he should indicate it with parentheses, e.g., A+(B+C), if he intends to use the program in more than one installation. In order to indicate the difference between the three expressions A+B+C, A+(B+C) and (A+B)+C in the polish postfix string, a convention must be used which allows the rearrangement of calculation in the first instance. Such a convention is to allow n-ary versions of operands. Thus A+B+C becomes ABC+ • The subscript 3 indicates that + is the summation of the 3 elements A,B and C. A+(B+C) and (A+B)+C would then uniquely be ABC++ and AB+C+ where the subscript 2 for binary operations has been dropped. In order that this convention also handle the subtraction (and division) operations the "sign" of the operation is placed over the last element representing that variable. Thus A-B+C is ABC+ while A-(B+C) is ABC++ and A+(-B+C) is ABC++. Similarly a/b*C and A/(B*C) are ABC* and ABC** respectively. When the operation + in the expression ABC+ is to be compiled by Pass II, the objective is to rearrange the elements so that the minimum of work is done at execution time. This has been discussed in the case of constants by Floyd and Gear . There is a hierarc the stack at any time. This ordering is Floyd and Gear . There is a hierarchy of operands that are possible in -10- Constant Locally Constant Expression (including Locally Constant Variable) Step Expression/ . . . N (including Step Variable) Variable When an n-ary commutative operation + or * is encountered by Pass II, the top n levels of the operand stack S are examined to find all occurrences of constants first, these are combined at compile time and n is appropriately reduced. If it is not now 1, the remaining levels are examined for LCE's or LCV's. If there are any, they are combined with any constant to give a new LCE. If n still has not been reduced to 1, the search continues for SE's or SV's which can be legitimately combined into SE's. Finally, if n is not 1, code is compiled and the stack entry becomes a variable. 6. Common Subexpressions Floyd has described one way of locating common subexpressions within a single expression. This method could be extended to a number of expressions within a program segment at a considerable cost in main storage requirements. It seems doubtful that any simple technique can be found to handle this problem without considerable searching and comparing. This author feels that in general the pay-off for the amount of work involved is marginal for the following reasons : a) If the common subexpression is very short, e.g., A+B or A*B, the gain in execution speed is small since in most cases a store order is involved to save the result in place of the common arithmetic operation. b) If the common subexpression is lengthy, the laziness of the programmer can be relied upon to do the necessary simplification, e.g., ... (A+B)*(C+D)/(E+(A+B)*(C+D)) ••• is likely to appear as T = (A+B)*(C+D) . . . t/(e+t) ... which is a considerable saving in writing. -11- However, when a feature of the language expresses a fairly involved expression in a simple manner, the recognition of common subexpressions becomes advantageous, Such a case is indexing. For example, the assignment statement for three- dimensional Gaus-Seidel iteration is A(l,J,K) = (A(l,J,K-l) + A(l,J,K+l) + A(l,J-l,K) + A(l,J+l,K) + A(l-1,J,K) + A(l+l,J,k))/6. This contains the expression ((K-l)*N2 + J-l)*Nl + 1-1 + A plus various constants seven times where Nl and N2 are the dimensions of the first two subscripts of A. The method proposed here is not to look for such common subexpressions unless they are either local constants or step functions. If this is done, then each time that a new step function or local constant is generated, it should be compared against the existing set in the top level of S before being entered b there. A step function contains two local constants. If the step part LCEB agrees with the step part of another step function (the names of these are located in S_), then only one incremented function is needed; the other can be determined by calculating the difference between the two constant parts LCEA and adding this relocation at the time it is used. If these are identical, so much the better. In many situations, such as the example above, the two constant parts LCEA of two step functions will only differ by a fixed constant, but this fact will be hidden by the arrangement of the polish postfix strings representing them. For example, if I is the step variable, and it is being incremented by 1 starting at 1, the LCEA part of A(l,J,K-l) is K 2 + N2 * J 1 + Nl * -A' + whereas for A(I+1,J,K) it is K 1 + N2 * J 1 + Nl 1 'A' + where 'A' is the address of A(l,l,l). A direct comparison will not easily yield the fact that these differ by N2*N1+1. -12- To aid this recognition, it is proposed that the LCE's be reorganized as they are generated. A suitable form is such that LC's always appear before C's in additive groups. They can easily be rearranged as they are removed from the S stack during formation. Additionally when an LCE of the form"LCE string, constant, +" is multiplied by a constant such as N2, the value of "constant * N2" should be calculated, yielding say, C . The new LCE is then 'LCE string N2 * C +." In this way, additive constants are moved to the right hand side of the string (and in an expression like (l+3)*7+J+2 additions are saved). A direct comparison of the LCE's can now be used to determine if they differ only by a constant. If this is so, the second expression can be calculated from the first by simple addition. Whether this is better done outside the loop or at the time of use depends on the machine characteristics. 7- Auxiliary Storage Input The reverse pass feature of this method can be used to handle this optimizing problem at a simple level. During the first pass, the input statement is moved forward over statements as far as possible such that a) No transfers into or out of these statements are possible (determined by the nonexistence of statement numbers or control transfers in FORTRAN), b) No variables used or changed by these statements are in the input list, and c) None of these statements refers to the same i/o unit. During the second pass, the call to the buffering program or monitor is moved back to be the first statement of the program segment, that is, immediately following the first transfer into or out of the program. 8. Conclusion Because the output of one pass is read backwards by the next pass, this method is particularly suited to machines with tapes that can be read in either direction. Disk files or drums are also suitable for the intermediate storage, and can lead to very fast second and third passes. -13- On a machine with a large main memory many methods of optimization which use large parts of core are feasible. Since the first pass is usually highly input limited, it might be an advantage to use the method outlined above on several programs simultaneously; that is, the compiler program is time-shared by several inputs and auxiliary devices, each using a small piece of memory for data storage. Partial optimization of the object code in a machine independent fashion is certainly feasible using a small amount of main memory with a reversible auxiliary store. On most machines it is likely that the input time for the source program will dominate the compile speed. Optimization of the use of machine features such as index registers is a harder problem, probably requiring two further passes, one in each direction, since in the proposed compiler, information about which addresses would best be in index registers is not available until Pass II, and Pass II must generate code whose length can be determined by the end of Pass II. -14- BIBLIOGRAPHY [1] Floyd, R. W. "An Algorithm for Coding Efficient Arithmetic Operations/' Comm. ACM, 4 (January, 196l), p. ^2. [2] Gear, C. W. "Optimization of the Address Field Compilation in the ILLIAC II Assembler, " Comp. J., 6 (January, 196k), p. 332. [3] Huxtable, D. H. R. "On Writing an Optimizing Translator for ALG0L 60," Introduction to System Programming . Edited by P. Wegner, Academic Press, 196U, p. 137. [k] IBM. "Systems Manual for 70k FORTRAN and 709 F0RTRAN," Applied Programming Department, IBM, April, i960. [5] Samelson, K. and F. L. Bauer. "Sequential Formula Translation," Comm. ACM, 3 (February, i960), p. 76. -15- c f „ r002 V171 1B709651 llllac III « pin"" ' ■ * 30112088398208 Ml