HiHH man BatH wSSA ■ mm mm nl inn ■ n Hi mm SH Hi ■i m H DP In SB! »B ■ ■1 i m U7 hh H ■ ■ nfl ■ ■ LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN 510.84 Iifcv ho.69l-G96 ?■ cop. 2. The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN • L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/speedupofiterati694chen Report No. UIUCDCS-R-75-694 NSF - OCA - GJ-36936 - 000007 SPEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS by Shyh-ching Chen January 1975 MAR 1 1 1975 UNIVERSITY OF ILLINOIS Report No. UIUCDCS-R-75-694 SPEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS by Shyh-ching Chen January 1975 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF-GJ-36936 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, January 1975. SPEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS Shyh-ching Chen, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1975 Loop computations are the basic constructs which consume most of the computation time in ordinary programs. For a multiprocessing (parallel or pipelined) system, the overall speedup which might be other- wise obtained will be seriously degraded if these computations are carried out in the sequential manner as they are presented. In this paper, we present algorithms to expose vector operations and recurrence relations which constitute a loop computation and, in turn, to transform recurrence relations into vector operations. For the latter case, time and processor bounds for different classes of problems and practical situations will be discussed. Bounds on speedups of different types of loop computations will also be given, so that machine designers or users could know how well they were doing in some absolute sense. iii ACKNOWLEDGMENT The author wishes to express his deepest gratitude to Professor David J. Kuck. His fruitful advice, constant enthusiasm and continuing help throughout this work are most appreciated and admired. Special thanks should go to Mrs . Vivian Alsip for her excellent job of typing and remarkable patience in doing it. Mrs. Gayanne Carpenter gave me kindly departmental assistance. Drafting and printing services of this department have been invaluable. Sincere thanks are also extended to my parents and wife, whose encouragement and understanding have made my long period of study possible, And to them it is dedicated. iv TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR RECURRENCE SYSTEMS WITH ARBITRARY COEFFICIENTS .... 5 2.1 Introduction 5 2.2 General Linear Recurrences 8 2.3 M-th Order Linear Recurrences 26 2.4 Practical Implications 37 3. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR RECURRENCE SYSTEMS WITH CONSTANT COEFFICIENTS 45 3.1 Introduction 45 3.2 General Linear Recurrences with Constant Coefficients 46 3.3 M-th Order Linear Recurrences with Constant Coefficients 56 3.4 Practical Implications 69 4. SPEEDUPS OF PRACTICAL LINEAR RECURRENCE SYSTEMS WITH A LIMITED NUMBER OF PROCESSORS 70 4.1 Introduction 70 4.2 M-th Order Linear Recurrences with Arbitrary Coefficients 71 4.3 M-th Order Linear Recurrences with Constant Coefficients 87 5. SPEEDUP OF LOOP COMPUTATIONS IN ORDINARY PROGRAMS 94 5.1 Introduction 94 5.2 Speedup Characteristics of Program Loops 97 5.3 Exploitation of Parallelism in Program Loops 103 5.4 Time Bounds of Program Loops 112 5.5 Practical Implications 121 6. CONCLUSION 124 LIST OF REFERENCES 126 VITA 129 LIST OF TABLES Page 1. Experimental Results of General Linear Recurrences A3 2. Experimental Results of m-th Order Linear Recurrences .... 44 3. Experimental Results of m-th Order Linear Recurrences by Algorithm 5 for n = 16,384 84 4. Experimental Results of m-th Order Linear Recurrences by Algorithm 2 and Algorithm 5 for n = 128 85 5. Experimental Results of m-th Order Linear Recurrences by Algorithm 2 and Algorithm 5 for n = 1024 86 vi LIST OF FIGURES Page 1. Evaluation Trees for R<4> 21 2. Dynamic Partitioning for Algorithm 1 22 3. Parallel Evaluation of R<8> 23 A. Dynamic Partitioning for Algorithm 2 31 5. Parallel Evaluation of R<16,2> 33 6. Dynamic Partitioning for Algorithm 3 51 7. Parallel Evaluation of R<8> 52 8. Dynamic Partitioning for Algorithm 4 58 9. Parallel Evaluation of R<16,2> 59 10. Dynamic Partitioning for Parallel Evaluation of Remote Term in Homogeneous R by Algorithm 4 67 11. Dynamic Partitioning for Parallel Evaluation of Remote Term in Homogeneous R by Algorithm 2 of Chapter 2 . . . 68 12. Dynamic Partitioning for Algorithm 5 72 13. Dynamic Partitioning of Matrix E for Algorithm 6 89 1 . INTRODUCTION The primary objective of this research is to develop fast and efficient algorithms to evaluate programs which are presented in serial form. Specifically, the recurrence relations in loop computa- tions which are encountered quite often in ordinary programs such as Fortran DO loops and Algol FOR loops will be the center of our concern. During the last decade, as the cost and size of hardware circuits became remarkably cheaper and smaller, and as the speed of electronic devices approaches its physical limits , multiprocessing systems were introduced in an attempt to speed up machines by allowing concurrent processing. As a result, various types of multiprocessing systems have emerged. These include ILLIAC IV, CDC STAR, TI ASC, C.mmp and many others, each enjoying some success in the computations tailored to its own structure, and hence generally intended for a special group of users. However, we believe that the future trend of multiprocessing system designs will and should make the high-speed computational power of such machines available to ordinary users [1], Two basic design and application principles which help to attain the full effectiveness of multioperational machines are: 1) Extracting parallelism from ordinary programs as much as possible. Obviously, a plentiful supply of possible concurrent operations will exploit the highest speed of any given or proposed system. There may be two approaches : a) Devising new parallel algorithms and providing existing languages with adequate parallel processing features. b) Automatic detection of parallelism at the compiler level. 2) Minimizing data accessing and alignment conflicts in the system in order to achieve high-speed processing. This may be achieved by properly designed memory systems, storage schemes and data alignment networks which combine to allow conflict-free accessing and processing of various slices of data required in most ordinary programs [2], [3], [4]. Loop computations are the basic constructs which consume most of the computation time in ordinary programs. This is true intuitively as well as in actual experimental measurements [5], [6], [7]. For a multiprocessing system, the overall speedup which might be otherwise obtained will be seriously degraded if these computations are carried out in the sequential manner as they are presented. Therefore, according to the above principles, they are very good and important candidates deserving careful theoretical and practical studies if future machines of this class are to be in general use at all. Most previous results pertaining to this problem treat indi- vidually each principle stated before. These will be mentioned in later chapters with their related topics. In this thesis, we would like to look at the problem from a unified point of view. Briefly put, we ask ourselves if there is any fast algorithm (or algorithms) which we can use for evaluation of recurrence relations such that: 1) They may be written in numerical subroutines by extending current languages with some useful features, 2) Upon recognition in program loops, it should be easy for a compiler to transfer these iterative constructs into the equivalent but fast evaluations proposed by these algorithms, 3) With care, it is possible to use or design a well-organized system to achieve a desirable speed. All the algorithms developed herein have some implicit bearing on certain (although not all) solutions to these three questions. In short, we try to devise algorithms for easy implementations through user subroutines or compiler algorithms, and the assumed general machine structures and data routing patterns underlying these algorithms should enhance their maximum speeds. In Chapter 2, we discuss the algorithms to evaluate linear recurrence relations with arbitrary coefficients for both general and m-th order problems. Chapter 3 emphasizes the cases of constant co- efficients which occur frequently in general numerical programs. All algorithms as they stand are for the fastest computations and anticipate future machines with as many processors as the number of IC packages used in current machines. Also, some easy and general modification procedures are introduced to maintain substantial speedups while using processors more efficiently. Some representative experimental results to this effect will be included. These types of information and practicalities should be able to lead us to good machine design and application procedures in future parallel machines. On the other hand, for realistic designs and uses at present, Chapter 4 provides some methods to evaluate most practical m-th order linear recurrences where m is very small and under the assumption that usually the number of processors is far less than the problem size. Extending the results from these three chapters, we then can treat general jump-free loop computations in a uniform way either theoretically or practically in Chapter 5. Bounds on speedups of different types of loop structures will be given so that designers or users could know how well they were doing in some absolute sense. Although this work provides some insight to these serious problems, no claim can be made about the perfection of the results as spelled out in the open questions listed in Chapter 6. For notational simplicity, we will assume all logarithmic functions take their ceiling values throughout this thesis. 2.. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR RECURRENCE SYSTEMS WITH ARBITRARY COEFFICIENTS 2.1 Introduction One of the most difficult aspects of speeding up ordinary programs for parallel or pipeline computation is the treatment of recurrence relations [7], [8]. Although they occur with rather low overall frequency, an annoyingly large fraction of the programs one sees contain at least one recurrence relation. Most of these are linear recurrence relations. Suppose some multidimensional computation space contains n points over which a general linear recurrence relation is defined. We will show o that this computation can be speeded up by a factor of ©((n/loggn) ) using 3 0(n ) processors. More generally stated, we consider the parallel solution of any linear system of the form x ■ c+Ax where A is a strictly lower tri- angular matrix and c is a constant column vector. While a serial machine can 2 solve this system straightforwardly in 0(n ) steps, our algorithm finds the 2 3 2 solution in 0((log_n) ) time steps using at most n /68 + 0(n ) processors. Further, all processors perform the same operations at the same time (SIMD operation). These matters are discussed in Section 2. In Section 3 we consider the restricted problem of solving an m-th order linear recurrence relation. We show that given any linear system of the form x=c+Ax where A is a strictly lower triangular matrix of bandwidth m, its solution can be speeded up by a factor of O(mn/log mlog n) using at most p 0(m n) processors. Again, this result is achieved with uniform operations in all processors (SIMD operation) . These results are far from ideal in several respects. First we observe that in the general case a parallel computation time of O(log p n) could be hoped for. This follows from a simple fan-in argument using the 0(n ) data elements in the A matrix and c vector. Similarly, in the m-th order case we could hope for a parallel computation time of O(log mn). Note that we approximate this when n » m. A second shortcoming of these bounds is that the number of pro- cessors required is rather high so that rather low efficiencies result. This leads us to present in Section k some practical modifications to the algorithms of Sections 2 and 3- These modifications allow much higher effi- j ciencies without significantly reducing the speedups . The parallel evaluation of recurrence relations has been studied by a number of people. In [9], Karp et.al. discuss the formalization of the problem and prove some results about the resulting graphs . Muraoka in [10] discusses algorithms for speeding up general classes of recurrence relations and also presents bounds of 0(log n) time steps on a number of simple recurrences. Kogge and Stone [11] discuss m-th order recurrences and while they do not give a general processor bound, their time bound is about twice ours for this case. Heller has studied general recurrence 2 relations and proves that they can be evaluated in 0(log«n) steps with 4 0(n ) processors using a machine capable of simultaneous addition and multiplication (MIMD operation) . He recently has improved the processor 3 bound to 0(n ) while the time bound remains twice that reported here [12], Although we do not discuss any details of machine organization in this paper, it is in order to sketch a machine here. First, we list some idealizing assumptions. 1) Instructions are always available for execution as required and are never held up by a control unit. 2) Each operation takes the same amount of time, which we refer to as a unit step . 3) Any number of processors and memories may be used at any time. 4) There are never any accessing conflicts in the memory. 5) No time is required to communicate data between the proces- sors and memories. In a properly designed system, for steady state data and control flow, standard pipelining techniques may be used to approximate 1 and 2. We will present theoretical and practical bounds on the number of proces- sors required, so 3 can be removed. It will be noted that the algorithms of this paper mainly involve various kinds of vector and matrix products. References [2] and [4] show various schemes for accessing rows, columns, diagonals, square blocks, etc., without conflicts using a single memory storage scheme. Such a memory should provide a good approximation to 4. A pipelined alignment network can be used to approximate 5 and the details of one possibility are given in [13] • Finally, we note that for the algorithms we present, simultaneous multiplication and addition (MIMD operation) would allow us to further reduce our processor bounds, but as our algorithms are presented all processors execute the same operation at the same time (SIMD operation) . Thus we feel that our results could be of direct use in the design of real machines. The following definitions will hold throughout the paper. Let T be the time, measured in unit steps, required to perform some calculation using i independent (parallel or pipeline) processors. We define the T T T l speedup of a p-processor machine over a uniprocessor as S = ■=— , and we T l define its efficiency as E = _< 1, which may be regarded as the quo- P P P tient of S and the maximum possible p-processor speedup p. For various results in Sections 2 and 3 we will bound p, T and S . In Section 4 we P P will discuss these further and introduce E values as well. P 2 .2 General Linear Recurrences In this section we discuss bounds on the time and processors for evaluating general linear recurrences. The main result is Theorem 1. We also give Algorithm 1 which may be used for the parallel evaluation of a general linear recurrence system. Some general notations will be followed throughout this paper unless otherwise specified. We use upper case letters to denote square i matrices, lower case letters to denote scalars, and lower case letters with overbars to denote column vectors. Given any n by n matrix A = [a..] , A, denotes the sub-matrix consisting of all elements a., for ij n*n Tc ij k< i _< n, k, is defined as R: x. = for i < 0, 1-1 = c. + Z a. . x. for 1 < i < n . i . 1 ij 3 ~ ~ In matrix notation we write this as x = c + Ax where x = \ x -i > x p> • * • j x n ' > c = \ c n y C 2' ' ' ' y °n ' and A = Ta. .1 with a. . = for i < j. ' ij nxn i j - For simplicity, we will assume n is a power of 2 for the remaining dis- cussion. If it is not, the system can be augmented and the results applied directly. We refer to x as the solution vector of R . If the — — * * solution vector x of R is equal to the solution vector x of R , * we say that R = R . i 10 Definition 2 Given any general linear recurrence system R : x = c + Ax , we define a set of n-1 systems {R.|l <_ j <_ n - 1} where R.: y. = a. + A.y. J J J J J in which y. is the solution vector of R.. The set of all solution J J vectors y., 1 £ j _5_n - 1, can be written in matrix form as a strictly lower triangular matrix Y = [y. . in which y. . = for i < j , ij - i-1 = a. . + E a.,y, . for 1 <_ j < i <_ n . 1J k=j+l lk k3 As an example of these definitions, consider the following linear recurrence system with n = 4. Example 1 Given R<4> : x, = c. x 2 = c 2 + a 21 X;L X 3 C 3 + a 31 X l + a 32 X 2 X A c 4 + a 41 x l + a 42 x 2 + a 43 X 3 we have 11 A = 21 31 32 a 7A %1 %2 %3 L x - [X--., Xp, x , x< ) , - V C n > C p> c o> C h ' In this case there are three corresponding systems R. for 1 < j < 3, written: R x <3>: y 21 = a 21 y 3l = a 3l + a 32 y 2l y ^l = \l + %2 y 2l + %3 y 3l ' R 2 <2>: y 32 " a 32 and Thus we have y ^2 ~ %2 + a J+3 y 32> R 5 <1>: ^1+3 = %$ ' Y = y 21 y 31 y 32 Jkl Y k2 J k5 where y. is the solution vector of the corresponding system R. for J J 1 < 3 < 3. 12 [.emma 1 Given any R : x = c + Ax , there is a corresponding R*^i>: x = c + Yc - (I+Y)c , such that R = R - Proof : Our proof is by induction on n. The cases n = 1 and n = 2 are trivial so we will examine n = 3 as our basis. If n = 3, we have from Definition 1 R <3>: x x = c X 2 = C 2 + a 21 X l = °2 + a 21 C l x 3 = c^ + a 31 x ± + a^ x 2 = c 3 + a^ c ± + a^ (c 2 + a^ c^ The corresponding system has the form R*<3>: x = c x 2 = c 2 + y 21 C] _ x 3 = c 3 + y 3l °i + y 32 °2 ' Now, by Definition 2 we know that y 21 = a 21 (1) y 3l = a 31 + a 32 y 21 = a 3l + a 32 a 2l (2) and y = a . (3) By substituting Equations (1), (2), and (3) into R <3>, we get R*<3>: x x = c ± X 2 = C 2 + a 21 C l x 3 = c 3 + (a 3l +a 32 a 2l )c l + a 32 c 2 = c 3 + a 3l C l + a 32 (c 2 +a 2l c ]. ) - 13 By comparing this system with the final righthand side obtained above for R<3> , we see that R<3> = R* <3> ■ Now as our induction hypothesis, we assume the lemma is true for n = k, and prove it for n = k + 1. That is, given any R, we assume there is an R =■ R where R can be written as : \ y 21 i y 3i y 32 x • ♦ • • y ki y k2 •' y k,k-l \ (k) Any system R has one additional equation whose form by Definition 1 is *k+l = C k+1 + V-l.l X l + ^+1,2 X 2 + '" + a k+l,k *k : C k+1 + (a k+l,l' a k+l,2' •*•' a k+l,k } \X, > X» , • ■ • j ^T»' (5) If we substitute the righthand side of Equation (4) into (5), we get 14 x k+l " °k+l + ( Vl,l' Vl,2 ) * • • > Vl,k ) r 21 31 J 32 y^o i • • • • y y, i kl ,y k2 * ' -%k-l c l] °2 ; v. ) By the definition of y k+1 for 1 <_ j £ k (see Definition 2) , this can be written as *k+l = C k+1 + (y k+l,l' y k + l,2^'-^ y k+l,k ) ^ (6) Now by combining Equations (4) and (6) , we can write x = x l 1 r— — > C l X 2 y 21 i C 2 x 3 y 31 y 32 1 , c 3 • • • • 1 m . = • • • ' • - • • • • m \ y kl * * * y k,k-l 1 | 1- °k L \ + l _ y k+l,l ' • ' y k+l,k-l y k+l,kl 1 _ C k + l (I+Y)'c Thus, the lemma holds for n = k + 1, and therefore for any integer n. Q.E.D. Before giving the next lemma, further definitions are needed. 15 Definition ^ Let f be a function of a. , where 1 < i < n and < j < n-1 and f be a function of a. , where p+1 < i < p+n and p < j < p+n-1. Let f ' be derived from f by replacing each a. . in f. by a. . We sav that f -l ij l l+p, j+p * 2 is a p-shifted function of f^ if f» = f and we express this symbolically by writing f, * f . Definition ^ For any R < n >, and any ordered pair (i, j), < j < n-1, 1 < i < n, we define a sequence of functions f,(i, j), f„(i, j), f.(i,j ),..., f (i, j) such that j+2r-l f _ (i,j) = f (1,3) + 2 f (i,k).f (k,a) for r=2 m , 0< m < t-1, dr r k=j+r with the boundary conditions f^i, j) = a for i > J, in which f (i, j) - for i < j, a._ = c. for 1 < i < n. iO i — — For any given f (i, j), if we apply repeatedly the recurrence definition to express f (i, j) in terms of f , f ,..., and finally of f , 2 IT we can express f (i, j) as a function of a. .'s only. In the following, we will assume this final expression for any f (i,j) unless otherwise specified. Some properties of the f function can be observed directly from the above definition: Property i) fg r (i,j) = f r (i,j) for i-j < r. P Property ii) f r (i,j) ■• f r (i+p,j+p) • 16 Property iii) If g is derived by replacing each a in f (i,,j) lK r by a ±+p k , then g - f r (i+p,j), for i-,j > r. To illustrate Property ii) and Property iii), let n = 8, i = k, j - 0, r - 2, p = 1. Definition k, Property ii) and Property iii), respectively, give us: f 2 (M) = ^(M) + f 1 (4,l)-f 1 (l,0) = a^ Q + a 4l a 1() , f 2 (5,l) = ^(5,1) + f 1 (5,2)'f 1 (2,l) = a 51 + a 52 a 21 , and fg(5,0) = f x (5,0) + f 1 (5,D'f 1 (l,0) = a 5Q + a 51 a 10 . Wow we will prove a lemma which forms the basis of the remaining theorems . Lemma 2 Given any R , we have i-1 f (i,0) = a. _ + Z a. .x. = x. n v ' xO . , ij J i for 1 < i < n . Proof ; Our proof is by induction on n. As a basis, the lemma is obvious for n = 2. As an induction hypothesis, we assume it holds for n = s. Thus, by hypothesis we have i-1 f (i,0) = a.~ + Z a. .x. = x. for 1 < i < s. s v ' lO 1 • X3 J i ~ " Now we prove the lemma for n = 2s and consider two cases. Case 1 1 < i < s. We know from Property i) and Equation 7 f (1,0) = f (i,0) = x. for 1 < i < s. 2s ' s ' i — — (7) 17 Case 2 s+1 < i < 2s. From Equation 7 w e have s-1 f (s,0) = a - + Z a .x. = x for 1 < i < s. S V ' ' BO , , B] 3 S - By applying Property iii) to the above f (s,0) we can get s s-1 f (s+k, 0) = a ^ _ + Z a . .x. for 1 < k < s. s v ' s+k,0 s+k, j j - J (8) Having these values, the equations for x , , s < i < 2s, of R<2s> can be ex- pressed in matrix form as Equation 9« X s / X s + 1 X s+2 = • • • w X 2s . >. ff s (B,0) ■ f (s+1,0 ) f (s+2,0) s ' ' f g (2s,0) s+1, s !> .« a ~ , s+2, s s+2, s+1 fc a 2s,s a 2s,s+l * * * a 2s,2s-l ° By Lemma 1, we know that Equation 9 is equivalent to X s X s+1 • X s+2 • X 2s L J (9) ^s+1 L s+2 L 2s s+1, s y s+2,s y 2s, s s+2, s+1 1 y y. 2s, s+1 . . . ,y 2s,2s-l 1 f s (s,0) " f (s+1,0) f g (s+2,0) f a (2s,0) (10) where by Definition 2 we have i-1 y. . = a. . + Z a., y, . U 13 k=J+1 ik y kj for s < j < i < 2s. (11) 18 Now consider the equations for x , 1 <_ i <_ s , of R<2s>. If we relabel all the x. as x. n , we obtain a system of the form 1 1 y ^ i-1 x i,o = a i,o + ^ a ik"k, k=l ' 1 < i < s. (12) From Equation 11 and Equation 12 it follows by Definition 3 that X i-j,0 "* y i,j ' and from Equation 7 we have s < j < i < 2s, (13) f (i-j,0) = x. . = x. . n , 1 < i-j < s. (i*0 Now by Property ii) we have f 8 (l-3,o) I f fl (l,3) > so by Equation Ik (15) Thus we see from Equations 13 and 15 that y. . and f (i, j) are both j -shifted 13 s functions of x. . _. and it follows that 1-3,0* y ld = f fl (i,d), s < j < i < 2s. (16) By substituting the right hand sides of Equation 16 into Equation 10 we get s+1 s+2 ^2s "1 • f s (s+l,s) 1 f (s+2,s) f (s+2, s+1) S S f (2s, s) f (2s, s+1) s s f (2s,2s-l) 1 s f s (s,0) f s ( S +i,o: f (s+2,0 s f (2s,0) s (17) 19 and by Definition k, = (f s (s,0), f 2g (s+l,0), f 2s (s+2,0),...,f 2s (2s,0)) t . (18) Hence we have f 2s (i,0) = x., S+l < i < 2s. Cases 1,2 together show that the lemma holds for n = 2s and hence Q.E.D. for any n. Now, we can express our parallel algorithm for evaluating R in terms of iterative computations of tJL±,0) = x ± for 1 < i < n. It follows from Definition k that each f (i,0), 1 < i < n, can be expressed as a function of f , f , .... f, recursively. Hence, we can evaluate them in the reverse n ' n ' ' 1 ' 2 k \ order simultaneously for all i. By Property i) of the function f, the evalua- tion of f (i,0) = x. will terminate on the (log i)-th iteration and hence it n ' i 2 takes at most logun iterations to complete the evaluation of the whole R system. This is illustrated in the example of Figure 1. To carry out this evaluation scheme for any given R system, we present in the following a computational algorithm, in which identical function values (such as f (2,0), f (3,0) in Figure l) at the same level of computation will be evaluated only once. Algorithm 1 a. Let B be a lower triangular matrix of order (n x n) in which the j-th column contains a. . n for j < i < n where a. = c. . u i,j-l "" "~ io i b. Let C be an alias for B, i.e., B and C represent the same memory locations . c. Repeat this step for i = 1, 2, . . ., log n; i: Set k = 2 1 ; ii: Partition B and C as shown in Figure 2; ili: Compute S. = S. + T. * Q. for i < j < n/k simultaneously. J j 3 3 _ 20 d. The first column of B contains the solutions x for 1 <_ i <_ n. The effectiveness of this algorithm for evaluating R can be easily demonstrated by the fact that at the beginning of the i-th iteration any element b in any of partitions S., T., or Q., for p,q j j j 1 <_ j <_ n/k, represents the value of f, ,.(p,q-l); and at the end of that iteration, any element in S. represents f, (p,q-l). The example shown in Figure 1 relates the functional notation f to the matrix notation of Algorithm. Note that at the end of the i-th iteration of Algorithm 1, the first 2 elements of the first column of B contain the solutions fx. Il < j < 2 } . We will illustrate this as well as the algorithm in the J following example. Example _2 Given R<8> as follows. x 1 = 2 j x 2 = 1 + x 1 x^ = 1 + 2x 1 + x 2 \ = k + x x + 3 x p + Sx^ x = 2 + 3x 1 + 3x, p + 2x + 3x^ x 6 = 3 + x l + x 2 + x 5 + \ + x 5 - Xy = 1 + 2x.. + 2x_ + x, + x, + k r + 2x A I J- d J) 4 9 O x 8 = 3 + x 1 + x 2 4- 2x 3 + 2x^ + x 5 + x 6 + x ? Tne solution set is [2, 3, 8, 31, 126, 173, 930, 1285) which is 21 a o •H ■U CO u d CM c o •H ■U CO U A v ttf M O «4-l CO W u 3 bO •H 22 O X en u o M u o «4-l 00 c c o QD X cr < 2 fa E C Q M be •H fa 23 ITERATION 1 ITERATION 2 ITERATION 3 2\ 3 8 X X ^s® 31 X 33 X X X 3\J 16 X X X 4 N 19 X X X X X X 21 6 30 8 2\ 24 3 \ FINISHED 2\ 3 8 X 31 X X 126 X X X 173 X X X X >X 900 X X X X X }X 1285 X X X XXX }x © Figure 3. Parallel Evaluation of R<8> 24 the same as shown in Figure 3. In the figure 'x 1 denotes a number which is no longer needed, matrices B and C have been overlayed as a single matrix, and © , © indicate, respectively, matrix multiplicatiorf and addition specified by Algorithm 1. From the above, we can obtain a theorem for general linear recurrence problems. In what follows, we will write log_x to denote flog ? xT for notational simplicity. Theorem 1 Any R can be computed in 12 3 T <_ ■£ log 2 n + - log 2 n , n 3 with P £ ~ + 0(n ) . 2 The minimum speedup is 0( =— ) . log 2 n Proof: By Lemma 2, x. = f (i,0), 1 < i < n. But from J l n — — Definition 4, each f (i,0) can be expressed in terms of f , f , ..., f , • n n ii 1 2 4 Each level of the expansion, e.g., from f, to f, increases the computa- 2 tion time by one multiplication and one summation of ("T + 1) operands; k the summation can be done in at most log_(-^- + 1) steps which is less than or equal to log_k steps. Thus, for a total of log„n levels, we have log 2 n log 2 n 12 5 T < Z .1 + E 1 = - log 2 n + g- logon. y j=l ,1=1 d d 25 Since the serial computation time is n(n-l), the speedup S is n(n-l) S > 1, 2 3 n P - g g? n + 2 log 2 n 0( n 2 log ? n For the processor count, we refer to Algorithm 1 and Figure 2. The maximum number of processors are required at the multiplication time for each iteration i. Hence, we will count the total number of multiplica- tions for T. * Q., 1 < j < n/k, as the required number of processors p(? ) J J for a particular iteration i. It can be easily observed from Figure 2 that p(k=2 i ) = [k/2 E J + k (n-k) ♦* k/2 1) I j + f (k + 2k + . + (n-2k) for 1 <_ i <_ log 2 n . For large n, p(k) takes its maximum value at k = n/A and gives p(n/4) = (r^ n + I6n + 8n)/128; hence the maximum number of processors is 8 n 3 /68 + 0(n 2 ) Q.E.D. We close this section with several observations about Theorem 1. If multiplication time is significantly greater than addition time on some computer it is worthwhile to note that only log„n of our unit time steps are multiplications, the rest being additions. We also note that the processor bounds can be further improved by delaying the evaluation of some smaller trees, e.g., those for x. and x in Figure 1. However, as the algorithm 26 stands, all processors execute the same operation at the same time (SIMD organization) and by delaying some trees we would require additions and multiplications at the same time (MIMD organization) . As our algorithm 3 stands we need about n /68 processors (in an SIMD organization) and for 8 2 n <^ 2 this could be regarded as 0(n ) processors. Also note that in a very straightforward way, we can apply the algorithm and results to the solution of any triangular linear system of equations Ax = b . This is done by transforming the system Ax = b to the system x = c + Bx in which c . = b ./a. . for 1 < i < n l i n — — and b.. = -a.. /a.. for 1 < j < i < n , lj lj li - J - = for i < j . 2. 3 M-th Order Linear Recurrences In this section we discuss bounds on the time and processors for evaluating m-th order linear recurrences. The main result is Theorem 2. We also give Algorithm 2 which may be used for the parallel evaluation of an m-th order linear recurrence system. Definition 6 An m-th order linear recurrence system of n equations, R, is defined as R: x. = for i <_ , i-1 = c. + Z a. . x. for 1 < i < n , 1 • • !J 3 — — j=i-m where 1 < m < n. In matrix notation we write the above as x = c + Ax where 27 A is a strictly lower triangular band matrix, i.e., a = for i < j or i - j > m. For simplicity we will assume n and m are powers of 2. If n is not, we can augment the system, and if m is not, we choose the smallest power of two greater than m and apply our results directly. Since we are now dealing with a band matrix, we are able to prove the following lemma. Lemma 3 For any R, m <_ s <_ n/2, we have i-1 f 2s (i ' 0) = f s (i > 0) + Z f s (i,k)'f,(k,0) "for s < i < s+m, (19) k-s = § + £ " f (i,k)'f (k,0) for s+ra < i < 2s, (20) k=s where g is derived by replacing each a. ^ . in f (i,s+m-l) bv a k., s+m-1 s ' kO Proof : Since we are only concerned about i j^ 2s, it is sufficient to consider the first 2s equations, i.e., x. for 1 £ i <_ 2s as an R<2s,m> system which is a special case of an R<2s> system. Now, the equations for x., s < i < 2s, can be recalled from Equa- tion 17 (derived from Equation 9 and Equation 10), and the equivalent relation stated in Equation 18. However, by Equation 8 and the property that a. . = for i - j > m, we have f (s+k, 0) = a for m < k < s . (21) S S '&* vJ mmt ™" By substituting Equation 21 into Equation 17, and decomposing the x vector and its corresponding f vector into an m-vector and an (s-m+l) -vector, it is obvious that the first m-vector will confirm Equation 19 • Furthermore, the 28 second (s-m+1) -vector shows that f_ (s+m,0) zs f (s+m+1, 0] f 2s (2s,0) s+m x s+m+1 2s f (s+m,s) s f (s+m+1, s) s f (2s, s) f (s+m,s+m-l) f (s+m+1, s+m-1) f (2s, s+m-1) f s (s,0) f g (s+l,0) f ( s+m-1, 0) + b where b = f (s+nH-l,s+m) 1 f (s+m+2,s+m) f (s+m+2, s+m+1) 1 f (2s, s+m) f (2s, s+m+1) . s+m,0 s+m+1 , C 2s, (22) By Equation 16, we can rewrite b as b = Z s+m, s+m+1, s+m+2, 2s, where Z = y s+m+1, s+m y s +m+2 , s +m y s +m+2 , s +m+l y, 2s, s+m y' 2s, s+m+1 (23) 29 However, from Equation 16 and Lemma 1 we have f (s+xn, s+m-1 f (s+m+1, s+m-1 ) s ' f (2s, s+m-1) s y s+m, s+m-1 y s+m+l, s+m-1 y, 2s, s+m-1 = Z . a s+m, s+m-1 s+m+1, s+m-1 2s, s+m-1 (24) Note that each a. - for s+m < k < 2s in Equation 24 has a k, s+m-1 — — one-to-one corresponding a, _ in Equation 23, and that from Definition 2 there are no a. . for <^ j <_ s+m in any element of Z. Hence, Equation 22 through Equation 24 together prove Equation 20 in the lemma. Q.E.D. By using Property ii) and Property iii), we can easily extend the lemma as follows. Lemma k For any R, m < s < n/2, we have i-1 f_ (i, j) = f (i, j) + Z f (i,k)*f (k, j) for s < i-j < s+m , "2s k=j+s j+s+m-1 + Z f ,(i,k)-f (k,j) k-j+s for s+m < i-j < n , where g is derived by replacing each a. k,j+s+m-l inf s (i ^ +S+m - l)bya kr The parallel algorithm for evaluating R is essentially the same as that for general case, i.e., for each f (i, 0) = x. , 1 < i < n, we express it in terms of f , f , . . . f, recursively, and then evaluate them in the reverse n/ n' 1 "" 2 k 30 order simultaneously for all i. The main difference between this case and the general case is that the recurrence definition stated in Lemma 4 will be used instead of Definition 4 for the expansion of f , m < s < n/2. As a result, the computation time introduced at each level of expansion from f to f is a n m constant instead of a variable increasing with the levels of expansion. Similarly, to carry out this evaluation scheme for any given R system we can establish the following computational algorithm. Algorithm 2 a. Let B be a lower triangular matrix of order (nxn) in which the j-th column contains a. . , for j < i < n, where a. = c. and a. . = for 1,3-1 ° - - 10 i ij i-j > m initially. b. Let C be an alias for B i.e., B and C represent the same memory locations. Repeat this step for i = 1 , 2, . . . , log p n ; i: Set k = 2 1 ; ii: Partition B (diagonally shaded areas) and C (vertically shaded areas) as shown in Figure 4(a) (1 <^ i <^ log„m) and Figure 4(b) (log 9 2m <_ i <_ log_n) . Note that matrices B and C are overlayed as a single matrix in the figure; iii: If 1 < i < log m, then compute S =S .+T.*Q. for 1 < j < n/k simultaneously, u J J J else compute S.=S.+T.*Q. for 1 < j < n/k , J J J j ~~ ™~ and V.=V.+T *U f or 2 < j < n/k simultaneously J J J J 31 CM M •H U o 00 o 4-1 bO C •H e o •H ■U •H ■U U P* o e <1> M 60 32 VI E CM o 60 o 4-1 00 c •rl c o Ptl •a a) 3 C u 3 bO •H fa 33 /x X X X 2 2 Ji O CM CM \ *-t / yi" * /x * ! * X XIX /v x X X | X A \ m CM ^ A x \ is S 2 to in M O N 01 s s /* \ < oo — m r- c\J (V 0> /* \ 1.1 t I f ■ ! CM C\J ♦ >o t O » J (O U> £ § § * x\ - * A CN VO ■H V « M-l o G O •H •u ctj rH CO > W a) ctj >-i CO &4 (V J-i 3 00 •H 34 d. The first column of B contains the solutions x for 1 <_ 1 <_ n . At the end of the i-th iteration of Algorithm 2, the first 2 elements of the first column of B contains the solutions {x.|l <_ j <_ 2 }. In the algorithm, the V , 2 £ j <_ n/k, contain all the required values of function g as specified in Lemma 4 which will be used during the next iteration. We illustrate this in the following example. Example 3 Given R<16,2> as follows: x ± = 1 x 9 = 3+x^Xq x 2 = 2 +Xl x^ = l + x 8+ x 9 x 3 = 2+ Xl +x 2 x i:L = l+2x 9 +3x 10 x k = l+3x 2+ x 3 ^ = 2 + 2x 10+ x i;L x 5 = 2+3^+2^ x 13 = 2+x 11 +2x 12 x 6 = ^ +3 % +2x 5 x iU = 1+X 12 +X 13 ^ = l + x 5+ x 6 x 15 = 5+2x 13+Xl4 xg = 2 + 2x 6+ x ? x l6 = ^+x lif+ x 15 . ^ | The solution set is {1,3, 6, 16, 40,100, 141,3^3, 830, 1174, 5183,7533,20251,27785, 68292,96081} which is the same as shown in Figure 5« Now, we can state a general theorem about this class of problems. Theorem 2 Any R, can be computed in 1 2 T < (log 2 m + 2)log 2 n - — (log 2 m + log 2 m) . 3m 2 n with p <_ —7 — + 0(mn) . The minimum speedup is 0(- = ) . log 2 m log 2 n 35 Proof ; By Lemma 4 any f n (i,0) ■ x., 1 <_ i <_ n, can be expressed in terms of f^, f^, ..., f . Each level of the expansion increases the computa- 2 4 tion time by one multiplication and one summation of (m+1) operands. For a total of log„(— ) levels, we have °2 m t x < (log 2 Cmfl) + 1) logg(£> < (log 2 m + 2) logg(£) . To express f^ in terms of f^, f^, ..., f we use Definition A for ex- 2 4 pansion as in the general case. Hence, as stated in the proof of Theorem 1 we have 12? t 2 5 2 1 °S 2 in 4 t l0g 2 m * Thus, the total time T < t 4 t , i.e., Xr -*- C T p < ( lo e 2 m * 2)log 2 n - |(log 2 m + log^) . Since the serial computation time is 2mn-m(m+l), the speedup S is Q v, 2mn-m(m+l ) , mn v P " /•, o\-i I/-, 2 _ v log^m log^n' (log 2 m + 2)log 2 n - 2 (log 2 m + log 2 m) & 2 & 2 For the processor count, we refer to Algorithm 2 and Figure h. We will count the total number of multiplications for any iteration i as the required number of processors for that particular iteration. This can be 36 described in 2 cases: (i) 2 < 2 1 < m (Figure k&) f vl k / 2 , (k=2 i, < [x^fi.x)!] £ J + 1 + (2f -D| k(m-l) 2 k k . "^"l . 0=1 |g (k+2)(n-k+2)+^(m-l)(n-m-k+2)+2in(m-k) for 2 < k < m, and (ii) 2m < 2 1 < n (Figure Vb), p(k=2 i ) = \l + (g -2)(m+l)J [" Z j + m(| -1)1 (m+1) [m 3=1 j + m(- -m) ] for 2m < k < - , and p(k) m p v • , /n \ mn m -m = 2 j + m- (- -m) = — - — — 3=1 l 2 for k = n Since p(k) takes the maximum value at k=2m for n>>m, the maxi- mum number of processors p is 2 3 2 2 3m n + 2mn - n 10m - 2m - 4m 3m n , n/ . P 1 4 1 = -J— + 0(mn) . Q.E.D. As was true with Theorem 1, Theorem 2 requires only log_n multiplication steps. Similarly uniform operations are performed across all processors with this algorithm. By delaying some small trees, better 37 processor bounds could be achieved at the expense of requiring simul- taneous multiplications and additions. The algorithm and results can also be used in the solution of any linear system Ax=b where A is a triangular band matrix by a simple transformation as stated in section 1. 2.U Practical Implications Now we turn to a discussion of some practical implications of our algorithms. As noted earlier, to obtain the largest theoretical speedup achievable by Algorithms 1 and 2, an unreasonably large number of processors is required. However, by the following methods substantial speedups over serial machines can be achieved at reasonable efficiencies with greatly re- duced numbers of processors. First we shall discuss two methods of reducing the number of pro- cessors required. Then we present numerical results which give detailed estimates of possible speedups and efficiencies using various numbers of processors. Both general and m-th order recurrences are discussed and results are presented for ranges of m and n. Present multi-operation machines perform up to 64 simultaneous bit parallel operations (ILLIAC IV) or up to several hundred simultaneous bit serial operations (STARAN) . Systems under discussion allow for 10,000 or more bit serial processors [14]. As higher levels of circuit integra- tion become practical, it may be possible to use as many processors in future machines as the number of integrated circuit packages in current machines. Thus, machines with several hundred thousand processors may 38 some day be reasonable. The numbers of processors presented below are for this reason allowed a wide range. In modifications of both Algorithms 1 and 2, we use two methods to reduce the number of processors required. The first method, which we call c utting is to cut an nxn A matrix into [n/kl vertical strips of width k and process just one strip at a time. We carry out the following for k = r, 2r, 4r, ..., n, with r=2 for general recurrences and r=m for m-th order recurrences. For a given k value, consider the leftmost k column strip of the A matrix. The strip may be regarded as a triangular kxk system T (at the top of the strip) and a rectangular (n-k) xk system R (at the bottom of the strip). Furthermore, each of the [n/kl strips has the same form, with fewer and fewer rows in successive rectangular systems. Assume that p_ processors are required to solve T (the value of p can be determined from Theorem 1). Using the solution of T, we can solve R by simply substituting these values into each row and computing an inner product. Assume that this required p processors (actually p_ = k(n-k)). Let p = max(p ,p ) K R IK processors be chosen to compute the first strip as well as the remaining [n/kl - 1 strips. Clearly, p will suffice for each of the other strips if they are evaluated in two parts as was the first strip. Using this p value, we obtain overall speedup and efficiency figures and the procedure is repeated for all values of k = r, 2r, 4r, ..., n. Note that we could achieve better speedup results by increasing the strip width to occupy all p processors as we sweep to the right. This 39 was avoided to simply our experiments and its avoidance would simplify the compiler implementation of such a scheme for a real machine. The above procedure generates a number of possible speedups for various reduced numbers of processors. Next, we attack the problem of increasing the efficiency. Assume we have a tree of height t which contains 2 -1 operation nodes and whose evaluation requires 2 processors. The 2 fc -l - 2 efficiency of such a tree evaluation is — — = = — . By halving the 2 t -t fc number of processors, only one extra processing step is required, so we can 2 t -l - 4 achieve an efficiency of — —z = — — . For large t, this approximately 2 t (t+1) t+1 doubles the efficiency with a negligible decrease in speedup. We call this process a fold of the tree. After i folds are performed the tree height is t+2 -i-2, while the number of processors is reduced to 2 /2 . Thus, for large i the product of (processors) * (parallel time), which is the denominator of the efficiency calculation, changes very slowly and the technique is of little value. In practice at most 4 or 5 folds proved useful. In the implementation of our algorithms it is of course not the case that a single tree is being folded. But at each stage of the calcula- tion a number of trees must be evaluated in parallel. In spite of the fact that the trees are of different height, our procedure folds them all uniformly. This would simplify a compilation algorithm and also allows all processors to perform the same operations at the same time. By more 40 elaborate scheduling procedures, greater efficiencies could be achieved. The results of our cutting and folding experiments are sum- marized in Tables 1 and 2. In both cases we first generated a number of possible results by cutting and folding. Then, by using the processor count intervals shown as columns in the tables, we tabulated the maximum speedup result obtained for each row. In fact, this is the maximum of a set of lower bounds obtained for each row, column, position, from the minimum speedup . In Tables 1 and 2, the rightmost columns are labelled "maximum possible." These refer to the upper bounds on the number of processors given by Theorems 1 and 2, respectively. In cases where this column is empty, the rightmost value tabulated is the maximum possible. Gaps in both tables for efficiency values up to 1.0 were not run for technical reasons, but could be filled in a continuous way. Table 1 may be interpreted as follows. If one had a machine consisting of p processors, by scanning down the column labelled p, minimum achievable speedup and minimum achievable efficiency can be read for various problem sizes n. Alternately, if one wants to design a machine for problems of size n, a row scan will indicate the maximum number of proces- sors required by our method to achieve the speedup tabulated. It should be noted that the speedups shown along the diagonal with p=n-l, which are trivial to achieve on a parallel machine, can be substantially improved by our methods. By reading across rows to the right, we see that for small n, the speedup can be improved by a factor 41 of 2 or 3 while still maintaining efficiencies of about 10% or better. For large values of n the speedup can be improved by about a factor of 10 over the speedup of the simple n-1 processor case, with order of 10% efficiencies. Of course, if a large number of processors exist they can be used on smaller problems with lower efficiencies to obtain higher speedup improvements than mentioned above. Table 2 may be interpreted as was Table 1, although several points should be clarified. The values shown in Table 2 were obtained for the case n=16,384. However, they hold approximately for other n values as well. This may be explained as follows. If we approximate the parallel computation time of Theorem 2 as T = log^m log_n, for a cutting scheme which uses strips of width k the speedup is approximately T c _ __± Z mn mk P T n log„m log„k P - log 2 m log 2 k & 2 & 2 ~ 2 Correspondingly, we have p = m k, which leads to S = * 2~ . (25) P m log 2 m log 2 (p/m ) It may be seen that Equation 25, with p held constant, has a minimum S value for some ra in the range of interest to us. Thus, for p > 256, the columns of Table 2 exhibit a decrease in S and then an increase as m P varies from 2 to 128. Equation 25 also shows that speedup, by this approximation, is 42 independent of n. Thus, we can use Table 2 as an approximation for any n value. Of course, we should not use arbitrarily many processors on any 2 problem and by Theorem 2 the limit can be approximated by p < m n. It follows that entries in Table 2 are valid approximately for n > *s- . Our m experiments were exhaustive for m _< 128 and n <_ 16,384 and our data substantiates the above. The approximations are introduced here simply to avoid presenting many similar numerical tables. Now let us consider the speedup improvement which can be obtained by our method over the simple parallel implementation using p=m shown on the diagonal. For small m and efficiencies of about 10% we can obtain improvements of 3 or 4. For larger m the improvement does not increase much, unlike the general recurrence case. However, if a rather large number of processors is available (say to solve large general recurrences) then very large speedup improvements (at efficiencies lower than 10%) can be realized for small m. For example, if 16,384 processors were available, we could achieve a speedup improvement over the diagonal values of about 225 for m=2 (S =451.93) and about 83 for m=4 (S =332.62). P P We conclude that for m-th order recurrences with small m, which are in practice encountered quite frequently, the methods discussed here should be useful in exploiting large parallel machines. 43 m 1/1 CM O 887 287 • • in 88 8 £8' CM O CM s W— CM 88! m o » p: 3 N6U1 5o| en on 5 S ^3 en o in Sin CM lO«< — ©< enf» ♦ noig F-om tON 9i cn r^O _ «M ^ S o oB jjoS SoS set 383 m Of- R S 10 CM 10 cm out £33 now Ne'3 33g — f >o 888 8 - S3 CM £ IO KOA in d * 8SK in dip *;7S s - x - r-CM3 CjON 1 s lOKie cm p— in cm d # cm m p in •— cm • • f. IS i cm m ;d* 3£3 CO O — in d 8 c* en in • • CM CM ON en en oi>>8 *7l| *MCM jf r-! d 8 g m 2Ri NOV I - nunc no* CM C* CM »— en ^- O O CM 8og m a? en en ia »n m en o d d >» 583 CM O * s° 5 8~3 in d i7 s •*8 c- o r-» 388 • -CM S oSJ Sm cm O m en d3 e». encM CM«-f- uion 323 V 8 CM ' en di 8 o ^78 S' M Q CM O •- ^^H SS| 83 = 8 C in in in ■ . cm »•- l# * CM tf> «d p cm in S8S8 CM d 83! SrsiS in o 883 d d 8 In ■ cm o 8 CM •— m en 8 enm oo d ON ' oin So «■ in CM o 88^ 44 CO 0) o c •» o 3 eo 8 S • • evj 53' cm o 88' CM r— 8' r- cm cm °.PR fO O ^- 00 US : $ CM CM CM ♦ on . . vo O n- $< >— 1*1 CM >-Offl ison SO CM o r- . . 00 •— o — CM O >— • • Cn Son ■- o co en cm UIOi- 8S' «* o en S tn~ s§: s°5 in -*■ 3 © ! aiei S°s 5°2 3S8 • • m ss .s — o * r-^O en m Q en — © »'ei 5s: est o 88' cm o 88T to CM WON t-o <5 • • «■> SCSI to o co ■ . en S82 . . 00 ui o c6 n ia :8J£ cnj o •— f* S • • • CM 8°r r^ o >- 8 • © CM CSI (SI 5°i co t— to O o ■— • • O 3 ro CM en r- eo co o to =s 3 © CM >— O CM £81 VO O CM to © in • • i/» r% o r** 3 3 8 >— 00 O CM • r— ■«■ o to IO a »- CM S o.- co 8SS s: — CM CO CM O CM O 4J o. en Q • io o f- • • CU 8°' 3o3 85S O IO w MOO Sf*» CM ©en eo o eo a 45 ems A 3. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR RECURRENCE SYSTEMS WITH CONSTANT COEFFICIENTS 3.1 Introduction Linear recurrences with constant coefficients arise frequently in general numerical computations. Several analytic methods for the solution of these equations are available, such as by solving for the roots of its characteristic polynomial or by the use of generating functions. We are interested in a parallel and direct computational algo- rithm which can be used to evaluate them at a high speed. Such syst may be represented as x = c + Ax" where c is a constant column vector, is an nxn strictly lower triangular matrix with bandwidth of m, m < n, and all elements are identical along each sub-diagonal. We show that 0(log 2 m log 2 n) time steps and 0(mn) processors are sufficient to solve such a system. We also show that general recurrences, i.e., with m = n, can be computed within Odog'n) time steps with at most n 2 /4 processors. While such systems remain in the same order of speed as in the algorithms for arbitrary coefficients discussed in Chapter 2, the bounds on processors are sharpened by a factor of m and n for the m-th order and general constant coefficient system, respectively. To achieve these results, all processors need only perform one type of operation at each time step (SIMD operation). We further show that our method can be modified to evaluate the remote terms in a homogeneous linear recurrence sequence with at most 46 2 3 0(m ) processors in contrast to the 0(m ) processors required by the Miller-Brown algorithm. Also, the parallel evaluation of n-th degree polynomials can be completed within 21og 9 (n+l) time steps with at most ^r- + 1 processors which compares favorably with recent results [15], [l6l for practical applications. 3.2 General Linear Recurrences with Constant Coefficients In this section we discuss bounds on the time and processors for the direct evaluation of the following class of general linear recurrences R: x. = 1 for i < 0, i-1 = c. + £ a. x. . for 1 < i < n . In matrix notation, we write this as x = c + Ax where A = strictly lower triangular matrix. For example, [a. . ] is a nxn R<4>: r" "" x i X 2 = *3 x 4 . ; ll L *J a 1 a 2 a ± a~ a„ a, r "1 x i X 2 • X 3 x 4 _ For simplicity, we will assume n is a power of 2. The main result is Theorem 3. We also give Algorithm 3 which may be used as a basic algorithm for the parallel evaluation of this class of problems. 47 First, let us state one important lemma which follows directly from Definition 2. Lemma 5 Given any R : x = c + Ax , its associated matrix Y as defined in Definition 2 has identical elements along each sub-diagonal. For example, let A = 1 2 1 3 2 1 then Y = 1 3 1 8 3 1 It is obvious that the j-th column of Y can be obtained by simply shifting the first column (j-1) places downward. We will denote this operation as y . = y, [ j-1] where y. is the j-th column of matrix Y, These notations will be used throughout this chapter. Now, we present the proposed parallel algorithm and the proof of its effectiveness will immediately follow. Algorithm 3 a) Let B be a lower triangular matrix of order (n*n) 1 1 ' ^"9 ' * * * ' ^"n ' b 2 = (0, o^, ct 2 , . . ., « n _ 1 ) , and b = b 2 [j-2], 3 < j £n . 48 b) Let C be an alias for B, i.e., B and C represent the same memory locations. c) Repeat this step for i = 1, 2, . .., log~n; i: Set k = 2 ; ii: Partition B and C as shown in Figure 6; iii: Compute S. - S. + T. * Q. for 1 < i < min(2,?-) j j j j ~ - k simultaneously; a) (2) n iv: Set y = y ± [j-1] for 1 < j < k, 2 <_ ft <_ £ ; v: Compute D. = D. + W. * Z. 3 3 3 3 simultaneously. n for 1 j< j j< min(2,-rr-) d) The first column of B contains the solutions x. for 1 < i < n. claim: To justify this algorithm, we need only to prove the following Lemma 6 At the end of the i-th iteration of Algorithm 3, we have - (1) t 1) y 1 ■ (x 15 x 2 , . .. , x^) ; 2) for each partition y £k+l,£k y Jlk+2,£k y £k+2,Jlk+l (£+1) . for 1 < I < 7- - — — k y ik+k,Ak y j2k+k,£k+l J2k+k, flc+k-1 where y are elements of the associated matrix Y of A; and 49 3) for k <_ — , D is the vector of the partial sums of the first k terms in the expression of x, . , x, _, ..., x~, , and similarly D 2 is that of y 3k+lj2k , y 3k+2>2kJ •••> y 4k>2k for k ■ which combines with (y k+ i,k' y k +2 ,k' •••' "TiCtJ. , K _ K.+Z , K. y_ ) from the previous iteration (hypothesis Condition (2)) in forming |k,k (2) (2) the first column y of partition Y of the current iteration. —(2) With the new result of y , now, we can see that the shifting operation in step (iv) will result in Condition (2) due to Lemma 5. After step (iii) and step (iv) , we know that Z = (x , x , ..., 3c _i ) • Since in all previous iterations, no operation has been done on the data below the partitions on the diagonal, Y (l) for 1 <_ i ^ | , we have D l (c k+l' C k+2' •'•' c k ) ' ; a k+l,l a k+l,2 w 1 -| a 2k,l a 2k,2 • • ^l.k-l ^+2,1 a k+2,2 * ^+2^-1 2k,k-l Hence, the resultant vector D of step (v) is a vector of the partial sums as stated in Condition (3). Similarly, it is true for D_. This completes our proof. 51 CO O 60 u o M c •H c o vO K 274 x X X J X X X \^ Figure 7. Parallel Evaluation of R<8> 53 It is worthwhile to note that the shifting operation stated in step (iv) could be easily accomplished by a proper indexing function, thereby no major computation time is involved and more efficient use of storage is possible by such a scheme. To illustrate this algorithm, we use a R<8> as an example shown in Figure 7. With this algorithm, we can how establish a general theorem about this class of problems . Theorem 3 Any R can be computed in 2 T < log n + 21og n - 1 . p — z £. with p _< n /4 . 2 The minimum speedup is 0( r— ) . log 2 n Proof : During the i-th iteration of Algorithm 3, if a sufficient number of processors are given, then the matrix operation in step (iii) takes at most one multiplication time plus the time for the summation of at most Cr+1) operands, which is log_k addition steps. So that for a total of log n iterations, this step alone takes log ? n 12 3 t 1 5. l J + log 2 n = 2 lo 8 2 n + 2 log 2 n * Similarly, step (v) takes at most one multiplication time plus the time for the summation of k operands which is log-k addition steps. 54 ,IK . Thus, for a total of log_(y) iterations, this step needs log 2 ty n t < Z 3 + log 9 (y) in 2 ! i i 2 log 2 n + 2 log 2 n " X Therefore, the total computation time T < t. + t. , i.e., 2 T <_ lo 8 2 n + 21o 82 n " X ' Since the serial computation time is n(n-l), the speedup S is s > 2 ^ = o(-3— -) . P log 2 n + 21og 2 n - 1 log 2 n To achieve the above speed, a sufficient number of processors should be provided at the multiplication time for each iteration i. So we will choose the maximum of the number of multiplications in step (iii) and step (v) as the number of processors required for that particular iteration. For step (iii), we can see from Figure 6(a) that k/2 p (k=2 X ) <2Ej for 2 is a special case of arbitrary coefficient system, we can use the latter algorithm to compute R with some operations masked off. Due to Lemma 5, now the matrix operations S. = S. + T. * Q. in step (iii) of Algorithm 1 are performed J J J J only for j = 1, 2. All others are masked and supplemented by shifting operations of S_. Further masking is possible even in computing the portion of S„ in the top triangular part (Figure 2) with Q ? and S_ shrinking to vectors instead of matrices, as in Algorithm 3 (Figure 6). In this case, it can be shown that the maximum number of processors required is about n 3 /128. 56 3.3 M-th Order Linear Recurrences with Constant Coefficients In this section, we will turn to the considerations of practical m-th order recurrence problems, i.e., R: x. = for i £ , m = c. + Z a. x. . for 1 < i < n , i i=1 i t-j - - where 1 <_ m < n. In matrix operation x = c + Ax with A being a strictly lower triangular matrix of bandwidth m. For simplicity, we will assume both n and m are powers of 2. Since A is now a band matrix, we can further speed up the computation of Algorithm 3 when i >. log 9 2m . This is done by changing partitions S., Q., T., Z., D. and W. in Figure 6 to that in Figure 8, 3 3 3 3 3 3 and introducing new partitions U., V., G. and E. and their corresponding 3 3 3 3 matrix operations as described in Algorithm 4. The proof of its validity can be found in Lemma 3 and 4 of Chapter 2. Algorithm 4 a) Let B be a lower triangular matrix of order (nxn) in which t>2 = (0> a i> a 2' ***' a m' ^' ' ' * » 0) > and b = b 2 [j-2], 3 <. j < n . b) Let C be an alias for B, i.e., B and C represent the same memory locations. c) Repeat this step for i = 1, 2, ..., log n ; 57 (i) Set k = 2 ; (ii) Partition B and C as shown in Figure 6 (1 <_ i <_ log„m) and Figure 8 (log 2m <_ i <_ log^n) . (iii) If 1 <. i £. log»m , n, then compute S. = S. + T^ * Q. for 1 < j < min(2,— ) J 3 J 3 ~ ~ 'm' simultaneously ; n s else compute S. = S. + T. * Q. for 1 < j < min(2,— ) , 3 3 J j ~ ~ k and V. = V. + T * U. for 2 < j < - 3 J 2 j - J -k simultaneously . (£) (2) n (iv) Set y = y. [j-1] for 1 <_ j <_ k, 2 £ l <_ t- ; (v) If 1 <_ i £ log m , n then compute D. = D + W. * Z. for 1 < j < min(2, ■=— ) 3 5 3 3 simultaneously ; 2m' n else compute D. = D. + W. * Z. for 1 < j < min(2,-rr-) Till — — 2k and E. = E. + W * G. f or 2 < j < Tjr 3 3 simultaneously. - 2k d) The first column of B contains the solutions x. for 1 <_ i £ n. For illustration, an example of R<16,2> is shown in Figure 9, In that figure, all numbers are kept in place to help understanding, although they may not be used after certain iterations. From the above, we can obtain a general theorem for the m-th order linear recurrence problems. 58 CD X (T < 2 e x. •U •H M O 60 O H-i 60 C •H c o ■u J-i (0 Pu c Q oo QJ U 3 60 59 ITERATION I tfsrf ^k^> T*pO 2 i ' 2 fx -— "f — — ^* "^M^"W i i • Hfe !\ -- -_w -_ + _-4--i_-t-'-V ' ^ .L^J._L.!iK if^« f4«X .i;j.j__.i_^X .UJ_i-LW-i?X Z 1 I I I 12 iX --}-■»— f-l-4-i-J—|— f-'X -_ r _4._J._4--l 1 1 l-VX 3 " ' i i ' ' 1 2 iK — r-*-- I-+-J— «— T-t-f-l— '^V Oil 1 1 ' ' ' 2 "Vv - h -uf- ( .-t.- t -L-' t-j-i-A iii iii i i i 1 2 rx. -_4_4 1 1 1 ( 1 [ 1 L-U.|..L l \ 1 i i I I i 1 i i 1 i 1 2 iK 2 i i i i i i i ' ' i I '2 i\. 2 n --T_±: i -_i._U_L._l., 4» l » I 3 ' i\ — ' — — |--t--l-. ..i4-L-L4-^ i ' i ' 3 [ ".\ + _^_4_ 4 — I l--i £■ 4--1- _, 1 ,__ ,__!__ 3 ! f — i — J — »— +-+-+-+- 4—}— p\ ' i l I i I i ? i\ Lii_i_4_j_w_ »_J_j___j_jiJi: ' I ' 7 I ' I , ' ! ' I ' ' * i.N 1 i l I i iii 1 3 ' tV li i i i I I i i i i i ' 2 i\ _4.-4.-__ + -^-4---_4_^_-|_-»--j-H--_-N 2 I i I ' i I ! ' i ' ! i |3 N _J I I I i I I U_l I I i 1 ITERATION 2 Figure 9. Parallel Evaluation of R<16,2> 60 ITERATION 3 IV 2 \ 5 2 \ a 3 r i 3 \ "\® 22 2 45 92~J — ]5 3 K 183 1 ill ^ — 5 3 [JjK 2^ 3 "* I 3 •5 l\ 3 l\ | © 10 16 + -- "" la 5 11 3 V^ 5 3 l\ 37 |43 21 11 5 |3 \ 70 |85 43 21 11 '5 3>v 146 |171 85 43 21 11 5 3\ ITERATION 4 rv 2\ 5 2\ 11 3 22 N- ~ 45 92 ^, 3 Jv --^53 ^K ~~" s® 183T 11 5 3\ 185] 2 2 3 >v 3 5 3 IV 10 11 5 3\ 16 21 It 5 3 }v 37 43 21 11 5 3 jv 70 85 43 21 11 5 3 >v 146 171 85 43 21 11 5 3 i\ ^•^^ lS. 2 \ 5 2 1\ 11 3 l\ 22 2 r\ 45 3 l\ 92 5 3 rs. 183 11 5 3 ^ ji\ 368 736 is V 1473 \f> 3 JV 2948 111 5 3 i\ 5694 [21 11 5 3 l\ 11791 | 43 21 11 5 3 i\ 23580 J85 43 21 11 5 i\ 47164 • 171 85 43 21 11 5 3 f^ i Figure 9 (continued). Parallel Evaluation of R<16,2> 61 Theorem A Any R can be computed in T <_ 2 log-n for m = 1, 2 £ Ulog-m + 3) log.n - (log 2 m + log_m + 1) for m > 1, with p < mn. The minimum speedup is 0(- ; ) . log 2 m log 2 n Proof ; For 1^.1^, log 9 m, step (iii) is similar to that of Algorithm 3 and takes totally log 2 m 12 3 t _< Z j + log 2 m = - log 2 m + - log 2 m . j=l However, for log 2m <^ i <_ log n, this step needs only one multiplication time and one summation time of (m+1) operands per iteration. This amounts to t 2 < (log 2 (m+l) + 1) log 2 (^) < (log 2 m + 2) log^) . For step (v) , it is similar to Algorithm 3 for 1 < i < log-(-r-) and the total time is log^-j) m 1 2 1 t 2 — l J + log 2^ = 2 log 2 m + 2 log 2 m ~ 1 * This same step will take one multiplication time and one summation time of m operands per iteration for log_m £ i <_ log(— ) . Thus, in total it has t 4 < (log 2 m + 1) log <*) . Hence, 62 T < E t = (21og m + 3) log„n - (log_m + log„m + 1) , P - j = 1 j 2 2 2 2 m > 1 . For m = 1, since the W.'s diminish, T < t, + t„ ■ 21og_n. 3 p — 1 2 b 2 Comparing this with the serial computation time 2mn - m(m + 1) , the speedup is S > 2mn - m(mfl) (21og 2 m + 3) log 2 n - (log 2 m + log 2 m + 1) = 0( mn log 2 m log 2 n ) . In obtaining the above speed, we assume that there are enough processors at the multiplication time of any iteration. The number of multiplications required in step (iii) can be observed from Figure 6(a) and Figure 8(a) that p-lOc . k/2 2 ) <2 Z j f or 2 < k < m , J-l i(r+l)[Zj+m(f- m)] k j=l l m k <^ E j + m(— - m) for k n for 2m <_ k <_ y , = n For step (v) , we can see from Figure 6(b) and Figure 8(b) that 63 m p 2 (k) < 2k(k-l) for 2 £ k < | , m-1 <_ 2 E j for k = m , j=l m-1 < (—+ 1)( Z j) for 2m < k < | , 3=1 m-1 _< E j for k = y . When n>>m, p.. (■—) = (3mn - 6m + 6m) /4 is the maximum value and 2 p_(k) = 0(m ) for all k's. Therefore, the processor bound is mn. Q.E.D. As was true with Theorem 3, Theorem 4 requires only 21og 9 n - 1 steps. All processors are performing the same operations at the same time. Without this restriction, a better processor bound could be achieved. Also note that as m increases, greater efficiency can be achieved by using this algorithm instead of Algorithm 2 for arbitrary coefficients as described in Chapter 2. On the contrary,, we can achieve twice this speed by applying the latter algorithm with some operations masked off. In this case, the processor bound can be shown to be 3 0(mn + m ) } an d one might prefer it as m becomes very small. We will conclude this section with some comparisons between these results and some known algorithms in the evaluations of two distinct special cases. First, in computing the remote terms, e.g., x , in a homogeneous recurrence sequence relation x- + olx. -, + a.x. .+ ... + a x, =0, 1 ± l-l 2 i-2 m i-m 64 Miller and Brown [17] have shown an algorithm with about the same speed 3 as that of Algorithm 4 by using 0(m ) parallel processors. However, it can be shown easily that all vectors U., V,, G, and E. are zero vectors, j J j J and hence those operations can be masked off totally. In addition, since we can leave out all unnecessary intermediate results, parts of the matrix operation S. = S. + T. * Q. in step (iii) and D = D + W * Z in step (v) can be further masked out and modified as shown in the new partitioning in Figure 10. Therefore, the maximum number of processors 2 3 2 is 3m . On the other hand, given m + m processors, we can use Algorithm 2 in Chapter 2 with the new partitioning as shown in Figure 11, and achieve twice the speed of the Miller-Brown algorithm. As a result, we can establish the following facts. Corollary 4.1 Any remote term x of a homogeneous linear recurrence relation x + a x . , + a o x. + .. . + a . x = , i 1 l-l 2 x-2 u i-m m 2 can be computed within (21og m + 3) log 9 n time steps with 0(m ) processors, 3 and within (log m + 2) log_n steps with 0(m ) processors. Any polynomial P (a) of degree n can be represented as the last solution of the following linear recurrence relation R: x. = c. + a x. .. for 1 < i _< n + 1 . x l x-i — By Theorem 4, this can be computed within 21og (n + 1) steps. However, the processor bound can be reduced as we only need the last solution. This is done by changing the partitions Q , S. and T. for j = 1,2 in Figure 8 65 to that in Figure 10 (note that other partitions U., V. in Figure 8 still remain). Thus, it can be shown to have the maximum demand of processors for k = 2, i.e., p £ I— — I + 1. For illustration, let us evaluate P ? (a) by Algorithm 4 (or Algorithm 2 since for m = 1, both algorithms become identical) but with the modified partitions mentioned above on the matrix B = [b..] of Figure 8. The whole process can 1J 8x8 be summarized as follows. First, we compute 2 a = a*a , and b. . = c. + c. , a i,l l i-1 for i = 2, 4, 6, 8 simultaneously; then compute 4 2 2 a = a *a , and b i,l = b i,l + b i-2,l a for i = 4, 8 simultaneously; and finally compute b 8,l = b 8,l + b 4,l a which is now = [(c g + c 7 a) + (c 6 + c 5 a)a ] 2 4 + [ (c , + c_a) + (c~ + c.a)a ]a The final result in b-- is obviously the value of P_(a). This procedure may be generalized to any degree n and the computational process will 66 proceed like 2 P (a) =c +c .a+ (c o+c a)a n n n-1 n-2 n-3 2 4 + (c . + c c a + (c , + c -,a)a )a n-4 n-5 n-6 n-7 + (c n _ 8 + c n _ 9 a + (c n _ 1Q + c n _ n a)a 2 + (c n _ 12 + C n-13 a + (c n-14 + c n-15 o)o,2)oA) ° 8 "T* • • • • It is interesting to note that the above procedure is exactly equivalent to the well-known Estrin's method [l8l« I n other words, Estrin's method becomes a very special case of our algorithms for evalu- ation of general R systems. Hence, we can formalize it as a corollary. Corollary 2.2 Any n-th degree polynomial can be computed within 2 log (n + 1) time steps using at most I— r— + 1 processors. By comparing this result with the known k-th order Horner's rule, it is slightly faster and generally requires less number of processors to achieve the same speed [10], [19]. Kuck and Maruyama [16] show that a general polynomial form of degree n can be evaluated in 0(log_n + /81og_n) steps. While the speed is faster, it requires far more processors (p = 2n) than that required by our result. In practical applications where n is not very large, this method, even compared to the fastest known multifolding method [15], becomes attractive not only because of its simple implementation but also its easy integration with more general linear recurrence problems. 67 a JS 4-1 •H V-l c o o 60 •H rH 4-1 < «0 3 >. iH x> CO > A W B a .H C 01 V i-H l OS iH co W M 3 03 O PL, (U C >-l ■> M-l p o 0) M 3 60 •i-l 68 •H W P o fH toO 0) iH rH < iH rt >« V4 Xi cfl cm A s M n O c y-i V OS 00 c CO •H 3 C o o OJ •H c ■u ai •H 60 4J o H B «J cm X o c •H •H e E >m. However, the practical multiprocessing systems at present may consist of only a limited number of processors, p < n. In this case, using those algorithms could result in somewhat inefficient computations as shown in the middle of the upper left portion of Table 2. The major reason is the inherent nature of non-uniform distributions of processors required at each computational stage. On the other end of the line, it is trivial to show that such systems can be solved in 0(n) steps using p = m processors, with a limited speedup S = m. P We provide in this chapter two alternative algorithms for the evaluation of m-th order recurrences. It is shown that if n>>p>m, then any such system can be speeded up at least by ~ — - — p for arbitrary 2m+3 coefficients and ~ ^~_ p for constant coefficients, at an efficiency independent of n. When m is very small in practical programs, computations by these methods will generally be more efficient than by those discussed in Chapters 2 and 3. 71 4.2 M-th Order Linear Recurrences with Arbitrary Coefficients In this section we consider the evaluation of any given R: x = c + Ax for n > p > m . For simplicity, we assume n, p and m are all powers of 2 although our algorithms can be used for arbitrary values. We will describe this algorithm symbolically by an R<32,4> system as an example. The principle can apply to any R system. First, let B be a lower triangular matrix of order (nxn) and B = ["c,A] . All given data are shown as dots in Figure 12(a). Next, let us assume we have p = km = 16, then we can partition B into k = 2- = 4 subsystems Z , 1 < 1 < k. These are shown as solid m J — J — dots in Figure 12(a), in which each Z "* , 2 <_ j <± k, consists of the solid dots in the diagonal pre-concatenated by those in the first column of B. Thus, the number of columns in each Z , is r - (m-1) = 5 except Z with ^ = 8. We also partition B into Y J , 2 £ j <_ k, as shown in Figure 12(b), each with — columns. Every Z ■* or Y ~* will be treated as an independent recurrence system with the chopped corner assuming zero values. In the first stage, we solve Z J , 1 <_ j _< k, simultaneously, each with m processors. This is done in a very straightforward way, i.e., multiplying the i-th column by the (i-l)-th element of the first 72 M-. • • • • • • • • • • •«, • • z • • • • • • • #|0 • «io o o o o o o o o o o o • 2h,16- 3h,24 - n,32 o o • • o • z • • • • • • • •lO • •jo o _#|o o o o o o o o o o o o w (3) o • z • • • • • IO o I o oq_ 0006 000 o o • • .2. (a) MATRIX B Figure 12. Dynamic Partitioning for Algorithm 5 73 h.8 • n,32 • • • • Y *% • • • • • • • • • • • • • • (3) • • • V (b) MATRIX C Figure 12 (continued) . Dynamic Partitioning for Algorithm 5 74 h,8- 2h,16- 3h,24 ■ n,32 o o o o o o •I 2i y- -* o o 2 2 O D 2 O •J O o o o o o o o o o o o o o o o o o OOOOl o o olc o o o I o o o o o o ri: o o o o o w x o o o, ojo 0,0 OIO •~!o • • • • • •l Y 2 #l I o o o OOOI o o olc fo ^••^ o o o o o o o o o o o o, o ojo W 2 0|0 o o_oio o o • «no o o o o I I • Y 3 #| ° olo ojo o f o 10 1^ 10 o o o OOOi o o ojo o w 3 o,o o O O OIO o o o o o o (c) MATRIX E Figure 12 (continued) . Dynamic Partitioning for Algorithm 5 75 column of Z , and adding the resultant vector to the first column. This proceeds successively from i = 2 to the last column. Having done this, we will have in Figure 12(c) D == (x l' X 2' •*•' X n/k-l ) ' and, in particular Z, = (Z 11' Z 12' •" Z lm ) where z.. denotes the partial sum of the first (m+2-j) non-zero terms in the expression of x , 1 <_ j £ m, respectively. <£>i+j-i In the second stage, we solve each Y J and its associated (m-1) subsystems (defined in Definition 2) in parallel, and also simul- taneously for 2 <_ j _< k. As a result , we have W. = J r+l,r r+2,r y r+2,r+l r-hn,r y * * * y r+m,r+l y r+m,r+m-l y ,n - y ,n _ ... J r-h— l,r r+r— l,r+l k k r-h — l,r+m-l k for 1 £ j <_ k - 1 , 76 and Y. = J y r y r,|+m-l y r+m-l,| ' r+m-l,yfm-l for 2 <_ j £ k, ,n s where r = (— )j and y's are elements of the associated matrix Y of A. Now, by applying Lemma 3 repeatedly, we can compute the following first-order recurrence system of vector matrix form Z. = Z. + Y. Z. . , 2 < i < k xxi x-1 — — (26) where new x xl x2 • * ' Z *J) xm and z.. is defined as above. Note that Z, and Y, have been augmented by xj k k zeroes to become conformable and the first element of each resultant Z., x i.e., z.-, 1 < i < k, is the solution x, ,. s . . xl' — — (n/k)x After that, by the same reasoning of Lemma 3, we may evaluate all other solutions by D. x + W.Z. for 1 < i < k - 1, xx — — simultaneously . We may summarize this procedure as follows, 77 Algorithm 5 a. Let B be a lower triangular matrix of order (nxn) in which the i-th column contains a. . ., for j < i < n, where a. rt = c. J i,j-l -j _ — ' iO i and a. . = for i - j > m initially. b. Let C, E be aliases for B, i.e., B, C and E represent the same memory locations c. Set k ■» *■ and h = 7- ; m k i: Partition B, C and E as shown in Figure 12(a), (b) and (c) , respectively, in which the order of each partition is Z (j) : (h+m-1, h) for j = 1, (h, h-m+1) for 2 <_ j <_ k - 1 , (h-m+1, h-m+1) for j = k , y'^': (h+m-1, h) f or 2 <_ j < k - 1 , (h, h) for j = k , Dy (h-1, 1) for j = , • (h-m, 1) for 1 < j < k - 1 , for 1 <_ j < k - 1 , for j = k , for 1 _< j <_ k - 1 , for 2 < j <_ k - 1 , for j = k . Z.: 3 (m, 1) (1. 1) W.: J (h, m) Y.: J (m, m) (I, m) 78 ii: iii: Compute all recurrence systems Z , 1 <_ j <^ k, simultaneously; Compute each recurrence system Y J and its associated (m-1) subsystems in parallel, and also simultaneously for 2 < j < k; 1 < i < n. iv: Compute the recurrence system of vector-matrix form Z. = Z_, + Y. Z. . , 2 < i < k . x i i x-1 — — v: Compute + W.Z. 1 x for 1 < i < k - 1 simultaneously. d. The first column of B contains the solutions x. for x Note that at the end of step (ii) , solutions x., 1 <_ i _< h, are in place; at the end of step (iv) , x, . , 1 <_ i <^ k, are also in their corresponding positions. In the following we use a R<16,1> as an example to illustrate this algorithm. Note that matrices Z. and Y. become scalars in this i J case. Example 4 where Given <16,1>: x = c + Ax c = (1,2, 1,1, 1,5, 4, 1,2, 1,2, 4, 3, 1,2,1)* , a . . = for i - j > 1 or j-i^.0, 79 and the first subdiagonal of A is a d s (a 21' a 32 ,a 43'*" ,a 16,15 ) = (1,1,2,2,4,2,1,1,2,1,1,2,3,4,1) If we have p = 4, then we have k ■ 4 and (1) _ (4) . (2) . 1 "l "2 " 2 1 10 1 z (2 > = 5 4 4 2 z (3 ) = 1 2 2 1 10 2 , 1 1 , 4 1 - 3 1 3 2 4 10 1 , 2 " 1 " 2 - 4 2 Y (3) . 2 1 Y (4) m 3 4 1 , 1 , 1 The solutions of Z , 1 < i < 4, are D o" (x 1 ,x 2 ,x 3 ) = (1,3,4), Z 1 (1,9,22), Z 2 = 23 , (2,5,7), Z 3 = 11 , (3,10,42), Z 4 - 43 . = x, = 9 , 80 and the solutions of Y , 2 <_ i <_ 4, are W* = (0,2,8,16), Y 2 = 16 , Wj = (0,1,2,2), Y 3 = 2 , \t* = (0,2,6,24), Y 4 = 24 . With these results, we will solve the following recurrence system to obtain Z. = Z. + Y. Z. - for2>p>m, then any R can be computed by using p processors within (2m +3m)— + 0(m log_(p/m)) time steps. The minimum speedup is approximately ~ _, p Proof: In step (ii) , each system Z , 2 j< j <_ k, can be computed by m processors within o /inn \ 1 /TOR 1 \ t < 2(— - m) < 2(- 1) . 1 - p - p (j) In step (iii) , one processor is assigned to every Y J and each of its associated (m-1) subsystems for 2 <^ j _< k. This requires t„ < (2m -m) (2m-l) . I — p By applying Algorithm 2 to step (iv) now, during each iteration i k i (let r=2 ) we have — matrix-vector products as well as vector-vector k r additions, plus (— - —) matrix-matrix products. These amount to km , ,k r, 2 . , ^ „ . , . Tj - + (y - y) m inner-products of 2 m-vectors in total. By assigning 82 one processor to each inner-product and assuming vector-vector additions are performed at the end of each iteration, then it takes k 2 t 3 < log 2 k + E (2m-l) f(jp (m+1) - ^/p] r=2 < [(2m-l)£ + 1) + 1] log 9 ( £ ) . For the last step (v) , there are n - — - (k-1) pending solutions, each requiring one inner-product and one addition. Hence, totally we need t 4 < (2m)|-(n- if**- 1))/P] • The entire computation, therefore, can be done within 9 n 9 T £ E t. - (2m +3m)- + 0(m log, (p/m)) P i=1 i P for n>>p>m . And since the serial computation time is T- = 2mn - m(m+l) , 2mn - m(m+l) P (2m 2 +3m)-+ 0(m 2 log 9 (p/m)) P *• 2 p for n>>p>m 2m+3 Q.E.D, If only the last solution, say x , is required, then step (v) can be omitted. This gives us Corollary 5.1 If n>>p>m, then any remote term x of R 9 n 2 can be computed by using p processors within (2m +m)— + 0(m log- (p/m)) 2 time steps. The minimum speedup is approximately — - — p 2m+l r 83 For homogeneous R, a slight modification of the above algorithm should have more efficient computations than that stated in Theorem 5 and Corollary 5.1. However, we will not elaborate on it here. We close this section with some observations about the computa- tional efficiency. A quick comparison could be made at e.g., m = 1 as shown in Example 4, in which we solve the following system R: x.=c.+a.x. n , 1 < i < n , ill l-l — — in ~ — time steps in general, and within " — time steps if only the P P last solution is needed. For the latter case, we can compare it to the p-th order Horner's rule for the parallel evaluation of n-th degree polynomials (n+l>>p>2) which is equivalent to finding the last solution of the following system R: x. » c. + a x_, , , 1 < i < n + 1 . i i i-I — — It is well known that p-th order Horner's rule will take ~ — time steps P while our method only needs steps to evaluate a far more complex y P system. For more numeric comparisons with the previous methods, we list in Table 3 some experimental results for n = 16,384. It is easy to see that for small m this alternative algorithm can give far better speedups and efficiencies than the ones in Table 2, as shown in the entries to the right of the bracket in each row. For example, given p = 256 we can obtain a speedup improvement 70 32 over the previous method of about 5 (= * ) for m = 2 and about 14 . 81 84 w > •H ,Q ^ H ■s Eh ■ ' " ■ -=1- vo OJ VO H rH O O H f- O ITN H VO o LT\ OJ r- O OJ VO LT\ H VO o 0\CO on o -=*" o OJ CO VO H H O H CO LTN o o on o OS on ONO H OJ on H OJ C- O CO Lf\ -=1- H -=T O LTNVC O H LT\ O CO o H H on o H O H OJ O OJ VO O VO VO ir\ on H O H VO H CO -3- t—co on oj J- o ONVO O OJ on o on h J- on oj d OACO oj d OJ ON ON H OJ O LT\ H o> a ~=t- o ONO IfN OJ rH O C— CO vo lt\ j 1 d C— H H O l>- o • - on-* on on h d OJ VO VO VO cO d OJ ON O ON O i-H H * / ./ £ OJ -=j- CO VO H 86 w oj o c ID H U 3 o CO 1* -* CM M o «0 iH 0) c II •H hJ C M M 0) O T3 "4-1 M O m J5 1 1 ■u g •H J-i vw O o M iH CO ! w .a m s s s CM 5 ° o o ro o\ o H O £ 8 Sn $ r- On co O -* o H itn O i-4 On o H H CO O CM H VQ IA iH 0J o ON H £ S 00 21 Ov O CVJ Ov O H ?1 ° VO o ir\ co 00 O ^ s vo ia 35. r4 vS « CO O ON O ro O ro ro vp vo cu ro oo O i-t ITN r-l ITN i-l ro vo vo O vo o d ° VO O ro O VO 5s d IfN VQ VO H Os J- J- to vo o VO o H O CVI g* as ON CVJ itn on CO H CO r4 J- H CO vo CO o CM en 00 o ITN O ro o H O O ITN VO H -* d p ro CVI CVJ vo O ro ON ro ro aj d o o t— CO O H oo cvi r- H ^* o CO O H O 3 CO o t- CVJ O ro ITN O 5 3 t^- d 2 3 cu d VO CVJ in d CO 04 d ON O C— VO jt d ro ro •H O J- r— t- VO VO cvj d cu I 1 2.00 1 1.00 ; - - ft / / fl cvj j- 00 3 87 3 (= ->/ >yc. ) f° r m = 4. However, when m becomes bigger which is not the practical case usually, then this method will become impractical. To show the differences for a moderate range of n we use double entries in Table 4 for n = 128 and Table 5 for n = 1024. Each entry represents S and E . The left entries are for the previous algorithm with p p p omitted, and the right ones for the new algorithm. From these data and other untabulated results for different ranges of n, we can conclude that: Observation In the evaluation of m-th order linear recurrences , if the following conditions are satisfied: -i \ n .2 !) 2 - P - Am and 2) m <_ 16 , then the computation by Algorithm 5 will yield the same or better speedup (and hence efficiency) than by Algorithm 2. 4.3 M-th Order Linear Recurrences with Constant Coefficients In this section we discuss an efficient algorithm for R when n>p>m. From Lemma 5 in Chapter 3, we know that the partitions of 00 00 Figure 12 now will have (by neglecting the augmented parts of Z , Y , Vi and V Z (D „ z o> CO . y (j) for 2 ^ ± k> 2 < j £ k , 88 and hence the resultant matrices W i =W for l£i < k- 1, 1 < j < k- 1 , i t -t for 2 £ i £ k, 2£j£k . Further, t. = t^Ij-1] for 2 < j < m , where t. is the j-th column of matrix T as shown in Figure 13, and brackets [x] indicate shifting operations as defined in Chapter 3. With these properties, we need only to solve the recurrence (2) — — system Y (without any associated subsystem) to obtain w. - and y 91 • After the shifting operation of T, we can complete the evaluation of Y 2 by Y 2 = Y 2 + Q W;L . Therefore, we can establish the following method for R problems Algorithm 6 a. Let B be a lower triangular matrix of order (nxn) in which b l (C 1' °2' •*" °n ) ' b ? - (0, a n , a , ..., a , 0, ..., 0) V ^2 m and b = b 2 [j-2], 3 < j < n . 89 h.8- 2h,16- 3h, 24- n,32 - o o o o |0 o o IO o o JO O O Oj 'l |OJb 6 oj IOIO W} Ol l9loo_° H ' )••••) Ml 2 • • Y 2 • o o o o o o_o o _o| \0 Q o!C «_«_y_ w J s^> O > O \ \ o o o o o o, o o OIO o w 2 o|o °_2-9_°J • • • •io • • • •]■ • • Y S «I L*fJL*l o o o o o o o o °\ o|o 0|0 OIO io l_ o o o O O 0| O O OIO IOOWjO'OO |o_o polo o o >• • Y 4 ^iO OOP Figure 13. Dynamic Partitioning of Matrix E for Algorithm 6 90 b. Let C,E be aliases for B, i.e., B, C and E represent the same memory locations c. Let k = £ and h = — ; m k Partition B and C as shown in Figure 12(a) and (b) , respectively. Partition E as shown in Figure 13, where T is of order (h+l,m) , and W , Y , Q are all of order (m-l,m-l); (j) ii: Compute all recurrence system Z J for 1 <^ j j< k (2) and Y , simultaneously; iii: Let t. = t- [j-1] for 2 <_ j <. m, and then compute Y 2 - Y 2 + Q Wl ; iv: Compute the recurrence system of vector-matrix from z i " z i + Vi-i > 2 < i < k ; v: Compute D. D. + W-Z. for 1 < i < k - 1, 1 i — — simultaneously . d. The first column of B contains the solutions x. for x 1 < i < n, In analogy with Theorem 5, we can derive a general theorem from this algorithm. 91 Theorem 6 If n>>p>m then any R can be computed by using p processors within "^ _ n + 0(m log (p/m)) time steps. The minimum speedup is approximately r*- _ p . Proof : Each independent system in step (ii) can be computed by m processors within . .,mn \ ^ n z ™ 1 i n t < 2( m) < 2(— — -l; . 1 — x p-m — p-ni 2 There are (m-1) inner-products of 2 (m-l)-vectors in step (iii) One processor can compute one inner-product in 2. t 2 < 2 (m-1) (m-1)' • In step (iv) , we can use Algorithm 4. Then, during each i k iteration i (let r = 2 ) , there are — matrix-vector products as well as r k vector-vector additions, plus -r matrix-matrix products for 2 <_ r <_ — . Applying the same strategy as that for step (iv) in Theorem 5, we require k/2 2 t 3 < ( I (2m-l) f(|S + a-)/p|) + (2m-l)[^| + log 2 k < [(2m-l)(|+ 1) + 1] log 2 (£) . Similarly to that for Theorem 5, step (v) takes Thus , the algorithm totally requires 4 T < j t 2mn(2p-m) + q^ ( /m)) for n»p>m . v i=i p(.p-mj l 92 Since T = 2mn - m(nrl-l) , we have 2mn - m(m +l) p — 2mn(2p-m) n/ 2 1 , , N . p(p-m) + ° (m log 2 (p/m)) p-m , % p for n>>p>m . 2p-m Q.E.D. By omitting the last step (v) , we can easily show that Corollary 6.1 If n>>p>m then any remote term x of R can be computed by using p processors within — 1- 0(m log» (p/m)) time steps. The minimum speedup is approximately p-m. It is easy to see that this algorithm offers a very efficient computation for linear recurrences with constant coefficients. Since, by letting e.g., p = 2m, we have — <_ E <_ — , and if only the last solution is required we have — <_ E < 1. In other words, by using this algorithm under practical situations, we may obtain speedups proportional to the limited number of processors we have, and the constant of proportionality S P P possible speedup, is at least one-third or more. Also, it is interesting to note that for the parallel evaluation of n-th degree polynomials, which is a special case of Corollary 6.1 with m = 1, this method has the same E = —*- , which can be regarded as the ratio of the actual and maximum order of speed as does the p-th order Horner's rule (n + l>>p _> 2), with efficiency E > -r- . p - 2 As was true with Algorithm 5, Algorithm 6 could be expected 93 to give higher speedups and efficiencies for the evaluation of practical R systems. Similar conditions may be observed as we did in the preceding section to warrant the use of this algorithm. However, as it is a special case of R systems, the previous observation can still hold for R systems. 94 5. SPEEDUP OF LOOP COMPUTATIONS IN ORDINARY PROGRAMS 5.1 Introduction With all the studies of linear recurrence systems behind us, we now proceed to look at the possible speedups of iterative programs which are usually written as a loop in some high-level languages using, for example, Fortran DO or Algol FOR statements. As stated in Chapter 1, there are apparently two ways to increase speeds of these loop computa- tions on a multiprocessing system. The first one is to devise new parallel algorithms and provide existing languages with adequate parallel features. This approach will help by providing explicit parallelism in the program and reduce the time otherwise spent in iterative loops. For instance, the development of parallel FFT algorithm is a prominent example [20] • All algorithms in the previous chapters can be used to replace some iterative loops used to solve triangular systems in many numerical programs. An extensive survey in these areas can be found in [21], [22]. What we are interested in here is the second approach, i.e., automatically detecting the implicit parallelism in ordinary program loops at the compiler level. Many people have worked in this area. In his automatic pro- gram analysis, Russell gives some algorithms to test the possibilities of replication of simple loops [ 5]. The most extensive studies are done by Muraoka [10] • He provides methods to analyze subscript ex- pressions, and extract parallelism in a loop with single assignment statement by the replication or wave-front method. The latter scheme is also used by Lamport for an ILLIAC IV Fortran compiler [23]. For 95 loops with multiple assignment statements, Muraoka also describes algorithms to exploit parallelism by replication of the whole loop or by distribution of the original loop into a set of smaller loops, as well as by the wave-front method between assignment statements. Lately, we have generalized most of his work and implemented them in a Fortran program analyzer to measure the potential speedups in ordinary Fortran programs [7 ], [8 ]• Cohagan recently also reported techniques to generate vector operations from DO loops in the TI ASC Fortran compiler [ 24] . All of these works have demonstrated the feasibility of detecting a fairly large amount of parallelism from ordinary program loops. However, for one of the frequently encountered statements in a loop like X(I) = X(I-l) + Y , most of the previous workers do not handle it because of its sequential nature. In fact, this is a special case of linear recurrence relations. Muraoka has discussed this class of first-order recurrences and shows that they can be speeded up significantly by forward substitution and tree-height reduction algorithms. Nevertheless, for a slightly different statement such as X(I) = X(I-l) + X(I-2) + Y , the above method will become very difficult in implementation and may result in poor speedup if the number of iterations is large. But, by using the algorithms in Chapter 2 through Chapter A, which are simple, efficient, and require no complicated machine structure to support them, we feel that it is practically feasible and desirable to extract 96 parallelism from even these highly serial codes in ordinary programs. In section 2, we discuss theoretically the speedup spectrum of different types of loops in terms of the problem size (sequential computation time T ). As shown before, the order of any recurrence represented in the loop should be the smaller the better as far as the speed and processors are concerned. To facilitate this, we introduce informally in section 3 the distribution algorithm to decompose a given loop into loops of smaller sizes. By this algorithm, we may also be able to expose vector operations as much as possible. In section 4, we discuss time bounds of some loop structures with known computation patterns (dependence graphs) and certain measurable parameters, with which the minimum speedup on a multiprocessing system of a particular class of loop computations (and hence the algorithms they represent) can be estimated. At last, some practical implications for compiler and machine design are described in section 5. Before we proceed further, some basic definitions and notations are given below which will be used throughout the rest of this chapter. Definition 7 An assignment statement is denoted by x = E where x is a single or array variable, and E is a well-formed arithmetic expression composed of the four arithmetic operations (+,-,*,/), left and right parentheses, and atoms which are constants or variables. Definition 8 A loop is denoted by L = (I 1 «- N r I 2 *- N 2 , ..., I d <- N d )(S 1 , S 2 , ..., S g ) or = (1^ I 2 , ..., I d )(S 1 , S 2 , ..., S g ) 97 where I, is called loop Index , N, is an ordered set and called index set , and S, is called body statement which may be an assignment statement or another loop. We will call it a basic loop if there is no S, which is a loop. For example, a basic loop like L: DO S I = 1, 10, 2 DO S 2 I = 1, 10, 1 S ± : Y(I 1 ,I 2 ) = A(I 1 ,I 2 ) + B S 2 : Z(I lt I 2 ) = C(I 1 ,I 2 ) + D is written as L = (I +■ 1^, I 2 «- N 2 )(S 1 ,S 2 ) or simply L = ( I i» I 2^ ( S i» S 2^ ' where N- - {1,3,5,7,9} and N = {1,2,3, ... ,10} with InJ = 5 and |n 2 | = 10. 5.2 Speedup Characteristics of Program Loops In this section, we will characterize program loops according to their possible speedups over serial computation time T.. . Obviously, a loop is equivalent to a sequence of assignment statements according to the original execution order defined by the index set. In order to formalize the problem, let the i-th statement of such sequence be x. = E. , 1 < i < n l l — — where n is the total number of assignment statements executed by the loop. 98 For instance, given a loop L as that in the preceding section, we have L: x = Y(l,l) = A(l,l) + B x 2 - Z(l,l) = C(l,l) + D x A9 = Y(9,10) = A(9,10) + B x 5Q = Z(9,10) = C(9,10) + D . If none of the expressions E., 1 _< i _< n, contains an atom x. for m > 0, then there is no dependence relation between any pair i-m of statements. Thus, all x. = E., 1 _< i _< n, can be computed in parallel. We will call this set of loops R. For example, the following loop which performs matrix addition and scalar multiplication, DO 5 I = 1, 10 DO 5 J = 1, 10 Y(I,J) = A(I,J) + B(I,J) 5 Z(I,J) = C(I,J)*D , has this property. The total time required by any LeR is, by Brent's argument [25], T p < 41og 2 e where e = max {e.} where e. is the number of atoms in E., 1 < i < n 99 Hence, we have for LeR S " tj~ — - OCT,) . (27) p — 41og e r as Now, let us study a slightly more complicated loop such one that performs vector inner-product DO 5 I = 1, 10 5 T = T + A(I)*B(I) . Loops of this kind have the property that some of the expressions E , 1 < i < n, contain a linear combination of x. for m. > 0. We — — l— m. 1 will call this set of loops LR. For any LeLR, if we compute simultaneously all subexpressions in E. , 1 j< i j< n, which do not depend on any computed value in the loop, i.e., x. for 1 _< i <^ n, then the resultant loop can be treated as R system where m < n is the maximum of m. for all i. This can be illustrated by the above example of vector inner-product in which we can evaluate A(I)*B(I) for 1 <_ I <_ 10 in parallel. As a result, we obtain all required coefficients to compute an R<10,1> system represented by the original loop. The total computation time of such loops is, therefore, pre- processing time of coefficients which is 41og_e time steps, plus the time to solve an R system which is stated in Theorem 2. Since n _< T_ £ ne, we have a speedup for this subset of LR T, S _> ■ 5 ~~ P "' (log 2 m + 2)log 2 n-l/2(log 2 m + log 2 m) + 4log 2 e = 0(— i— ) • < 28 > log 2 T 1 100 Next, consider the subset of loops which has m = n. For example, given an upper triangular matrix A, to solve Ax = b by the traditional back-substitution method we may write a loop like DO 5 I = 10, 1, -1 X(I) = B(I)/A(I,I) DO 5 J = I + 1, 10, 1 5 X(I) = X(I) - (A(I,J)/A(I,I))*X(J) . In this example, if we preprocess B(I)/A(I,I) for all I, and A(I,J)/ A(I,I) for all I, J, we obtain an R system. Since m = n, this is the worst-case loop of LR. Hence, we can say that the computation time of any LeLR is less than 41og„e plus the time stated in Theorem 1. There- fore, we have for any LeLR S > l/21og 2 n+3/21og 2 n+41og 2 e = 0( lo *2 T l ) • (29) Next, we study a simple looking, but more complicated loop: DO 5 I = 1, 10 5 X(L) = (X(I-l) + A/X(I-l))/2 . This is a familiar iterative program for approximating vA . This class of loops has the property that some of the expressions E., 1 <^ i <^ n, are non-linear combinations of x. for m. > 0. We denote this set by LR. l-m. i i LU decomposition algorithms in standard numerical programs are another well-known example in this class. Muraoka shows that by using forward substitution any loop with 101 E. being a d-th degree polynomial of x._i» d > 1, can be speeded up at most by a constant factor. Kung also shows this later by forward substitution techniques [26]. However, it remains an open question whether there is some algorithm besides forward substitution which may speed up this class of loops by more than a constant. At the present time, we may content ourselves that there could be some LeLR which can only have a constant speedup. From the above analysis, we can obtain a quick estimate of the possible speedup of a given program loop without any detailed knowledge of subscript expressions and index sets. This is stated as follows. Given a loop L, let us define recursively (as S may also be a loop): IN(L) = U INCS^ , 1 <_ i <_ s OUT(L) = U 0UT(S ± ), 1 <_ i <_ s , where IN(S.) and 0UT(S.) are the set of variable names in the left- and right-hand sides, respectively. And we denote Q(L) = IN(L) (\ OUT(L). Theorem 7 For any L and a sufficient number of processors, we have 1) if Q(L) = , then the minimum speedup is T l S = .. = 0(T.) . p 41og 2 e r 2) if Q(L) ^ j . Although we only discuss all loops without jump/logic or other types of body statements, the basic concepts presented here can be easily generalized to any loop structures, and further to the complexity problems of different algorithms. 5.3 Exploitation of Parallelism in Program Loops In the preceding section, a theoretical speedup spectrum has been established for various kinds of loops, i.e., from the simplest Type loop to the most complicated Type 3 loop. In this section, we present a general algorithm which may be used in the semantic analysis 104 portion of a compiler to detect inter- and intra-statement parallelism in program loops and achieve higher speedups. The central idea of this algorithm is to distribute (decompose) a given Type i loop into a set of smaller size loops of Type j , <_ j _< i, which upon execution give the equivalent results of the original one. This is essentially to reduce a. in s - * P ajdogjT/ (and hence increase speedup) as much as possible. We will describe the algorithm informally by using examples. The idea of loop distribution was introduced by Muraoka [10], and later was implemented in our Fortran program analyzer to measure potential parallelism in ordinary programs [7], [8]. Should it be used in an actual compiler, the details will depend largely on different levels of sophistication in implementing them as discussed later. Hence, in what follows, we will address only the basic concepts which would be required in an idealized loop distribution algorithm. Distribution Algorithm Given a loop L = (L , lo» * * * » H ' i » o * • • • > " Q ' * Step 1 By analyzing subscript expressions and indexing patterns, construct a dependence graph D with respect to loop index I for 1 <. u 5. d* In D , there are s nodes representing S. for 1<_ j <_ s. An 105 arc from S, to S indicates that, for fixed values of I- , I-, ..., I , an atom of S during certain iterations has been updated by the output variable of S, during an earlier (or the same if k < &) iteration generated by the nest I , I 1 , ..., I, . Example 5 Given a loop DO S 2 I- - 1, 10 DO S 2 I 2 = 1, 10 DO S 2 I = 1, 10 A(I 1 ,I 2 ,I 3 ) = B(I 1 -1,I 2 ,I 3 )*C(I 1 ,I 2 ) + D*E B(I r I 2 ,I 3 ) = A(I 1 ,I 2 -1,I 3 )*F(I 2 ,I 3 ) we have D r D, D 3 : Example 6 Given a loop L: V V S 3 = S 5 : DO S I - 1, 10 A(I) - A(I-l) + F(I) B(I) = A(I) + B(I+1) C(I) = D(I-2) + H(I) D(I) = C(I-l) + B(I) E(I) = D(I) + N(I+3) 106 we have Step 2 On the dependence graph D for 1 < u < d, define a partition tt of {S , S , S } in such a way that S and S are in the same subset if and only if there is a chain of arcs from S. to S k. a and vice versa. Elements of each subset are ordered by their subscripts, For Example 5, we have TT ;L = {(S 1 ,S 2 )>, 7T 2 = {(S 1 ),(S 2 )}, TT 3 = {(S 1 ),(S 2 )} , and for Example 6, we have 7 r 1 = {(s 1 ),(s 2 ),(s 3 ,s 4 ),(s 5 )} Step 3 On the partition it = {if ,, i „, ...} for 1 < u < d, u ul u2 — — define a partial ordering relation y in such a way that tt . y t\ . (reflexive), and for i ^ i, tt . v tt . if there is an arc in D from some Ul uj u element of tt . to some element of tt . . The v relation is also ux uj ' 107 antisymmetric and transitive. By neglecting those relations implied by reflexivity and transitivity, we have for Example 5 (S.) y (S,,) in i^- And for Example 6, we have (S^ y (S 2 ) , (S 2 > y (S 3 ,S 4 ), (S 3 ,S 4 ) y (S 5 > in ir^ Step 4 Let the (d-u+1) innermost nest of L be L for 1 < u < d, i.e. , " tt r i 2 W "Wi V (s r s 2 V 1 ■ (I 1- Z 2 W*"' • Then replace L according to tt with a set of loops {(I)(tt .),(I)(tt „) , ...} where (I) = (1^1^,...,:^). The condition for the partial ordering relation y insures that data are updated before being used. Hence, any execution order of the set of loops which replaces L will be valid as long as this relation is not violated. Hence, for fixed values of I, , I„, ..., I , , if it . v 12 u-*l ui tt then loop (I) (tt .) must be evaluated before (I) (tt .), otherwise uj r ui uj they may be computed in parallel. (In general, we can also use statement substitution to remove this relation between some of the distributed loops. This will be discussed in the next section). For instance, in Example 5, we have now three possible ways in the evaluation of L. Besides the first one which happens to be L, the 108 other two are and do s r- *»l, 10 DO S I = 1, 10 DO S I 3 - 1, 10 S x : A(I 1 ,I 2 ,I 3 ) = BClj-1,1^, I 3 )*C(I 1 ,I 2 ) + D*E DO S 2 I 2 = 1, 10 DO S 2 I = 1, 10 S 2 : B(I 1 ,I 2 ,I 3 ) = A(I 1 ,I 2 -1,I 3 )*F(I 2 ,I 3 ) , DO S I = 1, 10 DO S 3 I = 1, 10 DO S 1 I = 1, 10 DO S 2 I = 1, 10 S ± : A(I V 1 2 ,1 3 ) = •.. S 2 : BCI^I^I^ = ... S 3 : CONTINUE in which the two innermost loops are computed in parallel. Similarly, one can evaluate the loop in Example 6 as four sequential loops: 109 DO S I = 1, 10 S 1 : A(I) = A(I-l) + F(I) DO S 2 I = 1, 10 S 2 : B(I) = A(I) + B(I+1) DO S, I = 1, 10 S 3 : C(I) = D(I-2) + H(I) S 4 : D(I) = C(I-l) + B(I) DO S 5 I = 1, 10 S : E(I) = D(I) + N(I+3) Ideally, we would like to choose the one which gives the fastest results for the original loop L under the given machine con- figurations. For illustration, let us assume an arbitrary number of processors are available and statement substitution is not allowed. By using Example 5 again, the decomposition according to D_ will yield two Type loops inside the nested loops I and I~, and each of them can be evaluated in parallel. Thus, T = 10xl0xmax{2,l} = 200 . P We may also distribute it by D ? which gives us two sequential type loops inside loop I, and T = (2+1) x 10 = 30 ; P 110 or by D which may result in an R<2000,199> system after preprocessing of coefficients as described in the previous section. Thus, we have for this case T = (log 2 199+2) log 2 2000 - l/2(log 2 199+log 2 199) + log 2 4 = 76 . Clearly, we will choose the second decomposition to obtain a speedup of s = li . 4000 p T 30 ■ LJJ ' P Similarly, the loop in Example 6 will be distributed into a set of four sequential loops (I) (S ) , (I) (S ) , (I)(S ,S.) and (I) (S ) which are Type 1 (m=l) , Type 0, Type 1 (m=3) and Type loops, re- spectively. Obviously, both loops (I) (S ) and (I) (S-) need t.. = t„ = 1 Let preprocessing time of coefficients of loops (I)(S 1 ) and (I)(S_,S.) be one time step, then the computation times of both loops are t 3 = 21og 2 10 = 8 , and t 4 = (log 2 3+2) log^O - l/2(log 2 3+log 2 3) = 17 . Thus , the overall 4 T = I t. + 1 = 28 , P 1=1 X while T = 5x10 = 50 . So that, we have for Example 6 T s =-i a 10 p T 28 L ' P Ill It is interesting to note that both loops in Example 5 and Example 6 are Type 1 by their very nature. However, the speedup of the second loop is far smaller than that of the first one. This is due to the big difference of problem sizes (T ratio = 4000/50) . If we in- crease the number of iterations of the second loop from 10 to 800, then both loops will have the same size but now we can obtain a speedup S a .- * 62, which is comparable to the first loop's speedup. We should also notice that, without distribution, the loop in Example 6 can be computed as an R<50,9> system. It yields about the same speedup (T ~ 27) as that obtained by distribution, but requires far more processors (see Theorem 2) to achieve the same speed. This demonstrates that loop distribution may not give better speedup but it could improve the efficiency in the use of processors. In summary, the distribution algorithm introduced in this section, resembling the distribution algorithm for the reduction of tree- height of any given arithmetic expression [ 8 ] , [10 ] > may introduce more parallelism into a program loop than that obtained from an undistributed one. The simpler the loop structure is, the more parallelism can be extracted by the loop distribution techniques. Since most loops in ordinary programs are quite simple, we believe a practical loop distribu- tion algorithm with reasonable complexity can be justified to be built into a real compiler and generate efficient object codes from ordinary programs for future parallel machines. 112 5.4 Time Bounds of Program Loops In section 2, we have given two crude bounds on computation time without any subscript information. As stated before, by performing subscript analysis we can obtain more refined bounds. In this section, we will discuss some of these bounds in terms of certain measurable loop parameters and the connectivity of its dependence graph. The graph is the product of subscript analysis or a computation pattern that we know a priori. For simplicity, we shall concentrate on basic loops and it is not difficult to generalize the discussion to non-basic ones. All parameters used are listed as follows. Definition 9 For a given basic loop L, we define 1) k = tt N. where In. I is the number of 3-1 J j iterations of loop index I . and d is the number 3 of nested loops. 2) s = the number of assignment statements in L. 3) E={e.|l,1.2, . . . ,k) is the number of iterations over which the h-th dependence line in D- passes from S. to S.. If there is only one such i 3 113 line, we omit the superscript (h) . The maximum q.. is denoted by q. ij With these parameters, any basic loop can be represented by a four-tuple, i.e., L(k,s,E,Q). For instance, the loop in Example 5 is L(1000,2, {4,2}, {q =10, q =100}) and in Example 6 is L(10,5, {2,2,2,2,2} {q ll =1 ' q 12 =0 ' q 24 = °' q 34 =1 ' q 43 =2 ' q 45 =0}) ' Now we will present various bounds based on this loop informa- tion. The simplest case is a loop whose dependence graph is empty, i.e., Q=cf>. This implies that LeR and hence from Equation (27) we obtain Theorem 8 Given any L(k,s,E,Q), if Q=cf) then T <_ 41og_e. For the set of loops with non-empty graphs, we can easily distinguish two cases, i.e., acyclic and cyclic graphs. The acyclic graphs can be readily recognized as either tree-connected or non-tree-connected. Formally, we define these as follows. Definition 10 An acyclic dependence graph is a dependence graph of s nodes, S. for 1 <^ i <^ s, with no pair (S.,S.) such that there exists a chain of arcs (dependence lines) from S. to S and vice versa. If there is only one chain of arcs between any pair of nodes, we call it tree-connected . Figure 14 shows all these cases. 114 (a) Acyclic & Tree- connected (b) Acyclic & Non- tree-connected (c) Cyclic Figure 14. Connectivity of Dependence Graph For acyclic graphs, it is easy to demonstrate that we can perform statement substitution between any pair of nodes which have a dependence relation: that is, we substitute for each left-hand-side variable of S. on the right-hand side of S., which is the cause of a i 3 dependence relation, the corresponding arithmetic expression on the right-hand side of S. with all subscript expressions shifted in ac- cordance with the q. . representing this relation. For example, given a loop s r DO S, I = 1, 10 A(I) = E(I-l) + F(I) B(I) = A(I) + A(I-l) , C(I) = B(I-l) + G(I) D(I) = A(I-2) + H(I) , D, 115 we have Q = {q| 2 , q{ 2 » q 2 3' q 14* " *°' 1 * 1 * 2 * * By *PP 1 y in 8 thij procedure, a set of four independent loops results: DO S| I - 1, 10 Sj: A(I) = E(I-l) + F(I) DO S^ I = 1, 10 S^: B(I) = E(I-l) + F(I) + E(I-2) + F(I-l) DO S' I = 1, 10 S^: C(I) = E(I-2) + F(I-l) + E(I-3) + F(I-2) + G(I) DO S' I = 1, 10 S^: D(I) = E(I-3) + F(I-2) + H(I) . And all of them can be computed in parallel as the dependence relations among them have been removed by this process . Evidently, T and p of each resultant statement S ' , 1 j< j <_ s, can be bounded by the number of atoms present in its final form. The growth of atoms in each resultant statement can be easily observed from the fact that if S has y., chains of arcs connected to S,, then S will contribute Yj.( e J~l) atoms to S| . Hence, in general the number of atoms in S|, 1 < j < s, will be the summation of e, over the contributions J - - j from all other nodes which have at least a chain of arcs connected to S . This is shown in the following example in which S! contains 116 e + (e -1) + 2(e 2 -l) + (e^-l) atoms © (e r l) X Now, we can state a general theorem for the class of loops with acyclic dependence graphs. Theorem 9 Given L(k,s,E,Q) with acyclic dependence graph. If statement substitution is allowed, then we have 1) for a loop with a tree-connected graph T <_ 41og 9 (se). 2) for a loop with a non-tree-connected graph T <_ 41og~(s'e) where Z-l s-1 s" = (s-£+2)r <_ 2r < r , & = maximum length (number of elements) of the chains of partially ordered set n as defined in section 3 (31 £ si . r = maximum number of chains from any element of ir 1 . Proof: For the tree-connected case , any S . , 1 <_ j <_ s , can have at most one chain of arcs from S., 1 < j < s and i ^ j. Hence, let e 1 1 ~ ~ J 117 be the number of atoms of S', then e! <^ e. + Z (e_.-l) 5 se 1=1 1=1 X Since all Si, 1 _f_ j £. S, can be evaluated in parallel, part (a) follows directly from Brent's argument. For the non-tree-connected case, there are at most r chains (r >_ 2) of arcs coming out of one node and the maximum length of these chains is l. Thus, the maximum contribution by S . to S! is i J £-1 SL-1 r e.(j< r e) . Let all (s-A+1) nodes contribute in the same manner, e! < e. + (s-£+l)r J! '" 1 e + (r £_2 +r £ ~ 3 +. . .+r)e J - J < (s-£+2)/ -1 e < 2r S_1 e < r S e . Therefore, part (b) is also proved. Q.E.D. Next, we study the case when statement substitution is not allowed. In this case, we can prove that Theorem 10 Given any L(k,s,E,Q) with acyclic dependence graph. If statement substitution is not allowed, then we have T < 4£log_e < 4slog_e . p — 2 - °2 118 Proof : Since tt has the maximum length of the chains i, then the elements in it can be partitioned into £ disjoint antichains (subsets of elements), and in each antichain there is no relation among any pair of its elements. This is a dual form of Dilworth's Theorem [27] • Hence, all decomposed loops, each of which represented by one element of a particular antichain, can be computed in parallel. Since the graph is acyclic, each of such loops has Q= with no more than one statement. Thus, by Theorem 8, the computation time of all loops represented in one antichain is no more than 41og~e, and for I antichains in total we have T £ 4Uog e. Q.E.D. Theorem 9 and Theorem 10 give two extreme bounds with and with- out statement substitution. Needless to say, that any combination of partial substitution and partial distribution will yield time bounds between these two results. Also note that Theorems 8-10 can be applied to bound the compu- tation time of a block of assignment statements, as it is a special case of a loop with k=l, q.. = {^,0} for all h and i < j, and q . = $ for i >^ j . Finally, let us examine those loops with cyclic dependence graphs. Recall from the previous section that any element tt_ . of tt- with multiple statements actually forms a maximal cycle , i.e., it con- tains all smaller cycles formed by its elements. Those statements 119 represented in a maximal cycle are coupled together by dependence relations and no statement substitution is possible. By treating such a cycle as a single loop, we can compute it as a recurrence system if it is linear. Hence, we can use all results in the previous chapters to bound the time and processors for these maximal cycles. In particular, we can state a general theorem for LeLR as follows. Theorem 11 Given any L(k,s,E,Q) with cyclic dependence graph, then if LeLR, we have T £ [(log 2 (cq+c-l) + 2) log 2 (ck)].mina,t) + 4Uog 2 e (31) < 1/2 log^Csk) + 3/21og 2 (sk) + 41og 2 e (32) where c = maximum number of nodes in one maximal cycle of D , which is equivalent to the maximum number of statements represented by one element of the partially ordered set it. (1 <_ c <_ s) . t = total number of maximal cycles in D.. . Proof : We first show it for c^s. In this case, as discussed in the proof of Theorem 10, the elements of tt. can be partitioned in I dis- joint antichains. Any decomposed loop, represented by one element of a particular antichain, contains no more than c statements. Hence, the maximum computation time of all loops represented in one antichain is equivalent to that of an R system with at most 120 m = c(q-l) + (c-1) + c = c(q+l) - 1 , plus the maximum time for preprocessing coefficients (41og_e) . Since there are at most min(£,t) maximal cycles in a chain, Equation (31) is proved by Theorem 2. For the case c=s, Equation (32) directly follows from Theorem 1. Q.E.D, In summary, we have given Theorems 8-11 to estimate time bounds (and hence speedups since T ~ ske for all cases) for a large class of program loops if their computation patterns are known or reconstructed by subscript analysis. These bounds are based on some gross parameters such as (s,k,e,q,£,c, t) which we used here. Within a specific class of problems and their computational programs, e.g., eigenvalue programs or linear system solvers, users usually have a very good idea of average computation patterns and typical values of these parameters inside their program loops. With these kinds of analytical bounds, they can have a general idea about the ultimate speedup range they can expect from their programs running on a multiprocessing system. This may also help the designer of computers for a given class of computations in knowing how well they are doing in comparison to these possible speedups with an arbitrary number of processors. Although we do not give time bounds for a fixed number of processors, due to their infinite number of variations, it is conceivable that by using some tables like that in Chapter 2 or eventually some analytically derived formulas, we can also obtain some of these results 121 for practical usages. Similarly, a careful redefinition of parameters and detailed studies of loop structures should be able to generalize our discussions here to non-basic loops, and further to loops with jump/ logic control informations. 5.5 Practical Implications As discussed in sections 3 and 4, a good compiler algorithm is desirable to decompose a given loop into a set of loops which can be evaluated efficiently for a fixed machine. The complexity of loop distribution algorithms will be decided by the levels of optimization we want and the given machine structure we have. For instance, we have various ways of constructing the dependence graph from a loop: 1) only analyze the input and output variable names of each assignment statement without checking of subscript expressions at all, or 2) Subscript analysis is performed but renaming procedure is not allowed. In this case, for a loop like DO S 2 1-1, N S 1 : A(I) = B(I) + B(I-l) S 2 : B(I) = C(I) + D(I+1) , we will have a dependence line from S« to S. , and an artificial line from S. to S. even though S- never updates S_ in the loop; otherwise, we may distribute and evaluate (I) (S.) and (I) (S.. ) sequentially. The latter will obviously result in erroneous solutions as the first atom B(I) in 122 S needs old values instead of the updated ones generated by this new execution order. Hence, the artificial line inserted will lock up these two statements from being distributed. This technique is used in TI ASC Fortran compiler [24]- On the other hand, if we are allowed to rename B in S and the second B in S 1 (as well as B(0) outside the loop) to a new variable name, e.g., B', then the distribution is possible. Techniques to create these new storages is reported in [10]. We also mentioned that statement substitution between two decomposed loops which have dependence relation may enable them to be evaluated in parallel. However, if the machine structure does not allow more than one array fetched at the same time, then the effect on speedup of statement substitution will be degraded or not preferred at all. Finally, let us look at one decomposed loop, e.g., DO 5 I = 1, N DO 5 J = 1, N 5 A(I,J) = A(I,J-1) + A(I-1,J) , which has two-dimensional recurrence relations and may be treated as a 2 very sparse R system with n=N and m=N. Although we can solve it by the previous general algorithms, for a system with very small p, the wave-front method in [10], [23] may also be attractive because of its simplicity. On the contrary, if the above statement is changed to 5 A(I,J) = A(I,J-1) + A(I-1, J+k) where k is a relatively large number, then the latter method will become very inefficient. 123 Hence, we can conclude that, to design an efficient system for executing program loops, a general compiler simulator can be established to distribute loops according to different machine con- figurations and various parallelism optimization levels. Each distributed loop can be assessed with computation time by the time bounds described in this section, or more practically by a general machine simulator to obtain the overall speedup of the original loop. Experiments of this kind on a large sample of ordinary programs will indicate which options will influence average speedup (or efficiency) in major ways. As a result of such studies, a practical loop distribu- tion algorithm of compiler with reasonable complexity can emerge for a given machine; and, in general, an efficient structure of the machine for ordinary programs will also be revealed. 124 6. CONCLUSION In this thesis, we have shown efficient algorithms for parallel evaluation of general linear recurrence systems and m-th order linear recurrence systems, with arbitrary or constant coefficients. Detailed results have been given in the introduction of each chapter and need not be repeated here. These algorithms could be used directly as a building block in forming parallel algorithms for standard numerical programs such as in [28 3, [29]. We can also use them in compilers. With the loop distribution algorithm discussed here, a compiler should be able to decompose a given loop into loops which can be computed as vector operations or recurrence relations. The latter case can be further transformed into vector operations (or matrix operations if the machine structure allows) by our algorithms without programmer intervention. All algorithms described in this thesis are primarily for conceptual understanding. A properly designed machine organization plus some practical modifications of the basic algorithms presented here for that particular system should provide us high efficiency computations of iterative programs. We conclude this thesis by giving the following several topics which are opened up by this research and deserve further studies: 1) Is there any other algorithm which can be used to evaluate general linear recurrence systems in O(log„n) time steps instead of 2 2 O(log-n) steps? Since only 0(n ) data elements are involved in matrix A and c, it follows from a fan-in argument that one might hope for a faster 125 scheme requiring only logarithmetic time. Otherwise, could we disprove this possibility by some abstract combinatorial argument? Similar questions can be raised for m-th order linear recurrence systems. 2) Transformation of various standard numerical algorithms into algorithms which use linear recurrences, such as the ones mentioned above and recently in [30], could be explored for future parallel algorithms. 3) Practical compiler algorithms for decomposition of loops with jump/logic statements and various indexing patterns should be studied in detail. 4) Numerical stability problems of these basic algorithms require more understanding, although it is expected that the error bounds will be within the range of serial computations. 126 LIST OF REFERENCES [1] D. J. Kuck, "Multioperation Machine Computational Complexity," Proceedings of Symposium on Complexity of Sequential and Parallel Numerical Algorithms , pp. 17-47, Academic Press, 1973. [2] P. Budnik and D. J. Kuck, "The Organization and Use of Parallel Memories," IEEE Transactions on Computers , Vol. C-20, pp. 1566-1569, December 1971. [3] D. H. Lawrie, "Memory Systems for Parallel Array Processors," Proceedings of the Eleventh Annual Allerton Conference on Circuit and System Theory , Allerton House, Monticello, 111., pp. 568-576, October 1973. [4] D. H. Lawrie, "Access Requirements and Design of Primary Memory for Array Processors," submitted for publication IEEE Transactions on Computers , (IEEE Repository No. R74-30) , 1974. [5] E. C. Russell, "Automatic Program Analysis," Ph.D. thesis, Depart- ment of Electrical Engineering, University of California, Los Angeles, Report No. 69-12, March 1969. [6] D. E. Knuth, "An Empirical Study of FORTRAN Programs," Department of Computer Science, Stanford University, Report No. CS-186, 1970. [7] D. Kuck, P. Budnik, S-C. Chen, E. Davis, Jr., J. Han, P. Kraska, D. Lawrie, Y. Muraoka, R. Strebendt and R. Towle, "Measure- ments of Parallelism in Ordinary FORTRAN Programs," Sagamore Computer Conference on Parallel Processing , Syracuse Uni- versity, pp. 23-29, August 1973; also appears in IEEE Computer , pp. 37-46, January 1974. [8] D. J. Kuck, Y. Muraoka and S. C. Chen, "On the Number of Operations Simultaneously Executable in FORTRAN-Like Programs and Their Resulting Speed-Up," IEEE Transactions on Computers , Vol. C-21 pp. 1293-1310, December 1972. [9] R. H. Karp, R. E. Miller and S. Winograd, "The Organization of Computations for Uniform Recurrence Equations," Journal of the ACM , Vol. 14, No. 3, pp. 563-590, July 1967. 127 [10] Y. Muraoka, "Parallelism Exposure and Exploitation in Programs," Ph.D. thesis, University of Illinois at Urb ana- Champaign, Department of Computer Science, Report No. 424, February 1971. [11] P.M. Kogge and H. S. Stone, "A Parallel Algorithm for the Effi- cient Solution of a General Class of Recurrence Equations," IEEE Transactions on Computers , Vol. C-22, No. 8, pp. 786- 792, August 1973. [12] D. Heller, "A Determinant Theorem with Applications to Parallel Algorithms," Carnegie-Mellon University, Department of Computer Science, March 1973; also "On the Efficient Computa- tion of Recurrence Relations," Institute for Computer Applications in Sciences and Engineering (ICASE) , June 1974. [13] D. H. Lawrie, "Memory-Processor Connection Networks," Ph.D. thesis, University of Illinois at Urbana-Champaign, Department of Computer Science, Report No. 557, February 1973. [14] S. F. Reddaway, "An Elementary Array with Processing and Storage Capabilities," International Workshop on Computer Architec- ture, Grenoble, France, June 1973. [15] K. Maruyama, "On the Parallel Evaluation of Polynomials," IEEE Transactions on Computers , Vol. C-22, pp. 2-5, January 1973. [16] D. J. Kuck and K. Maruyama, "Time Bounds on the Parallel Evaluation of Arithmetic Expressions," to appear in SIAM Journal of Computing . [17] J. C. P. Miller and D. J. S. Brown, "An Algorithm for Evaluation of Remote Terms in a Linear Recurrence Relation," Computer Journal , Vol. 9, pp. 188-190, 1967. [18] G. Estrin, "Organization of Computer Systems — the Fixed Plus Variable Structure Computer," Proceedings of Western Joint Computer Conference , pp. 33-40, May 1960. [19] W. S. Dorn, "Generalizations of Horner's Rule for Polynomial Evaluation," IBM Journal of Research and Development , No. 6, April 1962. [20] M. C. Pease, "An Adaptation of the Fast Four Transform for Parallel Processing," Journal of the ACM , Vol. 15, No. 2, April 1968. f21] W. G. Poole, Jr. and R. G. Voigt, "Numerical Algorithms for Parallel and Vector Computers: An Annotated Bibliography," Institute for Computer Applications in Science and Engi- neering (ICASE), 1974. 128 [22] J. L. Baer, "A Survey of Some Theoretical Aspects of Multi- processing," Computing Surveys , Vol. 5, No. 1, March 1973. [23] L. Lamport, "The Parallel Execution of DO Loops," Communications of the ACM , Vol. 17, No. 2, February 1974. [2A] W. L. Cohagan, "Vector Optimization for the ASC," Proceedings of the Seventh Annual Princeton Conference on Information Sciences and Systems , 1973. [25] R. P. Brent, "The Parallel Evaluation of General Arithmetic Expressions," Journal of the ACM , Vol. 21, No. 2, April 1974. [26] H. T. Kung, "New Algorithms and Lower Bounds for the Parallel Evaluation of Certain Rational Expressions," Proceedings of the Sixth Annual ACM Symposium on Theory of Computing , April 1974. [27] C. L. Liu, "Topics in Combinatorial Mathematics," Notes on Lectures, MAA Summer Seminar, Williams College, Massachusetts, 1972. [28] H. S. Stone, "An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations," Journal of the ACM , Vol. 20, No. 1, January 1973. [29] A. H. Sameh, S. C. Chen and D. J. Kuck, "Parallel Direct Poisson and Biharmonic Solvers," submitted for publication. [30] F. S. Acton, "Recurrence Relations for the Fresnel Integral and Similar Integrals," Communications of the ACM , Vol. 17, No. 8, August 1974. 129 VITA Shyh-ching Chen was born in Fukien, China, on February 1, 1944. He received the B.S. degree in electrical engineering from the National Taiwan University, Taipei, Taiwan, in 1966 and the M.S. degree in electrical engineering from Villanova University, Villanova, Pennsylvania, in 1972. From August 1967 to August 1969 he was with the Electronic Computing Center, National Taiwan University. Since September 1970 he has been a research assistant in the Department of Computer Science, University of Illinois, Urbana. He is a member of Sigma Xi, IEEE and ACM. IBLIOGRAPHIC DATA 1EET 1. Report No. UIUCDCS-R-75-694 3. Recipient's Accession No. Fillc and Subtitle PEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS 5- Report Date January 1975 6. A ut hor(s ) hyh-ching Chen 8. Performing Organization Rept. No " UIUCDCS-R-75-694 Performing Organization Name and Address niversity of Illinois at Urbana-Champaign lepartment of Computer Science rbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF GJ-36936 , Sponsoring Organization Name and Address ational Science Foundation ashington, D. C. 13. Type of Report & Period Covered Doctoral - 1975 14. Supplementary Notes . Abstracts .oop computations are the basic constructs which consume most of the computation time r. ordinary programs. For a multiprocessing (parallel or pipelined) system, the over- ill speedup which might be otherwise obtained will be seriously degraded if these :omputations are carried out in the sequential manner as they are presented. .ti this paper, we present algorithms to expose vector operations and recurrence re- ations which constitute a loop computation and, in turn, to transform recurrence elations into vector operations. For the latter case, time and processor bounds for [ifferent classes of problems and practical situations will be discussed. Bounds on ipeedups of different types of loop computations will also be given, so that machine lesigners or users could know how well they were doing in some absolute sense. . Key Words and Document Analysis. 17a. Descriptors Compiler General Linear Recurrences i-th Order Linear Recurrence lultiprocessing Systems )rdinary Program Loops 'arallel Computation 'rocessor Bounds Speedup Efficiencies rime Bounds .riangular Linear Systems b. Identifiers /Opcn-Ended Terms COSATI Field/Group Availability Statement ELEASE UNLIMITED 19. Security Class (This Report) UNTt.ASSIFIFD 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 136 22. Price 'M NTIS-35 ( 10-70) USCOMM-DC 40329-P7 1 % o % en CO *0» JVJ* Bom UNIVERStTY OF ILLINOISURBAnT »>0»4U«Rno COO? no S01 69«(1«74 n«port / _3 0112 088401663 ml KB I ■ ■ ■ ■ '<** m w m ■ ■ H ' t ■ I SI I Hi I ■ ■ MA ■ ■ I 'Ui'J.jii ■■