HiHH 
 
 man BatH wSSA 
 
 ■ 
 
 mm 
 
 mm 
 
 nl inn 
 
 ■ n 
 
 Hi mm 
 SH Hi 
 
 ■i 
 
 
 m 
 
 H 
 
 DP In 
 
 SB! 
 
 
 
 »B 
 
 ■ 
 
 ■1 i 
 
 m 
 
 U7 hh 
 
 H ■ 
 ■ 
 
 nfl 
 
 ■ ■ 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAICN 
 
 510.84 
 
 Iifcv 
 
 ho.69l-G96 
 
 ?■ 
 
 cop. 2. 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 • 
 
 L161— O-1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/speedupofiterati694chen 
 

 Report No. UIUCDCS-R-75-694 
 
 NSF - OCA - GJ-36936 - 000007 
 
 SPEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS 
 
 by 
 Shyh-ching Chen 
 
 January 1975 
 
 MAR 1 1 1975 
 
 UNIVERSITY OF ILLINOIS 
 
Report No. UIUCDCS-R-75-694 
 
 SPEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS 
 
 by 
 Shyh-ching Chen 
 
 January 1975 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 * 
 
 This work was supported in part by the National Science Foundation under 
 
 Grant No. US NSF-GJ-36936 and was submitted in partial fulfillment of the 
 requirements for the degree of Doctor of Philosophy in Computer Science, 
 January 1975. 
 
SPEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS 
 
 Shyh-ching Chen, Ph.D. 
 Department of Computer Science 
 University of Illinois at Urbana-Champaign, 1975 
 
 Loop computations are the basic constructs which consume most 
 of the computation time in ordinary programs. For a multiprocessing 
 (parallel or pipelined) system, the overall speedup which might be other- 
 wise obtained will be seriously degraded if these computations are carried 
 out in the sequential manner as they are presented. 
 
 In this paper, we present algorithms to expose vector operations 
 and recurrence relations which constitute a loop computation and, in turn, 
 to transform recurrence relations into vector operations. For the latter 
 case, time and processor bounds for different classes of problems and 
 practical situations will be discussed. Bounds on speedups of different 
 types of loop computations will also be given, so that machine designers or 
 users could know how well they were doing in some absolute sense. 
 
iii 
 
 ACKNOWLEDGMENT 
 
 The author wishes to express his deepest gratitude to Professor 
 David J. Kuck. His fruitful advice, constant enthusiasm and continuing 
 help throughout this work are most appreciated and admired. 
 
 Special thanks should go to Mrs . Vivian Alsip for her excellent 
 job of typing and remarkable patience in doing it. Mrs. Gayanne Carpenter 
 gave me kindly departmental assistance. Drafting and printing services of 
 this department have been invaluable. 
 
 Sincere thanks are also extended to my parents and wife, whose 
 encouragement and understanding have made my long period of study possible, 
 And to them it is dedicated. 
 
iv 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR RECURRENCE 
 SYSTEMS WITH ARBITRARY COEFFICIENTS .... 5 
 
 2.1 Introduction 5 
 
 2.2 General Linear Recurrences 8 
 
 2.3 M-th Order Linear Recurrences 26 
 
 2.4 Practical Implications 37 
 
 3. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR RECURRENCE 
 SYSTEMS WITH CONSTANT COEFFICIENTS 45 
 
 3.1 Introduction 45 
 
 3.2 General Linear Recurrences with Constant 
 
 Coefficients 46 
 
 3.3 M-th Order Linear Recurrences with Constant 
 
 Coefficients 56 
 
 3.4 Practical Implications 69 
 
 4. SPEEDUPS OF PRACTICAL LINEAR RECURRENCE SYSTEMS WITH A 
 
 LIMITED NUMBER OF PROCESSORS 70 
 
 4.1 Introduction 70 
 
 4.2 M-th Order Linear Recurrences with Arbitrary 
 
 Coefficients 71 
 
 4.3 M-th Order Linear Recurrences with Constant 
 
 Coefficients 87 
 
 5. SPEEDUP OF LOOP COMPUTATIONS IN ORDINARY PROGRAMS 94 
 
 5.1 Introduction 94 
 
 5.2 Speedup Characteristics of Program Loops 97 
 
 5.3 Exploitation of Parallelism in Program Loops 103 
 
 5.4 Time Bounds of Program Loops 112 
 
 5.5 Practical Implications 121 
 
 6. CONCLUSION 124 
 
 LIST OF REFERENCES 126 
 
 VITA 129 
 
LIST OF TABLES 
 
 Page 
 
 1. Experimental Results of General Linear Recurrences A3 
 
 2. Experimental Results of m-th Order Linear Recurrences .... 44 
 
 3. Experimental Results of m-th Order Linear Recurrences 
 
 by Algorithm 5 for n = 16,384 84 
 
 4. Experimental Results of m-th Order Linear Recurrences 
 
 by Algorithm 2 and Algorithm 5 for n = 128 85 
 
 5. Experimental Results of m-th Order Linear Recurrences 
 
 by Algorithm 2 and Algorithm 5 for n = 1024 86 
 
vi 
 
 LIST OF FIGURES 
 
 Page 
 
 1. Evaluation Trees for R<4> 21 
 
 2. Dynamic Partitioning for Algorithm 1 22 
 
 3. Parallel Evaluation of R<8> 23 
 
 A. Dynamic Partitioning for Algorithm 2 31 
 
 5. Parallel Evaluation of R<16,2> 33 
 
 6. Dynamic Partitioning for Algorithm 3 51 
 
 7. Parallel Evaluation of R<8> 52 
 
 8. Dynamic Partitioning for Algorithm 4 58 
 
 9. Parallel Evaluation of R<16,2> 59 
 
 10. Dynamic Partitioning for Parallel Evaluation of Remote 
 
 Term in Homogeneous R<n,m> by Algorithm 4 67 
 
 11. Dynamic Partitioning for Parallel Evaluation of Remote 
 
 Term in Homogeneous R<n,m> by Algorithm 2 of Chapter 2 . . . 68 
 
 12. Dynamic Partitioning for Algorithm 5 72 
 
 13. Dynamic Partitioning of Matrix E for Algorithm 6 89 
 
1 . INTRODUCTION 
 
 The primary objective of this research is to develop fast 
 and efficient algorithms to evaluate programs which are presented in 
 serial form. Specifically, the recurrence relations in loop computa- 
 tions which are encountered quite often in ordinary programs such as 
 Fortran DO loops and Algol FOR loops will be the center of our concern. 
 
 During the last decade, as the cost and size of hardware 
 circuits became remarkably cheaper and smaller, and as the speed of 
 electronic devices approaches its physical limits , multiprocessing 
 systems were introduced in an attempt to speed up machines by allowing 
 concurrent processing. As a result, various types of multiprocessing 
 systems have emerged. These include ILLIAC IV, CDC STAR, TI ASC, C.mmp 
 and many others, each enjoying some success in the computations tailored 
 to its own structure, and hence generally intended for a special group 
 of users. 
 
 However, we believe that the future trend of multiprocessing 
 system designs will and should make the high-speed computational power 
 of such machines available to ordinary users [1], Two basic design and 
 application principles which help to attain the full effectiveness of 
 multioperational machines are: 
 
 1) Extracting parallelism from ordinary programs as much as 
 possible. Obviously, a plentiful supply of possible concurrent operations 
 will exploit the highest speed of any given or proposed system. There 
 may be two approaches : 
 
a) Devising new parallel algorithms and providing 
 existing languages with adequate parallel processing features. 
 
 b) Automatic detection of parallelism at the compiler 
 level. 
 
 2) Minimizing data accessing and alignment conflicts in the 
 system in order to achieve high-speed processing. This may be achieved 
 by properly designed memory systems, storage schemes and data alignment 
 networks which combine to allow conflict-free accessing and processing 
 of various slices of data required in most ordinary programs [2], [3], [4]. 
 
 Loop computations are the basic constructs which consume most 
 of the computation time in ordinary programs. This is true intuitively 
 as well as in actual experimental measurements [5], [6], [7]. For a 
 multiprocessing system, the overall speedup which might be otherwise 
 obtained will be seriously degraded if these computations are carried out 
 in the sequential manner as they are presented. Therefore, according to 
 the above principles, they are very good and important candidates 
 deserving careful theoretical and practical studies if future machines 
 of this class are to be in general use at all. 
 
 Most previous results pertaining to this problem treat indi- 
 vidually each principle stated before. These will be mentioned in later 
 chapters with their related topics. In this thesis, we would like to 
 look at the problem from a unified point of view. Briefly put, we ask 
 ourselves if there is any fast algorithm (or algorithms) which we can 
 use for evaluation of recurrence relations such that: 
 
 1) They may be written in numerical subroutines by extending 
 
current languages with some useful features, 
 
 2) Upon recognition in program loops, it should be easy for 
 a compiler to transfer these iterative constructs into the equivalent 
 but fast evaluations proposed by these algorithms, 
 
 3) With care, it is possible to use or design a well-organized 
 system to achieve a desirable speed. 
 
 All the algorithms developed herein have some implicit bearing 
 on certain (although not all) solutions to these three questions. In 
 short, we try to devise algorithms for easy implementations through 
 user subroutines or compiler algorithms, and the assumed general machine 
 structures and data routing patterns underlying these algorithms should 
 enhance their maximum speeds. 
 
 In Chapter 2, we discuss the algorithms to evaluate linear 
 recurrence relations with arbitrary coefficients for both general and 
 m-th order problems. Chapter 3 emphasizes the cases of constant co- 
 efficients which occur frequently in general numerical programs. All 
 algorithms as they stand are for the fastest computations and anticipate 
 future machines with as many processors as the number of IC packages 
 used in current machines. Also, some easy and general modification 
 procedures are introduced to maintain substantial speedups while using 
 processors more efficiently. Some representative experimental results 
 to this effect will be included. These types of information and 
 practicalities should be able to lead us to good machine design and 
 application procedures in future parallel machines. On the other hand, 
 for realistic designs and uses at present, Chapter 4 provides some 
 methods to evaluate most practical m-th order linear recurrences where m 
 
is very small and under the assumption that usually the number of 
 processors is far less than the problem size. Extending the results 
 from these three chapters, we then can treat general jump-free loop 
 computations in a uniform way either theoretically or practically in 
 Chapter 5. Bounds on speedups of different types of loop structures 
 will be given so that designers or users could know how well they were 
 doing in some absolute sense. Although this work provides some insight 
 to these serious problems, no claim can be made about the perfection of 
 the results as spelled out in the open questions listed in Chapter 6. 
 
 For notational simplicity, we will assume all logarithmic 
 functions take their ceiling values throughout this thesis. 
 
2.. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR 
 RECURRENCE SYSTEMS WITH ARBITRARY COEFFICIENTS 
 
 2.1 Introduction 
 
 One of the most difficult aspects of speeding up ordinary programs 
 for parallel or pipeline computation is the treatment of recurrence relations 
 
 [7], [8]. Although they occur with rather low overall frequency, an 
 annoyingly large fraction of the programs one sees contain at least one 
 recurrence relation. Most of these are linear recurrence relations. 
 
 Suppose some multidimensional computation space contains n points 
 
 over which a general linear recurrence relation is defined. We will show 
 
 o 
 that this computation can be speeded up by a factor of ©((n/loggn) ) using 
 
 3 
 0(n ) processors. More generally stated, we consider the parallel solution 
 
 of any linear system of the form x ■ c+Ax where A is a strictly lower tri- 
 angular matrix and c is a constant column vector. While a serial machine can 
 
 2 
 solve this system straightforwardly in 0(n ) steps, our algorithm finds the 
 
 2 3 2 
 
 solution in 0((log_n) ) time steps using at most n /68 + 0(n ) processors. 
 
 Further, all processors perform the same operations at the same time (SIMD 
 operation). These matters are discussed in Section 2. 
 
 In Section 3 we consider the restricted problem of solving an m-th 
 order linear recurrence relation. We show that given any linear system of the 
 form x=c+Ax where A is a strictly lower triangular matrix of bandwidth m, its 
 
solution can be speeded up by a factor of O(mn/log mlog n) using at most 
 
 p 
 0(m n) processors. Again, this result is achieved with uniform operations 
 
 in all processors (SIMD operation) . 
 
 These results are far from ideal in several respects. First we 
 
 observe that in the general case a parallel computation time of O(log p n) 
 
 could be hoped for. This follows from a simple fan-in argument using the 
 0(n ) data elements in the A matrix and c vector. Similarly, in the m-th 
 
 order case we could hope for a parallel computation time of O(log mn). 
 Note that we approximate this when n » m. 
 
 A second shortcoming of these bounds is that the number of pro- 
 cessors required is rather high so that rather low efficiencies result. 
 This leads us to present in Section k some practical modifications to the 
 algorithms of Sections 2 and 3- These modifications allow much higher effi- j 
 ciencies without significantly reducing the speedups . 
 
 The parallel evaluation of recurrence relations has been studied 
 by a number of people. In [9], Karp et.al. discuss the formalization of 
 the problem and prove some results about the resulting graphs . Muraoka in 
 [10] discusses algorithms for speeding up general classes of recurrence 
 relations and also presents bounds of 0(log n) time steps on a number of 
 
simple recurrences. Kogge and Stone [11] discuss m-th order recurrences 
 and while they do not give a general processor bound, their time bound is 
 
 about twice ours for this case. Heller has studied general recurrence 
 
 2 
 relations and proves that they can be evaluated in 0(log«n) steps with 
 
 4 
 0(n ) processors using a machine capable of simultaneous addition and 
 
 multiplication (MIMD operation) . He recently has improved the processor 
 
 3 
 bound to 0(n ) while the time bound remains twice that reported here [12], 
 
 Although we do not discuss any details of machine organization 
 
 in this paper, it is in order to sketch a machine here. First, we list 
 
 some idealizing assumptions. 
 
 1) Instructions are always available for execution as required 
 and are never held up by a control unit. 
 
 2) Each operation takes the same amount of time, which we refer 
 to as a unit step . 
 
 3) Any number of processors and memories may be used at any 
 time. 
 
 4) There are never any accessing conflicts in the memory. 
 
 5) No time is required to communicate data between the proces- 
 sors and memories. 
 
 In a properly designed system, for steady state data and control 
 flow, standard pipelining techniques may be used to approximate 1 and 2. 
 We will present theoretical and practical bounds on the number of proces- 
 sors required, so 3 can be removed. It will be noted that the algorithms 
 of this paper mainly involve various kinds of vector and matrix products. 
 References [2] and [4] show various schemes for accessing rows, columns, 
 
diagonals, square blocks, etc., without conflicts using a single memory 
 storage scheme. Such a memory should provide a good approximation to 4. 
 A pipelined alignment network can be used to approximate 5 and the details 
 of one possibility are given in [13] • Finally, we note that for the 
 algorithms we present, simultaneous multiplication and addition (MIMD 
 operation) would allow us to further reduce our processor bounds, but as 
 our algorithms are presented all processors execute the same operation at 
 the same time (SIMD operation) . Thus we feel that our results could be 
 of direct use in the design of real machines. 
 
 The following definitions will hold throughout the paper. Let T 
 
 be the time, measured in unit steps, required to perform some calculation 
 
 using i independent (parallel or pipeline) processors. We define the 
 
 T 
 
 T 
 
 T l 
 speedup of a p-processor machine over a uniprocessor as S = ■=— , and we 
 
 T l 
 define its efficiency as E = _< 1, which may be regarded as the quo- 
 
 P P P 
 tient of S and the maximum possible p-processor speedup p. For various 
 
 results in Sections 2 and 3 we will bound p, T and S . In Section 4 we 
 
 P P 
 
 will discuss these further and introduce E values as well. 
 
 P 
 
 2 .2 General Linear Recurrences 
 
 In this section we discuss bounds on the time and processors for 
 evaluating general linear recurrences. The main result is Theorem 1. We 
 also give Algorithm 1 which may be used for the parallel evaluation of a 
 general linear recurrence system. 
 
 Some general notations will be followed throughout this paper 
 unless otherwise specified. We use upper case letters to denote square 
 
 i 
 
 
matrices, lower case letters to denote scalars, and lower case letters 
 
 with overbars to denote column vectors. Given any n by n matrix A 
 
 = [a..] , A, denotes the sub-matrix consisting of all elements a., for 
 ij n*n Tc ij 
 
 k< i _< n, k<j<^n, and a, denotes a vector of the last n-k elements 
 of the k-th column of A, i.e., 
 
 a k = (a k+l,k , \+2,k'"* ,a n,k ) " 
 
 Definition 1 A general linear recurrence system of n equations, 
 R<n>, is defined as 
 
 R<n>: x. = for i < 0, 
 
 1-1 
 
 = c. + Z a. . x. for 1 < i < n . 
 
 i . 1 ij 3 ~ ~ 
 
 In matrix notation we write this as 
 
 x = c + Ax 
 where 
 
 x = \ x -i > x p> • * • j x n ' > c = \ c n y C 2' ' ' ' y °n ' 
 
 and 
 
 A = Ta. .1 with a. . = for i < j. 
 
 ' ij nxn i j - 
 
 For simplicity, we will assume n is a power of 2 for the remaining dis- 
 cussion. If it is not, the system can be augmented and the results 
 
 applied directly. We refer to x as the solution vector of R<n> . If the 
 
 — — * * 
 
 solution vector x of R<n> is equal to the solution vector x of R <n>, 
 
 * 
 
 we say that R<n> = R <n> . 
 
 i 
 
10 
 
 Definition 2 Given any general linear recurrence system 
 
 R<n> : x = c + Ax , 
 we define a set of n-1 systems {R.<n-j>|l <_ j <_ n - 1} 
 
 where 
 
 R.<n-i>: y. = a. + A.y. 
 J J J J J 
 
 in which y. is the solution vector of R.<n-j>. The set of all solution 
 J J 
 
 vectors y., 1 £ j _5_n - 1, can be written in matrix form as a strictly 
 lower triangular matrix Y = [y. . in which 
 
 y. . = for i < j , 
 
 ij - 
 
 i-1 
 = a. . + E a.,y, . for 1 <_ j < i <_ n . 
 1J k=j+l lk k3 
 
 As an example of these definitions, consider the following 
 linear recurrence system with n = 4. 
 
 Example 1 
 
 Given R<4> : x, = c. 
 
 x 2 = c 2 + a 21 X;L 
 
 X 3 C 3 + a 31 X l + a 32 X 2 
 
 X A c 4 + a 41 x l + a 42 x 2 + a 43 X 3 
 
we have 
 
 11 
 
 A = 
 
 21 
 
 31 32 
 
 
 
 
 
 
 
 
 
 a 7A 
 
 
 
 %1 %2 %3 
 
 
 
 
 
 
 L 
 
 x - [X--., Xp, x , x< ) , 
 
 - V C n > C p> c o> C h ' 
 
 In this case there are three corresponding systems R.<n-j> for 1 < j < 3, 
 written: 
 
 R x <3>: 
 
 y 21 = a 21 
 
 y 3l = a 3l + a 32 y 2l 
 
 y ^l = \l + %2 y 2l + %3 y 3l ' 
 
 R 2 <2>: 
 
 y 32 " a 32 
 
 and 
 
 Thus we have 
 
 y ^2 ~ %2 + a J+3 y 32> 
 
 R 5 <1>: ^1+3 = %$ ' 
 
 Y = 
 
 
 
 
 
 
 
 
 
 y 21 
 
 
 
 
 
 
 
 y 31 
 
 y 32 
 
 
 
 
 
 Jkl 
 
 Y k2 
 
 J k5 
 
 
 
 where y. is the solution vector of the corresponding system R.<n-j> for 
 
 J J 
 
 1 < 3 < 3. 
 
12 
 
 [.emma 1 Given any 
 
 R<n> : x = c + Ax , 
 there is a corresponding 
 
 R*^i>: x = c + Yc - (I+Y)c , 
 such that R<n> = R <n>- 
 
 Proof : Our proof is by induction on n. The cases n = 1 and n = 2 are 
 trivial so we will examine n = 3 as our basis. If n = 3, we have from 
 Definition 1 
 
 R <3>: x x = c 
 
 X 2 = C 2 + a 21 X l = °2 + a 21 C l 
 
 x 3 = c^ + a 31 x ± + a^ x 2 = c 3 + a^ c ± + a^ (c 2 + a^ c^ 
 
 The corresponding system has the form 
 R*<3>: x = c 
 
 x 2 = c 2 + y 21 C] _ 
 
 x 3 = c 3 + y 3l °i + y 32 °2 ' 
 Now, by Definition 2 we know that 
 
 y 21 = a 21 (1) 
 
 y 3l = a 31 + a 32 y 21 = a 3l + a 32 a 2l (2) 
 
 and y = a . (3) 
 
 By substituting Equations (1), (2), and (3) into R <3>, we get 
 
 R*<3>: x x = c ± 
 
 X 2 = C 2 + a 21 C l 
 
 x 3 = c 3 + (a 3l +a 32 a 2l )c l + a 32 c 2 = c 3 + a 3l C l + a 32 (c 2 +a 2l c ]. ) - 
 
13 
 
 By comparing this system with the final righthand side obtained above 
 for R<3> , we see that R<3> = R* <3> ■ 
 
 Now as our induction hypothesis, we assume the lemma is true for 
 n = k, and prove it for n = k + 1. That is, given any R<k>, we assume there 
 
 is an R <k> =■ R<k> where R <k> can be written as : 
 
 \ 
 
 y 21 i 
 y 3i y 32 x 
 
 • ♦ 
 
 • • 
 
 y ki y k2 •' y k,k-l \ 
 
 (k) 
 
 Any system R<k+1> has one additional equation whose form by Definition 1 
 is 
 
 *k+l = C k+1 + V-l.l X l + ^+1,2 X 2 + '" + a k+l,k *k 
 
 : C k+1 + (a k+l,l' a k+l,2' •*•' a k+l,k } 
 
 \X, > X» , • ■ • j ^T»' 
 
 (5) 
 
 If we substitute the righthand side of Equation (4) into (5), we get 
 
14 
 
 x k+l " °k+l 
 
 + ( Vl,l' Vl,2 
 
 ) * • • > 
 
 Vl,k ) 
 
 r 21 
 
 31 J 32 
 
 y^o i 
 
 • • 
 
 • • 
 
 y 
 
 y, i 
 
 kl ,y k2 * ' -%k-l 
 
 c l] 
 
 °2 ; 
 
 v. ) 
 
 By the definition of y k+1 for 1 <_ j £ k (see Definition 2) , this can be 
 
 written as 
 
 *k+l = C k+1 + (y k+l,l' y k + l,2^'-^ y k+l,k ) 
 
 ^ 
 
 (6) 
 
 Now 
 
 by combining Equations (4) and (6) , we can write 
 
 x = 
 
 x l 
 
 
 1 
 
 
 
 r— — > 
 C l 
 
 X 2 
 
 
 y 21 i 
 
 
 
 C 2 
 
 x 3 
 
 
 y 31 y 32 
 
 1 , 
 
 
 c 3 
 
 • 
 
 
 • • 
 
 • 1 
 
 
 m 
 
 . 
 
 = 
 
 • 
 
 • • ' 
 
 • 
 
 - 
 
 • 
 
 
 • 
 
 • • 
 
 
 m 
 
 \ 
 
 
 y kl * * * 
 
 y k,k-l 1 | 
 1- 
 
 
 °k 
 
 L \ + l 
 
 
 _ y k+l,l ' • 
 
 ' y k+l,k-l y k+l,kl 1 
 
 
 _ C k + l 
 
 (I+Y)'c 
 
 Thus, the lemma holds for n = k + 1, and therefore for any integer n. 
 
 Q.E.D. 
 
 Before giving the next lemma, further definitions are needed. 
 
15 
 
 Definition ^ Let f be a function of a. , where 1 < i < n and < j < n-1 
 
 and f be a function of a. , where p+1 < i < p+n and p < j < p+n-1. Let f ' 
 
 be derived from f by replacing each a. . in f. by a. . We sav that f 
 -l ij l l+p, j+p * 2 
 
 is a p-shifted function of f^ if f» = f and we express this symbolically by 
 writing f, * f . 
 
 Definition ^ For any R < n >, and any ordered pair (i, j), < j < n-1, 
 1 < i < n, we define a sequence of functions f,(i, j), f„(i, j), f.(i,j ),..., 
 
 f (i, j) such that 
 
 j+2r-l 
 f _ (i,j) = f (1,3) + 2 f (i,k).f (k,a) for r=2 m , 0< m < t-1, 
 dr r k=j+r 
 with the boundary conditions 
 
 f^i, j) = a for i > J, 
 
 in which 
 
 f (i, j) - for i < j, 
 
 a._ = c. for 1 < i < n. 
 iO i — — 
 
 For any given f (i, j), if we apply repeatedly the recurrence 
 
 definition to express f (i, j) in terms of f , f ,..., and finally of f , 
 
 2 IT 
 
 we can express f (i, j) as a function of a. .'s only. In the following, we 
 will assume this final expression for any f (i,j) unless otherwise specified. 
 
 Some properties of the f function can be observed directly from 
 the above definition: 
 
 Property i) fg r (i,j) = f r (i,j) for i-j < r. 
 
 P 
 Property ii) f r (i,j) ■• f r (i+p,j+p) • 
 
16 
 
 Property iii) If g is derived by replacing each a in f (i,,j) 
 
 lK r 
 
 by a ±+p k , then g - f r (i+p,j), for i-,j > r. 
 To illustrate Property ii) and Property iii), let n = 8, i = k, j - 0, r - 2, 
 p = 1. Definition k, Property ii) and Property iii), respectively, give us: 
 f 2 (M) = ^(M) + f 1 (4,l)-f 1 (l,0) = a^ Q + a 4l a 1() , 
 
 f 2 (5,l) = ^(5,1) + f 1 (5,2)'f 1 (2,l) = a 51 + a 52 a 21 , 
 and fg(5,0) = f x (5,0) + f 1 (5,D'f 1 (l,0) = a 5Q + a 51 a 10 . 
 
 Wow we will prove a lemma which forms the basis of the remaining 
 theorems . 
 
 Lemma 2 Given any R <n>, we have 
 
 i-1 
 
 f (i,0) = a. _ + Z a. .x. = x. 
 n v ' xO . , ij J i 
 
 for 1 < i < n . 
 
 Proof ; 
 
 Our proof is by induction on n. As a basis, the lemma is obvious 
 for n = 2. As an induction hypothesis, we assume it holds for n = s. Thus, 
 by hypothesis we have 
 
 i-1 
 
 f (i,0) = a.~ + Z a. .x. = x. for 1 < i < s. 
 s v ' lO 1 • X3 J i ~ " 
 
 Now we prove the lemma for n = 2s and consider two cases. 
 
 Case 1 1 < i < s. 
 
 We know from Property i) and Equation 7 
 
 f (1,0) = f (i,0) = x. for 1 < i < s. 
 2s ' s ' i — — 
 
 (7) 
 
17 
 
 Case 2 s+1 < i < 2s. 
 
 From Equation 7 w e have 
 
 s-1 
 f (s,0) = a - + Z a .x. = x for 1 < i < s. 
 
 S V ' ' BO , , B] 3 S - 
 
 By applying Property iii) to the above f (s,0) we can get 
 
 s 
 
 s-1 
 
 f (s+k, 0) = a ^ _ + Z a . .x. for 1 < k < s. 
 s v ' s+k,0 s+k, j j - 
 
 J 
 
 (8) 
 
 Having these values, the equations for x , , s < i < 2s, of R<2s> can be ex- 
 pressed in matrix form as Equation 9« 
 
 X 
 
 s 
 
 / 
 
 X s + 1 
 
 
 X s+2 
 
 = 
 
 • 
 
 
 • 
 
 
 • 
 
 
 w X 2s . 
 
 >. 
 
 ff s (B,0) ■ 
 
 f (s+1,0 ) 
 
 f (s+2,0) 
 
 s ' ' 
 
 f g (2s,0) 
 
 s+1, s 
 
 !> .« a ~ , 
 
 s+2, s s+2, s+1 
 
 fc a 2s,s a 2s,s+l * * * a 2s,2s-l ° 
 
 By Lemma 1, we know that Equation 9 is equivalent to 
 
 
 X 
 
 s 
 
 
 X s+1 
 
 • 
 
 X s+2 
 
 
 • 
 
 
 X 2s 
 
 
 L J 
 
 (9) 
 
 ^s+1 
 
 L s+2 
 
 L 2s 
 
 s+1, s 
 
 y 
 
 s+2,s 
 
 y 
 
 2s, s 
 
 s+2, s+1 1 
 
 y 
 
 y. 
 
 2s, s+1 . . . ,y 2s,2s-l 1 
 
 f s (s,0) " 
 f (s+1,0) 
 
 f g (s+2,0) 
 
 f a (2s,0) 
 
 (10) 
 
 where by Definition 2 we have 
 
 i-1 
 
 y. . = a. . + Z a., y, . 
 
 U 13 k=J+1 ik y kj 
 
 for s < j < i < 2s. 
 
 (11) 
 
18 
 
 Now consider the equations for x , 1 <_ i <_ s , of R<2s>. If we 
 
 relabel all the x. as x. n , we obtain a system of the form 
 
 1 1 y ^ 
 
 i-1 
 
 x i,o = a i,o + ^ a ik"k, 
 
 k=l 
 
 ' 
 
 1 < i < s. 
 
 (12) 
 
 From Equation 11 and Equation 12 it follows by Definition 3 that 
 
 X i-j,0 "* y i,j ' 
 and from Equation 7 we have 
 
 s < j < i < 2s, 
 
 (13) 
 
 f (i-j,0) = x. . = x. . n , 
 
 1 < i-j < s. 
 
 (i*0 
 
 Now by Property ii) we have 
 
 f 8 (l-3,o) I f fl (l,3) > 
 
 so by Equation Ik 
 
 (15) 
 
 Thus we see from Equations 13 and 15 that y. . and f (i, j) are both j -shifted 
 
 13 s 
 
 functions of x. . _. and it follows that 
 1-3,0* 
 
 y ld = f fl (i,d), 
 
 s < j < i < 2s. (16) 
 
 By substituting the right hand sides of Equation 16 into Equation 10 we get 
 
 s+1 
 
 s+2 
 
 ^2s 
 
 "1 • 
 
 f s (s+l,s) 1 
 
 f (s+2,s) f (s+2, s+1) 
 
 S S 
 
 f (2s, s) f (2s, s+1) 
 s s 
 
 f (2s,2s-l) 1 
 s 
 
 f s (s,0) 
 
 f s ( S +i,o: 
 
 f (s+2,0 
 
 s 
 
 f (2s,0) 
 
 s 
 
 (17) 
 
19 
 
 and by Definition k, 
 
 = (f s (s,0), f 2g (s+l,0), f 2s (s+2,0),...,f 2s (2s,0)) t . (18) 
 
 Hence we have 
 
 f 2s (i,0) = x., S+l < i < 2s. 
 
 Cases 1,2 together show that the lemma holds for n = 2s and hence 
 
 Q.E.D. 
 for any n. 
 
 Now, we can express our parallel algorithm for evaluating R<n> in 
 terms of iterative computations of tJL±,0) = x ± for 1 < i < n. It follows from 
 
 Definition k that each f (i,0), 1 < i < n, can be expressed as a function of 
 
 f , f , .... f, recursively. Hence, we can evaluate them in the reverse 
 
 n ' n ' ' 1 ' 
 
 2 k \ 
 
 order simultaneously for all i. By Property i) of the function f, the evalua- 
 tion of f (i,0) = x. will terminate on the (log i)-th iteration and hence it 
 n ' i 2 
 
 takes at most logun iterations to complete the evaluation of the whole R<h> 
 system. This is illustrated in the example of Figure 1. 
 
 To carry out this evaluation scheme for any given R<n> system, 
 we present in the following a computational algorithm, in which identical 
 
 function values (such as f (2,0), f (3,0) in Figure l) at the same level of 
 
 computation will be evaluated only once. 
 
 Algorithm 1 
 
 a. Let B be a lower triangular matrix of order (n x n) in which 
 
 the j-th column contains a. . n for j < i < n where a. = c. . 
 u i,j-l "" "~ io i 
 
 b. Let C be an alias for B, i.e., B and C represent the same memory 
 locations . 
 
 c. Repeat this step for i = 1, 2, . . ., log n; 
 
 i: Set k = 2 1 ; 
 
 ii: Partition B and C as shown in Figure 2; 
 
 ili: Compute S. = S. + T. * Q. for i < j < n/k simultaneously. 
 J j 3 3 _ 
 
20 
 
 d. The first column of B contains the solutions x for 
 
 1 <_ i <_ n. 
 
 The effectiveness of this algorithm for evaluating R<n> can 
 be easily demonstrated by the fact that at the beginning of the i-th 
 iteration any element b in any of partitions S., T., or Q., for 
 
 p,q j j j 
 
 1 <_ j <_ n/k, represents the value of f, ,.(p,q-l); and at the end of 
 
 that iteration, any element in S. represents f, (p,q-l). The example 
 
 shown in Figure 1 relates the functional notation f to the matrix 
 notation of Algorithm. 
 
 Note that at the end of the i-th iteration of Algorithm 1, the 
 first 2 elements of the first column of B contain the solutions 
 
 fx. Il < j < 2 } . We will illustrate this as well as the algorithm in the 
 
 J 
 
 following example. 
 
 Example _2 Given R<8> as follows. 
 
 x 1 = 2 j 
 
 x 2 = 1 + x 1 
 
 x^ = 1 + 2x 1 + x 2 
 
 \ = k + x x + 3 x p + Sx^ 
 
 x = 2 + 3x 1 + 3x, p + 2x + 3x^ 
 
 x 6 = 3 + x l + x 2 + x 5 + \ + x 5 - 
 
 Xy = 1 + 2x.. + 2x_ + x, + x, + k r + 2x A 
 
 I J- d J) 4 9 O 
 
 x 8 = 3 + x 1 + x 2 4- 2x 3 + 2x^ + x 5 + x 6 + x ? 
 Tne solution set is [2, 3, 8, 31, 126, 173, 930, 1285) which is 
 
21 
 
 a 
 o 
 
 •H 
 ■U 
 CO 
 
 u 
 
 d 
 
 CM 
 
 c 
 o 
 
 •H 
 ■U 
 CO 
 U 
 
 A 
 
 v 
 ttf 
 
 M 
 O 
 
 «4-l 
 CO 
 
 <u 
 
 (U 
 u 
 
 H 
 
 6 
 O 
 •H 
 4J 
 (0 
 3 
 
 rH 
 CO 
 
 > 
 
 W 
 
 u 
 
 3 
 bO 
 •H 
 
22 
 
 O 
 
 X 
 
 en 
 
 
 u 
 o 
 
 M 
 
 u 
 o 
 
 «4-l 
 00 
 
 c 
 
 c 
 o 
 
 
 QD 
 X 
 
 cr 
 
 < 
 
 2 
 
 fa 
 
 E 
 C 
 Q 
 
 M 
 
 be 
 
 •H 
 fa 
 
23 
 
 ITERATION 1 
 
 ITERATION 2 
 
 
 
 
 ITERATION 
 
 3 
 
 2\ 
 
 
 
 
 
 3 
 
 
 
 
 
 8 
 
 X 
 X 
 
 
 
 ^s® 
 
 
 31 
 
 X 
 
 
 
 33 
 
 X 
 
 X 
 
 X 
 
 3\J 
 
 
 16 
 
 X 
 
 X 
 
 X 
 
 4 N 
 
 
 19 
 
 X 
 X 
 
 X 
 X 
 
 X 
 
 X 
 
 21 6 
 30 8 
 
 2\ 
 
 24 
 
 3 \ 
 
 
 
 
 
 
 
 
 
 
 FINISHED 
 
 2\ 
 
 
 
 
 
 3 
 
 
 
 
 
 8 
 
 X 
 
 
 
 
 31 
 
 X 
 
 X 
 
 
 
 126 
 
 X 
 
 X 
 
 X 
 
 
 173 
 
 X 
 
 X 
 
 X 
 
 X >X 
 
 900 
 
 X 
 
 X 
 
 X 
 
 X X }X 
 
 1285 
 
 X 
 
 X 
 
 X 
 
 XXX }x 
 
 © 
 
 Figure 3. Parallel Evaluation of R<8> 
 
24 
 
 the same as shown in Figure 3. In the figure 'x 1 denotes a number which 
 is no longer needed, matrices B and C have been overlayed as a single 
 matrix, and © , © indicate, respectively, matrix multiplicatiorf 
 and addition specified by Algorithm 1. 
 
 From the above, we can obtain a theorem for general linear 
 recurrence problems. In what follows, we will write log_x to denote 
 flog ? xT for notational simplicity. 
 
 Theorem 1 Any R<n> can be computed in 
 
 12 3 
 T <_ ■£ log 2 n + - log 2 n , 
 
 n 3 
 with P £ ~ + 0(n ) . 
 
 2 
 
 The minimum speedup is 0( =— ) . 
 
 log 2 n 
 
 Proof: By Lemma 2, x. = f (i,0), 1 < i < n. But from 
 J l n — — 
 
 Definition 4, each f (i,0) can be expressed in terms of f , f , ..., f , • 
 
 n n ii 1 
 
 2 4 
 
 Each level of the expansion, e.g., from f, to f, increases the computa- 
 
 2 
 
 tion time by one multiplication and one summation of ("T + 1) operands; 
 
 k 
 the summation can be done in at most log_(-^- + 1) steps which is less than 
 
 or equal to log_k steps. Thus, for a total of log„n levels, we have 
 
 log 2 n log 2 n 12 5 
 T < Z .1 + E 1 = - log 2 n + g- logon. 
 
 y j=l ,1=1 d d 
 
25 
 
 Since the serial computation time is n(n-l), the speedup S is 
 
 n(n-l) 
 
 S > 1, 2 3 n 
 
 P - g g? n + 2 log 2 n 
 
 0( 
 
 n 2 
 log ? n 
 
 For the processor count, we refer to Algorithm 1 and Figure 2. 
 The maximum number of processors are required at the multiplication time 
 for each iteration i. Hence, we will count the total number of multiplica- 
 tions for T. * Q., 1 < j < n/k, as the required number of processors p(? ) 
 
 J J 
 
 for a particular iteration i. It can be easily observed from Figure 2 that 
 
 p(k=2 i ) = 
 
 [k/2 
 
 E J + 
 
 k (n-k) 
 
 ♦* 
 
 k/2 
 1) I j + f (k + 2k + 
 
 . + (n-2k) 
 
 for 1 <_ i <_ log 2 n . 
 
 For large n, p(k) takes its maximum value at k = n/A and gives 
 
 p(n/4) = (r^ n + I6n + 8n)/128; hence the maximum number of processors is 
 8 
 
 n 3 /68 + 0(n 2 ) 
 
 Q.E.D. 
 
 We close this section with several observations about Theorem 1. 
 If multiplication time is significantly greater than addition time on some 
 computer it is worthwhile to note that only log„n of our unit time steps are 
 multiplications, the rest being additions. We also note that the processor 
 bounds can be further improved by delaying the evaluation of some smaller 
 trees, e.g., those for x. and x in Figure 1. However, as the algorithm 
 
26 
 
 stands, all processors execute the same operation at the same time (SIMD 
 organization) and by delaying some trees we would require additions and 
 
 multiplications at the same time (MIMD organization) . As our algorithm 
 
 3 
 stands we need about n /68 processors (in an SIMD organization) and for 
 
 8 2 
 
 n <^ 2 this could be regarded as 0(n ) processors. Also note that in a very 
 
 straightforward way, we can apply the algorithm and results to the solution 
 
 of any triangular linear system of equations Ax = b . This is done by 
 
 transforming the system Ax = b to the system x = c + Bx in which 
 
 c . = b ./a. . for 1 < i < n 
 l i n — — 
 
 and b.. = -a.. /a.. for 1 < j < i < n , 
 lj lj li - J - 
 
 = for i < j . 
 
 2. 3 M-th Order Linear Recurrences 
 
 In this section we discuss bounds on the time and processors for 
 evaluating m-th order linear recurrences. The main result is Theorem 2. 
 We also give Algorithm 2 which may be used for the parallel evaluation of 
 an m-th order linear recurrence system. 
 
 Definition 6 An m-th order linear recurrence system of n equations, 
 R<n,m>, is defined as 
 
 R<n,m>: x. = for i <_ , 
 
 i-1 
 
 = c. + Z a. . x. for 1 < i < n , 
 1 • • !J 3 — — 
 
 j=i-m 
 
 where 1 < m < n. In matrix notation we write the above as x = c + Ax where 
 
27 
 
 A is a strictly lower triangular band matrix, i.e., a = for i < j or 
 
 i - j > m. For simplicity we will assume n and m are powers of 2. If n is 
 not, we can augment the system, and if m is not, we choose the smallest 
 power of two greater than m and apply our results directly. Since we are 
 now dealing with a band matrix, we are able to prove the following lemma. 
 
 Lemma 3 For any R<n,m>, m <_ s <_ n/2, we have 
 
 i-1 
 
 f 2s (i ' 0) = f s (i > 0) + Z f s (i,k)'f,(k,0) "for s < i < s+m, (19) 
 
 k-s 
 
 = § + £ " f (i,k)'f (k,0) for s+ra < i < 2s, (20) 
 k=s 
 
 where g is derived by replacing each a. ^ . in f (i,s+m-l) bv a 
 
 k., s+m-1 s ' kO 
 
 Proof : Since we are only concerned about i j^ 2s, it is sufficient to 
 consider the first 2s equations, i.e., x. for 1 £ i <_ 2s as an R<2s,m> system 
 
 which is a special case of an R<2s> system. 
 
 Now, the equations for x., s < i < 2s, can be recalled from Equa- 
 tion 17 (derived from Equation 9 and Equation 10), and the equivalent relation 
 stated in Equation 18. However, by Equation 8 and the property that a. . = 
 for i - j > m, we have 
 
 f (s+k, 0) = a for m < k < s . (21) 
 
 S S '&* vJ mmt ™" 
 
 By substituting Equation 21 into Equation 17, and decomposing the x vector 
 and its corresponding f vector into an m-vector and an (s-m+l) -vector, it is 
 obvious that the first m-vector will confirm Equation 19 • Furthermore, the 
 
28 
 
 second (s-m+1) -vector shows that 
 
 f_ (s+m,0) 
 zs 
 
 f (s+m+1, 0] 
 
 f 2s (2s,0) 
 
 s+m 
 
 x 
 
 s+m+1 
 
 2s 
 
 f (s+m,s) 
 s 
 
 f (s+m+1, s) 
 s 
 
 f (2s, s) 
 
 f (s+m,s+m-l) 
 f (s+m+1, s+m-1) 
 
 f (2s, s+m-1) 
 
 f s (s,0) 
 f g (s+l,0) 
 
 f ( s+m-1, 0) 
 
 + b 
 
 where b = 
 
 f (s+nH-l,s+m) 1 
 
 f (s+m+2,s+m) f (s+m+2, s+m+1) 1 
 
 f (2s, s+m) f (2s, s+m+1) . 
 
 s+m,0 
 s+m+1 , C 
 
 2s, 
 
 (22) 
 
 By Equation 16, we can rewrite b as 
 
 b = Z 
 
 s+m, 
 
 s+m+1, 
 
 s+m+2, 
 
 2s, 
 
 where Z = 
 
 y s+m+1, s+m 
 
 y 
 
 s +m+2 , s +m y s +m+2 , s +m+l 
 
 y, 
 
 2s, s+m 
 
 y' 
 
 2s, s+m+1 
 
 (23) 
 
29 
 
 However, from Equation 16 and Lemma 1 we have 
 
 f (s+xn, s+m-1 
 
 f (s+m+1, s+m-1 ) 
 
 s ' 
 
 f (2s, s+m-1) 
 s 
 
 y 
 
 s+m, s+m-1 
 
 y 
 
 s+m+l, s+m-1 
 
 y, 
 
 2s, s+m-1 
 
 = Z . 
 
 a 
 
 s+m, s+m-1 
 
 s+m+1, s+m-1 
 
 2s, s+m-1 
 
 (24) 
 
 Note that each a. - for s+m < k < 2s in Equation 24 has a 
 k, s+m-1 — — 
 
 one-to-one corresponding a, _ in Equation 23, and that from Definition 2 
 
 there are no a. . for <^ j <_ s+m in any element of Z. Hence, Equation 22 
 
 through Equation 24 together prove Equation 20 in the lemma. 
 
 Q.E.D. 
 
 By using Property ii) and Property iii), we can easily extend the 
 lemma as follows. 
 
 Lemma k 
 
 For any R<n,m>, m < s < n/2, we have 
 
 i-1 
 
 f_ (i, j) = f (i, j) + Z f (i,k)*f (k, j) for s < i-j < s+m , 
 
 "2s 
 
 k=j+s 
 
 j+s+m-1 
 + Z f ,(i,k)-f (k,j) 
 k-j+s 
 
 for s+m < i-j < n , 
 
 where g is derived by replacing each a. 
 
 k,j+s+m-l inf s (i ^ +S+m - l)bya kr 
 
 The parallel algorithm for evaluating R<n,m> is essentially the same 
 
 as that for general case, i.e., for each f (i, 0) = x. , 1 < i < n, we express it 
 
 in terms of f , f , . . . f, recursively, and then evaluate them in the reverse 
 n/ n' 1 "" 
 
 2 k 
 
30 
 
 order simultaneously for all i. The main difference between this case and the 
 general case is that the recurrence definition stated in Lemma 4 will be 
 
 
 used instead of Definition 4 for the expansion of f , m < s < n/2. As a result, 
 
 the computation time introduced at each level of expansion from f to f is a 
 
 n m 
 
 constant instead of a variable increasing with the levels of expansion. 
 
 Similarly, to carry out this evaluation scheme for any given R<h, m> 
 system we can establish the following computational algorithm. 
 
 Algorithm 2 
 
 a. Let B be a lower triangular matrix of order (nxn) in which the 
 
 
 j-th column contains a. . , for j < i < n, where a. = c. and a. . = for 
 
 1,3-1 ° - - 10 i ij 
 
 i-j > m initially. 
 
 b. Let C be an alias for B i.e., B and C represent the same memory 
 
 
 locations. 
 
 Repeat this step for i = 1 , 2, . . . , log p n ; 
 i: Set k = 2 1 ; 
 ii: Partition B (diagonally shaded areas) and C (vertically 
 shaded areas) as shown in Figure 4(a) (1 <^ i <^ log„m) and 
 
 Figure 4(b) (log 9 2m <_ i <_ log_n) . Note that matrices B 
 and C are overlayed as a single matrix in the figure; 
 
 iii: If 1 < i < log m, 
 
 then compute S =S .+T.*Q. for 1 < j < n/k simultaneously, 
 
 u J J J 
 
 else compute S.=S.+T.*Q. for 1 < j < n/k , 
 
 J J J j ~~ ™~ 
 
 and V.=V.+T *U f or 2 < j < n/k simultaneously 
 J J J J 
 
31 
 
 CM 
 
 M 
 
 •H 
 U 
 
 o 
 
 00 
 
 o 
 
 4-1 
 
 bO 
 C 
 •H 
 
 e 
 o 
 
 •H 
 ■U 
 •H 
 ■U 
 U 
 
 P* 
 
 o 
 
 e 
 
 <1> 
 
 M 
 
 60 
 
32 
 
 VI 
 
 E 
 
 CM 
 
 o 
 
 60 
 
 o 
 
 4-1 
 
 00 
 
 c 
 
 •rl 
 
 c 
 o 
 
 Ptl 
 
 •a 
 
 a) 
 3 
 C 
 
 u 
 
 3 
 bO 
 •H 
 
 fa 
 
33 
 
 
 
 
 
 
 
 
 
 /x 
 
 X 
 
 
 
 
 
 
 
 X 
 
 X 
 
 
 <r 
 
 
 
 
 
 
 
 
 
 
 
 /* 
 
 X 
 
 X 
 
 X 
 
 
 z 
 
 
 
 V 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 
 H 
 
 
 X 
 
 X 
 
 
 
 
 
 < 
 
 a: 
 
 /C m 
 
 K 
 
 
 i£ 
 CM 
 
 
 o 
 
 0) 
 
 
 ®/^~/?J "> 2 
 
 2 
 
 Ji 
 
 O 
 
 CM 
 
 CM 
 
 \ 
 
 *-t 
 
 / yi" * 
 
 
 
 
 
 
 
 
 
 /x * ! * 
 
 
 
 
 
 
 
 
 
 X XIX 
 
 
 
 
 
 
 
 
 /v x 
 
 X X | X 
 
 
 
 
 
 
 
 
 A \ 
 
 
 
 m 
 
 CM 
 
 ^ 
 
 
 
 A x \ is S 2 
 
 to 
 in 
 
 M 
 O 
 
 N 
 
 01 
 
 s 
 
 s 
 
 
 
 /* \ < oo — m 
 
 r- 
 
 c\J 
 
 (V 
 
 0> 
 
 
 
 /* \ 
 
 1.1 
 
 t I 
 
 f 
 
 ■ 
 
 ! 
 
 CM 
 C\J 
 
 ♦ 
 
 >o 
 
 t 
 
 O 
 
 » 
 
 J 
 
 
 (O U> £ § § 
 
 * 
 
 x\ 
 
 - * 
 
 A 
 CN 
 
 VO 
 
 ■H 
 V 
 « 
 
 M-l 
 
 o 
 
 G 
 O 
 •H 
 •u 
 ctj 
 
 rH 
 CO 
 > 
 W 
 
 a) 
 
 ctj 
 
 >-i 
 
 CO 
 
 &4 
 
 (V 
 
 J-i 
 
 3 
 
 00 
 •H 
 
34 
 
 d. The first column of B contains the solutions x for 1 <_ 1 <_ n . 
 
 At the end of the i-th iteration of Algorithm 2, the first 2 
 
 elements of the first column of B contains the solutions {x.|l <_ j <_ 2 }. 
 
 In the algorithm, the V , 2 £ j <_ n/k, contain all the required values of 
 
 function g as specified in Lemma 4 which will be used during the next 
 iteration. We illustrate this in the following example. 
 
 Example 3 Given R<16,2> as follows: 
 
 x ± = 1 x 9 = 3+x^Xq 
 
 x 2 = 2 +Xl x^ = l + x 8+ x 9 
 
 x 3 = 2+ Xl +x 2 x i:L = l+2x 9 +3x 10 
 
 x k = l+3x 2+ x 3 ^ = 2 + 2x 10+ x i;L 
 
 x 5 = 2+3^+2^ x 13 = 2+x 11 +2x 12 
 
 x 6 = ^ +3 % +2x 5 x iU = 1+X 12 +X 13 
 
 ^ = l + x 5+ x 6 x 15 = 5+2x 13+Xl4 
 
 xg = 2 + 2x 6+ x ? x l6 = ^+x lif+ x 15 . ^ | 
 
 The solution set is {1,3, 6, 16, 40,100, 141,3^3, 830, 1174, 5183,7533,20251,27785, 
 68292,96081} which is the same as shown in Figure 5« 
 
 Now, we can state a general theorem about this class of problems. 
 
 Theorem 2 Any R<n,m>, can be computed in 
 
 1 2 
 T < (log 2 m + 2)log 2 n - — (log 2 m + log 2 m) . 
 
 3m 2 n 
 with p <_ —7 — + 0(mn) . 
 
 The minimum speedup is 0(- = ) . 
 
 log 2 m log 2 n 
 
35 
 
 Proof ; By Lemma 4 any f n (i,0) ■ x., 1 <_ i <_ n, can be expressed in 
 
 terms of f^, f^, ..., f . Each level of the expansion increases the computa- 
 2 4 
 
 tion time by one multiplication and one summation of (m+1) operands. For a 
 
 total of log„(— ) levels, we have 
 °2 m 
 
 t x < (log 2 Cmfl) + 1) logg(£> < (log 2 m + 2) logg(£) . 
 
 To express f^ in terms of f^, f^, ..., f we use Definition A for ex- 
 
 2 4 
 
 pansion as in the general case. Hence, as stated in the proof of Theorem 1 
 
 we have 
 
 12? 
 
 t 2 5 2 1 °S 2 in 4 t l0g 2 m * 
 
 Thus, the total time T < t 4 t , i.e., 
 
 Xr -*- C 
 
 T p < ( lo e 2 m * 2)log 2 n - |(log 2 m + log^) . 
 Since the serial computation time is 2mn-m(m+l), the speedup S is 
 
 Q v, 2mn-m(m+l ) , mn v 
 
 P " /•, o\-i I/-, 2 _ v log^m log^n' 
 (log 2 m + 2)log 2 n - 2 (log 2 m + log 2 m) & 2 & 2 
 
 For the processor count, we refer to Algorithm 2 and Figure h. We 
 will count the total number of multiplications for any iteration i as the 
 required number of processors for that particular iteration. This can be 
 
36 
 
 described in 2 cases: 
 
 (i) 2 < 2 1 < m (Figure k&) f 
 
 vl k / 2 
 
 , (k=2 i, < [x^fi.x)!] £ 
 
 J + 
 
 1 + (2f -D| 
 
 k(m-l) 
 2 
 
 k k . "^"l . 
 0=1 
 
 |g (k+2)(n-k+2)+^(m-l)(n-m-k+2)+2in(m-k) for 2 < k < m, 
 
 and (ii) 2m < 2 1 < n (Figure Vb), 
 
 p(k=2 i ) = \l + (g -2)(m+l)J [" Z j + m(| -1)1 
 
 (m+1) 
 
 [m 
 3=1 
 
 j + m(- -m) 
 
 ] 
 
 for 2m < k < - , 
 
 and 
 
 p(k) 
 
 m p 
 
 v • , /n \ mn m -m 
 
 = 2 j + m- (- -m) = — - — — 
 
 3=1 
 
 l 2 
 
 for k = n 
 
 Since p(k) takes the maximum value at k=2m for n>>m, the maxi- 
 mum number of processors p is 
 
 2 3 2 2 
 
 3m n + 2mn - n 10m - 2m - 4m 3m n , n/ . 
 P 1 4 1 = -J— + 0(mn) . 
 
 Q.E.D. 
 
 As was true with Theorem 1, Theorem 2 requires only log_n 
 multiplication steps. Similarly uniform operations are performed across 
 
 all processors with this algorithm. By delaying some small trees, better 
 
37 
 
 processor bounds could be achieved at the expense of requiring simul- 
 taneous multiplications and additions. The algorithm and results can also 
 be used in the solution of any linear system Ax=b where A is a triangular 
 band matrix by a simple transformation as stated in section 1. 
 
 2.U Practical Implications 
 
 Now we turn to a discussion of some practical implications of our 
 algorithms. As noted earlier, to obtain the largest theoretical speedup 
 achievable by Algorithms 1 and 2, an unreasonably large number of processors 
 is required. However, by the following methods substantial speedups over 
 serial machines can be achieved at reasonable efficiencies with greatly re- 
 duced numbers of processors. 
 
 First we shall discuss two methods of reducing the number of pro- 
 cessors required. Then we present numerical results which give detailed 
 estimates of possible speedups and efficiencies using various numbers of 
 processors. Both general and m-th order recurrences are discussed and results 
 are presented for ranges of m and n. 
 
 Present multi-operation machines perform up to 64 simultaneous 
 bit parallel operations (ILLIAC IV) or up to several hundred simultaneous 
 bit serial operations (STARAN) . Systems under discussion allow for 10,000 
 or more bit serial processors [14]. As higher levels of circuit integra- 
 tion become practical, it may be possible to use as many processors in 
 future machines as the number of integrated circuit packages in current 
 machines. Thus, machines with several hundred thousand processors may 
 
38 
 
 some day be reasonable. The numbers of processors presented below are 
 for this reason allowed a wide range. 
 
 In modifications of both Algorithms 1 and 2, we use two methods 
 to reduce the number of processors required. The first method, which we 
 call c utting is to cut an nxn A matrix into [n/kl vertical strips of 
 width k and process just one strip at a time. We carry out the following 
 for k = r, 2r, 4r, ..., n, with r=2 for general recurrences and r=m for 
 m-th order recurrences. 
 
 For a given k value, consider the leftmost k column strip of 
 the A matrix. The strip may be regarded as a triangular kxk system T (at 
 the top of the strip) and a rectangular (n-k) xk system R (at the bottom of 
 the strip). Furthermore, each of the [n/kl strips has the same form, with 
 fewer and fewer rows in successive rectangular systems. Assume that p_ 
 
 processors are required to solve T (the value of p can be determined from 
 
 Theorem 1). Using the solution of T, we can solve R by simply substituting 
 these values into each row and computing an inner product. Assume that 
 this required p processors (actually p_ = k(n-k)). Let p = max(p ,p ) 
 
 K R IK 
 
 processors be chosen to compute the first strip as well as the remaining 
 [n/kl - 1 strips. Clearly, p will suffice for each of the other strips if 
 they are evaluated in two parts as was the first strip. Using this p value, 
 we obtain overall speedup and efficiency figures and the procedure is 
 repeated for all values of k = r, 2r, 4r, ..., n. 
 
 Note that we could achieve better speedup results by increasing 
 the strip width to occupy all p processors as we sweep to the right. This 
 
 
39 
 
 was avoided to simply our experiments and its avoidance would simplify 
 the compiler implementation of such a scheme for a real machine. 
 
 The above procedure generates a number of possible speedups for 
 various reduced numbers of processors. Next, we attack the problem of 
 increasing the efficiency. Assume we have a tree of height t which contains 
 
 2 -1 operation nodes and whose evaluation requires 2 processors. The 
 
 2 fc -l - 2 
 efficiency of such a tree evaluation is — — = = — . By halving the 
 
 2 t -t fc 
 number of processors, only one extra processing step is required, so we can 
 
 2 t -l - 4 
 achieve an efficiency of — —z = — — . For large t, this approximately 
 
 2 t (t+1) t+1 
 doubles the efficiency with a negligible decrease in speedup. We call this 
 process a fold of the tree. 
 
 After i folds are performed the tree height is t+2 -i-2, 
 
 while the number of processors is reduced to 2 /2 . Thus, for large i 
 the product of (processors) * (parallel time), which is the denominator of 
 the efficiency calculation, changes very slowly and the technique is of 
 little value. In practice at most 4 or 5 folds proved useful. 
 
 In the implementation of our algorithms it is of course not the 
 case that a single tree is being folded. But at each stage of the calcula- 
 tion a number of trees must be evaluated in parallel. In spite of the 
 fact that the trees are of different height, our procedure folds them all 
 uniformly. This would simplify a compilation algorithm and also allows all 
 processors to perform the same operations at the same time. By more 
 
40 
 
 elaborate scheduling procedures, greater efficiencies could be achieved. 
 
 The results of our cutting and folding experiments are sum- 
 marized in Tables 1 and 2. In both cases we first generated a number of 
 possible results by cutting and folding. Then, by using the processor 
 count intervals shown as columns in the tables, we tabulated the maximum 
 speedup result obtained for each row. In fact, this is the maximum of a 
 set of lower bounds obtained for each row, column, position, from the 
 minimum speedup . 
 
 In Tables 1 and 2, the rightmost columns are labelled "maximum 
 possible." These refer to the upper bounds on the number of processors 
 given by Theorems 1 and 2, respectively. In cases where this column is 
 empty, the rightmost value tabulated is the maximum possible. Gaps in both 
 tables for efficiency values up to 1.0 were not run for technical reasons, 
 but could be filled in a continuous way. 
 
 Table 1 may be interpreted as follows. If one had a machine 
 consisting of p processors, by scanning down the column labelled p, minimum 
 achievable speedup and minimum achievable efficiency can be read for 
 various problem sizes n. Alternately, if one wants to design a machine for 
 problems of size n, a row scan will indicate the maximum number of proces- 
 sors required by our method to achieve the speedup tabulated. 
 
 It should be noted that the speedups shown along the diagonal 
 with p=n-l, which are trivial to achieve on a parallel machine, can be 
 substantially improved by our methods. By reading across rows to the 
 right, we see that for small n, the speedup can be improved by a factor 
 
 
41 
 
 of 2 or 3 while still maintaining efficiencies of about 10% or better. 
 For large values of n the speedup can be improved by about a factor of 10 
 over the speedup of the simple n-1 processor case, with order of 10% 
 efficiencies. Of course, if a large number of processors exist they can 
 be used on smaller problems with lower efficiencies to obtain higher 
 speedup improvements than mentioned above. 
 
 Table 2 may be interpreted as was Table 1, although several 
 points should be clarified. The values shown in Table 2 were obtained 
 for the case n=16,384. However, they hold approximately for other n 
 values as well. This may be explained as follows. 
 
 If we approximate the parallel computation time of Theorem 2 as 
 T = log^m log_n, for a cutting scheme which uses strips of width k the 
 
 speedup is approximately 
 
 T 
 c _ __± Z mn mk 
 
 P T n log„m log„k 
 
 P - log 2 m log 2 k & 2 & 2 
 
 ~ 2 
 Correspondingly, we have p = m k, which leads to 
 
 S = * 2~ . (25) 
 
 P m log 2 m log 2 (p/m ) 
 
 It may be seen that Equation 25, with p held constant, has a minimum S 
 
 value for some ra in the range of interest to us. Thus, for p > 256, the 
 
 columns of Table 2 exhibit a decrease in S and then an increase as m 
 
 P 
 
 varies from 2 to 128. 
 
 Equation 25 also shows that speedup, by this approximation, is 
 
42 
 
 independent of n. Thus, we can use Table 2 as an approximation for any n 
 
 value. Of course, we should not use arbitrarily many processors on any 
 
 2 
 problem and by Theorem 2 the limit can be approximated by p < m n. It 
 
 follows that entries in Table 2 are valid approximately for n > *s- . Our 
 
 m 
 
 experiments were exhaustive for m _< 128 and n <_ 16,384 and our data 
 
 substantiates the above. The approximations are introduced here simply 
 
 to avoid presenting many similar numerical tables. 
 
 Now let us consider the speedup improvement which can be 
 
 obtained by our method over the simple parallel implementation using p=m 
 
 shown on the diagonal. For small m and efficiencies of about 10% we can 
 
 obtain improvements of 3 or 4. For larger m the improvement does not 
 
 increase much, unlike the general recurrence case. However, if a rather 
 
 large number of processors is available (say to solve large general 
 
 recurrences) then very large speedup improvements (at efficiencies lower 
 
 than 10%) can be realized for small m. For example, if 16,384 processors 
 
 were available, we could achieve a speedup improvement over the diagonal 
 
 values of about 225 for m=2 (S =451.93) and about 83 for m=4 (S =332.62). 
 
 P P 
 
 We conclude that for m-th order recurrences with small m, which are in 
 practice encountered quite frequently, the methods discussed here should 
 be useful in exploiting large parallel machines. 
 
43 
 
 m 1/1 
 
 CM O 
 
 887 
 
 287 
 
 • • in 
 
 88 
 
 8 
 
 £8' 
 
 CM O 
 CM 
 
 s 
 
 W— CM 
 
 88! 
 
 m o » 
 
 p: 3 
 
 N6U1 
 
 5o| 
 
 en on 
 
 5 S 
 
 ^3 
 
 en o in 
 
 Sin 
 CM 
 
 lO«< 
 — ©< 
 
 enf» ♦ 
 
 noig 
 
 F-om 
 
 tON 
 
 9i cn 
 
 r^O _ 
 «M ^ 
 
 S o oB 
 
 jjoS 
 
 SoS 
 
 set 
 
 383 
 
 m Of- 
 
 R S 
 
 10 CM 10 
 
 cm out 
 
 £33 
 
 now 
 
 Ne'3 
 
 33g 
 
 — f >o 
 
 888 
 8 - 
 
 S3 
 
 CM £ IO 
 
 KOA 
 in d * 
 
 8SK 
 
 in dip 
 
 *;7S 
 
 s - 
 
 x - 
 
 r-CM3 
 
 CjON 
 
 1 s 
 
 lOKie 
 cm p— in 
 
 cm d # 
 
 cm m p 
 
 in •— cm 
 
 • • f. 
 
 IS 
 
 i cm m 
 
 ;d* 
 
 3£3 
 
 CO O — 
 
 in d 8 
 
 c* en in 
 • • CM 
 
 CM ON 
 
 en en 
 
 oi>>8 
 
 *7l| 
 
 *MCM jf 
 r-! d 8 
 
 g m 
 
 2Ri 
 
 NOV 
 
 I - 
 
 nunc 
 no* 
 
 CM C* CM 
 
 »— en ^- 
 
 O O CM 
 
 8og 
 m a? 
 
 en en ia 
 
 »n m 
 en o 
 
 d d >» 
 
 583 
 
 CM O * 
 
 s° 5 
 
 8~3 
 
 in d i7 
 
 s 
 
 •*8 
 
 c- o r-» 
 
 388 
 
 • -CM 
 
 S oSJ 
 
 Sm cm 
 O m 
 
 en d3 
 
 e». encM 
 
 CM«-f- 
 
 uion 
 
 323 
 
 V 
 
 8 CM ' 
 
 en di 
 
 8<m8 
 
 <o o 
 
 8 
 
 83en 
 
 83 
 
 do 
 
 I 
 
 8SS 
 gd!= 
 
 88! 
 
 528 
 
 LO O — 
 
 in in cm 
 
 3S£ 
 
 en df- 
 
 CM 
 
 <a 
 
 sag 
 
 oca o 
 . . to 
 o> o 
 
 
 ^78 
 
 S' 
 
 M Q 
 
 CM O •- 
 
 ^^H 
 
 SS| 
 
 83 = 
 8 C 
 
 in in in 
 
 ■ . cm 
 
 »•- l# 
 
 * CM tf> 
 
 «d 
 
 p cm in 
 
 S8S8 
 
 CM d 
 
 83! 
 
 SrsiS 
 in o 
 
 883 
 
 d d 
 
 8 In ■ 
 cm o 
 
 8 CM •— 
 m en 
 
 8 enm 
 
 oo d 
 
 ON ' 
 
 oin 
 
 So «■ 
 in 
 
 CM o 
 
 88^ 
 
44 
 
 CO 
 0) 
 
 o 
 
 c 
 
 <D 
 U 
 U 
 3 
 O 
 
 <u 
 t* 
 
 u 
 
 CO 
 
 CU 
 
 c 
 
 v- 
 
 CU 
 
 O 
 
 CO 
 
 ■u 
 
 iH 
 
 3 
 CO 
 
 co 
 
 I 
 
 (U 
 
 & 
 
 CN 
 
 CD 
 •H 
 ^3 
 
 CO 
 
 H 
 
 LO 
 
 3°S 
 
 (MHO 
 
 cm o r^ 
 . . io 
 
 5 ON 
 
 £22 
 >•» o 3 
 
 eo 8 S 
 • • evj 
 
 53' 
 
 cm o 
 
 88' 
 
 CM r— 
 
 8' 
 
 r- cm cm 
 
 °.PR 
 
 fO O ^- 
 00 US 
 
 : $ 
 
 CM CM CM 
 
 ♦ on 
 . . vo 
 
 O n- 
 
 $< 
 
 >— 1*1 CM 
 
 >-Offl 
 
 ison 
 
 SO CM 
 o r- 
 
 . . 00 
 
 •— o — 
 
 CM O >— 
 • • Cn 
 
 Son 
 ■- o 
 
 co en cm 
 
 UIOi- 
 
 8S' 
 
 «* o 
 
 en S 
 tn~ 
 
 s§: 
 
 s°5 
 
 in -*■ 
 
 3 © ! 
 
 aiei 
 
 S°s 
 
 5°2 
 
 3S8 
 
 • • m 
 
 ss 
 
 .s 
 
 — o * 
 
 r-^O 
 
 en m Q 
 
 en — © 
 
 »'ei 
 
 5s: 
 
 est o 
 
 88' 
 
 cm o 
 
 88T 
 
 to CM 
 
 
 WON 
 
 t-o <5 
 • • «■> 
 
 SCSI to 
 o co 
 ■ . en 
 
 S82 
 
 . . 00 
 
 ui o 
 
 c6 n ia 
 
 :8J£ 
 
 cnj o •— 
 f* S • 
 
 • • CM 
 
 8°r 
 
 r^ o >- 
 
 8 • 
 
 © CM 
 
 CSI (SI 
 
 5°i 
 
 co t— to 
 
 O o ■— 
 
 • • O 
 
 3 ro 
 
 CM 
 
 en r- eo 
 co o to 
 
 =s 
 
 3 © CM 
 
 >— O CM 
 
 £81 
 
 VO O CM 
 
 to © in 
 
 • • i/» 
 
 r% o r** 
 
 3 3 
 
 8 
 
 >— 00 
 O CM 
 
 
 • r— 
 
 ■«■ o to 
 
 IO 
 
 
 a 
 
 »- CM 
 
 S 
 
 o.- 
 co 
 
 8SS 
 
 s: 
 
 — CM CO 
 CM O CM 
 
 O 4J o. 
 
 en Q • 
 
 io o f- 
 • • CU 
 
 8°' 
 
 3o3 
 
 85S 
 
 O IO w 
 
 MOO 
 
 Sf*» CM 
 ©en 
 
 eo o eo 
 
 a 
 
45 
 
 ems 
 
 A 
 
 3. TIME AND PARALLEL PROCESSOR BOUNDS FOR LINEAR 
 RECURRENCE SYSTEMS WITH CONSTANT COEFFICIENTS 
 
 3.1 Introduction 
 
 Linear recurrences with constant coefficients arise frequently 
 in general numerical computations. Several analytic methods for the 
 solution of these equations are available, such as by solving for the 
 roots of its characteristic polynomial or by the use of generating 
 functions. 
 
 We are interested in a parallel and direct computational algo- 
 rithm which can be used to evaluate them at a high speed. Such syst 
 may be represented as x = c + Ax" where c is a constant column vector, 
 is an nxn strictly lower triangular matrix with bandwidth of m, m < n, 
 and all elements are identical along each sub-diagonal. We show that 
 0(log 2 m log 2 n) time steps and 0(mn) processors are sufficient to solve 
 such a system. We also show that general recurrences, i.e., with m = n, 
 can be computed within Odog'n) time steps with at most n 2 /4 processors. 
 While such systems remain in the same order of speed as in the algorithms 
 for arbitrary coefficients discussed in Chapter 2, the bounds on 
 processors are sharpened by a factor of m and n for the m-th order and 
 general constant coefficient system, respectively. To achieve these 
 results, all processors need only perform one type of operation at each 
 time step (SIMD operation). 
 
 We further show that our method can be modified to evaluate the 
 remote terms in a homogeneous linear recurrence sequence with at most 
 
46 
 
 2 3 
 
 0(m ) processors in contrast to the 0(m ) processors required by the 
 
 Miller-Brown algorithm. Also, the parallel evaluation of n-th degree 
 
 polynomials can be completed within 21og 9 (n+l) time steps with at most 
 
 ^r- + 1 processors which compares favorably with recent results [15], [l6l 
 
 for practical applications. 
 
 3.2 General Linear Recurrences with Constant Coefficients 
 
 In this section we discuss bounds on the time and processors 
 for the direct evaluation of the following class of general linear 
 recurrences 
 
 R<n>: x. = 
 
 1 
 
 for i < 0, 
 
 i-1 
 
 = c. + £ a. x. . for 1 < i < n . 
 In matrix notation, we write this as x = c + Ax where A = 
 
 strictly lower triangular matrix. For example, 
 
 [a. . ] is a 
 nxn 
 
 R<4>: 
 
 r" "" 
 
 
 x i 
 
 
 X 2 
 
 
 
 = 
 
 *3 
 
 
 x 4 
 
 
 . 
 
 
 ; ll 
 
 L *J 
 
 
 
 a 1 
 
 a 2 a ± 
 
 a~ a„ a, 
 
 
 r "1 
 
 
 x i 
 
 
 X 2 
 
 • 
 
 
 
 X 3 
 
 
 x 4 
 
 
 _ 
 
 For simplicity, we will assume n is a power of 2. The main 
 result is Theorem 3. We also give Algorithm 3 which may be used as a 
 basic algorithm for the parallel evaluation of this class of problems. 
 
47 
 
 First, let us state one important lemma which follows directly 
 from Definition 2. 
 
 Lemma 5 
 
 Given any 
 
 R<n> : x = c + Ax , 
 
 its associated matrix Y as defined in Definition 2 has identical elements 
 along each sub-diagonal. For example, let 
 
 A = 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 2 
 
 1 
 
 
 
 
 
 3 
 
 2 
 
 1 
 
 
 
 then Y = 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 3 
 
 1 
 
 
 
 
 
 8 
 
 3 
 
 1 
 
 
 
 It is obvious that the j-th column of Y can be obtained by 
 simply shifting the first column (j-1) places downward. We will denote 
 this operation as y . = y, [ j-1] where y. is the j-th column of matrix Y, 
 
 These notations will be used throughout this chapter. 
 
 Now, we present the proposed parallel algorithm and the proof 
 of its effectiveness will immediately follow. 
 
 Algorithm 3 
 
 a) Let B be a lower triangular matrix of order (n*n) 
 
 1 1 ' ^"9 ' * * * ' ^"n ' 
 
 b 2 = (0, o^, ct 2 , . . ., « n _ 1 ) , 
 
 and 
 
 b = b 2 [j-2], 3 < j £n . 
 
48 
 
 b) Let C be an alias for B, i.e., B and C represent the same 
 memory locations. 
 
 c) Repeat this step for i = 1, 2, . .., log~n; 
 
 i: Set k = 2 ; 
 
 ii: Partition B and C as shown in Figure 6; 
 
 iii: Compute S. - S. + T. * Q. for 1 < i < min(2,?-) 
 j j j j ~ - k 
 
 simultaneously; 
 
 a) (2) 
 
 n 
 
 iv: Set y = y ± [j-1] for 1 < j < k, 2 <_ ft <_ £ ; 
 
 v: Compute D. = D. + W. * Z. 
 3 3 3 3 
 
 simultaneously. 
 
 n 
 
 for 1 j< j j< min(2,-rr-) 
 
 d) The first column of B contains the solutions x. for 
 
 1 < i < n. 
 
 claim: 
 
 To justify this algorithm, we need only to prove the following 
 
 Lemma 6 
 
 At the end of the i-th iteration of Algorithm 3, we have 
 
 - (1) t 
 
 1) y 1 ■ (x 15 x 2 , . .. , x^) ; 
 
 2) for each partition 
 
 y £k+l,£k 
 
 y Jlk+2,£k y £k+2,Jlk+l 
 (£+1) . for 
 
 1 < I < 7- - 
 — — k 
 
 y ik+k,Ak y j2k+k,£k+l 
 
 J2k+k, flc+k-1 
 
 where y are elements of the associated matrix Y of A; and 
 
49 
 
 3) for k <_ — , D is the vector of the partial sums of the 
 first k terms in the expression of x, . , x, _, ..., x~, , and similarly 
 D 2 is that of y 3k+lj2k , y 3k+2>2kJ •••> y 4k>2k for k <f . 
 
 Proof: 
 
 We prove this claim by induction on k. It is true for 
 
 k = 2, as a basis. Let us assume it is also true for all iterations up 
 
 to — for some k as shown in Figure 6. Then, during the current iteration, 
 
 by hypothesis Condition (1) the first element of Q_ is x and S.. is the 
 
 2 
 vector of the partial sums of the first -r terms in the expression of 
 
 x^ , x^ , ..., x by hypothesis Condition (3). 
 y»-l f-2 k 
 
 Hence, we can write 
 
 ■"■ 
 
 
 ~ 
 
 
 *k 
 
 2 
 
 
 *k 
 
 2 
 
 
 X k 
 
 
 
 I 
 
 i 
 
 
 = 
 
 S l 
 
 i 
 + 1 
 
 ff2 
 
 
 j 
 
 • 
 • 
 • 
 
 
 
 
 \ 
 
 
 
 
 r Ly 2 
 
 T '2 
 
 k,- 
 
 a k k ° 
 ^■2 ^f 1 
 
 k.frt 
 
 
 
 
 X k 
 2 
 
 1 
 
 — — 
 
 + 
 
 x k 
 |fl 
 
 \ 
 
 r 2 
 
 • 1 
 
 
 
 m 
 • 
 
 • 1 
 
 
 
 • 
 
 k,k-l 
 
 
 
 
 X k 
 
 
 — 
 
 
 _ _ 
 
50 
 
 By virtue of Lemma 1 and hypothesis Condition (2) , the bottom partition 
 of the above equation is equivalent to (S + T * Q ) . This proves 
 
 Condition (1). Similarly, it can be shown that the matrix operation 
 S = S_ + T * Q specified in step (iii) generates the results of 
 
 (y 3^ J y i^ ' ■— y 2k,k> ■ which combines with (y k+ i,k' y k +2 ,k' •••' 
 
 "TiCtJ. , K _ K.+Z , K. 
 
 y_ ) from the previous iteration (hypothesis Condition (2)) in forming 
 |k,k 
 
 (2) 
 
 (2) 
 
 the first column y of partition Y of the current iteration. 
 
 —(2) 
 With the new result of y , now, we can see that the shifting 
 
 operation in step (iv) will result in Condition (2) due to Lemma 5. 
 
 After step (iii) and step (iv) , we know that 
 Z = (x , x , ..., 3c _i ) • Since in all previous iterations, no operation 
 
 has been done on the data below the partitions on the diagonal, 
 
 Y (l) for 1 <_ i ^ | , we have 
 
 D l (c k+l' C k+2' •'•' c k ) ' 
 
 ; a k+l,l a k+l,2 
 
 w 1 -| 
 
 a 2k,l a 2k,2 
 
 • • 
 
 ^l.k-l 
 
 ^+2,1 a k+2,2 * ^+2^-1 
 
 2k,k-l 
 
 Hence, the resultant vector D of step (v) is a vector of the partial 
 
 sums as stated in Condition (3). Similarly, it is true for D_. This 
 completes our proof. 
 
51 
 
 CO 
 
 O 
 
 60 
 
 u 
 o 
 
 M 
 
 c 
 
 •H 
 
 c 
 o 
 
 
 
 vO 
 
 <u 
 3 
 •H 
 
 m 
 
 
52 
 
 ITERATION 
 
 ITERATION 2 
 
 ITERATION 3 
 
 2\ 
 
 
 
 
 3 \ 
 
 
 
 
 6 x 
 
 x\ 
 
 
 
 15 x 
 
 X 
 
 x\ 
 
 
 30 x 
 
 X 
 
 X 
 
 x\ 
 
 64 x 
 
 X 
 
 X 
 
 x Vy 
 
 130 x 
 
 X 
 
 X 
 
 x x >K 
 
 274 x 
 
 X 
 
 X 
 
 J 
 
 X X X \^ 
 
 Figure 7. Parallel Evaluation of R<8> 
 
53 
 
 It is worthwhile to note that the shifting operation stated 
 in step (iv) could be easily accomplished by a proper indexing function, 
 thereby no major computation time is involved and more efficient use of 
 storage is possible by such a scheme. 
 
 To illustrate this algorithm, we use a R<8> as an example shown 
 in Figure 7. 
 
 With this algorithm, we can how establish a general theorem 
 about this class of problems . 
 
 Theorem 3 Any R<n> can be computed in 
 
 2 
 T < log n + 21og n - 1 . 
 p — z £. 
 
 with p _< n /4 . 
 
 2 
 
 The minimum speedup is 0( r— ) . 
 
 log 2 n 
 
 Proof : During the i-th iteration of Algorithm 3, if a sufficient 
 number of processors are given, then the matrix operation in step (iii) 
 takes at most one multiplication time plus the time for the summation of 
 
 at most Cr+1) operands, which is log_k addition steps. So that for a 
 
 total of log n iterations, this step alone takes 
 
 log ? n 
 
 12 3 
 t 1 5. l J + log 2 n = 2 lo 8 2 n + 2 log 2 n * 
 
 Similarly, step (v) takes at most one multiplication time plus 
 the time for the summation of k operands which is log-k addition steps. 
 
54 
 
 ,IK . 
 
 Thus, for a total of log_(y) iterations, this step needs 
 
 log 2 ty n 
 
 t < Z 3 + log 9 (y) 
 
 in 2 ! i i 
 2 log 2 n + 2 log 2 n " X 
 
 Therefore, the total computation time T < t. + t. , i.e., 
 
 2 
 T <_ lo 8 2 n + 21o 82 n " X ' 
 
 Since the serial computation time is n(n-l), the speedup S is 
 
 s > 2 ^ = o(-3— -) . 
 
 P log 2 n + 21og 2 n - 1 log 2 n 
 
 To achieve the above speed, a sufficient number of processors 
 should be provided at the multiplication time for each iteration i. So 
 we will choose the maximum of the number of multiplications in step (iii) 
 and step (v) as the number of processors required for that particular 
 iteration. For step (iii), we can see from Figure 6(a) that 
 
 k/2 
 p (k=2 X ) <2Ej for 2 <k<^ , 
 
 J-l 
 
 k/2 
 <_ Z j for k = n . 
 j=l 
 
 For step (v) , it can be observed from Figure 6(b) that 
 
 P 2 (k) < 2k(k-l) 
 
 for 2 < k < 7 
 — — 4 
 
 < k(k-l) 
 
 * i n 
 for k = — 
 
55 
 
 2 2 
 
 For large n, p 1 (n) £ (n + 2n)/8 and p 2 (n/2) £ (n - 2n)/4 
 
 give the respective peak values. Hence, the maximum number of processors 
 
 is n 2 /4 . 
 
 Q.E.D, 
 
 As the basic algorithm stands, only 21og_n-l unit time steps 
 
 are multiplications, the rest being additions. Also note that in the 
 above processor count, all processors perform uniform operations at each 
 time step, and hence the bound can be further improved by lifting this 
 restriction. 
 
 We close this section with some observations about the computa- 
 tional efficiency of this algorithm. Although its speed is approximately 
 one-half that of Algorithm 1 in Chapter 2, the number of processors is 
 reduced by a factor of n. Thus, it gives us a tremendous increase in 
 computational efficiency. On the other hand, since R<n> is a special case 
 of arbitrary coefficient system, we can use the latter algorithm to compute 
 R<n> with some operations masked off. Due to Lemma 5, now the matrix 
 
 operations S. = S. + T. * Q. in step (iii) of Algorithm 1 are performed 
 J J J J 
 
 only for j = 1, 2. All others are masked and supplemented by shifting 
 operations of S_. Further masking is possible even in computing the 
 
 portion of S„ in the top triangular part (Figure 2) with Q ? and S_ 
 
 shrinking to vectors instead of matrices, as in Algorithm 3 (Figure 6). 
 In this case, it can be shown that the maximum number of processors 
 required is about n 3 /128. 
 
56 
 
 3.3 M-th Order Linear Recurrences with Constant Coefficients 
 
 In this section, we will turn to the considerations of 
 practical m-th order recurrence problems, i.e., 
 
 R<n,m>: x. = for i £ , 
 
 m 
 = c. + Z a. x. . for 1 < i < n , 
 
 i i=1 i t-j - - 
 
 where 1 <_ m < n. In matrix operation x = c + Ax with A being a strictly 
 lower triangular matrix of bandwidth m. For simplicity, we will assume 
 both n and m are powers of 2. 
 
 Since A is now a band matrix, we can further speed up the 
 computation of Algorithm 3 when i >. log 9 2m . This is done by changing 
 
 partitions S., Q., T., Z., D. and W. in Figure 6 to that in Figure 8, 
 3 3 3 3 3 3 
 
 and introducing new partitions U., V., G. and E. and their corresponding 
 
 3 3 3 3 
 
 matrix operations as described in Algorithm 4. The proof of its validity 
 can be found in Lemma 3 and 4 of Chapter 2. 
 
 Algorithm 4 
 
 a) Let B be a lower triangular matrix of order (nxn) in which 
 
 t>2 = (0> a i> a 2' ***' a m' ^' ' ' * » 0) > 
 and b = b 2 [j-2], 3 <. j < n . 
 
 b) Let C be an alias for B, i.e., B and C represent the same 
 memory locations. 
 
 c) Repeat this step for i = 1, 2, ..., log n ; 
 
57 
 
 (i) Set k = 2 ; 
 (ii) Partition B and C as shown in Figure 6 
 
 (1 <_ i <_ log„m) and Figure 8 (log 2m <_ i <_ log^n) . 
 
 (iii) If 1 <. i £. log»m , 
 
 n, 
 
 then compute S. = S. + T^ * Q. for 1 < j < min(2,— ) 
 J 3 J 3 ~ ~ 'm' 
 
 simultaneously ; 
 
 n s 
 
 else compute S. = S. + T. * Q. for 1 < j < min(2,— ) , 
 3 3 J j ~ ~ k 
 
 and V. = V. + T * U. for 2 < j < - 
 3 J 2 j - J -k 
 
 simultaneously . 
 (£) (2) 
 
 n 
 
 (iv) Set y = y. [j-1] for 1 <_ j <_ k, 2 £ l <_ t- ; 
 
 (v) If 1 <_ i £ log m , 
 
 n 
 
 then compute D. = D + W. * Z. for 1 < j < min(2, ■=— ) 
 
 3 5 3 3 
 
 simultaneously ; 
 
 2m' 
 
 n 
 
 else compute D. = D. + W. * Z. for 1 < j < min(2,-rr-) 
 Till — — 2k 
 
 and E. = E. + W * G. f or 2 < j < Tjr 
 
 3 3 
 
 simultaneously. 
 
 - 2k 
 
 d) The first column of B contains the solutions x. for 
 1 <_ i £ n. 
 
 For illustration, an example of R<16,2> is shown in Figure 9, 
 In that figure, all numbers are kept in place to help understanding, 
 although they may not be used after certain iterations. 
 
 From the above, we can obtain a general theorem for the m-th 
 order linear recurrence problems. 
 
58 
 
 CD 
 X 
 
 (T 
 
 < 
 2 
 
 e 
 
 x. 
 
 •U 
 •H 
 
 M 
 O 
 60 
 
 O 
 
 H-i 
 
 60 
 C 
 •H 
 
 c 
 o 
 
 ■u 
 J-i 
 (0 
 Pu 
 
 c 
 
 Q 
 
 oo 
 
 QJ 
 U 
 3 
 60 
 
59 
 
 ITERATION I 
 
 tfsrf 
 
 ^k^> 
 
 T*pO 
 
 2 i ' 2 fx 
 
 -— "f — — ^* "^M^"W 
 
 i i • Hfe !\ -- -_w 
 
 -_ + _-4--i_-t-'-V ' ^ 
 
 .L^J._L.!iK 
 
 if^« f4«X 
 
 .i;j.j__.i_^X 
 
 .UJ_i-LW-i?X 
 
 Z 1 I I I 12 iX 
 
 --}-■»— f-l-4-i-J—|— f-'X 
 -_ r _4._J._4--l 1 1 l-VX 
 
 3 " ' i i ' ' 1 2 iK 
 
 — r-*-- I-+-J— «— T-t-f-l— '^V 
 
 Oil 1 1 ' ' ' 2 "Vv 
 
 - h -uf- ( .-t.- t -L-' t-j-i-A 
 
 iii iii i i i 1 2 rx. 
 
 -_4_4 1 1 1 ( 1 [ 1 L-U.|..L l \ 
 
 1 i i I I i 1 i i 1 i 1 2 iK 
 
 2 i i i i i i i ' ' i I '2 i\. 
 
 2 n 
 
 --T_±: 
 
 i 
 
 -_i._U_L._l., 
 
 4» 
 
 l 
 
 » I 3 ' i\ 
 — ' — — |--t--l-. 
 
 ..i4-L-L4-^ 
 
 i ' i ' 3 [ ".\ 
 
 + _^_4_ 4 — I l--i £■ 
 
 4--1- _, 1 ,__ ,__!__ 
 
 3 ! 
 
 f — i — J — »— +-+-+-+- 4—}— p\ 
 
 ' i l I i I i ? i\ 
 
 Lii_i_4_j_w_ »_J_j___j_jiJi: 
 
 ' I ' 7 I ' I , ' ! ' I ' ' * i.N 
 
 1 i l I i iii 1 3 ' tV 
 
 li i i i I I i i i i i ' 2 i\ 
 
 _4.-4.-__ + -^-4---_4_^_-|_-»--j-H--_-N 
 
 2 I i I ' i I ! ' i ' ! i |3 N 
 _J I I I i I I U_l I I i 1 
 
 ITERATION 2 
 
 Figure 9. Parallel Evaluation of R<16,2> 
 
60 
 
 ITERATION 3 
 
 IV 
 
 
 
 
 
 
 
 
 
 
 2 \ 
 
 
 
 
 
 
 
 
 
 
 5 2 
 
 \ 
 
 
 
 
 
 
 
 
 
 a 
 
 3 
 
 r 
 
 i 3 
 
 \ 
 
 
 "\® 
 
 
 
 
 22 
 
 
 2 
 
 
 45 
 
 92~J — 
 
 
 
 ]5 
 
 3 
 
 K 
 
 
 183 
 
 1 
 
 
 
 ill 
 
 ^ — 
 
 5 
 
 3 
 
 [JjK 
 
 
 
 
 2^ 
 
 3 
 
 
 
 
 
 
 "* I 3 
 
 •5 
 
 l\ 
 3 
 
 l\ 
 
 
 
 | © 
 
 
 10 
 16 
 
 
 
 + -- 
 
 
 
 "" la 
 
 5 
 11 
 
 3 V^ 
 5 3 
 
 l\ 
 
 37 
 
 
 
 
 
 
 |43 21 
 
 11 5 
 
 |3 \ 
 
 70 
 
 
 
 
 
 
 |85 
 
 43 
 
 21 11 
 
 '5 3>v 
 
 146 
 
 
 
 
 
 
 |171 85 43 21 
 
 11 5 3\ 
 
 
 
 ITERATION 4 
 
 rv 
 
 
 
 2\ 
 
 
 
 5 2\ 
 
 
 
 11 3 
 22 
 
 
 
 N- 
 
 ~ 
 
 45 
 
 92 ^, 
 
 3 Jv 
 
 --^53 ^K ~~" 
 
 s® 
 
 183T 
 
 11 5 3\ 
 
 
 185] 
 
 2 
 
 
 2 
 
 
 3 >v 
 
 3 
 
 
 5 3 
 
 IV 
 
 10 
 
 
 11 5 
 
 3\ 
 
 16 
 
 
 21 It 
 
 5 3 }v 
 
 37 
 
 
 43 21 
 
 11 5 3 jv 
 
 70 
 
 
 85 43 
 
 21 11 5 3 >v 
 
 146 
 
 
 171 85 
 
 43 21 11 5 3 i\ 
 
 ^•^^ 
 
 
 
 
 lS. 
 
 
 
 
 
 
 
 
 
 2 \ 
 
 
 
 
 
 
 
 
 
 5 2 
 
 1\ 
 
 
 
 
 
 
 
 
 11 
 
 3 
 
 l\ 
 
 
 
 
 
 
 
 22 
 
 
 2 
 
 r\ 
 
 
 
 
 
 
 45 
 
 
 
 3 
 
 l\ 
 
 
 
 
 
 92 
 
 
 
 5 
 
 3 
 
 rs. 
 
 
 
 
 183 
 
 
 
 11 
 
 5 
 
 3 ^ 
 
 ji\ 
 
 
 
 368 
 
 
 
 
 
 
 
 736 
 
 
 
 
 
 
 is V 
 
 
 
 1473 
 
 
 
 
 
 
 \f> 3 JV 
 
 
 
 2948 
 
 
 
 
 
 
 111 5 3 i\ 
 
 
 
 5694 
 
 
 
 
 
 
 [21 11 5 3 
 
 l\ 
 
 
 11791 
 
 
 
 
 
 
 | 43 21 11 5 
 
 3 
 
 i\ 
 
 23580 
 
 
 
 
 
 
 J85 43 21 11 
 
 5 
 
 i\ 
 
 47164 
 
 
 
 
 
 
 • 171 85 43 21 
 
 11 
 
 5 3 f^ 
 
 i 
 
 Figure 9 (continued). Parallel Evaluation of R<16,2> 
 
61 
 
 Theorem A Any R<n,m> can be computed in 
 
 T <_ 2 log-n for m = 1, 
 
 2 
 £ Ulog-m + 3) log.n - (log 2 m + log_m + 1) for m > 1, 
 
 with p < mn. 
 
 The minimum speedup is 0(- ; ) . 
 
 log 2 m log 2 n 
 
 Proof ; For 1^.1^, log 9 m, step (iii) is similar to that of 
 Algorithm 3 and takes totally 
 
 log 2 m 12 3 
 
 t _< Z j + log 2 m = - log 2 m + - log 2 m . 
 
 j=l 
 However, for log 2m <^ i <_ log n, this step needs only one multiplication 
 
 time and one summation time of (m+1) operands per iteration. This amounts 
 to 
 
 t 2 < (log 2 (m+l) + 1) log 2 (^) < (log 2 m + 2) log^) . 
 
 For step (v) , it is similar to Algorithm 3 for 1 < i < log-(-r-) 
 
 and the total time is 
 
 log^-j) m 1 2 1 
 t 2 — l J + log 2^ = 2 log 2 m + 2 log 2 m ~ 1 * 
 
 This same step will take one multiplication time and one summation time 
 
 of m operands per iteration for log_m £ i <_ log(— ) . Thus, in total it 
 has 
 
 t 4 < (log 2 m + 1) log <*) . 
 
Hence, 
 
 62 
 
 T < E t = (21og m + 3) log„n - (log_m + log„m + 1) , 
 P - j = 1 j 2 2 2 2 
 
 m > 1 . 
 
 For m = 1, since the W.'s diminish, T < t, + t„ ■ 21og_n. 
 
 3 p — 1 2 b 2 
 
 Comparing this with the serial computation time 2mn - m(m + 1) , 
 
 the speedup is 
 
 S > 
 
 2mn - m(mfl) 
 
 (21og 2 m + 3) log 2 n - (log 2 m + log 2 m + 1) 
 
 = 0( 
 
 mn 
 
 log 2 m log 2 n 
 
 ) . 
 
 In obtaining the above speed, we assume that there are enough 
 processors at the multiplication time of any iteration. The number of 
 multiplications required in step (iii) can be observed from Figure 6(a) 
 and Figure 8(a) that 
 
 p-lOc 
 
 . k/2 
 
 2 ) <2 Z j f or 2 < k < m , 
 
 J-l 
 
 i(r+l)[Zj+m(f- m)] 
 k j=l l 
 
 m k 
 
 <^ E j + m(— - m) for k 
 
 n 
 
 for 2m <_ k <_ y , 
 
 = n 
 
 For step (v) , we can see from Figure 6(b) and Figure 8(b) that 
 
 
63 
 
 m 
 
 p 2 (k) < 2k(k-l) for 2 £ k < | , 
 
 m-1 
 <_ 2 E j for k = m , 
 j=l 
 
 m-1 
 < (—+ 1)( Z j) for 2m < k < | , 
 
 3=1 
 
 m-1 
 _< E j for k = y . 
 
 When n>>m, p.. (■—) = (3mn - 6m + 6m) /4 is the maximum value and 
 
 2 
 p_(k) = 0(m ) for all k's. Therefore, the processor bound is mn. 
 
 Q.E.D. 
 
 As was true with Theorem 3, Theorem 4 requires only 21og 9 n - 1 
 
 steps. All processors are performing the same operations at the same 
 time. Without this restriction, a better processor bound could be 
 achieved. Also note that as m increases, greater efficiency can be 
 achieved by using this algorithm instead of Algorithm 2 for arbitrary 
 coefficients as described in Chapter 2. On the contrary,, we can achieve 
 twice this speed by applying the latter algorithm with some operations 
 
 masked off. In this case, the processor bound can be shown to be 
 
 3 
 0(mn + m ) } an d one might prefer it as m becomes very small. 
 
 We will conclude this section with some comparisons between 
 
 these results and some known algorithms in the evaluations of two distinct 
 
 special cases. First, in computing the remote terms, e.g., x , in a 
 
 homogeneous recurrence sequence relation 
 
 x- + olx. -, + a.x. .+ ... + a x, =0, 
 1 ± l-l 2 i-2 m i-m 
 
64 
 
 Miller and Brown [17] have shown an algorithm with about the same speed 
 
 3 
 as that of Algorithm 4 by using 0(m ) parallel processors. However, it 
 
 can be shown easily that all vectors U., V,, G, and E. are zero vectors, 
 
 j J j J 
 
 and hence those operations can be masked off totally. In addition, since 
 we can leave out all unnecessary intermediate results, parts of the 
 matrix operation S. = S. + T. * Q. in step (iii) and D = D + W * Z 
 
 in step (v) can be further masked out and modified as shown in the new 
 
 partitioning in Figure 10. Therefore, the maximum number of processors 
 
 2 3 2 
 
 is 3m . On the other hand, given m + m processors, we can use 
 
 Algorithm 2 in Chapter 2 with the new partitioning as shown in Figure 11, 
 
 and achieve twice the speed of the Miller-Brown algorithm. 
 
 As a result, we can establish the following facts. 
 
 Corollary 4.1 Any remote term x of a homogeneous linear 
 recurrence relation 
 
 x + a x . , + a o x. + .. . + a . x = , 
 i 1 l-l 2 x-2 u i-m m 
 
 2 
 can be computed within (21og m + 3) log 9 n time steps with 0(m ) processors, 
 
 3 
 
 and within (log m + 2) log_n steps with 0(m ) processors. 
 
 Any polynomial P (a) of degree n can be represented as the last 
 
 solution of the following linear recurrence relation 
 
 R<n+l,l>: x. = c. + a x. .. for 1 < i _< n + 1 . 
 x l x-i — 
 
 By Theorem 4, this can be computed within 21og (n + 1) steps. However, 
 
 the processor bound can be reduced as we only need the last solution. 
 
 This is done by changing the partitions Q , S. and T. for j = 1,2 in Figure 8 
 
65 
 
 to that in Figure 10 (note that other partitions U., V. in Figure 8 
 still remain). Thus, it can be shown to have the maximum demand of 
 processors for k = 2, i.e., p £ I— — I + 1. For illustration, let us 
 evaluate P ? (a) by Algorithm 4 (or Algorithm 2 since for m = 1, both 
 
 algorithms become identical) but with the modified partitions mentioned 
 
 above on the matrix B = [b..] of Figure 8. The whole process can 
 
 1J 8x8 
 
 be summarized as follows. 
 
 First, we compute 
 
 2 
 
 a = a*a , 
 
 and b. . = c. + c. , a 
 i,l l i-1 
 
 for i = 2, 4, 6, 8 simultaneously; 
 
 then compute 
 
 4 2 2 
 
 a = a *a , 
 
 and b i,l = b i,l + b i-2,l a 
 
 for i = 4, 8 simultaneously; 
 
 and finally compute 
 
 b 8,l = b 8,l + b 4,l a 
 
 which is now 
 
 = [(c g + c 7 a) + (c 6 + c 5 a)a ] 
 
 2 4 
 + [ (c , + c_a) + (c~ + c.a)a ]a 
 
 The final result in b-- is obviously the value of P_(a). This procedure 
 
 may be generalized to any degree n and the computational process will 
 
66 
 
 proceed like 
 
 2 
 P (a) =c +c .a+ (c o+c a)a 
 n n n-1 n-2 n-3 
 
 2 4 
 + (c . + c c a + (c , + c -,a)a )a 
 n-4 n-5 n-6 n-7 
 
 + (c n _ 8 + c n _ 9 a + (c n _ 1Q + c n _ n a)a 2 + (c n _ 12 
 + C n-13 a + (c n-14 + c n-15 o)o,2)oA) ° 8 
 
 "T* • • • • 
 
 It is interesting to note that the above procedure is exactly 
 equivalent to the well-known Estrin's method [l8l« I n other words, 
 Estrin's method becomes a very special case of our algorithms for evalu- 
 ation of general R<n,m> systems. Hence, we can formalize it as a 
 corollary. 
 
 Corollary 2.2 Any n-th degree polynomial can be computed within 
 
 2 log (n + 1) time steps using at most I— r— + 1 processors. 
 
 By comparing this result with the known k-th order Horner's 
 rule, it is slightly faster and generally requires less number of 
 processors to achieve the same speed [10], [19]. Kuck and Maruyama [16] 
 show that a general polynomial form of degree n can be evaluated in 
 0(log_n + /81og_n) steps. While the speed is faster, it requires far 
 
 more processors (p = 2n) than that required by our result. In practical 
 applications where n is not very large, this method, even compared to 
 the fastest known multifolding method [15], becomes attractive not only 
 because of its simple implementation but also its easy integration with 
 more general linear recurrence problems. 
 
67 
 
 
 a 
 
 
 JS 
 
 
 4-1 
 
 
 •H 
 
 
 V-l 
 
 c 
 
 o 
 
 o 
 
 60 
 
 •H 
 
 rH 
 
 4-1 
 
 < 
 
 «0 
 
 
 3 
 
 >. 
 
 iH 
 
 x> 
 
 CO 
 
 
 > 
 
 A 
 
 W 
 
 B 
 
 
 a 
 
 .H 
 
 C 
 
 01 
 
 V 
 
 i-H 
 
 l OS 
 
 iH 
 
 
 co 
 
 W 
 
 M 
 
 3 
 
 03 
 
 O 
 
 PL, 
 
 (U 
 
 
 C 
 
 >-l 
 
 <u 
 
 o 
 
 60 
 
 4-1 
 
 o 
 
 
 B 
 
 00 
 
 o 
 
 c 
 
 33 
 
 •H 
 
 
 C 
 
 G 
 
 o 
 
 •H 
 
 •H 
 
 
 4-1 
 
 B 
 
 •H 
 
 M 
 
 4J 
 
 (U 
 
 u 
 
 H 
 
 CO 
 
 
 PL, 
 
 <U 
 
 
 4-1 
 
 o 
 
 O 
 
 •H 
 
 B 
 
 b 
 
 01 
 
 CO 
 
 c* 
 
 c 
 
 
 >■> 
 
 M-l 
 
 p 
 
 o 
 
 0) 
 
 M 
 
 3 
 60 
 •i-l 
 
68 
 
 <U CM 
 
 o 
 
 h 
 
 s 
 
 <u 
 
 0) 
 
 4J 
 
 f* 
 
 a 
 
 
 cd 
 
 <4-l 
 
 A 
 
 o 
 
 CJ 
 
 c 
 
 4-4 
 
 o 
 
 O 
 
 •H 
 
 
 4-1 
 
 CM 
 
 « 
 
 
 3 
 
 e 
 
 iH 
 
 X 
 
 <0 
 
 4J 
 
 > 
 
 •H 
 
 W 
 
 P 
 
 
 o 
 
 fH 
 
 toO 
 
 0) 
 
 iH 
 
 rH 
 
 < 
 
 iH 
 
 
 rt 
 
 >« 
 
 V4 
 
 Xi 
 
 cfl 
 
 
 cm 
 
 A 
 
 
 s 
 
 M 
 
 n 
 
 O 
 
 c 
 
 y-i 
 
 V 
 
 
 OS 
 
 00 
 
 
 c 
 
 CO 
 
 •H 
 
 3 
 
 C 
 
 o 
 
 o 
 
 OJ 
 
 •H 
 
 c 
 
 ■u 
 
 ai 
 
 •H 
 
 60 
 
 4J 
 
 o 
 
 H 
 
 B 
 
 «J 
 
 
 
 cm 
 
 X 
 
 o 
 
 c 
 
 •H 
 
 •H 
 
 e 
 
 
 
 E 
 
 <u 
 
 Q 
 
 H 
 
 n 
 
 3 
 00 
 
 fo 
 
69 
 
 3.4 Practical Implications 
 
 As was in the cases of Chapter 2, the basic algorithms dis- 
 cussed here are primarily for conceptual understanding. By more 
 elaborate scheduling procedures or practical modifications of the 
 original ones, one can develop more efficient computational algorithms 
 which can use much less processors without significantly reducing the 
 speedups. 
 
 For illustration, during the iteration at k =t of 
 
 Algorithm 3, we can perform the multiplications of each inner-product 
 
 in step (v) within two steps instead of one step, thereby giving p (n/2) 
 
 2 2 
 
 = (n/2)(n/A) = n /8 and T <_ log.n + 21og n. This folding scheme 
 
 obviously halves the number of processors needed and almost retains the 
 same speed as Theorem 3. Similar procedures can be applied to different 
 iterations and extended to multiple-folding to achieve more efficient use 
 of processors. Alternately, we can cut the triangular system into strips 
 and then solve the recurrence system on the top of each strip with a 
 sufficient number of processors. We then compute the partial sums in 
 the bottom of that strip for the remaining equations as a new constant 
 vector for the remaining triangular system. This process is repeated 
 from the leftmost strip to the rightmost strip sequentially. This 
 cutting scheme can also help in increasing efficiency as shown in 
 Chapter 2. 
 
 We can conclude that a proper design and modification of the 
 basic algorithms presented here for a particular environment should 
 provide us high efficiency computations in exploiting future parallel 
 machines . 
 
70 
 
 4. SPEEDUPS OF PRACTICAL LINEAR RECURRENCE SYSTEMS 
 WITH A LIMITED NUMBER OF PROCESSORS 
 
 4. 1 Introduction 
 
 Most linear recurrence relations which occur in ordinary pro- 
 grams are of m-th order with m being a very small number. Chapters 2 
 and 3 have given algorithms to evaluate any m-th order linear recurrence 
 system in 0(log ? n) time steps by using 0(n) processors, where n is the 
 number of solutions to be computed and assuming n>>m. However, the 
 practical multiprocessing systems at present may consist of only a 
 limited number of processors, p < n. In this case, using those algorithms 
 could result in somewhat inefficient computations as shown in the middle 
 of the upper left portion of Table 2. The major reason is the inherent 
 nature of non-uniform distributions of processors required at each 
 computational stage. On the other end of the line, it is trivial to show 
 that such systems can be solved in 0(n) steps using p = m processors, with 
 
 a limited speedup S = m. 
 
 P 
 
 We provide in this chapter two alternative algorithms for the 
 
 evaluation of m-th order recurrences. It is shown that if n>>p>m, then 
 
 any such system can be speeded up at least by ~ — - — p for arbitrary 
 
 2m+3 
 
 coefficients and ~ ^~_ p for constant coefficients, at an efficiency 
 
 independent of n. When m is very small in practical programs, computations 
 by these methods will generally be more efficient than by those discussed 
 in Chapters 2 and 3. 
 
71 
 
 4.2 M-th Order Linear Recurrences with Arbitrary Coefficients 
 
 In this section we consider the evaluation of any given 
 
 R<n,m>: x = c + Ax for n > p > m . 
 
 For simplicity, we assume n, p and m are all powers of 2 although our 
 algorithms can be used for arbitrary values. 
 
 We will describe this algorithm symbolically by an R<32,4> 
 system as an example. The principle can apply to any R<n,m> system. 
 
 First, let B be a lower triangular matrix of order (nxn) 
 and 
 
 B = ["c,A] . 
 
 All given data are shown as dots in Figure 12(a). 
 
 Next, let us assume we have p = km = 16, then we can partition 
 
 B into k = 2- = 4 subsystems Z , 1 < 1 < k. These are shown as solid 
 
 m J — J — 
 
 dots in Figure 12(a), in which each Z "* , 2 <_ j <± k, consists of the 
 solid dots in the diagonal pre-concatenated by those in the first column 
 
 of B. Thus, the number of columns in each Z , is r - (m-1) = 5 
 
 except Z with ^ = 8. We also partition B into Y J , 2 £ j <_ k, as 
 
 shown in Figure 12(b), each with — columns. Every Z ■* or Y ~* will be 
 
 treated as an independent recurrence system with the chopped corner 
 assuming zero values. 
 
 In the first stage, we solve Z J , 1 <_ j _< k, simultaneously, 
 each with m processors. This is done in a very straightforward way, 
 i.e., multiplying the i-th column by the (i-l)-th element of the first 
 
72 
 
 M-. 
 
 • • • 
 
 • • • • 
 
 • • • •«, 
 • • z • 
 
 • • • • 
 
 • • #|0 
 
 • «io o 
 
 o o o 
 
 o o o o 
 o o o • 
 
 2h,16- 
 
 3h,24 - 
 
 n,32 
 
 o o • • 
 o • z • 
 • • • • 
 
 • • •lO 
 
 • •jo o 
 _#|o o o 
 
 o o o o 
 o o o 
 o o 
 
 w (3) 
 
 o • z • 
 • • • • 
 
 IO o 
 I o oq_ 
 0006 
 000 
 
 o o • • 
 
 .2. 
 
 (a) MATRIX B 
 
 Figure 12. Dynamic Partitioning for Algorithm 5 
 
73 
 
 h.8 • 
 
 n,32 
 
 • • • 
 • Y 
 
 *% 
 
 • • • 
 
 • • • • 
 
 • • • • 
 
 • • • 
 
 (3) 
 
 
 • • • 
 
 V 
 
 
 (b) MATRIX C 
 
 Figure 12 (continued) . Dynamic Partitioning for Algorithm 5 
 
74 
 
 h,8- 
 
 2h,16- 
 
 3h,24 ■ 
 
 n,32 
 
 o 
 o 
 o 
 
 o 
 o 
 o 
 
 •I 
 
 2i 
 
 y- -* 
 o 
 
 o 
 
 2 2 
 
 O 
 
 D 2 
 O 
 
 •J 
 
 O 
 
 o o 
 
 o o o 
 
 o o o 
 
 o o 
 
 o 
 
 o 
 
 o o 
 o o o 
 
 OOOOl 
 
 o o olc 
 
 o o 
 o 
 
 I 
 
 o 
 o 
 
 o 
 o 
 o 
 
 o 
 ri: 
 
 o 
 o 
 o 
 o 
 
 o w x 
 o o 
 
 o, 
 
 ojo 
 0,0 
 
 OIO 
 
 •~!o 
 
 • • • 
 
 • • 
 
 •l 
 
 Y 2 #l 
 
 I 
 
 o 
 o o 
 
 OOOI 
 
 o o olc 
 
 fo 
 
 ^••^ 
 
 o o 
 o 
 
 o 
 o 
 o 
 o 
 o 
 o 
 
 o 
 
 o o, 
 o ojo 
 
 W 2 0|0 o 
 
 o_oio o o 
 
 • «no o o 
 
 o o 
 
 I 
 
 I 
 
 • Y 3 #| 
 
 ° 
 
 olo 
 
 ojo 
 o f o 
 
 10 
 
 1^ 
 
 10 
 
 o 
 o o 
 
 OOOi 
 
 o o ojo 
 o w 3 o,o o 
 
 O O OIO o o 
 
 o o o o 
 
 (c) MATRIX E 
 
 Figure 12 (continued) . Dynamic Partitioning for Algorithm 5 
 
75 
 
 column of Z , and adding the resultant vector to the first column. 
 This proceeds successively from i = 2 to the last column. Having done 
 this, we will have in Figure 12(c) 
 
 D == (x l' X 2' •*•' X n/k-l ) ' 
 
 and, in particular 
 
 Z, = 
 
 (Z 11' Z 12' 
 
 •" Z lm ) 
 
 where z.. denotes the partial sum of the first (m+2-j) non-zero terms 
 in the expression of x , 1 <_ j £ m, respectively. 
 
 <£>i+j-i 
 
 In the second stage, we solve each Y J and its associated 
 (m-1) subsystems (defined in Definition 2) in parallel, and also simul- 
 taneously for 2 <_ j _< k. 
 
 As a result , we have 
 
 W. = 
 J 
 
 
 
 r+l,r 
 
 r+2,r y r+2,r+l 
 
 r-hn,r 
 
 y * * * y 
 r+m,r+l y r+m,r+m-l 
 
 y ,n - y ,n _ ... 
 J r-h— l,r r+r— l,r+l 
 k k 
 
 r-h — l,r+m-l 
 k 
 
 for 1 £ j <_ k - 1 , 
 
76 
 
 and 
 
 Y. = 
 J 
 
 y r 
 
 y r,|+m-l 
 
 y r+m-l,| ' 
 
 r+m-l,yfm-l 
 
 for 2 <_ j £ k, 
 
 ,n s 
 
 where r = (— )j and y's are elements of the associated matrix Y of A. 
 
 Now, by applying Lemma 3 repeatedly, we can compute the 
 following first-order recurrence system of vector matrix form 
 
 Z. = Z. + Y. Z. . , 2 < i < k 
 xxi x-1 — — 
 
 (26) 
 
 where new 
 
 x xl x2 
 
 • * ' Z *J) 
 
 xm 
 
 and z.. is defined as above. Note that Z, and Y, have been augmented by 
 xj k k 
 
 zeroes to become conformable and the first element of each resultant Z., 
 
 x 
 
 i.e., z.-, 1 < i < k, is the solution x, ,. s . . 
 xl' — — (n/k)x 
 
 After that, by the same reasoning of Lemma 3, we may evaluate 
 
 all other solutions by 
 
 D. 
 
 x 
 
 + W.Z. for 1 < i < k - 1, 
 xx — — 
 
 simultaneously . 
 
 We may summarize this procedure as follows, 
 
77 
 
 Algorithm 5 
 
 a. Let B be a lower triangular matrix of order (nxn) in 
 
 which the i-th column contains a. . ., for j < i < n, where a. rt = c. 
 J i,j-l -j _ — ' iO i 
 
 and a. . = for i - j > m initially. 
 
 b. Let C, E be aliases for B, i.e., B, C and E represent the 
 same memory locations 
 
 c. Set k ■» *■ and h = 7- ; 
 
 m k 
 
 i: Partition B, C and E as shown in Figure 12(a), (b) 
 and (c) , respectively, in which the order of each 
 partition is 
 
 Z (j) : (h+m-1, h) for j = 1, 
 
 (h, h-m+1) for 2 <_ j <_ k - 1 , 
 (h-m+1, h-m+1) for j = k , 
 
 y'^': (h+m-1, h) f or 2 <_ j < k - 1 , 
 (h, h) for j = k , 
 
 Dy (h-1, 1) for j = , • 
 
 (h-m, 1) for 1 < j < k - 1 , 
 
 for 1 <_ j < k - 1 , 
 
 for j = k , 
 
 for 1 _< j <_ k - 1 , 
 
 for 2 < j <_ k - 1 , 
 
 for j = k . 
 
 Z.: 
 3 
 
 (m, 1) 
 
 
 (1. 1) 
 
 W.: 
 J 
 
 (h, m) 
 
 Y.: 
 J 
 
 (m, m) 
 
 
 (I, m) 
 
78 
 
 ii: 
 
 iii: 
 
 Compute all recurrence systems Z , 1 <_ j <^ k, 
 simultaneously; 
 
 Compute each recurrence system Y J and its 
 associated (m-1) subsystems in parallel, and also 
 simultaneously for 2 < j < k; 
 
 1 < i < n. 
 
 iv: Compute the recurrence system of vector-matrix form 
 
 Z. = Z_, + Y. Z. . , 2 < i < k . 
 x i i x-1 — — 
 
 v: Compute 
 
 + W.Z. 
 1 x 
 
 for 1 < i < k - 1 
 
 simultaneously. 
 
 d. The first column of B contains the solutions x. for 
 
 x 
 
 Note that at the end of step (ii) , solutions x., 1 <_ i _< h, 
 
 are in place; at the end of step (iv) , x, . , 1 <_ i <^ k, are also in their 
 
 corresponding positions. 
 
 In the following we use a R<16,1> as an example to illustrate 
 
 this algorithm. Note that matrices Z. and Y. become scalars in this 
 
 i J 
 
 case. 
 
 Example 4 
 
 where 
 
 Given <16,1>: x = c + Ax 
 
 c = (1,2, 1,1, 1,5, 4, 1,2, 1,2, 4, 3, 1,2,1)* , 
 
 a . . = for i - j > 1 or j-i^.0, 
 
79 
 
 and the first subdiagonal of A is 
 
 a d s (a 21' a 32 ,a 43'*" ,a 16,15 ) 
 
 = (1,1,2,2,4,2,1,1,2,1,1,2,3,4,1) 
 
 If we have p = 4, then we have k ■ 4 and 
 
 (1) _ 
 
 (4) . 
 
 (2) . 
 
 1 
 
 
 
 "l 
 
 
 
 
 "2 
 
 
 " 
 
 2 1 
 10 1 
 
 
 z (2 > = 
 
 5 
 4 
 
 4 
 2 
 
 
 z (3 ) = 
 
 1 
 2 
 
 2 
 
 
 1 
 
 10 
 
 2 
 
 , 
 
 1 
 
 
 
 1 
 
 , 
 
 4 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 - 
 
 3 
 
 
 
 
 
 
 
 
 
 
 1 3 
 
 
 
 
 
 
 
 
 
 
 2 4 
 
 
 
 
 
 
 
 
 
 
 10 
 
 1 
 
 , 
 
 
 
 
 
 
 
 
 2 
 
 
 
 " 1 
 
 
 
 
 " 2 
 
 
 - 
 
 4 
 2 
 
 
 Y (3) . 
 
 
 
 
 2 
 1 
 
 
 Y (4) m 
 
 
 
 
 3 
 
 
 4 
 
 
 
 1 
 
 , 
 
 
 
 
 
 1 
 
 , 
 
 
 
 
 
 1 
 
 The solutions of Z , 1 < i < 4, are 
 
 D o" 
 
 (x 1 ,x 2 ,x 3 ) = (1,3,4), Z 1 
 (1,9,22), Z 2 = 23 , 
 
 (2,5,7), Z 3 = 11 , 
 
 (3,10,42), Z 4 - 43 . 
 
 = x, = 9 , 
 
80 
 
 and the solutions of Y , 2 <_ i <_ 4, are 
 W* = (0,2,8,16), Y 2 = 16 , 
 
 Wj = (0,1,2,2), Y 3 = 2 , 
 
 \t* = (0,2,6,24), Y 4 = 24 . 
 
 With these results, we will solve the following recurrence 
 
 system 
 
 to obtain 
 
 Z. = Z. + Y. Z. - for2<i<4 
 x i i l-l — — 
 
 Z 1 = x 4 = 9 , 
 
 Z 2 = x g = 167 , 
 
 Z 3 " X 12 " 345 ■ 
 
 Z. = x n . - 8323 . 
 4 16 
 
 Thus, the rest of the x's can be computed as follows. 
 
 X 4 
 
 
 9 
 
 
 
 
 
 9 
 
 x 5 
 
 
 1 
 
 + 
 
 2 
 
 •9 = 
 
 19 
 
 X 6 
 
 
 9 
 
 
 8 
 
 
 81 
 
 - X 7. 
 
 
 22 
 
 
 16 
 
 
 166 
 
 X 8 
 
 
 167 
 
 
 
 
 
 167 
 
 X 9 
 
 
 2 
 
 
 1 
 
 
 169 
 
 = 
 
 
 + 
 
 
 •167 = 
 
 
 X 10 
 
 
 5 
 
 
 2 
 
 
 339 
 
 J"u 
 
 
 7 
 
 
 2 
 
 
 341 
 
81 
 
 12 
 
 x 
 
 13 
 
 14 
 
 15 
 
 345 
 
 
 
 
 3 
 
 + 
 
 2 
 
 10 
 
 
 6 
 
 42 
 
 
 24 
 
 _ - m 
 
 
 _ _ 
 
 345 
 
 345] 
 693 
 
 2080 
 8322 
 
 Now, we have the complete solution set {1,3,4,9,19,81,166, 
 167,169,339,341,345,693,2080,8322,8323} for the original system. With 
 this algorithm, we can establish a general theorem. 
 
 Theorem 5 
 
 If n>>p>m, then any R<n,m> can be computed by using p 
 
 processors within (2m +3m)— + 0(m log_(p/m)) time steps. The minimum 
 
 speedup is approximately ~ _, p 
 
 Proof: 
 
 In step (ii) , each system Z , 2 j< j <_ k, can be 
 
 computed by m processors within 
 
 o /inn \ 1 /TOR 1 \ 
 
 t < 2(— - m) < 2(- 1) . 
 
 1 - p - p 
 
 (j) 
 
 In step (iii) , one processor is assigned to every Y J and 
 each of its associated (m-1) subsystems for 2 <^ j _< k. This requires 
 
 t„ < (2m -m) (2m-l) . 
 
 I — p 
 
 By applying Algorithm 2 to step (iv) now, during each iteration 
 
 i k 
 i (let r=2 ) we have — matrix-vector products as well as vector-vector 
 
 k r 
 additions, plus (— - —) matrix-matrix products. These amount to 
 
 km , ,k r, 2 . , ^ „ . , . 
 
 Tj - + (y - y) m inner-products of 2 m-vectors in total. By assigning 
 
82 
 
 one processor to each inner-product and assuming vector-vector additions 
 are performed at the end of each iteration, then it takes 
 
 k 2 
 
 t 3 < log 2 k + E (2m-l) f(jp (m+1) - ^/p] 
 r=2 
 
 < [(2m-l)£ + 1) + 1] log 9 ( £ ) . 
 
 For the last step (v) , there are n - — - (k-1) pending solutions, 
 each requiring one inner-product and one addition. Hence, totally we need 
 t 4 < (2m)|-(n- if**- 1))/P] • 
 
 The entire computation, therefore, can be done within 
 
 9 n 9 
 
 T £ E t. - (2m +3m)- + 0(m log, (p/m)) 
 P i=1 i P 
 
 for n>>p>m . 
 And since the serial computation time is T- = 2mn - m(m+l) , 
 
 2mn - m(m+l) 
 
 P (2m 2 +3m)-+ 0(m 2 log 9 (p/m)) 
 P *• 
 
 2 
 
 p for n>>p>m 
 
 2m+3 
 
 Q.E.D, 
 
 If only the last solution, say x , is required, then step (v) 
 can be omitted. This gives us 
 
 Corollary 5.1 If n>>p>m, then any remote term x of R<n,m> 
 
 9 n 2 
 
 can be computed by using p processors within (2m +m)— + 0(m log- (p/m)) 
 
 2 
 time steps. The minimum speedup is approximately — - — p 
 
 2m+l r 
 
83 
 
 For homogeneous R<n,m>, a slight modification of the above 
 algorithm should have more efficient computations than that stated in 
 Theorem 5 and Corollary 5.1. However, we will not elaborate on it here. 
 
 We close this section with some observations about the computa- 
 tional efficiency. A quick comparison could be made at e.g., m = 1 as 
 shown in Example 4, in which we solve the following system 
 
 R<n,l>: x.=c.+a.x. n , 1 < i < n , 
 
 ill l-l — — 
 
 in ~ — time steps in general, and within " — time steps if only the 
 P P 
 
 last solution is needed. For the latter case, we can compare it to the 
 p-th order Horner's rule for the parallel evaluation of n-th degree 
 polynomials (n+l>>p>2) which is equivalent to finding the last solution 
 of the following system 
 
 R<n+l,l>: x. » c. + a x_, , , 1 < i < n + 1 . 
 
 i i i-I — — 
 
 It is well known that p-th order Horner's rule will take ~ — time steps 
 
 P 
 
 while our method only needs steps to evaluate a far more complex 
 
 y P 
 
 system. 
 
 For more numeric comparisons with the previous methods, we list 
 in Table 3 some experimental results for n = 16,384. It is easy to see 
 that for small m this alternative algorithm can give far better speedups 
 and efficiencies than the ones in Table 2, as shown in the entries to the 
 right of the bracket in each row. 
 
 For example, given p = 256 we can obtain a speedup improvement 
 
 70 32 
 over the previous method of about 5 (= * ) for m = 2 and about 
 
 14 . 81 
 
84 
 
 w 
 
 
 <D 
 
 
 V 
 
 
 d 
 
 
 <D 
 
 
 h 
 
 
 g 
 
 
 3 
 
 
 o 
 
 
 OJ 
 
 
 K 
 
 
 h 
 
 
 0) 
 
 
 (U 
 
 
 C 
 
 
 •H 
 
 
 i-l 
 
 -cr 
 
 
 OO 
 
 fn 
 
 oo 
 
 0) 
 
 *j 
 
 Ti 
 
 VO 
 
 h 
 
 H 
 
 o 
 
 
 
 II 
 
 fl 
 
 
 ■p 
 
 fl 
 
 B 
 
 
 <+H 
 
 u 
 
 o 
 
 o 
 
 
 <H 
 
 w 
 
 
 -p 
 
 LTN 
 
 ■a 
 
 J 
 
 <P 
 
 ■P 
 
 K 
 
 •H 
 
 
 fH 
 
 H 
 
 O 
 
 ccS 
 
 bn 
 
 -P 
 
 H 
 
 fl 
 
 <U 
 
 0) 
 
 
 e 
 
 >> 
 
 •H 
 
 ,Q 
 
 ^ 
 
 
 <U 
 
 
 P4 
 
 
 « 
 
 
 W 
 
 
 OO 
 
 
 OJ 
 
 
 H 
 
 
 ,o 
 
 
 cd 
 
 
 EH 
 
 
 H 
 00 
 
 O 
 
 o 
 
 OJ 
 
 t- oo 
 ON H 
 
 VO O 
 LfN 
 
 o 
 
 H 
 
 VO H 
 
 H OJ 
 
 o o 
 
 00 
 
 OJ On 
 
 ON O 
 
 H O 
 VO 
 VO 
 
 ONJ- 
 O H 
 
 VOO 
 
 LfN 
 
 oovo 
 
 VO O 
 VO O 
 
 ITN 
 
 LfN_H/ 
 
 o o 
 c*- o 
 
 00 
 OJ 
 
 170.08 
 0.03 
 
 97-71 
 0.02 
 
 CO H 
 -d- O 
 
 LTN O 
 LfN 
 
 ON r-i 
 
 oo O 
 
 h d 
 oo 
 
 vo o 
 
 OJ H 
 H O 
 
 t- 
 oo 
 
 ONVO 
 
 vo o 
 
 ON O 
 
 OJ 
 
 OJ 
 
 ltn-J- 
 vo O 
 
 oo o 
 oo 
 
 H 
 
 t- OJ 
 H O 
 
 LfN O 
 
 H OJ 
 LfN O 
 
 H O 
 
 -3- 
 
 ON H 
 
 r-o 
 
 OJ o 
 OJ 
 
 ltn On 
 oo OJ 
 
 -3- O 
 
 LfN 
 
 . !..!■ 1 I 1 
 
 VO ON 
 D- H 
 
 r-o 
 
 CO 
 
 oo 
 
 On oo 
 -=1- H 
 
 00 o 
 
 LfN 
 
 OJ 
 
 LTN CO 
 
 VO o 
 vo o 
 
 ITN 
 
 H 
 
 H O 
 
 ON O 
 
 CO 
 
 vo oo 
 CO o 
 
 CO o 
 3F 
 
 O OJ 
 OJ O 
 
 VO O 
 OJ 
 
 H H 
 ON O 
 
 OO O 
 H 
 
 OJ 
 
 o 
 
 H 
 
 H4 
 ON OO 
 
 J- o 
 3- 
 
 OO 
 
 t--tf 
 
 H OJ 
 
 ONO 
 
 OO 
 
 OJ 
 
 OO VO 
 OJ H 
 
 vo o 
 
 LTN 
 
 H 
 
 t- o 
 
 H H 
 
 OO O 
 ON 
 
 ONVO 
 ON O 
 
 H O 
 
 LTN 
 
 CO oo 
 CO o 
 
 r— o 
 
 OJ 
 
 LfN OJ 
 
 vo o 
 -3- o 
 
 ON H 
 LfN O 
 
 t- o 
 
 OJ 
 H 
 
 LTN 
 
 o c- 
 
 J- oo 
 
 o\ o 
 
 00 
 H 
 
 C- t- 
 -* OJ 
 
 OO O 
 
 oo 
 H 
 
 OO t"- 
 
 OJ H 
 VO o 
 
 CO 
 
 H O 
 t— H 
 
 o o 
 
 LTN 
 
 D- O 
 
 OJ 
 
 H 0O 
 
 CO o 
 
 J- o 
 H 
 
 D- OJ 
 VO O 
 
 0- o 
 
 On o 
 oo O 
 
 VO 
 LfN 
 OJ 
 
 O ON 
 
 O OO 
 
 o\ o 
 
 ON 
 
 OJ CO 
 0O OJ 
 
 o o 
 t- 
 
 -=f CO 
 OJ H 
 
 LfN O 
 
 -3- 
 
 H H 
 
 J- H 
 
 VO o 
 
 - 2 i 
 
 3S 
 J- O 
 
 H 
 
 ON 0O 
 LTN O 
 
 t- o 
 
 -3- OO 
 CO o 
 
 OO O 
 
 H OJ 
 ON O 
 
 OO O 
 
 r— oj 
 
 ONO 
 
 H o 
 
 oo H 
 ON o 
 
 H O 
 
 OO 
 OJ 
 H 
 
 o\o 
 
 -4- dt 
 
 o o 
 in 
 
 LfN On 
 H 0J 
 
 VO O 
 
 oo 
 
 00 ON 
 H H 
 
 oo o 
 
 OJ 
 
 ^3 
 
 OO O 
 H 
 
 oovo 
 oo O 
 
 r- o 
 
 VO 
 
 OJ 
 OO 
 
 VO 
 H 
 
 CO o 
 
 ITv-H/ 
 
 LfN O 
 OJ 
 
 oo H 
 On_h/ 
 
 OJ o 
 H 
 
 VO H 
 
 LfN -3" 
 
 vo O 
 
 -=f ON 
 0O OJ 
 
 CO o 
 H 
 
 OO On 
 t- H 
 
 H O 
 
 H H 
 
 CO H 
 
 VO o 
 
 OO O 
 
 On o 
 H O 
 
 
 ONO 
 
 OJ oo 
 ONO 
 
 -=*• o 
 [■— oo 
 
 i 23 
 
 ITN ON 
 
 On H 
 ir\ o 
 
 LTN O 
 
 O OJ 
 
 OO O 
 
 £3 
 
 OO o 
 
 ONVO 
 CO o 
 
 H O 
 
 
 
 
 CO OJ 
 t- H 
 
 H O 
 
 
 
 
 
 OO 
 
 f- oo 
 oo-=r 
 
 oo O 
 
 0- H 
 
 -d- oo 
 
 OJ o 
 
 oo oo 
 h d 
 
 o o 
 
 VO OJ 
 H O 
 
 ■-■■ - 
 
 
 
 
 
 -=h 
 
 CO lf\ 
 
 h d 
 
 
 
 
 
 OJ 
 
 o o 
 
 O ltn 
 
 h d 
 
 
 
 
 
 
 
 
 
 H 
 
 OJ 
 
 -=J- 
 
 CO 
 
 VO 
 H 
 
 OJ 
 
 oo 
 
 vo 
 
 CO 
 CVJ 
 H 
 
85 
 
 w 
 
 
 OJ 
 
 
 o 
 
 
 C 
 
 
 <D 
 
 
 ^ 
 
 
 P< 
 
 
 O 
 
 
 <1J 
 
 co 
 
 K 
 
 CVI 
 
 
 H 
 
 ^ 
 
 
 cd 
 
 II 
 
 <u 
 
 
 a 
 
 C 
 
 •H 
 
 
 h-l 
 
 
 
 ^ 
 
 ^ 
 
 o 
 
 CD 
 
 <H 
 
 Tj 
 
 
 u 
 
 ir\ 
 
 o 
 
 
 43 
 
 jjj 
 
 -P 
 
 -P 
 
 s 
 
 •H 
 
 
 fn 
 
 «H 
 
 O 
 
 o 
 
 b() 
 
 
 H 
 
 (0 
 
 <! 
 
 -p 
 
 
 ■a 
 
 
 K 
 
 cU 
 
 
 
 
 K 
 
 OJ 
 
 H 
 a3 
 
 | 
 
 -p 
 
 -P 
 
 G 
 
 •H 
 
 93 
 
 ^ 
 
 B 
 
 O 
 
 •H 
 
 b() 
 
 U 
 
 H 
 
 d) 
 
 <; 
 
 P< 
 
 
 E 
 
 ^ 
 
 N 
 
 £> 
 
 H 
 
 ■s 
 
 Eh 
 
 ■ ' " ■ 
 
 -=1- 
 vo 
 
 OJ VO 
 H rH 
 
 O O 
 H 
 
 f- O 
 ITN H 
 
 VO o 
 
 LT\ OJ 
 
 r- O 
 
 OJ VO 
 LT\ H 
 
 VO o 
 
 0\CO 
 
 on o 
 
 -=*" o 
 
 OJ CO 
 VO H 
 
 H O 
 H 
 
 CO LTN 
 
 o o 
 on o 
 
 OS on 
 
 ONO 
 
 H 
 
 OJ 
 
 on 
 
 H OJ 
 
 C- O 
 
 CO Lf\ 
 
 -=1- H 
 
 -=T O 
 
 LTNVC 
 
 O H 
 
 LT\ O 
 
 CO o 
 H H 
 
 on o 
 
 H O 
 
 H OJ 
 O OJ 
 
 VO O 
 
 VO VO 
 
 ir\ on 
 
 H O 
 H 
 
 
 VO 
 H 
 
 CO 
 -3- 
 
 t—co 
 on oj 
 
 J- o 
 
 ONVO 
 O OJ 
 
 on o 
 
 on h 
 J- on 
 
 oj d 
 
 OACO 
 
 oj d 
 
 OJ ON 
 ON H 
 
 OJ O 
 LT\ H 
 
 o> a 
 
 ~=t- o 
 
 ONO 
 IfN OJ 
 
 rH O 
 C— CO 
 
 vo lt\ 
 j 1 d 
 
 C— H 
 H O 
 
 l>- o 
 
 • - 
 
 
 
 on-* 
 on on 
 
 h d 
 
 OJ VO 
 
 VO VO 
 
 cO d 
 
 
 
 OJ 
 
 
 
 
 
 ON O 
 ON O 
 
 i-H H 
 
 
 
 
 * / 
 
 ./ £ 
 
 OJ 
 
 -=j- 
 
 CO 
 
 VO 
 H 
 
86 
 
 w 
 
 
 oj 
 
 
 o 
 
 
 c 
 
 
 ID 
 
 
 H 
 
 
 U 
 
 
 3 
 
 
 o 
 
 
 CO 
 
 
 1* 
 
 -* 
 
 
 CM 
 
 M 
 
 o 
 
 «0 
 
 iH 
 
 0) 
 
 
 c 
 
 II 
 
 •H 
 
 
 hJ 
 
 C 
 
 M 
 
 M 
 
 0) 
 
 O 
 
 T3 
 
 "4-1 
 
 M 
 
 
 O 
 
 m 
 
 J5 
 
 1 
 
 1 
 
 ■u 
 
 g 
 
 •H 
 
 
 J-i 
 
 vw 
 
 O 
 
 o 
 
 M 
 
 
 iH 
 
 CO 
 
 <! 
 
 •u 
 
 
 iH 
 
 *X3 
 
 3 
 
 C 
 
 CO 
 
 CO 
 
 cu 
 
 
 erf 
 
 CM 
 
 iH 
 
 g 
 
 flU 
 
 J3 
 
 •u 
 
 u 
 
 c 
 
 •H 
 
 0) 
 
 h 
 
 a 
 
 O 
 
 •H 
 
 ao 
 
 M 
 
 iH 
 
 0) 
 
 < 
 
 s- 
 
 >! 
 
 w 
 
 .a 
 
 m 
 
 
 <u 
 
 
 rH 
 
 
 -O 
 
 
 CtJ 
 
 
 H 
 
 
 
 VO H 
 
 r- 0\ 
 O O 
 
 £ S 
 
 oo -i 
 
 
 r- O 
 
 H O 
 
 3 
 
 tf ° 
 
 vo O 
 
 21 
 
 
 
 _ . . 
 
 - - 
 
 - -- -■ -• • 
 
 i^ 
 
 ro O 
 
 -* SO 
 ro O 
 
 ITN O 
 
 tt 21 
 
 
 -* o 
 
 cvj 
 
 H O 
 
 CVJ 
 
 to O 
 
 CVJ 
 
 ITN O 
 
 ro 
 
 
 s s 
 
 CO H 
 
 s> s 
 
 s s 
 
 CM 
 
 5 ° 
 
 o o 
 ro 
 
 o\ o 
 
 H O 
 
 £ 8 
 
 Sn $ 
 
 r- On 
 co O 
 
 
 
 -* o 
 
 H 
 
 itn O 
 
 i-4 
 
 On o 
 H 
 
 
 
 H CO 
 O CM 
 
 H VQ 
 IA iH 
 
 0J o 
 ON H 
 
 £ S 
 
 00 
 21 
 
 Ov O 
 CVJ 
 
 Ov O 
 H 
 
 ?1 ° 
 
 VO o 
 
 ir\ co 
 00 O 
 
 ^ s 
 
 vo ia 
 
 35. r4 
 
 vS « 
 
 
 CO O 
 
 ON O 
 
 
 ro O 
 ro 
 
 
 ro vp 
 vo cu 
 
 ro oo 
 O i-t 
 
 ITN r-l 
 
 ITN i-l 
 
 ro vo 
 vo O 
 
 
 vo o 
 
 d ° 
 
 VO O 
 
 ro O 
 
 VO 
 
 5s d 
 
 IfN VQ 
 
 VO H 
 
 
 Os J- 
 J- to 
 
 
 vo o 
 
 VO o 
 
 
 H O 
 CVI 
 
 
 g* as 
 
 ON CVJ 
 
 itn on 
 
 CO H 
 
 CO r4 
 
 J- H 
 
 CO vo 
 CO o 
 
 CM 
 
 en 
 
 00 o 
 
 ITN O 
 
 ro o 
 
 H O 
 
 O ITN 
 VO H 
 
 -* d 
 
 p ro 
 
 CVI CVJ 
 
 vo O 
 
 ro ON 
 ro ro 
 
 aj d 
 
 
 
 o o 
 
 t— CO 
 
 O H 
 
 oo cvi 
 
 r- H 
 
 
 
 ^* o 
 
 CO O 
 
 H O 
 
 
 3 
 
 
 
 
 
 
 
 
 
 
 CO o 
 
 t- CVJ 
 
 O ro 
 
 ITN O 
 
 5 3 
 
 t^- d 
 
 
 
 2 3 
 
 cu d 
 
 VO CVJ 
 
 in d 
 
 
 
 CO 
 
 04 d 
 
 ON O 
 C— VO 
 
 jt d 
 
 
 
 
 ro ro 
 
 
 
 
 
 •H O 
 
 
 
 
 J- 
 
 
 r— t- 
 
 VO VO 
 
 cvj d 
 
 
 
 cu 
 
 I 
 
 1 
 2.00 1 
 
 1.00 ; 
 
 
 - - 
 
 
 
 ft / 
 
 
 
 
 
 / fl 
 
 cvj 
 
 j- 
 
 00 
 
 3 
 
87 
 
 3 (= ->/ >yc. ) f° r m = 4. However, when m becomes bigger which is not the 
 
 practical case usually, then this method will become impractical. To 
 
 show the differences for a moderate range of n we use double entries in 
 
 Table 4 for n = 128 and Table 5 for n = 1024. Each entry represents 
 
 S and E . The left entries are for the previous algorithm with p 
 p p 
 
 omitted, and the right ones for the new algorithm. From these data and 
 
 other untabulated results for different ranges of n, we can conclude that: 
 
 Observation In the evaluation of m-th order linear recurrences , 
 
 if the following conditions are satisfied: 
 
 -i \ n .2 
 !) 2 - P - Am 
 
 and 2) m <_ 16 , 
 
 then the computation by Algorithm 5 will yield the same or better speedup 
 (and hence efficiency) than by Algorithm 2. 
 
 4.3 M-th Order Linear Recurrences with Constant Coefficients 
 
 In this section we discuss an efficient algorithm for R<n,m> 
 when n>p>m. 
 
 From Lemma 5 in Chapter 3, we know that the partitions of 
 
 00 00 
 Figure 12 now will have (by neglecting the augmented parts of Z , Y , 
 
 Vi and V 
 
 Z (D „ z o> 
 
 CO . y (j) for 2 ^ ± k> 2 < j £ k , 
 
88 
 
 and hence the resultant matrices 
 
 W i =W for l£i < k- 1, 1 < j < k- 1 , 
 
 i t -t 
 
 for 2 £ i £ k, 2£j£k . 
 
 Further, 
 
 t. = t^Ij-1] for 2 < j < m , 
 
 where t. is the j-th column of matrix T as shown in Figure 13, and 
 
 brackets [x] indicate shifting operations as defined in Chapter 3. 
 
 With these properties, we need only to solve the recurrence 
 
 (2) — — 
 
 system Y (without any associated subsystem) to obtain w. - and y 91 • 
 
 After the shifting operation of T, we can complete the evaluation of 
 Y 2 by 
 
 Y 2 = Y 2 + Q W;L . 
 
 Therefore, we can establish the following method for R<n,m> 
 
 problems 
 
 Algorithm 6 
 
 a. Let B be a lower triangular matrix of order (nxn) in which 
 
 b l (C 1' °2' •*" °n ) ' 
 
 b ? - (0, a n , a , ..., a , 0, ..., 0) 
 
 V ^2 
 
 m 
 
 and 
 
 b = b 2 [j-2], 3 < j < n . 
 
89 
 
 h.8- 
 
 2h,16- 
 
 3h, 24- 
 
 n,32 
 
 
 - 
 
 o 
 
 
 o 
 
 o 
 
 o 
 
 |0 o 
 
 o 
 
 IO o o 
 
 
 JO O O Oj 
 
 'l 
 
 |OJb 6 oj 
 
 
 IOIO W} Ol 
 
 
 l9loo_° H ' 
 
 
 )••••) 
 
 
 Ml 
 
 2 
 
 • • Y 2 • 
 
 o 
 o o 
 o o 
 
 o_o o _o| 
 
 \0 Q o!C 
 
 
 «_«_y_ w J 
 
 s^> O 
 
 > O 
 \ 
 \ 
 
 o 
 o o 
 
 o o o, 
 
 o o OIO 
 
 o w 2 o|o 
 
 °_2-9_°J 
 
 • • • •io 
 
 • • • •]■ 
 
 • • Y S «I 
 
 L*fJL*l 
 
 o 
 
 o o 
 
 o o 
 
 o o 
 
 o 
 
 °\ 
 
 o|o 
 
 0|0 
 OIO 
 
 io 
 l_ 
 
 o 
 o o 
 
 O O 0| 
 O O OIO 
 
 IOOWjO'OO 
 
 |o_o polo o o 
 
 >• • Y 4 ^iO OOP 
 
 Figure 13. Dynamic Partitioning of Matrix E for Algorithm 6 
 
90 
 
 b. Let C,E be aliases for B, i.e., B, C and E represent the 
 same memory locations 
 
 c. Let k = £ and h = — ; 
 
 m k 
 
 Partition B and C as shown in Figure 12(a) and (b) , 
 respectively. Partition E as shown in Figure 13, 
 where T is of order (h+l,m) , and W , Y , Q are all 
 
 of order (m-l,m-l); 
 
 (j) 
 
 ii: Compute all recurrence system Z J for 1 <^ j j< k 
 
 (2) 
 and Y , simultaneously; 
 
 iii: Let t. = t- [j-1] for 2 <_ j <. m, and then compute 
 
 Y 2 - Y 2 + Q Wl ; 
 
 iv: Compute the recurrence system of vector-matrix from 
 
 z i " z i + Vi-i > 
 
 2 < i < k ; 
 
 v: Compute 
 
 D. 
 
 D. 
 
 + W-Z. for 1 < i < k - 1, 
 1 i — — 
 
 simultaneously . 
 
 d. The first column of B contains the solutions x. for 
 
 x 
 
 1 < i < n, 
 
 In analogy with Theorem 5, we can derive a general theorem from 
 this algorithm. 
 
91 
 
 Theorem 6 If n>>p>m then any R<n,m> can be computed by using 
 
 p processors within "^ _ n + 0(m log (p/m)) time steps. The minimum 
 speedup is approximately r*- _ p . 
 
 Proof : Each independent system in step (ii) can be computed 
 
 by m processors within 
 
 . .,mn \ ^ n z ™ 1 i n 
 
 t < 2( m) < 2(— — -l; . 
 
 1 — x p-m — p-ni 
 
 2 
 There are (m-1) inner-products of 2 (m-l)-vectors in step (iii) 
 
 One processor can compute one inner-product in 
 
 2. 
 
 t 2 < 2 (m-1) 
 
 (m-1)' 
 
 • 
 
 In step (iv) , we can use Algorithm 4. Then, during each 
 
 i k 
 
 iteration i (let r = 2 ) , there are — matrix-vector products as well as 
 
 r k 
 
 vector-vector additions, plus -r matrix-matrix products for 2 <_ r <_ — . 
 
 Applying the same strategy as that for step (iv) in Theorem 5, we require 
 
 k/2 2 
 
 t 3 < ( I (2m-l) f(|S + a-)/p|) + (2m-l)[^| + log 2 k 
 
 < [(2m-l)(|+ 1) + 1] log 2 (£) . 
 Similarly to that for Theorem 5, step (v) takes 
 
 Thus , the algorithm totally requires 
 
 4 
 T < j t 2mn(2p-m) + q^ ( /m)) for n»p>m . 
 
 v i=i p(.p-mj l 
 
92 
 
 Since T = 2mn - m(nrl-l) , we have 
 
 2mn - m(m +l) 
 
 p — 2mn(2p-m) n/ 2 1 , , N . 
 
 p(p-m) + ° (m log 2 (p/m)) 
 
 p-m , 
 
 % p for n>>p>m . 
 
 2p-m 
 
 Q.E.D. 
 
 By omitting the last step (v) , we can easily show that 
 
 Corollary 6.1 If n>>p>m then any remote term x of R<n,m> can 
 
 be computed by using p processors within — 1- 0(m log» (p/m)) time 
 
 steps. The minimum speedup is approximately p-m. 
 
 It is easy to see that this algorithm offers a very efficient 
 computation for linear recurrences with constant coefficients. Since, by 
 letting e.g., p = 2m, we have — <_ E <_ — , and if only the last solution 
 
 is required we have — <_ E < 1. In other words, by using this algorithm 
 
 under practical situations, we may obtain speedups proportional to the 
 
 limited number of processors we have, and the constant of proportionality 
 S 
 
 P P 
 
 possible speedup, is at least one-third or more. Also, it is interesting 
 to note that for the parallel evaluation of n-th degree polynomials, which 
 is a special case of Corollary 6.1 with m = 1, this method has the same 
 
 E = —*- , which can be regarded as the ratio of the actual and maximum 
 
 order of speed as does the p-th order Horner's rule (n + l>>p _> 2), 
 
 with efficiency E > -r- . 
 p - 2 
 
 As was true with Algorithm 5, Algorithm 6 could be expected 
 
93 
 
 to give higher speedups and efficiencies for the evaluation of 
 practical R<n,m> systems. Similar conditions may be observed as we 
 did in the preceding section to warrant the use of this algorithm. 
 However, as it is a special case of R<n,m> systems, the previous 
 observation can still hold for R<n,m> systems. 
 
94 
 
 5. SPEEDUP OF LOOP COMPUTATIONS IN ORDINARY PROGRAMS 
 
 5.1 Introduction 
 
 With all the studies of linear recurrence systems behind us, 
 we now proceed to look at the possible speedups of iterative programs 
 which are usually written as a loop in some high-level languages using, 
 for example, Fortran DO or Algol FOR statements. As stated in Chapter 1, 
 there are apparently two ways to increase speeds of these loop computa- 
 tions on a multiprocessing system. The first one is to devise new 
 parallel algorithms and provide existing languages with adequate parallel 
 features. This approach will help by providing explicit parallelism in 
 the program and reduce the time otherwise spent in iterative loops. For 
 instance, the development of parallel FFT algorithm is a prominent 
 example [20] • All algorithms in the previous chapters can be used to 
 replace some iterative loops used to solve triangular systems in many 
 numerical programs. An extensive survey in these areas can be found in 
 [21], [22]. What we are interested in here is the second approach, i.e., 
 automatically detecting the implicit parallelism in ordinary program 
 loops at the compiler level. 
 
 Many people have worked in this area. In his automatic pro- 
 gram analysis, Russell gives some algorithms to test the possibilities 
 of replication of simple loops [ 5]. The most extensive studies are 
 done by Muraoka [10] • He provides methods to analyze subscript ex- 
 pressions, and extract parallelism in a loop with single assignment 
 statement by the replication or wave-front method. The latter scheme 
 is also used by Lamport for an ILLIAC IV Fortran compiler [23]. For 
 
95 
 
 loops with multiple assignment statements, Muraoka also describes 
 algorithms to exploit parallelism by replication of the whole loop or 
 by distribution of the original loop into a set of smaller loops, as 
 well as by the wave-front method between assignment statements. Lately, 
 we have generalized most of his work and implemented them in a Fortran 
 program analyzer to measure the potential speedups in ordinary Fortran 
 programs [7 ], [8 ]• Cohagan recently also reported techniques to 
 generate vector operations from DO loops in the TI ASC Fortran compiler 
 [ 24] . All of these works have demonstrated the feasibility of detecting 
 a fairly large amount of parallelism from ordinary program loops. 
 
 However, for one of the frequently encountered statements in a 
 loop like 
 
 X(I) = X(I-l) + Y , 
 
 most of the previous workers do not handle it because of its sequential 
 nature. In fact, this is a special case of linear recurrence relations. 
 Muraoka has discussed this class of first-order recurrences and shows 
 that they can be speeded up significantly by forward substitution and 
 tree-height reduction algorithms. Nevertheless, for a slightly different 
 statement such as 
 
 X(I) = X(I-l) + X(I-2) + Y , 
 
 the above method will become very difficult in implementation and may 
 result in poor speedup if the number of iterations is large. But, by 
 using the algorithms in Chapter 2 through Chapter A, which are simple, 
 efficient, and require no complicated machine structure to support them, 
 we feel that it is practically feasible and desirable to extract 
 
96 
 
 parallelism from even these highly serial codes in ordinary programs. 
 
 In section 2, we discuss theoretically the speedup spectrum 
 of different types of loops in terms of the problem size (sequential 
 computation time T ). As shown before, the order of any recurrence 
 represented in the loop should be the smaller the better as far as the 
 speed and processors are concerned. To facilitate this, we introduce 
 informally in section 3 the distribution algorithm to decompose a given 
 loop into loops of smaller sizes. By this algorithm, we may also be 
 able to expose vector operations as much as possible. In section 4, we 
 discuss time bounds of some loop structures with known computation 
 patterns (dependence graphs) and certain measurable parameters, with 
 which the minimum speedup on a multiprocessing system of a particular 
 class of loop computations (and hence the algorithms they represent) 
 can be estimated. At last, some practical implications for compiler and 
 machine design are described in section 5. 
 
 Before we proceed further, some basic definitions and notations 
 are given below which will be used throughout the rest of this chapter. 
 
 Definition 7 An assignment statement is denoted by 
 
 x = E 
 
 where x is a single or array variable, and E is a well-formed arithmetic 
 expression composed of the four arithmetic operations (+,-,*,/), left 
 and right parentheses, and atoms which are constants or variables. 
 
 Definition 8 A loop is denoted by 
 
 L = (I 1 «- N r I 2 *- N 2 , ..., I d <- N d )(S 1 , S 2 , ..., S g ) 
 
 or = (1^ I 2 , ..., I d )(S 1 , S 2 , ..., S g ) 
 
97 
 
 where I, is called loop Index , N, is an ordered set and called index set , 
 
 and S, is called body statement which may be an assignment statement or 
 
 another loop. We will call it a basic loop if there is no S, which is 
 
 a loop. 
 
 For example, a basic loop like 
 
 L: DO S I = 1, 10, 2 
 
 DO S 2 I = 1, 10, 1 
 
 S ± : Y(I 1 ,I 2 ) = A(I 1 ,I 2 ) + B 
 
 S 2 : Z(I lt I 2 ) = C(I 1 ,I 2 ) + D 
 
 is written as L = (I +■ 1^, I 2 «- N 2 )(S 1 ,S 2 ) or simply L = ( I i» I 2^ ( S i» S 2^ ' 
 where N- - {1,3,5,7,9} and N = {1,2,3, ... ,10} with InJ = 5 and 
 |n 2 | = 10. 
 
 5.2 Speedup Characteristics of Program Loops 
 
 In this section, we will characterize program loops according 
 to their possible speedups over serial computation time T.. . 
 
 Obviously, a loop is equivalent to a sequence of assignment 
 statements according to the original execution order defined by the index 
 set. In order to formalize the problem, let the i-th statement of such 
 sequence be 
 
 x. = E. , 1 < i < n 
 l l — — 
 
 where n is the total number of assignment statements executed by the loop. 
 
98 
 
 For instance, given a loop L as that in the preceding section, we have 
 
 L: x = Y(l,l) = A(l,l) + B 
 x 2 - Z(l,l) = C(l,l) + D 
 
 x A9 = Y(9,10) = A(9,10) + B 
 x 5Q = Z(9,10) = C(9,10) + D . 
 
 If none of the expressions E., 1 _< i _< n, contains an atom 
 
 x. for m > 0, then there is no dependence relation between any pair 
 
 i-m 
 
 of statements. Thus, all x. = E., 1 _< i _< n, can be computed in parallel. 
 
 We will call this set of loops R. For example, the following loop which 
 performs matrix addition and scalar multiplication, 
 
 DO 5 I = 1, 10 
 
 DO 5 J = 1, 10 
 
 Y(I,J) = A(I,J) + B(I,J) 
 
 5 Z(I,J) = C(I,J)*D , 
 
 has this property. 
 
 The total time required by any LeR is, by Brent's argument [25], 
 
 T p < 41og 2 e 
 
 where 
 
 e = max {e.} where e. is the number of atoms in E., 
 
 1 < i < n 
 
99 
 
 Hence, we have for LeR 
 
 S " tj~ — - OCT,) . (27) 
 
 p — 41og e r 
 
 as 
 
 Now, let us study a slightly more complicated loop such 
 one that performs vector inner-product 
 
 DO 5 I = 1, 10 
 5 T = T + A(I)*B(I) . 
 
 Loops of this kind have the property that some of the expressions E , 
 
 1 < i < n, contain a linear combination of x. for m. > 0. We 
 — — l— m. 1 
 
 will call this set of loops LR. For any LeLR, if we compute simultaneously 
 
 all subexpressions in E. , 1 j< i j< n, which do not depend on any computed 
 
 value in the loop, i.e., x. for 1 _< i <^ n, then the resultant loop can 
 
 be treated as R<n,m> system where m < n is the maximum of m. for all i. 
 
 This can be illustrated by the above example of vector inner-product in 
 which we can evaluate A(I)*B(I) for 1 <_ I <_ 10 in parallel. As a result, 
 we obtain all required coefficients to compute an R<10,1> system 
 represented by the original loop. 
 
 The total computation time of such loops is, therefore, pre- 
 processing time of coefficients which is 41og_e time steps, plus the 
 time to solve an R<n,m> system which is stated in Theorem 2. Since 
 n _< T_ £ ne, we have a speedup for this subset of LR 
 
 T, 
 
 S _> ■ 5 ~~ 
 
 P "' (log 2 m + 2)log 2 n-l/2(log 2 m + log 2 m) + 4log 2 e 
 
 = 0(— i— ) • < 28 > 
 
 log 2 T 1 
 
100 
 
 Next, consider the subset of loops which has m = n. For 
 example, given an upper triangular matrix A, to solve Ax = b by the 
 traditional back-substitution method we may write a loop like 
 
 DO 5 I = 10, 1, -1 
 X(I) = B(I)/A(I,I) 
 DO 5 J = I + 1, 10, 1 
 5 X(I) = X(I) - (A(I,J)/A(I,I))*X(J) . 
 
 In this example, if we preprocess B(I)/A(I,I) for all I, and A(I,J)/ 
 A(I,I) for all I, J, we obtain an R<n> system. Since m = n, this is the 
 worst-case loop of LR. Hence, we can say that the computation time of 
 any LeLR is less than 41og„e plus the time stated in Theorem 1. There- 
 fore, we have for any LeLR 
 
 S > 
 
 l/21og 2 n+3/21og 2 n+41og 2 e 
 
 = 0( 
 
 lo *2 T l 
 
 ) • (29) 
 
 Next, we study a simple looking, but more complicated loop: 
 
 DO 5 I = 1, 10 
 5 X(L) = (X(I-l) + A/X(I-l))/2 . 
 
 This is a familiar iterative program for approximating vA . This class 
 of loops has the property that some of the expressions E., 1 <^ i <^ n, are 
 
 non-linear combinations of x. for m. > 0. We denote this set by LR. 
 
 l-m. i 
 
 i 
 
 LU decomposition algorithms in standard numerical programs are another 
 well-known example in this class. 
 
 Muraoka shows that by using forward substitution any loop with 
 
101 
 
 E. being a d-th degree polynomial of x._i» d > 1, can be speeded up at 
 
 most by a constant factor. Kung also shows this later by forward 
 substitution techniques [26]. However, it remains an open question 
 whether there is some algorithm besides forward substitution which may 
 speed up this class of loops by more than a constant. At the present 
 time, we may content ourselves that there could be some LeLR which can 
 only have a constant speedup. 
 
 From the above analysis, we can obtain a quick estimate of the 
 possible speedup of a given program loop without any detailed knowledge 
 of subscript expressions and index sets. This is stated as follows. 
 Given a loop L, let us define recursively (as S may also be a loop): 
 
 IN(L) = U INCS^ , 1 <_ i <_ s 
 
 OUT(L) = U 0UT(S ± ), 1 <_ i <_ s , 
 
 where IN(S.) and 0UT(S.) are the set of variable names in the left- and 
 
 right-hand sides, respectively. And we denote Q(L) = IN(L) (\ OUT(L). 
 
 Theorem 7 For any L and a sufficient number of processors, we 
 have 
 
 1) if Q(L) = <j> , then the minimum speedup is 
 
 T l 
 S = .. = 0(T.) . 
 p 41og 2 e r 
 
 2) if Q(L) ^ <f) and each arithmetic expression on the right- 
 hand side is a linear combination of all XeQ(L), then the 
 
102 
 
 minimum speedup is 
 
 T T 
 
 S P 5 ~ - 0( M • 
 
 P l/21og 2 n+3/21og 2 n+41og 2 e lo 8 2 T l 
 
 where e = maximum number of atoms in assignment statements, 
 n = total number of assignment statements executed. 
 
 Proof : It is easy to see that the set of loops which satisfy 
 Condition (1) (or (2)) is a subset of R (or LR) . Thus, the theorem 
 follows directly from Equations (28) and (29). 
 
 Q.E.D. 
 
 The conditions stated in Theorem 7 are over-restricted if 
 subscripts are analyzed. For example, a loop like L = (I)(S:A(I) 
 = A(I) + 1) has Q(L) 3* $ but LeR. We will discuss more refined bounds 
 in section 4 based on such information. However, we can use this as a 
 crude approximation about the average speedup of ordinary program loops. 
 Knuth and his group measured 7933 DO loops and report that about 87% of 
 them have no more than 5 statements, and 91% of them have within 3 
 levels of nesting [6]. Let us assume typically every loop has 3 assign- 
 ment statements and forms a double nesting, i.e., 
 
 L = (I 1 *" N l' X 2 * N 2 )(S 1' V V " 
 
 It is also reasonable to assume e = 3 and | N 1 | = |N_| = 10. Hence, 
 
 n = 3x10x10 = 300 and T = 300x2 = 600. Our experiences in measurements 
 
 also suggest that most of the loops satisfy Condition (2) of Theorem 7. 
 Therefore, we can say that typical loops in ordinary programs can be 
 
103 
 
 speeded up by a factor of 10, and in many practical cases up to 
 
 T 
 log 
 
 100 (~ -z — ) and more (~ T-), if a sufficient number of processors 
 
 log 2 T 1 1 
 
 are available. We have observed this phenomena in our measurements of 
 about 100 randomly selected programs [7 ] . 
 
 In summary, we have studied loop computations theoretically 
 from their expanded forms. By using the results of Chapter 1, we are 
 able to classify all loops in terms of their possible speedups over serial 
 computation time T_ , i.e., 
 
 T l 
 S = r for <_ i <_ 2 , (30) 
 
 P a.dog^) 1 
 
 and S = c . 
 
 P 
 
 We may call it Type i loop, <_ i _< 2, if its speedup has the form of 
 Equation (30) or Type 3 loop if S = c. We have shown examples of these 
 
 4 types of loops. Obviously, the computational complexity of a Type i 
 loop is higher than Type j loop if i > j . Although we only discuss all 
 loops without jump/logic or other types of body statements, the basic 
 concepts presented here can be easily generalized to any loop structures, 
 and further to the complexity problems of different algorithms. 
 
 5.3 Exploitation of Parallelism in Program Loops 
 
 In the preceding section, a theoretical speedup spectrum has 
 been established for various kinds of loops, i.e., from the simplest 
 Type loop to the most complicated Type 3 loop. In this section, we 
 present a general algorithm which may be used in the semantic analysis 
 
104 
 
 portion of a compiler to detect inter- and intra-statement parallelism 
 in program loops and achieve higher speedups. 
 
 The central idea of this algorithm is to distribute (decompose) 
 a given Type i loop into a set of smaller size loops of Type j , 
 <_ j _< i, which upon execution give the equivalent results of the 
 original one. This is essentially to reduce a. in 
 
 s - * 
 
 P ajdogjT/ 
 
 (and hence increase speedup) as much as possible. 
 
 We will describe the algorithm informally by using examples. 
 The idea of loop distribution was introduced by Muraoka [10], and 
 later was implemented in our Fortran program analyzer to measure potential 
 parallelism in ordinary programs [7], [8]. Should it be used in an 
 actual compiler, the details will depend largely on different levels of 
 
 sophistication in implementing them as discussed later. Hence, in what 
 follows, we will address only the basic concepts which would be required 
 in an idealized loop distribution algorithm. 
 
 Distribution Algorithm 
 Given a loop 
 
 L = (L , lo» * * * » H ' i » o * • • • > " Q ' * 
 
 Step 1 By analyzing subscript expressions and indexing patterns, 
 construct a dependence graph D with respect to loop index I for 
 
 1 <. u 5. d* In D , there are s nodes representing S. for 1<_ j <_ s. An 
 
105 
 
 arc from S, to S indicates that, for fixed values of I- , I-, ..., 
 I , an atom of S during certain iterations has been updated by the 
 output variable of S, during an earlier (or the same if k < &) iteration 
 generated by the nest I , I 1 , ..., I, . 
 
 Example 5 
 
 Given a loop 
 
 DO S 2 I- - 1, 10 
 
 DO S 2 I 2 = 1, 10 
 
 DO S 2 I = 1, 10 
 
 A(I 1 ,I 2 ,I 3 ) = B(I 1 -1,I 2 ,I 3 )*C(I 1 ,I 2 ) + D*E 
 
 B(I r I 2 ,I 3 ) = A(I 1 ,I 2 -1,I 3 )*F(I 2 ,I 3 ) 
 
 we have 
 
 D r 
 
 D, 
 
 D 3 : 
 
 Example 6 
 
 Given a loop 
 
 L: 
 
 V 
 V 
 
 S 3 = 
 S 5 : 
 
 DO S I - 1, 10 
 A(I) - A(I-l) + F(I) 
 B(I) = A(I) + B(I+1) 
 C(I) = D(I-2) + H(I) 
 D(I) = C(I-l) + B(I) 
 E(I) = D(I) + N(I+3) 
 
106 
 
 we have 
 
 Step 2 
 
 On the dependence graph D for 1 < u < d, define a 
 
 partition tt of {S , S 
 
 , S } in such a way that S and S are in 
 
 the same subset if and only if there is a chain of arcs from S. to S 
 
 k. a 
 
 and vice versa. Elements of each subset are ordered by their subscripts, 
 
 For Example 5, we have 
 
 TT ;L = {(S 1 ,S 2 )>, 7T 2 = {(S 1 ),(S 2 )}, TT 3 = {(S 1 ),(S 2 )} , 
 
 and for Example 6, we have 
 
 7 r 1 = {(s 1 ),(s 2 ),(s 3 ,s 4 ),(s 5 )} 
 
 Step 3 On the partition it = {if ,, i „, ...} for 1 < u < d, 
 u ul u2 — — 
 
 define a partial ordering relation y in such a way that tt . y t\ . 
 
 (reflexive), and for i ^ i, tt . v tt . if there is an arc in D from some 
 
 Ul uj u 
 
 element of tt . to some element of tt . . The v relation is also 
 ux uj ' 
 
107 
 
 antisymmetric and transitive. 
 
 By neglecting those relations implied by reflexivity and 
 transitivity, we have for Example 5 (S.) y (S,,) in i^- And for 
 
 Example 6, we have (S^ y (S 2 ) , (S 2 > y (S 3 ,S 4 ), (S 3 ,S 4 ) y (S 5 > in ir^ 
 
 Step 4 Let the (d-u+1) innermost nest of L be L for 
 
 1 < u < d, i.e. , 
 
 " tt r i 2 W "Wi V (s r s 2 V 1 
 
 ■ (I 1- Z 2 W*"' • 
 
 Then replace L according to tt with a set of loops {(I)(tt .),(I)(tt „) , 
 
 ...} where (I) = (1^1^,...,:^). 
 
 The condition for the partial ordering relation y insures that 
 
 data are updated before being used. Hence, any execution order of the 
 
 set of loops which replaces L will be valid as long as this relation is 
 
 not violated. Hence, for fixed values of I, , I„, ..., I , , if it . v 
 
 12 u-*l ui 
 
 tt then loop (I) (tt .) must be evaluated before (I) (tt .), otherwise 
 uj r ui uj 
 
 they may be computed in parallel. (In general, we can also use statement 
 
 substitution to remove this relation between some of the distributed 
 
 loops. This will be discussed in the next section). 
 
 For instance, in Example 5, we have now three possible ways in 
 
 the evaluation of L. Besides the first one which happens to be L, the 
 
108 
 
 other two are 
 
 and 
 
 do s r- *»l, 10 
 
 DO S I = 1, 10 
 
 DO S I 3 - 1, 10 
 S x : A(I 1 ,I 2 ,I 3 ) = BClj-1,1^, I 3 )*C(I 1 ,I 2 ) + D*E 
 
 DO S 2 I 2 = 1, 10 
 
 DO S 2 I = 1, 10 
 S 2 : B(I 1 ,I 2 ,I 3 ) = A(I 1 ,I 2 -1,I 3 )*F(I 2 ,I 3 ) , 
 
 DO S I = 1, 10 
 
 DO S 3 I = 1, 10 
 
 DO S 1 I = 1, 10 DO S 2 I = 1, 10 
 
 S ± : A(I V 1 2 ,1 3 ) = •.. S 2 : BCI^I^I^ = ... 
 
 S 3 : CONTINUE 
 
 in which the two innermost loops are computed in parallel. 
 
 Similarly, one can evaluate the loop in Example 6 as four 
 sequential loops: 
 
109 
 
 DO S I = 1, 10 
 
 S 1 : A(I) = A(I-l) + F(I) 
 
 DO S 2 I = 1, 10 
 
 S 2 : B(I) = A(I) + B(I+1) 
 
 DO S, I = 1, 10 
 
 S 3 : C(I) = D(I-2) + H(I) 
 
 S 4 : D(I) = C(I-l) + B(I) 
 
 DO S 5 I = 1, 10 
 
 S : E(I) = D(I) + N(I+3) 
 
 Ideally, we would like to choose the one which gives the 
 fastest results for the original loop L under the given machine con- 
 figurations. For illustration, let us assume an arbitrary number of 
 processors are available and statement substitution is not allowed. 
 By using Example 5 again, the decomposition according to D_ will yield 
 
 two Type loops inside the nested loops I and I~, and each of them can 
 
 be evaluated in parallel. Thus, 
 
 T = 10xl0xmax{2,l} = 200 . 
 P 
 
 We may also distribute it by D ? which gives us two sequential type 
 
 loops inside loop I, and 
 
 T = (2+1) x 10 = 30 ; 
 P 
 
110 
 
 or by D which may result in an R<2000,199> system after preprocessing 
 
 of coefficients as described in the previous section. Thus, we have 
 for this case 
 
 T = (log 2 199+2) log 2 2000 - l/2(log 2 199+log 2 199) + log 2 4 
 
 = 76 . 
 
 Clearly, we will choose the second decomposition to obtain a speedup of 
 
 s = li . 4000 
 p T 30 ■ LJJ ' 
 P 
 
 Similarly, the loop in Example 6 will be distributed into a 
 set of four sequential loops (I) (S ) , (I) (S ) , (I)(S ,S.) and (I) (S ) 
 
 which are Type 1 (m=l) , Type 0, Type 1 (m=3) and Type loops, re- 
 spectively. Obviously, both loops (I) (S ) and (I) (S-) need t.. = t„ = 1 
 
 Let preprocessing time of coefficients of loops (I)(S 1 ) and (I)(S_,S.) 
 
 be one time step, then the computation times of both loops are 
 
 t 3 = 21og 2 10 = 8 , 
 
 and t 4 = (log 2 3+2) log^O - l/2(log 2 3+log 2 3) = 17 . 
 
 Thus , the overall 
 
 4 
 T = I t. + 1 = 28 , 
 P 1=1 X 
 
 while T = 5x10 = 50 . 
 
 So that, we have for Example 6 
 
 T 
 s =-i a 10 
 
 p T 28 L ' 
 P 
 
Ill 
 
 It is interesting to note that both loops in Example 5 and 
 Example 6 are Type 1 by their very nature. However, the speedup of the 
 second loop is far smaller than that of the first one. This is due to 
 the big difference of problem sizes (T ratio = 4000/50) . If we in- 
 crease the number of iterations of the second loop from 10 to 800, then 
 both loops will have the same size but now we can obtain a speedup 
 
 S a .- * 62, which is comparable to the first loop's speedup. 
 
 We should also notice that, without distribution, the loop in 
 Example 6 can be computed as an R<50,9> system. It yields about the 
 same speedup (T ~ 27) as that obtained by distribution, but requires 
 
 far more processors (see Theorem 2) to achieve the same speed. This 
 demonstrates that loop distribution may not give better speedup but it 
 could improve the efficiency in the use of processors. 
 
 In summary, the distribution algorithm introduced in this 
 section, resembling the distribution algorithm for the reduction of tree- 
 height of any given arithmetic expression [ 8 ] , [10 ] > may introduce more 
 parallelism into a program loop than that obtained from an undistributed 
 one. The simpler the loop structure is, the more parallelism can be 
 extracted by the loop distribution techniques. Since most loops in 
 ordinary programs are quite simple, we believe a practical loop distribu- 
 tion algorithm with reasonable complexity can be justified to be built 
 into a real compiler and generate efficient object codes from ordinary 
 programs for future parallel machines. 
 
112 
 
 5.4 Time Bounds of Program Loops 
 
 In section 2, we have given two crude bounds on computation 
 time without any subscript information. As stated before, by performing 
 subscript analysis we can obtain more refined bounds. In this section, 
 we will discuss some of these bounds in terms of certain measurable 
 loop parameters and the connectivity of its dependence graph. The 
 graph is the product of subscript analysis or a computation pattern 
 that we know a priori. 
 
 For simplicity, we shall concentrate on basic loops and it is 
 not difficult to generalize the discussion to non-basic ones. All 
 parameters used are listed as follows. 
 
 Definition 9 For a given basic loop L, we define 
 
 1) k = tt N. where In. I is the number of 
 
 3-1 J j 
 
 iterations of loop index I . and d is the number 
 
 3 
 
 of nested loops. 
 
 2) s = the number of assignment statements in L. 
 
 3) E={e.|l<i<s} where e. is the number of 
 
 J 1 ~ - J 
 
 atoms in S . . The maximum e. is denoted by e. 
 3 3 
 
 4) Q={q..|l£i»jf.s, 1 < h£e.} where 
 
 q. . e {<f>,1.2, . . . ,k) is the number of iterations 
 
 over which the h-th dependence line in D- 
 
 passes from S. to S.. If there is only one such 
 i 3 
 
113 
 
 line, we omit the superscript (h) . The maximum 
 
 q.. is denoted by q. 
 ij 
 
 With these parameters, any basic loop can be represented by a 
 four-tuple, i.e., L(k,s,E,Q). For instance, the loop in Example 5 is 
 L(1000,2, {4,2}, {q =10, q =100}) and in Example 6 is L(10,5, {2,2,2,2,2} 
 
 {q ll =1 ' q 12 =0 ' q 24 = °' q 34 =1 ' q 43 =2 ' q 45 =0}) ' 
 
 Now we will present various bounds based on this loop informa- 
 tion. The simplest case is a loop whose dependence graph is empty, i.e., 
 Q=cf>. This implies that LeR and hence from Equation (27) we obtain 
 
 Theorem 8 Given any L(k,s,E,Q), if Q=cf) 
 
 then T <_ 41og_e. 
 
 For the set of loops with non-empty graphs, we can easily 
 distinguish two cases, i.e., acyclic and cyclic graphs. The acyclic graphs 
 can be readily recognized as either tree-connected or non-tree-connected. 
 Formally, we define these as follows. 
 
 Definition 10 An acyclic dependence graph is a dependence graph 
 
 of s nodes, S. for 1 <^ i <^ s, with no pair (S.,S.) such that there 
 
 exists a chain of arcs (dependence lines) from S. to S and vice versa. 
 
 If there is only one chain of arcs between any pair of nodes, we call it 
 tree-connected . Figure 14 shows all these cases. 
 
114 
 
 (a) Acyclic 
 & 
 Tree- connected 
 
 (b) Acyclic 
 & 
 Non- tree-connected 
 
 (c) Cyclic 
 
 Figure 14. Connectivity of Dependence Graph 
 
 For acyclic graphs, it is easy to demonstrate that we can 
 
 perform statement substitution between any pair of nodes which have a 
 
 dependence relation: that is, we substitute for each left-hand-side 
 
 variable of S. on the right-hand side of S., which is the cause of a 
 i 3 
 
 dependence relation, the corresponding arithmetic expression on the 
 right-hand side of S. with all subscript expressions shifted in ac- 
 cordance with the q. . representing this relation. 
 For example, given a loop 
 
 s r 
 
 DO S, I = 1, 10 
 A(I) = E(I-l) + F(I) 
 B(I) = A(I) + A(I-l) , 
 C(I) = B(I-l) + G(I) 
 D(I) = A(I-2) + H(I) , 
 
 D, 
 
115 
 
 we have Q = {q| 2 , q{ 2 » q 2 3' q 14* " *°' 1 * 1 * 2 * * By *PP 1 y in 8 thij 
 procedure, a set of four independent loops results: 
 
 DO S| I - 1, 10 
 
 Sj: A(I) = E(I-l) + F(I) 
 
 DO S^ I = 1, 10 
 S^: B(I) = E(I-l) + F(I) + E(I-2) + F(I-l) 
 
 DO S' I = 1, 10 
 S^: C(I) = E(I-2) + F(I-l) + E(I-3) + F(I-2) + G(I) 
 
 DO S' I = 1, 10 
 S^: D(I) = E(I-3) + F(I-2) + H(I) . 
 
 And all of them can be computed in parallel as the dependence relations 
 among them have been removed by this process . 
 
 Evidently, T and p of each resultant statement S ' , 1 j< j <_ s, 
 
 can be bounded by the number of atoms present in its final form. The 
 growth of atoms in each resultant statement can be easily observed from 
 the fact that if S has y., chains of arcs connected to S,, then S 
 
 will contribute Yj.( e J~l) atoms to S| . Hence, in general the number of 
 
 atoms in S|, 1 < j < s, will be the summation of e, over the contributions 
 J - - j 
 
 from all other nodes which have at least a chain of arcs connected to S . 
 This is shown in the following example in which S! contains 
 
116 
 
 e + (e -1) + 2(e 2 -l) + (e^-l) atoms 
 
 © 
 
 (e r l) X 
 
 Now, we can state a general theorem for the class of loops with acyclic 
 dependence graphs. 
 
 Theorem 9 
 
 Given L(k,s,E,Q) with acyclic dependence graph. If 
 
 statement substitution is allowed, then we have 
 
 1) for a loop with a tree-connected graph T <_ 41og 9 (se). 
 
 2) for a loop with a non-tree-connected graph T <_ 41og~(s'e) 
 
 where 
 
 Z-l 
 
 s-1 
 
 s" = (s-£+2)r <_ 2r < r , 
 
 & = maximum length (number of elements) of the chains 
 
 of partially ordered set n as defined in section 3 
 
 (31 £ si . 
 r = maximum number of chains from any element of ir 1 . 
 
 Proof: 
 
 For the tree-connected case , any S . , 1 <_ j <_ s , can have 
 
 at most one chain of arcs from S., 1 < j < s and i ^ j. Hence, let e 1 
 
 1 ~ ~ J 
 
117 
 
 be the number of atoms of S', then 
 
 e! <^ e. + Z (e_.-l) 5 se 
 1=1 
 
 1=1 X 
 
 Since all Si, 1 _f_ j £. S, can be evaluated in parallel, part (a) follows 
 
 directly from Brent's argument. 
 
 For the non-tree-connected case, there are at most r chains 
 
 (r >_ 2) of arcs coming out of one node and the maximum length of these 
 
 chains is l. Thus, the maximum contribution by S . to S! is 
 
 i J 
 
 £-1 SL-1 
 r e.(j< r e) . Let all (s-A+1) nodes contribute in the same manner, 
 
 e! < e. + (s-£+l)r J! '" 1 e + (r £_2 +r £ ~ 3 +. . .+r)e 
 J - J 
 
 < (s-£+2)/ -1 e < 2r S_1 e < r S e . 
 
 Therefore, part (b) is also proved. 
 
 Q.E.D. 
 
 Next, we study the case when statement substitution is not 
 allowed. In this case, we can prove that 
 
 Theorem 10 Given any L(k,s,E,Q) with acyclic dependence graph. 
 
 If statement substitution is not allowed, then we have 
 
 T < 4£log_e < 4slog_e . 
 p — 2 - °2 
 
118 
 
 Proof : Since tt has the maximum length of the chains i, then 
 
 the elements in it can be partitioned into £ disjoint antichains (subsets 
 
 of elements), and in each antichain there is no relation among any pair 
 of its elements. This is a dual form of Dilworth's Theorem [27] • Hence, 
 all decomposed loops, each of which represented by one element of a 
 particular antichain, can be computed in parallel. Since the graph is 
 acyclic, each of such loops has Q=<f> with no more than one statement. 
 Thus, by Theorem 8, the computation time of all loops represented in one 
 antichain is no more than 41og~e, and for I antichains in total we have 
 
 T £ 4Uog e. 
 
 Q.E.D. 
 
 Theorem 9 and Theorem 10 give two extreme bounds with and with- 
 out statement substitution. Needless to say, that any combination of 
 partial substitution and partial distribution will yield time bounds 
 between these two results. 
 
 Also note that Theorems 8-10 can be applied to bound the compu- 
 tation time of a block of assignment statements, as it is a special case 
 of a loop with k=l, q.. = {^,0} for all h and i < j, and q . = $ 
 
 for i >^ j . 
 
 Finally, let us examine those loops with cyclic dependence 
 graphs. Recall from the previous section that any element tt_ . of tt- 
 
 with multiple statements actually forms a maximal cycle , i.e., it con- 
 tains all smaller cycles formed by its elements. Those statements 
 
119 
 
 represented in a maximal cycle are coupled together by dependence 
 relations and no statement substitution is possible. By treating such 
 a cycle as a single loop, we can compute it as a recurrence system if 
 it is linear. Hence, we can use all results in the previous chapters 
 to bound the time and processors for these maximal cycles. In particular, 
 we can state a general theorem for LeLR as follows. 
 
 Theorem 11 Given any L(k,s,E,Q) with cyclic dependence graph, 
 
 then if LeLR, we have 
 
 T £ [(log 2 (cq+c-l) + 2) log 2 (ck)].mina,t) + 4Uog 2 e (31) 
 
 < 1/2 log^Csk) + 3/21og 2 (sk) + 41og 2 e (32) 
 
 where c = maximum number of nodes in one maximal cycle of D , 
 
 which is equivalent to the maximum number of statements 
 represented by one element of the partially ordered 
 set it. (1 <_ c <_ s) . 
 
 t = total number of maximal cycles in D.. . 
 
 Proof : We first show it for c^s. In this case, as discussed in 
 the proof of Theorem 10, the elements of tt. can be partitioned in I dis- 
 joint antichains. Any decomposed loop, represented by one element of a 
 particular antichain, contains no more than c statements. Hence, the 
 maximum computation time of all loops represented in one antichain is 
 equivalent to that of an R<n,m> system with at most 
 
120 
 
 m = c(q-l) + (c-1) + c = c(q+l) - 1 , 
 
 plus the maximum time for preprocessing coefficients (41og_e) . Since 
 
 there are at most min(£,t) maximal cycles in a chain, Equation (31) 
 is proved by Theorem 2. For the case c=s, Equation (32) directly 
 follows from Theorem 1. 
 
 Q.E.D, 
 
 In summary, we have given Theorems 8-11 to estimate time bounds 
 (and hence speedups since T ~ ske for all cases) for a large class of 
 
 program loops if their computation patterns are known or reconstructed 
 by subscript analysis. These bounds are based on some gross parameters 
 such as (s,k,e,q,£,c, t) which we used here. Within a specific class of 
 problems and their computational programs, e.g., eigenvalue programs or 
 linear system solvers, users usually have a very good idea of average 
 computation patterns and typical values of these parameters inside their 
 program loops. With these kinds of analytical bounds, they can have a 
 general idea about the ultimate speedup range they can expect from their 
 programs running on a multiprocessing system. This may also help the 
 designer of computers for a given class of computations in knowing how 
 well they are doing in comparison to these possible speedups with an 
 arbitrary number of processors. 
 
 Although we do not give time bounds for a fixed number of 
 processors, due to their infinite number of variations, it is conceivable 
 that by using some tables like that in Chapter 2 or eventually some 
 analytically derived formulas, we can also obtain some of these results 
 

 121 
 
 for practical usages. Similarly, a careful redefinition of parameters 
 and detailed studies of loop structures should be able to generalize 
 our discussions here to non-basic loops, and further to loops with jump/ 
 logic control informations. 
 
 5.5 Practical Implications 
 
 As discussed in sections 3 and 4, a good compiler algorithm is 
 desirable to decompose a given loop into a set of loops which can be 
 evaluated efficiently for a fixed machine. The complexity of loop 
 distribution algorithms will be decided by the levels of optimization we 
 want and the given machine structure we have. 
 
 For instance, we have various ways of constructing the 
 dependence graph from a loop: 
 
 1) only analyze the input and output variable names of each 
 assignment statement without checking of subscript expressions at all, or 
 
 2) Subscript analysis is performed but renaming procedure is 
 not allowed. In this case, for a loop like 
 
 DO S 2 1-1, N 
 S 1 : A(I) = B(I) + B(I-l) 
 S 2 : B(I) = C(I) + D(I+1) , 
 
 we will have a dependence line from S« to S. , and an artificial line 
 from S. to S. even though S- never updates S_ in the loop; otherwise, we 
 may distribute and evaluate (I) (S.) and (I) (S.. ) sequentially. The latter 
 will obviously result in erroneous solutions as the first atom B(I) in 
 
122 
 
 S needs old values instead of the updated ones generated by this new 
 
 execution order. Hence, the artificial line inserted will lock up these 
 two statements from being distributed. This technique is used in TI ASC 
 Fortran compiler [24]- On the other hand, if we are allowed to rename B 
 in S and the second B in S 1 (as well as B(0) outside the loop) to a new 
 
 variable name, e.g., B', then the distribution is possible. Techniques 
 to create these new storages is reported in [10]. 
 
 We also mentioned that statement substitution between two 
 decomposed loops which have dependence relation may enable them to be 
 evaluated in parallel. However, if the machine structure does not allow 
 more than one array fetched at the same time, then the effect on speedup 
 of statement substitution will be degraded or not preferred at all. 
 
 Finally, let us look at one decomposed loop, e.g., 
 
 DO 5 I = 1, N 
 DO 5 J = 1, N 
 5 A(I,J) = A(I,J-1) + A(I-1,J) , 
 
 which has two-dimensional recurrence relations and may be treated as a 
 
 2 
 very sparse R<n,m> system with n=N and m=N. Although we can solve it 
 
 by the previous general algorithms, for a system with very small p, the 
 
 wave-front method in [10], [23] may also be attractive because of its 
 
 simplicity. On the contrary, if the above statement is changed to 
 
 5 A(I,J) = A(I,J-1) + A(I-1, J+k) 
 
 where k is a relatively large number, then the latter method will become 
 very inefficient. 
 
123 
 
 Hence, we can conclude that, to design an efficient system 
 for executing program loops, a general compiler simulator can be 
 established to distribute loops according to different machine con- 
 figurations and various parallelism optimization levels. Each 
 distributed loop can be assessed with computation time by the time 
 bounds described in this section, or more practically by a general 
 machine simulator to obtain the overall speedup of the original loop. 
 Experiments of this kind on a large sample of ordinary programs will 
 indicate which options will influence average speedup (or efficiency) 
 in major ways. As a result of such studies, a practical loop distribu- 
 tion algorithm of compiler with reasonable complexity can emerge for a 
 given machine; and, in general, an efficient structure of the machine 
 for ordinary programs will also be revealed. 
 
124 
 
 6. CONCLUSION 
 
 In this thesis, we have shown efficient algorithms for parallel 
 evaluation of general linear recurrence systems and m-th order linear 
 recurrence systems, with arbitrary or constant coefficients. Detailed 
 results have been given in the introduction of each chapter and need not 
 be repeated here. 
 
 These algorithms could be used directly as a building block in 
 forming parallel algorithms for standard numerical programs such as in 
 [28 3, [29]. We can also use them in compilers. With the loop distribution 
 algorithm discussed here, a compiler should be able to decompose a given 
 loop into loops which can be computed as vector operations or recurrence 
 relations. The latter case can be further transformed into vector 
 operations (or matrix operations if the machine structure allows) by our 
 algorithms without programmer intervention. 
 
 All algorithms described in this thesis are primarily for 
 conceptual understanding. A properly designed machine organization plus 
 some practical modifications of the basic algorithms presented here for 
 that particular system should provide us high efficiency computations of 
 iterative programs. 
 
 We conclude this thesis by giving the following several topics 
 which are opened up by this research and deserve further studies: 
 
 1) Is there any other algorithm which can be used to evaluate 
 
 general linear recurrence systems in O(log„n) time steps instead of 
 
 2 2 
 
 O(log-n) steps? Since only 0(n ) data elements are involved in matrix A 
 
 and c, it follows from a fan-in argument that one might hope for a faster 
 
125 
 
 scheme requiring only logarithmetic time. Otherwise, could we disprove 
 this possibility by some abstract combinatorial argument? Similar 
 questions can be raised for m-th order linear recurrence systems. 
 
 2) Transformation of various standard numerical algorithms 
 into algorithms which use linear recurrences, such as the ones mentioned 
 above and recently in [30], could be explored for future parallel 
 algorithms. 
 
 3) Practical compiler algorithms for decomposition of loops 
 with jump/logic statements and various indexing patterns should be 
 studied in detail. 
 
 4) Numerical stability problems of these basic algorithms 
 require more understanding, although it is expected that the error 
 bounds will be within the range of serial computations. 
 
126 
 
 LIST OF REFERENCES 
 
 [1] D. J. Kuck, "Multioperation Machine Computational Complexity," 
 Proceedings of Symposium on Complexity of Sequential and 
 Parallel Numerical Algorithms , pp. 17-47, Academic Press, 
 1973. 
 
 [2] P. Budnik and D. J. Kuck, "The Organization and Use of Parallel 
 Memories," IEEE Transactions on Computers , Vol. C-20, 
 pp. 1566-1569, December 1971. 
 
 [3] D. H. Lawrie, "Memory Systems for Parallel Array Processors," 
 Proceedings of the Eleventh Annual Allerton Conference 
 on Circuit and System Theory , Allerton House, Monticello, 
 111., pp. 568-576, October 1973. 
 
 [4] D. H. Lawrie, "Access Requirements and Design of Primary Memory 
 for Array Processors," submitted for publication IEEE 
 Transactions on Computers , (IEEE Repository No. R74-30) , 
 1974. 
 
 [5] E. C. Russell, "Automatic Program Analysis," Ph.D. thesis, Depart- 
 ment of Electrical Engineering, University of California, 
 Los Angeles, Report No. 69-12, March 1969. 
 
 [6] D. E. Knuth, "An Empirical Study of FORTRAN Programs," Department 
 of Computer Science, Stanford University, Report No. CS-186, 
 1970. 
 
 [7] D. Kuck, P. Budnik, S-C. Chen, E. Davis, Jr., J. Han, P. Kraska, 
 D. Lawrie, Y. Muraoka, R. Strebendt and R. Towle, "Measure- 
 ments of Parallelism in Ordinary FORTRAN Programs," Sagamore 
 Computer Conference on Parallel Processing , Syracuse Uni- 
 versity, pp. 23-29, August 1973; also appears in IEEE 
 Computer , pp. 37-46, January 1974. 
 
 [8] D. J. Kuck, Y. Muraoka and S. C. Chen, "On the Number of Operations 
 Simultaneously Executable in FORTRAN-Like Programs and Their 
 Resulting Speed-Up," IEEE Transactions on Computers , Vol. C-21 
 pp. 1293-1310, December 1972. 
 
 [9] R. H. Karp, R. E. Miller and S. Winograd, "The Organization of 
 
 Computations for Uniform Recurrence Equations," Journal of 
 the ACM , Vol. 14, No. 3, pp. 563-590, July 1967. 
 
127 
 
 [10] Y. Muraoka, "Parallelism Exposure and Exploitation in Programs," 
 Ph.D. thesis, University of Illinois at Urb ana- Champaign, 
 Department of Computer Science, Report No. 424, February 1971. 
 
 [11] P.M. Kogge and H. S. Stone, "A Parallel Algorithm for the Effi- 
 cient Solution of a General Class of Recurrence Equations," 
 IEEE Transactions on Computers , Vol. C-22, No. 8, pp. 786- 
 792, August 1973. 
 
 [12] D. Heller, "A Determinant Theorem with Applications to Parallel 
 Algorithms," Carnegie-Mellon University, Department of 
 Computer Science, March 1973; also "On the Efficient Computa- 
 tion of Recurrence Relations," Institute for Computer 
 Applications in Sciences and Engineering (ICASE) , June 1974. 
 
 [13] D. H. Lawrie, "Memory-Processor Connection Networks," Ph.D. thesis, 
 University of Illinois at Urbana-Champaign, Department of 
 Computer Science, Report No. 557, February 1973. 
 
 [14] S. F. Reddaway, "An Elementary Array with Processing and Storage 
 Capabilities," International Workshop on Computer Architec- 
 ture, Grenoble, France, June 1973. 
 
 [15] K. Maruyama, "On the Parallel Evaluation of Polynomials," IEEE 
 
 Transactions on Computers , Vol. C-22, pp. 2-5, January 1973. 
 
 [16] D. J. Kuck and K. Maruyama, "Time Bounds on the Parallel Evaluation 
 of Arithmetic Expressions," to appear in SIAM Journal of 
 Computing . 
 
 [17] J. C. P. Miller and D. J. S. Brown, "An Algorithm for Evaluation 
 of Remote Terms in a Linear Recurrence Relation," Computer 
 Journal , Vol. 9, pp. 188-190, 1967. 
 
 [18] G. Estrin, "Organization of Computer Systems — the Fixed Plus 
 
 Variable Structure Computer," Proceedings of Western Joint 
 Computer Conference , pp. 33-40, May 1960. 
 
 [19] W. S. Dorn, "Generalizations of Horner's Rule for Polynomial 
 Evaluation," IBM Journal of Research and Development , 
 No. 6, April 1962. 
 
 [20] M. C. Pease, "An Adaptation of the Fast Four Transform for Parallel 
 Processing," Journal of the ACM , Vol. 15, No. 2, April 1968. 
 
 f21] W. G. Poole, Jr. and R. G. Voigt, "Numerical Algorithms for 
 
 Parallel and Vector Computers: An Annotated Bibliography," 
 Institute for Computer Applications in Science and Engi- 
 neering (ICASE), 1974. 
 
128 
 
 [22] J. L. Baer, "A Survey of Some Theoretical Aspects of Multi- 
 processing," Computing Surveys , Vol. 5, No. 1, March 1973. 
 
 [23] L. Lamport, "The Parallel Execution of DO Loops," Communications 
 of the ACM , Vol. 17, No. 2, February 1974. 
 
 [2A] W. L. Cohagan, "Vector Optimization for the ASC," Proceedings of 
 the Seventh Annual Princeton Conference on Information 
 Sciences and Systems , 1973. 
 
 [25] R. P. Brent, "The Parallel Evaluation of General Arithmetic 
 
 Expressions," Journal of the ACM , Vol. 21, No. 2, April 
 1974. 
 
 [26] H. T. Kung, "New Algorithms and Lower Bounds for the Parallel 
 Evaluation of Certain Rational Expressions," Proceedings 
 of the Sixth Annual ACM Symposium on Theory of Computing , 
 April 1974. 
 
 [27] C. L. Liu, "Topics in Combinatorial Mathematics," Notes on 
 
 Lectures, MAA Summer Seminar, Williams College, Massachusetts, 
 1972. 
 
 [28] H. S. Stone, "An Efficient Parallel Algorithm for the Solution of 
 a Tridiagonal Linear System of Equations," Journal of the 
 ACM , Vol. 20, No. 1, January 1973. 
 
 [29] A. H. Sameh, S. C. Chen and D. J. Kuck, "Parallel Direct Poisson 
 and Biharmonic Solvers," submitted for publication. 
 
 [30] F. S. Acton, "Recurrence Relations for the Fresnel Integral and 
 Similar Integrals," Communications of the ACM , Vol. 17, 
 No. 8, August 1974. 
 
129 
 
 VITA 
 
 Shyh-ching Chen was born in Fukien, China, on February 1, 1944. 
 He received the B.S. degree in electrical engineering from the National 
 Taiwan University, Taipei, Taiwan, in 1966 and the M.S. degree in electrical 
 engineering from Villanova University, Villanova, Pennsylvania, in 1972. 
 
 From August 1967 to August 1969 he was with the Electronic 
 Computing Center, National Taiwan University. Since September 1970 he has 
 been a research assistant in the Department of Computer Science, University 
 of Illinois, Urbana. He is a member of Sigma Xi, IEEE and ACM. 
 
IBLIOGRAPHIC DATA 
 1EET 
 
 1. Report No. 
 
 UIUCDCS-R-75-694 
 
 3. Recipient's Accession No. 
 
 Fillc and Subtitle 
 
 PEEDUP OF ITERATIVE PROGRAMS IN MULTIPROCESSING SYSTEMS 
 
 5- Report Date 
 
 January 1975 
 
 6. 
 
 A ut hor(s ) 
 
 hyh-ching Chen 
 
 8. Performing Organization Rept. 
 
 No " UIUCDCS-R-75-694 
 
 Performing Organization Name and Address 
 
 niversity of Illinois at Urbana-Champaign 
 lepartment of Computer Science 
 rbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF GJ-36936 
 
 , Sponsoring Organization Name and Address 
 
 ational Science Foundation 
 ashington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Doctoral - 1975 
 
 14. 
 
 Supplementary Notes 
 
 . Abstracts 
 
 .oop computations are the basic constructs which consume most of the computation time 
 r. ordinary programs. For a multiprocessing (parallel or pipelined) system, the over- 
 ill speedup which might be otherwise obtained will be seriously degraded if these 
 :omputations are carried out in the sequential manner as they are presented. 
 
 .ti this paper, we present algorithms to expose vector operations and recurrence re- 
 ations which constitute a loop computation and, in turn, to transform recurrence 
 elations into vector operations. For the latter case, time and processor bounds for 
 [ifferent classes of problems and practical situations will be discussed. Bounds on 
 ipeedups of different types of loop computations will also be given, so that machine 
 lesigners or users could know how well they were doing in some absolute sense. 
 
 . Key Words and Document Analysis. 17a. Descriptors 
 
 Compiler 
 
 General Linear Recurrences 
 
 i-th Order Linear Recurrence 
 
 lultiprocessing Systems 
 
 )rdinary Program Loops 
 
 'arallel Computation 
 
 'rocessor Bounds 
 
 Speedup Efficiencies 
 
 rime Bounds 
 
 .riangular Linear Systems 
 
 b. Identifiers /Opcn-Ended Terms 
 
 COSATI Field/Group 
 
 Availability Statement 
 
 ELEASE UNLIMITED 
 
 19. Security Class (This 
 Report) 
 
 UNTt.ASSIFIFD 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 136 
 
 22. Price 
 
 'M NTIS-35 ( 10-70) 
 
 USCOMM-DC 40329-P7 1 
 
% 
 
 o 
 
 % 
 
en 
 
 CO 
 
 *0» 
 
JVJ* 
 
Bom 
 
 UNIVERStTY OF ILLINOISURBAnT 
 »>0»4U«Rno COO? no S01 69«(1«74 
 
 n«port / 
 
 _3 0112 088401663 
 
 ml 
 
 KB 
 
 I 
 
 ■ 
 
 ■ 
 
 
 ■ 
 
 ■ '<** 
 
 m 
 
 w 
 
 m 
 
 ■ 
 
 
 ■ 
 
 H 
 
 ' t ■ I SI I 
 
 Hi 
 
 I 
 
 ■ 
 
 
 ■ 
 
 MA 
 
 
 ■ 
 
 ■ 
 
 I 
 
 'Ui'J.jii 
 
 ■■