UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF IUINOIS LIBRARY AT URBANA-CHAMPAIGN MAR 2 1981 FEB 2 6 an L161— O-1006 Digitized by the Internet Archive in 2013 http://archive.org/details/algorithmforsolv953gajs \>.9Sj: 2. W is the set of n equations ^ T 1 < i < n (1) where * denotes matrix multiplication and where a. = (a.,, a.„, .... a. , c.) —i il' i2' im' 3/ is the vector 1 of coefficients, x. , = (x. , , x. „, . .., x. , 1) is the —l-l l-l i-2 l-m vector of variables with initial vector x n = (x„, x 15 .... x ,-,,1) — u u -l -m+l speficied in advance. Furthermore, we will need a set of m+l unit vectors e^ = (e.. , e„ (2) If a row (column) of any matrix is equivalent to a unit vector, it is called a trivial row (column). Theorem 1 Any R system can be written in the form x. = b. * xj (3) where for all i, 1 < i > * x o by de£ - of *i *T by def. of b. —i —o (Induction step) : i+l = -i+1 l -±+l £ by Def. 1 1+1 * ((b r b._r — Wr e m + i )T * *o> by h ^ 0thesis , * X T by def. of b -i+1 ^ X Theorem 1 shows that any set of consecutive variables {x.| j < i < k} can be computed in parallel, that is, independently of each other, if vectors of parallel coefficients b_ k , j < i 1 k , are generated and substituted for a., j < i 1 k . The computation of all b. is the overhead that must be paid for parallelism. The following example should clarify this idea. Let us consider linear recurrence system R<9,2> . Using standard notation of linear algebra R<9,2> can be written as l n 32 Ml ! 42 '41 52 51 '62 l 61 '72 71 82 81 '92 l 91 r x i x i " c r X 2 C 2 X 3 C 3 x 4 C 4 * X 5 = C 5 X 6 C 6 X 7 C 7 X 8 c 8 X 9 C 9 - ■* - This system can be transformed using (4) into l 21 32 31 l 42 " a 41 " b 52 72 82 51 ■ b 62 " b 61 -b -b 71 81 ' a 92 a 91 1 "V " C l" X 2 C 2 x 3 C 3 X 4 C 4 * X 5 = C 5 X 6 C 6 X 7 C 7 X 8 C 8 X 9 C 9 » - where [b 51 , b 52 , d 5 ] = [a 51 , a 52 , c 5 ] * 10 10 1. [a 51 , a 52 , c 5 ] [b 61' b 62' d 6 ] = [a 61' a 62' C 6 ] * b 51 b 52 d 5 = [a 61 a 51 + a 62' a 61 a 52' a 61 C 5 + C 6 ] [b 71 , b ?2 , d ? ] - [a 71 , a ?2 , c ? ] * b 61 b 62 d 6 b 51 b 52 d 5 = [a ?1 (a 61 a 51 + a^) 4- a?2 a 51 , a ?1 a fil a 52 + a ?2 a^ l 71 (a 61 C 5 + C 6 ) + a 72 C 5 + C 7 ] [b 81' b 82' d 8 ] = [a 81' a 82' C 8 ] * b 71 b 72 d 7 b 61 b 62 d 6 I ij = [a 81 (a 71 (a 61 a 51 + a 62 } + a 72 a 51 } + a 82 (a 61 a 51 + a 62 ) a 8l (a 71 a 61 a 52 + a 72 a 52 ) + a 82 a 61 a 52 a 81 (a 71 (a 61 C 5 + C 6 ) + a 72 C 5 } + a 82 (a 61 °5 + C 6 } + C 8 ] The above coefficients can be obtained by direct substitution of x c into x, , x c and x, into x, and finally x c , x, , and x., into x Q . The J J O / JO/O new transformed matrix allows parallel computation of x c , x, , x^, and x c JO/ t since they do not depend on each other anymore. It seems that nothing has been accomplished since computation of b. , 5 £ i £ 8, must be executed serially, and it requires more time than the original one. However, the generation of b. , 5 <_ i <_ 8, is independent of any other subset of variables whose intersection with {x.|5 _< i _< 8} is empty, and therefore, the generation of b_.'s for any two disjoint subsets of variables can be per- formed concurrently. Corollary 1 In any R the computation of b. = a. * B. , , 1 < i < n , —l —l — i-1 — — 2 requires at most m(m+l) multiplications and (m-1) (m+1) + 1 = m additions. Furthermore, the computation of T x. = b. * x^ , l the computation of all b = a * B l m + 1 m 1 n(n-l)(m+l) if n < m + 1 multiplications, and K a (n) additions. (n - |(m+l)) m 2 if n > m + 1 ■| n(n-l) m if n f m + 1 (6) Furthermore, the total number of operations required is equal to (n - ^) (2m 2 + m) if n > m + 1 K(n) = K (n) + K (n) = ( 1 -j n(n-l)(2m + 1) if n < m + 1 Proof For all i, 1 <_ i <_ m + 1, _B._, contains only i-1 nontrivial rows, so that b_. = a. • B._. requires only (i-1) (m+1) multiplications Therefore, for all n < m + 1 K (n) = Z (i-1) (m+1) i-1 n = (m+1) I (i-1) i=l = (m+1) Hfea (8) On the other hand, for n _> m + 1 , K consists of two parts. Using (8), the number of multiplications to compute all b_. , 1 £ i <^ m, is equal to -^■(m+l) m(m-l) . Each consecutive _b. ,m + l<£i<_n requires m(m+l) multiplications by Corollary 1. Therefore, for all n > m + 1 K (n) = (n-m) m(m+l) + ^r(m+l) m(m-l) m z 2 1 = n m(m+l) - m (m+1) + y(m+l) m(m-l) 1 2 = n m(m+l) - -r- m(m+l) Each nontrivial row, not counting the first, requires m+1 additions and each trivial row only one addition under assumption that the first row in B . , is nontrivial. If the first row is trivial, then i = 1 and B . , = B~ is an identity matrix. No additions are required —l-l -H) in computation of _b = a.-,*B_. = a . Therefore, for all n _< m + 1 n K (n) = Z [(i-2)(m+l) + (m+1) - (i-1)] 1-2 n = E m(i-l) i=2 = -^ m + n(n-l) Similarly, K(n) , n _> m + 1, consists of two parts since each_b. , 2 m + 1 _> i ^> n , requires m additions by Corollary 1 and the computation 1 2 of remaining _b . , 1 <^ i <^ m , altogether requires ■=■ m (m-1) additions. Thus 2 12 K (n) = (n-m) m + -r- m (m-1) a / 2 12 = n m - y m (m+1) 10 3. Parallel-Processor System The model of a parallel computer (Fig. 1) that we will consider consists of Parallel Memory (PM) , Parallel Processor (PP) and Control Unit (CU) . PP has p Arithmetic Elements (AE) . Each AE can perform any arithmetic operation in one time step (clock) . All AEs are assumed to perform the same operation on each time step (SIMD computer) . It is assumed further that no time is required to communicate data between processors and memories, and the storage arrangement of data is irrelevant. Algorithm 1 Given a linear recurrence system R divide R into |"n/pl subsystems R ^ , where 1 < j < Tn/pl : 1. begin 2 x (0) .=x z. x p • 2£q » 3. for j : = 1 until [n/pl do 4. begin 5. BJ^ : = identity matrix ; 6. for k: - 1 until p do b£ j) : = aS^ * b£^ ; 7. end 8. for j: = 1 until fn/pl do 9. for k: = 1 until p do x, (j) : = b, (:i) * x^" 1 ^ ; r k — k — p 10. end An application of Algorithm 1 to system R<28,2> is shown 11 E> ?M so • • RfTI P^l Fig. 1. A model of a parallel computer 12 in Fig. 2. A parallel computer with four AEs is used. In general, a square with the inverse diagonal symbolizes computation of b* J = a~' • J±rl , 1 < i < 4, 1 < j < 7 . The input on the top of the square represents a. and the output on the bottom _b. . The horizontal input represents the matrix B_ , whose rows are vectors b. ,, b. „, ..., u ., . The dots on the horizontal line indicate the —l-l —1-2 — m+1 vectors that are rows in B . ., . The unit vector u ... which is always the — i-1 —m+1 last row of any matrix B_._, is not connected for clarity. The closest dot to the square represents the top row of B^._ 1 , that is b_._ n . The natural ordering of the rows in the matrix is established by proximity to the square. Similarly, the empty square represents the computation of x. = by' * x J , with vertical input being b. , vertical l — l — p ' v 6 — l output x. , and the horizontal input representing vector x Each of four AEs performs the computation indicated by approximately 1/4 of the squares. AEs should be assigned to squares in a way that minimizes computational time, and simplifies memory access and communication. The assignment imposed by Algorithm 1 for R<28,2> is shown in Table 1. Note^that the upper part of Table 1 represents redun- dant computation not required in computation on a serial computer. Theorem 2 The number of time steps required to compute a linear recurrence system R using a parallel computer with p AEs is 13 ff *• ff HI '•> vV> -0\ u * I M ■V * * J, tf 3 E Ifr - -u- >> CJ tf 3 s CO tf H tf cd tf ex a 3 i re »? c * A CN S? 00 v* ^ N o N» C tf c •H >? ^ •i-J 3 i— l * O CO tf 1 it 00 J, -*>i PH ft — *& — ■* v3 N * «n * * fc M 9 a L4 AE, AE. AE, AE, Time Steps 13 14 '15 16 10 11 12 2 5 26 27 28 21 22 23 24 b 17^ 18 19 20 25 25 A 4 . x 8 X 12 X 16 X 20 X 24 X 28 ■11 '15 19 ■23 •27 10 14 18 , 22 '26 13 17 •21 k 25 Table 1. AE Assignment for R<28,2> 15 Proof -((2n 2 +3m) - |^(2m 2 + m)) if p > m + 1 n((m+l) - ^h if p < m + 1 The proof is based on Algorithm 1. The solution consists of two parts, described by two DO- loops in lines 6 and 9, respectively. Firstly, using only one AE all b J 's , 1 < k < p , are computed for some R . Since there are p AEs , p subsystems can be computed at the same time. The same procedure (DO-loop in line 6) 2 is repeated fn/p 1 times. Note that the DO-loop in line 3 can be considered to be a vector operation that is executed in parallel in slices of p elements each. On the other hand, the DO-loop in line 6 is executed serially. Secondly, using all p AEs all x, , 1 <^ k <_ p , are evaluated in parallel for each R . This time, the DO-loop in line 8 is executed serially and the one in line 9 as a vector operation in slices of size p. The number, T , of time steps required to compute the DO-loop in line 6 using one AE is given by Corollary 2 T' J (p - (m+l)/2)(2m 2 +m) if p > m + 1 \ Jp(p-l)(2m+l)/2 ifp T = £- T' + - T" P p 2 P —(2m -hn) =■ — r— (2m +m) H 2m p p 2 2 p ^-(p-l)(2m+l) + - 2m 2p p ^((2n 2 +3m) - f^(2m 2 + m) ) ci((m+l) - ^-) The computational time T given by Theorem 2 can be improved if the restriction of a SIMD model is removed; that is, the assumption that all AEs perform the same operation at the same time. In the MIMD model that we shall consider, (p-1) AEs still perform the same function at the same time, while the only remaining AE may not. The idea can be easily explained using Fig. 2. The system R <4,2> does not require computation of parallel coefficients _b , b_ , b_ and b_, . Since initial vector _x-> is known, the variables x.. , x„ , x„ and x, can be computed directly from x„ , that is x . = a. * x. , , 1 < i < 4. Furthermore, the number of —0 l—i —l-l — — time steps 2pm = 16 is less than the time needed to compute all parallel coefficients b. , 1 < i < 4 , for any R (l) , 2 < i < 4 . Since the time to compute parallel coefficients is (p - — =— )(2m + m) = 25 , the system R^ ^<4,2> could be enlarged to R^ ^<6,2> and still be solved in 24 time steps using only one AE. This is shown in Fig. 3. The same trick can be employed after every p(p-l) + r variables where r is the size of 4 « 4 St I 4. J. * a I NX A i la- 's ■B t= O O & ^ irl — flr S ■ k*- Q O i o- i ■s- t — & 6 1 — A Q ■ 40 S D- i K it 17 18 the first sybsystem R . At that moment the value of x ft and x is known and there is no need to compute parallel coefficients ^19' ^20' ±2V ^22' ^23 and ^24 " Algorithm 2 Given a linear recurrence system R , divide R into subsystems R (l ' , 1 < i < fn/(p(p-l) + r) 1 . Divide furthermore each R into p subsystems, R ' , where r = Up - (m+l)/2)(2m + 1)/2J and R^ x> ^ ) , 2 < j < p . 1. begin 2. x (o, P ) : -p 3. for i: 4. beg 5. 6. 7. for i: =1 until [n/(p(p-l) + r)l do jcld . x u-i,P) -0 -T5 for k: = 1 until r do *£'» z = a^ * x^ 1} ; for j : =2 until p do 8. begin 9. B o J ' = identit y matrix ; 10. for k: = 1 until p do b^ 1 '^: = a£ 1,j) * l^ j) 11. end 12. for k: = 1 until p do x^' 2) : = b^' 2) * x^'^ ; 13. for j: =3 until p do 14. fork: = 1 until p do x^' j) : = bf 1 '^ * x (i ' j_1 k — k — p 15. end 16. end 19 Theorem 3 The number of time steps required to compute a linear recurrence system R using a MIMD computer with p AEs — ■ -, ,» (2m + 3m) if p > m + 1 p + m + 1/z r — T < p j-(p(2m+l) +4) if p < m + 1 Proof The proof is based on Algorithm 2. A recurrence system R is divided into systems R^ , 1 < i < [n/(p(p-l) + r) ] Each R is further subdivided into p subsystems: R^ 1 ' 1 ' ) , and R^ 1 '^ '

, 2 < j < p . With p AE's in a MIMD computer one AE is used to solve R ' in 2rm time steps. The variables are computed sequentially: x, = jL w ' * 2E.v i * Eacn °f the remaining (p-1) processors are used to compute parallel coefficients bfc 1 '^ = ^fc 1 '^ * -k-1^ ' 1 - k - P ' in S ° me subs y stem R (l ' J)< P,ni> . The number of time steps required is (p - ^r— ) (2m + ni) , if p > m + 1 , or p(p-l)(2m + l)/2 if p < m + 1 by Corollary 2. Assuming that the solution of R ' takes at most as much time as computation of all parallel coefficient in any R , it is easy to find the value of r. Hence, T l(p - ^)(2m + 1)/2J if p>n. + n (^ Lp(p-l)(2m + l)/4mj if p < m + 1 J After all parallel coefficients for all subsystems have been computed, 20 ii (i,2) . (i,2) . (i,l) all p processors are used to compute x, = b/ ' ' * x > anc * consecutively *£* = ^k^"^ * ^'^^ for a11 J' 3< j

- + 1 ^-(p(2m+l) + 4m) if p < m + 1 21 4. Pipelined-Processor System A pipelined-processor model consists of Main Memory (MM) , Pipelined Processor (PP) and Control Unit (CU) . PP may consist of several pipelined functional units. In our model, we shall assume only 2 functional units: multiplication pipe and addition pipe. Each of them has s stages. These two functional units are connected serially with the multiplication pipe feeding the add pipe through register Rl. The results from the add pipe are either stored in MM or fed back into the add pipe through register R2 (Fig. 4). Algorithm 3 Given a linear recurrence system R , divide R into ("n/pl subsystems R , where 1 < j < fn/pl . Execute the following algorithm: 1. begin 2 2. for i: = until p - 1 do a . : = null ; 2 1 3. for i: =0 until p - 1 do a , . , , : = null ; — n+i+1 4. for j = 1 until fn/pl + p do 5. begin 6. B n iden tity matrix ; 7. for k: - 1 until p do b^" k+1 >: - a^" k+1) * B^ k+1) 8. for k: = 1 until p do x ( V > = b (J ^ * k^P" 1 ' ; v p-k — p-k+1 — p 9. end 10. end To point out the difference between Algorithms 1 and 3 the system R<28,2> is redrawn in Fig. 5. The order of fetching from MM and computing of coefficients and variables in PP is indicated by dotted 22 Fig. 4. A model of a pipelined computer 23 3 r j / / / A * — r - i / A a- -# 2 4 arrows. For example, the computation of b_ , requires a , , b_ „ and u, to be fetched from MM and entered into PP. It is impossible to proceed with computation of a_.. ,. since b_.. , will become available only after a certain number of time steps. Therefore, in order to keep pipelined functional units busy, Algorithm 3 proceeds by computing b 11} b Q , x. , — ii — O H x„ , x„ , x- and b-.. , b,Q . At that moment of time, b., should be available from MM or at the output of PP to be used in computation of b., . Theorem 4 Given an integer p > 0, any linear recurrence system R can be solved in T < ( \- p) (m + 2) time steps using pipelined functional s p units with s ~ ( (m + 2) p - (m + 1)) stages. Proof The proof is based on Algorithm 3. The system R is divided into subsystems R J , 1 < j < Tn/pl . To keep pipelined units busy almost all the time p + 1 subsystems are being solved concurrently, that is, different subsystems are using different stages of the pipelined units. Each R^ system is solved in two parts: b£ J ' = a^ 3 * B^J , 1 < k < p , is computed first and ac^* = b^?,. * x^ _1) , -k-1 ' — — * * v p-k+1 — p-k+1 -p 1 £ k _< p , is evaluated afterwards. With p + 1 systems computed concurrently, the order of computation is as follows ... b «> . b «-« , .... b«-p +1 > , « , x«:"> , .... «a-rt — 1 ' — 2 ' — p p p-1 1 h (j+l) h (j) h (J-P+2) V (j-P+D V (j-P+D X (j-P+D h > b 2 , ..., b^ , x p , x p _ 1 x x ... . 25 Therefore, at least (k-1) _b . ' s and k x.'s must be computed before their values are used in subsequent computations. Each b. is a vector containing m + 1 elements. Each element of b_. is computed as a sum-of-products (sop). Similarly, each x. is a sop. Hence, there are N = (m+1) (p-1) + p sop's altogether. Each sop has m+1 products at most. This implies that multiplication and addition pipes are used m+1 times for each sop. The bottleneck of the pipelined processor is obviously addition pipe since a new product can be added to the corresponding partial sum only after one addition-pipe step. Therefore, m+1 addition and one multiplication step are needed to obtain p solutions of R . Total time needed to compute R is less than ( 1- p) (m + 2) pipeline processor steps. The number of stages s in the addition pipe must be equal to N. This way, a partial sum for some sop and a new product to be added to that partial sum are generated at the same time by addition and multiplication pipes, respectively, and are ready to be entered into the addition pipe again. ■ It is worth noting the desirability of a large p _< /n (the function (— + p) (m + 2) has a minimum at p = /n) . However, a large p requires a large number of stages in pipelined functional units. Since the number of gate levels in an implementation of floating-point multiply or add is fixed, the number of stages s will hardly exceed 10-15. In present-day machines s is between 2 and 8. Secondly, increased s increases the cost of functional units 26 since extra latches or registers have to be added which in turn introduces extra delay, reducing the effect of pipelining. The problem of finding optimal number of stages s for a given p was answered by Theorem 3. In reality, pipelined processors have fixed s and the problem is in finding minimal T . Without loss of generality, we shall assume that all functional units have the same number of stages. Corollary 3 Given a pipelined processor with s-stage functional units, any R can be computed in T s < min[(|, + P ')(m + 2) , (m + 2) f ' (m+1) (^ + p") (> 4- 2)] . , . |s+_m + l| „ f s + m + 1" ] Where P = L m + 2 J and P = J m+ 2 I ' Proof For a given p, N = (m + 2) p - (m + 1) . If s = N , all stages are busy performing some computation. If s > N , exactly s - N stages are idling and waiting for the partial sums to become available at the output of the addition pipe. The time T required to compute R is still the same, ( h p) (m + 2) . However, the effectiveness E = T /s has increased. On the other hand, if s < N there is no idling, s s but more pipeline time steps are required to compute R . Consequently, I.-f _ N 27 (P 1 (s + m + l)/(. + 2) ) or s < N (p > (. + m + i)/ (m + 2))< In either case a p that minimizes T s is sought. In the former case, the function (p + P)(m + 2) is monotonically decreasing for positive p's, p < & , with the minimum at p = 4 . Assuming p«n , the minimum time is obtained for the largest p such that p < (s + m + l)/( m + 2) . When s < N the function f(| + p) ( m + 2 ) is monotonically increasing fo positive p's, and therefore the smallest p, p > ( s + m + i)/( m + 2 ) requires minimum time to compute R 28 5. Conclusion The results of this paper are collected in Table 2. The exact bounds have been approximated as close as possible by short, compact expressions. The advantage of MIMD over SIMD systems is negligible for large p and small m. The speedup of a MIMD machine over an SIMD machine increases by less than 1 for small p. For example, for p = 2, m = 1 speedups are 7/5 and 4/5 . However, for small p the advantage of a parallel computer over a single-processor system is not earthshaking. The pipelined computer seems to have advantage for medium to large p and moderate m (approximately m _> 4). However, if performance- cost ratio is considered the pipelined computer has the advantage even for small m (m = 1, 2, 3). Let us assume that the cost of an AE is 1. Then the cost of pipelining the same AE is 1.2-2.0 depending on the number of stages. The extra cost is incurred because of additional hardware for storing partial results on each stage. Therefore, the ratio of speedup and cost for parallel and pipelined computers c in = 2 2(1 + m/p + l/2p) . V p 2m + 3 ^ 2m + 3 ; S /C s s m + 4 + 5/m where C = p and C = 2 are cost of parallel and pipelined arithmetic. It should be noted that extra delay introduced by writing and reading of registers on each stage was neglected. That alone may degrade the performance of the pipelined computer by 10%-50% depending on the number of stages. The advantage of the pipelined computer can be easily 4 A .Hi C »n I e u d 0) •U •H 3 ■Ul a ca B n o a) u a o ai w C a) &o H cd 29 30 explained by inadequacy of the parallel model. It was assumed that each AE is capable of executing one operation in one time step. However, hard- ware for multiplication and addition has only one logic unit in common: mantissa adder. Sharing it between multiplication and addition may cost more than duplicating it. Therefore, instead of having exactly one-half of the AE logic idle during each operation, an AE could be designed to execute one multiplication and one addition simultaneously. This way the speedup of the parallel model would double, providing better performance over the pipelined model for all p and m. Furthermore, the performance of the parallel computer can be increased by increasing p. As was mentioned earlier, this is not possible with the pipelined computer since there is only finite number of stages that multiplication or addition logic will tolerate. A disadvantage of the parallel model is the access of Parallel Memory, which has p input and p output ports. If it is thought of as a two-dimensional array of words, it must allow parallel access to any p elements in any row or column. This can be easily seen from Table 1. For example, to compute t>„ , b-. , b and b ,. using four AE's, the PM must deliver simultaneously a_» , <*-. , _a.. , and _a.. ,. . After computation _b„ , b , b and b_ are stored in place of a~ , a.^ , a., .. and _a _ . Assuming four memory modules, the upper half of Table 1 can be used as the storage assignment for each memory module. However, when computing x„ , x _ , x 1 , and x _ , the PM must deliver simultaneously _b„ , _b , _b.. , and _b, „ which are unfortunately all in the same memory module. 31 Therefore, a memory scheme with a skewed storage and input and output alignment network is required, increasing further the cost of a parallel system. It should be mentioned at the end, that real parallel and pipelined computers have not been designed for solving linear recurrences and therefor mapping of algorithms presented in this paper on the available machines may present difficulty and lead to execution times above the bounds given in Table 2. Acknowledgment I would like to thank Professors Kuck and Sameh for helpful discussion during development of this paper. 32 REFERENCES [1] S. C. Chen, "Speedup of Iterative Programs in Multiprocessor Systems," Ph.D. thesis, University of Illinois at Urbana-Champaign, Department of Computer Science Rpt. No. 75-694, Jan. 1975. [2] S. C. Chen and D. J. Kuck, "Time and Parallel Processor Bounds for Linear Recurrence Systems," IEEE Trans, on Computers , Vol. C-24, No. 7, pp. 701-717, July 1975. [3] S. C. Chen, D. J. Kuck and A. H. Sameh, "Practical Parallel Band Triangular System Solvers," to appear in ACM Trans, on Mathematical Software , Vol. 4, No. 3, pp. 270-277, Sept., 1978. [4] S. C. Chen and A. H. Sameh, "On Parallel Triangular System Solvers," Proc. of the 1975 Sagamore Computer Conf. on Parallel Processing , pp. 237-238, Aug. 1975. [5] D. D. Gajski, "Processor Arrays for Computing Linear Recurrence Systems," Proc. of the 1978 International Conf. on Parallel Processing , [6] D. Heller, "On the Efficient Computation of Recurrence Relations," Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Res. Center, Hampton. VA, June, 1974. [7] L. Hyafil and H. T. Kung, "The Complexity of Parallel Evaluation of Linear Recurrences," Journ. of the ACM , Vol. 24, No. 3, pp. 513- 521, July 1977. [8] A. H. Sameh and R. P. Brent, "Solving Triangular Systems on a Parallel Computer," SIAM J. Numer. Anal., Vol. 14, pp. 1101-1113, 1977. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-78-953 4. Title and Subtitle AN ALGORITHM FOR SOLVING LINEAR RECURRENCE SYSTEMS ON PARALLEL AND PIPELINED MACHINES 7. Auth oi Daniel D. Gajski Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 15 Supplementary Notes 3. Recipient's Accession N< 5. Report Date December 1978 Performing Organization Rept No - UIUCDCS-R-78-953 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS77-27910 13. Type of Report & Period Covered Technical Report " ^ 8 °\ lthln f °V he s ° luti ™ °f linear recurrence systems on parallel or pipe- omputLTw Unfixed H V "^ ^^ ' 8peedup md "«**«"* *« SIMD and'wiMD S L VI processing elements as well as for pipelined computers with fixed number of stages per operation are obtained. The model of each computer is discussed in detail to explain better performance of the pipelined model A ^Lr^^L" 6 d6Sl8n " Pr ° CeSSing 61 ™ f - -— "<^™ -es 7- Key Uords and Document Analysis. 17a. Descriptors Linear recurrences Triangular system solvers Parallel evaluation Parallel processors Pipelined processors Complexity of algorithms Computer organization pen-Ended Ter c - ' OSATI Field/Gr 'v Statement Release Unlimited 19. Security Class (Thi Report) UNCLASSIFIED 20. Security Class (Thi Page UNCLASSIFIED 21. No. of Page 35 JSCOMM-DC 4032>5 AUG 1 b dty