UNIVERSITY OF 
 
 ILLINOIS LIBRARY 
 
 AT URBANA-CHAMPAIGN 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result in dismissal from 
 the University. 
 To renew call Telephone Center, 333-8400 
 
 UNIVERSITY OF IUINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 MAR 2 1981 
 
 FEB 2 6 
 
 an 
 
 L161— O-1006 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/algorithmforsolv953gajs 
 
\>.9Sj: 
 2. 
 
 W<va^ 
 
 Report No. UIUCDCS-R-78-953 
 
 UILU-ENG 78 1745 
 
 AN ALGORITHM FOR SOLVING LINEAR RECURRENCE 
 SYSTEMS ON PARALLEL AND PIPELINED MACHINES 
 
 by 
 
 Daniel D. Gajski 
 
 December 1978 
 
 NSF-OCA-MCS77-27910-000038 
 
An Algorithm for Solving Linear Recurrence 
 
 ■k 
 
 Systems on Parallel and Pipelined Machines 
 
 - by - 
 
 Daniel D. Gajski 
 
 This work was supported in part by the National Science Foundation 
 under Grant No. US NSF MCS77-27910. 
 
 ** 
 
 Department of Computer Science 
 222 Digital Computer Laboratory 
 University of Illinois at Urbana-Champaign 
 Urbana, IL 61801 
 
Abstract 
 
 A new algorithm for the solution of linear recurrence systems 
 on parallel or pipelined computers is described. Time bounds, speedup 
 and efficiency for SIMD and MIMD computers with fixed number of proces- 
 sing elements as well as for pipelined computers with fixed number of 
 stages per operation are obtained. The model of each computer is dis- 
 cussed in detail to explain better performance of the pipelined model. 
 A simple modification in the design of processing elements for parallel 
 computers makes parallel model superior. 
 
 Index Terms: 
 
 Linear recurrences 
 Triangular system solvers 
 Parallel evaluation 
 Parallel processors 
 Pipelined processors 
 Complexity of algorithms 
 Computer organization 
 
1. Introduction 
 
 In this paper a new algorithm for the fast solution of linear 
 recurrence systems is presented. The same algorithm in two different 
 implementations can be used on either parallel or pipelined computers. 
 
 Parallel linear recurrence solvers have been discussed previously 
 [1-7]. The algorithm presented in this paper obtains speed improvements 
 over previous algorithms in case of parallel computers with fixed number 
 of processors. Solving linear recurrence systems on pipelined computers 
 has not been discussed previously. 
 
 In Section 2, the definitions and notations used throughout this 
 paper are presented. Then the Main Theorem, which forms the basis of the 
 algorithm is proved. 
 
 In Section 3, the algorithm for a SIMD and MIMD parallel 
 
 computer using p AEs is developed. The speedup of the parallel algorithm 
 
 is denoted by S = T, /T where T n and T are computational times for 
 Pip 1 P 
 
 linear recurrence systems using 1 and p processors, respectively. 
 Similarly, efficiency E is denoted by S /p . 
 
 A model of a pipelined computer is described in Section 4. Each 
 functional unit (multiplier, adder, ...) in the model has s stages. It is 
 assumed that functional units can be chained together. It is shown that 
 a variation of the algorithm for parallel machines can be used for 
 pipelined computers too. Computational time T , speedup S and efficiency 
 
 E are defined in analogy with parallel computers. It is further assumed 
 
 that each pipelined functional unit requires one time step for a single 
 
operation, although it can deliver s results in one time step in the 
 vector mode. 
 
2. Main Theorem 
 
 Definition 1 An m-th order linear recurrence system R<n,m> is 
 
 the set of n equations 
 
 ^ T 
 
 1 < i < n (1) 
 
 where * denotes matrix multiplication and where a. = (a.,, a.„, .... a. , c.) 
 
 —i il' i2' im' 3/ 
 
 is the vector 1 of coefficients, x. , = (x. , , x. „, . .., x. , 1) is the 
 
 —l-l l-l i-2 l-m 
 
 vector of variables with initial vector x n = (x„, x 15 .... x ,-,,1) 
 
 — u u -l -m+l 
 
 speficied in advance. 
 
 Furthermore, we will need a set of m+l unit vectors e^ = (e.. , e„ 
 
 (2) 
 
 If a row (column) of any matrix is equivalent to a unit vector, it is 
 called a trivial row (column). 
 
 Theorem 1 
 
 Any R<n,m> system can be written in the form 
 
 x. = b. * xj (3) 
 
 where for all i, 1 < i<n , b. = (b. n , b.„, ..., b. , d.) is defined 
 — — —i il i2 im i 
 
 recursively as follows: 
 
 b. - a. * B. 1 1 < i < n (4) 
 
 —l —i —l-l — — 
 
 T 
 with matrix B.^ = (b._ r b._ 2 , .... b.^, e^) , and b^, b_ r ..., 
 
 b_ m+1 equal to e r £ 2 e^. 
 
 1 T 
 
 Throughout this paper any vector x is a row vector and x is corresponding 
 
 column vector. 
 
Proof (by mathematical induction on i) 
 (Basis) 
 
 x = a * x T ^ Def ' 1 
 
 x l -1 *0 
 
 - (£ X * (i r *2' ••" V W } *- 
 
 -X) 
 
 by def. of identity matrix 
 
 ■ («! * <V »-i. •••• -"-*,• s_n> > * x o by de£ - of *i 
 
 *T by def. of b. 
 
 —i —o 
 
 (Induction step) 
 
 : i+l = -i+1 
 
 l -±+l 
 
 £ by Def. 1 
 
 1+1 * ((b r b._r — Wr e m + i )T * *o> by h ^ 0thesis 
 
 , * X T by def. of b 
 
 -i+1 ^ X 
 
 Theorem 1 shows that any set of consecutive variables {x.| j < i < k} 
 can be computed in parallel, that is, independently of each other, if 
 vectors of parallel coefficients b_ k , j < i 1 k , are generated and 
 substituted for a., j < i 1 k . The computation of all b. is the overhead 
 that must be paid for parallelism. The following example should clarify 
 
 this idea. 
 
 Let us consider linear recurrence system R<9,2> . Using 
 standard notation of linear algebra R<9,2> can be written as 
 
l n 
 
 32 
 
 Ml 
 
 ! 42 
 
 '41 
 
 52 
 
 51 
 
 '62 
 
 l 61 
 
 '72 
 
 71 
 
 82 81 
 
 '92 
 
 l 91 
 
 
 r x i 
 
 x i 
 
 " c r 
 
 
 X 2 
 
 
 C 2 
 
 
 X 3 
 
 
 C 3 
 
 
 x 4 
 
 
 C 4 
 
 * 
 
 X 5 
 
 = 
 
 C 5 
 
 
 X 6 
 
 
 C 6 
 
 
 X 7 
 
 
 C 7 
 
 
 X 8 
 
 
 c 8 
 
 
 X 9 
 
 
 C 9 
 
 
 - ■* 
 
 
 - 
 
 This system can be transformed using (4) into 
 
 l 21 
 
 32 
 
 31 
 
 l 42 " a 41 
 " b 52 
 
 72 
 
 82 
 
 51 
 
 ■ b 62 " b 61 
 
 -b 
 
 -b 
 
 71 
 
 81 
 
 ' a 92 a 91 1 
 
 
 "V 
 
 
 " C l" 
 
 
 X 2 
 
 
 C 2 
 
 
 x 3 
 
 
 C 3 
 
 
 X 4 
 
 
 C 4 
 
 * 
 
 X 5 
 
 = 
 
 C 5 
 
 
 X 6 
 
 
 C 6 
 
 
 X 7 
 
 
 C 7 
 
 
 X 8 
 
 
 C 8 
 
 
 X 9 
 
 
 C 9 
 
 
 
 
 » - 
 
where 
 
 [b 51 , b 52 , d 5 ] = [a 51 , a 52 , c 5 ] * 
 
 10 
 10 
 1. 
 
 [a 51 , a 52 , c 5 ] 
 
 [b 61' b 62' d 6 ] = [a 61' a 62' C 6 ] * 
 
 b 51 b 52 d 5 
 
 = [a 61 a 51 + a 62' a 61 a 52' a 61 C 5 + C 6 ] 
 
 [b 71 , b ?2 , d ? ] - [a 71 , a ?2 , c ? ] * 
 
 b 61 b 62 d 6 
 b 51 b 52 d 5 
 
 = [a ?1 (a 61 a 51 + a^) 4- a?2 a 51 , a ?1 a fil a 52 + a ?2 a^ 
 
 l 71 (a 61 C 5 + C 6 ) + a 72 C 5 + C 7 ] 
 
 [b 81' b 82' d 8 ] = [a 81' a 82' C 8 ] * 
 
 b 71 b 72 d 7 
 b 61 b 62 d 6 
 
 I 
 
 ij 
 
 = [a 81 (a 71 (a 61 a 51 + a 62 } + a 72 a 51 } + a 82 (a 61 a 51 + a 62 ) 
 
 a 8l (a 71 a 61 a 52 + a 72 a 52 ) + a 82 a 61 a 52 
 
 a 81 (a 71 (a 61 C 5 + C 6 ) + a 72 C 5 } + a 82 (a 61 °5 + C 6 } + C 8 ] 
 
 The above coefficients can be obtained by direct substitution 
 of x c into x, , x c and x, into x, and finally x c , x, , and x., into x Q . The 
 
 J J O / JO/O 
 
 new transformed matrix allows parallel computation of x c , x, , x^, and x c 
 
 JO/ t 
 
since they do not depend on each other anymore. It seems that nothing has 
 been accomplished since computation of b. , 5 £ i £ 8, must be executed 
 serially, and it requires more time than the original one. However, the 
 generation of b. , 5 <_ i <_ 8, is independent of any other subset of variables 
 whose intersection with {x.|5 _< i _< 8} is empty, and therefore, the 
 generation of b_.'s for any two disjoint subsets of variables can be per- 
 formed concurrently. 
 
 Corollary 1 
 
 In any R<n,m> the computation of 
 
 b. = a. * B. , , 1 < i < n , 
 —l —l — i-1 — — 
 
 2 
 requires at most m(m+l) multiplications and (m-1) (m+1) + 1 = m additions. 
 
 Furthermore, the computation of 
 
 T 
 x. = b. * x^ , l<i<n 
 l — i — — — 
 
 requires at most m multiplications and m additions. 
 
 ■ 
 
 Corollary 2 
 
 In any R<n,m> the computation of all 
 
 b = a * B l<i<n requires no more than 
 
 — i -i -i-1 - — 
 
 (n - -km+1)) m(m+l) if n > m + 1 
 
 
 m 
 
 1 n(n-l)(m+l) if n < m + 1 
 
 multiplications, and 
 
 K a (n) 
 
 additions. 
 
 (n - |(m+l)) m 2 if n > m + 1 
 
 ■| n(n-l) m if n f m + 1 
 
 (6) 
 
Furthermore, the total number of operations required is equal to 
 
 (n - ^) (2m 2 + m) if n > m + 1 
 
 K(n) = K (n) + K (n) = ( 
 
 1 
 
 -j n(n-l)(2m + 1) if n < m + 1 
 
 Proof 
 
 For all i, 1 <_ i <_ m + 1, _B._, contains only i-1 nontrivial 
 
 rows, so that b_. = a. • B._. requires only (i-1) (m+1) multiplications 
 
 Therefore, for all n < m + 1 
 
 K (n) = Z (i-1) (m+1) 
 i-1 
 
 n 
 = (m+1) I (i-1) 
 i=l 
 
 = (m+1) Hfea (8) 
 
 On the other hand, for n _> m + 1 , K consists of two parts. Using (8), 
 
 the number of multiplications to compute all b_. , 1 £ i <^ m, is equal to 
 
 -^■(m+l) m(m-l) . Each consecutive _b. ,m + l<£i<_n requires m(m+l) 
 multiplications by Corollary 1. Therefore, for all n > m + 1 
 
 K (n) = (n-m) m(m+l) + ^r(m+l) m(m-l) 
 m z 
 
 2 1 
 = n m(m+l) - m (m+1) + y(m+l) m(m-l) 
 
 1 2 
 = n m(m+l) - -r- m(m+l) 
 
 Each nontrivial row, not counting the first, requires m+1 
 additions and each trivial row only one addition under assumption that 
 
the first row in B . , is nontrivial. If the first row is trivial, then 
 
 i = 1 and B . , = B~ is an identity matrix. No additions are required 
 —l-l -H) 
 
 in computation of _b = a.-,*B_. = a . Therefore, for all n _< m + 1 
 
 n 
 K (n) = Z [(i-2)(m+l) + (m+1) - (i-1)] 
 1-2 
 
 n 
 = E m(i-l) 
 i=2 
 
 = -^ m + n(n-l) 
 
 Similarly, K(n) , n _> m + 1, consists of two parts since each_b. , 
 
 2 
 m + 1 _> i ^> n , requires m additions by Corollary 1 and the computation 
 
 1 2 
 of remaining _b . , 1 <^ i <^ m , altogether requires ■=■ m (m-1) additions. 
 
 Thus 
 
 2 12 
 K (n) = (n-m) m + -r- m (m-1) 
 a / 
 
 2 12 
 = n m - y m (m+1) 
 
10 
 
 3. Parallel-Processor System 
 
 The model of a parallel computer (Fig. 1) that we will 
 consider consists of Parallel Memory (PM) , Parallel Processor (PP) 
 and Control Unit (CU) . PP has p Arithmetic Elements (AE) . Each AE 
 can perform any arithmetic operation in one time step (clock) . All 
 AEs are assumed to perform the same operation on each time step (SIMD 
 computer) . It is assumed further that no time is required to communicate 
 data between processors and memories, and the storage arrangement of 
 data is irrelevant. 
 
 Algorithm 1 
 
 Given a linear recurrence system R<n,m> divide R<n,m> into 
 |"n/pl subsystems R ^ <p,m> , where 1 < j < Tn/pl : 
 1. begin 
 
 2 x (0) .=x 
 
 z. x p • 2£q » 
 
 3. for j : = 1 until [n/pl do 
 
 4. begin 
 
 5. BJ^ : = identity matrix ; 
 
 6. for k: - 1 until p do b£ j) : = aS^ * b£^ ; 
 
 7. end 
 
 8. for j: = 1 until fn/pl do 
 
 9. for k: = 1 until p do x, (j) : = b, (:i) * x^" 1 ^ ; 
 
 r k — k — p 
 
 10. end 
 
 An application of Algorithm 1 to system R<28,2> is shown 
 
11 
 
 E> 
 
 ?M 
 
 so • • 
 
 RfTI P^l 
 
 Fig. 1. A model of a parallel computer 
 
12 
 
 in Fig. 2. A parallel computer with four AEs is used. 
 
 In general, a square with the inverse diagonal symbolizes 
 computation of b* J = a~' • J±rl , 1 < i < 4, 1 < j < 7 . The input 
 
 on the top of the square represents a. and the output on the bottom _b. . 
 
 The horizontal input represents the matrix B_ , whose rows are vectors 
 
 b. ,, b. „, ..., u ., . The dots on the horizontal line indicate the 
 —l-l —1-2 — m+1 
 
 vectors that are rows in B . ., . The unit vector u ... which is always the 
 — i-1 —m+1 
 
 last row of any matrix B_._, is not connected for clarity. The closest 
 
 dot to the square represents the top row of B^._ 1 , that is b_._ n . 
 
 The natural ordering of the rows in the matrix is established by proximity 
 
 to the square. Similarly, the empty square represents the computation 
 
 of x. = by' * x J , with vertical input being b. , vertical 
 l — l — p ' v 6 — l 
 
 output x. , and the horizontal input representing vector x 
 
 Each of four AEs performs the computation indicated by 
 approximately 1/4 of the squares. AEs should be assigned to squares in 
 a way that minimizes computational time, and simplifies memory access 
 and communication. The assignment imposed by Algorithm 1 for R<28,2> is 
 shown in Table 1. Note^that the upper part of Table 1 represents redun- 
 dant computation not required in computation on a serial computer. 
 
 Theorem 2 
 
 The number of time steps required to compute a linear 
 recurrence system R<n,m> using a parallel computer with p AEs is 
 
13 
 
 ff 
 
 
 
 
 *• 
 
 ff 
 
 
 
 
 
 
 HI 
 
 '•> 
 
 
 vV> 
 
 
 -0\ 
 
 u 
 
 * 
 
 I 
 
 M 
 
 ■V 
 
 * * 
 
 J, 
 
 tf 
 
 3 
 E 
 
 Ifr - 
 
 -u- 
 
 >> 
 
 CJ 
 
 tf 
 
 
 
 3 
 s 
 
 CO 
 
 tf 
 
 
 
 H 
 
 tf 
 
 
 
 cd 
 
 tf 
 
 
 
 ex 
 
 a 
 
 
 
 3 
 
 i 
 
 re 
 
 »? 
 
 
 
 c 
 
 * 
 
 
 
 A 
 
 CN 
 
 S? 
 
 
 
 00 
 
 v* 
 
 
 
 ^ 
 
 N 
 
 
 
 o 
 
 N» 
 
 
 
 C 
 
 tf 
 
 
 
 c 
 
 •H 
 
 >? 
 
 ^ 
 
 
 •i-J 
 3 
 i— l 
 
 
 
 * 
 
 O 
 CO 
 
 tf 
 
 1 
 
 it 
 
 00 
 
 
 J, 
 
 -*>i 
 
 PH 
 
 ft — 
 
 *& 
 
 — ■* 
 
 
 v3 
 
 
 
 
 N 
 
 
 
 
 * 
 
 
 
 
 «n 
 
 
 
 
 * 
 
 
 
 
 * 
 
 
 
 
 fc 
 
 
 
 
 M 
 
 
 
 
 9 
 
 
 
 
 a 
 
 
 
 
L4 
 
 AE, 
 
 AE. 
 
 AE, 
 
 AE, 
 
 Time Steps 
 
 13 
 
 14 
 
 '15 
 
 16 
 
 10 
 
 11 
 
 12 
 
 2 5 
 
 26 
 
 27 
 
 28 
 
 21 
 
 22 
 
 23 
 
 24 
 
 b 17^ 
 
 18 
 
 19 
 
 20 
 
 25 
 
 25 
 
 A 4 . 
 
 x 8 
 
 X 12 
 
 X 16 
 
 X 20 
 
 X 24 
 
 X 28 
 
 ■11 
 
 '15 
 
 19 
 
 ■23 
 
 •27 
 
 10 
 
 14 
 
 18 
 
 , 22 
 
 '26 
 
 13 
 
 17 
 
 •21 
 
 k 25 
 
 Table 1. AE Assignment for R<28,2> 
 
15 
 
 Proof 
 
 -((2n 2 +3m) - |^(2m 2 + m)) if p > m + 1 
 
 n((m+l) - ^h if p < m + 1 
 
 The proof is based on Algorithm 1. The solution consists of 
 
 two parts, described by two DO- loops in lines 6 and 9, respectively. 
 Firstly, using only one AE all b J 's , 1 < k < p , are 
 
 computed for some R <p,m> . Since there are p AEs , p subsystems can 
 
 be computed at the same time. The same procedure (DO-loop in line 6) 
 
 2 
 is repeated fn/p 1 times. Note that the DO-loop in line 3 can be 
 
 considered to be a vector operation that is executed in parallel in 
 
 slices of p elements each. On the other hand, the DO-loop in line 6 
 
 is executed serially. 
 
 Secondly, using all p AEs all x, , 1 <^ k <_ p , are evaluated 
 
 in parallel for each R <p,m> . This time, the DO-loop in line 8 is 
 executed serially and the one in line 9 as a vector operation in slices 
 of size p. 
 
 The number, T , of time steps required to compute the DO-loop 
 in line 6 using one AE is given by Corollary 2 
 
 T' 
 
 J (p - (m+l)/2)(2m 2 +m) if p > m + 1 \ 
 Jp(p-l)(2m+l)/2 ifp<m + lj 
 
 Furthermore, the time T" needed to compute the DO-loop in line 9 using 
 p AEs is given by Corollary 1 
 T" = 2m . 
 
L6 
 
 Therefore, the total number of time steps T needed to compute R<n,m> 
 
 T = £- T' + - T" 
 P p 2 P 
 
 —(2m -hn) =■ — r— (2m +m) H 2m 
 
 p p 2 2 p 
 
 ^-(p-l)(2m+l) + - 2m 
 2p p 
 
 ^((2n 2 +3m) - f^(2m 2 + m) ) 
 
 ci((m+l) - ^-) 
 
 The computational time T given by Theorem 2 can be improved 
 if the restriction of a SIMD model is removed; that is, the assumption 
 that all AEs perform the same operation at the same time. In the MIMD 
 model that we shall consider, (p-1) AEs still perform the same function 
 at the same time, while the only remaining AE may not. The idea can be 
 easily explained using Fig. 2. The system R <4,2> does not require 
 computation of parallel coefficients _b , b_ , b_ and b_, . Since initial 
 
 vector _x-> is known, the variables x.. , x„ , x„ and x, can be computed directly 
 
 from x„ , that is x . = a. * x. , , 1 < i < 4. Furthermore, the number of 
 —0 l—i —l-l — — 
 
 time steps 2pm = 16 is less than the time needed to compute all parallel 
 
 coefficients b. , 1 < i < 4 , for any R (l) <p,m> , 2 < i < 4 . Since the 
 
 time to compute parallel coefficients is (p - — =— )(2m + m) = 25 , the 
 
 system R^ ^<4,2> could be enlarged to R^ ^<6,2> and still be solved in 24 
 time steps using only one AE. This is shown in Fig. 3. The same trick 
 can be employed after every p(p-l) + r variables where r is the size of 
 
4 
 
 « 
 
 4 
 
 St 
 
 I 
 
 4. 
 
 
 J. 
 
 * a 
 
 
 I 
 NX 
 
 A 
 
 i 
 
 la- 
 
 's 
 
 ■B 
 
 
 t= 
 
 O 
 O 
 
 & 
 
 ^ 
 
 irl — flr 
 
 S ■ k*- 
 
 Q O 
 
 i 
 
 o- 
 
 i 
 
 ■s- 
 
 t 
 
 — 
 
 & 
 
 6 1 — A 
 
 Q ■ 40 
 
 S D- 
 
 i 
 
 
 K 
 
 it 
 
 
 
 17 
 
18 
 
 the first sybsystem R <r,m> . At that moment the value of x ft and 
 x is known and there is no need to compute parallel coefficients 
 
 ^19' ^20' ±2V ^22' ^23 and ^24 " 
 
 Algorithm 2 
 
 Given a linear recurrence system R<n,m> , divide R<n,m> into 
 subsystems R (l '<p(p-1) + r, m> , 1 < i < fn/(p(p-l) + r) 1 . Divide 
 furthermore each R <p(p-l) + r, m> into p subsystems, R ' <r,m> , 
 where r = Up - (m+l)/2)(2m + 1)/2J and R^ x> ^ ) <p,m> , 2 < j < p . 
 
 1. 
 
 begin 
 
 2. 
 
 x (o, P ) : 
 
 -p 
 
 3. 
 
 for i: 
 
 4. 
 
 beg 
 
 5. 
 
 
 6. 
 
 
 7. 
 
 
 for i: =1 until [n/(p(p-l) + r)l do 
 
 jcld . x u-i,P) 
 
 -0 -T5 
 
 for k: = 1 until r do *£'» z = a^ * x^ 1} ; 
 
 for j : =2 until p do 
 
 8. begin 
 
 9. B o J ' = identit y matrix ; 
 
 10. for k: = 1 until p do b^ 1 '^: = a£ 1,j) * l^ j) 
 
 11. end 
 
 12. for k: = 1 until p do x^' 2) : = b^' 2) * x^'^ ; 
 
 13. for j: =3 until p do 
 
 14. 
 
 fork: = 1 until p do x^' j) : = bf 1 '^ * x (i ' j_1 
 k — k — p 
 
 15. end 
 
 16. end 
 
19 
 
 Theorem 3 
 
 The number of time steps required to compute a linear 
 recurrence system R<n,m> using a MIMD computer with p AEs 
 
 — ■ -, ,» (2m + 3m) if p > m + 1 
 
 p + m + 1/z r — 
 
 T < 
 p 
 
 j-(p(2m+l) +4) if p < m + 1 
 
 Proof 
 
 The proof is based on Algorithm 2. A recurrence system 
 R<n,m> is divided into systems R^<p(p-1) + r, m> , 1 < i < [n/(p(p-l) + r) ] 
 Each R <p(p-l) + r, m> is further subdivided into p subsystems: 
 R^ 1 ' 1 ' ) <r,m> , and R^ 1 '^ '<p ,m> , 2 < j < p . With p AE's in a MIMD 
 computer one AE is used to solve R ' <r,m> in 2rm time steps. The variables 
 are computed sequentially: x, = jL w ' * 2E.v i * Eacn °f the 
 remaining (p-1) processors are used to compute parallel coefficients 
 bfc 1 '^ = ^fc 1 '^ * -k-1^ ' 1 - k - P ' in S ° me subs y stem R (l ' J)< P,ni> . The 
 number of time steps required is (p - ^r— ) (2m + ni) , if p > m + 1 , or 
 
 p(p-l)(2m + l)/2 if p < m + 1 by Corollary 2. 
 
 Assuming that the solution of R ' <r,m> takes at most as much 
 time as computation of all parallel coefficient in any R <p,m> , it 
 is easy to find the value of r. Hence, 
 
 T l(p - ^)(2m + 1)/2J if p>n. + n 
 
 (^ Lp(p-l)(2m + l)/4mj if p < m + 1 J 
 
 After all parallel coefficients for all subsystems have been computed, 
 
20 
 
 ii (i,2) . (i,2) . (i,l) 
 
 all p processors are used to compute x, = b/ ' ' * x > anc * 
 
 consecutively *£* = ^k^"^ * ^'^^ for a11 J' 3< j <p . 
 This can be accomplished in 2m(p-l) time steps. Therefore, 
 
 p(p _ 1 " + r ((P " I X L )(2m 2 + m) + 2(p-l) m) 
 
 T < 
 
 p ( p. 1 ) + r (P(p-D(2m + l)/2 + 2(p-l) m 
 
 2m 2 + 3m) - ^(2m 2 + ra) ) 
 
 (p - l)(p + p ^ r ). 2 
 
 J^i) ( i p(2m + 1} + 2ra) 
 
 P+ m + l/2 (2m ' + 3m) « P > - + 1 
 
 ^-(p(2m+l) + 4m) if p < m + 1 
 
21 
 
 4. Pipelined-Processor System 
 
 A pipelined-processor model consists of Main Memory (MM) , 
 Pipelined Processor (PP) and Control Unit (CU) . PP may consist of 
 several pipelined functional units. In our model, we shall assume 
 only 2 functional units: multiplication pipe and addition pipe. 
 Each of them has s stages. These two functional units are connected 
 serially with the multiplication pipe feeding the add pipe through 
 register Rl. The results from the add pipe are either stored in MM or 
 fed back into the add pipe through register R2 (Fig. 4). 
 
 Algorithm 3 
 
 Given a linear recurrence system R<n,m> , divide R<n,m> 
 into ("n/pl subsystems R <p,m> , where 1 < j < fn/pl . Execute the 
 following algorithm: 
 
 1. begin 
 
 2 
 
 2. for i: = until p - 1 do a . : = null ; 
 
 2 1 
 
 3. for i: =0 until p - 1 do a , . , , : = null ; 
 
 — n+i+1 
 
 4. for j = 1 until fn/pl + p do 
 
 5. begin 
 
 6. B n iden tity matrix ; 
 
 7. for k: - 1 until p do b^" k+1 >: - a^" k+1) * B^ k+1) 
 
 8. for k: = 1 until p do x ( V > = b (J ^ * k^P" 1 ' ; 
 
 v p-k — p-k+1 — p 
 
 9. end 
 10. end 
 
 To point out the difference between Algorithms 1 and 3 the 
 
 system R<28,2> is redrawn in Fig. 5. The order of fetching from MM and 
 
 computing of coefficients and variables in PP is indicated by dotted 
 
22 
 
 Fig. 4. A model of a pipelined computer 
 
23 
 
 
 3 
 
 r j / / / A 
 * — r - i / A a- 
 
 -# 
 
2 4 
 
 arrows. For example, the computation of b_ , requires a , , b_ „ and u, 
 
 to be fetched from MM and entered into PP. It is impossible to proceed 
 with computation of a_.. ,. since b_.. , will become available only after a 
 
 certain number of time steps. Therefore, in order to keep pipelined 
 functional units busy, Algorithm 3 proceeds by computing b 11} b Q , x. , 
 
 — ii — O H 
 
 x„ , x„ , x- and b-.. , b,Q . At that moment of time, b., should be 
 available from MM or at the output of PP to be used in computation of b., . 
 
 Theorem 4 
 
 Given an integer p > 0, any linear recurrence system R<n,m> 
 
 can be solved in T < ( \- p) (m + 2) time steps using pipelined functional 
 
 s p 
 
 units with s ~ ( (m + 2) p - (m + 1)) stages. 
 
 Proof 
 
 The proof is based on Algorithm 3. The system R<n,m> is 
 divided into subsystems R J <p,m> , 1 < j < Tn/pl . To keep pipelined 
 units busy almost all the time p + 1 subsystems are being solved 
 concurrently, that is, different subsystems are using different stages 
 of the pipelined units. 
 
 Each R^<p,m> system is solved in two parts: b£ J ' = a^ 3 * 
 
 B^J , 1 < k < p , is computed first and ac^* = b^?,. * x^ _1) , 
 -k-1 ' — — * * v p-k+1 — p-k+1 -p 
 
 1 £ k _< p , is evaluated afterwards. With p + 1 systems computed 
 
 concurrently, the order of computation is as follows 
 
 ... b «> . b «-« , .... b«-p +1 > , «<J-p> , x«:"> , .... «a-rt 
 
 — 1 ' — 2 ' — p p p-1 1 
 
 h (j+l) h (j) h (J-P+2) V (j-P+D V (j-P+D X (j-P+D 
 
 h > b 2 , ..., b^ , x p , x p _ 1 x x ... . 
 
25 
 
 Therefore, at least (k-1) _b . ' s and k x.'s must be computed 
 
 before their values are used in subsequent computations. Each b. is 
 
 a vector containing m + 1 elements. Each element of b_. is computed as a 
 
 sum-of-products (sop). Similarly, each x. is a sop. Hence, there are 
 N = (m+1) (p-1) + p sop's altogether. Each sop has m+1 products at most. 
 This implies that multiplication and addition pipes are used m+1 times 
 for each sop. The bottleneck of the pipelined processor is obviously 
 addition pipe since a new product can be added to the corresponding 
 partial sum only after one addition-pipe step. Therefore, m+1 addition 
 and one multiplication step are needed to obtain p solutions of 
 R<n,m> . Total time needed to compute R<n,m> is less than ( 1- p) (m + 2) 
 
 pipeline processor steps. The number of stages s in the addition pipe 
 must be equal to N. This way, a partial sum for some sop and a new product 
 to be added to that partial sum are generated at the same time by 
 addition and multiplication pipes, respectively, and are ready to be 
 entered into the addition pipe again. 
 
 ■ 
 
 It is worth noting the desirability of a large p _< /n (the 
 function (— + p) (m + 2) has a minimum at p = /n) . However, a large p 
 requires a large number of stages in pipelined functional units. Since 
 the number of gate levels in an implementation of floating-point multiply 
 or add is fixed, the number of stages s will hardly exceed 10-15. In 
 present-day machines s is between 2 and 8. 
 
 Secondly, increased s increases the cost of functional units 
 
26 
 
 since extra latches or registers have to be added which in turn 
 introduces extra delay, reducing the effect of pipelining. 
 
 The problem of finding optimal number of stages s for a given 
 p was answered by Theorem 3. In reality, pipelined processors have 
 fixed s and the problem is in finding minimal T . Without loss of 
 
 generality, we shall assume that all functional units have the same 
 number of stages. 
 
 Corollary 3 
 
 Given a pipelined processor with s-stage functional units, 
 any R<n,m> can be computed in 
 
 T s < min[(|, + P ')(m + 2) , (m + 2) f ' (m+1) (^ + p") (> 4- 2)] 
 
 . , . |s+_m + l| „ f s + m + 1" ] 
 
 Where P = L m + 2 J and P = J m+ 2 I ' 
 
 Proof 
 
 For a given p, N = (m + 2) p - (m + 1) . If s = N , all 
 
 stages are busy performing some computation. If s > N , exactly s - N 
 stages are idling and waiting for the partial sums to become available 
 at the output of the addition pipe. The time T required to compute 
 
 R<n,m> is still the same, ( h p) (m + 2) . However, the effectiveness 
 
 E = T /s has increased. On the other hand, if s < N there is no idling, 
 s s 
 
 but more pipeline time steps are required to compute R<n,m> . Consequently, 
 I.-f <f +PH-+2) . 
 
 For a given s, it is possible to choose p such that s >_ N 
 
27 
 
 (P 1 (s + m + l)/(. + 2) ) or s < N (p > (. + m + i)/ (m + 2))< In either 
 case a p that minimizes T s is sought. In the former case, the function 
 (p + P)(m + 2) is monotonically decreasing for positive p's, p < & , 
 with the minimum at p = 4 . Assuming p«n , the minimum time is 
 obtained for the largest p such that p < (s + m + l)/( m + 2) . When 
 
 s < N the function f(| + p) ( m + 2 ) is monotonically increasing fo 
 
 positive p's, and therefore the smallest p, p > ( s + m + i)/( m + 2 ) 
 requires minimum time to compute R<n,m> 
 
28 
 
 5. Conclusion 
 
 The results of this paper are collected in Table 2. The 
 exact bounds have been approximated as close as possible by short, 
 compact expressions. 
 
 The advantage of MIMD over SIMD systems is negligible for large 
 p and small m. The speedup of a MIMD machine over an SIMD machine increases 
 by less than 1 for small p. For example, for p = 2, m = 1 speedups are 
 7/5 and 4/5 . However, for small p the advantage of a parallel computer 
 over a single-processor system is not earthshaking. 
 
 The pipelined computer seems to have advantage for medium to 
 large p and moderate m (approximately m _> 4). However, if performance- 
 cost ratio is considered the pipelined computer has the advantage even 
 for small m (m = 1, 2, 3). Let us assume that the cost of an AE is 1. 
 Then the cost of pipelining the same AE is 1.2-2.0 depending on the 
 number of stages. The extra cost is incurred because of additional 
 hardware for storing partial results on each stage. Therefore, the ratio 
 of speedup and cost for parallel and pipelined computers 
 
 c in = 2 2(1 + m/p + l/2p) . 
 V p 2m + 3 ^ 2m + 3 ; 
 
 S /C 
 
 s s m + 4 + 5/m 
 where C = p and C = 2 are cost of parallel and pipelined arithmetic. 
 
 It should be noted that extra delay introduced by writing and 
 reading of registers on each stage was neglected. That alone may degrade 
 the performance of the pipelined computer by 10%-50% depending on the 
 number of stages. The advantage of the pipelined computer can be easily 
 
4 
 
 A 
 
 .Hi C 
 
 »n I e 
 
 u 
 
 d 
 
 0) 
 
 
 
 •U 
 
 •H 
 
 3 
 
 ■Ul 
 
 a 
 
 ca 
 
 B 
 
 n 
 
 o 
 
 a) 
 
 u 
 
 a 
 
 
 o 
 
 ai 
 
 w 
 
 C 
 
 a) 
 
 
 &o 
 
 H 
 
 cd 
 
 29 
 
30 
 
 explained by inadequacy of the parallel model. It was assumed that each 
 AE is capable of executing one operation in one time step. However, hard- 
 ware for multiplication and addition has only one logic unit in common: 
 mantissa adder. Sharing it between multiplication and addition may cost 
 more than duplicating it. Therefore, instead of having exactly 
 one-half of the AE logic idle during each operation, an AE could be 
 designed to execute one multiplication and one addition simultaneously. 
 This way the speedup of the parallel model would double, providing 
 better performance over the pipelined model for all p and m. Furthermore, 
 the performance of the parallel computer can be increased by increasing 
 p. As was mentioned earlier, this is not possible with the pipelined 
 computer since there is only finite number of stages that multiplication 
 or addition logic will tolerate. 
 
 A disadvantage of the parallel model is the access of Parallel 
 Memory, which has p input and p output ports. If it is thought of as a 
 two-dimensional array of words, it must allow parallel access to any p 
 elements in any row or column. This can be easily seen from Table 1. 
 For example, to compute t>„ , b-. , b and b ,. using four AE's, the PM 
 
 must deliver simultaneously a_» , <*-. , _a.. , and _a.. ,. . After computation 
 
 _b„ , b , b and b_ are stored in place of a~ , a.^ , a., .. and _a _ . 
 
 Assuming four memory modules, the upper half of Table 1 can be used as 
 the storage assignment for each memory module. However, when computing 
 x„ , x _ , x 1 , and x _ , the PM must deliver simultaneously _b„ , _b , 
 
 _b.. , and _b, „ which are unfortunately all in the same memory module. 
 
31 
 
 Therefore, a memory scheme with a skewed storage and input and output 
 alignment network is required, increasing further the cost of a parallel 
 system. 
 
 It should be mentioned at the end, that real parallel and 
 pipelined computers have not been designed for solving linear recurrences 
 and therefor mapping of algorithms presented in this paper on the 
 available machines may present difficulty and lead to execution times 
 above the bounds given in Table 2. 
 
 Acknowledgment 
 
 I would like to thank Professors Kuck and Sameh for helpful 
 discussion during development of this paper. 
 
32 
 
 REFERENCES 
 
 [1] S. C. Chen, "Speedup of Iterative Programs in Multiprocessor 
 
 Systems," Ph.D. thesis, University of Illinois at Urbana-Champaign, 
 
 Department of Computer Science Rpt. No. 75-694, Jan. 1975. 
 [2] S. C. Chen and D. J. Kuck, "Time and Parallel Processor Bounds for 
 
 Linear Recurrence Systems," IEEE Trans, on Computers , Vol. C-24, 
 
 No. 7, pp. 701-717, July 1975. 
 [3] S. C. Chen, D. J. Kuck and A. H. Sameh, "Practical Parallel Band 
 
 Triangular System Solvers," to appear in ACM Trans, on Mathematical 
 
 Software , Vol. 4, No. 3, pp. 270-277, Sept., 1978. 
 [4] S. C. Chen and A. H. Sameh, "On Parallel Triangular System Solvers," 
 
 Proc. of the 1975 Sagamore Computer Conf. on Parallel Processing , 
 
 pp. 237-238, Aug. 1975. 
 [5] D. D. Gajski, "Processor Arrays for Computing Linear Recurrence 
 
 Systems," Proc. of the 1978 International Conf. on Parallel 
 
 Processing , 
 [6] D. Heller, "On the Efficient Computation of Recurrence Relations," 
 
 Institute for Computer Applications in Science and Engineering 
 
 (ICASE), NASA Langley Res. Center, Hampton. VA, June, 1974. 
 [7] L. Hyafil and H. T. Kung, "The Complexity of Parallel Evaluation of 
 
 Linear Recurrences," Journ. of the ACM , Vol. 24, No. 3, pp. 513- 
 
 521, July 1977. 
 [8] A. H. Sameh and R. P. Brent, "Solving Triangular Systems on a Parallel 
 
 Computer," SIAM J. Numer. Anal., Vol. 14, pp. 1101-1113, 1977. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-78-953 
 
 4. Title and Subtitle 
 
 AN ALGORITHM FOR SOLVING LINEAR RECURRENCE SYSTEMS ON 
 PARALLEL AND PIPELINED MACHINES 
 
 7. Auth oi 
 
 Daniel D. Gajski 
 
 Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 15 Supplementary Notes 
 
 3. Recipient's Accession N< 
 
 5. Report Date 
 
 December 1978 
 
 Performing Organization Rept 
 
 No - UIUCDCS-R-78-953 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF MCS77-27910 
 
 13. Type of Report & Period 
 Covered 
 
 Technical Report 
 
 " ^ 8 °\ lthln f °V he s ° luti ™ °f linear recurrence systems on parallel or pipe- 
 
 omputLTw Unfixed H V "^ ^^ ' 8peedup md "«**«"* *« SIMD and'wiMD 
 S L VI processing elements as well as for pipelined computers 
 
 with fixed number of stages per operation are obtained. The model of each computer 
 is discussed in detail to explain better performance of the pipelined model A 
 
 ^Lr^^L" 6 d6Sl8n " Pr ° CeSSing 61 ™ f - -— "<^™ -es 
 
 7- Key Uords and Document Analysis. 17a. Descriptors 
 
 Linear recurrences 
 Triangular system solvers 
 Parallel evaluation 
 Parallel processors 
 Pipelined processors 
 Complexity of algorithms 
 Computer organization 
 
 pen-Ended Ter 
 
 c - ' OSATI Field/Gr 
 
 'v Statement 
 
 Release Unlimited 
 
 19. Security Class (Thi 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (Thi 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Page 
 
 35 
 
 JSCOMM-DC 4032>5 
 
AUG 1 b dty