LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510. 84 
 IJHbr 
 
 "0.698-702. 
 
 Cod. £- 
 
 £ 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 •W 5 
 I RECTI 
 
 L161 — O-1096 
 
"to 7c 7 
 
 Report No. UIUCDCS-R-75-701 
 
 NSF - OCA - GJ-36936 - 000009 
 
 LINEAR SYSTEM SOLVERS FOR PARALLEL COMPUTERS 
 
 by 
 
 A. H. Sameh and D. J. Kuck 
 
 February 1975 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 XHE LIBRARY OF THE 
 
 MAR 24 1975 
 
 UNIVERSITY OF ILLINOIS 
 
Linear System Solvers for Parallel Computers 
 
 by 
 
 * ** 
 
 A. H. Sameh and D. J. Kuck 
 
 Department of Computer Science, and Center for Advanced Computation 
 University of Illinois 
 Urbana, Illinois 61801 
 
 ** 
 
 Department of Computer Science 
 
 University of Illinois 
 Urbana, Illinois 61801 
 
 This work was supported in part by NSF Grant GJ-36936; and in part by the 
 Advanced Research Project Agency of the Department of Defense under 
 Contract No. DAHCO4-72-C-0001. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/linearsystemsolv701same 
 
Abstract 
 
 In this paper we consider the direct solution of dense and tri- 
 diagonal linear systems of n equations on parallel computers using the 
 LDU factorization and the square root-free Givens triangularization. 
 
 We show that a dense system may be solved using LDU factorization 
 
 2 
 without pivoting, or Givens reduction in T = 0(n) steps with p = 0(n ) 
 
 processors. The resulting speedup over the sequential algorithm is of 
 
 2 
 0(n ) and an efficiency independent of n. If pivoting is required for a 
 
 stable LDU factorization, the system can be solved in T = 0(n log^n) steps 
 
 2 2 
 
 with p = 0(n ) processors. In this case we obtain a speedup of 0(n /log_n) 
 
 and an efficiency of 0(l/log_n). 
 
 Using Givens reduction or LU factorization without pivoting, a 
 tridiagonal system may be solved in T = 0(log»n) steps with p = 0(n) 
 
 resulting in a speedup of 0(n/log_n) and an efficiency of 0(l/log ? n). 
 
 Keywords : 
 
 Efficiency of a parallel computation 
 
 Givens triangularization 
 
 LDU factorizations 
 
 Linear systems 
 
 Orthogonal transformation 
 
 Parallel computers 
 
 Speedup of a parallel computation 
 
1. Introduction 
 
 Solving a system of linear equations is one of the most 
 frequently encountered problems in numerical computing. Several tradi- 
 tional algorithms are available for solving such systems. Using a serial 
 computer, a system of order n can be solved by Gaussian elimination or 
 
 LU factorization using — n + 0(n ) multiplications, while Givens tri- 
 
 h 3 2 12 
 
 angularization requires — n + 0(n ) multiplications and — n square 
 
 roots. 
 
 In this paper we consider parallel versions of the LU decompo- 
 sition and Givens reduction algorithms for both dense and tridiagonal 
 
 systems of equations. For each algorithm discussed we give bounds on 
 the time and number of processors required to achieve that time. 
 
 The folio-wing definitions and assumptions hold throughout 
 
 this paper. If p processors are used, p > 1, we denote the computation 
 
 time by T . Thus, the speedup of a parallel algorithm which uses p 
 
 processors over a serial algorithm is given by S = T,/T > 1. We 
 
 pl'p — 
 
 define the efficiency of such a computation as E = S /p < 1. This can 
 be regarded as the ratio of the achieved speedup to the maximum possi- 
 ble speedup using p processors. We also assume that 
 
 (i) any number of processors may be used at any time, 
 but we will give bounds on this number, 
 (ii) each processor can perform any of the four arith- 
 metic operations in one time step, and 
 (iii) there are no memory or data alignment time penalties. 
 
In Section 2 we discuss the LDU decomposition algorithm. We 
 
 show that linear diagonally dominant and irreducible systems of order 
 
 2 
 n may be solved in T = 0(n) steps using p = 0(n ) processors. While 
 
 for an arbitrary real nonsingular matrix, p = 0(n ) and T = 0(n log n) 
 
 steps. The extra time is required for pivoting. In Theorem 3, we give 
 
 a new result for solving tridiagonal systems. We show that a system of 
 
 order n may be solved in T =7 log n steps using no more than p = Un 
 
 processors if pivoting is not required. This is an improvement on the 
 results in [1], 
 
 Section 3 contains our main results. We adapt the Givens 
 
 process for parallel computation so that any nonsingular linear system 
 
 2 
 of order n can be solved in T = 0(n) steps using p = 0(n ) processors. 
 
 3 2 
 
 Since T = (n ) we achieve a speedup S -0(h) with an efficiency which 
 
 is independent of n. This is a substantial improvement over the LDU de- 
 composition with pivoting. Finally, we show that any nonsingular tri- 
 diagonal system of order n can be solved in T =10 log n + 0(1) steps 
 
 with p = 4n. This algorithm is more stable than the LU decomposition of 
 Section 2 without pivoting, in cases where pivoting is necessary for a 
 stable LU factorization, and is also faster than the result of [1]. 
 Throughout the paper we denote log~x by log x. 
 
2. LDU Decomposition 
 
 In this section -we consider the parallel solution of dense 
 and tridiagonal systems of equations. For dense systems we discuss 
 both cases when the matrix of coefficients is (i) irreducible and dia- 
 gonally dominant, in which pivoting is not required [2], and (ii) an 
 arbitrary real nonsingular matrix where pivoting is usually required for a 
 stable decomposition. An algorithm for the LDU factorization in each of 
 the two cases is presented in Theorems 1 and 2, respectively. Lemma 1 
 can be used in either case to obtain the solution by forward and back- 
 ward substitution. It should be noted that the time and processors 
 required in Lemma 1, (T = 0(n)), do not change the order of the results 
 
 of Theorem 1, (T = 0(n)), or Theorem 2, (T = 0(n log n)). Hence, 
 
 n n 
 
 the results of the theorems can be taken as time and processor bounds 
 
 for linear system solvers of this type. We also note that if a linear 
 system, once factored, is to be solved for a number of different right- 
 hand sides (not all at the same time), then the time and processors of 
 Lemma 1 are all that we need for each right-hand side. 
 
 In Theorem 3 we present an algorithm for solving tridiagonal 
 systems. Since the decomposition and the forward and backsubstitution 
 all require the same order of processors and time, we present these as 
 one theorem, namely T. = 7 log n. 
 
 We first consider the case of dense irreducible and diagonally 
 dominant systems. 
 
Theorem 1 . 
 
 Let A be an irreducible diagonally dominant matrix of order n. 
 Then it can be factored in the form A = LDU in T = 0(n) steps using 
 p = 0(n ) processors, where L and U are unit lower and upper triangular 
 matrices respectively, and D is a diagonal matrix with nonzero entries. 
 Thus, we obtain a speedup S = 0(n ) with an efficiency, 
 
 ir 
 
 2 „ 2 
 
 Proof: 
 
 and 
 
 Hence. 
 
 Let 
 
 L = [i x , i 2 , ..., i Q ], U = [u x , Ug, ..., u Q ], 
 
 where 
 
 i\== (o, ..., 0, i, i^ k , ..., | k ), 
 
 u\ = (0, ..., 0, 1, u k ^ +1 , ..., u^) 
 
 n t 
 A = A n = £ d.i.u . 
 1 .-ill 
 
 i=l 
 
in ■which d. is the i-th element of D. At the beginning of the k-th 
 stage we have the elements (d , d , ..., cL ), the column vectors 
 
 [i.j_, i 2 , -.., ^_{\f !>-,_, u 2 > •••> \_i^ 
 
 and the matrix 
 
 n " 1 t 
 A= A - E d. I. u . 
 
 K 1 .-11 1 
 1=1 
 
 whose first (k-l) rows and columns are zero. Thus 
 
 \ ■ £!> V - * { £/\> 
 
 and 
 
 u , = a) ' /± , r = k+1, k+2, .... n, 
 
 kr kr/ k ' ' ' ' ' 
 
 are obtained by one division using 2(n-k) processors. We then obtain 
 
 t 2 
 
 the matrix (ii)u in one multiplication using (n-k) processors, 
 
 and A, , = A, -d, H ,vl in one subtraction using the same number of 
 
 processors. Therefore, three time steps are required at the k-th 
 
 2 
 stage using max{2(n-k), (n-k) } processors. Hence the whole factor- 
 
 2 
 ization requires T = 3(n-l) steps with p = (n-l) processors for n > 3- 
 
 It can be shown that the sequential algorithm requires 
 
 T = 2n(n -l)/3 steps. Thus, we obtain a speedup, S = 2n(n+l)/9 
 
2 2 
 with a corresponding efficiency, -r < E £ — , where the upper limit is 
 
 achieved for n = 2. 
 
 If A is symmetric positive definite pivoting is again not 
 necessary [3], and it can be factored in 3(n-l) steps, as in Theorem 1, 
 
 using no more than n(n-l)/2 processors. So again we obtain a speedup of 
 
 2 
 0(n ) with an efficiency greater than 2/9. 
 
 For a general nonsingular matrix a pivoting strategy is, in 
 
 general, necessary for the numerical stability of the decomposition. 
 
 Adopting a partial pivoting strategy, we show that the decomposition 
 
 can be performed in 0(n log n) steps. 
 
 Theorem 2 . 
 
 Let A be a nonsingular matrix of order n = 2 , for a 
 positive integer m. Then it can be factored in the form PA = LDU in 
 T = 0(n log n) steps using p = 0(n ) processors, where L and U are 
 unit lower and upper triangular matrices, respectively, D is a diagonal 
 matrix with nonzero entries, and P is a permutation matrix. Thus we 
 obtain a speedup S = 0(n /log n) and an efficiency E = O(l/log n) . 
 
Proof: 
 
 The proof is similar to that of Theorem 1. At the k-th stage, 
 
 2 
 k = 1, 2, ..., n-1, we require three steps as before using (n-k) pro- 
 
 cessors, in addition to 
 
 (n-k+1)] 
 
 log (n-k+1) steps for searching a column 
 
 vector of (n-k+1) elements for the pivot using 
 
 ( n-k+1 )/2 
 
 processors, 
 
 Hence, the factorization can be obtained in T < 3(n-l) + n log n steps 
 
 2 
 using p = (n-1) processors. Ignoring the time required for pivoting, 
 
 2 2 
 the sequential algorithm requires T- = — n(n -1) steps. Thus for large n, 
 
 2 
 we obtain a speedup, S - 2n /3 log n with an efficiency, E - 2/3 log n. 
 
 Next we give a lemma for the forward and backsubstitution which 
 are used to obtain a solution once the matrix of coefficients is factored. 
 
 Lemma 1 
 
 Given the LDU factorization of PA, then the linear system of 
 equations Ax = f can be solved in T = 0(n) steps using p = 0(n) proces- 
 sors with a speedup S = 0(n) and an efficiency E = 0(1). 
 
Proof : 
 
 The solution vector x can be obtained by sequentially solving 
 the three systems 
 
 Lz = Pf , Dy = z , Ux = y . 
 
 The lower triangular system can be written as 
 
 g z.i. =Pf = f (l) , 
 
 i=l X X 
 
 (k) 
 hence z is given by z = f £ (k = 1,2, ..., n), "where 
 
 f (j+l) = f U) - zj. j=l,2,..., n-1 
 
 J J 
 
 Therefore each of the triangular systems may be solved in 2(n-l) steps 
 using (n-1) processors. Since Dy=z can be solved in one step using n 
 processors, the total time for solving Ax=f, using p=n processors, is 
 T = 4n-3. The sequential algorithm requires T.. = n(2n-l) steps, 
 hence we obtain a speedup S ~ n /2 a ^d an efficiency E ~ 1/2. 
 
 We have shown [4,5] that a triangular system of equations 
 
 2 1/3 2 
 
 may be solved, using (n ) processors, in o(n log n) steps. Since we 
 
 are dealing here with dense matrices of moderate sizes (n <_ 1000, say) 
 
 2 
 the process remains essentially of 0(n) even though we are using 0(n ) 
 
 processors. Hence, for higher efficiency, we use the above algorithm 
 
 which requires only n processors. 
 
For tridiagonal systems of equations it has been shown [1], 
 that the direct LU factorization of the matrix of coefficients and the 
 solution of the resulting system can be obtained in 17 [log n l steps 
 using n processors. We next give a smaller upper bound on T for solving 
 such systems. As a further comparison with the method of [1], we note 
 that using Theorem 3 and n processors a tridiagonal system can be solved 
 without pivoting in just 16 log n steps. 
 
 Theorem 3 
 
 Let A he a nonsingular tridiagonal matrix of order n=2 , for 
 
 a positive integer m. Then the system of equations Ax=f may be solved 
 using the LU factorization, A=LU where L is unit lower triangular and 
 U is upper triangular, in T =7 log n steps using p=Un processors. 
 
 IT 
 
 This results in a speedup S = O(n/log n) with an efficiency 
 
 E = 0(1/ log n). 
 
 p ' 
 
 Proof : 
 
 It has been shown in [A ] that any m-th order linear recurrence 
 system of n equations, or any unit triangular system of order n and 
 band-width (m+1) may be solved in, 
 
 1 2 
 T = (2 + log m) log n - — (log m + log m) 
 
 2 
 steps with p = 0(m n) processors. In [1] it was shown that the LU factor- 
 ization of A is essentially equivalent to solving a second order, m=2, linear 
 
10 
 
 recurrence system of n equations. Thus the factorization can be 
 
 obtained in T = (3 log n - 1) steps with p n = kn. Each of the for- 
 P-i 1 
 
 ward and backward sweeps requires the solution of a first order linear 
 
 recurrence system of n equations. Hence the time required for these 
 
 sweeps is T =1+4 log n with p ■ n processors. Therefore, the 
 P 2 
 
 total time for solving the tridiagonal system of equations is 
 
 T + T = T = 7 log n steps with p = An processors. This results 
 P x P 2 P 
 
 in a speedup S = 8(n-l)/7 log n with an efficiency E - 2/7 log n. 
 
 Note that if pivoting is necessary for a stable LU factoriza- 
 tion, then the scheme in [1] is not satisfactory. In the following 
 section we introduce another algorithm that is more stable. 
 
11 
 
 3. Givens Triangularization 
 
 In this section we consider the parallel solution of a 
 
 general linear system using orthogonal factorization of the matrix of 
 
 2 
 coefficients. Theorem k shows that with p = 0(n ) processors we can 
 
 triangularize any real nonsingular matrix in T = 0(n) steps using a 
 
 n 2 
 
 modified Givens process. Thus we achieve a speedup S p = 0(n ) and an 
 
 n 
 
 efficiency independent of n. Lemma 2 shows that;, after the orthogonal 
 
 factorization of the matrix of coefficients, we can solve the linear 
 
 system with n processors in 0(n) steps. Thus Theorem k determines the 
 
 order of time and processors for linear system solving using Givens 
 
 reduction. Again we note that if a linear system is to be solved for 
 
 various right-hand sides , we use the time and processors of Theorem k 
 
 once, and of Lemma 2 for each right-hand side. 
 
 Finally Theorem 5. shows that any tridiagonal system of 
 equations may be solved using Givens reduction in T =10 log n steps 
 with p=l+n processors. 
 
 Now we consider the orthogonal factorization of a real non- 
 singular matrix A, 
 
 Q A = R (1) 
 
 where Q, is an orthogonal matrix formed as the product of plane rotations, 
 and R is upper triangular. This standard Givens process requires 
 roughly ^n /3 multiplications and n /2 square roots, where n is the 
 order of the matrix A. 
 
12 
 
 Recently [6,7] it has been shown that the matrices D and R in 
 the modified Givens reduction, 
 
 i 
 
 Q A = D 2 R (2) 
 
 can be obtained using \ the multiplications of the standard Givens 
 process and no square roots if Q is not computed. We will use this 
 modified process in our parallel algorithm to avoid square root evalua- 
 tion after every transformation. First. we describe a numerically stable 
 
 A Givens transformation/ \ UvU, ,■< 'U. i ■•■■•: Up: ■.■<■■ i ■ 
 
 version of this square root-free Givens reduction 
 
 ') 
 
 Vk u s/k u ... \'k u ... 
 
 ■J~£ u, J~£ u ... \Ttt u . . . 
 so as to annihilate the first element of the second row would produce, 
 
 n/p u ± \Tp u 2 ... sTp u r 
 
 sTq v . . . \Tq v 
 
 and the rotation is determined by, 
 
 c 2 = l/(l+op) , s 2 = cp/(l+dp) (3) 
 
 where, 
 
13 
 
 a 
 
 = iv^k^ , B = v 1 /u 1 (U) 
 
 Thus, 
 
 i &\ s /£" 
 
 u = u + V 
 
 r r- J r /- r 
 
 •pi 1 ^p 
 
 I /k 1 , / C /iT . 
 
 v = - u + V 
 
 Simplifying (5), we obtain 
 
 u = u + a v , 
 r r r 
 
 v = -Bu + V 
 r K r r 
 
 (5) 
 
 There are several choices for p and q; one such choice is 
 
 p = c k , and q = c £. (6) 
 
 and ( 5 ' ) 
 
 Furthermore, 
 
 p = k/(l+a 6) 
 and (6') 
 
 q = i/(l+a 6) . 
 
 Unfortunately, however, the method suffers from the possibility of 
 underflow in p and q. This may be avoided by interchanging the two 
 rows under consideration. 
 
14 
 
 On a serial computer, evaluating u , v (r = 1, 2 t) , p 
 
 and q require (4t+5) steps, while on a parallel computer this process 
 
 requires four steps with (2t+5) processors as shown in Figure 1. In 
 
 fact, we can apply more than one rotation at a time in triangularizing 
 
 A. Similar to the parallel Jacobi method [8], several schemes are 
 
 possible. We present an annihilation scheme that requires 0(n) steps 
 
 to produce D, R and Q in (2) . Denoting each element annihilated in the 
 
 k-th transformation by the integer k, the annihilation regime is shown 
 
 in Figure 2. Let us denote by (i, i, j) that rotation which rotates 
 
 rows i and i, annihilating the element in position (i, j). The k-th 
 
 transformation consists of the rotations (i, i+1, j), i=i+l, where i 
 
 and j are given by the sequences : 
 
 (i) for k = 1, 2, . . ., n-1 
 
 i = [(n-k), (n-k+2), ..., (n-l-S(k))} 
 J = (1,2, ..., |~k/2~|} 
 
 (7) 
 
 (ii) for k = n,n+l, ..., 2n-3 
 
 i = {(k-n+2), (k-rw-10, •••, (n-l-8(k))) 
 j = {(k-n+2), (k-n+3), ..., I k/2~] } 
 
 (8) 
 
 where s(k) = for k odd, and 1 for k even. 
 
15 
 
 ■\ 
 
 r 1 1 
 
 \r=l,2,...,t 
 
 J 
 
 q. = £ k 
 
 Uj_ u 1 ( k u x u x + t v 1 v 1 ) 
 
 -1 
 
 p = k k 
 
 u 
 
 1 u x 
 
 (k u^ u-j_ + i V]_ Vq_ ) 
 
 -1 
 
 Figure 1. 
 
16 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 15 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Ik 
 
 16 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 13 
 
 15 
 
 17 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 12 
 
 lU 
 
 16 
 
 18 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 11 
 
 13 
 
 15 
 
 17 
 
 19 
 
 -X- 
 
 
 
 
 
 
 
 
 
 
 
 10 
 
 12 
 
 lU 
 
 16 
 
 18 
 
 20 
 
 ■* 
 
 
 
 
 
 
 
 
 
 
 9 
 
 11 
 
 13 
 
 15 
 
 17 
 
 19 
 
 21 
 
 * 
 
 
 
 
 
 
 
 
 
 8 
 
 10 
 
 12 
 
 11+ 
 
 16 
 
 18 
 
 20 
 
 22 
 
 * 
 
 
 
 
 
 
 
 
 7 
 
 9 
 
 11 
 
 13 
 
 15 
 
 17 
 
 19 
 
 21 
 
 23 
 
 # 
 
 
 
 
 
 
 
 6 
 
 8 
 
 10 
 
 12 
 
 ik 
 
 16 
 
 18 
 
 20 
 
 22 
 
 2k 
 
 * 
 
 
 
 
 
 
 K 
 
 y 
 
 7 
 
 9 
 
 11 
 
 13 
 
 15 
 
 17 ■ 
 
 19 
 
 21 
 
 23 
 
 25 
 
 * 
 
 
 
 
 
 i+ 
 
 6 
 
 8 
 
 10 
 
 12 
 
 Ik 
 
 16 
 
 18 
 
 20 
 
 22 
 
 2k 
 
 26 
 
 -x- 
 
 
 
 
 3 
 
 5 
 
 7 
 
 9 
 
 11 
 
 13 
 
 15 
 
 17 
 
 19 
 
 21 
 
 23 
 
 25 
 
 27 
 
 * 
 
 
 
 2 
 
 1+ 
 
 6 
 
 8 
 
 10 
 
 12 
 
 Ik 
 
 16 
 
 18 
 
 20 
 
 22 
 
 2k 
 
 26 
 
 28 
 
 # 
 
 
 1 
 
 3 
 
 5 
 
 7 
 
 9 
 
 11 
 
 13 
 
 15 
 
 17 
 
 19 
 
 21 
 
 23 
 
 25 
 
 27 
 
 29 
 
 # 
 
 Figure 2. 
 
17 
 
 Since it is only an odd numbered transformation that introduces 
 
 a zero in the last row, the total number of required transformations is 
 
 1 
 (2n-3). Hence in computing D 2 and R in (2) we require T = U(2n-3) 
 
 P l 
 steps and one square root with the number of processors not exceeding 
 
 Pi ■ 
 
 n 
 Z 
 
 (21*3) = In/2 
 1 t=n-| n/2 +1 l. . 
 
 (2(n+2) 
 
 n/2 
 
 }• 
 
 For n=2 , where m is a positive integer, p = 4 n + 2n. 
 
 The orthogonal matrix Q, is given by Q, = Q^ . . . (^ £L , where 
 
 each transformation is the direct product of the rotations (i,i+l,j) 
 
 given by the sequences (7) and (8). That is, & is a block-diagonal 
 
 matrix consisting of 2 X 2 diagonal submatrices. Thus, while we do not 
 
 form Q explicitly, Q consists of a total of n(n-l)/2 rotations all of 
 
 which can be evaluated from (3) in T =3 steps plus one square root, 
 
 P 2 
 using p = n(n-l) processors. Therefore, the total parallel time for 
 
 obtaining D 2 ,R, and Q, is given by T = T + T = (8n-9) steps and two 
 
 1 
 
 P, 
 
 square roots with p = n(n-l) processors. 
 
 On a serial machine the time required for obtaining D 2 and 
 
 R in (2), i.e., evaluating (5') and (6')> is given by, 
 
 n 
 T l = t^2 (t_l) (UU5) 
 
 n(n-l) (8n+23)/6 
 
18 
 
 steps and n square roots. While the time required for evaluating the 
 n(n-l)/2 rotations in Q is given by T" = 7n(n-l)/2 steps and n(n-l) 
 
 square roots, as can be shown from (3) and (4). Thus the total sequential 
 
 2 
 time is T = T' + T'' = 2n(n-l) (2n+ll) /3 steps and n square roots, and we 
 
 ? 18 
 
 achieve a speedup Sp = n /6 with an efficiency •? < E <_ rr . Hence we 
 
 proved the following theorem. 
 
 Theorem k 
 
 Let A be a nonsingular matrix of order n, then it can be 
 factored in the form, 
 
 QA = D 2 R 
 
 1 
 
 where D 2 , R, and Q. (i=l,2, . .., 2n-3) may be evaluated in T = 0(n) 
 
 2 
 steps and two square roots using p = 0(n ) processors. This results 
 
 2 18 
 
 in a speedup Sp = 0(n ) with an efficiency -r < E <^ -rr- . 
 
 We have also developed an annihilation scheme that requires 
 2(n-l) - log n transformations instead of (2n-3) , however the relation 
 between i, i, and j are more complicated than that in (7) and (8). 
 
19 
 
 Lemma 2 
 
 1 
 
 Given the orthogonal factorization Q, A = D 2 R, i.e., D, R, 
 
 and Q. The linear system of equations Ax = f may be solved in 
 
 T = 0(n) steps and one square root using no more than p = n processors. 
 Hence we obtain a speedup Sp = 0(n) with an efficiency — < E <_ y . 
 
 Proof : 
 
 From (2) we have 
 
 D 2 Rx = Qf (9) 
 
 Since the matrices Q. in Q = Q^ ... Q, Q are block diagonal, the 
 right hand side f = Qf may be computed sequentially, 
 
 f,,-, = Q, f, = 1,2, ..., 2n-3 (10) 
 
 where f = f and f = f. Using n processors, we require three steps 
 for each f. ,. Thus f may be computed in 3(2n-3) steps using n pro- 
 cessors. Solving for z in D 2 z = f requires one division with n pro- 
 cessors. Finally the solution vector x in Rx = z can be obtained in 
 3(n-l) steps using (n-1) processors, (note that R is not unit upper 
 triangular). Thus the total time for solving (9) is given by, 
 T = (9n-ll) steps with p = n processors. The sequential algorithm 
 
 requires, T- = 2n(2n-l) steps. Thus we obtain a speedup Sp - 4n/9 
 
4 6 
 with an efficiency — < E < ■=■ 
 
 9 p — 7 
 
 Let, 
 
 Next we consider Givens reduction of tridiagonal matrices 
 
 20 
 
 A = 
 
 a i b i 
 
 e a , b , 
 n-1 n-1 n-1 
 
 n n 
 
 ,m 
 
 be of order n=2 , where m is a positive integer. We annihilate the 
 subdiagonal elements of A by applying the orthogonal transformation 
 
 Q = Q 
 
 n-1 
 
 Q Q , where each Q . is a Givens transformation rotating 
 
 rows j and j+1 such that the element in the position (j+1, j) is 
 eliminated, 
 
21 
 
 s . 
 
 row j 
 
 V 
 
 -s . 
 J 
 
 c. 
 
 J 
 
 Thus, after step (r-1) , Q , Q _ 2 ... Q.. A is given by, 
 
 r-1 
 
 'r+1 
 
 r+1 
 
 c n b 
 r-1 r 
 
 r+1 r+1 
 
 and by induction it follows that, 
 
22 
 
 c = p /a , s = e , ./a (11) 
 
 r r r r r+1 r 
 
 p = c n a -c s -b - 
 r r-1 r r-2 r-1 r-1 
 
 I = s a .- + c ..c b r - 1, 2, ..., n-1 (12) 
 r r r+1 r-1 r r 
 
 Y = s b . 
 r r r+1 
 
 where c = 1, s =0. From (11) and the first equation in (12) we 
 o o 
 
 obtain the recurrence relation, 
 
 a c = c .. a - c (e b _)a , (13) 
 
 r r r-1 r r-2 r r-1 r-1 
 
 Let x, = c. II a. ; then (13) reduces to, 
 k k . n i 
 i=l 
 
 T o = X ' T l = a l 
 
 x = a t . - (e b ..)t » r = 2, 3, ..., n-1 (14) 
 r r r-1 r r-1 r-2 
 
 This can be expressed as the system of linear equations, 
 
23 
 
 -a. 
 
 6 2 b l 
 
 6 3 b 2 
 
 -a. 
 
 e ,13 
 n-1 n 
 
 ■a . 1 
 n-1 
 
 l 2 
 T 3 
 
 n-1 
 
 The matrix of coefficients can be formed in one step using (n-2) 
 processors. The system itself can be solved [A], in (3 log n-1) steps 
 
 using no more than An processors. Thus obtaining the vector t requires, 
 
 2 2 
 T =3 log n steps, with p = An processors. Since c, + s, = 1, then 
 
 P - X IC K 
 
 2 k 2 
 defining 9, = IT a. , we obtain the recurrence relations, 
 
 k i=l ± 
 
 «2 2 
 6 = t = 1 
 o o 
 
 s 2 2 2 2 
 
 \ = e k+l 6 k-l + T k 
 
 k = 1 , 2 , . . . , n-1 
 
 (15) 
 
 where, we have used (11). The corresponding system of equations is, 
 
24 
 
 -e, 
 
 -e. 
 
 -e 
 
 n 
 
 r 
 
 
 1 
 
 i 
 
 e 2 
 
 O 
 
 
 2 
 
 2 
 
 h 
 
 
 2 
 
 T l 
 
 2 
 9 2 
 
 • 
 
 
 2 
 T 2 
 
 • 
 
 
 • 
 
 • 
 
 
 • 
 
 • 
 
 
 • 
 
 4l 
 
 
 2 
 
 
 T n-1 
 
 - 
 
 
 
 The matrix of coefficients and the right-hand side can be formed in 
 
 one step using 2(n-l) processors, and the bidiagonal system is solved 
 
 2 
 in 2 log n steps with (n-1) processors. Thus the elements 9. , 
 
 i = 1, 2, ..., n-1, can be obtained in T =1+2 log n steps with 
 
 p 9 = 2(n-l) processors. Now we obtain c , a., and s. from, 
 z J J J 
 
 2 2,2 2 2,2 2 2.2 
 c. = ■c./e* » a 4 «= e./e, . , s. = e /a. 
 
 (16) 
 
25 
 
 in two steps and one square root using 3(n-l) processors. Once these 
 
 are obtained, the elements ft,, y.» and a ■ p in (12) of the upper 
 
 j j n r n 
 
 triangular matrix are obtained in three steps using 2n processors. Thus 
 
 the total time required for obtaining the upper triangular matrix and 
 
 the rotations (c , s., j = 1, 2, ..., n-1) is T =6+5 log n steps and 
 
 one square root using no more than 4n processors. 
 
 It can be shown [9] that the orthogonal matrix Q is of the form, 
 
 Q(u,v,w) = 
 
 U 1 V 1 
 
 Vi 
 
 u ..v.. 
 n-1 1 
 
 U V w 
 2 2 2 
 
 n-1 2 
 
 u v n u V- 
 n 1 n 2 
 
 u n v . 
 n-1 n-1 
 
 u v , 
 n n-1 
 
 w 
 
 n-1 
 
 u v i 
 n n i 
 
 where the vectors u, v, and w are given by, 
 
 u. = c. n- -i 3 v. - v,. n/ M . 
 
 J J 0-1 J 0-1' 
 
 w.=s. j=l,2, ..., n-1 
 J J 
 
 = c .i-l /n i-l j = 1, 2, ..., n 
 
 (17) 
 
 ■where, c = c = 1 and, 
 ' o n 
 
26 
 
 \ = 1 
 
 i j 
 n. = (-D J n s j = 1, 2, ..., n -l (18) 
 
 k-1 k 
 
 The matrix Q is not formed explicitly, but is represented by the vectors 
 u, v, and w. The vector n is obtained in log n steps using (n-2)/2 
 processors, hence u and v are obtained in one step using (2n-3) pro- 
 cessors. We are now in a position to introduce the following theorem. 
 
 Theorem 5 
 
 Let A be a nonsingular tridiagonal matrix of order n=2 , for 
 
 a positive integer m. Then the system of equations Ax=f may be solved 
 using the orthogonal factorization QA=R, where R is upper triangular of 
 band width 3, in T =10 log n + 0(l) steps with p = kn processors. 
 The resulting speedup and efficiency are of the same order as in 
 Theorem 3« 
 
 Proof : 
 
 The determination of R and Q required (7+6 log n) steps 
 and one square root using no more than Hn processors. Now the system 
 to be solved is Rx = Q,f. f = Qf may be expressed as, 
 
 f = (ULV + W)f 
 where, 
 
27 
 
 and 
 
 U = diag(u.) , V = diag(v.) 
 
 W = diag(\* 1 , w 2 , . .., w n _ 1 , 0), 
 
 ' 1 i>j 
 I. . = 
 1J ' i<j 
 
 Thus f can be evaluated in (U + log n) steps using n processors. 
 Rx=f can be solved, [4], in 3 log n steps using kn processors. There- 
 fore, the total time for solving Ax=f is T = (.11 + 10 log n) steps 
 and one square root with p = Un processors. The sequential algorithm 
 requires 3(8n-9) steps and (n-l) square roots. Hence we obtain a 
 
 speedup S ~ 12 n/5 log n -with a corresponding efficiency, E - 3/5 log n, 
 p - / P 
 
28 
 
 References 
 
 [ 1] Stone, H. S., "An Efficient Parallel Algorithm for the Solution 
 of a Tridiagonal Linear System of Equations/' JACM, 20, 1973, 
 pp. 27-38. 
 
 [2] Wendroff, B., Theoretical Numerical Analysis, New York, Academic 
 Press, 1966. 
 
 [3] Wilkinson, J. H. , "Error Analysis of Direct Methods of Matrix 
 Inversion," JACM, 8, I96I, pp. 281-330. 
 
 [h] Chen, S. C, and Kuck, D. J., "Time and Parallel Processor Bounds 
 
 for Linear Recurrence Systems." To appear, IEEE Transactions 
 on Computers . 
 
 [5] Sameh, A., Chen, S., and Kuck D. , "Parallel Direct Poisson and 
 Biharmonic Solvers. " Submitted for publication. 
 
 [6] Gentleman, W. M. , "Least Squares Computations by Givens 
 Transformations Without Square Roots," J. Inst. Maths. 
 Applies. 12, 1973, PP- 329-336. 
 
 [7] Hammerling, S., "A Note on Modifications to the Givens Plane 
 Rotation," J. Inst. Maths. Applies. 13, 197^, pp. 215-218. 
 
 [8] Sameh, A. H. , "On Jacobi and Jacobi-like Algorithms for a 
 Parallel Computer," Math Comp., 25, 1971, PP- 579-590. 
 
 [9] Golub et. al., "Methods for Modifying Matrix Factorizations," 
 Math. Comp., 28, 1974, pp. 505-535. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-75-701 
 
 3. Recipient's Accession No. 
 
 4. Title and Subtitle 
 
 LINEAR SYSTEM SOLVERS FOR PARALLEL COMPUTERS 
 
 5. Report Date 
 
 February 1975 
 
 6. 
 
 7. Author(s) 
 
 A. H. Sameh and D. J. Kuck 
 
 8. Performing Organization Rept. 
 
 No - UIUCDCS-R-75-701 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, IL 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 US NSF-GJ-36936 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 20550 
 
 13. Type of Report & Period 
 Covered 
 
 Technical Report 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 In this paper we consider the direct solution of dense and tridiagonal linear systems 
 of n equations on parallel computers using the LDU factorization and the square root- 
 free Givens triangularization. 
 
 We show that a dense system may be solved using LDU factorization without pivoting, or 
 
 2 
 Givens reduction in T = 0(n) steps with p = 0(n ) processors. The resulting speedup 
 
 P 2 
 
 over the sequential algorithm is of 0(n ) and an efficiency independent of n. If 
 
 pivoting is required for a stable LDU factorization, the system can be solved in 
 
 2 
 T = 0(n log_n) steps with p = 0(n ) processors. In this case we obtain a speedup of 
 
 2 
 0(n /log„n) and an efficiency of 0(l/log 9 n). 
 
 Using Givens rerliirt-i nn nr T.TT fapfnri?at-inn wi t-hnnt- piwiMng 3 a rriHiagnnal gygt-pm may 
 '. KevWords and Document Analysis. 17a. Descriptors 
 
 17. Key words and Document Analysis. 17a. Descriptors 
 
 be solved in T = 0(log~n) steps with p 
 
 and an efficiency of 0(l/log 9 n). 
 
 Erf iciency ' of * a ' parallel computation 
 
 Givens triangularization 
 
 LDU factorizations 
 
 Linear systems 
 
 Orthogonal transformation 
 
 Parallel computers 
 
 Speedup of a parallel computation 
 
 17b. Identifiers/Open-Ended Terms 
 
 0(n) resulting in a speedup of 0(n/log„n) 
 
 17c. COSATI Field/Group 
 
 18. Availability Statement 
 
 Unlimited 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 31 
 
 22. Price 
 
 FORM NTIS-35 (10-70) 
 
 USCOMM-DC 40329-P71 
 
... 
 
I ■;.;■■■ ■ ■ M 
 
 mm m M 
 
 VHH 
 
 ElsQES InNfF 
 
 raw Ryll 
 
 HH 1 H 
 
 msmm 
 
 of 
 
 coo? ' 
 
 a 0»» 
 
 \NO»» 
 
 jBB^'' 
 
 MO* 14 
 
 
 ili B Hi