LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510. 84 IJHbr "0.698-702. Cod. £- £ The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN •W 5 I RECTI L161 — O-1096 "to 7c 7 Report No. UIUCDCS-R-75-701 NSF - OCA - GJ-36936 - 000009 LINEAR SYSTEM SOLVERS FOR PARALLEL COMPUTERS by A. H. Sameh and D. J. Kuck February 1975 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS XHE LIBRARY OF THE MAR 24 1975 UNIVERSITY OF ILLINOIS Linear System Solvers for Parallel Computers by * ** A. H. Sameh and D. J. Kuck Department of Computer Science, and Center for Advanced Computation University of Illinois Urbana, Illinois 61801 ** Department of Computer Science University of Illinois Urbana, Illinois 61801 This work was supported in part by NSF Grant GJ-36936; and in part by the Advanced Research Project Agency of the Department of Defense under Contract No. DAHCO4-72-C-0001. Digitized by the Internet Archive in 2013 http://archive.org/details/linearsystemsolv701same Abstract In this paper we consider the direct solution of dense and tri- diagonal linear systems of n equations on parallel computers using the LDU factorization and the square root-free Givens triangularization. We show that a dense system may be solved using LDU factorization 2 without pivoting, or Givens reduction in T = 0(n) steps with p = 0(n ) processors. The resulting speedup over the sequential algorithm is of 2 0(n ) and an efficiency independent of n. If pivoting is required for a stable LDU factorization, the system can be solved in T = 0(n log^n) steps 2 2 with p = 0(n ) processors. In this case we obtain a speedup of 0(n /log_n) and an efficiency of 0(l/log_n). Using Givens reduction or LU factorization without pivoting, a tridiagonal system may be solved in T = 0(log»n) steps with p = 0(n) resulting in a speedup of 0(n/log_n) and an efficiency of 0(l/log ? n). Keywords : Efficiency of a parallel computation Givens triangularization LDU factorizations Linear systems Orthogonal transformation Parallel computers Speedup of a parallel computation 1. Introduction Solving a system of linear equations is one of the most frequently encountered problems in numerical computing. Several tradi- tional algorithms are available for solving such systems. Using a serial computer, a system of order n can be solved by Gaussian elimination or LU factorization using — n + 0(n ) multiplications, while Givens tri- h 3 2 12 angularization requires — n + 0(n ) multiplications and — n square roots. In this paper we consider parallel versions of the LU decompo- sition and Givens reduction algorithms for both dense and tridiagonal systems of equations. For each algorithm discussed we give bounds on the time and number of processors required to achieve that time. The folio-wing definitions and assumptions hold throughout this paper. If p processors are used, p > 1, we denote the computation time by T . Thus, the speedup of a parallel algorithm which uses p processors over a serial algorithm is given by S = T,/T > 1. We pl'p — define the efficiency of such a computation as E = S /p < 1. This can be regarded as the ratio of the achieved speedup to the maximum possi- ble speedup using p processors. We also assume that (i) any number of processors may be used at any time, but we will give bounds on this number, (ii) each processor can perform any of the four arith- metic operations in one time step, and (iii) there are no memory or data alignment time penalties. In Section 2 we discuss the LDU decomposition algorithm. We show that linear diagonally dominant and irreducible systems of order 2 n may be solved in T = 0(n) steps using p = 0(n ) processors. While for an arbitrary real nonsingular matrix, p = 0(n ) and T = 0(n log n) steps. The extra time is required for pivoting. In Theorem 3, we give a new result for solving tridiagonal systems. We show that a system of order n may be solved in T =7 log n steps using no more than p = Un processors if pivoting is not required. This is an improvement on the results in [1], Section 3 contains our main results. We adapt the Givens process for parallel computation so that any nonsingular linear system 2 of order n can be solved in T = 0(n) steps using p = 0(n ) processors. 3 2 Since T = (n ) we achieve a speedup S -0(h) with an efficiency which is independent of n. This is a substantial improvement over the LDU de- composition with pivoting. Finally, we show that any nonsingular tri- diagonal system of order n can be solved in T =10 log n + 0(1) steps with p = 4n. This algorithm is more stable than the LU decomposition of Section 2 without pivoting, in cases where pivoting is necessary for a stable LU factorization, and is also faster than the result of [1]. Throughout the paper we denote log~x by log x. 2. LDU Decomposition In this section -we consider the parallel solution of dense and tridiagonal systems of equations. For dense systems we discuss both cases when the matrix of coefficients is (i) irreducible and dia- gonally dominant, in which pivoting is not required [2], and (ii) an arbitrary real nonsingular matrix where pivoting is usually required for a stable decomposition. An algorithm for the LDU factorization in each of the two cases is presented in Theorems 1 and 2, respectively. Lemma 1 can be used in either case to obtain the solution by forward and back- ward substitution. It should be noted that the time and processors required in Lemma 1, (T = 0(n)), do not change the order of the results of Theorem 1, (T = 0(n)), or Theorem 2, (T = 0(n log n)). Hence, n n the results of the theorems can be taken as time and processor bounds for linear system solvers of this type. We also note that if a linear system, once factored, is to be solved for a number of different right- hand sides (not all at the same time), then the time and processors of Lemma 1 are all that we need for each right-hand side. In Theorem 3 we present an algorithm for solving tridiagonal systems. Since the decomposition and the forward and backsubstitution all require the same order of processors and time, we present these as one theorem, namely T. = 7 log n. We first consider the case of dense irreducible and diagonally dominant systems. Theorem 1 . Let A be an irreducible diagonally dominant matrix of order n. Then it can be factored in the form A = LDU in T = 0(n) steps using p = 0(n ) processors, where L and U are unit lower and upper triangular matrices respectively, and D is a diagonal matrix with nonzero entries. Thus, we obtain a speedup S = 0(n ) with an efficiency, ir 2 „ 2 Proof: and Hence. Let L = [i x , i 2 , ..., i Q ], U = [u x , Ug, ..., u Q ], where i\== (o, ..., 0, i, i^ k , ..., | k ), u\ = (0, ..., 0, 1, u k ^ +1 , ..., u^) n t A = A n = £ d.i.u . 1 .-ill i=l in ■which d. is the i-th element of D. At the beginning of the k-th stage we have the elements (d , d , ..., cL ), the column vectors [i.j_, i 2 , -.., ^_{\f !>-,_, u 2 > •••> \_i^ and the matrix n " 1 t A= A - E d. I. u . K 1 .-11 1 1=1 whose first (k-l) rows and columns are zero. Thus \ ■ £!> V - * { £/\> and u , = a) ' /± , r = k+1, k+2, .... n, kr kr/ k ' ' ' ' ' are obtained by one division using 2(n-k) processors. We then obtain t 2 the matrix (ii)u in one multiplication using (n-k) processors, and A, , = A, -d, H ,vl in one subtraction using the same number of processors. Therefore, three time steps are required at the k-th 2 stage using max{2(n-k), (n-k) } processors. Hence the whole factor- 2 ization requires T = 3(n-l) steps with p = (n-l) processors for n > 3- It can be shown that the sequential algorithm requires T = 2n(n -l)/3 steps. Thus, we obtain a speedup, S = 2n(n+l)/9 2 2 with a corresponding efficiency, -r < E £ — , where the upper limit is achieved for n = 2. If A is symmetric positive definite pivoting is again not necessary [3], and it can be factored in 3(n-l) steps, as in Theorem 1, using no more than n(n-l)/2 processors. So again we obtain a speedup of 2 0(n ) with an efficiency greater than 2/9. For a general nonsingular matrix a pivoting strategy is, in general, necessary for the numerical stability of the decomposition. Adopting a partial pivoting strategy, we show that the decomposition can be performed in 0(n log n) steps. Theorem 2 . Let A be a nonsingular matrix of order n = 2 , for a positive integer m. Then it can be factored in the form PA = LDU in T = 0(n log n) steps using p = 0(n ) processors, where L and U are unit lower and upper triangular matrices, respectively, D is a diagonal matrix with nonzero entries, and P is a permutation matrix. Thus we obtain a speedup S = 0(n /log n) and an efficiency E = O(l/log n) . Proof: The proof is similar to that of Theorem 1. At the k-th stage, 2 k = 1, 2, ..., n-1, we require three steps as before using (n-k) pro- cessors, in addition to (n-k+1)] log (n-k+1) steps for searching a column vector of (n-k+1) elements for the pivot using ( n-k+1 )/2 processors, Hence, the factorization can be obtained in T < 3(n-l) + n log n steps 2 using p = (n-1) processors. Ignoring the time required for pivoting, 2 2 the sequential algorithm requires T- = — n(n -1) steps. Thus for large n, 2 we obtain a speedup, S - 2n /3 log n with an efficiency, E - 2/3 log n. Next we give a lemma for the forward and backsubstitution which are used to obtain a solution once the matrix of coefficients is factored. Lemma 1 Given the LDU factorization of PA, then the linear system of equations Ax = f can be solved in T = 0(n) steps using p = 0(n) proces- sors with a speedup S = 0(n) and an efficiency E = 0(1). Proof : The solution vector x can be obtained by sequentially solving the three systems Lz = Pf , Dy = z , Ux = y . The lower triangular system can be written as g z.i. =Pf = f (l) , i=l X X (k) hence z is given by z = f £ (k = 1,2, ..., n), "where f (j+l) = f U) - zj. j=l,2,..., n-1 J J Therefore each of the triangular systems may be solved in 2(n-l) steps using (n-1) processors. Since Dy=z can be solved in one step using n processors, the total time for solving Ax=f, using p=n processors, is T = 4n-3. The sequential algorithm requires T.. = n(2n-l) steps, hence we obtain a speedup S ~ n /2 a ^d an efficiency E ~ 1/2. We have shown [4,5] that a triangular system of equations 2 1/3 2 may be solved, using (n ) processors, in o(n log n) steps. Since we are dealing here with dense matrices of moderate sizes (n <_ 1000, say) 2 the process remains essentially of 0(n) even though we are using 0(n ) processors. Hence, for higher efficiency, we use the above algorithm which requires only n processors. For tridiagonal systems of equations it has been shown [1], that the direct LU factorization of the matrix of coefficients and the solution of the resulting system can be obtained in 17 [log n l steps using n processors. We next give a smaller upper bound on T for solving such systems. As a further comparison with the method of [1], we note that using Theorem 3 and n processors a tridiagonal system can be solved without pivoting in just 16 log n steps. Theorem 3 Let A he a nonsingular tridiagonal matrix of order n=2 , for a positive integer m. Then the system of equations Ax=f may be solved using the LU factorization, A=LU where L is unit lower triangular and U is upper triangular, in T =7 log n steps using p=Un processors. IT This results in a speedup S = O(n/log n) with an efficiency E = 0(1/ log n). p ' Proof : It has been shown in [A ] that any m-th order linear recurrence system of n equations, or any unit triangular system of order n and band-width (m+1) may be solved in, 1 2 T = (2 + log m) log n - — (log m + log m) 2 steps with p = 0(m n) processors. In [1] it was shown that the LU factor- ization of A is essentially equivalent to solving a second order, m=2, linear 10 recurrence system of n equations. Thus the factorization can be obtained in T = (3 log n - 1) steps with p n = kn. Each of the for- P-i 1 ward and backward sweeps requires the solution of a first order linear recurrence system of n equations. Hence the time required for these sweeps is T =1+4 log n with p ■ n processors. Therefore, the P 2 total time for solving the tridiagonal system of equations is T + T = T = 7 log n steps with p = An processors. This results P x P 2 P in a speedup S = 8(n-l)/7 log n with an efficiency E - 2/7 log n. Note that if pivoting is necessary for a stable LU factoriza- tion, then the scheme in [1] is not satisfactory. In the following section we introduce another algorithm that is more stable. 11 3. Givens Triangularization In this section we consider the parallel solution of a general linear system using orthogonal factorization of the matrix of 2 coefficients. Theorem k shows that with p = 0(n ) processors we can triangularize any real nonsingular matrix in T = 0(n) steps using a n 2 modified Givens process. Thus we achieve a speedup S p = 0(n ) and an n efficiency independent of n. Lemma 2 shows that;, after the orthogonal factorization of the matrix of coefficients, we can solve the linear system with n processors in 0(n) steps. Thus Theorem k determines the order of time and processors for linear system solving using Givens reduction. Again we note that if a linear system is to be solved for various right-hand sides , we use the time and processors of Theorem k once, and of Lemma 2 for each right-hand side. Finally Theorem 5. shows that any tridiagonal system of equations may be solved using Givens reduction in T =10 log n steps with p=l+n processors. Now we consider the orthogonal factorization of a real non- singular matrix A, Q A = R (1) where Q, is an orthogonal matrix formed as the product of plane rotations, and R is upper triangular. This standard Givens process requires roughly ^n /3 multiplications and n /2 square roots, where n is the order of the matrix A. 12 Recently [6,7] it has been shown that the matrices D and R in the modified Givens reduction, i Q A = D 2 R (2) can be obtained using \ the multiplications of the standard Givens process and no square roots if Q is not computed. We will use this modified process in our parallel algorithm to avoid square root evalua- tion after every transformation. First. we describe a numerically stable A Givens transformation/ \ UvU, ,■< 'U. i ■•■■•: Up: ■.■<■■ i ■ version of this square root-free Givens reduction ') Vk u s/k u ... \'k u ... ■J~£ u, J~£ u ... \Ttt u . . . so as to annihilate the first element of the second row would produce, n/p u ± \Tp u 2 ... sTp u r sTq v . . . \Tq v and the rotation is determined by, c 2 = l/(l+op) , s 2 = cp/(l+dp) (3) where, 13 a = iv^k^ , B = v 1 /u 1 (U) Thus, i &\ s /£" u = u + V r r- J r /- r •pi 1 ^p I /k 1 , / C /iT . v = - u + V Simplifying (5), we obtain u = u + a v , r r r v = -Bu + V r K r r (5) There are several choices for p and q; one such choice is p = c k , and q = c £. (6) and ( 5 ' ) Furthermore, p = k/(l+a 6) and (6') q = i/(l+a 6) . Unfortunately, however, the method suffers from the possibility of underflow in p and q. This may be avoided by interchanging the two rows under consideration. 14 On a serial computer, evaluating u , v (r = 1, 2 t) , p and q require (4t+5) steps, while on a parallel computer this process requires four steps with (2t+5) processors as shown in Figure 1. In fact, we can apply more than one rotation at a time in triangularizing A. Similar to the parallel Jacobi method [8], several schemes are possible. We present an annihilation scheme that requires 0(n) steps to produce D, R and Q in (2) . Denoting each element annihilated in the k-th transformation by the integer k, the annihilation regime is shown in Figure 2. Let us denote by (i, i, j) that rotation which rotates rows i and i, annihilating the element in position (i, j). The k-th transformation consists of the rotations (i, i+1, j), i=i+l, where i and j are given by the sequences : (i) for k = 1, 2, . . ., n-1 i = [(n-k), (n-k+2), ..., (n-l-S(k))} J = (1,2, ..., |~k/2~|} (7) (ii) for k = n,n+l, ..., 2n-3 i = {(k-n+2), (k-rw-10, •••, (n-l-8(k))) j = {(k-n+2), (k-n+3), ..., I k/2~] } (8) where s(k) = for k odd, and 1 for k even. 15 ■\ r 1 1 \r=l,2,...,t J q. = £ k Uj_ u 1 ( k u x u x + t v 1 v 1 ) -1 p = k k u 1 u x (k u^ u-j_ + i V]_ Vq_ ) -1 Figure 1. 16 * 15 * Ik 16 * 13 15 17 * 12 lU 16 18 * 11 13 15 17 19 -X- 10 12 lU 16 18 20 ■* 9 11 13 15 17 19 21 * 8 10 12 11+ 16 18 20 22 * 7 9 11 13 15 17 19 21 23 # 6 8 10 12 ik 16 18 20 22 2k * K y 7 9 11 13 15 17 ■ 19 21 23 25 * i+ 6 8 10 12 Ik 16 18 20 22 2k 26 -x- 3 5 7 9 11 13 15 17 19 21 23 25 27 * 2 1+ 6 8 10 12 Ik 16 18 20 22 2k 26 28 # 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 # Figure 2. 17 Since it is only an odd numbered transformation that introduces a zero in the last row, the total number of required transformations is 1 (2n-3). Hence in computing D 2 and R in (2) we require T = U(2n-3) P l steps and one square root with the number of processors not exceeding Pi ■ n Z (21*3) = In/2 1 t=n-| n/2 +1 l. . (2(n+2) n/2 }• For n=2 , where m is a positive integer, p = 4 n + 2n. The orthogonal matrix Q, is given by Q, = Q^ . . . (^ £L , where each transformation is the direct product of the rotations (i,i+l,j) given by the sequences (7) and (8). That is, & is a block-diagonal matrix consisting of 2 X 2 diagonal submatrices. Thus, while we do not form Q explicitly, Q consists of a total of n(n-l)/2 rotations all of which can be evaluated from (3) in T =3 steps plus one square root, P 2 using p = n(n-l) processors. Therefore, the total parallel time for obtaining D 2 ,R, and Q, is given by T = T + T = (8n-9) steps and two 1 P, square roots with p = n(n-l) processors. On a serial machine the time required for obtaining D 2 and R in (2), i.e., evaluating (5') and (6')> is given by, n T l = t^2 (t_l) (UU5) n(n-l) (8n+23)/6 18 steps and n square roots. While the time required for evaluating the n(n-l)/2 rotations in Q is given by T" = 7n(n-l)/2 steps and n(n-l) square roots, as can be shown from (3) and (4). Thus the total sequential 2 time is T = T' + T'' = 2n(n-l) (2n+ll) /3 steps and n square roots, and we ? 18 achieve a speedup Sp = n /6 with an efficiency •? < E <_ rr . Hence we proved the following theorem. Theorem k Let A be a nonsingular matrix of order n, then it can be factored in the form, QA = D 2 R 1 where D 2 , R, and Q. (i=l,2, . .., 2n-3) may be evaluated in T = 0(n) 2 steps and two square roots using p = 0(n ) processors. This results 2 18 in a speedup Sp = 0(n ) with an efficiency -r < E <^ -rr- . We have also developed an annihilation scheme that requires 2(n-l) - log n transformations instead of (2n-3) , however the relation between i, i, and j are more complicated than that in (7) and (8). 19 Lemma 2 1 Given the orthogonal factorization Q, A = D 2 R, i.e., D, R, and Q. The linear system of equations Ax = f may be solved in T = 0(n) steps and one square root using no more than p = n processors. Hence we obtain a speedup Sp = 0(n) with an efficiency — < E <_ y . Proof : From (2) we have D 2 Rx = Qf (9) Since the matrices Q. in Q = Q^ ... Q, Q are block diagonal, the right hand side f = Qf may be computed sequentially, f,,-, = Q, f, = 1,2, ..., 2n-3 (10) where f = f and f = f. Using n processors, we require three steps for each f. ,. Thus f may be computed in 3(2n-3) steps using n pro- cessors. Solving for z in D 2 z = f requires one division with n pro- cessors. Finally the solution vector x in Rx = z can be obtained in 3(n-l) steps using (n-1) processors, (note that R is not unit upper triangular). Thus the total time for solving (9) is given by, T = (9n-ll) steps with p = n processors. The sequential algorithm requires, T- = 2n(2n-l) steps. Thus we obtain a speedup Sp - 4n/9 4 6 with an efficiency — < E < ■=■ 9 p — 7 Let, Next we consider Givens reduction of tridiagonal matrices 20 A = a i b i e a , b , n-1 n-1 n-1 n n ,m be of order n=2 , where m is a positive integer. We annihilate the subdiagonal elements of A by applying the orthogonal transformation Q = Q n-1 Q Q , where each Q . is a Givens transformation rotating rows j and j+1 such that the element in the position (j+1, j) is eliminated, 21 s . row j V -s . J c. J Thus, after step (r-1) , Q , Q _ 2 ... Q.. A is given by, r-1 'r+1 r+1 c n b r-1 r r+1 r+1 and by induction it follows that, 22 c = p /a , s = e , ./a (11) r r r r r+1 r p = c n a -c s -b - r r-1 r r-2 r-1 r-1 I = s a .- + c ..c b r - 1, 2, ..., n-1 (12) r r r+1 r-1 r r Y = s b . r r r+1 where c = 1, s =0. From (11) and the first equation in (12) we o o obtain the recurrence relation, a c = c .. a - c (e b _)a , (13) r r r-1 r r-2 r r-1 r-1 Let x, = c. II a. ; then (13) reduces to, k k . n i i=l T o = X ' T l = a l x = a t . - (e b ..)t » r = 2, 3, ..., n-1 (14) r r r-1 r r-1 r-2 This can be expressed as the system of linear equations, 23 -a. 6 2 b l 6 3 b 2 -a. e ,13 n-1 n ■a . 1 n-1 l 2 T 3 n-1 The matrix of coefficients can be formed in one step using (n-2) processors. The system itself can be solved [A], in (3 log n-1) steps using no more than An processors. Thus obtaining the vector t requires, 2 2 T =3 log n steps, with p = An processors. Since c, + s, = 1, then P - X IC K 2 k 2 defining 9, = IT a. , we obtain the recurrence relations, k i=l ± «2 2 6 = t = 1 o o s 2 2 2 2 \ = e k+l 6 k-l + T k k = 1 , 2 , . . . , n-1 (15) where, we have used (11). The corresponding system of equations is, 24 -e, -e. -e n r 1 i e 2 O 2 2 h 2 T l 2 9 2 • 2 T 2 • • • • • • 4l 2 T n-1 - The matrix of coefficients and the right-hand side can be formed in one step using 2(n-l) processors, and the bidiagonal system is solved 2 in 2 log n steps with (n-1) processors. Thus the elements 9. , i = 1, 2, ..., n-1, can be obtained in T =1+2 log n steps with p 9 = 2(n-l) processors. Now we obtain c , a., and s. from, z J J J 2 2,2 2 2,2 2 2.2 c. = ■c./e* » a 4 «= e./e, . , s. = e /a. (16) 25 in two steps and one square root using 3(n-l) processors. Once these are obtained, the elements ft,, y.» and a ■ p in (12) of the upper j j n r n triangular matrix are obtained in three steps using 2n processors. Thus the total time required for obtaining the upper triangular matrix and the rotations (c , s., j = 1, 2, ..., n-1) is T =6+5 log n steps and one square root using no more than 4n processors. It can be shown [9] that the orthogonal matrix Q is of the form, Q(u,v,w) = U 1 V 1 Vi u ..v.. n-1 1 U V w 2 2 2 n-1 2 u v n u V- n 1 n 2 u n v . n-1 n-1 u v , n n-1 w n-1 u v i n n i where the vectors u, v, and w are given by, u. = c. n- -i 3 v. - v,. n/ M . J J 0-1 J 0-1' w.=s. j=l,2, ..., n-1 J J = c .i-l /n i-l j = 1, 2, ..., n (17) ■where, c = c = 1 and, ' o n 26 \ = 1 i j n. = (-D J n s j = 1, 2, ..., n -l (18) k-1 k The matrix Q is not formed explicitly, but is represented by the vectors u, v, and w. The vector n is obtained in log n steps using (n-2)/2 processors, hence u and v are obtained in one step using (2n-3) pro- cessors. We are now in a position to introduce the following theorem. Theorem 5 Let A be a nonsingular tridiagonal matrix of order n=2 , for a positive integer m. Then the system of equations Ax=f may be solved using the orthogonal factorization QA=R, where R is upper triangular of band width 3, in T =10 log n + 0(l) steps with p = kn processors. The resulting speedup and efficiency are of the same order as in Theorem 3« Proof : The determination of R and Q required (7+6 log n) steps and one square root using no more than Hn processors. Now the system to be solved is Rx = Q,f. f = Qf may be expressed as, f = (ULV + W)f where, 27 and U = diag(u.) , V = diag(v.) W = diag(\* 1 , w 2 , . .., w n _ 1 , 0), ' 1 i>j I. . = 1J ' i