LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 IJ*6r no .679-6 84- cop. 2. The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN NOV 5 OCT 13 -; RECD ,/.:■:; > ; i^4 L161 — O-1096 NSF - OCA - GJ-36936 - 000006 PARALLEL DIRECT POISSON AND BIHARMONIC SOLVERS by A. H. Sameh, S. C. Chen, and D. J. Kuck July 1974 The JAM 1 ' 19 75 ..tiiuis DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Parallel Direct Poisson and Biharmonic Solvers by * ** ** A. H. Sameh , S. C. Chen , and D. J. Kuck +m, • This work was supported in part by NSF Grant GJ-36936; and in part by the Advanced Research Project Agency of the Department of Defense under Contract No. DAHC04-72-C-0001. * Department of Computer Science and the Center for Advanced Computation, University of Illinois, Urbana, Illinois 61801. ** Department of Computer Science, University of Illinois, Urbana, Illinois 61801, July 19, 1974 Digitized by the Internet Archive in 2013 http://archive.org/details/paralleldirectpo684same 1. INTRODUCTION Several algorithms have been developed for the direct solution of finite difference approximations of the Poisson and biharmonic equa- tions on a rectangle [l- 3 ], [k,5]. For square meshes of n x n points, the fastest available sequential algorithms require 0( n 2 log 2 n) time steps for the Poisson equation, and 0(n 3 log 2 n) time steps for the biharmonic equa- tion. Using as many parallel processors as necessary, one could hope to achieve a solution to either of these problems in 0(log 2 n) time steps. In fact, it follows from a simple fan in argument that any algorithm which requires the interaction of n 2 data points with each other cannot be com- puted in less than 21og n time steps. We are concerned here with parallel computation and shall mea- sure time in unit steps. If p processors are being used, then we denote the computation time as T p unit steps. Thus T ± is the time required by a serial machine. We define the speedup of a computation using p processors over the serial computation time as S = T /T > 1. Further, we define the efficiency of a p processor computation as E = S /p < 1. We also y p assume that: (i) any number of processors may be used at any time, but we will give bounds on this number, (ii) each processor may perform any of the four arithmetic operations in one time step, and (iii) there are no memory or data alignment time penalties. In this paper we develop a parallel algorithm for the Poisson equation which uses only 0(log 2 n) time steps and 0(n 2 ) processors. Thus 2 we achieve a speedup of 0(n ) over the fastest known sequential algorithms, at an efficiency which is independent of n. Furthermore, we present a parallel algorithm for the biharmonic equation which uses only 0(n) time 3 2 steps and 0(n ) processors for a speedup of 0(n log n) over the fastest known sequential algorithm. We also present a slower, more efficient 5 2 algorithm which uses 0(nlog n) time steps and only 0(n ) processors, for 2 a speedup of 0(n ). 2. PRELIMINARY RESULTS Now we introduce several lemmas which will be fundamental in obtaining the main results of this paper. Throughout section 2, we assume n = 2 for a positive integer k. Lemma 1 The coefficients y. , l2). Then the system Ax = f may be solved in T =2 log n steps using p = (n-l) processors with a speedup S = (n-l ) /log n and an efficiency E = l/log_n. This is a corollary of Theorem 2 in [l0]. Lemma k Let A be a nonsingular symmetric matrix of order n. Then it can be factored in the form A = LDL 11 in T = 3 (n-l) steps using p = n(n-l)/2 P processors, where L is unit lower triangular and D is a diagonal matrix with nonzero elements. Thus, we obtain a speedup S = n(2n+5)/18 with an 2 efficiency -5- < E < 1 . Proof Let L = U r * 2 ,...,* n ], where I* = (0.. .. ,0,1, Vl.k'W.k'" •• l n,k , « n t hence A = A - £ d.£.£.. At the beginning of the k-th stage we have (d ,d_,...,cL ), [Z- ,Jlp, . . . , JL _ ) , and the symmetric matrix A^ = A - E d. . Jl . Jl . i=l (k) (k) whose first (k-l) rows and columns are zero. Thus cL = a, , and I = a /cL (r = k+l,...,n) are obtained by one division using (n-k) processors. We then obtain the matrix cL JL JL in one multiplication using (n-k) (n-k+l)/2 processors, and A = A - till in one subtraction using the same number of processors. Therefore, 3 time-steps are required in the k-th stage using (n-k) (n-k+l)/2 processors. Hence, the whole factorization requires T = 3(n-l) P steps with p = n(n-l)/2 processors. The sequential algorithm requires n(n-l) (2n+5)/6 steps, thus we obtain the above-mentioned speedup and efficiency. 3. A POISSON SOLVER We consider the five-point finite difference approximation of the Poisson equation -V 2 !! = f (1) on a unit square. We superimpose a square grid over the unit square with a mesh size h = l/n and assume that log n = 2 for a positive integer k. In fact, the results of this paper hold for n = 2 , but for later notational simplicity we make the above restriction. Considering the periodic boundary conditions: u(x Q , y) = u(x n , y) u(x s y Q ) = u(x, y n ), (2) we are led to the linear system, where and M v = w , P M = A -I -I N A -I -I -I -I A -I -I A v = w t t = (v , w 2 , / t t t> (v x , v 2 , ..., v n ), . . . , w ), f 2 2 ^ (n x n ), (3) A = k -1 -1 \ 1 h -l \ s -i " N u S V V N S V -1 -1 (n x n . The matrix A has the eigenvalues A = k + 2cos— "— , (j = 0, 1, 2, ..., n-1 ) , and the corresponding eigenvectors are 1 t q o ^ (1, -1, 1, ..., -1), q£ - -~ (1, 1, 1, ..-, 1), 2 ^ t /2 , 2£tt kin on , q„ = /— (cos , cos , .... cos 2£tt)_ H V n n n ' and f2 , . 2£tt . kH-n — (sin , sin , sin 2&ir)* where £ = 1,2, ...,—- 1. Let Q be the eigenvector matrix of A, v. = Q v. , and w. = Q w. (i = 1, 2, . . . , n). Thus, the solution of system (3) can be reduced [3] to that of solving the n systems of linear equations. r .v, = v. (j = 0, 1, 2, ..., n-1), (k) where and r. = J A . -1 -1 -1 A . -1 \\ \ -1 A . -1 -1 -1 A. J -J v r ( v v j2' •■•• V' (nxn), in which v and w are the j-th components of v and w , respectively, algorithm proceeds as follows in three stages: The i) Compute w. = Q w. (i = 1, 2, ... , n), ii) Solve the n systems r.v, = w J J J iii) Compute v. = Qv. (j = 0, 1, 2, ..., n-1), (i = 1, 2, ... , n). With this theoretical "background for the Poisson equation, we now turn to the development of an algorithm which computes a solution in 0(log 9 n) time steps. This will be developed by analyzing p and T for stages i - iii above. The result is presented below as Theorem 1. We describe an algorithm that capitalizes on Lemmas i and 2 so as to require a time within a small multiple of the time required by a fan in 2 argument on n points. We give the number of time steps and processors re- quired in each stage for the major operations whose computation time depends on n. In other words we will ignore operations which can be computed in constant time. Figure 1, below, gives a time-processor chart for the parallel Poisson solver. r Tp 3 log 2 n 2 log 2 n jL l°g 2 n T 2 log 2 n f 3 log 2 n -jn/log n w. = Q w. l l z j a. J v. = Qv. 1 stage i, i=l, 2, ...,n stage ii, j = 0, 1, ..., n-1 stage iii, i = 1, 2, ..., n Figure 1 From the eigenvectors q and q , of the matrix A we see that v. = Q w. will give rise to discrete cosine and sine transforms each of n/2 1 1 real data points. By Corollary 1, each of these can be computed by an FFT algorithm of n/U complex data points; thus we can evaluate (w , ..., w lo , w , 9xl) ..., w ) in 3 log (n/U ) steps using n i,l i,n/£!-l i,n/^+± i,n-l 2 processors. Using (n/log n) additional processors, we can simultaneously evaluate w 1 n-1 , n-1 , = I (-1) J w. . and w •n j=0 ' d i,- /n j=0 Z w. . in (3 log p n - log p log p n) steps. Therefore, stages i and iii require 2 2 3 log_n steps each with (n + n /log_n) processors as shown in Fig. 1. In stage ii we solve the linear systems (h) using Toeplitz factorization in conjunction with the Woodburry formula, [ll]. The matrix T can be considered as the sum (B. + SV.), where J J j -1 X . -\ B. = \ J \ ! \\ \l -1 A. -1 -1 X. (nxn), oJ S = [e ,e ], and lu.-i o- t i J V. = j J -1 o- -1 i ! (2xn). The matrix B can then be factored into the product LU , where j J J ^j- 1 u. =! -l 1 , and L. W ■ -1 N S N s I y J ! 2 1 /2 in which y = (X /2) + ((X A) - l) , and for numerical stability of the factorization the sign is chosen such that |y.| = |x /2 1 + ((X./1+) -l ) J J J Thus, v. = rT w. 1/2 = uT^lT 1 * - l^s (i + v*u: 1 L: 1 sr :L v*u , : 1 L" 1 i,]. J J J J 2 J j J J J J j Solving the systems L.G. = S and U.H. = V. is rather straightforward; J J J J J and H j = [(P J% - y j le n»' "I lg j ] ' where t ,_ -1 -2 -n+h g = (1, y , y , ... , y ) J J J J We assume that such a vector may "be computed in constant time. The major operations in computing the v.'s are those involved in solving the systems J L.z. = w., and U.v. = a. , J J J J J J (5) in which a. = z. - G.(I„ + H*0.) 1 H^z. J J J2 jj jj (6) By Lemma 3 we see that solving each of the systems in (5) requires 2 log p n steps and (n-l) processors. Since g.g. = (l-y. )/(l-y. ), then the only J J J J major operation in (6) is that of computing the inner product g.z.. This J J requires log n steps and n processors. Stage ii, therefore, requires a 2 total of 5 log p n steps with n processors as shown in Fig. 1. 10 Thus the total number of time steps required for this algorithm 2 2 is 11 log n with n + n /log p n processors. Similarly we can show that for Neumann boundary conditions the same numbers of steps and processors are required, while for the Dirichlet boundary conditions we require only 2 10 log n steps with n processors, since stage ii needs only k log n steps. The following theorem is therefore established, Theorem 1 The five-point finite difference approximation of the Poisson equation on the unit square with mesh size h = l/n can be solved in at most T = 11 log„n steps using no more than p = n(n+[n/log nl ) processors, with P 2 « a speedup S = 0(n ) and an efficiency 9/44 <_ E < 9/20. O Comparing our parallel algorithm with the odd-even reduction scheme [3] , we obtain, S = | n 2 log 2 n/ll log 2 n = 9n 2 /22, 9/44 < E £ 9/20 where the upper bound is achieved for the Dirichlet boundary conditions. We use [xl to denote the smallest integer greater than or equal to x. 11 k. A BIHARMONIC SOLVER We consider the "biharmonic equation, k V u = f on a unit square with the boundary conditions u(x, y) =0 and u (x, y) = for (x, y) e 9R, where u (x, y) is the outward normal derivative at the boundary. Superimposing a square mesh on the unit square with mesh size h = l/(n+l), where n = 2 , then a finite difference approximation leads to the linear system M v = w, (T) where "b A+I 2B I 2B A 2B I I 2B A 2B I "1 2B A 2B I I 2B A 2B I 2B A+I f 2 2 \ (n x n ) 3 in which B = -k 1 1 -U 1 \ s \ \ \ \ 1-Ul 1 -h t A = B + 21 + 2EE (nxn), = V + 2EE , and ■*- 1 o o o 1 (2xn) 12 The eigenvalues of B are given by in w. = -h + 2 cos - l n+1 (i = 1, 2, ..., n), and the eigenvector matrix, <» -[«„!■ n+1 sin 1 JTT n+1 System (?) can be expressed [5] as where (G + 2FF )v = w, (i, j = 1, 2, . . . , n). G = V+I 2B I 2B V 2B I I 2B V 2B I 2B V 2B I \ \ \ I 2B V 2B I 2B V+I (n xn ) , and F/ 2 \ = diag(E, E, ..., E). Thus by the Woodburry formula, \Zi x c.n ) v = G _1 w - 2G~ 1 F(I^ + 2F t G X F) 1 F t G _1 w 2n (8) and the algorithm proceeds as follows in three stages: i) Solve the (2n+l) linear systems GY = F and Gv = w, ii) Factor the symmetric matrix {T + 2F Y) in the form LDL and solve the linear systems t^ t Lz„ = F v, Dz = Zp, and L z = z , iii) Obtain the solution, >\> v = v - 2Yz. 13 In stage i, we would like to solve (2n+l) systems of the form Gx = b . (9) As in the direct Poisson solver, solving the system (9) can be reduced to that of solving the n systems, fix = b j = 1,2,. ..,n , (10) J.J J where the positive-definite matrices fi. are given by J 1+A. J 2u> fi. = J 2u>. 1 J 2d) . 1 J J 2w. A. 2o). 1 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 2u 2w 2u. A . J J 2u. 1 2u. l+X J J (nxn) , where A. = 2 + id, j ; c. J ' ( *J1' 5 J2» in which, , x. ) , and b . = (b , b , . . . , b. ) , J Jl J2' jn \ " «*V \ " «\ k = 1, 2, . . . , n 1/2 Let y = (u /2) - ((oj /If) _ i) ' ; then fi. can be expressed as J J j J n. = t 2 + s.v,' j j j j (ii) where 1U T. = v i 1 1 u, 1 \ ^ N \ N * N l 0) 1 (J 1 OJ, (n*n), in which S j = C(a j e l + 6 j e 2 K V e n ] ' md V j = Ce l' 6 j e 2> e n ] ' 2 2 a. = 2 + co. - y . , 6 . = u>. - y . J J J J J J T. has the Toeplitz factorization L.U. similar to that of B in Sec. 3, j J J J where "both L and U. are diagonally dominant. Therefore the numerical J J stability of the process of solving the linear systems (L.U.) y. = f. is assured. Thus the solutions to the systems (10) may he written as, x = (L U )" 2 h - (L U )~ 2 S.(1„ + V^(L.U.)" 2 S.)"V(L.U J" 2 "b.. (12) j j j j J J j3 J J J J J J J J 2 By Lemma 3, solving the system (L U.) y. = h. requires 8 log n steps j j j j <— using (n-1) processors. Furthermore, since x in Lx = e. (i=l,2,n), and Ux = e can be obtained in constant time, the solution of the system n / \2^ ^ / N A (L.U.J S. = S. requires only 16 log n steps using (n-1) processors. Thus x. J J J J *- I ■H i ■HI