LI B R.AR.Y OF THE UNIVERSITY Of ILLINOIS 510. 84 U6r *o. 271-278 cop. 2 1 '• • The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN HOV ?2 19* JAN 2 8 1»2 L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/seismicsignalpro277acki __ , Report No . 277 II (th) yrutvi; SEISMIC SIGNAL PROCESSING VIA THE ILLIAC IV COMPUTER by G. M. Ackins D. J. Kuck ILLIAC IV Document No. 201 DEPARTMENT OF COMPUTER SCIENCE • UNIVERSITY OF ILLINOIS • URBANA, ILLINOIS DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS URBANA, ILLINOIS 6l801 SEISMIC SIGNAL PROCESSING VIA THE ILLIAC IV COMPUTER by G. M. Ackins and D. J, Kuck ILLIAC IV Document No. 201 August 15, I968 ABSTRACT This paper discusses the ILLIAC IV computer and its application to seismic signal-processing problems. ILLIAC IV "will be operational at the University of Illinois' Computer Science Department in 1970. ILLIAC IV is an array of 256 coupled high-speed digital computers; as such, it has the capacity to execute algorithms 256 times faster than present-day computers. Achievement of large speed-up factors, however, is dependent upon "structuring" an algorithm into groups of (nominally, 256) identical computations. As an aid to the programmer, the Tranquil language is being designed with certain common statements, such as matrix operations, prestructured to provide for the parallel execution of code on many pieces of data. These, as well as other, aspects of the ILLIAC IV System and the Tranquil language are described in the first part of this paper. The second part of the paper discusses the structuring of common signal-processing algorithms, such as beam forming, convolution, and fast Fourier transform, for ILLIAC IV parallel computation. It is shown that, in a great variety of situations, the full speed-up factor, 256, can be obtained. * ILLIAC IV is being fabricated by the Burroughs Corporation for delivery at the University of Illinois in early 1970- - 11 - THE ILLIAC IV COMPUTER System Organization ILLIAC IV is an array of 256 coupled computers driven by instruc- tions from a common control unit. Each of the 256 processing elements (PEs) has 2048 words of 6k bit memory with a 2^0 nanosecond cycle time. Each PE is capable of 6k bit floating point multiplication in U00 nano- seconds and addition in 2^0 nanoseconds. Thirty-two bit floating point operations and 8 bit fixed point operations are also available. The PE instruction set is similar to that of conventional machines, with two exceptions. First, the PEs are capable of communicating data to four neigh- boring PEs by means of routing instructions. Second, the PEs are able to set their own mode registers to effectively disable or enable themselves. Figure 1 shows 6k PEs, each having three arithmetic registers (A, B, and C) and one protected programmer register (S). The registers, words, and paths in Figure 1 are all 6k bits wide, except the PE index registers (XR), mode registers, and as noted. The mode register may be regarded as one bit which may be used to block the participation of its PE in any action. The routing registers are shown connected to neighbors at distances ±1 and 18; similar end around connections are provided at 1, 6k, etc. Programs and data are stored in PE memory. Instructions are fetched by the control unit (CU) as required. Figure 1 also shows the essential registers and paths in the CU and their relations to the PEs. Instructions are decoded and control For a more detailed discussion of the ILLIAC IV hardware and software, see [1] and [2]. - 1 - signals are sent to the PE array from the control unit. Some instructions are executed directly in the CU; e.g., the loading of CU accumulator registers (CARs) with program literals. Operand addresses may be indexed once in the CU and again separately in each PE. It is possible to load the local data buffer (6k words of 6k bits each) and CARs from PE memory. Local data buffer registers and CARs may be loaded from each other. A broadcast instruction allows the contents of a CAR to be transmitted simul- taneously to all PEs. It is often convenient to manipulate all PE mode bits or a number from one PE in a CAR. For this purpose, the broadcast path is bidirectional. ILLIAC IV has k control units, each of which may drive 6k PEs independently. In the united configuration, all 256 PEs are effectively driven by one CU and routing proceeds across PE boundaries. This allows some flexibility in fitting problems to the array. The complete ILLIAC IV system (Figure 2) includes a Burroughs 6500 which performs input/ output operations, compilation, and contains an 9 operating system to control ILLIAC IV. Also, there is a 10 bit, head per track disk with a kO ms. rotation speed and an effective transfer rate of lCr bits/second. This allows the loading of the 32 X 10 bit ILLIAC IV memory from the disk in 32 ms. The input/ output controller (IOC) contains a queuer for disk requests and a 2-*-° bit buffer memory to smooth transmissions to and from the slower B-65OO memory. Programming ILLIAC IV It is apparent that ILLIAC IV may be suited to the execution of algorithms defined on arrays of data (e.g., matrix operations and certain - 2 - common signal -processing algorithms). However, there are two major diffi- culties in using such a machine. One is that an algorithm must be devised which allows identical computations to be performed simultaneously on a large number of operands. The other is closely related to the first: The data must be so placed in the ILLIAC IV memory that, over the course of the calculation, few PEs are disabled. In other words, the algorithm and the data storage allocation must be mutually designed for highly parallel computation. For this reason ILLIAC IV codes are expressed as two parts: a storage allocation block followed by a computational algorithm block. These codes may be written in a high level language called Tranquil which is described below. Storage Allocation In some cases data-array elements are mapped directly into PE memory locations (i.e., element a. . goes to the i location in the j PE). In other cases this may be unsatisfactory. For example, if it is desirable to have parallel access to columns of the array as well as rows, one may use a skewed storage scheme such as: a. . goes into the i row of the PE 10 whose number is (i+j) mod n, where n is the number of PEs. This is th illustrated in Figure 3(a). Now, for example, the column of the A matrix may be accessed by setting the XR of the i PE to i. Tranquil The Tranquil language is similar to existing languages for serial machines except that certain statements provide for the parallel execution of code on many pieces of data. Operations on entire arrays may - 3 - be performed by simply declaring arguments to be arrays. Thus if A, B, and C are declared to be m x m matrices, A ^B*C will be compiled as a matrix multiplication code, assigning the product to matrix A. Index sets may be used to refer to particular elements of an array or to control a block of code. Tranquil allows two kinds of control statements--SEQ and SIM. SEQ is analogous to the standard sequential con- trol statement of the FORTRAN DO or ALGOL FOR variety. SIM, however, allows simultaneous operation on all elements of an index set. For example, if we define II, J J, and KK to be index sets as follows: II «-< 0, 2, ..., 2n> JJ «- < 1, 3, • • • , 2n+l> KK <- < 9, •••, 2n+l> then FOR (I, J) SIM (II*JJ) DO BEGIN C(I,J) «-o. FOR K SEQ (KK) C (I, J) *-C(l,j) + A(I,K) * B(K,J); END will cause the matrix C to have the elements in the intersection of even rows and odd columns set to the inner products of appropriate rows and columns of A and B. Tranquil also has conditional expressions which test if any or all elements of an array meet some Boolean condition. - k - Tranquil will provide a number of canned storage allocation schemes for standard geometries plus some slicing algorithms for arbitrary shapes. The user will also have the prerogative of writing his own allocation map. (The Tranquil storage allocation block provides this mapping from symbolic notation into PE locations.) The Tranquil translator uses this storage allocation information when translating the Tranquil algorithm into ILLIAC IV machine language. In particular., routing, indexing and mode operations will be generated from the storage allocation information. An example of a Tranquil storage allocation statement is shown in Figure 3(b). This statement defines the skewed storage scheme shown in Figure 3(a); the skewing into memory is done by the compiler. EXECUTION OF SIGNAL PROCESSING ALGORITHMS Digital Filter Design Several multichannel filter design procedures [3>^] involve matrix operations --primarily addition, multiplication;, transposition and inversion. As explained previously, these operations will be a part of the ILLIAC IV software. Thus, if r, a, and 'a are matrices and T denotes transposition, expressions such as (3) T (3) T (3) which appears in the Robinson synthesis [5]* will be compiled and executed with high efficiency. If, for example, the problem involves an array consisting of 8 subarrays, each of 32 sensors, the design procedure may be implemented simultaneously for all eight subarrays. The problem of - 5 - computing correlation matrices (e.g., the r's in the above formula) from sampled data is also equivalent to matrix multiplication. The following computation, which involves approximately 3 X 10 multiplications of 25 X 25 matrices is estimated to take about 10 seconds: Given an array consisting of 10 subarrays of 25 sensors each, use the Robinson synthesis technique to design for each subarray a 100 point least squares filter. Beam Forming Figure k illustrates ILLIAC IV s capability to form beams for a large number of sensors. The rectangular structures represent the computer- - its 256 PEs stretching horizontally across the figure and PE memory extend- ing downward. The circles represent the sensors and the incoming signal is labeled by the letter "S"- -superscripts denote time-values; subscripts indicate a particular sensor. Sensors are "paired" with PEs; that is, each sensor "feeds" a particular PE. If, for the moment, attention is restricted to a single straight summation beam, then what is desired is the vector - element sum , s i 1 + s 2 1+ ••• + s 2 5 6 i (1) of the 256-element vector stretching across the A registers of the PEs. Upon output of the vector -element sum of time step i, the data of time step i + 1 can be presented and the process continued. This process uses the vector -element summation algorithm illustrated in Figure 5 and achieves a speed-up of 32:1 over serial operation. Figure 6 exhibits a Tranquil program for this procedure. The program will be compiled as the series of 6 - routes and adds illustrated in Figure 5. After execution, the vector -element 2 sum (l) is found in all PEs. It should be realized, however, that even though a speed-up of 32:1 over serial operation has been achieved, ILLIAC IV is operating at a low 12.5$ efficiency since at 6k bit precision the maximum speed-up is 256:1. In view of this, the question immediately arises: Can the summa- tions of successive time-steps be interleaved to produce a more efficient algorithm? It turns out that nearly 100$ efficient operation can be achieved by any of several interleaving schemes. One scheme [6] has a structure essentially identical to the method (Figure 9) used to execute the Cooley- Tukey algorithm. ILLIAC IV can efficiently form beams for arrays of more than 256 sensors. For example, to form beams for 512 sensors, the data is "folded over" so that sensors 257 through 512 also feed PEs 1 through 256, respectively. Thus, the first ILLIAC IV addition, « i g i « i b l b 2 b 256 + s 25? , + s 258 , + s 512 , is completely parallel even without interleaving. For an array of 5000 sensors, the data is "folded over" until the number of sensors is depleted (20 times) and 19 additions would be performed in parallel before a vector- sum technique is implemented. (Note that some efficiency is lost because 5000 is not an integer multiple of 256. ) Fifty subarrays, each consisting 2 To avoid a cluttered appearance, Figure 5 shows only the operations involved in obtaining the particular sum which ends up in PE 1. - 7 - of 100 sensors, can be processed in much the same way as a single 5000 sensor array [6] . At a seismic sampling rate of 20 per second, the maximum number of sensors that could be handled in a real-time (single) beam forming situation is 50 million. If the sensors are used in a multiple beam- forming situation (see below), the number 50 million is reduced accordingly. The more general problem of delay-and-sum beam forming involving many different beams is illustrated in Figure J. The figure shows a portion of memory (loc 1 through loc h) large enough to accommodate the largest delay. The beam patterns shown define the beams for all time. The inde- pendent access feature of PE memory (i.e., the PE index register) enables the components of each beam to be accessed simultaneously, and the sum can be formed as if it were a straight sum beam. After all beams have been processed for t = 1, new data (in Figure "J: S ) may be entered into memory (in Figure r J: loc 1 or loc 5), and the process is repeated. Convolution The convolution of a sampled time-series with predetermined filter weights can become time-consuming if (l) the problem involves multichannel filters, or (2) the length of the filter is very large. These cases are considered separately. (l) Suppose samples are fed to each PE from a particular sensor (as in Figure k) . The parallelism for multichannel convolution is evident-- each PE performs a single channel convolution and the results are summed. For example, a 50-point convolutional filter would involve a standard serial convolution algorithm which is executed simultaneously in all PEs - 8 - followed (after every 50 intra-PE operations) by a vector-element summation across the entire PE array (in the case of one 256- channel filter) or sums across sections of the PE array (in the case of several smaller multichannel filters). These vector-element summations may be interleaved if desired. (2) Less transparent, but still easily implemented, is the case of a single long convolution. Consider the 256-point single channel con- volution illustrated in Figure 8. Denoting filter weights by W and signal samples by S, the desired successive filter outputs are (sV + S 2 * 2 + . . . + S 256 ^ 56 ), (SV + SV + . . . + S 25 V 56 ), etc. At t = the weights are spread across register S. At t = 1, S is received in the CU, broadcast multiplied into the array of weights, and this result added to loc 1. The weights now (actually, while the addition is in progress) are routed leftward one PE. (See t = 1 in Figure 8.) At t = 2, 2 S is received, multiplied into the weights, and the partial sum in loc 1 is updated. The weights move leftward, and so on. At t = 256 the first 3 genuine filter output is found in PE 1. Thereafter, each iteration produces a new filter output which is found in the PE leftward of the previous output. Thus, the second output, sV + sV + ... + s 25 V 56 , will be found in PE 256. After each result is obtained, loc 1 in the PE it came from is reset to zero. It should be noted that this scheme is not restricted to 256 samples, and that the procedure achieves a speed-up factor of approximately 256 over serial operation. 3 Dummy outputs were produced at times 1 through 255 as denoted by the O's in loc 1. This is done so that the same algorithm suffices for all time. - 9 - Cooley-Tukey Algorithm The Cooley-Tukey Algorithm [7] for the' case N=2 , where p is an integer, is well suited for parallel computation. An ILLIAC IV Cooley-Tukey program can be conveniently divided into three sections: (i) W-generation, (ii) iteration, and (iii) unscrambling. W-generation refers to the task of computing the complex constants (W's) needed to implement the algorithm. If the algorithm is used successively for different sets of data, the W's remain unchanged- -that is, they depend only on p. The iteration scheme which will be presented leaves the Fourier components in an unnatural order, and it is necessary to unscramble them. The unscrambling procedure (iii) takes less than 10$ the time of (ii). Thus, in a real-time situation, it is (ii) which determines the computation time. For these reasons, and since it is "iteration" which computes the finite Fourier transform of the N samples, this paper will present a detailed description only of (ii). The schemes which implement (i) and (iii) are similar to that for implementing (ii) and all are highly parallel [8]. A good feeling for what (ii) accomplishes can be obtained by examining the Cooley-Tukey algorithm for a relatively simple case (Figure 9)« The case illustrated involves 16 data points (A's) and 8 PEs. The figure shows 12 "operations". Original storage appears in (l) ; Fourier samples are the A, (i) appearing (scrambled) in Cl2\ .' The caption between the boxes describes the "operation" which transforms the box above the caption into the box below the caption. Note that for ease in portrayal A, (0) is written in place of A(0) + A(8)W and A_(8) is written in place of h A(0) - A(8)W , etc. The switch , multiply, add, subtract; switch, multiply T Note that the "switch" operation is of variable "length" - 10 - add, subtract structure of the program is dictated by the inherent structure of the algorithm. The algorithm for l6 data points involves four iterations. Of the four iterations, only the first does not entail routing (in Figure 9: "switch"). The algorithm for 32 data points would involve five iterations, the data would be folded over k deep, and both the first and second iter: : would not entail routing [8]. The algorithm for N = 4096 data points (6k bits; parts ii and iii only) is executed in about .6 millisecond. 11 - REFERENCES [1] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, and R. E. Stokes, "The ILLIAC IV computer", 1968; accepted for publication. [2] D. J. Kuck, "ILLIAC IV software and applications programming", 1968; accepted for publication. [3] P. E. Green, E. J. Kelly, Jr., and M. J. Levin, "A comparison of seismic array processing methods", Geophys. J. R. Astr. Soc , vol. 2, pp. 67 -84, 1966. [4] J. Capon, R. J. Greenfield, and R. J. Kolker, "Multidimensional maximum- likelihood processing of a large aperture seismic array", Proc. IEEE , vol. 55, pp. 192-211, 1967. [5] F. A. Robinson and R. A. Wiggins, "Recursive solution to the multichannel filtering problem", J. Geophys. Res ., vol. 70, pp. I885-I89I, 1965. [6] G. M. Ackins, "Successive vector-element processes and variations on the sorting problem", Department of Computer Science, University of Illinois, ILLIAC IV Document No. 195, File No. 165, July 11, 1968. [7] J. W. Cooley and J. W. Tukey, "An algorithm for the machine calculation of complex Fourier series", Math, of Comp. , vol. 19, pp. 297-301, 1965. [8] G. M. Ackins, "Fast Fourier Transform via ILLIAC IV", Department of Computer Science, University of Illinois, ILLIAC IV Document No. 198, File No. 768, July 12, 1968. FIGURE CAPTIONS Fig. 1. Quadrant configuration. Fig. 2. System organization. Fig. 3. Storage allocation J Part (a) Skewed storage scheme. | Part (b) Corresponding Tranquil storage allocation block, Fig. h. Beam forming: Input of data. Fig. 5. Beam forming: Vector-element sum. Fig. 6. Vector-element sum: Tranquil code. Fig. 7. Delay-and-sum beam forming. Fig. 8. Convolution. Fig. 9. Cooley-Tukey algorithm. INSTRUCTION AND DATA FETCHING CD CM in INSTR. FETCH REQUEST I MSTRUCTION BUFFER ASSOCIATIVE MEMORY CONTROL UNIT T 1 WORDS 64 WORDS 1 1 OATA BUFFER INSTRUCTION DECODER C U INDEXING CAR CAR 3 T ~i P E CONTROL SIGNALS DATA FETCHING DATA BROADCASTING f 64 57 T 2048 WORDS 1 I MODE B XR MIR PE MEMORY 64 BITS PE 1 1-1 9 i-8 • • • MODE B XR MM PE RwtMvfrT 1+1 63 i+8 56 • • • | MODE B R XR MIR PE MEMORY PEi PE64 Fig. 1. Quadrant configuration USER TERMINALS ARPA CONTRACTORS NETWORK ILLINET REAL TIME INPUTS Fig. 2. System organization PE PE i PE n loc o a^ a „ a on loc 1 a ln Oi,i-i a X|ll _ lOC i Oi.n-i + i Qi.o Qi.n-I loc n a„i Qn i- m .... u n ,_ n .... u n0 (a) SKEW A[o:n,o:n] (b) Fig. 3« Storage allocation fPart (a) Skewed storage schemes. \ Part ("b) Corresponding Tranquil stora ^ allocation block. SIGNAL 256 SENSORS ILLIAC IV loc 1 loc 2 0© 1 I PE 1 PE 2 • • • • • • J PE 256 si b 2 b 256 A REGISTER PE MEMORY Fig. h. Beam forming: Input of data PE MEMORY LOCATION S AFTER ®-l OPERATIONS S* Q (ROUTE: DISTANCE 1, LEFT; ADD) si+s| S|4Sj sj+sj St+Ss (D (ROUTE: DISTANCE 2, LEFT; ADD) 4 Is? i-1 e Est i-5 (D (ROUTE: DISTANCE 4, LEFT; ADD) 256 , © PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 Fig. 5. Beam forming: Vector-element sum BEGIN REAL ARRAY S(b:255|; FOR (K) SEQ (0, . . . , 7) BEGIN J<-2tK FOR (I) SIM (0, . . . , 255) DO S[l]<-S[l]+S[(I+J)MOD 256] END; END: Fie. 6. Vector-element sum: Tranquil code PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 loc 1 loc 2 loc 3 loc 4 loc 5 i D i D -i D i D i° D Sf o Si Si Si Si c2 q2 q2 q2 q2 ^1 ^2 ^ ^3 ^4 ^ ^5 S Q Q Q Q 1 D 2 D 3 D 4 D 5 S^ C* Q Q Q 1 D 2 ^3 ^4 ^5 "BEAM PATTERNS" D f = s{+S24-S 3 +S44-S5+ ... o T = 5^^+52 -rS 3 +S 4 +S 5 + ... O t = Si+S 2 +1 +S 3 +S 4 +1 +Sj+ ... Fig. 7. Delay and sum "beam forming reg S toe 1 reg S loci reg S loc 1 PE 1 PE 2 PE 3 • • • PE 255 PE 256 W 1 W 2 W s • • • W A t (12) A ( (13) A ( (14) A, (15) sw itch: A^O) Ajd) A ( (2) 1^3) A ( (8) A ( (9) A, (10) A^ll) A,<4) A,(5) A ( (6) A,<7) A, (12) A ( (13) A,(14) A ( (15) multiply: loc 2* (loc 2) x (loc 4) A t (0) A t (1) A ( <2) A ( (3) A t (8) A ( (9) A ( (10) A^ll) A (4)*° A (5)^ A (6)^ A (7)^ A (12)! 4 A (13)! 4 A (M)f 4 A ( (15)«f 4 add, subtract: loc 1* loc 1 ♦ loc 2 loc 2* loc 1 - loc 2 A 2 (0) A 2 (1) A 2 (2) A 2 (3) A 2 (8) A 2 (9) A 2 (10) * 2 <11) A 2 (4) A 2 (5) A 2 (6) A 2 (7) A ? (12) A 2 (13) A ? (14) A ? (15) Fig. 9« Cooley-Tukey algorithm sw i tch loc 1 loc 2 * 2 (0) * 2 (2) * 2 (D A 2 (3) A 2 (4) A 2 (5) A 2 (8) A 2 (8) A 2 (12) A 2 (6) A 2 (7) A 2 (10) A 2 (11) A 2 (U) A 2 (13) A 2 (15) © multiply: loc 2-+— (loc 2) x (loc 5) loc 1 loc 2 A 2 (0) A 2 (1) A 2 (4) A 2 (5) A 2 (8) A 2 (9) A 2 (12) A 2 (2)W° A 2 (3)i° A 2 (6)f 4 A 2 (7)l 4 A 2 (10)» 2 A 2 (11)f 2 A 2 (14)i 8 A 2 (13) AjdS)* 6 © add, subtract: loc 1« loc 2< ■loc 1 + loc 2 ■loc 1 - loc 2 A 3 (0) A 3 (1) A 3 (4) A 3 (5) A 3 (8) A 3 (2) A 3 (3) A 3 (6) A 3 (7) * 3 (10) A 3 (9) A 3 (12) A 3 (13) A 3 (11) A 3 (14) A 3 (15) © sw i tch: loc 1 loc 2 A (0) A (2) A (4) A (6) A (8) 3 9 3 3 3 A 3 (D A 3 (3) A 3 (5) A 3 (7) A 3 (9) A (10) A (12) A (14) 3 3 3 A 3 (11) A 3 (13) A 3 (15) multiply: loc 2+ (loc 2) x (loc 6) A 3 (0) A 3 (2) A 3 (4) A 3 (6) A 3 (8) A (1)W° A (3)W 4 A (5)W 2 A (7)W 6 A (9)*' A 3 (10) A 3 (12) A 3 (14) A 3 (11)W 5 A 3 (13)W 3 A 3 ( 1 5 ) W 7 © add, subtract: loc 1 loc 2 A 4 (0) A 4 (2) A 4 (D A 4 (3) A 4 (4) A 4 (6) A 4 (8) A/5) A 4 (7) A 4 (9) A 4 (10) A 4 (12) A 4 (14) A 4 (11) A 4 (13) A 4 (15) PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 Fig. 9- (continued)