LI B R.AR.Y 
 
 OF THE 
 
 UNIVERSITY 
 
 Of ILLINOIS 
 
 510. 84 
 
 U6r 
 
 *o. 271-278 
 
 cop. 2 
 
 1 '• • 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 HOV ?2 19* 
 
 JAN 2 8 1»2 
 
 L161 — O-1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/seismicsignalpro277acki 
 
__ , Report No . 277 
 
 II (th) 
 
 yrutvi; 
 
 SEISMIC SIGNAL PROCESSING VIA 
 THE ILLIAC IV COMPUTER 
 
 by 
 
 G. M. Ackins 
 D. J. Kuck 
 
 ILLIAC IV Document No. 201 
 
 DEPARTMENT OF COMPUTER SCIENCE • UNIVERSITY OF ILLINOIS • URBANA, ILLINOIS 
 
DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS 
 URBANA, ILLINOIS 6l801 
 
 SEISMIC SIGNAL PROCESSING VIA 
 THE ILLIAC IV COMPUTER 
 
 by 
 G. M. Ackins and D. J, Kuck 
 
 ILLIAC IV Document No. 201 
 August 15, I968 
 
ABSTRACT 
 
 This paper discusses the ILLIAC IV computer and its application 
 to seismic signal-processing problems. ILLIAC IV "will be operational at 
 the University of Illinois' Computer Science Department in 1970. 
 
 ILLIAC IV is an array of 256 coupled high-speed digital computers; 
 as such, it has the capacity to execute algorithms 256 times faster than 
 present-day computers. Achievement of large speed-up factors, however, is 
 dependent upon "structuring" an algorithm into groups of (nominally, 256) 
 identical computations. As an aid to the programmer, the Tranquil language 
 is being designed with certain common statements, such as matrix operations, 
 prestructured to provide for the parallel execution of code on many pieces 
 of data. These, as well as other, aspects of the ILLIAC IV System and the 
 Tranquil language are described in the first part of this paper. 
 
 The second part of the paper discusses the structuring of common 
 signal-processing algorithms, such as beam forming, convolution, and fast 
 Fourier transform, for ILLIAC IV parallel computation. It is shown that, 
 in a great variety of situations, the full speed-up factor, 256, can be 
 obtained. 
 
 * 
 
 ILLIAC IV is being fabricated by the Burroughs Corporation for delivery at 
 
 the University of Illinois in early 1970- 
 
 - 11 - 
 
THE ILLIAC IV COMPUTER 
 
 System Organization 
 
 ILLIAC IV is an array of 256 coupled computers driven by instruc- 
 tions from a common control unit. Each of the 256 processing elements 
 (PEs) has 2048 words of 6k bit memory with a 2^0 nanosecond cycle time. 
 Each PE is capable of 6k bit floating point multiplication in U00 nano- 
 seconds and addition in 2^0 nanoseconds. Thirty-two bit floating point 
 operations and 8 bit fixed point operations are also available. The PE 
 instruction set is similar to that of conventional machines, with two 
 exceptions. First, the PEs are capable of communicating data to four neigh- 
 boring PEs by means of routing instructions. Second, the PEs are able to 
 set their own mode registers to effectively disable or enable themselves. 
 
 Figure 1 shows 6k PEs, each having three arithmetic registers 
 (A, B, and C) and one protected programmer register (S). The registers, 
 words, and paths in Figure 1 are all 6k bits wide, except the PE index 
 registers (XR), mode registers, and as noted. The mode register may be 
 regarded as one bit which may be used to block the participation of its PE 
 in any action. The routing registers are shown connected to neighbors 
 at distances ±1 and 18; similar end around connections are provided at 
 1, 6k, etc. Programs and data are stored in PE memory. Instructions are 
 fetched by the control unit (CU) as required. 
 
 Figure 1 also shows the essential registers and paths in the CU 
 and their relations to the PEs. Instructions are decoded and control 
 
 For a more detailed discussion of the ILLIAC IV hardware and 
 software, see [1] and [2]. 
 
 - 1 - 
 
signals are sent to the PE array from the control unit. Some instructions 
 are executed directly in the CU; e.g., the loading of CU accumulator 
 registers (CARs) with program literals. Operand addresses may be indexed 
 once in the CU and again separately in each PE. It is possible to load the 
 local data buffer (6k words of 6k bits each) and CARs from PE memory. 
 Local data buffer registers and CARs may be loaded from each other. A 
 broadcast instruction allows the contents of a CAR to be transmitted simul- 
 taneously to all PEs. It is often convenient to manipulate all PE mode 
 bits or a number from one PE in a CAR. For this purpose, the broadcast path 
 is bidirectional. 
 
 ILLIAC IV has k control units, each of which may drive 6k PEs 
 independently. In the united configuration, all 256 PEs are effectively 
 driven by one CU and routing proceeds across PE boundaries. This allows 
 some flexibility in fitting problems to the array. 
 
 The complete ILLIAC IV system (Figure 2) includes a Burroughs 
 
 6500 which performs input/ output operations, compilation, and contains an 
 
 9 
 
 operating system to control ILLIAC IV. Also, there is a 10 bit, head per 
 
 track disk with a kO ms. rotation speed and an effective transfer rate of 
 lCr bits/second. This allows the loading of the 32 X 10 bit ILLIAC IV 
 memory from the disk in 32 ms. The input/ output controller (IOC) contains 
 a queuer for disk requests and a 2-*-° bit buffer memory to smooth transmissions 
 to and from the slower B-65OO memory. 
 
 Programming ILLIAC IV 
 
 It is apparent that ILLIAC IV may be suited to the execution of 
 algorithms defined on arrays of data (e.g., matrix operations and certain 
 
 - 2 - 
 
common signal -processing algorithms). However, there are two major diffi- 
 culties in using such a machine. One is that an algorithm must be devised 
 which allows identical computations to be performed simultaneously on a 
 large number of operands. The other is closely related to the first: The 
 data must be so placed in the ILLIAC IV memory that, over the course of the 
 calculation, few PEs are disabled. In other words, the algorithm and the 
 data storage allocation must be mutually designed for highly parallel 
 computation. For this reason ILLIAC IV codes are expressed as two parts: 
 a storage allocation block followed by a computational algorithm block. 
 These codes may be written in a high level language called Tranquil which 
 is described below. 
 
 Storage Allocation 
 
 In some cases data-array elements are mapped directly into PE 
 
 memory locations (i.e., element a. . goes to the i location in the j PE). 
 
 In other cases this may be unsatisfactory. For example, if it is desirable 
 
 to have parallel access to columns of the array as well as rows, one may 
 
 use a skewed storage scheme such as: a. . goes into the i row of the PE 
 
 10 
 
 whose number is (i+j) mod n, where n is the number of PEs. This is 
 
 th 
 illustrated in Figure 3(a). Now, for example, the column of the A 
 
 matrix may be accessed by setting the XR of the i PE to i. 
 
 Tranquil 
 
 The Tranquil language is similar to existing languages for 
 serial machines except that certain statements provide for the parallel 
 execution of code on many pieces of data. Operations on entire arrays may 
 
 - 3 - 
 
be performed by simply declaring arguments to be arrays. Thus if A, B, 
 and C are declared to be m x m matrices, 
 
 A ^B*C 
 will be compiled as a matrix multiplication code, assigning the product to 
 matrix A. 
 
 Index sets may be used to refer to particular elements of an 
 array or to control a block of code. Tranquil allows two kinds of control 
 statements--SEQ and SIM. SEQ is analogous to the standard sequential con- 
 trol statement of the FORTRAN DO or ALGOL FOR variety. SIM, however, allows 
 simultaneous operation on all elements of an index set. For example, if 
 we define II, J J, and KK to be index sets as follows: 
 
 II «-< 0, 2, ..., 2n> 
 JJ «- < 1, 3, • • • , 2n+l> 
 KK <- < 9, •••, 2n+l> 
 then 
 
 FOR (I, J) SIM (II*JJ) DO 
 BEGIN 
 
 C(I,J) «-o. 
 FOR K SEQ (KK) 
 
 C (I, J) *-C(l,j) + A(I,K) * B(K,J); 
 END 
 will cause the matrix C to have the elements in the intersection of even 
 rows and odd columns set to the inner products of appropriate rows and 
 columns of A and B. 
 
 Tranquil also has conditional expressions which test if any or 
 all elements of an array meet some Boolean condition. 
 
 - k - 
 
Tranquil will provide a number of canned storage allocation 
 schemes for standard geometries plus some slicing algorithms for arbitrary 
 shapes. The user will also have the prerogative of writing his own 
 allocation map. (The Tranquil storage allocation block provides this mapping 
 from symbolic notation into PE locations.) The Tranquil translator uses 
 this storage allocation information when translating the Tranquil algorithm 
 into ILLIAC IV machine language. In particular., routing, indexing and mode 
 operations will be generated from the storage allocation information. An 
 example of a Tranquil storage allocation statement is shown in Figure 3(b). 
 This statement defines the skewed storage scheme shown in Figure 3(a); the 
 skewing into memory is done by the compiler. 
 
 EXECUTION OF SIGNAL PROCESSING ALGORITHMS 
 Digital Filter Design 
 
 Several multichannel filter design procedures [3>^] involve matrix 
 operations --primarily addition, multiplication;, transposition and inversion. 
 As explained previously, these operations will be a part of the ILLIAC IV 
 software. Thus, if r, a, and 'a are matrices and T denotes transposition, 
 expressions such as 
 
 (3) T (3) T (3) 
 
 which appears in the Robinson synthesis [5]* will be compiled and executed 
 with high efficiency. If, for example, the problem involves an array 
 consisting of 8 subarrays, each of 32 sensors, the design procedure may be 
 implemented simultaneously for all eight subarrays. The problem of 
 
 - 5 - 
 
computing correlation matrices (e.g., the r's in the above formula) from 
 sampled data is also equivalent to matrix multiplication. 
 
 The following computation, which involves approximately 3 X 10 
 multiplications of 25 X 25 matrices is estimated to take about 10 
 seconds: Given an array consisting of 10 subarrays of 25 sensors each, 
 use the Robinson synthesis technique to design for each subarray a 100 
 point least squares filter. 
 
 Beam Forming 
 
 Figure k illustrates ILLIAC IV s capability to form beams for a 
 large number of sensors. The rectangular structures represent the computer- - 
 its 256 PEs stretching horizontally across the figure and PE memory extend- 
 ing downward. The circles represent the sensors and the incoming signal is 
 labeled by the letter "S"- -superscripts denote time-values; subscripts 
 indicate a particular sensor. Sensors are "paired" with PEs; that is, each 
 sensor "feeds" a particular PE. If, for the moment, attention is restricted 
 to a single straight summation beam, then what is desired is the vector - 
 element sum , 
 
 s i 1 + s 2 1+ ••• + s 2 5 6 i (1) 
 
 of the 256-element vector stretching across the A registers of the PEs. 
 Upon output of the vector -element sum of time step i, the data of time 
 step i + 1 can be presented and the process continued. This process uses 
 the vector -element summation algorithm illustrated in Figure 5 and achieves 
 a speed-up of 32:1 over serial operation. Figure 6 exhibits a Tranquil 
 program for this procedure. The program will be compiled as the series of 
 
 6 - 
 
routes and adds illustrated in Figure 5. After execution, the vector -element 
 
 2 
 
 sum (l) is found in all PEs. 
 
 It should be realized, however, that even though a speed-up of 
 32:1 over serial operation has been achieved, ILLIAC IV is operating at a 
 low 12.5$ efficiency since at 6k bit precision the maximum speed-up is 
 256:1. In view of this, the question immediately arises: Can the summa- 
 tions of successive time-steps be interleaved to produce a more efficient 
 algorithm? It turns out that nearly 100$ efficient operation can be achieved 
 by any of several interleaving schemes. One scheme [6] has a structure 
 essentially identical to the method (Figure 9) used to execute the Cooley- 
 Tukey algorithm. 
 
 ILLIAC IV can efficiently form beams for arrays of more than 256 
 sensors. For example, to form beams for 512 sensors, the data is "folded 
 over" so that sensors 257 through 512 also feed PEs 1 through 256, 
 respectively. Thus, the first ILLIAC IV addition, 
 
 « i g i « i 
 
 b l b 2 b 256 
 
 + s 25? , + s 258 , + s 512 , 
 
 is completely parallel even without interleaving. For an array of 5000 
 sensors, the data is "folded over" until the number of sensors is depleted 
 (20 times) and 19 additions would be performed in parallel before a vector- 
 sum technique is implemented. (Note that some efficiency is lost because 
 5000 is not an integer multiple of 256. ) Fifty subarrays, each consisting 
 
 2 
 To avoid a cluttered appearance, Figure 5 shows only the 
 
 operations involved in obtaining the particular sum which ends up in PE 1. 
 
 - 7 - 
 
of 100 sensors, can be processed in much the same way as a single 
 
 5000 sensor array [6] . At a seismic sampling rate of 20 per second, 
 
 the maximum number of sensors that could be handled in a real-time (single) 
 
 beam forming situation is 50 million. If the sensors are used in a multiple 
 
 beam- forming situation (see below), the number 50 million is reduced 
 
 accordingly. 
 
 The more general problem of delay-and-sum beam forming involving 
 many different beams is illustrated in Figure J. The figure shows a portion 
 of memory (loc 1 through loc h) large enough to accommodate the largest 
 delay. The beam patterns shown define the beams for all time. The inde- 
 pendent access feature of PE memory (i.e., the PE index register) enables 
 the components of each beam to be accessed simultaneously, and the sum can 
 be formed as if it were a straight sum beam. After all beams have been 
 processed for t = 1, new data (in Figure "J: S ) may be entered into memory 
 (in Figure r J: loc 1 or loc 5), and the process is repeated. 
 
 Convolution 
 
 The convolution of a sampled time-series with predetermined 
 filter weights can become time-consuming if (l) the problem involves 
 multichannel filters, or (2) the length of the filter is very large. 
 These cases are considered separately. 
 
 (l) Suppose samples are fed to each PE from a particular sensor 
 (as in Figure k) . The parallelism for multichannel convolution is evident-- 
 each PE performs a single channel convolution and the results are summed. 
 For example, a 50-point convolutional filter would involve a standard 
 serial convolution algorithm which is executed simultaneously in all PEs 
 
 - 8 - 
 
followed (after every 50 intra-PE operations) by a vector-element summation 
 across the entire PE array (in the case of one 256- channel filter) or sums 
 across sections of the PE array (in the case of several smaller multichannel 
 filters). These vector-element summations may be interleaved if desired. 
 
 (2) Less transparent, but still easily implemented, is the case 
 of a single long convolution. Consider the 256-point single channel con- 
 volution illustrated in Figure 8. Denoting filter weights by W and signal 
 samples by S, the desired successive filter outputs are 
 
 (sV + S 2 * 2 + . . . + S 256 ^ 56 ), (SV + SV + . . . + S 25 V 56 ), etc. 
 At t = the weights are spread across register S. At t = 1, S is received 
 in the CU, broadcast multiplied into the array of weights, and this result 
 added to loc 1. The weights now (actually, while the addition is in 
 
 progress) are routed leftward one PE. (See t = 1 in Figure 8.) At t = 2, 
 
 2 
 
 S is received, multiplied into the weights, and the partial sum in loc 1 
 
 is updated. The weights move leftward, and so on. At t = 256 the first 
 
 3 
 genuine filter output is found in PE 1. Thereafter, each iteration produces 
 
 a new filter output which is found in the PE leftward of the previous output. 
 
 Thus, the second output, 
 
 sV + sV + ... + s 25 V 56 , 
 
 will be found in PE 256. After each result is obtained, loc 1 in the PE it 
 came from is reset to zero. It should be noted that this scheme is not 
 restricted to 256 samples, and that the procedure achieves a speed-up factor 
 of approximately 256 over serial operation. 
 
 3 
 Dummy outputs were produced at times 1 through 255 as denoted by 
 
 the O's in loc 1. This is done so that the same algorithm suffices for all 
 
 time. 
 
 - 9 - 
 
Cooley-Tukey Algorithm 
 
 The Cooley-Tukey Algorithm [7] for the' case N=2 , where p is an 
 integer, is well suited for parallel computation. An ILLIAC IV Cooley-Tukey 
 program can be conveniently divided into three sections: (i) W-generation, 
 (ii) iteration, and (iii) unscrambling. W-generation refers to the task of 
 computing the complex constants (W's) needed to implement the algorithm. If 
 the algorithm is used successively for different sets of data, the W's 
 remain unchanged- -that is, they depend only on p. The iteration scheme 
 which will be presented leaves the Fourier components in an unnatural order, 
 and it is necessary to unscramble them. The unscrambling procedure (iii) 
 takes less than 10$ the time of (ii). Thus, in a real-time situation, it 
 is (ii) which determines the computation time. For these reasons, and since 
 it is "iteration" which computes the finite Fourier transform of the N 
 samples, this paper will present a detailed description only of (ii). The 
 schemes which implement (i) and (iii) are similar to that for implementing 
 (ii) and all are highly parallel [8]. 
 
 A good feeling for what (ii) accomplishes can be obtained by 
 examining the Cooley-Tukey algorithm for a relatively simple case (Figure 9)« 
 The case illustrated involves 16 data points (A's) and 8 PEs. The figure 
 shows 12 "operations". Original storage appears in (l) ; Fourier samples 
 are the A, (i) appearing (scrambled) in Cl2\ .' The caption between the 
 boxes describes the "operation" which transforms the box above the caption 
 into the box below the caption. Note that for ease in portrayal A, (0) is 
 
 written in place of A(0) + A(8)W and A_(8) is written in place of 
 
 h 
 
 A(0) - A(8)W , etc. The switch , multiply, add, subtract; switch, multiply 
 
 T 
 
 Note that the "switch" operation is of variable "length" 
 
 - 10 - 
 
add, subtract structure of the program is dictated by the inherent structure 
 of the algorithm. The algorithm for l6 data points involves four iterations. 
 Of the four iterations, only the first does not entail routing (in Figure 9: 
 "switch"). The algorithm for 32 data points would involve five iterations, 
 the data would be folded over k deep, and both the first and second iter: : 
 would not entail routing [8]. The algorithm for N = 4096 data points (6k 
 bits; parts ii and iii only) is executed in about .6 millisecond. 
 
 11 - 
 
REFERENCES 
 
 [1] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, 
 and R. E. Stokes, "The ILLIAC IV computer", 1968; accepted for publication. 
 
 [2] D. J. Kuck, "ILLIAC IV software and applications programming", 
 1968; accepted for publication. 
 
 [3] P. E. Green, E. J. Kelly, Jr., and M. J. Levin, "A comparison 
 of seismic array processing methods", Geophys. J. R. Astr. Soc , vol. 2, 
 pp. 67 -84, 1966. 
 
 [4] J. Capon, R. J. Greenfield, and R. J. Kolker, "Multidimensional 
 maximum- likelihood processing of a large aperture seismic array", Proc. IEEE , 
 vol. 55, pp. 192-211, 1967. 
 
 [5] F. A. Robinson and R. A. Wiggins, "Recursive solution to the 
 multichannel filtering problem", J. Geophys. Res ., vol. 70, pp. I885-I89I, 
 1965. 
 
 [6] G. M. Ackins, "Successive vector-element processes and variations 
 on the sorting problem", Department of Computer Science, University of 
 Illinois, ILLIAC IV Document No. 195, File No. 165, July 11, 1968. 
 
 [7] J. W. Cooley and J. W. Tukey, "An algorithm for the machine 
 calculation of complex Fourier series", Math, of Comp. , vol. 19, pp. 297-301, 
 1965. 
 
 [8] G. M. Ackins, "Fast Fourier Transform via ILLIAC IV", 
 Department of Computer Science, University of Illinois, ILLIAC IV Document No. 
 198, File No. 768, July 12, 1968. 
 
FIGURE CAPTIONS 
 
 Fig. 1. Quadrant configuration. 
 
 Fig. 2. System organization. 
 
 Fig. 3. Storage allocation 
 
 J Part (a) Skewed storage scheme. 
 
 | Part (b) Corresponding Tranquil storage allocation block, 
 
 Fig. h. Beam forming: Input of data. 
 
 Fig. 5. Beam forming: Vector-element sum. 
 
 Fig. 6. Vector-element sum: Tranquil code. 
 
 Fig. 7. Delay-and-sum beam forming. 
 
 Fig. 8. Convolution. 
 
 Fig. 9. Cooley-Tukey algorithm. 
 
INSTRUCTION AND DATA FETCHING 
 
 CD 
 CM 
 
 in 
 
 INSTR. 
 
 FETCH 
 
 REQUEST 
 
 I 
 
 MSTRUCTION 
 BUFFER 
 
 ASSOCIATIVE 
 MEMORY 
 
 CONTROL UNIT 
 
 T 
 
 1 
 
 WORDS 
 
 64 
 
 WORDS 
 
 1 
 
 1 
 
 OATA 
 BUFFER 
 
 INSTRUCTION 
 DECODER 
 
 
 C U INDEXING 
 
 CAR 
 
 CAR 3 
 
 T 
 
 ~i 
 
 P E CONTROL SIGNALS 
 
 DATA FETCHING 
 DATA BROADCASTING f 
 
 64 
 
 57 
 
 T 
 
 2048 
 WORDS 
 
 1 
 
 I 
 
 MODE 
 
 B 
 
 XR 
 
 MIR 
 
 PE 
 MEMORY 
 
 64 
 BITS 
 
 PE 1 
 
 1-1 
 
 9 i-8 
 
 • • • 
 
 MODE 
 
 B 
 
 XR 
 
 MM 
 
 PE 
 
 RwtMvfrT 
 
 1+1 
 
 63 
 
 i+8 56 
 
 • • • 
 
 | MODE 
 
 B 
 
 R 
 
 XR 
 
 MIR 
 
 PE 
 MEMORY 
 
 PEi 
 
 PE64 
 
 Fig. 1. Quadrant configuration 
 
USER TERMINALS 
 
 ARPA 
 
 CONTRACTORS 
 
 NETWORK 
 
 ILLINET 
 
 REAL TIME 
 INPUTS 
 
 Fig. 2. System organization 
 
PE PE i PE n 
 
 loc o a^ a „ a on 
 
 loc 1 a ln Oi,i-i a X|ll _ 
 
 lOC i Oi.n-i + i Qi.o Qi.n-I 
 
 loc n a„i Qn i- 
 
 m .... u n ,_ n .... u n0 
 
 (a) 
 
 SKEW A[o:n,o:n] 
 (b) 
 
 Fig. 3« Storage allocation 
 
 fPart (a) Skewed storage schemes. 
 
 \ Part ("b) Corresponding Tranquil stora 
 
 ^ allocation block. 
 
SIGNAL 
 
 256 
 
 SENSORS 
 
 ILLIAC IV 
 
 loc 1 
 loc 2 
 
 0© 
 1 I 
 
 PE 1 PE 2 
 
 • • • 
 
 • • • 
 
 J 
 
 PE 256 
 
 si 
 
 b 2 
 
 
 b 256 
 
 A REGISTER 
 
 PE MEMORY 
 
 Fig. h. Beam forming: Input of data 
 
PE MEMORY LOCATION S AFTER ®-l OPERATIONS 
 
 S* 
 
 Q 
 
 (ROUTE: DISTANCE 1, LEFT; ADD) 
 
 si+s| 
 
 S|4Sj 
 
 sj+sj 
 
 St+Ss 
 
 (D 
 
 (ROUTE: DISTANCE 2, LEFT; ADD) 
 
 4 
 
 Is? 
 
 i-1 
 
 e 
 
 Est 
 
 i-5 
 
 (D 
 
 (ROUTE: DISTANCE 4, LEFT; ADD) 
 
 256 , 
 
 © 
 
 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 PE 8 
 
 Fig. 5. Beam forming: Vector-element sum 
 
BEGIN REAL ARRAY S(b:255|; 
 
 FOR (K) SEQ (0, . . . , 7) 
 
 BEGIN 
 
 J<-2tK 
 
 FOR (I) SIM (0, . . . , 255) DO 
 
 S[l]<-S[l]+S[(I+J)MOD 256] 
 
 END; 
 END: 
 
 Fie. 6. Vector-element sum: Tranquil code 
 
PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 
 
 loc 1 
 loc 2 
 loc 3 
 loc 4 
 loc 5 
 
 i D i D -i D i D i° D 
 
 Sf o Si Si Si Si 
 
 c2 q2 q2 q2 q2 
 
 ^1 ^2 ^ ^3 ^4 ^ ^5 
 
 S Q Q Q Q 
 
 1 D 2 D 3 D 4 D 5 
 
 S^ C* Q Q Q 
 
 1 D 2 ^3 ^4 ^5 
 
 "BEAM PATTERNS" 
 
 D f = s{+S24-S 3 +S44-S5+ ... 
 
 o T = 5^^+52 -rS 3 +S 4 +S 5 + ... 
 
 O t = Si+S 2 +1 +S 3 +S 4 +1 +Sj+ ... 
 
 Fig. 7. Delay and sum "beam forming 
 
reg S 
 
 toe 1 
 
 reg S 
 
 loci 
 
 reg S 
 
 loc 1 
 
 PE 1 
 
 PE 2 
 
 PE 3 
 
 • • • 
 
 PE 255 
 
 PE 256 
 
 W 1 
 
 W 2 
 
 W s 
 
 • • • 
 
 W <S9 
 
 W 25« 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 w 2 
 
 w s 
 
 w 4 
 
 
 W 256 
 
 W 1 
 
 
 w^-s 1 
 
 W 2 S 1 
 
 W 3 S X 
 
 
 W ZS6 S 1 
 
 
 
 
 w 3 
 
 w 4 
 
 w 8 
 
 
 W 1 
 
 w 2 
 
 
 W 1 S 1 +W 2 S 2 
 
 W 2 S 1 +W 3 S 2 
 
 w'sHw^* 
 
 
 
 
 o+wJ-s 2 
 
 t = 
 
 t=l 
 
 t=2 
 
 regS 
 
 loci 
 
 w 1 
 
 w 2 
 
 w 3 
 
 
 w 255 
 
 w 256 
 
 
 256 
 
 Jw'S* = lst "GENUINE" OUTPUT 
 
 t-i 
 
 t=256 
 
 Vs ,+2 
 
 
 w t s t+i 
 
 : THIS SUM LACKS 
 
 ONLY THE TERM 
 W 256 S 257 AND W)LL 
 
 BE OUTPUT AT t=257. 
 
 Fig. 8. Convolution 
 
PE MEMORY AFTER MINUS ONE " OPERATIONS" 
 
 loc 1 
 loc 2 
 loc 3 
 loc 4 
 loc 5 
 loc 6 
 
 A(0) 
 
 A(1) 
 
 A(2) 
 
 A(3) 
 
 A(4) 
 
 A(5) 
 
 A(6) 
 
 A(7) 
 
 A(8) 
 
 A(9) 
 
 A(I0) 
 
 A(11) 
 
 A(12) 
 
 A(13) 
 
 A(14) 
 
 A(15) 
 
 1° 
 
 f° 
 
 1° 
 
 1° 
 
 f° 
 
 ■° 
 
 1° 
 
 i° 
 
 1° 
 
 i° 
 
 1° 
 
 1° 
 
 f 4 
 
 i 4 
 
 f 4 
 
 w 4 
 
 W° 
 
 1° 
 
 I 4 
 
 ■ 4 
 
 w 2 
 
 • 2 
 
 • 8 
 
 i 6 
 
 1° 
 
 f 4 
 
 w 2 
 
 •' 
 
 ■' 
 
 « 5 
 
 • 3 
 
 w 7 
 
 loc 1 
 loc 2 
 
 multiply: loc 2* (loc 2) x (loc 3) 
 
 A(0) 
 
 A(1) A(2) A(3) A(4) A(5) A(6) 
 
 A(7) 
 
 A(8)W° A(9)l° A(10)i° A(11)f° A(12)t° A(13)f° A(14)l° A(15)*° 
 
 loc 1 
 
 loc 2 
 
 loc 1 
 
 loc 2 
 
 loc 1 
 
 loc 2 
 
 loc 1 
 loc 2 
 
 add, subtract: loc l« loc 1 + loc 2 
 loc 2* loc 1 - loc 2 
 
 A^O) A^l) A t (2) A ( (3) A t (4) A^S) A ( (6) A ( (7) 
 A,(8) A,(9) A t (10) A 1 C 1 1 > A t (12) A ( (13) A ( (14) A, (15) 
 
 sw itch: 
 
 A^O) Ajd) A ( (2) 1^3) A ( (8) A ( (9) A, (10) A^ll) 
 A,<4) A,(5) A ( (6) A,<7) A, (12) A ( (13) A,(14) A ( (15) 
 
 multiply: loc 2* (loc 2) x (loc 4) 
 
 A t (0) A t (1) A ( <2) A ( (3) A t (8) A ( (9) A ( (10) A^ll) 
 A (4)*° A (5)^ A (6)^ A (7)^ A (12)! 4 A (13)! 4 A (M)f 4 A ( (15)«f 4 
 
 add, subtract: loc 1* loc 1 ♦ loc 2 
 loc 2* loc 1 - loc 2 
 
 A 2 (0) A 2 (1) A 2 (2) A 2 (3) A 2 (8) A 2 (9) A 2 (10) * 2 <11) 
 A 2 (4) A 2 (5) A 2 (6) A 2 (7) A ? (12) A 2 (13) A ? (14) A ? (15) 
 
 Fig. 9« Cooley-Tukey algorithm 
 
sw i tch 
 
 loc 1 
 loc 2 
 
 * 2 (0) 
 * 2 (2) 
 
 * 2 (D 
 A 2 (3) 
 
 A 2 (4) A 2 (5) A 2 (8) A 2 (8) 
 
 A 2 (12) 
 
 A 2 (6) 
 
 A 2 (7) 
 
 A 2 (10) A 2 (11) A 2 (U) 
 
 A 2 (13) 
 A 2 (15) 
 
 © 
 
 multiply: loc 2-+— (loc 2) x (loc 5) 
 
 loc 1 
 loc 2 
 
 A 2 (0) A 2 (1) A 2 (4) A 2 (5) 
 
 A 2 (8) A 2 (9) 
 
 A 2 (12) 
 
 A 2 (2)W° A 2 (3)i° A 2 (6)f 4 A 2 (7)l 4 A 2 (10)» 2 A 2 (11)f 2 A 2 (14)i 8 
 
 A 2 (13) 
 AjdS)* 6 
 
 © 
 
 add, subtract: loc 1« 
 loc 2< 
 
 ■loc 1 + loc 2 
 ■loc 1 - loc 2 
 
 A 3 (0) A 3 (1) A 3 (4) A 3 (5) A 3 (8) 
 A 3 (2) A 3 (3) A 3 (6) A 3 (7) * 3 (10) 
 
 A 3 (9) A 3 (12) A 3 (13) 
 A 3 (11) A 3 (14) A 3 (15) 
 
 © 
 
 sw i tch: 
 
 loc 1 
 loc 2 
 
 A (0) A (2) A (4) A (6) A (8) 
 
 3 9 3 3 3 
 
 A 3 (D A 3 (3) A 3 (5) A 3 (7) A 3 (9) 
 
 A (10) A (12) A (14) 
 3 3 3 
 
 A 3 (11) A 3 (13) A 3 (15) 
 
 multiply: loc 2+ (loc 2) x (loc 6) 
 
 A 3 (0) A 3 (2) A 3 (4) A 3 (6) A 3 (8) 
 
 A (1)W° A (3)W 4 A (5)W 2 A (7)W 6 A (9)*' 
 
 A 3 (10) A 3 (12) A 3 (14) 
 
 A 3 (11)W 5 A 3 (13)W 3 A 3 ( 1 5 ) W 7 
 
 © 
 
 add, subtract: 
 
 loc 1 
 loc 2 
 
 A 4 (0) A 4 (2) 
 
 A 4 (D A 4 (3) 
 
 A 4 (4) A 4 (6) A 4 (8) 
 
 A/5) A 4 (7) A 4 (9) 
 
 A 4 (10) A 4 (12) A 4 (14) 
 
 A 4 (11) A 4 (13) A 4 (15) 
 
 PE 1 
 
 PE 2 
 
 PE 3 
 
 PE 4 
 
 PE 5 
 
 PE 6 
 
 PE 7 
 
 PE 8 
 
 Fig. 9- (continued)