LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 UlCbr 
 v/Y\.O.S»53-SS?> 
 
 QO^ 
 
 z 
 
CENTRAL CIRCULATION AND BOOKSTACKS 
 
 be charged a minimum fee or * 
 
 each non-returned or o ^. _ ^ 
 
 Theft, mufilation. o. de ^^'^materiais owned by 
 causes for s»uder,t «"««P ,, ""» ""'"•«,. prop erry of .he S.a.e 
 , he University of HlinoU J L.br«v « »" . P JP ^ ^^ 
 of Illinois and are protected by Art. 
 UW and ^edure^ ^ 333.^, 
 
 Unl ver^ J EL« = r-'^^Z s 
 
 £UUl 
 
 JUL i I ZUU1 
 
 When renewing by phone, write new due d£ 
 below previous due date. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/memoryprocessorc557lawr 
 
^-^br Report No> uiUCDCS-R-73-557 
 
 y\c.ssi 
 
 -$& 
 
 ^ 
 
 ^ 
 
 0* 
 
 MEMORY- PROCESSOR CONMICTION NETWORKS 
 
 by 
 
 ■?& 
 
 k c* 
 
 ov^ 
 
 V 
 
 o^ 
 
 Duncan Hamish Lawrie 
 
 February 1973 
 
Report No. UIUCDCS-R- 73-557 
 
 MEMORY-PROCESSOR CONNECTION NETWORKS 
 
 hy 
 
 Duncan Hamish Lawrie 
 
 February 1973 
 
 Department of Computer Science 
 University of Illinois at Urbana- Champaign 
 Urbana, Illinois 6l801 
 
 This work was supported in part by the National Science Foundation 
 under Crant No. US NSF CJ 2 r jkk6 and was submitted in partial ful- 
 fillment of the requirements for the degree of Doctor of Philosophy 
 in Computer Science, February 1973 • 
 
MEMORY-PROCESSOR CONNECTION NETWORKS 
 
 Duncan Hamish Lawrie, Ph.D. 
 Department of Computer Science 
 University of Illinois at Urbana- Champaign, 1973 
 
 In order to utilize the potential speed of a SIMD type parallel 
 processor it is necessary to arrange data in the memory system so that sub- 
 sets of this data can be fetched in parallel without memory conflicts. Addi- 
 tionally, we must provide sufficient memory-processor paths to allow the data 
 to be correctly aligned with the processor array. In this paper we present 
 several storage mapping algorithms together with a memory-processor inter- 
 connection network. We demonstrate the cost and effectiveness of these and 
 compare them with other networks which have been proposed for this application, 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 The author would like to thank Professor David J. Kuck for his 
 patience and inspiration, the Department of Computer Science for its support, 
 and the ILLIAC IV Project for a long and usually fruitful relationship. 
 
 The diligent efforts of Vivian Alsip and Barbara Armstrong in 
 typing this document and the workmanship of the printing and drafting de- 
 partments are greatly appreciated. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. ft NETWORKS, THEORETICAL DEVELOPMENT 5 
 
 2.1 Fundamentals . 5 
 
 2.2 The t Relation and its Properties 9 
 
 2.3 Construction of a Network from ft l8 
 
 2.k The Equivalence of H(ft) and ft 2^4 
 
 2.5 Minimal ft Networks 29 
 
 2.6 Summary 31 
 
 3. EFFECTIVENESS OF ft NETWORKS 33 
 
 3-1 Representation of Array Storage Schemes and N-Vectors. . . 35 
 
 3-2 Important N-Vectors 37 
 
 3.3 Memory Conflicts, Memory and Vector Equations, and the 
 
 Network Connection Equation 38 
 
 3.4 Effectiveness of ft Networks ^1 
 
 3.5 Additional Considerations, Cost 69 
 
 3-6 Summary 7° 
 
 k. CONSTRUCTION OF SEVERAL ft NETWORKS 72 
 
 k.l A Bit Serial ft Network 72 
 
 k.2 A Better Network 7& 
 
 k.3 Processor-Processor Connections °7 
 
 k.k Summary 92 
 
 5 • CONSTRUCTION AND PROPERTIES OF OTHER NETWORKS 95 
 
 5.1 Uniform Shift Networks 95 
 
 5-2 The Barrel Shifter 100 
 
 5-3 The Batcher Network 10° 
 
 5-h The Benes Network 10l+ 
 
 5.5 The Crossbar Switch 1Q 6 
 
 5-6 Network Comparison ' 1°" 
 
 6. CONCLUSION 112 
 
 LIST OF REFERENCES lll+ 
 
 VITA 115 
 
1. INTRODUCTION 
 
 It has been suggested that the use of a large number of processors 
 operating on a single program can be used to produce improvements in speed 
 and possibly cost effectiveness. Recent experiments with this concept have 
 indicated a number of problems which are as yet unsolved. The purpose of 
 this paper is to investigate a possible solution to one of these problems, 
 the problem of how to provide the processors with data at a rate which matches 
 the processors' speed. 
 
 It is clear that given N processors, we need on the order of N 
 memories just to provide the raw bandwidth requirements. Given an order of N 
 memories, we must still solve two problems. The first problem is to assign 
 the data to memory in such a way that memory conflicts are minimized; that 
 is, the data must be arranged so that during the fetch of any required set of 
 N data elements, no two of these will be stored in the same memory. This will 
 be called the memory assignment problem. Once the data has been fetched, each 
 datum must be sent to the correct processor. This is the second problem which 
 we will call the data alignment problem. 
 
 Independent solutions to the above problems have been suggested in 
 the literature and elsewhere, but to the best of our knowledge no practical 
 and compatible solution to both problems has ever been published. For example, 
 (±1,±/n), and barrel shifters have been proposed (and built) to solve the 
 data alignment problem [ 1 ] and memory assignment algorithms have been pro- 
 posed to solve the memory assignment problem [2, 3 ] • Unfortunately, the 
 better solutions of the latter generally require alignments which are more 
 complex than simple uniform shifts (see Chapter 5)- More complex networks 
 have been proposed [.'i, 5] , but their high cost or excessive switching time 
 
make them impractical for our purposes. 
 
 In this paper we present several effective memory assignment 
 algorithms and propose a new network which can meet the resulting alignment 
 requirements. We begin by proving some useful results about these networks 
 and then showing their effectiveness in handling certain important align- 
 ments. We follow this with a comparison with other solutions and show that 
 our solution is more effective than these previous solutions. 
 
 Before continuing we will define some terms and assumptions which 
 will be implicit in the following work. By a network we mean a device which 
 has n input and n output ports. Each datum entering the network remains 
 integral, but the ordering of the data may be altered by the network. We 
 assume that a different ordering (or permutation) of the inputs is required 
 for each successive set of inputs, i.e., a different switching of the network 
 is required for each set of n or less inputs. Thus, the time required to 
 switch the network is a significant factor. 
 
 This network may be thought of as lying between the N processors 
 and M memories as in Figure 1, as a separate function box available to the 
 processors as in Figure 2, or as an integral part of the processors. 
 
 In general, we assume that at each unit of time, N or fewer of the 
 processors require one datum each. These data are fetched and aligned with 
 the correct processors. We shall also assume that these data are taken from 
 a uniform part of an N x N matrix, e.g., row, column, diagonal, etc. We will 
 define this more clearly in Chapter 3. 
 

 
 
 • • • 
 
 
 
 
 
 
 
 
 
 ALIGNMENT 
 NETWORK 
 
 
 
 
 
 
 
 
 
 
 
 
 • • • 
 
 
 PROCESSORS 
 
 MEMORIES 
 
 iCure ] . l'rooer.jor-Network-Memory System 
 
ALIGNMENT 
 NETWORK 
 
 
 
 
 
 
 A Mi A 
 
 
 
 
 
 
 • • W 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 PROCESSORS 
 
 MEMORIES 
 
 Figure 2. Network- Processor-Memory System 
 
2. fi NETWORKS, THEORETICAL DEVELOPMENT 
 
 The purpose of a switching network, variously known as a 
 permutation, indexing, connection, or sorting network, is to produce 
 connections between input nodes and output nodes. The network may he 
 changed or switched by means of control signals in such a way as to 
 alter these connections to meet various requirements. In this chapter 
 we will investigate in some detail a special class of network, hence- 
 forth ft networks, which we feel are especially useful for connections 
 between a vector of memories and a vector of processors. 
 
 ft networks are based on a mathematical notion of a "ft base" 
 representation of integers. We begin by investigating some special 
 properties of an ft base system. We then proceed to show how to build 
 a network from an ft base and then using our results we show how to 
 determine whether the ft network can produce certain special connections. 
 
 In later chapters we will derive a number of connections which 
 are frequently required during operation of a parallel computer. Then, 
 using the results derived in this chapter, we will show that the ft 
 network performs well in certain types of parallel computers. 
 
 2.1 Fundamentals 
 
 We will begin with some fundamental definitions from number 
 
 theory. Any number x can be expressed as the sum of a multiple of a 
 modulus m and a remainder r : 
 
 x = mb + r r<m 
 
Given two numbers 
 
 x = nib + r 
 x x 
 
 y = nib + r 
 
 J y y 
 
 we say x = y if and only if r = r . Further, we say 
 
 * y 
 
 m 
 
 m 
 x = y 
 
 if and only if b = b . We say x divides y 
 x y 
 
 x|y 
 
 if there exists an integer a such that ax = y. Further, whether or not 
 x divides y, we define 
 
 XTy 
 to be the integer part of the quotient x/y. 
 
 The greatest common divisor of x and y, 
 gcd(x,y) 
 is the largest integer a such that 
 
 a|x and a|y. 
 If gcd(x,y) = 1, we say x and y are relatively prime, x is prime to y, or 
 y is prime to x. 
 
 The greatest prime factor, written 
 gpf(x,y), 
 is the greatest integer factor (divisor) of x which is prime to y. Thus, 
 
 gcd(gpf (x,y) ,y) = 1. 
 
 We now present a number of results. Most proofs can be 
 found in Vinogradov [6] or Shanks [7]. 
 
PI 
 
 P2 
 P3 
 PU 
 
 P5 
 
 P6 
 
 PT 
 
 P8 
 
 P9 
 P10 
 
 x+a = y+a <--* x = y 
 m m 
 
 x = y -> ax e ay 
 m m 
 
 gcd(a,m)=l A ax = ay -> x e y 
 m m 
 
 a|m A ax e ay + x e y 
 m m/a 
 
 a|n A x E y ^ x = y 
 m m/a 
 
 m 
 x ^yAxEy + x^y 
 
 am m 
 
 x = y A (0 <_ x,y < m/a) ■*■ x = y 
 m/a m 
 
 x = y -* ax E ay 
 
 m am 
 
 m/a m 
 
 a|mAx e y->xEy 
 
 a-xEa-y-e+xEy 
 
 m m 
 
 am m 
 
 x E y -+ x^a E y^-a 
 
 Pll 
 
 m am 
 
 P12 x J y<--+ ax f ay 
 
 We will tend to use these results in their rontraposit ive form. 
 
 That is , 
 
 P2 
 
 ax i ay -+ x £ y 
 m m 
 
 Let R be an ordered set of factors of 
 n 
 
 n, R n = {p 1 , P 2 ,..., P k 
 
 such that p. x p_x...X p. = n. Define ft (R ) to be the set 
 1 2 K n 
 
 ft(R ) = lw, |w. = 1, cj. = ix).^.. x p..-, < i < k-1}, so co. = Ji p 
 
 n L ' k l l+l l+l — i . . . , 
 
 j=l+l 
 
 JNote that oj = n. 
 o 
 
 In general, any number x < co can be expressed as 
 
 x = l x. X co . , 
 , , i l 
 
 i=l 
 
8 
 
 where x. < p . . We say that x x x ... x^ is the "base ft" representation 
 of x. 
 
 Thus, if ft = {8,U,2,l}, then we get the familiar base 2 
 number system while if ft = {12,6,3,1}, then we get a somewhat unusual 
 (but useful) number system where 
 
 °10 = 000 ft 
 
 X 10 = 001 ft 
 2 10 = 002 ft 
 3 10 = 010 ft 
 u io = °1^ 
 5 10 = 012 ft 
 6 10 = 100 ft 
 
 7 10 
 
 - 
 
 101ft 
 
 8 10 
 
 = 
 
 l 02 ft 
 
 9 10 
 
 = 
 
 ll°ft 
 
 10 io 
 
 = 
 
 111ft 
 
 U 10 
 
 = 
 
 1120, 
 
 Notice that 
 
 0) . 
 
 J 
 
 x 
 
 7 "* X i = y i 1 1 i 1 J 
 
 and 
 
 x = y -> x. = y. j + 1 ^ i <_ k. 
 J 
 For example, if R n = (2,5,3), then ft(lT) = {30,15,3,1}, and 
 
 5 10 = 012 ft 
 U 10 = 011 a 
 
 1T 10 = 102 ft 
 
So 
 
 5 ~= h (012, s 011^) 
 
 5 5 17 (012^ = 102 ) 
 
 5 ? IT (012^ 4 102 ). 
 15 -fi 15 -^ 
 
 In most of the following work, when R is understood or not 
 important we will refer simply to n rather than R(R ). Further, the 
 unsubscripted lower case letters x, y, z, s, d are usually variables 
 whose range is the set of non-negative integers. The notation 
 {f(x) Condition on x} denotes the set of all numbers in the range of 
 the function f where the domain of f is all values of x which satisfy 
 the specified condition. Thus, {2x:x >_ 0} is the set of all even, 
 non-negative integers. Subscripted letters usually refer to specific 
 values of the variable. 
 
 2.2 The t Relation and its Properties 
 
 An n-set represents a mapping or connection between n input 
 nodes and n output nodes . 
 Definition 1 : n-set 
 
 An n-set P is a set of pairs of integers, P = {(s,d)}. If 
 (s,d)e P, then we say P connects input s' mod n to output d mod n. □ 
 
 Thus, if P = {(2,1), (7,0), (2,0)} and n=U-, then P 
 represents the connection illustrated in Figure J. 
 
10 
 
 Input Output 
 
 Nodes Nodes 
 
 Figure 3 • A Connection 
 
 Definition 2 : m + P 
 
 Given an n-set P, we say the integer m passes P, written 
 m t P, if an only if for all (s.,d.), (s ,d.)e P 
 
 m 
 
 s . $ s . and s. ^ s J -> d. ^ d . D 
 
 1 n J - X m J x J 
 
 Definition 3 : a + P 
 
 Given an n-set P and a = {w =n, w-,,..., w, =1} , if 
 for all id e H, to t P, then we say 8 +.P. That is 
 
 n + P«-> (Voe fl, V(s, ,d ),azxd y(s ,d.)e P, s. f s . a s. e b a - d L] 
 
 11 J J x n J x 03 J i 3 
 
 The negative form of a A P is useful and perhaps its statement will make 
 the relation i clearer. 
 
 a f P +■■+ 3 to e a and 3(s ,d ),(s.,d.)e P, such that s. i s 
 
 n ° 
 
 oo 
 and s. e s. and d. e d . n 
 
 1 u) J x J 
 
 We will now demonstrate several functions f(P) on n-sets and 
 
 show that 
 
11 
 a + p •* a + f(p) . 
 
 Theorem 1 : Let P + c = { (s+c ,d) : ( s ,d)e P , c an integer constant). 
 
 Then q + p -». Q + p+ c 
 
 We shall prove the contrapositive form of this theorem: 
 
 ntP + c->^tP. We have 
 
 fi fP + c^ 3oj e fi, 3(s.+c, d ) , and 3(s .+c, d.) , such that 
 
 J J 
 
 (0 
 
 s. + c^s + e . s.+ces,+c, and d . S d . . But 
 
 i j i J i J 
 
 n co 
 
 PI 
 
 S. +C^S.+C"> S. ^S, 
 
 1 n J X n J 
 
 and* 
 
 PI 
 s. + c = s , + c -► S . = S , 
 
 1 J 1 J 
 
 so J] f P . D 
 
 Theorem 2 : Let P x a = { (xxa,d) : ( s ,d)e P, a an integer constant} 
 
 If for all co e JJ, gcd(a,co) = 1, then ft + P -> ft + P x a. 
 
 * The notation PI above the implication symbol means that the implication 
 follows from property PI found at the beginning of this chapter. 
 
12 
 
 Proof : 
 
 o,+ PXa + 3 W £0^ 3 (s xa,d .), and 3 (s-Xa,d.), such that 
 
 CO 
 
 s.xa^s^xa, s.xa^s.xa-j and d . Ed.. But 
 
 i J ^ ,, J 1 J 
 
 n w 
 
 P2 
 s.xa^s.x. a-* s. ^ s , and if gcd(a,co) = 1, then 
 1 n J X n J 
 
 P3 
 s.x aEs.x a ■* s. e s . . Thus , 0, * P . n 
 
 a) J X co J 
 
 Corollary 2.1 : If gcd(a,n) = 1, then ft+P + ft+Px a . 
 
 Proof : 
 
 We need only show that gcd(a,n) = 1 -*■ y w e ft, gcd(a,co) = 1 
 and then apply Theorem 2. Assume there exists an co e ft , such that 
 
 gcd(a,to) = b . 
 Then 
 
 b | a and b | co , 
 but recall that u|n (from the definition of ft ) , i.e., there exists an 
 integer c such that 
 
 ioc = n. 
 Since b | co then b | toe or b|n. Thus, gcd(a,n) ^_b, and so 
 
 gcd(a,n) = 1 -*■ V" £ ^> gcd(a,co) =1. □ 
 
13 
 Theorem 3 : Let c - P = { (c-s ,d) : (s ,d)e P} . Then ft + P ■*■ ft + c - P 
 
 Proof : 
 
 P10 
 
 As before, c-s. = c - s. -»■ s. = s. 
 
 i J i J 
 CO co 
 
 so Hfc-P^-flfP qi 
 
 Let P = {(ax + b, ex + d):0 <_x < E,} , where a,b,c and d are 
 integer constants . We will now prove a theorem which states that if 
 a,c and £ satisfy certain conditions with respect to ft, then ft + P. 
 
 Theorem h : 
 
 Let Pc {(ax + b, cx+ d)|0 < x < £} be an n-set and 
 
 define 
 
 h = gpf(c,n) 
 Y m = gcd(c,m) 
 
 n m = gcd(c,a Ym ) 
 
 6 m = gpf (a VV m/Y m ) 
 C m = gpf(a,m) 
 
 R n = { Pi>P 2 '"-' p k } 
 ft(R ) = {m =n,m ,m , . . . , nL=l} 
 
 \i = the largest m e ft such that gcd(a,m) = m. 
 If the following three conditions hold, 
 A: c <_ n 
 B: ah/c > 1 or £ < hn/c 
 
Ik 
 
 C : For all m e 9,, u <_ m < n 
 
 CI: a/(6 xi ) < 1, or 
 
 mm — 
 
 C2: £ < m z, /a, 
 — m 
 
 then n i P. 
 
 The proof will consist of two parts and several subparts. First, we 
 
 need a preliminary result. Let (ax + b, ex + d) , (ay + b, cy + d) be 
 
 any two elements of P. Then we will show 
 
 ax + b iay+b ■+ ex + d ^ cy + d 
 n n 
 
 We have 
 
 PI 
 ex + d = cy + d -> ex = cy 
 n n 
 
 P3 
 
 -> — E -£- where h = gpf ( c ,n ) 
 n 
 
 Now since c < n and h = gpf(c,n), then ^-| n . So 
 
 ?k 
 
 ■* x E y 
 hn 
 c 
 
 Multiplying by a 
 
 P8 
 ■+ ax = ay, 
 
 ahn 
 
 ah P5 
 
 Suppose ~ >_ 1. Then -> ax E ay. If — < 1, then since x,y < E < — 
 
 - c 
 
 (condition B), then ax, ay < =E. Thus, 
 
Finally, 
 
 15 
 
 PT 
 
 •+ ax = ay . 
 n 
 
 PI 
 ■> ax + b e ay + b . 
 
 n 
 
 Thus, 
 
 ax+b2"ay+b->-cx + d^ay + d. 
 n n 
 
 Now we will prove that if for all m e ft where u ^_ m < n, 
 
 condition CI or C2 holds, then for all m e ft 
 
 m 
 ax+b ^ ay + b Aax + b Eay+b->-cx + d^cy + d 
 n m 
 
 and thus !1 t P. Specifically, we shall prove the equivalent theorem 
 
 m 
 
 ax + b ^ay+b Acx + dEay + d->-ax + b ^ ay + b 
 n m 
 
 Using our previous result, we have from condition B 
 
 ax+b ^ay+b+cx+d^cy+d. 
 n n 
 
 Then, by P6 we get 
 
 Now 
 
 m P6 
 
 cx + d^cy + dAcx + dEay + d ■* cx + d^cy + d. 
 n m 
 
 PI 
 
 ■+ ex t cy 
 m 
 
 P8 
 
 -*■ -3- t -*- where y = gcd(c,m) 
 
 Y Y m 
 
 mm m 
 
 ^™ 
 m 
 
 n m = gcd(c,a Ym ). 
 
Let 
 
 16 
 
 v = c/ii , u = ay /n. 
 
 mm m m m 
 
 Then since 6 = gpf(u ,m/y ] 
 m OJ ^ m m 
 
 PU,3 
 
 u ex u cy 
 
 m -. m 
 
 + ^ 
 
 Y Y 
 
 m mu m 
 
 m 
 
 6 y 
 mm 
 
 P2 
 
 u ex u cy 
 
 m / m 
 
 But since 
 
 v Y v y 
 
 mm u m mm 
 m 
 
 6 y 
 
 m'm 
 
 u ay n 
 
 m m m a 
 
 v y Y n c c 
 
 mm mm 
 
 we have 
 
 ax f. ay 
 
 u m 
 m 
 
 5 Y 
 mm 
 
 Now suppose CI holds, i.e., u /& y < 1. Then 
 
 m m'm — 
 
 P5 
 
 -*■ ax f. ay 
 m 
 
 and 
 
 PI 
 
 ax + "b ^ ay + b 
 
 m 
 
 Suppose now that CI does not hold, i.e., u /S ■ y > 1, but C2 does hold, i.e 
 
 m mm 
 
 4 < ^m 
 
17 
 
 Let 3 = u /& Y • Then we have 
 m mm 
 
 ax ^ ay 3 > 1 
 
 P2 
 
 m 3m m 
 
 where 
 
 where 
 
 ? m = gpf(a,m). 
 
 mc 
 
 Since x < £ < , then ax, ay < mr 
 
 — a m 
 
 PT 
 and - ax , ay.- 
 
 mm m 
 
 P3 
 
 ->■ ax ^ ay 
 
 m 
 
 ^m = SPf(a,m) ■* gcdU m ,m) = 1 
 
 Finally, 
 
 PI 
 
 ■* ax + b ^ ay + b . 
 m 
 
 New, suppose p is the largest m e fi such that gcd(a,m) = m, and that 
 
 CI or C2 hold for all m, n > m >_ p. We will now show that for all 
 
 m < p, ex + b p cy + b . Let me fi be <_ p such that m = p/p as per the 
 
 definition of fi. Since gcd(a,p) = p, 
 
 ax + b = ay + b 
 
 P 
 
 must be true always. Since CI or C2 hold for p, we have 
 
 P 
 ax+b say+b->cx + d^cy + d. 
 
 P 
 
18 
 
 By P9 ; 
 
 y u/p 
 
 ex + d % cy + d -*■ ex + d t cy + d. 
 
 Since the original premise is true, then the final conclusion must be 
 
 true, and since m = U/p, 
 
 m 
 ex + d i cy + d. 
 
 Now, in the special case where u = n, then neither CI nor C2 
 
 need hold. But since gcd(a,n) = n, then 
 
 ax + b = ay + b 
 n 
 
 must be true always . Thus , ft \ P trivially . □ 
 
 Theorems 1, 2, and 3 tell us that if Q tP, then ft + P+c , 
 
 ft t c-P and if gcd(a,n) = 1, then ft i Pxa. However, they tell us nothing 
 
 about whether or not ft i P. Theorem k tells us that if P satisfies 
 
 certain conditions with respect to ft, then ft + P. We will use all of 
 
 these theorems later. First, we will present an algorithm for 
 
 constructing a switching network from a given ft. We will then conclude 
 
 this chapter by showing that if ft t P, then the network constructed from 
 
 ft can produce without conflict the input-output connections specified by P, 
 
 2.3 Construction of a Network from ft 
 
 Algorithm 1 : Construction of a network from ft. 
 
 Let ft(R ) be as defined earlier, i.e., ft(R ) is the set 
 n ' ' n 
 
 ft(R n ) = {oj.: w =1, u. = n p.,0<i<k-l}. 
 1 X j=i+l J 
 
 The network will consist of k stages numbered 1,2,..., k from left to 
 
19 
 
 right. The i stage will consist of n/p . crossbar switches*, each 
 switch being p. x P.« Refer to Figure k and number the inputs (left) 
 and outputs (right) of each stage from to n - 1. 
 
 Now, for 1 < i ^_ k - 1, connect output j of stage i to input 
 I of stage i + 1 where 
 
 I = (j t u i _ 1 ) x w i _ 1 
 
 + (j mod p . ) x w - 
 
 + (j mod Usj) * P i 
 
 (Recall x v y is the integer part of the quotient x/y) . 
 
 Finally, the actual network inputs must be connected to the 
 stage 1 inputs (actually this will be shown in the figures as a 
 renumbering of the stage 1 inputs). Let 
 
 R n = {p k' p k-l'"" p 2' p l } 
 
 where 
 
 and form 
 
 and 
 
 R n = {p lt p ,..., P k _ r P k K 
 
 fl(R ) = {(ii.- 0) =1, u).=co. p , •: i < k-1] 
 n i k l l+l l+l — — 
 
 fi(R n ) = (« 1 :« |p . 1 . V w i+1 p k-i+l' ° ^ i ^ k -^ 
 Define a function t(x) on •'■ x < n - 1 such that if x = (x x . .. r )", 
 then t(x) = (x ... x x ) fi . Now connect (or renumber) stage 1 input x 
 to x(x). □ 
 
 Figures 5, 6, and 7 are examples of networks constructed in this way. 
 
 * A crossbar switch is capable of producing any one-to-one or one-to- 
 many connection of inputs to outputs. An attempt to produce a many- 
 to-one connection results in a "conflict." 
 
20 
 
 STAGE i 
 INPUTS 
 
 STAGE i STAGE 1*1 
 OUTPUTS INPUTS 
 
 - 
 
 1 - 
 
 />■-!— 1 
 
 STAGE i+1 
 OUTPUTS 
 
 — 
 
 — 1 
 
 p f -I p ul -l 1 I /0, + i-l 
 
 P\ P\+i 
 
 pi+1 pi.i+1 
 
 *P\ "I 1 I 2/), -1 2/D I+l -l 
 
 I 2/>, +1 -l 
 
 n-^+1 
 
 n-1 
 
 n+/>,+l 
 
 n-1 n-1 
 
 A*iV>i* 
 
 #n> 
 
 '/».« 
 
 n-/> 
 
 i*l 
 
 n-1 
 
 STAGE 
 
 STAGE 1*1 
 
 Figure k. Stages of a Network Constructed from fi 
 
21 
 
 I 
 
 N 
 P 
 U 
 
 T 
 
 — 
 
 1 
 2 
 
 3 
 
 4 
 
 5 
 6 
 
 7 
 8 
 
 9 
 10 
 
 — 11 
 
 
 
 u 
 
 T 
 P 
 U 
 
 T 
 
 Figure 5. Network Constructed from fi = (12, 6, 2, 1} 
 
22 
 
 Figure 6. Network Constructed from Q, = {l8, 9, 3, 1} 
 
23 
 
 I 
 
 N 
 
 P 
 U 
 
 T 
 
 
 U 
 
 T 
 P 
 U 
 
 T 
 
 ire 7- Network Constructed from = (l8, 6, 3, 1} 
 
2h 
 
 Algorithm 2 : Network Control Algorithm : 
 
 Each (s,d) pair in P establishes a path through the network as 
 
 follows. Assume s and d are expressed in base fi notation, i.e., 
 
 s = s s ... s , d = id ... d. Then, starting at input s, for 
 
 i = 1, 2,..., k, each (s,d) pair "enters" a p. x p. crossbar switch and 
 
 is connected to output d. of that crossbar. Alternately, starting at 
 
 output d for i - k, k-1,..., 1, each (s,d) pair "enters" an output of a 
 
 p. x P. crossbar and is connected to input s n . , ., of that crossbar switch. □ 
 11 - k-i+1 
 
 Figure 8 shows the necessary switching of the fi = (l8, 6, 3, 1} 
 network for E = {(0,7), (0,9), (7,l6), (l6,13)> where 7 = 101^, 
 9 = HOjj, 13 = 201„, and 16 = 211 . Figure 9 shows the switching for 
 P = {(12,15), (15, l6)>, where 15 = 210 , and l6 = 211„ . Note the 
 conflict at the output of stage 2: H(fl)f P and 9, f" P. 
 
 2.k The Equivalence of H(°Q and fl 
 
 It remains for us to show that this network is in some way 
 equivalent to fi. That is, we would like to show that if 0, t P, then 
 the network constructed from 0, "passes" P, and conversely. 
 
 Each output of each stage of the 0, network is only accessible 
 to elements of P having certain characteristics. This follows from the 
 construction algorithm and the control algorithm. Let 
 (xx... x a ... a ) represent the set of all integers z such that 
 
 Z of, (£ Vl V2 ••' \ } fi' 
 
 J 
 
 and (a a ... a. xx... x) represent the set of all integers z, such that 
 
 03 . 
 
 z -= (a^... a. 00... 0) Q . 
 
25 
 
 I 
 
 N 
 P 
 U 
 
 T 
 
 
 
 u 
 
 T 
 P 
 U 
 
 T 
 
 Figure 8. fi = (l8, 6, 3, D Switching for P = {(0,7), (0,9), (7,l6), 
 (16,1-))} 
 
26 
 
 N 
 P 
 U 
 
 T 
 
 
 
 u 
 
 T 
 P 
 
 u 
 
 T 
 
 Figure 9- Switching with Conflict for P - {(12,15), (15, l6)} 
 
27 
 
 Finally, (xx... x s s ... s , id ... d. xx. . . x) represents the 
 set of all (s,d) such that 
 
 S ' (S J + 1 S j + 2 ••• S k } fi 
 J 
 
 and 
 
 w . 
 d = ( dl d 2 ... d. 00... 0) Q . 
 
 Now each output of each stage of the network can "be labeled with the 
 representation of the set of (s,d) pairs which are accessible to that 
 output. This is shown in Figure lOfor a specific network. 
 
 Theorem 5 : 
 
 Let H(Q) -t- P denote the fact that the network H(fi) constructed 
 from 9, passes or connects P. Then Q + P -*--* H(fi) + P 
 
 Proof ; 
 
 Assume H(p») t P. Then there must be a conflict at the output 
 
 of one of the crossbar switches. That is, two pairs (s ,d ), 
 
 (s ,d )e P, s i Sp, must be trying to use the same output. Without 
 n 
 
 loss of generality, assume this output is labeled 
 
 (xx... x a a ... a k , b.^ . . . b xx. . . x) . 
 
 Then (s ,d ) and (s ,d ) must both be elements of the set represented 
 by this output label. This implies 
 
 s 1 = s 2 and ^ = d g . 
 
28 
 
 ooo 
 
 XXX, 001 
 
 I— 1 
 
 2 
 
 XXX, 010 
 
 i XXX .011 
 
 - 4 
 
 XXX, 020 
 
 XXX, 121 
 — 11 
 
 Figure 10. Network for ft = {12, 6, 2, 1} with Stage Outputs Labelled 
 with (s, d) Classes 
 
29 
 But this implies Q, f P. So we have 
 
 H(n) t"p -* fi ~ p 
 
 or 
 
 ti + P + H(R) + P. 
 
 By a similar argument it is easy to show that if ft t P, 
 
 then H(Q) t P> a nd thus we have 
 
 fl t P «--*■ H(fi) + P. □ 
 
 Thus far in this chapter we have defined and explored a 
 
 relation + between a special kind of set ft(R ) and an arbitrary set 
 
 of pairs P which represents a set of connections between two sets of 
 
 nodes. We have shown how to construct a switching network H(ft(R )) 
 
 n 
 
 from an ft(R ), and we have shown that if 0, i P, then E(Q) + P and 
 conversely. Thus, given a network H(ft) and an n-set P we have 
 some useful tools for determining whether or not H(fi) can produce the 
 connections specified by P. 
 
 2 . 5 Minimal fl Networks 
 
 Before leaving this chapter we will present three more results 
 which will be needed in later chapters. 
 Let 
 
 R n = {p liP ..... p ±1 P i+1 ,..., P k > 
 
 and 
 
 R n = W'V-' p i X p i + l'--" p k } 
 
 Thus, R has k elements while R' has k - 1 elements, 
 n n 
 
 Theorem 6: fi(R ) t P ■* fi(R') + P 
 n n 
 
30 
 
 Proof : 
 
 Assume fi ? T P. Then 3 co e fi f and 3(s ,d ), and 3(s ,d )e P 
 
 w 
 such that s $ s and s P s and d = d . But then fit P, so we have 
 
 n w 
 
 fi(R ) + P -»■ n(R') t P D 
 
 n n 
 
 Theorem 7: The network constructed from fi(R ) has the same 
 n 
 
 number or fewer gates than the network constructed fromfi(R' ). 
 
 n 
 
 Proof- 
 
 2 
 A p x P crossbar switch has order of P gates . So the 
 
 number of gates in H(fl) is just 
 
 k 2 k 
 
 ;(fi) = S (p. ) n/p . = n Z p . , 
 i=l i=l 
 
 while the number of gates in H(fi') is 
 
 g(fl') = p n + p n +...+ p. n + p.p. n +...+ p n n. 
 1 2 l-l ii+l k 
 
 Since for p . , p . . > 1, 
 l l+l 
 
 p p > p + p 
 
 i M i+l - M i p i+l 
 
 it follows that 
 
 g(0) i g(n'). □ 
 
 Notice the special case of Theorem 7 where p. = p. = 2. Then 
 
 s(n) = g(n f ) 
 
 since p. + p. +i - p . p . +r 
 
 Theorem 8 : Minimal Network 
 
 Let p 1 ,p 2 ,..., p^ be a prime factorization of n; that is, 
 P 1 P 2 P 3*'* p k ~ n ' and p i P rime numbers. Then the network constructed 
 
31 
 
 from 
 
 fl = £u i :« 1 = p i+1 U . +1 , < i < k - 1, u = n} 
 
 is minimal for n in terms of gates . 
 
 Proof : 
 
 This follows easily from Theorem 7. If 
 
 Qi = {^.-u). = p. +i u> i+1 , o < i < k - 1, a, = n}, 
 
 and for some i = I , p^ +l is not prime, i.e., p £+i = ab , then the 
 fl generated from p^.pg,..., P^ , a,b ,p £+1> . . . , p k yields a network with 
 equal or fewer gates than fl'. Continued application of this until 
 each p i is prime, therefore, results in a minimal network. □ 
 
 2.6 Summary 
 
 In this chapter we have developed some basic results relating 
 the concept of an Q set to a class of switching networks. In 
 particular, we showed how to construct a network from 0, and we proved 
 sufficient conditions on P relative to fi such that 11 t P. In review 
 
 Theorem 1: ft+P+fttP+c 
 
 Corollary 2.1: gcd(a,n) =1, fi + P-»-n + Pxa 
 
 Theorem 3 
 Theorem 5 
 Theorem 6 
 Theorem 7 
 Theorem 8 
 
 ft+P + ft+c-P 
 fl + P •«"»■ H(fl) + P 
 
 n + p -*■ n 1 + p 
 g (n) < g(fi') 
 
 If all p . e R are prime numbers , 
 in 
 
 then H(fl(R )) is a (gate) minimal network 
 n 
 
32 
 
 In the following chapters we shall examine n-sets which 
 are needed in computer programs, and we will examine in more detail 
 the construction of some networks. In particular, we will compare 
 ft network with other classes of networks in terms of capability and cost 
 
33 
 
 3- EFFECTIVENESS OF fi NETWORKS 
 
 One of central problems in utilizing parallel computers is in 
 alignment of data in the memory system. In this paper we are restricting 
 ourselves to the primary memory problem. Here the problem is two-fold: 
 
 1. If x and y are two data elements which are required at the 
 same time, then x and y should be stored in separate memory 
 modules in order to avoid a "memory conflict." 
 
 2. If x is required by processor p, then x should be stored in a 
 memory module which is "readily accessible" to p via the 
 processor-memory connection network. 
 
 For example, refer to Figure 11, which shows an N x N array stored "straight" 
 in N memories, i.e., element a. . is stored in memory j . An attempt to fetch 
 
 a column of this array would result in memory conflicts since the elements of 
 any given column are stored in one memory. In addition, data is only acces- 
 sible to processor p, < p < n-1 if the data is in memories p or p ±1 since 
 each processor is only connected to the three closest memories. 
 
 In order to solve the first problem, a number of special storage 
 schemes have been proposed (Budnik [2], Kraska [3], Muraoka [8]). In general, 
 these schemes place data in the memories in such a way that the most desirable 
 N-vectors can be fetched without memory conflict. 
 
 The purpose of this paper is to solve both of these problems simul- 
 taneously. We begin with the following assumptions. We assume that there are 
 N processors and M > N memories. The processors and memories are connected 
 together by an n x n network. The data is an N X N array where the (i,j)-th 
 element is logically in the i-th row, j-th column of the matrix space. 
 
^ 
 
 PROCESSORS 
 
 MEMORIES 
 
 N-l 
 
 Figure 11. An Example of an N X N Array Stored Straight in a Parallel 
 Memory-Processor System 
 
35 
 
 3 .1 Representation of Array Storage Schemes and ^-Vectors 
 
 Definition k : Memory Equation 
 
 We define the function ju(i, j ) to yield the memory number where data 
 with coordinates i, j is stored. D 
 
 For example, if an array is stored "straight", then u(l,j) = j, 
 whereas if it is stored with skew = 1, then ju(i, j) = i + j (mod M) . In general, 
 we will consider only linear equations u(i, j) = ax + bj + c(mod M), where a is 
 called the skew and b is the skip. 
 
 Memory equations for some of the more popular storage schemes are 
 given below: 
 
 straight : j 
 
 1-skew: i + j 
 
 VN-skew: 7n i + j 
 
 1-skew, 2-skip: i + 2j 
 
 2- skew, 1-skip: 2i + j 
 
 TN+1-skew: (i/N+l)i + j 
 
 Definition 5 : Vector Equation 
 
 The N processors operate on N-vectors of data where the x-th element 
 of the N-vector goes to the x-th processor. We define v(x) to be a function 
 which yields the i, j coordinates of the x-th element of the N-vector. The 
 vector equation for some of the common N-vectors are shown below: 
 
 X 
 
 All arithmetic is mod M. 
 
 1 arithmetic is mod N. N is assumed to be a perfect square where necessary. 
 
36 
 
 1 . row : 
 
 v(x) = 
 
 <V J 'o + x) 
 
 column: 
 
 v(x) = (i + x, j ) 
 
 forward diagonal: v(x) = 
 
 (i Q + x, j + x Q ) 
 
 reverse diagonal: v(x) = (i + x, j - x ) 
 
 Tijx 7k partition: v(x) 
 
 N broadcast: 
 
 v(x) - 
 
 -/n row broadcast: v(x) = 
 VN column broadcast: v(x) = 
 
 (i + x -j- -/n, j + x mod -/n) 
 v o o 
 
 <w 
 
 (i o ,J o +(x * Vn) X Vw) 
 
 (i Q +(x -f Vn) X t/S, j Q ) D 
 
 We should make it clear at this point just what the vector equation 
 means . Notice that i , 3 in the above equations determine the first element 
 
 of the N-vector. For example, if i =1 and j = 2, then the row equation 
 
 specifies the row beginning with element (1,2). Thus, each vector equation 
 above actually represents a class of vector equations and a particular equa- 
 tion is determined by a choice of i and j . An N-vector is really repre- 
 sented by a function v(x, i , j ) for all x. For the sake of brevity, we will 
 
 delete the subscript o on i and j in the following pages. 
 
 Combining a vector equation representing a given N-vector with the 
 memory equation which determines the storage map for the data, we get 
 
 m(v(x)) : < x < N-l , 
 
 which is the memory in which the x-th element of the N-vector is stored. The 
 set 
 
 {ju(v(x)) : < x < N} 
 
 thus represents the set of memories which contain the given N-vector. 
 
37 
 
 Later we will define a "network connection equation" which deter- 
 mines the memory-processor connection which is required for N processors to 
 access a given N-vector in a particular order. This network connection equa- 
 tion will depend primarily on the memory and vector equations. We will then 
 evaluate various connection networks to determine whether or not they can 
 produce the necessary or common memory-processor connections. Obviously then, 
 we must first know which N-vectors are necessary or common. 
 
 3.2 Important N-Vectors 
 
 Let us assume for a moment that we intend to build a parallel 
 processor and that we know exactly which problems will be solved on this 
 machine. We could then analyze these problems to determine which N-vectors 
 are important. Using these results, we could then determine the effectiveness 
 of any given connection network in producing the connections determined by 
 these N-vectors. 
 
 Unfortunately, there are several problems here. First, we do not 
 have a specific problem set in mind. Second, the determination of important 
 N-vectors from a given problem set is not trivial. In fact, there generally 
 exists more than one algorithm for solving any given problem and each algorithm 
 may require different types of N-vectors. For example, one algorithm for 
 solving the fast Fourier transform requires very strange memory-processor 
 connections (Pease [9l) but a trivial change in this algorithm requires only 
 simple shifts (Stevens [10]). 
 
 So where does this leave us? It is our purpose only to present a 
 new connection network and show it is a plausible solution to the memory- 
 processor connection problem. Thus, we will only consider those vector equa- 
 tions and memory equations listed above. We must leave it to the reader to 
 
38 
 
 apply the results of the previous chapter in evaluating the effectiveness 
 of fl networks on other vector-memory equations. 
 
 The vector equations listed above represent some of the N-vectors 
 which are most frequently encountered in programming the ILLIAC IV computer. 
 Of these, the first two, rows and columns, are so important that we will 
 disregard any system which cannot handle them. Diagonals are also important 
 in many applications. /n partitions represent any submatrix of size equal 
 or less than -/n X VN. Of course, sometimes rectangular partitions are 
 referenced where one of the dimensions is greater than VN , but we will not 
 consider them here. N broadcast is simply one element broadcast to all the 
 processors, -/n broadcast is every -/if-th element of a row or column sent to 
 -/n consecutive processors. This pattern is frequently encountered when, 
 for reasons of storage efficiency, a large matrix is partitioned into 
 -/n x t/n blocks and operations are performed in parallel on these blocks. 
 
 3-3 Memory Conflicts, Memory and Vector Equations, and the Network Connection 
 Equation 
 
 So far we have a memory equation and a vector equation. Thus, the 
 
 memory where the x-th element of an N-vector is stored is simply jh(v(x)). 
 
 Definition 6: Memory Conflict 
 
 We say the N-vector v(x) can be fetched without conflict if and 
 
 only if x/y (0 < x, y < n) and ju(v(x)) = ju(v(y)) -»v(x) = v(y) . That is, 
 
 M 
 
 if two elements of the N-vector v(x) and v(y) occur in the same memory 
 element, then there will be a memory conflict unless v(x) and v(y) are the 
 same . D 
 
 Assume ju(v(x)) = ax + b, < x < £. A sufficient condition that 
 v(x) can be fetched without conflict is the following. 
 
39 
 Theorem 9: If iu("v(x)) = ax + b, < x < £, and if 
 
 £ < M a/a 
 where a = gpf(a,M), 
 then the £ -vector v(x), < x < I, can be fetched without conflict, 
 
 Proof : 
 
 We need only to show that for all < x, y < £, 
 
 Xjty->ax + b^ay+b . 
 M 
 
 / ¥a 
 
 We have x f y and x,y < £ < — <M 
 
 M 
 
 ■Pk 
 
 a / a 
 
 _» - x ^ r. y 
 
 a 
 
 P7 
 
 -» - x if - y since x,y < 
 
 a M a a 
 
 M_a 
 
 P3 
 
 -> ax ^ ay since a = gpf(a,M) 
 
 M then gcd(o,M) = 1 
 
 PI 
 
 -> ax + b ^ ay + b □ 
 
 M 
 
 Next we need to specify the connection between network ports and 
 memories. 
 
 Definition 7 : Memory Connection Equation 
 
 We define l(m) to be the network port connected to memory m. □ 
 In most cases l(m) = m. 
 
>+0 
 
 Thus, the network port at which the x-th element of an N-vector will 
 appear is simply 
 
 Ku(v(x))). 
 
 Now we have said that the x-th element of the N-vector must go to processor 
 x. In order to determine the output port to which this element must be sent 
 we must specify how the processors are connected to the network ports. 
 
 Definition 8 : Processor Connection Equation 
 
 We define $ (x) to be the port(s) attached to processor x. □ 
 
 In most cases <$> (x) = x. However, in case n = M = 2N, <f> (x) is multivalued 
 and may have one of the following two forms: 
 
 <f> (x) - 2x and 2x + 1 
 or $ (x) = x and x + N. 
 
 We now have enough information to specify the network connections 
 P = {(s,d)} required to send each element of the N-vector to its correct proces- 
 sor. 
 
 Definition 9 : Network Connection Equation 
 
 We define C(x) to be the (s(x),d(x)) connection required to route 
 v(x) from memory u(v(x)) through network input l(ju(v(x))) to output $ (x) and 
 thence to processor x. 
 Thus C(x) = (l(/x(v(x))), $(x)) . □ 
 
 Now what have we got? Recall in the previous chapter (Theorem k) 
 that if 
 
 Pc{(f(x),g(x)):0 <x < i) , 
 and if f, g and g satisfy certain conditions with respect to Q, then fitP and so 
 
kl 
 
 the network constructed from U can produce the connections specified by P. 
 Thus, in many cases we can easily show that a given ft network will pass C(x) 
 if C(x) satisfies Theorem k. 
 
 3 .k Effectiveness of ft Networks 
 
 In this section we will be concerned with showing the effectiveness 
 of a particular type of ft network for various memory and storage configurations. 
 First, we shall assume that N, the number of processors, is a power of two. 
 Second, we will assume that ft = {n, n/2, . .., 2, 1], where n, the "size" of 
 the ft network, is a power of two. This latter assumption is pragmatic. Recall 
 Theorem 8 of the previous chapter. If n is a power of 2, then ft = {n, n/2, ..., 
 2, 1} results in a network which is minimal in terms of gates. 
 
 Our choice of a power of two for the number of processors is not 
 so defensible. Suffice to say that investigation of any specific system 
 where N is not a power of two is beyond the scope of this paper. Additionally, 
 we only consider values of N < ^096, except where noted otherwise. 
 
 Tables I- HI show C(x) for various combinations of u(i, j), v(x), 
 l(m) and$(x). Shown along with C(x) are two numbers. The first of these, 
 memory cycles, is the maximum number of memory cycles required to fetch the 
 specified N-vector. For example, if 2 elements of the N-vector lie in the 
 same memory, then at least two cycles of the memory system will be required. 
 
 The second number shown with C(x) is the number of network cycles 
 required to pass the N-vector. For example, if ft f P but P = P, U P and 
 
 ft f P-,, ft t Pp then two cycles of the network are required. 
 
 •>ome cases, where N must have an integer square root, we assume N is a power 
 Of 'l . 
 
k2 
 In some cases the numbers given for memory cycles and net-work cycles 
 
 represent (not necessarily least) upper bounds or lower bounds. This is 
 indicated by the symbols < and > respectively. Table IV" is a summary of the 
 results presented in Tables I-III. The numbers shown in Table IV are the 
 maximum of memory cycles or network cycles. As we can see, of the configu- 
 rations investigated, only Table III-D can handle all 8 of the most frequent 
 types of N-vectors. Before discussing additional pros and cons of these con- 
 figurations, we will investigate one additional W-vector which sometimes occurs 
 
 Consider the f-vector consisting of the elements of small dxd sub- 
 matrices along the diagonal (see Figure 12). We assume that the first row of 
 the first submatrix goes in order to the first d processors, the next row of 
 
 this first submatrix to the second d processors, and so on [11] . 
 
 2 
 In all there can be at most N-s-d full dxd submatrices in an I -vector 
 
 2 2 
 where i < N. Thus, there are a total of I = (N-s-d ) x d elements. We have 
 
 2 2 2 2 
 
 v(x) = ( (x mod d )+ d + d(x*d ) + i, (x mod d ) mod d + d(x*d ) + j), 
 
 and for jLt(i, j) = (t/n+1)i + 2j (Table III-D) we get 
 
 u(v(x)) - (/w+l)[(x mod d 2 * d + d(x-fd 2 ) + i] + 
 
 2 2 
 
 2[ (x mod d mod d + d(x*d ) + j] . 
 
 This function has yet to yield to analysis. Instead, an exhaustive 
 check' was performed using a computer and assuming the same configuration as in 
 Table III-D for 
 
 W = k^, where k = 2, 3, k, 5- 
 
 The results are summarized in Table V. A second check was performed for 
 
 J2 = {2N, N/2, ..., k, 1} and the results were identical to those of Table V. 
 
^ 
 
 N processors 
 N memories 
 N x N network 
 $(x) = x 
 I(m) = m 
 
 u(i>J) - 3 (straight storage) 
 C(x) = (ju(v(x)), x) 
 
 v(x) 
 
 0) 
 
 -p 
 8 
 
 m(v(x)) 
 
 QJ >j 
 
 6 o 
 
 ?H w 
 O QJ 
 
 |S r-l 
 -P V 
 
 QJ >i 
 
 (:i, ;j+x) rows 
 
 
 j+x 
 
 1 
 
 1 
 
 (i+x, j) columns 
 
 a 
 
 j 
 
 N 
 
 1 
 
 fi+x, ;'mx) fwd. diag. 
 
 
 j+x 
 
 1 
 
 1 
 
 Ci ^x, ,j-x) rev. diag. 
 
 
 j-X 
 
 1 
 
 1 
 
 (i+(x+/I), j + (x mod /n) ) 
 /iix /ti partitions 
 
 b 
 
 j + (x mod -/n) 
 
 ■/n 
 
 1 
 
 (i,,i) 
 
 N broadcast 
 
 c 
 
 j 
 
 1 
 
 1 
 
 (i,J + (xWN) x Vn) 
 
 AT row broadcast 
 
 d 
 
 j + (xWn)x/n 
 
 1 
 
 1 
 
 M ! 1 :•■ : /N ) X /N, j ) 
 
 /i7 column broadcast 
 
 e 
 
 
 
 /n 
 
 1 
 
 Table I-A 
 
kk 
 
 N processors 
 
 N memories 
 
 N X N network 
 
 $(x) = x 
 
 l(m) = m 
 
 ji(i, j) = i+j (skewed storage) 
 
 C(x) = (n(v(x)), x) 
 
 v(x) . 
 
 0) 
 
 ■p 
 
 o 
 
 ^(v(x)) 
 
 fn 0) 
 
 a o 
 
 S o 
 
 ^H W 
 O 0) 
 
 -P O 
 
 (i, j+x) rows 
 
 
 i+j+x 
 
 i 
 
 1 
 
 (i+x, j) columns 
 
 
 i+j+x 
 
 l 
 
 1 
 
 (i+x, j+x) fwd. diag. 
 
 f 
 
 i+ j +2x 
 
 2 
 
 N/2 
 
 (i+x, j-x) rev. diag. 
 
 g 
 
 i+j 
 
 N 
 
 1 
 
 (i+(xWN), j + (x mod t/n)) 
 ■/n x Vn partitions 
 
 h 
 
 i+j+xWN+x mod -/I 
 
 t/N 
 
 7n 
 
 (i,d) 
 
 N broadcast 
 
 i 
 
 i+j 
 
 1 
 
 i 
 
 (i,j + (xWSf) x t/n) 
 
 VW row broadcast 
 
 J 
 
 i+j + (xWN)xi/N 
 
 1 
 
 l 
 
 (i+(xWn) x 7n, j) 
 
 vN column broadcast 
 
 k 
 
 i+j + (x-jVN)x-/N 
 
 1 
 
 l 
 
 Table I-B 
 
^ 
 
 N processors 
 
 N memories 
 
 N x N network 
 
 $(x) = x 
 
 I(m) = m 
 
 fi(i,j) = 3i+j (3 skew) 
 
 Cfx) = (u(v(x)), x) 
 
 n = /N 
 
 v(x) 
 
 •p 
 O 
 
 u(v(x)) 
 
 B o 
 
 X 
 
 O <1> 
 -P O 
 
 (i, ;j+x) rows 
 
 
 3i+j+x 
 
 1 
 
 1 
 
 (i4x, ,j) columns 
 
 f 
 
 3(i+x)+j 
 
 1 
 
 1 
 
 f i t x, ,i+x) fwd. diag. 
 
 m 
 
 i+x+3i+a 
 
 ^ 
 
 nA 
 
 (i-ix, ,i-x) rev. diag. 
 
 n 
 
 2x+3i+j 
 
 2 
 
 N/2 
 
 (i+lx-r/N), j + (x mod /n) ) 
 ; x -/n partitions 
 
 
 
 
 
 (i,.i) 
 
 broadcast 
 
 
 3i+j 
 
 1 
 
 1 
 
 . i i (xWn) x /n) 
 
 /ii row broadcast 
 
 
 3i+j+(x-fr|)xTi 
 
 1 
 
 1 
 
 . : /N) X t/N, j) 
 
 /N column broadcast 
 
 
 
 
 
 Table I-C 
 
k6 
 
 N processors 
 
 N memories 
 
 N X N network 
 
 $ (x) = x 
 
 l(m) = m 
 
 /i(i >t l) = T]i+j (-/n skewing) 
 
 C(x) = (ju(v(x)), x) 
 
 v(x) 
 
 0) 
 
 o 
 
 m(v(x)) 
 
 S o 
 
 o cu 
 
 CD >3 
 
 (i, j+x) rows 
 
 
 rii+j+x 
 
 1 
 
 1 
 
 (i+x, j) columns 
 
 
 
 T]X+Tli+j 
 
 /n 
 
 t/n 
 
 (i+x, j+x) fwd. diag. 
 
 P 
 
 (T]+l)x+T]i+j 
 
 1 
 
 1 
 
 (i+x, j-x) rev. diag. 
 
 q 
 
 (t}-1)x+t)X+j 
 
 1 
 
 1 
 
 (l+(xWw), j + (x mod Tn) ) 
 /I x -/i partitions 
 
 r 
 
 x+Tii+j 
 
 1 
 
 1 
 
 (i,j) 
 
 N broadcast 
 
 
 r)i+J 
 
 1 
 
 1 
 
 (i,j + (xWI) x t/n) 
 
 VN row "broadcast 
 
 s 
 
 ■Tli+j+(x*T))XT) 
 
 1 
 
 1 
 
 (i-i(xWN) xi/l, j) 
 
 Vff column "broadcast 
 
 t 
 
 Tii+j+N(x-fTi) 
 
 7n 
 
 1 
 
 Table I-D 
 
^7 
 
 N processors 
 N memories 
 N x N network 
 $(x) = x 
 Km) = m 
 //(i,«l) = rji+J 
 
 (t/n+1 skewing) 
 
 C(x) - U(v(x)), x) 
 
 n = 7n+i 
 
 v(x) 
 
 
 •p 
 o 
 
 u(v(x)) 
 
 ^ 0) 
 
 QH 
 
 S o 
 
 X 
 o cu 
 
 -P O 
 
 0) >5 
 
 S3 O 
 
 (j, ;j+x) rows 
 
 
 Tri+j+x 
 
 1 
 
 1 
 
 (i+x, ,j) columns 
 
 u 
 
 T)(i+x)+J 
 
 1 
 
 1 
 
 (i+x, ,i+x) fwd. diag. 
 
 V 
 
 (T)+l)x+T)i+j 
 
 w2 
 
 - 
 
 (i+x, ,j-x) rev. diag. 
 
 w 
 
 (TT,-l)x+T 1 i+j 
 
 t/N 
 
 7n 
 
 (i+(x+/N), j+(x mod t/n)) 
 
 t/n x /n partitions 
 
 X 
 
 T](x-^-/N)+T]i+Ti(x mod 7n) + 
 
 >1 
 
 
 (i,.D 
 
 II broadcast 
 
 
 r\±+3 
 
 1 
 
 i 
 
 (i, j + (x+i/N) x t/n) 
 
 /N row broadcast 
 
 y 
 
 TT,i + j + ( x+/N) x7N( X+-/N) 
 
 1 
 
 l 
 
 M ! (x+t/n) x t/n, ,i 
 
 /?! .lumn broadcast 
 
 z 
 
 tp.+j+N(x+-/n)+t/n (x-m/n) 
 
 1 
 
 i 
 
 Table I-E 
 
ii-8 
 
 Notes to Table I 
 
 a) Since all elements of a column lie in the same memory, i.e., 
 v(x) = (i+x, j), ju(v(x)) = 2> that memory must be cycled N times to produce 
 all elements of the column. One might suspect that the networks would also 
 have to be cycled N times. This is not necessarily so. Notice the resulting 
 connection equation C(x) = (j, x) satisfies the condition gcd(a, N) = N since 
 a = 0. Thus, the network can produce the connection C(x) = (j, x), < x < 
 
 N - 1 by Theorem k of the last chapter. So how do we reconcile the fact that 
 our theory says the connection is possible while intuition says it is not. 
 In fact, there is no real reconciliation. If two different data 
 elements are present at the same input port, then any network would have to be 
 cycled at least twice to pass both elements. We shall simply state that the 
 number shown under network cycles represents the number of cycles required, 
 assuming no multiple data at any input port. Extra cycles required due to 
 such multiple data are reflected in the memory cycles column. 
 
 b) The equation C(x) = (j +(x mod Vn), x) is not of the correct 
 
 form to be covered by Theorem k. Thus, we must show that C(x) satisfies 
 
 k i— k/2 
 
 definition 3. We assume N = 2 where k is even. Thus, VN = 2 ' . Now we 
 
 assume x mod t/n 4 7 na°d /n (so x ^ y), 
 N N 
 
 m 
 and x = y. Then, (P6) 
 
 x 
 
 4 7- 
 
 m 
 
 Now, if m < 7n, then obviously x mod 7n = x and y mod -/n = y , so 
 
 m m 
 
 x mod 7n ^ y mod 7n . 
 
 m 
 
^9 
 
 If m > /n, then we have (since x mod /n, y mod -/n < /n < m) 
 
 x mod Vn ^ y mod 7N -> x mod 7n ^ y mod -/i? 
 N 7n 
 
 and finally 
 
 P5 
 
 -» 
 
 x mod Tn ^ mod -/n . 
 
 m 
 
 Thus, ft t P, where P = { (x mod /N, x) : o < x < N - 1} so by Theorem 1 (previous 
 chapter) ft t P + j, where P + j = {(j +(x mod /n), x) : < x < N - 1} . □ 
 
 c) See note (a) . 
 
 d) Again we must show that C(x) = (j + (xWn) x /n, x) satisfies 
 
 definition 3> i-e., for all m e ft 
 
 m 
 s(x) ^ s (y) and s ( x ) - s (y) implies x ^ y . 
 
 N m 
 
 First, consider the case m > -/n. Let r\ = VN. 
 
 We have j + (x+T])r) ^ ,j +(y+T])n 
 
 N 
 
 I'l 
 
 -» (x+t])t) ^ (y+T])ii 
 N 
 
 and j i(x+t])t] s j + (y4t|)r) 
 
 m 
 
 PI 
 
 -» (x+ti)ti ■ 3 +(y+T])Ti. 
 m 
 
 Thus, 
 
 P6 m 
 
 -> (X+T])T] ^ (y+T])T) 
 
50 
 
 P12 m/r] 
 
 -> (x+tj) 4 (y+Tj) 
 
 Pll m 
 
 -» x ^ y . 
 
 Now, for m < ri let rj = mcr = 7n . We have 
 H 
 
 -» ( x *t) f (y-Mi) 
 
 and 
 
 or 
 
 Thus, 
 
 ti(x-mi) = n(y +T i) 
 
 m 
 
 mcr(x-M-)) == mcr(y4-T]) 
 m 
 
 P4 
 
 -> cr(x-fTi) = ^(y-Mi) 
 
 1 
 
 P3 
 
 -» (x-s-ti) = (y*T)) • 
 
 1 
 
 p6 i 
 
 Pll mcr 
 
 -» x ^ y 
 
 P9 m 
 -» x ^ y . 
 
51 
 
 Thus, fi t C(x). The vector v(x) can be fetched without conflict since 
 £i(v(x)) satisfies definition 6. 
 
 e) See notes (a) and (b) . 
 
 f) Divide v(x) into two n/2 -vectors: 
 
 v 1 (x) = (i+x, j+x), < < N/2 
 v 2 (x) = (i+x+N/2, j+x+N/2), < x < n/2 . 
 Then v(x) = v 1 (x) U v (x), and 
 
 jLi(v,(x)) and u(v (x)) both satisfy Theorem 9. See also 
 
 note (n) . 
 
 g) See note (a) . 
 
 h) It is easy to see that there are -/n elements (the reverse 
 diagonal) of the partition in memory ni(v/N-l)) = i+j+ t/N-1 . Thus, the -/n 
 memory cycles. In order to show that the network cycles are < t/N, we divide 
 v(x) into -/n /N-vectors: 
 
 v (x) = (i+p, j+x mod 7n),0 < x < -/n, < p < -/n . 
 
 /n-i 
 
 Then v(x) = U v (x) 
 
 p=0 P 
 
 and by the argument in note (b), 
 
 fi t C (x), where C (x) = (u(v (x)), x + p-/N), < x < 7n, < p < /n. 
 
 i) See note (a) . 
 
52 
 
 j) See note (d) . 
 k) See note (d) . 
 
 l) Since 3 is prime to any power of 2, C (x) satisfies Theorem 4 
 and *i(v(x)) satisfies Theorem 9* 
 
 m) The required partition for the memory is: 
 
 v (x) = (i+x, j+x), < x < N/4, < p < 4 
 
 This allows the N-vector to be fetched in 4 cycles. 
 The partition required for the network is 
 
 v (x)"=(i+p+Nx/Ij-, j+p+NxA), < x < 4, < p < nA 
 
 and thus (since Nx=0), 
 
 N 
 
 C (x) = (4p+3i+j, p+NxA), < x <4, < p < nA- 
 
 n) An argument similar to m applies here. 
 
 o) An argument similar to m applies here. 
 
 p) If t/n + 1 is prime to N, then C (x) satisfies Theorem 4 and 
 ju(v(x)) satisfies Theorem 9« t/n + 1 is prime to 4, l6, 64, 256, 1024, and 
 4096. 
 
 q) This is true for N = 4, l6, 64, 256, 1024, and 4096 . 
 
 r) Note that -/n(xW"N) + x mod Vn = x. 
 
 s) See note (d) . 
 
 t) The required partition is 
 
 v p (x) = (p+i, 3), < x < n 
 
 < p < T] 
 
 T) = Vn , 
 giving us ju(v (x)) == r)(i+p) + j, which satisfies definition 6 since 
 
 ir 
 
53 
 
 v (x) = v (y). 
 
 To prove Q, f C(x), we have 
 
 T]i+j+N(x*t)) = iii+j+N(y^ri), 
 i.e., the source is the same memory element for all data. See note (a). 
 
 u) 7n + 1 is prime to N = k, l6, 6k, 256, 1024, and 4096. ■ Thus, 
 this result holds for these N by Theorems k and 9« 
 
 v) The cycles required here vary since r\+l = VN+2 is not prime to 
 N = h f l6, 6k, 256, or 1024. The memory cycles vary from k for N = k to 2 for 
 N = 1024, where in general memory cycles is (Theorem 9) 
 
 — where a = gpf (a,N) . 
 
 The required number of network cycles is unknown. 
 
 w) Note T)-l = Vn . An argument similar to note m applies here. 
 x) t}(xWn + x mod -/S) + T)(i+j) 
 
 = -/n(xWn) + (x mod t/n) + (x+ t/n) + Vn(x mod t/n) + T\(l+j) 
 
 = x + (x+VN) + /n(x mod 7n) + r)(i+j) • 
 
 Since )u(v(Vn-1)) s ju(v(N-/n)) e rj(i+j) , we know that memory cycles 
 N N 
 
 are > 1. 
 
 y) See note (d) . 
 
 z) Since N(xWn) = , 
 N 
 
 jLi(v(x)) = 71 i+,j + (x+/N)>c/N . 
 N 
 
 See note (d) . 
 
5+ 
 
 N processors 
 
 2N memories 
 
 2NX2N network 
 
 $(x) = x, < x < N 
 
 l(m) = m, < m < 2N 
 
 /i(i, j) = j (straight storage) 
 
 C(x) = (n(v(x)), x), < x < N 
 
 v(x), < x < N 
 
 (U 
 
 •p 
 
 g 
 
 m(v(x)) 
 
 H (L) 
 
 s H 
 
 S o 
 
 o 
 
 O <D 
 
 -P O 
 
 CD >> 
 
 (i, j+x) rows 
 
 
 j+x 
 
 1 
 
 1 
 
 (i+x, j) columns 
 
 a 
 
 d 
 
 N 
 
 1 
 
 (i+x, j+x) fwd. diag. 
 
 
 j+x 
 
 1 
 
 1 
 
 (i+x, j-x) rev. diag. 
 
 
 j-x 
 
 1 
 
 1 
 
 (i+(xWl), j+(x mod i/i)) 
 
 -/N x Vn partitions 
 
 b 
 
 j + (x mod V^) 
 
 ■/n 
 
 1 
 
 N broadcast 
 
 
 
 
 1 
 
 1 
 
 (i,j + (xWN) X Vn) 
 
 VN row broadcast 
 
 
 . j+(xWn)xt/n 
 
 1 
 
 1 
 
 (i+(xW5) x Vw, j) 
 
 VN column broadcast 
 
 c 
 
 J 
 
 t/n 
 
 1 
 
 Table II-A 
 
55 
 
 N processors 
 
 2N memories 
 
 2NX2N network 
 
 $(x) = x, < x < N 
 
 I(m) = m, < m < 2N 
 
 u(i,3) = 2i+j (2- skew, 1-skip) 
 
 Cfx) = (m(v(x)), x), < x < N 
 
 v(x), o < x < N 
 
 -p 
 
 s 
 
 u(v(x)) 
 
 
 M 
 
 U w 
 O <D 
 
 -P O 
 
 S3 O 
 
 (i, ;j+x) rows 
 
 
 2i+j+x 
 
 1 
 
 1 
 
 (i+x, ,j ) columns 
 
 d 
 
 2x+2i+j 
 
 1 
 
 N 
 2 
 
 (i+x, /Hx) fwd. diag. 
 
 
 3x+2i+j 
 
 1 
 
 1 
 
 (i+x, ,1-x) rev. diag. 
 
 
 x+2i+j 
 
 1 
 
 1 
 
 (i + (x+i/N), j+(x mod /i)) 
 /n x /n partitions 
 
 e 
 
 2(x+/N)+2i+j+x mod t/n 
 
 7n 
 
 
 N broadcast 
 
 
 2i+j 
 
 i 
 
 1 
 
 (i,j+(xWN) x t/n) 
 
 i/N row broadcast 
 
 f 
 
 .2i+j + (x+/N)x /N 
 
 l 
 
 1 
 
 (:i i(xWn) x t/n, j) 
 
 /u column bro idc ist 
 
 g 
 
 2i+j+2-/N(xWlf) 
 
 l 
 
 1 
 
 Table II-B 
 
56 
 
 Notes to Table II 
 
 a) See note (a), Table I. 
 
 b) See note (b), Table I. 
 
 c) See notes (a) and (b), Table I, 
 
 d) Theorem 9 i- s satisfied since 
 
 a 2 
 
 See note (n), Table I. 
 
 e) Partition v(x), < x < N 
 into 
 
 v (x) = (i+ps/N, 3), < x < 7n, < p < -/n . 
 
 Then since n(y (x) ) satisfies Theorem 9 and 
 
 71-1 
 v(x) = U v (x) . 
 p=0 P 
 
 f) See note (d), Table I. 
 
 g) We have (after applying Po and 2-/I(xWn) = 2t/n x -j- 2U) 
 
 m 
 2t/n x + 2N ^ 2t/k y -s- 2N 
 
 Pll 2mN 
 
 -» 2-/N x ^ 2t/n y 
 
 P12, 9 m 
 
 -» x ^ y . 
 
57 
 
 N processors 
 
 2N memories 
 
 2NX2N network 
 
 $(x) = 2x, < x < N 
 
 I(m) = m, < m < 2N 
 
 iu(i>«j) = j (straight storage) 
 
 C(x) = (u(v(x)), 2x), < x < N 
 
 v(x), < x < N 
 
 -P 
 
 s 
 
 m(v(x)) 
 
 S o 
 
 ^H W 
 
 O CD 
 £ H 
 -P c> 
 
 CD >s 
 
 (i, ,j+x) rows 
 
 a 
 
 d+x 
 
 1 
 
 1 
 
 (i+x, ,j) columns 
 
 b 
 
 j 
 
 N 
 
 1 
 
 (i+x, j+x) fwd. diag. 
 
 
 j+x 
 
 1 
 
 1 
 
 (i+x, j-x) rev. diag. 
 
 
 j-x 
 
 1 
 
 1 
 
 (i+(xWN), j + (x mod t/n)) 
 /N x t/n partitions 
 
 c 
 
 j+x mod t/n 
 
 t/n 
 
 1 
 
 (i,j) 
 
 N broadcast 
 
 
 j 
 
 1 
 
 1 
 
 (i,j+(xWN) x Vn) 
 
 /N row broadcast 
 
 d 
 
 j + (x /N) -/TT 
 
 1 
 
 1 
 
 (xWn) x i/N, j) 
 vN column broadcast 
 
 e 
 
 j 
 
 /N 
 
 1 
 
 Table III-A 
 
58 
 
 N processors 
 
 2N memories 
 
 2NX2N network 
 
 J>(x) = 2x, < x < N 
 
 l(m) = m, < m < 2N 
 
 ju(i,o) = 2i+j (2- skew, 1-skip) 
 
 C(x) = (ju(v(x-)), 2x), < x < N 
 
 v(x), < x < N 
 
 -p 
 
 s 
 
 m(v(x)) 
 
 5h (L) 
 S o 
 
 M 
 
 J-l CO 
 
 O d) 
 £ H 
 -P c> 
 0) >> 
 
 C o 
 
 (i, ,i+x) rows 
 
 
 2i+j+x 
 
 1 
 
 1 
 
 (i+x, j) columns 
 
 f 
 
 2i+2x+j 
 
 1 
 
 1 
 
 (i+x, t i+x) fwd. diag. 
 
 
 3x+2i+j 
 
 1 
 
 1 
 
 (i-tx, j-x) rev. diag. 
 
 
 x+2i+j 
 
 1 
 
 1 
 
 (i+(xWN),. j + (x mod t/n) ) 
 t/n X t/n partitions 
 
 g 
 
 2i+j+2(xWN) + x mod t/n 
 
 t/N 
 
 _ 
 
 (i,j) 
 
 N broadcast 
 
 
 3 
 
 1 
 
 1 
 
 (i, J + (xWn) X t/n) 
 
 vN row broadcast 
 
 h 
 
 2i+J + (x4-/N) VN 
 
 1 
 
 1 
 
 (i-i (xWN) x t/n, j) 
 
 ■fu column broadcast 
 
 i 
 
 2i+j+2(xWN) t/n 
 
 1 
 
 1 
 
 Table III-B 
 
59 
 
 N processors 
 
 2N memories 
 
 2Nx2N network 
 
 $(x) = 2x, < x < N 
 
 l(m) = m, < m < 2N 
 
 iu(i,j) = i+2j (1-skew, 2- skip) 
 
 C(x) = (m(v(x)), 2x), <x <N 
 
 v(x), < x < N 
 
 CD 
 •P 
 
 s 
 
 m(v(x)) 
 
 OH 
 
 e o 
 aj ;>> 
 
 ^ W 
 
 CD 
 5rH 
 •P O 
 
 <D >> 
 
 (i, ,j+x) rows 
 
 j 
 
 i+2j+2x 
 
 1 
 
 1 
 
 (i+x, j) columns 
 
 
 x+i+2j 
 
 1 
 
 1 
 
 (i+x, ,]4x) fwd. diag. 
 
 
 3x+2j+i 
 
 1 
 
 1 
 
 (i+x, j-x) rev. diag. 
 
 
 x+2j+i 
 
 1 
 
 1 
 
 (i+(x+VN), j + (x mod /n) ) 
 /ix /l partitions 
 
 k 
 
 xWN+i+2j+2(x mod /I) 
 
 7n 
 
 2 
 
 2 
 
 N broadcast 
 
 
 i+2j 
 
 1 
 
 1 
 
 (i,j+(xWw) x t/n) 
 
 vi? row broadcast 
 
 I 
 
 • i+2j+2(x4-/N) /N 
 
 1 
 
 1 
 
 (i i (xWS) x/n, j) 
 
 i/Ti column broadcast 
 
 m 
 
 2i+j + (xWN) t/n 
 
 1 
 
 1 
 
 Table III-C 
 
6o 
 
 N processors 
 2N memories 
 2Ik<2N network 
 
 k) = 2x, < x < N 
 Hm) = m, < m < 2N 
 //(i, ,0 = T)i+2j (-/n+1 skew, 2 skip) 
 
 C jc) = (n(v(x)), 2x), < x < N 
 
 T) = nTn + 1 
 
 v(x), < x < N 
 
 s 
 
 u(v(x)) 
 
 CD >j 
 
 S o 
 
 O QJ 
 -p a 
 
 CD >5 
 
 C o 
 
 (i, j+x) rows 
 
 n 
 
 T]i+2j+2x 
 
 1 
 
 1 
 
 (i+x, j) columns 
 
 o 
 
 T]i+2j+T]X 
 
 1 
 
 1 
 
 fi+x, j+x) fwd. diag. 
 
 P 
 
 (Ti+2)x+T]i+2j 
 
 1 
 
 1 
 
 (i-tx, j-x) rev. diag. 
 
 q 
 
 (r 1 -2)x+T 1 i+2j 
 
 1 
 
 1 
 
 i+(x+VN), j + (x mod /ii) ) 
 vN x -/n partitions 
 
 r 
 
 r)(x4-7N)+T]i+2j + 
 2(x mod -/N) 
 
 1 
 
 1 
 
 (i,,1) 
 
 N broadcast 
 
 
 T]i+2j 
 
 1 
 
 1 
 
 i,j+(xWw) x 7w) 
 
 /N row broadcast 
 
 s 
 
 Y]i+2j+2i/N(xWN) 
 
 1 
 
 1 
 
 (H(xWn) x t/n, j) 
 
 /w column broadcast 
 
 t 
 
 T]i+2j+T]7N(x4-7N) 
 
 1 
 
 1 
 
 Table III-D 
 
61 
 
 Notes, Table III 
 
 a) Note that c(x) is now (u(v(x)), 2x) . 
 
 b) See note (a), Table I. 
 
 c) See note (b), Table II. 
 
 d) See note (d), Table I. 
 
 e) See notes (a) and (b), Table I. 
 
 f) Note that 2x + 2i + j satisfies Theorem 9, since x < N. Also, 
 C (x) = (2x, 2x) satisfies Theorem k. 
 
 g) See note (e), Table II. 
 h) See note (d), Table I. 
 i) See note (d), Table I. 
 j) See note (f), Table III. 
 
 k) Note that 
 
 2(x mod nTn) = 2x mod 2 >/n. 
 
 Partition v(x), < x < N into 
 
 v (x) = (i+(x4- nTn)+2p, j+(x mod/N)), 
 
 < x < 2 J~N, 
 < p < vTn/2 . 
 
 Then iu(v (x)) satisfies Theorem 9 since x < 2 n'N < 2N, and 
 
 n/n/2 
 v(x) = U v (x). 
 p=0 p 
 
62 
 
 and i— , 
 
 VN/2 
 
 v(x) = U v (x) . 
 
 p=0 P 
 
 These same partitions satisfy Theorem k. 
 
 l) The result follows from note (&), Table I, and P2. 
 
 m) See note (d), Table I. 
 
 n) Theorem 9 is satisfied since x < 2N/2. Theorem 4 is also 
 satisfied (since $(x) = 2x) . 
 
 o) VN + 1 is prime to 2N for N = k 9 16, 64, 256, 1024, or 4096. 
 Thus, Theorems 4 and 9 are satisfied again. 
 
 p) ti + 2 = 7n + 3. 7n + 3 is prime to 2N for N = k, 16, 64, 256, 
 1024, or 4096. 
 
 q) t] - 2 = -i/n - 1. t/n - 1 is prime to 2N for N = 4, l6, 64, 256, 
 1024, or 4096. 
 
 r) An exhaustive check of the function 
 
 C(x) = ((VI+1) (xWn) + 2(x mod 7n), 2x), < x < N - 1 
 
 was made by computer for N = 4, 16, 64, 256, 1024, and 4096. In all cases, 
 definitions 3 and 6 were satisfied. The required connections then follow 
 by Theorem 1. 
 
 s) We have 
 
 T)i + 2j + 2t/n(xWn) = T]i + 2j . + 2-/I(yWM r ) 
 
 m 
 
 PI 
 
 -^ 2/n(xWn) s 2yiT(yWN) 
 
 m 
 
 But tji + 2,j + 27n(xWn) ^ T)i + 2j + 2-/N(yWN) 
 
 2N 
 
 PI 
 
 -* 2t/n(xWn) ^ 27N(y-7N) 
 2N 
 
63 
 
 These two results give us 
 
 P6 m 
 
 -* 2-/n(x-«-t/n) 4 2i/n(v.*t/n) . 
 
 Since 27n(xs.-/n) - (2y/lf)+ (2N), 
 
 we get 
 
 Pll 2mN 
 
 _> 2-/N x ^ 27n y 
 
 P12 2m/I 
 -» 2x ^ 2y 
 
 P9 m 
 
 -* 2x ^ 2y . 
 
 Thus, definition 3 is satisfied. Definition 6 is satisfied since it is 
 satisfied for rows (see note (d), Table i) . 
 t) We have 
 
 T|i + 2j + tiTn(xWn) = r\± + 2j + Ty/N(y-s7N) 
 
 m 
 
 PI 
 
 _» t)/n(xWn) s ^(yWN), where t]=/n +1 
 m 
 
 Again, t] is prime to 2N for 2N=2 , so 
 
 P3 
 
 _, Vn(xWn) = VN(y+n/N) . 
 
 m 
 
 But t>l + 2j + t]/n(xWn) ^ T|i + 2j + Ty/N(yWN) 
 
 2N 
 
 PI, 3 
 
 -» /N(xWN) ^ /N(y-sVN) 
 2N 
 
6k 
 
 So P6 m 
 
 Since Tn(x-jV^) = (x^O+N, we get 
 
 m 
 
 (xM) + N ^ (yV^)* N 
 
 Pll Mn 
 
 _>- 7n x 4 -& y 
 
 P9 Nm/2 
 
 -» Vn x ^ 71 y 
 
 P12 2Mn/2 
 
 _> 271 x ^ 271 y 
 
 Nm 
 27N x ^ 271 y 
 
 P12 Vk/fS 
 
 -> 2x ^ 2y 
 
 P9 m 
 
 -> 2x ^ 2y . 
 
 Thus, definition 3 is satisfied since 
 
 m 
 
 s(x) ^ s(y) A s(x) s s(y) -»d(x) ^ d(y) . 
 2N m 
 
 Definition 6 is satisfied by an argument similar to note (s), Table III, 
 
65 
 
 That is, since there are no memory conflicts for columns, and since the Vn 
 column broadcast involves fetching only -/n elements out of N in a column, then 
 
 there cannot be any memory conflicts in a 7n column broadcast. That there 
 are no memory conflicts for a column fetch follows from the fact that 7w + 1 
 is prime to 2N = 2 . Thus, Theorem 9 is satisfied since the greatest factor 
 of 7n + 1 prime to 2N is -/n + 1, 
 
 gpf(-/N+l,N) = /N + 1 
 
 and so it is only required that x be less than M = 2N, 
 
 x < I < M = 2N . 
 
 Thus x^y and ji(v(x)) = jLi(v(y)) 
 
 M 
 
 or (/N+l)(x+i) + 2j ■ (7N+l)(y+i) + 2j 
 
 M 
 
 PI 
 
 -> (/N+l)(x+i) =(/N+l)(y+i) 
 
 M 
 
 P3 
 
 -» x+i = y+i 
 
 M 
 
 _» (x+i,o) = (y+i,j) since x,y <M 
 
 -» v(x) - v(y) . 
 
66 
 
 o 
 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 H 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 rH 
 
 H 
 
 OJ 
 
 *? 
 
 OJ 
 
 OJ 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 
 
 O 
 
 
 
 
 
 
 
 
 
 
 
 
 
 l 
 
 H 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 fl w 
 
 H 
 
 H 
 
 H 
 
 [25 
 
 OJ 
 
 H 
 
 OJ 
 
 X 
 
 OJ 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 
 
 pq 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 15 
 
 H 
 
 H 
 
 H 
 
 [25 
 
 OJ 
 
 rH 
 
 X 
 
 H 
 
 
 
 
 
 
 
 
 
 OJ 
 
 
 
 OJ 
 
 H 
 
 
 
 
 
 A| 
 
 
 
 
 
 
 
 
 < 
 
 
 
 
 
 
 
 
 
 
 
 
 
 H 
 H 
 
 H 
 
 S 
 
 H 
 
 H 
 
 »e 
 
 H 
 
 H 
 
 <? 
 
 [25 
 
 OJ 
 
 O 
 
 rH 
 
 OJ 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 
 
 PQ 
 
 I 
 
 H 
 
 OJ 
 
 H 
 
 H 
 
 is 
 
 H 
 
 H 
 
 rH 
 
 [25 
 
 OJ 
 
 rH 
 
 X 
 
 H 
 
 
 A| 
 
 
 
 
 
 
 
 OJ 
 
 
 
 
 H 
 
 
 
 
 
 A| 
 
 
 
 
 
 
 
 
 1 
 
 H 
 
 H 
 
 ■ S 
 
 H 
 
 H 
 
 <? 
 
 H 
 
 H 
 
 •? 
 
 [25 
 
 OJ 
 
 O 
 
 H 
 
 X 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 i 
 H 
 
 H 
 
 H 
 
 2cvj 
 
 I? 
 
 CM 
 
 Al 
 
 H 
 
 H 
 
 H 
 
 [25 
 
 + 
 
 1? 
 
 H 
 
 X 
 
 O 
 
 i 
 H 
 
 H 
 
 »? 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 «! 
 
 [Sh- 
 
 •§ 
 
 H 
 
 X 
 
 O 
 
 
 
 _* 
 
 OJ 
 
 
 
 
 
 
 
 
 
 l 
 
 H 
 
 H 
 
 ~\^ 
 
 ^~^ 
 
 1 
 
 H 
 
 H 
 
 1 
 
 is; 
 
 KA 
 
 H 
 
 X 
 
 H 
 
 
 
 S 
 
 S 
 
 
 
 
 
 
 
 
 
 pq 
 
 i 
 
 H 
 
 H 
 
 H 
 
 CVJ 
 
 S 
 
 •? 
 
 H 
 
 H 
 
 H 
 
 [25 
 
 H 
 
 H 
 
 X 
 
 < 
 
 i 
 
 H 
 
 H 
 
 S 
 
 H 
 
 rH 
 
 "? 
 
 H 
 
 H 
 
 i? 
 
 S 
 
 O 
 
 H 
 
 X 
 
 
 
 
 
 
 
 
 
 P 
 
 
 
 
 
 
 
 
 
 
 
 
 
 W 
 
 
 
 
 
 
 
 
 
 
 £ 
 
 
 
 ■ CO 
 
 
 
 
 
 
 
 
 
 
 O 
 
 
 P 
 
 o 
 
 
 
 
 
 
 
 
 
 
 •H 
 
 
 02 
 
 d 
 
 
 
 
 
 ?H 
 
 
 
 
 
 -P 
 
 
 CO 
 
 CO 
 
 
 
 
 
 cu 
 
 
 
 
 
 ■H 
 
 
 o 
 
 o 
 
 
 
 
 
 £ 
 
 
 
 
 
 -P 
 
 CO 
 
 
 d 
 
 co 
 
 o 
 
 
 
 
 
 
 S 
 
 
 
 CO 
 
 bO 
 
 CO 
 
 Ph 
 
 -P 
 w 
 
 
 1 
 
 w 
 
 
 
 
 CD 
 
 H 
 
 
 CO 
 
 •H 
 d 
 
 •H 
 
 d 
 
 «? 
 
 CO 
 O 
 
 |3 
 
 H 
 
 CD 
 
 •H 
 
 
 
 
 r^ 
 
 
 E 
 
 
 
 
 d 
 
 O 
 
 o 
 
 !h 
 
 
 
 
 03 
 
 w 
 
 p 
 
 • 
 
 • 
 
 X 
 
 CO 
 
 U 
 
 o 
 
 O 
 
 ,5 
 
 Ph 
 
 s- — V 
 
 EH 
 
 £ 
 
 rH 
 
 d 
 
 J> 
 
 
 o 
 
 
 
 g 
 
 CD 
 
 •H 
 
 X 
 
 
 O 
 
 rH 
 
 o 
 
 O 
 
 IS 
 
 <H 
 
 CD 
 
 i? 
 
 
 >? 
 
 «5 
 
 CD 
 
 
 
 o 
 
67 
 
 d 
 
 d 
 
 d 
 
 "d" 
 
 d 
 d 
 
 N 
 
 N 
 
 are 12. Storage of dx d Submatrices Along the Diagonal 
 
68 
 
 c- 
 
 saxoiCo 
 3[ jcoMq.au 
 
 i 
 
 H 
 
 OJ 
 
 A| 
 
 OJ 
 
 A| 
 
 sbjoAo 
 Aj.ovsdm 
 
 ■ 
 
 H 
 
 H 
 
 OJ 
 
 A| 
 
 VO 
 
 sajo^o 
 3fjcoMq.au 
 
 i 
 
 H 
 
 OJ 
 
 A| 
 
 OJ 
 
 Al 
 
 saiojfo 
 JLzomBm 
 
 i 
 
 H 
 
 OJ 
 
 A| 
 
 H 
 
 UA 
 
 sa"[Oi?o 
 s[ jcoM-q.au 
 
 i 
 
 OJ 
 
 OJ 
 
 A| 
 
 OJ 
 
 Al 
 
 sajo^o 
 jCjcoraara 
 
 ! 
 
 rH 
 
 H 
 
 H 
 
 -4- ■ 
 
 saxc/fo 
 3j;jcoM.q.au 
 
 H 
 
 OJ 
 
 OJ 
 
 A| 
 
 OJ 
 
 A| 
 
 saxo^o 
 ^jcoutara 
 
 H 
 
 04 
 
 OJ 
 
 A| 
 
 OJ 
 
 Al 
 
 KA 
 
 saxojCo 
 ^JcoAq.au 
 
 H 
 
 OJ 
 
 OJ 
 
 Al 
 
 OJ 
 
 Al 
 
 saxo.<fo 
 .Aiomara 
 
 H 
 
 OJ 
 
 OJ 
 
 Al 
 
 OJ 
 
 Al 
 
 CVJ 
 
 S8~["0.£0 
 
 3];jcoM.q.au 
 
 OJ 
 
 OJ 
 
 OJ 
 
 Al 
 
 OJ 
 
 A| 
 
 safjo^o 
 ^jcoraani 
 
 H 
 
 H 
 
 OJ 
 
 A| 
 
 OJ 
 
 Al 
 
 n3 
 
 fe 
 
 VD 
 H 
 
 
 LIA 
 
 OJ 
 
 -=J- 
 
 OJ 
 
 o 
 
 H 
 
 H 
 03 
 C 
 O 
 bD 
 cO 
 •H 
 R 
 
 d) 
 
 -P 
 
 o 
 
 oa 
 
 o 
 o 
 
 H 
 
 pq 
 
 CO 
 
 > 
 
 CD 
 H 
 ,£> 
 CO 
 
 EH 
 
69 
 
 3.5 Additional Considerations, Cost 
 
 Tables I-V give us a pretty good idea of the performance of the ft 
 network on certain highly frequent N-vectors. We will not discuss these 
 results any further in this chapter. (in a later chapter we will compare 
 these results with similar results from other types of networks) . We will, 
 however, discuss several questions raised by these results. 
 
 First, it seems apparent that using 2N memories yields fewer memory 
 conflicts than a system with N memories. In fact, purely combinatorial 
 arguments tell us that of a total of ( ) = — 4 — — — possible distinct N- 
 
 (ir-N) in: 
 
 vectors in an N x N array, the N memory systen can deliver NUT without 
 
 conflict, while the 2N memory system can deliver -^ — without conflict, 
 
 NI2 
 
 But we must ask: What is the difference in cost between an N memory system 
 and a 2N memory system? Let us assume that the total storage capacity of 
 both systems is equal, e.g., each memory in the N memory system can store K 
 words while each memory in the 2N memory system can store k/2 words. 
 
 There is some evidence which suggests that at least the cost of the 
 basic memory components would be the same. For example, Intel supplies a 
 solid state memory board which can be configured either as kK X l8 bit words 
 or 8K x 9 bit words. However, the total cost of a memory system is 
 primarily determined by the cost of power supplies, packaging, inter- 
 connections, etc., and thus further analysis of this 
 
 * p 
 
 The assumption is made that all N~ elements of the N x N array are evenly 
 
 distributed among all N or 2N memories. 
 
TO 
 problem is beyond the scope of this paper. We will simply assert that it is 
 
 not unlikely that the cost of a 2N memory system would not be prohibitively 
 
 greater than the cost of an N memory system when the total cost of the 
 
 computer system is considered. 
 
 Another question which arises is the relative cost of a 2N X 2N ft 
 
 network versus an N X N ft network. As we shall see in a later chapter, the 
 
 number of gates in an N X N ft network (including control circuitry) is on the 
 
 order of 
 
 2|(d + log 2 (log 2 N)) log 2 W , 
 
 where d is the number of bits per data word. The gate ratio of a 2N X 2N ft 
 network to an N x N network is approximately 2. (This should be 
 compared with a ratio of k for a crossbar switch) . 
 
 Finally, it might be argued that each memory in a 2N memory system 
 only needs to be half as fast as each memory in an N memory system in order 
 
 to provide the same effective memory bandwidth. This argument is valid only 
 if successive N-vectors do not interfere with each other. 'While an analysis 
 of successive N-vector interference is beyond the scope of this work, it seems 
 obvious that this interference could be significant and so the argument 
 justifying slower memories would not hold. 
 
 3.6 Summary 
 
 In this chapter we have explored several memory systems in conjunction 
 with various memory equations. We have shown that in almost all cases the ft 
 network performs at least as well as the memories themselves. That is, in 
 almost all cases, if the N-vector can be accessed without memory conflict, 
 then the ft network can establish the necessary memory-processor connection. 
 
71 
 In the next chapter we will discuss actual implementation of Q. networks. This 
 will be followed in later chapters by comparisons with various other switching 
 networks which have been proposed in the literature. 
 
72 
 k. CONSTRUCTION OF SEVERAL ft NETWORKS 
 
 In this chapter we will present several implementations of ft net- 
 works. Of course, many implementations are possible but we will present only 
 a few which represent various extremes of design. The first design is probably 
 the simplest design possible. It is somewhat slow, due to the fact that it 
 operates in bit serial mode. It might be useful in applications requiring the 
 switching of bit serial data, such as switching data between tracks of a 
 rotating memory. 
 
 The second design is a more general network which might be used for 
 the processor-memory connections discussed earlier. Finally, we will discuss 
 a particular interconnection of processors which is surprisingly related to ft 
 networks . 
 
 k.l A Bit Serial ft Network 
 
 This ft network, shown in Figure 13, will be constructed from elements 
 shown in Figure lh . (Notice that the two center NAND gates form a bistable 
 device). The network operates as follows. The network is first reset by using 
 the reset lines which are common to all elements. Associated with each input 
 is a log n bit number which represents the number of the output port to which 
 
 that input port is to be connected (the destination tag) . This number is in- 
 serted bit serially into each input port, most significant bit first. Each 
 element then transmits these bits 
 
 A = C, B = D. 
 As the i-th most significant bit is fed into the network, the strobe signal is 
 turned on momentarily in the i-th stage (from the left). This strobe signal 
 allows the bistable pair of NAND gates to be switched, and if A = 1, then the 
 element will begin transmitting A = D, B = C; otherwise, it will continue to 
 
 • 
 
73 
 
 Figure 13 . An fi Network 
 
7^ 
 
 * ♦ A B 
 
 Strobe Reset 
 
 L_L 
 
 B 
 
 — D 
 
 FT 
 
 Strobe Reset 
 
 Figure Ik . One Element of a Bit Serial ft Network Constructed from 
 NAND Gates 
 
75 
 
 transmit A - 0, B = D. 
 
 Now assume there are no conflicts as defined in Chapter 2. Then it 
 can be shown (refer to Figure 10 in Chapter 2) that both streams entering an 
 element of the i-th stage (from the left) must he equal in the first i-1 most 
 significant bits and must be unequal in the i-th most significant bit position. 
 Thus, the i-th stage is strobed when the i-th most significant bits are present 
 at the inputs. (if there are no conflicts, then these bits are unequal). If 
 A = at this time, then the element remains in state A = C, B = D. Otherwise, 
 A = 1 and the element enters state A = D, B = C. The strobe is now turned off 
 which effectively locks the element in this state until the reset signals are 
 used. Thus, the network is being switched according to Algorithm 2 of Chapter 2 
 as long as there are no conflicts. 
 
 Now there are two problems. First, if a conflict does arise, the 
 network will produce an erroneous connection. (The destination tags could be 
 examined as they emerge from the outputs to determine if the correct connection 
 was established) . Second, it is impossible to set up any one-to-many connec- 
 tions . (Recall these are allowed in the generalized Q network presented in 
 Chapter 2). Nevertheless, this network may prove useful in some applications. 
 
 Notice in Figure Ik that each element requires 9 NAND gates. Thus, 
 the total number of gates required for the network is 9 X p l°gp n or 
 
 ■( 
 
 ~ n log n. Further examination of Figure Ik reveals that 3 gate delays are 
 
 needed for switching and 2 for transmission through each element. This, to- 
 gether with the bit serial nature of transmission, reveals that 
 
 lo G p n 
 
 *- p 
 
 L • i 2i = 21og n +(log n) gate delays are required to properly switch 
 i I 2 2 
 
 the network. 
 
7 6 
 
 This could be reduced to 51og_n by the addition of appropriate 
 latching registers and clock signals. 
 
 We will now turn to a more complicated ft network. 
 
 k.2 A Better Network 
 
 In the previous network when a conflict arises, one of the inputs 
 is switched in the "wrong" direction. This wrongly switched input can, in a 
 later stage, cause other inputs to be wrongly switched. The network which we 
 will now discuss prevents this by associating a validity signal with each in- 
 put. As soon as an input is switched in the wrong direction, its validity 
 signal is turned off and thus the wrongly switched input is prevented from 
 influencing later switching decisions. 
 
 Additionally, this network will be capable of producing one to many 
 connections (broadcasts) as discussed in Chapter 2. This is accomplished by 
 using source tags rather than destination tags. A source tag is a number as- 
 sociated with each output port which represents the input to which that output 
 port is connected. 
 
 This network is divided into one "control plane" and one or more 
 "data planes" as shown in Figure 15. Signals generated by the control plane 
 are used to control the switching of the data planes. Each plane is similar 
 (at least topo logic ally) to Figure 13- Notice that the control signals will 
 flow from right to left while the data will be transmitted from left to right. 
 
 We begin by designing the "control plane" for this ft network. The 
 control plane generates signals which are fed to the switches in each data 
 plane of the ft network. Refer to Figure l6. On the right edge are n shift 
 registers of log g n bits each. Each of these shift registers contain the 
 
 number of the input port to which this output is to be connected. The least 
 
77 
 
 Data 
 Planes^ 
 
 Control 
 Plane 
 
 
 Figure 15. The Control and Data Planes of an fi Network 
 
78 
 
 
 
 
 
 
 
 f 
 
 
 
 r 
 
 
 
 
 
 
 
 
 
 r 
 
 
 
 
 
 r 
 
 
 
 
 
 
 
 
 
 
 s 
 
 
 1 
 
 SOURCE 
 
 
 
 
 « 
 
 
 
 TAG SHIFT 
 
 
 
 
 
 | 
 
 
 
 
 REGISTERS 
 
 
 
 ^\// f 
 
 
 r 
 
 
 
 
 2 
 
 
 
 
 
 
 r 
 
 
 
 
 
 
 
 
 
 
 3 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 * 
 
 
 
 
 4 
 
 
 
 * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 I 
 
 
 | 
 
 
 | 
 
 
 
 
 
 
 
 
 
 
 
 
 
 6 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 7 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 STf 
 
 *OBE 3 
 
 
 STI 
 
 ROBE 2 
 
 
 STF 
 
 OBE 1 
 
 
 Figure l6. Control Plane for an 8 x 8 n Network 
 
79 
 
 significant bit of each of these registers is connected to the inputs of the 
 control plane as shown in Figure 16. Generally, each stage of the control 
 plane will he "set up" during one major clock. The entire plane is thus set 
 in log n major clocks. During the i-th major clock, the i-th stage (from the 
 right) is strobed allowing the signal from the current least significant bit 
 to set the switching and memory of each block in the stage. Then the source 
 tag registers are shifted allowing the next least significant bit to be 
 switched through the i-th stage to the (i+l)-th stage where this process is 
 repeated. 
 
 The following signals are used in each element of the control plane 
 (refer to Figure 17) : 
 
 Strobe: enables setting of the flip flops 
 
 A: upper output tag 
 
 B: lower output tag 
 
 V»: upper output validity 
 
 V_: lower output validity 
 B 
 
 C : upper input tag 
 
 D: lower input tag 
 
 V p : upper input validity 
 
 V : lower input validity 
 
 F n ' upper switching signal for data switch 
 
 F : lower switching signal for data switch 
 
 Reset: resets both flip flops 
 
 A special situation arises if an attempt is made to switch both 
 inputs to the same output in this control plane. This may arise if a broad- 
 cast connection is being set up (i.e., two network outputs requesting con- 
 nection to the same network input) or in the case of a genuine conflict. When 
 
8o 
 
 Reset ♦ ♦ Strobe 
 
 Reset 
 
 Strobe 
 
 Figure 17. One Control Plane Switching Elenent 
 
8i 
 
 this happens, the signals which will control the data planes are set to 
 connect the broadcast connection but the switch of the control plane is set 
 to transmit only one of the tag signals correctly. The other tag signal is 
 transmitted incorrectly but its associated validity signal is set to zero. 
 (Note that the validity signals may also be used to indicate that a given 
 network output port requires no connection) . 
 
 Thus, each source tag travels bit by bit from right to left through 
 the control plane. Each bit causes a given stage to switch the next bit 
 through to the next stage . 
 
 The logic diagram of one element of the control plane is shown in 
 Figure l8. The two flip flops are set or reset by their respective input 
 (C or D) provided they are enabled by the strobe signal. (These flip flops 
 are the same as the three gate devices shown in Figure Ik) . The outputs of 
 these flip flops control subsequent transmission of this element in addition 
 to the switching of corresponding elements of each data plane (see Figure 20). 
 Associated with each tag input signal (C or D) are two validity signals 
 (y or V ) . These validity signals indicate whether or not the corresponding 
 
 tag signal is valid. (A tag signal is invalid if it was incorrectly switched 
 in a previous stage or if the corresponding network output port is not re- 
 questing a connection) . 
 
 The logic equations are determined as follows. First, the flip flops 
 are strobed and set or reset as C and D are set or reset. The various trans- 
 mission states of this element are shown in Figure 19- Numbers on the right of 
 each box correspond, top to bottom to C, V , D, and V , respectively. Lines 
 
 in each box represent connections and the numbers on the right indicate trans- 
 mitted validity signals (V., V_.) . Don't care conditions were chosen to 
 
 A B 
 
82 
 
 Reset Strobe 
 
 Reset Strobe 
 
 Figure l8. Control Plane Switching Element Circuitry 
 
83 
 
 i ° n 
 
 V° V° 
 
 Ao Ao 
 
 11 
 
 : x: 
 
 10 
 
 
 
 7: v\ '7: ' 
 
 10 1 
 
 
 
 1 
 
 1 
 1 
 
 
 
 'Xi 
 
 i i 
 
 °\: 
 
 i i 
 
 °x 
 
 i 
 
 m 
 
 ;ure 19 . Transmission States of One Control Element 
 
Qk 
 
 minimize gate counts. 
 
 As mentioned earlier, the signals from one of these control planes 
 will be used to control the switching of one or more data planes. Each data 
 plane consists of log p n stages of n/2 elements. One such element is shown in 
 Figure 20. Notice that the switching of this element is controlled "by the 
 signals F , F , which are generated by the corresponding element of the control 
 
 plane (see Figure 3) • Figure 21 shows the states of this data switching 
 
 element. The numbers above each box represent values of F and F , respec- 
 
 o u 
 
 tively. 
 
 In summary then, each output requests connection to a particular 
 input port by placing the number of that port in the source tag register. The 
 control plane is then clocked log n times after which the data planes are set 
 to provide some or all of the specified connections between inputs and outputs. 
 If the requested connection does not create conflicts in the sense defined in 
 Chapter 2, then the resulting connection will be the requested connection. 
 If there is a conflict, then some output (s) will not be connected to their 
 requested input. For example, refer to Figure 16. Assume output requests 
 connection to input and output 2 to input k. Then the first element of the 
 middle stage of Figure 16 will be placed in state f (Figure 19) . At this point 
 the request from output 2 will be effectively cancelled, since its validity 
 signal is turned off and in fact output 2 will be connected via the data planes 
 to input instead of k. 
 
 In many cases it would be desirable to detect this condition. This 
 can be done by including a log n bit source tag along with each element of 
 data at the inputs. After the data is sent through the network each output 
 port checks the source tag received against the source tag requested. Lack 
 
Upper 
 
 Lower 
 
 85 
 
 F c Signal From Control Plane 
 
 Upper 
 
 Data 
 Output 
 
 Lower 
 
 F D Signal From Control Plane 
 
 Figure 20. One Element of a Data Plane Using NAND Gates 
 
86 
 
 
 
 a 
 
 i I 
 
 I o 
 
 Figure 21. States of One Data Switching Element 
 
8 T 
 
 of equality indicates a conflict occurred, and the process must be repeated 
 to process the unsatisfied requests. 
 
 We have discussed in some detail the construction of the control and 
 data planes of an Q. based network which allows broadcast patterns and is 
 
 capable of detecting conflicts. If we assume that a flip flop requires 3 
 
 2 l0g 2 I 
 
 ^ates, then a control plane requires yr log n elements of 22 gates (see 
 
 Figure l6) for a total of lln log n gates. Each data plane has — log n 
 
 elements of 8 gates for a total of 4n log n gates. If there are d data planes, 
 
 then the entire network requires (4d+ll)n log n gates. 
 
 Examination of Figure 18 reveals that transmission through a stage 
 requires 3 gate delays and switching of a stage required 3 gate delays. Thus, 
 
 log n 
 
 5 2 
 
 it requires E 3(i-l) + 3 = — (log„n + (log„n) ) gate delays to switch the 
 
 i=l 
 entire network. This could be changed to 6 log n by the addition of appropriate 
 
 latches between stages. 
 
 k .3 Processor-Processor Connections 
 
 As readers may have noticed, the topology of the P. network is 
 equivalent to that of the last log n stages of both the bitonic sorting net- 
 works discussed by Batcher [ k ] and the binary switching networks discussed 
 by Benes [ 5 ]• This in itself is somewhat surprising. It is even more sur- 
 prising to discover that the interconnections between stages turn out to be 
 the "perfect shuffle" connection discussed by Pease[ 9 1 and Stone [12]. We 
 Ml not present a proof of this result but it can be easily seen by examining 
 i ">< , which is the same binary Q. network we have been discussing all 
 along. While it appeared earlier (Figure 13) that each pair of stages was 
 
88 
 
 f\ 
 
 1 
 
 
 1 
 
 
 1 
 
 — 
 
 — l 
 
 u 
 
 1 V J 
 
 
 
 
 
 
 
 
 
 
 2 —v /\ 
 
 3 
 
 
 3 
 
 
 2 
 
 — 2 
 
 3-^ V / 
 
 
 
 
 
 
 — 3 
 
 
 
 
 
 
 
 
 4 A 
 
 2 
 
 
 2 
 
 
 3 
 
 — 4 
 
 5 -/ \ > 
 
 
 
 
 
 
 — 5 
 
 
 
 
 
 
 
 
 6 -^ \ 
 
 7 
 
 4 
 
 
 4 
 
 
 4 
 
 — 6 
 
 — 7 
 
 ^ 
 
 
 
 
 
 
 Figure 22. An fi Network with Identical Interstage Connections Revealed 
 
8 9 
 
 interconnected differently, (it was expedient to draw them this way to facili- 
 tate understanding of how they worked), in fact, a reordering of elements in 
 each stage allows them to be drawn as shown in Figure 22. Thus, it becomes 
 clear that the stage interconnections are identical for each stage. (The 
 numbers on each element indicate their original order as shown in Figure 13) . 
 Since each stage is identical, it appears that it might be possible 
 to save more gates and build just one stage of an 0, network, add some 
 registers, and simply recycle this stage log n times. We will now examine 
 
 some particularly effective ways of doing this. 
 
 Assume that we have N processors interconnected by the perfect 
 shuffle, as shown in Figure 25« This interconnection was proposed by Stone 
 [12]. The new results, effectively proved in Chapter 2, is that shifts and 
 other patterns required by our solution to the memory conflict problem can 
 be performed in exactly 1 + log N cycles. This is accomplished as follows. 
 
 Each processor will have as many as two data words which need to be 
 transmitted to another processor (possibly itself) . Associated with each data 
 word will be a log N bit destination tag and a one bit validity indicator. 
 
 Assume these interconnections are wide enough to allow parallel transmission 
 of d bits of data, log_N bits of tag and one extra bit. During each cycle the 
 
 contents of the output registers (see Figure 2*0 which contain data, tag and 
 validity information are sent via the shuffle connection to the input registers 
 of other processors. Then, depending on the tags and validity bits, the two 
 input registers in each processor may exchange or not as they are gated to the 
 output registers, (it should be noted that one to many connections are not 
 possible using this scheme, but another scheme similar to section 2 of this 
 
90 
 
 
 
 
 
 
 
 
 
 w 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ftg 
 
 
 
 *■ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 „ 
 
 
 
 
 
 
 
 
 
 
 Figure 23- Four Processors Connected by the Perfect Shuffle 
 
91 
 
 
 Output Registers Input Registers 
 
 Figure 2^. Internal Registers Required for the Processor Interconnection 
 Scheme 
 
92 
 
 chapter is possible which does allow broadcast type connections) . 
 
 From Chapter 2, we know that any permutation of n numbers which 
 satisfies definition 3 of that chapter can be produced in log n cycles. Since 
 
 n = 2N, this becomes 1 + log N cycles. In addition, we know that any permuta- 
 tion can be done in log n(log n+l)/2 cycles. This latter result follows from 
 
 Batcher's work [ k ], and the facts that the stage interconnections in the 
 Batcher network are perfect shuffles and that the only data dependent informa- 
 tion required by a Batcher element can be obtained from the inputs to that 
 element. It is not known at this time whether or not log n(log n+l)/2 is a 
 
 least upper bound, given the above constraints. 
 
 One additional result which follows from Batcher is that we can sort 
 2N numbers using this same hardware. 
 
 k.k Summary 
 
 It is difficult to summarize the results of this chapter in terms of 
 actual numbers. In one case we took existing processors, possibly modified 
 them, and formed an ft network. The primary cost here would not be determined 
 by gates (which presumably are already available in the processors) but by 
 wires, drivers, connectors, etc. In two other cases we actually designed 
 separate ft networks. We can say that any permutation which satisfies defi- 
 nition 3 of Chapter 2 can be produced in time on the order of log n cycles and 
 
 with an order of n gates/stage. 
 
 Thus, we have a number of options available : 
 I. Processor-processor connections. 
 
 A. Advantages: possibly cheaper due to the use of already 
 existing gates and can be used to implement a bitonic 
 
93 
 
 sorter with almost no extra logic. 
 B. Disadvantage: during each cycle, some input must travel a 
 distance of n/2 processors. If the processors are large or 
 there are a lot of them, the wire lengths involved may be a 
 problem. 
 II. Separate networks. 
 
 A. Multistage. 
 
 1. Advantages: size and thus wire lengths are smaller. 
 
 2. Disadvantage: cost in terms of gates. 
 
 B. Single stage. 
 
 1. Advantages: size and wire lengths are smaller than 
 processor-processor connections; uses fewer gates than 
 the multistage networks. 
 
 2. Disadvantage: probably slower than multistage due to 
 extra clocking requirements and i/o registers. 
 
 C. Multistage pipelined. 
 
 1. Advantages: size and wire length are smaller than 
 processor-processor connections, higher bandwidth. 
 
 2. Disadvantages: cost in terms of gates for the necessary 
 interstage registers and in terms of extra transmission 
 time due to these registers and extra clocking. 
 
 It should also be pointed out that it is not necessary to build these networks 
 using binary elements. Larger elements, consistent with the particular 
 implementation technology, can be used with no loss in capability (Theorem 6, 
 Chapter 2) and yield a possibly faster network. 
 
 In the next chapter we will review some other networks which might 
 be adaptable to memory-processor connection networks and we will compare these 
 
9h 
 networks with ft networks in terms of cost, speed and effectiveness. 
 
95 
 5. CONSTRUCTION AND PROPERTIES OF OTHER NETWORKS 
 
 In this chapter we will examine five networks which have been pro- 
 posed as data alignment networks. 
 
 5.1 Uniform Shift Networks 
 
 A -uniform shift network has the capability of "shifting" all inputs 
 ±1 or ± S (mod n) positions through any stage or cycle of the network. How- 
 ever, during any cycle it must shift all inputs by the same distance. It is 
 easy to show that the maximum number of cycles required to perform an arbitrary 
 
 In S 
 uniform shift is Loi-^lJ + LoJ-1* This upper bound is minimized if S = Vn 
 
 (assuming n is a square). This type of network generally performs well for 
 uniform shifts, the time being bounded by 7n-l cycles. But if non-uniform 
 shifts are required, then we run into trouble. 
 
 In the case of ILLIAC IV, it was generally true that significant 
 changes had to be made either to the storage mapping function or to the 
 algorithm itself in order to force the alignment patterns to be uniform shifts. 
 If this could not be done, then a software routine had to be used which resulted 
 in n shifts of +1. Any problem which relied on this routine usually performed 
 so badly that it was no longer considered for use on ILLIAC IV. 
 
 Let us examine our alignment requirements in terms of uniform shifts. 
 Recall (Chapter 3) that we characterize our' alignment requirements by a 
 function 
 
 C(x) = (ju(v(x)), *(x)) , 
 
 where we considered the cases <t(x) = x or r (x) = 2x. In terms of shift 
 
 "^e all stages would be identical, we may assume only one stage is used 
 but is recycled to necessary number of times. 
 
 "^he network used in ILLIAC IV is a +1, ±8 uniform shifter. 
 
96 
 
 distances then, the x-th element of the n vector will require a shift of 
 
 S(x) = $(x) - ju(v(x)) (mod n) . 
 
 Now, if <$>(x) is independent of x, then it is a -uniform shift; otherwise it is 
 not. Table VI summarizes the results of Tables I-III of Chapter 3, in terms 
 of uniform or non-uniform shifts. An x represents a non-uniform shift while 
 represents a uniform shift. As we can see, in no case are more than two 
 of the n-vectors listed accessible by uniform shifts. (We have not even con- 
 sidered broadcast patterns since this kind of network cannot produce them) . 
 
 There are several ways of building such a network. One way is to 
 use a single stage as shown in Figure 25. (Figure 25 actually shows two stages 
 in order to more easily show the connections. One may mentally fold the second 
 stage into the first). Figure 26 shows one of the n elements. Clearly, the 
 interstage wiring for this type of network is more complex than that of the 9. 
 network. This network requires 5n gates/stage (see Figure 20 of Chapter h) . 
 
 Finally, we may consider connecting the N processors together with a 
 ±1, ±S connection and using the processors as a single, recycleable stage. 
 Each processor requires h input and h output lines, while using fi connections 
 required only two input and two output lines. Further, the fi connections 
 allowed us to align 2N data words while the ± 1 , ± S connection only allows us 
 to shift N words. 
 
 let us propose that we add ±1 connections to the Q connections. 
 Then, we have a system whose cost is approximately the same as the ±1, ± /n 
 system, but which can do any uniform shift in < log N cycles as opposed to 
 < t/n cycles for the ±1, ±71 system. Additionally, the ±1, fl system can 
 produce permutations which are not uniform shifts. 
 
 In spite of the apparently overwhelming evidence against the uniform 
 
97 
 
 o 
 
 
 
 
 
 
 
 H 
 
 
 
 H 
 
 O 
 
 X 
 
 X 
 
 X 
 
 X 
 
 S 
 
 OJ 
 
 X 
 
 H 
 
 H 
 
 
 
 
 
 
 C\J 
 
 »? 
 
 
 OJ 
 
 O 
 
 
 
 
 
 
 
 
 
 
 1 
 
 H 
 
 O 
 
 X 
 
 X 
 
 X 
 
 X 
 
 s 
 
 H 
 
 OJ 
 
 X 
 
 H 
 
 
 
 
 
 
 OJ 
 
 
 
 OJ 
 
 H 
 
 
 
 
 
 
 
 
 
 
 PQ 
 
 
 
 
 
 
 
 
 
 
 I 
 
 H 
 
 X 
 
 o 
 
 X 
 
 X 
 
 X 
 
 s 
 
 OJ 
 
 H 
 
 M 
 
 H 
 
 
 
 
 
 
 CVI 
 
 
 
 OJ 
 
 H 
 
 
 
 
 
 
 
 
 
 
 < 
 
 
 
 
 
 
 
 
 
 
 I 
 
 H 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 S 
 
 o 
 
 H 
 
 X 
 
 H 
 
 
 
 
 
 
 CJ 
 
 
 
 OJ 
 
 H 
 
 
 
 
 
 
 
 
 
 
 pq 
 
 
 
 
 
 
 
 
 
 
 1 
 
 o 
 
 X 
 
 X 
 
 o 
 
 X 
 
 s 
 
 OJ 
 
 H 
 
 x 
 
 H 
 
 
 
 
 
 
 CVJ 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 
 <! 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 X 
 
 o 
 
 X 
 
 X 
 
 5 
 
 o 
 
 H 
 
 x 
 
 H 
 
 
 
 
 
 
 OJ 
 
 
 
 
 1— 1 
 
 
 
 
 
 
 
 
 
 
 W 
 
 
 
 
 
 
 
 rH 
 
 
 
 i 
 
 o 
 
 X 
 
 X 
 
 X 
 
 X 
 
 Ss5 
 
 + 
 
 H 
 
 X 
 
 H 
 
 
 
 
 
 
 
 US 
 
 
 
 Q 
 1 
 
 H 
 
 o 
 
 X 
 
 X 
 
 X 
 
 o 
 
 & 
 
 <? 
 
 rH 
 
 X 
 
 O 
 1 
 
 H 
 
 o 
 
 X 
 
 X 
 
 X 
 
 X 
 
 [25 
 
 K> 
 
 H 
 
 X 
 
 pq 
 
 
 
 
 
 
 
 
 
 
 i 
 
 o 
 
 o 
 
 X 
 
 X 
 
 X 
 
 }?*; 
 
 rH 
 
 rH 
 
 X 
 
 M 
 
 
 
 
 
 
 
 
 
 
 < 
 1 
 
 o 
 
 X 
 
 o 
 
 X 
 
 X 
 
 ft 
 
 o 
 
 H 
 
 X 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 co 
 
 
 
 
 
 
 
 
 CO 
 
 co 
 
 d 
 
 
 
 
 
 
 
 
 rH 
 
 H 
 
 O 
 
 
 
 
 
 H 
 
 
 
 CO 
 
 CO 
 
 •H 
 
 
 
 
 
 <D 
 
 
 
 a 
 
 a 
 
 -p 
 
 
 
 
 
 rg 
 
 
 
 o 
 bo 
 
 o 
 5p 
 
 •H 
 
 -P 
 
 
 
 
 
 £ 
 
 
 
 CO 
 
 CO 
 
 Fl 
 
 
 
 
 
 
 
 •H 
 
 •H 
 
 CO 
 
 
 
 
 
 CD 
 
 
 
 >d 
 
 13 
 
 ft 
 
 w 
 
 
 
 
 H 
 
 
 ta 
 
 
 CD 
 
 to 
 
 I? 
 
 CD 
 •H 
 
 
 
 
 CO 
 
 
 3 
 
 CO 
 
 h 
 
 
 Fh 
 
 
 
 
 Eh 
 
 w 
 
 p 
 
 •j 
 
 Q) 
 
 x 
 
 
 ' : 
 
 ft 
 
 ^— ^ 
 
 
 [5 
 
 rH 
 
 fi 
 
 
 
 R 
 
 CD 
 
 •H 
 
 X 
 
 
 B 
 
 o 
 o 
 
 o 
 
 I? 
 
 a) 
 
 5 
 
 CO 
 
 co 
 
 o 
 
 CO 
 
 d 
 
 CD 
 -P 
 -P 
 
 CO 
 
 Cm 
 
 ■s 
 
 CD 
 
 •H 
 
 rH 
 < 
 
 -P 
 
 <H 
 •H 
 
 en 
 
 •H 
 
 d 
 
 d 
 
 o 
 
 •H 
 
 d 
 
 CD 
 
 H 
 
 ■s 
 
 E-t 
 
9 8 
 
 Figure 25. A (±1,±2) Uniform Shift Network 
 
99 
 
 CONTROL SIGNALS 
 
 -S INPUT 
 
 1 INPUT 
 
 1 INPUT 
 
 ♦ S INPUT 
 
 ] =o 
 
 o 
 
 o 
 
 o 
 
 :> 
 
 -> OUTPUT -S 
 •» OUTPUT -1 
 -> OUTPUT +1 
 -> OUTPUT +S 
 
 re 26. One Element of a (± 1, ± S) Shift Network Using NAND Gates 
 
100 
 
 shift networks, we will examine one more such network, the barrel shifter. 
 
 5.2 The Barrel Shifter 
 
 The barrel shifter is capable of doing any -uniform shift. It is 
 conceptually a simple device. It consists of log n non-identical stages. 
 
 The i-th stage is capable of switching all inputs either or 2 positions. 
 This is shown in Figure 27 for a four port network. 
 
 Thus, the barrel shifter requires log n stage delays for any uni- 
 form shift. It requires 3n gates/stage. This is to be compared with the o, 
 network which requires Un gates/stage and logpn stage delays. Unfortunately, 
 the stage interconnections of the barrel shifter are not uniform and thus we 
 cannot build just a single stage and recycle it log n times. 
 
 The only advantage of the barrel shifter is that the upper bound on 
 the time required to perform any uniform shift is less than the same bound for 
 the ±1, ±S shifter. But the latter network has a smaller lower bound. The 
 barrel shifter is clearly inferior for our purposes to the Q network. 
 
 5-3 The Batcher Network 
 
 The Batcher network [ k ] was first proposed as a sorting network 
 (see Figure 29)- However, it should be clear that any network capable of 
 sorting n numbers is also capable of switching n numbers. We will not go into 
 the details of this network. It is sufficient to say that the networks 
 consist of log n(l+log n)/2 stages. Each stage is (or can be) interconnected 
 
 by the perfect shuffle and each of the n/2 elements of each stage is capable 
 of ordering its two inputs into descending or ascending sequence. Each 
 element can be built with 13 NOR gates for a total of 13n/2 gates/stage 
 (see [h]) . This is for the "control" plane. Data planes can be identical to 
 
101 
 
 Figure 27. A Four Port Barrel Shifter 
 
102 
 
 CONTROL LINES 
 
 INPUT -2 1 
 
 INPUT 
 
 O 
 
 o 
 
 OUTPUT 
 
 OUTPUT 2' 
 
 Figure 28. One Barrel Switch Element 
 
103 
 
 
 
 1 
 
 2 
 3 
 
 4 
 5 
 
 6 
 7 
 
 Figun 29. An 8 x 8 Batcher Network. Crosshatched Elements Sort in 
 Descending Order 
 
104 
 
 those of the ft network except there are more stages. 
 
 We could build only one stage and recycle it «(log n) times in 
 
 order to save gates. However, one additional problem arises. Each element 
 will require some extra logic since during some cycles it must produce a 
 descending order while during others it must produce an ascending order. We 
 will ignore this problem. 
 
 Finally, we could use the processors themselves as Batcher elements. 
 Now we have a system which is practically identical to that of section 3, 
 Chapter k. Batcher's results tell us that using this system we can do any 
 
 permutation in an order of (log n) cycles. The results of Chapter 2 tell us 
 that if the permutation satisfies definition 3 of Chapter 2 (which includes 
 uniform shifts), then it can be done in log n cycles. Other permutations may 
 
 require fewer or greater than log n cycles. 
 
 5-4 The Benes Network 
 
 While Benes did not invent this network, his extensive discussion 
 of this network prompted us to name it after him [ 5 1 ■ It was originally in- 
 tended for telephone traffic switching. 
 
 The data planes of a Benes network consist of (21og n)-l stages of 
 n/2 elements interconnected by the perfect shuffle and with the exception of 
 the number of stages they are identical to the data planes of the ft and 
 Batcher networks (see Figure 30). Benes shows that the typology of such a 
 network is sufficient to produce any permutation of inputs. Unfortunately, 
 the determination of the switching of each element in the network is complex. 
 The best known result [13] requires time on the order of n Log n 
 
 Actually, Benes considered these networks in a more general form. We 
 consider here only binary networks. 
 
105 
 
 Figure 30. A Binary 8x8 Benes Network 
 
io6 
 
 operations. Thus, it would appear that this network would not be practical 
 in situations where the control signals have to he determined in real time. 
 However, in some instances it might be possible to compute these signals 
 ahead of time, say in a compiler, and in these cases the Benes network might 
 be useful. 
 
 5.5 The Crossbar Switch 
 
 A conceptual diagram of a crossbar switch is shown in Figure 31. 
 This could be implemented as shown in Figure 32 using fan- in logic and 
 limiting fan-in and fan-out to 2. Control signals for this network are derived 
 
 directly from the source tags. This circuit can produce any set of one-to- 
 
 2 
 one or one-to-many connections. It requires an order of n gates and log n 
 
 gate delays for transmission. 
 
 5.6 Network Comparison 
 
 At this point we have discussed these networks sufficiently to allow 
 us to compare them as much as possible with ft networks. We will begin by 
 assuming that we will build a complete network, i.e., we will not build just a 
 single stage which can be recycled. We will follow this by a comparison based 
 on single stage design. In this first case we will not consider the ±1, + S 
 uniform shifter since it is clearly nonsensical to build such a network. For 
 each of the ft, barrel shift, Batcher and Benes networks we will compare control 
 gates, data gates, switching time, data transmission time, and capability. 
 Table VII shows the relative values of these parameters. The expressions listed 
 represent only relative magnitudes and do not represent actual values. - 
 
 Now let us assume we interconnect the N processors in by ±1, ± /n 
 connections or by the perfect shuffle connections. Table VIII summarizes these 
 results. The only difference between the ft, Batcher, and Benes "networks" is 
 
107 
 
 (I 
 2 
 3 
 
 INPUTS < 
 
 N 
 
 ill f 
 
 1^^ 1< 
 
 ihh^ ^ 
 
 12 3 • • • N 
 OUTPUTS 
 
 Figure 51. A Conceptual Diagram of an N x N Crossbar Switch 
 
108 
 
 Figure 32. A k X k Crossbar Using AWD-OR Components and Controlled by 
 Source Indexing 
 
109 
 
 
 
 ™T^^ 
 
 
 
 
 
 
 (D 
 
 
 
 
 
 
 ^ 
 
 
 
 
 
 
 2 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 bO O 
 
 
 
 
 
 
 2 sd 
 
 
 
 
 
 
 •H -H 
 
 
 
 
 
 
 £~ 
 
 
 
 
 ra 
 
 
 W OJ 
 
 
 
 
 cd 
 
 
 ■H 
 
 
 
 
 •H 
 
 
 -P • 
 
 
 
 
 -P 
 
 
 co ,d 
 
 
 
 
 ■H 
 H 
 
 •H 
 
 
 W o 
 
 
 
 
 
 fltH^ 
 
 sd 
 
 sd 
 
 sd 
 
 ,Q 
 
 w 
 
 O O w 
 
 o 
 
 o 
 
 o 
 
 CO 
 
 ■p 
 
 •H -P 
 
 ■H 
 
 •H 
 
 •H 
 
 ft 
 
 CH 
 
 ■PfO'H 
 
 -P 
 
 -p 
 
 -P 
 
 CO 
 
 •H 
 
 CO -H 
 
 CO 
 
 CO 
 
 co 
 
 o 
 
 & 
 
 -P d .£ 
 
 -P 
 
 -p 
 
 -P 
 
 
 CO 
 
 g O w 
 
 g 
 
 g 
 
 g 
 
 
 S 
 
 fi -P £ 
 
 s 
 
 H 
 
 M 
 
 
 In 
 
 CD -H £ 
 
 CD 
 
 (D 
 
 CD 
 
 
 O 
 
 ft sd o 
 
 ft 
 
 ft 
 
 ft 
 
 
 Ch 
 
 •H <Vn 
 
 
 
 
 
 •H 
 
 t^'Pl »H 
 
 >? 
 
 £> 
 
 £? 
 
 
 $H 
 
 C CD id 
 
 S 
 
 sd 
 
 sd 
 
 
 ^ 
 
 CO ^d 2 
 
 CO 
 
 CO 
 
 CO 
 
 a 
 
 
 
 
 
 
 o 
 
 
 
 0J 
 
 
 
 ■H 
 CO 
 
 Sd 
 
 Sd 
 
 'cT 
 
 a 
 
 sd 
 
 CO 
 
 
 
 
 
 
 •H 
 
 bO 
 
 bD 
 
 bO 
 
 bO 
 
 bD 
 
 g 
 
 O 
 
 O 
 
 O 
 
 o 
 
 o 
 
 CO 
 
 H 
 
 rH 
 
 H 
 
 H 
 
 rH 
 
 (B CJ O 
 
 
 
 — 
 
 OJ 
 
 
 -Ptlig 
 
 
 
 
 
 
 CO fn >H 
 
 
 
 
 
 
 Q H Eh 
 
 
 
 
 
 
 
 
 
 OJ 
 
 
 
 H 
 O 
 
 
 C 
 
 'H' 
 
 sd 
 
 
 -p (1) 
 
 H 
 
 bO 
 
 '.:) 
 
 bD 
 
 H 
 
 
 
 o 
 
 O 
 
 O 
 
 
 
 H 
 
 r-\ 
 
 H 
 
 
 O En 
 
 
 
 
 id 
 
 
 
 
 
 OJ 
 
 
 
 
 Sd 
 
 d 
 
 Id 
 
 sd 
 
 
 co 
 
 
 
 
 
 
 CO 0) 
 
 bD 
 
 bO 
 
 bD 
 
 bD 
 
 
 -p -p 
 
 O 
 
 O 
 
 o 
 
 O 
 
 W ,H 
 
 CO CO 
 
 H 
 
 H 
 
 H 
 
 r-l 
 
 sd 
 
 « C3 
 
 
 
 «« — ^ 
 
 
 
 
 Sd 
 
 id 
 
 d 
 
 sd 
 
 OJ 
 
 
 
 
 
 OJ 
 
 ,0 
 
 
 rH 
 
 o 
 
 
 Sd 
 
 $d 
 
 "id 
 
 
 CO 
 
 bD 
 
 bO 
 
 M 
 
 i 
 
 U w 
 
 *. * 
 
 O 
 
 O 
 
 o 
 
 i 
 
 -P 0) 
 
 o 
 
 <-\ 
 
 r-\ 
 
 H 
 
 i 
 
 d -p 
 
 
 
 ^-^ 
 
 
 
 o ro 
 
 
 a 
 
 d 
 
 sd 
 
 
 o a 
 
 
 
 
 A 
 
 
 
 fH 
 
 
 
 
 
 
 CD 
 
 
 
 
 
 
 -p 
 
 
 
 
 
 
 fH 
 
 
 
 
 
 
 •H 
 
 
 
 
 
 
 .£ 
 
 
 
 
 
 
 w 
 
 
 5h 
 
 
 Sh 
 
 CO 
 
 
 H 
 
 
 CD 
 
 
 ,0 
 
 
 0) 
 
 
 A 
 
 co 
 
 co 
 
 
 fn 
 
 
 o 
 
 QJ 
 
 w 
 
 
 fn 
 
 
 -p 
 
 sd 
 
 O 
 
 
 CO 
 
 
 § 
 
 cd 
 
 fn 
 
 
 Ml 
 
 a 
 
 ■ 
 
 o 
 
 CD 
 
 -P 
 •H 
 
 sd 
 
 •H 
 CH 
 
 H 
 CD 
 S* 
 
 w 
 CO 
 
 I 
 
 •p 
 
 •H 
 Jh 
 O 
 bD 
 
 H 
 
 CO 
 
 bD 
 
 sd 
 
 ■s 
 
 -p 
 
 •H 
 
 -p 
 
 co 
 CO 
 
 
 
 ■p 
 
 Sh 
 
 .2 
 
 tS 
 CD 
 
 5h 
 •H 
 
 & 
 
 CD 
 fH 
 
 CD 
 Sh 
 
 CO 
 
 CO 
 
 CD 
 •H 
 U 
 
 I 
 
 fn 
 
 ^> 
 
 •H 
 bD 
 •H 
 H 
 bD 
 CD 
 
 I- 1 
 
 u 
 
 •, 1 
 
 d 
 
 J3 
 
 •i i 
 
 
 rd 
 
 sd 
 
 o 
 
 
 ed 
 
 bD E 
 
 o 
 
 
 H 
 
 CD 
 
 
 1 ' 
 
 1 
 
 1 ' 
 
 CO 
 
 -p 
 
 (D 
 
 CD 
 Ph 
 
 O 
 
 sd 
 o 
 
 co 
 •H 
 
 fH 
 
 CO 
 
 & 
 8 
 
 H 
 H 
 > 
 
 CD 
 
 H 
 ■§ 
 
 EH 
 
110 
 
 CD 
 
 
 
 
 "1 
 
 £ 
 
 
 
 
 
 •H 
 
 
 
 
 ■ 
 
 Eh 
 
 
 
 
 
 M a) 
 
 
 
 
 1 
 
 Pi H 
 
 
 
 
 i 
 
 •H O 
 
 o 
 
 H 
 
 Pi 
 
 i— i 
 
 ,£ >3 
 
 
 
 
 
 O O 
 
 
 
 
 
 -P 
 
 
 
 
 
 •H Jh 
 
 
 
 
 
 £ CD 
 
 
 
 
 
 GQ f^ 
 
 
 
 
 
 !» 
 
 
 
 
 
 Ti fH 
 
 
 
 
 
 pi cc 
 
 
 
 
 
 O fn fn 
 
 
 
 
 
 •H -P 
 
 
 CVJ 
 
 
 OJ 
 
 r& PS -H <H 
 
 H 
 
 — -s 
 
 
 — ■* 
 
 S3 C^ ,Q O 
 
 I 
 
 PI 
 
 PJ 
 
 pi 
 
 pi CD H -H 
 O Pi < 4J 
 
 S 
 
 OJ 
 
 OJ 
 
 OJ 
 
 
 bO 
 
 w 
 
 bO 
 
 PQ co 
 
 
 O 
 
 o 
 
 O 
 
 w Pi -P 
 
 
 H 
 
 H 
 
 H 
 
 fn 0) CO pi 
 
 
 s — - 
 
 CVJ 
 
 - — - 
 
 CD i—l S 
 
 
 
 
 
 fto ^ S 
 
 
 
 
 
 P) i>iO OJ 
 
 
 
 
 
 lb o *U Ph 
 
 
 
 
 
 -d 
 
 
 
 
 
 CD 
 
 
 
 
 
 ?H 
 
 H 
 
 
 
 
 •H 
 
 i 
 
 
 
 P! 
 
 &g 
 
 *? 
 
 
 
 0J 
 
 bO 
 
 CD O 
 
 
 P! 
 
 PI 
 
 O 
 
 K <+H 
 
 V| 
 
 OJ 
 
 OJ 
 
 H 
 
 •H 
 
 
 bO 
 
 bO 
 
 
 w Pi 
 
 o 
 
 O 
 
 O 
 
 V| 
 
 J) D -p 
 
 
 H 
 
 H 
 
 
 H fH 
 
 V| 
 
 
 OJ 
 
 o 
 
 o fn -H 
 
 
 
 
 
 >3 O rd 
 
 H 
 
 
 
 V| 
 
 O fH 02 
 
 
 
 
 H 
 
 fH 
 
 
 
 
 
 -P o 
 
 
 
 
 
 p5 co 
 
 
 
 
 
 Pi w a 
 
 
 
 
 
 -P fH 0) 
 
 
 
 
 
 pi -H O 
 
 
 
 
 
 O CO o 
 
 -=h 
 
 OJ 
 
 OJ 
 
 _4- 
 
 \ Pm "h 
 
 
 
 
 
 -P Ph 
 
 
 
 
 
 pi CD 
 
 
 
 
 
 Ph Pi fH 
 
 
 
 
 
 Pi -H CD 
 
 
 
 
 
 H hlft 
 
 
 
 
 
 
 
 
 
 G 
 
 | 
 
 
 
 
 
 fH 
 
 
 
 G 
 
 
 CD 
 
 
 ? 
 
 1 
 
 
 o 
 
 CO 
 
 
 +1 
 
 CD 
 
 CO 
 
 pq 
 
 
 »\ 
 
 O 
 
 CD 
 
 •N 
 
 
 H 
 
 -P 
 
 CO 
 
 P! 
 CD 
 
 H 
 
 
 +1 
 
 PQ 
 
 m 
 
 +1 
 
 CO 
 CD 
 
 .a 
 
 o 
 
 m 
 
 a 
 o 
 
 •H 
 
 -p 
 
 o 
 
 
 o 
 o 
 
 fH 
 
 CD 
 -P 
 P! 
 H 
 
 fH 
 
 O 
 
 w 
 
 CO 
 CD 
 O 
 O 
 fH 
 
 Ph 
 
 Ch 
 O 
 
 p! 
 o 
 to 
 
 •H 
 fH 
 
 CO 
 
 t 
 
 O 
 
 o 
 
 > 
 
 CD 
 H 
 
 Eh 
 
Ill 
 
 the decision process required in each processor during each cycle. In the 
 case of the P. network, this requires examination of one bit each of two log N 
 
 hit tags in each processor during each cycle- For the Batcher network this 
 requires arithmetic comparison of two log„N bit tags in each processor during 
 
 each cycle. Since these two are so similar, we have combined them in Table 
 VIII. The Benes network using Opferman and Tsao-Wu's algorithm requires a 
 control time on the order of n log n but cannot be done based on examination 
 of the two data elements in each processor at each cycle. The switching time 
 shown in Table VIII simply reflects the total time divided be the number of 
 cycles. 
 
 It should be pointed out that the ± 1, ± /n connection can shift N 
 inputs and during each cycle only one of the four I/O line pairs is used. The 
 shuffle connections allow permutation of n = 2N numbers and during each cycle 
 both i/o line pairs are used. We have shown in Chapter 3 that using 2N memories 
 resulted in the ability to fetch significantly more N-vectors. Thus, the 
 ability to handle any N out of 2N inputs may be significant. 
 
112 
 
 6. conclusion : 
 
 The problem of data alignment for a large parallel computer has been 
 investigated before, but previous results generally relate only to inter- 
 processor connections or switching networks; memory assignment; or (rarely) 
 control algorithms for switching networks. For example, Benes [5 ] in- 
 vestigated the properties of some 21og n stage binary switching networks which 
 might provide a solution except that no control algorithm is known which is 
 fast enough to allow the fast word-after-word switching required in a highly 
 parallel computer. Kuck and Budnik [ 2 ] proposed a solution to the memory 
 assignment problem which allows a wide variety of vectors to be fetched from 
 a parallel memory system without conflicts, but the alignment and indexing 
 problems indicate that this solution is probably not practical. 
 
 We have attempted to consider all these problems simultaneously in 
 order to arrive at a realistic solution. In this paper we have presented such 
 a solution. In order to prove the merits of this solution, we began in Chapter 
 2 by deriving some theoretical properties of a particular type of network which 
 we called an ft network. We showed, for example, a necessary and sufficient 
 condition (definition 3) for a given permutation to be producible by a given 
 ft network. Then, in Chapter 3, we derived some important N-vectors, several 
 storage assignment schemes, and considered several memory systems. We showed 
 in each case whether or not certain N-vectors could be fetched in parallel 
 without conflict and aligned by a binary ft network. We found that by using 
 twice as many memories as processors, a (-/N+l)-skew, 2 skip memory assignment, 
 and a binary ft network we could fetch and align at least the four most im- 
 portant N-vectors (rows, columns, diagonals and t/n x t/n positions) without 
 conflicts. 
 
 Finally, in Chapter 5, we discussed several other types of networks 
 
. 
 
 113 
 
 and their relationships to the ft network. Here we showed that in general the 
 uniform shift networks are not flexible enough to "be useful in general 
 parallel computing applications and, in fact, need not be considered since ft 
 networks are more flexible and either cost less or at least do not cost 
 significantly more than uniform shift networks. In addition, comparison with 
 more general networks (Benes [5] and Batcher [!)-]) which are capable of 
 producing any arbitrary permutation showed the ft network to be cheaper and 
 faster. 
 
 We also discussed the possibility of interconnecting the processors 
 together in such a way as to form one stage of an ft network. By recycling this 
 single stage the correct number of times we could effectively simulate an ft, 
 Batcher, or Benes network. As it turns out, this particular interconnection 
 of processors was proposed by Stone [12] and called the perfect shuffle. Using 
 the results of Chapter 2, we were able to prove the new result that certain 
 important permutations (including uniform shifts) could be done in exactly 
 log-N cycles where N is the number of processors. In comparison, the ILLIAC IV 
 
 ±1, ± /N connections allow any uniform shift in time < 7n. If we add ±1 
 connections to the perfect shuffle connection we get a system with the same 
 number of i/o line pairs (probably the major cost item) and which can handle 
 
 any uniform shift plus other important permutations in time < log N. 
 
 Thus, while we may not have found a panacea, we have demonstrated a 
 class of network which provides a feasible solution to the data alignment and 
 memory conflict problems in a parallel vector processing machine. 
 
114 
 
 LIST OF REFERENCES 
 
 [l] G. H. Barnes, et. al., "The Illiac IV Computer," IEEE Transactions on 
 Computers, Vol. C-17, pp. 746-757; August 196b. 
 
 [2] P. Budnik, and D. J. Kuck, "The Organization of Parallel Memories," 
 IEEE Transactions on Computers , C20, p. 1566; December 1971- 
 
 [3] P. W. Kraska, • "Array Storage Allocation," M.S. Thesis, Department of 
 Computer Science, University of Illinois at Urbana- Champaign, 
 Report No. 344; August 1969. 
 
 [4] K. E. Batcher, "Sorting Networks and Their Applications," Proceedings 
 of the Spring Joint Computer Conference , 1968; pp. 307-314. 
 
 [5] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone 
 Traffic, Academic Press, New York; 1965 • 
 
 [6] I. M. Vinogradov, Elements of Number Theory , Dover Publications; 1954. 
 
 [7] D. Shanks, Solved and Unsolved Problems in Number Theory, Vol. I, 
 Spartan Books, Washington, D. C; 1962. 
 
 [8] Y. Muraoka, "Storage Allocation Algorithms in the TRANQUIL Compiler," 
 M.S. Thesis, Department of Computer Science, University of 
 Illinois at Urbana-Champaign, Report No. 297; January 1969- 
 
 [9] M. C. Pease, "An Adaption of the Fast Fourier Transform for Parallel 
 
 Processing," Journal of the ACM , Vol. 15, pp. 252-264; April 1968 
 
 [10] J. E. Stevens, "A Fast Fourier Transform Subroutine for Illiac IV," 
 Center for Advanced Computation, University of Illinois at 
 Urbana-Champaign, Document No. 17; October 1971* 
 
 [11] D. J. Kuck, and A. H. Sameh, "Parallel Computation of Eigenvalues of 
 Real Matrices," Proceedings of the IFIP Congress , p. TA-1-24; 
 1971. 
 
 [12] H. S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE 
 Transactions on Computers , C20, pp. 153-l6l; February 1971 • 
 
 [13] D. C. Opferman, and N. T. Tsao-Wu, "On a Class of Rearrangeable 
 
 Switching Networks," Bell System Technical Journal , Vol. 50, 
 pp. 1579-1618; May-June 1971. 
 
115 
 
 VITA 
 
 Duncan Hamish Lawrie was born in Chicago, Illinois, on April 26, 
 19^3. He received the B.A. degree from DePauw University in Greencastle, 
 Indiana, and the B.S.E.E. degree from Purdue University, "both in 1966, and the 
 M.S. degree in Computer Science from the University of Illinois in 1969. While 
 at the University of Illinois Mr. Lawrie held a NASA Traineeship in Computer 
 Science and a Research Assistantship in the Department of Computer Science. 
 He also worked on the Illiac IV project as Senior Research Programmer in charge 
 of language development and computer operations. He is a member of Tau Beta Pi, 
 Eta Kappa Nu, the Association for Computing Machinery, the Institute of Elec- 
 trical and Electronic Engineers, and is an associate member of Sigma Xi . He 
 has published two papers, "The Use and Performance of Memory Hierarchies: A 
 Survey," in Software Engineering , Vol. I, Academic Press, New York (197°); and 
 "interconnection Networks for Processors and Memories in Large Systems, " pre- 
 sented at the COMPCON 72 convention in San Francisco, California (1972), both 
 with D. J. Kuck. 
 
BLIOGRAPHIC DATA 
 IEET 
 
 1. Report No. 
 
 uiucdcs-r-73-557 
 
 Title and Subtitle 
 
 MEMORY-PROCESSOR CONNECTION NETWORKS 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 February 1973 
 
 6. 
 
 Author(s) 
 
 Duncan Hamish Lawrie 
 
 8. Performing Organization Rept. 
 
 No -uiucdcs-R-73-557 
 
 Performing Organization Name and Address 
 
 University of Illinois at Urbana- Champaign 
 Department of Computer Science 
 Urbana, Illinois 6l801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 US NSF GJ 274J+6 
 
 Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Doctoral - 1972 
 
 14. 
 
 Supplementary Notes 
 
 Abstracts 
 
 In order to utilize the potential speed of a SIMD type parallel processor it 
 is necessary to arrange data in the memory system so that subsets of this data can 
 be fetched in parallel without memory conflicts. Additionally, we must provide 
 sufficient memory-processor paths to allow the data to be correctly aligned with 
 the processor array. In this paper we present several storage mapping algorithms 
 together with a memory-processor interconnection network. We demonstrate the cost 
 and effectiveness of these and compare them with other networks which have been 
 proposed for this application. 
 
 Key Words and Document Analysis. 17o. Descriptors 
 
 Indexing Networks 
 Sorting Networks 
 Switching Networks 
 Crossbar Switch 
 Memory-Processor Connection 
 
 b. Identifiers/Open-Ended Terms 
 
 e. COSAT1 lie Id /Group 
 
 • Availability Statement 
 
 RELEASE UNLIMITED 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 :uritv Class (Thi 
 
 20. Security Class (This 
 Page 
 
 1JNCLASS1FIED 
 
 21. No. of Pages 
 
 121 
 
 22. Price 
 
 RM N TIS-19 (10-701 
 
 USCOMM-DC 40329-P7 1 
 
mtfiiaw* 
 

OCT 24 19/3 
 
UNIVERSITY OF ILLINOI9-URBANA 
 
 3 0112 052121743