LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN UlCbr v/Y\.O.S»53-SS?> QO^ z CENTRAL CIRCULATION AND BOOKSTACKS be charged a minimum fee or * each non-returned or o ^. _ ^ Theft, mufilation. o. de ^^'^materiais owned by causes for s»uder,t «"««P ,, ""» ""'"•«,. prop erry of .he S.a.e , he University of HlinoU J L.br«v « »" . P JP ^ ^^ of Illinois and are protected by Art. UW and ^edure^ ^ 333.^, Unl ver^ J EL« = r-'^^Z s £UUl JUL i I ZUU1 When renewing by phone, write new due d£ below previous due date. Digitized by the Internet Archive in 2013 http://archive.org/details/memoryprocessorc557lawr ^-^br Report No> uiUCDCS-R-73-557 y\c.ssi -$& ^ ^ 0* MEMORY- PROCESSOR CONMICTION NETWORKS by ■?& k c* ov^ V o^ Duncan Hamish Lawrie February 1973 Report No. UIUCDCS-R- 73-557 MEMORY-PROCESSOR CONNECTION NETWORKS hy Duncan Hamish Lawrie February 1973 Department of Computer Science University of Illinois at Urbana- Champaign Urbana, Illinois 6l801 This work was supported in part by the National Science Foundation under Crant No. US NSF CJ 2 r jkk6 and was submitted in partial ful- fillment of the requirements for the degree of Doctor of Philosophy in Computer Science, February 1973 • MEMORY-PROCESSOR CONNECTION NETWORKS Duncan Hamish Lawrie, Ph.D. Department of Computer Science University of Illinois at Urbana- Champaign, 1973 In order to utilize the potential speed of a SIMD type parallel processor it is necessary to arrange data in the memory system so that sub- sets of this data can be fetched in parallel without memory conflicts. Addi- tionally, we must provide sufficient memory-processor paths to allow the data to be correctly aligned with the processor array. In this paper we present several storage mapping algorithms together with a memory-processor inter- connection network. We demonstrate the cost and effectiveness of these and compare them with other networks which have been proposed for this application, Ill ACKNOWLEDGMENT The author would like to thank Professor David J. Kuck for his patience and inspiration, the Department of Computer Science for its support, and the ILLIAC IV Project for a long and usually fruitful relationship. The diligent efforts of Vivian Alsip and Barbara Armstrong in typing this document and the workmanship of the printing and drafting de- partments are greatly appreciated. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. ft NETWORKS, THEORETICAL DEVELOPMENT 5 2.1 Fundamentals . 5 2.2 The t Relation and its Properties 9 2.3 Construction of a Network from ft l8 2.k The Equivalence of H(ft) and ft 2^4 2.5 Minimal ft Networks 29 2.6 Summary 31 3. EFFECTIVENESS OF ft NETWORKS 33 3-1 Representation of Array Storage Schemes and N-Vectors. . . 35 3-2 Important N-Vectors 37 3.3 Memory Conflicts, Memory and Vector Equations, and the Network Connection Equation 38 3.4 Effectiveness of ft Networks ^1 3.5 Additional Considerations, Cost 69 3-6 Summary 7° k. CONSTRUCTION OF SEVERAL ft NETWORKS 72 k.l A Bit Serial ft Network 72 k.2 A Better Network 7& k.3 Processor-Processor Connections °7 k.k Summary 92 5 • CONSTRUCTION AND PROPERTIES OF OTHER NETWORKS 95 5.1 Uniform Shift Networks 95 5-2 The Barrel Shifter 100 5-3 The Batcher Network 10° 5-h The Benes Network 10l+ 5.5 The Crossbar Switch 1Q 6 5-6 Network Comparison ' 1°" 6. CONCLUSION 112 LIST OF REFERENCES lll+ VITA 115 1. INTRODUCTION It has been suggested that the use of a large number of processors operating on a single program can be used to produce improvements in speed and possibly cost effectiveness. Recent experiments with this concept have indicated a number of problems which are as yet unsolved. The purpose of this paper is to investigate a possible solution to one of these problems, the problem of how to provide the processors with data at a rate which matches the processors' speed. It is clear that given N processors, we need on the order of N memories just to provide the raw bandwidth requirements. Given an order of N memories, we must still solve two problems. The first problem is to assign the data to memory in such a way that memory conflicts are minimized; that is, the data must be arranged so that during the fetch of any required set of N data elements, no two of these will be stored in the same memory. This will be called the memory assignment problem. Once the data has been fetched, each datum must be sent to the correct processor. This is the second problem which we will call the data alignment problem. Independent solutions to the above problems have been suggested in the literature and elsewhere, but to the best of our knowledge no practical and compatible solution to both problems has ever been published. For example, (±1,±/n), and barrel shifters have been proposed (and built) to solve the data alignment problem [ 1 ] and memory assignment algorithms have been pro- posed to solve the memory assignment problem [2, 3 ] • Unfortunately, the better solutions of the latter generally require alignments which are more complex than simple uniform shifts (see Chapter 5)- More complex networks have been proposed [.'i, 5] , but their high cost or excessive switching time make them impractical for our purposes. In this paper we present several effective memory assignment algorithms and propose a new network which can meet the resulting alignment requirements. We begin by proving some useful results about these networks and then showing their effectiveness in handling certain important align- ments. We follow this with a comparison with other solutions and show that our solution is more effective than these previous solutions. Before continuing we will define some terms and assumptions which will be implicit in the following work. By a network we mean a device which has n input and n output ports. Each datum entering the network remains integral, but the ordering of the data may be altered by the network. We assume that a different ordering (or permutation) of the inputs is required for each successive set of inputs, i.e., a different switching of the network is required for each set of n or less inputs. Thus, the time required to switch the network is a significant factor. This network may be thought of as lying between the N processors and M memories as in Figure 1, as a separate function box available to the processors as in Figure 2, or as an integral part of the processors. In general, we assume that at each unit of time, N or fewer of the processors require one datum each. These data are fetched and aligned with the correct processors. We shall also assume that these data are taken from a uniform part of an N x N matrix, e.g., row, column, diagonal, etc. We will define this more clearly in Chapter 3. • • • ALIGNMENT NETWORK • • • PROCESSORS MEMORIES iCure ] . l'rooer.jor-Network-Memory System ALIGNMENT NETWORK A Mi A • • W PROCESSORS MEMORIES Figure 2. Network- Processor-Memory System 2. fi NETWORKS, THEORETICAL DEVELOPMENT The purpose of a switching network, variously known as a permutation, indexing, connection, or sorting network, is to produce connections between input nodes and output nodes. The network may he changed or switched by means of control signals in such a way as to alter these connections to meet various requirements. In this chapter we will investigate in some detail a special class of network, hence- forth ft networks, which we feel are especially useful for connections between a vector of memories and a vector of processors. ft networks are based on a mathematical notion of a "ft base" representation of integers. We begin by investigating some special properties of an ft base system. We then proceed to show how to build a network from an ft base and then using our results we show how to determine whether the ft network can produce certain special connections. In later chapters we will derive a number of connections which are frequently required during operation of a parallel computer. Then, using the results derived in this chapter, we will show that the ft network performs well in certain types of parallel computers. 2.1 Fundamentals We will begin with some fundamental definitions from number theory. Any number x can be expressed as the sum of a multiple of a modulus m and a remainder r : x = mb + r r ax e ay m m gcd(a,m)=l A ax = ay -> x e y m m a|m A ax e ay + x e y m m/a a|n A x E y ^ x = y m m/a m x ^yAxEy + x^y am m x = y A (0 <_ x,y < m/a) ■*■ x = y m/a m x = y -* ax E ay m am m/a m a|mAx e y->xEy a-xEa-y-e+xEy m m am m x E y -+ x^a E y^-a Pll m am P12 x J y<--+ ax f ay We will tend to use these results in their rontraposit ive form. That is , P2 ax i ay -+ x £ y m m Let R be an ordered set of factors of n n, R n = {p 1 , P 2 ,..., P k such that p. x p_x...X p. = n. Define ft (R ) to be the set 1 2 K n ft(R ) = lw, |w. = 1, cj. = ix).^.. x p..-, < i < k-1}, so co. = Ji p n L ' k l l+l l+l — i . . . , j=l+l JNote that oj = n. o In general, any number x < co can be expressed as x = l x. X co . , , , i l i=l 8 where x. < p . . We say that x x x ... x^ is the "base ft" representation of x. Thus, if ft = {8,U,2,l}, then we get the familiar base 2 number system while if ft = {12,6,3,1}, then we get a somewhat unusual (but useful) number system where °10 = 000 ft X 10 = 001 ft 2 10 = 002 ft 3 10 = 010 ft u io = °1^ 5 10 = 012 ft 6 10 = 100 ft 7 10 - 101ft 8 10 = l 02 ft 9 10 = ll°ft 10 io = 111ft U 10 = 1120, Notice that 0) . J x 7 "* X i = y i 1 1 i 1 J and x = y -> x. = y. j + 1 ^ i <_ k. J For example, if R n = (2,5,3), then ft(lT) = {30,15,3,1}, and 5 10 = 012 ft U 10 = 011 a 1T 10 = 102 ft So 5 ~= h (012, s 011^) 5 5 17 (012^ = 102 ) 5 ? IT (012^ 4 102 ). 15 -fi 15 -^ In most of the following work, when R is understood or not important we will refer simply to n rather than R(R ). Further, the unsubscripted lower case letters x, y, z, s, d are usually variables whose range is the set of non-negative integers. The notation {f(x) Condition on x} denotes the set of all numbers in the range of the function f where the domain of f is all values of x which satisfy the specified condition. Thus, {2x:x >_ 0} is the set of all even, non-negative integers. Subscripted letters usually refer to specific values of the variable. 2.2 The t Relation and its Properties An n-set represents a mapping or connection between n input nodes and n output nodes . Definition 1 : n-set An n-set P is a set of pairs of integers, P = {(s,d)}. If (s,d)e P, then we say P connects input s' mod n to output d mod n. □ Thus, if P = {(2,1), (7,0), (2,0)} and n=U-, then P represents the connection illustrated in Figure J. 10 Input Output Nodes Nodes Figure 3 • A Connection Definition 2 : m + P Given an n-set P, we say the integer m passes P, written m t P, if an only if for all (s.,d.), (s ,d.)e P m s . $ s . and s. ^ s J -> d. ^ d . D 1 n J - X m J x J Definition 3 : a + P Given an n-set P and a = {w =n, w-,,..., w, =1} , if for all id e H, to t P, then we say 8 +.P. That is n + P«-> (Voe fl, V(s, ,d ),azxd y(s ,d.)e P, s. f s . a s. e b a - d L] 11 J J x n J x 03 J i 3 The negative form of a A P is useful and perhaps its statement will make the relation i clearer. a f P +■■+ 3 to e a and 3(s ,d ),(s.,d.)e P, such that s. i s n ° oo and s. e s. and d. e d . n 1 u) J x J We will now demonstrate several functions f(P) on n-sets and show that 11 a + p •* a + f(p) . Theorem 1 : Let P + c = { (s+c ,d) : ( s ,d)e P , c an integer constant). Then q + p -». Q + p+ c We shall prove the contrapositive form of this theorem: ntP + c->^tP. We have fi fP + c^ 3oj e fi, 3(s.+c, d ) , and 3(s .+c, d.) , such that J J (0 s. + c^s + e . s.+ces,+c, and d . S d . . But i j i J i J n co PI S. +C^S.+C"> S. ^S, 1 n J X n J and* PI s. + c = s , + c -► S . = S , 1 J 1 J so J] f P . D Theorem 2 : Let P x a = { (xxa,d) : ( s ,d)e P, a an integer constant} If for all co e JJ, gcd(a,co) = 1, then ft + P -> ft + P x a. * The notation PI above the implication symbol means that the implication follows from property PI found at the beginning of this chapter. 12 Proof : o,+ PXa + 3 W £0^ 3 (s xa,d .), and 3 (s-Xa,d.), such that CO s.xa^s^xa, s.xa^s.xa-j and d . Ed.. But i J ^ ,, J 1 J n w P2 s.xa^s.x. a-* s. ^ s , and if gcd(a,co) = 1, then 1 n J X n J P3 s.x aEs.x a ■* s. e s . . Thus , 0, * P . n a) J X co J Corollary 2.1 : If gcd(a,n) = 1, then ft+P + ft+Px a . Proof : We need only show that gcd(a,n) = 1 -*■ y w e ft, gcd(a,co) = 1 and then apply Theorem 2. Assume there exists an co e ft , such that gcd(a,to) = b . Then b | a and b | co , but recall that u|n (from the definition of ft ) , i.e., there exists an integer c such that ioc = n. Since b | co then b | toe or b|n. Thus, gcd(a,n) ^_b, and so gcd(a,n) = 1 -*■ V" £ ^> gcd(a,co) =1. □ 13 Theorem 3 : Let c - P = { (c-s ,d) : (s ,d)e P} . Then ft + P ■*■ ft + c - P Proof : P10 As before, c-s. = c - s. -»■ s. = s. i J i J CO co so Hfc-P^-flfP qi Let P = {(ax + b, ex + d):0 <_x < E,} , where a,b,c and d are integer constants . We will now prove a theorem which states that if a,c and £ satisfy certain conditions with respect to ft, then ft + P. Theorem h : Let Pc {(ax + b, cx+ d)|0 < x < £} be an n-set and define h = gpf(c,n) Y m = gcd(c,m) n m = gcd(c,a Ym ) 6 m = gpf (a VV m/Y m ) C m = gpf(a,m) R n = { Pi>P 2 '"-' p k } ft(R ) = {m =n,m ,m , . . . , nL=l} \i = the largest m e ft such that gcd(a,m) = m. If the following three conditions hold, A: c <_ n B: ah/c > 1 or £ < hn/c Ik C : For all m e 9,, u <_ m < n CI: a/(6 xi ) < 1, or mm — C2: £ < m z, /a, — m then n i P. The proof will consist of two parts and several subparts. First, we need a preliminary result. Let (ax + b, ex + d) , (ay + b, cy + d) be any two elements of P. Then we will show ax + b iay+b ■+ ex + d ^ cy + d n n We have PI ex + d = cy + d -> ex = cy n n P3 -> — E -£- where h = gpf ( c ,n ) n Now since c < n and h = gpf(c,n), then ^-| n . So ?k ■* x E y hn c Multiplying by a P8 ■+ ax = ay, ahn ah P5 Suppose ~ >_ 1. Then -> ax E ay. If — < 1, then since x,y < E < — - c (condition B), then ax, ay < =E. Thus, Finally, 15 PT •+ ax = ay . n PI ■> ax + b e ay + b . n Thus, ax+b2"ay+b->-cx + d^ay + d. n n Now we will prove that if for all m e ft where u ^_ m < n, condition CI or C2 holds, then for all m e ft m ax+b ^ ay + b Aax + b Eay+b->-cx + d^cy + d n m and thus !1 t P. Specifically, we shall prove the equivalent theorem m ax + b ^ay+b Acx + dEay + d->-ax + b ^ ay + b n m Using our previous result, we have from condition B ax+b ^ay+b+cx+d^cy+d. n n Then, by P6 we get Now m P6 cx + d^cy + dAcx + dEay + d ■* cx + d^cy + d. n m PI ■+ ex t cy m P8 -*■ -3- t -*- where y = gcd(c,m) Y Y m mm m ^™ m n m = gcd(c,a Ym ). Let 16 v = c/ii , u = ay /n. mm m m m Then since 6 = gpf(u ,m/y ] m OJ ^ m m PU,3 u ex u cy m -. m + ^ Y Y m mu m m 6 y mm P2 u ex u cy m / m But since v Y v y mm u m mm m 6 y m'm u ay n m m m a v y Y n c c mm mm we have ax f. ay u m m 5 Y mm Now suppose CI holds, i.e., u /& y < 1. Then m m'm — P5 -*■ ax f. ay m and PI ax + "b ^ ay + b m Suppose now that CI does not hold, i.e., u /S ■ y > 1, but C2 does hold, i.e m mm 4 < ^m 17 Let 3 = u /& Y • Then we have m mm ax ^ ay 3 > 1 P2 m 3m m where where ? m = gpf(a,m). mc Since x < £ < , then ax, ay < mr — a m PT and - ax , ay.- mm m P3 ->■ ax ^ ay m ^m = SPf(a,m) ■* gcdU m ,m) = 1 Finally, PI ■* ax + b ^ ay + b . m New, suppose p is the largest m e fi such that gcd(a,m) = m, and that CI or C2 hold for all m, n > m >_ p. We will now show that for all m < p, ex + b p cy + b . Let me fi be <_ p such that m = p/p as per the definition of fi. Since gcd(a,p) = p, ax + b = ay + b P must be true always. Since CI or C2 hold for p, we have P ax+b say+b->cx + d^cy + d. P 18 By P9 ; y u/p ex + d % cy + d -*■ ex + d t cy + d. Since the original premise is true, then the final conclusion must be true, and since m = U/p, m ex + d i cy + d. Now, in the special case where u = n, then neither CI nor C2 need hold. But since gcd(a,n) = n, then ax + b = ay + b n must be true always . Thus , ft \ P trivially . □ Theorems 1, 2, and 3 tell us that if Q tP, then ft + P+c , ft t c-P and if gcd(a,n) = 1, then ft i Pxa. However, they tell us nothing about whether or not ft i P. Theorem k tells us that if P satisfies certain conditions with respect to ft, then ft + P. We will use all of these theorems later. First, we will present an algorithm for constructing a switching network from a given ft. We will then conclude this chapter by showing that if ft t P, then the network constructed from ft can produce without conflict the input-output connections specified by P, 2.3 Construction of a Network from ft Algorithm 1 : Construction of a network from ft. Let ft(R ) be as defined earlier, i.e., ft(R ) is the set n ' ' n ft(R n ) = {oj.: w =1, u. = n p.,0■-!— 1 STAGE i+1 OUTPUTS — — 1 p f -I p ul -l 1 I /0, + i-l P\ P\+i pi+1 pi.i+1 *P\ "I 1 I 2/), -1 2/D I+l -l I 2/>, +1 -l n-^+1 n-1 n+/>,+l n-1 n-1 A*iV>i* #n> '/».« n-/> i*l n-1 STAGE STAGE 1*1 Figure k. Stages of a Network Constructed from fi 21 I N P U T — 1 2 3 4 5 6 7 8 9 10 — 11 u T P U T Figure 5. Network Constructed from fi = (12, 6, 2, 1} 22 Figure 6. Network Constructed from Q, = {l8, 9, 3, 1} 23 I N P U T U T P U T ire 7- Network Constructed from = (l8, 6, 3, 1} 2h Algorithm 2 : Network Control Algorithm : Each (s,d) pair in P establishes a path through the network as follows. Assume s and d are expressed in base fi notation, i.e., s = s s ... s , d = id ... d. Then, starting at input s, for i = 1, 2,..., k, each (s,d) pair "enters" a p. x p. crossbar switch and is connected to output d. of that crossbar. Alternately, starting at output d for i - k, k-1,..., 1, each (s,d) pair "enters" an output of a p. x P. crossbar and is connected to input s n . , ., of that crossbar switch. □ 11 - k-i+1 Figure 8 shows the necessary switching of the fi = (l8, 6, 3, 1} network for E = {(0,7), (0,9), (7,l6), (l6,13)> where 7 = 101^, 9 = HOjj, 13 = 201„, and 16 = 211 . Figure 9 shows the switching for P = {(12,15), (15, l6)>, where 15 = 210 , and l6 = 211„ . Note the conflict at the output of stage 2: H(fl)f P and 9, f" P. 2.k The Equivalence of H(°Q and fl It remains for us to show that this network is in some way equivalent to fi. That is, we would like to show that if 0, t P, then the network constructed from 0, "passes" P, and conversely. Each output of each stage of the 0, network is only accessible to elements of P having certain characteristics. This follows from the construction algorithm and the control algorithm. Let (xx... x a ... a ) represent the set of all integers z such that Z of, (£ Vl V2 ••' \ } fi' J and (a a ... a. xx... x) represent the set of all integers z, such that 03 . z -= (a^... a. 00... 0) Q . 25 I N P U T u T P U T Figure 8. fi = (l8, 6, 3, D Switching for P = {(0,7), (0,9), (7,l6), (16,1-))} 26 N P U T u T P u T Figure 9- Switching with Conflict for P - {(12,15), (15, l6)} 27 Finally, (xx... x s s ... s , id ... d. xx. . . x) represents the set of all (s,d) such that S ' (S J + 1 S j + 2 ••• S k } fi J and w . d = ( dl d 2 ... d. 00... 0) Q . Now each output of each stage of the network can "be labeled with the representation of the set of (s,d) pairs which are accessible to that output. This is shown in Figure lOfor a specific network. Theorem 5 : Let H(Q) -t- P denote the fact that the network H(fi) constructed from 9, passes or connects P. Then Q + P -*--* H(fi) + P Proof ; Assume H(p») t P. Then there must be a conflict at the output of one of the crossbar switches. That is, two pairs (s ,d ), (s ,d )e P, s i Sp, must be trying to use the same output. Without n loss of generality, assume this output is labeled (xx... x a a ... a k , b.^ . . . b xx. . . x) . Then (s ,d ) and (s ,d ) must both be elements of the set represented by this output label. This implies s 1 = s 2 and ^ = d g . 28 ooo XXX, 001 I— 1 2 XXX, 010 i XXX .011 - 4 XXX, 020 XXX, 121 — 11 Figure 10. Network for ft = {12, 6, 2, 1} with Stage Outputs Labelled with (s, d) Classes 29 But this implies Q, f P. So we have H(n) t"p -* fi ~ p or ti + P + H(R) + P. By a similar argument it is easy to show that if ft t P, then H(Q) t P> a nd thus we have fl t P «--*■ H(fi) + P. □ Thus far in this chapter we have defined and explored a relation + between a special kind of set ft(R ) and an arbitrary set of pairs P which represents a set of connections between two sets of nodes. We have shown how to construct a switching network H(ft(R )) n from an ft(R ), and we have shown that if 0, i P, then E(Q) + P and conversely. Thus, given a network H(ft) and an n-set P we have some useful tools for determining whether or not H(fi) can produce the connections specified by P. 2 . 5 Minimal fl Networks Before leaving this chapter we will present three more results which will be needed in later chapters. Let R n = {p liP ..... p ±1 P i+1 ,..., P k > and R n = W'V-' p i X p i + l'--" p k } Thus, R has k elements while R' has k - 1 elements, n n Theorem 6: fi(R ) t P ■* fi(R') + P n n 30 Proof : Assume fi ? T P. Then 3 co e fi f and 3(s ,d ), and 3(s ,d )e P w such that s $ s and s P s and d = d . But then fit P, so we have n w fi(R ) + P -»■ n(R') t P D n n Theorem 7: The network constructed from fi(R ) has the same n number or fewer gates than the network constructed fromfi(R' ). n Proof- 2 A p x P crossbar switch has order of P gates . So the number of gates in H(fl) is just k 2 k ;(fi) = S (p. ) n/p . = n Z p . , i=l i=l while the number of gates in H(fi') is g(fl') = p n + p n +...+ p. n + p.p. n +...+ p n n. 1 2 l-l ii+l k Since for p . , p . . > 1, l l+l p p > p + p i M i+l - M i p i+l it follows that g(0) i g(n'). □ Notice the special case of Theorem 7 where p. = p. = 2. Then s(n) = g(n f ) since p. + p. +i - p . p . +r Theorem 8 : Minimal Network Let p 1 ,p 2 ,..., p^ be a prime factorization of n; that is, P 1 P 2 P 3*'* p k ~ n ' and p i P rime numbers. Then the network constructed 31 from fl = £u i :« 1 = p i+1 U . +1 , < i < k - 1, u = n} is minimal for n in terms of gates . Proof : This follows easily from Theorem 7. If Qi = {^.-u). = p. +i u> i+1 , o < i < k - 1, a, = n}, and for some i = I , p^ +l is not prime, i.e., p £+i = ab , then the fl generated from p^.pg,..., P^ , a,b ,p £+1> . . . , p k yields a network with equal or fewer gates than fl'. Continued application of this until each p i is prime, therefore, results in a minimal network. □ 2.6 Summary In this chapter we have developed some basic results relating the concept of an Q set to a class of switching networks. In particular, we showed how to construct a network from 0, and we proved sufficient conditions on P relative to fi such that 11 t P. In review Theorem 1: ft+P+fttP+c Corollary 2.1: gcd(a,n) =1, fi + P-»-n + Pxa Theorem 3 Theorem 5 Theorem 6 Theorem 7 Theorem 8 ft+P + ft+c-P fl + P •«"»■ H(fl) + P n + p -*■ n 1 + p g (n) < g(fi') If all p . e R are prime numbers , in then H(fl(R )) is a (gate) minimal network n 32 In the following chapters we shall examine n-sets which are needed in computer programs, and we will examine in more detail the construction of some networks. In particular, we will compare ft network with other classes of networks in terms of capability and cost 33 3- EFFECTIVENESS OF fi NETWORKS One of central problems in utilizing parallel computers is in alignment of data in the memory system. In this paper we are restricting ourselves to the primary memory problem. Here the problem is two-fold: 1. If x and y are two data elements which are required at the same time, then x and y should be stored in separate memory modules in order to avoid a "memory conflict." 2. If x is required by processor p, then x should be stored in a memory module which is "readily accessible" to p via the processor-memory connection network. For example, refer to Figure 11, which shows an N x N array stored "straight" in N memories, i.e., element a. . is stored in memory j . An attempt to fetch a column of this array would result in memory conflicts since the elements of any given column are stored in one memory. In addition, data is only acces- sible to processor p, < p < n-1 if the data is in memories p or p ±1 since each processor is only connected to the three closest memories. In order to solve the first problem, a number of special storage schemes have been proposed (Budnik [2], Kraska [3], Muraoka [8]). In general, these schemes place data in the memories in such a way that the most desirable N-vectors can be fetched without memory conflict. The purpose of this paper is to solve both of these problems simul- taneously. We begin with the following assumptions. We assume that there are N processors and M > N memories. The processors and memories are connected together by an n x n network. The data is an N X N array where the (i,j)-th element is logically in the i-th row, j-th column of the matrix space. ^ PROCESSORS MEMORIES N-l Figure 11. An Example of an N X N Array Stored Straight in a Parallel Memory-Processor System 35 3 .1 Representation of Array Storage Schemes and ^-Vectors Definition k : Memory Equation We define the function ju(i, j ) to yield the memory number where data with coordinates i, j is stored. D For example, if an array is stored "straight", then u(l,j) = j, whereas if it is stored with skew = 1, then ju(i, j) = i + j (mod M) . In general, we will consider only linear equations u(i, j) = ax + bj + c(mod M), where a is called the skew and b is the skip. Memory equations for some of the more popular storage schemes are given below: straight : j 1-skew: i + j VN-skew: 7n i + j 1-skew, 2-skip: i + 2j 2- skew, 1-skip: 2i + j TN+1-skew: (i/N+l)i + j Definition 5 : Vector Equation The N processors operate on N-vectors of data where the x-th element of the N-vector goes to the x-th processor. We define v(x) to be a function which yields the i, j coordinates of the x-th element of the N-vector. The vector equation for some of the common N-vectors are shown below: X All arithmetic is mod M. 1 arithmetic is mod N. N is assumed to be a perfect square where necessary. 36 1 . row : v(x) = ax + b^ay+b . M / ¥a We have x f y and x,y < £ < — ax ^ ay since a = gpf(a,M) M then gcd(o,M) = 1 PI -> ax + b ^ ay + b □ M Next we need to specify the connection between network ports and memories. Definition 7 : Memory Connection Equation We define l(m) to be the network port connected to memory m. □ In most cases l(m) = m. >+0 Thus, the network port at which the x-th element of an N-vector will appear is simply Ku(v(x))). Now we have said that the x-th element of the N-vector must go to processor x. In order to determine the output port to which this element must be sent we must specify how the processors are connected to the network ports. Definition 8 : Processor Connection Equation We define $ (x) to be the port(s) attached to processor x. □ In most cases <$> (x) = x. However, in case n = M = 2N, (x) is multivalued and may have one of the following two forms: (x) - 2x and 2x + 1 or $ (x) = x and x + N. We now have enough information to specify the network connections P = {(s,d)} required to send each element of the N-vector to its correct proces- sor. Definition 9 : Network Connection Equation We define C(x) to be the (s(x),d(x)) connection required to route v(x) from memory u(v(x)) through network input l(ju(v(x))) to output $ (x) and thence to processor x. Thus C(x) = (l(/x(v(x))), $(x)) . □ Now what have we got? Recall in the previous chapter (Theorem k) that if Pc{(f(x),g(x)):0 ome cases, where N must have an integer square root, we assume N is a power Of 'l . k2 In some cases the numbers given for memory cycles and net-work cycles represent (not necessarily least) upper bounds or lower bounds. This is indicated by the symbols < and > respectively. Table IV" is a summary of the results presented in Tables I-III. The numbers shown in Table IV are the maximum of memory cycles or network cycles. As we can see, of the configu- rations investigated, only Table III-D can handle all 8 of the most frequent types of N-vectors. Before discussing additional pros and cons of these con- figurations, we will investigate one additional W-vector which sometimes occurs Consider the f-vector consisting of the elements of small dxd sub- matrices along the diagonal (see Figure 12). We assume that the first row of the first submatrix goes in order to the first d processors, the next row of this first submatrix to the second d processors, and so on [11] . 2 In all there can be at most N-s-d full dxd submatrices in an I -vector 2 2 where i < N. Thus, there are a total of I = (N-s-d ) x d elements. We have 2 2 2 2 v(x) = ( (x mod d )+ d + d(x*d ) + i, (x mod d ) mod d + d(x*d ) + j), and for jLt(i, j) = (t/n+1)i + 2j (Table III-D) we get u(v(x)) - (/w+l)[(x mod d 2 * d + d(x-fd 2 ) + i] + 2 2 2[ (x mod d mod d + d(x*d ) + j] . This function has yet to yield to analysis. Instead, an exhaustive check' was performed using a computer and assuming the same configuration as in Table III-D for W = k^, where k = 2, 3, k, 5- The results are summarized in Table V. A second check was performed for J2 = {2N, N/2, ..., k, 1} and the results were identical to those of Table V. ^ N processors N memories N x N network $(x) = x I(m) = m u(i>J) - 3 (straight storage) C(x) = (ju(v(x)), x) v(x) 0) -p 8 m(v(x)) QJ >j 6 o ?H w O QJ |S r-l -P V QJ >i (:i, ;j+x) rows j+x 1 1 (i+x, j) columns a j N 1 fi+x, ;'mx) fwd. diag. j+x 1 1 Ci ^x, ,j-x) rev. diag. j-X 1 1 (i+(x+/I), j + (x mod /n) ) /iix /ti partitions b j + (x mod -/n) ■/n 1 (i,,i) N broadcast c j 1 1 (i,J + (xWN) x Vn) AT row broadcast d j + (xWn)x/n 1 1 M ! 1 :•■ : /N ) X /N, j ) /i7 column broadcast e /n 1 Table I-A kk N processors N memories N X N network $(x) = x l(m) = m ji(i, j) = i+j (skewed storage) C(x) = (n(v(x)), x) v(x) . 0) ■p o ^(v(x)) fn 0) a o S o ^H W O 0) -P O (i, j+x) rows i+j+x i 1 (i+x, j) columns i+j+x l 1 (i+x, j+x) fwd. diag. f i+ j +2x 2 N/2 (i+x, j-x) rev. diag. g i+j N 1 (i+(xWN), j + (x mod t/n)) ■/n x Vn partitions h i+j+xWN+x mod -/I t/N 7n (i,d) N broadcast i i+j 1 i (i,j + (xWSf) x t/n) VW row broadcast J i+j + (xWN)xi/N 1 l (i+(xWn) x 7n, j) vN column broadcast k i+j + (x-jVN)x-/N 1 l Table I-B ^ N processors N memories N x N network $(x) = x I(m) = m fi(i,j) = 3i+j (3 skew) Cfx) = (u(v(x)), x) n = /N v(x) •p O u(v(x)) B o X O <1> -P O (i, ;j+x) rows 3i+j+x 1 1 (i4x, ,j) columns f 3(i+x)+j 1 1 f i t x, ,i+x) fwd. diag. m i+x+3i+a ^ nA (i-ix, ,i-x) rev. diag. n 2x+3i+j 2 N/2 (i+lx-r/N), j + (x mod /n) ) ; x -/n partitions (i,.i) broadcast 3i+j 1 1 . i i (xWn) x /n) /ii row broadcast 3i+j+(x-fr|)xTi 1 1 . : /N) X t/N, j) /N column broadcast Table I-C k6 N processors N memories N X N network $ (x) = x l(m) = m /i(i >t l) = T]i+j (-/n skewing) C(x) = (ju(v(x)), x) v(x) 0) o m(v(x)) S o o cu CD >3 (i, j+x) rows rii+j+x 1 1 (i+x, j) columns T]X+Tli+j /n t/n (i+x, j+x) fwd. diag. P (T]+l)x+T]i+j 1 1 (i+x, j-x) rev. diag. q (t}-1)x+t)X+j 1 1 (l+(xWw), j + (x mod Tn) ) /I x -/i partitions r x+Tii+j 1 1 (i,j) N broadcast r)i+J 1 1 (i,j + (xWI) x t/n) VN row "broadcast s ■Tli+j+(x*T))XT) 1 1 (i-i(xWN) xi/l, j) Vff column "broadcast t Tii+j+N(x-fTi) 7n 1 Table I-D ^7 N processors N memories N x N network $(x) = x Km) = m //(i,«l) = rji+J (t/n+1 skewing) C(x) - U(v(x)), x) n = 7n+i v(x) •p o u(v(x)) ^ 0) QH S o X o cu -P O 0) >5 S3 O (j, ;j+x) rows Tri+j+x 1 1 (i+x, ,j) columns u T)(i+x)+J 1 1 (i+x, ,i+x) fwd. diag. V (T)+l)x+T)i+j w2 - (i+x, ,j-x) rev. diag. w (TT,-l)x+T 1 i+j t/N 7n (i+(x+/N), j+(x mod t/n)) t/n x /n partitions X T](x-^-/N)+T]i+Ti(x mod 7n) + >1 (i,.D II broadcast r\±+3 1 i (i, j + (x+i/N) x t/n) /N row broadcast y TT,i + j + ( x+/N) x7N( X+-/N) 1 l M ! (x+t/n) x t/n, ,i /?! .lumn broadcast z tp.+j+N(x+-/n)+t/n (x-m/n) 1 i Table I-E ii-8 Notes to Table I a) Since all elements of a column lie in the same memory, i.e., v(x) = (i+x, j), ju(v(x)) = 2> that memory must be cycled N times to produce all elements of the column. One might suspect that the networks would also have to be cycled N times. This is not necessarily so. Notice the resulting connection equation C(x) = (j, x) satisfies the condition gcd(a, N) = N since a = 0. Thus, the network can produce the connection C(x) = (j, x), < x < N - 1 by Theorem k of the last chapter. So how do we reconcile the fact that our theory says the connection is possible while intuition says it is not. In fact, there is no real reconciliation. If two different data elements are present at the same input port, then any network would have to be cycled at least twice to pass both elements. We shall simply state that the number shown under network cycles represents the number of cycles required, assuming no multiple data at any input port. Extra cycles required due to such multiple data are reflected in the memory cycles column. b) The equation C(x) = (j +(x mod Vn), x) is not of the correct form to be covered by Theorem k. Thus, we must show that C(x) satisfies k i— k/2 definition 3. We assume N = 2 where k is even. Thus, VN = 2 ' . Now we assume x mod t/n 4 7 na°d /n (so x ^ y), N N m and x = y. Then, (P6) x 4 7- m Now, if m < 7n, then obviously x mod 7n = x and y mod -/n = y , so m m x mod 7n ^ y mod 7n . m ^9 If m > /n, then we have (since x mod /n, y mod -/n < /n < m) x mod Vn ^ y mod 7N -> x mod 7n ^ y mod -/i? N 7n and finally P5 -» x mod Tn ^ mod -/n . m Thus, ft t P, where P = { (x mod /N, x) : o < x < N - 1} so by Theorem 1 (previous chapter) ft t P + j, where P + j = {(j +(x mod /n), x) : < x < N - 1} . □ c) See note (a) . d) Again we must show that C(x) = (j + (xWn) x /n, x) satisfies definition 3> i-e., for all m e ft m s(x) ^ s (y) and s ( x ) - s (y) implies x ^ y . N m First, consider the case m > -/n. Let r\ = VN. We have j + (x+T])r) ^ ,j +(y+T])n N I'l -» (x+t])t) ^ (y+T])ii N and j i(x+t])t] s j + (y4t|)r) m PI -» (x+ti)ti ■ 3 +(y+T])Ti. m Thus, P6 m -> (X+T])T] ^ (y+T])T) 50 P12 m/r] -> (x+tj) 4 (y+Tj) Pll m -» x ^ y . Now, for m < ri let rj = mcr = 7n . We have H -» ( x *t) f (y-Mi) and or Thus, ti(x-mi) = n(y +T i) m mcr(x-M-)) == mcr(y4-T]) m P4 -> cr(x-fTi) = ^(y-Mi) 1 P3 -» (x-s-ti) = (y*T)) • 1 p6 i Pll mcr -» x ^ y P9 m -» x ^ y . 51 Thus, fi t C(x). The vector v(x) can be fetched without conflict since £i(v(x)) satisfies definition 6. e) See notes (a) and (b) . f) Divide v(x) into two n/2 -vectors: v 1 (x) = (i+x, j+x), < < N/2 v 2 (x) = (i+x+N/2, j+x+N/2), < x < n/2 . Then v(x) = v 1 (x) U v (x), and jLi(v,(x)) and u(v (x)) both satisfy Theorem 9. See also note (n) . g) See note (a) . h) It is easy to see that there are -/n elements (the reverse diagonal) of the partition in memory ni(v/N-l)) = i+j+ t/N-1 . Thus, the -/n memory cycles. In order to show that the network cycles are < t/N, we divide v(x) into -/n /N-vectors: v (x) = (i+p, j+x mod 7n),0 < x < -/n, < p < -/n . /n-i Then v(x) = U v (x) p=0 P and by the argument in note (b), fi t C (x), where C (x) = (u(v (x)), x + p-/N), < x < 7n, < p < /n. i) See note (a) . 52 j) See note (d) . k) See note (d) . l) Since 3 is prime to any power of 2, C (x) satisfies Theorem 4 and *i(v(x)) satisfies Theorem 9* m) The required partition for the memory is: v (x) = (i+x, j+x), < x < N/4, < p < 4 This allows the N-vector to be fetched in 4 cycles. The partition required for the network is v (x)"=(i+p+Nx/Ij-, j+p+NxA), < x < 4, < p < nA and thus (since Nx=0), N C (x) = (4p+3i+j, p+NxA), < x <4, < p < nA- n) An argument similar to m applies here. o) An argument similar to m applies here. p) If t/n + 1 is prime to N, then C (x) satisfies Theorem 4 and ju(v(x)) satisfies Theorem 9« t/n + 1 is prime to 4, l6, 64, 256, 1024, and 4096. q) This is true for N = 4, l6, 64, 256, 1024, and 4096 . r) Note that -/n(xW"N) + x mod Vn = x. s) See note (d) . t) The required partition is v p (x) = (p+i, 3), < x < n < p < T] T) = Vn , giving us ju(v (x)) == r)(i+p) + j, which satisfies definition 6 since ir 53 v (x) = v (y). To prove Q, f C(x), we have T]i+j+N(x*t)) = iii+j+N(y^ri), i.e., the source is the same memory element for all data. See note (a). u) 7n + 1 is prime to N = k, l6, 6k, 256, 1024, and 4096. ■ Thus, this result holds for these N by Theorems k and 9« v) The cycles required here vary since r\+l = VN+2 is not prime to N = h f l6, 6k, 256, or 1024. The memory cycles vary from k for N = k to 2 for N = 1024, where in general memory cycles is (Theorem 9) — where a = gpf (a,N) . The required number of network cycles is unknown. w) Note T)-l = Vn . An argument similar to note m applies here. x) t}(xWn + x mod -/S) + T)(i+j) = -/n(xWn) + (x mod t/n) + (x+ t/n) + Vn(x mod t/n) + T\(l+j) = x + (x+VN) + /n(x mod 7n) + r)(i+j) • Since )u(v(Vn-1)) s ju(v(N-/n)) e rj(i+j) , we know that memory cycles N N are > 1. y) See note (d) . z) Since N(xWn) = , N jLi(v(x)) = 71 i+,j + (x+/N)>c/N . N See note (d) . 5+ N processors 2N memories 2NX2N network $(x) = x, < x < N l(m) = m, < m < 2N /i(i, j) = j (straight storage) C(x) = (n(v(x)), x), < x < N v(x), < x < N (U •p g m(v(x)) H (L) s H S o o O > (i, j+x) rows j+x 1 1 (i+x, j) columns a d N 1 (i+x, j+x) fwd. diag. j+x 1 1 (i+x, j-x) rev. diag. j-x 1 1 (i+(xWl), j+(x mod i/i)) -/N x Vn partitions b j + (x mod V^) ■/n 1 N broadcast 1 1 (i,j + (xWN) X Vn) VN row broadcast . j+(xWn)xt/n 1 1 (i+(xW5) x Vw, j) VN column broadcast c J t/n 1 Table II-A 55 N processors 2N memories 2NX2N network $(x) = x, < x < N I(m) = m, < m < 2N u(i,3) = 2i+j (2- skew, 1-skip) Cfx) = (m(v(x)), x), < x < N v(x), o < x < N -p s u(v(x)) M U w O «j) = j (straight storage) C(x) = (u(v(x)), 2x), < x < N v(x), < x < N -P s m(v(x)) S o ^H W O CD £ H -P c> CD >s (i, ,j+x) rows a d+x 1 1 (i+x, ,j) columns b j N 1 (i+x, j+x) fwd. diag. j+x 1 1 (i+x, j-x) rev. diag. j-x 1 1 (i+(xWN), j + (x mod t/n)) /N x t/n partitions c j+x mod t/n t/n 1 (i,j) N broadcast j 1 1 (i,j+(xWN) x Vn) /N row broadcast d j + (x /N) -/TT 1 1 (xWn) x i/N, j) vN column broadcast e j /N 1 Table III-A 58 N processors 2N memories 2NX2N network J>(x) = 2x, < x < N l(m) = m, < m < 2N ju(i,o) = 2i+j (2- skew, 1-skip) C(x) = (ju(v(x-)), 2x), < x < N v(x), < x < N -p s m(v(x)) 5h (L) S o M J-l CO O d) £ H -P c> 0) >> C o (i, ,i+x) rows 2i+j+x 1 1 (i+x, j) columns f 2i+2x+j 1 1 (i+x, t i+x) fwd. diag. 3x+2i+j 1 1 (i-tx, j-x) rev. diag. x+2i+j 1 1 (i+(xWN),. j + (x mod t/n) ) t/n X t/n partitions g 2i+j+2(xWN) + x mod t/n t/N _ (i,j) N broadcast 3 1 1 (i, J + (xWn) X t/n) vN row broadcast h 2i+J + (x4-/N) VN 1 1 (i-i (xWN) x t/n, j) ■fu column broadcast i 2i+j+2(xWN) t/n 1 1 Table III-B 59 N processors 2N memories 2Nx2N network $(x) = 2x, < x < N l(m) = m, < m < 2N iu(i,j) = i+2j (1-skew, 2- skip) C(x) = (m(v(x)), 2x), > ^ W CD 5rH •P O > (i, ,j+x) rows j i+2j+2x 1 1 (i+x, j) columns x+i+2j 1 1 (i+x, ,]4x) fwd. diag. 3x+2j+i 1 1 (i+x, j-x) rev. diag. x+2j+i 1 1 (i+(x+VN), j + (x mod /n) ) /ix /l partitions k xWN+i+2j+2(x mod /I) 7n 2 2 N broadcast i+2j 1 1 (i,j+(xWw) x t/n) vi? row broadcast I • i+2j+2(x4-/N) /N 1 1 (i i (xWS) x/n, j) i/Ti column broadcast m 2i+j + (xWN) t/n 1 1 Table III-C 6o N processors 2N memories 2Ik<2N network k) = 2x, < x < N Hm) = m, < m < 2N //(i, ,0 = T)i+2j (-/n+1 skew, 2 skip) C jc) = (n(v(x)), 2x), < x < N T) = nTn + 1 v(x), < x < N s u(v(x)) CD >j S o O QJ -p a CD >5 C o (i, j+x) rows n T]i+2j+2x 1 1 (i+x, j) columns o T]i+2j+T]X 1 1 fi+x, j+x) fwd. diag. P (Ti+2)x+T]i+2j 1 1 (i-tx, j-x) rev. diag. q (r 1 -2)x+T 1 i+2j 1 1 i+(x+VN), j + (x mod /ii) ) vN x -/n partitions r r)(x4-7N)+T]i+2j + 2(x mod -/N) 1 1 (i,,1) N broadcast T]i+2j 1 1 i,j+(xWw) x 7w) /N row broadcast s Y]i+2j+2i/N(xWN) 1 1 (H(xWn) x t/n, j) /w column broadcast t T]i+2j+T]7N(x4-7N) 1 1 Table III-D 61 Notes, Table III a) Note that c(x) is now (u(v(x)), 2x) . b) See note (a), Table I. c) See note (b), Table II. d) See note (d), Table I. e) See notes (a) and (b), Table I. f) Note that 2x + 2i + j satisfies Theorem 9, since x < N. Also, C (x) = (2x, 2x) satisfies Theorem k. g) See note (e), Table II. h) See note (d), Table I. i) See note (d), Table I. j) See note (f), Table III. k) Note that 2(x mod nTn) = 2x mod 2 >/n. Partition v(x), < x < N into v (x) = (i+(x4- nTn)+2p, j+(x mod/N)), < x < 2 J~N, < p < vTn/2 . Then iu(v (x)) satisfies Theorem 9 since x < 2 n'N < 2N, and n/n/2 v(x) = U v (x). p=0 p 62 and i— , VN/2 v(x) = U v (x) . p=0 P These same partitions satisfy Theorem k. l) The result follows from note (&), Table I, and P2. m) See note (d), Table I. n) Theorem 9 is satisfied since x < 2N/2. Theorem 4 is also satisfied (since $(x) = 2x) . o) VN + 1 is prime to 2N for N = k 9 16, 64, 256, 1024, or 4096. Thus, Theorems 4 and 9 are satisfied again. p) ti + 2 = 7n + 3. 7n + 3 is prime to 2N for N = k, 16, 64, 256, 1024, or 4096. q) t] - 2 = -i/n - 1. t/n - 1 is prime to 2N for N = 4, l6, 64, 256, 1024, or 4096. r) An exhaustive check of the function C(x) = ((VI+1) (xWn) + 2(x mod 7n), 2x), < x < N - 1 was made by computer for N = 4, 16, 64, 256, 1024, and 4096. In all cases, definitions 3 and 6 were satisfied. The required connections then follow by Theorem 1. s) We have T)i + 2j + 2t/n(xWn) = T]i + 2j . + 2-/I(yWM r ) m PI -^ 2/n(xWn) s 2yiT(yWN) m But tji + 2,j + 27n(xWn) ^ T)i + 2j + 2-/N(yWN) 2N PI -* 2t/n(xWn) ^ 27N(y-7N) 2N 63 These two results give us P6 m -* 2-/n(x-«-t/n) 4 2i/n(v.*t/n) . Since 27n(xs.-/n) - (2y/lf)+ (2N), we get Pll 2mN _> 2-/N x ^ 27n y P12 2m/I -» 2x ^ 2y P9 m -* 2x ^ 2y . Thus, definition 3 is satisfied. Definition 6 is satisfied since it is satisfied for rows (see note (d), Table i) . t) We have T|i + 2j + tiTn(xWn) = r\± + 2j + Ty/N(y-s7N) m PI _» t)/n(xWn) s ^(yWN), where t]=/n +1 m Again, t] is prime to 2N for 2N=2 , so P3 _, Vn(xWn) = VN(y+n/N) . m But t>l + 2j + t]/n(xWn) ^ T|i + 2j + Ty/N(yWN) 2N PI, 3 -» /N(xWN) ^ /N(y-sVN) 2N 6k So P6 m Since Tn(x-jV^) = (x^O+N, we get m (xM) + N ^ (yV^)* N Pll Mn _>- 7n x 4 -& y P9 Nm/2 -» Vn x ^ 71 y P12 2Mn/2 _> 271 x ^ 271 y Nm 27N x ^ 271 y P12 Vk/fS -> 2x ^ 2y P9 m -> 2x ^ 2y . Thus, definition 3 is satisfied since m s(x) ^ s(y) A s(x) s s(y) -»d(x) ^ d(y) . 2N m Definition 6 is satisfied by an argument similar to note (s), Table III, 65 That is, since there are no memory conflicts for columns, and since the Vn column broadcast involves fetching only -/n elements out of N in a column, then there cannot be any memory conflicts in a 7n column broadcast. That there are no memory conflicts for a column fetch follows from the fact that 7w + 1 is prime to 2N = 2 . Thus, Theorem 9 is satisfied since the greatest factor of 7n + 1 prime to 2N is -/n + 1, gpf(-/N+l,N) = /N + 1 and so it is only required that x be less than M = 2N, x < I < M = 2N . Thus x^y and ji(v(x)) = jLi(v(y)) M or (/N+l)(x+i) + 2j ■ (7N+l)(y+i) + 2j M PI -> (/N+l)(x+i) =(/N+l)(y+i) M P3 -» x+i = y+i M _» (x+i,o) = (y+i,j) since x,y o g CD •H X O rH o O IS ? «5 CD o 67 d d d "d" d d N N are 12. Storage of dx d Submatrices Along the Diagonal 68 c- saxoiCo 3[ jcoMq.au i H OJ A| OJ A| sbjoAo Aj.ovsdm ■ H H OJ A| VO sajo^o 3fjcoMq.au i H OJ A| OJ Al saiojfo JLzomBm i H OJ A| H UA sa"[Oi?o s[ jcoM-q.au i OJ OJ A| OJ Al sajo^o jCjcoraara ! rH H H -4- ■ saxc/fo 3j;jcoM.q.au H OJ OJ A| OJ A| saxo^o ^jcoutara H 04 OJ A| OJ Al KA saxojCo ^JcoAq.au H OJ OJ Al OJ Al saxo. CD H ,£> CO EH 69 3.5 Additional Considerations, Cost Tables I-V give us a pretty good idea of the performance of the ft network on certain highly frequent N-vectors. We will not discuss these results any further in this chapter. (in a later chapter we will compare these results with similar results from other types of networks) . We will, however, discuss several questions raised by these results. First, it seems apparent that using 2N memories yields fewer memory conflicts than a system with N memories. In fact, purely combinatorial arguments tell us that of a total of ( ) = — 4 — — — possible distinct N- (ir-N) in: vectors in an N x N array, the N memory systen can deliver NUT without conflict, while the 2N memory system can deliver -^ — without conflict, NI2 But we must ask: What is the difference in cost between an N memory system and a 2N memory system? Let us assume that the total storage capacity of both systems is equal, e.g., each memory in the N memory system can store K words while each memory in the 2N memory system can store k/2 words. There is some evidence which suggests that at least the cost of the basic memory components would be the same. For example, Intel supplies a solid state memory board which can be configured either as kK X l8 bit words or 8K x 9 bit words. However, the total cost of a memory system is primarily determined by the cost of power supplies, packaging, inter- connections, etc., and thus further analysis of this * p The assumption is made that all N~ elements of the N x N array are evenly distributed among all N or 2N memories. TO problem is beyond the scope of this paper. We will simply assert that it is not unlikely that the cost of a 2N memory system would not be prohibitively greater than the cost of an N memory system when the total cost of the computer system is considered. Another question which arises is the relative cost of a 2N X 2N ft network versus an N X N ft network. As we shall see in a later chapter, the number of gates in an N X N ft network (including control circuitry) is on the order of 2|(d + log 2 (log 2 N)) log 2 W , where d is the number of bits per data word. The gate ratio of a 2N X 2N ft network to an N x N network is approximately 2. (This should be compared with a ratio of k for a crossbar switch) . Finally, it might be argued that each memory in a 2N memory system only needs to be half as fast as each memory in an N memory system in order to provide the same effective memory bandwidth. This argument is valid only if successive N-vectors do not interfere with each other. 'While an analysis of successive N-vector interference is beyond the scope of this work, it seems obvious that this interference could be significant and so the argument justifying slower memories would not hold. 3.6 Summary In this chapter we have explored several memory systems in conjunction with various memory equations. We have shown that in almost all cases the ft network performs at least as well as the memories themselves. That is, in almost all cases, if the N-vector can be accessed without memory conflict, then the ft network can establish the necessary memory-processor connection. 71 In the next chapter we will discuss actual implementation of Q. networks. This will be followed in later chapters by comparisons with various other switching networks which have been proposed in the literature. 72 k. CONSTRUCTION OF SEVERAL ft NETWORKS In this chapter we will present several implementations of ft net- works. Of course, many implementations are possible but we will present only a few which represent various extremes of design. The first design is probably the simplest design possible. It is somewhat slow, due to the fact that it operates in bit serial mode. It might be useful in applications requiring the switching of bit serial data, such as switching data between tracks of a rotating memory. The second design is a more general network which might be used for the processor-memory connections discussed earlier. Finally, we will discuss a particular interconnection of processors which is surprisingly related to ft networks . k.l A Bit Serial ft Network This ft network, shown in Figure 13, will be constructed from elements shown in Figure lh . (Notice that the two center NAND gates form a bistable device). The network operates as follows. The network is first reset by using the reset lines which are common to all elements. Associated with each input is a log n bit number which represents the number of the output port to which that input port is to be connected (the destination tag) . This number is in- serted bit serially into each input port, most significant bit first. Each element then transmits these bits A = C, B = D. As the i-th most significant bit is fed into the network, the strobe signal is turned on momentarily in the i-th stage (from the left). This strobe signal allows the bistable pair of NAND gates to be switched, and if A = 1, then the element will begin transmitting A = D, B = C; otherwise, it will continue to • 73 Figure 13 . An fi Network 7^ * ♦ A B Strobe Reset L_L B — D FT Strobe Reset Figure Ik . One Element of a Bit Serial ft Network Constructed from NAND Gates 75 transmit A - 0, B = D. Now assume there are no conflicts as defined in Chapter 2. Then it can be shown (refer to Figure 10 in Chapter 2) that both streams entering an element of the i-th stage (from the left) must he equal in the first i-1 most significant bits and must be unequal in the i-th most significant bit position. Thus, the i-th stage is strobed when the i-th most significant bits are present at the inputs. (if there are no conflicts, then these bits are unequal). If A = at this time, then the element remains in state A = C, B = D. Otherwise, A = 1 and the element enters state A = D, B = C. The strobe is now turned off which effectively locks the element in this state until the reset signals are used. Thus, the network is being switched according to Algorithm 2 of Chapter 2 as long as there are no conflicts. Now there are two problems. First, if a conflict does arise, the network will produce an erroneous connection. (The destination tags could be examined as they emerge from the outputs to determine if the correct connection was established) . Second, it is impossible to set up any one-to-many connec- tions . (Recall these are allowed in the generalized Q network presented in Chapter 2). Nevertheless, this network may prove useful in some applications. Notice in Figure Ik that each element requires 9 NAND gates. Thus, the total number of gates required for the network is 9 X p l°gp n or ■( ~ n log n. Further examination of Figure Ik reveals that 3 gate delays are needed for switching and 2 for transmission through each element. This, to- gether with the bit serial nature of transmission, reveals that lo G p n *- p L • i 2i = 21og n +(log n) gate delays are required to properly switch i I 2 2 the network. 7 6 This could be reduced to 51og_n by the addition of appropriate latching registers and clock signals. We will now turn to a more complicated ft network. k.2 A Better Network In the previous network when a conflict arises, one of the inputs is switched in the "wrong" direction. This wrongly switched input can, in a later stage, cause other inputs to be wrongly switched. The network which we will now discuss prevents this by associating a validity signal with each in- put. As soon as an input is switched in the wrong direction, its validity signal is turned off and thus the wrongly switched input is prevented from influencing later switching decisions. Additionally, this network will be capable of producing one to many connections (broadcasts) as discussed in Chapter 2. This is accomplished by using source tags rather than destination tags. A source tag is a number as- sociated with each output port which represents the input to which that output port is connected. This network is divided into one "control plane" and one or more "data planes" as shown in Figure 15. Signals generated by the control plane are used to control the switching of the data planes. Each plane is similar (at least topo logic ally) to Figure 13- Notice that the control signals will flow from right to left while the data will be transmitted from left to right. We begin by designing the "control plane" for this ft network. The control plane generates signals which are fed to the switches in each data plane of the ft network. Refer to Figure l6. On the right edge are n shift registers of log g n bits each. Each of these shift registers contain the number of the input port to which this output is to be connected. The least 77 Data Planes^ Control Plane Figure 15. The Control and Data Planes of an fi Network 78 f r r r s 1 SOURCE « TAG SHIFT | REGISTERS ^\// f r 2 r 3 1 * 4 * 5 I | | 6 7 STf *OBE 3 STI ROBE 2 STF OBE 1 Figure l6. Control Plane for an 8 x 8 n Network 79 significant bit of each of these registers is connected to the inputs of the control plane as shown in Figure 16. Generally, each stage of the control plane will he "set up" during one major clock. The entire plane is thus set in log n major clocks. During the i-th major clock, the i-th stage (from the right) is strobed allowing the signal from the current least significant bit to set the switching and memory of each block in the stage. Then the source tag registers are shifted allowing the next least significant bit to be switched through the i-th stage to the (i+l)-th stage where this process is repeated. The following signals are used in each element of the control plane (refer to Figure 17) : Strobe: enables setting of the flip flops A: upper output tag B: lower output tag V»: upper output validity V_: lower output validity B C : upper input tag D: lower input tag V p : upper input validity V : lower input validity F n ' upper switching signal for data switch F : lower switching signal for data switch Reset: resets both flip flops A special situation arises if an attempt is made to switch both inputs to the same output in this control plane. This may arise if a broad- cast connection is being set up (i.e., two network outputs requesting con- nection to the same network input) or in the case of a genuine conflict. When 8o Reset ♦ ♦ Strobe Reset Strobe Figure 17. One Control Plane Switching Elenent 8i this happens, the signals which will control the data planes are set to connect the broadcast connection but the switch of the control plane is set to transmit only one of the tag signals correctly. The other tag signal is transmitted incorrectly but its associated validity signal is set to zero. (Note that the validity signals may also be used to indicate that a given network output port requires no connection) . Thus, each source tag travels bit by bit from right to left through the control plane. Each bit causes a given stage to switch the next bit through to the next stage . The logic diagram of one element of the control plane is shown in Figure l8. The two flip flops are set or reset by their respective input (C or D) provided they are enabled by the strobe signal. (These flip flops are the same as the three gate devices shown in Figure Ik) . The outputs of these flip flops control subsequent transmission of this element in addition to the switching of corresponding elements of each data plane (see Figure 20). Associated with each tag input signal (C or D) are two validity signals (y or V ) . These validity signals indicate whether or not the corresponding tag signal is valid. (A tag signal is invalid if it was incorrectly switched in a previous stage or if the corresponding network output port is not re- questing a connection) . The logic equations are determined as follows. First, the flip flops are strobed and set or reset as C and D are set or reset. The various trans- mission states of this element are shown in Figure 19- Numbers on the right of each box correspond, top to bottom to C, V , D, and V , respectively. Lines in each box represent connections and the numbers on the right indicate trans- mitted validity signals (V., V_.) . Don't care conditions were chosen to A B 82 Reset Strobe Reset Strobe Figure l8. Control Plane Switching Element Circuitry 83 i ° n V° V° Ao Ao 11 : x: 10 7: v\ '7: ' 10 1 1 1 1 'Xi i i °\: i i °x i m ;ure 19 . Transmission States of One Control Element Qk minimize gate counts. As mentioned earlier, the signals from one of these control planes will be used to control the switching of one or more data planes. Each data plane consists of log p n stages of n/2 elements. One such element is shown in Figure 20. Notice that the switching of this element is controlled "by the signals F , F , which are generated by the corresponding element of the control plane (see Figure 3) • Figure 21 shows the states of this data switching element. The numbers above each box represent values of F and F , respec- o u tively. In summary then, each output requests connection to a particular input port by placing the number of that port in the source tag register. The control plane is then clocked log n times after which the data planes are set to provide some or all of the specified connections between inputs and outputs. If the requested connection does not create conflicts in the sense defined in Chapter 2, then the resulting connection will be the requested connection. If there is a conflict, then some output (s) will not be connected to their requested input. For example, refer to Figure 16. Assume output requests connection to input and output 2 to input k. Then the first element of the middle stage of Figure 16 will be placed in state f (Figure 19) . At this point the request from output 2 will be effectively cancelled, since its validity signal is turned off and in fact output 2 will be connected via the data planes to input instead of k. In many cases it would be desirable to detect this condition. This can be done by including a log n bit source tag along with each element of data at the inputs. After the data is sent through the network each output port checks the source tag received against the source tag requested. Lack Upper Lower 85 F c Signal From Control Plane Upper Data Output Lower F D Signal From Control Plane Figure 20. One Element of a Data Plane Using NAND Gates 86 a i I I o Figure 21. States of One Data Switching Element 8 T of equality indicates a conflict occurred, and the process must be repeated to process the unsatisfied requests. We have discussed in some detail the construction of the control and data planes of an Q. based network which allows broadcast patterns and is capable of detecting conflicts. If we assume that a flip flop requires 3 2 l0g 2 I ^ates, then a control plane requires yr log n elements of 22 gates (see Figure l6) for a total of lln log n gates. Each data plane has — log n elements of 8 gates for a total of 4n log n gates. If there are d data planes, then the entire network requires (4d+ll)n log n gates. Examination of Figure 18 reveals that transmission through a stage requires 3 gate delays and switching of a stage required 3 gate delays. Thus, log n 5 2 it requires E 3(i-l) + 3 = — (log„n + (log„n) ) gate delays to switch the i=l entire network. This could be changed to 6 log n by the addition of appropriate latches between stages. k .3 Processor-Processor Connections As readers may have noticed, the topology of the P. network is equivalent to that of the last log n stages of both the bitonic sorting net- works discussed by Batcher [ k ] and the binary switching networks discussed by Benes [ 5 ]• This in itself is somewhat surprising. It is even more sur- prising to discover that the interconnections between stages turn out to be the "perfect shuffle" connection discussed by Pease[ 9 1 and Stone [12]. We Ml not present a proof of this result but it can be easily seen by examining i ">< , which is the same binary Q. network we have been discussing all along. While it appeared earlier (Figure 13) that each pair of stages was 88 f\ 1 1 1 — — l u 1 V J 2 —v /\ 3 3 2 — 2 3-^ V / — 3 4 A 2 2 3 — 4 5 -/ \ > — 5 6 -^ \ 7 4 4 4 — 6 — 7 ^ Figure 22. An fi Network with Identical Interstage Connections Revealed 8 9 interconnected differently, (it was expedient to draw them this way to facili- tate understanding of how they worked), in fact, a reordering of elements in each stage allows them to be drawn as shown in Figure 22. Thus, it becomes clear that the stage interconnections are identical for each stage. (The numbers on each element indicate their original order as shown in Figure 13) . Since each stage is identical, it appears that it might be possible to save more gates and build just one stage of an 0, network, add some registers, and simply recycle this stage log n times. We will now examine some particularly effective ways of doing this. Assume that we have N processors interconnected by the perfect shuffle, as shown in Figure 25« This interconnection was proposed by Stone [12]. The new results, effectively proved in Chapter 2, is that shifts and other patterns required by our solution to the memory conflict problem can be performed in exactly 1 + log N cycles. This is accomplished as follows. Each processor will have as many as two data words which need to be transmitted to another processor (possibly itself) . Associated with each data word will be a log N bit destination tag and a one bit validity indicator. Assume these interconnections are wide enough to allow parallel transmission of d bits of data, log_N bits of tag and one extra bit. During each cycle the contents of the output registers (see Figure 2*0 which contain data, tag and validity information are sent via the shuffle connection to the input registers of other processors. Then, depending on the tags and validity bits, the two input registers in each processor may exchange or not as they are gated to the output registers, (it should be noted that one to many connections are not possible using this scheme, but another scheme similar to section 2 of this 90 w ftg *■ „ Figure 23- Four Processors Connected by the Perfect Shuffle 91 Output Registers Input Registers Figure 2^. Internal Registers Required for the Processor Interconnection Scheme 92 chapter is possible which does allow broadcast type connections) . From Chapter 2, we know that any permutation of n numbers which satisfies definition 3 of that chapter can be produced in log n cycles. Since n = 2N, this becomes 1 + log N cycles. In addition, we know that any permuta- tion can be done in log n(log n+l)/2 cycles. This latter result follows from Batcher's work [ k ], and the facts that the stage interconnections in the Batcher network are perfect shuffles and that the only data dependent informa- tion required by a Batcher element can be obtained from the inputs to that element. It is not known at this time whether or not log n(log n+l)/2 is a least upper bound, given the above constraints. One additional result which follows from Batcher is that we can sort 2N numbers using this same hardware. k.k Summary It is difficult to summarize the results of this chapter in terms of actual numbers. In one case we took existing processors, possibly modified them, and formed an ft network. The primary cost here would not be determined by gates (which presumably are already available in the processors) but by wires, drivers, connectors, etc. In two other cases we actually designed separate ft networks. We can say that any permutation which satisfies defi- nition 3 of Chapter 2 can be produced in time on the order of log n cycles and with an order of n gates/stage. Thus, we have a number of options available : I. Processor-processor connections. A. Advantages: possibly cheaper due to the use of already existing gates and can be used to implement a bitonic 93 sorter with almost no extra logic. B. Disadvantage: during each cycle, some input must travel a distance of n/2 processors. If the processors are large or there are a lot of them, the wire lengths involved may be a problem. II. Separate networks. A. Multistage. 1. Advantages: size and thus wire lengths are smaller. 2. Disadvantage: cost in terms of gates. B. Single stage. 1. Advantages: size and wire lengths are smaller than processor-processor connections; uses fewer gates than the multistage networks. 2. Disadvantage: probably slower than multistage due to extra clocking requirements and i/o registers. C. Multistage pipelined. 1. Advantages: size and wire length are smaller than processor-processor connections, higher bandwidth. 2. Disadvantages: cost in terms of gates for the necessary interstage registers and in terms of extra transmission time due to these registers and extra clocking. It should also be pointed out that it is not necessary to build these networks using binary elements. Larger elements, consistent with the particular implementation technology, can be used with no loss in capability (Theorem 6, Chapter 2) and yield a possibly faster network. In the next chapter we will review some other networks which might be adaptable to memory-processor connection networks and we will compare these 9h networks with ft networks in terms of cost, speed and effectiveness. 95 5. CONSTRUCTION AND PROPERTIES OF OTHER NETWORKS In this chapter we will examine five networks which have been pro- posed as data alignment networks. 5.1 Uniform Shift Networks A -uniform shift network has the capability of "shifting" all inputs ±1 or ± S (mod n) positions through any stage or cycle of the network. How- ever, during any cycle it must shift all inputs by the same distance. It is easy to show that the maximum number of cycles required to perform an arbitrary In S uniform shift is Loi-^lJ + LoJ-1* This upper bound is minimized if S = Vn (assuming n is a square). This type of network generally performs well for uniform shifts, the time being bounded by 7n-l cycles. But if non-uniform shifts are required, then we run into trouble. In the case of ILLIAC IV, it was generally true that significant changes had to be made either to the storage mapping function or to the algorithm itself in order to force the alignment patterns to be uniform shifts. If this could not be done, then a software routine had to be used which resulted in n shifts of +1. Any problem which relied on this routine usually performed so badly that it was no longer considered for use on ILLIAC IV. Let us examine our alignment requirements in terms of uniform shifts. Recall (Chapter 3) that we characterize our' alignment requirements by a function C(x) = (ju(v(x)), *(x)) , where we considered the cases (x) is independent of x, then it is a -uniform shift; otherwise it is not. Table VI summarizes the results of Tables I-III of Chapter 3, in terms of uniform or non-uniform shifts. An x represents a non-uniform shift while represents a uniform shift. As we can see, in no case are more than two of the n-vectors listed accessible by uniform shifts. (We have not even con- sidered broadcast patterns since this kind of network cannot produce them) . There are several ways of building such a network. One way is to use a single stage as shown in Figure 25. (Figure 25 actually shows two stages in order to more easily show the connections. One may mentally fold the second stage into the first). Figure 26 shows one of the n elements. Clearly, the interstage wiring for this type of network is more complex than that of the 9. network. This network requires 5n gates/stage (see Figure 20 of Chapter h) . Finally, we may consider connecting the N processors together with a ±1, ±S connection and using the processors as a single, recycleable stage. Each processor requires h input and h output lines, while using fi connections required only two input and two output lines. Further, the fi connections allowed us to align 2N data words while the ± 1 , ± S connection only allows us to shift N words. let us propose that we add ±1 connections to the Q connections. Then, we have a system whose cost is approximately the same as the ±1, ± /n system, but which can do any uniform shift in < log N cycles as opposed to < t/n cycles for the ±1, ±71 system. Additionally, the ±1, fl system can produce permutations which are not uniform shifts. In spite of the apparently overwhelming evidence against the uniform 97 o H H O X X X X S OJ X H H C\J »? OJ O 1 H O X X X X s H OJ X H OJ OJ H PQ I H X o X X X s OJ H M H CVI OJ H < I H X X X X X S o H X H CJ OJ H pq 1 o X X o X s OJ H x H CVJ H H X pq i o o X X X }?*; rH rH X M < 1 o X o X X ft o H X H co CO co d rH H O H CO CO •H d 13 ft w H ta CD to I? CD •H CO 3 CO h Fh Eh w p •j Q) x ' : ft ^— ^ [5 rH fi R CD •H X B o o o I? a) 5 CO co o CO d CD -P -P CO Cm ■s CD •H rH < -P -> OUTPUT -S •» OUTPUT -1 -> OUTPUT +1 -> OUTPUT +S re 26. One Element of a (± 1, ± S) Shift Network Using NAND Gates 100 shift networks, we will examine one more such network, the barrel shifter. 5.2 The Barrel Shifter The barrel shifter is capable of doing any -uniform shift. It is conceptually a simple device. It consists of log n non-identical stages. The i-th stage is capable of switching all inputs either or 2 positions. This is shown in Figure 27 for a four port network. Thus, the barrel shifter requires log n stage delays for any uni- form shift. It requires 3n gates/stage. This is to be compared with the o, network which requires Un gates/stage and logpn stage delays. Unfortunately, the stage interconnections of the barrel shifter are not uniform and thus we cannot build just a single stage and recycle it log n times. The only advantage of the barrel shifter is that the upper bound on the time required to perform any uniform shift is less than the same bound for the ±1, ±S shifter. But the latter network has a smaller lower bound. The barrel shifter is clearly inferior for our purposes to the Q network. 5-3 The Batcher Network The Batcher network [ k ] was first proposed as a sorting network (see Figure 29)- However, it should be clear that any network capable of sorting n numbers is also capable of switching n numbers. We will not go into the details of this network. It is sufficient to say that the networks consist of log n(l+log n)/2 stages. Each stage is (or can be) interconnected by the perfect shuffle and each of the n/2 elements of each stage is capable of ordering its two inputs into descending or ascending sequence. Each element can be built with 13 NOR gates for a total of 13n/2 gates/stage (see [h]) . This is for the "control" plane. Data planes can be identical to 101 Figure 27. A Four Port Barrel Shifter 102 CONTROL LINES INPUT -2 1 INPUT O o OUTPUT OUTPUT 2' Figure 28. One Barrel Switch Element 103 1 2 3 4 5 6 7 Figun 29. An 8 x 8 Batcher Network. Crosshatched Elements Sort in Descending Order 104 those of the ft network except there are more stages. We could build only one stage and recycle it «(log n) times in order to save gates. However, one additional problem arises. Each element will require some extra logic since during some cycles it must produce a descending order while during others it must produce an ascending order. We will ignore this problem. Finally, we could use the processors themselves as Batcher elements. Now we have a system which is practically identical to that of section 3, Chapter k. Batcher's results tell us that using this system we can do any permutation in an order of (log n) cycles. The results of Chapter 2 tell us that if the permutation satisfies definition 3 of Chapter 2 (which includes uniform shifts), then it can be done in log n cycles. Other permutations may require fewer or greater than log n cycles. 5-4 The Benes Network While Benes did not invent this network, his extensive discussion of this network prompted us to name it after him [ 5 1 ■ It was originally in- tended for telephone traffic switching. The data planes of a Benes network consist of (21og n)-l stages of n/2 elements interconnected by the perfect shuffle and with the exception of the number of stages they are identical to the data planes of the ft and Batcher networks (see Figure 30). Benes shows that the typology of such a network is sufficient to produce any permutation of inputs. Unfortunately, the determination of the switching of each element in the network is complex. The best known result [13] requires time on the order of n Log n Actually, Benes considered these networks in a more general form. We consider here only binary networks. 105 Figure 30. A Binary 8x8 Benes Network io6 operations. Thus, it would appear that this network would not be practical in situations where the control signals have to he determined in real time. However, in some instances it might be possible to compute these signals ahead of time, say in a compiler, and in these cases the Benes network might be useful. 5.5 The Crossbar Switch A conceptual diagram of a crossbar switch is shown in Figure 31. This could be implemented as shown in Figure 32 using fan- in logic and limiting fan-in and fan-out to 2. Control signals for this network are derived directly from the source tags. This circuit can produce any set of one-to- 2 one or one-to-many connections. It requires an order of n gates and log n gate delays for transmission. 5.6 Network Comparison At this point we have discussed these networks sufficiently to allow us to compare them as much as possible with ft networks. We will begin by assuming that we will build a complete network, i.e., we will not build just a single stage which can be recycled. We will follow this by a comparison based on single stage design. In this first case we will not consider the ±1, + S uniform shifter since it is clearly nonsensical to build such a network. For each of the ft, barrel shift, Batcher and Benes networks we will compare control gates, data gates, switching time, data transmission time, and capability. Table VII shows the relative values of these parameters. The expressions listed represent only relative magnitudes and do not represent actual values. - Now let us assume we interconnect the N processors in by ±1, ± /n connections or by the perfect shuffle connections. Table VIII summarizes these results. The only difference between the ft, Batcher, and Benes "networks" is 107 (I 2 3 INPUTS < N ill f 1^^ 1< ihh^ ^ 12 3 • • • N OUTPUTS Figure 51. A Conceptual Diagram of an N x N Crossbar Switch 108 Figure 32. A k X k Crossbar Using AWD-OR Components and Controlled by Source Indexing 109 ™T^^ (D ^ 2 H bO O 2 sd •H -H £~ ra W OJ cd ■H •H -P • -P co ,d ■H H •H W o fltH^ sd sd sd ,Q w O O w o o o CO ■p •H -P ■H •H •H ft CH ■PfO'H -P -p -P CO •H CO -H CO CO co o & -P d .£ -P -p -P CO g O w g g g S fi -P £ s H M In CD -H £ CD (D CD O ft sd o ft ft ft Ch •H ? £> £? $H C CD id S sd sd ^ CO ^d 2 CO CO CO a o 0J ■H CO Sd Sd 'cT a sd CO •H bO bD bO bO bD g O O O o o CO H rH H H rH (B CJ O — OJ -Ptlig CO fn >H Q H Eh OJ H O C 'H' sd -p (1) H bO '.:) bD H o O O H r-\ H O En id OJ Sd d Id sd co CO 0) bD bO bD bD -p -p O O o O W ,H CO CO H H H r-l sd « C3 «« — ^ Sd id d sd OJ OJ ,0 rH o Sd $d "id CO bD bO M i U w *. * O O o i -P 0) o <-\ r-\ H i d -p ^-^ o ro a d sd o a A fH CD -p fH •H .£ w 5h Sh CO H CD ,0 0) A co co fn o QJ w fn -p sd O CO § cd fn Ml a ■ o CD -P •H sd •H CH H CD S* w CO I •p •H Jh O bD H CO bD sd ■s -p •H -p co CO ■p Sh .2 tS CD 5h •H & CD fH CD Sh CO CO CD •H U I fn ^> •H bD •H H bD CD I- 1 u •, 1 d J3 •i i rd sd o ed bD E o H CD 1 ' 1 1 ' CO -p (D CD Ph O sd o co •H fH CO & 8 H H > CD H ■§ EH 110 CD "1 £ •H ■ Eh M a) 1 Pi H i •H O o H Pi i— i ,£ >3 O O -P •H Jh £ CD GQ f^ !» Ti fH pi cc O fn fn •H -P CVJ OJ r& PS -H iO OJ lb o *U Ph -d CD ?H H •H i P! &g *? 0J bO CD O P! PI O K <+H V| OJ OJ H •H bO bO w Pi o O O V| J) D -p H H H fH V| OJ o o fn -H >3 O rd H V| O fH 02 H fH -P o p5 co Pi w a -P fH 0) pi -H O O CO o -=h OJ OJ _4- \ Pm "h -P Ph pi CD Ph Pi fH Pi -H CD H hlft G | fH G CD ? 1 o CO +1 CD CO pq »\ O CD •N H -P CO P! CD H +1 PQ m +1 CO CD .a o m a o •H -p o o o fH CD -P P! H fH O w CO CD O O fH Ph Ch O p! o to •H fH CO t O o > CD H Eh Ill the decision process required in each processor during each cycle. In the case of the P. network, this requires examination of one bit each of two log N hit tags in each processor during each cycle- For the Batcher network this requires arithmetic comparison of two log„N bit tags in each processor during each cycle. Since these two are so similar, we have combined them in Table VIII. The Benes network using Opferman and Tsao-Wu's algorithm requires a control time on the order of n log n but cannot be done based on examination of the two data elements in each processor at each cycle. The switching time shown in Table VIII simply reflects the total time divided be the number of cycles. It should be pointed out that the ± 1, ± /n connection can shift N inputs and during each cycle only one of the four I/O line pairs is used. The shuffle connections allow permutation of n = 2N numbers and during each cycle both i/o line pairs are used. We have shown in Chapter 3 that using 2N memories resulted in the ability to fetch significantly more N-vectors. Thus, the ability to handle any N out of 2N inputs may be significant. 112 6. conclusion : The problem of data alignment for a large parallel computer has been investigated before, but previous results generally relate only to inter- processor connections or switching networks; memory assignment; or (rarely) control algorithms for switching networks. For example, Benes [5 ] in- vestigated the properties of some 21og n stage binary switching networks which might provide a solution except that no control algorithm is known which is fast enough to allow the fast word-after-word switching required in a highly parallel computer. Kuck and Budnik [ 2 ] proposed a solution to the memory assignment problem which allows a wide variety of vectors to be fetched from a parallel memory system without conflicts, but the alignment and indexing problems indicate that this solution is probably not practical. We have attempted to consider all these problems simultaneously in order to arrive at a realistic solution. In this paper we have presented such a solution. In order to prove the merits of this solution, we began in Chapter 2 by deriving some theoretical properties of a particular type of network which we called an ft network. We showed, for example, a necessary and sufficient condition (definition 3) for a given permutation to be producible by a given ft network. Then, in Chapter 3, we derived some important N-vectors, several storage assignment schemes, and considered several memory systems. We showed in each case whether or not certain N-vectors could be fetched in parallel without conflict and aligned by a binary ft network. We found that by using twice as many memories as processors, a (-/N+l)-skew, 2 skip memory assignment, and a binary ft network we could fetch and align at least the four most im- portant N-vectors (rows, columns, diagonals and t/n x t/n positions) without conflicts. Finally, in Chapter 5, we discussed several other types of networks . 113 and their relationships to the ft network. Here we showed that in general the uniform shift networks are not flexible enough to "be useful in general parallel computing applications and, in fact, need not be considered since ft networks are more flexible and either cost less or at least do not cost significantly more than uniform shift networks. In addition, comparison with more general networks (Benes [5] and Batcher [!)-]) which are capable of producing any arbitrary permutation showed the ft network to be cheaper and faster. We also discussed the possibility of interconnecting the processors together in such a way as to form one stage of an ft network. By recycling this single stage the correct number of times we could effectively simulate an ft, Batcher, or Benes network. As it turns out, this particular interconnection of processors was proposed by Stone [12] and called the perfect shuffle. Using the results of Chapter 2, we were able to prove the new result that certain important permutations (including uniform shifts) could be done in exactly log-N cycles where N is the number of processors. In comparison, the ILLIAC IV ±1, ± /N connections allow any uniform shift in time < 7n. If we add ±1 connections to the perfect shuffle connection we get a system with the same number of i/o line pairs (probably the major cost item) and which can handle any uniform shift plus other important permutations in time < log N. Thus, while we may not have found a panacea, we have demonstrated a class of network which provides a feasible solution to the data alignment and memory conflict problems in a parallel vector processing machine. 114 LIST OF REFERENCES [l] G. H. Barnes, et. al., "The Illiac IV Computer," IEEE Transactions on Computers, Vol. C-17, pp. 746-757; August 196b. [2] P. Budnik, and D. J. Kuck, "The Organization of Parallel Memories," IEEE Transactions on Computers , C20, p. 1566; December 1971- [3] P. W. Kraska, • "Array Storage Allocation," M.S. Thesis, Department of Computer Science, University of Illinois at Urbana- Champaign, Report No. 344; August 1969. [4] K. E. Batcher, "Sorting Networks and Their Applications," Proceedings of the Spring Joint Computer Conference , 1968; pp. 307-314. [5] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Traffic, Academic Press, New York; 1965 • [6] I. M. Vinogradov, Elements of Number Theory , Dover Publications; 1954. [7] D. Shanks, Solved and Unsolved Problems in Number Theory, Vol. I, Spartan Books, Washington, D. C; 1962. [8] Y. Muraoka, "Storage Allocation Algorithms in the TRANQUIL Compiler," M.S. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, Report No. 297; January 1969- [9] M. C. Pease, "An Adaption of the Fast Fourier Transform for Parallel Processing," Journal of the ACM , Vol. 15, pp. 252-264; April 1968 [10] J. E. Stevens, "A Fast Fourier Transform Subroutine for Illiac IV," Center for Advanced Computation, University of Illinois at Urbana-Champaign, Document No. 17; October 1971* [11] D. J. Kuck, and A. H. Sameh, "Parallel Computation of Eigenvalues of Real Matrices," Proceedings of the IFIP Congress , p. TA-1-24; 1971. [12] H. S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE Transactions on Computers , C20, pp. 153-l6l; February 1971 • [13] D. C. Opferman, and N. T. Tsao-Wu, "On a Class of Rearrangeable Switching Networks," Bell System Technical Journal , Vol. 50, pp. 1579-1618; May-June 1971. 115 VITA Duncan Hamish Lawrie was born in Chicago, Illinois, on April 26, 19^3. He received the B.A. degree from DePauw University in Greencastle, Indiana, and the B.S.E.E. degree from Purdue University, "both in 1966, and the M.S. degree in Computer Science from the University of Illinois in 1969. While at the University of Illinois Mr. Lawrie held a NASA Traineeship in Computer Science and a Research Assistantship in the Department of Computer Science. He also worked on the Illiac IV project as Senior Research Programmer in charge of language development and computer operations. He is a member of Tau Beta Pi, Eta Kappa Nu, the Association for Computing Machinery, the Institute of Elec- trical and Electronic Engineers, and is an associate member of Sigma Xi . He has published two papers, "The Use and Performance of Memory Hierarchies: A Survey," in Software Engineering , Vol. I, Academic Press, New York (197°); and "interconnection Networks for Processors and Memories in Large Systems, " pre- sented at the COMPCON 72 convention in San Francisco, California (1972), both with D. J. Kuck. BLIOGRAPHIC DATA IEET 1. Report No. uiucdcs-r-73-557 Title and Subtitle MEMORY-PROCESSOR CONNECTION NETWORKS 3. Recipient's Accession No. 5. Report Date February 1973 6. Author(s) Duncan Hamish Lawrie 8. Performing Organization Rept. No -uiucdcs-R-73-557 Performing Organization Name and Address University of Illinois at Urbana- Champaign Department of Computer Science Urbana, Illinois 6l801 10. Project/Task/Work Unit No. 11. Contract/Grant No. US NSF GJ 274J+6 Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Doctoral - 1972 14. Supplementary Notes Abstracts In order to utilize the potential speed of a SIMD type parallel processor it is necessary to arrange data in the memory system so that subsets of this data can be fetched in parallel without memory conflicts. Additionally, we must provide sufficient memory-processor paths to allow the data to be correctly aligned with the processor array. In this paper we present several storage mapping algorithms together with a memory-processor interconnection network. We demonstrate the cost and effectiveness of these and compare them with other networks which have been proposed for this application. Key Words and Document Analysis. 17o. Descriptors Indexing Networks Sorting Networks Switching Networks Crossbar Switch Memory-Processor Connection b. Identifiers/Open-Ended Terms e. COSAT1 lie Id /Group • Availability Statement RELEASE UNLIMITED 19. Security Class (This Report) UNCLASSIFIED :uritv Class (Thi 20. Security Class (This Page 1JNCLASS1FIED 21. No. of Pages 121 22. Price RM N TIS-19 (10-701 USCOMM-DC 40329-P7 1 mtfiiaw* OCT 24 19/3 UNIVERSITY OF ILLINOI9-URBANA 3 0112 052121743