OBmBKUOBtS/UKS, BnlwDHmlSMl HBHHHuBBnmfflnHiuii 1^ lira 11 sign SKI Bf&fffl Hi HnHMI/MnSUuiJIMifRlU ■HOP ■11 LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN cop. 2. Digitized by the Internet Archive in 2013 http://archive.org/details/eliminationofrot522gold Report No. UIUCDCS-R-72-522 /7/sLU~ r ELIMINATION OF ROTATIONAL LATENCY BY DYNAMIC DISK ALLOCATION by David E. Gold May 1972 ELIMINATION OF ROTATIONAL LATENCY BY DYNAMIC DISK ALLOCATION* By DAVID E. GOLD May 2k, 1972 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 618OI Report No. UIUCDCS-R-72-522 * This work "was supported in part by the National Science Foundation under Grant No. US NSF GJ 2 r (hk6 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, June 1972. ELIMINATION OF ROTATIONAL LATENCY BY DYNAMIC DISK ALLOCATION David E. Gold, Ph.D. Department of Computer Science University of Illinois at Urb ana -Champaign, 1972 In Input/Output bound computer systems which use disks for back up memory, rotational latency is frequently the source of the i/o boundness. This paper presents several methods for eliminating rotational latency in head-per-track disks or equivalent memory devices. These methods assume pre -run time knowledge of the i/o sequences required by a particular program which will be run on the system. They work by constantly reorganizing the data on the disk during execution of the program. Measures are obtained of certain system parameters which indicate the requirements for use of these techniques and the various tradeoffs may be examined and quantified. These results are useful both for designing a new operating system and for removing latency from an already existing operating system. Ill ACKNOWLEDGMENT I wish to thank my advisor, David J. Kuck. Without his constant questions, suggestions and disagreements neither this paper nor the research contained in it would have emerged. I thank also the National Science Founda- tion and the Department of Computer Science at the University of Illinois for their support during the evolution of this research. And finally I wish to thank Diana Mercer and Vivian Alsip for their fine job of typing the manuscript from my semi-legible and frequently changing handwritten original. IV TABLE OF CONTENTS 1. INTRODUCTION Page 1.1 The Problem X 1.2 Terminology and Notation 6 2. MESH PROBLEMS 10 2.1 Introduction 10 2.2 Single Process Programs 11 2.3 Two Computational Process Programs 22 2.3.1 Density Buffering, fast process to slow process transition 25 2.3-2 Density Buffering, slow process to fast process transition 30 2.3.3 Origin Shifting with Density Buffering 31 2.4 Generalization to N- Processes 3!+ 2.4.1 Data Rate Considerations for Some n- Process Problems. 36 2.4.2 Sweeps of Less Than a Disk Rotation in Duration ... 1+0 3- OTHER SWEEPING PROBLEMS 1+2 3.1 Introduction 1+2 3.2 Generalized Data Transmissions 1+2 3-3 Non- Standard Sweeping Patterns I4.7 3*4 Discussion 5IJ. 4. ARBITRARY PERMUTATIONS OF BLOCKS BETWEEN SWEEPS 55 4.1 Definition of the Problem 55 4.2 The Algorithms for Arbitrary Permutations 59 4.3 Permuting in a Memory Hierarchy 67 4.4 Special Cases of Permutations - Addition and Deletion of Blocks 68 4.4.1 Other Special Cases 70 4.5 Combining Permutations with Origin Shifting and Density Buffering 72 5. IMPLEMENTATION CONSIDERATIONS 74 5-1 Non- Zero Data Transmission Times 7I+ 5.2 Memory Sizes and Data Rates 75 5-3 Speed of the Buffer Memory 78 5.4 Implementation of Row Permutations 78 5.5 Disks Which Do Not Read and Write Simultaneously 8l Page 5.6 Real Data Block Spacing and Synchronization Qk 5.7 Obtaining Compile -Time Information on Data Accesses. ... 85 5.8 The Effects of Block Size on Memory 85 5.8.1 What the Effect Is 86 5.8.2 Processes in Which Changing the Block Size Has No Effect 9k 6. CONCLUSIONS AND DISCUSSION 95 6.1 Applications of Latency Elimination Techniques to Existing Systems 95 6.2 System Design Applications of Latency Elimination Techniques 96 6.3 Areas for Future Research 96 APPENDIX A 98 LIST OF REFERENCES 100 1. INTRODUCTION 1.1 The Problem This thesis is concerned with the elimination of rotational latency in several classes of i/O bound programs. Within the context of this paper, i/O boundness is that situation in which a processor requires data faster than the system's back up memory can supply same using a simple on-demand strategy. This situation, therefore, is memory, processor and program de- pendent and is a property of all three of these. Thus, a program may be i/O bound for some particular system (memories and processor) but not for some other. The term i/O bound program will implicitly refer to both a program and some particular system. The literature contains a succession of drum latency minimization techniques, starting with the IBM-65O and IBM-650-like processors [6,7,9]* In these machines, all memory references including instruction sequencing were potential sources of rotational latency. The more recent literature deals more with the statistical nature of requests for data from a drum and/ or the retrieval of data (records, pages, etc.) already stored there in some non- optimal manner [l,^-,5]. In this paper we deal with a predetermined sequence of demands to the drum and assume that the initial configuration of these data on the drum is established as an optimal one. Subsequent positioning of data on the drum (for example during output) is such that the locations continue to be optimal with respect to future retrieval from the drum. This approach was taken for a limited class of problems in [3]* The solution of an i/O bound program will consist of an algorithm which provides a particular sequence of i/o transactions between memories such that rotational latency is eliminated. Included in such a solution is a program and processor independent upper bound on the size of buffer memory required. Thus the results presented here are useful either in programming for an existing system or as system design parameters. The data of such an i/O bound program is assumed to be stored on a sufficiently large periodically addressable device such as a head-per-track disk, a drum, a recycling delay line or shift register, or some other device whose only latency is rotational in nature. This eliminates disks with head positioning latency but some of the results obtained here may be extended to include these devices. All of the zero latency schemes given in this paper will require disks which are capable of both reading and writing at the same time. That is, at any point in time while the program is being executed, data may be read from the disk, being written to the disk, neither of these or both of these. While this requirement is somewhat artificial, it permits the analysis of the data transmissions in the methods given here to remain more comprehensible. It is also the case that several disk manufacturers have indicated that such disks may be produced for no more than a 10-20$ additional cost over their single i/O counterparts. Furthermore, Chapter 5 gives a method by which standard disks without the concurrent read/write ability may simulate the other kind entirely through software techniques . In a uniprogramming situation, which this paper is concerned with, I/O bound programs cost, in time, an average of half a disk rotation plus transmission time for each access. If the program is long running, as in the case of one which iterates many times over most or all of its data base, the total latency becomes quite large. Such a situation may easily give rise to more latency time than actual process time. For many programs, however, it is clearly the case that with some forethought and planning the data could actually be placed at the proper locations on the disk to eliminate this latency for at least the first pass through the data. That is, for the first time the data are accessed from the disk, they may be placed in those locations on the disk from which they may be read at precisely the proper moment to be transferred to the processor when they are needed by the processor. Indeed, a fairly simple algorithm will accomplish this effect with less precise requirements: data are pre- fetched from the disk in the order in which they will be required and held in a buffer until required. This will yield a zero latency data accessing sequence for the first pass through the data, although it is not obvious what the size of buffer must be to accomplish this. For subsequent passes, how- ever, it seems that somewhat more sophisticated algorithms may be used to accomplish the same effect. These would insure that all necessary data can be pre-fetched in the allotted time and further insure that there is a fixed and reasonable bound on the size of the buffer which is required for the pro- cedure . This paper exploits the above premise. We show that for some classes of programs there exists an initial disk layout of the data base which provides a zero latency initial pass of the program through its data. We further provide algorithms which reformat the disk (in real-time) to pro- vide zero latency data accesses from the disk for subsequent passes over the data base. This, in effect accomplishes a continuous dynamic allocation of the disk. One requirement must be met before any methods presented here may be applied. The processor, the program to executed by it and the disk which will be used must be well matched . This simply means that the maximum data rate requirements of the program (executed on the particular processor) must be satisfiable by the disk. In other words, the program/processor should not crunch numbers faster than the disk's ability to provide them. Unless this criterion is met, there can be no hope of a zero latency solution of the type derived here. A program consists of one or more computational processes which operate on the data base of that program. A computational process will re- quire input data in a predictable sequence. It is assumed that the execution time of each computational process is known to within a relatively small tolerance. This last assumption permits the prediction of data requirement timing as well as sequencing. The solutions which are outlined here all have the property that the disk and processor remain synchronized with each other (cf . section 5.6) . The extraction of computational processes and their relevant para- meters is expected to take place at compile time although some of the parameters may be obtained at run time. In any event, it is this "early" information which will permit most of the latency eliminating methods presented here to work. This information may be obtained, in some cases, by a fairly straight- forward scanner to determine data sequences. For complicated programs, however, an analyzer' of the type described in [10] may be suitably modified to yield this information. The data are assumed to be lumped into blocks . Block size is chosen on the basis of several considerations. A block size smaller than the smallest addressable segment on the disk is inefficient of disk capacity. Furthermore, since many algorithms require many variables in primary memory at one time (as in the case of partial differential equation-type mesh problems where neighborhoods of variables are computed upon at the same time), very small blocks are uneconomical. This is because certain variables known as edge values are stored redundantly. The number of such edge values is directly proportional to the number of boundaries between blocks. Thus a partitioning into many small blocks will produce many redundantly stored values. On the other hand, there will typically be only a small number of blocks in memory at one time (interaction between data is assumed to be either within blocks or between only a few blocks over any time interval) . And because the size (and hence cost) of primary memory scales linearly with block size, this mitigates against very large blocks. Thus the block size is determined as a function of the primary memory (processor's memory), the disk's characteristics and the program which will be running. Chapter 5 contains a discussion of block size determination. The reader may wish to substitute the work "page" for "block" wherever it appears here. The common use of "page" is consistent with "block" here except that the preceding remarks about size determination should be heeded. This paper will give algorithms which produce zero latency solutions for particularly well-defined classes of programs. Generalizations are also given which permit application of these same methods to somewhat different, less restrictive types of programs. These generalizations, while less formal than the original formulations, will hopefully suggest additional application of the basic methods to latency elimination in programs. 1.2 Terminology and Notation A disk is any large capacity rotating storage device whose latency is solely rotational. Thus, drums and recycling shift registers are disks. Latency is rotation latency, or the time spent "waiting" for a disk to rotate so that the head is correctly positioned to begin reading the next needed data. A physical track of a disk is a recording track on a disk which is accessed by a single read and/or write head. It may be thought of as be- ginning at some arbitrary angular origin and extending for a full rotation of the disk. A logical track of a disk is a simulated physical track of arbitrary length. That is, it is a recording track which may extend for more than a single disk rotation. The simulation is accomplished by switching read or write heads from one physical track to another at successive occurrences of the origin at the heads. Intuitively, a logical track is several physical tracks laid out end-to-end. Density is a measure of the amount of data stored on some portion of a disk. It is usually expressed in bits per physical track or words per physical track although these units are more commonly referred to as bits per rotation or words per rotation, respectively. The maximum density for a disk is simply that quantity of bits or words per track which is the largest that may be recorded on a single physical track of that particular disk, due to its physical properties. is that amount of data (in either words or bits) which may occupy a physical track on a disk at its maximum density. Q is that amount of data which occupies a physical track on a disk. If the density of the data under consideration happens to be stored at maximum density, Q = Q . This quantity of data is frequently referred to as "a rotation of data." T is the disk period. Time is frequently expressed in terms of this quantity. That is, we frequently measure time in disk rotations . A disk rotation is T , seconds . d The data base is the collection of all data associated with a program. It includes all initial data as well as intermediate results and final re- sults. A data entity is some distinguishable portion of a program's data base. Data entities are mutually disjoint. An array is an example of a data entity. A program during its execution will require input data . In the schema here, this is in the form of input blocks . We say that computational processes compute on their input data. Results of computational processes are output data. This is handled in the form of output blocks. A c omput at i ona 1 process is a distinct computational portion of a program. It is assumed to be devoid of I/O statements and all conditional branches except those which provide for looping or are otherwise predictable. The input and output sequences for these processes are generated by the al- gorithms presented in this paper. A computational process which only does updating of values in a data entity will have both its input blocks and output blocks contained in that same data entity. A process which performs, say, a binary operation on two matrices to yield a third one will have its input blocks coming from two entities. Its output blocks will be contained in yet a third. A particular computational process may or may not operate over its program's entire data base but it will usually operate over an entire data entity if it operates on any part of it. This is because a computational process is usually assumed to operated iteratively over an entire entity. We frequently refer to a computational process as simply a process where no ambiguity will result. Primary memory is the fastest memory available to the processor (save perhaps a limited number of high speed registers). It is random access and contains the current data being computed upon (with the exception of perhaps small amount in registers). Because processes will access their input data from primary memory it is frequently referred to processor memory . Primary memory need contain only "a few" blocks of data at any point in time. I.e. primary memory may be as small as 3 blocks of data plus any needed to store the program itself. Thus this memory is relatively small. A discussion of block size appears in Chapter 5- Buffer memory is a typically large, slow memory with respect to primary memory. It is also random access memory and is used to match data rates between the disk and the primary memory and also as a data reservoir in which data are reformatted. Buffer memory is frequently considered to be logically partitioned into two areas. The input buffer contains data on its path from the disk to the primary memory whereas the output buffer contains data on its path from the primary memory to the disk. Output data will pass through the output buffer to the disk. When they are next retrieved, they are input data and pass through the input buffer to the primary memory. A system is a processor, a primary memory, a buffer memory and a disk in the following configuration: 9 ■» input buffer output buffer > d to minimize the number of edge values which need be passed, the size of edge blocks will similarly be much smaller than the data blocks. It is also possible for mesh problems to be three dimensional rather than two dimensional as previously described. In such a case we think of the two dimensional mesh as described as being a plane. In general, then, a sweep over a plane in the three dimensional model will have both edge values within its plane (as described) as well as edge values which are the actual data blocks in adjacent planes. In this case the inter-plane edge value blocks are identical in size to the actual data block being processed. It is also possible to allow the edge blocks to become as large as the data blocks themselves and apply the method of Chapter 5 to obtain a zero latency solution. 2. 3 Two Computational Process Programs Two process programs are those which have two processes which alternately sweep the entire mesh. The particular complication introduced by generalizing to two processes is the fact that the two processes have different compute times, C , for their respective blocks of data. Thus, for a mesh with its data blocks properly spaced on the disk for one of the pro- cesses, the output blocks sent back to the disk will, by the method of the 23 previous section, be spaced in like fashion. The next process, however, be- cause its compute time is different, will not be able to operate at its no- latency tempo. It requires a different spacing of these data blocks (its input blocks) - one which corresponds to its particular speed. It is obvious that the data blocks cannot migrate over the disk. Once they are output, their spacing is fixed and they will be input with the identical spacing. Furthermore, all the problems considered in this paper are assumed to be non-core contained, thus precluding the trivial solution of keeping the entire mesh in primary memory. The method of solution presented here uses a small amount of buffer memory to constantly alter the critical spacing of the data blocks. The result is a no-latency solution in which the processor is always being supplied its requisite data at the proper rate. The following assumptions are made concerning the nature of the problem . There will be several sweeps over the mesh by both processes. This will require a steady-state solution in which each of the two processes is assumed to both precede and follow the other. (The result will yield a zero latency solution in case of a single iteration by each of the kernels.) A sweep over the mesh is made first by one process over all of the mesh and then by the other process over all of the mesh. 2U The pattern of sweeping must be identical for both processes. That is, the sequence of partitions of the mesh must be identical for each process and for all sweeps. In describing the transition between any two processes, it will be necessary to consider the relative speeds or compute times of the two. One process will be faster than the other and this one will be called the fast process . Similarly, the slower process is christened the slow process . The fast process takes less compute time/datum than the slow process. The two process problem is defined such that both processes operate on precisely the same data over their respective sweeps. Thus, as one would expect, the slow process takes more time per sweep than the fast process. The data per sweep being constant and the time per sweep being different, the density requirements of the two processes are different. The fast process requires data from the disk at a higher rate than does the slower process. This creates one of the two potential latency producing problems. Density buffering is the process of altering the density of the data comprising a mesh from the density corresponding to one process to that density corre- sponding to the other process. The second possible area in which latency could be introduced in a two process program is virtually identical to that of the one process problems - intersweep transitions. That is, when a process is ready to begin computing, the data probably is not positioned to begin flowing from the disk. To insure no latency at this point, the single process solution uses buffer memory to temporarily store up to a rotation of this data and thus in- sure its accessibility when necessary. The use of buffer memory in this 25 manner is again called origin shifting since the memory functionally shifts the origin of the data stream to correspond to the processor's timing re- quirements . Both density buffering and origin shifting are handled differently as a function of whether the data are being passed from the fast process to the slow process or vice versa. These will be explained separately for fast-slow and slow- fast transitions. 2.3*1 Density Buffering, fast process to slow process transition Transition here refers to the altering of the data spacing on the disk, i.e. density, from that corresponding to the fast process to that of the slow process. Thus the transition is that of the data format and not that of the processor as it switches from one process to the other. The data trans- mission, as will be explained, occurs while one of the processes is operating. In the case of the fast-slow transition, the data blocks are assumed to be stored on the disk at the fast process' density. The goal is to input data from the disk at this rate and provide it the processor at the slow process' rate. This can be accomplished in several ways, but it is necessary to keep buffer memory requirements low. The following solution is given: Each rotation of the disk contains enough data blocks to supply the processor with its slow process computation for some period of time which is longer than a disk rotation. (intuitively, one rotation of data at the higher density corresponds to more than one rotation of data at the lower density.) If the ratio of the higher to lower density is denoted r, one rotation of data as it is stored on the disk will keep the processor going for r rotations (r > l) at the slow process speed. 26 This method assumes initiation of data transfers from the disk only at some arbitrary origin. This is assumed to be that angular position at which the first block of the sweep is stored on the disk. The original transmission will start with this block and continue for an integral number of disk revolutions . All subsequent inputs of data corresponding to this sweep will be initiated at the same angular position (which is where the previous transmission stopped) and continue for integral numbers of disk revolutions. As data blocks are being input at the higher data rate, the process is making use of it at a lesser rate. Thus there is a net increase of data in buffer memory at such times . Input from the disk is halted at its origin every time the supply of unused data already in memory is sufficient to supply the processor for at least one full disk rotation. Data, then, starts to accumulate in buffer memory as soon as input from the disk occurs. At certain points during the computation however, the accumulated data is sufficient to supply the processor, and input ceases temporarily. In this way buffer memory is bounded while still providing the processor with data at its required rate. Stated concisely as an algorithm, we have Algorithm 1 : i) Input is initiated as the origin of the data passes under the read head of the disk just before it is required by the processor and continues until the conditions in ii are satisfied (at least a full disk rotation) ii) at successive rotations of the disk to its angular origin (i.e. integral number of rotations from 27 the origin of the data), when the amount of data in the buffer is greater than that amount needed by the slow process for the next entire disk rotation, input terminates iii) at successive rotations to the data origin, input is initiated (or continued) when the amount of data in the input buffer is insufficient to keep the processor supplied for the next disk rotation. This bound is obtained by considering the worst case. Let the amounts of data in bits required by the fast and slow processes for the time corresponding to a revolution of the disk be CU and (3, respectively. Thus a = pr. From the description of the fast-slow density buffering, if we are to allow for any value of r > 1 and arbitrarily large data entities, it is obvious that baffer memory must be at least as large as p bits. The only time at which input will cease is when the origin of the disk is passed over by the read head(s) and there are at least (3 bits in buffer memory. The best case is when exactly the requisite p bits are resident at the time the origin is passed over. The worst case, then, is when there are P-e bits, thus necessitating an input transmission. At the end of another disk rotation, a bits have been input and p bits have been "used" by the processor, leaving a- e bits in buffer memory. This being sufficient to supply the processor for at least the next disk rotation, transmission stops. Input resumes only when the contents of buffer memory fall below (3 bits again. By this algorithm, then, the amount of buffer memory required for density buffering is never larger than a bits. The quantity a, however, is identical to 0^. ^ a rotation of data at the fast process' rate. 28 This process can be illustrated by a simple example. In Fig. 2-U, the curve labelled "data from disk" is the cummulative number of bits input to buffer memory from time . Note that transmissions initiate and terminate at integral numbers of disk rotations from the origin. The line below the axis is the integral of the amount of data trans- mitted to the processor and hence, data from buffer memory. The dotted line represents the amount of data resident in the input buffer and is simply the sum of the other two curves . In the example diagrammed, the scale is such that Cd = — p. That is, the fast process uses data at =■ the rate of the slow process or, in other o terms, the fast process operates on the same data in ft the time of the slow process. Theorem 1 : For the algorithm described for density buffering of data for the fast-slow transition, no more than Q buffer memory is required. Proof Data transmissions when initiated will read or write data over an integral number of disk rotations - origin to origin. Furthermore, data will always be flowing to the processor more slowly than it is input to the input buffer because (X > p. Therefore, the maximum amount of data that will have to be accommodated by buffer memory will be that occurring at exactly the end of a transmission. Assume that this quantity is larger than OC, say CH+x. Then, by the algorithm, one rotation earlier, Oi less bits would have been input and p less bits would have been forwarded to the processor. That is, one rotation before the quantity Q!+x was in buffer memory the quantity p+x would have to 29 o •H •p •H CO M -P 8 H < CO aS «H m O «m I? •H fH Cm > •P •H CO a Qt is resident in the output buffer, say d = a+x. Then a rotation before, there was a+x-p in the output buffer. But this contradicts the stated algorithm because this amount of data would have triggered an output transmission at that time. Q.E.D. Note that a may be as large as Q, that is, a rotation of data. Fig. 2-5 shows the data movement for this algorithm for the same parameters as in Fig. 2-4. k a = -p. 2. 3.3 Origin Shifting with Density Buffering All density buffering takes place while the slow process is computing. Intuitively this is because density buffering between any two densities will always take as long as the slower process. If this procedure is to be com- pletely masked from the processor, it must be concurrent with the slower process' execution. This requires a total buffer memory of 2Q for density buffering, because each of the buffers may be as large as Q. It might be though of as configured as in Fig. 2-6. But this does not account for origin 32 b£> 33 Size = Bits Z. Input Buffer Output Buffer X" Size = Bits Fig. 2-6 shifting to eliminate intersweep latency. In general, data will not be at exactly the angular position to be read at the right instant for the beginning of a sweep. There is, however, a scheme permitting buffering to eliminate this type of latency without requiring any more memory than that already necessary for density buffering. Observe that origin shifting may be accomplished "in advance" by storing data in an output buffer until it may be placed at precisely that point which corresponds to the angular position at which it can ultimately be read for no latency. That is, data may be made to appear to have shifted on the disk by either using an input buffer to read it early and hold it, or by using an output buffer to actually shift it. The amount of buffer memory is identical for either method: A rotation of the disk or Q. The scheme for density buffering does not permit origin shifting during the slow process' execution without additional buffer memory. It does however permit origin shifting during the fast process' execution. In fact, since no density buffering occurs at this time, origin shifting may 3U occur in both the input and output buffers. The input buffer will shift the partitions of data as they come off the disk and eliminate latency for the fast process. The output buffer will place the partitions in the proper position on the disk for the slow process. Thus both origin shifting and density buffering may be accomplished with only 2Q of buffer memory. 2.k Generalization to N-Processes One and two process mesh sweeping may be graphically represented as in Fig. 2-7, a and b, respectively. Such illustrations are called code graphs, The nodes represent processes and the directed links correspond to data transmissions. In the particular case of the mesh problems already considered in this chapter the links also correspond to the order of execution. This need not, however, always be true. The "upward arrows" are necessary because of the iterative nature of the problem. Fig»2-7c shows an n-process mesh problem. This is simply one in which several processes operate on a common data base in succession. Each is assumed to take a different time to compute as the other processes. Thus a zero latency solution must here perform both density buffering and origin shifting. The solution to this is a simple generalization of the two-process solution. If the solution for the two-process problem is considered as con- sisting of two transition components, it can be stated as follows: Fast/slow process transitions have origin shifting taking place in the output buffer during the fast process execution. Density buffering occurs in the input buffer while the slow process is executing. 35 Fig. 2-7 36 Slow/fast process transitions perform the density- buffering in the output buffer during the slow process execution. Origin shifting is accomplished in the input buffer during the fast process execution. N-process problems, then, are simply like two-process problems in which there is some sequence of these slow/fast and fast/slow sequences. The only significant difference is that in an n-process problem, any particular process need not be simply fast or slow. The attribute of being either fast or slow is a function of the process' neighbor. Since in the n-process problem a process has two neighbors, it is not possible to characterize that process' speed as one or the other. If a process is faster than its predecessor process, then it is "fast" with respect to that predecessor and that transition from its prede- cessor is a slow/ fast one. If a process is faster than its successor process, it is "fast" with respect to its successor and that transition to its successor is a fast/slow one. A process may however be fast with respects to its predecessor and slow with respects to its successor or vice versa. The impor- tant point is that an interprocess transition will occur in one of two ways, both of which are provided for in the solution. 2.4.1 Data Rate Considerations for Some n-Process Problems The statement of n-process problems has considered only that data in common to the processes. Some algorithms may be such that they consist of several processes, each of which operates on different variables at each point in the mesh. That is, there could be a set of variables over the mesh which is common to all of the processes and other classes of variables which are operated upon only by specific individual processes. 37 One obvious way to handle this case is to lump all of the variables and handle the entire mesh as if it were common to all nodes. This would require origin shifting and density buffering of every variable for every process, even though some processes would not make use of all of these variables. Somethimes, however, the overhead of carrying the "deadwood" variables along might cause the data rate from the disk to be exceeded where the data rate of the process (which just considers the necessary input data) would permit a solution. These situations may be solved by a decomposition of solutions. The original mesh is split up into several data bases, each made up of different variables. All of the possibilities of such decomposition will not be enumerated in this paper. An example is given however of one such decomposition for a two-process problem. The code graph of Fig. 2-7 b is modified to that of Fig 2-8 to show all of the data transmissions discussed in the two process decomposition. Fig. 2-8 38 A no latency solution for such a program will consist of the solution to the two process portion plus the two one process solutions. That is, there are three sets of data or meshes which are each provided for independently. The nature of the solutions allows for this "interspersing" of data while insuring correct arrival at the processor with only minor additional consider- ations. One of these considerations is the proper handling of data rates to and from the disk. A no latency solution is feasible only when every process in a program requires data at a slower rate than the disk can provide. Very large non-core contained arrays which can be rapidly "crunched" will have no low latency solution if the numbers are "crunched" faster than they may be supplied. The point with respect to mixed solutions of the kind being con- sidered here is that it is not sufficient that the processor/disk data rates be compatible for each of the processes in each solution. The combined data for each process for its two solutions must be acceptable. Simply, the sum of the processor data rates of the two solutions for each process must be less than that of the disk. The buffer memory requirements obtained for either the one- or two- process solutions are tied to these processor and disk data rates. The buffer requirements for combined solutions are therefore identical to the memory upper bound of the two process solution. This is because no matter how many solutions for any one process are combined, as long as the sum of their data rates is acceptable, no more than 2Q bits of buffer memory is required to process the data at that rate. There is one final matter to be handled in any real combined solution. The fact that the combined data rates to the processor are less than the 39 disk's bound does not guarantee that all the data will be accessible by the disk. Data could conceivably be output to the disk such that when blocks from two different solutions are to be subsequently input, they are in the same angular position on the disk. Unless the disk can read from several heads/tracks at the same time, there is no way to input all the required data when it is needed. This local overlap condition occurs in spite of the fact that the average density over the disks total rotation is acceptable. One way of precluding this problem is to merely insure that overlaying solutions will have their data blocks interleaved on the disk. Except for edge value blocks, all input blocks for a particular process have the same spacing between consecutive blocks. This is true no matter how many disjoint streams of data have been prescribed by separate solutions. Therefore any overlapping of these streams can be corrected by shifting one entire sequence of these blocks relative to the one it overlaps with. That is, the periodicity of any two or more streams of overlapping data blocks is identical and merely insuring that the first blocks of the respective streams will interleave forces all of them to. This can be done at the time the two solutions are combined. Compare all streams of input blocks in the combined solution. Where overlaps exist, one or more of those streams may be shifted by delaying the original output requests for the corresponding blocks the amount of time necessary to eliminate the overlap. At most, this procedure will increase the buffer memory requirements by the size of one block for each stream which will be delayed. This is clearly very much smaller than the amounts of buffer memory already conserva- tively required and can therefore be ignored. ho Edge value blocks in combined solutions are handled very much the same as in the one process solution. In the former case, however, it is the combined solution itself into which the edge values are inter stitially placed. This avoids overlap conflict in the combined solution. 2. k.2 Sweeps of Less Than a Disk Rotation in Duration Thus far, an implicit assumption was made in the statements of the previous problems. It has been assumed that any sweep over a data base would take at least as long as a single rotation of the disk. Indeed the solutions to these problems use the fact that data may be output to the disk and later input for adjacent sweeps. This section considers the case in which a process takes less than the time of a disk rotation to perform a single sweep. It Is important here to recall one of the first assumptions made about all of the problems to be solved in this paper. All of the input values required by any process must be required by the process for the particular processor being considered at a rate not to exceed the capacity of the disk to provide this data. Intuitively, no process should crunch numbers faster than the disk can supply them. Sweeps of too short duration to make use of the disk are handled in a simple straightforward manner. Instead of being routed from the output buffer to the disk, as the previous solutions call for, the data is routed from the output buffer directly to the input buffer. What this means is that for sweeps of sufficiently short duration, the required data are entirely buffer memory contained. The fact that the size of the buffer memory is sufficient to accommodate this data follows directly from the fact that a rotation of data is provided for in each buffer. kl Short duration processes of time less than a rotation must use less than a rotation of data for input and less than a rotation of data will be output. Thus, the solutions given earlier in this chapter with the above modification provides a means of generating no latency solutions for the previously considered problems. This will be effective no matter what the duration of the sweep of a process over its data base. k2 3- OTHER SWEEPING FROBLENE 3.1 Introduction This chapter considers a more general program structure than the last chapter. The generalizations include more diversified data transmissions than n-process problems and sweeping patterns which are not row by row as above, but regular sweeping patterns as are found, for example, in matrix operations. In both cases a significant alteration of the previous problem formulations is the explicit merging of more than one input data set to a process . In previous cases only a single data entity was assumed to be input to each process from other processes. The problems considered here permit several such entities to be merged and input to the process. An important requirement for this procedure is that the combined data rate of all sets of data to be input to a node does not exceed the bandwidth of the disk. This ensures the ability to interleave the data blocks on a single logical track without overlap. 3.2 Generalized Data Transmissions Fig. 3-1 a shows the relationship of processes to data transmissions for the problems considered thus far. Fig. 3-1 b shows a more general type of structure. The nodes of this code graph correspond to operations which occur iteratively to variables over an entire data entity. As such, they have the same meaning as nodes in the n-process mesh problems: each node corresponds to a computational process. Since this paper will consider only serial processors, code graphs of the form depicted in Fig. 3-1 b can be dis- torted into that of Fig. 3-1 c. This procedure allows a structure which shows U3 v i hk the temporal sequence of process execution, and is not in general unique. That is, given an arbitrary code graph, there may be several distorted ones, each of which corresponds to an allowable sequence of execution over the processes . A comparison of Fig. 3-1 a and c show that the only real difference introduced into the problem is that of the data transmissions. There are two methods of dealing with this new problem. One of these in fact, involves one further mapping which makes the structure of Fig. 3-1 b become that of the type depicted by Fig. 3-1 a. All of the data operated upon by all of the process can be lumped into one large data array. This is done such that the variables of each of the original smaller arrays are uniformly distributed from the beginning to the end of the new combined array. This large array can be thought of a single mesh. This mesh is used as if by an n-process program in which each process has the entire array as input data. However, it is only that subset of the large array which any particular process requires that is used by that process. The remainder is input to buffer memory, but need not be routed to primary memory. This unused data is recombined with the updated data from the process (output data) and then output to the disk. The large combined data array is thus input and subsequently output for each process in the problem. It is constantly being density buffered and origin shifted. One obvious drawback of this type of solution is that the entire data base of any such problem must be streamed through buffer memory for each process. This would require a bandwidth from disk to buffer memory (and, of course, from buffer memory back to the disk) very much higher than if just the necessary data had to be input. This is a problem only in those h5 cases where the disk to be used has a lower bandwidth than is required by- using the above method. In such cases, the data transmission must be treated differently. For transmission of required data only, the following observations are made : The combined input data rate to each process may not exceed the maximum deliverable by the disk. This is the sum of the data rates corresponding to each of the inputs to a multi-input node. Merging these multi-inputs for a process is not a problem. As long as the data rate requirement is met, the various input data streams may be interleaved when output to the disk before the merging retrieval. The sequence of execution of the processes in a code graph is not important. It is assumed that any sequence under consideration is legal in the sense that any process which must supply data for another is executed first. For any sequence under consideration, then, the only matter this paper is concerned with is the data transitions between nodes. For this more general problem, however, the data transmissions need not be between adjacently executed processes. The difficult problem therefore is that of multiply routed output data from a process. If data are updated at some particular data rate (cor- responding to the speed of that process which does the updating), and required k6 by several different processes at differing data rates, conflicts may develop. If this data were to be required at rates both faster and slower than the original rate, that process which outputs the data can be called neither a fast nor a slow process. If it is to prepare its output data for a faster process, density buffering must occur during the process' execution. If these same data are to be ready for a slower process, it should be output at its generated data rate and density buffered while the slower process computes. Clearly, both of these cannot occur at the same time, unless many redundant images of many of the data blocks are to be cluttering up the disk. Acutally there are two fairly simple methods of handling this situation. The more simple one to explain is that of "speeding up" the data. When a data entity is first placed on the disk, it is put there at the highest density required by any of the set of processes which uses this array as input data. Similarly, output data from any process are density buffered to the highest data rate that any of the successor nodes will require. Each process may therefore expect to have to density buffer "slowing down" its input data. This scheme would require an input and an output buffer, each of size Q to do density buffering. Intuitively, to completely eliminate latency, an additional (input or output) buffer of size Q, would also be needed to perform origin shifting. It will be shown in a later chapter that density buffering and origin shifting may be accomplished concurrently in the same buffer of size Q. For the purposes of this chapter, the amount of buffer memory required for the above scheme is bounded by 30, • Another way of handling the problem of output data blocks which ^7 are routed to several processes is to consider the entire sequence of accesses of those blocks. An allowable sequence is assigned to the code graph. All output data are placed on the disk at that density corresponding to the next process which will use said data as input data. This much of the scheme is identical to the n-process problem's solution. The difference arises in this problem because the output data of this successor process is not, in general, an updated version of its input data. Thus, while this process is executing, both its output data and its input data (if used as input data for other processes) must be in general density buffered. As long as the maximum data transmission rate is not exceeded, this method will accomplish zero latency scheduling with only the 2Q, buffer memory as used by the n-process solution. 3.3 Non-Standard Sweeping Patterns The schemes of Chapter 2 have provided solutions for processes which were assumed to sweep their respective data entities. They were expected to operate on the entire set of their input blocks on each sweep. A large class of problems however does not meet this requirement. Matrix operations are one such example. In this section we will consider zero latency solutions for such problems. McKellar and Coffman in [ll] show that savings can be effected for this type of problem by accessing the blocks in "good" ways relative to the partitioning scheme of the large matrices. They are concerned with efficiency of storage and minimization of block accesses, whereas we are not concerned with storage efficiency and permit any number of accesses since they are planned so as not to introduce any latency. Consider two matrices A and B which are to be multiplied, the U8 resultant matrix stored as C. (i.e. C<- A x B) Such operations for the purposes of this paper are assumed to be performed for matrices very much larger than the available primary memory. Thus, the matrices are partitioned into many smaller arrays. One such partitioning is shown in Fig. 3-2. Such partitions are again called blocks. Both matrices A and B have been partitioned into blocks. Each such block is assumed to contain several elements of the original matrix, but not so many as to preclude several such blocks from occupying primary memory at the same time. These blocks are further assumed to be square. That is, they contain as many columns as rows of the original matrix. The nomenclature of these blocks is obvious from Fig. 3-2. Multiplication of the partitioned matrices is accomplished in the standard way, i.e. n C. . = S A.. B. .. ij k=l J The problem evident in most matrix operations quickly becomes obvious as soon as one begins to consider the initial disk layout for this r simple matrix multiply operation: Matrices simply do not get operated upon in a standard row-by-row or column-by-column sweeping fashion. Each element of, say, the A matrix will be multiplied by (different) elements of the B matrix many times and while there are many different possible orderings of such operations, none will allow the two matrices to be simply swept at the same rate and produce the desired result. Another matter should also be considered at this point. While considering the order in which the blocks will be operated upon, it should be recognized that it would be desirable to have a standard matrix layout for, say, multiplication. That is, for obvious reasons, an ordering over the blocks which was a function of the matrix's size or of whether it was being k 9 a. GO • • • GO • • • • • • • • • z < • • • < • • • • • (VJ < w —* -4 < r-4 CVJ < • • • i), the C partitions may be output as they are computed. If this row of C would not occupy too much of primary memory to be simply contained there (that is if primary memory were at least as large as P plus a few blocks) these blocks may simply reside in primary memory* ^ 6 i hat the tW ° conditlons are n °t mutually exclusive leaving some room for choice in many cases. * ame room 52 The operations to occur now are the multiplication of A- p by the first row of blocks of B. A- will have been picked up from the disk sometime during the last rotation before it was needed. The first row of blocks of B will have similarly been origin shifted such that its first block also is in primary memory at the time at which multiplication is to commence. If the first partial sum row of C was output to the disk, its first block has likewise been input and is in primary memory . Note that at this point two operations will be occurring: Multiplication of an A block by a B block and an addition of this result to the first partial sum of the corresponding C partition, yielding the second partial sum of that C block. This extra addition operation (the addition of the previous partial sum) will cause the time between suc- cessive operations to be longer than it was for just the matrix multiplication. Thus input data to the processor will be density buffered in the input buffer. The new (second partial sum) C blocks will be density buffered in the output to the higher data rate of the original A and B blocks - which will remain intact on the disk. Successive rows will follow in like order: 53 The next block from the first row of A is multiplied by the next row of blocks of B and. added to the respective last partial sum row of C blocks. The A and B blocks are density buffered (slowed down) in the input buffer. The currently generated partial sum C blocks are, if output, density buffered (sped up) in the output buffer. Finally the first row of blocks of C will have been computed and placed on the disk at a spacing equal to that of the A and B blocks. The procedure then continues for blocks from successive rows of the A matrix and yielding successive rows of the C matrix. This procedure effectively sweeps one of the matrices once and the other many times to produce the product matrix. There are obvious variations in which the original matrix blocks may be stored by columns or diagonals instead of the original rows. The important point is simply that the one matrix whose data are required most rapidly (in this case, B) is on the disk at a sufficiently high density to be supplied to the processor accordingly. The remaining matrices may be stored at this same data rate and retrieved at the slower rate. Another possibility with this same scheme is to store all matrices in like fashion, say, in rows. The density at which they are stored is that highest density of all the operations which occur in that particular program. If, for example, a program consisted of only the standard matrix operations, the A and B matrices of the previous example could be stored with their 5h blocks spaced appropriately for matrix addition. Thus they could be added with no latency and multiplied just as described in the example except that the B matrix will always be density buffered on input. This would effectively "slow down" the data from the addition rate to the multiplication rate. 3-U Discussion This section has given several ideas for approaches to the parti- cular class of problems mentioned. No algorithms were given for all pro- grams in this class because one would almost have to exhaustively consider each individual program (e.g. each possible sequence of all matrix operations in a matrix program) . What we have tried to show, however, is that for any specific program a zero latency solution can always be obtained when one of the following conditions is satisfied: i) the i/o sequences have some identifiable regularity, or ii) the input data bandwidth requirements of the program are very much less than the bandwidth of disk. This condition will allow skewing of data blocks on the disk such that there will be little if any overlap between blocks of different entities. That is, additional disk bandwidth capacity will help preclude attempts to retrieve more than one data block from identical disk sectors at the same time. High data rate programs which retrieve input data in non regular sequences may qualify for zero latency solutions as generated by the algorithms in Chapter k. 55 k. ARBITRARY PERMUTATIONS OF BLOCKS BETWEEN SWEEPS k.l Definition of the Problem This chapter deals with the problem evidenced when a partitioned data base is accessed several times but in different orders. Specifically, the blocks remain the same in terms of the variables they represent but are referenced in different sequences during successive passes over the data base. An example of this is a matrix which is accessed first by rows of blocks and then by columns of blocks or diagonals. The accessing need not be as regular as rows or columns, however. This chapter deals with accessing in any random order for each pass over a given data base. In effect, the solution given provides for permutations of blocks of a data entity on each pass through that data entity. This. is accomplished as with previous results in a manner transparent to the processor. Judicious use of the disk and buffer memory will provide the blocks to the primary memory in their permuted order continuously. This section will explain the mechanism for arbitrary permutations only and will ignore several other considerations. For the purposes of this section, then, it will be assumed that density buffering and origin shifting are unnecessary. Furthermore, it is also assumed that each block in an array will be accessed exactly once during each pass. Later sections will deal with combining permutations with origin shifting and density buffering and with additions and deletions of blocks while they are being permuted. The mechanism for arbitrary permutations involves the following steps: i) Rectangularize the sequence of blocks as they are required for, say, the initial pass. 56 ii) Obtain from the resultant matrix a sequence of alternating row, column, row permutations which will yield the desired new sequence of blocks. iii) Linearize the above result into a linear sequence of the blocks iv) Relate the above directly to the operating system. The rectangularization is a process whereby the original sequence of blocks is represented in a manner which shows what their positional layout would be on the disk if they were to be directly output. The horizontal dimension corresponds to the number of such blocks which can occupy a physical track (i.e. a single rotation of the disk). The vertical dimension, then, corresponds to the number of disk rotations the entire set of blocks will occupy at their given density. This is given more formally below. Consider the original sequence of data blocks (the order from which they will be permuted) . Without loss of generality these may be relabelled 1 to n in the order of this original sequence. These n elements are placed in a rectangular array. The width of this array, or horizontal dimension, h, is the number of blocks which will occupy one physical rotation of the disk at the required input data rate of the process which operates on it. In general, the timing may be such that there is a non- integral number of partitions corresponding to an exact disk rotation. In such cases, that integer just less than the non-integral number of partitions which fills a disk rotation will be used. This introduces a small amount of latency into the solution. This latency is usually negligible, however . The height or vertical dimension of the matrix, v, corresponds to the number of rotations that all of the n blocks will occupy on the disk. 57 That is, v is the smallest integer not less than This is given by v = h/n. This matrix is shown schematically in Fig. k—1. Fig. k-l Note that in general n < vh and the matrix has been "packed" with the numbers n+1 to vh even though there are no such blocks. Each entry in such a matrix corresponds to a particular block and will frequently be referred to as an element of the matrix. This matrix is, quite literally, a map of the sequence of blocks on the disk if they would have been directly output to the disk . That is, each row represents a rotation of data blocks around the disk. The entire matrix represents the entire sequence placed on the disk with respect to some arbitrary origin. This matrix will be referred to as the original matrix and is the rectangularization of the sequence of blocks. 58 The sequences of permutations over the rows and columns of the original matrix are obtained by first forming the resultant matrix . The resultant matrix is obtained in a manner similar to that of the original matri> The final permuted sequence of blocks is rect angular i zed as if going back- wards. That is, a matrix representation of the n partitions is obtained which gives the positions the blocks would be in on the disk if they were already in permuted order. An example is given to aid in the explanation of the algorithm which yields the desired transformations. For an original 20 blocks labelled 1-20 in the order they were computed upon on the original pass, assume they are next needed in the sequence 8, 12, 20, 5, 2, 13, l8, 19, 1**-, 6, 10, 15, 17 16, 1+, 3, 9, 7, 11, 1- For h = 5, v = h the original and resultant matrices are obtained trivially and shown in Fig. k-2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 8 12 20 5 2 13 18 19 14 6 10 15 17 16 4 3 9 7 11 1 ORIGINAL MATRIX RESULTANT MATRIX Fig. k-2 59 What will be obtained will be a set of row permutations followed by a set of column permutations followed by yet another set of row permutations. These permutations will be such that when successively applied to the original matrix they will produce the resultant matrix. A set of row permutations consists of h permutations of v elements each. That is, each row is permuted independently and according to the particular pattern determined for it. Similarly, a set of column permutations consists of v permutations of h elements each, where each column permutation is independent of the others. k.2 The Algorithms for Arbitrary Permutations One important point should be kept in mind during the following discussion. A row permutation is in fact a switching between columns. That is, when a row of elements is permuted, the elements change columns. Similarly, column permutations cause row changes to be effected. Basically the sequence of row- column- row permutations which yields the total permutation over the entire matrix is obtained by first finding the middle step - the column permutations. Since the sets of row- column-row permutations are not in general unique, any satisfactory set of column permu- tations is generated first . The row permutations then are generated by working "outward" from the set of column permutations just obtained. Intuitively, the set of column permutations swaps the elements from their original rows into the rows they will occupy in the resultant matrix. The set of row permutations which precedes this aligns the elements so that there will be no conflicts when they are row swapped. The final set of row permutations merely shifts the elements into their positions which make up the resultant matrix. 60 The algorithm which describes this process is now given, performing the operations on the example given earlier. Standard Permutation Algorithm : I. Construct a transition matrix which shows the transitions that each block makes in going from the original to the resultant matrix . This matrix will be quite large: hv by hv. In the standard fashion, we use a 1 to indicate a transition and or blank otherwise. This matrix is the Total Transition Matrix or TTM and is shown in Fig. k-3. II. Abstract from the TTM a matrix which just shows the necessary transitions between rows of the original and resultant matrices. This is accomplished by first partitioning the TTM into partitions which represent the row transitions as shown in Fig. k-k. That is, the TTM is partitioned into k by h partitions. 2 2 The partitioned TTM will always have v partitions of h of the original entries each. This matrix is then reduced into the Row Transition Matrix or RTM. This step consists simply of collapsing the partitioned TTM into a v x v matrix, each entry of which is the sum of the entries in the corresponding position of the partitioned TTM. For example, the partition in the first row, first column of the partitioned TTM is summed to yield the element of the RTM which appears in the first row and first column. The RTM for this example is shown in Fig. k—5- The Roman numerals I to IV are used arbitrarily to label the rows 1 to k of the original/resultant matrices, respectively. The interpretation of the RTM is as follows: The entry in the i row, j column is the number of elements in the i row of the original th matrix which appear in the j row of the resultant matrix. 61 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 1 20 1 Fig. l+_3 Total Transmission Matrix 62 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 1 20 1 Fig. k-k Partitioned TTM 63 I n m m iniE 2 1 2 1 1 1 2 1 2 1 1 1 2 2 Fig. 4-5 Reduced Transition Matrix Thus, for example, the entry in the first column, first row indicates that two blocks which appear in the first row of the original matrix also appear (somewhere) in the first row of the resultant matrix. These two blocks are (from the TTM) the ones which move from position 2 to position 5 and from position 5 to position k. Note that the sum of each row and each column in the TTM is 1. The sum of row and each column of the RTM must therefore be h, or in the example, 5< III. Decompose the RTM into h individual transition matrices, each of which will correspond to a single column partition . A permutation of k elements can be represented by a k x k matrix in which the sum of each row and column is unity. The TTM is one such matrix. It represents a permutation of hv elements. The existence of this decomposition is shown by Ryser [13 1 as Theorem 5*3 and Hall [8] as Theorem 5-1 -9' Appendix A gives the listing of a program which performs this decomposition. For the example given, one such decomposition is shown in Fig. k-6. 6k 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (0) (b) (C) (d) (e) Fie. k-6 IV. Assign each of these component matrices to a column to obtain the column permutations. The sets of row permutations then follow . These component matrices are then referred to as column transition matrices. Each of the 5 column transition matrices in Fig. k-6 corresponds to a permutation of k elements. These may be arbitrarily assigned to the 5 columns. Thus, if the matrix in k-6 (a) corresponds to the permutation in the first row, it indicates the following permutation: the element in the first row remains there, the element in the second row is moved to the third row, the element in the third row is moved to the fourth row, the element in the fourth row is moved to the second row. For the purposes of this example, the matrices in Fig. 4- 6(a) to (e) are assigned to columns 1 to 5, respectively. The remainder of the process consists of identifying the particular elements from the original matrix that correspond to those being permuted in the columns. From this information the row permutations are easily obtained. The first set of row permutations is obtained by examining the set of column transition matrices and comparing these to the partitioned TTM, a row at a time. 65 In the example, the l's appearing in the column transition matrices are assigned to particular transitions in the partitioned TTM. For example, the column transition matrix which corresponds to the first column permutation has a non-zero entry in row 1, column 1. This is matched with a non-zero entry in the partitioned TTM which is in that partition that is in the first row and first column of blocks. In this particular case there are two possibilities: the transition from position 2 to position 5 and the transition from position 5 to position k. Either of these two transitions is acceptable for the matching. Suppose it is the transition from position 2 to position 5 which is matched to this entry in the first column transition matrix. Since this is the first column permutation, the matching has the following significance: The element in position 2 of the original matrix should be (during the first set of row permutations) per- muted to column 1. The column permutations will insure that it gets relocated in the correct row of the resultant matrix . Thus the particular block under consideration gets "permuted" to the same row it started in and the final row permutation places it in position 5' This algorithm can be concisely given as: For each non-zero entry in a particular column transition matrix, match it with a particular transition from the partitioned TTM. For this transition, the element will first be row permuted from its position in the original matrix to that column (in the same row) which corresponds to the particular column for the 66 column transition matrix being considered. After being column permuted, this partition is then row permuted into its position in the resultant matrix. This procedure is followed for the remainder of the example. Fig. 4-7 shows the column permutation matrices in which the numbers of the blocks with which their non-zero entries are matched have been indicated. 2 5 4 1 3 10 7 9 6 8 11 13 12 15 14 18 6 19 20 17 (a) (b) (C) (d) (e) Fig. 4-7 Since the matrices (a) to (e) shown in Fig. 4-7 correspond to columns 1-5, the original matrix after the first set of row permutations is immediately obtained. This is shown in Fig. 4-8 (a) . 2 5 4 1 3 10 7 9 6 8 11 13 12 15 14 18 16 19 20 17 (a) 2 5 12 20 8 18 13 19 6 14 10 16 4 15 17 11 7 9 1 3 (b) Fie. 4-8 8 12 20 5 2 13 18 19 14 6 10 15 17 16 4 3 9 7 11 1 (C) 67 Fig. 1]-- 8(b) shows the matrix of Fig. 4-8 (a) after being column permuted. Fig. 4-8(c) is the resultant matrix and is obtained by row permu- ting the matrix of Fig. 4- 8(b) . An equivalent algorithm is given by Opferman and Tsa-Wu in [32]. They are, however, concerned with switching signals instead of transmitting data blocks, but the analyses are identical. Thus any rectangular array of elements may be randomly permuted by a set of row permutations, a set of column permutations, and a final set of row permutations. k.3 Permuting in a Memory Hierarchy The preceding method is implemented for a head-per- track disk system by performing row permutations in intermediate (bulk storage) random access memory. The first row permutation is accomplished by accumulating blocks in an output buffer as they come from the primary memory. When a row is accumulated, these blocks start to be output to the disk in their permuted order. The final row permutations are performed in a similar manner. One row is accumulated in an input buffer before being routed to primary memory. The blocks of this row are then sent to primary memory in their permuted order. Column permutations must take place between the time the bbcks leave the output buffer for the disk and the time they arrive at the input buffer from the disk. They are accomplished by simply switching the heads of the disk appropriately. Fig. 4_8(a) shows a map of the allocation of the blocks on the disk. Since a column represents a particular angular area on a disk, the disk may read any of the blocks in a particular column with equal ease. Thus the sequence shown in Fig. 4-8(b) can be obtained by switching the read heads of the disk. Note, however, that the effect of column permutations 68 may also be obtained by switching the write heads of the disk. Thus either the read or the write heads may be switched to achieve the effect of column permutations, whichever one would require the least switching time on a par- ticular disk. The amount of buffer memory required is simply that for a row of blocks in the input buffer and an equal amount for the output buffer. Since a row of blocks was previously defined to be a disk rotation, the required buffer memory for this scheme is 2Q. k.k Special Cases of Permutations - Addition and Deletion of Blocks The previous sections have dealt only with permutations over the entire data base. All partitions over that base were assumed to be computed upon for each iteration. Real programs frequently do not access their data in that fashion. Some blocks may be abandoned after several iterations and/ or some may first appear in the computation only after some number of iterations. This section deals with such modifications to the permutation algorithm. The deletion of blocks after having been used for several iterations is a relatively straightforward procedure: Deletion Algorithm : Assume that blocks are to be subsequently output (otherwise they are simply overlayed in primary memory) . The resultant permutation matrix for the first iteration in which these blocks do not appear is generated to show these blocks at the end of all those blocks which will be used. That is, the blocks to be deleted occur in any sequence after the sequence of blocks which are not deleted. The permutation algo- 69 rithm is used to obtain the row-column-row permutations as before, except that the deleted blocks are not picked up after the others. Thus, they are "shifted" (permuted) to the end of the line and then truncated from the sequence and left on the disk. Addition of blocks at iterations other than the first is accomplished in much the same way. The blocks to be added are placed on the disk at the locations from which they will be read for the first iteration in which they appear. These locations are easily obtained from the standard permutation algorithm. Addition Algorithm : All blocks which will later be added to the set being permuted are entered in the set notationally by appending them at the end of the sequence of blocks which are being used. This is done for the last iteration before the new blocks are to be used and they thus appear in the original permutation matrix. They are placed in the resultant permutation matrix at those places at which they are required by the processor. The permutation algorithm will generate the usual row- column- row permutations for all the blocks. The first row permutations are not carried out for the blocks which will be added, however. They are pre-placed on the disk at those locations which the permutation algo- rithm would have placed them had they been computed upon during the previous iteration. 70 Thus blocks may be added and deleted at any time during the itera- tive computations over the data base. As explained in the next section, how- ever, blocks may not be deleted and later added by application of the deletion and addition algorithms. k . h . 1 Other Special Cases When a block is deleted it is placed by the permutation algorithm at some particular angular position on the disk. This position may be any- where insofar as the computational process is concerned because it is finished with this block. The permutation algorithm, however, puts it in one of a limited set of positions which are consistent with its goal of performing the necessary permutations. Blocks to be added, on the other hand, must be placed in particular angular positions on the disk before being input. In general, there is no way to insure that a block be abandoned at precisely that position from which it should be retrieved for subsequent interations over the data base. In fact, cases can be constructed which preclude this fortuitous placing of blocks on the disk. Thus the procedures described above for additions and deletions do not permit the addition of blocks which have already been used and deleted. There are however several ways in which this situation may be handled. For processes which do not require data at the disk's maximum rate, the following methods are feasible. (Processes which require the disk to be constantly transmitting data must have additions and deletions of their blocks adhere to the constraint implicit in the section on additions and deletions. That is, they may not delete a particular block and then add that same block several iterations later.) When the process requires input at a lower rate than the disk's maximum capacity, data can be made to appear to "move" on the disk. This is 71 accomplished by reading from the disk some data which are not required by the process for some time to come. These data are kept in buffer memory and subsequently output to their final destination on the disk, without having been processed. The input and output of these "moving" data are performed interstitially with the regular input and output of blocks for the current process. Another way to handle the placing of deleted blocks so that they may subsequently be added is to simply keep the deleted blocks in buffer memory until the disk rotates to those positions corresponding to allowable locations for subsequent addition. This is possible because the buffers, if matched to the data rate of the disk, will be partly unused and hence available for temporary storage of blocks. Yet another method which makes use of the fact that the process is not demanding data at the disk's maximum rate is useful for certain situations. Suppose the entire data base is made up of several variables over, say, a mesh. On successive iterations, different variables will be required as input to the process with perhaps some other variables in common. That is, some of these variables are required on every iteration while others are required only for selected iterations. One way to handle what would otherwise be a quite messy addition and deletion problem is include all variables for a particular portion of the mesh in a block. The entire block is then retrieved as an input block while only the requisite variables are made use of (and perhaps updated) . This will work when the virtual data rate made up by lumping variables in blocks is less than that of the disk's maximum capacity. 72 U.5 Combining Permutations with Origin Shifting and Density Buffering The permutation algorithm stores blocks on the disk in such a way that it no longer is reasonable to consider them as a continuous sequence. Rather, a row, or rotation, of these blocks is input from several tracks on the disk. These blocks are further permuted before being used by the compu- tational process. The concept of origin shifting deals with a continuous stream of data (in this case, blocks) which must be anticipated so that its origin is available to the processor when needed. Since the data blocks when being permuted have no real origin, there is no reason for the operation previously described as origin shifting. Before a process will be able to use its first row of input data, the blocks comprising this row must already have been read from the disk. This is because a row permutation must be performed first. It makes no difference at what angular position reading was initiated. The procedure is simply to begin transmitting the blocks from the disk to buffer memory one disk rotation before they must be sent on to the primary memory. At the end of one rotation, the necessary blocks will have been transmitted to the input buffer, regardless of the order in which they were transmitted. The same lack of data origin allows for density buffering in much the same way. The algorithm for this is as follows: Density Buffering Algorithm (input) : One disk rotation before a row of blocks is required in primary memory, transmission is initiated from the disk to the input buffer. This input ceases after a single rotation and commences again one disk rotation before the next row of blocks is required. 73 This will always allow for the permutation of data blocks in suc- cessive rows and incorporates what was previously referred to as density buf- fering as well as origin shifting. Note that this entire procedure is des- cribed as taking place in the input buffer. Thus, for input, no additional buffer memory is required than for simply the permutation algorithm. Output of data is handled in an identical fashion: Density Buffering Algorithm (Output) : When a rotation of data (at the density at which it is to be stored on the disk) is accumulated in the output buffer, it is output to the disk. The output buffer then continues to accumulate blocks until another rotation is collected. In this way mismatches of data rates are provided for along with reorderings of the accesses of the data blocks. The necessary buffer memory is still 2Q, i.e. Q, each for the input and output buffers. 7U 5- IMPLEMENTATION CONSIDERATIONS 5.1 Non-Zero Data Transmission Times The methods used in this chapter and indeed those of the remainder of this paper assume no time to transmit data from one level of memory to another. This assumption is made simply to facilitate explanation of the techniques developed. Taking into account the effects of real transmission times is relatively straightforward: Instead of retrieving data within a rotation of when it is required, data are retrieved within a rotation plus the total transmission time between the disk and the processor. That is, the transmission times between the disk and buffer memory and between buffer memory and primary memory are added. This time is added to the time of a single disk rotation and the result is the new "lead time" which determines how soon before it is needed that data blocks are retrieved. This will always be a very small correction to the time relative to a disk rotation. Nevertheless, to be totally correct, the size of buffer memory should also be augmented by an amount equal to the amount of data which will be read in this additional increment of time. 5.2 Memory Sizes and Data Rates An upper bound on the size of the buffer memory was determined to be 20 . is the maximum amount of data which may be transmitted by the disk in a single rotation. Fig. 5-1 shows some of the larger disks currently 75 o 10 o m o u UJ < H O I 2 O z o a. u. < < o (X) o IT) I o o o (33SW) 3WI1 NOIlVlOa XSIQ 76 available and the values of Q. to which they correspond. Note that Fig. 5-1 includes movable head disks as well as head-per-track disks. This is not intended to imply that the algorithms presented in this paper are directly applicable to movable-head disks. Rather, these entries on the graph are meant to be indicative of current disk parameters. It is noteworthy that essentially all of the current production disks yield a < .5 Megabit. Furthermore, most of these are under .25 Megabit. The preceding leads to the observation that a separate buffer memory may in fact not be necessary: for 32-bit words, the "buffer" memory require- ment need be in the neighborhood of 8k to l6K words. Thus, for a system com- prised of a processor and any standard disk, the input and output buffers may well be sections of primary memory. Not all processor /disk systems can operate in this fashion, however. Some configurations may contain processors which do not provide for the neces- sary additional primary memory. Another possibility is a system making use of the type of disk which is currently being manufactured for the ILLIAC IV computer by the Burroughs Corporation. These are head-per-track disks with a period of k-0 msec and a data rate of 500 Megabits per second. Q for such a disk, then, is 500 x 10 x -3 6 k-0 x 10 = 20 x 10 bits. The necessary buffer memory for a system making use of such a disk is ^0 Megabits if the processor is likely to require data at the disk's rate. Most extant processors, however, will require data at a somewhat lesser rate. Fig. 5-2 shows just what this data rate would be for various "fast" processors as a function of the number of operations per variable which will occur during the computational process. The size of each buffer determined by these data rates is termed . 77 V) V < or o UJ _l CD < en < UJ 0. o 5 0T O O UJ (T UJ u. u. o tr e < UJ O x Q tr — o H h- 1 > m < * _l -I o u III — < or UJ a O OJ LT\ QN0D3S/S1I8 3JLV8 viva O UJ O CM O UJ <0 o UJ 2 CVJ o UJ 2 o UI 2 >tsia D3swotr 'jo: Dvmi uoj sua ni -ivsho o J 78 5.3 Speed of the Buffer Memory The required bandwidth of the buffer memory is the sum of the data rates along all paths to and from it. Fig. 5-3 schematically shows the buffer memory and its data paths. Thus the buffer memory need have a bandwidth of exactly four times that of the disk. DISK *»- INPUT ^W PRIMARY MEMORY \ >* * I OUTPUT ^r- ^M BUFFER MEMORY Fig. 5-3 Note that implicit in Fig. 5-3 is the assumption that data are routed to the primary memory from the buffer memory in blocks. These blocks need not, however, be identical to those which are recorded on the disk. Although this paper does not mention the need for such procedures, the buffer may be used as a staging area for reformatting data. If it could provide the data at the proper rates, there is no reason why it could not clump k blocks from the disk into a single block which gets routed to primary memory. Similarly, blocks from primary memory could be clumped into larger blocks for disk storage. 5 .k Implementation of Row Permutations As stated in the beginning of this paper, the algorithms described in this paper are predicated upon knowledge at compile or run time of what the data accesses will be. It is a relatively straightforward procedure to determine at that time what the input and output sequences and permutations 79 will be. This is done by making use of the known data block access sequences and the timing characteristics of the processor and disk. The i/o sequences and permutations obtained before actual execution may well take up more than the available primary memory. How, then, can this information be stored so as to be made use of during execution? One way is as follows. For each disk rotation during execution while permuting blocks, the following information must be available to the system. i) a sequence listing the (at most h) tracks from which the next rotation's blocks will be retrieved. These can be track numbers without sector information, but if the entire disk address is used, it provides complete information about the next set of inputs from the disk as well as synchronizing the actual computation process with the disk. ii) a similar sequence to the one above for output addresses. iii) two sets of permutations - one for the input buffer and one for the output buffer. Each of these is a permutation of h elements. Thus each permutation may be (wastefully) stored as a string of h integer numbers. Suppose further that a whole word is to be used for each such number. i), ii), iii) above consist of at most ^h words, h is bounded by the number of addressable sectors on the disk, but will obviously be typically somewhat less. For h = 6k, a block of 256 would be necessary for each disk rotation of data. A block such as this could simply be stored on the disk inter stitially with respect to the data blocks. It would appear once for each rotation of data which will be retrieved. Note that in the case of density buffering, n rotations of data may actually be retrieved in more than n physical disk rota- tions. Thus these i/o blocks are, in general, retrieved at a frequency less than one per disk revolution. 80 Each such block contains essentially an input request queue, an output queue and two permutation sequences. These sequences may be assigned as descriptors or tags to their respective blocks as they enter the buffers or they may be encoded in tables by which the system will route the blocks and thus effect the permutations. Alternatively, routing may be accomplished by hardware techniques. One possible method is through the use of a Batcher network [2]. In such a system, each data block will enter an input or output buffer with a tag which indicates what its permuted position should be. The transmission from these buffers will be through a Batcher network and thus automatically perform the row permutations. In such an implementation, each data block need only store a single address for each row permutation it undergoes - that one which indicates its permuted destination. Requests for rows of data blocks may be handled by hardware means also. At the time at which a new row must start to be input (one rotation plus transmission times before it is needed), the addresses of the blocks comprising that row are sent to a queue r of the type manufactured by Burroughs Corporation for use in conjunction with the ILLIAC IV disks. These queuers will cause the requests to be serviced in the fastest order possible. In this case, then, the order of the block requests may be relative to the arbitrary data origin on the disk but the blocks will start to be input immediately in a "wrap-around" sequence. The destination tags on each tag will then insure that the row permutation is accomplished correctly. One obvious observation which may be made at this point is that many permutation patterns will be cyclical. That is, there may be two basic regular accessing patterns which alternate over the data base. In such cases, the use 81 of unique i/o "blocks for each rotation of data would probably be unnecessary. A periodic pattern of accesses would probably suffice* 5.5 Disks Which Do Not Read and Write Simultaneously The algorithms and methods presented in this paper have been pre- dicated upon the existence of a disk which can simultaneously read and write (although not with the same head or on the same track) . That is, data blocks are assumed to be read from the disk according to a particular algorithm and written on the disk according to some other algorithm. Never, however, has this paper been concerned with constraints upon the times when reading and writing may occur. Indeed it is probably less efficient to introduce such constraints than to assume concurrent read/write capabilities of the disk and then simulate this facility on the standard disk. The mechanism by which a disk can be made to appear to both read and write at the same time is relatively straightforward. Each physical track on the disk is divided into an even number of sectors. The size of these sectors is such that each may be addressed as a unit. For reasons >hich will shortly become clear, these sectors should also be relatively small within the limits of the above constraint. These sectors are then considered as adjacent pairs. Thus from some arbitrary rotational origin on the disk, each physical track will consist of some number of pairs of addressable sectors. As the reader might expect, each pair is considered to be a read/write unit. That is, for each track- sector pair, one sector will be used to write data on and the other to read data from. In this way a disk i/o supervisor can at a macroscopic level appear to both read and write at the same time while in reality it is only reading (at most) half of that time and similarly writing. 82 Note that this method cannot assign particular track- sectors permanent read status and others permanent write status. Adjacent pairs of these units are, as a pair, assigned the duty of splitting up the read/write responsibility. This is because a track-sector which is at one time used to write data to must (if it is to be useful) later be used to read data from. After data is read from a particular track sector it may be considered a write unit. If it is the first half of a read/write pair, the second half will already be a write unit and may be thus used. If the second half of the pair is not immediately used, either unit is available for a write. As soon as one half of the pair gets written on, however, it becomes the read portion. Thus available unused i/o pairs may be thought of as write/write pairs. Write portions of a read/write unit may only be used (written in) when an adjacent input will read the contents of its read portion. That is, we disallow read/read pairs. In this way, read/write pairs are maintained as either read/write pairs (or released by being converted to write/write pairs) although the assignment of the read or write duties will change within the pair. As an algorithm, this can be stated as: The current write sector is the mate to the current read sector (which is uniquely defined by the algorithm which provides input data to the processor's memory). If, as happens in come cases, more output data are generated than the input data being used, the output destinations are assigned from sector pairs on the disk which are not currently being used. 83 Note that this method of making the disk appear to both read and write will impose some additional restrictions. Since, in reality, data blocks are not being both read and written at the same time, an additional amount of buffering will be necessary. For example, a situation in which blocks are assumed to be directly output to the disk will, in reality, require the buffer- ing of an amount of data equal to that contained in one half of a read/write pair. Typically this should be the size chosen as a data block and an additional amount of buffer memory equal to two data blocks (one each for each buffer) in size is added to the buffer memory requirements. This additional amount of memory is incremental relative to the 2Q already required. A more serious consequence, however, comes from the fact that the disk now cannot be read at its maximum data rate. Similarly it cannot be written to at this maximum data rate either. Thus the effective data transfer rates for disks being used in the above fashion is one half that of their maximum transfer rates. This has the desirable effect of halving the necessary buffer memory but the unfortunate effect of precluding zero latency solutions for processes which require data at rates greater than the new effective data transfer rate. The result of this may be the requirement to use a disk with a higher data transfer rate to simulate one with read/write capabilities at a lesser data transfer rate. One more consequence of this simulation is the relatively minor effect of the decreased data storage capacity which comes about from partition- ing the disk. Each access from disk may require some unused time (= space) for switching between heads on the disk. The read/write pairs may be chosen to be relatively small due to buffer memory considerations. Their small size, however, will cause more data transmission initiations and hence a small degradation of the disks data handling abilities. 8k 5.6 Real Data Block Spacing and Synchronization Throughout this paper we have relied on the predictability of execu- tion times of the computational processes. This has permitted the spacing between data blocks such that their retrieval rate from the disk exactly corresponded to their computation rate by the processor. Any real processor, however, cannot be relied upon to be totally predictable to the /^second level (which may well be called for) . How, then, do such methods as those presented here remain practical for real processors? The above question is in part answered by the fact that the buffer memory acts as a cushion to allow a slight drift in execution time of a process. If the process takes less than its predicted time for some data block, the next block is already in buffer memory when the first is completed. If the processor consistently takes less than the predicted time, eventually the data available in the buffer will be depleted and the processor will become synchronized to the new input as it becomes available. In such a case, the prediction of compute time was probably in error. In the case of the processor continuously taking too much time, the buffer will eventually fill to capacity. At this point a supervisor program should terminate reading from the disk to the input buffer for a rotation. Again, this is the result of an erroneous prediction of the computation time. In either of these two cases, no severe problem develops. The processor eventually computes at the slower of its natural rate and that of the input data available to it. In the case where the processor computes at a rate slower than it could if its input data were always available, the slowdown is essentially caused by rotational latency. Note, however, that in such a case the data rate and process are still "closely" matched. This has the consequence of introducing much less 85 latency than if no preplanning had been undertaken. Thus bad" predictions of compute times still yield "good" results relative to no predictive planning. 5.7 Obtaining Gompile-Time Information on Data Accesses Pre-run information on the sequences of data blocks to be required by a program could be obtained by modifying an analyzer of the type described in [10] . The analyzer could be modified to obtain possible sequences over a processes input data. A sequence would be a sequential "trace" over the com- putations which the analyzer currently finds may be performed in parallel. Accurate timing information may be obtained at this same time. For each computational process, the particular operations per datum (or per mesh point in some cases) are obtained. (Note that in some cases this information may be dependent upon the block size also, thus yielding the number and types of operations per datum as function of the block size.) The compute time per block then comes from computing the actual time to perform the particular operations. This should include any non-processing time taken for retrieval from primary memory, data alignment, etc . Such timing information can obviously also be obtained empirically by running a program or portion thereof beforehand. 5.8 The Effects of Block Size on Memory For some processes, the block size chosen will have an effect on the buffer memory and disk characteristics. This occurs when the number of operations performed upon each variable, while constant for an entire iteration over the data base, will be related to the block size. That is, as the block size is varied the number of operations performed per variable each time a block is accessed will also vary. An example of this is matrix multiplication. 86 5.8.1 What the Effect Is In multiplying two n by n matrices, a total of n multiplications will occur. If the blocks consist of k by k partitions of the matrices, there 3 will be k multiplications for each access of a block from each matrix. There will be (r-l blocks and each block will be accessed — times. (The number of multiplies per block accessed times the number of accesses/iteration times the number of blocks in a matrix gives n multiplications.) The required input bandwidth B for this problem with the above w partitioning scheme is approximated by considering only the multiplications ^ (3 blocks/multiplication) x (size of each block) w ~ (number of multiplies/2 block pair) x (time /multiply for parti- cular machine) letting t represent the processor's multiply time, 3 -3r w J k^t kt m m If the block size, B , were increased by doubling the linear dimen- sion of the partition, we obtain B T = 2k by 2k = k-'k Li (2k) = 8k multiplications per block Trr- j = T-(r-) blocks per matrix 2k/ \ \kj yacy — - accesses of each block 2k There will be three blocks input for each multiplication when the partial sum from a previous multiplication must be retrieved and updated after each current multiplication . 87 ok t m m Thus the bandwidth is halved by quadrupling the block size. This follows by 3/2 recognizing that matrix multiplication is an N process. That is, for N 3/2 elements in each matrix (n x n) there are N multiplications over the entire 3/2 matrix multiplication. For K (= k x k) elements per block, there are K ' multiplications and the (input) bandwidth which is measured in bits or words per second, comes from B - 3K - -2- m 3/2 For an N process, then, increasing the block size by a factor of P will result in a decrease of the necessary buffer memory bandwidth by -i— . In general, for an N process (r > l) an increase of block size by a factor of P will reduce the required bandwidth by the factor =-. (We make the P assumption here that the required output bandwidth is equal to that for input. This is very conservative since few operations "generate" more new data than they "consume".) Increasing the block size for an 0(w) process (r > l) has several effects on the various memories in the hierarchy. The most obvious of these are i) As previously shown, the required bandwidth from the buffer memory (as well as the disk) is reduced, and ii) The size of buffer memory decreases. This is because its required size, comes directly from the bandwidth and the disk's rotation time. Buffer memory = 2Q, = 2-B -T . 88 At first, then, it may seem that things continue to improve as the block size gets "bigger, or, for r = 1, at least remain constant. Why not just make it as big as possible and thereby reduce the required buffer memory size and the buffer memory and disk's bandwidth? This turns out not to be a suc- cessful approach for all values of r for two reasons: i) The buffer memory must be at least large enough to contain a single block. Thus, the quantity 2Q, is insufficient when B > 2Q. Li ii) The primary memory will have to hold a few blocks. Typically its requirement will be about three to five blocks. One further observation should be made at this point. When the block size (or the combined size of two blocks in the case of a binary opera- tion such as matrix multiplication) exceeds Q,, there is no longer any reason to make use of the algorithms developed in this paper. An input buffer equal to the block size is sufficient to eliminate all latency. This is accomplished by the straightforward procedure of picking up each block (or pair of blocks) in the last rotation before they are needed. Output blocks are written directly on the disk. This will work for random accesses over the entire data base as well as for different speeds of the computational processes which operate on the blocks . The curve in Fig. 5-k depicts the required buffered memory at that point . We define an s-ary process to be one in which s data blocks are required to be in primary memory for the process to compute. As an example, matrix multiplication of whole matrices or single partitions is a binary process. Matrix multiplication of partitioned matrices (which includes the addition of partial sums as well as the multiplication process) may be con- 89 UJ M >- (T O UJ 2 sr UJ 3 CD DECREASING PORTION DUE TO CHANGE IN BANDWIDTH (20) DISCONTINUITY WHERE THE BUFFER MEMORY CAN ACCOMODATE EXACTLY ONE ROTATION OF DATA. BUFFER MEMORY = 1 BLOCK INCREASING AS BLOCK SIZE INCREASES. BLOCK SIZE -> Fig. 5-k sidered a ternary process. (Actually it is only macroscopically a ternary process while locally it is a (binary) multiplication process followed by a (binary) addition process but it is the macroscopic level with which we are concerned.) Since the memory and bandwidth requirements go up as the number of blocks which must be accommodated at a single time, we define the virtual block size , B , of a process to be the number of blocks, s, which must be co- resident in primary memory times the size of each block, B T . Thus, B = sB_ . h V Jj In the equations which follow, buffer memory is indicated by the symbol B . From Fig. 5-4 it is clear that if there are no additional consider- ations, the optimum block size to minimize the size of buffer memory is simply that which yields one block retrieval from the disk on each rotation. 90 An 0(Nj process is one in which the dominant operation is performed aB times for each virtual block (typically, 1 < a < 10) we denote the time to perform this operation by t . The necessary input disk bandwidth is the number of data words per computation on a block divided by the number of dominant operations in a block computation times the time for that operation. SB L aB T t aB T t L p L p Buffer memory = 2Q for the left portion of the curve in Fig. 5-U 2sT B = 2Q = 2B T, = =■ — for B T < Q m w d T,r-1, L aB T t L p and B = B T for B T > m L L and we obtain the point of discontinuity in terms of the block size: sT B T = Q = B T , = L w d T,r-1, aB_ t L p sT\ r d or, B_ = — r— . L at P 2T That is, for B^ < ^- , B m = 2Q P 2T and for B^ > — E , B m = B L< P 91 If primary memory contains, say, p blocks, then its size is pB (s < p < 10) and one can, then, compute the block size which actually minimizes the com- bined cost of buffer and primary memory. If the buffer memory cost is d/(unit of memory) and the primary is c/(unit of memory), we want the minimum of the function r 2sT. cost = d | r-1 i aB T t L L p_j + c[pB L ] This minimum will always occur before the discontinuity in buffer memory size. We prove this by solving for the minimum. Solve for the deriative of the cost function with respect to block size and set to zero. ^g ost > = -(r-l)d oB_ -L 2sT. aB^t _ LP. + cp = which, with suitable algebraic manipulation yields £ "2sT d(r-l)1 aT cp ! P or B L = 2sT,d(r-l)" d aT cp P l/r The discontinuity from our earlier calculations is B L = sT. aT l/i P ! 92 The minimum of the cost function occurs at n 2d(r-l) jl/r cp _j want to consider the possible values for 2d(r-l) ! l/r sT. c aT L PJ iA and we We note first that r L ^P ; is positive (since r > l) and that each of the terms is also. The quantity — will always be <1, in fact it will almost always be in the neighborhood of 0.1. For most practical algorithms, r will not be larger than 2 or 3, let us take It as an upper bound. As mentioned earlier, 1 < s < p < 10 and if we take the very con- servative amount of 1, the above has an upper bound of ■ (2) (0.1) (3) ll/r _ , 6] l/r J 1 and since this is < 1 for our conservative estimates the minimum of the cost function will always occur for a smaller block size than that at which the discontinuity occurs. Fig. 5-5 demonstrates some of these relationships for the following values s = 1 P = 5 t = 30 nsec P T., = 40 msec d r = 3/2 a = 3 d = $.12/word » $.02/bit for 6k bit word c = $1.20/word -a $.20 bit for 6k bit word 93 Q * V H CD CO CO III u. O ft IT O 2 Ld 2 u o or a o o or co o CJ t- CO o o a UJ >- V tr UJ 2 a. o u. it 2 UJ 2 2 U. ro CO UJ CD ro a 2 < ) o > CO lit UJ oe Lt CO CO UJ < > z 2 U. T 1, altering the block size has the addi- tional effects of altering the required disk and buffer memory bandwidths and the buffer memory size. For programs containing such processes one can compute or plot the values of these parameters versus block size to obtain those values for the block size which are physically realizable for the existing system. 6.2 System Design Applications of Latency Elimination Techniques When designing a system which is to employ the latency elim- ination techniques presented in this paper, the relationships between the various system parameters are known and all of the trade-offs may be con- sidered. Thus, for example, the required disk and buffer memory bandwidths may be related to the processor, process and, (for an 0(lT ) process, r > 1), block size. The class of all processes which may be run on the system could be provided for by providing a memory bandwidth as large as the largest re- quired. That is, a system may be designed for a particular type of program (as many special purpose computer systems frequently are) or for many types of programs. We need only estimate such parameters as the upper bounds of the speed and size over the entire program mix. For systems "tuned" to a particular class of programs, the relation- ships between the system parameters can be used to obtain an optimal memory heirarchy configuration as in the manner in which memory cost was minimized in Section 5-8. 6. 3 Areas for Future Research Some further research which is suggested by that in this paper is a generalization of our techniques to moveable arm disks. Some obvious con- straints (e.g. requiring a minimum wait of the longest possible head positioning 97 time between i/O operations) will allow the application of our techniques directly, but more sophisticated refinement might yield a more practical model of such devices . Preplanning techniques such as ours could perhaps be generalized to handle situations with imperfect information on i/O operations. This would permit the same type of latency elimination schemes with just statistical in- formation on the i/O sequences . This might be done by storing several parti- cular data block sequences at particular angular positions on the disk when it has been predicted that one of them (but not which one) will be needed at that point. A method which was effective for this type of imperfect infor- mation would require a less sophisticated analyzer than that of the methods in this paper to extract the i/O information before run time, and would probably also be more useful for a multi -programming environment. The methods employed to effect the permutations of data blocks can be applied to memory configurations to obtain hardware methods of reordering, retrieving and/or storing data in real time. Thus data could be made to migrate on a disk through basically hardware means. 98 APPENDIX A O O « _l a ui -D LL O o in ■— >>» in -j ■v in a. >. •• o u. O ae • IP 2 o uj ae z x z o z — UJ in 1/1 at: z u. H_ > CO *• (\j z O ■— « o o <0 X u. X > < a ii >»■ 3 zu ae LU LU • -J ¥ < ae i- ae X -> >0 o < z -i m* » o t- Q LU ZOO LU < U, "* o Q z < LU X m o. >» z i— at a o o i/) -1 < • xo ae (T\ M a to X o UL — < o oac LU z o < K ■■* O X o 1/1 O 1 ►- Q ►- • • ►- 1— _J < z LU Z „ ,, f-4 1 -i > —t _) C5 m ae Z «n II -* ♦ < LO < z ae ae ae ** Ml • » -5 o ^ «• a! a — t- O < CD Z « A a «« ae Q- 1/1 a o LL *• • » o 0m a 3 < X z z o a a z «* o * -i ox a z LU to LU ►- ■ a •. ae z o 3 a — X 3 X o uo o i LU _J l/l * K X ■* ^ »- > Mi _i o ae V Q O LU O t- LL n < o a. h- CO o ae < i/> ■■4 f-l ae II X — •p 0. < ae ae LU X -» II ae -^ o t- □ w z o o •- • 1- o o o •» -1 < M ae < z 3 loZ o * *: < fO a -» «■• • • mm X LU z LU < o o X » ■ o K z z K» •» ii t» ae < O a. < 1/1 3 o «• > a Z LU < — < — . ^ «« z> i/> O CO M D m z< LU X o X w ^ o •• h- t- Q I o Z LO •^ ae -» Z < K -^ o OQ. < 3 OO _l < * M z >- Oae •^ o t- U3 x a JO z a. ae QC -I z < h- < » a. .^ o -> H O •• 1) z a ae z < X o z M z » ♦ — t o » — _J o — i-m LU LU — h- o -J p4 ^ M a ■fa MM II -1 2 xae H — Q CO t- • O — o •> H 3 h- L> M X r o o rg i/> z o 3 I X t- X o -* ->a < • •• «0 o z ae aeo < X 1- ae o -» o o i_> lux aei^tu c< vi vi a. a O a. xhZOIUIUUI^UOU^"" 1 " IX XXrj1JMMMM-r«-cotT>o-4jm'*'m^)eoo , >0»<'Mfn<»-(nj<»>(*ie»><«ifim«n"ifO«'>'#^ INPUT ARRAY 99 COMPONENT ARRAYS 100 LIST OF REFERENCES [1] J. Abate and H. Dubner, "Optimizing the Performance of a Drum-Like Storage," ] EEE Transactions on Computers , vol. C-l8, pp. 992-997, November 1969. [2] K. E. Batcher, "Sorting Networks and Their Applications," Proc . SJCC 1968 , pp. 307-31^. [3] Bruce A. Bernott, "Disk i/O for Non-Core -Contained P.D.E. Meshes and Arrays," University of Illinois at Urbana-Champaign, Department of Computer Science Report No. 311, March 1969. [h] E.G. Coffman, Jr., "Analysis of a Drum Input/output Queue Under Scheduled Operation in a Paged Computer System," Journal of the ACM , vol. l6, pp. 73-90, January 1969. [5] S. H. Fuller, "An Optimal Drum Scheduling Algorithm," Technical Report No. 12, Digital Systems Laboratory, Stanford Electronics Laboratories, April I97.I. [6] A. Gill, "The Optimal Organization of Serial Memory Transfers," IRE Trans - actions on Electronic Computers , vol. EC-9, pp. 12-15, March i960. [7] B. Gordon, "An Optimizing Algorithm for the IBM 65O, " Journal of the ACM , vol. 3, pp« 3-5, January 1956. [8] M. Hall, Jr., Combinatorial Theory . Waltham, Massachusetts. Blaisdell, 1967. [9] D. Knuth, "Minimizing Drum Latency Time," Journal of the ACM , vol. 8, pp. 119-150, April 1961. [l0] D. J. Kuck, Y. Muraoka and S. C. Chen, "On the Number of Operations Simul- taneously Executable in FORTRAN-Like Programs and Their Resulting Speedup, " to be published in IEEE Transactions on Computers . [ll] A. C. McKellar and E. G. Ccffman, Jr., "Organizing Matrices and Matrix Operations for Paged Memory Systems," Communications of the ACM , vol. 12, pp. 153-165, March 1969- [12] D. C. Opferman and N. T. Tsao-Wu, "On a Class of Rearrangeable Switching Networks Part I : Control Algorithm, " The Bell System Technical Journal , vol. 50, pp. 1579-1600, May- June 1971. [13] H. J. Ryser, Combinatorial Mathematics. New York: Wiley, 1963. BLIOGRAPHIC DATA IEET 1. Report No. UIUCDCS-R-72-522 I Title and Subtitle Elimination of Rotational Latency by Dynamic Disk Allocation 3. Recipient's Accession No. 5. Report Date May 2U, 1972 6. ] Author(s) David E. Gold 8. Performing Organization Rept. N °- UIUCDCS-R-72-522 f Performing Organization Name and Address I University of Illinois at Urbana-Champaign I Department of Computer Science j Urbana, Illinois 61801 10. Project/Task/Worlc Unit No. 11. Contract /Grant No. US NSF GJ 27^1+6 I. Sponsoring Organization Name and Address I National Science Foundation Washington, D .C . 13. Type of Report & Period Covered Doctoral Dissertation 14. . Supplementary Notes Abstracts In Input/Output bound computer systems which use disks for back up memory, rotational latency is frequently the source of the i/O boundness. This paper presents several methods for eliminating rotational latency in head-per-track disks or equivalent memory devices. These methods assume pre-run time knowledge of the i/O sequences required by a particular program which will be run on the system. They work by constantly reorganizing the data on the disk during execution of the program. Measures are obtained of certain system parameters which indicate the requirements for use of these techniques and the various tradeoffs may be examined and quantified. These results are useful both for designing a new operating system and for removing latency from an already existing operating system 1 Key Words and Document Analysis. 17a. Descriptors Magnetic Disks Magnetic Drums Rotational Latency Dynamic Storage Allocation Memory Hierarchies Buffer Memory Permutation Networks ». Identifiers/Open-Ended Terms c. COSATI Field/Group Availability Statement Release Unlimited RM NTIS-35 ( 10-70) 19. Security Class (This Report) 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 106 22. Price USCOMM-DC 4032S-P7 1 %/> ■:• mn? no *>' 0l y ■ i -r ■ vv*r i ■ ■ HI nl VBt, n 1 1 ■ HI ■ ■ :;■:■'•■•■« ■ 1 1 B9 II B ■ m mm 0} sun H 533m H jbhhmmmhBJIH ESSC IBS < <>*! WBiwm HH H ■■■ BHH1 RaEI