OBmBKUOBtS/UKS, 
 
 BnlwDHmlSMl 
 
 HBHHHuBBnmfflnHiuii 
 
 1^ lira 11 sign 
 
 SKI Bf&fffl 
 
 Hi 
 
 HnHMI/MnSUuiJIMifRlU 
 
 ■HOP 
 
 ■11 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 cop. 2. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/eliminationofrot522gold 
 

 Report No. UIUCDCS-R-72-522 
 
 /7/sLU~ 
 
 r 
 
 ELIMINATION OF ROTATIONAL 
 LATENCY BY DYNAMIC DISK ALLOCATION 
 
 by 
 
 David E. Gold 
 
 May 1972 
 
ELIMINATION OF ROTATIONAL 
 LATENCY BY DYNAMIC DISK ALLOCATION* 
 
 By 
 
 DAVID E. GOLD 
 
 May 2k, 1972 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 618OI 
 
 Report No. UIUCDCS-R-72-522 
 
 * This work "was supported in part by the National Science Foundation 
 
 under Grant No. US NSF GJ 2 r (hk6 and was submitted in partial fulfillment 
 of the requirements for the degree of Doctor of Philosophy in Computer 
 Science, June 1972. 
 
ELIMINATION OF ROTATIONAL 
 LATENCY BY DYNAMIC DISK ALLOCATION 
 
 David E. Gold, Ph.D. 
 
 Department of Computer Science 
 
 University of Illinois at Urb ana -Champaign, 1972 
 
 In Input/Output bound computer systems which use disks for back up 
 memory, rotational latency is frequently the source of the i/o boundness. 
 This paper presents several methods for eliminating rotational latency in 
 head-per-track disks or equivalent memory devices. These methods assume 
 pre -run time knowledge of the i/o sequences required by a particular program 
 which will be run on the system. They work by constantly reorganizing the data 
 on the disk during execution of the program. Measures are obtained of 
 certain system parameters which indicate the requirements for use of these 
 techniques and the various tradeoffs may be examined and quantified. These 
 results are useful both for designing a new operating system and for removing 
 latency from an already existing operating system. 
 

 Ill 
 
 ACKNOWLEDGMENT 
 
 I wish to thank my advisor, David J. Kuck. Without his constant 
 questions, suggestions and disagreements neither this paper nor the research 
 contained in it would have emerged. I thank also the National Science Founda- 
 tion and the Department of Computer Science at the University of Illinois for 
 their support during the evolution of this research. 
 
 And finally I wish to thank Diana Mercer and Vivian Alsip for their 
 fine job of typing the manuscript from my semi-legible and frequently changing 
 handwritten original. 
 
 
IV 
 
 TABLE OF CONTENTS 
 
 1. INTRODUCTION 
 
 Page 
 
 1.1 The Problem X 
 
 1.2 Terminology and Notation 6 
 
 2. MESH PROBLEMS 10 
 
 2.1 Introduction 10 
 
 2.2 Single Process Programs 11 
 
 2.3 Two Computational Process Programs 22 
 
 2.3.1 Density Buffering, fast process to slow process 
 
 transition 25 
 
 2.3-2 Density Buffering, slow process to fast process 
 
 transition 30 
 
 2.3.3 Origin Shifting with Density Buffering 31 
 
 2.4 Generalization to N- Processes 3!+ 
 
 2.4.1 Data Rate Considerations for Some n- Process Problems. 36 
 
 2.4.2 Sweeps of Less Than a Disk Rotation in Duration ... 1+0 
 
 3- OTHER SWEEPING PROBLEMS 1+2 
 
 3.1 Introduction 1+2 
 
 3.2 Generalized Data Transmissions 1+2 
 
 3-3 Non- Standard Sweeping Patterns I4.7 
 
 3*4 Discussion 5IJ. 
 
 4. ARBITRARY PERMUTATIONS OF BLOCKS BETWEEN SWEEPS 55 
 
 4.1 Definition of the Problem 55 
 
 4.2 The Algorithms for Arbitrary Permutations 59 
 
 4.3 Permuting in a Memory Hierarchy 67 
 
 4.4 Special Cases of Permutations - Addition and Deletion of 
 Blocks 68 
 
 4.4.1 Other Special Cases 70 
 
 4.5 Combining Permutations with Origin Shifting and Density 
 Buffering 72 
 
 5. IMPLEMENTATION CONSIDERATIONS 74 
 
 5-1 Non- Zero Data Transmission Times 7I+ 
 
 5.2 Memory Sizes and Data Rates 75 
 
 5-3 Speed of the Buffer Memory 78 
 
 5.4 Implementation of Row Permutations 78 
 
 5.5 Disks Which Do Not Read and Write Simultaneously 8l 
 
Page 
 
 5.6 Real Data Block Spacing and Synchronization Qk 
 
 5.7 Obtaining Compile -Time Information on Data Accesses. ... 85 
 
 5.8 The Effects of Block Size on Memory 85 
 
 5.8.1 What the Effect Is 86 
 
 5.8.2 Processes in Which Changing the Block Size Has No 
 
 Effect 9k 
 
 6. CONCLUSIONS AND DISCUSSION 95 
 
 6.1 Applications of Latency Elimination Techniques to Existing 
 Systems 95 
 
 6.2 System Design Applications of Latency Elimination 
 
 Techniques 96 
 
 6.3 Areas for Future Research 96 
 
 APPENDIX A 98 
 
 LIST OF REFERENCES 100 
 
1. INTRODUCTION 
 
 1.1 The Problem 
 
 This thesis is concerned with the elimination of rotational latency 
 in several classes of i/O bound programs. Within the context of this paper, 
 i/O boundness is that situation in which a processor requires data faster 
 than the system's back up memory can supply same using a simple on-demand 
 strategy. This situation, therefore, is memory, processor and program de- 
 pendent and is a property of all three of these. Thus, a program may be i/O 
 bound for some particular system (memories and processor) but not for some 
 other. The term i/O bound program will implicitly refer to both a program 
 and some particular system. 
 
 The literature contains a succession of drum latency minimization 
 techniques, starting with the IBM-65O and IBM-650-like processors [6,7,9]* 
 In these machines, all memory references including instruction sequencing 
 were potential sources of rotational latency. The more recent literature 
 deals more with the statistical nature of requests for data from a drum and/ 
 or the retrieval of data (records, pages, etc.) already stored there in some 
 non- optimal manner [l,^-,5]. 
 
 In this paper we deal with a predetermined sequence of demands to 
 the drum and assume that the initial configuration of these data on the drum 
 is established as an optimal one. Subsequent positioning of data on the drum 
 (for example during output) is such that the locations continue to be optimal 
 with respect to future retrieval from the drum. This approach was taken for 
 a limited class of problems in [3]* 
 
 The solution of an i/O bound program will consist of an algorithm 
 which provides a particular sequence of i/o transactions between memories 
 
such that rotational latency is eliminated. Included in such a solution is 
 a program and processor independent upper bound on the size of buffer memory 
 required. Thus the results presented here are useful either in programming 
 for an existing system or as system design parameters. The data of such an 
 i/O bound program is assumed to be stored on a sufficiently large periodically 
 addressable device such as a head-per-track disk, a drum, a recycling delay 
 line or shift register, or some other device whose only latency is rotational 
 in nature. This eliminates disks with head positioning latency but some of 
 the results obtained here may be extended to include these devices. 
 
 All of the zero latency schemes given in this paper will require 
 disks which are capable of both reading and writing at the same time. That 
 is, at any point in time while the program is being executed, data may be 
 read from the disk, being written to the disk, neither of these or both of 
 these. 
 
 While this requirement is somewhat artificial, it permits the 
 analysis of the data transmissions in the methods given here to remain more 
 comprehensible. It is also the case that several disk manufacturers have 
 indicated that such disks may be produced for no more than a 10-20$ additional 
 cost over their single i/O counterparts. Furthermore, Chapter 5 gives a 
 method by which standard disks without the concurrent read/write ability may 
 simulate the other kind entirely through software techniques . 
 
 In a uniprogramming situation, which this paper is concerned with, 
 I/O bound programs cost, in time, an average of half a disk rotation plus 
 transmission time for each access. If the program is long running, as in 
 the case of one which iterates many times over most or all of its data base, 
 the total latency becomes quite large. Such a situation may easily give rise 
 to more latency time than actual process time. 
 
For many programs, however, it is clearly the case that with some 
 forethought and planning the data could actually be placed at the proper 
 locations on the disk to eliminate this latency for at least the first pass 
 through the data. That is, for the first time the data are accessed from the 
 disk, they may be placed in those locations on the disk from which they may 
 be read at precisely the proper moment to be transferred to the processor 
 when they are needed by the processor. Indeed, a fairly simple algorithm 
 will accomplish this effect with less precise requirements: data are pre- 
 fetched from the disk in the order in which they will be required and held in 
 a buffer until required. This will yield a zero latency data accessing 
 sequence for the first pass through the data, although it is not obvious what 
 the size of buffer must be to accomplish this. For subsequent passes, how- 
 ever, it seems that somewhat more sophisticated algorithms may be used to 
 accomplish the same effect. These would insure that all necessary data can 
 be pre-fetched in the allotted time and further insure that there is a fixed 
 and reasonable bound on the size of the buffer which is required for the pro- 
 cedure . 
 
 This paper exploits the above premise. We show that for some 
 classes of programs there exists an initial disk layout of the data base 
 which provides a zero latency initial pass of the program through its data. 
 We further provide algorithms which reformat the disk (in real-time) to pro- 
 vide zero latency data accesses from the disk for subsequent passes over the 
 data base. This, in effect accomplishes a continuous dynamic allocation of 
 the disk. 
 
 One requirement must be met before any methods presented here may 
 be applied. The processor, the program to executed by it and the disk which 
 
will be used must be well matched . This simply means that the maximum data 
 rate requirements of the program (executed on the particular processor) must 
 be satisfiable by the disk. In other words, the program/processor should 
 not crunch numbers faster than the disk's ability to provide them. Unless 
 this criterion is met, there can be no hope of a zero latency solution of 
 the type derived here. 
 
 A program consists of one or more computational processes which 
 operate on the data base of that program. A computational process will re- 
 quire input data in a predictable sequence. It is assumed that the execution 
 time of each computational process is known to within a relatively small 
 tolerance. This last assumption permits the prediction of data requirement 
 timing as well as sequencing. The solutions which are outlined here all have 
 the property that the disk and processor remain synchronized with each other 
 (cf . section 5.6) . 
 
 The extraction of computational processes and their relevant para- 
 meters is expected to take place at compile time although some of the parameters 
 may be obtained at run time. In any event, it is this "early" information 
 which will permit most of the latency eliminating methods presented here to 
 work. 
 
 This information may be obtained, in some cases, by a fairly straight- 
 forward scanner to determine data sequences. For complicated programs, 
 however, an analyzer' of the type described in [10] may be suitably modified 
 
 to yield this information. 
 
 The data are assumed to be lumped into blocks . Block size is 
 chosen on the basis of several considerations. A block size smaller than the 
 smallest addressable segment on the disk is inefficient of disk capacity. 
 

 Furthermore, since many algorithms require many variables in primary memory 
 at one time (as in the case of partial differential equation-type mesh 
 problems where neighborhoods of variables are computed upon at the same time), 
 very small blocks are uneconomical. This is because certain variables known 
 as edge values are stored redundantly. The number of such edge values is 
 directly proportional to the number of boundaries between blocks. Thus a 
 partitioning into many small blocks will produce many redundantly stored 
 values. On the other hand, there will typically be only a small number of 
 blocks in memory at one time (interaction between data is assumed to be 
 either within blocks or between only a few blocks over any time interval) . 
 And because the size (and hence cost) of primary memory scales linearly with 
 block size, this mitigates against very large blocks. Thus the block size 
 is determined as a function of the primary memory (processor's memory), the 
 disk's characteristics and the program which will be running. Chapter 5 
 contains a discussion of block size determination. 
 
 The reader may wish to substitute the work "page" for "block" 
 wherever it appears here. The common use of "page" is consistent with 
 "block" here except that the preceding remarks about size determination 
 should be heeded. 
 
 This paper will give algorithms which produce zero latency 
 solutions for particularly well-defined classes of programs. Generalizations 
 are also given which permit application of these same methods to somewhat 
 different, less restrictive types of programs. These generalizations, while 
 less formal than the original formulations, will hopefully suggest additional 
 application of the basic methods to latency elimination in programs. 
 
1.2 Terminology and Notation 
 
 A disk is any large capacity rotating storage device whose latency is 
 solely rotational. Thus, drums and recycling shift registers are disks. 
 
 Latency is rotation latency, or the time spent "waiting" for a disk 
 to rotate so that the head is correctly positioned to begin reading the next 
 needed data. 
 
 A physical track of a disk is a recording track on a disk which is 
 accessed by a single read and/or write head. It may be thought of as be- 
 ginning at some arbitrary angular origin and extending for a full rotation of 
 the disk. 
 
 A logical track of a disk is a simulated physical track of arbitrary 
 length. That is, it is a recording track which may extend for more than a 
 single disk rotation. The simulation is accomplished by switching read or 
 write heads from one physical track to another at successive occurrences of 
 the origin at the heads. Intuitively, a logical track is several physical 
 tracks laid out end-to-end. 
 
 Density is a measure of the amount of data stored on some portion of 
 a disk. It is usually expressed in bits per physical track or words per 
 physical track although these units are more commonly referred to as bits per 
 rotation or words per rotation, respectively. The maximum density for a disk 
 is simply that quantity of bits or words per track which is the largest that 
 may be recorded on a single physical track of that particular disk, due to 
 its physical properties. 
 
 is that amount of data (in either words or bits) which may occupy 
 
 a physical track on a disk at its maximum density. 
 
Q is that amount of data which occupies a physical track on a disk. 
 If the density of the data under consideration happens to be stored at 
 maximum density, Q = Q . This quantity of data is frequently referred to 
 as "a rotation of data." 
 
 T is the disk period. Time is frequently expressed in terms of this 
 
 quantity. That is, we frequently measure time in disk rotations . A disk 
 
 rotation is T , seconds . 
 d 
 
 The data base is the collection of all data associated with a program. 
 It includes all initial data as well as intermediate results and final re- 
 sults. 
 
 A data entity is some distinguishable portion of a program's data base. 
 Data entities are mutually disjoint. An array is an example of a data entity. 
 
 A program during its execution will require input data . In the 
 schema here, this is in the form of input blocks . We say that computational 
 processes compute on their input data. Results of computational processes 
 are output data. This is handled in the form of output blocks. 
 
 A c omput at i ona 1 process is a distinct computational portion of a 
 program. It is assumed to be devoid of I/O statements and all conditional 
 branches except those which provide for looping or are otherwise predictable. 
 The input and output sequences for these processes are generated by the al- 
 gorithms presented in this paper. 
 
 A computational process which only does updating of values in a data 
 entity will have both its input blocks and output blocks contained in that 
 same data entity. A process which performs, say, a binary operation on two 
 matrices to yield a third one will have its input blocks coming from two 
 entities. Its output blocks will be contained in yet a third. A particular 
 
computational process may or may not operate over its program's entire data 
 base but it will usually operate over an entire data entity if it operates 
 on any part of it. This is because a computational process is usually 
 assumed to operated iteratively over an entire entity. We frequently refer 
 to a computational process as simply a process where no ambiguity will result. 
 
 Primary memory is the fastest memory available to the processor (save 
 perhaps a limited number of high speed registers). It is random access and 
 contains the current data being computed upon (with the exception of perhaps 
 small amount in registers). Because processes will access their input data 
 from primary memory it is frequently referred to processor memory . 
 
 Primary memory need contain only "a few" blocks of data at any point 
 in time. I.e. primary memory may be as small as 3 blocks of data plus any 
 needed to store the program itself. Thus this memory is relatively small. 
 A discussion of block size appears in Chapter 5- 
 
 Buffer memory is a typically large, slow memory with respect to 
 primary memory. It is also random access memory and is used to match data 
 rates between the disk and the primary memory and also as a data reservoir 
 in which data are reformatted. Buffer memory is frequently considered to be 
 logically partitioned into two areas. The input buffer contains data on its 
 path from the disk to the primary memory whereas the output buffer contains 
 data on its path from the primary memory to the disk. Output data will pass 
 through the output buffer to the disk. When they are next retrieved, they are 
 input data and pass through the input buffer to the primary memory. 
 
 A system is a processor, a primary memory, a buffer memory and a disk 
 in the following configuration: 
 
9 
 
 ■» 
 
 input buffer 
 
 output buffer 
 
 <T 
 
 primary 
 memory 
 
 processor 
 
 The relevant parameters are the disk period, T , the maximum data rate from 
 the disk, D, the size and bandwidth of the buffer memory, B , and the data 
 word consumption rate of the processor. The primary memory is assumed to be 
 matched to the processor. The bandwidth of the buffer memory must be at 
 least as large as that of the disk. 
 
 
10 
 2. MESH PROBLEMS 
 
 2.1 Introduction 
 
 This chapter is concerned with a particular class of programs and 
 the elimination of rotational latency from their execution. Several of the 
 results presented here will be superseded in later chpaters but are nevertheless 
 interesting. 
 
 The problem class described here is fairly common, representing a 
 considerable fraction of those which are frequently i/o bound. Solutions 
 applicable to this particular class of problems are thus of interest. Further- 
 more, an understanding of the solutions developed in this chapter will enable 
 the reader to better generalize these results to those of later chapters. 
 
 The types of programs to be analyzed in this chapter include those 
 with data structures which are derived from partial differential equation 
 (PDE) problems. These data structures are characterized by a large partitioned 
 array or mesh which consists of one or more variables at each mesh point. 
 Except for small areas of the mesh close to its boundaries, the mesh is assumed 
 to be homogeneous and continuous and computations on the variable (s) at meBh 
 points or nodes will use the values of neighboring variables. Thus, one signi- 
 ficant property of this class of programs is the requirement of data from 
 partitions other than the one being directly computed upon. 
 
 The problem to be solved here is the minimization of latency for 
 programs whose data cannot be contained in primary memory. An example of the 
 class of programs being considered are those programs which obtain the solution 
 of large PDE problems by finite difference methods. The following discussion, 
 then, will be restricted to those programs, but it should be remembered that 
 the methods presented here can be used directly for other computations with 
 the same data structures . 
 
11 
 
 A program for which latency is to be eliminated is partitioned into 
 one or more computational processes , or processes , each of which compute on 
 the data base. The processes are chosen for the predictability of their 
 running times as well as their relatively "long" running times. By long 
 running times we mean that the time for a computational process to compute 
 on an input block should be no less than the transmission time between the 
 input buffer and primary memory for an input block. It is assumed that each 
 process will be applied iteratively to the entire data structure. 
 
 This problem will be explicitly solved for those programs consisting 
 of a single process. Solutions will then be given for generalizations to 
 multi-process programs. To simplify the following discussion, all transmission 
 times between memories are assumed to be zero. 
 
 2.2 Single Process Programs 
 
 The discussion in this section is limited to those programs containing 
 a single process. This set of operations occurs iteratively over the data 
 points of the mesh and, in general, in iterative sweeps over the mesh. That 
 is, if a sweep is an application of the process to all of the data, real 
 programs will contain several sweeps over the data. 
 
 These data are usually altered during each sweep although the 
 collection of all relevant variables will be imprecisely referred to as "the 
 mesh" or "the data." 
 
 One feature that makes finite difference equation data structures 
 difficult to handle is the interactions between mesh data points. While the 
 overall array may be partitioned into blocks each of which fits into the 
 primary memory, these blocks must interact in that values along their boundaries 
 are required in the computation using adjacent blocks. Edge values are 
 
12 
 
 variables along the boundaries of these data blocks and must be passed from 
 one subarray to the next. For example, in a simple five point difference 
 scheme such as illustrated in Fig-. 2-1, each point is updated in terms of 
 its four orthogonal mesh neighbors, d is the depth of the finite difference 
 operator. When the computation is operating on a row at the edge of some 
 block, data values from another partition are required. Figure 2-2 
 schematically shows the edge values of data blocks. 
 
 For large problems too much space would be required if all these 
 edge values were stored in primary memory during the entire computation. Thus 
 we seek a scheme which periodically writes these edge values as well as the 
 partitions of the overall array into the secondary storage. The positions 
 into which these data are written are, of course, critical. We wish to write 
 them in such a way that when they are again required for computation they will 
 be available with very small or no latency. 
 
 A critical measure is the ratio of a process' computation time on 
 a partition to the time it takes to bring a new partition into primary memory 
 from the disk. For disks which can only read or write but not both at any 
 one time, the theoretical lower limit of this ratio is not less than two. 
 That is, if we could read a block of data and write another block of data in 
 precisely the time it takes to compute on a third block, the ratio would be 
 two and theoretically a zero latency solution would be feasible. 
 
 For disks which could both read and write at the same time this 
 limit is one. That is, if the compute time for a block of data were no less 
 than the data rate from the disk for that block, a solution is feasible. 
 Intuitively what this means is that if the data to be calculated upon cannot 
 be retrieved from the disk at least as fast as it is needed by the processor, 
 no zero latency solutions can be found for large l/0-bound problems. 
 
 
13 
 
 NEIGHBORING MESH POINTS 
 REQUIRED FOR THE COMPUTATION 
 
 MESH POINT BEING 
 COMPUTED UPON 
 
 Fig. 2.1 
 
 w 
 
 Data 
 Block 
 
 V&ZZZZZZZZ2. 
 
 ZZZZZA 
 
 4 V 
 
 YZZZZZZZZZfy 
 
 I 
 
 vi y\ Data 
 ^ K Block 
 
 Edge Values 
 
 ^2 WZZZZZZZfli ^7777? 
 
 I 
 
 ^s 
 
 0ii 
 
 vi :k 
 
 ^ Pd ata 
 
 '//////////A 
 
 A Data ^ 
 ™ Block H 
 
 Fig. 2-2 
 
lU 
 
 Data blocks are ideally so spaced apart on the disk that the time 
 between reading of two successive blocks is identical to the time required 
 for a block to be computed upon. Real world constraints, however, preclude 
 such a precise positioning. Real world constraints, however, preclude such 
 a precise positioning. 
 
 While a disk is physically a homogeneous recording medium, it gets 
 formatted such that logically the possible storage locations are quantized. 
 Thus, words or bits may only be recorded at particular locations. The origin 
 or address of a data block on the disk is uniquely defined by its track and 
 angular position. The area or space on the disk occupied by a data block are 
 defined by its track and sector . A sector, then, is a specifically oriented 
 angular displacement on the disk. 
 
 Because data blocks may only start at addressable locations on the 
 disk, they will start at the next possible addressable location following the 
 theoretical and unrealizable location that its compute time would call for. 
 This rule will, of course, introduce some latency into the solution. This 
 latency as a ratio of processor running time is equal to the incremental 
 amount per block, as just described, divided by the compute time per block. 
 This figure for most configurations can be made quite small by allowing a 
 reasonably large block size. 
 
 The scheme dealt with here will be described as having blocks 
 spaced appropriately on the disk. No further mention will be made in this 
 chapter of the second order consideration mentioned above. This effect 
 being negligible, all solutions and statements relating to them will be 
 referred to as zero latency but must be recognized as including this small 
 incremental latency. 
 
15 
 
 The first iteration of the process over the data blocks (ignoring 
 inter-block data requirements) are particularly easy to provide for with no 
 latency. The data blocks are simply spaced out on the disk in one con- 
 tinuous logical track. The blocks are spaced appropriately before the 
 program beings, and the process procedes to work on its data. 
 
 What we have yet to provide for, however, is further iterations 
 over the entire mesh, or sweeps . 
 
 It is necessary to consider the data requirements of such a pro- 
 gram. In general an algorithm which requires successive sweeps over its 
 data base will generate a series of intermediate results, or updated portions 
 of its data base. That is, a mesh will be updated on each sweep. It is the 
 updated mesh which is operated upon on the next sweep. 
 
 Since we have assumed that the program contains a single process, 
 the spacing between updated blocks will be identical to that of the blocks 
 before updating. The process is thus considered to have both input data and 
 output data. The input data are made up of the input blocks which the process 
 operates upon directly. These blocks are input from the disk for the process. 
 The output data are made up of the output blocks which are the result of the 
 updating operations performed by the process upon the input blocks. Each 
 output block is output to the disk shortly after the process finishes with it. 
 
 The output blocks, when placed on the disk during the i sweep over a mesh 
 
 th 
 are the input blocks for the i+1 sweep. 
 
 An instantaneous description, then, of the data block memory allo- 
 cation for a system without buffer memory during the i sweep would be as 
 follows : 
 
16 
 
 That block being computed upon is in primary memory 
 along with perhaps several adjacent neighboring blocks 
 (some already processed and some not yet processed). 
 
 The disk contains those input blocks for the i 
 sweep which have not yet during this sweep been moved 
 into primary memory. 
 
 Also on the disk are those output blocks for the 
 i sweep which have already been computed on and output 
 during this sweep. These are the input blocks for the i+1 
 sweep. 
 
 This is a steady-state condition. The time for the processor to 
 compute on a block is the block compute time, C, . This is exactly the time 
 interval between successive input block transmissions from the disk and exactly 
 the time interval between successive output block transmissions to the disk. 
 Thus, at any point in time the disk may be servicing both input and output 
 requests . 
 
 Providing for proper data block spacing for a single sweep only 
 solves half of the latency problem. A total solution must also provide for 
 inter-sweep gaps. Assume that the first sweep originates as soon as the 
 processor receives the beginning of the requisite data and that the updated 
 data blocks are output to the disk shortly after being updated. The dis- 
 placement between an input block and its updated version (output block) on 
 the disk is therefore a function of C. , the two transmission times between 
 primary and secondary (disk) memory and the amount of time it remains idle in 
 primary memory. The ideal locations on the disk for these updated blocks 
 
17 
 
 correspond to the length of time it takes to make a sweep. That is, the 
 ideal displacement, modulo the disk rotation time, is a function of C, and 
 the total number of blocks in the mesh, both of which are fixed by the pro- 
 gram and the processor. 
 
 Intuitively this means that if no special arrangements are made, 
 there will be, in general, some latency between sweeps. The amount of this 
 inter-sweep latency can vary from zero to as much as an entire disk rotation, 
 with the value distributed uniformly over the interval. 
 
 This latency need not be serious. For programs which take many 
 disk rotations per sweep, an average latency of half a rotation per sweep 
 could be insignificant. There is, however, a straightforward means of 
 eliminating the latency. 
 
 Buffer memory may be used to pick up and store the input data when 
 it passes under the disk's read heads within the disk's last rotation before 
 the process' next sweep begins. That is, if there would normally be a half 
 rotation of latency, half a rotation of data would be "cycled through" the 
 buffer memory. Each input block would be read from the disk at that point 
 at which it passed under the read head and buffered for exactly half a 
 rotation until the processor was ready for it. This procedure is called 
 origin shifting since it has the effect of shifting the origin of the data 
 on the disk to that angular position which would yield zero latency. Note 
 that the origin of the data is not physically shifted on disk. What is ac- 
 complished is the effect of having thus shifted it. Another manner in which 
 this same effect is obtained is that of using an output buffer to delay the 
 actual transmission of data to the disk by an amount equal to what is neces- 
 sary to correctly place it in position for later retrieval with no latency. 
 
 
18 
 
 These two processes are equivalent and both are called origin shifting al- 
 though only the latter one actually causes the data to be shifted from their 
 natural position on the disk. 
 
 It is necessary to consider the size of such buffer memory. An 
 upper bound comes from recognizing that no more than one rotation need be 
 buffered. (in reality, no more than (l-e) rotation need be provided for 
 since an entire rotation corresponds to that fortuitous disk layout which 
 produces zero latency.) One disk rotation of data, therefore, is sufficient. 
 There are two quantities used in this paper to express that frequently- 
 occurring quantity: 
 
 Q is the maximum amount which could be recorded 
 Tnax 
 
 on a single rotation of the particular disk under con- 
 sideration. This is the most conservative estimate of 
 required buffer storage. It is simply equal to the 
 number of bits which could be transmitted by the disk in 
 a single rotation. 
 
 If the disk's bandwidth is D bits/second and the 
 disk has a rotational period T seconds, 
 
 inax ~ d 
 
 a more realistic quantity representing a rotation of data is obtained by 
 considering the actual data requirements of the processor for a particular 
 process. Since real programs on any given computer system will rarely re- 
 quire data at the disk's maximum rate, it is often desirable to express 
 memory requirements in terms of the process/processor's data needs. 
 
19 
 
 This quantity is denoted as Q with a subscript 
 
 appropriate to the process being considered, or, in 
 
 more general situations, as Q _. 
 
 For a block of B T bits which takes C, seconds 
 L a 
 
 to compute (update), 
 
 <W - £)T a 
 
 d 
 
 Chapter 5 contains several charts which give values for Q and show re- 
 lationships between Qy, eal and Q . 
 
 It should be noted that the amount of memory necessary to eliminate 
 inter-sweep latency, Q, is totally independent of mesh size. Thus for larger 
 data bases with the same program, buffer memory does not grow. Furthermore 
 G is dependent only on the system's disk's characteristics. 
 
 If not for the problem of the intercommunication between data blocks, 
 the single process zero latency solution would be complete. This intercom- 
 munication is considered here and a method is given in which it can be handled 
 in a manner consistent with the previous results. 
 
 Row 1 
 Row 2 
 
 • 
 
 Last Row 
 
 Fig. 2-3 
 
20 
 
 For the sake of explanation, the mesh is assumed to be swept in 
 a left-to-right, top-to-bottom sequence as shown in Fig. 2-3, each row con- 
 taining many partitions and having a height of, in general, several mesh points. 
 
 Edge values, because they are a relatively small fraction of an 
 entire data block may be lumped in groups of k into edge value blocks . The 
 value of the integer k can be chosen to insure that the total size of the 
 edge value blocks is at least a minimum i/O transmission for the disk being 
 used. 
 
 There are four types of edge values: Left, right, upper and lower. 
 They are defined in the expected way and some are stored redundantly, once 
 in the data block in which they naturally occur and once in the edge value 
 block. 
 
 Fig. 2-2 shows edge values relative to their data blocks. Right edge 
 values of a particular block are those used by that partition to the right 
 of that block and left edge values are similarly defined. Similarly upper 
 edge values of a block are used by that block above the original. 
 
 Right and left edge values present no problem. They may be 
 handled without the use of edge value blocks by merely retaining in primary 
 memory at least part of the left and right neighbors of a block. Thus for 
 those edge values of any particular block, the left edge values are used 
 while the preceding block is being operated upon. This is because the block 
 containing the left edge values is in primary memory somewhat sooner than 
 absolutely necessary. In like fashion, the right edge values are available 
 to the right neighbor partition because the block containing these edge 
 values is retained in primary memory until its right edge values are no 
 longer needed. 
 

 21 
 
 It is the upper and lower edge values which require their own 
 blocks in order to be intercommunicated. These are simply output inter- 
 stitially with respect to data blocks. That is, an edge value block (con- 
 sisting of edge values for k successive blocks) is output between two data 
 blocks. It is simply input during the last disk rotation before it is re- 
 quired. The simple rule to remember in scheduling these "mini -transmissions" 
 is that, for the top-to-bottom example given, lower edge values have a 
 periodicity corresponding to a single row across the mesh. Upper edge 
 values are required with periodicity corresponding to an entire sweep over 
 the mesh minus a row across it. Data blocks, of course, have periodicity 
 corresponding to exactly an entire sweep over the mesh. 
 
 Upper and lower edge value blocks are therefore simple lumped in 
 number as a function of the disk into redundant edge value blocks. These 
 are output to the disk at the first inter-data block possibility and 
 retrieved at the last possibility before being needed by the program. Their 
 effect upon storage requirements is negligible and the only requirement of 
 this method is that there exist inter-data block gaps . These gaps will in 
 fact exist in all but those systems which are so matched to the programs 
 running on them that the data requirements are identical to the disk's 
 capacity. 
 
 ■*Some partial differential equation solutions are obtained by using the non- 
 updated lower edge values for each sweep, saving the updated values for the 
 next sweep. In such cases the updated lower edge values during the i 
 iteration are output for use during the i+1 iteration. Their periodicity 
 is then an entire sweep over the mesh plus a row across it. During the sweep 
 across any row there exist two sets of lower edge values for that row: the 
 one being generated for use in the next iteration over the mesh and those 
 from the last iteration which will be used on the sweep of the next row. 
 
22 
 
 As a measure of the size of the edge value blocks, consider RLg. 
 2-1. d is the depth of the finite difference operator. In this case d is 
 only one but may range up to, say, six or seven. This same depth is also 
 obviously the depth of the edge value "strip". Thus, if a data block is a 
 partition of the original mesh made up of m (height) by n (width) mesh 
 points, the size of an upper or lower edge value block need be no larger than 
 nd mesh points. Since m will always be chosen such that m > > d to minimize 
 the number of edge values which need be passed, the size of edge blocks will 
 similarly be much smaller than the data blocks. 
 
 It is also possible for mesh problems to be three dimensional rather 
 than two dimensional as previously described. In such a case we think of the 
 two dimensional mesh as described as being a plane. In general, then, a 
 sweep over a plane in the three dimensional model will have both edge values 
 within its plane (as described) as well as edge values which are the actual 
 data blocks in adjacent planes. In this case the inter-plane edge value 
 blocks are identical in size to the actual data block being processed. 
 
 It is also possible to allow the edge blocks to become as large as 
 
 the data blocks themselves and apply the method of Chapter 5 to obtain a 
 
 zero latency solution. 
 
 2. 3 Two Computational Process Programs 
 
 Two process programs are those which have two processes which 
 alternately sweep the entire mesh. The particular complication introduced 
 by generalizing to two processes is the fact that the two processes have 
 different compute times, C , for their respective blocks of data. Thus, for 
 a mesh with its data blocks properly spaced on the disk for one of the pro- 
 cesses, the output blocks sent back to the disk will, by the method of the 
 
23 
 
 previous section, be spaced in like fashion. The next process, however, be- 
 cause its compute time is different, will not be able to operate at its no- 
 latency tempo. It requires a different spacing of these data blocks (its 
 input blocks) - one which corresponds to its particular speed. 
 
 It is obvious that the data blocks cannot migrate over the disk. 
 Once they are output, their spacing is fixed and they will be input with the 
 identical spacing. Furthermore, all the problems considered in this paper 
 are assumed to be non-core contained, thus precluding the trivial solution 
 of keeping the entire mesh in primary memory. 
 
 The method of solution presented here uses a small amount of 
 buffer memory to constantly alter the critical spacing of the data blocks. 
 The result is a no-latency solution in which the processor is always being 
 supplied its requisite data at the proper rate. 
 
 The following assumptions are made concerning the nature of the 
 problem . 
 
 There will be several sweeps over the mesh by both 
 processes. This will require a steady-state solution in 
 which each of the two processes is assumed to both precede 
 and follow the other. (The result will yield a zero 
 latency solution in case of a single iteration by each of 
 the kernels.) 
 
 A sweep over the mesh is made first by one process 
 over all of the mesh and then by the other process over 
 all of the mesh. 
 
2U 
 
 The pattern of sweeping must be identical for both 
 processes. That is, the sequence of partitions of the 
 mesh must be identical for each process and for all sweeps. 
 
 In describing the transition between any two processes, it will be 
 necessary to consider the relative speeds or compute times of the two. One 
 process will be faster than the other and this one will be called the fast 
 process . Similarly, the slower process is christened the slow process . The 
 fast process takes less compute time/datum than the slow process. The two 
 process problem is defined such that both processes operate on precisely the 
 same data over their respective sweeps. Thus, as one would expect, the slow 
 process takes more time per sweep than the fast process. 
 
 The data per sweep being constant and the time per sweep being 
 different, the density requirements of the two processes are different. The 
 fast process requires data from the disk at a higher rate than does the slower 
 process. This creates one of the two potential latency producing problems. 
 Density buffering is the process of altering the density of the data comprising 
 a mesh from the density corresponding to one process to that density corre- 
 sponding to the other process. 
 
 The second possible area in which latency could be introduced in a 
 two process program is virtually identical to that of the one process 
 problems - intersweep transitions. That is, when a process is ready to begin 
 computing, the data probably is not positioned to begin flowing from the 
 disk. To insure no latency at this point, the single process solution uses 
 buffer memory to temporarily store up to a rotation of this data and thus in- 
 sure its accessibility when necessary. The use of buffer memory in this 
 
25 
 
 manner is again called origin shifting since the memory functionally shifts 
 the origin of the data stream to correspond to the processor's timing re- 
 quirements . 
 
 Both density buffering and origin shifting are handled differently 
 as a function of whether the data are being passed from the fast process to 
 the slow process or vice versa. These will be explained separately for 
 fast-slow and slow- fast transitions. 
 
 2.3*1 Density Buffering, fast process to slow process transition 
 
 Transition here refers to the altering of the data spacing on the 
 disk, i.e. density, from that corresponding to the fast process to that of the 
 slow process. Thus the transition is that of the data format and not that of 
 the processor as it switches from one process to the other. The data trans- 
 mission, as will be explained, occurs while one of the processes is operating. 
 
 In the case of the fast-slow transition, the data blocks are 
 assumed to be stored on the disk at the fast process' density. The goal is 
 to input data from the disk at this rate and provide it the processor at the 
 slow process' rate. This can be accomplished in several ways, but it is 
 necessary to keep buffer memory requirements low. The following solution 
 is given: 
 
 Each rotation of the disk contains enough data blocks to supply 
 the processor with its slow process computation for some period of time 
 which is longer than a disk rotation. (intuitively, one rotation of data 
 at the higher density corresponds to more than one rotation of data at the 
 lower density.) If the ratio of the higher to lower density is denoted r, 
 one rotation of data as it is stored on the disk will keep the processor 
 going for r rotations (r > l) at the slow process speed. 
 
26 
 
 This method assumes initiation of data transfers from the disk only 
 at some arbitrary origin. This is assumed to be that angular position at 
 which the first block of the sweep is stored on the disk. The original 
 transmission will start with this block and continue for an integral number 
 of disk revolutions . All subsequent inputs of data corresponding to this 
 sweep will be initiated at the same angular position (which is where the 
 previous transmission stopped) and continue for integral numbers of disk 
 revolutions. 
 
 As data blocks are being input at the higher data rate, the process 
 is making use of it at a lesser rate. Thus there is a net increase of data 
 in buffer memory at such times . Input from the disk is halted at its origin 
 every time the supply of unused data already in memory is sufficient to supply 
 the processor for at least one full disk rotation. Data, then, starts to 
 accumulate in buffer memory as soon as input from the disk occurs. At certain 
 points during the computation however, the accumulated data is sufficient to 
 supply the processor, and input ceases temporarily. In this way buffer memory 
 is bounded while still providing the processor with data at its required rate. 
 Stated concisely as an algorithm, we have 
 Algorithm 1 : 
 
 i) Input is initiated as the origin of the data passes 
 
 under the read head of the disk just before it is required 
 
 by the processor and continues until the conditions in ii 
 
 are satisfied (at least a full disk rotation) 
 
 ii) at successive rotations of the disk to its 
 
 angular origin (i.e. integral number of rotations from 
 
27 
 
 the origin of the data), when the amount of data in the 
 buffer is greater than that amount needed by the slow process 
 for the next entire disk rotation, input terminates 
 
 iii) at successive rotations to the data origin, input 
 is initiated (or continued) when the amount of data in the 
 input buffer is insufficient to keep the processor supplied 
 for the next disk rotation. 
 
 This bound is obtained by considering the worst case. Let the 
 amounts of data in bits required by the fast and slow processes for the 
 time corresponding to a revolution of the disk be CU and (3, respectively. 
 Thus a = pr. 
 
 From the description of the fast-slow density buffering, if we are 
 to allow for any value of r > 1 and arbitrarily large data entities, it is 
 obvious that baffer memory must be at least as large as p bits. The only 
 time at which input will cease is when the origin of the disk is passed over 
 by the read head(s) and there are at least (3 bits in buffer memory. 
 
 The best case is when exactly the requisite p bits are resident at 
 the time the origin is passed over. The worst case, then, is when there are 
 P-e bits, thus necessitating an input transmission. At the end of another 
 disk rotation, a bits have been input and p bits have been "used" by the 
 processor, leaving a- e bits in buffer memory. This being sufficient to 
 supply the processor for at least the next disk rotation, transmission stops. 
 Input resumes only when the contents of buffer memory fall below (3 bits again. 
 By this algorithm, then, the amount of buffer memory required for density 
 buffering is never larger than a bits. The quantity a, however, is identical 
 to 0^. ^ a rotation of data at the fast process' rate. 
 
28 
 
 This process can be illustrated by a simple example. In Fig. 2-U, 
 the curve labelled "data from disk" is the cummulative number of bits input 
 to buffer memory from time . Note that transmissions initiate and terminate 
 at integral numbers of disk rotations from the origin. 
 
 The line below the axis is the integral of the amount of data trans- 
 mitted to the processor and hence, data from buffer memory. The dotted line 
 represents the amount of data resident in the input buffer and is simply the 
 sum of the other two curves . 
 
 In the example diagrammed, the scale is such that Cd = — p. That is, 
 the fast process uses data at =■ the rate of the slow process or, in other 
 
 o 
 
 terms, the fast process operates on the same data in ft the time of the slow 
 
 process. 
 
 Theorem 1 : 
 
 For the algorithm described for density buffering of data for the 
 fast-slow transition, no more than Q buffer memory is required. 
 Proof 
 
 Data transmissions when initiated will read or write data over an 
 integral number of disk rotations - origin to origin. Furthermore, data will 
 always be flowing to the processor more slowly than it is input to the input 
 buffer because (X > p. Therefore, the maximum amount of data that will have 
 to be accommodated by buffer memory will be that occurring at exactly the 
 end of a transmission. 
 
 Assume that this quantity is larger than OC, say CH+x. Then, by the 
 algorithm, one rotation earlier, Oi less bits would have been input and p less 
 bits would have been forwarded to the processor. That is, one rotation 
 before the quantity Q!+x was in buffer memory the quantity p+x would have to 
 
29 
 
 o 
 
 •H 
 
 •p 
 
 •H 
 CO 
 
 M 
 -P 
 
 8 
 
 H 
 < 
 
 CO 
 
 aS 
 
 «H 
 
 m 
 O 
 
 «m 
 
 I? 
 •H 
 fH 
 
 Cm 
 <m 
 
 .Q 
 
 >> 
 •P 
 
 •H 
 CO 
 
 a 
 
 <M 
 O 
 
 X 
 
 i 
 
 bo 
 
 io in 
 
30 
 
 have been in buffer memory. But this is a contradiction because the 
 algorithm calls for input to terminate should the buffer contain this much 
 data at this point 
 
 Q.E.D. 
 
 2.3.2 Density Buffering, slow process to fast process transition 
 
 Here data blocks are accumulated in an output buffer while the slow 
 
 process is processing more data. Roughly speaking, when the amount of such 
 
 accumulated data is sufficient for a disk rotation at the fast process' density, 
 
 an output is initiated. 
 
 Specifically, the algorithm is as follows: 
 
 Algorithm 2 : 
 
 When the slow process begins its computation, its output 
 data are accumulated in the output buffer for the first disk 
 rotation. 
 
 At the origin on the disk of the second and subsequent 
 rotations r 
 
 i) if the amount of data already accumulated in the output 
 buffer plus the amount to be generated in the next rotation 
 (p) is greater than or equal to a, output of this data is 
 initiated now and continues for an entire rotation. 
 
 ii) if the amount of accumulated data is < a-3, continue 
 to accumulate data for another rotation. 
 
 Thus the output buffer allows data accumulation until a full high 
 density output transmission can occur. The data initiation with accumulated 
 
31 
 
 data of a-p, rather than simply a corresponds to a one rotation "look-ahead". 
 That is, by knowing that at all times during the rotation sufficient data will 
 be available to continue transmitting to the disk at the higher density, we 
 can initiate the output at the start of that rotation. 
 
 Theorem 2 : 
 
 For the algorithm described for density buffering of data for the 
 slow- fast transition, no more than a bits of buffer memory is required. 
 Proof 
 
 The maximum amount of data ever held in the output buffer occurs 
 immediately before an output transmission. Suppose at some time an amount 
 of data d > Qt is resident in the output buffer, say d = a+x. Then a 
 rotation before, there was a+x-p in the output buffer. But this contradicts 
 the stated algorithm because this amount of data would have triggered an 
 output transmission at that time. 
 
 Q.E.D. 
 
 Note that a may be as large as Q, that is, a rotation of data. Fig. 2-5 shows 
 
 the data movement for this algorithm for the same parameters as in Fig. 2-4. 
 
 k 
 
 a = -p. 
 
 2. 3.3 Origin Shifting with Density Buffering 
 
 All density buffering takes place while the slow process is computing. 
 Intuitively this is because density buffering between any two densities will 
 always take as long as the slower process. If this procedure is to be com- 
 pletely masked from the processor, it must be concurrent with the slower 
 process' execution. This requires a total buffer memory of 2Q for density 
 buffering, because each of the buffers may be as large as Q. It might be 
 though of as configured as in Fig. 2-6. But this does not account for origin 
 
32 
 
 b£> 
 
33 
 
 Size = Bits 
 
 Z. 
 
 Input Buffer 
 
 Output Buffer 
 
 X" 
 
 Size = Bits 
 
 Fig. 2-6 
 
 shifting to eliminate intersweep latency. In general, data will not be at 
 exactly the angular position to be read at the right instant for the beginning 
 of a sweep. There is, however, a scheme permitting buffering to eliminate 
 this type of latency without requiring any more memory than that already 
 necessary for density buffering. 
 
 Observe that origin shifting may be accomplished "in advance" by 
 
 storing data in an output buffer until it may be placed at precisely that 
 point which corresponds to the angular position at which it can ultimately be 
 read for no latency. That is, data may be made to appear to have shifted on 
 the disk by either using an input buffer to read it early and hold it, or by 
 using an output buffer to actually shift it. The amount of buffer memory is 
 identical for either method: A rotation of the disk or Q. 
 
 The scheme for density buffering does not permit origin shifting 
 during the slow process' execution without additional buffer memory. It 
 does however permit origin shifting during the fast process' execution. In 
 fact, since no density buffering occurs at this time, origin shifting may 
 
3U 
 
 occur in both the input and output buffers. The input buffer will shift the 
 partitions of data as they come off the disk and eliminate latency for the 
 fast process. The output buffer will place the partitions in the proper 
 position on the disk for the slow process. Thus both origin shifting and 
 density buffering may be accomplished with only 2Q of buffer memory. 
 
 2.k Generalization to N-Processes 
 
 One and two process mesh sweeping may be graphically represented as 
 in Fig. 2-7, a and b, respectively. Such illustrations are called code graphs, 
 The nodes represent processes and the directed links correspond to data 
 transmissions. In the particular case of the mesh problems already considered 
 in this chapter the links also correspond to the order of execution. This 
 need not, however, always be true. The "upward arrows" are necessary because 
 of the iterative nature of the problem. 
 
 Fig»2-7c shows an n-process mesh problem. This is simply one in 
 which several processes operate on a common data base in succession. Each is 
 assumed to take a different time to compute as the other processes. Thus 
 a zero latency solution must here perform both density buffering and origin 
 shifting. 
 
 The solution to this is a simple generalization of the two-process 
 solution. If the solution for the two-process problem is considered as con- 
 sisting of two transition components, it can be stated as follows: 
 
 Fast/slow process transitions have origin shifting 
 taking place in the output buffer during the fast process 
 execution. Density buffering occurs in the input buffer 
 while the slow process is executing. 
 
35 
 
 Fig. 2-7 
 
36 
 
 Slow/fast process transitions perform the density- 
 buffering in the output buffer during the slow process 
 execution. Origin shifting is accomplished in the input 
 buffer during the fast process execution. 
 
 N-process problems, then, are simply like two-process problems in 
 which there is some sequence of these slow/fast and fast/slow sequences. The 
 only significant difference is that in an n-process problem, any particular 
 process need not be simply fast or slow. The attribute of being either fast 
 or slow is a function of the process' neighbor. Since in the n-process 
 problem a process has two neighbors, it is not possible to characterize that 
 process' speed as one or the other. 
 
 If a process is faster than its predecessor process, then it is 
 "fast" with respect to that predecessor and that transition from its prede- 
 cessor is a slow/ fast one. If a process is faster than its successor process, 
 it is "fast" with respect to its successor and that transition to its successor 
 is a fast/slow one. A process may however be fast with respects to its 
 predecessor and slow with respects to its successor or vice versa. The impor- 
 tant point is that an interprocess transition will occur in one of two ways, 
 
 both of which are provided for in the solution. 
 
 2.4.1 Data Rate Considerations for Some n-Process Problems 
 
 The statement of n-process problems has considered only that data in 
 
 common to the processes. Some algorithms may be such that they consist of 
 
 several processes, each of which operates on different variables at each point 
 
 in the mesh. That is, there could be a set of variables over the mesh which 
 
 is common to all of the processes and other classes of variables which are 
 
 operated upon only by specific individual processes. 
 
37 
 
 One obvious way to handle this case is to lump all of the variables 
 and handle the entire mesh as if it were common to all nodes. This would 
 require origin shifting and density buffering of every variable for every 
 process, even though some processes would not make use of all of these variables. 
 
 Somethimes, however, the overhead of carrying the "deadwood" variables 
 along might cause the data rate from the disk to be exceeded where the data 
 rate of the process (which just considers the necessary input data) would 
 permit a solution. These situations may be solved by a decomposition of 
 solutions. The original mesh is split up into several data bases, each made 
 up of different variables. 
 
 All of the possibilities of such decomposition will not be enumerated 
 in this paper. An example is given however of one such decomposition for a 
 two-process problem. 
 
 The code graph of Fig. 2-7 b is modified to that of Fig 2-8 to show 
 all of the data transmissions discussed in the two process decomposition. 
 
 Fig. 2-8 
 
38 
 
 A no latency solution for such a program will consist of the solution 
 to the two process portion plus the two one process solutions. That is, there 
 are three sets of data or meshes which are each provided for independently. 
 The nature of the solutions allows for this "interspersing" of data while 
 insuring correct arrival at the processor with only minor additional consider- 
 ations. One of these considerations is the proper handling of data rates 
 to and from the disk. 
 
 A no latency solution is feasible only when every process in a 
 program requires data at a slower rate than the disk can provide. Very large 
 non-core contained arrays which can be rapidly "crunched" will have no 
 low latency solution if the numbers are "crunched" faster than they may be 
 supplied. The point with respect to mixed solutions of the kind being con- 
 sidered here is that it is not sufficient that the processor/disk data rates 
 be compatible for each of the processes in each solution. The combined data 
 
 for each process for its two solutions must be acceptable. Simply, the sum 
 of the processor data rates of the two solutions for each process must be 
 less than that of the disk. 
 
 The buffer memory requirements obtained for either the one- or two- 
 process solutions are tied to these processor and disk data rates. The buffer 
 requirements for combined solutions are therefore identical to the memory 
 upper bound of the two process solution. This is because no matter how many 
 solutions for any one process are combined, as long as the sum of their data 
 rates is acceptable, no more than 2Q bits of buffer memory is required to 
 process the data at that rate. 
 
 There is one final matter to be handled in any real combined solution. 
 The fact that the combined data rates to the processor are less than the 
 
39 
 
 disk's bound does not guarantee that all the data will be accessible by the 
 
 disk. Data could conceivably be output to the disk such that when blocks 
 
 from two different solutions are to be subsequently input, they are in the 
 
 same angular position on the disk. Unless the disk can read from several 
 
 heads/tracks at the same time, there is no way to input all the required data 
 
 when it is needed. This local overlap condition occurs in spite of the fact 
 
 that the average density over the disks total rotation is acceptable. 
 
 One way of precluding this problem is to merely insure that overlaying 
 
 solutions will have their data blocks interleaved on the disk. Except for 
 edge value blocks, all input blocks for a particular process have the same 
 spacing between consecutive blocks. This is true no matter how many disjoint 
 streams of data have been prescribed by separate solutions. Therefore any 
 overlapping of these streams can be corrected by shifting one entire sequence 
 of these blocks relative to the one it overlaps with. That is, the periodicity 
 of any two or more streams of overlapping data blocks is identical and merely 
 insuring that the first blocks of the respective streams will interleave 
 forces all of them to. This can be done at the time the two solutions are 
 combined. Compare all streams of input blocks in the combined solution. 
 Where overlaps exist, one or more of those streams may be shifted by delaying 
 the original output requests for the corresponding blocks the amount of time 
 necessary to eliminate the overlap. 
 
 At most, this procedure will increase the buffer memory requirements 
 by the size of one block for each stream which will be delayed. This is 
 clearly very much smaller than the amounts of buffer memory already conserva- 
 tively required and can therefore be ignored. 
 
 
ho 
 
 Edge value blocks in combined solutions are handled very much the 
 same as in the one process solution. In the former case, however, it is the 
 combined solution itself into which the edge values are inter stitially placed. 
 This avoids overlap conflict in the combined solution. 
 
 2. k.2 Sweeps of Less Than a Disk Rotation in Duration 
 
 Thus far, an implicit assumption was made in the statements of the 
 previous problems. It has been assumed that any sweep over a data base would 
 take at least as long as a single rotation of the disk. Indeed the solutions 
 to these problems use the fact that data may be output to the disk and later 
 input for adjacent sweeps. This section considers the case in which a process 
 takes less than the time of a disk rotation to perform a single sweep. 
 
 It Is important here to recall one of the first assumptions made 
 about all of the problems to be solved in this paper. All of the input values 
 required by any process must be required by the process for the particular 
 processor being considered at a rate not to exceed the capacity of the disk 
 to provide this data. Intuitively, no process should crunch numbers faster 
 than the disk can supply them. 
 
 Sweeps of too short duration to make use of the disk are handled in 
 a simple straightforward manner. Instead of being routed from the output 
 buffer to the disk, as the previous solutions call for, the data is routed 
 from the output buffer directly to the input buffer. 
 
 What this means is that for sweeps of sufficiently short duration, 
 the required data are entirely buffer memory contained. The fact that the 
 size of the buffer memory is sufficient to accommodate this data follows 
 directly from the fact that a rotation of data is provided for in each buffer. 
 
kl 
 
 Short duration processes of time less than a rotation must use less than a 
 rotation of data for input and less than a rotation of data will be output. 
 Thus, the solutions given earlier in this chapter with the above 
 modification provides a means of generating no latency solutions for the 
 previously considered problems. This will be effective no matter what the 
 duration of the sweep of a process over its data base. 
 
k2 
 3- OTHER SWEEPING FROBLENE 
 
 3.1 Introduction 
 
 This chapter considers a more general program structure than the 
 last chapter. The generalizations include more diversified data transmissions 
 than n-process problems and sweeping patterns which are not row by row as 
 above, but regular sweeping patterns as are found, for example, in matrix 
 operations. In both cases a significant alteration of the previous problem 
 formulations is the explicit merging of more than one input data set to a 
 process . 
 
 In previous cases only a single data entity was assumed to be input 
 to each process from other processes. The problems considered here permit 
 several such entities to be merged and input to the process. An important 
 requirement for this procedure is that the combined data rate of all sets of 
 data to be input to a node does not exceed the bandwidth of the disk. This 
 ensures the ability to interleave the data blocks on a single logical track 
 without overlap. 
 
 3.2 Generalized Data Transmissions 
 
 Fig. 3-1 a shows the relationship of processes to data transmissions 
 for the problems considered thus far. Fig. 3-1 b shows a more general type 
 of structure. The nodes of this code graph correspond to operations which 
 occur iteratively to variables over an entire data entity. As such, they 
 have the same meaning as nodes in the n-process mesh problems: each node 
 corresponds to a computational process. Since this paper will consider only 
 serial processors, code graphs of the form depicted in Fig. 3-1 b can be dis- 
 torted into that of Fig. 3-1 c. This procedure allows a structure which shows 
 
U3 
 
 v 
 
 i 
 
 
hk 
 
 the temporal sequence of process execution, and is not in general unique. 
 That is, given an arbitrary code graph, there may be several distorted ones, 
 each of which corresponds to an allowable sequence of execution over the 
 processes . 
 
 A comparison of Fig. 3-1 a and c show that the only real difference 
 introduced into the problem is that of the data transmissions. There are two 
 methods of dealing with this new problem. One of these in fact, involves one 
 further mapping which makes the structure of Fig. 3-1 b become that of the 
 type depicted by Fig. 3-1 a. 
 
 All of the data operated upon by all of the process can be lumped 
 
 into one large data array. This is done such that the variables of each of 
 
 the original smaller arrays are uniformly distributed from the beginning to 
 
 the end of the new combined array. This large array can be thought of a 
 
 single mesh. This mesh is used as if by an n-process program in which each 
 
 process has the entire array as input data. However, it is only that subset 
 
 of the large array which any particular process requires that is used by that 
 
 process. The remainder is input to buffer memory, but need not be routed to 
 
 primary memory. This unused data is recombined with the updated data from 
 
 the process (output data) and then output to the disk. 
 
 The large combined data array is thus input and subsequently output 
 
 for each process in the problem. It is constantly being density buffered and 
 
 origin shifted. One obvious drawback of this type of solution is that the 
 
 entire data base of any such problem must be streamed through buffer memory 
 
 for each process. This would require a bandwidth from disk to buffer memory 
 
 (and, of course, from buffer memory back to the disk) very much higher than 
 
 if just the necessary data had to be input. This is a problem only in those 
 
h5 
 
 cases where the disk to be used has a lower bandwidth than is required by- 
 using the above method. In such cases, the data transmission must be treated 
 differently. 
 
 For transmission of required data only, the following observations 
 are made : 
 
 The combined input data rate to each process may 
 not exceed the maximum deliverable by the disk. This is 
 the sum of the data rates corresponding to each of the 
 inputs to a multi-input node. 
 
 Merging these multi-inputs for a process is not a 
 problem. As long as the data rate requirement is met, the 
 various input data streams may be interleaved when output 
 to the disk before the merging retrieval. 
 
 The sequence of execution of the processes in a 
 code graph is not important. It is assumed that any 
 sequence under consideration is legal in the sense that 
 any process which must supply data for another is executed 
 first. For any sequence under consideration, then, the 
 only matter this paper is concerned with is the data 
 transitions between nodes. For this more general problem, 
 however, the data transmissions need not be between 
 adjacently executed processes. 
 
 The difficult problem therefore is that of multiply routed output 
 data from a process. If data are updated at some particular data rate (cor- 
 responding to the speed of that process which does the updating), and required 
 
k6 
 
 by several different processes at differing data rates, conflicts may develop. 
 If this data were to be required at rates both faster and slower than the 
 original rate, that process which outputs the data can be called neither a 
 fast nor a slow process. 
 
 If it is to prepare its output data for a faster process, density 
 buffering must occur during the process' execution. If these same data are 
 to be ready for a slower process, it should be output at its generated data 
 rate and density buffered while the slower process computes. Clearly, both 
 of these cannot occur at the same time, unless many redundant images of many 
 of the data blocks are to be cluttering up the disk. 
 
 Acutally there are two fairly simple methods of handling this 
 situation. The more simple one to explain is that of "speeding up" the data. 
 
 When a data entity is first placed on the disk, it is put there at 
 
 the highest density required by any of the set of processes which uses this 
 
 array as input data. Similarly, output data from any process are density 
 
 buffered to the highest data rate that any of the successor nodes will require. 
 
 Each process may therefore expect to have to density buffer "slowing down" its 
 
 input data. This scheme would require an input and an output buffer, each of 
 
 size Q to do density buffering. Intuitively, to completely eliminate latency, 
 
 an additional (input or output) buffer of size Q, would also be needed to perform 
 
 origin shifting. 
 
 It will be shown in a later chapter that density buffering and 
 
 origin shifting may be accomplished concurrently in the same buffer of size 
 
 Q. For the purposes of this chapter, the amount of buffer memory required 
 
 for the above scheme is bounded by 30, • 
 
 Another way of handling the problem of output data blocks which 
 
^7 
 
 are routed to several processes is to consider the entire sequence of accesses 
 of those blocks. An allowable sequence is assigned to the code graph. All 
 output data are placed on the disk at that density corresponding to the next 
 process which will use said data as input data. This much of the scheme is 
 identical to the n-process problem's solution. The difference arises in this 
 problem because the output data of this successor process is not, in general, 
 an updated version of its input data. Thus, while this process is executing, 
 both its output data and its input data (if used as input data for other 
 processes) must be in general density buffered. As long as the maximum data 
 transmission rate is not exceeded, this method will accomplish zero latency 
 scheduling with only the 2Q, buffer memory as used by the n-process solution. 
 
 3.3 Non-Standard Sweeping Patterns 
 
 The schemes of Chapter 2 have provided solutions for processes which 
 were assumed to sweep their respective data entities. They were expected to 
 operate on the entire set of their input blocks on each sweep. A large class 
 of problems however does not meet this requirement. Matrix operations are one 
 such example. In this section we will consider zero latency solutions for 
 such problems. 
 
 McKellar and Coffman in [ll] show that savings can be effected for 
 this type of problem by accessing the blocks in "good" ways relative to the 
 partitioning scheme of the large matrices. They are concerned with efficiency 
 of storage and minimization of block accesses, whereas we are not concerned 
 with storage efficiency and permit any number of accesses since they are 
 planned so as not to introduce any latency. 
 
 Consider two matrices A and B which are to be multiplied, the 
 
U8 
 
 resultant matrix stored as C. (i.e. C<- A x B) Such operations for the purposes 
 of this paper are assumed to be performed for matrices very much larger than 
 the available primary memory. Thus, the matrices are partitioned into many 
 smaller arrays. One such partitioning is shown in Fig. 3-2. Such partitions 
 are again called blocks. Both matrices A and B have been partitioned into 
 blocks. Each such block is assumed to contain several elements of the original 
 matrix, but not so many as to preclude several such blocks from occupying 
 primary memory at the same time. These blocks are further assumed to be 
 square. That is, they contain as many columns as rows of the original matrix. 
 The nomenclature of these blocks is obvious from Fig. 3-2. Multiplication of 
 
 the partitioned matrices is accomplished in the standard way, i.e. 
 
 n 
 
 C. . = S A.. B. .. 
 
 ij k=l J 
 
 The problem evident in most matrix operations quickly becomes 
 
 obvious as soon as one begins to consider the initial disk layout for this 
 
 r 
 
 simple matrix multiply operation: Matrices simply do not get operated upon 
 in a standard row-by-row or column-by-column sweeping fashion. Each element 
 of, say, the A matrix will be multiplied by (different) elements of the B 
 matrix many times and while there are many different possible orderings of 
 such operations, none will allow the two matrices to be simply swept at the 
 same rate and produce the desired result. 
 
 Another matter should also be considered at this point. While 
 considering the order in which the blocks will be operated upon, it should 
 be recognized that it would be desirable to have a standard matrix layout 
 for, say, multiplication. That is, for obvious reasons, an ordering over the 
 blocks which was a function of the matrix's size or of whether it was being 
 
k 9 
 
 a. 
 GO 
 
 • 
 • 
 • 
 
 GO 
 
 • • • 
 
 • • • 
 
 • 
 • 
 • 
 
 z 
 < 
 
 • • • 
 
 < 
 
 • 
 
 
 • 
 
 • 
 
 
 • 
 
 • 
 
 
 (VJ 
 
 < 
 
 
 w 
 
 —* 
 
 -4 
 < 
 
 r-4 
 
 CVJ 
 
 < 
 
 • • • 
 
 <? 
 
 OJ 
 
 I 
 
 bo 
 
 •H 
 
50 
 
 pre- or post-multiplied would be unsatisfactory. If, for example, partitions 
 had to be placed in either row or column order, depending upon whether they 
 were being pre- or post-multiplied, many matrices would have to be stored 
 redundantly or have the ordering of their blocks restructured several times 
 during the execution of the program in which they were used. Thus, a single 
 standard format is desirable. 
 
 Solutions to this type of problem use the following basic technique. 
 When only one of several data entities which are to be merged is required to 
 be swept and the others will only have a few of their data blocks used in 
 the same interval, all of these blocks may be stored on the disk at the 
 appropriate spacing for the swept one. It is always possible to retrieve 
 from the disk a few blocks of an entire array as long as this retrieval rate 
 is less than that rate at which the data are recorded on the disk. 
 
 For matrix operations, there are several ways in which to accom- 
 plish a zero latency solution. One such method for matrix multiplication 
 will be shown. 
 
 k n 
 
 Bii 
 
 '12 
 
 B 
 
 12 
 
 *13 
 
 • • • 
 
 LOGICAL DISK TRACK 
 
 Fig. 3-5 
 
51 
 For the operation C<-AxB, let A and B be layed out on the disk 
 such that their respective blocks do not overlap and the distance, s, between 
 successive blocks corresponds to the time for a full matrix multiplication of 
 two such partitions. Both sets of partitions are in row order. Refer to 
 Fig. 3-3- 
 
 Clearly, the blocks cannot simply be retrieved in the order in whi ch 
 they appear as pairs along the disk track. The order of retrieval is as 
 follows : 
 
 Pick up A and retain in primary memory while 
 the first row of B is input and multiplied. That is, 
 bring in B, , , B lp , ... B one at a time and pre- 
 multiply by A , , forming the first partial sums of 
 the first row of C. A superscript will be used to 
 
 denote the partial sum. Thus C , C , ... C have 
 been computed. This resultant data is handled in one of 
 
 two ways. 
 
 If this row of blocks of C would occupy more than one 
 rotation, , when placed on the disk with the same spacing 
 as A and B (Ps >i), the C partitions may be output as 
 they are computed. 
 
 If this row of C would not occupy too much of 
 primary memory to be simply contained there (that is 
 if primary memory were at least as large as P plus a 
 few blocks) these blocks may simply reside in primary 
 memory* 
 
 ^ 6 i hat the tW ° conditlons are n °t mutually exclusive leaving some room 
 for choice in many cases. * ame room 
 
52 
 
 The operations to occur now are the multiplication 
 of A- p by the first row of blocks of B. A- will have 
 been picked up from the disk sometime during the last 
 rotation before it was needed. The first row of 
 blocks of B will have similarly been origin shifted such 
 that its first block also is in primary memory at the 
 time at which multiplication is to commence. If the 
 first partial sum row of C was output to the disk, its 
 first block has likewise been input and is in primary 
 memory . 
 
 Note that at this point two operations will be 
 occurring: Multiplication of an A block by a B block 
 and an addition of this result to the first partial 
 sum of the corresponding C partition, yielding the 
 second partial sum of that C block. 
 
 This extra addition operation (the addition of the 
 previous partial sum) will cause the time between suc- 
 cessive operations to be longer than it was for just the 
 matrix multiplication. Thus input data to the processor 
 will be density buffered in the input buffer. 
 
 The new (second partial sum) C blocks will be 
 density buffered in the output to the higher data 
 rate of the original A and B blocks - which will 
 remain intact on the disk. 
 
 Successive rows will follow in like order: 
 
53 
 
 The next block from the first row of A is multiplied 
 by the next row of blocks of B and. added to the 
 respective last partial sum row of C blocks. The 
 A and B blocks are density buffered (slowed down) 
 in the input buffer. The currently generated partial 
 sum C blocks are, if output, density buffered (sped 
 up) in the output buffer. Finally the first row of 
 blocks of C will have been computed and placed on the 
 disk at a spacing equal to that of the A and B blocks. 
 The procedure then continues for blocks from successive 
 rows of the A matrix and yielding successive rows of 
 the C matrix. 
 
 This procedure effectively sweeps one of the matrices once and the 
 other many times to produce the product matrix. There are obvious variations 
 in which the original matrix blocks may be stored by columns or diagonals 
 instead of the original rows. The important point is simply that the one 
 matrix whose data are required most rapidly (in this case, B) is on the disk 
 at a sufficiently high density to be supplied to the processor accordingly. 
 The remaining matrices may be stored at this same data rate and retrieved 
 
 at the slower rate. 
 
 Another possibility with this same scheme is to store all matrices 
 in like fashion, say, in rows. The density at which they are stored is that 
 highest density of all the operations which occur in that particular program. 
 If, for example, a program consisted of only the standard matrix operations, 
 the A and B matrices of the previous example could be stored with their 
 
5h 
 
 blocks spaced appropriately for matrix addition. Thus they could be added 
 with no latency and multiplied just as described in the example except that 
 the B matrix will always be density buffered on input. This would effectively 
 "slow down" the data from the addition rate to the multiplication rate. 
 
 3-U Discussion 
 
 This section has given several ideas for approaches to the parti- 
 cular class of problems mentioned. No algorithms were given for all pro- 
 grams in this class because one would almost have to exhaustively consider 
 each individual program (e.g. each possible sequence of all matrix operations 
 in a matrix program) . What we have tried to show, however, is that for any 
 specific program a zero latency solution can always be obtained when one of 
 the following conditions is satisfied: 
 
 i) the i/o sequences have some identifiable 
 regularity, or 
 
 ii) the input data bandwidth requirements of the program 
 are very much less than the bandwidth of disk. This condition 
 will allow skewing of data blocks on the disk such that there will 
 be little if any overlap between blocks of different entities. 
 That is, additional disk bandwidth capacity will help preclude 
 attempts to retrieve more than one data block from identical 
 disk sectors at the same time. 
 
 High data rate programs which retrieve input data in non regular sequences 
 may qualify for zero latency solutions as generated by the algorithms in 
 Chapter k. 
 
55 
 k. ARBITRARY PERMUTATIONS OF BLOCKS BETWEEN SWEEPS 
 
 k.l Definition of the Problem 
 
 This chapter deals with the problem evidenced when a partitioned 
 data base is accessed several times but in different orders. Specifically, the 
 blocks remain the same in terms of the variables they represent but are 
 referenced in different sequences during successive passes over the data base. 
 An example of this is a matrix which is accessed first by rows of blocks and 
 then by columns of blocks or diagonals. The accessing need not be as regular 
 as rows or columns, however. This chapter deals with accessing in any random 
 order for each pass over a given data base. 
 
 In effect, the solution given provides for permutations of blocks 
 of a data entity on each pass through that data entity. This. is accomplished 
 as with previous results in a manner transparent to the processor. Judicious 
 use of the disk and buffer memory will provide the blocks to the primary 
 memory in their permuted order continuously. 
 
 This section will explain the mechanism for arbitrary permutations only 
 and will ignore several other considerations. For the purposes of this section, 
 then, it will be assumed that density buffering and origin shifting are 
 unnecessary. Furthermore, it is also assumed that each block in an array 
 will be accessed exactly once during each pass. Later sections will deal with 
 combining permutations with origin shifting and density buffering and with 
 additions and deletions of blocks while they are being permuted. 
 
 The mechanism for arbitrary permutations involves the following steps: 
 
 i) Rectangularize the sequence of blocks as they are required for, 
 say, the initial pass. 
 
56 
 
 ii) Obtain from the resultant matrix a sequence of alternating 
 row, column, row permutations which will yield the desired new sequence of 
 blocks. 
 
 iii) Linearize the above result into a linear sequence of the blocks 
 
 iv) Relate the above directly to the operating system. 
 
 The rectangularization is a process whereby the original sequence 
 of blocks is represented in a manner which shows what their positional layout 
 would be on the disk if they were to be directly output. The horizontal 
 dimension corresponds to the number of such blocks which can occupy a physical 
 track (i.e. a single rotation of the disk). The vertical dimension, then, 
 corresponds to the number of disk rotations the entire set of blocks will 
 occupy at their given density. This is given more formally below. 
 
 Consider the original sequence of data blocks (the order from which 
 they will be permuted) . Without loss of generality these may be relabelled 1 
 to n in the order of this original sequence. 
 
 These n elements are placed in a rectangular array. The width of 
 this array, or horizontal dimension, h, is the number of blocks which will 
 occupy one physical rotation of the disk at the required input data rate of 
 the process which operates on it. In general, the timing may be such that 
 there is a non- integral number of partitions corresponding to an exact disk 
 rotation. In such cases, that integer just less than the non-integral number 
 of partitions which fills a disk rotation will be used. This introduces a 
 small amount of latency into the solution. This latency is usually negligible, 
 however . 
 
 The height or vertical dimension of the matrix, v, corresponds to 
 the number of rotations that all of the n blocks will occupy on the disk. 
 
57 
 
 That is, v is the smallest integer not less than 
 
 This is given by v = 
 h/n. 
 
 This matrix is shown schematically in Fig. k—1. 
 
 Fig. k-l 
 
 Note that in general n < vh and the matrix has been "packed" with the numbers 
 n+1 to vh even though there are no such blocks. Each entry in such a matrix 
 corresponds to a particular block and will frequently be referred to as an 
 element of the matrix. 
 
 This matrix is, quite literally, a map of the sequence of blocks 
 on the disk if they would have been directly output to the disk . That is, 
 each row represents a rotation of data blocks around the disk. The entire 
 matrix represents the entire sequence placed on the disk with respect to some 
 arbitrary origin. This matrix will be referred to as the original matrix 
 and is the rectangularization of the sequence of blocks. 
 
58 
 
 The sequences of permutations over the rows and columns of the 
 original matrix are obtained by first forming the resultant matrix . The 
 resultant matrix is obtained in a manner similar to that of the original matri> 
 The final permuted sequence of blocks is rect angular i zed as if going back- 
 wards. That is, a matrix representation of the n partitions is obtained 
 which gives the positions the blocks would be in on the disk if they were 
 already in permuted order. 
 
 An example is given to aid in the explanation of the algorithm 
 which yields the desired transformations. For an original 20 blocks labelled 
 1-20 in the order they were computed upon on the original pass, assume they 
 are next needed in the sequence 8, 12, 20, 5, 2, 13, l8, 19, 1**-, 6, 10, 15, 
 17 16, 1+, 3, 9, 7, 11, 1- For h = 5, v = h the original and resultant 
 matrices are obtained trivially and shown in Fig. k-2. 
 
 1 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 
 
 11 
 
 12 
 
 13 
 
 14 
 
 15 
 
 16 
 
 17 
 
 18 
 
 19 
 
 20 
 
 8 
 
 12 
 
 20 
 
 5 
 
 2 
 
 13 
 
 18 
 
 19 
 
 14 
 
 6 
 
 10 
 
 15 
 
 17 
 
 16 
 
 4 
 
 3 
 
 9 
 
 7 
 
 11 
 
 1 
 
 ORIGINAL 
 MATRIX 
 
 RESULTANT 
 MATRIX 
 
 Fig. k-2 
 
59 
 
 What will be obtained will be a set of row permutations followed 
 by a set of column permutations followed by yet another set of row permutations. 
 These permutations will be such that when successively applied to the original 
 matrix they will produce the resultant matrix. 
 
 A set of row permutations consists of h permutations of v elements 
 each. That is, each row is permuted independently and according to the 
 particular pattern determined for it. Similarly, a set of column permutations 
 consists of v permutations of h elements each, where each column permutation 
 is independent of the others. 
 
 k.2 The Algorithms for Arbitrary Permutations 
 
 One important point should be kept in mind during the following 
 discussion. A row permutation is in fact a switching between columns. That 
 is, when a row of elements is permuted, the elements change columns. Similarly, 
 column permutations cause row changes to be effected. 
 
 Basically the sequence of row- column- row permutations which yields 
 the total permutation over the entire matrix is obtained by first finding 
 the middle step - the column permutations. Since the sets of row- column-row 
 permutations are not in general unique, any satisfactory set of column permu- 
 tations is generated first . The row permutations then are generated by working 
 "outward" from the set of column permutations just obtained. 
 
 Intuitively, the set of column permutations swaps the elements from 
 their original rows into the rows they will occupy in the resultant matrix. 
 The set of row permutations which precedes this aligns the elements so that 
 there will be no conflicts when they are row swapped. The final set of row 
 permutations merely shifts the elements into their positions which make up 
 the resultant matrix. 
 
60 
 
 The algorithm which describes this process is now given, performing 
 the operations on the example given earlier. 
 
 Standard Permutation Algorithm : 
 
 I. Construct a transition matrix which shows the transitions that each block 
 makes in going from the original to the resultant matrix . This matrix will 
 be quite large: hv by hv. In the standard fashion, we use a 1 to indicate 
 
 a transition and or blank otherwise. This matrix is the Total Transition 
 Matrix or TTM and is shown in Fig. k-3. 
 
 II. Abstract from the TTM a matrix which just shows the necessary transitions 
 between rows of the original and resultant matrices. 
 
 This is accomplished by first partitioning the TTM into partitions 
 which represent the row transitions as shown in Fig. k-k. That is, the TTM 
 
 is partitioned into k by h partitions. 
 
 2 2 
 
 The partitioned TTM will always have v partitions of h of the 
 
 original entries each. 
 
 This matrix is then reduced into the Row Transition Matrix or RTM. 
 This step consists simply of collapsing the partitioned TTM into a v x v 
 matrix, each entry of which is the sum of the entries in the corresponding 
 position of the partitioned TTM. For example, the partition in the first 
 row, first column of the partitioned TTM is summed to yield the element of 
 the RTM which appears in the first row and first column. The RTM for this 
 example is shown in Fig. k—5- The Roman numerals I to IV are used arbitrarily 
 to label the rows 1 to k of the original/resultant matrices, respectively. 
 
 The interpretation of the RTM is as follows: The entry in the i 
 
 row, j column is the number of elements in the i row of the original 
 
 th 
 matrix which appear in the j row of the resultant matrix. 
 
61 
 
 
 1 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 11 
 
 12 13 
 
 14 15 
 
 16 
 
 17 
 
 18 
 
 19 20 
 
 1 
 
 
 1 
 
 2 
 
 
 1 
 
 
 3 
 
 
 1 
 
 
 4 
 
 
 1 
 
 
 5 
 
 
 1 
 
 
 6 
 
 
 1 
 
 
 7 
 
 
 1 
 
 
 8 
 
 1 
 
 
 9 
 
 
 1 
 
 
 10 
 
 
 1 
 
 
 11 
 
 
 1 
 
 
 12 
 
 
 1 
 
 
 13 
 
 
 1 
 
 
 14 
 
 
 1 
 
 
 15 
 
 
 1 
 
 
 16 
 
 
 1 
 
 
 17 
 
 
 1 
 
 
 18 
 
 
 1 
 
 
 19 
 
 
 1 
 
 
 20 
 
 
 1 
 
 
 Fig. l+_3 Total Transmission Matrix 
 
62 
 
 
 1 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 
 
 11 
 
 12 
 
 13 
 
 14 
 
 15 
 
 16 
 
 17 
 
 18 
 
 19 20 
 
 1 
 
 
 1 
 
 2 
 
 
 1 
 
 
 3 
 
 
 1 
 
 
 4 
 
 
 1 
 
 
 5 
 
 
 1 
 
 
 6 
 
 
 1 
 
 
 7 
 
 
 1 
 
 
 8 
 
 1 
 
 
 9 
 
 
 1 
 
 
 10 
 
 
 1 
 
 
 11 
 
 
 1 
 
 
 12 
 
 
 1 
 
 
 13 
 
 
 1 
 
 
 14 
 
 
 1 
 
 
 15 
 
 
 1 
 
 
 16 
 
 
 1 
 
 
 17 
 
 
 1 
 
 
 18 
 
 
 1 
 
 
 19 
 
 
 1 
 
 
 20 
 
 
 1 
 
 
 Fig. k-k Partitioned TTM 
 
63 
 
 I 
 
 n 
 m 
 m 
 
 iniE 
 
 2 
 
 
 1 
 
 2 
 
 1 
 
 1 
 
 1 
 
 2 
 
 1 
 
 2 
 
 1 
 
 1 
 
 1 
 
 2 
 
 2 
 
 
 Fig. 4-5 Reduced Transition Matrix 
 
 Thus, for example, the entry in the first column, first row indicates 
 that two blocks which appear in the first row of the original matrix also 
 appear (somewhere) in the first row of the resultant matrix. These two 
 blocks are (from the TTM) the ones which move from position 2 to position 5 
 and from position 5 to position k. 
 
 Note that the sum of each row and each column in the TTM is 1. The 
 sum of row and each column of the RTM must therefore be h, or in the example, 5< 
 III. Decompose the RTM into h individual transition matrices, each of which 
 will correspond to a single column partition . 
 
 A permutation of k elements can be represented by a k x k matrix 
 in which the sum of each row and column is unity. The TTM is one such matrix. 
 It represents a permutation of hv elements. 
 
 The existence of this decomposition is shown by Ryser [13 1 as 
 Theorem 5*3 and Hall [8] as Theorem 5-1 -9' Appendix A gives the listing of 
 a program which performs this decomposition. 
 
 For the example given, one such decomposition is shown in Fig. 
 k-6. 
 
6k 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 1 
 
 
 (0) 
 
 (b) 
 
 (C) 
 
 (d) 
 
 (e) 
 
 Fie. k-6 
 
 IV. Assign each of these component matrices to a column to obtain the column 
 permutations. The sets of row permutations then follow . 
 
 These component matrices are then referred to as column transition 
 matrices. Each of the 5 column transition matrices in Fig. k-6 corresponds 
 to a permutation of k elements. These may be arbitrarily assigned to the 
 5 columns. Thus, if the matrix in k-6 (a) corresponds to the permutation in 
 the first row, it indicates the following permutation: the element in the 
 first row remains there, the element in the second row is moved to the third 
 row, the element in the third row is moved to the fourth row, the element in 
 the fourth row is moved to the second row. 
 
 For the purposes of this example, the matrices in Fig. 4- 6(a) to 
 (e) are assigned to columns 1 to 5, respectively. 
 
 The remainder of the process consists of identifying the particular 
 elements from the original matrix that correspond to those being permuted in 
 the columns. From this information the row permutations are easily obtained. 
 
 The first set of row permutations is obtained by examining the set 
 of column transition matrices and comparing these to the partitioned TTM, a 
 row at a time. 
 
65 
 
 In the example, the l's appearing in the column transition matrices 
 are assigned to particular transitions in the partitioned TTM. For example, 
 the column transition matrix which corresponds to the first column permutation 
 has a non-zero entry in row 1, column 1. This is matched with a non-zero entry 
 in the partitioned TTM which is in that partition that is in the first row 
 and first column of blocks. In this particular case there are two possibilities: 
 the transition from position 2 to position 5 and the transition from position 
 5 to position k. Either of these two transitions is acceptable for the matching. 
 
 Suppose it is the transition from position 2 to position 5 which is 
 matched to this entry in the first column transition matrix. Since this is 
 the first column permutation, the matching has the following significance: 
 
 The element in position 2 of the original matrix 
 should be (during the first set of row permutations) per- 
 muted to column 1. The column permutations will insure 
 that it gets relocated in the correct row of the resultant 
 matrix . 
 
 Thus the particular block under consideration gets "permuted" to 
 the same row it started in and the final row permutation places it in position 5' 
 This algorithm can be concisely given as: 
 
 For each non-zero entry in a particular column 
 transition matrix, match it with a particular transition 
 from the partitioned TTM. For this transition, the 
 element will first be row permuted from its position in 
 the original matrix to that column (in the same row) 
 which corresponds to the particular column for the 
 
66 
 
 column transition matrix being considered. After 
 being column permuted, this partition is then row 
 permuted into its position in the resultant matrix. 
 
 This procedure is followed for the remainder of the example. Fig. 
 4-7 shows the column permutation matrices in which the numbers of the blocks 
 with which their non-zero entries are matched have been indicated. 
 
 2 
 
 
 5 
 
 
 4 
 
 
 1 
 
 
 3 
 
 
 10 
 
 
 7 
 
 
 9 
 
 
 6 
 
 
 8 
 
 
 11 
 
 
 13 
 
 
 12 
 
 
 15 
 
 
 14 
 
 
 18 
 
 
 6 
 
 
 19 
 
 
 20 
 
 
 17 
 
 
 (a) 
 
 (b) 
 
 (C) 
 
 (d) 
 
 (e) 
 
 Fig. 4-7 
 
 Since the matrices (a) to (e) shown in Fig. 4-7 correspond to 
 columns 1-5, the original matrix after the first set of row permutations is 
 immediately obtained. This is shown in Fig. 4-8 (a) . 
 
 2 
 
 5 
 
 4 
 
 1 
 
 3 
 
 10 
 
 7 
 
 9 
 
 6 
 
 8 
 
 11 
 
 13 
 
 12 
 
 15 
 
 14 
 
 18 
 
 16 
 
 19 
 
 20 
 
 17 
 
 (a) 
 
 2 
 
 5 
 
 12 
 
 20 
 
 8 
 
 18 
 
 13 
 
 19 
 
 6 
 
 14 
 
 10 
 
 16 
 
 4 
 
 15 
 
 17 
 
 11 
 
 7 
 
 9 
 
 1 
 
 3 
 
 (b) 
 
 Fie. 4-8 
 
 8 
 
 12 
 
 20 
 
 5 
 
 2 
 
 13 
 
 18 
 
 19 
 
 14 
 
 6 
 
 10 
 
 15 
 
 17 
 
 16 
 
 4 
 
 3 
 
 9 
 
 7 
 
 11 
 
 1 
 
 (C) 
 
67 
 
 Fig. 1]-- 8(b) shows the matrix of Fig. 4-8 (a) after being column 
 permuted. Fig. 4-8(c) is the resultant matrix and is obtained by row permu- 
 ting the matrix of Fig. 4- 8(b) . 
 
 An equivalent algorithm is given by Opferman and Tsa-Wu in [32]. 
 They are, however, concerned with switching signals instead of transmitting 
 data blocks, but the analyses are identical. 
 
 Thus any rectangular array of elements may be randomly permuted by 
 a set of row permutations, a set of column permutations, and a final set of 
 row permutations. 
 
 k.3 Permuting in a Memory Hierarchy 
 
 The preceding method is implemented for a head-per- track disk system 
 by performing row permutations in intermediate (bulk storage) random access 
 memory. The first row permutation is accomplished by accumulating blocks 
 in an output buffer as they come from the primary memory. When a row is 
 accumulated, these blocks start to be output to the disk in their permuted 
 order. The final row permutations are performed in a similar manner. One 
 row is accumulated in an input buffer before being routed to primary memory. 
 The blocks of this row are then sent to primary memory in their permuted order. 
 
 Column permutations must take place between the time the bbcks leave 
 the output buffer for the disk and the time they arrive at the input buffer 
 from the disk. They are accomplished by simply switching the heads of the 
 disk appropriately. Fig. 4_8(a) shows a map of the allocation of the blocks 
 on the disk. Since a column represents a particular angular area on a disk, 
 the disk may read any of the blocks in a particular column with equal ease. 
 Thus the sequence shown in Fig. 4-8(b) can be obtained by switching the read 
 heads of the disk. Note, however, that the effect of column permutations 
 
68 
 
 may also be obtained by switching the write heads of the disk. Thus either 
 the read or the write heads may be switched to achieve the effect of column 
 permutations, whichever one would require the least switching time on a par- 
 ticular disk. 
 
 The amount of buffer memory required is simply that for a row of 
 blocks in the input buffer and an equal amount for the output buffer. Since 
 a row of blocks was previously defined to be a disk rotation, the required 
 buffer memory for this scheme is 2Q. 
 
 k.k Special Cases of Permutations - Addition and Deletion of Blocks 
 
 The previous sections have dealt only with permutations over the 
 entire data base. All partitions over that base were assumed to be computed 
 upon for each iteration. Real programs frequently do not access their data 
 in that fashion. Some blocks may be abandoned after several iterations and/ 
 or some may first appear in the computation only after some number of iterations. 
 This section deals with such modifications to the permutation algorithm. 
 
 The deletion of blocks after having been used for several iterations 
 is a relatively straightforward procedure: 
 
 Deletion Algorithm : 
 
 Assume that blocks are to be subsequently output 
 (otherwise they are simply overlayed in primary 
 memory) . The resultant permutation matrix for the 
 first iteration in which these blocks do not appear is 
 generated to show these blocks at the end of all those 
 blocks which will be used. That is, the blocks to be 
 deleted occur in any sequence after the sequence of 
 blocks which are not deleted. The permutation algo- 
 

 69 
 
 rithm is used to obtain the row-column-row permutations 
 as before, except that the deleted blocks are not picked 
 up after the others. Thus, they are "shifted" (permuted) 
 to the end of the line and then truncated from the 
 sequence and left on the disk. 
 
 Addition of blocks at iterations other than the first is accomplished 
 in much the same way. The blocks to be added are placed on the disk at the 
 locations from which they will be read for the first iteration in which they 
 appear. These locations are easily obtained from the standard permutation 
 algorithm. 
 
 Addition Algorithm : 
 
 All blocks which will later be added to the set being 
 permuted are entered in the set notationally by appending 
 them at the end of the sequence of blocks which are being 
 used. This is done for the last iteration before the new 
 blocks are to be used and they thus appear in the original 
 permutation matrix. They are placed in the resultant 
 permutation matrix at those places at which they are required 
 by the processor. The permutation algorithm will generate 
 the usual row- column- row permutations for all the blocks. 
 The first row permutations are not carried out for the 
 blocks which will be added, however. They are pre-placed 
 on the disk at those locations which the permutation algo- 
 rithm would have placed them had they been computed upon 
 during the previous iteration. 
 
70 
 
 Thus blocks may be added and deleted at any time during the itera- 
 tive computations over the data base. As explained in the next section, how- 
 ever, blocks may not be deleted and later added by application of the deletion 
 and addition algorithms. 
 
 k . h . 1 Other Special Cases 
 
 When a block is deleted it is placed by the permutation algorithm 
 at some particular angular position on the disk. This position may be any- 
 where insofar as the computational process is concerned because it is finished 
 with this block. The permutation algorithm, however, puts it in one of a 
 limited set of positions which are consistent with its goal of performing the 
 necessary permutations. 
 
 Blocks to be added, on the other hand, must be placed in particular 
 angular positions on the disk before being input. In general, there is no way 
 to insure that a block be abandoned at precisely that position from which it 
 should be retrieved for subsequent interations over the data base. In fact, 
 cases can be constructed which preclude this fortuitous placing of blocks on 
 the disk. Thus the procedures described above for additions and deletions do 
 not permit the addition of blocks which have already been used and deleted. 
 
 There are however several ways in which this situation may be handled. 
 For processes which do not require data at the disk's maximum rate, the following 
 methods are feasible. (Processes which require the disk to be constantly 
 transmitting data must have additions and deletions of their blocks adhere 
 to the constraint implicit in the section on additions and deletions. That 
 is, they may not delete a particular block and then add that same block several 
 iterations later.) 
 
 When the process requires input at a lower rate than the disk's 
 maximum capacity, data can be made to appear to "move" on the disk. This is 
 
71 
 
 accomplished by reading from the disk some data which are not required by the 
 process for some time to come. These data are kept in buffer memory and 
 subsequently output to their final destination on the disk, without having 
 been processed. The input and output of these "moving" data are performed 
 interstitially with the regular input and output of blocks for the current 
 process. 
 
 Another way to handle the placing of deleted blocks so that they 
 may subsequently be added is to simply keep the deleted blocks in buffer 
 memory until the disk rotates to those positions corresponding to allowable 
 locations for subsequent addition. This is possible because the buffers, if 
 matched to the data rate of the disk, will be partly unused and hence available 
 for temporary storage of blocks. 
 
 Yet another method which makes use of the fact that the process is 
 not demanding data at the disk's maximum rate is useful for certain situations. 
 Suppose the entire data base is made up of several variables over, say, a 
 mesh. On successive iterations, different variables will be required as input 
 to the process with perhaps some other variables in common. That is, some of 
 these variables are required on every iteration while others are required only 
 for selected iterations. 
 
 One way to handle what would otherwise be a quite messy addition 
 and deletion problem is include all variables for a particular portion of the 
 mesh in a block. The entire block is then retrieved as an input block while 
 only the requisite variables are made use of (and perhaps updated) . This 
 will work when the virtual data rate made up by lumping variables in blocks 
 is less than that of the disk's maximum capacity. 
 
72 
 
 U.5 Combining Permutations with Origin Shifting and Density Buffering 
 
 The permutation algorithm stores blocks on the disk in such a way 
 that it no longer is reasonable to consider them as a continuous sequence. 
 Rather, a row, or rotation, of these blocks is input from several tracks on 
 the disk. These blocks are further permuted before being used by the compu- 
 tational process. The concept of origin shifting deals with a continuous 
 stream of data (in this case, blocks) which must be anticipated so that its 
 origin is available to the processor when needed. 
 
 Since the data blocks when being permuted have no real origin, there 
 is no reason for the operation previously described as origin shifting. Before 
 a process will be able to use its first row of input data, the blocks comprising 
 this row must already have been read from the disk. This is because a row 
 permutation must be performed first. It makes no difference at what angular 
 position reading was initiated. The procedure is simply to begin transmitting 
 the blocks from the disk to buffer memory one disk rotation before they must 
 be sent on to the primary memory. At the end of one rotation, the necessary 
 blocks will have been transmitted to the input buffer, regardless of the order 
 in which they were transmitted. 
 
 The same lack of data origin allows for density buffering in much 
 the same way. The algorithm for this is as follows: 
 
 Density Buffering Algorithm (input) : 
 
 One disk rotation before a row of blocks is required 
 in primary memory, transmission is initiated from the 
 disk to the input buffer. This input ceases after a single 
 rotation and commences again one disk rotation before the 
 next row of blocks is required. 
 
73 
 
 This will always allow for the permutation of data blocks in suc- 
 cessive rows and incorporates what was previously referred to as density buf- 
 fering as well as origin shifting. Note that this entire procedure is des- 
 cribed as taking place in the input buffer. Thus, for input, no additional 
 buffer memory is required than for simply the permutation algorithm. 
 
 Output of data is handled in an identical fashion: 
 
 Density Buffering Algorithm (Output) : 
 
 When a rotation of data (at the density at which it 
 is to be stored on the disk) is accumulated in the output 
 buffer, it is output to the disk. The output buffer then 
 continues to accumulate blocks until another rotation 
 is collected. 
 
 In this way mismatches of data rates are provided for along with 
 reorderings of the accesses of the data blocks. The necessary buffer memory 
 is still 2Q, i.e. Q, each for the input and output buffers. 
 
7U 
 5- IMPLEMENTATION CONSIDERATIONS 
 
 5.1 Non-Zero Data Transmission Times 
 
 The methods used in this chapter and indeed those of the remainder of 
 this paper assume no time to transmit data from one level of memory to another. 
 This assumption is made simply to facilitate explanation of the techniques 
 developed. Taking into account the effects of real transmission times is 
 relatively straightforward: 
 
 Instead of retrieving data within a rotation of when 
 it is required, data are retrieved within a rotation plus 
 the total transmission time between the disk and the 
 processor. That is, the transmission times between the 
 disk and buffer memory and between buffer memory and 
 primary memory are added. This time is added to the 
 time of a single disk rotation and the result is the 
 new "lead time" which determines how soon before it is 
 needed that data blocks are retrieved. 
 
 This will always be a very small correction to the 
 time relative to a disk rotation. Nevertheless, to be 
 totally correct, the size of buffer memory should also 
 be augmented by an amount equal to the amount of data 
 which will be read in this additional increment of time. 
 
 5.2 Memory Sizes and Data Rates 
 
 An upper bound on the size of the buffer memory was determined to 
 be 20 . is the maximum amount of data which may be transmitted by the 
 disk in a single rotation. Fig. 5-1 shows some of the larger disks currently 
 
75 
 
 o 
 
 10 
 
 o 
 m 
 
 o 
 
 u 
 
 UJ 
 
 < H 
 O I 
 
 2 
 
 O 
 
 
 z 
 o 
 a. 
 u. 
 
 < 
 
 < 
 
 
 o 
 
 (X) 
 
 o 
 
 IT) 
 
 I 
 
 o 
 
 o 
 
 o 
 
 (33SW) 3WI1 NOIlVlOa XSIQ 
 
76 
 
 available and the values of Q. to which they correspond. Note that Fig. 5-1 
 includes movable head disks as well as head-per-track disks. This is not 
 intended to imply that the algorithms presented in this paper are directly 
 applicable to movable-head disks. Rather, these entries on the graph are 
 meant to be indicative of current disk parameters. It is noteworthy that 
 essentially all of the current production disks yield a < .5 Megabit. 
 Furthermore, most of these are under .25 Megabit. 
 
 The preceding leads to the observation that a separate buffer memory 
 may in fact not be necessary: for 32-bit words, the "buffer" memory require- 
 ment need be in the neighborhood of 8k to l6K words. Thus, for a system com- 
 prised of a processor and any standard disk, the input and output buffers may 
 well be sections of primary memory. 
 
 Not all processor /disk systems can operate in this fashion, however. 
 Some configurations may contain processors which do not provide for the neces- 
 sary additional primary memory. Another possibility is a system making use 
 of the type of disk which is currently being manufactured for the ILLIAC IV 
 computer by the Burroughs Corporation. 
 
 These are head-per-track disks with a period of k-0 msec and a data 
 
 rate of 500 Megabits per second. Q for such a disk, then, is 500 x 10 x 
 
 -3 6 
 k-0 x 10 = 20 x 10 bits. The necessary buffer memory for a system making use 
 
 of such a disk is ^0 Megabits if the processor is likely to require data at the 
 
 disk's rate. Most extant processors, however, will require data at a somewhat 
 
 lesser rate. Fig. 5-2 shows just what this data rate would be for various 
 
 "fast" processors as a function of the number of operations per variable which 
 
 will occur during the computational process. The size of each buffer determined 
 
 by these data rates is termed . 
 
77 
 
 V) 
 
 V 
 
 < 
 
 or 
 o 
 
 UJ 
 
 _l 
 
 CD 
 
 < 
 
 en 
 
 < 
 
 UJ 
 0. 
 
 o 
 
 5 
 
 0T 
 O 
 O 
 
 UJ 
 (T 
 UJ 
 
 u. 
 u. 
 
 o 
 
 tr 
 
 e 
 
 < 
 
 UJ 
 
 <T 
 
 o 
 
 a 
 
 z 
 <x 
 
 en 
 
 UJ 
 
 »- 
 < 
 <r 
 
 o 
 
 <r 
 
 u 
 
 UJ 
 V) 
 
 E 
 o 
 
 o 
 
 UJ 
 
 v> 
 
 O 
 
 x 
 
 Q 
 
 
 tr 
 
 — 
 
 o 
 
 H 
 
 h- 
 
 
 1 > 
 
 m 
 
 < 
 
 * 
 
 _l 
 
 <x> 
 
 -I 
 
 o 
 
 u 
 
 III 
 
 — 
 
 <o 
 
 H 
 
 z 
 
 (/) 
 
 fO 
 
 2 
 
 
 UJ 
 
 n 
 
 2 
 
 o 
 
 1- 
 
 
 Q. 
 
 
 o 
 
 
 Ui 
 
 _l 
 
 CD 
 
 < 
 
 <r 
 
 > 
 
 < 
 
 or 
 
 UJ 
 
 a 
 
 O OJ 
 
 LT\ 
 
 
 QN0D3S/S1I8 
 
 3JLV8 viva 
 
 
 O 
 
 UJ 
 
 O 
 
 CM 
 
 O 
 UJ 
 
 <0 
 
 o 
 
 UJ 
 
 2 
 
 CVJ 
 
 o 
 
 UJ 
 
 2 
 
 o 
 
 UI 
 
 2 
 
 >tsia D3swotr 'jo: Dvmi uoj sua ni -ivsho 
 
 o 
 
 J 
 
78 
 
 5.3 Speed of the Buffer Memory 
 
 The required bandwidth of the buffer memory is the sum of the data 
 rates along all paths to and from it. Fig. 5-3 schematically shows the buffer 
 memory and its data paths. Thus the buffer memory need have a bandwidth of 
 exactly four times that of the disk. 
 
 DISK 
 
 
 *»- 
 
 INPUT 
 
 
 ^W 
 
 PRIMARY 
 MEMORY 
 
 \ 
 
 >* 
 
 * 
 
 I 
 
 OUTPUT 
 
 ^r- 
 
 
 ^M 
 
 
 BUFFER MEMORY 
 
 Fig. 5-3 
 
 Note that implicit in Fig. 5-3 is the assumption that data are routed to 
 the primary memory from the buffer memory in blocks. These blocks need not, 
 however, be identical to those which are recorded on the disk. Although this 
 paper does not mention the need for such procedures, the buffer may be used 
 as a staging area for reformatting data. If it could provide the data at the 
 proper rates, there is no reason why it could not clump k blocks from the 
 disk into a single block which gets routed to primary memory. Similarly, 
 blocks from primary memory could be clumped into larger blocks for disk storage. 
 
 5 .k Implementation of Row Permutations 
 
 As stated in the beginning of this paper, the algorithms described 
 in this paper are predicated upon knowledge at compile or run time of what 
 the data accesses will be. It is a relatively straightforward procedure to 
 determine at that time what the input and output sequences and permutations 
 

 79 
 
 will be. This is done by making use of the known data block access sequences 
 and the timing characteristics of the processor and disk. 
 
 The i/o sequences and permutations obtained before actual execution 
 may well take up more than the available primary memory. How, then, can this 
 information be stored so as to be made use of during execution? One way is 
 as follows. 
 
 For each disk rotation during execution while permuting blocks, the 
 following information must be available to the system. 
 
 i) a sequence listing the (at most h) tracks from which the next 
 rotation's blocks will be retrieved. These can be track numbers without sector 
 information, but if the entire disk address is used, it provides complete 
 information about the next set of inputs from the disk as well as synchronizing 
 the actual computation process with the disk. 
 
 ii) a similar sequence to the one above for output addresses. 
 
 iii) two sets of permutations - one for the input buffer and one for 
 the output buffer. Each of these is a permutation of h elements. Thus each 
 permutation may be (wastefully) stored as a string of h integer numbers. 
 Suppose further that a whole word is to be used for each such number. 
 
 i), ii), iii) above consist of at most ^h words, h is bounded by 
 the number of addressable sectors on the disk, but will obviously be typically 
 somewhat less. For h = 6k, a block of 256 would be necessary for each disk 
 rotation of data. 
 
 A block such as this could simply be stored on the disk inter stitially 
 with respect to the data blocks. It would appear once for each rotation of 
 data which will be retrieved. Note that in the case of density buffering, n 
 rotations of data may actually be retrieved in more than n physical disk rota- 
 tions. Thus these i/o blocks are, in general, retrieved at a frequency less 
 than one per disk revolution. 
 
80 
 
 Each such block contains essentially an input request queue, an 
 output queue and two permutation sequences. These sequences may be assigned 
 as descriptors or tags to their respective blocks as they enter the buffers or 
 they may be encoded in tables by which the system will route the blocks and 
 thus effect the permutations. 
 
 Alternatively, routing may be accomplished by hardware techniques. 
 One possible method is through the use of a Batcher network [2]. In such a 
 system, each data block will enter an input or output buffer with a tag which 
 indicates what its permuted position should be. The transmission from these 
 buffers will be through a Batcher network and thus automatically perform the 
 row permutations. In such an implementation, each data block need only store 
 a single address for each row permutation it undergoes - that one which indicates 
 its permuted destination. 
 
 Requests for rows of data blocks may be handled by hardware means 
 also. At the time at which a new row must start to be input (one rotation 
 plus transmission times before it is needed), the addresses of the blocks 
 comprising that row are sent to a queue r of the type manufactured by Burroughs 
 Corporation for use in conjunction with the ILLIAC IV disks. These queuers will 
 cause the requests to be serviced in the fastest order possible. In this case, 
 then, the order of the block requests may be relative to the arbitrary data 
 origin on the disk but the blocks will start to be input immediately in a 
 "wrap-around" sequence. The destination tags on each tag will then insure 
 that the row permutation is accomplished correctly. 
 
 One obvious observation which may be made at this point is that many 
 permutation patterns will be cyclical. That is, there may be two basic regular 
 accessing patterns which alternate over the data base. In such cases, the use 
 
81 
 
 of unique i/o "blocks for each rotation of data would probably be unnecessary. 
 A periodic pattern of accesses would probably suffice* 
 
 5.5 Disks Which Do Not Read and Write Simultaneously 
 
 The algorithms and methods presented in this paper have been pre- 
 dicated upon the existence of a disk which can simultaneously read and write 
 (although not with the same head or on the same track) . That is, data blocks 
 are assumed to be read from the disk according to a particular algorithm and 
 written on the disk according to some other algorithm. Never, however, has 
 this paper been concerned with constraints upon the times when reading and 
 writing may occur. Indeed it is probably less efficient to introduce such 
 constraints than to assume concurrent read/write capabilities of the disk 
 and then simulate this facility on the standard disk. 
 
 The mechanism by which a disk can be made to appear to both read 
 and write at the same time is relatively straightforward. Each physical track 
 on the disk is divided into an even number of sectors. The size of these 
 sectors is such that each may be addressed as a unit. For reasons >hich will 
 shortly become clear, these sectors should also be relatively small within 
 the limits of the above constraint. 
 
 These sectors are then considered as adjacent pairs. Thus from 
 some arbitrary rotational origin on the disk, each physical track will consist 
 of some number of pairs of addressable sectors. As the reader might expect, 
 each pair is considered to be a read/write unit. That is, for each track- 
 sector pair, one sector will be used to write data on and the other to read 
 data from. In this way a disk i/o supervisor can at a macroscopic level appear 
 to both read and write at the same time while in reality it is only reading 
 (at most) half of that time and similarly writing. 
 
82 
 
 Note that this method cannot assign particular track- sectors permanent 
 read status and others permanent write status. Adjacent pairs of these units 
 are, as a pair, assigned the duty of splitting up the read/write responsibility. 
 This is because a track-sector which is at one time used to write data to must 
 (if it is to be useful) later be used to read data from. 
 
 After data is read from a particular track sector it may be considered 
 a write unit. If it is the first half of a read/write pair, the second half 
 will already be a write unit and may be thus used. If the second half of the 
 pair is not immediately used, either unit is available for a write. As soon 
 as one half of the pair gets written on, however, it becomes the read portion. 
 Thus available unused i/o pairs may be thought of as write/write pairs. 
 
 Write portions of a read/write unit may only be used (written in) 
 when an adjacent input will read the contents of its read portion. That is, 
 we disallow read/read pairs. In this way, read/write pairs are maintained as 
 either read/write pairs (or released by being converted to write/write pairs) 
 although the assignment of the read or write duties will change within the 
 pair. 
 
 As an algorithm, this can be stated as: 
 
 The current write sector is the mate to the current 
 read sector (which is uniquely defined by the algorithm 
 which provides input data to the processor's memory). 
 
 If, as happens in come cases, more output data 
 are generated than the input data being used, the output 
 destinations are assigned from sector pairs on the disk 
 which are not currently being used. 
 
83 
 
 Note that this method of making the disk appear to both read and 
 write will impose some additional restrictions. Since, in reality, data blocks 
 are not being both read and written at the same time, an additional amount of 
 buffering will be necessary. For example, a situation in which blocks are 
 assumed to be directly output to the disk will, in reality, require the buffer- 
 ing of an amount of data equal to that contained in one half of a read/write 
 pair. Typically this should be the size chosen as a data block and an additional 
 amount of buffer memory equal to two data blocks (one each for each buffer) in 
 size is added to the buffer memory requirements. This additional amount of 
 memory is incremental relative to the 2Q already required. 
 
 A more serious consequence, however, comes from the fact that the 
 disk now cannot be read at its maximum data rate. Similarly it cannot be 
 written to at this maximum data rate either. Thus the effective data transfer 
 rates for disks being used in the above fashion is one half that of their 
 maximum transfer rates. This has the desirable effect of halving the necessary 
 buffer memory but the unfortunate effect of precluding zero latency solutions 
 for processes which require data at rates greater than the new effective data 
 transfer rate. 
 
 The result of this may be the requirement to use a disk with a 
 higher data transfer rate to simulate one with read/write capabilities at a 
 lesser data transfer rate. 
 
 One more consequence of this simulation is the relatively minor 
 effect of the decreased data storage capacity which comes about from partition- 
 ing the disk. Each access from disk may require some unused time (= space) 
 for switching between heads on the disk. The read/write pairs may be chosen 
 to be relatively small due to buffer memory considerations. Their small size, 
 however, will cause more data transmission initiations and hence a small 
 degradation of the disks data handling abilities. 
 
8k 
 
 5.6 Real Data Block Spacing and Synchronization 
 
 Throughout this paper we have relied on the predictability of execu- 
 tion times of the computational processes. This has permitted the spacing 
 between data blocks such that their retrieval rate from the disk exactly 
 corresponded to their computation rate by the processor. Any real processor, 
 however, cannot be relied upon to be totally predictable to the /^second level 
 (which may well be called for) . How, then, do such methods as those presented 
 here remain practical for real processors? 
 
 The above question is in part answered by the fact that the buffer 
 memory acts as a cushion to allow a slight drift in execution time of a 
 process. If the process takes less than its predicted time for some data 
 block, the next block is already in buffer memory when the first is completed. 
 
 If the processor consistently takes less than the predicted time, 
 eventually the data available in the buffer will be depleted and the processor 
 will become synchronized to the new input as it becomes available. In such 
 a case, the prediction of compute time was probably in error. 
 
 In the case of the processor continuously taking too much time, the 
 buffer will eventually fill to capacity. At this point a supervisor program 
 should terminate reading from the disk to the input buffer for a rotation. 
 Again, this is the result of an erroneous prediction of the computation time. 
 In either of these two cases, no severe problem develops. The processor 
 eventually computes at the slower of its natural rate and that of the input 
 data available to it. In the case where the processor computes at a rate 
 slower than it could if its input data were always available, the slowdown 
 is essentially caused by rotational latency. 
 
 Note, however, that in such a case the data rate and process are 
 still "closely" matched. This has the consequence of introducing much less 
 
 
85 
 
 latency than if no preplanning had been undertaken. Thus bad" predictions 
 
 of compute times still yield "good" results relative to no predictive planning. 
 
 5.7 Obtaining Gompile-Time Information on Data Accesses 
 
 Pre-run information on the sequences of data blocks to be required 
 by a program could be obtained by modifying an analyzer of the type described 
 in [10] . The analyzer could be modified to obtain possible sequences over a 
 processes input data. A sequence would be a sequential "trace" over the com- 
 putations which the analyzer currently finds may be performed in parallel. 
 
 Accurate timing information may be obtained at this same time. For 
 each computational process, the particular operations per datum (or per mesh 
 point in some cases) are obtained. (Note that in some cases this information 
 may be dependent upon the block size also, thus yielding the number and types 
 of operations per datum as function of the block size.) The compute time 
 per block then comes from computing the actual time to perform the particular 
 operations. This should include any non-processing time taken for retrieval 
 from primary memory, data alignment, etc . Such timing information can obviously 
 also be obtained empirically by running a program or portion thereof beforehand. 
 
 5.8 The Effects of Block Size on Memory 
 
 For some processes, the block size chosen will have an effect on 
 the buffer memory and disk characteristics. This occurs when the number of 
 operations performed upon each variable, while constant for an entire iteration 
 over the data base, will be related to the block size. That is, as the block 
 size is varied the number of operations performed per variable each time a 
 block is accessed will also vary. An example of this is matrix multiplication. 
 
86 
 
 5.8.1 What the Effect Is 
 
 In multiplying two n by n matrices, a total of n multiplications 
 
 will occur. If the blocks consist of k by k partitions of the matrices, there 
 
 3 
 will be k multiplications for each access of a block from each matrix. There 
 
 will be (r-l blocks and each block will be accessed — times. (The number of 
 
 multiplies per block accessed times the number of accesses/iteration times the 
 
 number of blocks in a matrix gives n multiplications.) 
 
 The required input bandwidth B for this problem with the above 
 
 w 
 
 partitioning scheme is approximated by considering only the multiplications 
 
 ^ (3 blocks/multiplication) x (size of each block) 
 
 w ~ (number of multiplies/2 block pair) x (time /multiply for parti- 
 
 cular machine) 
 
 letting t represent the processor's multiply time, 
 
 3 -3r 
 
 w J 
 
 k^t 
 
 kt 
 
 m 
 
 m 
 
 If the block size, B , were increased by doubling the linear dimen- 
 sion of the partition, we obtain 
 
 B T = 2k by 2k = k-'k 
 
 Li 
 
 (2k) = 8k multiplications per block 
 
 Trr- j = T-(r-) blocks per matrix 
 2k/ \ \kj 
 
 yacy 
 
 — - accesses of each block 
 2k 
 
 There will be three blocks input for each multiplication when the partial sum 
 from a previous multiplication must be retrieved and updated after each current 
 multiplication . 
 
87 
 
 ok t m 
 m 
 
 Thus the bandwidth is halved by quadrupling the block size. This follows by 
 
 3/2 
 
 recognizing that matrix multiplication is an N process. That is, for N 
 
 3/2 
 
 elements in each matrix (n x n) there are N multiplications over the entire 
 
 3/2 
 
 matrix multiplication. For K (= k x k) elements per block, there are K ' 
 
 multiplications and the (input) bandwidth which is measured in bits or words 
 per second, comes from 
 
 B - 3K - -2- 
 
 m 
 
 3/2 
 For an N process, then, increasing the block size by a factor of 
 
 P will result in a decrease of the necessary buffer memory bandwidth by -i— . 
 
 In general, for an N process (r > l) an increase of block size by a factor 
 
 of P will reduce the required bandwidth by the factor =-. (We make the 
 
 P 
 assumption here that the required output bandwidth is equal to that for input. 
 
 This is very conservative since few operations "generate" more new data than 
 
 they "consume".) 
 
 Increasing the block size for an 0(w) process (r > l) has several 
 effects on the various memories in the hierarchy. The most obvious of these 
 are 
 
 i) As previously shown, the required bandwidth from the buffer 
 memory (as well as the disk) is reduced, and 
 
 ii) The size of buffer memory decreases. This is because its 
 required size, comes directly from the bandwidth and the disk's rotation time. 
 Buffer memory = 2Q, = 2-B -T . 
 
88 
 
 At first, then, it may seem that things continue to improve as the 
 block size gets "bigger, or, for r = 1, at least remain constant. Why not just 
 make it as big as possible and thereby reduce the required buffer memory size 
 and the buffer memory and disk's bandwidth? This turns out not to be a suc- 
 cessful approach for all values of r for two reasons: 
 
 i) The buffer memory must be at least large enough to contain a 
 single block. Thus, the quantity 2Q, is insufficient when B > 2Q. 
 
 Li 
 
 ii) The primary memory will have to hold a few blocks. Typically 
 its requirement will be about three to five blocks. 
 
 One further observation should be made at this point. When the 
 block size (or the combined size of two blocks in the case of a binary opera- 
 tion such as matrix multiplication) exceeds Q,, there is no longer any reason 
 to make use of the algorithms developed in this paper. An input buffer equal 
 to the block size is sufficient to eliminate all latency. This is accomplished 
 by the straightforward procedure of picking up each block (or pair of blocks) 
 in the last rotation before they are needed. Output blocks are written directly 
 on the disk. This will work for random accesses over the entire data base 
 as well as for different speeds of the computational processes which operate 
 on the blocks . 
 
 The curve in Fig. 5-k depicts the required buffered memory at that 
 point . 
 
 We define an s-ary process to be one in which s data blocks are 
 required to be in primary memory for the process to compute. As an example, 
 matrix multiplication of whole matrices or single partitions is a binary 
 process. Matrix multiplication of partitioned matrices (which includes the 
 addition of partial sums as well as the multiplication process) may be con- 
 
89 
 
 UJ 
 M 
 
 >- 
 
 (T 
 O 
 
 UJ 
 
 2 
 sr 
 
 UJ 
 
 3 
 CD 
 
 DECREASING PORTION DUE TO 
 CHANGE IN BANDWIDTH (20) 
 
 DISCONTINUITY WHERE THE BUFFER MEMORY 
 CAN ACCOMODATE EXACTLY ONE ROTATION 
 OF DATA. 
 
 BUFFER MEMORY = 1 BLOCK 
 INCREASING AS BLOCK SIZE 
 INCREASES. 
 
 BLOCK SIZE -> 
 
 Fig. 5-k 
 
 sidered a ternary process. (Actually it is only macroscopically a ternary 
 process while locally it is a (binary) multiplication process followed by a 
 (binary) addition process but it is the macroscopic level with which we are 
 concerned.) 
 
 Since the memory and bandwidth requirements go up as the number of 
 blocks which must be accommodated at a single time, we define the virtual 
 block size , B , of a process to be the number of blocks, s, which must be co- 
 resident in primary memory times the size of each block, B T . Thus, B = sB_ . 
 
 h V Jj 
 
 In the equations which follow, buffer memory is indicated by the 
 symbol B . 
 
 From Fig. 5-4 it is clear that if there are no additional consider- 
 ations, the optimum block size to minimize the size of buffer memory is simply 
 that which yields one block retrieval from the disk on each rotation. 
 
90 
 
 An 0(Nj process is one in which the dominant operation is performed 
 aB times for each virtual block (typically, 1 < a < 10) we denote the time 
 to perform this operation by t . 
 
 The necessary input disk bandwidth is the number of data words per 
 computation on a block divided by the number of dominant operations in a 
 block computation times the time for that operation. 
 
 SB L 
 
 aB T t aB T t 
 L p L p 
 
 Buffer memory = 2Q for the left portion of the curve in Fig. 5-U 
 
 2sT 
 
 B = 2Q = 2B T, = =■ — for B T < Q 
 
 m w d T,r-1, L 
 
 aB T t 
 L p 
 
 and B = B T for B T > 
 m L L 
 
 and we obtain the point of discontinuity in terms of the block size: 
 
 sT 
 
 B T = Q = B T , = 
 
 L w d T,r-1, 
 aB_ t 
 L p 
 
 sT\ 
 r d 
 or, B_ = — r— . 
 L at 
 P 
 
 2T 
 That is, for B^ < ^- , B m = 2Q 
 
 P 
 
 2T 
 and for B^ > — E , B m = B L< 
 
 P 
 
91 
 
 If primary memory contains, say, p blocks, then its size is pB (s < p < 10) 
 and one can, then, compute the block size which actually minimizes the com- 
 bined cost of buffer and primary memory. 
 
 If the buffer memory cost is d/(unit of memory) and the primary is 
 c/(unit of memory), we want the minimum of the function 
 
 r 
 
 2sT. 
 
 cost = d | 
 
 r-1 
 i aB T t 
 L L p_j 
 
 + c[pB L ] 
 
 This minimum will always occur before the discontinuity in buffer memory size. 
 We prove this by solving for the minimum. Solve for the deriative of the cost 
 function with respect to block size and set to zero. 
 
 ^g ost > = -(r-l)d 
 oB_ 
 
 -L 
 
 2sT. 
 
 aB^t 
 _ LP. 
 
 + cp = 
 
 which, with suitable algebraic manipulation yields 
 
 £ 
 
 "2sT d(r-l)1 
 
 aT cp ! 
 P 
 
 or 
 
 B L = 
 
 2sT,d(r-l)" 
 d 
 
 aT cp 
 P 
 
 l/r 
 
 The discontinuity from our earlier calculations is 
 
 B L = 
 
 sT. 
 
 aT 
 
 l/i 
 
 P ! 
 
92 
 
 The minimum of the cost function occurs at 
 
 n 
 
 2d(r-l) jl/r 
 
 cp 
 
 _j 
 
 want to consider the possible values for 
 
 2d(r-l) ! l/r 
 
 sT. 
 c 
 
 aT 
 
 L PJ 
 
 iA 
 
 and we 
 
 We note first that r 
 
 L ^P ; 
 is positive (since r > l) and that each of the terms is also. 
 
 The quantity — will always be <1, in fact it will almost always be 
 in the neighborhood of 0.1. 
 
 For most practical algorithms, r will not be larger than 2 or 3, let 
 us take It as an upper bound. 
 
 As mentioned earlier, 1 < s < p < 10 and if we take the very con- 
 servative amount of 1, the above has an upper bound of 
 
 ■ (2) (0.1) (3) ll/r _ , 6] l/r 
 J 
 
 1 
 
 and since this is < 1 for our conservative estimates the minimum of the cost 
 function will always occur for a smaller block size than that at which the 
 discontinuity occurs. 
 
 Fig. 5-5 demonstrates some of these relationships for the following 
 
 values 
 
 s = 1 
 
 P = 5 
 
 t = 30 nsec 
 P 
 
 T., = 40 msec 
 d 
 
 r = 3/2 
 
 a = 3 
 
 d = $.12/word » $.02/bit for 6k bit word 
 
 c = $1.20/word -a $.20 bit for 6k bit word 
 
93 
 
 
 Q 
 
 
 * 
 
 
 V 
 
 
 H 
 
 
 CD 
 
 
 <T 
 
 
 CO 
 
 
 t- 
 
 
 m 
 
 
 N, 
 
 
 -o- 
 
 
 C\J 
 
 
 ? 
 
 
 O 
 
 
 O 
 
 * 
 
 
 * 
 
 ■s 
 
 
 V 
 
 n 
 
 
 •o- 
 
 M 
 
 
 s 
 
 
 2 
 
 
 > 
 
 CO 
 CO 
 
 III 
 
 
 u. 
 
 O 
 
 ft 
 
 IT 
 
 O 
 
 2 
 
 Ld 
 2 
 
 u 
 o 
 
 or 
 a 
 
 o 
 o 
 
 or 
 
 co 
 o 
 
 CJ 
 
 t- 
 CO 
 
 o 
 o 
 
 
 a 
 
 UJ 
 
 >- 
 
 V 
 
 tr 
 
 UJ 
 
 2 
 
 a. 
 
 o 
 
 u. 
 
 
 it 
 
 2 
 
 UJ 
 
 2 
 
 2 
 
 U. 
 
 ro 
 
 CO 
 
 UJ 
 
 CD 
 
 ro 
 
 a 
 
 2 
 
 
 < ) 
 
 o 
 
 
 > 
 
 CO 
 
 lit 
 
 UJ 
 
 oe 
 
 Lt 
 
 
 CO 
 
 CO 
 
 UJ 
 
 < 
 
 > 
 
 z 
 
 2 
 
 U. 
 
 T 
 
 <r 
 
 
 u. 
 
 
 < 
 
 8 
 
 o 
 
 3 
 
 cc 
 
 2 
 
 * 
 
 CD 
 
 0. 
 
 Ct 
 
 
 a 
 
 
 z 
 
 or 
 
 
 UJ 
 
 o 
 
 
 UJ 
 
 u. 
 
 
 * 
 
 
 t- 
 
 CO 
 
 
 ui 
 
 1- 
 
 
 CD 
 
 co 
 o 
 
 
 a 
 
 u 
 
 
 i 
 co 
 z 
 o 
 
 Q 
 
 z 
 < 
 
 
 1- 
 < 
 _l 
 
 UJ 
 
 CO 
 UJ 
 
 
 <r 
 
 co 
 
 
 I 
 
 LP, 
 
 tuO 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 x X 
 
 X 
 
 X 
 
 w 
 
 x 
 
 X 
 
 X 
 
 o 
 
 CM 
 
 o 
 
 O 
 
 2 
 
 o Q 
 
 O 
 
 O 
 
 5 
 
 o 
 
 o 
 
 o 
 
 o 
 CM 
 
 CM 
 
 CM 
 
 o 
 
 N 
 
 S2 * 
 
 cm 
 
 9 
 
 CO 
 
 (0 
 
 «• 
 
 CM 
 
 1 1 
 
 1 
 
 1 
 
 1 
 
 
 - (SQU0M) 3ZIS 
 
 AW0W3W • 
 
 1 
 
 1 
 
 1 
 
 1 1 
 
 
 00 
 
 
 X 
 
 
 X 
 
 X 
 
 
 10 
 
 
 X 
 
 00 
 
 
 00 
 
 
 «■ 
 
 
 0i 
 
 * 
 
 
 1 
 
 1 
 
 
 CM 
 
 1 
 
 
 3 
 
 1(1 1 CM 11 
 
 1 
 
 
 ■ 
 
 
 1 
 
 1 
 
9h 
 
 That point at which the total memory cost is a minimum (from Fig. 
 5-5) is approximately: 
 Buffer memory = ^2K words 
 Primary memory = 2K words 
 Total memory cost = $7.2K 
 
 It must he borne in mind, however, that there is an additional con- 
 straint which must be considered that does not appear on the graph of Fig. 
 5-5« At some point the block size is too small for the bandwidth of the disk 
 to even allow a solution of the kind presented here. This threshold can be 
 represented by a horizontal line corresponding to 20 for the particular 
 disk being considered. Block sizes small enough to require a buffer memory 
 larger than this are not feasible for our solutions, because even if we could 
 merely increase the buffer memory size, the bandwidth of the disk would be 
 inadequate . 
 
 5.8.2 Processes in Which Changing the Block Size Has No Effect 
 
 Processes which are 0(w) will require aB operations for each s 
 
 blocks of size B . The bandwidth and buffer memory size are thus not dependent 
 J_j 
 
 upon the block size as long as the block size does not get larger than is 
 necessary to be processed for an entire disk rotation. Thus such a process 
 will have its buffer memory size and cost plotted as a horizontal line on a 
 graph such as that of Fig. 5-5* 
 
 Minimizing the memory cost in this case is accomplished by deter- 
 mining the smallest possible block size consistent with the various hardware 
 and software constraints imposed by the system. 
 
95 
 6. CONCLUSIONS AND DISCUSSION 
 
 The algorithms and other methods presented in this thesis may be 
 directly applied to a significantly large class of problems. The important 
 requirements for any program are that it iterate over its data base or 
 portions thereof and that the i/o sequences required during these iterations 
 and their timing be predictable. Obvious examples of such programs are 
 numerical partial differential equation programs with large data bases (e.g. 
 weather codes) and programs which contain matrix operations for very large 
 matrices (which, for standard matrix operations, are inherently iterative and 
 always predictable). 
 
 We may treat these latency removing techniques as either system 
 design or programming tools. In the former case we can obtain various bounds 
 which will indicate the necessary hardware parameters for the memory heir- 
 archy that will fit the program and processor parameters. In the latter case, 
 we can determine whether a latency free execution is possible for the parti- 
 cular system and program. 
 
 6.1 Applications of Latency Elimination Techniques to Existing Systems 
 
 The application of our techniques to existing systems is chiefly 
 to determine if they will "fit". In the case of an 0(n) process, this is 
 first done by determining if the memory bandwidths are adequate for the 
 algorithm and processor. If they are, the sizes of buffer and primary 
 memory are next considered. In this case buffer memory must be equal to 
 20 but the size of primary memory is dependent upon (among other para- 
 meters) the block size. Thus, data block size may be adjusted to obtain a 
 fit in primary memory. 
 
96 
 
 For 0(tr) processes, r > 1, altering the block size has the addi- 
 tional effects of altering the required disk and buffer memory bandwidths and 
 the buffer memory size. For programs containing such processes one can compute 
 or plot the values of these parameters versus block size to obtain those values 
 for the block size which are physically realizable for the existing system. 
 
 6.2 System Design Applications of Latency Elimination Techniques 
 
 When designing a system which is to employ the latency elim- 
 ination techniques presented in this paper, the relationships between the 
 various system parameters are known and all of the trade-offs may be con- 
 sidered. Thus, for example, the required disk and buffer memory bandwidths 
 may be related to the processor, process and, (for an 0(lT ) process, r > 1), 
 block size. The class of all processes which may be run on the system could 
 be provided for by providing a memory bandwidth as large as the largest re- 
 quired. That is, a system may be designed for a particular type of program 
 (as many special purpose computer systems frequently are) or for many types 
 of programs. We need only estimate such parameters as the upper bounds of the 
 speed and size over the entire program mix. 
 
 For systems "tuned" to a particular class of programs, the relation- 
 ships between the system parameters can be used to obtain an optimal memory 
 heirarchy configuration as in the manner in which memory cost was minimized 
 in Section 5-8. 
 
 6. 3 Areas for Future Research 
 
 Some further research which is suggested by that in this paper is a 
 generalization of our techniques to moveable arm disks. Some obvious con- 
 straints (e.g. requiring a minimum wait of the longest possible head positioning 
 
97 
 
 time between i/O operations) will allow the application of our techniques 
 directly, but more sophisticated refinement might yield a more practical model 
 of such devices . 
 
 Preplanning techniques such as ours could perhaps be generalized to 
 handle situations with imperfect information on i/O operations. This would 
 permit the same type of latency elimination schemes with just statistical in- 
 formation on the i/O sequences . This might be done by storing several parti- 
 cular data block sequences at particular angular positions on the disk when 
 it has been predicted that one of them (but not which one) will be needed at 
 that point. A method which was effective for this type of imperfect infor- 
 mation would require a less sophisticated analyzer than that of the methods 
 in this paper to extract the i/O information before run time, and would 
 probably also be more useful for a multi -programming environment. 
 
 The methods employed to effect the permutations of data blocks can 
 be applied to memory configurations to obtain hardware methods of reordering, 
 retrieving and/or storing data in real time. Thus data could be made to 
 migrate on a disk through basically hardware means. 
 
98 
 
 APPENDIX A 
 
 O O 
 
 « _l 
 
 a ui 
 
 -D LL 
 
 O 
 
 o 
 
 in ■— 
 
 >>» 
 
 
 in 
 
 
 -j 
 
 
 ■v 
 
 
 in 
 
 
 a. 
 
 
 >. 
 
 
 •• 
 
 
 o 
 
 u. 
 
 O 
 
 
 ae 
 
 
 • 
 
 IP 
 
 
 2 
 
 
 o 
 
 
 uj ae 
 
 
 z 
 
 
 x 
 
 
 z o 
 
 
 z 
 
 
 — 
 
 
 UJ 
 
 
 in 
 
 1/1 
 
 at: 
 
 
 z u. 
 
 
 H_ 
 
 > 
 
 CO 
 
 
 *• 
 
 
 (\j 
 
 z 
 
 O 
 
 
 ■— « 
 
 
 o 
 
 
 o 
 
 
 <0 
 
 X 
 
 u. 
 
 
 X 
 
 
 > 
 
 < 
 
 a 
 
 
 ii 
 
 
 >»■ 
 
 3 
 
 
 zu 
 
 
 ae 
 
 LU 
 
 LU 
 
 
 • 
 
 -J 
 
 ¥ 
 
 
 < ae 
 
 
 i- 
 
 ae 
 
 X 
 
 
 -> 
 
 
 >0 
 
 o 
 
 
 < 
 
 
 z 
 
 -i 
 
 m* 
 
 
 » 
 
 
 o 
 
 t- 
 
 
 Q LU 
 ZOO 
 
 
 LU 
 
 < 
 
 U, 
 
 
 "* 
 
 
 o 
 
 Q 
 
 z 
 
 
 < 
 
 
 LU 
 
 X 
 
 m 
 
 
 o. 
 
 
 >» 
 
 z 
 
 i— 
 
 
 at 
 
 
 a 
 
 o 
 
 o 
 
 
 i/) 
 
 
 -1 
 
 < 
 
 
 • 
 
 xo 
 
 
 ae 
 
 (T\ 
 
 
 M 
 
 
 a 
 
 to 
 
 X 
 
 o 
 
 UL 
 
 — < 
 o 
 
 oac 
 <r ae 
 
 LU 
 LU 1 
 
 
 LU 
 O 
 
 < 
 
 LU 
 1/1 
 
 * 
 
 o 
 rr, 
 
 
 • • 
 
 
 ae. 
 
 a 
 
 
 I a 
 
 
 LU 
 
 a. 
 
 
 Z 
 
 
 X 
 
 h- z 
 
 
 o 
 
 X 
 
 i/> 
 
 
 LU 
 
 z o 
 
 < 
 
 
 K 
 
 ■■* 
 
 
 O 
 
 
 X 
 
 o 
 
 1/1 
 
 O 1 
 
 
 ►- 
 
 
 Q 
 
 
 ►- 
 
 
 • • 
 
 
 ►- 
 
 1— 
 
 _J 
 
 < 
 
 
 z 
 
 LU 
 
 Z 
 
 „ 
 
 ,, 
 
 
 f-4 
 
 
 1 
 
 
 -i 
 
 
 > 
 
 —t 
 
 _) 
 
 C5 m 
 
 
 ae 
 
 
 Z 
 
 «n 
 
 
 II 
 
 
 -* 
 
 
 ♦ 
 
 
 < 
 
 LO 
 
 < 
 
 z ae 
 
 
 ae 
 
 ae 
 
 ** 
 
 Ml 
 
 
 • » 
 
 
 -5 
 
 
 o 
 
 
 ^ 
 
 
 «• 
 
 
 a! 
 
 a 
 
 
 — t- 
 
 
 O 
 
 < 
 
 CD 
 
 Z 
 
 
 « 
 
 
 A 
 
 
 a 
 
 
 «« 
 
 
 ae 
 
 Q- 
 
 1/1 
 
 a 
 
 
 o 
 
 LL 
 
 
 *• 
 
 • » 
 
 o 
 
 
 0m 
 
 
 a 
 
 
 3 
 
 
 < 
 
 X 
 
 z 
 
 z o 
 
 
 a 
 
 a 
 
 
 z 
 
 «* 
 
 o 
 
 
 * 
 
 
 -i 
 
 
 ox 
 
 a z 
 
 
 LU 
 
 to 
 
 LU 
 
 ►- 
 
 
 ■ 
 
 
 a 
 
 
 •. 
 
 
 ae 
 
 
 z 
 
 o 3 
 
 a — 
 
 
 X 
 
 3 
 
 X 
 
 
 o 
 
 uo 
 
 
 o 
 
 
 i 
 
 
 LU 
 
 _J 
 
 l/l * 
 
 
 K 
 
 X 
 
 ■* 
 
 ^ 
 
 
 »- 
 
 > 
 
 Mi 
 
 
 _i 
 
 
 o 
 
 
 ae 
 
 
 V 
 
 Q O 
 
 LU O 
 
 
 t- 
 
 LL 
 
 n 
 
 
 < 
 
 o 
 
 
 a. 
 
 
 h- 
 
 
 CO 
 
 
 o 
 
 ae < 
 
 
 i/> 
 
 
 ■■4 
 
 
 f-l 
 
 ae 
 
 II 
 
 
 X 
 
 
 — 
 
 •p 
 
 0. 
 
 
 < 
 
 
 ae ae 
 
 
 LU 
 
 X 
 
 -» 
 
 
 II 
 
 ae 
 
 -^ 
 
 
 o 
 
 
 t- 
 
 □ 
 
 w 
 
 
 z 
 
 
 o 
 
 o •- 
 
 • 
 
 1- 
 
 o 
 
 o 
 
 o 
 
 •» 
 
 -1 
 
 < 
 
 M 
 
 
 ae 
 
 
 < 
 
 z 
 
 3 
 
 
 loZ 
 
 o * 
 
 *: 
 
 < 
 
 
 fO 
 
 a 
 
 -» 
 
 
 «■• 
 
 • • 
 
 mm 
 
 
 X 
 
 LU 
 
 
 z 
 
 LU 
 
 < 
 
 o 
 
 
 o 
 
 X 
 
 » 
 
 
 ■ 
 
 o 
 
 K 
 
 z 
 
 z 
 
 K» 
 
 •» 
 
 ii 
 
 t» 
 
 ae 
 
 
 < 
 
 O 
 
 
 a. < 
 
 1/1 
 
 
 3 
 
 o 
 
 «• 
 
 > 
 
 a 
 
 Z 
 
 LU 
 
 
 < 
 
 — < 
 
 — . 
 
 ^ 
 
 «« 
 
 
 z> 
 
 i/> 
 
 O CO 
 
 M 
 
 D 
 
 
 m 
 
 z< 
 
 
 LU 
 
 X 
 
 o 
 
 X 
 
 w 
 
 ^ 
 
 o •• 
 
 h- 
 
 
 t- 
 
 Q 
 
 I 
 
 o 
 
 
 Z 
 
 LO 
 
 •^ 
 
 
 ae 
 
 -» 
 
 Z 
 
 < 
 
 K 
 
 
 -^ 
 
 o 
 
 OQ. 
 
 < 
 
 
 3 
 
 OO 
 
 _l < 
 
 * 
 
 M 
 
 z 
 
 >- 
 
 Oae 
 
 •^ 
 
 o 
 
 t- 
 
 
 U3 
 
 x 
 
 a 
 
 JO 
 
 z 
 
 
 a. 
 
 ae 
 
 QC 
 
 
 -I 
 
 
 z 
 
 < 
 
 h- < 
 
 » 
 
 a. 
 
 
 .^ 
 
 
 o 
 
 -> 
 
 H O 
 
 •• 1) 
 
 
 z 
 
 a 
 
 
 ae z 
 
 < 
 
 X 
 
 o 
 
 z 
 
 
 M 
 
 z 
 
 » 
 
 ♦ 
 
 — t 
 
 o 
 
 » 
 
 — _J 
 
 o — 
 
 
 i-m 
 
 
 LU 
 
 LU — 
 
 h- 
 
 o 
 
 -J 
 
 
 p4 
 
 ^ 
 
 M 
 
 a 
 
 ■fa 
 
 MM 
 
 II 
 
 -1 
 
 2 
 
 xae 
 
 H — 
 
 
 Q CO 
 
 t- • 
 
 O 
 
 — 
 
 o 
 
 •> 
 
 H 
 
 3 
 
 h- 
 
 L> 
 
 M 
 
 X 
 
 r 
 
 
 o 
 
 o 
 
 rg 
 
 
 i/> 
 
 z o 
 
 3 I X 
 
 t- 
 
 X 
 
 o 
 
 -* 
 
 ->a 
 
 < 
 
 • •• 
 
 «0 
 
 o 
 
 
 z 
 
 ae 
 
 aeo 
 
 
 < 
 
 <t 
 
 I 
 
 O O - 
 
 • 
 
 Jt 
 
 
 o 
 
 
 Z 
 
 z 
 
 — -J 
 
 ae 
 
 ae 
 
 * 
 
 LU 
 
 n« 
 
 — *- 
 
 x ae 
 
 
 X 
 
 ae ae 
 
 
 LU 
 
 ei 
 
 o 
 
 •" 
 
 — 
 
 t-< 
 
 h- 
 
 ■— 
 
 o 
 
 •>X 
 
 1- 
 
 ae 
 
 o -» 
 
 o 
 o 
 
 </>i_> lux aei^tu c< vi vi a. a O a. 
 
 xhZOIUIUUI^UOU^"" 1 " 
 
 IX XX<tI0L<i-<OQ.-J_J 
 
 Kj^hwujK-uj" xae 
 
 a _i ►- »- 
 * .. u LU LU 
 
 •v O.OU 19 
 
 a 
 
 —■ - 4 _j(N|„ ■lNjxim^-rj'-fl ,, >rj1JMMMM-<Htu 
 
 <Mm<rm^>r«-cotT>o-4<Mr-iTO'^< , >jm'*'m^)eoo , >0»<'Mfn<»-(n<of-ooo ,, 0" rf 
 •^.^— «-<»^(\jnjMfM»\jiM<Mnjp>j<»>(*ie»><«ifim«n"ifO«'>'#^ 
 
INPUT ARRAY 
 
 99 
 
 COMPONENT ARRAYS 
 
100 
 
 LIST OF REFERENCES 
 
 [1] J. Abate and H. Dubner, "Optimizing the Performance of a Drum-Like Storage," 
 ] EEE Transactions on Computers , vol. C-l8, pp. 992-997, November 1969. 
 
 [2] K. E. Batcher, "Sorting Networks and Their Applications," Proc . SJCC 1968 , 
 pp. 307-31^. 
 
 [3] Bruce A. Bernott, "Disk i/O for Non-Core -Contained P.D.E. Meshes and Arrays," 
 University of Illinois at Urbana-Champaign, Department of Computer Science 
 Report No. 311, March 1969. 
 
 [h] E.G. Coffman, Jr., "Analysis of a Drum Input/output Queue Under Scheduled 
 Operation in a Paged Computer System," Journal of the ACM , vol. l6, 
 pp. 73-90, January 1969. 
 
 [5] S. H. Fuller, "An Optimal Drum Scheduling Algorithm," Technical Report No. 
 12, Digital Systems Laboratory, Stanford Electronics Laboratories, 
 April I97.I. 
 
 [6] A. Gill, "The Optimal Organization of Serial Memory Transfers," IRE Trans - 
 actions on Electronic Computers , vol. EC-9, pp. 12-15, March i960. 
 
 [7] B. Gordon, "An Optimizing Algorithm for the IBM 65O, " Journal of the ACM , 
 vol. 3, pp« 3-5, January 1956. 
 
 [8] M. Hall, Jr., Combinatorial Theory . Waltham, Massachusetts. Blaisdell, 
 1967. 
 
 [9] D. Knuth, "Minimizing Drum Latency Time," Journal of the ACM , vol. 8, 
 pp. 119-150, April 1961. 
 
 [l0] D. J. Kuck, Y. Muraoka and S. C. Chen, "On the Number of Operations Simul- 
 taneously Executable in FORTRAN-Like Programs and Their Resulting 
 Speedup, " to be published in IEEE Transactions on Computers . 
 
 [ll] A. C. McKellar and E. G. Ccffman, Jr., "Organizing Matrices and Matrix 
 
 Operations for Paged Memory Systems," Communications of the ACM , vol. 12, 
 pp. 153-165, March 1969- 
 
 [12] D. C. Opferman and N. T. Tsao-Wu, "On a Class of Rearrangeable Switching 
 
 Networks Part I : Control Algorithm, " The Bell System Technical Journal , 
 vol. 50, pp. 1579-1600, May- June 1971. 
 
 [13] H. J. Ryser, Combinatorial Mathematics. New York: Wiley, 1963. 
 
BLIOGRAPHIC DATA 
 IEET 
 
 1. Report No. 
 
 UIUCDCS-R-72-522 
 
 I Title and Subtitle 
 
 Elimination of Rotational Latency by Dynamic Disk Allocation 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 May 2U, 1972 
 
 6. 
 
 ] Author(s) 
 
 David E. Gold 
 
 8. Performing Organization Rept. 
 
 N °- UIUCDCS-R-72-522 
 
 f Performing Organization Name and Address 
 
 I University of Illinois at Urbana-Champaign 
 I Department of Computer Science 
 j Urbana, Illinois 61801 
 
 10. Project/Task/Worlc Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF GJ 27^1+6 
 
 I. Sponsoring Organization Name and Address 
 
 I National Science Foundation 
 Washington, D .C . 
 
 13. Type of Report & Period 
 Covered 
 
 Doctoral Dissertation 
 
 14. 
 
 . Supplementary Notes 
 
 Abstracts 
 
 In Input/Output bound computer systems which use disks for back up memory, 
 rotational latency is frequently the source of the i/O boundness. This paper 
 presents several methods for eliminating rotational latency in head-per-track 
 disks or equivalent memory devices. These methods assume pre-run time knowledge 
 of the i/O sequences required by a particular program which will be run on the 
 system. They work by constantly reorganizing the data on the disk during execution 
 of the program. Measures are obtained of certain system parameters which indicate 
 the requirements for use of these techniques and the various tradeoffs may be 
 examined and quantified. These results are useful both for designing a new 
 operating system and for removing latency from an already existing operating system 
 
 1 Key Words and Document Analysis. 17a. Descriptors 
 
 Magnetic Disks 
 Magnetic Drums 
 Rotational Latency 
 Dynamic Storage Allocation 
 Memory Hierarchies 
 Buffer Memory 
 Permutation Networks 
 
 ». Identifiers/Open-Ended Terms 
 
 c. COSATI Field/Group 
 
 Availability Statement 
 
 Release Unlimited 
 
 RM NTIS-35 ( 10-70) 
 
 19. Security Class (This 
 Report) 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 106 
 
 22. Price 
 
 USCOMM-DC 4032S-P7 1 
 
%/> 
 
 ■:• 
 

 mn? no *>' 0l y 
 
 ■ 
 
 i -r ■ 
 
 vv*r 
 
 i ■ 
 
 ■ 
 
 HI 
 
 nl 
 
 VBt, 
 
 n 
 
 
 1 1 
 
 ■ 
 HI 
 
 ■ 
 
 ■ :;■:■'•■•■« 
 
 ■ 1 1 B9 
 
 II B 
 
 ■ 
 
 m mm 0} 
 
 sun H 533m H 
 
 jbhhmmmhBJIH ESSC IBS < <>*! 
 
 WBiwm 
 
 HH H ■■■ BHH1 RaEI