H IP HHmBh Hals HnKSMnHflnN IHlffiSHffiilll -XL.- HBBHf1 IWHHrafflfl mill LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN 510. 84 ttft- no. G6I-666 cop. 2 CENTRAL CIRCULATION BOOKSTACKS The person charging this material is re- sponsible for its renewal or its return to the library from which it was borrowed on or before the Latest Date stamped below. You may be charged a minimum fee of $75.00 for each lost book. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result In dismissal from the University. TO RENEW CALL TELEPHONE CENTER, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN AUG 1 4 1996 Allfi 8 1996 When renewing by phone, write new due date below previous due date. L162 Report No. UIUCDCS-R-72j.-662 NSF - OCA - GJ-328 - 000002 PREPAGING AND APPLICATIONS TO STRUCTURED ARRAY PROBLEMS by Kishor Shri&harbhai Trivedi July 1974 JHE UBRARY OF SEP 2 3 1974 UNIVERSITY OF ILUNOtS DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Digitized by the Internet Archive in 2013 http://archive.org/details/prepagingapplica662triv Report No. UIUCDCS-R-74-662 PREPAGING AND APPLICATIONS TO STRUCTURED ARRAY PROBLEMS* by Kishor Shridharbhai Trivedi July 1974 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the National Science Foundation under Grant No. US NSF GJ-328 and was submitted in partial ful- fillment of the requirements for the degree of Doctor of Philosophy in Computer Science, July 197 1 !-. Ill ACKNOWLEDGMENT The author wishes to express his gratitude to Professor J. Richard Phillips for his guidance during the preparation of this thesis. I would like to thank the National Science Foundation and the Department of Computer Science for financial support. Thanks are also due to Mrs. Vivian Alsip and Mrs. June Wingler for typing this thesis. Finally, I would like to thank my wife, Kalpana, for her encourage- ment and patience. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. GENERAL PAGING 4 2.0 Introduction „ 4 2.1 Performance Measures 6 2.2 Important Variables 8 2.2.1 Multiprogramming 9 2.2.2 Main Memory Allotment 10 2.2.3 The Page Size 11 2. 2. ^ The Page-Fetch Time 12 2.2.5 Program Behavior 13 2.2.6 Pagination 18 2.2.7 Paging Algorithms 18 2. 3 The Formalism of Paging Algorithms 20 2. 3. 1 Cost of a Paging Algorithm 23 2.4 Working Set Algorithm 24 2. 5 Performance Measurement 26 2.5.1 Stack Algorithms [COFF73, MATT7O] 27 2.5.2 King's Model 30 3. GENERAL PREPAGING ALGORITHMS 32 3.0 Introduction . . . „ 32 3. 1 The Optimal Demand Prepaging Algorithm 34 3.1.1 Performance Measurement for DPMIN 39 3.2 Realizable Prepaging Algorithms 41 3.2.1 Freeing Dead Pages J+l 3.2.2 Prepaging 44 3.2.2.1 Joseph's OBL Algorithm [J0SE70]. ... 56 3.3 FWS Algorithm. 60 4. IMPROVING LOCALITY OF ARRAY PROGRAMS 68 4. Introduction 68 4 o 1 Cholesky Decomposition 69 If. 2 Other Matrix Algorithms 80 4.2.1 Matrix Multiplication 86 4.2.2 LU Decomposition, 86 4.2«3 Gaussian Elimination 89 4.2.4 Gram-Schmidt Orthogonalization „ 94 4.3 Conclusion „ „ 96 V Page 1+. k A Note About Measurement of Performance 100 5. APPLICATION OF FEEEING AND PREPARING TO ARRAY PROGRAMS . 101 5.0 Introduction 101 5.1 Cholesky Decomposition . . 102 5.2 Other Examples . . Ill 5.2.1 LU Decomposition Ill 5.2.2 Gaussian Elimination . . . . 112 5. 2. 3 Gram-Schmidt Orthogonalization 114 5.2.1+ Matrix Multiplication 116 5.3 Combined Effect of Locality Improvement and Working-Set Identification Methods 120 6. AUTOMATION OF PERFORMANCE IMPROVEMENT TECHNIQUES .... 126 6.0 Introduction. 126 6. 1 Transformations Which Improve Locality 127 6.1.1 Loop Reversal. . . . . . . 127 6.1.2 Loop Decomposition „ 132 6. 1. 3 Submatrix-Multiplication-Transformation. . . I33 6.1.1+ Conversion From a Non-Submatrix to a Submatrix Algorithm 133 6.2 Freeing and Prepaging I32+ 7. CONCLUSION AND FUTURE RESEARCH 11+9 7.0 Conclusion 11+9 7.1 Future Research 151 LIST OF REFERENCES 153 APPENDIX A - PL/l PROGRAMS FOR THE SIMULATION OF SOME PAGING ALGORITHMS ..... 159 APPENDIX B - PROOF OF THEOREM 3. 1+ 170 APPENDIX C - PL/l PROGRAMS FOR SEVERAL MATRIX ALGORITHMS .... I76 APPENDIX D - ALGORITHM AP 189 VITA 197 VI LIST OF FIGURES Page If.l Submatrix Storage 70 If. 2 Cholesky Decomposition 72 If . 3 Cholesky Decomposition with Reversal 73 if. If Localities of CD and CDR 75 If. 5 OL/2 Program for Cholesky Decomposition 76 If. 6 Cholesky Decomposition with Submatrix Multiplication .... 77 If. 7 Localities of CD and CDM 78 If. 8 Localities of CD and CDS 81 If. 9 Localities of CD, CDR, CDM and CDS 82 If. 10 Page Faults Using WS Algorithm 83 If. 11 Page Faults Using LRU Algorithm 8 If If. 12 Page Faults Using MOT Algorithm 85 If. 13 Localities of MM, MMR and MMS 87 If.llf Localities of LU, LUR, LUM and LUS 90 If. 15 Localities of GOS, GOSR, GOSM and GOSS 93 If. 16 Localities of ORTH, ORTHR, ORTHM and ORTHS 97 If. 17 Page Faults for CD, LU and GOS 99 5.1 Cholesky Decomposition with Freeing lOIf 5.2 Page Faults Using LRU, FREELRU and MIN Algorithms 10 5 5.3 Page Faults Using LRU, FREEDPREIfLRU, MIN and DPMIN Paging Algorithms 109 5. If Page Pulls Using LRU, FREEDPREIfLRU, MIN and DPMIN Paging Algorithms 110 5.5 Effect of Prepaging on LU Decomposition 113 Vll Page 5.6 Effect of Prepaging on Gaussian Elimination 11 5 5.7 Effect of Prepaging on Orthonormalization 117 5.8 Effect of Prepaging on Matrix Multiplication 119 5.9 Page Faults for CD and CDSRFP 121 5.10 Page Faults for LU and LUSRFP 122 5.11 Page Faults for GOS and GOSSRFP 123 5.12 Page Faults for ORTH and ORTHSRFP 12^ 5.13 Page Faults for MM and MMSRFP 125 6.1 The Partition Lines „ 11^5 1. INTRODUCTION Paged virtual memory systems (PVMS) were introduced by Kilburn et al [KILB62] in the Ferranti Atlas computer. The objective of PVMS is to relieve the programmer of the burden of storage management. However, the cost, in terms of overhead, and performance degradation was, at the beginning, thought to be high [FINE66, KUEH.68] . Nevertheless, others argued in favor of PVMS [DENN70,SAYP69], and PVMS are well accepted and they have proved to be useful in practice [DENN70]. But the effectiveness and the overhead leave much to be desired [MASU72,0RGA72] . This is especially true in terms of the changing technology with very large memories and the types of problems with large data bases and large arrays. In this thesis, we are concerned with the performance of programs running under a PVMS and with techniques and algorithms that include prepaging as a more viable solution for the computer systems of the future. In Chapter 2, we introduce the terminology used and survey the relevant literature. We identify variables of a PVMS which affect the performance of its programs. We discuss the question of what performance measures to use. We then study the effect of each of the variables of PVMS on the performance, from the available literature. We focus our attention on two important variables of a PVMS, namely, the paging algorithm used by the system and the locality of the programs. We note that the most popular paging algorithms are demand paging algorithms chiefly because of their simplicity in implementation and because little is known about prepaging. The property of locality has been observed in practical programs and is, perhaps, the chief reason of the practicability of PVMS [DENN70], For special applications, people have given methods to write "more local" programs. Compiler implemented locality improvement methods have restricted them to code and not data. Further, these methods only do the physical reorganization of code and not a logical reorganization. In Chapter 3, we present several new paging algorithms, same of them being prepaging algorithms. We show why prepaging is useful. We present a new algorithm, DPMIN, which is a demand -prepaging algorithm, and prove that it is an optimal demand -prepaging algorithm. However, DPMIN cannot be implemented in practice since it requires that the program's reference string be known in advance. We also present several practical prepaging algorithms. We then present a variable-memory prepaging algorithm called PWS, which is based on Denning' s WS algorithm and prove that it incurs zero page faults. PWS algorithm is also impractical in the same sense as DPMIN and is only useful for theoretical purposes. We also study the question of performance measurement while using these prepaging algorithms, in particular, we study whether or not the proposed paging algorithms are stack algorithms [MA.TT70] . In Chapter h, we show how to improve the locality of matrix algorithms. This includes a logical reorganization of the program and its application to several common matrix algorithms. We also measure and compare the average working set size of programs and show that this is a reasonable way to measure locality. In Chapter 5> we introduce prepaging in matrix algorithms. By- using the prepaging algorithms proposed in Chapter 3> we show how to improve the performance of the matrix algorithms. In Chapter 6, we discuss the automation of the performance-improvement techniques of Chapters h and 5. Prepaging applied to matrix algorithms can be implemented in a compiler but not in an operating system. Eventually one would hope that these techniques would find their way into some of the compilers of the future and thereby provide efficient prepaging that cannot be obtained in other ways. 2. GENERAL PAGING 2.0 Introduction In a paged virtual memory system (PVMS), a program's address space is divided into fixed-size blocks of addresses called pages and the main memory (MM) is divided into matching size blocks of locations called page frames. We will let N = {1, 2, . .., n) denote a program's address space and M = {1, 2, . .., c} denote the set of MM page frames allot ed to the program, and generally 1 < c < n. When a program is executing, it makes a sequence of references to its address space N. We will denote such a sequence by w = r,, r p , . .., r., ... r , where r e N, t > 1. We will let |w| denote the number of references in w, N denote the set containing T iii the null string A, N = {w = v- f r p , . .., r , ... ! |w| = T), W denote the set of all finite strings on N and N = N* - {A} . Now since the CPU (central processing unit) can only refer to the MM, we must interpose a mapping mechanism between the CPU and the MM. The address map f, then, is a function, f : N -» M. Since it is possible that c < n, f may be a partial function as follows : r j if page i resides in page frame j f(i) = ^ undefined otherwise The function f is also dynamic, since in general, a program may refer to any of the pages of N. If the page size is p, a valid virtual address is an integer a, < a < np. Similarly, a valid MM address is an integer £, < p < cp. When presented with the virtual address Ot, the address translation mechanism (MAP) obtains (i, £) such that Oi = (i-l) p + £ where < £ < p, and then generates memory address p = [f(i)-l] p + £ if f(i) is defined and a 'page fault' if f(i) is undefined. A page fault interrupts the execution of the program and the program makes a transition from running state to the page-wait state. If we assume that the whole of the program's address space resides on a secondary memory device such as a disc, a drum or a bulk core store, then upon the occurrence of a page fault, the memory manager has to bring the required page from secondary memory to MM. The program waits for the required page for the duration of one page fetch time, assumed to be, on the average, T time units. If the MM was originally full, the above process will also force us to choose a page, already resident in MM, to be replaced. The process of page replacement may also involve a page push from MM to the secondary memory if it was written into during its last stay in MM. After the required page has been brought into MM, the interrupted program is in the ready state, ready to execute on a CPU if one is available. During the time of the above page transfer, the CPU will remain idle unless we resort to multiprogramming. In multiprogramming, we try to mask a long page transfer time by overlapping I/O and execution. After a page fault interrupt, the CPU is allot ed to another ready program if one is available. If the system does not allow multi- programming, then it is a monoprogrammed system. It has been observed in practice that a page fault incurs a significant amount of overhead. As an example, Masuda [MASU72] reports that in his system, more than 50 6000 units, there is no gain to be had from multiprogramming. Here T is measured in average instruction time units (roughly microseconds). Roughly speaking, this means that movable head disc is a very bad paging device. Drums and fixed head disc fall below the 6000 mark, and are useful. Extended core storage (ECS) provides T < 1000 and is an excellent paging device. In fact, with ECS, it has been recommended that the process switch after a page fault be avoided altogether and the cpu be left idle during the page-wait time [0EGA72]. 13 Denning [DEOT68] reports that the main reason for thrashing can be traced to a large value of T. For a further discussion see [DEM70] . 2.2.5 Program Behavior The memory referencing behavior of a program is a very important variable in determining performance of a PVMS. A program which scatters its references over the entire address space will page fault heavily. It is therefore desirable that the program tend to refer to a small subset of its address space in a small time interval. Fortunately, the above property, known as the locality property, is found to be exhibited by most programs in practice [BELA.66, COFF73, DEM68, DFJW70, DEM72a, FINE66, LIPT68, BRAW68, BRAW70, MCKE69, KUEH68, SAYR69, SHED72], In fact, this property of locality is the key reason of the feasibility of PVMS. More specifically, the property of locality can be summarized in three statements [DEM72a] : During any time interval, a program distributes its references nonuniformly over its address space, some pages being favored over the others. The density of references to a given page changes slowly in time or the set of favored pages changes membership slowly. Two disjoint segments of the page reference string tend to be highly correlated when the interval between them is short, and tend to become uncorrelated as the interval between them increases. It is clear that the behavior and style of the programmer has a direct bearing on the locality of programs. Programmers tend to use sequential and looping control structures and they tend to concentrate on small sections of large programs for moderately long time intervals, and they generally group data into content-related blocks. Ik The set of favored pages seem naturally to split into four areas of activity [GIBS66, J0SE70] . One of these areas is constituted by the instruction addresses of the program. The other three areas are constituted by data addresses. The processes of data analysis often consist of arithmetic or logical manipulations on two or more strings of independently (semi) coherent addresses which in combination form the sequence of operand addresses. The fourth area of activity is the output catchment area where the results are held prior to outputting. If it is not possible to study the program behavior in this great a detail, it may be profitable to separate the set of favored pages into data pages and instruction pages [DEM72], We will investigate this topic in detail in this thesis. We remark that for problems with large data bases, the instruction paging is almost trivial compared with the data paging. Therefore, we will concentrate, for the most part, on the data paging problem in this thesis. Examples of what may happen to performance of a program with poor locality are too numerous [FINE66, BRAW68, DEM65, BAIR68, SMIT67] . Therefore, it is very important that methods be developed to create guidelines on how to write programs with a high degree of locality. Before we discuss these locality improvement methods, we discuss several other results on program behavior. To study and predict performance of programs running under FVMS, several models of program behavior have been investigated. These models are used to generate reference strings to be used in the analytical study of program behavior. The validation of a proposed model is another question that need be answered. The simplest of all models investigated 15 is the independent reference model, which has been studied by several authors [COFF73, KING71, FRMTfk, AH071, BELA.66] . In this model, it is assumed that the probability of a reference to page i at time t is given "by, P [r, = i] = p. V t > 1 V 1 < i < n, where the set of all p. i r "0 l -i is n fixed and Z p. = 1. It has been observed that, this model is relatively 1=1 1 simple to analyze but it does not mirror the behavior of actual programs [COFF73] . Aho et al [AH071] proposed a very general ^-order nonstationary model of program behavior. However, except in 0-order stationary case (which is identical to the independent reference model), very few results are obtained for this general model. Denning et al [DENN72a] propose several locality models of program behavior. They were able to produce some interesting results with these models which compared favorably with the behavior of practical programs. Matt son et al [MATT70] propose an LRU stack model of program behavior which was improved and modified by Shedler et al [SHED72]. In the LRU stack model, it is assumed that the locality L of the program at time t is of a fixed size c, equal to the page allotment, and consists of the last c pages referred by the program. Quite clearly, this is a good model. Saltzer [SALT7^] has recently proposed a simple linear model of program behavior which is validated against the MULTICS system. This model assumes that the page fault probability is inversely proportional to the page allotment c. Several authors have given guidelines to produce programs with better locality [BRAW68, DENF70, F1NE66, HATF71, KUEH68, WEIN72]. The use of modular and highly structured programming and programming languages is a suggestion made most often. The improvement of locality by proper 16 arrangement of cede and data in VA space is a problem we will discuss separately. Here we are concerned with logical organization of the program. When asked to solve a particular problem, a programmer, generally, has a choice of a variety of algorithms to use. As an example, if he is to solve a system of linear equations, he can use Gaussian elimination, LU decomposition or Cholesky factorization and many others [ISAA66]. The choice of the algorithm is based upon the mathematical properties of data; therefore, we expect the locality to vary with each algorithm. Even when a particular algorithm is chosen, it may be possible to change the order of some operations without affecting the algorithm drastically. We would like to consider the order of operations which yields the maximum locality. In order to discuss this topic in detail, we have to consider individual applications. For applications involving sorting, see [BRAW70], for searching problem see [KNIG ], and for the list processing application see [BOBR67] . In the case of matrix algorithms, McKeller et al [MCKE69] have shown that block algorithms have a superior locality property compared with nonblock algorithms. Lubrulle [DUBR72] has discussed the solution of the Eigenvalue problem in a paged environment. Moler [M0LE72] has discussed a method of loop reversal for improving matrix algorithms. Rogers [ROGE73] discusses the solution of linear equations in a paged environment. We will discuss methods of improving locality of matrix algorithms Q Our work will be distinguished from that of the above by the fact that we will be studying many more matrix algorithms and that our measurements will be more extensive. In particular, we will study matrix multiplication, Cholesky decomposition, EU decomposition, Gaussian 17 elimination and Gram Schmidt orthonormalization. We note here that, we choose to ignore the error growth aspect of our algorithms unlike Rogers. We will measure the paging performance of our algorithms under a variety of paging algorithms, both realizable and unrealizable. As against this, McKeller et al have only measured the performance of their algorithms under MIN [BELA.66] paging algorithm, which is unrealizable. So far we have discussed only programmer-implemented locality improvement methods. One of the motivations for adopting a virtual memory system is that it relieves the programmer from the burden of memory management (or overlay problems) when his program cannot fit into the available MM [DENN70, KILB62], therefore, asking the programmer to worry about locality of his programs seems to be a step backwards. The job of improving locality of programs must then be carried out by the computer system. The operating system cannot possibly do much, since it does not know the global structure of the program. Therefore, it appears plausible that the optimization we seek can be achieved by the compiler. We know of no earlier work along these lines. Observe that, we are considering the topic of logical program reorganization and not physical reorganization. We will consider this type of compiler optimization for matrix algorithms written in both a high-level language (FORTRAN or PL/l) and an array language OL/2 [PHIL72]. We remark that the OL/2 language is an array language which allows dynamic partitioning and block structure, but it does not allow a 'GOTO' statement. These characteristics greatly influence the program reorganization strategies. 18 2.2.6 Pagi nation The organization of a program's information in the VA space has been termed pagination [DEM70] . This can be further divided into code (instruction) pagination and data pagination. In code pagination, at the highest level one could study the grouping of subroutines to minimize page faults. Such experiments were made by Comeau [COME67]. Code pagination can be done on the basis of statistical properties (e.g., branch probabilities) [FAMA66, DEHN70], the history collected from previous runs [HATF71, FERR73], or solely on the basis of syntax [YEL071] . The problem of data pagination has received very little attention. In particular, only for special applications like matrix algorithms have investigations been made. It has been noted by several authors [MCKE69, ROGE73] that storing matrices by square blocks coupled with the use of block algorithms seems to improve paging performance. We will assume block (or submatrix) type of storage for matrices. 2.2.7 Paging Algorithms We now discuss paging algorithms and their effect on paging performance. A paging algorithm implements three policies. The fetch policy determines when a particular page should be brought into MM; a replacement policy determines when a particular page is removed from MM; and a placement policy determines an available page frame to hold a fetched page. In a PVMS (as against a segmented VMS), the placement policy is trivial and therefore, will not be considered [DENF70] . Basically there are two types of fetch policies, namely, demand fetching and anticipatory fetching. Under demand fetching, a page is 19 "brought into MM only on demand, i.e., when a page is referenced and is not found to be resident in MM (i.e., at the time of a page fault). Under anticipatory fetch policy (also called prefetching), one or more pages may be brought into MM at any time, usually in advance of reference to the page. A special kind of prefetching policy is called demand prefetching [COFF73]. Assume that at time t, a page x is referenced and is not found in MM, thus creating a page fault. As a result a page fetch for page x will be initiated. A demand prefetch policy will allow us to fetch some other pages together with page x in the above situation. A replacement policy can also be one of two types : demand an anticipatory. Most popular paging algorithms are of demand_fetch, demand_replace variety and are commonly known as demand paging algorithms. All the other varieties of paging algorithms are commonly known as prepaging algorithms. Demand paging has been widely used in practice due to its simplicity in implementation. Prepaging, on the other hand, is more difficult to implement, chiefly because good predictions of a program's pages needed in future are not easily obtained [DEM70, DEM72], Two types of prepaging schemes have been reported. One scheme, known as swapping, has been used to guard against excessive paging due to a small time quantum (e.g., time- sharing systems) [DEM70, KUEH68, 0RGA72]. In this scheme (except for the first time quantum) a program's working set is preloaded at the beginning of the time quantum and demand paging is used during the time quantum. We note that the working set to be preloaded is determined from the set of pages the program had acquired during the last time quantum. This scheme is successful with small time quantum. 20 The second scheme, proposed by Joseph [J0SE70], is as follows. Whenever a page fault occurs for page x, then fetch page x and x+1 . This scheme is particularly successful when the address pattern of a program is sequential. Prepaging incurs less page faults than demand paging, but the number of page pulls may be the same or even larger because of the possibility of an incorrect prediction [J0SE70], The effect of prepaging on ST product of MM used is not clear from the literature. Joseph reports an increase in ST product whereas Denning [DEM70] shows that if the probability of incorrect prediction is low, then ST product is reduced by prepaging. Because of a reduction in total page wait time, ST product tends to reduce but at the same time it tends to increase because pages are brought in MM in advance of their use. We will discuss this question in Chapter 3. If more than one page is brought into MM at the same time and a proper layout of pages on the secondary memory is used, then the effective access time per page (t) can be reduced, provided rotating secondary memory is used. The basic question, then seems to be that which performance measure is more important, page fault or page pull. There is no easy answer to this question. However, if the goal is better cpu utilization then page fault is a more important measure and if reduction of the channel traffic is the goal, then page pull is a more important measure. It seems, therefore, that prepaging is justifiable if good predictions are available. We will elaborate on these ideas in subsequent chapters. 2.3 The Formalism of Paging Algorithms In the abstract, a paging algorithm A is a mechanism for processing a reference string w = r, r~ ... r ... and generating, in response, a 21 sequence of memory states S_ S, . . . S . .. where S Q is a specified initial memory state [AH071, COFF73] . Each memory state Sis the set of pages from N which reside in M at time t; they satisfy the conditions S, c N, |S, I < c, r, e S (t > 0). Moreover, S and S . are related by, S, = S, + X - Y where X, c N - S, , is the set of pages fetched at time t and Y. c S, , is the set of replaced pages. To determine X, and Y at time t, the paging algorithm must maintain internal records. A set of control states Q is used for this purpose, q being a designated initial control state. A configuration of the paging algorithm is any pair (S, q) in which |s| < c, q e Q. Associated with the algorithm A is a transition function g , such that g A (S, q, x) = (S',q'), x e S', where (S, q) is the present configuration and (S ' , q') is the next configuration and x is the current page reference causing the transition. In particular, the memory state sequence S n S ... S ... induces a configuration sequence (S_, q ) (S, , q, ) . . (S,, q, ) ... generated by, {S t> q t } = g A (S t-l' %-l> r t^ t ^ 1 ' A paging algorithm is a demand fetching algorithm if for a given c > 0, |x J e {0, 1} and if r e S , then X = x ) ■ ( s '> "" r t i^oo^ if x does not appear in r,, r , ..., r Presently, we consider paging algorithms, which work with a fixed memory allotment (i.e., a fixed value of c). Some typical demand paging algorithms will now be described. Let S = S, and let S' = S, ] = S + x - R(S, q, x) such that x 4. S, , x e S and R(S, q, x) denotes u t t+1 the page selected for replacement so that R(S, q, x) = if |s| < c. 23 (1) LRU: This algorithm chooses the page least recently used as the page to he replaced. In other words, R(S, q, x) = y iff \(y) = m ax O t ( z )]- zeS This algorithm is used widely in practice and is found to behave fairly well [DENN70, CORB69, LIPT68] . (2) FIFO: This algorithm chooses for replacement the page that was fetched first in MM. This algorithm behaves well only for highly sequential reference patterns. However, it is very simple to implement. (3) RAM): It selects the page to be replaced randomly from the pages resident in MM. Clearly, this algorithm behaves well only for independent reference model of program behavior [BELA66, KING71, COFF73]. It is also very simple to implement. (h) MIN: The page to be replaced has the largest forward distance. In case of a tie, the tie-breaking rule uses lexicographic ordering. This algorithm was first proposed by Belady [BELA.66], and it is unrealizable since it presumes the advance knowledge of the program's reference string. It is, however, useful for theoretical purposes since it can be proved optimal in a certain sense. It is also called B Q [COFF73] and 0PT[MATT70] sometimes. More specifically [COFF73], R(S, q, x) = y iff y = min [z]. Here, S* is defined so that zeS* ieS* iff d (i) = max[d (u)]. ueS Note here that lexicographical ordering is used in S*. 2.3.1 Cost of a Paging Algorithm Aho et al [AH071] define the cost of processing a reference string w = r-,, r p , ..., r with a paging algorithm A operating with c page 7 frames of MM: C(A, c, <*>) = Z h(|X, |). Here h is a function such that t=l t h(0) = 0, h(k) >h(l) = 1. Aho et al prove the following theorem: Suppose h(k) > k then for any given paging algorithm A, there exists a demand paging algorithm A' such that C(A', c, co) < c(A, c, to) for any reference string to and for any value of c. 2k We note that for rotating auxilliary memory devices like discs and drums h(k) < k. Only for large core storage device can we hope to have h(k) = k. In such a case, the above cost function measures the number of page pulls. A paging algorithm A is said to be optimal (with respect to the cost function defined) if it minimizes C(A, c, u>) for all reference strings w and all values of c. It can be proved that KEN is an optimal paging algorithm by the above definition of optimality [COFF73, P0ME71, MATT70] . Therefore, the MEN algorithm, though unrealizable, is useful as a benchmark for evaluating other paging algorithms. + Assume that a probability distribution is specified over N such that for wg N , p( w ) is the probability of occurrence of oj, then we define the expected cost of a paging algorithm A by: C(A, c) = Z p(co) C(A, c, u) [COFF73]. Define a paging algorithm to be optimal with respect to distribution p if it minimizes C(A, c), V c > 1. Denning et al [DENN68b] proposed a demand paging algorithm A , which replaces the page with the longest expected time until next reference. It can be proved that A n is an optimal paging algorithm with respect to a distribution corresponding to the independent reference model of program behavior [AH071] . 2.k Working Set Algorithm We consider Denning ' s working set algorithm, which is a variable memory algorithm [DENN68a] . A program's working set at time t is defined to be W(t, t) = {i € N| page i appears among r t-T+1 t . . . , r , J 25 where t is a parameter called the window size and t > 1. In other words, W(t, t) is the 'contents' of a window of size t looking backwards at the reference string from reference r, . Working set replacement algorithm essentially states that, do not replace a page from the working set. This algorithm is a little difficult to implement but behaves very well in practice [D0HE70, R0DR72, RODR73, WEIZ69] . Working set principle consists in using a working set replacement algorithm and using the following scheduling policy: a program may run if its working is in MM [DENF70] . Denning has shown that thrashing can be avoided by use of WS principle [DEJM68] . Denning et al [DENN72b] define and prove several important properties of the working-set model. Define the working set size w(t, t) = |w(t, t)| and the average working set size s(t) = lim s (t) i k 1 k where s (t) = 7- Z w(t, t). Define the binary variable "\(t, T) =f lifr ^ w(t ' T) L otherwise, 1 k_1 then the missing page rate (page fault probability) m(x) = lim — E A(t, t), k-^00 K t=0 Suppose that in the reference string to, two successive references to page i occur at times t and t+x. . We call x. an interference interval for page i. 11 The interference distribution for page i is defined to be F. (x) = lim 1 i K-KXJ The interference density function for page i is defined to be f.(x) = F. (x) - F. (x-l). With these definitions following properties hold: "no. x. m r,, r„, •••> r, with x. < x k 1 — _ no. x. in r,, 1 1' *•*' r k J 26 PI: 1 ■ s(l) < s(t) < s(t+1) < min{n, t+1). P2: s(t+1) - s(t) = m(T). P3: < m(T+l) < m(T) < m(0) = 1. PU: m(T) = 1 - F(t) = L f(y). y h^ B c D E B C B D A E A c memory- 1 A B c D E B c B D A. E A c state 1 A B C D E B C B D A E A S-> A B C D E E C B D D E FAULT * * -* -* * # * -* # * * The number of page faults in this example is 11. Observe how the LEU algorithm chooses a page to be replaced. At t=^, r i. =D > r i, 4 S t; wn i- cn implies a page fault which in turn implies a replacement. In S , A is the page used least recently, therefore, it is chosen for replacement. Now for the same w and paging algorithm, but for a different value of c, we are asked to find the page fault count, we will have to go through the simulation again. This means a rescan of the reference string for each new value of c. Since practical reference strings are very long, say a million references, the simulation is very inefficient. Mattson et al [MATT70] have developed a technique which scans the reference string only once and produces the page fault count for all values of 1 < c < n. We will have occasion to use his ideas in Chapter 3. 2.5.1 Stack Algorithms [C0FF73, MATT70] We introduce the notation S(A, c, w) to stand for the memory state resulting after the paging algorithm A has processed reference string w in a memory of size c, assuming that S = f). That is, if w = r,, r , ..., r, generates S_, S , ..., S under A, then S(A, c, w) = S, . If A is understood, we simply write S(c, w). 28 An algorithm A is called a stack algorithm if its memory states satisfy the following inclusion property: S(c, w) c S(c+1, w) V c > 1, V uieN . This means that the memory states form a collection of nested sets. This condition is equivalent to the following condition: for each w, there exist a permutation of N, s(w) - (s^oj), Sg(w), . .., B (»)), such that, for all 1 < c < n, S(c, u>) = {s 1 (w), s 2 (w), . .., s q (w)). The vector _s(w) is called the stack vector or imply the stack. For a given stack algorithm, the inclusion property implies that for each reference string r _, . .., r , a sequence of stacks s_ , s , . .., _s can be constructed so that the memory state sequence for each value of c can be determined by simply taking the topmost c pages of the stack. This property of stack algorithms implies that the page fault behavior of a given reference string can be computed effectively in parallel for all memory sizes 1 < c < n and in one scan of the reference string. Examples of stack algorithms are MEN, LRU, LFU (least frequently used), whereas FIFO is not a stack algorithm [COFF73, MATT70] . The stack distance D (w) is defined for page x to be the position of x in the stack s(w). Thus if s. (oj) = x, then D (oj) = k. If x does K X not appear in _s(w), then D (w) = ». Observe that a page fault occurs for page x as the last reference in the string wx iff D (to) > c, since A the first c elements of _s(oj) are the contents of memory of size c. Suppose that u> = r n , r . ..., r is processed by A producing the stack -L d. y sequence s_ , s^, ..., s,, ..., s . There is, associated with oo, a stack distance sequence D , D , ..., D , where L\ is the position of r^ in X. d. y % t St-l* If *( A > c ' **) denotes the number of page faults in processing oo with memory size c using algorithm A then, Define 29 it (A, c, to) = |{t|D > c, 1 < t < y\. a. = | (t|D , = k, 1 < t < 7} | then clearly, n it (A, c, w) = a + Z a . k=c+l K c Alternatively, the success function N(A, c, to) is N(A, c, u>) = Z a, k=l * the success frequency F(A, c, to) = N(A, c, w)/^, and the page fault frequency m(A, c, to) = jt(A, c, to)/y. Gordon [G0RD73] has shown a method by which one can compute the average working set size and the page fault probability for the WS algorithm and for the LRU algorithm in one pass of the reference string. Define I (w) (reference interval) associated with each page in the stack _s(w) as the number of distinct pages referenced since the last reference to this page and I (tax) = 1 (i.e., reference interval is defined to be equal to 1 at the time this page is referenced). If x^_s(oo) then I (to) = 00. Let I be the reference interval associated with page r , = x in the stack s. . Then associated with the reference string , there is a reference interval sequence I.., I p , ..., I , ..., I . Clearly «(WS, T, w) = I {t|l > t, 1 < t < 7} I . If we define, b = | {t|l = k, 1 < t < 7)|, then *(WS, t, to) = b + z b Also iq(t) = it(WS, t, to)/| k=T+l K T-l and s(t) = Z m(z) where m(0) = 1 by definition. z=0 We have noted that MEN is a stack algorithm but since it requires a scan of the future reference string in order to determine the page to be replaced, a one pass algorithm is difficult to obtain. Mattson et al [MA.TT70] have given a two pass algorithm to get MIN statistics. More 30 recently, Belady et al [BELA.7U] present a one pass algorithm to obtain MIN page fault behavior. So far we have discussed methods to obtain paging performance when the reference string is specified. Collection of page traces, however, is very time consuming and for very long page traces running simulations to obtain paging performance is also very time consuming. It is beneficial, therefore, to consider analytical methods to predict paging behavior of programs. This requires a model of program behavior to be formulated. King [KING71] has analyzed program behavior using the independent reference model. 2.5.2 King's Model If the configuration sequence (S Q , q_) (S,, q, ) ... (S,, q^ ) ... generated by a paging algorithm form a Markov chain, then analytical methods exist to find m(c) where m(c) = Z p(w) m(c, co). Here N is the 00 uxeN set of all infinite reference strings over N. Suppose we are given the transition probability matrix P of the Markov chain whose states are V x Q> where V-{S|ScN, |s| =c} (i.e., y is the set of memory states). Now if the Markov chain is irreducible, then there exists a unique stationary probability distribution a such that aP = P and z a. = 1. Define the equivalence class of ie V XQ 1 state i of the Markov chain by: [i] = [i j transition from i -* j and j -> i is a non-page fault transition} . 31 It can be shown that, VXQ m(c) = 1 - Z ( L a. p 0=1 {k|[k]=[j]) K K ' J where P = (p v .)« Under the independent reference model of program behavior LRU, FIFO, A_ algorithms satisfy the criteria for this treatment and closed form expression of page fault probability can be obtained. 32 3. GENERAL PREPAGING ALGORITHMS 3.0 Introduction In this chapter we present arguments as to how prepaging can be useful. We present several new prepaging algorithms and study their behavior. We have noted in Chapter 2 that, Aho et al proved that given an arbitrary paging algorithm A, there exists a demand paging algorithm A' which performs no worse than A. First of all, this theorem holds only if h(k) > k. For rotating auxilliary memories h(k) < k and only for bulk core storage, do we have h(k) = k. Since rotating auxilliary memories are in widespread use, the assumptions of the theorem do not hold in practice. If h(k) = k then the cost function defined by Aho et al measures the number of page pulls. If the channel is the bottleneck in the system then the page traffic should be minimized. Heuristically, minimization of page pulls for individual programs will imply minimization of the system page traffic. Under this assumption the theorem is valid. When cpu overhead is critical [MASU72, 0RGA72], however, and we have multi- programming system then a page fault implies a process switch in which case, how many pages we pull after a page fault is immaterial and therefore the cost of a paging algorithm is the number of page faults it incurs. Under this assumption, h(k) = 1 V k > 1. Reduction in cpu overhead is affected in two ways : first since the number of page faults is decreased and as we have seen, few milliseconds of cpu t im e is spent 33 in servicing a page fault. Secondly, after a page fault, if a ready- process is not available, the cpu may have to be idle. By reducing the number of page faults we reduce the possibility of such idleness. The effect of prepaging on ST factor will be discussed in section J>,1. Granted that prepaging has its benefits, there are two problems yet to be resolved: who should specify the paging needs in advance and when to carry out the page fetches. Haore [HA0R72] notes that the memory management routines do not have sufficient knowledge of the future reference strings, therefore, these forecasts must be made by the programs, either directly by the programmer or by the compiler. We agree with this statement. To show the viability of prepaging, we will consider a particular application and show that prepaging can be done and that it does result in improved performance. If we resort to demand prepaging, then the problem of when to fetch pages is solved. Whenever a prepage request is made, we just flag the page and on the occurrence of a page fault, we bring in all the flagged pages. Note that, under demand paging, the distinction between a page fault and a page pull vanishes since either implies the other. Under an arbitrary paging algorithm neither need imply the other. Under demand prepaging, a page fault implies a page pull but not vice versa. This means that for demand prepaging, # page faults < # page pulls. The ideal for the number of page faults is zero. It is possible, at least in theory, to achieve this ideal with arbitrary prepaging but not with demand prepaging or with demand paging. 34 3.1 The Optimal Demand Prepaging Algorithm We define a demand prepaging algorithm DPMIN using the notation of section 2.3. DPMIN: The transition function g(S, q, x) = (S', q' ) where, S' - S if x e S S' - [y lt y 2 , ..., y £ ) if x ft s where I = min(c, |. w |) and ,w = r.j . .., r , and where t t 7 V i, y. e N & V x e N - S* d (x) > d (y. ). In other words, at the time of a page fault, DPMIN scans the future reference string and fetches the first c pages that will be referenced in future. Note that, DPMIN is unrealizable in the same sense as MIN is unrealizable. We will see that DPMIN serves as a benchmark of performance. We define a paging algorithm A to be optimal with respect to page faults if for any arbitrary paging algorithm A', and V c > 1, V oo e n*, it (A, c, go) < it(A', c, oj), where n denotes the number of page faults „ Theorem 3.1 : DPMIN is optimal among all demand prepaging algorithms (in number of page faults). Proof: Let jl (S, t)- denote the minimum achievable cost under demand prepaging of processing references r . _, ..., r, given that S = S. If 35 we define it n (S, t) = and let r , , = x then we can write r * k (s, t) = it, JS, t+1) if x e k-1 1 + min it. . (S+X-Y, t+l) YcS k_1 XcN-S if x ft S. xeX This relation may be recognized as the principle of optimality in a dynamic programming problem [AH071, C0FF73] . The proof of the theorem now reduces to showing that paging according to DBCEN is characterized by the above principle of optimality. Let <, be an ordering defined over N for V t > 1 such that x < y iff d (x) < d (y) (if d (x) = d (y), lexicographic ordering is assumed). Denote M = {y , ..., y } such that V x e N-M , x < y. for V 1 < i < c-1. Then the transition function for DFMIN may be written as : 'DFMIN (S, q, x) = (S, q') if x e S (M +x, q») if x ft S, Then it is sufficient to show that il (M.+x, t-1) = min. (jl (S+x, t-l)), K t ScN-x * |S|=c-l Clearly, it,(M +x, t-l) = it, (M +x, t) and K. "C K— -L "t it k (S+x, t-l) = « k-1 (S+x, t), 36 therefore, it is sufficient to show that Tt,(M.+x, t) = min. («, (S+x, t)). k t 3rN-x k |BT- c-1 In fact, we will prove that An, = jt, (S+x, t) - jl (M.+x, t) < 1. We will prove this by induction on k. It is clearly true for k = (by definition of n (S, t)). Assume true for V i < k. If S = M then we are done, therefore assume S f M and let i be the smallest index such that y. 4. S, "C i also let d (y. ) = i. First note that i < c-1. Then, rt k (S+x, t) = 1 + min.(Tr k _^(S+X-Y, t+i+l)) and « k (M t +x, t) = * k _^(M t +x, t+i+l). By inductive assumption, min. (ji^CS+X-Y, t+i+l)) Also, by induction, ^k-i = Vi (s ' +r t+i+i' t+i+1 > ^-^ M f. + ^ +r + + ^V t+ ' +1 ) t+i+1 > - ^^(M^x, t+i+l) 37 note that r „ , € M +x. .'. taking S' = M t +x " r t+ ^ +1 * ^k = 2 + Vi ( W r t + i + l' t+2+1) • - v/ s ' +r t + ^ t+ ^ +1 ) = or 1 by the inductive hypothesis on Art, g . Therefore for V k > 0, This completes the proof of the optimality of DPMIN. Lemma 3. ! • Given two fixed memory paging algorithms A and A' it (A, c, w) < it(A', c, go)=>ST(A, c, w) < ST^, c, go) on the average, assimiing the behavior of other concurrent programs remain the same. Proof : Since memory alloted to the program is fixed, ST = c * t where t is the total time that the program occupies the MM. t ~ t + m * a *■ m _ C p U t . , + t , . Since cpu time of a program is unaltered by page-wait ready queue a change in the paging algorithm, only the other two factors need to be considered. Now assume T 1 is the average time that the program has to spend in page wait and then in the ready queue. Clearly, T' depends on the characteristics of other concurrently executing programs and on the average page fetch time T. Therefore, T' can be assumed to be a constant. t = t + #page faults * T' m cpu * to The required result immediately follows. 38 Theorem 3.2 : DIMM minimizes ST product among all demand prepaging algorithms under the assumptions of lemma 3.1. Proof : The proof of this theorem follows directly from theorem 3.1 and lemma 3.1. We would have liked to prove optimality of DIMM with respect to page pulls, but unfortunately, it does not hold. We will be able to see this in Chapter 5. As a result of lemma 3.1> we need to only discuss page fault and page pull measures for all fixed memory type paging algorithms. Note that the class of all demand paging algorithms is included in the class of all demand prepaging algorithms. Therefore, applying Theorem 3.1 to the optimal demand paging algorithm MM, we get it (DIMM, c, w) < jt(MM, c, w). We will show that there are cases when strict inequality holds. On the other hand, it is trivial to find examples for which equality holds. Note that MM is optimal in page pulls among all paging algorithms, therefore , C(DIMM, c, w) > C(MM, c, co) = it (MM, c, u>). Once again, there are cases when equality holds in the above relation and there are cases when strict inequality holds. Thus DIMM is superior in two performance measures (page faults and ST product) and MEN is superior in page pulls. Quantitative results of the comparison will be given in Chapter 5 for specific applications. 39 3.1.1 Performance Measurement for DPMIN The problem is to find the number of page faults and the number of page pulls for a given reference string w and a given page allotment c, using the DPMIN paging algorithm. We present a numeral matrix algorithm for this purpose, based on the numeral matrix algorithm for'MIN [BEIA7U] . The numeral matrix has as many rows as the number of pages in the VA space and as many columns as the nu er of page faults. The algorithm is as follows : Initially the matrix is blank. 1. Suppose the next reference in w is to a page x and that the rightmost nonempty column is (i-l). 2. Let j be the rightmost column with c markings (note: j = i-l or i-2). 3. If j = i-l then if the entry (x, i-l) is blank then mark (x, i) and return else return else k. If j = i-2 then if (x, j) is blank then mark (x, i-l) and return else mark (x, i-l) and return. In step 3 marking (x, i) implies a page fault and a page pull. In step k if (x, j ) is blank then a page pull occurs (no page fault). After the complete reference string is processed, then the number of page faults is obtained by counting the number of nonempty columns of the numeral matrix. Number of page pulls in excess of page faults is obtained by counting the number of times the clause 'if (x, j ) is blank' is satisfied in step 3. ko Clearly, this algorithm is c -dependent. We illustrate its use by the following example. N = {A, B, C, D, E}, c = 3, w = ABCDEDBCBMEAC . The number 1 is used for marking. If a page pull occurred without a page fault then we indicate this with an asterisk. The numeral matrix for our example is : A 1 1 B 1* 1 1 C 1* 1 1 D 1 1 E 1* 1* Clearly, the number of page faults is equal to four. The number of page pulls is given by: # page pulls = # of 1* + # page faults = k + h = 8. Belady et al [BEIA7^] show that for the same reference string, the MEN algorithm produces # page faults = 8 = # of page pulls. The numeral matrix algorithm is clearly c-dependent, i.e., for each new value of c, a rescan of the reference string is needed. In particular, DFMEN is not a stack algorithm. We may verify this fact by the following example : w = ABCD, for c = 2, the memory state sequence is: {}, {A, B}, {A, B}, {C, D}, {C, D} . For c = 3, the memory state sequence is: {}, {A, B, C}, {A, B, C}, {A, B, C), {D}. From these, S (2) ^ S (3) which gives the required result. Another problem in the implementation of the numeral matrix algorithm is the storage of the matrix. A close inspection, however, reveals that only two columns of the numeral matrix need be stored at any one time. In step 2, we have noted that j = i-1 or i-2. This is "because, we create a new column only when the previous column is full. Thus only the present column and the last column need be kept in storage. A PL/l implementation of DPMIN is given in Appendix A. 3.2 Realizable Prepaging Algorithms We have noted that DPMIN is an unrealizable paging algorithm. We would like to consider same paging algorithms which can be realized. 3.2.1 Freeing Dead Pages Assume that at time t, a certain page x is 'dead', i.e., it does not occur in the reference string after time t. This is equivalent to saying: d (x) = <». Assume that either the programmer or the compiler has discovered this fact. Furthermore assume that this information is pro- vided to the operating system so that either it can push the page x onto the secondary memory or it can flag the page dead so that at the time of the next replacement decision, page x will have the highest priority to be replaced. Assume that, originally, we started out with a demand paging algorithm A, and then modify it by the above mechanism and call it FREEA. The process of declaring a page dead will be called Freeing. There are two possible ways to interpret FREEA: One way is to assume that on the issuance of a FREE(x) instruction, page x is replaced. In this case FREEA is a demandrfetch-anticipatory-replace algorithm. The second way is to assume that on the issuance of a FREE(x) instruction, page x is k2 flagged (i.e., a 'dead* bit associated with the page table entry of page x is set), and when a replacement decision is to be made, first a search is made for a page with the dead bit on. If none is found then the replacement procedure as in algorithm A is followed. With this interpre- tation, FREEA is a demand paging algorithm. Quite clearly, these two interpretations do not affect the number of page pulls or the number of page faults and therefore, we do not have to distinguish between them. It is clear that n(FREEA) < tt(A). To verify this, assume that A has made a wrong replacement decision at time t, and further assume that working with FREEA, at time t, a dead page was present in memory and therefore FREEA will not make a wrong replacement decision. On the other hand, if FREEA has made a wrong replacement decision then quite clearly, A must have also. It is also clear that rt(MIN) < jt( FREEA), since FREEA is a demand paging algorithm and the optimal! ty of MIN is applicable. It should be noted that it (MIN) = ^(FREEMIN), since MIN never makes a wrong replacement decision. If we started out with a demand prepaging algorithm A then we can show that jt(DPMIN).< jt(FREEA) < it(A) and that jt(DFMIN) = it ( FREEDPMIN ) . If we start out with a general paging algorithm A, then < it ( FREEA.) < k(A), Freeing, however, tends to reduce the ST product: First by reducing the number of page faults and second by reducing the average page-wait time. The second reason holds only when anticipatory replacement is used. We have pointed out in Chapter 2 that, it is desirable that a given paging algorithm is a stack algorithm. Theorem 3.3 answers this question for FREEA. k3 Theorem 3 • 3 ' (Assume A is a demand paging algorithm) If A is a stack algorithm then FREEA is a stack algorithm. Proof : The proof makes use of the proposition 6.1 from the book by Coffman et al [COFF73] which states that: A demand paging algorithm B is a stack algorithm iff R(S+y, q, x) = R(S, q, x) or y whenever x ^ S + y. Let r denote the set of distinct pages in the string r_, r p , . .., r . FREEA partitions the pages of r, into two classes. One class is the set of dead pages. We assume that the set of dead pages is lexicographically ordered, and the set of nondead pages is ordered by the same ordering as in algorithm A. Let the set of dead pages in S be denoted by A(S). We want to prove that the above proposition is satisfied by FREEA. Therefore, consider the following two cases : (i) [y e A(S+y)]: If y= min. (A(S+y)) then R(S+y, q, x) = y and we are done, otherwise if y/ min. (A(S+y)) then let y' = min. (A(S+y)). This means y 1 = min. (a(S)) which implies R(S+y, q, x) = R(S, q, x) = y« and we are done, (ii) [y^A(S+y)]: If A(S) = $ then A(S+y) = fi (since y j. A(S+y)) therefore, FREEA behaves exactly like A which means R(S+y, q, x) = R(S, q, x) (since A is a stack algorithm). kk If A(S) / f> then let y' = min.(A(S)). Since y £ A(S+y) =»A(S) = A(S+y) =>y f = min.(A(S+y)) ^> R(S+y, q, x) = R(S, q, x) = y\ Thus FREEA is a stack algorithm. In Appendix A, we give a PL/ I program for FREELRU. 3.2.2 Prepaging We now consider some practical prepaging algorithms. Our objective is to consider prepaging algorithms which reduce page faults without increasing page pulls significantly and thereby obtain an improvement over demand paging algorithms. When working with fixed memory algorithms, at the beginning of a time slice, a few page frames remain unoccupied with a demand paging situation. If we could prepage seme useful pages to fill up this unused space, we would probably reduce page faults. The second important rule we incorporate into the algorithm is that we never replace a useful (non-dead) page for bringing in a prepaged page. This reduces the possibility of increasing the page faulting in a prepaging algorithm After the MM is full, this scheme would not allow us to do any prepaging unless, we have some dead pages which can be freed. We will see that this situation is the most valuable. The first prepaging algorithm that we propose is PRE1A. Assume that reference string is modified to have PRE(x) inserted at various places. PRE1A : (Assume A is a demand paging algorithm) If |sj < c then if r. +1 = PRE(x) and x f( S then S = S + x. Otherwise PRE1A works just like A. This algorithm has two drawbacks. First, it implies a page-wait for each prepage operation which is what we were trying to avoid in the first place and second, it is not able to prepage frequently since the situation | S , | < c arises only initially. We can solve the first problem in two ways. One way is to allow the execution of the program and the input operation of the prepaged page simultaneously, which will be called the "overlap solution. " A second way is to concatenate many page fetches together and resort to demand prepaging. The solution to ' |S | < c' problem is to combine freeing and prepaging together. If we use the overlap solution then after the occurrence of PKE(x) in the reference string, a page frame is reserved for x but x cannot be assumed to reside in S . , -. • Page x is then said to be in a 'not-set-up' state. After elapsing of real time T (the average page-wait time), page x said to be in 'set-up' state and can be considered resident in MM. Now, if before a page x is set up, a page fault for another page y (or the same page x) occurs, then during this page-wait period, page x will become set-up. Thus, in general, on the occurrence of a page fault, all not-set-up pages in MM will become set-up. We divide the memory state S t (c) into two disjoint sets U (c) and N.(c). N (c) consists of the set of all not- set-up pages and U (c) = S (c) - N (c). We will assume c > 2 for all prepaging algorithms. We now define PRE2A. We associate a set-up counter 'COUNT' with each not-set-up page. PRE2A: [1] (Page fault for a not-set-up page) If r +. + i = x € - N 4-^ c ^ tlien (declare all not-set-up pages set-up) \ \ + i^ = V c > u V°> - s t (o) = W o) - I N t+1 (c) . 0. We can easily add freeing to PRE2A and obtain FREEPRE2A. It can easily be proved that both, PRE2A and FREEPRE2A are not stack algorithms. It appeared at first that if we restrict our attention to a certain class of reference strings then PRE2A could be proved to be a stack algorithm. We will say that a reference string co satisfies the property P if r, = PRE(x) => (j§ t 1 < t such that r» t , = x or r t , = PRE(x)). Initially it was believed that PRE2A is a stack algorithm for (u £ N P( w )}j however, this was found to be false. The reason for this can be explained as follows: Suppose we are in step (3) of PRE2A. For a certain value of c, say c., assume that r = x ^ S (c, ) and let c. be the largest such value of c, =# x e S (c.,+1) (assume that x is set-up in a memory of size c +l). A page fault occurs for c = c,, whereas no page fault occurs for c = c.,+1 =>U, (c, ) = S (c ) + x - y, ye U (c ). But, U^Cc-j+l) = U^c-j+l) U {y e IT ( Cl +1), COUNT(y) > T-l} . Now, the inclusion property for U cannot be proved in general. A similar situation occurs in step (l). We, therefore, modify PRE2A to obtain PRE3A, which is indeed a stack algorithm. PRE3A : [1] If r. +1 = x e N then [a] if x e I (c) then (considered to be a page fault) V y 6 N t (c), COUNT(y) «- COUM](y) + 1; U t+1 (c) = U t (c) + x U{y|y € ^(c), COUM!(y) > T} ; S t+1 (c) = S t (c); W c) = W c )- u t + i (c) ' and RETURN: hi [2] (Page reference to a set-up page) If r, . = x € U, (c) then t+1 t (increment COUNT for each non-set-up page) v y e N , (c) COUNT(y) <- COUNT(y) + 1. (declare all non- set-up pages with COUNT > T, set-up) U t+1 (c) = U t (c) U {y e N t (c) | COUNT(y) > T} . S t + l (c) = S t (c) N t+l (c) =S t + l (c) " U t + l (c) - [3] (reference to page not in S, (c) => page fault) If r. +1 = x fi S.(c) then (bring in the required page, replace a page if necessary and declare all not-set-up pages set-up). U t+1 (c) = S t (c) + x - R A (U t , q, x) N t + i (c) -* W c) = u t + i^ c) - Note here that R.(U , q, x) denotes the page that would have been replaced by paging algorithm A with the given values of the parameters. [k] (a prepage instruction) If r = PRE(x) and x ft S (c) then if | S | < c then \ +1 (c) = N t (c) + x, COUNT (x) <- Vi (c) = u t (c) - End PRE2A. ka [b] If x e U. (c) then (reference to a set-up page) Vy e N.(c) COUNT(y) *- COUNT(y) + lj (c) = U, (c) U{y|y e N. (c), COUNT(y) > T} ; 11 t+1 W c) - s t (c) ' N t + i (c) = s t+i (c) -Vi (c) and RETURN; [c] If x fi T. (equivalent to saying x ^ S , (c f ) for any c f which is equivalent to saying ^ t' < t such that r , = x or r , , = FRE(x) which is equivalent to saying x ^ s_ (k) for any k) then if |S,(c)| < c then U t+1 (c) = S t (c) + x; N t+1 (c) = cp; S t+1^ = U t + l (c) ' else U t+1 (c) = S t (c) + x - R A (U^ q, x); N t+1 and RETURN; (c) = T) ; t+1 (c) = S.(c) + x; N +J ,(c) = S,^(c) - U. ..(c); t+1 t+1' t+r and RETURN; else (|s t (c)| = c) U +4 .n( c ) = U.(c) + x - R A (U + , q, x) t+1 t+1 u (y|y € N t (c), COUNT(y) > T) (c) == S, (c) + x - R fl (U, q, x); t A v V N t+l v (c) = S +J -(c) - U +j _ n (c); t+1 t+1' h9 and RETURN; else RETURN; [2] If j» , = PRE(x), x e N and x ^ S t (c) then if |S (c) < c then N t+1 (c) = N t ( c ) + x ; U t+1 (c) = U t (c); COUNT(x) - 0; S t+1 (c) = S t (c) + x; and RETURN; END PRE3A; We will say that an absolute page fault has occurred at time t+1 if r = x and x ^ T, (or any of the equivalent conditions). What this amounts to is that r = x is the first occurrence of x in w. PRE3A is different from PRE2A in that, only on the occurrence of an absolute page fault, all not- set-up pages are declared set-up, whereas in the latter on every page fault this is done. Theorem 3.^- ' PRE3A is a stack algorithm for {w|co e N + , P(w)}. Proof : See Appendix B for the proof. Theorem 3.^- allows us to construct an efficient one pass algorithm to obtain the paging performance of a given reference string w. Because of the nature of PRE3A, we would expect PRE2A to give better results than PRE3A. Intuitively, it seems that jt(PRE2A, c, to) < n(PRE3A, c, w), though we have not been able to prove this relation. However, results of applying PRE3A to a certain reference string does tell us, at least approximately, the performance of PRE2A. We note that for each of the algorithms PREiA we can create another algorithm, say FREEPREiA, which 50 incorporates freeing and thereby improve the paging performance. Clearly, FREEPRE3A Is a stack algorithm which follows from theorem 3.3. In both, PRE2A and PRE3A, whenever a page made a transition from not-set-up state to set-up state, we immediately consider it the same as any other page in U, . This means that a prepaged page can be replaced even before it is used at least once. This kind of replacement may lead to increased page traffic. When a PRE(x) has been inserted in the reference string, e ither by the programmer or by the compiler, we have reason to believe that page x will be used in the near future, so it might pay off to keep the page locked in MM until its first use. We now, divide the memory state S (c) into three disjoint sets: U (c) is the set of used pages (used at least once), P (c) is the set of prepaged, set-up but not yet used pages and N (c) is the set of not-set-up pages. We define PRE^A corresponding to PRE2A as follows : PREVIA : [1] If r = x e N then [a] if x e N (c) then (page fault) U t + l (c)=U t (c) + X P t+1 (c) = P t (c) + N t (c) - x N t+1 (c) = fi and RETURN; [b] if x € P, (c) then (success) Vy e n (c) COUNT(y) - COUM'(y) + 1; P t+l (c) = p t ( c ) - x U (yk 6 N t (c), CCOTT(y) > T}; U t + l (c ) - U t ( c ) + ^ N t+1 ( c ). = (y G N t (c) | COUNT(y) < T} ; t+1 and RETURN; 51 [c] if x e U, (c) then (success) V y e N (c), COUNT(y) *- COUNT(y) + 1; Vi (c) =u t (c); P t+1 (c) = P t (c) U {y\y e N t (c), COUNT(y) > T} ; N t+1 (c) = {y e N t (c) | COUNT(y) < T} ; and RETURN; [d] if x/ S (c) then (page fault) U t+l (c) = U t (c) + X ' V U f q ' x); P t+1 (c) = P t (c) U N t (c); N t+1 (c) = P; and RETURN; [2] If r = PRE(x) then if x j. S then if | S ( c ) | < c then N. +1 (c) = N (c) + x; COUNT(x) <- 0; P t + l (c) =P t (c)? U t + l (c) = V c) ' END PRE^+A; Quite clearly, PRE^A is not a stack algorithm. We can define another algorithm PRE5A which does a not-set-up to set-up conversion only on the occurrence of an absolute page fault. PRE5A will be proved to be a stack algorithm. PRE5A: [1] If r t+1 - x e N then [a] if x e N (c) then (a page fault) Vy e N t (c),C0UNT(y) «- COUNT(y) + 1; p t+i = p t u {y € N t' x I C0UWT (y) > T ^ U t + l =U t +x; N fc+1 = {y € N t -x I COUNT(y) < T) ; and RETURN; 52 [b] if x e P, (c) then (success) V y e N , COUNT(y) «- COUNT(y) + 1; P t+1 = P t -x U{y e N t | COUNT(y) > T} ; U t + 1 = U t + X ' N t+1 = {y e N t | COUNT(y) < T} ; and RETURN; [c] if x e U, then (success) Vy e N , COUNT (y) «- COUNT(y) + 1; u t + i ■ V p t+i = p t U{y € N t I C0UKT (y) > T ); N t+1 = {y € N t | COUNT(y) < T} ; and RETURN; [d] if x ft T, then (an absolute page fault) U t + 1 " U t + X " W *< X); P t + 1 " P t U V H t + 1 " * and RETURN; [e] if x ft S then (a page fault) Vy e N , COUNT(y) «- COUNT(y) + 1; VI = U t + X " V U t' *> X) ' p t+i = p t U{y e N t I C0UNT (y) - T )5 N t+l = fy e N t I C0UNT (y) < T}; and RETURN; [2] if r t+1 = PRE(x) and x ft S and I S^_ ( c ) | < c then (prepage the required page) Vi ■ u t 53 N. -, = N. + x; COUNT(x) «- 0; t+1 t P = P ; and RETURN; END PRE5A; Theorem 3.5 : PRE5A is a stack algorithm for (ue N |P(w)}. The proof is very similar to the proof of theorem 3.*+ so we omit it here. When faced with the choice between the use of PREljA and PRE5A, we recommend the use of FREljA. However, for efficiency in performance mea- surement, one could use PRE5A. In the following theorem we prove that the performance of PRE^+A is no worse than the performance of PRE5A which explains the comments just made about these two algorithms. Theorem 3.6 : jt(PRE^+A, c, w) < it(PRE5A, c, w). Proof : We will first prove that V t > 1, S t (PREl+A, C, u>) = S (PRE5A, c, u>), U (PRE^A, c, o>) = U t (PRE5A, c, w), P t (PREl+A, c, u) =3 P t (PRE5A, c, w). From these, the required result follows since a page fault occurs when r, , = x e S, (c) but x ^ P, (c), x ^ U, (c). The proof is obtained by studying the inputs and outputs to the sets P, , U,, N, . We draw a set input-output diagram in which three circles are the sets P., N , U, . The arcs are the input and outputs to these sets. Labels to the arcs are of the form d/X where d is the step of the algorithm in which this arc is relevant and X is the set of pages to which this arc applies. 5k For PREUA, the diagram is For PRE5A, the diagram is: Note that the following relation between the steps of the two algorithms holds : PRE^A. (Id) = PRE5A. (id) U PRE5A. (le). and PREUA. (2) = PRE5A. (2). =» PREUA. S = PRE5A. S t V t > 1, Vc>2. Input to U, is strictly a function of go and output from U^ occurs only in steps l(d) and l(e) therefore by the above step relations, PREUA. U = PRE5A. U, V t > 1, Vc>2. 55 Then clearly, "by the arrows going from N to P we conclude that PREUa. P 2 PRE5A. P Vt>l, Vc>2, This gives the required result. □ Note that, freeing can be added to any PREdA d = 1, 2, 3, k, 5 and we obtain the corresponding FREEPREdA with a better performance, i.e., jr(PREdA) > jt( FREEPREdA). We can also prove that FREEPRE3A and FREEPRE5A are stack algorithms for (w £ H P( w )}. We can also show that it (FREEPREDA, c, co) < jt(FREEPRE5A, c, w) which means that, though FREEPREDA is recommended for use in practice, performance can be bounded by that of FREEPRE5A in lesser time. Note that each of these prepaging algorithms can be reduced to a demand prepaging algorithm by removing the mechanism of time -dependent set-up of a not- set-up page and allow set-up only on the occurrence of a page fault. To implement this in practice, we do the following: on the occurrence of a PRE(x) in w, a prepage bit is turned on in the page table entry associated with x. At the time of the next page fault, say, for a page y, we issue page pull commands for y and all the pages with prepage bit turned on (subject to the memory size constraint). We will label (FREE)PREdA, after the above transformation as (FREE)DPREdA. We note that, it(DPMIN, c, w) < jt(FREEDPREdA, c, w). We now have to show that these new paging algorithms do achieve something for us in practice. We will use FREELRU and FREEDPREULRU and study the behavior "of these two algorithms as compared to LRU, MIN and DPMIN. We will carry out our experiments on several matrix problems in Chapter 5 and report the results in that chapter. 56 3.2.2.1 Joseph's OBL Algorithm [J0SE70] So far, we have only discussed deterministic prepaging, i.e., the prepaging specified by the programmer or compiler. In the deterministic case, we assumed that the prediction is always correct. Prepaging schemes developed by Joseph [J0SE70] can be called probabilistic prepaging schemes. These schemes are implemented in the operating systems and does not need a prescan of the program. They are based on the observation that a program tends to refer to its pages sequentially. Joseph has shown experimentally that such prepaging schemes do reduce page faults. We would like to show analytically that such is the case. Clearly, for such analytical work, we should have a program behavior model. We will define a sequential-random model of program behavior for this purpose. Before that, we will define a demand prepaging algorithm based on the OBL algorithm of Joseph. We will follow the approach of King [KING71] to analyze the behavior of this algorithm. DPOBLLRU = (V, Q, N, S Q , q Q , g), where V = {S|S c N, |S| = c} the set of memory states. S Q , the initial memory state, q Q , is the initial control state, Q is the set of control states q = (j,, j , ..., j ) where j £ N (V k e [1, c] ) and j. ^ j . for i / I (i.e., they are all distinct), g is the transition function, g: V XQXN-> V X Q. r (S, q') if x e S g(S, q, x) = / (S", q") if x ft S, x © 1 £ S ^ (S ,n , q m ) if x { S, x © 1 i S. 57 Where x © 1 denotes x + 1 mod n, q' = ($!> >>> J k _^ J k+1 > >-> 3 Q , x ) where ^ = x. S" = if x © 1 / j ± then S - J_ + x; else S - j p + x; q" = if x © 1 / J x then (jg, . . . , o q , x); eJ.se \d-,j J^> • ••* 0„^ x /j S ,M = S - j 1 - j 2 + x + (x © 1); q'" = (x © 1, J 3 , . .., j c , x); MD DPOBLLRU; From the algorithm, it is clear that given q = (j' , . .., j ) we can conclude that S = (j , . .., j }. Therefore, a configuration (S, q) of the algorithm is completely specified by specifying q. Therefore, the set of configurations is the set Q. Our first objective is to seek the number of states in Q. Definition: A state q e Q is said to have a property z (i.e., z(q)) if 3 x e N such that x and x © 1 both occur in q. Lemma 3.2 : z(q ) -* V q e Q, z(q). Proof : The proof is by induction. In the transition function g of DPOBLLRU, assume that z(q) and we will prove that z(q'), z(q") and z(q™ ). Observe that the property z is invariant under a permutation of its argument. Therefore, z(q) =>z(q') since q' is a permutation of q. Since x © 1 e S and from the definition of S", x © 1 € S". Also from the definition of S", x e S" therfore z(q"). From the definition of q'", we have z(q'" ). The result of the lemma immediately follows. □ 58 Lemma 3« 3 '• In I n v> (n-c)! n l Q l = P c _ (n-2c)!(n-c) where P denotes the c-permutations out of n objects, c Proof : Berge [BERG71, p. 31] has proved that the number of subsets of size c (from the set N of n elements), which do not contain either two consecutive integers or both 1 and n simultaneously, is given by ***.t \ / n_c \ P f*(n,c) = ( ) . v ' ' v c y n-c From Lemma 3.2 it follows that |q| = n P - c!f*(n, c). Substituting for f*(n, c), we get the required result. It has been mentioned that the set of configurations (Q,) of DPOBLLRU form a Markov Chain. Assume that the transition probability matrix P = (p , , ) for the Markov Chain is given. q ; q Lemma 3.^- s The Markov Chain of configurations of DPOBLLRU is irreducible provided Pr[r, =i] ^ 0, V i e N. Proof : It is sufficient to show that V q e Q, V q 1 e Q, 3 k > 0, such that p , > where p , represents the probability of a k-step transition from state q' to state q. Starting with an arbitrary state q f € Q we will show a finite sequence of transitions to an arbitrary state q e Q and thus we will prove the lemma. Let q = (i,, ..., i ) and q r = (j^ ..., j ) and q / q'. By applying Lemma 3.2 to q, we get, 3 i e N such that both i and i © 1 occur in q. Let j„.= i, 1<^ $,. We present a sequence 59 of states starting from q' and leading to q and strictly following the transition function g. Clearly, there is an associated reference string, u) , such that g*(q', w ) = q. Note here that g* is an extension of g and that we have omitted the memory state S, since it is completely- specified by the control state. Now, since we give a transition from q' to q in a finite number of steps, oo is a finite string. And since by our assumption p(co ) > 0, we get the required result. [Note that in the following, an asterisk means unspecified or irrelevant page number.] q T -* q-L = {*, *>3 1 ) x = j q-j_ - Qg = (*, *,3 1} J 2 ) x = J, H-2 ^£ q i-l ~ (*"--*>Ji>J2' ,, * t ^-l) q i-l "" q i = (*>•••*> Ji> J £■ 2 ,,,,J "i-l ,J ' / g+l x 'i-1 x = J 4+1 q s-3 _> q s-2 ~ (*"-*>Ji>***'^-l'^+l' , * ,J 's-l' ) L s-2 q s -l = (**..** J^... Jj.!^ i+r'^s-i^s+i- x = J x = J s-1 s+1 ^-2 q G-2 ~~ ^*'*' 5 l" , ^i-.l' J ^.l' ,,,; 's-l'' J B+l ,,,, « J c' q c-i = W b >3 1 9-">3i_ 1 j3 £-l' d £+l' • ,,d s-l' J s+l > • • • j' c > jjjj X x = J Thus with a reference string, 3 1' ' J i-1' d 4+l' *•" d s-l' d s+l> **' d c' we have reached the state q . Now with the following reference string, u 3 = a *l' "" V d i+l' "•' J s' d s+l • • • j J > 6o we will reach the state q. Note that all the intermediate states, while going from q to q, are permutations of q. Thus U) = U u is the required reference string. Clearly, to is finite and therefore we have the required result. □ We define the sequential-random model of program behavior as follows : Pr[r t+1 =j | r t =l] = v ± . is given by: p io P-L if i = 3, p 2 if j = i © 1, P otherwise We require that p. > Pp > p„ > and P, + P p + (n-2) p = 1. We note that this model is closer to the behavior of real programs as compared to the independent reference model. Conjecture 3.1: m(DP0BLLRU, c) < m(LRU, c) for the sequential-random model of program behavior. Where m(A, c) indicates the long term average page fault probability working with algorithm A and with c page frames of MM. Unfortunately, we have not been able to prove this theorem. However, we can prove that m(DP0BLLRU, 2) < m(LRU, 2). 3.3 PWS Algorithm We assume that the program's reference string is fully known in advance. With this assumption, we will propose a new paging algorithm, PWS which will incur zero page faults, on the average. We will also derive the expressions for page pull probability and the average number 61 of memory page frames required and compare the results with WS algorithm of Denning [DENN68a] . Assume that w = r, , . .., r is the given reference string. Define the 'futuristic' working-set W*(t, T) by: W*(t, X) = {x e N | at', t < t' < t +T , r., = x). In other words, W* is the contents of a window of size T , looking forward. The purpose of algorithm PWS is to keep W* in MM. At any time t, if r, T e W*(t, 7 ) then nothing needs to be done; otherwise, we issue instructions to fetch this page and thereby prepage it. We also determine whether r, , e W*(t, T ) and if not, we either push it from MM or declare it dead. If it is assumed that T > T, where T is the page -fetch time, then since a fetch instruction for a page is issued T time units in advance of its use, we will have no page faults whatsoever. In other words, we have m(l¥S, T ) = if 1 > T. We will now study properties of this model. From the above definitions the following properties can be proved. (We denote |w*(t,T )| by u*(t,CT).) PI: a) W*(t, 0) = jZ5 which implies w*(t, 0) = b) W*(L, J) - {r L }, V T > 0. c) w*(t, J ) is a monotonically non-decreasing function of J~ . d) tjj*(t, J) is concave downwards. To see this, note that, W*(t, 2-7) = W*(t,0" ) U W*(t+ T, J). Therefore, . w*(t, 2'J) < w*(t, J ) + u»(t+CT, j), since the working sets may overlap. 62 Assuming statistical regularity, w*(t+ 7, 1 ) behaves, on the average, like w*(t, 7) therefore, u*(t, 27) < 2w*(t, T). Hence we have the concavity property, e) If t < L - 1 then W*(t, 7") = {x e N | x = r, ,, t < t * < L} . Let us denote 1 k k k t=l And let the average working set size s*(T) = s*(T). The page pulling rate m*(T) measures the number of pages per unit time returning to the working set (note that, m*(T) will be an upper bound, since a page which has left W*(t, X ) may still be in MM). Let A(t, 7 ) = if t+7 i W*(t,T) J otherwise. (t > l) Note that, A(t, j) = V t > L - 7 . Let m* = A(0, T) = {r^ ..., r^ } The page pulling rate is given by: m*(T) = Lim. if k-*L k-1 Z A(t,T) + m* Lt=l The interference distribution F. (x) and the interference density function f.(x) as well as the relative frequency of reference X. have the same meaning as in Denning' s work which is outlined in Chapter 2. Let the mean n overall reference interval x = L X. x. , x. = Z x f. (x). The page i is x=l x>0 called 'recurrent' if \. ^ and j denotes the number of recurrent pages in N. If \. f then x. = i/\ . i i ' i 63 P2 : s*(T) is nondecreasing in 7 and is bounded above and below: 1 = s*(l) < s*(T) < s*(T+l) < min.{n, 7 +1). Proof : P(t,j)cP(t,T+l)4 w*(t, 7) < w*(t, 7+1). => s*(T) < s*(T+l). Also, w*(t, 1) = 1=» s*(l) - 1. w*(t, 7+1) < min. {n, J +1} =$ s*(j +l) < min. {n, 7 +1) . m* P3: s*(7+l) - s*(7) = m*(T) - -£ . m Note that -=- is a very small number compared to m*(7) and can be ignored for most purposes, particularly if L is very large. Proof : W*(t, 7+1) - W*(t, J) E ( r t+ T } * Therefore, u*(t, 7+1) - w*(t, 7) = A(t, 7 ). Therefore, Lt k-+L • k ± k 1 - Z u*(t, T+l) - - Z w*(t, 7) . t=l k t=l = Lt rr I A(t, 7 ). k->L t=l Therefore, s*(7+l) - s*(7) = Lt k->L r\ k-1 t- Z A(t, 7 ) + m* t=l m* + A(L,7) _ ^0 L L * m o m*(T). D 6k y PU: -£ < m*(7+l) < ™*(t) < m*(0) = 1. which states that m*( ) is a nonincreasing function of T and is bounded above and below. This follows immediately from P2 and P3. P5: m*(T) = 1 - F(T) + -j% which states that m*(T) can be regarded as the probability that x > "J . Proof: Define the binary variable B. (t, x) = 1 iff r =i ; ^uL 1 v ' t=l k-T Z ft. (t, x) 1 k-T n Lim. - Z Z 6. (t, x) k-»L t=l 1=1 1 since n. (k)-l 1 ~ n.(k) n Define e(t, x) = Z 6. (t, x) and observe that fi(t, x) = 1 - A(t, x), i=l k-T Therefore, 1 - F(x)=Lim. - Z A(t, x) k-*L t=l = m*(x) - m */L. This proves P5. P6: m*(T+l) - m*(T) = - f(T+l). This follows immediately from P5. T-l T-l m o. P7: s*(T) = 2 m*(z) = Z (l - F(z) + -^) z=0 z=0 L T-l = Z Z f(y) + m^ n T/L. z=0 y T, b) m*(lWS, T ) 2 m(WS, T ), c) s^(PWS, ? ) ~ s(WS, T ), d) ST(PWS, T ) « ST(WS, J ). Proof : Part a of the theorem has been already proved when we introduced the algorithm PWS. Part b follows from the property P5 of PWS and the property PU of WS, given in section 2. k- Part c follows from the property P7 of PWS and the property P6 of WS, given in section 2. k» Finally, the last part can be shown as follows : ST(PWS, 7 ) = 2 w*(x) *(ot*) where St* = average execution burst plus the average page wait time plus the average ready queue delay. w*(*7") is the average working set size during the interval St*. Similarly, ST(WS, *T ) = Z w(T) * t. Now since approximately w*(T ) ~ co(t ) and 6t* « 6t since we have no page faults in PWS,St*~ average execution burst. Therefore, we have the required result. □ Implication of Theorem 3.8 is that we have defined a paging algorithm with zero. page faults which pulls the same number of pages as the WS algorithm and which requires the same average working set size 66 as WS. We have also proved that the memory utilization (ST product) of EWS is highly superior to that of WS. The main drawback, however, is that IWS is unrealizable. If partial information is available about future working sets then we may be able to utilize it to improve performance over WS and at least approach the performance of PWS. We feel that further investigations should be made in this direction. We will add some comments to this topic in Chapter 6. We have ignored several points in the above discussion of PWS which we now discuss. Define the residency of a page in MM as the fraction of time it is potentially available in MM. We assume that a page in the working set is never replaced. Once a page has entered W*(t, T ), it will remain in MM for at least T time units. Let x be the interreference interval for the page i. We have: 1) If x < J , the page residency is IQQPjo. 2) If 7" < x then the following diagram gives us the page residency. x -> t-T t+T t+x 67 If during the last residency, the page was updated, then the above diagram requires that t+x-y>t+T in order for the page to be released at time t. Which means that if x > T +7 then only the page i can leave memory. Otherwise (i.e., T < x < T +T ) then if the page was allowed to leave memory and called again after x time units then the contents of page i will be the old contents of page i (i.e., before the update during the last residency). Therefore, in such a case, the page must be kept in MM and it will always be resident. This means that more than uj*(t, T) pages will occupy MM. To implement this in practice, we assume that the reference string w = r,, r , ..., r is such that r, = (y, k) where y e N is the current page referenced and k is the time to next reference to y. The value of k will be zero if page y if not referenced again. The modified PWS, therefore, can be written as follows : MEWS: (assume y > T) Vt >1 let r twl = (y, k) and r t+/J = (y', k'). 1) If k = or k> T + T then FREE(y) 2) If y*£ W*(t, 7 ) then PREPAGE (y f ). EHD MPWS; 68 k. IMPROVING LOCALITY OF ARRAY PROGRAMS k.O Introduction Locality of programs can "be improved in various ways, as noted in Chapter 2. By proper organization of data among the data pages, we can reduce the size of the localities as well as the probability of an interlocality transition. This has been termed the problem of data pagination. For matrix algorithms operating on large arrays, we need to consider various techniques for storing them, such as, row major order, column major order, packed row storage, packed column storage, or submatrix (block) storage. Several investigations have indicated that submatrix storage is preferable to other storage schemes [MCKE69, ROGE73]. We shall restrict our investigation to the submatrix type of organization. For large matrix problems, it should be clear that instruction paging is dominated by data paging; for this reason we ignore instruction paging. With respect to improving locality, it is possible, in certain instances, to resequence some of the operations in a program so that we operate on the data that is already resident in MM rather than execute unnecessary paging. This amounts to a logical reorganization of the code as against pagination, which does a physical reorganization. In this chapter, we will include a discussion of some of these methods of logical reorganization for matrix programs. We will assume that these 69 modifications are done by the programmer. The problem of automating some of these methods will be considered in Chapter 6. Since we are interested in improving locality, we are also interested in measuring these improvements. We feel that the average working set size s(*X) as a function of T is a very good indication of the locality of a program. We also measure the number of page faults using LRU, WS and MIN paging algorithms. We will present three techniques for the improvement of locality of matrix programs. We will illustrate their use by applying them on several common matrix algorithms. We will show that an order of magnitude improvement in locality can be obtained by these methods. We now consider the notation for storing matrices in VA space. Let A be a rectangular matrix of size (n,Xn p ). The matrix will be divided into square submatrices of order m as shown in Figure k.l. If 2 we assume that p, the page size, is a perfect square and that m = p then each of the submatrices can be stored in one page. Note that there will be some submatrices On the right and the bottom borders which will not be full. We will assume that n, and n p are both multiples of m and therefore, such fragmentation will not occur. If the matrix A is a square matrix then n = n p = n. We define the integers N, N, and N p through the equations n = N.m, n, = E, «m and n p = N p «m. Therefore, the number of data pages occupied by a n Xn matrix is N «N , by a nXn matrix is N and by a triangular matrix of order n is N(N+l)/2. *+. 1 Cholesky Decomposition We start with a particular matrix algorithm, namely, the Cholesky factorization of a symmetric, positive definite matrix. We will try to 70 — m — m i An A 12 Ain 2 n A21 A22 A2N2 l AnxI A Nl 2 A Nl N 2 n 2 Figure k* 1. Submatrix Storage 71 improve the locality of this algorithm in various ways. This will, then, lead us to general methods of locality improvement. We will try these methods on same other typical matrix algorithms. Cholesky decomposition factors a symmetric positive definite matrix A into the product GXG' where G is a lower triangular matrix and G 1 denotes the transpose of G [WILK71] . The algorithm is defined by the following relations : k-1 2 §kk = sqrt(a kk " ;= g ij } for 1 < k < n ik = (a ik " E 1 g jk SikVSkk' .1=1 Where A = fa. .1 is an nXn matrix, and G = fg. .1 is a lower triangular matrix of order n. We recall that the basic algorithm in PL/ I like language is as in Figure k.2. Note that, we operate only on the lower triangular part of A and we overwrite G onto A. We have omitted all declarations, input, output and such other statements from the program. Note that, for our measurements, we will use a matrix (A) of order n = 2k, a page size p = 16 (=> m = k). We note that the first inner loop, th 'DO J = 1 TO k - 1', scans the k row of the matrix A. This is done with each traversal of the major loop (with a different value of k). Since several (in fact, m) rows are stored across in a row of pages, it is clear that the locality will be improved if we alternately traverse the rows in opposite directions. We observe that these traversals are possible because the order of computation within the loop is immaterial. By appropriately reversing the other two loops, we will improve the locality in a similar way. The resulting program is labelled CDR, which is given in Figure k.3. 72 CD: /* CHOLESKY DECOMPOSITION */ DO k = 1 TO n; S - 0; DO J = 1 TO k - 1; S = S + A(k, J) * A(k, j); END; A(k, k) = SOJRT(A(k, k) - S); DO I = 1 TO n - k; S = 0; DO J = 1 TO k - 1; S = S + A(k+I, J) * A(k, J); END; A(k+I, k) = (A(k+I, k) - S)/A(k, k); END; END CD; Figure 1^.2. Cholesky Decomposition 73 CDR: DO k = 1 TO n; S = 0; IF MOD(k, 2) = 1 THEN DO; JLL = 1; JHL = k - 1; JSTEP = 1; END; ELSE DO; JLL = k - 1; JHL = 1; JSTEP = -1; END; DO J = JLL TO JHL BY JSTEP; S - S + A(k, J) * A(k, J); END; A(k, k) = SQRT(A(k, k) - S); IF MOD(k, 2) - 1 THEN DO; IHL = n - k; ILL = 1; ISTEP = 1; END; ELSE DO; IHL * 1; ILL = n - k; ISTEP = -1; END; DO I = ILL TO IHL BY ISTEP; IF MOD (I, 2) = 1 THEN DO; JLL = 1; JHL = k - 1; JSTEP = 1; END; ELSE DO; JLL = k - 1; JHL = 1; JSTEP - -1; END; S = 0; DO J = JLL TO JHL BY JSTEP; S = S + A(k+I, J) * A(k, J); END; A(k+I, k) - (A(k+I, k) - S)/A(k, k); END CDR: Figure I4..3. Cholesky Decomposition with Reversal Ik In Figure h.k, we have plotted the average working set size s(0") as a function of the window width 1 for CD and CDR. The improvement of locality due to the reversal technique is clearly shown. The locality of a program, such as CD, may he improved more significantly if we extend the algebra of the language to include submatrix operations. We sketch the program using the 0L./2 language [PHIL72]. The names of all OL/2 programs will be prefixed by the letter '0'. In Figure h-5, we give the program OCD. The order of the element by element computation of (C - M X R') is unspecified in OCD. Since a matrix-vector multiplication is specified by H x R 1 , we have the choice of implementing the multiplication by a submatrix multiplication and thereby improve the locality. We will briefly explain what we mean by submatrix multiplication. Assume we are carrying out Z = X * Y. Where X, Y and Z are matrices of order n. We shall denote the (I, j) submatrix of the matrix X by the usual subscript notation, e.g., X . Multiplication of X and Y can, therefore, be written as follows : N \ = T X * Y IJ ,\ A IK KJ k=l 1 < I, J < N. Note that, the sum and product in the above equation denotes the matrix sum and matrix product for matrices of order m. We apply this submatrix operation to modify CD and obtain the results shown in Figure h.7 for the program CDM in Figure k.6. We have plotted, in Figure k.7, s(t) vs. J for CD and CDM. We note that the improvement obtained by this method is much more than that obtained by the method of loop reversal. 75 CDR Figure k>k* Localities of CD and CDR 76 OCD: PROC(A, n); LET A BE A LOWER TRIANGULAR MATRIX OF ORDER (n); TOR K = 1, 2, ..., n; PARTITION A AFTER ROWS K - 1, K; SET R = A <2,1> ROW VECTOR, M = A <3, 1> , D = SQRT(D - (R*, R'))j C = (C - M X R' )/D; END OCD; D = A <2,2> SCALAR, • C = A <3,2> COLUMN VECTOR: K-l K Figure k. 5- OL/2 Program for Cholesky Decomposition 77 CDM: FM = FLOAT (M); DO K = 1 TO n; KIM = (K-l)/M; KCM = CEIL(K/FM); S = 0; DO J = 1 TO K - 1; S = S + A(K, J) * A(K, J); END; A(K, K) - SQ£T(A(K, K) - S); DO J = 1 TO K1FM; DO I = K+l TO KCM * M; S = 0; DO Jl = 1 TO M; S = S + A(I, (J-l) * M + Jl) * A(K, (J-1) * M + Jl); END; A(I, K) = A(l, K) - S; END; END; DO I = K+l TO KCM * M; S = 0; DO J = K1FM * M + 1 TO K - 1; S = S + A(I, J) * A(K, J); END; A(l, K) = (A(I, K) - S)/A(K, K); END; DO I = KCM + 1 TO N; DO J = 1 TO K1FM; DO II = 1 TO M; S = 0; DO Jl = 1 TO M; S = S + A((l-1) * M + II, (J-1) * M + Jl) * A(K, (J-1) * M + Jl); END; A((I-1) * M + II, K) = A((l-1) * M + II, K) - S; END; END; DO II = 1 TO M; S = 0; DO J = K1EM * M + 1 TO K - 1; S = S + A((I-1) * M + II, J) * A(K, J); END; A((I-1) * M + II, K) = (A((I-1) * M + II, K) - S)/A(K, K); END CDM; Figure 1^.6. Cholesky Decomposition with Submatrix Multiplication 78 Figure If. 7. Localities of CD and CDM 79 We now turn to an even better technique for improving the locality. This requires the construction of a submatrix algorithm and we illustrate it for the Cholesky factorization [H0US6^] . G n = CHDLDEC <*n " %. (G u X ^r )} for 1 < I < N. for I < J < N We now present the OL/2 program for this algorithm. A PL/l program, CDS, is given in Appendix C. OCDS: PROC; FOR K = 1, 2, . .., N; PARTITION A AFTER ROWS m * (K-l), m * K; SET R = A <2,1> , D = A <2,2>, C = A <3,2> , M = A <3,1>; D = D - R X R« CALL OCD (D, m); C = (C - M X R') X (D*)" 5 EEL OCDS; Note that, D is a lower triangular matrix of order m, R is a rectangular matrix of size m X m(k-l). Similarly, C is (N-m)k X m and M is (N-m) k X m(k-l) matrices respectively. To carry out the Cholesky factorization of D, only one page is involved, and therefore, any method can be used without changing the performance. Therefore, we use OCD for simplicity. Based on OCDS, we write CDS given in Appendix C. We carry out the multiplication M X R' by submatrices. We store the inverse of D' in the strictly upper triangular part of D. This way, we will not be able to store the diagonal elements of (D')~ . Note that, this storage scheme is possible because the page containing the subarray D has 80 precisely m(m-l)/2 words vacant. Also note that, the diagonal elements of (D* ) are the reciprocals of the corresponding diagonal elements of D. In Figure 1+.8, we have plotted s(t) vs. T for CD and CDS. The large improvement in locality is quite obvious. We note that the reversal technique can be applied to both, CDM and CDS obtaining CDMR and CDSR respectively. Improvements thus gained are not significant enough to report. In Figure 1^.9, we have plotted s(T) vs. T for CD, CDR, CDM and CDS. It can be easily seen that the versions of Cholesky factorization in order of increasingly better locality are CD, CDR, CDM and CDS. In Figure In 10, we have plotted the number of page fault, jt(t) vs. 7 for CD, CDR, CDM and CDS using WS paging algorithm. In Figure k.H and Figure k.12, we have plotted it(c) vs. c (the page allotment) for CD, CDR, CDM and CDS using LRU and MIN paging algorithms respectively. It is clear from these figures that, improved locality implies a reduction in page faults. k-2 Other Matrix Algorithms From the example of Cholesky decomposition, we observe that there are' three general methods of locality improvement of matrix algorithms. The first method is the method of loop reversal in which we change the direction of loop traversal each time we traverse a loop. We note that, depending on the computation within the loop, this may not always be possible. The second method is to carry out all matrix multiplications occurring within the algorithm as submatrix multiplications. The third method consists in using a submatrix algorithm from the beginning. We 81 15 10 15 20 T Figure lj.,8. Localities of CD and CDS 5 -- 82 1 6 — I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 f Figure k.9. Localities of CD, CDR, CDM and CDS LOG 2 (tt(t)) PAGING ALGORITHM - WS n=24,m = 4, N = 6 Figure l^.. 10. Page Faults Using WS Algorithm 81+ ! L0G 2 (7T(C)) 15 -- PAGING ALGORITHM -LRU n = 24 m = 4 N = 6 20 21 Figure If. 11. Page Faults Using LRU Algorithm 85 LOG (tt(c)) 15-- PAGING ALGORITHM -MIN n = 24 m = 4 N = 6 13-- 10-- 1 — I — | — I — I — I — I — — I — I — I — I — I — i — I — I — I — — K 10 15 20 Figure q.,12. Page Faults Using MIN Algorithm 86 note that, for several matrix algorithms, their submatrix versions already exist in the literature. However, for eigenvalue problems, such submatrix algorithms do not exist and probably will not be found in the future. We believe that the three methods can be usefully applied to most matrix algorithms. We will consider several more algorithms in this chapter. We note that, the conclusions regarding the performance improvements by these methods hold for any matrix size and any page size, although we have considered only one matrix size and one page size. If. 2.1 Matrix Multiplication Three versions of matrix multiplication we consider are : MM (basic matrix multiplication), MR (matrix multiplication with loop reversal, and MMS (submatrix multiplication). In Figure U.13, we plot s( T) vs. *J for these three programs. k-2.2 LU Decomposition Given a nonsingular matrix A, it can be uniquely decomposed into L x U where L is a lower triangular matrix and U is unit upper triangular. The method is known as Crout EQ decomposition, and can be written as [ISAA66] : for k as 1 to n do; for i = k to n do: k-1 I.. = d.. - z I. . u.. ; ik ik ij jk 5 end; for j = k+1 to n do; k-1 \j = Ga kj ".f/kiV/^; end; end: 87 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 30 T Figure 1+.13. Localities of MM, MMR and MMS 88 A straightforward implementation of this will be called IU. Appropriate loop reversal turns it into IUR. To obtain LUM, we must identify matrix multiplications within the algorithm. Towards this end, we write the algorithm in OL/2 and call it OLU. Note that, we store L, U over A. OLU: PROC(A, n); LET A BE A MATRIX OF ORDER (n); FOR K = 1, 2, ..., n; PARRITION A AFTER ROW K-l AND AFTER COLUMNS K-l and K; SET C = A <2,1>, B = A <1, 3>, X = A <1,2>, Y = A <2,2>, M = A <2,3>; PARTITION C, Y, M AFTER ROW 1; SET R = C <1>, D = Y <1>, Z = M <1>; Y = Y - C * X; Z = (Z - R * B)/D; END OLU; K-l K B K-l K R D -M- 89 From OHJ, we see that C * X is a matrix-vector product and R * B is a vector-matrix product. Coding these products as submatrix products (as in CDM) we obtain HJM (we omit the detailed coding here). Submatrix LU decomposition algorithm can be written as [ISAA66] for K = 1 to N do; ( L KK> V - 1U - DEC V " y, L KJ V' J— 1 for I = K + 1 to N do; L IK - (A IK " . L IJ V * 'V^ J — X end; for J = K + 1 to N do; U KJ - ^Kk'" 1 * (A KJ " £ L KI V; end K; This can easily be programmed into OL/2 as well as PL/ I, obtaining OLUS and LUS respectively. In both cases, we store L and U on the original storage of A. While programming LUS, we store the inverses of L^. and U on a scratch page (only one scratch page is needed). Once again, we do not give the detailed codes for these. In Figure k.lk, we have plotted s(T) vs. y for LU, LUR, LUM and LUS. We have used n = 2k ( = order of the matrix A), page size p = 16 (implies m = k). Thus the number of pages occupied by A is 6 X 6 = 36 (in general, IT). U.2.3 Gaussian Elimination Gaussian elimination reduces a nonsingular matrix A into an upper triangular matrix. Gaussian elimination with coefficient storage can be written as [MCKE69] : 90 1 3 5 7 9 11 13 17 21 25 29 33 37 T Figure i4_.Hi-. Localities of LU, LUR, LUM and LUS 91 for k = 1 to n - 1 do; for i = k + 1 to n do; J\\ a ik ~ a ik' °kk' end; for j = k + 1 to n do; for i = k + 1 to n do; a. . s a. . - a., * a, . . ij ij ik kj > end; end; end; A straightforward implementation of this algorithm is called GOS (we omit the coding here). From GOS with loop reversal, we easily obtain GOSR (code omitted here). To see the matrix multiplications within GOS, we write it in OL/2 and call it OGOS. OGOS: PROC(A, n); LET A BE A MATRIX OF ORDER (n); FOR K - 1, 2, . .., n-1; PARTITION A AFTER ROWS K-l, K AND AFTER COLUMNS K-l, K; SET D = A <£,2> SCALAR, R = A <2,3> ROW VECTOR, C = A <3,2> COHJMNVECTOR, M = A <3,3>; C = C/D; M = (M - C * R); END OGOS; K-l K • K-l K A <1,1> A <1,2> A <1, 3> A <2,1> D R A <3,1> M 92 By programming M=M-C*Rby submatrices, we obtain GOSM (code omitted here). Submatrix Gaussian elimination with coefficient storage can be written as [HOUS64, MCKE69] : for K = 1 to N do; A^ gos(a kk , m)j for J = 1 to m do; for i = K*m+1 to n do; a i, (K-l)*m+J = a i, (K-l)*m+j/ a (K-l)*m+J, (K-l)*m+J; end; for i = K*m+1 to n do; for j = (K-l)*m+J+l to (K-l)*ntf-m; a ij = a ij " a i,(K-l)*m+J * a (K-l)*m+J,j\ end; end: for i = (K-l)*m+J+l to K*m do; for j = K*m+1 to n do; a ij = a ij " a i,(K-l)*m+J * a (K-l)*m+J,j> end; end; end; for I = K + 1 TO N do; for Jl = K + 1 TO N do; A = A - A * A I,J1 I,J1 IK K,J1 end: end; end; Based on this algorithm, we can write OGCSS and GOSS (code omitted here). We note that both, GOSM and GOSS can be modified by the loop reversal technique. McKeller et al [MCKE69] present a submatrix Gaussian elimination algorithm which, in our terminology, is GOSSR. In Figure k m l$, we have plotted s( J) vs. J for GOS, GOSR, GOSM and GOSS. Once again, behavior similar to Cholesky factorization is observed. S(T) 93 n = 24, m = 4, N=6 GOS < GOSR 13 17 21 25 29 33 36 Figure U.15- Localities of GOS, GOSR, GOSM and GOSS 91+ l*.2.i| Gram- Schmidt Orthogonalization Given a generally tall (n X n„) matrix A of rank n ? (n ? < n ), its decomposition into A = B X R is known as the Gram-Schmidt orthogonalization method. Here B is an (n, X n ) matrix with orthonormalized columns and R is a nonsingular upper triangular matrix of order ru [SCHW73]. It can be written as: for k = 1 to n do; for j = 1 to k - 1 do; n l r.. = Z b. . * a., j 3 i=l 1J lk for i = 1 to n do: b.. = a., -p.. * b. .; lk lk jk \y end; end; r n = V Z b . , , kk . , lk i=l for i = 1 to n, do; b ik = b ik/ r kk' end; end; A straightforward implementation of this algorithm is called ORTH (omitted here). We store B over A, thus destroying the original elements of A. A reversal technique applied to ORTH turns it into ORTHR. To identify the matrix multiplications within ORTH, we write it in OL/2 and call it OORTH. OORTH: PROC(A, R, n , n ); LET A BE A n_ X n^ MATRIX AND LET R BE AH UPPER TRIANGULAR MATRIX OF ORDER (n ); FOR K = 1, 2, ..., n 2 ; PARTITION A, R AFTER COLUMNS K-l, K; SET Q = A <1>; AK = A <2>, RK = R <1,2>, D = R <2,2>; 95 EK = Q* * AK; AK - AK - Q * RK; D = (AK, AK); AK = AK/D; END OORTH; A K-l K Q ■ AK If we code the statements RK = Q' * AK and AK = AK - Q * RK by submatrices, we obtain ORTHM. Submatrix version of ORTH is [SCH073] n n written in OL/2, and is called OORTHS. Note that, N = — and N-, = — . 1 ' 2 m 1 m OORTHS : PROC, LET A BE n X n 2 MATRIX AND R BE AN UPPER TRIANGUIAR MATRIX OF ORDER (n ); FOR K - 1, 2, ..., N 2 ; PARTITION A, R AFTER COLUMNS (K-l) * m, K * m; SET Q = A <2>, D = R <2,2>, M = R <£,3>, G = A <3>; CALL OORTH (Q, D, n , m); M = Q' * G; G = G - Q, * M; END OORTH; 96 (K-l)*m (K-l)*m This can be easily programmed in PL/ I and is called ORTHS. In Figure If. 16, we have plotted s(T) vs. T for ORTH, ORTHR, ORTHM and ORTHS. Once again, reversal technique can be applied to ORTHM and ORTHS but the improvement thus gained is not significant. k» 3 Conclusion In all four algorithms considered so far, we have seen that three methods of locality improvement work very well. We have seen that a submatrix version of an algorithm improves the locality to the greatest extent. The submatrix multiplication technique for all internal multiplications is the second best technique for improvement. Reversal technique also improves locality but by a much smaller degree than the other two techniques. We also notice that, locality improving methods cannot reduce page faults for very large core allotments. The reason is that, with a large number of pages allotted, the working set of an 97 sir) 8 -- 7-- 6-- 5-- 4 -- 3 -- 2 -- fix s 24, n 2 = 16 N! * 6 , N 2 = 4 ORTHR ORTHM 13 17 21 25 29 33 37 39 Figure 1|. 16. Localities of ORTH, ORTHR, ORTHM and ORTHS 98 ill-structured (poor locality) program can also be kept in MM. We have observed that locality improvement methods are effective when c < ^(N). Where (q (N) denotes kN for some constant k. Thus the asymptotic value of page faults is unaffected by locality improvements. It is also clear that the asymptotic value of page faults cannot be changed by changing paging algorithms if we restrict ourselves to demand paging. [There is no replacement required at high core allotments.] In Chapter 5> we will see that the asymptotic value of the number of page faults can be reduced by devising prepaging algorithms. At this point, we are in a position to compare the paging performance of the three decomposition algorithms used in solution of linear systems [Cholesky decomposition, LU decomposition and Gaussian elimination]. In Figure h.YJ, we have plotted u(c) vs. c for CD, HJ and COS using LRU paging algorithm. The superiority of Cholesky decomposition is very clear. But, as is well known, it can only be used when the matrix A is symmetric and positive definite. It is also known that this algorithm has minimum operation count and has superior stability property among these three decomposition algorithms [ISAA66] . When the given matrix A is not symmetric or is not positive definite and if we have to choose between HJ decomposition and Gaussian elimination, BJ decomposition is preferable. Also, LU decomposition incurs lesser number of operations and is also more stable [ISAA66], from Figure ^-.17> it is seen that LU decomposition is generally superior to Gaussian elimination in number of page faults also. 99 L0G 2 (7T(c)) PAGING ALG. USED - LRU n = 24 , m = 4 9 13 17 21 25 29 33 37 Figure 4. 17. Page Faults for CD, LU and GOS 100 k.k A Note About Measurement of Performance We have combined LRU stack algorithm of Mattson et al [MATT70], WS statistics gathering algorithm of Gordon [GORD73] and multivalued MIN algorithm of Belady [BELAY 1 *] into LRU_WS_MIN algorithm. This algorithm obtains the LRU, WS and MIN statistics of a given reference string in one pass for all values of page allotments (up to the size of the address space) and for all values of the window width 7 (up to a fixed maximum). We give a PL/ I program for this algorithm in Appendix A. Once a simple one pass, paging algorithm simulation is available, the next question is to easily obtain the reference strings for a given program. Since for the matrix programs that we have considered, each program has only one trace and, therefore, there is only one reference string associated with it. Also note that, we are only interested in data references and, in particular, only in array references. Given a matrix program written in PL/ I, we remove all declarations. All assignment statements and expressions are reduced by removing all scalar references (and retaining only array references). Each array reference is now replaced by a call to the appropriate paging algorithm simulator (in our case LRU_WS_MIN ) . For each such call we must also supply the page number of the referenced page. This is determined from the name and index of the array referenced and the organization of the arrays in VA space. For example, a matrix of order n is stored from page one to page W by submatrices and the submatrices are stored in a column major order, then a reference A(i, j ) will be a reference to the page number -i-|-l) + fi- rn 1 ' ' in- clement A (i, j) belongs, is (f-"|, f^l ). x = (N*(r— 1-1) + \— ] ), since the index of the submatrix, to which the m" ' m 101 5. APPLICATION OF FREEING AND PREPAGING TO ARPAY PROGRAMS 5.0 Introduction The objective of prepaging is to reduce or avoid page faults. In many array algorithms, which are dominated by data paging, we have some inherent structure which will allow a fore -knowledge of paging needs. This can be utilized in several ways : First, if we know that a set of data pages is not required by the program in future then the space occupied by these pages in MM can be freed for some other use. Second, if we have extra space in MM, then we may be able to prefetch some of the required pages. We have noted in Chapter h, that the locality- improvement methods reduce page faults only for small values of the page allotment c. We will see that the methods used in this chapter can help reduce page faults for larger values of c. Thus we will have covered the complete spectrum of c-values. We will show that an order of magnitude reduction in the asymptotic value of page faults can be achieved by our methods. Once again, we start out with an example, namely, Cholesky decomposition (CD), and show how to improve its performance. We will also be concerned with measuring the improvement in the performance. Finally, in subsequent sections, we will apply the techniques to other matrix algorithms. - In this chapter, we assume that the programmer is making these improvements in his program. In Chapter 6, we will discuss the question of automating these techniques. 102 5.1 Cholesky Decomposition It is easier if we describe the algorithm using OL/2 and an associated diagram to identify the data-working- set. Let us consider the program OCD (refer to section J+.l) and the following partitioned array : k-1 k A <1,1> \ k-1 >D R ^k M C A <3,3> \ The algorithm OCD shows that, for a particular value of k, the subarrays R, D, M and C are referenced, but the subarrays A <1, 1> and A <3, 3> are not. Furthermore, we note that the elements of A <3,3> were not used in the past, but the present elements of A <3,3> will be referenced in the future. On the other hand, all the elements of A <1, 1> were used in the past, therefore, they are likely to be in MM and the algorithm shows that they will not be used again. Thus A <1, 1> may be marked as a dead subarray. Since the mechanism that we have, only allows pages to be declared dead, it is necessary to modify our mechanism for dead subarrays. Because of dynamic partitioning, the extent of the subarrays 103 vary dynamically. In this example, A <1, 1> may share pages with the useful subarrays R, D, M and C. The set of shared pages is a null set whenever the condition, (k-l) = mod m, is satisfied. This condition corresponds to the alignment of a page boundary with the bottom boundary of the subarray A <1, 1> (given by the expression k-l). Under such a condition, we can free all the pages of A <1, 1>. We assume. the existence of a procedure FREE(b) which frees all the pages of the subarrays, which are resident in MM. Based on this observation, OCD can be modified to yield OCDF as follows: OCDF: PR0C(A, n); IF M0D(k, m) - 1 THEN CALL FREE(A <1,1>); D = SQRT(D - (R T , r t )); C = (C - M X R')/D; END OCDF; A corresponding modification is made in the Pi/ I program CD (refer to section ^-.l) to obtain CDF. We assume the existence of a procedure FREE (A, I, J), which declares the (I, J) submatrix of array A dead. We note that this freeing operation is carried out each time k - 1 = mod m. Therefore, we need only free the bottom-most row of pages of A <1, 1>. In Figure 5.1, we give the PL/l program CDF. In Figure 5.2, we have plotted it(c) vs. c for CD using LRU and MIN paging algorithms and for CDF using FREELRU paging algorithm. We see that at and around point D, FREELRU behaves almost like MIN (the optimal demand paging algorithm). The page allotment for point D corresponds to the expression max ( |p(R,D,M, C) | ) where ke[l,n] P(A-., Ap, ...) denotes the pages containing the subarrays A,, A~, ... and 101+ CDF: DO k = 1 TO n; K1FM - (k-1 )/m; IF MOD(k, m) = 1 THEN DO J = 1 TO K1FM; CALL FREE (A, K1PM, J); END; S - 0; DO J = 1 TO k-1; S = S + A(k, J) * A(k, J); END; A(k, k) = SQRT(A(k, k) - S); DO I = 1 TO n-k; S = 0; DO J - 1 TO k-1; S = S + A(k+I, J) * A(k, J); END; A(k+I, k) = (A(k+I, k) - S)/A(k, k); END; END CDF; Figure 5.1. Cholesky Decomposition with Freeing 105 LOG 2 (7T(c)) n = 24, m =4, N = 6 3 5 7 9 11 13 15 17 19 21 Figure 5.2. Page Faults Using LRU, FREELRU and MIN Algorithms 106 p| denotes the number of pages in the set P. Clearly, this is the same as the maximum number of active pages during the execution of the program CD. This is easily found to be equal to » r < which in the case N = 6 reduces to 12. We note that the asymptotic value of page faults, i.e., n(o°), is not changed by freeing. Note that it(») = total number of distinct pages referenced = size of the address space (in case each page is referenced) = — ^ - (= 21 in the present case). We note that for very low values of c, freeing does not reduce page faults because dead pages are removed anyway by the replacement action of LRU. Only after c = >6(N) then freeing starts to help. For very high values of c (i.e., £)(Nj), once again, freeing does not help since the whole of the VA space can be stored in MM and no page replacements are necessary. Because of simulation time limits we have not carried out our experiments with large values of the matrix size. It is hoped that with large value of n (the matrix size) the advantages of freeing will be brought out more clearly. We note that the freeing technique can be applied to other variations of Cholesky decomposition (e.g., CDR, CDM, CDS). Since the nature of improvements is similar, we do not discuss these further. Next we proceed to investigate what can be prepaged in OCD. We know that the four useful subarrays are: R, C, M and D. Since the extent of these subarrays change with the loop control variable k, we will indicate this by subscripting these subarrays. Note that, R(k+1) c M(k) U C(k) and M(k+l) c M(k) U C(k). Therefore, if we decide to look only one loop execution ahead, we need to prepage the subarrays C and D only. We note that, P(c(k+l) = P(c(k)) if k f mod m. Here 107 P(B) denotes the pages containing the sub array B. Similarly, P(D(k+l)) = P(D(k)) if k ^ mod m. The reason for this can be seen from the following diagram: 1 m J R(k) D(k) D(k+l) R(k+1) Page boundaries C(k) C(k+l) Therefore, we need to prepage only when the condition k = mod m is satisfied. Under such a condition we prepage C(k+l) and D(k+l). If we decide to prepage the newly required elements of C and D for each new execution of the loop just before beginning the execution of the loop, then we will prepage C(k) and D(k) if k = 1 mod m. Since we are going to consider only demand prepaging algorithms, it does not matter when we prepage a certain set of pages, so long as the prepaging is done prior to a reference to any one of these pages. Assume that we 108 have issued a prepage statement for a set of pages P at time t. Assume further that at t' > t, r = i 6 P, then and at t' we have a page fault. At this time page i will be fetched and as many pages of P will be prefetched as will be allowed by the number of empty pages in MM. With this modification, we obtain the following algorithm: OCDFP: PR0C(A, n, m); IF M0D(k, m) = 1 THEN DO; CALL FREE (A <1, 1>); CALL PREPAGE (C, D); END; D = SQRT(D - (R 1 , R'))j C = (C - M X R')/D; END OCDFP; We can also modify CDF to obtain CDFP in a similar way. In Figure 5.3, we have plotted Jt(c) vs. c curve for CD using LRU paging algorithm, for CDFP using FREEDPRElfLRU paging algorithm and for CD using DPMIN paging algorithm. We see that prepaging using FREEDPREI^LRU (in fact, demand prepaging) reduces the page faulting considerably for large values of c. In particular, the. asymptotic value of it has been brought down from n(n+i) — *~ — - to N. This is one order of magnitude reduction. We see that the asymptotic value of jt for DPMIN is 1. We note here that such a prepaging technique can also be applied to other variations of Cholesky factorization (namely, CDR, CDM and CDS) but we omit this here, since the results are very similar to that of CD. Since with prepaging, the number of page pulls is different from the number of page faults, we need to consider these two measures separately. In Figure 5. k, we have plotted, the number of page pulls vs. c for CD using LRU, MIN and DPMIN paging algorithms and for CDFP using FREEDPRE^LRU paging algorithm. We see that, the number page pulls for FREEDPRElf.LRU is generally lower than the page pulls for LRU and only L0G,(7T(C)) 15" 109 n = 24, m = 4, N = 6 10-- 1 3 5 7 9 11 13 15 17 19 21 Figure 5.3. Page Faults Using LRU, FREEDPTOLRU, MIN and DPMIN Paging Algorithms 110 A LOG 2 (PAGE PULLS) 13- Figure 5-4. Page Pulls Using LRU, FREED PRE IfLRU, MIN and DPMIN Paging Algorithms Ill rarely does it go above and then only by a very small amount. We see that the page pulls for DPMTN and MIN are generally equal except in some cases where DPMLN incurs more page pulls than MIN. Note that the asymptotic value of page pulls is the same for all cases and is equal to the size of the data address space (= N(N+l)/2). We can view the process of freeing and prepaging done in CD in a slightly different way. Let us denote the data-working set of one loop execution of CD by DWS(k), where k is the loop control index. In words, DWS(k) is the set of pages referenced in the k execution of the loop. DWS(k) can be recursively defined as follows: a) DWS(l) = P(C(1)) U P(D(1)) b) If k = 1 mod m then DWS(k) = DWS(k-l) U P(D(k)) U P(C(k)) - P(A(k) - A(k-1)) = DWS(k-l) U P(D(k)) U P(C(k)) - P(R(k-l)) - P(D(k-l)) else DWS(k) = DWS(k-l); This knowledge of DWS allows us to do the appropriate freeing and prepaging. 5.2 Other Examples 5.2.1 LU Decomposition For LU decomposition (refer to OLU and the associated diagram in section k.2), we have a) DWS(l) = P(Y(1)) U P(Z(1)); 112 b ) IF k = 1 mod m then DWS(k) = DWS(k-l) U P(Y(k)) U P(Z(k)) - P(A(k-l)) = DWS(k-l) U P(Y(k)) U P(z(k) - P(R(k-l)) - P(X(k-l)) - P(D(k-l)); else DWS(k) = DWS(k-l); Based on this observation, LU and OLU can be modified to free A<1, 1> and prepage Y and Z. We do not present the detailed programs LUF, LUFP, OLUF and OLUFP. In Figure 5.5, we have plotted n(c) vs. c for LU using LRU, MIN and DPMIN paging algorithms, for HJF using FREEDPREULRU paging algorithm. We see that freeing and prepaging helps only for larger values of c and that the asymptotic value of -n(c) has been brought down from N to N. 5.2.2 Gaussian Elimination For Gaussian elimination (referring to OGOS and its associated diagram in section h.2.), we have, a) DWS(l) = P(A); i.e., the whole array A. b ) If k = 1 mod m then DWS(k) = DWS(k-l) - P(A<1,1> U A<1,2> U A<1,3> U A<2,1> U A<3,1>). = DWS(k-l) - P(c(k-1)) - P(D(k-l)) - P(R(k-l)); else DWS(k) = DWS(k-l); 113 , | L06 2 (7T(c)) n « 24, m = 4, N'6 LU Figure 5. 5. Effect of Prepaging on LU Decomposition 114 From these equations, we see that only prepaging that can be done is outside (before) the loop. And since whole of the array A is to be prepaged only a limited number of pages as allowed by the available space will really be prefetched. Freeing of the subarray A<1, 1>, A<1,2>, A<1, 3>, A<2,3> and A<3>1> can be done within the loop when the condition, k = 1 mod m, is met. With these modifications we have GOSF, GOSFP, OGOSF, and OGOSFP (we omit the corresponding programs). In Figure 5.6, we have plotted jt(c) vs. c for GOS using IRQ, MIN and DPMIN paging algorithms, for GOSF using FREELRU paging algorithm and for GOSFP using FREEDPREULRU paging algorithm. We note that jt( FREELRU ) = rt(LRU). In other words, freeing does not help in this algorithm. Reason for this is that dead page is precisely the least recently used. page. Thus the page replaced by FREEIiRU and LRU is the same at any time. The same effect is not necessarily observed with other paging algorithms. For example if we were using FIFO instead of LRU, then in all probability, Tt(FREEFIFO) < it (FIFO) for GOS. We also note that, prepaging and freeing does help and reduces the asymptotic value of it(c) from DJ to 1. 5.2.3 Gram-Schmidt Orthogonalization For Gram-Schmidt orthogonalization (referring to OORTH and its associated diagram in section h.2) } we have, a) DWS(l) = P(D(1)) U P(RK(1)) U P(AK(l)); b ) If k = 1 mod m then DWS(k) = DWS(k-l) U P(AK(k)) U P(RK(k)) U P(D(k)) - P(R(k)) = DWS(k-l) U P(AK(k)) U P(RK(k) U D(k)) 115 1 i L0G 2 (7T(c)) 15- 14- 13- 12- 11- 10- 9- /FREELRU 8- MIN' ^v / / L " u 7- 6- 5- 4- dpmin' > FREEPRE4LRU~-^J 3- ^ \ 1 2 - s 1- 1 1 1 1 \ -^ *• 10 15 20 25 30 35 Figure 5.6. Effect of Prepaging on Gaussian Elimination 116 - P(RK(k-l) U D(k-l)); else DWS(k) = DWS(k-l); From these equations, we see that, when the condition k = 1 mod m (within the loop just after the backward branch target), we can free the subarray R<1, 1> and prepage the subarrays AK, RK and D. With appropriate modifications, we obtain ORTHF, ORTHFP, OORTHF, and OORTHFP. In Figure 5.7, we have plotted jt(c) vs. c for ORTH using LRU, MM and DPMIN paging algorithms, for ORTHF using FREELRU paging algorithm and for ORTHFP using FREEDPRE^LRU paging algorithm. We see, once again, that freeing and prepaging helps for larger values of c and that it brings down the asymptotic value of it(c) from (N N + N (Np+l)/2) to N . 5.2. k Matrix Multiplication For matrix multiplication, consider the following program OMM: OMM: LET A, B, C BE MATRICES OF ORDER (n); FOR k = 1, 2, ..., n-1; PARTITION A, B AFTER ROWS k-1, k; A<2> = B<£> * C; END OMM; k-1 k B<2> A B 117 A L0G 2 (7T(C)) 14-- n l »24, n 2 « 16, m«4 40 C Figure 5. !• Effect of Prepaging on Orthonormalization 118 [Note that the above can easily be programmed in OL/2 as the primitive A = B * C but we have written OMM for obtaining DWS.] From the above, we have, a) DWS(l) = P(C) U P(B<2>(1)) U P(A<2>(1)); b ) If k = 1 mod m then DWS(k) = DWS(k-l) - P(B(k)) - P(A(k)) U p(B<2>(k)) U p(A<^>(k)) = DWS(k-l) - P(B<2>(k-l)) - P(A<2>(k-l)) P(B<2>(k)) U p(A<2>(k)); else DWS(k) = DWS(k-l); Based on this, we must prepage the whole array C outside (before) the loop body and inside the loop, we can free subarrays B<1> and A<1> and prepage subarrays B<2> and A<2> when the loop control variable satisfies the condition k = 1 mod m. With appropriate modifications, we obtain MMF and MMFP. In Figure 5.8, we have plotted it(c) vs. c for MM using LRU, MEN and DPMIN paging algorithms, for MMF using FEEELRU paging algorithm and for MMFP using FKEEDFRElj.LRU paging algorithm. Once again, we see that for large values of c, freeing and prepaging helps and that it reduces the asymptotic value of jt(c) from 3N^ to N. We have seen that our methods of freeing and prepaging improve performance for large values of c (> £)(N)) and that the methods work for many different matrix algorithms. The method owes its success to our ability to identify the data-working- sets of the programs. This identification is possible, because of the well structured data- referencing 119 L0G 2 (7T(c)) n*l6, m=4 1 5 9 13 17 21 25 29 33 37 41 45 49 c Figure 5»8» Effect of Prepaging on Matrix Multiplication 120 of matrix algorithms. In Chapter 6, we will investigate whether the identification of data-working- sets can be done by an optimizing compiler, 5.3 Combined Effect of Locality Improvement and Working-Set Identification Methods We have noted that the locality improvement methods of Chapter k improve paging performance for low values of c (<6(N)) and the methods of this chapter improve paging performance for higher values of c (>£(N)), In this section we will combine the two methods and measure the resultant improvements . The version of Cholesky factorization without any improvements is CD and the version with all improvements is CDSRFP. We give a PL/ I program for the latter in Appendix C. In Figure 5. 9, we have plotted jt(c) vs. c for CD using LRU paging algorithm and for CDSRFP using FREEDFREULRU paging algorithm. The vast improvement in performance needs no elaboration. We have also plotted the number of page pulls vs c for CDSRFP. In Figure 5.10, 5 .11, 5.12 and 5.13 we draw similar curves for LU decomposition, Gaussian elimination, Gram- Schmidt orthogonal! zat ion and matrix multiplication respectively. Fully improved versions of these algorithms are also presented in Appendix C. 121 L0G 2 (7T(C)) 24, m = 4, N»6 ® PAGE-FAULTS FOR CD USING LRU PAGING ALGORITHM. (D PAGE- PULLS FOR CDSRFP USING FREEDPRE4LRU PAGING ALGORITHM. (D PAGE -FAULTS FOR CDSRFP USING FREEDPRE4LRU PAGING ALGORITHM. Figure 5.9. Page Faults for CD and CDSRFP 122 n * 24, m « 4, N « 6 @ PAGE FAULTS FOR LU USING LRU PAGING ALGORITHM (D PAGE PULLS FOR LUSRFP USING FREEDPRE4LRU PAGING ALGORITHM (D PAGE FAULTS FOR LUSRFP USING FREEDPRE4LRU PAGING ALGORITHM Figure 5.10. Page Faults for LU and LUSRFP 123 LOG 2 (7T(c)) PAGE FAULTS FOR GOS USING LRU PAGING ALGORITHM (D PAGE PULLS FOR GOSSRFP USING FREEDPRE4LRU PAGING ALGORITHM (5) PAGE FAULTS FOR GOSSRFP USING FREEDPRE4LRU PAGING ALGORITHM Figure 5.11. Page Faults for GOS and GOSSRFP 12U n,«24, n 2 »l6, m»4, Nj«6, N 2 «4 PAGE FAULTS FOR ORTH USING LRU PAGING ALGORITHM. (D PAGE PULLS FOR ORTHSRFP USING FREE0PRE4LRU PAGING ALGORITHM. (D PAGE FAULTS FOR ORTHSRFP USING FREEDPRE4LRU PAGING ALGORITHM. 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Figure 5.12. Page Faults for ORTH and ORTHSRFP LOG 2 (7T(c)) 125 PAGE FAULTS FOR MM USING LRU PAGING ALGORITHM © PAGE PULLS FOR MMSRFP USING FREEDPRE4LRU PAGING ALGORITHM ® PAGE FAULTS FOR MMSRFP USING FREEDPRE4LRU PAGING ALGORITHM Figure 5.13. Page Faults for MM and MMSRFP 126 6. AUTOMATION OF PERFORMANCE IMPROVEMENT TECHNIQUES 6.0 Introduction We have shown, in Chapter k, how to improve the locality of matrix algorithms, and in Chapter 5, we have shown how to include freeing and prepaging techniques. We have seen that these techniques improve the paging performance of matrix algorithms. However, in the previous chapters, we have implicitly assumed that the programmer has the responsibility to modify his program to obtain the improved paging performance. In general, they are not capable or willing to make these changes; neither is it desirable to detract the user from his main purpose — namely, to solve his problem. Therefore, we will consider how one is to automate these techniques by incorporating them into the computer system. The methods that we have presented clearly require an advanced knowledge of the program's reference pattern, which means that the operating system contains insufficient knowledge of the reference string. An optimizing compiler, on the other hand, may be able to gather enough information about the reference pattern and may be able to make the necessary changes to obtain the improved paging performance. In section 6.1, we will sketch several transformations of program segments which improve the locality of the segment, and we will give simple conditions for these transformations to be valid. In section 6.2, we will present an algorithm which will abstract information from an 0L/2 program and. then, later, use this information to do freeing and prepaging for the 127 program. It is clear that a similar algorithm can be given for an element (Pl/^ like) source language, but the amount of information to be processed will be very large so as to be prohibitive in the amount of overhead incurred. 6.1 Transformations Which Improve Locality 6.1.1 Loop Reversal First we start out with an element language like PL/l. Consider a program segment P as follows : P: DO I = 1 TO n ; « DO J - 1 TO n ; s i S 2 END; • END P; Now consider a transformation T ; of the program segment P into the program segment P'. P' : DO I = 1 TO n ; • IF M0D(l,2) = 1 THEN DO; JLLsd; JHL=*n p ; JSTEP=1 ; END; ELSE J DO; JLL^; JHL=1; JSTEP=-1; END; DO J-JLL TO JHL BY JSTEP; s i S„ END; END P'; 128 We will say that the transformation T is valid if the program segments P and P' are equivalent (in the input-output sense) [ALLE71] . In case T is valid, we will say that the inner loop of P is reversible. We want to obtain (easily testable) sufficient conditions under which T is valid. Let us denote the sequence of statements within the inner DO loop (of P) by F(l,j) where the subscripts indicate the value of the two loop control variables during the execution of the statement sequence F. Clearly, P and P' are equivalent, provided the following segments X and X' are equivalent. X: F(I,1) X': F(l,n 2 ) F(l,l) A sufficient condition for X and X' to be equivalent is: F(l, j) commutes with F(l,k) for Vie [1,1^] and V j, k e [l,n 2 J 3 ^ k. Bernstein [BEBN66] has derived sufficient conditions for cammutivity of program segments. Note that, we need only pa ial cammutivity here, i.e., we require that, F(l, j) F(l,k) sequence is equivalent to F(l,k) F(l, j) for V k > 3, V I. We will repeat his conditions here. Let us denote the set of addresses fetched and not stored during the 129 execution of F(l, j) by W(l, j), the set of addresses stored and not fetched by X(l, j), the set of addresses fetched first and stored later by Y(l, j) and the set of addresses stored first and fetched later by Z(l, j). Let us denote by VL,, X , Y_,, Z , the similar set of addresses for the rest of K K K K the program segment P excluding the inner DO loop. Then, sufficient conditions for P and P' to be equivalent are: 1) (W(l,k) U Y(l,k)) fl (X(l,o) U Y(l,j) U Z(I,J)) = fa 2) (¥(l,j) u T(I,J)) n (X(l,k) U Y(l,k) U Z(l,k)) = fa 3) (x(i,k) u y(i,k) u z(i,k)) n (X(l,j) U T(I,J) u z(i,j)) n (w R U Y R ) = fa Note that, these are simple conditions which are easy to test. However, these conditions exclude the following case: P : DO I = 1 TO n ; S = 0; DO J = 1 TO n 2 5 S = S + A(I,J) * B(l, J); END; END; This case represents an inner product being calculated within the inner loop and as such is a most frequently occurring construct in matrix algorithms. For the program segment P-,> we have, W(I,J) = {A(I,J), B(I,J)}, X(I,J) = fa Y(I,J) = {S}, Z(I,J) = fa Therefore, Y(l,j) fl Y(l,k) = {S} ^ which implies that conditions l) and 2) are violated. But- it is clear that the inner loop of P is reversible. This shows that the conditions we have are sufficient but not necessary. We will require that this construct be specially recognized as reversible. In fact, the following generalization is valid. Suppose, we have the following program segment: DO I = 1 TO n-; DO J = 1 TO n 2 ; S = S □ E(J); END; END; Assume that □ is a commutative and associative operator and E(j) is some expression, then the inner J loop is reversible. Therefore, our loop reversal technique is a modified technique of Bernstein. Furthermore, it is generally profitable to reverse adjacent loops in the opposite directions. We apply this automated procedure to CD and see whether we can obtain CDR as the transformed procedure of section k.l. The two inner J loops are immediately recognized as reversible by the special construct. For the loop, 'DO I = 1 TO n-k 1 , we have: W(K,I) = {A(K+I,1), ... A(K+I, K-l), A(K,1) ... A(K, k-l), A(K,K)}, X(K,I) = p, Y(K,I) = {A(K+I,K)}, Z(K,I) = {S}, =>W(K,I) U Y(K,I) = {A(K+I,1:K), A(K,1:K)} X(K,J) U Y(K,J) U Z(K,J) = {A(K+J,K), S) . 131 Clearly, intersection is empty. W R = (A(K+l:n, l:n-l)} \ = fi, \= {A(K+1, K+l), ..., A(n,n)} Z R ={S}. The triple intersection is: SO (W U Y ) = $. Thus, the reversibility of the loop is established. By hand simulation, we have proved that our automated loop reversing procedure will indeed translate CD into CDR. In a similar fashion, we can show this for other algorithms. We now turn our attention to programs written in 0L./2. Quite clearly, the techniques presented for an element source language are applicable. We have, in addition, several more possibilities of reversal. Consider a vector assignment statement within a loop in an OL/2 program. The vector assignment statement will be compiled into a DO loop by the OL/2 compiler. If this loop is detected to be reversible, then the compiler can carry out the loop reversal. The conditions for reversibility are simpler in this case and are as follows (assume that the statement is a vector expression being assigned to a vector C): 1) Either the right side of the assignment statement does not contain the vector C, its ancestor, or subarrays in the ACB tree [FHIL72]; 2) Or the assignment statement is given by C = C + Exp where Exp is a vector expression satisfying condition 1. If these conditions are satisfied then the vector assignment is reversible. Clearly, this procedure will translate OCD into CDR. 132 6.1.2 Loop Decomposition We have not discussed this transformation in Chapter h. Loop decomposition is also known as loop unswitching [ALLE71] . Consider the following program segment P: P: DO I - 1 TO n; A(I) = A(I) * A(I); B(I) = B(I) + Is END; rest of the program Now consider the following transformation P' of P: P' : DO I = 1 TO ns A(I) = A(I) * A(I); END; DO I = 1 TO n; B(I) = B(I) + I; END; • I rest of the program It is easily seen that the locality of P' is much better than that of P. In other words, whenever a single loop operates on two or more independent data streams then it is better to decouple these operations into two or more DO loops. Sufficient conditions for the transformation to be valid can be stated as follows [BERN66] : a) (W 1 (I) U Y 1 (I)) fl (X 2 (J) U Y 2 (J) U ^(j)) = fi, b) (w 2 (j) u y 2 (j)) n ( Xl (i) u y 1 (i) u Zl (i)) = fi, c) (x 1 (i) u Zl (i)) n (x 2 (j) u z 2 (j)) n (w R u y r ) = fi, for 1 < J < I < n. 133 6.1.3 Submat rix-Mu.lt ipli cati on- Trans format! on In Chapter k, many matrix algorithm contain matrix multiplication. It is, therefore, profitable to convert these internal matrix multiplications into submatrix multiplications. In an array language like 0L./2, the programmer is allowed to write a matrix multiplication as a primitive operation. In such a case, the compiler can easily compile this primitive into a submatrix multiplication. Thus OCD will be compiled into CDM. If we also apply the procedure of section 6.1.1 then, OCD will be compiled into CDMR with a corresponding high degree of locality. The same statement holds for other matrix algorithms. For an element source language, it is more difficult since the compiler has to detect a sequence of statements which corresponds to matrix multiplication or its variants. A number of papers have appeared on the topic of automatic recognition of vector and matrix operations for the purpose of parallelism exploitation [SCHN72], The same techniques are applicable for our purposes. 6.1.U Conversion From a Non-Submatrix to a Submatrix Algorithm In general, it is impossible to automate a conversion from a non- submatrix algorithm like CD to a submatrix algorithm like CDS. The transformation procedure would be required not only to understand the logic of the program, but also to have knowledge or 'intelligence' about the theory of matrix algebra. We will illustrate the difficulties involved in such a task by examples. First we establish submatrix analogs (also known as block analogs) of scalar operations. For example, the submatrix analogs of scalar addition 13* and scalar multiplication are matrix addition and matrix multiplication respectively. When setting up the submatrix analog of division, we have a problem. Assume that in the non- submatrix version of a certain matrix program, the expression, E/A(k,k), occurs. A possible submatrix analog of this expression could be E * (A T t ) , where A is the (I, I) D J-, A- J., J. submatrix of A, I = fK/m], and E~ is the submatrix analog of E. Another possible analog of the expression is (A ) * E . Clearly, these two -L, X o expressions produce different effects, and only one of them is the correct analog. It is not at all clear from the original program which of the two possible analogs is correct. This particular example occurs in the transformation of CD into CDS (refer to section k.l). If we consider Cholesky decomposition, then the submatrix analog of the square root operation is the Cholesky decomposition of a submatrix. There is no way to automate this type of knowledge at present. Another example occurs in attempting to transform ORTH into ORTHS (refer to section ^-.2). The transformation is possible because of the known properties of orthonormal vectors [SCH073; SCHW73] . Again, we cannot incorporate this knowledge in a compiler. In case of eigenvalue problem, submatrix algorithms may not even exist. The basic problem is the noncommutative property of matrix multipli cation [ IDVA72 ] . 6.2 Freeing and Prepaging In this section, we will discuss the automation of freeing and prepaging for matrix algorithms, written in OL/2 language. It should be clear that, we can do the same kind of optimization for any other source 135 language. However, for an element source language the overhead incurred by our procedure is likely to be intolerable. The optimizer will be a module of the OL/2 compiler. It will collect some information about the program during the syntactic phase and later use this information to do efficient freeing and prepaging for the program. Simply stated, the job of the optimizer consists of two parts: (a) prepage the subarrays and the arrays needed by the program at the proper time during the execution of the program and (b) free memory space occupied by the dead subarrays and dead arrays as soon as possible. We will use (sub) array to denote "subarray or array." Ideally, the proper time to prepage is such that any page of the (sub) array should be set up in MM before the first reference to it, and 'as soon as possible' is taken to mean 'just after the last use of the corresponding page'. Such an ideal situation will, clearly, be equivalent to the use of PWS paging algorithm and result in zero page faults as well as very economic use of memory. Such an ideal system, however, will be very expensive in terms of overhead. We consider a practical solution. It is clear that some sort of a list of useful (sub) arrays needs to be prepared during the syntax analysis phase. This list can be used in two possible ways. First, it may be used to insert instructions for freeing and prepaging (sub) arrays during the coding phase. Second, it may be used dynamically, to prepage (sub) arrays from the list as far into future as demanded by the desired performance. Clearly, the second approach requires variable page allotment of MM. It will also incur more overhead since the list must be kept in MM during the execution 136 phase, but it is likely to result in an improved paging performance. We will discuss the implementation of the first approach, since it is more simple; however, the technique of constructing the list is common to both approaches. Let us assume, for the sake of argument; that the program can acquire as many pages as necessary. Furthermore, assume that once a (sub) array has been brought into MM it remains there until after the last time it is used. Define the lifetime of a (sub) array A as follows: L(A) = (t-,, t ) where t is the time of the first use and t p the time of the last use of A. Once the lifetime of each (sub) array is determined, then our problem is solved, at least in theory, since we can follow the following prepaging and freeing strategy: Given L(A) = (t 1 , t £ ), if t < t ± - T then PREPAGE(A); and if t > t then FREE (A). The variable t is the current process time and T is the average page-fetch time. We construct, therefore, the list USE_LIST whose nodes correspond to the (sub) arrays in the order in which they are referenced by the program. On the program level the normal sequential ordering is used, and within a statement, right end order of the expression tree is followed [JURI72], The primary link in the USE_LIST is the one implied by the above ordering. We will refer to this link by the names, 'static link' and the NEXT pointer. In case of repeated use of an (sub) array, it is not worthwhile to enter each use in the USE_LIST, since the USE_LIST will become very long. Instead, we consider only the first and the last use and the interval between them has been defined as the lifetime. The decision made here is 137 quite arbitrary. We could have, alternately, defined the lifetime L* of ( sub ) array A as : L*(A) = L(A) SS(A) where L(A) is the lifetime by the earlier definition and SS(A) is the syntactic scope of A (as in a block-structured language like OL/2). Note that, SS(A), in general, is a set of disjoint intervals and since L(A) is a single interval, L' (A) will be a set of disjoint intervals. The use of L' will involve more overhead but will result in obtaining better paging performance. There are many other ways of defining lifetime, but we will use the first definition, namely, L. We have to store L(A) for each A in the USE_LIST. A practical way to do this is to have a node for A corresponding to the first applied occurrence of A and store a lifetime pointer, called LP, in the node, pointing to the last applied occurrence of A. We have tacitly assumed that the sequence of uses of (sub) arrays in a program can be linearized. There are several language constructs that prohibit this. The GOTO statement introduces nonlinearity in the USE_LIST. Since the programming language that we are using (OL/2) is GOTO-less, we do not have this problem. Another source of nonlinearity is the IF statement. Let us assume that the statement I = (IF E THEN S, ELSE Sp) occurs in a program. Ideally, the USE_LIST for this statement should be: T / U(S ) \ u(D = u( E ) • F e - This solution implies that all prepaging activity be stopped after the evaluation of the if-expression. Which of these solutions is the most useful can only be decided by experience with a working system. We select the first solution for the rest of the discussion. 139 The existence of loops (e.g., K>R loops, WHILE loops) is still another source of nonlinearity. One possible way of linearizing, in this case, is to unfold the loops. But this solution is unattractive because: it is not applicable to WHILE loops since the number of times the loop is traversed is not known at compile time, and it is wasteful of memory space since a lot of information is duplicated in the USE_LIST. The alternative is to allow loops in the USE_LIST. The fourth source of nonlinearity in the USE_LIST is the call to a procedure. We assume that each procedure has code inserted in it to do its own freeing and prepaging. Thus, in the calling procedure, we stop prepaging at and beyond the point of call and resume freeing and prepaging after a return from the called procedure. In our treatment, we have ignored call statements altogether but the above solution may easily be incorporated. Assume that the USE_LIST has been constructed for a given program. Each (sub) array in the USE_LIST may have at most four boundaries. Now since OL/2 allows dynamic partitioning, these boundaries can move during the execution of the program. Therefore, each boundary has associated with it a boundary expression [PHIL72], If all the boundary expressions of a (sub) array remain constant during the execution of a loop, then such a (sub) array will be called static (with respect to the loop). Those subarrays for which at least one of the boundary expressions vary within the loop are called dynamic subarrays. Clearly, paging techniques for these two types of (sub) arrays should be different. The static (sub) arrays need be prepaged only once. Whereas the extents of the dynamic subarrays change during the execution of the loop and hence parts of these subarrays 140 may have to be prepaged during the loop execution. We therefore, decide to originate a dynamic link at the beginning of a FOR loop and it will thread all subarrays within the scope of the FDR loop which are dynamic with respect to this loop. If the loop control variable is k, then the dynamic link will also be called k-chain. We define a predicate P such that for a subarray B and the loop control variable k, P(B,k) = T (true), iff B is dynamic with respect to loop k. Once the USE_LIST has been constructed, as outlined above, prepaging instructions can be inserted during the coding phase as follows . At nest level I = 0, we prepage (sub) arrays in the USE_LIST, traversing it along the static link. At nest level I > 0, prior to generating code for the loop, we insert instructions for prepaging (sub) arrays, traversing the USE_LIST along the static link; and within the loop, we insert instructions for prepaging dynamic subarrays traversing the USE_LIST along the dynamic link. We will now discuss how to insert the lifetime pointers (LP) or equivalently, how to determine L(A) for each A in the USE_IIST. We assume the existence of a table of (sub) array names (addressed by the (sub) array name) and containing two pointers IP and CP for each entry. IP, the initial pointer, points to the first occurrence of the (sub) array in the USE_LIST and CP, the current pointer, points to a node in the USE_LIST, where currently last use of the (sub) array would have been listed if we were to list each use of the (sub) array. At the end of the syntax analysis phase, LP = CP for each (sub) array. How and when to free (sub) arrays is the topic to be discussed next. A simple answer to this question is to free a (sub) array soon after its ■ end of lifetime. First of all, this implies the existence of a node, called the end-of- lifetime node, for each (sub) array in the USE_LIST. Whenever such a node is scanned during the traversal of USE_LIST in the coding phase, instructions can be inserted to free the corresponding (sub) array. However, there are several problems with this procedure. If the lifetime L(A) of a (sub) array A is completely contained within a loop f, then the above procedure will make us prepage and free A during every traversal of the loop. To alleviate this difficulty, we make the following change in our procedure. If at nest level £ = then we free A, on scanning the end-of- lifetime node of A; and if at nest level £ > then we free A, on scanning the end-of- lifetime node of A, only during the last traversal of the loop. Now if A is a dynamic subarray then this refined procedure also breaks down. For such a subarray, the contents of A during the last iteration of a loop may be significantly different from the contents of A during the previous iterations. The procedure we have outlined thus far, may leave a lot of dead elements in MM. A simple scheme immediately comes to mind to solve this problem. At the expiration of the lifetime of A within a loop, free only the dead elements of A, i.e., if we let A(k) to be the elements of A when the loop control variable takes the value k then at the expiration of lifetime of A during k loop traversal free (A(k+l) - A(k)). The danger in this procedure is that these freed elements may be the useful elements of an adjacent subarray during the next loop traversal. We solve this problem in the following way. Define the scope of an array A by, S(A) = ( B ^ IP(B), /^ LP(B)) 142 where B C A means B is a subarray of A. Intuitively, S(A) is an interval specifying the first use and the last use of A or any of its subarrays. The scope S(A) of array A defines a natual universe over which we can define the dead subarrays of A. Define the DEAD_LIST as follows: DEAB_LIST (A, S(A)) = A - (A fl USE_LIST) - {B|B c A, B is static with respect to all loops in which it is referenced} . The expression (A PI USE_LIST) denotes the list of all subarrays of A that occur in the USE_LIST. Clearly, B e DEAD_LIST implies that B is dynamic with respect to at least one loop. Now it is easy to merge the DEAD_LIST with the USE_LIST. For each array A, we do the following: for each B € DEAD_LIST (A, S(A)), and for each loop (with control variable k), in which B occurs, if P(B, k) = T, we link B to the dynamic chain of the loop. It is clearly, helpful to keep all dead subarrays at the head of the dynamic chain after which all the useful dynamic subarrays follow. The procedure of DEAD_LIST construction and linking to USE_LIST can only be done in a post-syntax, pre-coding phase. Now the code for freeing and prepaging dynamic subarrays is easily inserted in the loop, by traversing the dynamic chain of the loop. Thus our description of the procedure to construct the USE_UST is reasonably complete. However, we have ignored one important problem, namely, for B c A, L(A) fl L(B) ^ ft, i.e., the possibility that the lifetimes of a subarray B and (sub) array A may overlap. We solve this problem by considering several cases separately. If L(B) c L(A) then we delete B from the USE_EIST. Since we have assumed that the (sub) array A is resident in MM throughout its lifetime, therefore, its subarrays need not be considered separately. lk3 If L(B) c L(A) but IP(B) < IP(A) and LP(B) < LP(a) then we extend the lifetime of B so that LP(B) = LP(A). The reason for this is that when the lifetime of B has ended, but the lifetime of A has. not, B cannot be freed anyway. If LP(B) > LP(A) then we extend the lifetime of A so that LP(A) = LP(B). If this simple approach is not used, then an alternative approach has to find the complement of B with respect to A and free all the the subarrays in the complement of B with respect to A and free all the subarrays in the complement. The procedure then becomes difficult and time consuming. The next question to be discussed is, how to relate (sub) arrays to pages. The freeing and prepaging instructions available to us can free and prepage only pages, but the information in USE_LIST is about the (sub) arrays. Somehow, we have to bridge the gap. The problem is compounded by the fact that the subarrays can change their sizes during the execution of the program. We will consider the cases separately. If an array occurs in the USE_IIST then we can prepage all of its pages a few nodes in advance. Similarly, at the expiration of its lifetime, we can free all of its pages. We have assumed that the array does not share any pages with other useful (sub) arrays. If we consider a static subarray smaller than a page, then while prepaging it, we can just bring in the page in which the subarray is stored. While freeing, however, we have to make sure that this page does not contain any useful information. For the sake of simplicity, we will not free such a subarray. If we consider a large static subarray then while prepaging, we fetch all the pages containing the elements of the subarray. While freeing, we can free only those pages which contain just the elements of this subarray. Other pages, which are shared with some other subarrays cannot be freed. We now consider the case of dynamic subarrays. Consider a loop f with the control variable k such that for the given subarray B, P(B,k) = T (i.e., B is a dynamic subarray with respect to the loop f). Let us denote the values that, the control variable k takes during the execution of the loop, by k , k , ..., k., k , ... . Let us denote the elements of th subarray B, during the k loop traversal by B(k) and the pages containing B(k) by P(B(k)). Then P(B(k Q )) e DWS(k Q )) and A P(B(k i+1 )) = P(B(k i+1 )) - P(B(k.)) e DWS(k i+1 ) - DWS(k ± ). This means that the elements of B(k ) must be prepaged before the first traversal of the loop and that during each subsequent traversal A P(B(k )) need be prepaged (if not null). The first requirement is already satisfied since before the first traversal of the loop, we prepage all the subarrays along the static chain in the loop. For satisfying the second requirement, we must determine the conditions under which A P(B(k.- n )) will be non-null. Before that, l+l however, we will determine the conditions under which A B(k. , ) = l+l B(k. , ) - B(k. ) will be non-null. We will assume that the subarray B has four boundary expressions associated with it. Let us further assume that, B = A , i.e., it is subarray of A. Then r T ., r , C , l-i 1 J— i Cj are the boundary expressions associated with it as shown in Figure 6.1. U5 A I-l ri Cj-i Cj Figure 6. 1. The Partition Lines Ilf6 In general, these four expressions are functions of k. Let X = {^(k), C J-;L (k)}, X 2 = {^(k), Cj(k)) and X = X 1 U X^ Define f (x) V x e X by: l+l 1 if x(k. . ) > x(k. ) l+l i if x(k i+1 ) - x(k ± ) -1 if x(k. . ) < x(k. ). v i+l y v i' Then clearly, the condition for A B(k. ) to be non-null is given by: (axeXBf (x) = -1) V (ax 6 x 2 B f (x) = 1). i+1 i+1 Intuitively, the condition asserts that the subarray B is expanding in at least one of the four directions. Let us assume that for a subset X of X, V x € X => B is expanding along the boundary expression x. We would like to determine the conditions under which A P(B(k. , )) f p. First of all, it is clear that, AB(k. .) = p^/\ P(B(k. _)) = p. i+1 i+1' Given that A B(k ) fi ft, it may be that A P(B(k )) = p. If ^ x e X 3 x(k. ) = mod m, then A P(B(k. )) ^ p. Intuitively, this condition asserts that one of the expanding boundaries has just crossed a page boundary. Thus the required condition for prepaging is: (A B(k i+1 ) ^ p) /\ ( a x e X Q ^ x(k ± ) = mod m). H7 As regards freeing dynamic sub arrays, during the k loop traversal, we scan the dynamic chain and try to free all the dead subarrays. If the subarray shares pages with other subarrays then we cannot free these pages. Also if the subarray is contracting in at least one of the directions then we cannot free it, since the present elements of this dead subarray may become the future elements of some useful subarray. The condition under which a dead subarray B is not contracting in any direction is given by: (Vx e X n , f (x) <0)a(Vxe X , f (x) > 0). The condition under which, the subarray B does not share pages with any other subarray is given by: V x e X, x(k. _ ) = mod m. When both these conditions are satisfied, we can free all the pages containing the dead subarray B. It may happen that some of the shared pages of a dead subarray belong to another dead subarray. This sharing will stop freeing of both these subarrays unnecessarily. We can easily incorporate a 'combination of dead subarray' function in our procedure. Assume that A <1,1>, A<1,2>, A<1,3> occur as dead subarrays and that there are only two vertical partition lines of A. Then the combination may be recognized as a whole (top) row of subarrays and may be denoted by A<1,*>. With this kind of combination, we may be able to free more frequently, in fact, such a situation does occur in Gaussian elimination. We will not, however, discuss the details. The scheme for freeing that we have discussed is not complete, because there may exist a subarray which is dead but which always shared pages with useful subarrays, and therefore is never freed. To alleviate 11*8 this difficulty, when the scope S(A) of an array A ends, we will free all the subarrays of A remaining in MM. This of course, implies an end-of-scope node for each array A. We give an algorithm (called ALG_AP), in Appendix D, which builds the USE_I1ST, builds the DEAD_I1ST and attaches to the USE_LIST and finally inserts instructions, in the code, for freeing and prepaging. U9 7. CONCLUSION AND FUTURE RESEARCH 7.0 Conclusion The improvement of paging performance of programs running under a paged virtual memory system has been demonstrated in this thesis. As we have seen, there are many variables which affect the performance. The paging algorithm used by the system and the locality of the program under consideration are two of the more important variables of the system. We have studied the effects of these two variables on the performance of programs running under such a system. In analyzing the performance, it was necessary to choose a set of performance measures. The traditional performance measure is the number of page pulls incurred by the program, but the number of page faults and the space-time product of main memory are also important performance measures. Which of these performance measures is the most important depends on a particular system and the mix of programs running under it. Demand paging algorithms have been very popular, both in theory and in practice. We have shown, both theoretically and practically, that non-demand algorithms are useful in improving paging performance. We have defined a fixed-memory demand prepaging algorithm DEMEN and proved that it is an optimal demand prepaging algorithm in the number of page faults. This algorithm, however, is unrealizable since it requires 150 the advance knowledge of the future reference string. It serves as a benchmark of performance for realizable demand prepaging algorithms. We have defined several realizable prepaging algorithms. We have studied, for each of these algorithms, whether or not it is a stack algorithm. To show practical usefulness of these algorithms, we resort to a specific application. We have shown that these prepaging algorithms improve the paging performance of common matrix algorithms considerably. The reduction in the number of page faults is shown to be as high as an order of magnitude. We have measured the number of page faults and the number of page pulls for many different matrix algorithms using many different paging algorithms. The application of realizable prepaging algorithms require some knowledge of the future paging needs of the program. If the programmer is willing to do this, he is suitably rewarded. We have shown that a compiler for a structured array- language like OL/2 can extract, without too much overhead, enough information from a program to improve its paging performance. The compiler can identify portions of the present working set which cease to be in the future working set. The compiler can also provide reliable advance information of the working set. As a result, moving unwanted information out. of memory and bringing in useful information, in advance of its use, can be done efficiently. We have defined a variable -memory prepaging algorithm IWS, based on Denning' s WS algorithm. We have shown that PWS algorithm incurs zero page faults, has the same number of page pulls as WS, and it is more economical in ST product than WS. However, IWS is an unrealizable algorithm since it requires that the program's reference string be known in advance. 151 PWS algorithm can serve as a benchmark of performance among all variable- memory prepaging algorithms. We have identified three methods of improving the locality of matrix algorithms. These are: loop reversal, submatrix multiplication and the construction of a submatrix algorithm. We have shown that these methods yield a vast improvement in locality and that the most powerful method is the construction of a submatrix algorithm. The average working set size is known to be a good measure of the locality of a program. We have shown that a decrease in the average working set size, generally, implies a reduction in the number of page faults. We have considered the problem of automating these methods to improve locality. If the source language is a high-level array- language, like OL/2, then the nature of the primitives of the language provide a much easier means of optimization. We have shown that the locality- improvement methods are effective for lower values of the page allotment and that prepaging methods are effective at higher values of the page allotment. 7. 1 Future Research As we have noted, very little is known about prepaging algorithms. Both, theoretical and practical investigations are needed for understanding the behavior of prepaging algorithms. In future, memory technology will provide very large memories making program (instruction) paging less important. Parallel processing will provide the means for solving problems with very large data bases, thereby making data paging more important. As a result we expect the emphasis in paging to shift toward specialized systems where more information can be used to anticipate paging demands. 152 There are many matrix algorithms, for which submatrix algorithms do not yet exist. This is a useful area of investigation. Actual implementation, either in software or hardware, of the paging optimizer proposed in thesis needs to be done. The use of the paging optimizer in conjunction with variable -memory prepaging algorithms can also be investigated. A study of models of program behavior in conjunction with prepaging algorithms is another fruitful area of investigation. In particular, we refer to conjecture 3.1 regarding the optimality of the paging algorithm DPOBLLRU for the sequential-random model of program behavior (refer to section 3.2.2.1). 153 LIST OF REFERENCES [AH071] Aho, A. V., Denning, P. J. and Ullman, J. D., "Principles of Optimal Page Replacement," JACM, Vol. 18, No. 1, Jan. 1971, pp. 80-93. [ALLE71] Allen, F. E. and Cocke, J., "A Catalogue of Optimizing Transformations," IBM T. J. Watson Research Center Report RC-35^8 (Sept. 1971). [ANAC67] Anacker, W. and Wang, C. P., "Performance Evaluation of Computing Systems with Memory Hierarchies,". IEEE Transactions on Computers, Vol. EC-16, No. 6, Dec. 1967, pp. 76^-773. [BAER71] Baer, J. L. and Sager, G. R., "Measurement and Improvements of Methods for Evaluation under Paging Systems, " in Proc. of Conf. on Statistical Methods for Evaluation of Computer System Performance, Brown University, Nov. 1971* PP« 2^1-269* [BAIR68] Bairstow, J. N., "Time -Sharing, " Electronic Design, Vol. l6, No. 9 (1968), pp. C1-C22. [BAYL68] Bay lis, M. H. J., Fletcher, D. G. and Howarth, D. J., "Paging Studies Made on the I.C.T. Atlas Computer," IFIP (1968), D113. [BELA66] Belady, L. A., "A Study of Replacement Algorithms for a Virtual Storage Computer," IBM Sys. J., Vol. 5, No. 2, (1966), pp. 78-IOI. [BELA69] Belady, L. A. and Kuehner, C. J., "Dynamic Space Sharing in Computer Systems," CACM, Vol. 12, No. 5 (May 1969), pp. 282-288. [BELA]^-] Belady, L. A. and Palermo, F. P., "On-line Measurement of Paging Behaviour by the Multivalued MIN Algorithm, " IBM J.R.D. Jan. 197^, pp. 2-19. [BERG71] Berge, C, Principles of Combinatorics, Academic Press, New York, 1971. [BERN66] Bernstein, A. J., "Analysis of Programs for Parallel Programming," IEEE Transactions on Computers, Vol. EC-15, No. 5, (Oct. 1966), PP. 757-763. [BOBR67] Bobrow, D. G. and Murphy, D. L., "Structure of a LISP System Using Two-Level Storage," CACM, Vol. 10, No. 3 (March I967), p. 155. IV, [BRAW68] Brawn, B. and Gustavson, F., "Program Behaviour in a Paging Environment/' AFIPS FJCC, Vol. 33 (1968), pp. 1019-1032. [BRAW70] Brawn, B. and Gustavson, F., "Sorting in a Paging Environment," cacm, Vol. 13, No. 8 (Aug. 1970), p. 483- [COFF68] Coffman, E. G. and Varian, L. C, "Further Experimental Data on the Behaviour of Programs in a Paging Environment, " CACM , Vol. 11, No. 7 (July 1968), pp. 471-I+7IU [C0FF72] Coffman, E. G. and Ryan, T., "A Study of Storage Partitioning Using a Mathematical Model of Locality," CACM, Vol. 15, No. 3 (Mar. 1972), pp. 185-190. [COFF73] Coffman, E. G. and Denning, P. J., Operating System Theory, Prentice -Hall, Inc., Englewood Cliffs, N.J., 1973- [COME67] Comeau, L. W., "A Study of the Effects of User Program Optimization in a Paging System, " ACM Symposium on OS, Oct. I967. [CORB69] Corbato, F. J., "A Paging Experiment with the Multics System," Chapter 19, In Honor of P. M. Morse, M.I.T. Press, Cambridge, Massachusetts, 1967, pp. 217-228. [DENN68] Denning, P. J., "Thrashing: Its Causes and Prevention," AFIPS SJCC, Vol. 33 (1968), pp. 915-922. [DENN68a] Denning, P. J., "The Working Set Model of Program Behaviour," CACM, Vol. 11, No. 5 (1968), pp. 323-333. [DEHN68b] Denning, P. J., Chen, Y. C. and Shedler, G. S., "A Model for Program Behaviour under Demand Paging," Rep. RC-2301, IBM T. J. Watson Res. Center, Yorktown Heights, N. Y., Sept. I968. [DEWN70] Denning, P. J., "Virtual Memory," Computing Surveys, Vol. 2, No. 3, Sept. 1970, pp. 153-189. [DENN72] Denning, P. J., "On Modelling Program Behaviour," AFIPS SJCC, Vol. kO (1972), pp. 937-9^- [DENN72a] Denning, P. J., Savage, J. E. and Spirn, J. R., "Models for Locality in Program Behaviour," TR-107, Dept. of Elect. Eng., Princeton Univ., Apr. 1972. [DEHN72b] Denning, P. J. and Schwartz, S. C, "Properties of the Working- Set Model," CACM, Vol. 15, No. 3 (Mar. 1972), pp. I9I-I98. [DENN65] Dennis, J. B. and Glaser, E. L., "The Structure of On-Line Information Processing Systems, " Proc. Second Congress on Information Systems Sciences, 1965, pp. 5~lk. [DOHE70] Doherty, W., "Scheduling TSS/360 for Responsiveness," AFIPS FJCC, Vol. 37 (1970), pp. 97-112. 155 [DOYL ] Doyle, M. S., "The Effects of Program Behaviour on the Design of Storage Hierarchies," Ph.D. Thesis, University of Waterloo, Canada. [DUBB.72] Dubrulle, A. A., "Solution of the Complete Symmetric Eigenvalue Problem in a Virtual Memory Environment," IBM J . R . D . , Nov. 1972, pp. 612-615. [FERR73] Ferrari, D., "A Tool for Automatic Program Restructuring," Proc. ACM Nat. Conf., 1973, p. 228. [FINE66] Fine, G. H., Jackson, C. W. and Mclsaac, P. V., "Dynamic Program Behaviour Under Paging, " Proc. 21 s fr ACM Nat. Conf., 1966, pp. 223-228. [FRAFflf] Franaszek, P. A. and Wagner, T. J., "Some Distribution-free Aspects of Paging Algorithm Performance," JACM, Vol. 21, No. 1, (Jan. 197*0, PP- 31-39- [FREI68] Freibergs, I. F., "The Dynamic Behaviour of Programs," AFIPS fjcc, Vol. 33 (1968), pp. 1163-1168. [GELE71] Gelenbe, E. and Tiberio, P., "The WSTI and Page Size in the Design of Virtual Memory Computer Systems, " Proc. 9' t * 1 Annual Allerton Conf. on Circuits and System Theory (1971), P- 10*1-. [GELE7J] Gelenbe, E., Tiberio, P. and Boekhorst, J. C. A., "Page Size in Demand-Paging Systems," Proc. ACM SIGME Symp. (1973), PP- 1-12. [GIBS66] Gibson, D. H., "Considerations in Block-Oriented Systems Design," AFIPS SJCC, Vol. 30 (1967), pp. 75-80. [GORD73] Gordon, R. L., "The Organization and Control of a Slave Memory Hierarchy," TR-kk-29, Naval Underwater Systems Center, Feb. 1973. [HATF71] Hatfield, D. J. and Gerald, J., "Program Restructuring for Virtual Memory," IBM Sys. J., Vol. 10, No. 3 (1971), PP- 168-192. [HATF72] Hatfield, D. J., "Experiments on Page Size, Program Access Patterns and Virtual Memory Performance," IBM J.R.D. , Jan. 1972, P- 58. [HOAR72] Hoare, C. A. R., "A Survey of Store Management Techniques Part Two, " in Operating Systems Techniques, (Ed. ) Hoare and Perrott, APIC Studies in Data Processing, No. 9, 1972. [H0US6^] Househoder, A. S., The Theory of Matrices in Numerical Analysis, Blaisdell Publishing Company, New York, 196k. [ISAA66] Isaacson, E. and Keller, H. B., Analysis of Numerical Methods, John Wiley and Sons, 1966. [JOSE70] Joseph, M., "An Analysis of Paging and Program Behaviour," The Computer Journal, Vol. 13, No. 1, Feb. 1970, p. k-Q. 156 [JURI72] Jurich, D. R., "An Approach to the Compilation of Array- Expressions in the OL/2 Language, " University of Illinois at Urb ana -Champaign, Department of Computer Science Report No. 550, 1972. [KILB62] Kilburn, T., Edwards, D. G., Lanigan, M. J. and Sumner, F. H., "One Level Storage System," IRE, EC-11, No. 2 (Apr. 1962), pp. 223-235. [KING71] King, W. F., "Analysis of Demand Paging Algorithms," IFIP 1971, Vol. 1, p. ^85. [KNIG ] Knight, J. C. and Page, E. S., "Searching on a Paging Environment, " to appear. [KUCK70] Kuck, D. J. and Lawrie, D. H., "The Use and Performance of Memory Hierarchies: A Survey," Software Engineering, Vol. 1, 1970, pp. 45-77. [KUEH68] Kuehner, C. and Randell, B., "Demand Paging in Perspective," AFIPS SJCC, Vol. 32 (1968), pp. 1011-1018. [LEHM68] Lehman, M. M. and Rosenfeld, J. L., "Performance of a Simulated Multiprogramming System," AFIPS SJCC, Vol. 33 (1968), pp. 1^31-1^2, [LIPT68] Liptay, J. S., "Structural Aspects of the System 360/85: II, The Cache," IBM Sys. J., Vol. 7, No. 1 (1968), p. 15. [LOVA72] Lovass-Nagy, V. and Powers, D. L., "A Note on Block Diagonalization of Some Partitioned Matrices, " Linear Algebra and Its Applications, Vol. 5, Oct. 1972, pp. 339-3^6. [MA.SU72] Masuda, T., Takahashi, N. and Yoshizawa, Y., "The Comparison of Swapping Algorithms and Some Program Behaviour Under a Paging Environment," Information Processing in Japan, Vol. 12 (1972), pp. 70-75. ~ ~ '" - [MATT70] Mattson, R. L., Gecsei, J., Slutz, D. R. and Traiger, I. L., "Evaluation Techniques of Storage Hierarchies," IBM Sys. J. , Vol. 9, No. 2 (1970), pp. 78-117. [MCKE69] McKeller, A. C. and Coffman, E. G., "The Organization of Matrices and Matrix Operations in a Paged Multiprogramming Environment, CACM, Vol. 12, No. 3 (1969), pp. 153-165. [M0LE72] Moler, C. B., "Matrix Computations with Fortran and Paging," CACM, Vol. 15, No. k (1972), p. 268. [ONEI67] O'Neill, R. W., "Experience Using a Time Sharing Multiprogramming System with Dynamic Address Relocation Hardware," AFIPS SJCC, Vol. 30 (1967), pp. 611-621. [OPPE68] Oppenheimer, G. and Weizer, N., "Resource Management for a Medium Scale Time Sharing Operating System," CACM, Vol. 11, No. 5, (May 1968), p. 113. 157 [0RGA72] [PHIL72] [PINK68] [P0ME71] [RAMA66] [RAND68] [RAND69] [R0DR72] [RODR75] [R0GE73] [SAL17^] [SAYR69] [SCRN72] [SCH073] [SCHW73] Organick, E. I., The MULTICS System: An Examination of Its Structure, MIT Press, Cambridge, Mass., 1972. Phillips, J. R. and Adams, H. C, "Dynamic Partitioning for Array Languages," CACM, Vol. 15, No. 12 (Dec. 1972), pp. 1023-1032. Pinkerton, T., "Program Behaviour and Control in Virtual Storage Computer Systems, " University of Michigan, C0NC0MP Report h (Apr. 1968). Pomeranz, J. E., "Paging with Fewest Expected Replacements," ifip, (1971), Vol. 1, pp. 1+91-^93. Ramamoorthy, C. V., "The Analytical Design of a Dynamic Lookahead and Program Segmenting Scheme for Multiprogrammed Computers, " Proc. 21 st ACM Nat. Conf., I966, pp. 229-239. Ran dell, B. and Kuehner, C. J., "Dynamic Storage Allocation Systems," CACM, Vol. 11, No. 5 (May I968), pp. 297-306. Randell, B., "A Note on Storage Fragmentation and Program Segmentation," CACM, Vol. 12, No. 7 (July 1969), p. 365. Rodriguez -Ros ell, J. and Dupuy, J. P., "The Evaluation of a Time-Sharing, Page Demand System," AFIPS SJCC (1972), Vol. k-0, pp. 759-765. Rodriguez -Rosell, J. and Dupuy, J. P., "The Design, Implementation, and Evaluation of a Working Set Dispatcher, " CACM, Vol. 16, No. k (Apr. 1973), PP. 2V7-253. Rogers, L. D., "Optimal Paging Strategies and Stability Considerations for Solving Large Linear Systems," Ph.D. Thesis University of Waterloo, Canada (1973). Saltzer, J. H., "A Simple Linear Model of Demand Paging Performance," CACM, Vol. 17, No. k (Apr. 197*0, PP- 181-186. Sayre, D., "Is Automatic Folding of Programs Efficient Enough to Displace Manual?" CACM Vol. 12, No. 12 (Dec. 1969), pp. 656-660. Schneck, P. B., "Automatic Recognition of Vector and Parallel Operations in a Higher Level Language, " Proc. ACM Nat. Conf. (Aug. 1972), pp. 772-780. Schonage, A., "Fast Schmidt Ortogonalization and Unitary Transformations of Large Matrices, " in Complexity of Sequential and Parallel Numerical Algorithms, Ed. J. F. Traub, Proc. Symp. (1973), Carnegie -Mellon University. Schwarz, H. R., Rutishauser, H. and Stiefel, E., Numerical Analysis of Symmetric Matrices, Prentice -Hall, Inc., Englewood Cliffs, N.J. (1973), translated by Hertelendy, P. 158 [SHED72] Shedler, G. S. and Tung, C, "Locality in Page Reference Strings," SIAM J. on Comput., Vol. 1, No. 3 (Sept. 1972), pp. 218-2*4-1. [SHEM66] Shemer, J. E. and Shippey, G. A., "Statistical Analysis of Paged and Segmented Computer Systems," IEEE, Vol. EC-15, No. 6 (Dec. 1966), pp. 855-863. [SHEM69] Shemer, J. E. and Gupta, S. C, "On the Design of Bayesian Storage Allocation Algorithms for Paging and Segmentation, " IEEE, Vol. C-18, No. 7 (July 1969), PP- 6^-651. [SISS68] Sisson, S. S. and Flynn, M., "Addressing Patterns and Memory Handling Algorithms, " APIPS FJCC, Vol. 33, No. 2, (1968), pp. 957-967. [SMIT67] Smith, J. L., "Multiprogramming under a Page on Demand Strategy," CACM, Vol. 10, No. 10 (Oct. I967), pp. 636-6^6. [STEV68] Stevenson, D. A. and Vermillion, W. H., "Core Storage as a Slave Memory for Disk Storage Devices," IFIP (1968), pp. F86-F9I. [VARI68] Varian, L. C. and Coffman, E. G., "An Empirical Study of the Behaviour of Programs in a Paging Environment, " CACM, Vol. 11, No. 7 (July 1968), pp. h'Jl-hfk. [WEIN72] Weinberg, G. M., "Programming and Compiling Strategies for Paging Systems, " Software-Practice and Experience, Vol. 2, 1972, pp. 165-171. [WEIZ69] Weizer, N. and Oppenheimer, G., "Virtual Memory Management in a Paging Environment," AFIPS SJCC (1969), Vol. J>k, pp. 270-271. [WILK63] Wilkinson, J. H., Rounding Errors in Algebraic Processes, Her Majesty's Stationery Office, London (1963). [WILK71] Wilkinson, J. H. and Reinsch, C, Linear Algebra, Springer Verlag, 1971. [YEL071] Yelowitz, L., "Toward Optimal Paginations for Structured Programs," Symposium on Computers and Automata, Polytechnic Institute of Brooklyn, 1971, pp. 225-236. 159 APPENDIX A PL/I PROGPAMS FOR THE SIMULATION OP SOME PAGING ALGORITHMS A.l LRU WS MIN We will present LRU_WS_MIN in this section. There are three modules: the first module declares all the global variables used by LRU_WS_MIN, the second module is the procedure LRU_WS_MEN and the third module computes the results. Note that LRU_WS_MIN Is a stack algorithm and therefore, in one pass of the reference string, obtains the page fault behavior for all possible values of the page allotment or the window size whichever is pertinent to the algorithm. 160 INIT((NA+1)(0)) V INITUNA + 1 )(UJJ, INITUNA+1) (0)), INIT((NA+l)(0)) f INIKOI , INITIO) » » t t f VARIABLES ♦ / , /♦ADDRESS SPACE SIZE ♦ / /♦LRU DI ST. COUNTER */ /♦REF. INT. FREQ. COUNTER*/ /♦MMC FREQ. COUNTER ♦ / /♦LRUSTACKPAGEXINOEX */ /♦LRUSTACK POINTER */ /♦REF.STR. LENGTH */ /♦MEMORY SIZE */ /♦LRU SUCC. FUNCTION */ /♦WS SUCC. FUNCTION ♦/ /♦MIN SUCC. FUNCTION ♦/ /♦THEIR PAGE FAULTS ♦/ /♦REF. INTERVAL COUNTER*/ /♦P -STACK ♦/ /♦P_STACK POINTER ♦/ /♦PAGE FAULT FREQENCIE*/ /♦AVE.WS SIZE ♦/ */ MODULEi:/*DECLARE ALL GLOBAL DC UNA LRUCNT(0:NAJ WSCNT (0:NA) MINCNT(0:NA) XCNA+1) CURSTACK LNGTH C NCLRU NTWS NCMIN PCLRUtPTWStPCMINt REFCNT(NA+1) IN IT ( ( NA«-1 ) ( 0) ) , P(NA+l) INITUNA*1)<0) J, P_POINT INIT(Oi , «) FIXED BINOlfO); DCL(MCLRU,MTWS»MCMIN, STAU ♦ ##) FLOAT BIN; M0DULE2:/* PROCEDURE LRU_WS_MIN LRU_WS_MIN:PROC(XX> ; /♦THIS PROC SIMULATES MATTSON-LRU-STACK-ALG. TOGETHER*/ /♦WITH GORDON-MODIFICATION TO GET WS-STATISTICS AND ♦/ /♦ONE PASS MIN CF.BELADY £ PALERMO ♦/ DCLCXX , /♦CURRENTLY REFERENCED PAGE ♦/ DIST INIT(O), /♦STACK DISTANCE*/ INT INIT<0)t /♦REF. INTERVAL ♦/ I,J,JP f MINP.MINPP) FIXED BINOLtO); LRU.WS: /*SEE R.L.GORDON ♦/ LNGTH=LNGTH+1; CURST ACK=CURSTACK*l; /♦PUT THIS PAGE ON TOP OF STACK */ X( CURSTACK J=XX;REFCNT( CURSTACK )*l; /♦SEARCH FOR THIS PAGE IN THE STACK ♦/ DO I-CURSTACK-1 TO 1 BY -1; IF X( U = XX THEN DO; /♦ FOUND ♦/ INT*REFCNT( I ) ;DI ST=CURSTACK-I ; /♦MOVE PART OF STACK ONE BELOW */ DO J=I TO CURSTACK-l BY i; X( J)=X(J*1);REFCNT(J)=REFCNT< J*l); END; END;ELSE REFCNT(IJ=REFCNT(I)-i-i; END; /♦ STEP LRU L WS COUNTERS ♦/ IF INT>NA THEN INT*NA;/*DONT TRUST WSCNT(NA)+/ LRUCNT(DISTJ=LRUCNT{DIST)+l;WSCNT(INT)=WSCNT(INT)+l; IF DIST=0 THEN DIST=CURSTACK-1 ; ELSE DO;CURSTACK=CURSTACK-l;DIST=OIST-l; END; /♦NEXT IS THE CODE FOR BELADYS M-OPERATOR ♦/ 161 MIN: IF 01ST*0 THEN RETURN; IF DIST>P_POINT THEN 00; /* A FRESH PAGE-REF. */ P_P0INT*P_P01NT*1;P(P_P0INT>-DIST*1;RETURN; END; /♦FIND SMALLEST ELE BELOW DIST IN P-STACK */ MINPsP(l) ;J=l;LSMALL*P_POINT-DIST*i; DO 1*2 TO LSMALL BY 1; IF PU) < MINP THEN do;minp=pCURST-ACK THEN CURSTACK = K; END;/*CCDE FOR ACTICN=NORMAL ENDS ♦/ IF ACTICN=DFREE THEN /♦DECLARE PAGE XX DEAD IF IN MEM, »/ DO; FLAG'O; DO I»l TO CURSTACK WHI LE ( FLAG=0 ) ; IF PSTATE1 U=FULL L X(I)*XX THEN DO;FLAG=l;PSTATE( I)=EMPTY;K=I ;END; END; if flag=1 then if k=curstack then curst ack=curs tack- 1; eno freelru; mgdule3:/*result computation */ nc=o; DC C=l TO NA; NC = NC«-DISTCNT(C) ; FC =Nu/FLGAT < LNGTH J ; MC = LO-FC ;FAULTS=LNGTH-NC ; PUT SKIP DATA(C,NC, FAULTS, FC,MC); END; 166 167 k.k FREEDPREULRU In this section, we will present FREEDERE^LRU. This is not a stack algorithm and therefore, requires one pass of the reference string for each value of the page allotment c. There are three modules: the first module declares all the global variables, the second module is the procedure FREEDPRE^LRU and the third module computes and prints out the required statistics. 168 MODULEl: /'DECLARE ALL GLOBAL VARIABLES DCL(NA ,/*SIifc OF ADDRESS SPACE INITUNAUO) ), INITI (NA)O)), INITO) , /♦PAGfcTABLE /♦LRU STACK /♦PAGE REF. ♦ / ♦/ ♦ / ♦ / ♦ / DP4_PT(NA) DP4.STACMNA) NORMAL PREPA6E INIT(I) OFREE INITO) 0P4 POINT INITO) FAULTS INITIO) PULLS INITO) PRECMT INITIO) LNGTH INITO) Ct /♦MEM. SUE */ «) FIXED BINOlfO); M0OULE2:/*PROC FREEDPRE4LRU FREEDPRE4LRU:PR0C(ACTI0N,X); /♦ THIS PROC. OBTAINS #PAGE FAULTS £ /♦FOR FREEDPRE4LRU PAGING ALGORITHM DCHACTON, /'INDICATES WHETHER TO FRE E t PREEPAGE /*A REFERENCE TO PAGE X X, /♦CURRENTLV REFERENCED PAGE /♦LRUSTACK POINTER ♦ / /♦*OF PEGE FAULTS ♦ / /*#GF PAGE PJLLS */ /♦#OF PREPAGED PAGES IN MEM ♦/ /♦REF. STRING LNGTH*/ #PAGE PULLS */ ♦ / ♦ / OR*/ */ */ DCLCNOTPRESENT NOTSETUP NOTUSED USED I, J) FIXED DCL(LACTION(0:2) LPAGE (0:3) INITO), INIT(I), INITO), INITO), BINOlfO); INIT(LNORMAL,LPREPAGE,LFREE), INITI SQT_PRESENT,NOT_SETUP, M3T.USED,LUSE0)) LABEL; GOTO LACTION(ACTION); /*A REFERENCE TO PAGE X */ LNORMAL: LNGTH =LNGTH+i; GOTO LPAGE(DP4_PT(X)); NOT_SETUP:/*REQUIRED PAGE NOT YET SETUP IN MEM FAULTS*FAULTS*i;/* INC. PAGEFAULT COUNT »/ /♦PUT THIS PAGE ON STACK */ DP4_P0INT»DP4_P0INT*i ; DP4_STACK(0P4_PCINT)=X; DP4.PT (X)»USED;PREC^T*PRECNT-l; /♦DECLARE ALL NOTSETUP PAGES SETUP ♦/ DO 1=1 TO NA; IF DP4_PT(I)«N0TSETUP THEN D0;DP%_PT»C THEN 00; DP4_PTl, V t > 0. The proof is by induction on t. Assume inductively that, (1) U (c+1) = U t (c) + y where y = fi or y ± e N. (2) N.(c+1) = N t (c) + y 2 where y g = or y 2 e N. (3) S t (c+1) = S t (c) + y 3 where y^ = ft or y^ = y 1 or y^ = y 2 « (^) C0UNT(y, t, c) = C0UNT(y, t, c+l), Vy e N.(c). Clearly, all these assumptions are trivially true for t = 0. Now depending on r, , there are two cases to consider: t+1 case (l) r, -. = x e N. Now depending on x, there are four case to consider: l(a): x e N, (c+l) which implies x e N, (c) orx/ N, (c). Assume first that x e N (c). From step 1(a) of PRE3A, Vy € N t (c), C0UNT(y, t+1, c) = C0UNT(y, t, c) + 1 = COUNT (y, t, c+1) + 1 - COUNT (y, t+1, c+1). Let N t (c+1) = N t (c) + y 2 . Then- 171 {y|y e N t (c+1), COUNT (y, t+1, c+l) > T} = {y|y € N t (°)^ COUNT(y, t+1, c+l) > T) + if COUNT (y , t+1, c+l) > T then y 2 = {y|y € N t (c), COUNT(y, t+1, c) > T} + if COUNT (y , t+1, c+l) > T then y„. Also let U, (c+l) = U, (c) + y where y and y cannot both be nonzero. Therefore, we have, from step 1(a) of PRE3A, U t+l( C+1 ) = U t+l^ C ' ) + ^ y l ° r y 2 ° r null ' ) N t+1 (c+l) = N t+1 (c) + (y 2 or null) This proves all the inductive steps. Next assume x e N (c). Now if N, (c+l) = N (c) + y p then x = y p . Also y = 0, and S (c+l) = S,(c) + x. Therefore, for page allotment c, we are in step l(d) of PRE3A and for page allotment c+l, we are in step 1(a) of PRE3A. U t+1 (c+l) = U t (c+1) + x U{y|y e N t (c+l), C0UNT(y, t+1, c+l) > T) = U t (c) + x U{y|y e N t (c), C0UNT(y, t+1, c+l) > T } Also since Vy e N .(c), C0UNT(y, t+1, c) = C0UNT(y, t, c) + 1 "G and C0UNT(y, t+1, c+l) = C0UNT(y, t, c+l) + 1, we have, U t+1 (c+l) = U t (c) + x U{y|y e N t (c), C0UNT(y, t+1, c) > T}. 172 But from step l(d) we have this = U t+ l (c) + V U t (c) ' q > x) * Also, S t+1 (c+l) = S t (c+1) = S t (c) + x = s t+ l (c) + W c) ' q > X) * From these, the required results follow. case (lb) : x e U (c+l) =» x e U (c) or x ^ U (c). Assume first that "C Xj "C x 6 U (c) and assume that U (c+l) = U (c) + y, . From step 1(b) of PRE3A, ~G "C X J- U t+1 (c+l) = U t (c+1) U {y|y e N t (c+1), COUNT(y, t+1, c+l) > T) = U t (c) + y ± U {y|y € N t (c), COUNT(y, t+1, c) > T) + if C0UNT(y 2 , t+1, c+l) > T then y 2 = u t+1 ( c ) + (y ± or y 2 or nul1 ) Also, S t+1 (c+l) = S t (c+1) = S t (c) + (y x or y 2 ) and S t+1 (c) = S t (c) .'. . S t+1 (c+l) = S t+1 (c) + y 1 or y 2 . This proves all the required things. Next assume x ^ U (c) =» U (c+l) = U (c) + x. X> Xj ~G N t (c+1) = N t (c), S t (c+1) = S t (c) + x. Therefore, for page allotment c+l, we are in step 1(b) and for page allotment c, we are in step 1(d) of PRE3A. 173 U (c+l) = U (c+l) U {y|y € N.(c+1), COUNT(y, t+1, c+l) > T) t+1 = U t (c) + x U {y|y 6 \(c), COUNT(y, t+1, c) > t) = u t+ i (c) + W c) ' q ' x)# Also, S t+1 (c+l) = S t (c+1) = S t (c) + x, S t+1 (c) = S t (c) + x - R A (U t (c), q, x). .*. S t+1 (c+l) - S t+1 (c) + R A (U t (c), q, x). From these the required results follow. case (lc) : x j. r, then we are in step l(c) of PRE3A. We have S t+1 (c+l) = U t+1 (c+l) = S t (c+1) + x - R A (U t (c+l), q, x) S t+1 (c) = U t (c) = S t (c) + x - R A (U t (c), q, x). Let U t (c+1) = U (c) + y 1 . Now since A is a stack algorithm, R A (U t (c+l), q, x) = R A (U t (c), q, x) or y ± . Therefore, S t+1 (c+l) = 5 t (c+l) + x - (R A (U t (c), q, x) or y^ = S t (c) + y 1 + x - (R A (U t (c), q, x) or 3^), Therefore, S t+1 (c+l) = S t+1 (c) + (R A (U t (c), q, x) or y ± ) Also, since N t+1 (e + D -H t+1 (c)=A the required results follow. 17+ case (id) : x ^ S, (c+l) =» x ^ S, (c). Therefore, we are in step 1(d) of PRE3A. U t+1 (c+l) = U t (c+1) + x - R A (U t (c+l), q, x) U {y|y 6 N (c+l), COUNT (y, t+1, c+l) > T} = U t (c) + y 1 + x - (R A (U t (c), q, x) or y ± ) U {y|y e N t (c), COUNT(y, t+1, c) > T} + if COUNT (y 2 , t+1, c+l) > T then y . Also, U t+1 (c) = U t (c) + x - R A (U t (c), q, x) U {y|y e N t (c), COUNT(y, t+1, c) > T) . Since y and y cannot both be nonzero, we have, U t+1 (c+l) = U t+1 (c) + (y 1 or y 2 or R A (U t (c), q, x)) Also, S t+1 (c+l) = S t (c+1) + x - R A (U t (c+l), q, x) = S t (c) + y 3 + x - (R A (U t (c), q, x) or y ± ) = s t+1 ( c ) + (y ± or y 2 or R A (U t (c), q, x)). From these, the required result follows. case (2) ; r ^ +i - pre( x ) Since OJ satisfies the property P, we have, x / |"? (or x / S, (c) for any c. ). Assume S, (c+l) = S, (c) + y, and y_ is not null, w t t j 5 (i) If |s t (c + 1)| < c+l, then |s (c)| < c. Then, N.' n (c+l) = N. (c+l) + x = N. (c) + y„ + x "C+-L X> "t 2 and N.^, (c) = N. (c) + x "C+J. "t 175 Also since y = y , N t + l (ctl) = N t+l (c) + y 5 S t+1 (c+l) = S t (c+1) + x = S t (c) + y^ + x and S t+1 (c) = S t (c) + x Therefore, S t+1 (c+l) = S t+1 (c) + y This proves the required result. (ii) If Is (c+l)| = c+l, then |s.(c)| = c. Then, s t+1 ( c ) = s f ( c ); and s t+1 ( c+1 ) = s t ( c+1 ) > etc « Therefore, all the required results follow. Now assume that y is null, i.e. S, (c+l) = S, (c). j5 t "t (i) S .(c) < c which implies S (c+l) < c+l, then the required results are trivial to prove. (ii) S (c) = c which implies S (c+l) = c < c+l. Therefore, S (c+l) = S (c+l) + x = S (c) + x and S (c) = S (c) which implies S , (c+l) = S (c) + x. Also N (c+l) = N (c+l) + x = N t (c) + x and N (c) = N (c) which implies N (c+l) = N (c+l) + x. This proves the required result. Since we have completed all possible cases of r, .. , the proof is complete. 176 APPENDIX C PL/ I PROGRAMS FOR SEVERAL MATRIX ALGORITHMS 0. 1 Cholesky Decomposition CDSRFP:PROC( A, NSMALL, MSMALL); /♦THIS PROC FACTORS THE SYM.POS .DEF. MATRIX*/ /♦A (OF ORDER NSMALL) INTO THE PROD, OF A */ /♦LOWER TRIANG.fc AN UPP.TRI.MAT. */ /*it is a submatrix alg.with loop reversal*/ /*& freeing i prepaging */ n=nsmall/msmall; 00 1=1 TO N; /* FREE A<1,1> */ 00 J = l TC I-l; CALL FREE(A,I-1,J); END; /* PREPAGE SUBARRAYS C & DIFFERENTIALLY */ DO K = I TO N; CALL PREPAGE(A,K,I ) ; END; /* CARRY OUT D=D-R*R' */ DC J = l TC I-l; DO I 1=1 TO msmall; DO Jl = i TO Hi S=0; DO Kl=l TO MSMALL; S=S+A( (I-l)*MSMALL+Ilf ( J-l ) *MSMALL* J 1 ) *A( (I-l)*MSMALL+Ilt(J-ii*MSMALL*Jl); END; A( ( I-l >*MSMALL+I1,(I-1)*MSMALL+I1)= A( (I-l)*MSMALL*Il,(l-l)*MSMALL+Ili-S; END; END; END; /*D0 CHOLDEC OF THE DI AG. SUBMATRIX U */ CALL CO(A(< I-l)*MSMALL*l:l*MSMALL, ( I-l )*MSMALL+1 : I*MSMALL) fMSHALLi; /♦INVERT SLT PART OF I) t STORE INTO SUT PART OF D*/ DO K=2 TC MSMALL; Sl=A((I-l)*MSMALL*K f ( 1-1 )*MS MALL *K ) ; DO I 1 = 1 TO K-l; S=A(( I-1**MSMALL+K,J I-l )*MSMALL+I 1 ) / A( (I-l)*MSMALL+Il,(I-l)*MSMALL+Ill; DO J=Il*l TO K-l; S=S*A((I-1)*MSMALL*-K,< I- 1 ) *MSMALL* J )* A( (I-1)*MSMALL+I1, BY G INVERSE */ DO 4*1*1 TO N; DCL S(MSMALL) FLOAT; /*TEMP. VECTOR */ DO Jl*l TO msmall; DO 11*1 TO MSMALL; S< II )*A< < J-l )*MSMALL+J1, ( 1-1 )*MSMALL+I 1) ; END; DO 11*1 TO MSMALL; S*S( 1 1)/A( (I-1)*MSMALL«-Ilt ( I- 1 )*MSMALL*I 1) ; DO Kl=l TO 11-15 S*S*S(K1)*A(( I-1)*MSMALL*K1,(I-1)*MSMALL*I1); END; A( ( J-1)*MSMALL*J1,U-1)*MSMALL + I1J*S; END; END; END; END CDSRFP; 178 C.2 LU Decomposition lusrfp:proc( a, nsmall, msmall ) ; /♦computes lu decomposition of matrix a of order*/ /*nsmall & stores l t u ln ajemploys a submatrix*/ /♦algorithm with loop rev . t freeinggprepaging */ n=nsmall/msmall; DO K=l TO N; /* FREE SUBAKRAY A<1,1> */ DO 1=1 TO K-18 CALL FREE(A,I,K-1) ;CALL FR EE ( A , K- 1 , I ) ; END; /* PREPAGE SUBARRAYS Y E, I */ DO I=K TO N; CALL PREPAGEU, I t K) ; END; DC J=K+1 TC n; CALL PREPAGE(A,K, J); END; /* Fuf^M D = D-R * X */ DO KK = 1 TC K-l; DO 11=1 TC MSMALL; DC Jl=l TC MbMALL; S = 0; DO KK1=1 TC MSMALL; S=S+A( (K-1)*MSMALL+I1 t(KK-l)*MSMALL+KKl) *A( ( KK-1)* MSMALL +KK1 , ( K~ 1 ) *MS MALL* Jl ) ; END; A( (K~l)*Mi>MALL + Il,(K-l) J «MSMALL+Jl) = £( (K-l)*MSMALLtI it (K-1)*MSMALL+J1)-S; END: END; END; /* DO LU DECCMP CF D */ CALL IU(M (K-l ) *MSMALL> 1 : K*MSMAL L , (K-l )*MSMALL*L: K*MSMALL) , MSMALL) *, /♦OCOMP. INV(LD), INV(UDJ I STOKE IN PAGE */ DC Kl=l TC MSMALL; SliA(NSMALL+Kl ,K1 ) = 1/ A ( ( K- 1 ) *MSMAL L + Kl , 2^n ♦/ DO J = l TC N; DO 1=1 TO N; CALL PREPAGE(AtltJ); END;end; DO K=l TO N; /♦MAJOR LOOP OF THE ALGORITHM ♦/ /♦ FREE THE DEAD SUBARRAYS ♦/ DO 1=1 TC K-l; CALL FREE(AiIfK-l); CALL FREECA.K-1, I) ; END; /♦ DO GAUSSELIM OF D ♦/ CALL GOS(A((K-i)^MSMALL*i:K+MSMALL,(K-l)+ MSMALL«-l:K^MSMALL), MSMALLJ; /♦COMPLETE ELIM. ON SUBARRAYS R £ C ♦/ DO J=l TO MSMALL; /♦COMPUTE C<2>=C<2>/D<2,2> ♦/ /♦REVERSE LOOP WUTH RE. TO J ♦/ IF M0D(J,2I=1 THEN DO; ILL = MSMALL^K + 1 ;IHL=NSMALL ; I STEP*1 ; END;ELSE DO; ILL = NSMALL;lHL=K^MSMALL«-l;ISTEP--l;END; DO I=ILL TC IHL BY ISTEP; A( I,=C<3>-C<2>+D<1,3> ♦/ /♦PEVERSE THE FCLOW.LOOP OPP. TO PREV.LOP+/ IF MCD(J,2)=0 THEN DC;ILL = K«-l; IHL = N;ISTEP=i;END;ELSE DO; ILL=N; IHL=k+i; ISTEP=-i;end; DO I=ILL TO IHL BY ISTEP; DC Jl=J*i TO MSMALL; DO 1 1= 1 TC MSMALL; A( ( I-1KMSMALL*I1,(K-1)^MSMALL*J1)«A( (I-l)^ MSMALL* I i , (K-1)+MSMALL+J1)-A( ( I- I) ♦MSMALL *I I , (K-1MMSMALL*J)^A( ( K-l ) ♦MS MALL* J, (K-l) ♦MSMALL *Ji); END; END; END; /♦ R<3>=R<3>-D<3,2>^R<2> ♦/ /♦ REV WITH RE. TO J ♦/ IF MOD(J,2)=l THEN 182 DO;JJL»K«-l; JJH*N;JJSTEP»l; END; ELSE DO;JJL»N; JJH*K+1;JJ STEP— 1; END; DO JJ=JJL TO JJH BY JJSTEP; DO I1=J+1 TO msmall; DO JJl*i to msmall; A( (K-1)*MSMALL*U, ( JJ-1 >*MSMALL*JJ II* A((K-1)*MSMALL*I1»( JJ-1 >*MSMALL+JJ 1 )- A( i;lSTEP=-i;END; DC 1=1 LL TO IHL BY istep; /*REV WITH RE. TO I */ IF MODU ,2)=1 THEN DC; JLL=K*1; JHL=N;JSTEP*l;ENO;ELSE DO; jll=n; jhl=k+i; jSTEP=-i;END; DO J = JLL TO JHL BY JSTEP; DO II =1 TO msmall; DO Jl=l TO msmall; S=o; DC Kl =1 TO msmall; S=S*A(U-1)*MSMALL«-I1i2*N1*/ DO J = l TO N2; DO I«l TO Nl; CALL PREPAGE(A,I,J); END;END; DO K*l TO N2; /♦ FREE ♦ / DO 1 = 1 TO Nl WHILE(K>1) ; CALL FREE(AtltK-l) ; END; DO J=K-l TG N2 WHILE(K>i); CALL FREE(R,K-i,J) ; END; /♦ PREPAGE ♦ / DO J=K TO N2; CALL PREPAGE(R,K,J|; END; /♦ GPTHO. OF M VECTORS A<*,J) FOR K*M>* J>( K-l )*M ♦/ DO Kl =1 TG MSMALL, /♦ COMP.U-I J*M*-Kl TH. COL. OF R ♦ / /♦REV. LOOP WITH RE. TO Kl ♦ / IF MOD(Kl,2)=l THEN DC;ILL=l;IHL=Nl;ISTEP=l;ENO;ELSE DO;ILL=Nl;IHL=l;ISTEP=-i;ENU; DC I=ILL TO IHL BY ISTEP; /♦REV. LOOP WITH RE. TO I ♦ / IF MOD( I,2)=l THEN DO;JLL=i;JHL=Kl-l;JSTEP=l;END;ELSE DO; JLL=Kl-i;JHL=l;JSTEP=-l;END; DC J = JLL TO JHL BY JSTEP; S=o; DO 11=1 TO MSMALL; S=S*A(=A<2>* * A<3> ♦/ /♦REV LOOP WITH RE. TO K ♦ / IF M0D(K,2)=1 THEN DO;ILL=l;IHL=Nl;ISTEP*l;END;ELSE DO;ILL=Nl;IHL=l;ISTEP=-l;ENO; DO I=ILL TO IHL BY ISTEP; /♦REV LOOP WITH RE. TO I ♦/ IF MCO( 1,21*1 THEN dc;kkl=k*i;kkh=N2;kkstep=i;END;else dc;kkl=N2;kkh=k«-1;kkstep=-1;END; DO KK=KKL to kkh by kkstep; DC KK1=1 TO MSMALL; DO Jl=l TO msmall; S=o; DC 11=1 TO MSMALL; S=S+AIU-1>+MSMALL«-I1, ( K-l ) ♦MSMALL* J 1 ) ♦ A( =A<3> - A<2> ♦ R<2,3> ♦/ /♦REV IN A DIRECTION OPP. TO PREV. LOOP ♦/ DO I=IHL TO ILL BY -ISTEP; /♦REV WITH RE. TO I ♦/ 185 IF MODI 1.21=1 THEN DC; JLl=K*l; JHL=N2;JSTEP=i;END;ELSE DO;JLL=N2; JHL = K«-l;JSTEP=-i;END; DO J = JLL TO JHL BY JSTEP; DO 11=1 TO MSMALL; DO Ji=l TO MSMALL; S = o; DC Kl=l TO MSMALL; S=S+A( (I-1)*MSMALL*I1, 2*N ♦ / 00 J=L TC N; DC 1=1 TO n; CALL PREPAGE(C, I tJ) ; END;END; /♦MAIN LOOP CF ALG. ♦ / DO 1=1 TO n; /♦FREE DEAD PORTIONS OF A & B ♦/ DC 11=1 TC N WHILE( I>1) ; CALL FREECA.I-1, I1MCALL FREE ( B, 1-1 , 1 1 ) \ END; /♦PREPAGE PCRTIONS CF A £ B ♦/ DC 11=1 TC n; CALL PREPAGE(B,I, II); CALL PREPAGEUt If II); END; /♦REV LOOP WITH RE. TO I ♦/ IF M0D(I,2)=1 THEN DO; JLL=l; JHL=N; JSTEP=L ;END; ELSE DC; JLL=N;JHL=l;JSTEP=-l; END; DO J=JLL TC JHL BY JSTEP; /♦REV LOCP WITH RE TC J ♦/ IF M0DU,2) = 1 THEN DO;KLL=l;KHL=N;KSTEP=l;END; ELSE do;kll=N;khl=1;kstep=-1 ;END; DQ K=KLL TC KHL BY KSTEP; DC 11=1 TC MSMALL; DC Jl=l TC MSMALL; s = o; DO Ki=l TO MSMALL; S = S + B( (1-1) ♦MS MALL -Hit (K-l )*MSMALL+K1 )♦ C((K-1)^MSMALL+K1, ( J-l ) ♦MSMALL* J 1 ) ; END; A( (I-1)*MSMALL + Il f ( J-1)^MSMALL+J1)=S+ A( (I-1)*MSMALL+Ilt ( J-l) ♦MSMALL* J 1) t END MMSRFP; 187 C.6 CDS CDS:PROC(A,NSMALL t MSMALL) ; /♦THIS PROC FACTORS SYM.POS.DEF. MATRIX ACOF ORDER*/ /♦NSMALL) INTO G*G« ,G IS LCWER TIANG.THIS IS A */ /♦SUBMATRIX ALGORITHM */ N=NSMALL/MSMALL; DO 1=1 TO n; /* COMPUTE D=D-R*R» */ DO J=i TO I-l; DO 11=1 TO msmall; DO Jl=l TO 11; S=o; DO Kl=l TO MSMALL; S=S*A( (I-1)*MSMALL+Ilf ( J-l )*MSMALL + K1 ) * A( ( I-1)*MSMALL*J1,( J-1)*MSMALL+K1); END; A( ( I-1)*MSMALL+I1, (I-1)*MSMALL+J1)= A( ( I-1)*MSMALL"U1,-S; END; END; END; /* DO DECOMPOSITION OF DIAG.SUBM. D */ CALL CD(A( ( I-l)*MSMALL*l:I*MSMALL, ( I- l)*MSMALL«-l:I*MS MALL), MSMALL); /♦INVERT SLT PART OF D L STORE IN SUT PART OF D*/ DO K=2 TO MSMALL; Si=A( ( I-1)*MSMALL«-K,(I-1)*MSMALL+K); DO 11=1 TO K-l; S = A(U-1)*MSMALL*K, (I-l >*MSMALL+I 1 )/ A( BY G INVERSE */ DCL S(MSMALL) FLOAT; /*TEMP. VECTOR */ DC J=I+1 TO n; DO J 1=1 TO MSMALL; DO 11=1 TO MSMALL; S(I1)=A( (J-i)*MSMALL+Jl, { 1-1 ) *MSMALL*I 1J ; END; DC 11=1 TO MSMALL; S = S ( I U/A( { I-l)*MSMALL*Ilt ( 1-1 > *MSMALL*I 1 ) ; DO Kl=l TO I l-l; S=S*S */ if CP[fi] = null then /* first occurrence */ do; Create a node of type SUEARRAY, pointed to by P, and link it to the USE-LIST; set IP[B] = CP[B] = P; NAME[P] = B; LP[P]=null; X[P] = (r I _ 1 , t x , c J _ 1 , Cj ); KC[P] = null; end; else /* not first occurrence */ CP[B] = P; [D] if SCAN = 'END' then /* end of a FOR loop */ do; Create a node of type FOR-END, pointed to by P, and link it to the USE-LIST; link this node to the k-chain of the corresponding FOR loop and set BP[ P] = pointer to the matching FOR-BEGIN node; end; Step AP-3 [ This step modifies the USE-LIST to add dead subarrays, lifetime pointer, correct lifetimes and end of scope nodes. This is done in a post-syntax-analysis, precoding phase]: 192 [A] [fill up lifetime pointers and create end-of-lifetime nodes]: Scan the USE-LIST sequentially. if SCAN = ARRAY[P] or SUBARRAY[ P] then do; B = NAME [P]j LP[P] = CP[B]; Create a node of type END-OF-LIFETIME, pointed to by Q, and insert in the USE-LIST at LP[P]j set RP[Q] = P; end; [B] [ link dynamic suharrays on the k-chain ]: Scan the USE-LIST sequentially. if the current nest level 1 > then if SCAN = SUBARPAY[P] then do; B = NAME [P]; for all 'FOR' loops f do; if S(f) H L(B) £ cp then do; k = NAME[f]; if P(B, k) = T then link B to the k-chain of loop f; end; end; end; [C] [modify the USE-LIST to take care of the overlapping lifetimes, introduce end-of- scope node, enter dead- suharrays] : (l) [modify overlapping lifetimes]: for all arrays A such that S(A) ^ cp, do; /* let 1 be the level of a sub-array B of A. Here the level means, the level in ACB tree structure */ 193 1 . = min 1 (B): 1 = max l(B) ; mn BCA maX BCA /* note that l(A) ■ 1, 1 (A ) = 2 */ for j=l to 1 . -1, do: max min for all D^A such that l(D) = j do; let Q be the pointer to the node D in the USE-LIST; let the lowest level ancestor of D in the USE-LIST be denoted by E and P points to its node in the USE-LIST; if (IP[E] > IP[D] ) and ( LP[P] > LP[ Q] ) then do; CP[D] = LP[Q] = LP[P]; delete the END-OF-LIFETIME node of D; end; if ( LP[P] < LP[Q] ) then do; CP[E] = LP[P] = LP[Q]; delete the END- OF- LIFETIME node of E; RP[Q] = P; if (IP[E] < IP[D] ) then do; delete node D from the USE-LIST; CP[D] = IP[D] = null; end [C] (1); 19k (2) [insert the END-OF-SCOPE nodes]: for all arrays A such that there is a subarray B of A such that B € USE-LIST do; set (P,Q) = S(A); Create a node of type END-OF-SCOPE and insert it in the USE-LIST at Q; NAME[Q] = A; end; (3) [create dead subarrays and link to USE-LIST]: for all arrays j. USE-LIST such that there is a subarray B of A such that B e USE-LIST do; j=aiax fl(B) BCA, B e USE-LIST, there is a k such that P(B,k) = T } ; construct a sequence of subarrays of A at level j; name the above sequence DEAD-LIST (A); if (B e DEAD-LIST (A)) and ( (B e USE-LIST) or ( for all k P(B,k)=F)) then delete B from DEAD-LIST(A); for each FOR loop f do; for each B e DEAD-LIST (A) do; let the loop control variable of loop f be k; let Q point to FOR-BEGIN node of f; if S(f) O S(A) f cp then if P(B,k) = T then do; create a node of type DEAD -SUBARRAY, pointed to by P, and link it to the k-chain of f; NAME[P] = B; X[P] = (^^ , ^ , Cj-1 , Cj ) ; end; end Step AP-3; Step AP-k [use of the USE-LIST in the coding phase to do freeing and prepaging] : 195 [A] [freeing]: if at lexical level 1=0 then on scanning an END-OF-LIFETIME node of B do; if B is an array then generate instruction: 'FREE(b)'; if B is a subarray then generate instructions: 'IF for all x € X, MOD(x,m)=0 THEN FREE(b)'; end; on scanning an END-OF-SCOPE node of A generate the instruction: 'FREE (A)'; if at level I > then on scanning an END-OF-LIFETIME or END-OF-SCOPE node of B do the same as above except that the code is inserted after the level £ is reduced to zero, i.e. after exit from the outermost loop; [B] [prepaging outside the loop]: if at level £ = then generate instructions to prepage a few nodes in advance; if at level I > then prior to generating code for the loop, generate instructions to prepage all the (sub) arrays in the USE-LIST from the FOR-BEGIN node to the FOR-END node along the static link; [C] [freeing within a loop]: (level i > ) for all dead- sub arrays B along the dynamic chain of the loop, generate the following code just after the backward branch target of the loop: 'IF ((for all x e X , f (x) < 0) AND 1 *i+l (for all x e X , f (x) > 0)) THEN l+l 196 IF for all x e X,MOD( x(k ),m) = THEN FREE(b)' ; [D] [incremental prepaging within a loop]: (level I > 0) for all dynamic subarrays B along the dynamic chain of the loop, insert the following code after the backward branch target of the loop: 'IF ((there exists x e X, such that f, (x) = -l) OR 1 k. , ., l+l (there exists x e X such that f (x) = 1)) THEN 2 k. , * i+1 IF there exists x e X Q such that M0D( x(k.),m) = THEN PREPAGE(B) ' ; end ALG-AP: 197 VITA Kishor Shridharbhai Trivedi was born in Bhavnagar, Gujarat State (India), on August 20, 19M>. He received the Bachelor's degree in Electrical Engineering from the Indian Institute of Technology, Bombay (India) in 1968. Between 1968 and 1970* he was first employed by Larsen and Toubro India Limited as an Apprentice Engineer and then by Inter- national Business Machines (WTC) of India as a Customer Engineer. He joined the University of Illinois, Urbana, as a graduate research assistant in the Department of Computer Science in 1970. He conducted research in the area of Digital Computer Arithmetic under the guidance of Professor James E. Robertson. He obtained his Master's degree in Computer Science in 1972. Since then, he conducted research in the area of Operating Systems and Numerical Analysis under the guidance of Professor J. Richard Phillips. He is the coauthor with J. E. Robertson of a paper entitled, "The Status of Investigations into Computer Hardware Design Based on the Use of Continued Fractions," which was published in IEEE Transactions on Computers, June 1973» BIBLIOGRAPHIC DATA SHEET 4. Title and Subtitle 1. Report No. UIUCDCS-R-74-662 PREPAGING AND APPLICATIONS TO STRUCTURED ARRAY PROBLEMS 7. Author(s) Kishor Shridharbhai Trivedi 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 15. Supplementary Notes 3. Recipient's Accession No. 5. Report Date July I974 8. Performing Organization Rent. No. UIUCDCS-R-74-662 10. Project/Task/Work Unit No. 11. Contract/Grant No. US NSF GJ-328 13. Type of Report & Period Covered Doctoral - 1974 14. 16. Abstracts ~~ A demand prepaging algorithm is defined and proved to be an optimal demand prepaging algorithm. A variable-memory prepaging algorithm PWS is also defined and is based on Denning 's WS algorithm. It is shown that i5s incurs zero page faults Both of these algorithms, however, cannot be used in practice since they require that the lllZ n ll e TAlZsT knWn ^ ^^ ™°~' - *—ibe several p^act^afpre! Data paging is of primary concern for problems with large data bases and for many types of array problems. In particular, we show that prepaging reduces the paging problems of array algorithms operating on large arrays. We^lso show that the use of a submatrix algorithm considerably improves the locality. Finally we consider in tnf "° mat ing + these Performance-improvement techniques^ means of Tcompner in the context of a structure-array-language. compiler 7. Keywords and Document Analysis. 17a. Descriptors Array Language Compiler Demand Paging Locality Matrix Algorithm Multiprogramming Paging Performance Prepaging Virtual Memory b. Identifiers /Open-Ended Te c ''OSATI Field/Group • Availability Statement RELEASE UNLIMITED tRU NTIS-35 ( 10-70) 19. Security Class (This Report) -,. UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. N> ■ i Pages 22. Price 200 USCOMM-DC 40329-P? % '9?4