H 
 
 IP 
 
 HHmBh Hals 
 
 HnKSMnHflnN 
 IHlffiSHffiilll 
 
 
 -XL.- 
 
 HBBHf1 IWHHrafflfl 
 
 mill 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAICN 
 
 510. 84 
 ttft- 
 
 no. G6I-666 
 cop. 2 
 
CENTRAL CIRCULATION BOOKSTACKS 
 
 The person charging this material is re- 
 sponsible for its renewal or its return to 
 the library from which it was borrowed 
 on or before the Latest Date stamped 
 below. You may be charged a minimum 
 fee of $75.00 for each lost book. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result In dismissal from 
 the University. 
 
 TO RENEW CALL TELEPHONE CENTER, 333-8400 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 AUG 1 4 1996 
 
 Allfi 8 1996 
 
 When renewing by phone, write new due date below 
 previous due date. L162 
 

 Report No. UIUCDCS-R-72j.-662 
 
 NSF - OCA - GJ-328 - 000002 
 
 PREPAGING AND APPLICATIONS TO STRUCTURED ARRAY PROBLEMS 
 
 by 
 
 Kishor Shri&harbhai Trivedi 
 
 July 1974 
 
 JHE UBRARY OF 
 
 SEP 2 3 1974 
 
 UNIVERSITY OF ILUNOtS 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/prepagingapplica662triv 
 
Report No. UIUCDCS-R-74-662 
 
 PREPAGING AND APPLICATIONS TO STRUCTURED ARRAY PROBLEMS* 
 
 by 
 
 Kishor Shridharbhai Trivedi 
 
 July 1974 
 
 Department of Computer Science 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 This work was supported in part by the National Science Foundation 
 under Grant No. US NSF GJ-328 and was submitted in partial ful- 
 fillment of the requirements for the degree of Doctor of Philosophy 
 in Computer Science, July 197 1 !-. 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 The author wishes to express his gratitude to Professor J. Richard 
 Phillips for his guidance during the preparation of this thesis. 
 
 I would like to thank the National Science Foundation and the 
 Department of Computer Science for financial support. Thanks are also due 
 to Mrs. Vivian Alsip and Mrs. June Wingler for typing this thesis. 
 
 Finally, I would like to thank my wife, Kalpana, for her encourage- 
 ment and patience. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. GENERAL PAGING 4 
 
 2.0 Introduction „ 4 
 
 2.1 Performance Measures 6 
 
 2.2 Important Variables 8 
 
 2.2.1 Multiprogramming 9 
 
 2.2.2 Main Memory Allotment 10 
 
 2.2.3 The Page Size 11 
 
 2. 2. ^ The Page-Fetch Time 12 
 
 2.2.5 Program Behavior 13 
 
 2.2.6 Pagination 18 
 
 2.2.7 Paging Algorithms 18 
 
 2. 3 The Formalism of Paging Algorithms 20 
 
 2. 3. 1 Cost of a Paging Algorithm 23 
 
 2.4 Working Set Algorithm 24 
 
 2. 5 Performance Measurement 26 
 
 2.5.1 Stack Algorithms [COFF73, MATT7O] 27 
 
 2.5.2 King's Model 30 
 
 3. GENERAL PREPAGING ALGORITHMS 32 
 
 3.0 Introduction . . . „ 32 
 
 3. 1 The Optimal Demand Prepaging Algorithm 34 
 
 3.1.1 Performance Measurement for DPMIN 39 
 
 3.2 Realizable Prepaging Algorithms 41 
 
 3.2.1 Freeing Dead Pages J+l 
 
 3.2.2 Prepaging 44 
 
 3.2.2.1 Joseph's OBL Algorithm [J0SE70]. ... 56 
 
 3.3 FWS Algorithm. 60 
 
 4. IMPROVING LOCALITY OF ARRAY PROGRAMS 68 
 
 4. Introduction 68 
 
 4 o 1 Cholesky Decomposition 69 
 
 If. 2 Other Matrix Algorithms 80 
 
 4.2.1 Matrix Multiplication 86 
 
 4.2.2 LU Decomposition, 86 
 
 4.2«3 Gaussian Elimination 89 
 
 4.2.4 Gram-Schmidt Orthogonalization „ 94 
 
 4.3 Conclusion „ „ 96 
 
V 
 
 Page 
 
 1+. k A Note About Measurement of Performance 100 
 
 5. APPLICATION OF FEEEING AND PREPARING TO ARRAY PROGRAMS . 101 
 
 5.0 Introduction 101 
 
 5.1 Cholesky Decomposition . . 102 
 
 5.2 Other Examples . . Ill 
 
 5.2.1 LU Decomposition Ill 
 
 5.2.2 Gaussian Elimination . . . . 112 
 
 5. 2. 3 Gram-Schmidt Orthogonalization 114 
 
 5.2.1+ Matrix Multiplication 116 
 
 5.3 Combined Effect of Locality Improvement 
 
 and Working-Set Identification Methods 120 
 
 6. AUTOMATION OF PERFORMANCE IMPROVEMENT TECHNIQUES .... 126 
 
 6.0 Introduction. 126 
 
 6. 1 Transformations Which Improve Locality 127 
 
 6.1.1 Loop Reversal. . . . . . . 127 
 
 6.1.2 Loop Decomposition „ 132 
 
 6. 1. 3 Submatrix-Multiplication-Transformation. . . I33 
 6.1.1+ Conversion From a Non-Submatrix to a 
 
 Submatrix Algorithm 133 
 
 6.2 Freeing and Prepaging I32+ 
 
 7. CONCLUSION AND FUTURE RESEARCH 11+9 
 
 7.0 Conclusion 11+9 
 
 7.1 Future Research 151 
 
 LIST OF REFERENCES 153 
 
 APPENDIX A - PL/l PROGRAMS FOR THE SIMULATION OF SOME PAGING 
 
 ALGORITHMS ..... 159 
 
 APPENDIX B - PROOF OF THEOREM 3. 1+ 170 
 
 APPENDIX C - PL/l PROGRAMS FOR SEVERAL MATRIX ALGORITHMS .... I76 
 
 APPENDIX D - ALGORITHM AP 189 
 
 VITA 197 
 
VI 
 
 LIST OF FIGURES 
 
 Page 
 
 If.l Submatrix Storage 70 
 
 If. 2 Cholesky Decomposition 72 
 
 If . 3 Cholesky Decomposition with Reversal 73 
 
 if. If Localities of CD and CDR 75 
 
 If. 5 OL/2 Program for Cholesky Decomposition 76 
 
 If. 6 Cholesky Decomposition with Submatrix Multiplication .... 77 
 
 If. 7 Localities of CD and CDM 78 
 
 If. 8 Localities of CD and CDS 81 
 
 If. 9 Localities of CD, CDR, CDM and CDS 82 
 
 If. 10 Page Faults Using WS Algorithm 83 
 
 If. 11 Page Faults Using LRU Algorithm 8 If 
 
 If. 12 Page Faults Using MOT Algorithm 85 
 
 If. 13 Localities of MM, MMR and MMS 87 
 
 If.llf Localities of LU, LUR, LUM and LUS 90 
 
 If. 15 Localities of GOS, GOSR, GOSM and GOSS 93 
 
 If. 16 Localities of ORTH, ORTHR, ORTHM and ORTHS 97 
 
 If. 17 Page Faults for CD, LU and GOS 99 
 
 5.1 Cholesky Decomposition with Freeing lOIf 
 
 5.2 Page Faults Using LRU, FREELRU and MIN Algorithms 10 5 
 
 5.3 Page Faults Using LRU, FREEDPREIfLRU, MIN and DPMIN 
 
 Paging Algorithms 109 
 
 5. If Page Pulls Using LRU, FREEDPREIfLRU, MIN and DPMIN 
 
 Paging Algorithms 110 
 
 5.5 Effect of Prepaging on LU Decomposition 113 
 
Vll 
 
 Page 
 
 5.6 Effect of Prepaging on Gaussian Elimination 11 5 
 
 5.7 Effect of Prepaging on Orthonormalization 117 
 
 5.8 Effect of Prepaging on Matrix Multiplication 119 
 
 5.9 Page Faults for CD and CDSRFP 121 
 
 5.10 Page Faults for LU and LUSRFP 122 
 
 5.11 Page Faults for GOS and GOSSRFP 123 
 
 5.12 Page Faults for ORTH and ORTHSRFP 12^ 
 
 5.13 Page Faults for MM and MMSRFP 125 
 
 6.1 The Partition Lines „ 11^5 
 
1. INTRODUCTION 
 
 Paged virtual memory systems (PVMS) were introduced by Kilburn 
 et al [KILB62] in the Ferranti Atlas computer. The objective of PVMS is 
 to relieve the programmer of the burden of storage management. However, 
 the cost, in terms of overhead, and performance degradation was, at the 
 beginning, thought to be high [FINE66, KUEH.68] . Nevertheless, others 
 argued in favor of PVMS [DENN70,SAYP69], and PVMS are well accepted and 
 they have proved to be useful in practice [DENN70]. But the effectiveness 
 and the overhead leave much to be desired [MASU72,0RGA72] . This is 
 especially true in terms of the changing technology with very large 
 memories and the types of problems with large data bases and large arrays. 
 
 In this thesis, we are concerned with the performance of programs 
 running under a PVMS and with techniques and algorithms that include 
 prepaging as a more viable solution for the computer systems of the 
 future. 
 
 In Chapter 2, we introduce the terminology used and survey the 
 relevant literature. We identify variables of a PVMS which affect the 
 performance of its programs. We discuss the question of what performance 
 measures to use. We then study the effect of each of the variables of 
 PVMS on the performance, from the available literature. We focus our 
 attention on two important variables of a PVMS, namely, the paging 
 algorithm used by the system and the locality of the programs. We note 
 that the most popular paging algorithms are demand paging algorithms 
 
chiefly because of their simplicity in implementation and because little 
 is known about prepaging. The property of locality has been observed in 
 practical programs and is, perhaps, the chief reason of the practicability 
 of PVMS [DENN70], For special applications, people have given methods 
 to write "more local" programs. Compiler implemented locality improvement 
 methods have restricted them to code and not data. Further, these methods 
 only do the physical reorganization of code and not a logical reorganization. 
 
 In Chapter 3, we present several new paging algorithms, same of 
 them being prepaging algorithms. We show why prepaging is useful. We 
 present a new algorithm, DPMIN, which is a demand -prepaging algorithm, and 
 prove that it is an optimal demand -prepaging algorithm. However, DPMIN 
 cannot be implemented in practice since it requires that the program's 
 reference string be known in advance. We also present several practical 
 prepaging algorithms. We then present a variable-memory prepaging algorithm 
 called PWS, which is based on Denning' s WS algorithm and prove that it 
 incurs zero page faults. PWS algorithm is also impractical in the same 
 sense as DPMIN and is only useful for theoretical purposes. We also 
 study the question of performance measurement while using these prepaging 
 algorithms, in particular, we study whether or not the proposed paging 
 algorithms are stack algorithms [MA.TT70] . 
 
 In Chapter h, we show how to improve the locality of matrix 
 algorithms. This includes a logical reorganization of the program and 
 its application to several common matrix algorithms. We also measure 
 and compare the average working set size of programs and show that this 
 is a reasonable way to measure locality. 
 
In Chapter 5> we introduce prepaging in matrix algorithms. By- 
 using the prepaging algorithms proposed in Chapter 3> we show how to 
 improve the performance of the matrix algorithms. 
 
 In Chapter 6, we discuss the automation of the performance-improvement 
 techniques of Chapters h and 5. Prepaging applied to matrix algorithms can 
 be implemented in a compiler but not in an operating system. Eventually 
 one would hope that these techniques would find their way into some of 
 the compilers of the future and thereby provide efficient prepaging that 
 cannot be obtained in other ways. 
 
2. GENERAL PAGING 
 
 2.0 Introduction 
 
 In a paged virtual memory system (PVMS), a program's address space 
 is divided into fixed-size blocks of addresses called pages and the main 
 memory (MM) is divided into matching size blocks of locations called 
 page frames. 
 
 We will let N = {1, 2, . .., n) denote a program's address space 
 and M = {1, 2, . .., c} denote the set of MM page frames allot ed to the 
 program, and generally 1 < c < n. When a program is executing, it makes 
 a sequence of references to its address space N. We will denote such a 
 sequence by w = r,, r p , . .., r., ... r , where r e N, t > 1. We will 
 
 let |w| denote the number of references in w, N denote the set containing 
 
 T iii 
 
 the null string A, N = {w = v- f r p , . .., r , ... ! |w| = T), W denote 
 
 the set of all finite strings on N and N = N* - {A} . 
 
 Now since the CPU (central processing unit) can only refer to 
 
 the MM, we must interpose a mapping mechanism between the CPU and the MM. 
 
 The address map f, then, is a function, f : N -» M. Since it is possible 
 
 that c < n, f may be a partial function as follows : 
 
 r j if page i resides in page frame j 
 f(i) = 
 
 ^ undefined otherwise 
 
The function f is also dynamic, since in general, a program may refer 
 to any of the pages of N. If the page size is p, a valid virtual address 
 is an integer a, < a < np. Similarly, a valid MM address is an 
 integer £, < p < cp. When presented with the virtual address Ot, the 
 address translation mechanism (MAP) obtains (i, £) such that Oi = (i-l) p + £ 
 where < £ < p, and then generates memory address p = [f(i)-l] p + £ if 
 f(i) is defined and a 'page fault' if f(i) is undefined. A page fault 
 interrupts the execution of the program and the program makes a transition 
 from running state to the page-wait state. If we assume that the whole 
 of the program's address space resides on a secondary memory device such 
 as a disc, a drum or a bulk core store, then upon the occurrence of a 
 page fault, the memory manager has to bring the required page from 
 secondary memory to MM. The program waits for the required page for 
 the duration of one page fetch time, assumed to be, on the average, T 
 time units. If the MM was originally full, the above process will also 
 force us to choose a page, already resident in MM, to be replaced. The 
 process of page replacement may also involve a page push from MM to the 
 secondary memory if it was written into during its last stay in MM. 
 After the required page has been brought into MM, the interrupted 
 program is in the ready state, ready to execute on a CPU if one is 
 available. During the time of the above page transfer, the CPU will 
 remain idle unless we resort to multiprogramming. In multiprogramming, 
 we try to mask a long page transfer time by overlapping I/O and execution. 
 After a page fault interrupt, the CPU is allot ed to another ready 
 program if one is available. If the system does not allow multi- 
 programming, then it is a monoprogrammed system. 
 
It has been observed in practice that a page fault incurs a 
 significant amount of overhead. As an example, Masuda [MASU72] reports 
 that in his system, more than 50</ of running time of the operating system 
 is spent servicing page fault interrupts and 30% of the time is a pure 
 idle time during which the cpu has no task to execute and is waiting 
 for the interrupt of the I/O termination. The situation is not much 
 different in other PVMS [0RGA72]. Thus it seems that, it is very important 
 to study the performance of a PVMS and find methods to improve performance. 
 Improving performance of a program running under PVMS will obviously 
 improve the performance of the PVMS. We will investigate the performance 
 of programs running under a PVMS. 
 
 2.1 Performance Measures 
 
 To study the performance, we need a set of measures. Furthermore, 
 performance as seen by the user and as seen by the system are two different 
 ways to study this topic. From the point of view of system efficiency, 
 the number of page faults ir, the number of page pulls C, and space time 
 product ST, of MM used are three important performance measures that we 
 will consider. From the user point of view, the cost of running the 
 program and the turnaround time are important performance measures. In 
 a monoprogrammed system, user oriented measures are directly related to 
 the number of page faults, thus in such a system it will be sufficient 
 to study the page fault measure. In a multiprogrammed system, user 
 oriented measures cannot be so easily related to the system oriented 
 measures. In most systems page faults or page pulls do not occur 
 explicitly in the cost charged to the user. However, a few milliseconds 
 
7 
 of cpu time per page fault is generally required. Since the cost is 
 generally determined by the cpu time used and memory usage, we see that 
 the cost is affected by page faults and the ST product. In a similar 
 way, the turnaround time is affected, at least indirectly, by the number 
 of page faults incurred. Because of these reasons, we will consider 
 only the system oriented performance measures. 
 
 A page fault implies the following sequence of operations: 
 interruption of the program and a switch to the page fault handler, 
 search for the required page on the secondary memory device, search for 
 a page to be replaced, issuing of output instructions to push the page 
 to be replaced (if required), issuing of input instructions to fetch the 
 required page, searching the ready queue and finally a process switch to 
 a new program from the ready queue. All these operations take considerable 
 amount of time, in fact, a few milliseconds per page fault [MASU72, ORGA'72] . 
 Thus if we want to improve the cpu utilization, we must minimize the 
 number of page faults. Another effect of a page fault is poor memory 
 utilization. For instance, during the page wait time and also during 
 the time the process spends in the ready queue waiting for a cpu, the 
 page frames of MM allot ed to the program are unavailable to other processes. 
 
 The number of page pulls has a direct relationship to the channel 
 traffic. With a given system, the maximum amount of channel traffic that 
 it can support is fixed. This implies that the page pulls incurred by an 
 individual program is an important measure of its behavior. In multi- 
 programmed systems there is an extreme situation, called thrashing, in 
 which excessive page traffic occurs [DENN68] . Thrashing has the effect 
 of rapidly decreasing cpu utilization, and it is therefore important to 
 
8 
 
 eliminate it if possible. Note that, if we use demand paging, then the 
 page fault and the page pull measures became indistinguishable. A related 
 performance measure is the number of page pushes; however, it is a fixed 
 percentage of the number of page pulls, on the average. 
 
 During the page-wait time, the set of pages alloted to the 
 interrupted program is wasted. If these pages were alloted to the 
 currently executing program, we could improve its performance. Therefore, 
 the utilization of MM can be measured by the space time product of MM 
 used by a program during the entire time of its residence in MM. 
 
 2.2 Important Variables 
 
 There are seven important variables in a PVMS which can affect 
 the performance. These are: 
 
 1. Whether the system is monoprogrammed or 
 multiprogrammed . 
 
 2. The number of allocated page frames c. 
 
 3. The page size p. 
 
 k. The average page fetch time T, and the ratio 
 T/a where A is the memory cycle time. 
 
 5. The memory referencing behavior of the program, 
 which is reflected in the page reference string oj. 
 
 6. The organization of the program's information 
 (both instructions and data) in the virtual 
 address space. 
 
 7. The paging algorithm used. 
 
There are four controllers of these seven variables: hardware, 
 operating system, compiler and the programmer [WEIN72]. 
 
 The hardware controls the page size p, the average page fetch 
 time T, the memory cycle time Z\, and the total amount of MM available. 
 In some systems it also controls the paging algorithm used [LIPT68], 
 
 The operating system controls the degree of multiprogramming used, 
 the number of allocated page frames c and whether or not c is allowed to 
 vary dynamically, and finally, it determines the paging algorithm used. 
 
 The compiler decides the page reference string, at least partially, 
 insofar as the translation process allows the freedom to do so. Storage 
 allocation phase of the compiler organizes a program's information in 
 the virtual address space. 
 
 The programmer may control the page reference string and to a 
 certain extent, the organization of the program's information in the 
 virtual address space,, 
 
 The effect of these seven variables on the performance will be 
 considered now in more detail. 
 
 2.2.1 Multiprogramming 
 
 We have observed that multiprogramming is used to overlap the 
 page transfers for one program with the execution of another program, and 
 as a result it has a potential of improving the cpu utilization. We note 
 that multiprogramming also has other uses, such as, to provide time 
 sharing, but in this work we concern ourselves with the former use only. 
 The number of programs occupying the MM concurrently is known as the 
 degree of multiprogramming. The basic questions in such a system are 
 what degree of multiprogramming to use and how much improvement in 
 cpu utilization is accrued as a result. If a very low degree is used, 
 
10 
 
 we may not be able to find a ready job to run on the occurrence of a 
 page fault; as a result the cpu will remain idle. On the other hand, 
 whenever several concurrent programs share MM in order to 'mask' I/O 
 time each program gets a smaller number of MM page frames which tends to 
 increase page faulting. In an excellent survey article, Kuck, et al. 
 [KUCK70] discuss this question in great depth. Their conclusion is that 
 for multiprogramming to yield a reasonable gain in cpu utilization, 
 there must be 'sufficient' amount of MM. They have also shown that with 
 other variables being fixed, the cpu utilization, generally, goes up 
 starting from a degree of multiprogramming equal to one (i.e., monopro- 
 gramming) until it reaches a maximum and then it falls down very quickly. 
 If the degree of multiprogramming is very high, the system is susceptible 
 to thrashing, a collapse of performance that may occur when MM is 
 overcommitted [DENN68, DENF70] . Denning has proposed that by using a 
 working set scheduling strategy, thrashing can be avoided. We will not 
 discuss the question of degree of multiprogramming any further. 
 Interested readers are referred to [KUCK70, DEMT70, COFF73, SMIT67] . 
 
 2.2.2 Main Memory Allotment 
 
 The effect of MM allotment on page faults has been studied in 
 a number of papers [BELA.69, DENN68, DEMf68a, KUCK70, LEHM68, OPPE68, 
 SHEM66, SHEM69, SISS68, SMIT67, FINE66, MATT70, SALT7U], and many 
 experiments have been conducted to determine program paging behavior 
 [ANA.C67, BAER71, BAYL68, BEIA66, BEAW68, COFF08, DEM70, FINE66, FREI68, 
 J0SE70, LEHM68, ONEI67, PINK68, SALT7^, STEV68, VARI68] . Conclusion of 
 all these studies seems to be that the paging rate is very high at low 
 
11 
 
 page allotments and it reduces very fast as the allotment is increased. 
 After an amount of page allotment called 'parachor' [KUEH68] is reached, 
 the decrease in paging rate achieved by increasing c is not very 
 significant. The value of c at parachor is dependent on the particular 
 program under consideration. This immediately brings us to the point of 
 fixed versus variable size page allotment. 
 
 In a monoprogrammed system, page allotment is fixed and such a 
 question does not arise; however, if we have a multiprogrammed system 
 then control could be exercised. It has been shown that variable memory 
 allotment is generally superior to fixed memory allotment [BELA.69, C0FF72, 
 DENM72]. 
 
 2.2.3 The Page Size 
 
 The effect of page size p on the performance of a PVMS has been 
 extensively studied in the literature [BAER71, DEM70, DOYL , GELE71, 
 GELE73, HATF72, J0SE70, KUCK70, RAND68, RAND69] . There are several 
 factors which affect the choice of a page size. Operating system 
 typically rounds the number of pages up to an integral number of pages 
 which results in unused words called internal fragmentation [RAND69] . 
 Reducing the page size will reduce internal fragmentation. Kuck [KUCK70] 
 defines superfluity to be the unreferenced words on a page during the 
 interval of residence of the page in MM. Doyle [DOYL ] has called this 
 gap fragmentation. Reducing the page size tends to minimize this problem. 
 By decreasing the size of the page, more space must be kept in the page 
 table which leaves less MM available for program paging. This effect is 
 called table fragmentation [DEM70, DOYL ] . A balance should be achieved 
 
12 
 
 to minimize these fragmentation problems. Note that fragmentation only- 
 affects the memory utilization. There are other effects of the page size. 
 With a fixed MM size, reducing page size seems to have beneficial effect 
 on the page faults [ANAC67, BAER71, BAYL68, COFF68, J0SE70] . Decreasing 
 the page size, however, increases cpu overhead, the hardware cost of the 
 address translation mechanism, the channel traffic and average word fetch 
 time [DENN70], We will not study the question of page size in this work. 
 
 2.2.4 The Page- Fetch Time 
 
 The effect of the average page fetch time T on the performance of 
 a PVMS is the next point to be discussed. In a monoprogrammed system, 
 the cpu is idle during this time, so a large T implies a very low cpu 
 utilization and also a large turnaround time. In a multiprogramming 
 system, the effect of a large T is masked by overlapping page-wait time 
 of one program with the execution of another. But, as we have seen, 
 multiprogramming implies memory sharing, which in turn implies more page 
 faulting. If we avoid memory sharing and resort to swapping then the 
 channel traffic is increased. 
 
 Kuck [KUCK70] has concluded that for T > 6000 units, there is no 
 gain to be had from multiprogramming. Here T is measured in average 
 instruction time units (roughly microseconds). Roughly speaking, this 
 means that movable head disc is a very bad paging device. Drums and fixed 
 head disc fall below the 6000 mark, and are useful. Extended core 
 storage (ECS) provides T < 1000 and is an excellent paging device. In 
 fact, with ECS, it has been recommended that the process switch after 
 a page fault be avoided altogether and the cpu be left idle during the 
 page-wait time [0EGA72]. 
 
13 
 Denning [DEOT68] reports that the main reason for thrashing can 
 be traced to a large value of T. For a further discussion see [DEM70] . 
 
 2.2.5 Program Behavior 
 
 The memory referencing behavior of a program is a very important 
 variable in determining performance of a PVMS. A program which scatters 
 its references over the entire address space will page fault heavily. It 
 is therefore desirable that the program tend to refer to a small subset 
 of its address space in a small time interval. Fortunately, the above 
 property, known as the locality property, is found to be exhibited by 
 most programs in practice [BELA.66, COFF73, DEM68, DFJW70, DEM72a, 
 FINE66, LIPT68, BRAW68, BRAW70, MCKE69, KUEH68, SAYR69, SHED72], In fact, 
 this property of locality is the key reason of the feasibility of PVMS. 
 
 More specifically, the property of locality can be summarized 
 
 in three statements [DEM72a] : 
 
 During any time interval, a program distributes its 
 references nonuniformly over its address space, some 
 pages being favored over the others. 
 
 The density of references to a given page changes 
 slowly in time or the set of favored pages changes 
 membership slowly. 
 
 Two disjoint segments of the page reference string 
 tend to be highly correlated when the interval between 
 them is short, and tend to become uncorrelated as the 
 interval between them increases. 
 
 It is clear that the behavior and style of the programmer has a 
 
 direct bearing on the locality of programs. Programmers tend to use 
 
 sequential and looping control structures and they tend to concentrate 
 
 on small sections of large programs for moderately long time intervals, 
 
 and they generally group data into content-related blocks. 
 
Ik 
 
 The set of favored pages seem naturally to split into four areas 
 of activity [GIBS66, J0SE70] . One of these areas is constituted by the 
 instruction addresses of the program. The other three areas are 
 constituted by data addresses. The processes of data analysis often 
 consist of arithmetic or logical manipulations on two or more strings of 
 independently (semi) coherent addresses which in combination form the 
 sequence of operand addresses. The fourth area of activity is the output 
 catchment area where the results are held prior to outputting. If it is 
 not possible to study the program behavior in this great a detail, it 
 may be profitable to separate the set of favored pages into data pages 
 and instruction pages [DEM72], We will investigate this topic in detail 
 in this thesis. We remark that for problems with large data bases, the 
 instruction paging is almost trivial compared with the data paging. 
 Therefore, we will concentrate, for the most part, on the data paging 
 problem in this thesis. 
 
 Examples of what may happen to performance of a program with 
 poor locality are too numerous [FINE66, BRAW68, DEM65, BAIR68, SMIT67] . 
 Therefore, it is very important that methods be developed to create 
 guidelines on how to write programs with a high degree of locality. 
 Before we discuss these locality improvement methods, we discuss several 
 other results on program behavior. 
 
 To study and predict performance of programs running under FVMS, 
 several models of program behavior have been investigated. These models 
 are used to generate reference strings to be used in the analytical study 
 of program behavior. The validation of a proposed model is another 
 question that need be answered. The simplest of all models investigated 
 
15 
 
 is the independent reference model, which has been studied by several 
 
 authors [COFF73, KING71, FRMTfk, AH071, BELA.66] . In this model, it is 
 
 assumed that the probability of a reference to page i at time t is given 
 
 "by, P [r, = i] = p. V t > 1 V 1 < i < n, where the set of all p. i 
 r "0 l -i 
 
 is 
 
 n 
 
 fixed and Z p. = 1. It has been observed that, this model is relatively 
 1=1 1 
 
 simple to analyze but it does not mirror the behavior of actual programs 
 [COFF73] . Aho et al [AH071] proposed a very general ^-order nonstationary 
 model of program behavior. However, except in 0-order stationary case 
 (which is identical to the independent reference model), very few results 
 are obtained for this general model. Denning et al [DENN72a] propose 
 several locality models of program behavior. They were able to produce 
 some interesting results with these models which compared favorably with 
 the behavior of practical programs. Matt son et al [MATT70] propose 
 an LRU stack model of program behavior which was improved and modified 
 by Shedler et al [SHED72]. In the LRU stack model, it is assumed that 
 the locality L of the program at time t is of a fixed size c, equal to 
 the page allotment, and consists of the last c pages referred by the 
 program. Quite clearly, this is a good model. Saltzer [SALT7^] has 
 recently proposed a simple linear model of program behavior which is 
 validated against the MULTICS system. This model assumes that the page 
 fault probability is inversely proportional to the page allotment c. 
 
 Several authors have given guidelines to produce programs with 
 better locality [BRAW68, DENF70, F1NE66, HATF71, KUEH68, WEIN72]. The 
 use of modular and highly structured programming and programming languages 
 is a suggestion made most often. The improvement of locality by proper 
 
16 
 
 arrangement of cede and data in VA space is a problem we will discuss 
 separately. Here we are concerned with logical organization of the 
 program. When asked to solve a particular problem, a programmer, generally, 
 has a choice of a variety of algorithms to use. As an example, if he is 
 to solve a system of linear equations, he can use Gaussian elimination, 
 LU decomposition or Cholesky factorization and many others [ISAA66]. The 
 choice of the algorithm is based upon the mathematical properties of data; 
 therefore, we expect the locality to vary with each algorithm. Even when 
 a particular algorithm is chosen, it may be possible to change the order 
 of some operations without affecting the algorithm drastically. We would 
 like to consider the order of operations which yields the maximum locality. 
 In order to discuss this topic in detail, we have to consider individual 
 applications. For applications involving sorting, see [BRAW70], for 
 searching problem see [KNIG ], and for the list processing application 
 see [BOBR67] . In the case of matrix algorithms, McKeller et al [MCKE69] 
 have shown that block algorithms have a superior locality property 
 compared with nonblock algorithms. Lubrulle [DUBR72] has discussed the 
 solution of the Eigenvalue problem in a paged environment. Moler [M0LE72] 
 has discussed a method of loop reversal for improving matrix algorithms. 
 Rogers [ROGE73] discusses the solution of linear equations in a paged 
 environment. We will discuss methods of improving locality of matrix 
 algorithms Q Our work will be distinguished from that of the above by the 
 fact that we will be studying many more matrix algorithms and that our 
 measurements will be more extensive. In particular, we will study matrix 
 multiplication, Cholesky decomposition, EU decomposition, Gaussian 
 
17 
 elimination and Gram Schmidt orthonormalization. We note here that, we 
 choose to ignore the error growth aspect of our algorithms unlike Rogers. 
 We will measure the paging performance of our algorithms under a variety 
 of paging algorithms, both realizable and unrealizable. As against this, 
 McKeller et al have only measured the performance of their algorithms 
 under MIN [BELA.66] paging algorithm, which is unrealizable. 
 
 So far we have discussed only programmer-implemented locality 
 improvement methods. One of the motivations for adopting a virtual memory 
 system is that it relieves the programmer from the burden of memory 
 management (or overlay problems) when his program cannot fit into the 
 available MM [DENN70, KILB62], therefore, asking the programmer to worry 
 about locality of his programs seems to be a step backwards. The job 
 of improving locality of programs must then be carried out by the computer 
 system. The operating system cannot possibly do much, since it does not 
 know the global structure of the program. Therefore, it appears plausible 
 that the optimization we seek can be achieved by the compiler. We know 
 of no earlier work along these lines. Observe that, we are considering 
 the topic of logical program reorganization and not physical reorganization. 
 We will consider this type of compiler optimization for matrix algorithms 
 written in both a high-level language (FORTRAN or PL/l) and an array 
 language OL/2 [PHIL72]. We remark that the OL/2 language is an array 
 language which allows dynamic partitioning and block structure, but it 
 does not allow a 'GOTO' statement. These characteristics greatly influence 
 the program reorganization strategies. 
 
18 
 
 2.2.6 Pagi nation 
 
 The organization of a program's information in the VA space has 
 been termed pagination [DEM70] . This can be further divided into code 
 (instruction) pagination and data pagination. In code pagination, at the 
 highest level one could study the grouping of subroutines to minimize page 
 faults. Such experiments were made by Comeau [COME67]. Code pagination 
 can be done on the basis of statistical properties (e.g., branch 
 probabilities) [FAMA66, DEHN70], the history collected from previous 
 runs [HATF71, FERR73], or solely on the basis of syntax [YEL071] . The 
 problem of data pagination has received very little attention. In 
 particular, only for special applications like matrix algorithms have 
 investigations been made. It has been noted by several authors [MCKE69, 
 ROGE73] that storing matrices by square blocks coupled with the use of 
 block algorithms seems to improve paging performance. We will assume 
 block (or submatrix) type of storage for matrices. 
 
 2.2.7 Paging Algorithms 
 
 We now discuss paging algorithms and their effect on paging 
 performance. A paging algorithm implements three policies. The fetch 
 policy determines when a particular page should be brought into MM; a 
 replacement policy determines when a particular page is removed from MM; 
 and a placement policy determines an available page frame to hold a 
 fetched page. In a PVMS (as against a segmented VMS), the placement 
 policy is trivial and therefore, will not be considered [DENF70] . 
 
 Basically there are two types of fetch policies, namely, demand 
 fetching and anticipatory fetching. Under demand fetching, a page is 
 
19 
 "brought into MM only on demand, i.e., when a page is referenced and is not 
 found to be resident in MM (i.e., at the time of a page fault). Under 
 anticipatory fetch policy (also called prefetching), one or more pages may 
 be brought into MM at any time, usually in advance of reference to the 
 page. A special kind of prefetching policy is called demand prefetching 
 [COFF73]. Assume that at time t, a page x is referenced and is not found 
 in MM, thus creating a page fault. As a result a page fetch for page x 
 will be initiated. A demand prefetch policy will allow us to fetch some 
 other pages together with page x in the above situation. 
 
 A replacement policy can also be one of two types : demand an 
 anticipatory. 
 
 Most popular paging algorithms are of demand_fetch, demand_replace 
 variety and are commonly known as demand paging algorithms. All the other 
 varieties of paging algorithms are commonly known as prepaging algorithms. 
 Demand paging has been widely used in practice due to its simplicity in 
 implementation. Prepaging, on the other hand, is more difficult to 
 implement, chiefly because good predictions of a program's pages needed 
 in future are not easily obtained [DEM70, DEM72], 
 
 Two types of prepaging schemes have been reported. One scheme, 
 known as swapping, has been used to guard against excessive paging due to 
 a small time quantum (e.g., time- sharing systems) [DEM70, KUEH68, 0RGA72]. 
 In this scheme (except for the first time quantum) a program's working 
 set is preloaded at the beginning of the time quantum and demand paging 
 is used during the time quantum. We note that the working set to be 
 preloaded is determined from the set of pages the program had acquired 
 during the last time quantum. This scheme is successful with small time 
 quantum. 
 
20 
 The second scheme, proposed by Joseph [J0SE70], is as follows. 
 
 Whenever a page fault occurs for page x, then fetch page x and x+1 . This 
 
 scheme is particularly successful when the address pattern of a program 
 
 is sequential. 
 
 Prepaging incurs less page faults than demand paging, but the 
 
 number of page pulls may be the same or even larger because of the 
 
 possibility of an incorrect prediction [J0SE70], The effect of prepaging 
 
 on ST product of MM used is not clear from the literature. Joseph reports 
 
 an increase in ST product whereas Denning [DEM70] shows that if the 
 
 probability of incorrect prediction is low, then ST product is reduced 
 
 by prepaging. Because of a reduction in total page wait time, ST product 
 
 tends to reduce but at the same time it tends to increase because pages 
 
 are brought in MM in advance of their use. We will discuss this question 
 
 in Chapter 3. If more than one page is brought into MM at the same time 
 
 and a proper layout of pages on the secondary memory is used, then the 
 
 effective access time per page (t) can be reduced, provided rotating 
 
 secondary memory is used. The basic question, then seems to be that which 
 
 performance measure is more important, page fault or page pull. There is 
 
 no easy answer to this question. However, if the goal is better cpu 
 
 utilization then page fault is a more important measure and if reduction 
 
 of the channel traffic is the goal, then page pull is a more important 
 
 measure. It seems, therefore, that prepaging is justifiable if good 
 
 predictions are available. We will elaborate on these ideas in subsequent 
 
 chapters. 
 
 2.3 The Formalism of Paging Algorithms 
 
 In the abstract, a paging algorithm A is a mechanism for processing 
 a reference string w = r, r~ ... r ... and generating, in response, a 
 
21 
 sequence of memory states S_ S, . . . S . .. where S Q is a specified 
 initial memory state [AH071, COFF73] . Each memory state Sis the set 
 of pages from N which reside in M at time t; they satisfy the conditions 
 S, c N, |S, I < c, r, e S (t > 0). Moreover, S and S . are related by, 
 S, = S, + X - Y where X, c N - S, , is the set of pages fetched at time 
 t and Y. c S, , is the set of replaced pages. To determine X, and Y at time 
 t, the paging algorithm must maintain internal records. A set of control 
 states Q is used for this purpose, q being a designated initial control 
 state. A configuration of the paging algorithm is any pair (S, q) in which 
 |s| < c, q e Q. Associated with the algorithm A is a transition function g , 
 such that g A (S, q, x) = (S',q'), x e S', where (S, q) is the present 
 configuration and (S ' , q') is the next configuration and x is the current 
 page reference causing the transition. In particular, the memory state 
 sequence S n S ... S ... induces a configuration sequence (S_, q ) (S, , q, ) . . 
 (S,, q, ) ... generated by, 
 
 {S t> q t } = g A (S t-l' %-l> r t^ t ^ 1 ' 
 
 A paging algorithm is a demand fetching algorithm if for a given c > 0, 
 |x J e {0, 1} and if r e S , then X = <fi. A paging algorithm is a demand 
 replacement algorithm if |Y | e [0, 1} and if r e S, , or |s | < c 
 then Y = p. A paging algorithm is a demand paging algorithm 
 (demand_fetch_demand_replace) if g (S, q, x) = (S', q' ) satisfies the 
 
 conditions : 
 
 r 
 
 S if x e S 
 
 ( S+x if x £ S and |s| < c 
 
 L S+x-y 3 y g S, if x jt S, |s| - c. 
 
22 
 
 Thus for a demand paging algorithm, < | Y t | < |X^ | < 1. A paging 
 algorithm is a demand prefetching algorithm if g A (S, q, x) = (S», q* ) 
 satisfies the condition that if x e S then X ■ 0. In particular, the 
 restriction |X. | < 1 is removed. And finally, a paging algorithm is a 
 demand prepaging algorithm [COFF73] if sA^t <l> x ) ■ ( s '> <T ) satisfies the 
 conditions : 
 
 r S if x e S 
 S' = < S+X if x ft S, x e X, |S+X| < c 
 
 L S+X-Y if x ft S, x e X, |S+X-Y| = c. 
 
 In a reference string w, the forward distance d (x) at time t 
 for page x is the distance to the first reference to x after time t: 
 
 {k, if r, , is the first occurrence 
 ' t+k 
 of x in r t+1 , r t+2 , . . . 
 co^ if x does not appear in 
 r t+l' r t+2' *•• 
 
 Similarly, the backward distance b , (x) is the distance to the most 
 recent reference to x : 
 
 "" k, if r , . is the last occurrence of x 
 
 ' t-k 
 
 b t (x) =/ in r l' r 2> "" r t 
 
 i^oo^ if x does not appear in r,, r , ..., r 
 
 Presently, we consider paging algorithms, which work with a fixed 
 memory allotment (i.e., a fixed value of c). Some typical demand paging 
 algorithms will now be described. Let S = S, and let S' = S, ] = 
 S + x - R(S, q, x) such that x 4. S, , x e S and R(S, q, x) denotes 
 
 u t t+1 
 
 the page selected for replacement so that R(S, q, x) = if |s| < c. 
 
23 
 
 (1) LRU: This algorithm chooses the page least recently used 
 
 as the page to he replaced. In other words, 
 
 R(S, q, x) = y iff \(y) = m ax O t ( z )]- 
 
 zeS 
 
 This algorithm is used widely in practice and is 
 found to behave fairly well [DENN70, CORB69, LIPT68] . 
 
 (2) FIFO: This algorithm chooses for replacement the page 
 
 that was fetched first in MM. This algorithm 
 behaves well only for highly sequential reference 
 patterns. However, it is very simple to 
 implement. 
 
 (3) RAM): It selects the page to be replaced randomly from the 
 
 pages resident in MM. Clearly, this algorithm 
 behaves well only for independent reference model 
 of program behavior [BELA66, KING71, COFF73]. It 
 is also very simple to implement. 
 
 (h) MIN: The page to be replaced has the largest forward 
 
 distance. In case of a tie, the tie-breaking rule 
 uses lexicographic ordering. This algorithm was 
 first proposed by Belady [BELA.66], and it is 
 unrealizable since it presumes the advance knowledge 
 of the program's reference string. It is, however, 
 useful for theoretical purposes since it can be 
 proved optimal in a certain sense. It is also 
 called B Q [COFF73] and 0PT[MATT70] sometimes. 
 
 More specifically [COFF73], R(S, q, x) = y iff 
 y = min [z]. Here, S* is defined so that 
 
 zeS* 
 ieS* iff d (i) = max[d (u)]. 
 ueS 
 
 Note here that lexicographical ordering is used in S*. 
 
 2.3.1 Cost of a Paging Algorithm 
 
 Aho et al [AH071] define the cost of processing a reference string 
 
 w = r-,, r p , ..., r with a paging algorithm A operating with c page 
 
 7 
 frames of MM: C(A, c, <*>) = Z h(|X, |). Here h is a function such that 
 
 t=l t 
 
 h(0) = 0, h(k) >h(l) = 1. Aho et al prove the following theorem: 
 
 Suppose h(k) > k then for any given paging algorithm A, there 
 
 exists a demand paging algorithm A' such that C(A', c, co) < c(A, c, to) 
 for any reference string to and for any value of c. 
 
2k 
 We note that for rotating auxilliary memory devices like discs 
 and drums h(k) < k. Only for large core storage device can we hope to 
 have h(k) = k. In such a case, the above cost function measures the 
 number of page pulls. 
 
 A paging algorithm A is said to be optimal (with respect to the 
 cost function defined) if it minimizes C(A, c, u>) for all reference 
 strings w and all values of c. It can be proved that KEN is an optimal 
 paging algorithm by the above definition of optimality [COFF73, P0ME71, 
 MATT70] . Therefore, the MEN algorithm, though unrealizable, is useful as 
 
 a benchmark for evaluating other paging algorithms. 
 
 + 
 
 Assume that a probability distribution is specified over N such 
 
 that for wg N , p( w ) is the probability of occurrence of oj, then we define 
 the expected cost of a paging algorithm A by: 
 
 C(A, c) = Z p(co) C(A, c, u) [COFF73]. 
 
 Define a paging algorithm to be optimal with respect to distribution p if 
 it minimizes C(A, c), V c > 1. Denning et al [DENN68b] proposed a demand 
 paging algorithm A , which replaces the page with the longest expected 
 time until next reference. It can be proved that A n is an optimal paging 
 algorithm with respect to a distribution corresponding to the independent 
 reference model of program behavior [AH071] . 
 
 2.k Working Set Algorithm 
 
 We consider Denning ' s working set algorithm, which is a variable 
 memory algorithm [DENN68a] . A program's working set at time t is defined 
 to be 
 
 W(t, t) = {i € N| page i appears among r 
 
 t-T+1 
 
 t . . . , r , J 
 
25 
 
 where t is a parameter called the window size and t > 1. In other words, 
 
 W(t, t) is the 'contents' of a window of size t looking backwards at the 
 
 reference string from reference r, . Working set replacement algorithm 
 
 essentially states that, do not replace a page from the working set. This 
 
 algorithm is a little difficult to implement but behaves very well in 
 
 practice [D0HE70, R0DR72, RODR73, WEIZ69] . Working set principle consists 
 
 in using a working set replacement algorithm and using the following 
 
 scheduling policy: a program may run if its working is in MM [DENF70] . 
 
 Denning has shown that thrashing can be avoided by use of WS principle 
 
 [DEJM68] . 
 
 Denning et al [DENN72b] define and prove several important 
 
 properties of the working-set model. Define the working set size 
 
 w(t, t) = |w(t, t)| and the average working set size s(t) = lim s (t) 
 
 i k 
 
 1 k 
 where s (t) = 7- Z w(t, t). Define the binary variable 
 
 "\(t, T) =f lifr ^ w(t ' T) 
 
 L otherwise, 
 
 1 k_1 
 then the missing page rate (page fault probability) m(x) = lim — E A(t, t), 
 
 k-^00 K t=0 
 Suppose that in the reference string to, two successive references to page i 
 
 occur at times t and t+x. . We call x. an interference interval for page i. 
 
 11 
 
 The interference distribution for page i is defined to be 
 
 F. (x) = lim 
 1 i 
 
 K-KXJ 
 
 The interference density function for page i is defined to be 
 
 f.(x) = F. (x) - F. (x-l). With these definitions following properties hold: 
 
 "no. 
 
 x. m r,, r„, •••> 
 
 r, with x. < x 
 k 1 — 
 
 _ 
 
 no. x. in r,, 
 1 1' 
 
 *•*' r k J 
 
26 
 
 PI: 1 ■ s(l) < s(t) < s(t+1) < min{n, t+1). 
 
 P2: s(t+1) - s(t) = m(T). 
 
 P3: < m(T+l) < m(T) < m(0) = 1. 
 
 PU: m(T) = 1 - F(t) = L f(y). 
 
 y<T 
 
 P5j_ m(T+l) - m(T) - - f(T+l). 
 
 T-l T-l 
 P6: s(t) = Z m(z) = Z (l-F(z)) 
 z=0 z=0 
 
 T-l 
 
 = Z E f(z). 
 z=0 y<z 
 
 P7: Lim s(t) = n. 
 P9: Lim m(x) = 0. 
 
 T^°° 
 
 We note here that the average working set size of a program is a 
 very good indicator of its locality [DEKN70] . 
 
 2.5 Performance Measurement 
 
 If we are given a reference string w, a page allotment c and a 
 paging algorithm A then we can simulate the processing of co by A and obtain 
 the page fault count and the page pull count. We can also measure ST 
 product provided the average page wait time is assumed to be fixed. With 
 a demand paging algorithm page faults and page pulls are identical and 
 the ST product is directly dependent on page faults, therefore, most 
 measurements are restricted to only page pulls. As an example, we 
 consider the reference string, u = ABCDEBCBDAEAC, where N = {A, B, C, D, E) 
 We use c = 3 and find the page faults for the above u using LRU. 
 
27 
 
 paging algorithm 
 
 
 
 
 
 
 
 
 
 
 
 
 
 l s i 
 
 S 2 
 
 S 3 
 
 % 
 
 
 
 
 
 
 
 
 
 S 13 
 
 U)-> 
 
 h^ 
 
 B 
 
 c 
 
 D 
 
 E 
 
 B 
 
 C 
 
 B 
 
 D 
 
 A 
 
 E 
 
 A 
 
 c 
 
 memory- 
 
 1 A 
 
 B 
 
 c 
 
 D 
 
 E 
 
 B 
 
 c 
 
 B 
 
 D 
 
 A. 
 
 E 
 
 A 
 
 c 
 
 state 
 
 1 
 
 A 
 
 B 
 
 C 
 
 D 
 
 E 
 
 B 
 
 C 
 
 B 
 
 D 
 
 A 
 
 E 
 
 A 
 
 S-> 
 
 
 
 A 
 
 B 
 
 C 
 
 D 
 
 E 
 
 E 
 
 C 
 
 B 
 
 D 
 
 D 
 
 E 
 
 FAULT 
 
 * 
 
 * 
 
 -* 
 
 -* 
 
 * 
 
 # 
 
 * 
 
 
 -* 
 
 # 
 
 * 
 
 
 * 
 
 The number of page faults in this example is 11. Observe how the LEU 
 algorithm chooses a page to be replaced. At t=^, r i. =D > r i, 4 S t; wn i- cn 
 implies a page fault which in turn implies a replacement. In S , A is the 
 page used least recently, therefore, it is chosen for replacement. Now 
 for the same w and paging algorithm, but for a different value of c, we 
 are asked to find the page fault count, we will have to go through the 
 simulation again. This means a rescan of the reference string for each 
 new value of c. Since practical reference strings are very long, say a 
 million references, the simulation is very inefficient. Mattson et al 
 [MATT70] have developed a technique which scans the reference string only 
 once and produces the page fault count for all values of 1 < c < n. We 
 will have occasion to use his ideas in Chapter 3. 
 
 2.5.1 Stack Algorithms [C0FF73, MATT70] 
 
 We introduce the notation S(A, c, w) to stand for the memory 
 state resulting after the paging algorithm A has processed reference string 
 w in a memory of size c, assuming that S = f). That is, if w = r,, r , 
 ..., r, generates S_, S , ..., S under A, then S(A, c, w) = S, . If A 
 is understood, we simply write S(c, w). 
 
28 
 
 An algorithm A is called a stack algorithm if its memory states 
 satisfy the following inclusion property: S(c, w) c S(c+1, w) V c > 1, 
 V uieN . This means that the memory states form a collection of nested 
 sets. This condition is equivalent to the following condition: for each 
 w, there exist a permutation of N, s(w) - (s^oj), Sg(w), . .., B (»)), 
 such that, for all 1 < c < n, S(c, u>) = {s 1 (w), s 2 (w), . .., s q (w)). The 
 vector _s(w) is called the stack vector or imply the stack. For a given 
 stack algorithm, the inclusion property implies that for each reference 
 string r _, . .., r , a sequence of stacks s_ , s , . .., _s can be constructed 
 so that the memory state sequence for each value of c can be determined by 
 simply taking the topmost c pages of the stack. This property of stack 
 algorithms implies that the page fault behavior of a given reference 
 string can be computed effectively in parallel for all memory sizes 
 1 < c < n and in one scan of the reference string. 
 
 Examples of stack algorithms are MEN, LRU, LFU (least frequently 
 used), whereas FIFO is not a stack algorithm [COFF73, MATT70] . 
 
 The stack distance D (w) is defined for page x to be the position 
 of x in the stack s(w). Thus if s. (oj) = x, then D (oj) = k. If x does 
 
 K X 
 
 not appear in _s(w), then D (w) = ». Observe that a page fault occurs 
 for page x as the last reference in the string wx iff D (to) > c, since 
 
 A 
 
 the first c elements of _s(oj) are the contents of memory of size c. 
 Suppose that u> = r n , r . ..., r is processed by A producing the stack 
 
 -L d. y 
 
 sequence s_ , s^, ..., s,, ..., s . There is, associated with oo, a stack 
 
 distance sequence D , D , ..., D , where L\ is the position of r^ in 
 
 X. d. y % t 
 
 St-l* If *( A > c ' **) denotes the number of page faults in processing oo 
 with memory size c using algorithm A then, 
 
Define 
 
 29 
 it (A, c, to) = |{t|D > c, 1 < t < y\. 
 
 a. = | (t|D , = k, 1 < t < 7} | then clearly, 
 
 n 
 it (A, c, w) = a + Z a . 
 
 k=c+l K 
 
 c 
 
 Alternatively, the success function N(A, c, to) is N(A, c, u>) = Z a, 
 
 k=l * 
 
 the success frequency F(A, c, to) = N(A, c, w)/^, and the page fault 
 frequency m(A, c, to) = jt(A, c, to)/y. 
 
 Gordon [G0RD73] has shown a method by which one can compute the 
 average working set size and the page fault probability for the WS 
 algorithm and for the LRU algorithm in one pass of the reference string. 
 Define I (w) (reference interval) associated with each page in the stack 
 _s(w) as the number of distinct pages referenced since the last reference 
 to this page and I (tax) = 1 (i.e., reference interval is defined to be 
 equal to 1 at the time this page is referenced). If x^_s(oo) then 
 I (to) = 00. Let I be the reference interval associated with page r , = x 
 in the stack s. . Then associated with the reference string , there is 
 a reference interval sequence I.., I p , ..., I , ..., I . Clearly 
 «(WS, T, w) = I {t|l > t, 1 < t < 7} I . If we define, b = | {t|l = k, 
 
 1 < t < 7)|, then *(WS, t, to) = b + z b Also iq(t) = it(WS, t, to)/| 
 
 k=T+l K 
 
 T-l 
 
 and s(t) = Z m(z) where m(0) = 1 by definition. 
 z=0 
 
 We have noted that MEN is a stack algorithm but since it requires 
 a scan of the future reference string in order to determine the page to 
 be replaced, a one pass algorithm is difficult to obtain. Mattson et al 
 [MA.TT70] have given a two pass algorithm to get MIN statistics. More 
 
30 
 
 recently, Belady et al [BELA.7U] present a one pass algorithm to obtain 
 MIN page fault behavior. 
 
 So far we have discussed methods to obtain paging performance when 
 the reference string is specified. Collection of page traces, however, is 
 very time consuming and for very long page traces running simulations to 
 obtain paging performance is also very time consuming. It is beneficial, 
 therefore, to consider analytical methods to predict paging behavior of 
 programs. This requires a model of program behavior to be formulated. 
 King [KING71] has analyzed program behavior using the independent reference 
 model. 
 
 2.5.2 King's Model 
 
 If the configuration sequence (S Q , q_) (S,, q, ) ... (S,, q^ ) ... 
 generated by a paging algorithm form a Markov chain, then analytical 
 methods exist to find m(c) where m(c) = Z p(w) m(c, co). Here N is the 
 
 00 
 
 uxeN 
 set of all infinite reference strings over N. 
 
 Suppose we are given the transition probability matrix P of the 
 
 Markov chain whose states are V x Q> where V-{S|ScN, |s| =c} 
 
 (i.e., y is the set of memory states). Now if the Markov chain is 
 
 irreducible, then there exists a unique stationary probability distribution 
 
 a such that aP = P and z a. = 1. Define the equivalence class of 
 
 ie V XQ 1 
 
 state i of the Markov chain by: 
 
 [i] = [i j transition from i -* j and j -> i 
 is a non-page fault transition} . 
 
31 
 
 It can be shown that, 
 
 VXQ 
 m(c) = 1 - Z ( L a. p 
 
 0=1 {k|[k]=[j]) K K ' J 
 
 where P = (p v .)« Under the independent reference model of program behavior 
 LRU, FIFO, A_ algorithms satisfy the criteria for this treatment and closed 
 form expression of page fault probability can be obtained. 
 
32 
 
 3. GENERAL PREPAGING ALGORITHMS 
 
 3.0 Introduction 
 
 In this chapter we present arguments as to how prepaging can be 
 useful. We present several new prepaging algorithms and study their 
 behavior. 
 
 We have noted in Chapter 2 that, Aho et al proved that given an 
 arbitrary paging algorithm A, there exists a demand paging algorithm A' 
 which performs no worse than A. First of all, this theorem holds only if 
 h(k) > k. For rotating auxilliary memories h(k) < k and only for bulk 
 core storage, do we have h(k) = k. Since rotating auxilliary memories 
 are in widespread use, the assumptions of the theorem do not hold in 
 practice. If h(k) = k then the cost function defined by Aho et al measures 
 the number of page pulls. If the channel is the bottleneck in the system 
 then the page traffic should be minimized. Heuristically, minimization 
 of page pulls for individual programs will imply minimization of the 
 system page traffic. Under this assumption the theorem is valid. When 
 cpu overhead is critical [MASU72, 0RGA72], however, and we have multi- 
 programming system then a page fault implies a process switch in which 
 case, how many pages we pull after a page fault is immaterial and 
 therefore the cost of a paging algorithm is the number of page faults 
 it incurs. Under this assumption, h(k) = 1 V k > 1. Reduction in cpu 
 overhead is affected in two ways : first since the number of page faults 
 is decreased and as we have seen, few milliseconds of cpu t im e is spent 
 
33 
 
 in servicing a page fault. Secondly, after a page fault, if a ready- 
 process is not available, the cpu may have to be idle. By reducing the 
 number of page faults we reduce the possibility of such idleness. The 
 effect of prepaging on ST factor will be discussed in section J>,1. 
 
 Granted that prepaging has its benefits, there are two problems 
 yet to be resolved: who should specify the paging needs in advance and 
 when to carry out the page fetches. Haore [HA0R72] notes that the memory 
 management routines do not have sufficient knowledge of the future reference 
 strings, therefore, these forecasts must be made by the programs, either 
 directly by the programmer or by the compiler. We agree with this statement. 
 To show the viability of prepaging, we will consider a particular application 
 and show that prepaging can be done and that it does result in improved 
 performance. If we resort to demand prepaging, then the problem of when 
 to fetch pages is solved. Whenever a prepage request is made, we just 
 flag the page and on the occurrence of a page fault, we bring in all the 
 flagged pages. 
 
 Note that, under demand paging, the distinction between a page 
 fault and a page pull vanishes since either implies the other. Under an 
 arbitrary paging algorithm neither need imply the other. Under demand 
 prepaging, a page fault implies a page pull but not vice versa. This 
 means that for demand prepaging, # page faults < # page pulls. The ideal 
 for the number of page faults is zero. It is possible, at least in theory, 
 to achieve this ideal with arbitrary prepaging but not with demand 
 prepaging or with demand paging. 
 
34 
 
 3.1 The Optimal Demand Prepaging Algorithm 
 
 We define a demand prepaging algorithm DPMIN using the notation 
 
 of section 2.3. 
 
 DPMIN: The transition function g(S, q, x) = (S', q' ) 
 
 where, 
 
 S' - S if x e S 
 
 S' - [y lt y 2 , ..., y £ ) if x ft s 
 
 where I = min(c, |. w |) and 
 
 ,w = r.j . .., r , and where 
 t t 7 
 
 V i, y. e N & V x e N - S* d (x) > d (y. ). 
 In other words, at the time of a page fault, DPMIN scans the future 
 reference string and fetches the first c pages that will be referenced 
 in future. 
 
 Note that, DPMIN is unrealizable in the same sense as MIN is 
 unrealizable. We will see that DPMIN serves as a benchmark of performance. 
 
 We define a paging algorithm A to be optimal with respect to 
 page faults if for any arbitrary paging algorithm A', and V c > 1, 
 V oo e n*, 
 
 it (A, c, go) < it(A', c, oj), 
 
 where n denotes the number of page faults „ 
 Theorem 3.1 : 
 
 DPMIN is optimal among all demand prepaging algorithms (in number 
 of page faults). 
 
 Proof: Let jl (S, t)- denote the minimum achievable cost under demand 
 prepaging of processing references r . _, ..., r, given that S = S. If 
 
35 
 
 we define it n (S, t) = and let r , , = x then we can write 
 
 r 
 
 * k (s, t) = 
 
 it, JS, t+1) if x e 
 k-1 
 
 1 + min it. . (S+X-Y, t+l) 
 YcS k_1 
 
 XcN-S if x ft S. 
 
 xeX 
 
 This relation may be recognized as the principle of optimality 
 in a dynamic programming problem [AH071, C0FF73] . The proof of the 
 theorem now reduces to showing that paging according to DBCEN is 
 characterized by the above principle of optimality. 
 
 Let <, be an ordering defined over N for V t > 1 such that 
 x < y iff d (x) < d (y) (if d (x) = d (y), lexicographic ordering is 
 assumed). Denote M = {y , ..., y } such that V x e N-M , x < y. 
 for V 1 < i < c-1. Then the transition function for DFMIN may be written 
 as : 
 
 'DFMIN 
 
 (S, q, x) = 
 
 (S, q') if x e S 
 
 (M +x, q») if x ft S, 
 
 Then it is sufficient to show that 
 
 il (M.+x, t-1) = min. (jl (S+x, t-l)), 
 K t ScN-x * 
 
 |S|=c-l 
 
 Clearly, it,(M +x, t-l) = it, (M +x, t) and 
 
 K. "C K— -L "t 
 
 it k (S+x, t-l) = « k-1 (S+x, t), 
 
36 
 therefore, it is sufficient to show that 
 
 Tt,(M.+x, t) = min. («, (S+x, t)). 
 k t 3rN-x k 
 
 |BT- c-1 
 
 In fact, we will prove that An, = jt, (S+x, t) - jl (M.+x, t) < 1. We will 
 
 prove this by induction on k. It is clearly true for k = (by definition 
 
 of n (S, t)). Assume true for V i < k. If S = M then we are done, 
 
 therefore assume S f M and let i be the smallest index such that y. 4. S, 
 
 "C i 
 
 also let d (y. ) = i. First note that i < c-1. Then, 
 
 rt k (S+x, t) = 1 + min.(Tr k _^(S+X-Y, t+i+l)) 
 
 and 
 
 « k (M t +x, t) = * k _^(M t +x, t+i+l). 
 
 By inductive assumption, 
 
 min. (ji^CS+X-Y, t+i+l)) 
 
 Also, by induction, 
 
 ^k-i = Vi (s ' +r t+i+i' t+i+1 > 
 
 ^-^ M f. + ^ +r + + ^V t+ ' +1 ) <!■ 
 
 Then, 
 
 A\ = ir k (S+x, t) - \(M t +x, t) 
 
 1 + \-i (M t+i +r t+i+i> t+i+1 > 
 
 - ^^(M^x, t+i+l) 
 
37 
 note that r „ , € M +x. 
 
 .'. taking S' = M t +x " r t+ ^ +1 * 
 ^k = 2 + Vi ( W r t + i + l' t+2+1) 
 
 • - v/ s ' +r t + ^ t+ ^ +1 ) 
 
 = or 1 by the inductive hypothesis on Art, g . 
 Therefore for V k > 0, 
 
 This completes the proof of the optimality of DPMIN. 
 Lemma 3. ! • 
 
 Given two fixed memory paging algorithms A and A' 
 
 it (A, c, w) < it(A', c, go)=>ST(A, c, w) < ST^, c, go) 
 
 on the average, assimiing the behavior of other concurrent programs remain 
 
 the same. 
 
 Proof : Since memory alloted to the program is fixed, ST = c * t where 
 
 t is the total time that the program occupies the MM. t ~ t + 
 m * a *■ m _ C p U 
 
 t . , + t , . Since cpu time of a program is unaltered by 
 page-wait ready queue 
 
 a change in the paging algorithm, only the other two factors need to be 
 considered. Now assume T 1 is the average time that the program has to 
 spend in page wait and then in the ready queue. Clearly, T' depends on 
 the characteristics of other concurrently executing programs and on the 
 average page fetch time T. Therefore, T' can be assumed to be a constant. 
 
 t = t + #page faults * T' 
 m cpu * to 
 
 The required result immediately follows. 
 
38 
 
 Theorem 3.2 : 
 
 DIMM minimizes ST product among all demand prepaging algorithms 
 under the assumptions of lemma 3.1. 
 
 Proof : The proof of this theorem follows directly from theorem 3.1 and 
 lemma 3.1. 
 
 We would have liked to prove optimality of DIMM with respect to 
 page pulls, but unfortunately, it does not hold. We will be able to see 
 this in Chapter 5. As a result of lemma 3.1> we need to only discuss 
 page fault and page pull measures for all fixed memory type paging 
 algorithms. 
 
 Note that the class of all demand paging algorithms is included 
 in the class of all demand prepaging algorithms. Therefore, applying 
 Theorem 3.1 to the optimal demand paging algorithm MM, we get 
 
 it (DIMM, c, w) < jt(MM, c, w). 
 We will show that there are cases when strict inequality holds. On the 
 other hand, it is trivial to find examples for which equality holds. 
 Note that MM is optimal in page pulls among all paging algorithms, 
 therefore , 
 
 C(DIMM, c, w) > C(MM, c, co) 
 
 = it (MM, c, u>). 
 
 Once again, there are cases when equality holds in the above relation 
 and there are cases when strict inequality holds. Thus DIMM is superior 
 in two performance measures (page faults and ST product) and MEN is 
 superior in page pulls. Quantitative results of the comparison will be 
 given in Chapter 5 for specific applications. 
 
39 
 3.1.1 Performance Measurement for DPMIN 
 
 The problem is to find the number of page faults and the number 
 of page pulls for a given reference string w and a given page allotment c, 
 using the DPMIN paging algorithm. We present a numeral matrix algorithm 
 for this purpose, based on the numeral matrix algorithm for'MIN [BEIA7U] . 
 The numeral matrix has as many rows as the number of pages in the VA space 
 and as many columns as the nu er of page faults. The algorithm is as 
 follows : 
 
 Initially the matrix is blank. 
 
 1. Suppose the next reference in w is to a page x and that 
 the rightmost nonempty column is (i-l). 
 
 2. Let j be the rightmost column with c markings 
 (note: j = i-l or i-2). 
 
 3. If j = i-l 
 
 then if the entry (x, i-l) is blank 
 
 then mark (x, i) and return 
 else return 
 else 
 k. If j = i-2 then 
 
 if (x, j) is blank then mark (x, i-l) 
 and return else mark (x, i-l) and return. 
 In step 3 marking (x, i) implies a page fault and a page pull. In step k 
 if (x, j ) is blank then a page pull occurs (no page fault). 
 
 After the complete reference string is processed, then the number 
 of page faults is obtained by counting the number of nonempty columns of 
 the numeral matrix. Number of page pulls in excess of page faults is 
 obtained by counting the number of times the clause 'if (x, j ) is blank' 
 is satisfied in step 3. 
 
ko 
 
 Clearly, this algorithm is c -dependent. We illustrate its use 
 by the following example. 
 
 N = {A, B, C, D, E}, c = 3, 
 
 w = ABCDEDBCBMEAC . The number 1 is used for marking. 
 If a page pull occurred without a page fault then we 
 indicate this with an asterisk. The numeral matrix 
 for our example is : 
 
 A 
 
 1 
 
 
 
 1 
 
 B 
 
 1* 
 
 1 
 
 1 
 
 
 C 
 
 1* 
 
 
 1 
 
 1 
 
 D 
 
 
 1 
 
 1 
 
 
 E 
 
 
 1* 
 
 
 1* 
 
 Clearly, the number of page faults is equal to four. The number of page 
 pulls is given by: 
 
 # page pulls = # of 1* + # page faults 
 = k + h = 8. 
 Belady et al [BEIA7^] show that for the same reference string, the MEN 
 algorithm produces 
 
 # page faults = 8 = # of page pulls. 
 
 The numeral matrix algorithm is clearly c-dependent, i.e., for 
 each new value of c, a rescan of the reference string is needed. In 
 particular, DFMEN is not a stack algorithm. We may verify this fact by 
 the following example : w = ABCD, for c = 2, the memory state sequence 
 is: {}, {A, B}, {A, B}, {C, D}, {C, D} . For c = 3, the memory state 
 sequence is: {}, {A, B, C}, {A, B, C}, {A, B, C), {D}. From these, 
 S (2) ^ S (3) which gives the required result. 
 
Another problem in the implementation of the numeral matrix 
 algorithm is the storage of the matrix. A close inspection, however, 
 reveals that only two columns of the numeral matrix need be stored at 
 any one time. In step 2, we have noted that j = i-1 or i-2. This is 
 "because, we create a new column only when the previous column is full. 
 Thus only the present column and the last column need be kept in storage. 
 A PL/l implementation of DPMIN is given in Appendix A. 
 
 3.2 Realizable Prepaging Algorithms 
 
 We have noted that DPMIN is an unrealizable paging algorithm. We 
 would like to consider same paging algorithms which can be realized. 
 
 3.2.1 Freeing Dead Pages 
 
 Assume that at time t, a certain page x is 'dead', i.e., it does 
 not occur in the reference string after time t. This is equivalent to 
 saying: d (x) = <». Assume that either the programmer or the compiler 
 has discovered this fact. Furthermore assume that this information is pro- 
 vided to the operating system so that either it can push the page x onto the 
 secondary memory or it can flag the page dead so that at the time of the 
 next replacement decision, page x will have the highest priority to be 
 replaced. Assume that, originally, we started out with a demand paging 
 algorithm A, and then modify it by the above mechanism and call it FREEA. 
 The process of declaring a page dead will be called Freeing. There are 
 two possible ways to interpret FREEA: One way is to assume that on the 
 issuance of a FREE(x) instruction, page x is replaced. In this case 
 FREEA is a demandrfetch-anticipatory-replace algorithm. The second way 
 is to assume that on the issuance of a FREE(x) instruction, page x is 
 
k2 
 
 flagged (i.e., a 'dead* bit associated with the page table entry of page x 
 is set), and when a replacement decision is to be made, first a search 
 is made for a page with the dead bit on. If none is found then the 
 replacement procedure as in algorithm A is followed. With this interpre- 
 tation, FREEA is a demand paging algorithm. Quite clearly, these two 
 interpretations do not affect the number of page pulls or the number of 
 page faults and therefore, we do not have to distinguish between them. 
 It is clear that n(FREEA) < tt(A). To verify this, assume that A has made 
 a wrong replacement decision at time t, and further assume that working 
 with FREEA, at time t, a dead page was present in memory and therefore 
 FREEA will not make a wrong replacement decision. On the other hand, if 
 FREEA has made a wrong replacement decision then quite clearly, A must 
 have also. 
 
 It is also clear that rt(MIN) < jt( FREEA), since FREEA is a demand 
 paging algorithm and the optimal! ty of MIN is applicable. It should be 
 noted that it (MIN) = ^(FREEMIN), since MIN never makes a wrong replacement 
 decision. If we started out with a demand prepaging algorithm A then we 
 can show that 
 
 jt(DPMIN).< jt(FREEA) < it(A) 
 
 and that jt(DFMIN) = it ( FREEDPMIN ) . 
 
 If we start out with a general paging algorithm A, then < it ( FREEA.) < k(A), 
 Freeing, however, tends to reduce the ST product: First by reducing the 
 number of page faults and second by reducing the average page-wait time. 
 The second reason holds only when anticipatory replacement is used. 
 
 We have pointed out in Chapter 2 that, it is desirable that a 
 given paging algorithm is a stack algorithm. Theorem 3.3 answers this 
 question for FREEA. 
 
k3 
 
 Theorem 3 • 3 ' (Assume A is a demand paging algorithm) 
 
 If A is a stack algorithm then FREEA is a stack algorithm. 
 Proof : The proof makes use of the proposition 6.1 from the book by 
 Coffman et al [COFF73] which states that: 
 
 A demand paging algorithm B is a stack algorithm 
 iff R(S+y, q, x) = R(S, q, x) or y whenever x ^ S + y. 
 Let r denote the set of distinct pages in the string r_, r p , . .., r . 
 FREEA partitions the pages of r, into two classes. One class is the set 
 of dead pages. We assume that the set of dead pages is lexicographically 
 ordered, and the set of nondead pages is ordered by the same ordering as 
 in algorithm A. Let the set of dead pages in S be denoted by A(S). We 
 want to prove that the above proposition is satisfied by FREEA. Therefore, 
 consider the following two cases : 
 (i) [y e A(S+y)]: 
 
 If y= min. (A(S+y)) then R(S+y, q, x) = y 
 
 and we are done, otherwise 
 if y/ min. (A(S+y)) then let 
 
 y' = min. (A(S+y)). This means 
 y 1 = min. (a(S)) which implies 
 R(S+y, q, x) = R(S, q, x) = y« 
 and we are done, 
 (ii) [y^A(S+y)]: 
 
 If A(S) = $ then A(S+y) = fi (since y j. A(S+y)) 
 therefore, FREEA behaves exactly like A 
 which means R(S+y, q, x) = R(S, q, x) 
 (since A is a stack algorithm). 
 
kk 
 
 If A(S) / f> then let y' = min.(A(S)). 
 Since y £ A(S+y) =»A(S) = A(S+y) 
 =>y f = min.(A(S+y)) ^> 
 
 R(S+y, q, x) = R(S, q, x) = y\ 
 Thus FREEA is a stack algorithm. 
 In Appendix A, we give a PL/ I program for FREELRU. 
 
 3.2.2 Prepaging 
 
 We now consider some practical prepaging algorithms. Our 
 objective is to consider prepaging algorithms which reduce page faults 
 without increasing page pulls significantly and thereby obtain an 
 improvement over demand paging algorithms. When working with fixed 
 memory algorithms, at the beginning of a time slice, a few page frames 
 remain unoccupied with a demand paging situation. If we could prepage 
 seme useful pages to fill up this unused space, we would probably reduce 
 page faults. The second important rule we incorporate into the algorithm 
 is that we never replace a useful (non-dead) page for bringing in a 
 prepaged page. This reduces the possibility of increasing the page 
 faulting in a prepaging algorithm After the MM is full, this scheme 
 would not allow us to do any prepaging unless, we have some dead pages 
 which can be freed. We will see that this situation is the most valuable. 
 
 The first prepaging algorithm that we propose is PRE1A. Assume 
 that reference string is modified to have PRE(x) inserted at various 
 places. 
 PRE1A : (Assume A is a demand paging algorithm) 
 
 If |sj < c then if r. +1 = PRE(x) 
 
 and x f( S then S = S + x. 
 
 Otherwise PRE1A works just like A. 
 
This algorithm has two drawbacks. First, it implies a page-wait for each 
 prepage operation which is what we were trying to avoid in the first 
 place and second, it is not able to prepage frequently since the 
 situation | S , | < c arises only initially. We can solve the first 
 problem in two ways. One way is to allow the execution of the program 
 and the input operation of the prepaged page simultaneously, which will 
 be called the "overlap solution. " A second way is to concatenate many 
 page fetches together and resort to demand prepaging. The solution to 
 ' |S | < c' problem is to combine freeing and prepaging together. 
 
 If we use the overlap solution then after the occurrence of PKE(x) 
 in the reference string, a page frame is reserved for x but x cannot be 
 assumed to reside in S . , -. • Page x is then said to be in a 'not-set-up' 
 state. After elapsing of real time T (the average page-wait time), page x 
 said to be in 'set-up' state and can be considered resident in MM. Now, 
 if before a page x is set up, a page fault for another page y (or the 
 same page x) occurs, then during this page-wait period, page x will 
 become set-up. Thus, in general, on the occurrence of a page fault, all 
 not-set-up pages in MM will become set-up. We divide the memory state 
 S t (c) into two disjoint sets U (c) and N.(c). N (c) consists of the set 
 of all not- set-up pages and U (c) = S (c) - N (c). We will assume 
 c > 2 for all prepaging algorithms. We now define PRE2A. We associate 
 a set-up counter 'COUNT' with each not-set-up page. 
 PRE2A: 
 [1] (Page fault for a not-set-up page) 
 
 If r +. + i = x € - N 4-^ c ^ tlien (declare all not-set-up pages set-up) 
 
 \ \ + i^ = V c > u V°> - s t (o) = W o) - I 
 
 N t+1 (c) . 0. 
 
We can easily add freeing to PRE2A and obtain FREEPRE2A. It can 
 easily be proved that both, PRE2A and FREEPRE2A are not stack algorithms. 
 
 It appeared at first that if we restrict our attention to a 
 certain class of reference strings then PRE2A could be proved to be a 
 stack algorithm. We will say that a reference string co satisfies the 
 property P if r, = PRE(x) => (j§ t 1 < t such that r» t , = x or r t , = PRE(x)). 
 
 Initially it was believed that PRE2A is a stack algorithm for 
 (u £ N P( w )}j however, this was found to be false. The reason for 
 this can be explained as follows: Suppose we are in step (3) of PRE2A. 
 For a certain value of c, say c., assume that r = x ^ S (c, ) and let 
 c. be the largest such value of c, =# x e S (c.,+1) (assume that x is 
 set-up in a memory of size c +l). A page fault occurs for c = c,, whereas 
 no page fault occurs for c = c.,+1 =>U, (c, ) = S (c ) + x - y, ye U (c ). 
 But, U^Cc-j+l) = U^c-j+l) U {y e IT ( Cl +1), COUNT(y) > T-l} . Now, the 
 inclusion property for U cannot be proved in general. A similar 
 situation occurs in step (l). We, therefore, modify PRE2A to obtain 
 PRE3A, which is indeed a stack algorithm. 
 PRE3A : 
 [1] If r. +1 = x e N then 
 
 [a] if x e I (c) then (considered to be a page fault) 
 
 V y 6 N t (c), COUNT(y) «- COUM](y) + 1; 
 
 U t+1 (c) = U t (c) + x U{y|y € ^(c), COUM!(y) > T} ; 
 
 S t+1 (c) = S t (c); 
 
 W c) = W c )- u t + i (c) ' 
 
 and RETURN: 
 
hi 
 
 [2] (Page reference to a set-up page) 
 
 If r, . = x € U, (c) then 
 t+1 t 
 
 (increment COUNT for each non-set-up page) 
 
 v y e N , (c) COUNT(y) <- COUNT(y) + 1. 
 
 (declare all non- set-up pages with COUNT > T, set-up) 
 
 U t+1 (c) = U t (c) U {y e N t (c) | COUNT(y) > T} . 
 
 S t + l (c) = S t (c) 
 
 N t+l (c) =S t + l (c) " U t + l (c) - 
 
 [3] (reference to page not in S, (c) => page fault) 
 If r. +1 = x fi S.(c) then 
 
 (bring in the required page, replace a page if 
 necessary and declare all not-set-up pages set-up). 
 U t+1 (c) = S t (c) + x - R A (U t , q, x) 
 
 N t + i (c) -* W c) = u t + i^ c) - 
 
 Note here that R.(U , q, x) denotes the page that would 
 have been replaced by paging algorithm A with the given 
 values of the parameters. 
 
 [k] (a prepage instruction) 
 
 If r = PRE(x) and x ft S (c) then 
 if | S | < c then 
 
 \ +1 (c) = N t (c) + x, 
 
 COUNT (x) <- 
 
 Vi (c) = u t (c) - 
 
 End PRE2A. 
 
ka 
 
 [b] If x e U. (c) then (reference to a set-up page) 
 Vy e N.(c) COUNT(y) *- COUNT(y) + lj 
 
 (c) = U, (c) U{y|y e N. (c), COUNT(y) > T} ; 
 
 11 
 
 t+1 
 
 W c) - s t (c) ' 
 
 N t + i (c) = s t+i (c) -Vi (c) 
 
 and RETURN; 
 [c] If x fi T. (equivalent to saying x ^ S , (c f ) for any c f 
 
 which is equivalent to saying ^ t' < t such that r , = x or 
 r , , = FRE(x) which is equivalent to saying x ^ s_ (k) 
 for any k) then 
 
 if |S,(c)| < c then 
 
 U t+1 (c) = S t (c) + x; N t+1 (c) = cp; 
 
 S t+1^ = U t + l (c) ' 
 
 else U t+1 (c) = S t (c) + x - R A (U^ q, x); 
 
 N t+1 
 and RETURN; 
 
 (c) = <p; S. _(c) = U.^Cc) 
 
 t+1 
 
 t+1 
 
 [d] If x £ S (c) [but x e r ] then 
 
 X ~G 
 
 V y e N,(c) COUNT(y) <- COUNT(y) + 1; 
 
 if | S (c) | < c then 
 
 U ++1 (c) = U (c) + x U{y|y 6 IS Ac), COUNT(y) > T) ; 
 
 t+1 
 
 (c) = S.(c) + x; N +J ,(c) = S,^(c) - U. ..(c); 
 
 t+1 
 
 t+1' 
 
 t+r 
 
 and RETURN; 
 else (|s t (c)| = c) 
 
 U +4 .n( c ) = U.(c) + x - R A (U + , q, x) 
 
 t+1 
 
 t+1 
 
 u (y|y € N t (c), COUNT(y) > T) 
 (c) == S, (c) + x - R fl (U, q, x); 
 
 t 
 
 A v V 
 
 N 
 
 t+l v 
 
 (c) = S +J -(c) - U +j _ n (c); 
 
 t+1 
 
 t+1' 
 
h9 
 
 and RETURN; 
 else RETURN; 
 [2] If j» , = PRE(x), x e N and x ^ S t (c) 
 
 then 
 
 if |S (c) < c then N t+1 (c) = N t ( c ) + x ; 
 
 U t+1 (c) = U t (c); COUNT(x) - 0; 
 
 S t+1 (c) = S t (c) + x; 
 
 and RETURN; 
 END PRE3A; 
 
 We will say that an absolute page fault has occurred at time t+1 
 if r = x and x ^ T, (or any of the equivalent conditions). What this 
 amounts to is that r = x is the first occurrence of x in w. PRE3A is 
 different from PRE2A in that, only on the occurrence of an absolute page 
 fault, all not- set-up pages are declared set-up, whereas in the latter 
 on every page fault this is done. 
 Theorem 3.^- ' 
 
 PRE3A is a stack algorithm for {w|co e N + , P(w)}. 
 Proof : See Appendix B for the proof. 
 
 Theorem 3.^- allows us to construct an efficient one pass algorithm 
 to obtain the paging performance of a given reference string w. Because 
 of the nature of PRE3A, we would expect PRE2A to give better results than 
 PRE3A. Intuitively, it seems that jt(PRE2A, c, to) < n(PRE3A, c, w), 
 though we have not been able to prove this relation. However, results 
 of applying PRE3A to a certain reference string does tell us, at least 
 approximately, the performance of PRE2A. We note that for each of the 
 algorithms PREiA we can create another algorithm, say FREEPREiA, which 
 
50 
 
 incorporates freeing and thereby improve the paging performance. Clearly, 
 FREEPRE3A Is a stack algorithm which follows from theorem 3.3. 
 
 In both, PRE2A and PRE3A, whenever a page made a transition from 
 not-set-up state to set-up state, we immediately consider it the same as 
 any other page in U, . This means that a prepaged page can be replaced 
 even before it is used at least once. This kind of replacement may lead 
 to increased page traffic. When a PRE(x) has been inserted in the 
 reference string, e ither by the programmer or by the compiler, we have 
 reason to believe that page x will be used in the near future, so it 
 might pay off to keep the page locked in MM until its first use. We now, 
 divide the memory state S (c) into three disjoint sets: U (c) is the 
 set of used pages (used at least once), P (c) is the set of prepaged, 
 set-up but not yet used pages and N (c) is the set of not-set-up pages. 
 We define PRE^A corresponding to PRE2A as follows : 
 PREVIA : 
 [1] If r = x e N then 
 
 [a] if x e N (c) then (page fault) 
 
 U t + l (c)=U t (c) + X 
 P t+1 (c) = P t (c) + N t (c) - x 
 
 N 
 
 t+1 
 
 (c) = fi and RETURN; 
 
 [b] if x € P, (c) then (success) 
 
 Vy e n (c) COUNT(y) - COUM'(y) + 1; 
 
 P t+l (c) = p t ( c ) - x U (yk 6 N t (c), CCOTT(y) > T}; 
 
 U t + l (c ) - U t ( c ) + ^ 
 
 N t+1 ( c ). = (y G N t (c) | COUNT(y) < T} ; 
 
 t+1 
 and RETURN; 
 
51 
 
 [c] if x e U, (c) then (success) 
 
 V y e N (c), COUNT(y) *- COUNT(y) + 1; 
 
 Vi (c) =u t (c); 
 
 P t+1 (c) = P t (c) U {y\y e N t (c), COUNT(y) > T} ; 
 N t+1 (c) = {y e N t (c) | COUNT(y) < T} ; 
 and RETURN; 
 
 [d] if x/ S (c) then (page fault) 
 
 U t+l (c) = U t (c) + X ' V U f q ' x); 
 P t+1 (c) = P t (c) U N t (c); 
 
 N t+1 (c) = P; 
 
 and RETURN; 
 [2] If r = PRE(x) then if x j. S then 
 if | S ( c ) | < c then 
 
 N. +1 (c) = N (c) + x; COUNT(x) <- 0; 
 
 P t + l (c) =P t (c)? U t + l (c) = V c) ' 
 END PRE^+A; 
 
 Quite clearly, PRE^A is not a stack algorithm. We can define 
 
 another algorithm PRE5A which does a not-set-up to set-up conversion only 
 
 on the occurrence of an absolute page fault. PRE5A will be proved to 
 
 be a stack algorithm. 
 
 PRE5A: 
 
 [1] If r t+1 - x e N then 
 
 [a] if x e N (c) then (a page fault) 
 
 Vy e N t (c),C0UNT(y) «- COUNT(y) + 1; 
 
 p t+i = p t u {y € N t' x I C0UWT (y) > T ^ 
 
 U t + l =U t +x; 
 
 N fc+1 = {y € N t -x I COUNT(y) < T) ; 
 
 and RETURN; 
 
52 
 
 [b] if x e P, (c) then (success) 
 
 V y e N , COUNT(y) «- COUNT(y) + 1; 
 P t+1 = P t -x U{y e N t | COUNT(y) > T} ; 
 
 U t + 1 = U t + X ' 
 
 N t+1 = {y e N t | COUNT(y) < T} ; 
 
 and RETURN; 
 
 [c] if x e U, then (success) 
 
 Vy e N , COUNT (y) «- COUNT(y) + 1; 
 
 u t + i ■ V 
 
 p t+i = p t U{y € N t I C0UKT (y) > T ); 
 
 N t+1 = {y € N t | COUNT(y) < T} ; 
 and RETURN; 
 
 [d] if x ft T, then (an absolute page fault) 
 
 U t + 1 " U t + X " W *< X); 
 
 P t + 1 " P t U V 
 
 H t + 1 " * 
 and RETURN; 
 
 [e] if x ft S then (a page fault) 
 
 Vy e N , COUNT(y) «- COUNT(y) + 1; 
 
 VI = U t + X " V U t' *> X) ' 
 
 p t+i = p t U{y e N t I C0UNT (y) - T )5 
 
 N t+l = fy e N t I C0UNT (y) < T}; 
 and RETURN; 
 [2] if r t+1 = PRE(x) and x ft S and 
 
 I S^_ ( c ) | < c then (prepage the required page) 
 
 Vi ■ u t 
 
53 
 
 N. -, = N. + x; COUNT(x) «- 0; 
 t+1 t 
 
 P = P ; and RETURN; 
 END PRE5A; 
 Theorem 3.5 : 
 
 PRE5A is a stack algorithm for (ue N |P(w)}. 
 
 The proof is very similar to the proof of theorem 3.*+ so we 
 omit it here. 
 
 When faced with the choice between the use of PREljA and PRE5A, we 
 recommend the use of FREljA. However, for efficiency in performance mea- 
 surement, one could use PRE5A. In the following theorem we prove that the 
 performance of PRE^+A is no worse than the performance of PRE5A which 
 explains the comments just made about these two algorithms. 
 Theorem 3.6 : 
 
 jt(PRE^+A, c, w) < it(PRE5A, c, w). 
 Proof : We will first prove that V t > 1, 
 
 S t (PREl+A, C, u>) = S (PRE5A, c, u>), 
 U (PRE^A, c, o>) = U t (PRE5A, c, w), 
 P t (PREl+A, c, u) =3 P t (PRE5A, c, w). 
 
 From these, the required result follows since a page fault 
 occurs when r, , = x e S, (c) but x ^ P, (c), x ^ U, (c). 
 
 The proof is obtained by studying the inputs and outputs to the 
 sets P, , U,, N, . We draw a set input-output diagram in which three 
 circles are the sets P., N , U, . The arcs are the input and outputs 
 to these sets. Labels to the arcs are of the form d/X where d is the 
 step of the algorithm in which this arc is relevant and X is the set 
 of pages to which this arc applies. 
 
5k 
 
 For PREUA, the diagram is 
 
 For PRE5A, the diagram is: 
 
 Note that the following relation between the steps of 
 
 the two algorithms holds : 
 
 PRE^A. (Id) = PRE5A. (id) U PRE5A. (le). 
 
 and PREUA. (2) = PRE5A. (2). 
 
 =» PREUA. S = PRE5A. S t V t > 1, Vc>2. 
 
 Input to U, is strictly a function of go and output from U^ occurs 
 only in steps l(d) and l(e) therefore by the above step relations, 
 PREUA. U = PRE5A. U, V t > 1, Vc>2. 
 
55 
 Then clearly, "by the arrows going from N to P we conclude that 
 
 PREUa. P 2 PRE5A. P Vt>l, Vc>2, 
 
 This gives the required result. □ 
 
 Note that, freeing can be added to any PREdA d = 1, 2, 3, k, 5 
 and we obtain the corresponding FREEPREdA with a better performance, 
 i.e., jr(PREdA) > jt( FREEPREdA). We can also prove that FREEPRE3A and 
 FREEPRE5A are stack algorithms for (w £ H P( w )}. We can also show 
 that it (FREEPREDA, c, co) < jt(FREEPRE5A, c, w) which means that, though 
 FREEPREDA is recommended for use in practice, performance can be bounded 
 by that of FREEPRE5A in lesser time. 
 
 Note that each of these prepaging algorithms can be reduced to 
 a demand prepaging algorithm by removing the mechanism of time -dependent 
 set-up of a not- set-up page and allow set-up only on the occurrence of 
 a page fault. To implement this in practice, we do the following: on 
 the occurrence of a PRE(x) in w, a prepage bit is turned on in the page 
 table entry associated with x. At the time of the next page fault, say, 
 for a page y, we issue page pull commands for y and all the pages with 
 prepage bit turned on (subject to the memory size constraint). We will 
 label (FREE)PREdA, after the above transformation as (FREE)DPREdA. We 
 note that, 
 
 it(DPMIN, c, w) < jt(FREEDPREdA, c, w). 
 
 We now have to show that these new paging algorithms do achieve 
 something for us in practice. We will use FREELRU and FREEDPREULRU and 
 study the behavior "of these two algorithms as compared to LRU, MIN and 
 DPMIN. We will carry out our experiments on several matrix problems in 
 Chapter 5 and report the results in that chapter. 
 
56 
 
 3.2.2.1 Joseph's OBL Algorithm [J0SE70] 
 
 So far, we have only discussed deterministic prepaging, i.e., the 
 prepaging specified by the programmer or compiler. In the deterministic 
 case, we assumed that the prediction is always correct. Prepaging 
 schemes developed by Joseph [J0SE70] can be called probabilistic 
 prepaging schemes. These schemes are implemented in the operating 
 systems and does not need a prescan of the program. They are based on 
 the observation that a program tends to refer to its pages sequentially. 
 Joseph has shown experimentally that such prepaging schemes do reduce 
 page faults. We would like to show analytically that such is the case. 
 Clearly, for such analytical work, we should have a program behavior 
 model. We will define a sequential-random model of program behavior 
 for this purpose. Before that, we will define a demand prepaging 
 algorithm based on the OBL algorithm of Joseph. We will follow the 
 approach of King [KING71] to analyze the behavior of this algorithm. 
 
 DPOBLLRU = (V, Q, N, S Q , q Q , g), where V = {S|S c N, |S| = c} 
 the set of memory states. S Q , the initial memory state, q Q , is the initial 
 control state, Q is the set of control states q = (j,, j , ..., j ) where 
 j £ N (V k e [1, c] ) and j. ^ j . for i / I (i.e., they are all distinct), 
 g is the transition function, g: V XQXN-> V X Q. 
 
 r (S, q') if x e S 
 g(S, q, x) = / (S", q") if x ft S, x © 1 £ S 
 ^ (S ,n , q m ) if x { S, x © 1 i S. 
 
57 
 
 Where x © 1 denotes x + 1 mod n, 
 
 q' = ($!> >>> J k _^ J k+1 > >-> 3 Q , x ) where ^ = x. 
 S" = if x © 1 / j ± then S - J_ + x; 
 
 else S - j p + x; 
 q" = if x © 1 / J x then (jg, . . . , o q , x); 
 
 eJ.se \d-,j J^> • ••* 0„^ x /j 
 S ,M = S - j 1 - j 2 + x + (x © 1); 
 q'" = (x © 1, J 3 , . .., j c , x); 
 MD DPOBLLRU; 
 
 From the algorithm, it is clear that given q = (j' , . .., j ) we 
 can conclude that S = (j , . .., j }. Therefore, a configuration (S, q) 
 of the algorithm is completely specified by specifying q. Therefore, 
 the set of configurations is the set Q. Our first objective is to seek 
 the number of states in Q. 
 
 Definition: A state q e Q is said to have a property z (i.e., z(q)) if 
 3 x e N such that x and x © 1 both occur in q. 
 Lemma 3.2 : 
 
 z(q ) -* V q e Q, z(q). 
 Proof : The proof is by induction. In the transition function g of 
 DPOBLLRU, assume that z(q) and we will prove that z(q'), z(q") and z(q™ ). 
 Observe that the property z is invariant under a permutation of its 
 argument. Therefore, z(q) =>z(q') since q' is a permutation of q. 
 
 Since x © 1 e S and from the definition of S", x © 1 € S". Also 
 from the definition of S", x e S" therfore z(q"). From the definition 
 of q'", we have z(q'" ). The result of the lemma immediately follows. □ 
 
58 
 
 Lemma 3« 3 '• 
 
 In I n v> (n-c)! n 
 
 l Q l = P c _ (n-2c)!(n-c) 
 
 where P denotes the c-permutations out of n objects, 
 c 
 
 Proof : Berge [BERG71, p. 31] has proved that the number of subsets of 
 size c (from the set N of n elements), which do not contain either two 
 consecutive integers or both 1 and n simultaneously, is given by 
 
 ***.t \ / n_c \ P 
 f*(n,c) = ( ) . 
 
 v ' ' v c y n-c 
 
 From Lemma 3.2 it follows that |q| = n P - c!f*(n, c). 
 
 Substituting for f*(n, c), we get the required result. 
 
 It has been mentioned that the set of configurations (Q,) of 
 
 DPOBLLRU form a Markov Chain. Assume that the transition probability 
 
 matrix P = (p , , ) for the Markov Chain is given. 
 q ; q 
 
 Lemma 3.^- s 
 
 The Markov Chain of configurations of DPOBLLRU is irreducible 
 provided Pr[r, =i] ^ 0, V i e N. 
 Proof : It is sufficient to show that V q e Q, V q 1 e Q, 3 k > 0, such 
 
 that p , > where p , represents the probability of a k-step 
 transition from state q' to state q. Starting with an arbitrary state 
 q f € Q we will show a finite sequence of transitions to an arbitrary 
 state q e Q and thus we will prove the lemma. 
 Let q = (i,, ..., i ) and 
 
 q r = (j^ ..., j ) and q / q'. 
 By applying Lemma 3.2 to q, we get, 3 i e N such that both i and i © 1 
 occur in q. Let j„.= i, 1<^<c, and i =i©l, l<s<c, s ^ £. 
 Without loss of generality, assume that s > $,. We present a sequence 
 
59 
 
 of states starting from q' and leading to q and strictly following the 
 transition function g. Clearly, there is an associated reference 
 string, u) , such that g*(q', w ) = q. Note here that g* is an extension 
 of g and that we have omitted the memory state S, since it is completely- 
 specified by the control state. Now, since we give a transition from 
 q' to q in a finite number of steps, oo is a finite string. And since by 
 our assumption p(co ) > 0, we get the required result. [Note that in the 
 following, an asterisk means unspecified or irrelevant page number.] 
 
 q T -* q-L = {*, *>3 1 ) x = j 
 
 q-j_ - Qg = (*, *,3 1} J 2 ) 
 
 x = J, 
 
 H-2 ^£ 
 
 q i-l ~ (*"--*>Ji>J2' ,, * t ^-l) 
 
 q i-l "" q i = (*>•••*> Ji> J 
 
 £■ 
 2 ,,,,J "i-l ,J ' / g+l 
 
 x 
 
 'i-1 
 
 x = J 
 
 4+1 
 
 q s-3 _> q s-2 ~ (*"-*>Ji>***'^-l'^+l' , * ,J 's-l' ) 
 
 L s-2 
 
 q s -l = (**..** J^... Jj.!^ 
 
 i+r'^s-i^s+i- 
 
 x = J 
 
 x = J 
 
 s-1 
 
 s+1 
 
 ^-2 
 
 q G-2 ~~ ^*'*' 5 l" , ^i-.l' J ^.l' ,,,; 's-l'' J B+l ,,,, « J c' 
 
 q c-i = W b >3 1 9-">3i_ 1 j3 
 
 £-l' d £+l' • ,,d s-l' J s+l 
 
 > • • • j' c > jjjj 
 
 X 
 
 x = J 
 
 Thus with a reference string, 
 
 3 1' 
 
 ' J i-1' d 4+l' *•" d s-l' d s+l> **' d c' 
 
 we have reached the state q . Now with the following reference string, 
 
 u 3 = a *l' "" V d i+l' "•' J s' d s+l 
 
 • • • j J > 
 
6o 
 
 we will reach the state q. Note that all the intermediate states, while 
 going from q to q, are permutations of q. Thus U) = U u is the 
 required reference string. Clearly, to is finite and therefore we have 
 the required result. □ 
 
 We define the sequential-random model of program behavior as 
 follows : 
 
 Pr[r t+1 =j | r t =l] = v ± . is given by: 
 
 p io 
 
 P-L if i = 3, 
 
 p 2 if j = i © 1, 
 P otherwise 
 
 We require that p. > Pp > p„ > and P, + P p + (n-2) p = 1. We note 
 that this model is closer to the behavior of real programs as compared 
 to the independent reference model. 
 Conjecture 3.1: 
 
 m(DP0BLLRU, c) < m(LRU, c) for the sequential-random model of 
 program behavior. Where m(A, c) indicates the long term average page 
 fault probability working with algorithm A and with c page frames of MM. 
 
 Unfortunately, we have not been able to prove this theorem. 
 However, we can prove that m(DP0BLLRU, 2) < m(LRU, 2). 
 
 3.3 PWS Algorithm 
 
 We assume that the program's reference string is fully known in 
 advance. With this assumption, we will propose a new paging algorithm, 
 PWS which will incur zero page faults, on the average. We will also 
 derive the expressions for page pull probability and the average number 
 
61 
 of memory page frames required and compare the results with WS algorithm 
 of Denning [DENN68a] . 
 
 Assume that w = r, , . .., r is the given reference string. Define 
 the 'futuristic' working-set W*(t, T) by: 
 
 W*(t, X) = {x e N | at', t < t' < t +T , r., = x). 
 
 In other words, W* is the contents of a window of size T , looking forward. 
 The purpose of algorithm PWS is to keep W* in MM. At any time t, if 
 r, T e W*(t, 7 ) then nothing needs to be done; otherwise, we issue 
 instructions to fetch this page and thereby prepage it. We also 
 determine whether r, , e W*(t, T ) and if not, we either push it from 
 MM or declare it dead. 
 
 If it is assumed that T > T, where T is the page -fetch time, 
 then since a fetch instruction for a page is issued T time units in 
 advance of its use, we will have no page faults whatsoever. In other words, 
 we have m(l¥S, T ) = if 1 > T. 
 
 We will now study properties of this model. From the above 
 definitions the following properties can be proved. (We denote |w*(t,T )| 
 by u*(t,CT).) 
 PI: a) W*(t, 0) = jZ5 which implies w*(t, 0) = 
 
 b) W*(L, J) - {r L }, V T > 0. 
 
 c) w*(t, J ) is a monotonically non-decreasing function of J~ . 
 
 d) tjj*(t, J) is concave downwards. To see this, note that, 
 
 W*(t, 2-7) = W*(t,0" ) U W*(t+ T, J). 
 
 Therefore, . w*(t, 2'J) < w*(t, J ) + u»(t+CT, j), 
 since the working sets may overlap. 
 
62 
 
 Assuming statistical regularity, w*(t+ 7, 1 ) behaves, on the 
 average, like w*(t, 7) 
 
 therefore, 
 
 u*(t, 27) < 2w*(t, T). 
 
 Hence we have the concavity property, 
 e) If t < L - 1 then W*(t, 7") = {x e N | x = r, ,, t < t * < L} . 
 
 Let us denote 
 
 1 k 
 k k t=l 
 
 And let the average working set size s*(T) = s*(T). The page pulling 
 
 rate m*(T) measures the number of pages per unit time returning to the 
 
 working set (note that, m*(T) will be an upper bound, since a page which 
 
 has left W*(t, X ) may still be in MM). 
 
 Let A(t, 7 ) = 
 
 if 
 
 t+7 
 
 i W*(t,T) 
 
 J otherwise. (t > l) 
 
 Note that, A(t, j) = V t > L - 7 . Let m* = A(0, T) = {r^ ..., r^ } 
 
 The page pulling rate is given by: 
 
 m*(T) = Lim. if 
 k-*L 
 
 k-1 
 Z A(t,T) + m* 
 Lt=l 
 
 
 
 The interference distribution F. (x) and the interference density function 
 
 f.(x) as well as the relative frequency of reference X. have the same 
 
 meaning as in Denning' s work which is outlined in Chapter 2. Let the mean 
 
 n 
 
 overall reference interval x = L X. x. , x. = Z x f. (x). The page i is 
 
 x=l x>0 
 
 called 'recurrent' if \. ^ and j denotes the number of recurrent pages in N. 
 
 If \. f then x. = i/\ . 
 i i ' i 
 
63 
 
 P2 : s*(T) is nondecreasing in 7 and is bounded above and below: 
 
 1 = s*(l) < s*(T) < s*(T+l) < min.{n, 7 +1). 
 
 Proof : P(t,j)cP(t,T+l)4 w*(t, 7) < w*(t, 7+1). 
 
 => s*(T) < s*(T+l). Also, 
 
 w*(t, 1) = 1=» s*(l) - 1. 
 
 w*(t, 7+1) < min. {n, J +1} =$ s*(j +l) < min. {n, 7 +1) . 
 
 m* 
 P3: s*(7+l) - s*(7) = m*(T) - -£ . 
 
 m 
 Note that -=- is a very small number compared to m*(7) and can be 
 
 ignored for most purposes, particularly if L is very large. 
 Proof : 
 
 W*(t, 7+1) - W*(t, J) E ( r t+ T } * 
 
 Therefore, u*(t, 7+1) - w*(t, 7) = A(t, 7 ). 
 
 Therefore, Lt 
 k-+L 
 
 • k ± k 1 
 
 - Z u*(t, T+l) - - Z w*(t, 7) 
 . t=l k t=l 
 
 = Lt rr I A(t, 7 ). 
 
 k->L 
 
 t=l 
 
 Therefore, s*(7+l) - s*(7) = Lt 
 
 k->L 
 
 r\ k-1 
 t- Z A(t, 7 ) + m* 
 
 t=l 
 
 m* 
 
 + A(L,7) _ ^0 
 
 L L * 
 
 m o 
 
 m*(T). 
 
 D 
 
6k 
 
 y 
 PU: -£ < m*(7+l) < ™*(t) < m*(0) = 1. 
 
 which states that m*( ) is a nonincreasing function of T and is bounded 
 
 above and below. This follows immediately from P2 and P3. 
 
 P5: m*(T) = 1 - F(T) + -j% 
 
 which states that m*(T) can be regarded as the probability that x > "J . 
 
 Proof: Define the binary variable B. (t, x) = 1 iff r =i ; ^u<t+J 
 
 such that r = i and t + T - u < x. Define n. (k) as the number of 
 u — 1 
 
 references to page i in r , ..., r . The definition of P(x) can be 
 
 1 K. 
 
 expressed as: 
 
 1 
 
 n n n. (k) 
 
 F(x) = Z X. F. (x) = Z Lim. -~ • —7—7 
 
 .-,11 k n. (k)-l , _ w i 
 
 i=l i=l k->L 1 v ' t=l 
 
 k-T 
 
 Z ft. (t, x) 
 
 1 
 
 k-T n 
 
 Lim. - Z Z 6. (t, x) 
 k-»L t=l 1=1 1 
 
 since n. (k)-l 
 1 
 
 ~ n.(k) 
 
 n 
 
 Define e(t, x) = Z 6. (t, x) and observe that fi(t, x) = 1 - A(t, x), 
 
 i=l 
 
 k-T 
 
 Therefore, 1 - F(x)=Lim. - Z A(t, x) 
 
 k-*L t=l 
 
 = m*(x) - m */L. 
 This proves P5. 
 P6: m*(T+l) - m*(T) = - f(T+l). 
 
 This follows immediately from P5. 
 
 T-l 
 
 T-l 
 
 m o. 
 
 P7: s*(T) = 2 m*(z) = Z (l - F(z) + -^) 
 z=0 z=0 L 
 
 T-l 
 = Z Z f(y) + m^ n T/L. 
 z=0 y<z 
 
 
 
65 
 
 This follows immediately from P5 and P6. 
 
 Since m* T/L is very small for large values of L (i.e., for long 
 
 T-l 
 reference strings), s*(T) ~ 2 ' ^ f(y)« 
 
 z=0 y<z 
 
 Theorem 3.8 : 
 
 a) m(PWS, T ) = if J > T, 
 
 b) m*(lWS, T ) 2 m(WS, T ), 
 
 c) s^(PWS, ? ) ~ s(WS, T ), 
 
 d) ST(PWS, T ) « ST(WS, J ). 
 
 Proof : Part a of the theorem has been already proved when we introduced 
 the algorithm PWS. Part b follows from the property P5 of PWS and the 
 property PU of WS, given in section 2. k- Part c follows from the 
 property P7 of PWS and the property P6 of WS, given in section 2. k» 
 Finally, the last part can be shown as follows : 
 
 ST(PWS, 7 ) = 2 w*(x) *(ot*) 
 
 where St* = average execution burst plus the average 
 
 page wait time plus the average ready queue delay. 
 
 w*(*7") is the average working set size during the 
 
 interval St*. 
 Similarly, ST(WS, *T ) = Z w(T) * t. 
 
 Now since approximately w*(T ) ~ co(t ) and 6t* « 6t since we 
 have no page faults in PWS,St*~ average execution burst. Therefore, 
 we have the required result. □ 
 
 Implication of Theorem 3.8 is that we have defined a paging 
 algorithm with zero. page faults which pulls the same number of pages 
 as the WS algorithm and which requires the same average working set size 
 
66 
 
 as WS. We have also proved that the memory utilization (ST product) of 
 EWS is highly superior to that of WS. The main drawback, however, is that 
 IWS is unrealizable. If partial information is available about future 
 working sets then we may be able to utilize it to improve performance 
 over WS and at least approach the performance of PWS. We feel that 
 further investigations should be made in this direction. We will add 
 some comments to this topic in Chapter 6. 
 
 We have ignored several points in the above discussion of PWS 
 which we now discuss. Define the residency of a page in MM as the 
 fraction of time it is potentially available in MM. We assume that a 
 page in the working set is never replaced. Once a page has entered 
 W*(t, T ), it will remain in MM for at least T time units. Let x be 
 the interreference interval for the page i. We have: 
 
 1) If x < J , the page residency is IQQPjo. 
 
 2) If 7" < x then the following diagram gives us the page 
 residency. 
 
 x 
 
 -> 
 
 t-T 
 
 t+T 
 
 t+x 
 
67 
 
 If during the last residency, the page was updated, then the 
 above diagram requires that t+x-y>t+T in order for the page 
 to be released at time t. Which means that if x > T +7 then only the 
 page i can leave memory. Otherwise (i.e., T < x < T +T ) then if the 
 page was allowed to leave memory and called again after x time units then 
 the contents of page i will be the old contents of page i (i.e., before 
 the update during the last residency). Therefore, in such a case, the 
 page must be kept in MM and it will always be resident. This means 
 that more than uj*(t, T) pages will occupy MM. To implement this in 
 practice, we assume that the reference string w = r,, r , ..., r is 
 such that r, = (y, k) where y e N is the current page referenced and k is 
 the time to next reference to y. The value of k will be zero if page y 
 if not referenced again. The modified PWS, therefore, can be written 
 as follows : 
 
 MEWS: 
 
 (assume y > T) 
 
 Vt >1 let r twl = (y, k) and r t+/J = (y', k'). 
 
 1) If k = or k> T + T then 
 
 FREE(y) 
 
 2) If y*£ W*(t, 7 ) then PREPAGE (y f ). 
 EHD MPWS; 
 
68 
 
 k. IMPROVING LOCALITY OF ARRAY PROGRAMS 
 
 k.O Introduction 
 
 Locality of programs can "be improved in various ways, as noted 
 in Chapter 2. By proper organization of data among the data pages, we 
 can reduce the size of the localities as well as the probability of an 
 interlocality transition. This has been termed the problem of data 
 pagination. For matrix algorithms operating on large arrays, we need 
 to consider various techniques for storing them, such as, row major 
 order, column major order, packed row storage, packed column storage, 
 or submatrix (block) storage. Several investigations have indicated 
 that submatrix storage is preferable to other storage schemes [MCKE69, 
 ROGE73]. We shall restrict our investigation to the submatrix type of 
 organization. For large matrix problems, it should be clear that 
 instruction paging is dominated by data paging; for this reason we 
 ignore instruction paging. 
 
 With respect to improving locality, it is possible, in certain 
 instances, to resequence some of the operations in a program so that we 
 operate on the data that is already resident in MM rather than execute 
 unnecessary paging. This amounts to a logical reorganization of the 
 code as against pagination, which does a physical reorganization. In 
 this chapter, we will include a discussion of some of these methods of 
 logical reorganization for matrix programs. We will assume that these 
 
69 
 
 modifications are done by the programmer. The problem of automating some 
 of these methods will be considered in Chapter 6. 
 
 Since we are interested in improving locality, we are also 
 interested in measuring these improvements. We feel that the average 
 working set size s(*X) as a function of T is a very good indication of 
 the locality of a program. We also measure the number of page faults 
 using LRU, WS and MIN paging algorithms. 
 
 We will present three techniques for the improvement of locality 
 of matrix programs. We will illustrate their use by applying them on 
 several common matrix algorithms. We will show that an order of magnitude 
 improvement in locality can be obtained by these methods. 
 
 We now consider the notation for storing matrices in VA space. 
 Let A be a rectangular matrix of size (n,Xn p ). The matrix will be 
 
 divided into square submatrices of order m as shown in Figure k.l. If 
 
 2 
 we assume that p, the page size, is a perfect square and that m = p 
 
 then each of the submatrices can be stored in one page. Note that there 
 
 will be some submatrices On the right and the bottom borders which will 
 
 not be full. We will assume that n, and n p are both multiples of m 
 
 and therefore, such fragmentation will not occur. If the matrix A is 
 
 a square matrix then n = n p = n. We define the integers N, N, and N p 
 
 through the equations n = N.m, n, = E, «m and n p = N p «m. Therefore, the 
 
 number of data pages occupied by a n Xn matrix is N «N , by a nXn 
 
 matrix is N and by a triangular matrix of order n is N(N+l)/2. 
 
 *+. 1 Cholesky Decomposition 
 
 We start with a particular matrix algorithm, namely, the Cholesky 
 factorization of a symmetric, positive definite matrix. We will try to 
 
70 
 
 
 — m — 
 
 
 
 
 
 
 m 
 
 i 
 
 An 
 
 A 12 
 
 
 Ain 2 
 
 n 
 
 
 
 A21 
 
 A22 
 
 
 A2N2 
 
 
 
 
 
 
 
 l 
 
 
 AnxI 
 
 A Nl 2 
 
 
 A Nl N 2 
 
 
 n 2 
 
 Figure k* 1. Submatrix Storage 
 
71 
 improve the locality of this algorithm in various ways. This will, then, 
 
 lead us to general methods of locality improvement. We will try these 
 
 methods on same other typical matrix algorithms. 
 
 Cholesky decomposition factors a symmetric positive definite 
 
 matrix A into the product GXG' where G is a lower triangular matrix and 
 
 G 1 denotes the transpose of G [WILK71] . The algorithm is defined by 
 
 the following relations : 
 
 k-1 2 
 §kk = sqrt(a kk " ;= g ij } 
 
 for 1 < k < n 
 
 ik = (a ik " E 1 g jk SikVSkk' 
 .1=1 
 
 
 
 Where A = fa. .1 is an nXn matrix, and G = fg. .1 is a lower triangular 
 matrix of order n. We recall that the basic algorithm in PL/ I like 
 language is as in Figure k.2. Note that, we operate only on the lower 
 triangular part of A and we overwrite G onto A. We have omitted all 
 declarations, input, output and such other statements from the program. 
 Note that, for our measurements, we will use a matrix (A) of order 
 
 n = 2k, a page size p = 16 (=> m = k). We note that the first inner loop, 
 
 th 
 'DO J = 1 TO k - 1', scans the k row of the matrix A. This is done 
 
 with each traversal of the major loop (with a different value of k). 
 
 Since several (in fact, m) rows are stored across in a row of pages, it 
 
 is clear that the locality will be improved if we alternately traverse 
 
 the rows in opposite directions. We observe that these traversals are 
 
 possible because the order of computation within the loop is immaterial. 
 
 By appropriately reversing the other two loops, we will improve the 
 
 locality in a similar way. The resulting program is labelled CDR, which 
 
 is given in Figure k.3. 
 
72 
 
 CD: /* CHOLESKY DECOMPOSITION */ 
 DO k = 1 TO n; 
 S - 0; 
 DO J = 1 TO k - 1; 
 
 S = S + A(k, J) * A(k, j); 
 END; 
 
 A(k, k) = SOJRT(A(k, k) - S); 
 DO I = 1 TO n - k; 
 S = 0; 
 DO J = 1 TO k - 1; 
 
 S = S + A(k+I, J) * A(k, J); 
 END; 
 
 A(k+I, k) = (A(k+I, k) - S)/A(k, k); 
 END; 
 END CD; 
 
 Figure 1^.2. Cholesky Decomposition 
 
73 
 
 CDR: 
 
 DO k = 1 TO n; 
 S = 0; 
 IF MOD(k, 2) = 1 THEN 
 
 DO; JLL = 1; JHL = k - 1; JSTEP = 1; END; 
 
 ELSE 
 DO; JLL = k - 1; JHL = 1; JSTEP = -1; END; 
 DO J = JLL TO JHL BY JSTEP; 
 S - S + A(k, J) * A(k, J); 
 END; 
 
 A(k, k) = SQRT(A(k, k) - S); 
 IF MOD(k, 2) - 1 THEN 
 
 DO; IHL = n - k; ILL = 1; ISTEP = 1; END; 
 
 ELSE 
 DO; IHL * 1; ILL = n - k; ISTEP = -1; END; 
 DO I = ILL TO IHL BY ISTEP; 
 IF MOD (I, 2) = 1 THEN 
 
 DO; JLL = 1; JHL = k - 1; JSTEP = 1; END; 
 
 ELSE 
 DO; JLL = k - 1; JHL = 1; JSTEP - -1; END; 
 S = 0; 
 DO J = JLL TO JHL BY JSTEP; 
 
 S = S + A(k+I, J) * A(k, J); 
 END; 
 
 A(k+I, k) - (A(k+I, k) - S)/A(k, k); 
 END CDR: 
 
 Figure I4..3. Cholesky Decomposition with Reversal 
 
Ik 
 
 In Figure h.k, we have plotted the average working set size s(0") 
 as a function of the window width 1 for CD and CDR. The improvement of 
 locality due to the reversal technique is clearly shown. 
 
 The locality of a program, such as CD, may he improved more 
 significantly if we extend the algebra of the language to include 
 submatrix operations. We sketch the program using the 0L./2 language 
 [PHIL72]. The names of all OL/2 programs will be prefixed by the letter 
 '0'. In Figure h-5, we give the program OCD. 
 
 The order of the element by element computation of (C - M X R') 
 is unspecified in OCD. Since a matrix-vector multiplication is specified 
 by H x R 1 , we have the choice of implementing the multiplication by a 
 submatrix multiplication and thereby improve the locality. We will 
 briefly explain what we mean by submatrix multiplication. Assume we 
 are carrying out Z = X * Y. Where X, Y and Z are matrices of order n. 
 We shall denote the (I, j) submatrix of the matrix X by the usual 
 subscript notation, e.g., X . Multiplication of X and Y can, therefore, 
 
 be written as follows : 
 
 N 
 
 \ = T X * Y 
 IJ ,\ A IK KJ 
 k=l 
 
 1 < I, J < N. 
 
 Note that, the sum and product in the above equation denotes the matrix 
 sum and matrix product for matrices of order m. We apply this submatrix 
 operation to modify CD and obtain the results shown in Figure h.7 for 
 the program CDM in Figure k.6. 
 
 We have plotted, in Figure k.7, s(t) vs. J for CD and CDM. 
 We note that the improvement obtained by this method is much more than 
 that obtained by the method of loop reversal. 
 
75 
 
 CDR 
 
 Figure k>k* Localities of CD and CDR 
 
76 
 
 OCD: PROC(A, n); 
 LET A BE A LOWER TRIANGULAR MATRIX OF 
 ORDER (n); 
 
 TOR K = 1, 2, ..., n; PARTITION A AFTER ROWS 
 K - 1, K; 
 SET R = A <2,1> ROW VECTOR, 
 M = A <3, 1> , 
 
 D = SQRT(D - (R*, R'))j 
 C = (C - M X R' )/D; 
 END OCD; 
 
 D = A <2,2> SCALAR, • 
 
 C = A <3,2> COLUMN VECTOR: 
 
 K-l 
 
 K 
 
 Figure k. 5- OL/2 Program for Cholesky Decomposition 
 
77 
 
 CDM: 
 FM = FLOAT (M); 
 
 DO K = 1 TO n; KIM = (K-l)/M; KCM = CEIL(K/FM); 
 S = 0; 
 DO J = 1 TO K - 1; 
 
 S = S + A(K, J) * A(K, J); 
 END; 
 
 A(K, K) - SQ£T(A(K, K) - S); 
 DO J = 1 TO K1FM; 
 DO I = K+l TO KCM * M; 
 S = 0; 
 DO Jl = 1 TO M; 
 
 S = S + A(I, (J-l) * M + Jl) * A(K, (J-1) * M + Jl); 
 END; 
 
 A(I, K) = A(l, K) - S; 
 END; 
 END; 
 DO I = K+l TO KCM * M; 
 S = 0; 
 DO J = K1FM * M + 1 TO K - 1; 
 
 S = S + A(I, J) * A(K, J); 
 END; 
 
 A(l, K) = (A(I, K) - S)/A(K, K); 
 END; 
 DO I = KCM + 1 TO N; 
 DO J = 1 TO K1FM; 
 DO II = 1 TO M; 
 S = 0; 
 
 DO Jl = 1 TO M; 
 S = S + A((l-1) * M + II, (J-1) * M + Jl) * 
 A(K, (J-1) * M + Jl); 
 END; 
 
 A((I-1) * M + II, K) = A((l-1) * M + II, K) - S; 
 END; 
 END; 
 
 DO II = 1 TO M; 
 S = 0; 
 DO J = K1EM * M + 1 TO K - 1; 
 
 S = S + A((I-1) * M + II, J) * A(K, J); 
 END; 
 
 A((I-1) * M + II, K) = (A((I-1) * M + II, K) - S)/A(K, K); 
 END CDM; 
 
 Figure 1^.6. Cholesky Decomposition with Submatrix Multiplication 
 
78 
 
 Figure If. 7. Localities of CD and CDM 
 
79 
 
 We now turn to an even better technique for improving the locality. 
 This requires the construction of a submatrix algorithm and we illustrate 
 it for the Cholesky factorization [H0US6^] . 
 
 G n = CHDLDEC <*n " %. (G u X ^r )} 
 
 
 for 
 
 1 < I < N. 
 
 for I < J < N 
 
 We now present the OL/2 program for this algorithm. A PL/l 
 
 program, CDS, is given in Appendix C. 
 
 OCDS: PROC; 
 
 FOR K = 1, 2, . .., N; PARTITION A AFTER 
 ROWS m * (K-l), m * K; 
 SET R = A <2,1> , D = A <2,2>, 
 
 C = A <3,2> , M = A <3,1>; 
 
 D = D - R X R« 
 
 CALL OCD (D, m); 
 
 C = (C - M X R') X (D*)" 5 
 
 EEL OCDS; 
 
 Note that, D is a lower triangular matrix of order m, R is a 
 
 rectangular matrix of size m X m(k-l). Similarly, C is (N-m)k X m and 
 
 M is (N-m) k X m(k-l) matrices respectively. To carry out the Cholesky 
 
 factorization of D, only one page is involved, and therefore, any method 
 
 can be used without changing the performance. Therefore, we use OCD 
 
 for simplicity. Based on OCDS, we write CDS given in Appendix C. We carry 
 
 out the multiplication M X R' by submatrices. We store the inverse of D' 
 
 in the strictly upper triangular part of D. This way, we will not be able 
 
 to store the diagonal elements of (D')~ . Note that, this storage 
 
 scheme is possible because the page containing the subarray D has 
 
80 
 
 precisely m(m-l)/2 words vacant. Also note that, the diagonal elements 
 of (D* ) are the reciprocals of the corresponding diagonal elements of D. 
 
 In Figure 1+.8, we have plotted s(t) vs. T for CD and CDS. The 
 large improvement in locality is quite obvious. We note that the 
 reversal technique can be applied to both, CDM and CDS obtaining CDMR 
 and CDSR respectively. Improvements thus gained are not significant 
 enough to report. In Figure 1^.9, we have plotted s(T) vs. T for CD, 
 CDR, CDM and CDS. It can be easily seen that the versions of Cholesky 
 factorization in order of increasingly better locality are CD, CDR, CDM 
 and CDS. 
 
 In Figure In 10, we have plotted the number of page fault, jt(t) vs. 
 7 for CD, CDR, CDM and CDS using WS paging algorithm. In Figure k.H 
 and Figure k.12, we have plotted it(c) vs. c (the page allotment) for 
 CD, CDR, CDM and CDS using LRU and MIN paging algorithms respectively. 
 It is clear from these figures that, improved locality implies a 
 reduction in page faults. 
 
 k-2 Other Matrix Algorithms 
 
 From the example of Cholesky decomposition, we observe that there 
 are' three general methods of locality improvement of matrix algorithms. 
 The first method is the method of loop reversal in which we change the 
 direction of loop traversal each time we traverse a loop. We note that, 
 depending on the computation within the loop, this may not always be 
 possible. The second method is to carry out all matrix multiplications 
 occurring within the algorithm as submatrix multiplications. The third 
 method consists in using a submatrix algorithm from the beginning. We 
 
81 
 
 15 10 15 20 T 
 
 Figure lj.,8. Localities of CD and CDS 
 
5 -- 
 
 82 
 
 1 6 — I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1- 
 
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 f 
 
 Figure k.9. Localities of CD, CDR, CDM and CDS 
 
LOG 2 (tt(t)) 
 
 PAGING ALGORITHM - WS 
 n=24,m = 4, N = 6 
 
 Figure l^.. 10. Page Faults Using WS Algorithm 
 
81+ 
 
 ! 
 
 L0G 2 (7T(C)) 
 
 15 -- 
 
 PAGING ALGORITHM -LRU 
 
 n = 24 
 m = 4 
 N = 6 
 
 
 20 21 
 
 Figure If. 11. Page Faults Using LRU Algorithm 
 
85 
 
 LOG (tt(c)) 
 
 15-- 
 
 PAGING ALGORITHM -MIN 
 
 n = 24 
 m = 4 
 N = 6 
 
 13-- 
 
 10-- 
 
 1 — I — | — I — I — I — I — — I — I — I — I — I — i — I — I — I — — K 
 
 10 
 
 15 
 
 20 
 
 Figure q.,12. Page Faults Using MIN Algorithm 
 
86 
 note that, for several matrix algorithms, their submatrix versions already 
 
 exist in the literature. However, for eigenvalue problems, such submatrix 
 
 algorithms do not exist and probably will not be found in the future. 
 
 We believe that the three methods can be usefully applied to most matrix 
 
 algorithms. We will consider several more algorithms in this chapter. 
 
 We note that, the conclusions regarding the performance improvements by 
 
 these methods hold for any matrix size and any page size, although we 
 
 have considered only one matrix size and one page size. 
 
 If. 2.1 Matrix Multiplication 
 
 Three versions of matrix multiplication we consider are : MM 
 (basic matrix multiplication), MR (matrix multiplication with loop 
 reversal, and MMS (submatrix multiplication). 
 
 In Figure U.13, we plot s( T) vs. *J for these three programs. 
 
 k-2.2 LU Decomposition 
 
 Given a nonsingular matrix A, it can be uniquely decomposed 
 into L x U where L is a lower triangular matrix and U is unit upper 
 triangular. The method is known as Crout EQ decomposition, and can be 
 written as [ISAA66] : 
 
 for k as 1 to n do; 
 
 for i = k to n do: 
 k-1 
 
 I.. = d.. - z I. . u.. ; 
 
 ik ik ij jk 5 
 
 end; 
 
 for j = k+1 to n do; 
 k-1 
 
 \j = Ga kj ".f/kiV/^; 
 
 end; 
 end: 
 
87 
 
 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 30 T 
 
 Figure 1+.13. Localities of MM, MMR and MMS 
 
88 
 
 A straightforward implementation of this will be called IU. 
 
 Appropriate loop reversal turns it into IUR. To obtain LUM, we must 
 
 identify matrix multiplications within the algorithm. Towards this 
 
 end, we write the algorithm in OL/2 and call it OLU. Note that, we 
 
 store L, U over A. 
 
 OLU: PROC(A, n); 
 LET A BE A MATRIX OF ORDER (n); 
 FOR K = 1, 2, ..., n; PARRITION A AFTER ROW 
 K-l AND AFTER COLUMNS K-l and K; 
 SET C = A <2,1>, B = A <1, 3>, X = A <1,2>, Y = A <2,2>, 
 
 M = A <2,3>; 
 PARTITION C, Y, M AFTER ROW 1; 
 SET R = C <1>, D = Y <1>, Z = M <1>; 
 
 Y = Y - C * X; 
 Z = (Z - R * B)/D; 
 END OLU; 
 
 K-l 
 
 K 
 
 B 
 
 K-l 
 
 K 
 
 R 
 
 D 
 
 -M- 
 
89 
 
 From OHJ, we see that C * X is a matrix-vector product and R * B 
 is a vector-matrix product. Coding these products as submatrix products 
 (as in CDM) we obtain HJM (we omit the detailed coding here). Submatrix 
 LU decomposition algorithm can be written as [ISAA66] 
 for K = 1 to N do; 
 
 ( L KK> V - 1U - DEC V " y, L KJ V' 
 
 J— 1 
 
 for I = K + 1 to N do; 
 
 L IK - (A IK " . L IJ V * 'V^ 
 
 J — X 
 
 end; 
 
 for J = K + 1 to N do; 
 
 U KJ - ^Kk'" 1 * (A KJ " £ L KI V; 
 end K; 
 
 This can easily be programmed into OL/2 as well as PL/ I, obtaining 
 
 OLUS and LUS respectively. In both cases, we store L and U on the 
 original storage of A. While programming LUS, we store the inverses 
 of L^. and U on a scratch page (only one scratch page is needed). Once 
 again, we do not give the detailed codes for these. In Figure k.lk, we 
 have plotted s(T) vs. y for LU, LUR, LUM and LUS. We have used 
 n = 2k ( = order of the matrix A), page size p = 16 (implies m = k). 
 Thus the number of pages occupied by A is 6 X 6 = 36 (in general, IT). 
 
 U.2.3 Gaussian Elimination 
 
 Gaussian elimination reduces a nonsingular matrix A into an 
 upper triangular matrix. Gaussian elimination with coefficient storage 
 can be written as [MCKE69] : 
 
90 
 
 1 3 5 7 9 11 13 17 21 25 29 33 37 T 
 
 Figure i4_.Hi-. Localities of LU, LUR, LUM and LUS 
 
91 
 
 for k = 1 to n - 1 do; 
 for i = k + 1 to n do; 
 
 J\\ 
 
 a ik ~ a ik' °kk' 
 end; 
 for j = k + 1 to n do; 
 
 for i = k + 1 to n do; 
 
 a. . s a. . - a., * a, . . 
 
 ij ij ik kj > 
 
 end; 
 end; 
 end; 
 
 A straightforward implementation of this algorithm is called GOS (we omit 
 the coding here). From GOS with loop reversal, we easily obtain GOSR 
 (code omitted here). To see the matrix multiplications within GOS, we 
 write it in OL/2 and call it OGOS. 
 
 OGOS: PROC(A, n); 
 LET A BE A MATRIX OF ORDER (n); 
 FOR K - 1, 2, . .., n-1; PARTITION A AFTER 
 ROWS K-l, K AND AFTER COLUMNS K-l, K; 
 SET D = A <£,2> SCALAR, R = A <2,3> ROW VECTOR, 
 C = A <3,2> COHJMNVECTOR, M = A <3,3>; 
 C = C/D; 
 
 M = (M - C * R); 
 END OGOS; 
 
 K-l 
 K • 
 
 K-l 
 
 K 
 
 A <1,1> 
 
 A <1,2> 
 
 A <1, 3> 
 
 A <2,1> 
 
 D 
 
 R 
 
 A <3,1> 
 
 M 
 
92 
 
 By programming M=M-C*Rby submatrices, we obtain GOSM (code omitted 
 here). Submatrix Gaussian elimination with coefficient storage can be 
 written as [HOUS64, MCKE69] : 
 
 for K = 1 to N do; 
 A^ gos(a kk , m)j 
 
 for J = 1 to m do; 
 
 for i = K*m+1 to n do; 
 
 a i, (K-l)*m+J = a i, (K-l)*m+j/ a (K-l)*m+J, (K-l)*m+J; 
 end; 
 for i = K*m+1 to n do; 
 
 for j = (K-l)*m+J+l to (K-l)*ntf-m; 
 
 a ij = a ij " a i,(K-l)*m+J * a (K-l)*m+J,j\ 
 
 end; 
 
 end: 
 
 for i = (K-l)*m+J+l to K*m do; 
 for j = K*m+1 to n do; 
 
 a ij = a ij " a i,(K-l)*m+J * a (K-l)*m+J,j> 
 end; 
 end; 
 end; 
 
 for I = K + 1 TO N do; 
 for Jl = K + 1 TO N do; 
 
 A = A - A * A 
 I,J1 I,J1 IK K,J1 
 
 end: 
 
 end; 
 end; 
 
 Based on this algorithm, we can write OGCSS and GOSS (code 
 omitted here). We note that both, GOSM and GOSS can be modified by the 
 loop reversal technique. McKeller et al [MCKE69] present a submatrix 
 Gaussian elimination algorithm which, in our terminology, is GOSSR. In 
 Figure k m l$, we have plotted s( J) vs. J for GOS, GOSR, GOSM and GOSS. 
 Once again, behavior similar to Cholesky factorization is observed. 
 
S(T) 
 
 93 
 
 n = 24, m = 4, N=6 
 
 GOS 
 
 < GOSR 
 
 13 17 21 25 29 33 36 
 
 Figure U.15- Localities of GOS, GOSR, GOSM and GOSS 
 
91+ 
 
 l*.2.i| Gram- Schmidt Orthogonalization 
 
 Given a generally tall (n X n„) matrix A of rank n ? (n ? < n ), 
 its decomposition into A = B X R is known as the Gram-Schmidt 
 orthogonalization method. Here B is an (n, X n ) matrix with 
 orthonormalized columns and R is a nonsingular upper triangular matrix 
 of order ru [SCHW73]. It can be written as: 
 
 for k = 1 to n do; 
 
 for j = 1 to k - 1 do; 
 
 n l 
 r.. = Z b. . * a., j 
 
 3 i=l 1J lk 
 for i = 1 to n do: 
 
 b.. = a., -p.. * b. .; 
 
 lk lk jk \y 
 
 end; 
 end; 
 
 
 r n = V Z b . , , 
 kk . , lk 
 i=l 
 
 for i = 1 to n, do; 
 
 b ik = b ik/ r kk' 
 end; 
 end; 
 
 A straightforward implementation of this algorithm is called 
 ORTH (omitted here). We store B over A, thus destroying the original 
 elements of A. A reversal technique applied to ORTH turns it into 
 ORTHR. To identify the matrix multiplications within ORTH, we write it 
 in OL/2 and call it OORTH. 
 
 OORTH: PROC(A, R, n , n ); 
 
 LET A BE A n_ X n^ MATRIX AND LET 
 
 R BE AH UPPER TRIANGULAR MATRIX OF ORDER (n ); 
 FOR K = 1, 2, ..., n 2 ; PARTITION A, R AFTER 
 
 COLUMNS K-l, K; 
 SET Q = A <1>; AK = A <2>, RK = R <1,2>, D = R <2,2>; 
 
95 
 
 EK = Q* * AK; 
 AK - AK - Q * RK; 
 D = (AK, AK); AK = AK/D; 
 END OORTH; 
 
 A 
 
 K-l 
 
 K 
 
 Q ■ AK 
 
 
 If we code the statements RK = Q' * AK and AK = AK - Q * RK by 
 
 submatrices, we obtain ORTHM. Submatrix version of ORTH is [SCH073] 
 
 n n 
 
 written in OL/2, and is called OORTHS. Note that, N = — and N-, = — . 
 1 ' 2 m 1 m 
 
 OORTHS : PROC, 
 LET A BE n X n 2 MATRIX AND R BE AN 
 
 UPPER TRIANGUIAR MATRIX OF ORDER (n ); 
 
 FOR K - 1, 2, ..., N 2 ; PARTITION A, R AFTER 
 
 COLUMNS (K-l) * m, K * m; 
 SET Q = A <2>, D = R <2,2>, M = R <£,3>, G = A <3>; 
 
 CALL OORTH (Q, D, n , m); 
 M = Q' * G; 
 G = G - Q, * M; 
 END OORTH; 
 
96 
 
 (K-l)*m 
 
 (K-l)*m 
 
 This can be easily programmed in PL/ I and is called ORTHS. In 
 Figure If. 16, we have plotted s(T) vs. T for ORTH, ORTHR, ORTHM and 
 ORTHS. Once again, reversal technique can be applied to ORTHM and 
 ORTHS but the improvement thus gained is not significant. 
 
 k» 3 Conclusion 
 
 In all four algorithms considered so far, we have seen that three 
 methods of locality improvement work very well. We have seen that a 
 submatrix version of an algorithm improves the locality to the greatest 
 extent. The submatrix multiplication technique for all internal 
 multiplications is the second best technique for improvement. Reversal 
 technique also improves locality but by a much smaller degree than the 
 other two techniques. We also notice that, locality improving methods 
 cannot reduce page faults for very large core allotments. The reason is 
 that, with a large number of pages allotted, the working set of an 
 
97 
 
 sir) 
 
 8 -- 
 
 7-- 
 
 6-- 
 
 5-- 
 
 4 -- 
 
 3 -- 
 
 2 -- 
 
 fix s 24, n 2 = 16 
 N! * 6 , N 2 = 4 
 
 ORTHR 
 
 ORTHM 
 
 13 17 21 25 29 33 37 39 
 
 Figure 1|. 16. Localities of ORTH, ORTHR, ORTHM and ORTHS 
 
98 
 ill-structured (poor locality) program can also be kept in MM. We have 
 observed that locality improvement methods are effective when c < ^(N). 
 Where (q (N) denotes kN for some constant k. Thus the asymptotic value 
 of page faults is unaffected by locality improvements. It is also clear 
 that the asymptotic value of page faults cannot be changed by changing 
 paging algorithms if we restrict ourselves to demand paging. [There is 
 no replacement required at high core allotments.] In Chapter 5> we will 
 see that the asymptotic value of the number of page faults can be 
 reduced by devising prepaging algorithms. 
 
 At this point, we are in a position to compare the paging 
 performance of the three decomposition algorithms used in solution of 
 linear systems [Cholesky decomposition, LU decomposition and Gaussian 
 elimination]. In Figure h.YJ, we have plotted u(c) vs. c for CD, HJ 
 and COS using LRU paging algorithm. The superiority of Cholesky 
 decomposition is very clear. But, as is well known, it can only be 
 used when the matrix A is symmetric and positive definite. It is also 
 known that this algorithm has minimum operation count and has superior 
 stability property among these three decomposition algorithms [ISAA66] . 
 When the given matrix A is not symmetric or is not positive definite and 
 if we have to choose between HJ decomposition and Gaussian elimination, 
 BJ decomposition is preferable. Also, LU decomposition incurs lesser 
 number of operations and is also more stable [ISAA66], from Figure ^-.17> 
 it is seen that LU decomposition is generally superior to Gaussian 
 elimination in number of page faults also. 
 
 
99 
 
 L0G 2 (7T(c)) 
 
 PAGING ALG. USED - LRU 
 n = 24 , m = 4 
 
 9 13 17 21 25 29 33 37 
 
 Figure 4. 17. Page Faults for CD, LU and GOS 
 
100 
 
 k.k A Note About Measurement of Performance 
 
 We have combined LRU stack algorithm of Mattson et al [MATT70], 
 WS statistics gathering algorithm of Gordon [GORD73] and multivalued MIN 
 algorithm of Belady [BELAY 1 *] into LRU_WS_MIN algorithm. This algorithm 
 obtains the LRU, WS and MIN statistics of a given reference string in 
 one pass for all values of page allotments (up to the size of the address 
 space) and for all values of the window width 7 (up to a fixed maximum). 
 We give a PL/ I program for this algorithm in Appendix A. 
 
 Once a simple one pass, paging algorithm simulation is available, 
 the next question is to easily obtain the reference strings for a given 
 program. Since for the matrix programs that we have considered, each 
 program has only one trace and, therefore, there is only one reference 
 string associated with it. Also note that, we are only interested in 
 data references and, in particular, only in array references. Given a 
 matrix program written in PL/ I, we remove all declarations. All assignment 
 statements and expressions are reduced by removing all scalar references 
 (and retaining only array references). Each array reference is now 
 replaced by a call to the appropriate paging algorithm simulator 
 (in our case LRU_WS_MIN ) . For each such call we must also supply the 
 page number of the referenced page. This is determined from the name 
 and index of the array referenced and the organization of the arrays in 
 VA space. For example, a matrix of order n is stored from page one to 
 page W by submatrices and the submatrices are stored in a column major 
 order, then a reference A(i, j ) will be a reference to the page number 
 
 -i-|-l) + fi- 
 rn 1 ' ' in- 
 clement A (i, j) belongs, is (f-"|, f^l ). 
 
 x = (N*(r— 1-1) + \— ] ), since the index of the submatrix, to which the 
 
 m" ' m 
 
101 
 
 5. APPLICATION OF FREEING AND PREPAGING TO ARPAY PROGRAMS 
 
 5.0 Introduction 
 
 The objective of prepaging is to reduce or avoid page faults. In 
 many array algorithms, which are dominated by data paging, we have some 
 inherent structure which will allow a fore -knowledge of paging needs. 
 This can be utilized in several ways : First, if we know that a set 
 of data pages is not required by the program in future then the space 
 occupied by these pages in MM can be freed for some other use. Second, 
 if we have extra space in MM, then we may be able to prefetch some of 
 the required pages. We have noted in Chapter h, that the locality- 
 improvement methods reduce page faults only for small values of the page 
 allotment c. We will see that the methods used in this chapter can 
 help reduce page faults for larger values of c. Thus we will have covered 
 the complete spectrum of c-values. We will show that an order of 
 magnitude reduction in the asymptotic value of page faults can be achieved 
 by our methods. 
 
 Once again, we start out with an example, namely, Cholesky 
 decomposition (CD), and show how to improve its performance. We will 
 also be concerned with measuring the improvement in the performance. 
 Finally, in subsequent sections, we will apply the techniques to other 
 matrix algorithms. - In this chapter, we assume that the programmer is 
 making these improvements in his program. In Chapter 6, we will discuss 
 the question of automating these techniques. 
 
102 
 
 5.1 Cholesky Decomposition 
 
 It is easier if we describe the algorithm using OL/2 and an 
 associated diagram to identify the data-working- set. Let us consider 
 the program OCD (refer to section J+.l) and the following partitioned 
 array : 
 
 k-1 
 
 k 
 
 A <1,1> \ 
 
 k-1 
 
 >D 
 
 R 
 
 
 ^k 
 
 M 
 
 C 
 
 A <3,3> \ 
 
 The algorithm OCD shows that, for a particular value of k, the subarrays 
 R, D, M and C are referenced, but the subarrays A <1, 1> and A <3, 3> are 
 not. Furthermore, we note that the elements of A <3,3> were not used 
 in the past, but the present elements of A <3,3> will be referenced in 
 the future. On the other hand, all the elements of A <1, 1> were used 
 in the past, therefore, they are likely to be in MM and the algorithm 
 shows that they will not be used again. Thus A <1, 1> may be marked as 
 a dead subarray. Since the mechanism that we have, only allows pages 
 to be declared dead, it is necessary to modify our mechanism for dead 
 subarrays. Because of dynamic partitioning, the extent of the subarrays 
 
103 
 
 vary dynamically. In this example, A <1, 1> may share pages with the 
 useful subarrays R, D, M and C. The set of shared pages is a null set 
 whenever the condition, (k-l) = mod m, is satisfied. This condition 
 corresponds to the alignment of a page boundary with the bottom boundary 
 of the subarray A <1, 1> (given by the expression k-l). Under such a 
 condition, we can free all the pages of A <1, 1>. We assume. the existence 
 of a procedure FREE(b) which frees all the pages of the subarrays, which 
 are resident in MM. Based on this observation, OCD can be modified to 
 yield OCDF as follows: 
 OCDF: PR0C(A, n); 
 
 IF M0D(k, m) - 1 THEN CALL FREE(A <1,1>); 
 D = SQRT(D - (R T , r t )); 
 C = (C - M X R')/D; 
 END OCDF; 
 
 A corresponding modification is made in the Pi/ I program CD (refer 
 
 to section ^-.l) to obtain CDF. We assume the existence of a procedure 
 
 FREE (A, I, J), which declares the (I, J) submatrix of array A dead. We 
 
 note that this freeing operation is carried out each time k - 1 = mod m. 
 
 Therefore, we need only free the bottom-most row of pages of A <1, 1>. In 
 
 Figure 5.1, we give the PL/l program CDF. 
 
 In Figure 5.2, we have plotted it(c) vs. c for CD using LRU and MIN 
 
 paging algorithms and for CDF using FREELRU paging algorithm. 
 
 We see that at and around point D, FREELRU behaves almost like 
 
 MIN (the optimal demand paging algorithm). The page allotment for 
 
 point D corresponds to the expression max ( |p(R,D,M, C) | ) where 
 
 ke[l,n] 
 
 P(A-., Ap, ...) denotes the pages containing the subarrays A,, A~, ... and 
 
101+ 
 
 CDF: 
 DO k = 1 TO n; K1FM - (k-1 )/m; 
 IF MOD(k, m) = 1 THEN 
 DO J = 1 TO K1FM; 
 
 CALL FREE (A, K1PM, J); 
 END; 
 S - 0; 
 DO J = 1 TO k-1; 
 
 S = S + A(k, J) * A(k, J); 
 END; 
 
 A(k, k) = SQRT(A(k, k) - S); 
 DO I = 1 TO n-k; 
 S = 0; 
 DO J - 1 TO k-1; 
 
 S = S + A(k+I, J) * A(k, J); 
 END; 
 
 A(k+I, k) = (A(k+I, k) - S)/A(k, k); 
 END; 
 END CDF; 
 
 Figure 5.1. Cholesky Decomposition with Freeing 
 
105 
 
 LOG 2 (7T(c)) 
 
 n = 24, m =4, N = 6 
 
 3 5 7 9 11 13 15 17 19 21 
 
 Figure 5.2. Page Faults Using LRU, FREELRU and MIN Algorithms 
 
106 
 p| denotes the number of pages in the set P. Clearly, this is the same 
 
 as the maximum number of active pages during the execution of the program 
 CD. This is easily found to be equal to » r < which in the case 
 N = 6 reduces to 12. We note that the asymptotic value of page faults, 
 i.e., n(o°), is not changed by freeing. Note that it(») = total number 
 of distinct pages referenced = size of the address space (in case each 
 
 page is referenced) = — ^ - (= 21 in the present case). We note that for 
 
 very low values of c, freeing does not reduce page faults because dead 
 pages are removed anyway by the replacement action of LRU. Only after 
 c = >6(N) then freeing starts to help. For very high values of c 
 (i.e., £)(Nj), once again, freeing does not help since the whole of the 
 VA space can be stored in MM and no page replacements are necessary. 
 Because of simulation time limits we have not carried out our experiments 
 with large values of the matrix size. It is hoped that with large value 
 of n (the matrix size) the advantages of freeing will be brought out 
 more clearly. 
 
 We note that the freeing technique can be applied to other 
 variations of Cholesky decomposition (e.g., CDR, CDM, CDS). Since the 
 nature of improvements is similar, we do not discuss these further. 
 
 Next we proceed to investigate what can be prepaged in OCD. 
 We know that the four useful subarrays are: R, C, M and D. Since the 
 extent of these subarrays change with the loop control variable k, we 
 will indicate this by subscripting these subarrays. Note that, 
 R(k+1) c M(k) U C(k) and M(k+l) c M(k) U C(k). Therefore, if we decide 
 to look only one loop execution ahead, we need to prepage the subarrays 
 C and D only. We note that, P(c(k+l) = P(c(k)) if k f mod m. Here 
 
107 
 
 P(B) denotes the pages containing the sub array B. Similarly, 
 P(D(k+l)) = P(D(k)) if k ^ mod m. The reason for this can be seen 
 from the following diagram: 
 
 1 
 
 m 
 
 J 
 
 R(k) 
 D(k) 
 
 D(k+l) 
 
 R(k+1) 
 
 Page boundaries 
 
 C(k) C(k+l) 
 
 Therefore, we need to prepage only when the condition k = mod m is 
 satisfied. Under such a condition we prepage C(k+l) and D(k+l). 
 
 If we decide to prepage the newly required elements of C and D 
 for each new execution of the loop just before beginning the execution 
 of the loop, then we will prepage C(k) and D(k) if k = 1 mod m. Since 
 we are going to consider only demand prepaging algorithms, it does not 
 matter when we prepage a certain set of pages, so long as the prepaging 
 is done prior to a reference to any one of these pages. Assume that we 
 
108 
 
 have issued a prepage statement for a set of pages P at time t. Assume 
 further that at t' > t, r = i 6 P, then and at t' we have a page fault. 
 At this time page i will be fetched and as many pages of P will be 
 prefetched as will be allowed by the number of empty pages in MM. With 
 this modification, we obtain the following algorithm: 
 OCDFP: PR0C(A, n, m); 
 
 IF M0D(k, m) = 1 THEN 
 
 DO; CALL FREE (A <1, 1>); CALL PREPAGE (C, D); 
 
 END; 
 D = SQRT(D - (R 1 , R'))j 
 C = (C - M X R')/D; 
 END OCDFP; 
 
 We can also modify CDF to obtain CDFP in a similar way. In 
 
 Figure 5.3, we have plotted Jt(c) vs. c curve for CD using LRU paging 
 
 algorithm, for CDFP using FREEDPRElfLRU paging algorithm and for CD using 
 
 DPMIN paging algorithm. We see that prepaging using FREEDPREI^LRU (in fact, 
 
 demand prepaging) reduces the page faulting considerably for large values 
 
 of c. In particular, the. asymptotic value of it has been brought down from 
 
 n(n+i) 
 
 — *~ — - to N. This is one order of magnitude reduction. We see that the 
 
 asymptotic value of jt for DPMIN is 1. We note here that such a prepaging 
 technique can also be applied to other variations of Cholesky factorization 
 (namely, CDR, CDM and CDS) but we omit this here, since the results are 
 very similar to that of CD. Since with prepaging, the number of page pulls 
 is different from the number of page faults, we need to consider these two 
 measures separately. In Figure 5. k, we have plotted, the number of page 
 pulls vs. c for CD using LRU, MIN and DPMIN paging algorithms and for CDFP 
 using FREEDPRE^LRU paging algorithm. We see that, the number page pulls 
 for FREEDPRElf.LRU is generally lower than the page pulls for LRU and only 
 
L0G,(7T(C)) 
 
 15" 
 
 109 
 
 n = 24, m = 4, N = 6 
 
 10-- 
 
 1 3 5 7 9 11 13 15 17 19 21 
 
 Figure 5.3. Page Faults Using LRU, FREEDPTOLRU, MIN and 
 DPMIN Paging Algorithms 
 
110 
 
 A LOG 2 (PAGE PULLS) 
 13- 
 
 Figure 5-4. Page Pulls Using LRU, FREED PRE IfLRU, MIN 
 and DPMIN Paging Algorithms 
 
Ill 
 
 rarely does it go above and then only by a very small amount. We see that 
 the page pulls for DPMTN and MIN are generally equal except in some cases 
 where DPMLN incurs more page pulls than MIN. Note that the asymptotic 
 value of page pulls is the same for all cases and is equal to the size of 
 the data address space (= N(N+l)/2). 
 
 We can view the process of freeing and prepaging done in CD in 
 a slightly different way. Let us denote the data-working set of one loop 
 execution of CD by DWS(k), where k is the loop control index. In words, 
 DWS(k) is the set of pages referenced in the k execution of the loop. 
 DWS(k) can be recursively defined as follows: 
 
 a) DWS(l) = P(C(1)) U P(D(1)) 
 
 b) If k = 1 mod m then 
 DWS(k) = DWS(k-l) U P(D(k)) 
 
 U P(C(k)) - P(A<l,l>(k) - A<l,l>(k-1)) 
 = DWS(k-l) U P(D(k)) U P(C(k)) 
 
 - P(R(k-l)) - P(D(k-l)) 
 else 
 
 DWS(k) = DWS(k-l); 
 This knowledge of DWS allows us to do the appropriate freeing and prepaging. 
 
 5.2 Other Examples 
 5.2.1 LU Decomposition 
 
 For LU decomposition (refer to OLU and the associated diagram 
 in section k.2), we have 
 
 a) DWS(l) = P(Y(1)) U P(Z(1)); 
 
112 
 b ) IF k = 1 mod m then 
 
 DWS(k) = DWS(k-l) U P(Y(k)) U P(Z(k)) 
 
 - P(A<l,l>(k-l)) 
 = DWS(k-l) U P(Y(k)) U P(z(k) 
 
 - P(R(k-l)) - P(X(k-l)) - P(D(k-l)); 
 else 
 DWS(k) = DWS(k-l); 
 Based on this observation, LU and OLU can be modified to free 
 A<1, 1> and prepage Y and Z. We do not present the detailed programs 
 LUF, LUFP, OLUF and OLUFP. In Figure 5.5, we have plotted n(c) vs. c for 
 LU using LRU, MIN and DPMIN paging algorithms, for HJF using FREEDPREULRU 
 paging algorithm. We see that freeing and prepaging helps only for larger 
 values of c and that the asymptotic value of -n(c) has been brought down 
 from N to N. 
 
 5.2.2 Gaussian Elimination 
 
 For Gaussian elimination (referring to OGOS and its associated 
 diagram in section h.2.), we have, 
 
 a) DWS(l) = P(A); i.e., the whole array A. 
 
 b ) If k = 1 mod m then 
 DWS(k) = DWS(k-l) 
 
 - P(A<1,1> U A<1,2> U A<1,3> U A<2,1> 
 
 U A<3,1>). 
 = DWS(k-l) - P(c(k-1)) - P(D(k-l)) - P(R(k-l)); 
 
 else 
 DWS(k) = DWS(k-l); 
 
113 
 
 , | L06 2 (7T(c)) 
 
 n « 24, m = 4, N'6 
 LU 
 
 Figure 5. 5. Effect of Prepaging on LU Decomposition 
 
114 
 
 From these equations, we see that only prepaging that can be 
 done is outside (before) the loop. And since whole of the array A is 
 to be prepaged only a limited number of pages as allowed by the available 
 space will really be prefetched. Freeing of the subarray A<1, 1>, A<1,2>, 
 A<1, 3>, A<2,3> and A<3>1> can be done within the loop when the condition, 
 k = 1 mod m, is met. With these modifications we have GOSF, GOSFP, 
 OGOSF, and OGOSFP (we omit the corresponding programs). In Figure 5.6, 
 we have plotted jt(c) vs. c for GOS using IRQ, MIN and DPMIN paging 
 algorithms, for GOSF using FREELRU paging algorithm and for GOSFP using 
 FREEDPREULRU paging algorithm. We note that jt( FREELRU ) = rt(LRU). In 
 other words, freeing does not help in this algorithm. Reason for this 
 is that dead page is precisely the least recently used. page. Thus the 
 page replaced by FREEIiRU and LRU is the same at any time. The same 
 effect is not necessarily observed with other paging algorithms. For 
 example if we were using FIFO instead of LRU, then in all probability, 
 Tt(FREEFIFO) < it (FIFO) for GOS. We also note that, prepaging and freeing 
 does help and reduces the asymptotic value of it(c) from DJ to 1. 
 
 5.2.3 Gram-Schmidt Orthogonalization 
 
 For Gram-Schmidt orthogonalization (referring to OORTH and its 
 associated diagram in section h.2) } we have, 
 
 a) DWS(l) = P(D(1)) U P(RK(1)) U P(AK(l)); 
 
 b ) If k = 1 mod m then 
 
 DWS(k) = DWS(k-l) U P(AK(k)) U P(RK(k)) 
 
 U P(D(k)) - P(R<l,l>(k)) 
 = DWS(k-l) U P(AK(k)) U P(RK(k) U D(k)) 
 
115 
 
 1 
 
 i L0G 2 (7T(c)) 
 
 
 
 
 15- 
 
 
 
 
 
 14- 
 
 
 
 
 
 13- 
 
 
 
 
 
 12- 
 
 
 
 
 
 11- 
 
 
 
 
 
 10- 
 
 
 
 
 
 9- 
 
 
 
 /FREELRU 
 
 
 8- 
 
 MIN' ^v 
 
 
 / / L " u 
 
 
 
 
 
 7- 
 
 
 
 6- 
 
 
 
 
 
 5- 
 
 
 4- 
 
 
 dpmin' > 
 
 FREEPRE4LRU~-^J 
 
 
 3- 
 
 
 
 
 ^ \ 1 
 
 
 2 - 
 
 
 
 s 
 
 
 1- 
 
 
 1 1 
 
 1 1 
 
 \ 
 
 -^ *• 
 
 10 
 
 15 
 
 20 
 
 25 
 
 30 
 
 35 
 
 Figure 5.6. Effect of Prepaging on Gaussian Elimination 
 
116 
 
 - P(RK(k-l) U D(k-l)); 
 else 
 DWS(k) = DWS(k-l); 
 From these equations, we see that, when the condition k = 1 mod m 
 (within the loop just after the backward branch target), we can free the 
 subarray R<1, 1> and prepage the subarrays AK, RK and D. With appropriate 
 modifications, we obtain ORTHF, ORTHFP, OORTHF, and OORTHFP. In 
 Figure 5.7, we have plotted jt(c) vs. c for ORTH using LRU, MM and DPMIN 
 paging algorithms, for ORTHF using FREELRU paging algorithm and for 
 ORTHFP using FREEDPRE^LRU paging algorithm. We see, once again, that 
 freeing and prepaging helps for larger values of c and that it brings 
 down the asymptotic value of it(c) from (N N + N (Np+l)/2) to N . 
 
 5.2. k Matrix Multiplication 
 
 For matrix multiplication, consider the following program OMM: 
 OMM: 
 
 LET A, B, C BE MATRICES OF ORDER (n); 
 
 FOR k = 1, 2, ..., n-1; PARTITION A, B AFTER 
 
 ROWS k-1, k; 
 
 A<2> = B<£> * C; 
 END OMM; 
 
 k-1 
 
 k 
 
 B<2> 
 
 A 
 
 B 
 
117 
 
 A L0G 2 (7T(C)) 
 14-- 
 
 n l »24, n 2 « 16, m«4 
 
 40 C 
 
 Figure 5. !• Effect of Prepaging on Orthonormalization 
 
118 
 
 [Note that the above can easily be programmed in OL/2 as the primitive 
 A = B * C but we have written OMM for obtaining DWS.] From the above, 
 we have, 
 
 a) DWS(l) = P(C) U P(B<2>(1)) U P(A<2>(1)); 
 
 b ) If k = 1 mod m then 
 
 DWS(k) = DWS(k-l) - P(B<l>(k)) 
 
 - P(A<l>(k)) U p(B<2>(k)) U p(A<^>(k)) 
 = DWS(k-l) - P(B<2>(k-l)) - P(A<2>(k-l)) 
 P(B<2>(k)) U p(A<2>(k)); 
 else 
 DWS(k) = DWS(k-l); 
 Based on this, we must prepage the whole array C outside (before) 
 the loop body and inside the loop, we can free subarrays B<1> and A<1> 
 and prepage subarrays B<2> and A<2> when the loop control variable 
 satisfies the condition k = 1 mod m. With appropriate modifications, 
 we obtain MMF and MMFP. In Figure 5.8, we have plotted it(c) vs. c for 
 MM using LRU, MEN and DPMIN paging algorithms, for MMF using FEEELRU 
 paging algorithm and for MMFP using FKEEDFRElj.LRU paging algorithm. 
 Once again, we see that for large values of c, freeing and prepaging helps 
 and that it reduces the asymptotic value of jt(c) from 3N^ to N. 
 
 We have seen that our methods of freeing and prepaging improve 
 performance for large values of c (> £)(N)) and that the methods work 
 for many different matrix algorithms. The method owes its success to 
 our ability to identify the data-working- sets of the programs. This 
 identification is possible, because of the well structured data- referencing 
 
119 
 
 L0G 2 (7T(c)) 
 
 n*l6, m=4 
 
 1 5 9 13 17 21 25 29 33 37 41 45 49 c 
 
 Figure 5»8» Effect of Prepaging on Matrix Multiplication 
 
120 
 
 of matrix algorithms. In Chapter 6, we will investigate whether the 
 identification of data-working- sets can be done by an optimizing compiler, 
 
 5.3 Combined Effect of Locality Improvement and 
 Working-Set Identification Methods 
 
 We have noted that the locality improvement methods of Chapter k 
 improve paging performance for low values of c (<6(N)) and the methods 
 of this chapter improve paging performance for higher values of c (>£(N)), 
 In this section we will combine the two methods and measure the resultant 
 improvements . 
 
 The version of Cholesky factorization without any improvements is 
 CD and the version with all improvements is CDSRFP. We give a PL/ I 
 program for the latter in Appendix C. In Figure 5. 9, we have plotted 
 jt(c) vs. c for CD using LRU paging algorithm and for CDSRFP using 
 FREEDFREULRU paging algorithm. The vast improvement in performance 
 needs no elaboration. We have also plotted the number of page pulls 
 vs c for CDSRFP. 
 
 In Figure 5.10, 5 .11, 5.12 and 5.13 we draw similar curves for 
 LU decomposition, Gaussian elimination, Gram- Schmidt orthogonal! zat ion 
 and matrix multiplication respectively. Fully improved versions of 
 these algorithms are also presented in Appendix C. 
 
121 
 
 L0G 2 (7T(C)) 
 
 24, m = 4, N»6 
 
 ® PAGE-FAULTS FOR CD USING LRU PAGING ALGORITHM. 
 
 (D PAGE- PULLS FOR CDSRFP USING 
 
 FREEDPRE4LRU PAGING ALGORITHM. 
 
 (D PAGE -FAULTS FOR CDSRFP USING 
 FREEDPRE4LRU PAGING ALGORITHM. 
 
 Figure 5.9. Page Faults for CD and CDSRFP 
 
122 
 
 n * 24, m « 4, N « 6 
 
 @ PAGE FAULTS FOR LU USING LRU PAGING ALGORITHM 
 
 (D PAGE PULLS FOR LUSRFP USING FREEDPRE4LRU PAGING ALGORITHM 
 
 (D PAGE FAULTS FOR LUSRFP USING FREEDPRE4LRU PAGING ALGORITHM 
 
 Figure 5.10. Page Faults for LU and LUSRFP 
 
123 
 
 LOG 2 (7T(c)) 
 
 PAGE FAULTS FOR GOS USING LRU PAGING ALGORITHM 
 
 (D PAGE PULLS FOR GOSSRFP USING FREEDPRE4LRU PAGING ALGORITHM 
 
 (5) PAGE FAULTS FOR GOSSRFP USING FREEDPRE4LRU PAGING ALGORITHM 
 
 Figure 5.11. Page Faults for GOS and GOSSRFP 
 
12U 
 
 n,«24, n 2 »l6, m»4, 
 Nj«6, N 2 «4 
 
 PAGE FAULTS FOR ORTH USING LRU PAGING ALGORITHM. 
 
 (D PAGE PULLS FOR ORTHSRFP USING FREE0PRE4LRU 
 PAGING ALGORITHM. 
 
 (D PAGE FAULTS FOR ORTHSRFP USING FREEDPRE4LRU 
 PAGING ALGORITHM. 
 
 11 13 15 17 19 21 23 25 27 29 31 33 35 37 
 
 Figure 5.12. Page Faults for ORTH and ORTHSRFP 
 
LOG 2 (7T(c)) 
 
 125 
 
 PAGE FAULTS FOR MM USING LRU PAGING ALGORITHM 
 
 © PAGE PULLS FOR MMSRFP USING FREEDPRE4LRU 
 PAGING ALGORITHM 
 
 ® PAGE FAULTS FOR MMSRFP USING FREEDPRE4LRU 
 PAGING ALGORITHM 
 
 Figure 5.13. Page Faults for MM and MMSRFP 
 
126 
 
 6. AUTOMATION OF PERFORMANCE IMPROVEMENT TECHNIQUES 
 
 6.0 Introduction 
 
 We have shown, in Chapter k, how to improve the locality of matrix 
 algorithms, and in Chapter 5, we have shown how to include freeing and 
 prepaging techniques. We have seen that these techniques improve the 
 paging performance of matrix algorithms. However, in the previous chapters, 
 we have implicitly assumed that the programmer has the responsibility to 
 modify his program to obtain the improved paging performance. In general, 
 they are not capable or willing to make these changes; neither is it 
 desirable to detract the user from his main purpose — namely, to solve his 
 problem. Therefore, we will consider how one is to automate these 
 techniques by incorporating them into the computer system. 
 
 The methods that we have presented clearly require an advanced 
 knowledge of the program's reference pattern, which means that the operating 
 system contains insufficient knowledge of the reference string. An 
 optimizing compiler, on the other hand, may be able to gather enough 
 information about the reference pattern and may be able to make the 
 necessary changes to obtain the improved paging performance. In section 
 6.1, we will sketch several transformations of program segments which 
 improve the locality of the segment, and we will give simple conditions 
 for these transformations to be valid. In section 6.2, we will present 
 an algorithm which will abstract information from an 0L/2 program and. 
 then, later, use this information to do freeing and prepaging for the 
 
127 
 program. It is clear that a similar algorithm can be given for an 
 element (Pl/^ like) source language, but the amount of information to 
 be processed will be very large so as to be prohibitive in the amount of 
 overhead incurred. 
 
 6.1 Transformations Which Improve Locality 
 
 6.1.1 Loop Reversal 
 
 First we start out with an element language like PL/l. Consider a 
 program segment P as follows : 
 P: DO I = 1 TO n ; 
 
 « 
 
 DO J - 1 TO n ; 
 
 s i 
 
 S 2 
 
 END; 
 
 • 
 
 END P; 
 Now consider a transformation T ; of the program segment P into the program 
 segment P'. 
 
 P' : DO I = 1 TO n ; 
 
 • 
 
 IF M0D(l,2) = 1 THEN 
 DO; JLLsd; JHL=*n p ; JSTEP=1 ; END; ELSE J 
 
 DO; JLL^; JHL=1; JSTEP=-1; END; 
 
 DO J-JLL TO JHL BY JSTEP; 
 
 s i 
 
 S„ 
 
 END; 
 
 END P'; 
 
128 
 
 We will say that the transformation T is valid if the program 
 segments P and P' are equivalent (in the input-output sense) [ALLE71] . In 
 case T is valid, we will say that the inner loop of P is reversible. We 
 want to obtain (easily testable) sufficient conditions under which T is 
 valid. 
 
 Let us denote the sequence of statements within the inner DO loop 
 (of P) by F(l,j) where the subscripts indicate the value of the two loop 
 control variables during the execution of the statement sequence F. 
 Clearly, P and P' are equivalent, provided the following segments X and X' 
 are equivalent. 
 
 X: F(I,1) 
 
 X': F(l,n 2 ) 
 
 F(l,l) 
 
 A sufficient condition for X and X' to be equivalent is: 
 
 F(l, j) commutes with F(l,k) for 
 
 Vie [1,1^] and V j, k e [l,n 2 J 3 ^ k. 
 Bernstein [BEBN66] has derived sufficient conditions for cammutivity of 
 program segments. Note that, we need only pa ial cammutivity here, 
 i.e., we require that, F(l, j) F(l,k) sequence is equivalent to 
 F(l,k) F(l, j) for V k > 3, V I. We will repeat his conditions here. 
 Let us denote the set of addresses fetched and not stored during the 
 
129 
 
 execution of F(l, j) by W(l, j), the set of addresses stored and not fetched 
 
 by X(l, j), the set of addresses fetched first and stored later by Y(l, j) 
 
 and the set of addresses stored first and fetched later by Z(l, j). Let 
 
 us denote by VL,, X , Y_,, Z , the similar set of addresses for the rest of 
 K K K K 
 
 the program segment P excluding the inner DO loop. Then, sufficient 
 conditions for P and P' to be equivalent are: 
 
 1) (W(l,k) U Y(l,k)) fl (X(l,o) U Y(l,j) U Z(I,J)) = fa 
 
 2) (¥(l,j) u T(I,J)) n (X(l,k) U Y(l,k) U Z(l,k)) = fa 
 
 3) (x(i,k) u y(i,k) u z(i,k)) n 
 
 (X(l,j) U T(I,J) u z(i,j)) n (w R U Y R ) = fa 
 
 Note that, these are simple conditions which are easy to test. 
 
 However, these conditions exclude the following case: 
 
 P : DO I = 1 TO n ; 
 S = 0; 
 
 DO J = 1 TO n 2 5 
 
 S = S + A(I,J) * B(l, J); 
 
 END; 
 
 END; 
 
 This case represents an inner product being calculated within the 
 inner loop and as such is a most frequently occurring construct in matrix 
 algorithms. For the program segment P-,> we have, 
 
 W(I,J) = {A(I,J), B(I,J)}, 
 
 X(I,J) = fa Y(I,J) = {S}, 
 
 Z(I,J) = fa 
 Therefore, Y(l,j) fl Y(l,k) = {S} ^ which implies that conditions l) and 2) 
 are violated. But- it is clear that the inner loop of P is reversible. 
 This shows that the conditions we have are sufficient but not necessary. 
 
We will require that this construct be specially recognized as reversible. 
 In fact, the following generalization is valid. Suppose, we have the 
 following program segment: 
 DO I = 1 TO n-; 
 
 DO J = 1 TO n 2 ; 
 
 S = S □ E(J); 
 END; 
 
 END; 
 Assume that □ is a commutative and associative operator and E(j) is some 
 expression, then the inner J loop is reversible. 
 
 Therefore, our loop reversal technique is a modified technique 
 of Bernstein. Furthermore, it is generally profitable to reverse adjacent 
 loops in the opposite directions. We apply this automated procedure to 
 CD and see whether we can obtain CDR as the transformed procedure of 
 section k.l. 
 
 The two inner J loops are immediately recognized as reversible 
 by the special construct. For the loop, 'DO I = 1 TO n-k 1 , we have: 
 W(K,I) = {A(K+I,1), ... A(K+I, K-l), 
 A(K,1) ... A(K, k-l), 
 A(K,K)}, 
 X(K,I) = p, 
 Y(K,I) = {A(K+I,K)}, 
 Z(K,I) = {S}, 
 
 =>W(K,I) U Y(K,I) = {A(K+I,1:K), A(K,1:K)} 
 X(K,J) U Y(K,J) U Z(K,J) = {A(K+J,K), S) . 
 
131 
 
 Clearly, intersection is empty. 
 
 W R = (A(K+l:n, l:n-l)} 
 
 \ = fi, \= {A(K+1, K+l), ..., A(n,n)} 
 
 Z R ={S}. 
 The triple intersection is: SO (W U Y ) = $. Thus, the reversibility of 
 the loop is established. By hand simulation, we have proved that our 
 automated loop reversing procedure will indeed translate CD into CDR. In 
 a similar fashion, we can show this for other algorithms. 
 
 We now turn our attention to programs written in 0L./2. Quite 
 clearly, the techniques presented for an element source language are 
 applicable. We have, in addition, several more possibilities of reversal. 
 Consider a vector assignment statement within a loop in an OL/2 program. 
 The vector assignment statement will be compiled into a DO loop by the OL/2 
 compiler. If this loop is detected to be reversible, then the compiler 
 can carry out the loop reversal. The conditions for reversibility are 
 simpler in this case and are as follows (assume that the statement is a 
 vector expression being assigned to a vector C): 
 
 1) Either the right side of the assignment statement does 
 not contain the vector C, its ancestor, or subarrays in 
 the ACB tree [FHIL72]; 
 
 2) Or the assignment statement is given by C = C + Exp 
 where Exp is a vector expression satisfying condition 1. 
 
 If these conditions are satisfied then the vector assignment is 
 reversible. Clearly, this procedure will translate OCD into CDR. 
 
132 
 
 6.1.2 Loop Decomposition 
 
 We have not discussed this transformation in Chapter h. Loop 
 
 decomposition is also known as loop unswitching [ALLE71] . Consider the 
 
 following program segment P: 
 
 P: DO I - 1 TO n; 
 
 A(I) = A(I) * A(I); 
 B(I) = B(I) + Is 
 
 END; 
 rest of the program 
 
 Now consider the following transformation P' of P: 
 
 P' : DO I = 1 TO ns 
 
 A(I) = A(I) * A(I); 
 
 END; 
 
 DO I = 1 TO n; 
 B(I) = B(I) + I; 
 
 END; 
 
 • 
 
 I rest of the program 
 
 It is easily seen that the locality of P' is much better than that 
 of P. In other words, whenever a single loop operates on two or more 
 independent data streams then it is better to decouple these operations 
 into two or more DO loops. Sufficient conditions for the transformation 
 to be valid can be stated as follows [BERN66] : 
 
 a) (W 1 (I) U Y 1 (I)) fl (X 2 (J) U Y 2 (J) U ^(j)) = fi, 
 
 b) (w 2 (j) u y 2 (j)) n ( Xl (i) u y 1 (i) u Zl (i)) = fi, 
 
 c) (x 1 (i) u Zl (i)) n (x 2 (j) u z 2 (j)) n (w R u y r ) = fi, 
 
 for 1 < J < I < n. 
 
133 
 
 6.1.3 Submat rix-Mu.lt ipli cati on- Trans format! on 
 
 In Chapter k, many matrix algorithm contain matrix multiplication. 
 It is, therefore, profitable to convert these internal matrix multiplications 
 into submatrix multiplications. 
 
 In an array language like 0L./2, the programmer is allowed to write 
 a matrix multiplication as a primitive operation. In such a case, the 
 compiler can easily compile this primitive into a submatrix multiplication. 
 Thus OCD will be compiled into CDM. If we also apply the procedure of 
 section 6.1.1 then, OCD will be compiled into CDMR with a corresponding 
 high degree of locality. The same statement holds for other matrix 
 algorithms. 
 
 For an element source language, it is more difficult since the 
 compiler has to detect a sequence of statements which corresponds to 
 matrix multiplication or its variants. A number of papers have appeared 
 on the topic of automatic recognition of vector and matrix operations for 
 the purpose of parallelism exploitation [SCHN72], The same techniques 
 are applicable for our purposes. 
 
 6.1.U Conversion From a Non-Submatrix to a Submatrix Algorithm 
 
 In general, it is impossible to automate a conversion from a 
 non- submatrix algorithm like CD to a submatrix algorithm like CDS. The 
 transformation procedure would be required not only to understand the 
 logic of the program, but also to have knowledge or 'intelligence' about 
 the theory of matrix algebra. We will illustrate the difficulties 
 involved in such a task by examples. 
 
 First we establish submatrix analogs (also known as block analogs) of 
 scalar operations. For example, the submatrix analogs of scalar addition 
 
13* 
 
 and scalar multiplication are matrix addition and matrix multiplication 
 respectively. When setting up the submatrix analog of division, we have 
 a problem. Assume that in the non- submatrix version of a certain matrix 
 program, the expression, E/A(k,k), occurs. A possible submatrix analog 
 of this expression could be E * (A T t ) , where A is the (I, I) 
 
 D J-, A- J., J. 
 
 submatrix of A, I = fK/m], and E~ is the submatrix analog of E. Another 
 
 possible analog of the expression is (A ) * E . Clearly, these two 
 
 -L, X o 
 
 expressions produce different effects, and only one of them is the correct 
 analog. It is not at all clear from the original program which of the 
 two possible analogs is correct. This particular example occurs in the 
 transformation of CD into CDS (refer to section k.l). 
 
 If we consider Cholesky decomposition, then the submatrix analog 
 of the square root operation is the Cholesky decomposition of a submatrix. 
 There is no way to automate this type of knowledge at present. 
 
 Another example occurs in attempting to transform ORTH into ORTHS 
 (refer to section ^-.2). The transformation is possible because of the 
 known properties of orthonormal vectors [SCH073; SCHW73] . Again, we cannot 
 incorporate this knowledge in a compiler. 
 
 In case of eigenvalue problem, submatrix algorithms may not even 
 exist. The basic problem is the noncommutative property of matrix 
 multipli cation [ IDVA72 ] . 
 
 6.2 Freeing and Prepaging 
 
 In this section, we will discuss the automation of freeing and 
 prepaging for matrix algorithms, written in OL/2 language. It should be 
 clear that, we can do the same kind of optimization for any other source 
 
135 
 
 language. However, for an element source language the overhead incurred 
 by our procedure is likely to be intolerable. 
 
 The optimizer will be a module of the OL/2 compiler. It will 
 collect some information about the program during the syntactic phase and 
 later use this information to do efficient freeing and prepaging for the 
 program. Simply stated, the job of the optimizer consists of two parts: 
 (a) prepage the subarrays and the arrays needed by the program at the 
 proper time during the execution of the program and (b) free memory space 
 occupied by the dead subarrays and dead arrays as soon as possible. We 
 will use (sub) array to denote "subarray or array." Ideally, the proper 
 time to prepage is such that any page of the (sub) array should be 
 set up in MM before the first reference to it, and 'as soon as possible' 
 is taken to mean 'just after the last use of the corresponding page'. 
 Such an ideal situation will, clearly, be equivalent to the use of PWS 
 paging algorithm and result in zero page faults as well as very economic 
 use of memory. Such an ideal system, however, will be very expensive in 
 terms of overhead. We consider a practical solution. 
 
 It is clear that some sort of a list of useful (sub) arrays needs 
 to be prepared during the syntax analysis phase. This list can be used 
 in two possible ways. First, it may be used to insert instructions for 
 freeing and prepaging (sub) arrays during the coding phase. Second, it 
 may be used dynamically, to prepage (sub) arrays from the list as far into 
 future as demanded by the desired performance. Clearly, the second 
 approach requires variable page allotment of MM. It will also incur 
 more overhead since the list must be kept in MM during the execution 
 
136 
 
 phase, but it is likely to result in an improved paging performance. We 
 will discuss the implementation of the first approach, since it is more 
 simple; however, the technique of constructing the list is common to both 
 approaches. 
 
 Let us assume, for the sake of argument; that the program can 
 acquire as many pages as necessary. Furthermore, assume that once a 
 (sub) array has been brought into MM it remains there until after the last 
 time it is used. Define the lifetime of a (sub) array A as follows: 
 L(A) = (t-,, t ) where t is the time of the first use and t p the time of 
 the last use of A. Once the lifetime of each (sub) array is determined, 
 then our problem is solved, at least in theory, since we can follow the 
 following prepaging and freeing strategy: 
 
 Given L(A) = (t 1 , t £ ), if t < t ± - T then PREPAGE(A); and if 
 t > t then FREE (A). The variable t is the current process time and 
 T is the average page-fetch time. 
 
 We construct, therefore, the list USE_LIST whose nodes correspond 
 to the (sub) arrays in the order in which they are referenced by the program. 
 On the program level the normal sequential ordering is used, and within a 
 statement, right end order of the expression tree is followed [JURI72], 
 The primary link in the USE_LIST is the one implied by the above ordering. 
 We will refer to this link by the names, 'static link' and the NEXT pointer. 
 In case of repeated use of an (sub) array, it is not worthwhile to enter 
 each use in the USE_LIST, since the USE_LIST will become very long. 
 Instead, we consider only the first and the last use and the interval 
 between them has been defined as the lifetime. The decision made here is 
 
 
137 
 
 quite arbitrary. We could have, alternately, defined the lifetime L* of 
 ( sub ) array A as : 
 
 L*(A) = L(A) SS(A) 
 
 where L(A) is the lifetime by the earlier definition and SS(A) is the 
 syntactic scope of A (as in a block-structured language like OL/2). Note 
 that, SS(A), in general, is a set of disjoint intervals and since L(A) is 
 a single interval, L' (A) will be a set of disjoint intervals. The use of L' 
 will involve more overhead but will result in obtaining better paging 
 performance. There are many other ways of defining lifetime, but we will 
 use the first definition, namely, L. We have to store L(A) for each A 
 in the USE_LIST. A practical way to do this is to have a node for A 
 corresponding to the first applied occurrence of A and store a lifetime 
 pointer, called LP, in the node, pointing to the last applied occurrence 
 of A. 
 
 We have tacitly assumed that the sequence of uses of (sub) arrays 
 in a program can be linearized. There are several language constructs 
 that prohibit this. 
 
 The GOTO statement introduces nonlinearity in the USE_LIST. Since 
 the programming language that we are using (OL/2) is GOTO-less, we do not 
 have this problem. 
 
 Another source of nonlinearity is the IF statement. Let us assume 
 that the statement I = (IF E THEN S, ELSE Sp) occurs in a program. 
 Ideally, the USE_LIST for this statement should be: 
 
 T 
 
 / U(S ) \ 
 
 u(D = u( E ) <r i y 
 
138 
 
 where U(X) denotes the USE_LIST for the program segment X. Since the 
 branch actually taken by the program is unknown until the execution time, 
 we have an obvious problem. The IF statement makes the predictability of 
 the future reference pattern poor. Therefore, we have to settle for 
 approximate solutions. There are various solutions to this problem. 
 
 The first solution is to linearize U(l) as U(E ) - U(S ) - U(S ). 
 This solution implies that, independent of the branch taken by the program, 
 the (sub) arrays in both S and S p will be considered part of the 
 data-working- set and will be prepaged. This has two implications: first, 
 potentially useless pages are brought into MM incurring extra page pulls 
 and wasting MM. Second, if U(S.. ) is large so that only part of it can be 
 prepaged when the if-expression E is evaluated and if the 'false' branch 
 is taken then a good part of U(S p ) will have to be demand paged. 
 
 The second approach assumes that U(l) = U(E ). In other words, we 
 assume that U(s) =U(S p ) = null. This solution implies that depending 
 on the branch taken by the program, U(S ) or U(S p ) will have to be demand 
 paged, but no memory space is wasted nor an extra channel traffic occurs. 
 
 The third approach allows the USE_LIST to be nonlinear and assumes 
 that 
 
 M S 2 
 
 U(I) = U( El ) - [STOP] <^u(?{> • 
 
 F e - 
 
 This solution implies that all prepaging activity be stopped after the 
 evaluation of the if-expression. 
 
 Which of these solutions is the most useful can only be decided 
 by experience with a working system. We select the first solution for 
 the rest of the discussion. 
 
139 
 
 The existence of loops (e.g., K>R loops, WHILE loops) is still 
 another source of nonlinearity. One possible way of linearizing, in this 
 case, is to unfold the loops. But this solution is unattractive because: 
 it is not applicable to WHILE loops since the number of times the loop is 
 traversed is not known at compile time, and it is wasteful of memory space 
 since a lot of information is duplicated in the USE_LIST. The alternative 
 is to allow loops in the USE_LIST. 
 
 The fourth source of nonlinearity in the USE_LIST is the call to 
 a procedure. We assume that each procedure has code inserted in it to 
 do its own freeing and prepaging. Thus, in the calling procedure, we stop 
 prepaging at and beyond the point of call and resume freeing and prepaging 
 after a return from the called procedure. In our treatment, we have 
 ignored call statements altogether but the above solution may easily be 
 incorporated. 
 
 Assume that the USE_LIST has been constructed for a given program. 
 Each (sub) array in the USE_LIST may have at most four boundaries. Now 
 since OL/2 allows dynamic partitioning, these boundaries can move during 
 the execution of the program. Therefore, each boundary has associated with 
 it a boundary expression [PHIL72], If all the boundary expressions of a 
 (sub) array remain constant during the execution of a loop, then such a 
 (sub) array will be called static (with respect to the loop). Those 
 subarrays for which at least one of the boundary expressions vary within 
 the loop are called dynamic subarrays. Clearly, paging techniques for 
 these two types of (sub) arrays should be different. The static (sub) arrays 
 need be prepaged only once. Whereas the extents of the dynamic subarrays 
 change during the execution of the loop and hence parts of these subarrays 
 
140 
 
 may have to be prepaged during the loop execution. We therefore, decide 
 to originate a dynamic link at the beginning of a FOR loop and it will 
 thread all subarrays within the scope of the FDR loop which are dynamic 
 with respect to this loop. If the loop control variable is k, then the 
 dynamic link will also be called k-chain. We define a predicate P such 
 that for a subarray B and the loop control variable k, P(B,k) = T (true), 
 iff B is dynamic with respect to loop k. 
 
 Once the USE_LIST has been constructed, as outlined above, 
 prepaging instructions can be inserted during the coding phase as follows . 
 
 At nest level I = 0, we prepage (sub) arrays in the USE_LIST, 
 traversing it along the static link. At nest level I > 0, prior to 
 generating code for the loop, we insert instructions for prepaging 
 (sub) arrays, traversing the USE_LIST along the static link; and within 
 the loop, we insert instructions for prepaging dynamic subarrays 
 traversing the USE_LIST along the dynamic link. 
 
 We will now discuss how to insert the lifetime pointers (LP) or 
 equivalently, how to determine L(A) for each A in the USE_IIST. We 
 assume the existence of a table of (sub) array names (addressed by the 
 (sub) array name) and containing two pointers IP and CP for each entry. 
 IP, the initial pointer, points to the first occurrence of the (sub) array 
 in the USE_LIST and CP, the current pointer, points to a node in the 
 USE_LIST, where currently last use of the (sub) array would have been 
 listed if we were to list each use of the (sub) array. At the end of 
 the syntax analysis phase, LP = CP for each (sub) array. 
 
 How and when to free (sub) arrays is the topic to be discussed next. 
 
 A simple answer to this question is to free a (sub) array soon after its 
 
 ■ 
 
end of lifetime. First of all, this implies the existence of a node, 
 called the end-of- lifetime node, for each (sub) array in the USE_LIST. 
 Whenever such a node is scanned during the traversal of USE_LIST in the 
 coding phase, instructions can be inserted to free the corresponding 
 (sub) array. However, there are several problems with this procedure. 
 If the lifetime L(A) of a (sub) array A is completely contained within a 
 loop f, then the above procedure will make us prepage and free A during 
 every traversal of the loop. To alleviate this difficulty, we make the 
 following change in our procedure. If at nest level £ = then we free A, 
 on scanning the end-of- lifetime node of A; and if at nest level £ > 
 then we free A, on scanning the end-of- lifetime node of A, only during 
 the last traversal of the loop. Now if A is a dynamic subarray then this 
 refined procedure also breaks down. For such a subarray, the contents of A 
 during the last iteration of a loop may be significantly different from 
 the contents of A during the previous iterations. The procedure we have 
 outlined thus far, may leave a lot of dead elements in MM. A simple 
 scheme immediately comes to mind to solve this problem. At the expiration 
 of the lifetime of A within a loop, free only the dead elements of A, 
 i.e., if we let A(k) to be the elements of A when the loop control variable 
 takes the value k then at the expiration of lifetime of A during k loop 
 traversal free (A(k+l) - A(k)). The danger in this procedure is that 
 these freed elements may be the useful elements of an adjacent subarray 
 during the next loop traversal. We solve this problem in the following 
 way. Define the scope of an array A by, 
 
 S(A) = ( B ^ IP(B), /^ LP(B)) 
 
142 
 
 where B C A means B is a subarray of A. Intuitively, S(A) is an interval 
 
 specifying the first use and the last use of A or any of its subarrays. 
 
 The scope S(A) of array A defines a natual universe over which we can 
 
 define the dead subarrays of A. Define the DEAD_LIST as follows: 
 
 DEAB_LIST (A, S(A)) = A - (A fl USE_LIST) 
 
 - {B|B c A, B is static with respect to all 
 loops in which it is referenced} . 
 
 The expression (A PI USE_LIST) denotes the list of all subarrays of A that 
 
 occur in the USE_LIST. Clearly, B e DEAD_LIST implies that B is dynamic 
 
 with respect to at least one loop. Now it is easy to merge the DEAD_LIST 
 
 with the USE_LIST. For each array A, we do the following: for each 
 
 B € DEAD_LIST (A, S(A)), and for each loop (with control variable k), in 
 
 which B occurs, if P(B, k) = T, we link B to the dynamic chain of the 
 
 loop. It is clearly, helpful to keep all dead subarrays at the head of 
 
 the dynamic chain after which all the useful dynamic subarrays follow. 
 
 The procedure of DEAD_LIST construction and linking to USE_LIST can only 
 
 be done in a post-syntax, pre-coding phase. Now the code for freeing and 
 
 prepaging dynamic subarrays is easily inserted in the loop, by traversing 
 
 the dynamic chain of the loop. 
 
 Thus our description of the procedure to construct the USE_UST is 
 reasonably complete. However, we have ignored one important problem, 
 namely, for B c A, L(A) fl L(B) ^ ft, i.e., the possibility that the 
 lifetimes of a subarray B and (sub) array A may overlap. We solve this 
 problem by considering several cases separately. 
 
 If L(B) c L(A) then we delete B from the USE_EIST. Since we have 
 assumed that the (sub) array A is resident in MM throughout its lifetime, 
 therefore, its subarrays need not be considered separately. 
 
 
 
 
lk3 
 
 If L(B) c L(A) but IP(B) < IP(A) and LP(B) < LP(a) then we extend 
 the lifetime of B so that LP(B) = LP(A). The reason for this is that when 
 the lifetime of B has ended, but the lifetime of A has. not, B cannot be 
 freed anyway. 
 
 If LP(B) > LP(A) then we extend the lifetime of A so that 
 LP(A) = LP(B). If this simple approach is not used, then an alternative 
 approach has to find the complement of B with respect to A and free all the 
 the subarrays in the complement of B with respect to A and free all the 
 subarrays in the complement. The procedure then becomes difficult and 
 time consuming. 
 
 The next question to be discussed is, how to relate (sub) arrays 
 to pages. The freeing and prepaging instructions available to us can free 
 and prepage only pages, but the information in USE_LIST is about the 
 (sub) arrays. Somehow, we have to bridge the gap. The problem is 
 compounded by the fact that the subarrays can change their sizes during 
 the execution of the program. We will consider the cases separately. 
 
 If an array occurs in the USE_IIST then we can prepage all of its 
 pages a few nodes in advance. Similarly, at the expiration of its 
 lifetime, we can free all of its pages. We have assumed that the array 
 does not share any pages with other useful (sub) arrays. 
 
 If we consider a static subarray smaller than a page, then while 
 prepaging it, we can just bring in the page in which the subarray is 
 stored. While freeing, however, we have to make sure that this page does 
 not contain any useful information. For the sake of simplicity, we will 
 not free such a subarray. 
 
If we consider a large static subarray then while prepaging, we 
 fetch all the pages containing the elements of the subarray. While freeing, 
 we can free only those pages which contain just the elements of this 
 subarray. Other pages, which are shared with some other subarrays cannot 
 be freed. 
 
 We now consider the case of dynamic subarrays. Consider a loop f 
 with the control variable k such that for the given subarray B, P(B,k) = T 
 (i.e., B is a dynamic subarray with respect to the loop f). Let us denote 
 the values that, the control variable k takes during the execution of 
 
 the loop, by k , k , ..., k., k , ... . Let us denote the elements of 
 
 th 
 subarray B, during the k loop traversal by B(k) and the pages containing 
 
 B(k) by P(B(k)). Then P(B(k Q )) e DWS(k Q )) and A P(B(k i+1 )) = P(B(k i+1 )) - 
 
 P(B(k.)) e DWS(k i+1 ) - DWS(k ± ). 
 
 This means that the elements of B(k ) must be prepaged before the 
 first traversal of the loop and that during each subsequent traversal 
 A P(B(k )) need be prepaged (if not null). The first requirement is 
 already satisfied since before the first traversal of the loop, we prepage 
 all the subarrays along the static chain in the loop. 
 
 For satisfying the second requirement, we must determine the 
 
 conditions under which A P(B(k.- n )) will be non-null. Before that, 
 
 l+l 
 
 however, we will determine the conditions under which A B(k. , ) = 
 
 l+l 
 
 B(k. , ) - B(k. ) will be non-null. We will assume that the subarray B 
 
 has four boundary expressions associated with it. Let us further assume 
 
 that, B = A <I, J>, i.e., it is <I, J> subarray of A. Then r T ., r , C , 
 
 l-i 1 J— i 
 
 Cj are the boundary expressions associated with it as shown in Figure 6.1. 
 
U5 
 
 
 
 
 
 A<I.J> 
 
 
 
 
 
 I-l 
 
 ri 
 
 Cj-i Cj 
 
 Figure 6. 1. The Partition Lines 
 
Ilf6 
 
 In general, these four expressions are functions of k. Let X = 
 {^(k), C J-;L (k)}, X 2 = {^(k), Cj(k)) and X = X 1 U X^ Define 
 
 f (x) V x e X by: 
 
 l+l 
 
 1 if x(k. . ) > x(k. ) 
 l+l i 
 
 if x(k i+1 ) - x(k ± ) 
 
 -1 if x(k. . ) < x(k. ). 
 v i+l y v i' 
 
 Then clearly, the condition for A B(k. ) to be non-null is given by: 
 
 (axeXBf (x) = -1) V (ax 6 x 2 B f (x) = 1). 
 
 i+1 i+1 
 
 Intuitively, the condition asserts that the subarray B is expanding in at 
 least one of the four directions. Let us assume that for a subset 
 X of X, V x € X => B is expanding along the boundary expression x. We 
 would like to determine the conditions under which A P(B(k. , )) f p. 
 First of all, it is clear that, 
 
 AB(k. .) = p^/\ P(B(k. _)) = p. 
 
 i+1 
 
 i+1' 
 
 Given that A B(k ) fi ft, it may be that A P(B(k )) = p. If 
 ^ x e X 3 x(k. ) = mod m, then A P(B(k. )) ^ p. Intuitively, this 
 condition asserts that one of the expanding boundaries has just crossed 
 a page boundary. Thus the required condition for prepaging is: 
 
 (A B(k i+1 ) ^ p) /\ ( a x e X Q ^ x(k ± ) = mod m). 
 
H7 
 
 As regards freeing dynamic sub arrays, during the k loop 
 
 traversal, we scan the dynamic chain and try to free all the dead subarrays. 
 If the subarray shares pages with other subarrays then we cannot free 
 these pages. Also if the subarray is contracting in at least one of the 
 directions then we cannot free it, since the present elements of this 
 dead subarray may become the future elements of some useful subarray. The 
 condition under which a dead subarray B is not contracting in any direction 
 is given by: 
 
 (Vx e X n , f (x) <0)a(Vxe X , f (x) > 0). 
 
 The condition under which, the subarray B does not share pages with any 
 other subarray is given by: V x e X, x(k. _ ) = mod m. When both these 
 conditions are satisfied, we can free all the pages containing the dead 
 subarray B. 
 
 It may happen that some of the shared pages of a dead subarray belong 
 to another dead subarray. This sharing will stop freeing of both these 
 subarrays unnecessarily. We can easily incorporate a 'combination of dead 
 subarray' function in our procedure. Assume that A <1,1>, A<1,2>, A<1,3> 
 occur as dead subarrays and that there are only two vertical partition 
 lines of A. Then the combination may be recognized as a whole (top) row of 
 subarrays and may be denoted by A<1,*>. With this kind of combination, 
 we may be able to free more frequently, in fact, such a situation does occur 
 in Gaussian elimination. We will not, however, discuss the details. 
 
 The scheme for freeing that we have discussed is not complete, 
 because there may exist a subarray which is dead but which always shared 
 pages with useful subarrays, and therefore is never freed. To alleviate 
 
11*8 
 
 this difficulty, when the scope S(A) of an array A ends, we will free all 
 the subarrays of A remaining in MM. This of course, implies an end-of-scope 
 node for each array A. 
 
 We give an algorithm (called ALG_AP), in Appendix D, which builds 
 the USE_I1ST, builds the DEAD_I1ST and attaches to the USE_LIST and finally 
 inserts instructions, in the code, for freeing and prepaging. 
 
U9 
 
 7. CONCLUSION AND FUTURE RESEARCH 
 
 7.0 Conclusion 
 
 The improvement of paging performance of programs running under a 
 paged virtual memory system has been demonstrated in this thesis. As 
 we have seen, there are many variables which affect the performance. The 
 paging algorithm used by the system and the locality of the program under 
 consideration are two of the more important variables of the system. We 
 have studied the effects of these two variables on the performance of 
 programs running under such a system. 
 
 In analyzing the performance, it was necessary to choose a set 
 of performance measures. The traditional performance measure is the 
 number of page pulls incurred by the program, but the number of page 
 faults and the space-time product of main memory are also important 
 performance measures. Which of these performance measures is the most 
 important depends on a particular system and the mix of programs running 
 under it. 
 
 Demand paging algorithms have been very popular, both in theory 
 and in practice. We have shown, both theoretically and practically, that 
 non-demand algorithms are useful in improving paging performance. 
 
 We have defined a fixed-memory demand prepaging algorithm DEMEN 
 and proved that it is an optimal demand prepaging algorithm in the number 
 of page faults. This algorithm, however, is unrealizable since it requires 
 
150 
 
 the advance knowledge of the future reference string. It serves as a 
 benchmark of performance for realizable demand prepaging algorithms. 
 
 We have defined several realizable prepaging algorithms. We have 
 studied, for each of these algorithms, whether or not it is a stack algorithm. 
 To show practical usefulness of these algorithms, we resort to a specific 
 application. We have shown that these prepaging algorithms improve the 
 paging performance of common matrix algorithms considerably. The reduction 
 in the number of page faults is shown to be as high as an order of 
 magnitude. We have measured the number of page faults and the number of 
 page pulls for many different matrix algorithms using many different paging 
 algorithms. 
 
 The application of realizable prepaging algorithms require some 
 knowledge of the future paging needs of the program. If the programmer 
 is willing to do this, he is suitably rewarded. We have shown that a 
 compiler for a structured array- language like OL/2 can extract, without 
 too much overhead, enough information from a program to improve its paging 
 performance. The compiler can identify portions of the present working 
 set which cease to be in the future working set. The compiler can also 
 provide reliable advance information of the working set. As a result, 
 moving unwanted information out. of memory and bringing in useful 
 information, in advance of its use, can be done efficiently. 
 
 We have defined a variable -memory prepaging algorithm IWS, based 
 on Denning' s WS algorithm. We have shown that PWS algorithm incurs zero 
 page faults, has the same number of page pulls as WS, and it is more 
 economical in ST product than WS. However, IWS is an unrealizable algorithm 
 since it requires that the program's reference string be known in advance. 
 
 
151 
 
 PWS algorithm can serve as a benchmark of performance among all variable- 
 memory prepaging algorithms. 
 
 We have identified three methods of improving the locality of 
 matrix algorithms. These are: loop reversal, submatrix multiplication 
 and the construction of a submatrix algorithm. We have shown that these 
 methods yield a vast improvement in locality and that the most powerful 
 method is the construction of a submatrix algorithm. The average working 
 set size is known to be a good measure of the locality of a program. We 
 have shown that a decrease in the average working set size, generally, 
 implies a reduction in the number of page faults. 
 
 We have considered the problem of automating these methods to 
 improve locality. If the source language is a high-level array- language, 
 like OL/2, then the nature of the primitives of the language provide a 
 much easier means of optimization. 
 
 We have shown that the locality- improvement methods are effective 
 for lower values of the page allotment and that prepaging methods are 
 effective at higher values of the page allotment. 
 
 7. 1 Future Research 
 
 As we have noted, very little is known about prepaging algorithms. 
 Both, theoretical and practical investigations are needed for understanding 
 the behavior of prepaging algorithms. In future, memory technology will 
 provide very large memories making program (instruction) paging less 
 important. Parallel processing will provide the means for solving problems 
 with very large data bases, thereby making data paging more important. 
 As a result we expect the emphasis in paging to shift toward specialized 
 systems where more information can be used to anticipate paging demands. 
 
152 
 
 There are many matrix algorithms, for which submatrix algorithms 
 do not yet exist. This is a useful area of investigation. 
 
 Actual implementation, either in software or hardware, of the 
 paging optimizer proposed in thesis needs to be done. The use of the 
 paging optimizer in conjunction with variable -memory prepaging algorithms 
 can also be investigated. 
 
 A study of models of program behavior in conjunction with prepaging 
 algorithms is another fruitful area of investigation. In particular, we 
 refer to conjecture 3.1 regarding the optimality of the paging algorithm 
 DPOBLLRU for the sequential-random model of program behavior (refer to 
 section 3.2.2.1). 
 
153 
 
 LIST OF REFERENCES 
 
 [AH071] Aho, A. V., Denning, P. J. and Ullman, J. D., "Principles 
 
 of Optimal Page Replacement," JACM, Vol. 18, No. 1, Jan. 1971, 
 pp. 80-93. 
 
 [ALLE71] Allen, F. E. and Cocke, J., "A Catalogue of Optimizing 
 
 Transformations," IBM T. J. Watson Research Center Report 
 RC-35^8 (Sept. 1971). 
 
 [ANAC67] Anacker, W. and Wang, C. P., "Performance Evaluation of 
 
 Computing Systems with Memory Hierarchies,". IEEE Transactions 
 on Computers, Vol. EC-16, No. 6, Dec. 1967, pp. 76^-773. 
 
 [BAER71] Baer, J. L. and Sager, G. R., "Measurement and Improvements of 
 Methods for Evaluation under Paging Systems, " in Proc. of Conf. 
 on Statistical Methods for Evaluation of Computer System 
 Performance, Brown University, Nov. 1971* PP« 2^1-269* 
 
 [BAIR68] Bairstow, J. N., "Time -Sharing, " Electronic Design, Vol. l6, 
 No. 9 (1968), pp. C1-C22. 
 
 [BAYL68] Bay lis, M. H. J., Fletcher, D. G. and Howarth, D. J., "Paging 
 Studies Made on the I.C.T. Atlas Computer," IFIP (1968), D113. 
 
 [BELA66] Belady, L. A., "A Study of Replacement Algorithms for a Virtual 
 
 Storage Computer," IBM Sys. J., Vol. 5, No. 2, (1966), pp. 78-IOI. 
 
 [BELA69] Belady, L. A. and Kuehner, C. J., "Dynamic Space Sharing in 
 
 Computer Systems," CACM, Vol. 12, No. 5 (May 1969), pp. 282-288. 
 
 [BELA]^-] Belady, L. A. and Palermo, F. P., "On-line Measurement of 
 
 Paging Behaviour by the Multivalued MIN Algorithm, " IBM J.R.D. 
 Jan. 197^, pp. 2-19. 
 
 [BERG71] Berge, C, Principles of Combinatorics, Academic Press, New York, 
 1971. 
 
 [BERN66] Bernstein, A. J., "Analysis of Programs for Parallel Programming," 
 IEEE Transactions on Computers, Vol. EC-15, No. 5, (Oct. 1966), 
 PP. 757-763. 
 
 [BOBR67] Bobrow, D. G. and Murphy, D. L., "Structure of a LISP System 
 Using Two-Level Storage," CACM, Vol. 10, No. 3 (March I967), 
 p. 155. 
 
IV, 
 
 [BRAW68] Brawn, B. and Gustavson, F., "Program Behaviour in a Paging 
 Environment/' AFIPS FJCC, Vol. 33 (1968), pp. 1019-1032. 
 
 [BRAW70] Brawn, B. and Gustavson, F., "Sorting in a Paging Environment," 
 cacm, Vol. 13, No. 8 (Aug. 1970), p. 483- 
 
 [COFF68] Coffman, E. G. and Varian, L. C, "Further Experimental Data 
 on the Behaviour of Programs in a Paging Environment, " CACM , 
 Vol. 11, No. 7 (July 1968), pp. 471-I+7IU 
 
 [C0FF72] Coffman, E. G. and Ryan, T., "A Study of Storage Partitioning 
 Using a Mathematical Model of Locality," CACM, Vol. 15, No. 3 
 (Mar. 1972), pp. 185-190. 
 
 [COFF73] Coffman, E. G. and Denning, P. J., Operating System Theory, 
 Prentice -Hall, Inc., Englewood Cliffs, N.J., 1973- 
 
 [COME67] Comeau, L. W., "A Study of the Effects of User Program 
 Optimization in a Paging System, " ACM Symposium on OS, 
 Oct. I967. 
 
 [CORB69] Corbato, F. J., "A Paging Experiment with the Multics System," 
 Chapter 19, In Honor of P. M. Morse, M.I.T. Press, Cambridge, 
 Massachusetts, 1967, pp. 217-228. 
 
 [DENN68] Denning, P. J., "Thrashing: Its Causes and Prevention," 
 AFIPS SJCC, Vol. 33 (1968), pp. 915-922. 
 
 [DENN68a] Denning, P. J., "The Working Set Model of Program Behaviour," 
 CACM, Vol. 11, No. 5 (1968), pp. 323-333. 
 
 [DEHN68b] Denning, P. J., Chen, Y. C. and Shedler, G. S., "A Model for 
 
 Program Behaviour under Demand Paging," Rep. RC-2301, IBM T. J. 
 Watson Res. Center, Yorktown Heights, N. Y., Sept. I968. 
 
 [DEWN70] Denning, P. J., "Virtual Memory," Computing Surveys, Vol. 2, 
 No. 3, Sept. 1970, pp. 153-189. 
 
 [DENN72] Denning, P. J., "On Modelling Program Behaviour," AFIPS SJCC, 
 Vol. kO (1972), pp. 937-9^- 
 
 [DENN72a] Denning, P. J., Savage, J. E. and Spirn, J. R., "Models for 
 Locality in Program Behaviour," TR-107, Dept. of Elect. Eng., 
 Princeton Univ., Apr. 1972. 
 
 [DEHN72b] Denning, P. J. and Schwartz, S. C, "Properties of the Working- 
 Set Model," CACM, Vol. 15, No. 3 (Mar. 1972), pp. I9I-I98. 
 
 [DENN65] Dennis, J. B. and Glaser, E. L., "The Structure of On-Line 
 Information Processing Systems, " Proc. Second Congress on 
 Information Systems Sciences, 1965, pp. 5~lk. 
 
 [DOHE70] Doherty, W., "Scheduling TSS/360 for Responsiveness," AFIPS 
 FJCC, Vol. 37 (1970), pp. 97-112. 
 
155 
 
 [DOYL ] Doyle, M. S., "The Effects of Program Behaviour on the 
 
 Design of Storage Hierarchies," Ph.D. Thesis, University 
 of Waterloo, Canada. 
 
 [DUBB.72] Dubrulle, A. A., "Solution of the Complete Symmetric Eigenvalue 
 Problem in a Virtual Memory Environment," IBM J . R . D . , Nov. 1972, 
 pp. 612-615. 
 
 [FERR73] Ferrari, D., "A Tool for Automatic Program Restructuring," 
 Proc. ACM Nat. Conf., 1973, p. 228. 
 
 [FINE66] Fine, G. H., Jackson, C. W. and Mclsaac, P. V., "Dynamic 
 
 Program Behaviour Under Paging, " Proc. 21 s fr ACM Nat. Conf., 
 1966, pp. 223-228. 
 
 [FRAFflf] Franaszek, P. A. and Wagner, T. J., "Some Distribution-free 
 
 Aspects of Paging Algorithm Performance," JACM, Vol. 21, No. 1, 
 (Jan. 197*0, PP- 31-39- 
 
 [FREI68] Freibergs, I. F., "The Dynamic Behaviour of Programs," AFIPS 
 
 fjcc, Vol. 33 (1968), pp. 1163-1168. 
 
 [GELE71] Gelenbe, E. and Tiberio, P., "The WSTI and Page Size in the 
 Design of Virtual Memory Computer Systems, " Proc. 9' t * 1 Annual 
 Allerton Conf. on Circuits and System Theory (1971), P- 10*1-. 
 
 [GELE7J] Gelenbe, E., Tiberio, P. and Boekhorst, J. C. A., "Page Size in 
 Demand-Paging Systems," Proc. ACM SIGME Symp. (1973), PP- 1-12. 
 
 [GIBS66] Gibson, D. H., "Considerations in Block-Oriented Systems Design," 
 AFIPS SJCC, Vol. 30 (1967), pp. 75-80. 
 
 [GORD73] Gordon, R. L., "The Organization and Control of a Slave Memory 
 Hierarchy," TR-kk-29, Naval Underwater Systems Center, Feb. 1973. 
 
 [HATF71] Hatfield, D. J. and Gerald, J., "Program Restructuring for 
 
 Virtual Memory," IBM Sys. J., Vol. 10, No. 3 (1971), PP- 168-192. 
 
 [HATF72] Hatfield, D. J., "Experiments on Page Size, Program Access 
 
 Patterns and Virtual Memory Performance," IBM J.R.D. , Jan. 1972, 
 P- 58. 
 
 [HOAR72] Hoare, C. A. R., "A Survey of Store Management Techniques Part 
 Two, " in Operating Systems Techniques, (Ed. ) Hoare and Perrott, 
 APIC Studies in Data Processing, No. 9, 1972. 
 
 [H0US6^] Househoder, A. S., The Theory of Matrices in Numerical Analysis, 
 Blaisdell Publishing Company, New York, 196k. 
 
 [ISAA66] Isaacson, E. and Keller, H. B., Analysis of Numerical Methods, 
 John Wiley and Sons, 1966. 
 
 [JOSE70] Joseph, M., "An Analysis of Paging and Program Behaviour," The 
 Computer Journal, Vol. 13, No. 1, Feb. 1970, p. k-Q. 
 
156 
 
 [JURI72] Jurich, D. R., "An Approach to the Compilation of Array- 
 Expressions in the OL/2 Language, " University of Illinois 
 at Urb ana -Champaign, Department of Computer Science Report 
 No. 550, 1972. 
 
 [KILB62] Kilburn, T., Edwards, D. G., Lanigan, M. J. and Sumner, F. H., 
 "One Level Storage System," IRE, EC-11, No. 2 (Apr. 1962), 
 pp. 223-235. 
 
 [KING71] King, W. F., "Analysis of Demand Paging Algorithms," IFIP 
 1971, Vol. 1, p. ^85. 
 
 [KNIG ] Knight, J. C. and Page, E. S., "Searching on a Paging 
 Environment, " to appear. 
 
 [KUCK70] Kuck, D. J. and Lawrie, D. H., "The Use and Performance of 
 
 Memory Hierarchies: A Survey," Software Engineering, Vol. 1, 
 1970, pp. 45-77. 
 
 [KUEH68] Kuehner, C. and Randell, B., "Demand Paging in Perspective," 
 AFIPS SJCC, Vol. 32 (1968), pp. 1011-1018. 
 
 [LEHM68] Lehman, M. M. and Rosenfeld, J. L., "Performance of a Simulated 
 
 Multiprogramming System," AFIPS SJCC, Vol. 33 (1968), pp. 1^31-1^2, 
 
 [LIPT68] Liptay, J. S., "Structural Aspects of the System 360/85: II, 
 The Cache," IBM Sys. J., Vol. 7, No. 1 (1968), p. 15. 
 
 [LOVA72] Lovass-Nagy, V. and Powers, D. L., "A Note on Block Diagonalization 
 of Some Partitioned Matrices, " Linear Algebra and Its Applications, 
 Vol. 5, Oct. 1972, pp. 339-3^6. 
 
 [MA.SU72] Masuda, T., Takahashi, N. and Yoshizawa, Y., "The Comparison of 
 Swapping Algorithms and Some Program Behaviour Under a Paging 
 Environment," Information Processing in Japan, Vol. 12 (1972), 
 pp. 70-75. ~ ~ '" - 
 
 [MATT70] Mattson, R. L., Gecsei, J., Slutz, D. R. and Traiger, I. L., 
 "Evaluation Techniques of Storage Hierarchies," IBM Sys. J. , 
 Vol. 9, No. 2 (1970), pp. 78-117. 
 
 [MCKE69] McKeller, A. C. and Coffman, E. G., "The Organization of Matrices 
 and Matrix Operations in a Paged Multiprogramming Environment, 
 CACM, Vol. 12, No. 3 (1969), pp. 153-165. 
 
 [M0LE72] Moler, C. B., "Matrix Computations with Fortran and Paging," 
 CACM, Vol. 15, No. k (1972), p. 268. 
 
 [ONEI67] O'Neill, R. W., "Experience Using a Time Sharing Multiprogramming 
 System with Dynamic Address Relocation Hardware," AFIPS SJCC, 
 Vol. 30 (1967), pp. 611-621. 
 
 [OPPE68] Oppenheimer, G. and Weizer, N., "Resource Management for a 
 
 Medium Scale Time Sharing Operating System," CACM, Vol. 11, No. 5, 
 (May 1968), p. 113. 
 
157 
 
 [0RGA72] 
 [PHIL72] 
 [PINK68] 
 
 [P0ME71] 
 [RAMA66] 
 
 [RAND68] 
 [RAND69] 
 [R0DR72] 
 
 [RODR75] 
 
 [R0GE73] 
 
 [SAL17^] 
 [SAYR69] 
 [SCRN72] 
 
 [SCH073] 
 
 [SCHW73] 
 
 Organick, E. I., The MULTICS System: An Examination of Its 
 Structure, MIT Press, Cambridge, Mass., 1972. 
 
 Phillips, J. R. and Adams, H. C, "Dynamic Partitioning for 
 
 Array Languages," CACM, Vol. 15, No. 12 (Dec. 1972), pp. 1023-1032. 
 
 Pinkerton, T., "Program Behaviour and Control in Virtual Storage 
 Computer Systems, " University of Michigan, C0NC0MP Report h 
 (Apr. 1968). 
 
 Pomeranz, J. E., "Paging with Fewest Expected Replacements," 
 ifip, (1971), Vol. 1, pp. 1+91-^93. 
 
 Ramamoorthy, C. V., "The Analytical Design of a Dynamic Lookahead 
 and Program Segmenting Scheme for Multiprogrammed Computers, " 
 Proc. 21 st ACM Nat. Conf., I966, pp. 229-239. 
 
 Ran dell, B. and Kuehner, C. J., "Dynamic Storage Allocation 
 Systems," CACM, Vol. 11, No. 5 (May I968), pp. 297-306. 
 
 Randell, B., "A Note on Storage Fragmentation and Program 
 Segmentation," CACM, Vol. 12, No. 7 (July 1969), p. 365. 
 
 Rodriguez -Ros ell, J. and Dupuy, J. P., "The Evaluation of a 
 Time-Sharing, Page Demand System," AFIPS SJCC (1972), Vol. k-0, 
 pp. 759-765. 
 
 Rodriguez -Rosell, J. and Dupuy, J. P., "The Design, Implementation, 
 and Evaluation of a Working Set Dispatcher, " CACM, Vol. 16, No. k 
 (Apr. 1973), PP. 2V7-253. 
 
 Rogers, L. D., "Optimal Paging Strategies and Stability 
 Considerations for Solving Large Linear Systems," Ph.D. Thesis 
 University of Waterloo, Canada (1973). 
 
 Saltzer, J. H., "A Simple Linear Model of Demand Paging Performance," 
 CACM, Vol. 17, No. k (Apr. 197*0, PP- 181-186. 
 
 Sayre, D., "Is Automatic Folding of Programs Efficient Enough 
 
 to Displace Manual?" CACM Vol. 12, No. 12 (Dec. 1969), pp. 656-660. 
 
 Schneck, P. B., "Automatic Recognition of Vector and Parallel 
 Operations in a Higher Level Language, " Proc. ACM Nat. Conf. 
 (Aug. 1972), pp. 772-780. 
 
 Schonage, A., "Fast Schmidt Ortogonalization and Unitary 
 Transformations of Large Matrices, " in Complexity of Sequential 
 and Parallel Numerical Algorithms, Ed. J. F. Traub, Proc. Symp. 
 (1973), Carnegie -Mellon University. 
 
 Schwarz, H. R., Rutishauser, H. and Stiefel, E., Numerical 
 Analysis of Symmetric Matrices, Prentice -Hall, Inc., Englewood 
 Cliffs, N.J. (1973), translated by Hertelendy, P. 
 
158 
 
 [SHED72] Shedler, G. S. and Tung, C, "Locality in Page Reference 
 Strings," SIAM J. on Comput., Vol. 1, No. 3 (Sept. 1972), 
 pp. 218-2*4-1. 
 
 [SHEM66] Shemer, J. E. and Shippey, G. A., "Statistical Analysis of 
 Paged and Segmented Computer Systems," IEEE, Vol. EC-15, 
 No. 6 (Dec. 1966), pp. 855-863. 
 
 [SHEM69] Shemer, J. E. and Gupta, S. C, "On the Design of Bayesian 
 Storage Allocation Algorithms for Paging and Segmentation, " 
 IEEE, Vol. C-18, No. 7 (July 1969), PP- 6^-651. 
 
 [SISS68] Sisson, S. S. and Flynn, M., "Addressing Patterns and Memory 
 Handling Algorithms, " APIPS FJCC, Vol. 33, No. 2, (1968), 
 pp. 957-967. 
 
 [SMIT67] Smith, J. L., "Multiprogramming under a Page on Demand Strategy," 
 CACM, Vol. 10, No. 10 (Oct. I967), pp. 636-6^6. 
 
 [STEV68] Stevenson, D. A. and Vermillion, W. H., "Core Storage as a 
 
 Slave Memory for Disk Storage Devices," IFIP (1968), pp. F86-F9I. 
 
 [VARI68] Varian, L. C. and Coffman, E. G., "An Empirical Study of the 
 Behaviour of Programs in a Paging Environment, " CACM, Vol. 11, 
 No. 7 (July 1968), pp. h'Jl-hfk. 
 
 [WEIN72] Weinberg, G. M., "Programming and Compiling Strategies for 
 
 Paging Systems, " Software-Practice and Experience, Vol. 2, 1972, 
 pp. 165-171. 
 
 [WEIZ69] Weizer, N. and Oppenheimer, G., "Virtual Memory Management in 
 
 a Paging Environment," AFIPS SJCC (1969), Vol. J>k, pp. 270-271. 
 
 [WILK63] Wilkinson, J. H., Rounding Errors in Algebraic Processes, Her 
 Majesty's Stationery Office, London (1963). 
 
 [WILK71] Wilkinson, J. H. and Reinsch, C, Linear Algebra, Springer Verlag, 
 1971. 
 
 [YEL071] Yelowitz, L., "Toward Optimal Paginations for Structured Programs," 
 Symposium on Computers and Automata, Polytechnic Institute of 
 Brooklyn, 1971, pp. 225-236. 
 
159 
 
 APPENDIX A 
 
 PL/I PROGPAMS FOR THE SIMULATION OP 
 SOME PAGING ALGORITHMS 
 
 A.l LRU WS MIN 
 
 We will present LRU_WS_MIN in this section. There are three 
 modules: the first module declares all the global variables used by 
 LRU_WS_MIN, the second module is the procedure LRU_WS_MEN and the third 
 module computes the results. Note that LRU_WS_MIN Is a stack algorithm 
 and therefore, in one pass of the reference string, obtains the page fault 
 behavior for all possible values of the page allotment or the window size 
 whichever is pertinent to the algorithm. 
 
160 
 
 INIT((NA+1)(0)) V 
 
 INITUNA + 1 )(UJJ, 
 INITUNA+1) (0)), 
 INIT((NA+l)(0)) f 
 INIKOI , 
 
 INITIO) » 
 
 » 
 t 
 t 
 f 
 
 VARIABLES ♦ / 
 
 , /♦ADDRESS SPACE SIZE ♦ / 
 
 /♦LRU DI ST. COUNTER */ 
 
 /♦REF. INT. FREQ. COUNTER*/ 
 
 /♦MMC FREQ. COUNTER ♦ / 
 
 /♦LRUSTACKPAGEXINOEX */ 
 
 /♦LRUSTACK POINTER */ 
 
 /♦REF.STR. LENGTH */ 
 
 /♦MEMORY SIZE */ 
 
 /♦LRU SUCC. FUNCTION */ 
 
 /♦WS SUCC. FUNCTION ♦/ 
 
 /♦MIN SUCC. FUNCTION ♦/ 
 
 /♦THEIR PAGE FAULTS ♦/ 
 /♦REF. INTERVAL COUNTER*/ 
 
 /♦P -STACK ♦/ 
 
 /♦P_STACK POINTER ♦/ 
 
 /♦PAGE FAULT FREQENCIE*/ 
 /♦AVE.WS SIZE ♦/ 
 
 */ 
 
 MODULEi:/*DECLARE ALL GLOBAL 
 DC UNA 
 
 LRUCNT(0:NAJ 
 WSCNT (0:NA) 
 MINCNT(0:NA) 
 XCNA+1) 
 CURSTACK 
 LNGTH 
 C 
 
 NCLRU 
 NTWS 
 NCMIN 
 
 PCLRUtPTWStPCMINt 
 REFCNT(NA+1) IN IT ( ( NA«-1 ) ( 0) ) , 
 P(NA+l) INITUNA*1)<0) J, 
 P_POINT INIT(Oi , 
 
 «) FIXED BINOlfO); 
 DCL(MCLRU,MTWS»MCMIN, 
 STAU ♦ 
 
 ##) FLOAT BIN; 
 M0DULE2:/* PROCEDURE LRU_WS_MIN 
 LRU_WS_MIN:PROC(XX> ; 
 
 /♦THIS PROC SIMULATES MATTSON-LRU-STACK-ALG. TOGETHER*/ 
 /♦WITH GORDON-MODIFICATION TO GET WS-STATISTICS AND ♦/ 
 /♦ONE PASS MIN CF.BELADY £ PALERMO ♦/ 
 
 DCLCXX , /♦CURRENTLY REFERENCED PAGE ♦/ 
 
 DIST INIT(O), /♦STACK DISTANCE*/ 
 INT INIT<0)t /♦REF. INTERVAL ♦/ 
 I,J,JP f MINP.MINPP) FIXED BINOLtO); 
 LRU.WS: /*SEE R.L.GORDON ♦/ 
 LNGTH=LNGTH+1; CURST ACK=CURSTACK*l; 
 /♦PUT THIS PAGE ON TOP OF STACK */ 
 X( CURSTACK J=XX;REFCNT( CURSTACK )*l; 
 /♦SEARCH FOR THIS PAGE IN THE STACK ♦/ 
 DO I-CURSTACK-1 TO 1 BY -1; 
 IF X( U = XX THEN 
 DO; /♦ FOUND ♦/ 
 INT*REFCNT( I ) ;DI ST=CURSTACK-I ; 
 /♦MOVE PART OF STACK ONE BELOW */ 
 DO J=I TO CURSTACK-l BY i; 
 
 X( J)=X(J*1);REFCNT(J)=REFCNT< J*l); 
 END; 
 END;ELSE REFCNT(IJ=REFCNT(I)-i-i; 
 END; 
 
 /♦ STEP LRU L WS COUNTERS ♦/ 
 
 IF INT>NA THEN INT*NA;/*DONT TRUST WSCNT(NA)+/ 
 LRUCNT(DISTJ=LRUCNT{DIST)+l;WSCNT(INT)=WSCNT(INT)+l; 
 IF DIST=0 THEN DIST=CURSTACK-1 ; ELSE 
 DO;CURSTACK=CURSTACK-l;DIST=OIST-l; END; 
 /♦NEXT IS THE CODE FOR BELADYS M-OPERATOR ♦/ 
 
161 
 MIN: 
 
 IF 01ST*0 THEN RETURN; 
 
 IF DIST>P_POINT THEN 
 
 00; /* A FRESH PAGE-REF. */ 
 
 P_P0INT*P_P01NT*1;P(P_P0INT>-DIST*1;RETURN; 
 END; 
 
 /♦FIND SMALLEST ELE BELOW DIST IN P-STACK */ 
 MINPsP(l) ;J=l;LSMALL*P_POINT-DIST*i; 
 DO 1*2 TO LSMALL BY 1; 
 
 IF PU) < MINP THEN 
 
 do;minp=p<i);j»i; end; 
 end; 
 
 DO WHILEfl'BJ; 
 
 if j*lsmall then 
 do; 
 
 DO I = J TO P_P0INT-1 BY 1; 
 
 P(I)=P(I+1); 
 
 END; 
 
 P(P_POI NT )=MINP;MINCNT< MINP )=MINCNT( MINP 1*1; 
 
 RETURN; 
 END; 
 
 minpp=pu«-1J; JP*J*l; 
 DO I = J*2 TO LSMALL BY 1; 
 
 IF P(l) < MINPP THEN 
 
 D0;MINPP*P(1 );JP=I;END; 
 END; 
 
 P(JP)=MINP;P( J)=MINPP;J-JP; 
 END; 
 END LRU_WS_MIN; 
 
162 
 
 A. 2 DEMIN 
 
 In this section, we will present DEMIN. There are three modules: 
 the first module declares all the global variables used by DEMIN, the second 
 module is the procedure DEMIN and the third module computes and prints the 
 required statistics. Note that, DEMIN is not a stack algorithm and therefore, 
 requires one pass of the reference string for each value of the page 
 allotment c. 
 
 
163 
 
 MODUI_El:/*DECLARE 
 DCKNA 
 
 DM_STK(2,NA) 
 
 PREV 
 
 CURR 
 
 CURR.CNT 
 
 FULL 
 
 EMPTY 
 
 LNGTH 
 
 C 
 
 FAULTS 
 
 PULLS 
 
 ALL GLOBAL VARIABLES */ 
 
 ./♦SIZE OF THE ADDRESS SPACE ♦ / 
 
 INIT( (2*NA)(Oll f /*TWO COL. OF NUM. MATRIX */ 
 
 INIT(l) 
 INIT(Z) 
 
 INIT(J) 
 INIT(l) 
 INITO) 
 INIT(O) 
 
 ./♦INDICATES 
 ./♦INDICATES 
 
 ,/*#PAGES IN 
 
 LAST MfcMSTATE ♦ / 
 CURR MEMSTATE ♦ / 
 MEMSTATE ♦ / 
 
 INIT(O) 
 INIT(O) 
 ) FIXED 
 MODULE2:/*PROCEDURE DPMIN 
 DPMIN:PROC(XJ; 
 /♦THIS PROC. OBTAINS STAT. 
 DCL(X , /♦PAGE 
 
 ./♦#DF 
 
 ./♦#OF 
 ./♦#0F 
 BIN(31,0) ; 
 
 MEM. PAGES ALLOCATED ♦/ 
 PAGE FAULTS ♦/ 
 PAGE PULLS ♦ / 
 
 ♦ / 
 
 FOR DPMIN 
 INDEX 
 
 ♦/ 
 ♦ / 
 
 I,ITEMP) FIXED 8IM(31fO); 
 LNGTH = LNSTm-l; 
 IF CURR_CNT = C THEN 
 DO; /♦PRESENT MEMSTATE FULL ♦/ 
 
 /♦IF THIS PAGE IN MEM. THEN QUIT ♦/ 
 
 IF DM_STK(CURR,X)=FULL THEN RETURN; 
 
 ITEMP=PREV;PREV=CURR;CURR=ITEMP; 
 
 DO I«l TO NA; 
 DM_STK ( CURR , I ) = EMPTY ; 
 
 END; 
 
 DM_STK(CURRtXJ=FULL;;uRR_CNT = l;FAULTS = FAULTS«-l; 
 
 return; 
 end; 
 
 IF DM_STMCURR,X)=FULL THEN RETURN; 
 IF DM_STK(PREV,X)=FULL THEN 
 
 DO ;DM_STK( CURR, X)=FULL;CURR_CNT=CURR_CNT*l; RETURN; END; 
 DM_STK(CURRfX)=FULL;CUR*„CNT»CURR_CNT*l; 
 IF CURR_CNT=l THEN FAULTS*FAULTS*1 ; 
 ELSE PULLS =PULLS *1; 
 
 END DPMIN; 
 MOOULE3:/^COMPUTATION OF RESULTS 
 
 PULLS = FAULTS«-PULLS; 
 
 PUT SKIP DATA(C»FAULTS, PULLS, LNGTH) ; 
 
 ♦ / 
 
16k 
 
 A. 3 FREELRU 
 
 In this section, we will present FREELRU. This is a stack algorithm 
 and therefore, it obtains the page fault statistics, for all possible values 
 of the page allotment c, in one pass of the reference string. There are 
 three modules: the first module declares all the global variables used by 
 FREELRU, the second module is the procedure FREELRU and the third module 
 computes the results. 
 
165 
 
 MODULEl:/*DECLARE ALL GLOBAL VARIABLES */ 
 
 DCHNA ,/*SIZE OF AODRESS SPACE */ 
 
 DISTCNT(G:NA) IN IT( ( NA+1 )( 0) )» /*LRU DIST. COUNTER */ 
 X(NA*1) IMT( <NA*1H0)),/*STACK PAGE INDEX */ 
 PSTATECNA+1) INITUNA*1M0)),/*EFPTY-FULL INDICATOR */ 
 EMPTY INIT(0)t 
 
 FULL INIT(l), 
 
 NORMAL IMT(O), /"^INDICATES A PAGE REF. */ 
 
 DFREE INIT(l), VINDICATES PAGE IS DEAD*/ 
 
 CURSTACK INIT(OI, /^POINTER TO STACK */ 
 LNGTH INIT(O), /*REF. STRONG LENGTH */ 
 
 C t /*MEM.PAGE ALLOTMENT */ 
 
 NC t /*SUCCESS FUNCTION */ 
 
 FAULTS INITIO), /*#GF PAGEFAULTS */ 
 #) FIXED BIN(31 f 0); 
 DCHFC ,/*SUCCESSFREQENCY*/ 
 MC ,/*PAGEFAULT FREQ.*/ 
 ##) FLOAT BIN; 
 MCDULE2:/*PkCC FREELRU( STACK ALGORITHM) */ 
 FREELRU:PROC(ACTICN,XX) ; 
 DCHACTICN, VINDICATES WHETHER A PAGE REF. OR A FREE CCMM*/ 
 XX ,/*CURRETLY REFERENCED PAGE */ 
 I ,J,K,PTEMP,XTEMP, TEMP, FLAG) FIXED BIN(3l,0); 
 IF ACTICN=NCRMAL THEN 
 DO;/*A REFERENCE TO PAGE XX*/ 
 LNGTH=LNGTH + 1; 
 
 /♦SEARCH FOR THIS PAGE IN STACK */ 
 FLAG^O; 
 
 DO 1 = 1 TC CURSTACK WHILEt FLAG=0 ) ♦ 
 IF PSTATEt I)=FULL L X(I)=XX THEN 
 DO; 
 FLAG=i; /* TO TERMINATE SEARCH */ 
 PSTATEU )=EMPTY; 
 DISTCNT( I)=U1STCNT(I )+l; 
 END; 
 END; 
 
 IF FLAG=0 THEN 
 
 /* PAGE NOT FOUND IN CORE ,D ISTCNT < ) IS ABS. */ 
 /♦PAGE FAULT CCUNTER */ 
 
 DISTCNT(0)=UISTCNT(0)+1; 
 /* PLT CURR PAGE ON STACK */ 
 PTEMP = PSTATE(l);xTEMP = Xa) ;PSTATE( 1)=FULL; 
 
 X(i)=xx;K=i; 
 
 DC J=2 TO CURSTACK + 1 WHILE ( PTEMP=FULL ) ; 
 TEMP=PSTATE(J) ;PSTATE( J )=PTEMP ;PTEMP=T EMP ; 
 TEMP = X( J);X( J)=XTEMP;XTEMP=TEMP;K=J; 
 END; 
 
 IF K>CURST-ACK THEN CURSTACK = K; 
 END;/*CCDE FOR ACTICN=NORMAL ENDS ♦/ 
 IF ACTICN=DFREE THEN 
 
/♦DECLARE PAGE XX DEAD IF IN MEM, »/ 
 
 DO; 
 
 FLAG'O; 
 
 DO I»l TO CURSTACK WHI LE ( FLAG=0 ) ; 
 IF PSTATE1 U=FULL L X(I)*XX THEN 
 DO;FLAG=l;PSTATE( I)=EMPTY;K=I ;END; 
 END; 
 
 if flag=1 then if k=curstack then 
 
 curst ack=curs tack- 1; 
 eno freelru; 
 
 mgdule3:/*result computation */ 
 nc=o; 
 
 DC C=l TO NA; 
 
 NC = NC«-DISTCNT(C) ; FC =Nu/FLGAT < LNGTH J ; 
 
 MC = LO-FC ;FAULTS=LNGTH-NC ; 
 
 PUT SKIP DATA(C,NC, FAULTS, FC,MC); 
 END; 
 
 166 
 
167 
 
 k.k FREEDPREULRU 
 
 In this section, we will present FREEDERE^LRU. This is not a stack 
 algorithm and therefore, requires one pass of the reference string for each 
 value of the page allotment c. There are three modules: the first module 
 declares all the global variables, the second module is the procedure 
 FREEDPRE^LRU and the third module computes and prints out the required 
 statistics. 
 
168 
 
 MODULEl: /'DECLARE ALL GLOBAL VARIABLES 
 DCL(NA ,/*SIifc OF ADDRESS SPACE 
 
 INITUNAUO) ), 
 INITI (NA)O)), 
 INITO) , 
 
 /♦PAGfcTABLE 
 /♦LRU STACK 
 /♦PAGE REF. 
 
 ♦ / 
 ♦/ 
 
 ♦ / 
 
 ♦ / 
 
 ♦ / 
 
 DP4_PT(NA) 
 
 DP4.STACMNA) 
 
 NORMAL 
 
 PREPA6E INIT(I) 
 
 OFREE INITO) 
 
 0P4 POINT INITO) 
 
 FAULTS INITIO) 
 
 PULLS INITO) 
 
 PRECMT INITIO) 
 
 LNGTH INITO) 
 
 Ct /♦MEM. SUE */ 
 
 «) FIXED BINOlfO); 
 M0OULE2:/*PROC FREEDPRE4LRU 
 FREEDPRE4LRU:PR0C(ACTI0N,X); 
 /♦ THIS PROC. OBTAINS #PAGE FAULTS £ 
 /♦FOR FREEDPRE4LRU PAGING ALGORITHM 
 DCHACTON, /'INDICATES WHETHER TO FRE E t PREEPAGE 
 /*A REFERENCE TO PAGE X 
 X, /♦CURRENTLV REFERENCED PAGE 
 
 /♦LRUSTACK POINTER ♦ / 
 
 /♦*OF PEGE FAULTS ♦ / 
 
 /*#GF PAGE PJLLS */ 
 
 /♦#OF PREPAGED PAGES IN MEM ♦/ 
 
 /♦REF. STRING LNGTH*/ 
 
 #PAGE PULLS 
 
 */ 
 
 ♦ / 
 
 ♦ / 
 
 OR*/ 
 */ 
 */ 
 
 DCLCNOTPRESENT 
 
 NOTSETUP 
 
 NOTUSED 
 
 USED 
 
 I, J) FIXED 
 DCL(LACTION(0:2) 
 
 LPAGE (0:3) 
 
 INITO), 
 INIT(I), 
 INITO), 
 INITO), 
 BINOlfO); 
 
 INIT(LNORMAL,LPREPAGE,LFREE), 
 INITI SQT_PRESENT,NOT_SETUP, 
 M3T.USED,LUSE0)) LABEL; 
 GOTO LACTION(ACTION); 
 /*A REFERENCE TO PAGE X */ 
 LNORMAL: LNGTH =LNGTH+i; 
 GOTO LPAGE(DP4_PT(X)); 
 
 NOT_SETUP:/*REQUIRED PAGE NOT YET SETUP IN MEM 
 FAULTS*FAULTS*i;/* INC. PAGEFAULT COUNT »/ 
 /♦PUT THIS PAGE ON STACK */ 
 DP4_P0INT»DP4_P0INT*i ; 
 DP4_STACK(0P4_PCINT)=X; 
 DP4.PT (X)»USED;PREC^T*PRECNT-l; 
 /♦DECLARE ALL NOTSETUP PAGES SETUP ♦/ 
 DO 1=1 TO NA; 
 IF DP4_PT(I)«N0TSETUP THEN 
 D0;DP%_PT<I)»NOTUSEO;PULLS*PULLS*i;END; 
 END; 
 RETURN; 
 
 not_used:/^req.page setup */ 
 
 dp4_pt(x)«used;precnt=precnt-1; 
 
 dp*_pqint«dp*_p0int«-1;dp4_stacmdp4_p0int) = x; 
 return; 
 lused:/*page x has been used before ♦/ 
 
 do i«dp4_p0int to 1 by -1 ; 
 
 IF DP4_STACMI)«X THEN 
 DO; 
 DO J«I TO DP4.P0INT-1; 
 
 DP4_STACMJ)»DP4_STACKO-H); 
 END; 
 
 DP4.ST ACK I DP4.P0I NT )«X ; RETURN ; 
 END; 
 END; 
 /♦PAGE X IS NOT IN MEM */ 
 NOT.PRESENT: FAULTS*FAULTS*1 ; 
 
 ♦ / 
 
169 
 
 IF DP4.P0INT+PRECNT >»C THEN 
 00; 
 DP4_PT<DP4_STACKUJ1»N0TPRESENT ;/*REPLACED PAGE */ 
 DO J*l TO DP4_P0INT-l; 
 
 DP4_STACM J)*DP4_SrACM J*ll ; 
 END; 
 END; 
 
 ELSE DP4_P0INT«DP4_P0INT*l; 
 DP4_STACMDP4_P0INTI«X; 
 DP4_PT(X)*USED; 
 
 /♦DECLARE ALL NOTSETJP PAGES SETUP */ 
 DO 1*1 TO NA; 
 IF DP4.PTI 1)«N0TSETUP THEN 
 DO;DP*_PT(I)«NOTUSED;PULLS«PULLS«-i;END; 
 END; 
 RETURN; 
 LPREPAGE:/*PREPAGE PAGE X */ 
 IF DP4_PT(X)*N0TPRES£NT £ DP4_PDINT*PRECNT<C £ PRECNT < C-l THEN 
 DO; DP4.PT <X)«NOTSETUP;PRECNT*PRECNT«-l; end; 
 RETURN; 
 
 LFREE:/*PAGE X IS KN3WN TO BE DEAD NOW */ 
 IF DP*_PT(X) *USED THEN 
 DO I*DP4_P0INT TO 1 BY -1; 
 IF DP4_STACMI)«X THEN 
 
 do; 
 
 DO J* I TO DP4_P0INT-1; 
 
 DP4_STACK( J)*DP4_Sr4CK( J*l) ; 
 END; 
 
 DP4_P0INT=DP4_P0INT-l; 
 END; 
 END; 
 
 DP4_PT(X)*N0TPRESENT; 
 RETURN; 
 END FREEDPRE4LRU; 
 
 M0DULE3:/*C0MPUTATION OF RESULT */ 
 PULLS = FAULTS«-PJLLS; 
 PUT SKIP DATA(C t FAULTS,PULLStLNGTH); 
 
170 
 
 APPENDIX B 
 
 PROOF OF THEOREM 3.4 
 
 Theorem 3.^ : PRE3A is a stack algorithm for {w|oo e N , P( w )}« 
 Proof : It is sufficient to prove that 
 
 s t (c) C S (c+1) Vc>l, V t > 0. 
 
 The proof is by induction on t. Assume inductively that, 
 
 (1) U (c+1) = U t (c) + y where y = fi or y ± e N. 
 
 (2) N.(c+1) = N t (c) + y 2 where y g = or y 2 e N. 
 
 (3) S t (c+1) = S t (c) + y 3 where y^ = ft or y^ = y 1 or y^ = y 2 « 
 (^) C0UNT(y, t, c) = C0UNT(y, t, c+l), Vy e N.(c). 
 
 Clearly, all these assumptions are trivially true for t = 0. Now depending 
 
 on r, , there are two cases to consider: 
 
 t+1 
 
 case (l) r, -. = x e N. Now depending on x, there are four case to consider: 
 l(a): x e N, (c+l) which implies x e N, (c) orx/ N, (c). 
 Assume first that x e N (c). 
 From step 1(a) of PRE3A, 
 
 Vy € N t (c), C0UNT(y, t+1, c) = C0UNT(y, t, c) + 1 
 = COUNT (y, t, c+1) + 1 
 - COUNT (y, t+1, c+1). 
 Let N t (c+1) = N t (c) + y 2 . 
 Then- 
 
171 
 
 {y|y e N t (c+1), COUNT (y, t+1, c+l) > T} 
 = {y|y € N t (°)^ COUNT(y, t+1, c+l) > T) + 
 
 if COUNT (y , t+1, c+l) > T then y 2 
 = {y|y € N t (c), COUNT(y, t+1, c) > T} + 
 if COUNT (y , t+1, c+l) > T then y„. 
 Also let U, (c+l) = U, (c) + y where y and y cannot both be nonzero. 
 Therefore, we have, from step 1(a) of PRE3A, 
 
 U t+l( C+1 ) = U t+l^ C ' ) + ^ y l ° r y 2 ° r null ' ) 
 N t+1 (c+l) = N t+1 (c) + (y 2 or null) 
 
 This proves all the inductive steps. 
 
 Next assume x e N (c). Now if N, (c+l) = N (c) + y p then x = y p . 
 Also y = 0, and S (c+l) = S,(c) + x. Therefore, for page allotment c, 
 we are in step l(d) of PRE3A and for page allotment c+l, we are in step 1(a) 
 of PRE3A. 
 
 U t+1 (c+l) = U t (c+1) + x U{y|y e N t (c+l), C0UNT(y, t+1, c+l) > T) 
 = U t (c) + x U{y|y e N t (c), C0UNT(y, t+1, c+l) > T } 
 Also since 
 
 Vy e N .(c), C0UNT(y, t+1, c) = C0UNT(y, t, c) + 1 
 
 "G 
 
 and C0UNT(y, t+1, c+l) = C0UNT(y, t, c+l) + 1, 
 we have, 
 
 U t+1 (c+l) = U t (c) + x U{y|y e N t (c), C0UNT(y, t+1, c) > T}. 
 
172 
 
 But from step l(d) we have this 
 
 = U t+ l (c) + V U t (c) ' q > x) * 
 Also, 
 
 S t+1 (c+l) = S t (c+1) = S t (c) + x 
 
 = s t+ l (c) + W c) ' q > X) * 
 From these, the required results follow. 
 case (lb) : x e U (c+l) =» x e U (c) or x ^ U (c). Assume first that 
 
 "C Xj "C 
 
 x 6 U (c) and assume that U (c+l) = U (c) + y, . From step 1(b) of PRE3A, 
 
 ~G "C X J- 
 
 U t+1 (c+l) = U t (c+1) U {y|y e N t (c+1), COUNT(y, t+1, c+l) > T) 
 = U t (c) + y ± U {y|y € N t (c), COUNT(y, t+1, c) > T) 
 + if C0UNT(y 2 , t+1, c+l) > T then y 2 
 
 = u t+1 ( c ) + (y ± or y 2 or nul1 ) 
 
 Also, 
 
 S t+1 (c+l) = S t (c+1) = S t (c) + (y x or y 2 ) 
 
 and S t+1 (c) = S t (c) 
 
 .'. . S t+1 (c+l) = S t+1 (c) + y 1 or y 2 . 
 
 This proves all the required things. 
 
 Next assume x ^ U (c) =» U (c+l) = U (c) + x. 
 
 X> Xj ~G 
 
 N t (c+1) = N t (c), S t (c+1) = S t (c) + x. 
 
 Therefore, for page allotment c+l, we are in step 1(b) and for page 
 allotment c, we are in step 1(d) of PRE3A. 
 
173 
 
 U (c+l) = U (c+l) U {y|y € N.(c+1), COUNT(y, t+1, c+l) > T) 
 
 t+1 
 
 = U t (c) + x U {y|y 6 \(c), COUNT(y, t+1, c) > t) 
 
 = u t+ i (c) + W c) ' q ' x)# 
 
 Also, 
 
 S t+1 (c+l) = S t (c+1) = S t (c) + x, 
 
 S t+1 (c) = S t (c) + x - R A (U t (c), q, x). 
 
 .*. S t+1 (c+l) - S t+1 (c) + R A (U t (c), q, x). 
 From these the required results follow. 
 case (lc) : x j. r, then we are in step l(c) of PRE3A. We have 
 
 S t+1 (c+l) = U t+1 (c+l) = S t (c+1) + x - R A (U t (c+l), q, x) 
 
 S t+1 (c) = U t (c) = S t (c) + x - R A (U t (c), q, x). 
 
 Let U t (c+1) = U (c) + y 1 . 
 Now since A is a stack algorithm, 
 
 R A (U t (c+l), q, x) = R A (U t (c), q, x) or y ± . 
 
 Therefore, 
 
 S t+1 (c+l) = 5 t (c+l) + x - (R A (U t (c), q, x) or y^ 
 
 = S t (c) + y 1 + x - (R A (U t (c), q, x) or 3^), 
 
 Therefore, 
 
 S t+1 (c+l) = S t+1 (c) + (R A (U t (c), q, x) or y ± ) 
 
 Also, since 
 
 N t+1 (e + D -H t+1 (c)=A 
 
 the required results follow. 
 
17+ 
 
 case (id) : x ^ S, (c+l) =» x ^ S, (c). Therefore, we are in step 1(d) of 
 PRE3A. 
 
 U t+1 (c+l) = U t (c+1) + x - R A (U t (c+l), q, x) 
 
 U {y|y 6 N (c+l), COUNT (y, t+1, c+l) > T} 
 
 = U t (c) + y 1 + x - (R A (U t (c), q, x) or y ± ) 
 
 U {y|y e N t (c), COUNT(y, t+1, c) > T} 
 
 + if COUNT (y 2 , t+1, c+l) > T then y . 
 Also, 
 
 U t+1 (c) = U t (c) + x - R A (U t (c), q, x) 
 
 U {y|y e N t (c), COUNT(y, t+1, c) > T) . 
 Since y and y cannot both be nonzero, we have, 
 
 U t+1 (c+l) = U t+1 (c) + (y 1 or y 2 or R A (U t (c), q, x)) 
 
 Also, 
 
 S t+1 (c+l) = S t (c+1) + x - R A (U t (c+l), q, x) 
 
 = S t (c) + y 3 + x - (R A (U t (c), q, x) or y ± ) 
 
 = s t+1 ( c ) + (y ± or y 2 or R A (U t (c), q, x)). 
 
 From these, the required result follows. 
 
 case (2) ; r ^ +i - pre( x ) 
 
 Since OJ satisfies the property P, we have, x / |"? (or 
 
 x / S, (c) for any c. ). Assume S, (c+l) = S, (c) + y, and y_ is not null, 
 w t t j 5 
 
 (i) If |s t (c + 1)| < c+l, then |s (c)| < c. 
 
 Then, N.' n (c+l) = N. (c+l) + x = N. (c) + y„ + x 
 "C+-L X> "t 2 
 
 and N.^, (c) = N. (c) + x 
 
 "C+J. "t 
 
175 
 
 Also since y = y , 
 
 N t + l (ctl) = N t+l (c) + y 5 
 S t+1 (c+l) = S t (c+1) + x 
 
 = S t (c) + y^ + x 
 and S t+1 (c) = S t (c) + x 
 
 Therefore, S t+1 (c+l) = S t+1 (c) + y 
 
 This proves the required result. 
 
 (ii) If Is (c+l)| = c+l, then |s.(c)| = c. 
 
 Then, s t+1 ( c ) = s f ( c ); and s t+1 ( c+1 ) = s t ( c+1 ) > etc « 
 Therefore, all the required results follow. 
 
 Now assume that y is null, i.e. S, (c+l) = S, (c). 
 j5 t "t 
 
 (i) S .(c) < c which implies S (c+l) < c+l, then the 
 required results are trivial to prove. 
 
 (ii) S (c) = c which implies S (c+l) = c < c+l. 
 Therefore, 
 
 S (c+l) = S (c+l) + x = S (c) + x 
 and S (c) = S (c) which implies S , (c+l) = S (c) + x. 
 Also N (c+l) = N (c+l) + x = N t (c) + x 
 and N (c) = N (c) which implies N (c+l) = N (c+l) + x. 
 This proves the required result. 
 
 Since we have completed all possible cases of r, .. , the proof is 
 
 complete. 
 
176 
 
 APPENDIX C 
 
 PL/ I PROGRAMS FOR SEVERAL MATRIX ALGORITHMS 
 
 0. 1 Cholesky Decomposition 
 
 CDSRFP:PROC( A, NSMALL, MSMALL); 
 
 /♦THIS PROC FACTORS THE SYM.POS .DEF. MATRIX*/ 
 /♦A (OF ORDER NSMALL) INTO THE PROD, OF A */ 
 /♦LOWER TRIANG.fc AN UPP.TRI.MAT. */ 
 
 /*it is a submatrix alg.with loop reversal*/ 
 /*& freeing i prepaging */ 
 
 n=nsmall/msmall; 
 
 00 1=1 TO N; 
 /* FREE A<1,1> */ 
 00 J = l TC I-l; 
 CALL FREE(A,I-1,J); 
 
 END; 
 
 /* PREPAGE SUBARRAYS C & DIFFERENTIALLY */ 
 DO K = I TO N; 
 
 CALL PREPAGE(A,K,I ) ; 
 END; 
 
 /* CARRY OUT D=D-R*R' */ 
 DC J = l TC I-l; 
 DO I 1=1 TO msmall; 
 DO Jl = i TO Hi 
 S=0; 
 
 DO Kl=l TO MSMALL; 
 S=S+A( (I-l)*MSMALL+Ilf ( J-l ) *MSMALL* J 1 ) 
 *A( (I-l)*MSMALL+Ilt(J-ii*MSMALL*Jl); 
 END; 
 
 A( ( I-l >*MSMALL+I1,(I-1)*MSMALL+I1)= 
 A( (I-l)*MSMALL*Il,(l-l)*MSMALL+Ili-S; 
 END; 
 END; 
 END; 
 
 /*D0 CHOLDEC OF THE DI AG. SUBMATRIX U */ 
 CALL CO(A(< I-l)*MSMALL*l:l*MSMALL, ( I-l )*MSMALL+1 : I*MSMALL) 
 
 fMSHALLi; 
 /♦INVERT SLT PART OF I) t STORE INTO SUT PART OF D*/ 
 DO K=2 TC MSMALL; 
 Sl=A((I-l)*MSMALL*K f ( 1-1 )*MS MALL *K ) ; 
 DO I 1 = 1 TO K-l; 
 S=A(( I-1**MSMALL+K,J I-l )*MSMALL+I 1 ) / 
 
 A( (I-l)*MSMALL+Il,(I-l)*MSMALL+Ill; 
 DO J=Il*l TO K-l; 
 S=S*A((I-1)*MSMALL*-K,< I- 1 ) *MSMALL* J )* 
 A( (I-1)*MSMALL+I1, <I-l)*MSMALL+J); 
 
 end; 
 
 A({ I-1)*MSMALL*U,(I-H*MSMALL+K) = S/S1; 
 END; 
 END; 
 
 /♦CCMPUTE C=C-M*R» ♦/ 
 
 /♦REVERSE FOLLOWING LOOP WITH RE. TO I */ 
 IF M0D(I,2J=1 THEN 
 
177 
 
 do;jll*i*i;jhl«n;jstep*i;end;else 
 D0;jll*N;jhl*I«-1;JSTEP*-1;END; 
 
 DO J*JLL TO JHL BY JSTEP; 
 /♦REVERSE FOLLOW. LOOP WITH RES. TO J */ 
 IF MOD<J t 2)*l THEN 
 
 do;kll*i;khl*i-i;kstep*i;end;else 
 DO;kll*I-1;khl*1;kstep*-1;END; 
 00 K*KLL TO KHL BY KSTEP; 
 DO Jl*l TO msmall; 
 DO 1 1* I TO msmall; 
 S*g; 
 
 DO Kl*l TO MSMALL*. 
 S=S*A((J-l)*MSMALL+Jl,(K-i)*MSMALL*KD* 
 A(( I-1)*MSMALL*I1,(K-1)*MSMALL*K1); 
 END; 
 
 A((J-1)*MSMALL*J1,(I-1)*MSMALL*IIJ* 
 A((J-1)*MSMALL*J1,( I-1)*MSMALL*11)-S; 
 END; 
 END; 
 END; 
 
 end; 
 
 /* MULT G<JtI> BY G<I,I> INVERSE */ 
 DO 4*1*1 TO N; 
 DCL S(MSMALL) FLOAT; /*TEMP. VECTOR */ 
 DO Jl*l TO msmall; 
 DO 11*1 TO MSMALL; 
 
 S< II )*A< < J-l )*MSMALL+J1, ( 1-1 )*MSMALL+I 1) ; 
 END; 
 
 DO 11*1 TO MSMALL; 
 S*S( 1 1)/A( (I-1)*MSMALL«-Ilt ( I- 1 )*MSMALL*I 1) ; 
 DO Kl=l TO 11-15 
 
 S*S*S(K1)*A(( I-1)*MSMALL*K1,(I-1)*MSMALL*I1); 
 END; 
 
 A( ( J-1)*MSMALL*J1,U-1)*MSMALL + I1J*S; 
 END; 
 END; 
 END; 
 END CDSRFP; 
 
178 
 
 C.2 LU Decomposition 
 
 lusrfp:proc( a, nsmall, msmall ) ; 
 
 /♦computes lu decomposition of matrix a of order*/ 
 /*nsmall & stores l t u ln ajemploys a submatrix*/ 
 /♦algorithm with loop rev . t freeinggprepaging */ 
 n=nsmall/msmall; 
 DO K=l TO N; 
 /* FREE SUBAKRAY A<1,1> */ 
 DO 1=1 TO K-18 
 
 CALL FREE(A,I,K-1) ;CALL FR EE ( A , K- 1 , I ) ; 
 END; 
 
 /* PREPAGE SUBARRAYS Y E, I */ 
 DO I=K TO N; 
 
 CALL PREPAGEU, I t K) ; 
 END; 
 
 DC J=K+1 TC n; 
 CALL PREPAGE(A,K, J); 
 
 END; 
 
 /* Fuf^M D = D-R * X */ 
 DO KK = 1 TC K-l; 
 DO 11=1 TC MSMALL; 
 DC Jl=l TC MbMALL; 
 S = 0; 
 
 DO KK1=1 TC MSMALL; 
 S=S+A( (K-1)*MSMALL+I1 t(KK-l)*MSMALL+KKl) 
 *A( ( KK-1)* MSMALL +KK1 , ( K~ 1 ) *MS MALL* Jl ) ; 
 END; 
 
 A( (K~l)*Mi>MALL + Il,(K-l) J «MSMALL+Jl) = 
 £( (K-l)*MSMALLtI it (K-1)*MSMALL+J1)-S; 
 END: 
 END; 
 END; 
 
 /* DO LU DECCMP CF D */ 
 CALL IU(M (K-l ) *MSMALL> 1 : K*MSMAL L , (K-l )*MSMALL*L: 
 
 K*MSMALL) , MSMALL) *, 
 /♦OCOMP. INV(LD), INV(UDJ I STOKE IN PAGE <N+1,1> */ 
 DC Kl=l TC MSMALL; 
 SliA(NSMALL+Kl ,K1 ) = 1/ A ( ( K- 1 ) *MSMAL L + Kl , 
 
 <K-1)*MSMALL*K1) ; 
 DO J=l TO Kl-l; 
 S=0; 
 
 DC I=J TC Kl-l; 
 S=S+A( (K-l)*MSMALL+Kl, (K-i )*MSMALL+I J* 
 A(NSMALL*I f J); 
 END; 
 
 A(NSMALL+K1, J)=-S*S1; 
 END; 
 
 DO J=Kl*l TC MSMALLJ 
 S = A( (K-1)*MSMALL + K1, ( K- 1 ) *M SMALL* J ); 
 DO I=K1*1 TC J-l; 
 S=S+A( (K-1)*MSMALL«-K1, (K-l )*MSMALL*I)* 
 
179 
 
 A(NSMALL«-I,JJ; 
 END; 
 
 A(NSMALL*K1, J)=-S; 
 END; 
 END; 
 
 /* COMP. P=P-Q * X 
 /♦REVERSE THE FOLLW.LUO 
 IF MOD(K,2)=l THEN 
 DO;ILL=K«-l ;IHL=N;ISTE 
 DC;ILL=N;IHL=K*1;ISTEP 
 DO I=ILL TG IHL BY ISTE 
 /*REV. THE FOLLOW. LOOP 
 IF M0D(I,2)=i THEN 
 
 dg;kkl=i;kkh=k-i;kkst 
 dg;kkl=k-i;kkh=i;kks7 
 dc kk=kkl tg kkh by kk 
 DG 11=1 TO msmall; 
 DC Jl=l tc msmall; 
 S=0; 
 
 DO KK1=1 TL MSMALL; 
 S = S*M II-1)*MSMALL 
 M (KK-1)*MSMAL 
 END; 
 
 A( (I-1)*MSMALL+Ilt< 
 A( (I-1)*MSMmLL*I1,( 
 end; 
 end; 
 END; 
 end; 
 /* comp. p = p*( inv(uo)*-i 
 
 DCL S(MSMALL) FLUAT /*T 
 OC I=K*MSMALL+1 TG NSMA 
 DG J=l TC msmall; 
 
 S( J)=A( I , (K-l )*MSMALL 
 END; 
 
 CC J=l TG msmall; 
 S = G; 
 DC Kl=l rc j-i; 
 
 S=S+S(K1)*A<NSMALL+K 
 END; 
 
 £< i ,<k-1)*msmall+j)=a 
 end; 
 END; 
 
 /♦COMP. 2=?~R*6 
 /♦REVERSE THE FOLLOW. lG 
 IF MCD(K,2)=1 THEN 
 
 Dc; jll=k*i; jhl=n; JST 
 DU; JLL=N; JHL=K+l; JST 
 DC J=JLL TO JHL BY JSTE 
 /♦REV. THE FOLLOW. LOGP 
 
 ♦ / 
 
 p with re. to k ♦/ 
 
 p=i;end;else 
 
 =-i;tND; 
 
 p; 
 
 WITH RE. I ♦/ 
 
 EP=l;END;ELSE 
 ep=-i;end; 
 
 STEP; 
 
 ♦lit (KK-D^MSMALL+KKU^ 
 L+KK1,(K-1)^MSMALL*J1); 
 
 K-1MMSMALL+J1) = 
 K-U^MSM*LL+Ji)-S; 
 
 ) ♦/ 
 
 EMP. STORAGE VECTOR ♦/ ; 
 LL; 
 
 ♦J); 
 
 if J); 
 
 (I t (K-l)^MSMALLO)+S; 
 
 ♦ / 
 uP WITH RE. TO K*/ 
 
 ep=i ;end;elSE 
 
 EP=-l;END; 
 
 p; 
 
 WITH RE. TO J ♦/ 
 
180 
 
 IF MOD 
 
 do;kk 
 
 oc;kk 
 
 DC KK = 
 
 DC 11 
 
 DC J 
 
 3 = 
 
 DG 
 
 S = 
 
 END 
 
 A( ( 
 
 A( ( 
 
 END; 
 
 end; 
 
 END; 
 
 END; 
 
 /* CGMP 
 
 DC J=K* 
 
 DC Ki = 
 
 S(Kl) 
 
 END; 
 
 DG 11 = 
 
 S=o; 
 
 DC Kl 
 
 S=S + 
 
 END; 
 
 A( IK- 
 
 END; 
 
 END; 
 
 END LUSR 
 
 (J,2)=l THEN 
 
 L = i ; KKH= K- I ; KK ST EP= I ; END ; EL SE 
 
 L=K-l;KKH*l;KKSTEP»-i;END; 
 
 KKL TU KKH BY KKSTEP; 
 
 =1 TC msmall; 
 
 1=1 TG msmall; 
 
 • 
 
 KKl=i tc msmall; 
 
 S + A< (K-i )*MS MALLEI It ( KK-1 ) *MSMALL«-KK1 ) * 
 A( (KK-1)*MSMALL*KK1,( J-l) *MSMALL+ J U ; 
 
 K-l)*MSMALL*Ilt ( J-1)*MSMALL+J1)= 
 K-1)*MSMALL+Ilf ( J-l)*MSMALL+Jl)-S; 
 
 UTE Z=INV(LD)*Z */ 
 MSMALL+1 TG NSMALL; 
 
 i to msmall; 
 
 =A( (K-1)*MSMALL*K1, J); 
 
 l to msmall; 
 
 =1 tc 11; 
 
 a(nsmall+i 1,ki)*s(k1) ; 
 
 1)*M$MALL*U,J)=S; 
 
 fp; 
 
181 
 
 C.3 Gaussian Elimination 
 
 gossp fp:proc( a, nsmallt msmall); 
 
 /♦this proc dces gaussian elim.on matrix a(of oroep */ 
 /♦nsmaldusing submatrix alg.with lcop reversal i ♦ / 
 /♦also using freeing £ prepaging ♦ / 
 
 n*nsmall/msmall; 
 
 /♦ preload as many pages of a as possible ♦ / 
 /♦desirable only when page allot. >2^n ♦/ 
 
 DO J = l TC N; 
 DO 1=1 TO N; 
 CALL PREPAGE(AtltJ); 
 END;end; 
 DO K=l TO N; 
 
 /♦MAJOR LOOP OF THE ALGORITHM ♦/ 
 
 /♦ FREE THE DEAD SUBARRAYS ♦/ 
 DO 1=1 TC K-l; 
 CALL FREE(AiIfK-l); 
 CALL FREECA.K-1, I) ; 
 END; 
 
 /♦ DO GAUSSELIM OF D ♦/ 
 CALL GOS(A((K-i)^MSMALL*i:K+MSMALL,(K-l)+ 
 
 MSMALL«-l:K^MSMALL), MSMALLJ; 
 /♦COMPLETE ELIM. ON SUBARRAYS R £ C ♦/ 
 DO J=l TO MSMALL; 
 /♦COMPUTE C<2>=C<2>/D<2,2> ♦/ 
 
 /♦REVERSE LOOP WUTH RE. TO J ♦/ 
 
 IF M0D(J,2I=1 THEN 
 DO; ILL = MSMALL^K + 1 ;IHL=NSMALL ; I STEP*1 ; END;ELSE 
 DO; ILL = NSMALL;lHL=K^MSMALL«-l;ISTEP--l;END; 
 DO I=ILL TC IHL BY ISTEP; 
 A( I,<K-1)^MSMALL*J)=A(I ,( K- 1 ) ♦MSMALL* J ) / 
 
 A( (K-l) ♦MSMALL * J , ( K- i ) ♦MSMALL* J ) ; 
 END; 
 
 /♦ C<3>=C<3>-C<2>+D<1,3> ♦/ 
 /♦PEVERSE THE FCLOW.LOOP OPP. TO PREV.LOP+/ 
 IF MCD(J,2)=0 THEN 
 DC;ILL = K«-l; IHL = N;ISTEP=i;END;ELSE 
 DO; ILL=N; IHL=k+i; ISTEP=-i;end; 
 DO I=ILL TO IHL BY ISTEP; 
 DC Jl=J*i TO MSMALL; 
 DO 1 1= 1 TC MSMALL; 
 A( ( I-1KMSMALL*I1,(K-1)^MSMALL*J1)«A( (I-l)^ 
 MSMALL* I i , (K-1)+MSMALL+J1)-A( ( I- I) ♦MSMALL *I I , 
 (K-1MMSMALL*J)^A( ( K-l ) ♦MS MALL* J, (K-l) ♦MSMALL 
 
 *Ji); 
 
 END; 
 END; 
 END; 
 /♦ R<3>=R<3>-D<3,2>^R<2> ♦/ 
 
 /♦ REV WITH RE. TO J ♦/ 
 
 IF MOD(J,2)=l THEN 
 
182 
 
 DO;JJL»K«-l; JJH*N;JJSTEP»l; END; ELSE 
 DO;JJL»N; JJH*K+1;JJ STEP— 1; END; 
 DO JJ=JJL TO JJH BY JJSTEP; 
 DO I1=J+1 TO msmall; 
 DO JJl*i to msmall; 
 A( (K-1)*MSMALL*U, ( JJ-1 >*MSMALL*JJ II* 
 A((K-1)*MSMALL*I1»( JJ-1 >*MSMALL+JJ 1 )- 
 A( <K-n*MSMALL*Ilt<K-ll*MSMALL*J) * 
 A((K-l)*MSMALL*J t <JJ-1J*MSMALL*JJL); 
 END; 
 
 end; 
 end; 
 end; 
 
 /* m=M-C*R */ 
 
 /*REV WITH RE. TO K */ 
 
 IF MOD(K,2)=l THEN 
 
 DO;ILL=K*l;lHL=N;ISTEP=l;END;ELSE 
 
 DO;ILL=N;IHL=K>i;lSTEP=-i;END; 
 DC 1=1 LL TO IHL BY istep; 
 
 /*REV WITH RE. TO I */ 
 
 IF MODU ,2)=1 THEN 
 DC; JLL=K*1; JHL=N;JSTEP*l;ENO;ELSE 
 DO; jll=n; jhl=k+i; jSTEP=-i;END; 
 
 DO J = JLL TO JHL BY JSTEP; 
 DO II =1 TO msmall; 
 DO Jl=l TO msmall; 
 S=o; 
 
 DC Kl =1 TO msmall; 
 S=S*A(U-1)*MSMALL«-I1i<K-1)*MSMALL«-K1)* 
 A( (K-1)*MSMALL«-K1,( J-l )*MSMALL*Jl) ; 
 END; 
 
 A( (I-1)*MSMALL*I1 f ( J-1)*MSMALL«-JU = 
 A( ( I-1)*MSMALL + U,< J-1I*MSMALL*J1I-S; 
 END GOSSRFP; 
 
183 
 
 C.J+ Gram-Schmidt Orthogonalization 
 
 ORTHSRFP:PROC(A,R,NSMALLl f ASMALL2 f MSMALL I ; 
 /♦THIS PROC FACTORS MATRIX A(OF SIZE NSMALLi X NSMALL2*/ 
 /♦) INTO V ♦ R, V HAS ORTHONORMAL COLUMNS & R IS UPPER*/ 
 /♦ TPIAGULAR ♦ / 
 
 N1=NSMALL1/MSMALL;N2=NSMALL2/MSMALL; 
 
 /♦PREPAGE ARRAY A, ADVISABLE IF PAGE ALLOT. >2*N1*/ 
 
 DO J = l TO N2; 
 DO I«l TO Nl; 
 CALL PREPAGE(A,I,J); 
 
 END;END; 
 
 DO K*l TO N2; 
 /♦ FREE ♦ / 
 
 DO 1 = 1 TO Nl WHILE(K>1) ; 
 CALL FREE(AtltK-l) ; 
 
 END; 
 
 DO J=K-l TG N2 WHILE(K>i); 
 
 CALL FREE(R,K-i,J) ; 
 END; 
 
 /♦ PREPAGE ♦ / 
 DO J=K TO N2; 
 
 CALL PREPAGE(R,K,J|; 
 END; 
 
 /♦ GPTHO. OF M VECTORS A<*,J) FOR K*M>* J>( K-l )*M ♦/ 
 DO Kl =1 TG MSMALL, 
 /♦ COMP.U-I J*M*-Kl TH. COL. OF R ♦ / 
 /♦REV. LOOP WITH RE. TO Kl ♦ / 
 IF MOD(Kl,2)=l THEN 
 DC;ILL=l;IHL=Nl;ISTEP=l;ENO;ELSE 
 DO;ILL=Nl;IHL=l;ISTEP=-i;ENU; 
 DC I=ILL TO IHL BY ISTEP; 
 /♦REV. LOOP WITH RE. TO I ♦ / 
 IF MOD( I,2)=l THEN 
 DO;JLL=i;JHL=Kl-l;JSTEP=l;END;ELSE 
 DO; JLL=Kl-i;JHL=l;JSTEP=-l;END; 
 DC J = JLL TO JHL BY JSTEP; 
 
 S=o; 
 
 DO 11=1 TO MSMALL; 
 S=S*A(<I-1I*MSMALL«-U,(K-1)*MSMALL«-J)* 
 A((I-l)*MSMALL^Iit<K-l)*MSMALL*J); 
 END; 
 
 P< <K- 11 ♦MSMALL *Jt (K-l) ♦MSMALL «-Kl)« 
 R«(K-l)*MSMALL*J,(K-l)*MSMALL-»-Kl)+S; 
 END; 
 END; 
 
 /♦COMP. (K-1I*M*K1 TH. COL OF A ♦ / 
 /♦REVERSE WITH RE. TO Kl */ 
 DO I = IHL TO ILL BY -ISTEP; 
 DO 11=1 TO MSMALL; 
 
 S = 0; 
 /♦REVERSE WITH RE. TO II ♦/ 
 
181* 
 
 IF M00( 11,21=1 THEN 
 DC;J1LL*1;JIHL»KI-1;J1STEP=1;END;ELSE 
 D0;J1LL*ki-1; J1HL=1;J1STEP=-1;ENC; 
 DO J1=J1LL TO J1HL BY J1STEP; 
 S=S*A(< I-l)*MSMALL*ILf ( K-l l*MSMALL*J I ) * 
 R< (K-l)*MSMALL<-Jl,(K-l)*MSMALL*Kl); 
 END; 
 
 A( ( I-l)*MSMALL^Ii,(K-U*MSMALL*Kl)« 
 A( ( I-1)*MSMALH-I1,(K-1)*MSMALL + K1)-S; 
 END; 
 END; 
 /♦NORMALIZE JUST COMPUTED COL OF A ♦ / 
 
 S=o; 
 
 DO 1=1 TO NSMALLl; 
 
 S=S*A( It <K-1J*MSMALL+K1)*A(I, ( K-l )*MSMALL*K1 1 ; 
 END; 
 
 S=SOPT(SJ;R( < K-l ) ♦ MS MALL *K1, (K-l J ♦MSMALL+K I i = S; 
 /♦REVERSE LOOP PERMANENTLY */ 
 DO I=NSMALL1 TO 1 BY -15 
 
 A(I, (K-li*MSMALL + Kl)=A(I, (K-l)*MSMALL-»-Kl)/S; 
 END; 
 end; 
 
 /♦COMPUTF R<2,3>=A<2>* * A<3> ♦/ 
 
 /♦REV LOOP WITH RE. TO K ♦ / 
 IF M0D(K,2)=1 THEN 
 
 DO;ILL=l;IHL=Nl;ISTEP*l;END;ELSE 
 
 DO;ILL=Nl;IHL=l;ISTEP=-l;ENO; 
 DO I=ILL TO IHL BY ISTEP; 
 
 /♦REV LOOP WITH RE. TO I ♦/ 
 
 IF MCO( 1,21*1 THEN 
 
 dc;kkl=k*i;kkh=N2;kkstep=i;END;else 
 dc;kkl=N2;kkh=k«-1;kkstep=-1;END; 
 DO KK=KKL to kkh by kkstep; 
 
 DC KK1=1 TO MSMALL; 
 DO Jl=l TO msmall; 
 S=o; 
 
 DC 11=1 TO MSMALL; 
 S=S+AIU-1>+MSMALL«-I1, ( K-l ) ♦MSMALL* J 1 ) ♦ 
 
 A( <I-i)^MSMALL*Il, (KK-1)^MSMALL+KK1I ; 
 END; 
 
 R{ (K-1)^MSMALL*-J1,(KK-1)^MSMALL*KK1)*S«- 
 R( (K-U+MSMALL+J1, < KK-1 )^MSMALL«-KK1 ); 
 END; 
 END; 
 END; 
 END; 
 
 /♦COMPUTE A<3>=A<3> - A<2> ♦ R<2,3> ♦/ 
 /♦REV IN A DIRECTION OPP. TO PREV. LOOP ♦/ 
 DO I=IHL TO ILL BY -ISTEP; 
 /♦REV WITH RE. TO I ♦/ 
 
185 
 
 IF MODI 1.21=1 THEN 
 DC; JLl=K*l; JHL=N2;JSTEP=i;END;ELSE 
 DO;JLL=N2; JHL = K«-l;JSTEP=-i;END; 
 DO J = JLL TO JHL BY JSTEP; 
 DO 11=1 TO MSMALL; 
 DO Ji=l TO MSMALL; 
 S = o; 
 
 DC Kl=l TO MSMALL; 
 S=S+A( (I-1)*MSMALL*I1, <K-1 )*MSMALL*K I ) * 
 R<(K-1)*MSMALL+K1,(J-1)*MSMALL*J1); 
 
 END; 
 
 A( (I-1)*MSMALL«-I1,< J-1)*MSMALL+J1)= 
 
 A( ( I-1)*MSMALL*I 1,( J-1)*MSMALL + JU-S; 
 
 END OPTHSRFP; 
 
186 
 
 C. 5 Matrix Multiplication 
 
 MMSRF P:PROC< A, B.CfNSMALL, MSMALL) ; 
 
 /♦ THIS PROC MULTIPLIES MATRICES B & C & STORES ♦ / 
 
 /♦ INTO A, BY SUBMATRICES;USES RE VER S AL , FREE ING ♦ / 
 
 /♦ 6 prepaging ♦ / 
 
 n=nsmall/msmall; 
 /♦prepage as much cf array c as possible ♦ / 
 
 /♦DESIRABLE IF PAGE ALLOT. > 2*N ♦ / 
 
 00 J=L TC N; 
 DC 1=1 TO n; 
 CALL PREPAGE(C, I tJ) ; 
 END;END; 
 
 /♦MAIN LOOP CF ALG. ♦ / 
 DO 1=1 TO n; 
 /♦FREE DEAD PORTIONS OF A & B ♦/ 
 DC 11=1 TC N WHILE( I>1) ; 
 
 CALL FREECA.I-1, I1MCALL FREE ( B, 1-1 , 1 1 ) \ 
 END; 
 
 /♦PREPAGE PCRTIONS CF A £ B ♦/ 
 DC 11=1 TC n; 
 CALL PREPAGE(B,I, II); 
 CALL PREPAGEUt If II); 
 END; 
 
 /♦REV LOOP WITH RE. TO I ♦/ 
 IF M0D(I,2)=1 THEN 
 DO; JLL=l; JHL=N; JSTEP=L ;END; ELSE 
 DC; JLL=N;JHL=l;JSTEP=-l; END; 
 DO J=JLL TC JHL BY JSTEP; 
 /♦REV LOCP WITH RE TC J ♦/ 
 IF M0DU,2) = 1 THEN 
 DO;KLL=l;KHL=N;KSTEP=l;END; ELSE 
 
 do;kll=N;khl=1;kstep=-1 ;END; 
 
 DQ K=KLL TC KHL BY KSTEP; 
 DC 11=1 TC MSMALL; 
 DC Jl=l TC MSMALL; 
 
 s = o; 
 
 DO Ki=l TO MSMALL; 
 S = S + B( (1-1) ♦MS MALL -Hit (K-l )*MSMALL+K1 )♦ 
 C((K-1)^MSMALL+K1, ( J-l ) ♦MSMALL* J 1 ) ; 
 END; 
 
 A( (I-1)*MSMALL + Il f ( J-1)^MSMALL+J1)=S+ 
 A( (I-1)*MSMALL+Ilt ( J-l) ♦MSMALL* J 1) t 
 END MMSRFP; 
 
187 
 
 C.6 CDS 
 
 CDS:PROC(A,NSMALL t MSMALL) ; 
 
 /♦THIS PROC FACTORS SYM.POS.DEF. MATRIX ACOF ORDER*/ 
 /♦NSMALL) INTO G*G« ,G IS LCWER TIANG.THIS IS A */ 
 /♦SUBMATRIX ALGORITHM */ 
 
 N=NSMALL/MSMALL; 
 DO 1=1 TO n; 
 /* COMPUTE D=D-R*R» */ 
 DO J=i TO I-l; 
 DO 11=1 TO msmall; 
 DO Jl=l TO 11; 
 S=o; 
 
 DO Kl=l TO MSMALL; 
 S=S*A( (I-1)*MSMALL+Ilf ( J-l )*MSMALL + K1 ) * 
 A( ( I-1)*MSMALL*J1,( J-1)*MSMALL+K1); 
 END; 
 
 A( ( I-1)*MSMALL+I1, (I-1)*MSMALL+J1)= 
 A( ( I-1)*MSMALL"U1,<I-1)*MSMALL+J1>-S; 
 END; 
 END; 
 END; 
 
 /* DO DECOMPOSITION OF DIAG.SUBM. D */ 
 CALL CD(A( ( I-l)*MSMALL*l:I*MSMALL, 
 
 ( I- l)*MSMALL«-l:I*MS MALL), MSMALL); 
 /♦INVERT SLT PART OF D L STORE IN SUT PART OF D*/ 
 DO K=2 TO MSMALL; 
 Si=A( ( I-1)*MSMALL«-K,(I-1)*MSMALL+K); 
 DO 11=1 TO K-l; 
 S = A(U-1)*MSMALL*K, (I-l >*MSMALL+I 1 )/ 
 
 A( <I-1)*MSMALL+I1,(I-1)*MSMALL+I1); 
 DO J=I1+1 TO K-l; 
 S=S+A< ( I-1)*MSMALL+K,( I-l ) *MSMALL*J )* 
 A(( I-l )*MS MALLEI If (I-1)*MSMALL*J); 
 END; 
 
 A(( I-1)*MSMALL*I1,<I-1)*MSMALL*K)=S/Sl; 
 END; 
 END; 
 
 /♦COMPUTE C=C-M*R» */ 
 DO J=I + 1 TO n; 
 DO K=l TO I-l; 
 DO Jl=l TO MSMALL; 
 DO 11=1 TO msmall; 
 S=0; 
 
 DO Kl=l TO MSMALL; 
 S=S*A( ( J-1)*MSMALL+J1, (K-l ) *MSMALL+K1 )* 
 A( ( I-1)*MSMALL«-I If (K-1)*MSMALL*K1) ; 
 END; 
 
 A( (J-1)*MSMALL*J1,( I-1)*MSMALL-H1) = 
 A( (J-1)*MSMALL+J1,( I-1)*MSMALL+I1)-S; 
 END; 
 END; 
 
188 
 
 END; 
 END; 
 
 /* MULT G<J,I> BY G<ItI> INVERSE */ 
 DCL S(MSMALL) FLOAT; /*TEMP. VECTOR */ 
 DC J=I+1 TO n; 
 DO J 1=1 TO MSMALL; 
 DO 11=1 TO MSMALL; 
 
 S(I1)=A( (J-i)*MSMALL+Jl, { 1-1 ) *MSMALL*I 1J ; 
 END; 
 
 DC 11=1 TO MSMALL; 
 S = S ( I U/A( { I-l)*MSMALL*Ilt ( 1-1 > *MSMALL*I 1 ) ; 
 DO Kl=l TO I l-l; 
 
 S=S*S<K1)*A<( I-1)*MSMALL«-K1,U-1)*MSMALL*I1); 
 END; 
 
 A( (J-l)*MSMALL*Jlt< 1-1 )*MSMALL + I1) = S; 
 END CDS; 
 
189 
 
 APPENDIX D 
 ALGORITHM AP 
 
 ALG-AP: 
 
 /* Constructs the USE-LIST and uses it to insert instructions in an 
 OL/2 program */ 
 Step AP-1 [Data Structure]: 
 
 [A]: A node in the USE-LIST is one of the seven types: ARRAY, SUBARRAY, 
 FOR-BEGIN, FOR-END, DEAD -SUBARRAY, END-OF-LIFETIME, END-OF-SCOFE. 
 
 (1) A node of type ARRAY has three fields: NAME, LP and NEXT. 
 NAME is a pointer to the ACB of the array, LP is the lifetime 
 pointer of the array, and NEXT is the static link in the USE-LIST. 
 
 (2) A node of type SUBARRAY has the following fields: NAME, X, KC, 
 NEXT, LP. 
 
 NAME is a pointer to the ACB of the subarray, X consists of four 
 subf ields : r , r , c , c , each indicating a boundary expression 
 
 _L — -L _L u — -L J 
 
 of the subarray. NEXT and LP are as in (l) above. KC is a 
 (possibly empty) set of subf ields, each threaded on a dynamic link 
 (k-chain) of some FOR loop (and indicating that this subarray is 
 dynamic with respect to that loop). 
 
 (3) A node of type FOR-BEGIN has three fields: NAME, KC and NEXT. 
 
 NAME contains some unique identifier of the FOR loopj we have assumed 
 that the loop control variable is such an identifier. NEXT is as in 
 (l) above. The field KC originates the k-chain of the loop. 
 
\')0 
 
 (k) A node of type FOR-END has three fields: BP, NEXT and KC. 
 
 BP is a back-pointer to the associated FOR-BEGIN node, NEXT is 
 as in (l) above, KC is the end of the k-chain of the loop. 
 
 (5) A node of type DEAD-SUBARRAY has three fields: NAME, X and KC. 
 All these three fields are as in (2) above. 
 
 (6) A node of type END- OF- LIFETIME has two fields: RP and NEXT. 
 RP is a return-pointer to the ARRAY or SUBARRAY node of this 
 (sub)array. NEXT is as in (l) above. 
 
 (7) A node of type END-OF-SCOPE has two fields: NAME and NEXT. 
 NAME is the name of the array and NEXT is as before. 
 
 [B]: The ACB of each array is modified to have two additional pointer 
 fields, IP and CP. 
 
 [C]: Definitions : (Scope and lifetime) 
 
 (1) S(A) = if A is of type ARRAY then ( min IP[B], max LP[B]); 
 
 BCA BgA 
 
 else if A is of type FOR-BEGIN then 
 
 (pointer to FOR-BEGIN[A] , pointer to FOR-END[A]); 
 
 (2) L(A) = if A is of type ARRAY or SUBARRAY then (lP[A], LP[A])j 
 
 Step AP-2 [ Build up the USE-LIST -during the syntax analysis phase] : 
 [A] if SCAN = 'FOR' then 
 do; 
 Create a node of type FOR-BEGIN, pointed to by P, and link it 
 to the USE-LIST (on the static link); set NAME[P] = 'k*; 
 KC[P] = null; 
 end; 
 
191 
 
 [B] if SCAN = 'A' /* an array */ then if CP[A] = null then 
 
 /* the first occurrence of A */ 
 
 do; 
 Create a node of type ARRAY, pointed to by P, and link it to 
 the USE-LIST; set IP[A] = P; CP[A] = P; NAME[P] = A; LP[ P] = null; 
 
 end; else /* not the first occurrence */ CP[A] = P; 
 
 [C] if SCAN = 'B' then /* a subarray, B = A<I,J> */ 
 
 if CP[fi] = null then /* first occurrence */ 
 do; 
 Create a node of type SUEARRAY, pointed to by P, and link it 
 to the USE-LIST; set IP[B] = CP[B] = P; NAME[P] = B; LP[P]=null; 
 X[P] = (r I _ 1 , t x , c J _ 1 , Cj ); KC[P] = null; 
 
 end; else /* not first occurrence */ CP[B] = P; 
 
 [D] if SCAN = 'END' then /* end of a FOR loop */ 
 
 do; 
 Create a node of type FOR-END, pointed to by P, and link it 
 to the USE-LIST; link this node to the k-chain of the corresponding 
 FOR loop and set BP[ P] = pointer to the matching FOR-BEGIN node; 
 
 end; 
 
 Step AP-3 [ This step modifies the USE-LIST to add dead subarrays, lifetime 
 pointer, correct lifetimes and end of scope nodes. This is 
 done in a post-syntax-analysis, precoding phase]: 
 
192 
 
 [A] [fill up lifetime pointers and create end-of-lifetime nodes]: 
 Scan the USE-LIST sequentially. 
 
 if SCAN = ARRAY[P] or SUBARRAY[ P] then 
 
 do; 
 
 B = NAME [P]j LP[P] = CP[B]; Create a node of type 
 END-OF-LIFETIME, pointed to by Q, and insert in the USE-LIST 
 at LP[P]j set RP[Q] = P; 
 
 end; 
 
 [B] [ link dynamic suharrays on the k-chain ]: 
 Scan the USE-LIST sequentially. 
 
 if the current nest level 1 > then if SCAN = SUBARPAY[P] then 
 do; 
 
 B = NAME [P]; 
 
 for all 'FOR' loops f do; 
 
 if S(f) H L(B) £ cp then 
 do; 
 
 k = NAME[f]; 
 
 if P(B, k) = T then link B to the k-chain of loop f; 
 end; 
 end; end; 
 
 [C] [modify the USE-LIST to take care of the overlapping lifetimes, 
 
 introduce end-of- scope node, enter dead- suharrays] : 
 (l) [modify overlapping lifetimes]: 
 
 for all arrays A such that S(A) ^ cp, do; 
 
 /* let 1 be the level of a sub-array B of A. Here the level 
 means, the level in ACB tree structure */ 
 
 
193 
 
 1 . = min 1 (B): 1 = max l(B) ; 
 
 mn BCA maX BCA 
 
 /* note that l(A) ■ 1, 1 (A <L, 2>) = 2 */ 
 
 for j=l to 1 . -1, do: 
 max min 
 
 for all D^A such that l(D) = j do; 
 
 let Q be the pointer to the node D in the USE-LIST; let 
 
 the lowest level ancestor of D in the USE-LIST be denoted 
 
 by E and P points to its node in the USE-LIST; 
 
 if (IP[E] > IP[D] ) and ( LP[P] > LP[ Q] ) then 
 
 do; 
 
 CP[D] = LP[Q] = LP[P]; delete the END-OF-LIFETIME 
 
 node of D; 
 end; 
 
 if ( LP[P] < LP[Q] ) then 
 do; 
 
 CP[E] = LP[P] = LP[Q]; delete the END- OF- LIFETIME 
 node of E; RP[Q] = P; 
 if (IP[E] < IP[D] ) then 
 do; 
 
 delete node D from the USE-LIST; CP[D] = IP[D] = null; 
 end [C] (1); 
 
19k 
 
 (2) [insert the END-OF-SCOPE nodes]: 
 
 for all arrays A such that there is a subarray B of A 
 
 such that B € USE-LIST do; 
 
 set (P,Q) = S(A); Create a node of type END-OF-SCOPE 
 and insert it in the USE-LIST at Q; NAME[Q] = A; 
 
 end; 
 
 (3) [create dead subarrays and link to USE-LIST]: 
 
 for all arrays j. USE-LIST such that there is a subarray B 
 of A such that B e USE-LIST do; 
 
 j=aiax fl(B) BCA, B e USE-LIST, there is a k 
 
 such that P(B,k) = T } ; 
 construct a sequence of subarrays of A at level j; 
 name the above sequence DEAD-LIST (A); 
 if (B e DEAD-LIST (A)) and ( (B e USE-LIST) or 
 
 ( for all k P(B,k)=F)) then delete B from DEAD-LIST(A); 
 for each FOR loop f do; 
 for each B e DEAD-LIST (A) do; 
 let the loop control variable of loop f be k; 
 let Q point to FOR-BEGIN node of f; 
 if S(f) O S(A) f cp then if P(B,k) = T then do; 
 create a node of type DEAD -SUBARRAY, pointed to by P, 
 and link it to the k-chain of f; 
 NAME[P] = B; X[P] = (^^ , ^ , Cj-1 , Cj ) ; 
 end; 
 end Step AP-3; 
 Step AP-k [use of the USE-LIST in the coding phase to do freeing 
 and prepaging] : 
 
195 
 
 [A] [freeing]: 
 
 if at lexical level 1=0 then on scanning an END-OF-LIFETIME 
 node of B do; 
 
 if B is an array then generate instruction: 'FREE(b)'; 
 
 if B is a subarray then generate instructions: 
 'IF for all x € X, MOD(x,m)=0 THEN FREE(b)'; 
 end; 
 on scanning an END-OF-SCOPE node of A generate the instruction: 
 
 'FREE (A)'; 
 if at level I > then on scanning an END-OF-LIFETIME or 
 END-OF-SCOPE node of B do the same as above except that the 
 code is inserted after the level £ is reduced to zero, i.e. 
 after exit from the outermost loop; 
 
 [B] [prepaging outside the loop]: 
 
 if at level £ = then generate instructions to prepage a few 
 
 nodes in advance; 
 if at level I > then prior to generating code for the loop, 
 
 generate instructions to prepage all the (sub) arrays in the 
 
 USE-LIST from the FOR-BEGIN node to the FOR-END node along 
 
 the static link; 
 
 [C] [freeing within a loop]: 
 (level i > ) 
 
 for all dead- sub arrays B along the dynamic chain of the loop, 
 
 generate the following code just after the backward branch 
 
 target of the loop: 
 
 'IF ((for all x e X , f (x) < 0) AND 
 
 1 *i+l 
 (for all x e X , f (x) > 0)) THEN 
 
 l+l 
 
196 
 
 IF for all x e X,MOD( x(k ),m) = THEN FREE(b)' ; 
 [D] [incremental prepaging within a loop]: 
 (level I > 0) 
 
 for all dynamic subarrays B along the dynamic chain of the 
 loop, insert the following code after the backward branch 
 target of the loop: 
 
 'IF ((there exists x e X, such that f, (x) = -l) OR 
 
 1 k. , ., 
 
 l+l 
 
 (there exists x e X such that f (x) = 1)) THEN 
 
 2 k. , 
 
 * i+1 
 
 IF there exists x e X Q such that M0D( x(k.),m) = 
 THEN PREPAGE(B) ' ; 
 end ALG-AP: 
 
197 
 
 VITA 
 
 Kishor Shridharbhai Trivedi was born in Bhavnagar, Gujarat State 
 (India), on August 20, 19M>. He received the Bachelor's degree in 
 Electrical Engineering from the Indian Institute of Technology, Bombay 
 (India) in 1968. Between 1968 and 1970* he was first employed by Larsen 
 and Toubro India Limited as an Apprentice Engineer and then by Inter- 
 national Business Machines (WTC) of India as a Customer Engineer. 
 
 He joined the University of Illinois, Urbana, as a graduate 
 research assistant in the Department of Computer Science in 1970. He 
 conducted research in the area of Digital Computer Arithmetic under the 
 guidance of Professor James E. Robertson. He obtained his Master's 
 degree in Computer Science in 1972. Since then, he conducted research in 
 the area of Operating Systems and Numerical Analysis under the guidance 
 of Professor J. Richard Phillips. He is the coauthor with J. E. Robertson 
 of a paper entitled, "The Status of Investigations into Computer Hardware 
 Design Based on the Use of Continued Fractions," which was published in 
 IEEE Transactions on Computers, June 1973» 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 4. Title and Subtitle 
 
 1. Report No. 
 
 UIUCDCS-R-74-662 
 
 PREPAGING AND APPLICATIONS TO STRUCTURED ARRAY PROBLEMS 
 
 7. Author(s) 
 
 Kishor Shridharbhai Trivedi 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 15. Supplementary Notes 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 July I974 
 
 8. Performing Organization Rent. 
 
 No. UIUCDCS-R-74-662 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 US NSF GJ-328 
 
 13. Type of Report & Period 
 Covered 
 
 Doctoral - 1974 
 
 14. 
 
 16. Abstracts ~~ 
 
 A demand prepaging algorithm is defined and proved to be an optimal demand 
 prepaging algorithm. A variable-memory prepaging algorithm PWS is also defined and is 
 based on Denning 's WS algorithm. It is shown that i5s incurs zero page faults Both 
 of these algorithms, however, cannot be used in practice since they require that the 
 lllZ n ll e TAlZsT knWn ^ ^^ ™°~' - *—ibe several p^act^afpre! 
 
 Data paging is of primary concern for problems with large data bases and for 
 many types of array problems. In particular, we show that prepaging reduces the 
 paging problems of array algorithms operating on large arrays. We^lso show that the 
 use of a submatrix algorithm considerably improves the locality. Finally we consider 
 
 in tnf "° mat ing + these Performance-improvement techniques^ means of Tcompner 
 in the context of a structure-array-language. compiler 
 
 7. Keywords and Document Analysis. 17a. Descriptors 
 
 Array Language 
 
 Compiler 
 
 Demand Paging 
 
 Locality 
 
 Matrix Algorithm 
 
 Multiprogramming 
 
 Paging Performance 
 
 Prepaging 
 
 Virtual Memory 
 
 b. Identifiers /Open-Ended Te 
 
 c ''OSATI Field/Group 
 
 • Availability Statement 
 
 RELEASE UNLIMITED 
 
 tRU NTIS-35 ( 10-70) 
 
 19. Security Class (This 
 Report) 
 
 -,. UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. N> 
 
 ■ i Pages 
 
 22. Price 
 
 200 
 
 USCOMM-DC 40329-P? 
 
% 
 
 '9?4