The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN OtC 6WM L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/minimummeanrunni922garc X-£6r UIUCDCS-R-78-922 UILU-ENG 78 1747 MINIMUM MEAN RUNNING TIME FUNCTION GENERATION USING READ ONLY MEMORY BY <2'° GILLES HENRI GARCIA December 1978 MINIMUM MEAN RUNNING TIME FUNCTION GENERATION USING READ ONLY MEMORY BY Gilles Henri Garcia Ing., Ecole Superieure d ' Electricite , 1972 M.S., University of Illinois, 1976 THESIS Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1979 Urbana, Illinois A ma f ami lie MINIMUM MEAN RUNNING TIME FUNCTION GENERATION USING READ ONLY MEMORY Gilles Henri Garcia, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1979 This thesis presents a method for generating functions using a minimum mean running time polynomial approximation. Although the method can be used for soft- ware library routines, the work focuses on hardware imple- mentations. With the advent of VLSI, Read Only Memories may be used to store many constants very economically. In addition, large multipliers are now available which multiply two n-bit numbers in (n) or O(logn) time. In this discus- sion, multiplication is considered as the basic operator. The method consists of splitting the interval [a,b] into several large partitions. Within each large partition, the function f(x) is evaluated in a large number of small sub- intervals using an approximating polynomial of very low degree. The coefficients of the polynomials are stored in ROMs. A set of partitions of [a,b] is defined such that the polynomials have the same degree within each partition. The optimal determination of the partition is obtained by solving a non-linear programming problem where the objective is the minimization of the average number of multiplications over the whole range, assuming a logarithmic distribution of of the mantissas of floating-point numbers, and the cost constraint is the maximum number of ROM words available. A program has been written to solve this NLP problem and a set of optimal curves representing the average number of multiplications achievable vs. the number of ROM words available has been obtained for the function f(x) = 1/x. All possible combinations of 2 and 3 partitions using Chebyshev polynomials of degrees 1 to 5 were considered. Various bounds are given on the complexity of the method. It can generate f(x) to precision n in 0(log(— ^ )M(n)) time, where M(n) is the time to multiply two n-bit integers and y is a positive (non-zero) constant. The method uses (n ' ) words of memory and (^ ) processors. Ill ACKNOWLEDGMENTS I would like to thank my thesis advisor Professor William J. Kubitz for the advice, guidance and beyond-the- call-of-duty encouragement he gave me during these inter- minable years. I am particularly grateful to Professor A. Sameh for his interest and helpful suggestions, and to the other members of my thesis committee, Professors W. J. Poppelbaum, J. E. Robertson, and E. H. Davidson. In addition, I would like to thank the Department of Computer Science for their financial support. I wish to thank Lucy van Weringh for her patience, typing and psychological counseling, which had to be recip- rocated at times. I am also indebted to Gayanne Carpenter and to Cinda Robbins for their assistance. And finally, I am especially indebted to Dan Pitt with whom I have had many fruitful con- versations about other things. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 1.1 Sequential vs . Parallel Function Generation Methods 5 1 . 2 Current Parallel Methods 7 1.2.1 Stefanelli's Inverting Circuit 9 1.2.2 Chin's Method 12 1.3 Rationale for the Approach Used Here 14 2. POLYNOMIAL APPROXIMATIONS FOR HARDWARE FUNCTION GENERATION 16 2.1 Choice of a Criterion for Goodness of Fit .... 18 2 . 2 Colloquation Points 22 2.3 Quasi-Optimal Polynomial Approximation 24 2.4 Piecewise Polynomial Approximation 25 2.5 The Logarithmic Distribution of Leading Digits in Floating-Point Numbers 27 2 . 6 Notation Used 29 3. ARCHITECTURE FOR HIGH-SPEED FUNCTION GENERATION : FOUR CLASSES 32 3.1 Uniform Interval Length Using Uniform Degree Polynomials (UIUD) 33 3.2 Variable Interval Length Using Uniform Degree Polynomials (VIUD) 35 3.3 Variable Degree Polynomials with either Uniform or Variable Intervals (UIVD and VIVD) . 40 3.3.1 The Minimum Mean Running Time Approximation 42 3.3.2 Algorithm for Finding the Number of ROM Words 46 3.3.3 Linearization of the Non-Linear Programming Problem 60 4. THE SOLUTION OF THE NON-LINEAR PROGRAMMING PROBLEM FOR f (x) = 1/x 65 4 . 1 Penalty Function Method 66 4 . 2 Software Considerations 72 4 . 3 Study of the Minimum 81 4.3.1 Influence of u 89 4.3.2 Influence of the Maximum Number of ROM Words R Q 89 5 . RESULTS AND EXAMPLES 101 5.1 Average Number of Multiplications vs. Number of ROM Words : Optimal Curves 101 5.1.1 One Joint Results 103 5.1.2 Two Joint Results 118 5.2 The Effect of a Finite-Length Look-Up Address . 130 6 . HARDWARE IMPLEMENTATION 132 6.1 Design Considerations: Expected Time Delay. How to Store the Coefficients 132 6.2 Architectural Considerations: Obtaining the Address from the Argument 13 9 6.2.1 UIUD Case 139 6.2.2 VIUD Case 142 6.2.3 UIVD Case 149 6.2.4 VIVD Case 152 VI 7. THE COMPLEXITY OF THE METHOD 154 7 . 1 Introduction and Background 154 7.1.1 The Measure of Effectiveness of an Algorithm 155 7.1.2 The Best Known Bounds for Parallel Polynomial Evaluation 156 7.1.3 Notation and Assumptions 157 7.1.4 Time vs space. An Intuitive Look at the R-Method 161 7 . 2 Memory Complexity of the R-Method 164 7.2.1 Defintion of Type 1 and Type 2 Functions 165 7.2.2 Memory Complexity of the Method 166 7.2.3 The Relationship between Polynomial Degree and ROM Size 169 7.3 The Effectiveness of the R-Method 177 7.3.1 The S-Method as a Particular Case of the R-Method 177 7.3.2 The Number of Processors, Speed-Up, Cost, and Efficiency for the R-Method .-181 7.3.3 Comparisons Using 1 Processor 184 7.3.4 Comparisons Using p Processors 188 7.3.5 Comparisons to the Best Known Bound for Function Evaluation 191 8. CONCLUSIONS 19 2 8 . 1 Areas of Further Research 192 8 . 2 Summary and Conclusions 198 LIST OF REFERENCES 200 APPENDIX 205 A. 1 Program Listings 205 Vll A. 2 Distribution of Floating Point Numbers Outside the Interval [ 1/3 , 1 ] 219 A. 3 Limit on ROM Savings Coefficient When n -* °° 222 VITA 227 Vlll LIST OF FIGURES Page 1.1 Stefanelli's Generation of 1/x 11 2 . 1 Colloquation Points 23 2.2 Notation Used Throughout the Discussion 30 3 . 1 Uniform Interval Length Case 34 3. 2 UIUD Design Procedure 36 3.3 Illustration of C(f, P m , x., x. + h) as a Function of x . for f = ^ 37 3 . 4 VIUD Design Procedure 39 3.5 Variable Degree Polynomials with either Uniform or Variable Intervals (UIVD and VIVD) ... 41 3.6 Number of Breakpoints in the Non-Uniform Case ... 52 3.7 Number of Breakpoints in the Non-Uniform Case; Different Behavior of N(x ,x) 54 3.8 Number of Breakpoints in the Non-Uniform Case; Another Behavior of N(x ,x) 55 4 . 1 Computer Solution; Subroutine Structure 74 4.2 Average Number of Multiplications; No Penalty ... 83 4.3 Number of ROM Words (View 1) 84 4.4 Number of ROM Words (View 2, Rotation) 85 4.5 Number of ROM Words (View 3, Rotation) 86 4.6 Location of the Minimum on the Surface Representing Objective + Penalty as a Function of Joints Locations 87 4.7 Combination of Average Number of Multiplications and the Penalty on the Number of ROM Words 90 IX 4.8 Influence of y; y = 91 4.9 Influence of y; y = 10 92 4.10 Influence of u; y = 100 93 4.11 Influence of y; y = 10,000 94 4.12 Influence of y; y = 1,000,000 95 4.13 Influence of the Maximum Number of ROM Words R on the Position of the Minimum; R = 200 97 o o 4.14 Influence of the Maximum Number of ROM Words R on the Position of the Minimum; R = 350 98 o o 4.15 Influence of the Maximum Number of ROM Words R on the Position of the Minimum; R = 500 99 o o 4.16 Influence of the Maximum Number of ROM Words R on the Position of the Minimum : R = 1024 . . . 100 o o 5 . 1 Each Point on the Curve Represents a Minimum 104 5.2 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg . = 1,2 10 5 5.3 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 1,3 106 5.4 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 1,4 107 5.5 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 1,5 108 5.6 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. =2,3 109 5.7 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 2,4 110 5.8 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 2,5 Ill 5.9 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 3,4 112 5.10 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 3,5 113 5.11 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; One Joint; Polyn. Deg. = 4,5 114 5.12 Ordering Relationship for Implementations Involving Two Different Polynomials (One Joint) . . 115 5.13 The Designer's Choice Among Two Possible Implementations 117 5.14 The Designer's Choice Among More Than Two Possible Implementations 119 5.15 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 1,2,3 120 5.16 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 1,2,4 121 5.17 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 1,2,5 122 5.18 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 1,3,4 123 5.19 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 1,3,5 124 5.20 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 1,4,5 125 5.21 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 2,3,5 126 XI 5.22 Optimal Curve: Average Number of Multiplications vs Cost in Number of ROM Words; VIVD Case; Two Joints; Polyn. Deg. = 2,4,5 127 5.2 3 Ordering Relationship for Implementations Involving Three Polynomials (Two Joints) 128 5.24 Global Ordering Relationship for Implementations with 2 or 3 Polynomials 129 5.25 Effect of Digitization on the Position of the Minimum. Two Joints. Polynomial Degrees 1,2,3 131 6.1 Sketch of the Arithmetic Unit for Function Evaluation 133 6.2 Type 1 Architecture: UIUD 141 6.3 VIUD Case; Variation of the h's Over the Interval .144 6.4 Type 2 Architecture: VIUD 147 6.5 Barrel Shifter 148 6.6 Address Decoding Using a Barrel Shifter 150 6 . 7 Type 3 Architecture: UIVD 151 6.8 Type 4 Architecture: VIVD 153 7 . 1 The R-Method 158 7.2 Tradeoffs when n -> °° 159 8.1 Optimal Determination of the Joints for Several Functions Simultaneously 195 A.l Probability Density Function of Floating Point Numbers between and 1 221 Xll LIST OF TABLES Page 3.1 Number of ROM Words for Uniform and Non-Uniform Intervals 59 5.1 Conditions Used for Figures 5.2 through 5.22 ,.io2 1. INTRODUCTION Since the publication of the article by Wallace [WAL64] , giving an algorithm for performing division using multiplication as the basic operator, many re- searchers have considered methods for implementing high- speed division using the capabilities of a high speed multipliers [FER67] , [STE72] . This research has also led to numerous papers on the generation of functions, such as 1/x and the trigonometric functions, by array-like function generators. The latter approach requires a dif- ferent array for each function, resulting in an expensive process when many functions must be generated. Considerable work has also been done in the area of digit-by-digit hardware function generation. De Lugish [DEL70] , in a general approach to this problem, uses addi- tion, subtraction, and shifting as the basic operations in his algorithms for implementing all arithmetic operations. An increase in speed for these types of sequential algo- rithms usually requires a redundant or high-radix number system. Numerous other papers have been published in this area [BAK73] , [ERC75] . It is apparent, therefore, that the problem of sequential function generation has been thoroughly investigated. This thesis deals with the following question: Is it possible to economically implement, in hardware, a numerical routine that performs faster than the traditional add- shift sequence of digit-by-digit routine? Historically, sequential algorithms were devised to avoid using multiplication by replacing it with shift- ing, which was faster and easier to implement. It is feasible today, however, to implement high-speed parallel multipliers at reasonable cost. These multipliers, often of the Wallace-tree type, can be built from LSI blocks based on large Read Only Memories (ROM) [STE77] . This paper investigates an approach to function generation based on table look-up and on the minimum mean running time approximation of functions. The basic opera- tor is multiplication and our argument is based on the existing high-speed hardware multipliers. Such multi- pliers can multiply two numbers in O(logn) gate delays. In order to consider algorithms having speed compatible with multiplication speeds, parallelism must be introduced at the hardware level. Most of the effort here is di- rected toward implementation of a function of one vari- able f(x). The example f(x) = 1/x is used for illustra- tion. The general organization of the thesis is as follows: In Section 1, we briefly review the current para- llel methods and give the rationale for the approach used here. Section 2 comprises a general discussion and a brief mathematical overview of the polynomial approxima- tion methods. The applicability of piecewise polynomial approximation to hardware function generation is also con- sidered. The choice of a criterion of "goodness of fit" is discussed and the nature of the distribution of the leading digits in floating-point numbers is introduced for use in the following sections. In Section 3, the general architecture of high- speed function generators is presented and we distinguish four classes whose implementations in hardware increase in complexity. In this section the minimum mean running time approximation is introduced and it is shown that an optimal realization requires solving a non-trivial non- linear programming problem. In some instances this prob- lem can be linearized or at least simplified. In Section 4, we actually undertake to solve this problem using a penalty method. A FORTRAN program has been written to solve this, with f (x) = 1/x as the example, for various combinations of polynomials in the range [.5, 1J . Some elementary considerations on the behavior of the minimum and a thorough study of the influence of various parameters lead to the results presented in Section 5. Examples of optimum curves are presented for f (x) = 1/x and polynomial degrees ranging from 1 to 5. The question of the choice of implementation given design constraints and the problem of address generation is dis- cussed in Section 6. The four possible architectural types are described and some recommendations are made in view of the trends in ROM cost and size. The complexity of the method is studied in detail in Section 7. Bounds for complexity in terms of memory requirements and bounds on the degrees of the polynomials are given. The method is also compared with a single- polynomial approximation approach. In Section 8, the generation of other functions is briefly overviewed and the problem of the probability dis- tribution of the exponents of floating-point numbers is discussed. Finally, _ suggestions for future research are offered. 1.1 Sequential vs. Parallel Function Generation Methods Sequential algorithms for numerical computations have been extensively studied during the past five years. One of the original reasons for the study of digit-by- digit algorithms was the problem of high-speed hardware implementation of numerical routines for function evalua- tion. At the time the research was carried out, multipli- cation schemes were not implemented in parallel because 2 of the high cost and the amount of required hardware (n ). As a result, sequential algorithms were easily justified because the time required for evaluation, 0(n), was com- parable to the time required for multiplication [HAB7 0] , [BAU73] . Compared to numerical function evaluation rou- tines using Taylor or Chebyshev series, which required several multiplications to attain a given precision, digit- by-digit algorithms presented a viable solution to the problem of high-speed function evaluation. However, the continuing reduction in the size and cost of integrated circuit components and the advent of modular elements allowing a fast Wallace-tree reduction scheme [STEW] , [SIN73] have made multiplication an opera- tion economically implementable at speeds of O(logn) [DAD65J . The rigidity of the digit-by-digit algorithms makes them incompatible with the parallel evalutation schemes which are applied to multiplication. Still, se- quential algorithms find a broad range of applications in pocket calculators and special-purpose processors where speed is not a major concern. The number of papers published on high-speed para- llel algorithms for evaluation of commonly used functions that are implementable in hardware is surprisingly small [FER67] , [CHI70], [FLY70] . It is interesting to quote [HAR68] concerning table look-up (in this case software) techniques: "Tabular data are an important element of machine evaluation of functions in two major cases: when the convenient source of quantitative data consists of a set of numerical values, and when the need for speed of evaluation outweighs considerations of storage and high precision. " Storage size has been a major factor in ruling out table look-up techniques in the literature. This is due to the fact that large scale ROMs and PLAs were not feas- ible as little as six years ago [now in 1978] . In addition, the only available basic operations were addition, subtrac- tion and shifting. Today, LSI has made feasible long word length multipliers in very compact and efficient implemen- tations [BRU73] , [CHU74], [JOH73], [KIN71], [PEZ71] and multiplication with time of O(logn) should be fully in- cluded among the basic operations for numerical routines. Next an overview of the parallel evaluation schemes considered in the literature is given. 1.2 Current Parallel Methods The only major work in the area of parallel func- tion evaluation was done by Tung Chin and Algirdas Avi- zienis in a paper entitled "Combinational Arithmetic Sys- tems for the Approximation of Functions" [CHI7 0] . Their idea was to introduce a basic building block that allowed a modular implementation of Chebyshev polynomial or rational fraction evaluators. The set of polynomial co- efficients was to be retrieved from a ROM and then used to evaluate the Chebyshev approximation of a given func- tion f (x) . Some work has also focused on the area of high- speed evaluation of the function 1/x. These studies have been concerned with implementing fast division algorithms. There are several such algorithms in the literature dealing with the computation of the reciprocal, 1/x, of a normal- ized number x. Wallace developed an iterative division scheme making use of a fast multiplier [WAL64] . The IBM/ 360 Model 90 floating-point execution unit also makes use of a fast multiplier to perform division [AND67]* It uses an iterative process where, on the kth iteration, a factor R, multiplies both numerator and denominator such that the resultant denominator converges quadratically toward one and the resulting numerator converges quadratically toward the desired quotient: N R o R l R n N - R o R l •""h _. , D-r"-r7 * ' ' 'I - ~* 1 * quotient o 1 n Ferrari [FER67J has also proposed a multiplicative scheme for computing the reciprocal of a number x. He bases his algorithm on interpolation of 1/x. Ling [LIN70] and Stefanelli [STE72] describe an inversion algorithm based on the Taylor series, making use of Read Only Mem- ories and a set of dedicated, non-cascaded multipliers. This latter technique is very fast because it is entirely parallel, but there are some inconveniences: The method uses dedicated multipliers to implement the division algorithm but it would seem that such an algorithm should be designed around an n x n multiplier that could be used for the multiplica- tion operation alone, as well as for the generation of other functions. The Taylor polynomial is known to produce very poor behavior with respect to error; it is possible to use other approximating polynomials which yield better error control. It will be seen later that it is expensive to use algorithms based on series approximations and ROM look-up to implement a totally parallel divider for a precision of more than 32 to 34 bits. The precision can be increased, however, by more efficient use of the ROMs than that proposed by Stefanelli. It seems then that the flexibility of ROMs and the possibilities introduced by Programmable Logic Arrays have not really been exploited in the current literature. It is interesting, however, to review the two main existing hardware implementations: The Stefanelli approach to a fast inverse function generator and the general approach proposed by Chin and Avizienis. 1.2.1 Stefanelli 's Inverting Circuit Stefanelli [STE72] computes the reciprocal 1/x of a binary number x. His design yields an inverter re- quiring 4 k ROMs and 12 x 12 multipliers. The inversion algorithm is based on the McLaurin series. x is written as x=l.b 1 b 2 ... b r b r+1 . b n or x=x,+2 x„ One can expand 1/x as 1 1 l/x 1 1 X x 1+ 2 x 2 1+2 ~ r © X - 2 -r/M + 2 -2r /V V \*i 10 By taking the first order approximation, the in- verse is 1= A 2~ r This can also be viewed as a polynomial of the variable x» approximating 1/x between the two points x = l.b,b ...b and x = l.b,b„...b + 2 x . (See Figure 1.1.) 12r 2 12r 2 x 3 Stefanelli's technique consists of storing 1/x, 2 and 1/x, (the coefficient of the variable x 2 ) in ROM and — r performing the multiplication by 2 x~. It is not a very efficient algorithm. His approxi- mation is equivalent to a piecewise polynomial approxima- — r tion, where the intervals have size 2 and the polynomial is a Taylor polynomial. It will be seen that for the same amount of memory one can certainly reduce the error by using a better polynomial such as a Lagrange or Cheby- shev polynomial. He also uses a set of dedicated multi- pliers to perform the multiplication; this is unnecessary as the multiplier could be used to generate several func- tions and, as such, provide a common hardware structure for several algorithms. And finally, the precision of the approximation is not uniform in the interval [-x-, 1] . 11 ,b n b~b_ ... b 12 3 r x l + 2 ~ Tx 2 x 1 + 2 -r Figure 1.1 Stefanelli's Generation of 1/x 12 1.2.2 Chin* s Method Chin's approach constitutes an almost direct trans- position of a subroutine library method into hardware. He defines the concepts of Arithmetic Building Blocks (ABBs) and Combinational Arithmetic nets (CAs) . An ABB can be seen as a single package having an adder (called the "sum- mer") which is an array of carry-save adders. This adder receives inputs from an "inputter" that converts a binary number into a minimally redundant radix 16 system. An "outputter" converts again into conventional binary form. Just to give an idea of the complexity involved, the num- ber of I/O connections of the ABB is 12 6 plus the necessary power supply pins. For the whole ABB, the circuit com- plexity is the sum of 560 gates, 131 full adders, and 14 flip flops. The longest signal path in the ABB contains 19 gates, 20 full adders, and 1 flip flop. The ABB can generate a two digit (base 16) product plus the standard arithmetic functions. Chin uses many ABBs interconnected to form parallel arrays called combinational arithmetic (CA) nets or pipelined combinational arithmetic nets (PCA) . Each ABB can be considered to be a two-digit arith- metic unit that can perform the basic operations of addi- tion, subtraction and multiplication. These building blocks can be connected so as to evaluate a polynomial in either a parallel fashion or a series fashion (pipelined 13 CA) . Between these two extremes, the designer must make a trade-off between the cost of hardware and the speed of execution. Division is difficult in a CA net and this fact forced the author to choose polynomial approximation, which uses no division at all for the approximation of functions. The main interest of the ABB is the modularity which allows pipelined techniques, but the cost of imple- mentation is absolutely prohibitive in spite of the bene- fits of modularity. A 60-bit precision result would require between 2500 and 4 000 of these ABBs and the speed would still be quite low. In conclusion, it is apparent that although the modularity and flexibility of the design procedure using CA nets has certain advantages, the cost and actual hard- ware implementation would be extremely high and burdensome. The key point in Chin's approach is that a direct software to hardware implementation does not bring any improvement to the function generation problem. Clearly, some other methods must be investigated, using the possibilities of LSI, ROMs and PLAs, if high speed and simplicity of design are to be achieved. 14 1. 3 Rationale for the Approach Used Here While the design of mathematical function routines has already been thoroughly investigated from the numeri- cal analysis point of view, the research on combinational arithmetic units for function generation has been limited to special array-type function units or division. Research in computer arithmetic has focused on digit-by-digit or sequential algorithms, which make use of the high-speed addition, subtraction and shifting capabilities of current computer technology IBAK73] . Higher radix number systems have been considered, with the basic motivation of in- creasing the speed of the sequential algorithms by the introduction of redundancy into the digit representation of the numbers [IRW77] , [ERC75] , [DEL70] , [LAK76] , [ROB76] . The impact of LSI on these algorithms has been studied from the point of view of modularity and research is cur- rently focused on various types of Universal Arithmetic Blocks that can perform addition, subtraction, multipli- cation and division in a serial digit-by-digit fashion. It seems, however, that insufficient attention has been given to the ever increasing capability of LSI modularity and speed and its potential application to combinational arithmetic units. Large and fast memories are now a part of today's generation of large-scale 15 computers [TOR75] . High-speed LSI has allowed the design of large combinational multipliers in very compact form. If the time to multiply two n-bit numbers has been de- creased from 0(n) to a feasible O(logn), it seems reason- able to attempt to obtain, in the same amount of time, the commonly used functions. The possibility of easy implementation of interpolation, including piecewise and spline interpolation, now exists at the hardware level. The object of this thesis is to study the application of the flexibility and versatility of Read Only Memories, Programmable Logic Arrays and LSI in general, to high- speed, parallel arithmetic. 16 2. POLYNOMIAL APPROXIMATIONS FOR HARDWARE FUNCTION GENERATION In this section, the basic concepts of piecewise polynomial interpolation are considered and the statisti- cal properties of floating-point numbers are reviewed. Rather than cumbersome and highly general mathematical definitions, a notation is used which, if not always rigorous, makes the development of the arguments under- standable to both logic designers and mathematicians. The problem is neither to find the most efficient approxi- mating polynomial nor to find the most general form. This has already been investigated at length in the numerical analysis literature. Our area of investigation is the practicality of implementation: how can one make judicious use of ROMs, PLAs and large-size multipliers to produce efficient, practical realizations? How can one increase the speed of function evaluation by applying techniques well known to numerical analysts to the problem of hard- are polynomial approximation while at the same time taking into consideration the nature of the numbers, in this case, floating-point numbers in a computer? First, floating-point numbers are not uniformly distributed and one should make use of this fact in 17 evaluating a function using these numbers. Secondly, what is important is not the speed of evaluation of f (x) over a given interval at a given point x but rather the statis- tical speed, that is, the global speed, the one that the user sees. The expected speed of evaluation in terms of the number of multiplications is the important question to consider. Given a function f (x) to be approximated in [a,b], the steps taken to obtain an approximation are the following: (1) reduce the interval of definition of the function to a more suitable interval like [I/3, 1] for a base 6 number representation system. This step is always included in any function evaluation algorithm and is sometimes called "normalization." (2) subdivide the basic interval into subintervals such that the function can be approximated by a low degree polynomial; one that is lower in degree than that of polynomials used in software function generation, usually of degree 10 to 22. This subdivision can be done after: a. choosing the type of polynomial approximant b. choosing the criterion for "goodness of fit," a measure criterion. In the next section, the general method of piece- wise polynomial approximation is presented in the light 18 of its potential use for hardware implementation with ROMs or PLAs. The logarithmic distribution of mantissas of floating-point numbers is then reviewed and the notation used throughout this work is presented for clarity of ex- position. 2.1 Choice of a Criterion for Goodness of Fit We will use, implicitly, the fundamental theorem of function approximation. Given a continuous function f (x) on I : [-1, 1] and a (Riemann) integrable weight func- tion w(x) which is positive on I (except, possibly, at a finite set of points of I at which w(x) = 0) , there exists a unique polynomial P (x) , of degree y, such that f-P y r >i - l (f(x)-P (x)) 2 w(x)dx y L--1 /■ ..., x , is called the L 2. 3 y Lagrange interpolating polynomial. It is easily obtained if the nodes are known a priori. It is, in fact, much easier to find the interpolating polynomial, given the nodes, than to find the polynomial (and nodes) given one of the previous error criteria. The error term for the Lagrange interpolating poly- nomial has been shown to be of the form M 7T ^(X)=f (X)-P M (x)' = fl (X-X k ) f (W+D (C) ( 2 - 8 > k=0 (y+D! 23 P (x) 'V f (x)' OLLOQUATION POINTS Figure 2.1 Colloquation Points 24 where £ is located in the smallest interval containing the points x Q , x,, x^, •••/ x , x. 2.3 Quasi-Optimal Polynomial Approximation Equation (2.8) cannot be used to calculate the exact value of the error f - P , since £ as a function of x is, in general, not known. Because there is nothing one can do about f (£) , the only effective way of mini- mizing the error is to minimize the maximum magnitude y of the (y+l)st degree polynomial [| (x-x,) • Tne deriva- k=0 tive f i ■*■' (£) is treated as though it were constant. Then the choice of the nodes x_, x,, ..., x can be made, using the criterion for goodness of fit defined y earlier, by taking tt(x) = B II (x-x, ) where B is a k=0 K constant. This dictates the position of the nodes. For 1 example, it is known that if w(x)=T=x= and the \l-x 2 criterion is (2.4), then the nodes that minimize the error criterion are the roots of the (y+l)st degree Cheby- shev polynomial [RIV74] , [LUT65] . This technique is sometimes used for its simplicity and ease of computation and it is the one which will be used here. 25 2.4 Piecewise Polynomial Approximation Usually, when the function f is approximated over a large interval [a,b] by a single polynomial for the entire interval, the degree of the polynomial must be high if reasonable precision is required. In addition, there may also be some undesirable oscillatory properties. Separate approximations over subintervals smaller than [a,b] can improve the approximation to the function by: a. requiring fewer terms of the polynomial approxi- mant b. adding flexibility to the form of approximants for each subinterval. If the interval [a,b] is divided into subintervals I- , I~/ ..., I q , and if f(x) is approximated by a function F(x) of the form p U-^x) xe I, p(2) F(x) = p (i) u . (x) xe I . p'S) M s (x) xe I s 26 where each Pisa polynomial of degree \i ., then F is i 1 called an S-fold segmented approximation to f . P (x) u i may be taken to be the best approximation to f on I . . Then the deviation |f-F| | (where | | • | is one of the norms) depends only on the subinterval breakpoints which define the subintervals I . . When the routines for approximating f(x) are imple- mented in software, two major drawbacks arise: a. If the degree of the polynomial in each subinter- val must be low (for fast computation) , segmented approximation will frequently require many words of memory for storing the coefficients of the polynomials. b. Given the argument x of f (x) is is not obvious how to determine the interval in which x lies. This is the problem of address computation , that is, location of the coefficients of the appropriate polynomial. However, when the routine is implemented in hardware, one can make use of LSI to solve both a and b. Storing a thousand numbers in ROM is neither impossible nor impracti- cal; 16K and larger Read Only Memories are common. 16K words of ROM easily fit on a small circuit card but this number of words has traditionally been considered to be a large amount of Read Only Memory. Four thousand 24-bit 27 coefficients easily fit side by side within the width of this page. As a consequence, it is not unrealistic to consider the possibility of hardware implementation for high-speed approximation of functions using piecewise low-degree polynomials. The questions to be considered are: the choice of the subintervals which will maximize the speed and minimize the amount of ROM. the trade-offs between the speed and the memory requirements . the best architecture for efficient function generation. Next, let us describe briefly a property of the distribution of mantissas of floating-point numbers which we will make use of in later sections. 2.5 The Logarithmic Distribution of Leading Digits in Floating-Point Numbers A number of papers have been written discussing the distribution of the most significant digits in tables of physical constants. The. logarithmic law of distribu- tion is derived using various approaches. Pinkham and Raimi derive the logarithmic law by imposing certain assump- tions on a statistical model. Benford and Flehinger [FLE66] find the law by analyzing the distribution of the leading 28 digits of the set of integers. Hamming [HAM70] gives a very clear account of the statistical properties of the mantissas of floating-point numbers and shows how the arithmetic operations of a computer transform various distributions toward the limiting reciprocal distribution 1 r(x)= xlnB &H where 3 is the base number of the system. He also shows that for both multiplication and division, if one of the factors comes from the reciprocal distribution, the result also belongs to the reciprocal distribution inde- pendent of the distribution of the other factor. In other words, the reciprocal distribution persists under multi^ plication and division and cannot be changed by choosing any of the other factors. Some examples are given, such as the effect on the representation error of numbers in base 2 and base 16. Several other papers dealing with the study of roundoff errors [KAN73] , [TSA74] include the logarithmic law as a weighting function to various measure functions. This logarithmic distribution will be used in the remainder of this work. As will be seen, the position of the joints over the range [1/3, 1] is influenced when taking this law into account. 29 2.6 Notation Used The notation used in the remainder of the discus- sion is explained below and illustrated in Figure 2.2. The breakpoint subintervals are denoted by x. where i may range from 1 to S. (This breakpoint index is often implicit in later sections.) The polynomial approximant for the ith interval [x . , x . , , ] is denoted P ' (x) and is of degree u . . l l+l \i . * i i It will be convenient later to divide the overall interval of interest into partitions within which the de- gree of the approximating polynomials are the same. These partitions are denoted by [X.,X. _J where the X. are called joints . There are s such intervals. Since y • is unchanged in [X.,X..,], it is convenient to denote it by m.. This : D+l (i) * 3 leads to the notation P (x) with i ranging from 1 to S and j ranging from 1 to s. To summarize: The partition joints are denoted by X. (capital X). The polynomial approximants are P (x) between joints X. and X. . , i.e., within a given partition, m. is the degree of this polynomial in [X.,X. .]. There are s partitions and s+1 joints. It is understood that there is a different poly- nomial P (x) for each breakpoint interval [x. , < CO o >■ ■z. < UJ UJ UJ QQ A1M01S 3dOW 031VnnVA3 39 AVIAI (x)J 3d3H o UJ UJ cc CD >- lil _J n u Q. UJ Ill ^ T < h- CO UJ X UJ > < X CO X o Q. Q. < X o u_ o f + X X CM X UJ h- 1- III < CO 1 u < m _J u_ < CO > co CO III < o * n UJ cr III X 1 I CO < Figure 2.2 Notation Used Throughout the Discussion 31 x. ,]. The index i is implicit when we refer to m . 3 C(f, P , x., x. in ) is the "goodness of fit" cri- m 1 l+l 3 terion as defined earlier. 3 is the base of the number representation. 32 3. ARCHITECTURE FOR HIGH-SPEED FUNCTION GENERATION: FOUR CLASSES In this section, we will introduce one of the main ideas in this discussion, the concept of the minimum mean running time approximation. This will yield a non- linear programming problem that we will attempt to solve in the next sections. Before setting the general non- linear programming problem, it will be necessary to have some way of computing the number of breakpoints for a given criterion C(f, P , x,, x~) and e over a given range [a, b] . This is discussed in Section 3.3.2. First, four schemes for piecewise function approximation are presented, Given an argument x represented as a normalized floating- point number, and assuming that the range [a, b] has been reduced to the interval [1/3/ 1] / one must answer the following questions: a. How should one choose the breakpoint locations for the piecewise polynomial approximation? b. How should one choose the degrees for each one of these piecewise polynomials? The four choices are described below. 33 3.1 Uniform Interval Length Using Uniform Degree Polynomials (UIUD) The first reasonable choice is a uniform mesh of breakpoints with a fixed polynomial degree over the whole interval range [1/3/ 1] . The width of each interval will be called h and the breakpoints, x.. h = x . . , - x . l+l l It is natural to choose an interval length such that h = 2~ l where £ will depend on a trade-off between speed and mem- ory size. This choice is natural because the value of x. such that ~-£ x . < x < x . +2 l — — l is extracted from the argument x itself simply by taking the £ most significant bits of the normalized argument x (Fig. 3.1). Consequently, the address of the coefficients of the ith polynomial approximating f (x) between x. and -£ x. + 2 is represented by those £ most significant bits. Moreover, if the base 8 is 2 then, knowing that the first bit is a 1, only (£ - 1) bits are necessary for generating the address of the ith polynomial. 34 -I . . When the intervals have a uniform length = 2 , the position the ith interval can be extracted from x directly. 12 3 11+1 = BREAKPOINTS Figure 3.1 Uniform Interval Length Case 35 The design procedure for this case, where the in- terval size is uniform as well as the degree of the poly- nomial, is flow-charted in Fig. 3.2. This case will be referred to as the UIUD (Uniform Interval, Uniform Degree) case. 3.2 Variable Interval Length Using Uniform Degree Polynomials (VIUD) It should be realized that the value of C(f, P , m x., x. , ) in the previous scheme will generally not remain constant for all x. e 11/3/ 1] . The only thing that the above procedure assures is the choice of h for the worst case (which depends on the function f ) . As an example, the function 1/x for x between . 5 and 1 with h = 2 and a Lagrange polynomial of degree 1 in each interval, shows (Fig. 3.3) that the criterion value changes by a factor of 10 over the interval. For a polynomial of higher degree the change would be even greater. Consequently, it seems that some ROM can be saved by having variable length intervals h. . Given x., the following non-trivial equation must be solved recursively for h- and P (x, h. ) : l m l / x.+h. l l (f (x)-P m (x,h i ) ) w(x)dx=e^ (3.1) X. 1 36 INITIALIZE THE NUMBER OF MULTIPLICATIONS M=0 \ m=m+i'i t DETERMINE THE LARGEST INTERVAL SIZE h SATISFYING C(f, P . x,, x.+h) X UJ tr ^ k uj ^r r> en o z ^r (T ^f o UJ J* CD CD ro C\J CJ CM ^1 0>CD 2- OO Figure 3.3 Illustration of C(f, P r . x. , x. + h) as a Function ni 1 1 of x^ for f 1 x 38 where the notation ?„(*/ h.) is used to indicate that the choice of P (x) depends on the breakpoint spacing h. , with the initial condition / ± + h 6 2 2 (f (x)-P m (x,h)) w(x)dx= E Q The problem created by this procedure at the hard- ware level is in the address generation. Because of the variable intervals, the address can no longer be extracted from the argument x directly. A ROM or PLA is needed in order to provide the correspondence between the I (or I - 1) most significant bits of x and the ROM address of the coefficients. The design procedure for this case is out- lined in Fig. 3.4. Table 3.2 in Section 3.3.2 (page 59 ) shows that for f = 1/x and P m (x) of degree 1 with h = 2 , the savings in the number of ROM words is 40%. Thus the address decoding memory is worthwhile. INITIALIZE THE NUMBER OF MULTIPLICATIONS M = 39 M = M+| START THE COMPUTATIONS OF THE NUMBER OF INTERVALS N Q N Q = x n Nq = n 4 T SOLVE FOR hi : C( f (x) ,P m , x.,hi ) = 6 Q HARDWARE ALGORITHM FOUND. M MULTIPLICATIONS NO^^ N ROM WORDS PRECISION € Q V COMPUTE THE COEFFICIENTS AND ADDRESSES AND STORE IN ROMS 00 BIG YES MUST TRADE OFF SPEED:ONE MORE MULTIPLICATION Figure 3.4 VIUD Design Procedure 40 3.3 Variable Degree Polynomials with either Uniform or Variable Intervals (UIVD and VIVD) So far (see Fig. 3.5), we have chosen a single polynomial degree over {1/ft , 1] . If this polynomial has degree m, we know that the speed of function evaluation is m multiplications. However, it is not always possible to choose a polynomial of low degree — degree one, for example — if we wish to evaluate f (x) with, say, 64 bits precision, due to the large number of ROM words needed. The only solution then is to increase the number of multi- plications to 2, 3 or 4 until the number of intervals is less than or equal to that which results in an affordable amount of ROM. This is somewhat inflexible and would seem to limit the choices for a designer to an integral number of multiplications. This is because we have not considered the statistical speed , the speed with which the user is actually concerned. If one computes f (x) 10,000 times, what is the "average" speed? Perhaps we can implement the function generator over one-half of the interval with polynomials of degree 1 and over the remaining half with polynomials of degree 2 (see Fig. 3.5). Assuming a uniform distribu- tion of the arguments x, on the average the number of multiplications necessary to evaluate our function is 1.5. 41 f (x) AVERAGE NUMBER OF MULTIPLICATIONS IS 1.3 ONE MULTIPLICATION 4 -». TWO MULTIPLICATIONS a=X. b=X. Figure 3.5 Variable Degree Polynomials with either Uniform or Variable Intervals (UIVD and VIVD) . 42 Sometimes, x falls in the 2 multiplication range while at other times only 1 multiplication will be necessary. More- over, when we consider the non-uniform distribution of the arguments, with smaller values of x more likely, x will certainly fall more often in the first half of the range so that, in fact, the average number of multiplications may be closer to 1.3. Thus, the distribution of the argu- ments x of the function to be evaluated will affect the choice of the partitions and, quoting Hamming in his famous paper on the distribution of numbers [HAM7 0] : "it is easy to see as a general rule that when we try to optimize a library routine for minimum mean running time (as against the Chebyshev minimax run time) we need to consider the distribution of the input data. Hence floating point numerical routines need to consider the reciprocal distribu- tion: the square root, log, exponential, and sine are all examples. In the case of the exponential and sine, some study of the exponents is also necessary." The choice of the partitions is the subject of the next section. 3.3.1 The Minimum Mean Running Time Approximation We have seen earlier (Section 2.5) that not all values of the arguments x are equiprobable. This means that, on the average, more time is spent evaluating f near x = 1/3 than near x = 1. Why not, then, use a low degree polynomial where the probability of finding x is 43 low? In other words, the fact that values of the argument x around 1 are less probable than around 1/3 can be utilized by computing the function using, for example, the follow- ing strategy: fast (1 multiplication, P (x) is of degree 1) where the number x occurs often (around 1/8) • less fast: (2 multiplications, P m (x) of degree 2, and larger intervals) where x occurs less frequently (somewhere between 1/8 and 1) . slowly (3 multiplications, P m (x) of degree 3 and even larger intervals) where x is not found too often (around 1) . One would then consider the average number of multi- plications over the entire interval. The problem consists of minimizing the mean running time depending on the posi- tions of the "joints" (the X.'s in Fig. 2.2). As an example, assume that there are four joints, X.. , X 2 , X , and X. , and that the degrees of the polynomials in the 3 intervals are 1, 2, 3. The number of breakpoint intervals in [X., X-,-.] is denoted as N(X., X . , . ) and the number of multiplications j D+l necessary to evaluate f at x, xefX., X.,,], is m.. The a. v. *. j * j+1 3 number of multiplications needed at any point x is denoted M(x) . 44 The first condition that must be satisfied is to minimize the average number of multiplications M over the 2 c av whole interval. mm M = / — M(x)dx av X. mm 1 In 'X. X. + 2 In + 3 In X X. (3.2) The second constraint is to keep the number of ROM words under some maximum, Rp . This is stated as: 2N(X 1 , X 2 ) + 3N(X 2 , X 3 ) + 4N(X 3 , X^ < Rq (3.3) (3.2) can be rewritten as: mm " G?)' £)' (5 which becomes: max (X, •X »X_-X. ) 1 2 3 4 (3.4) Condition (3.3) is non-linear for the case of non- uniform intervals and depends on the precision desired and 45 the behavior of f(x). However, if we consider the case of uniform interval breakpoints, condition (3.3) becomes linear. (3.3) and (3.4) form a Non Linear Programming problem whose solution is the vector of joint positions X , . . ., X . (x n and x . are the fixed interval boundaries.) 2. S X S 1 1 The solution of this problem is the optimum posi- tion of the X.'s for the lowest average number of multipli- cations with a constraint on the number of ROM words available. This type of approximation will be referred to as the minimum mean running time or MMRT approximation. It is quite different from the traditional Chebyshev minimum- maximum (minimax) running time approximation used in soft- ware numerical routines. We must now answer two questions: 1. How can we compute the number of breakpoints, N(X., X..,), between two joints? 3 D+l 2. How can we solve the non-linear programming prob- lem stated above? The next section is devoted to the search for an algorithm for finding the number of breakpoints in the interval [X., X. ,]. Knowing the degree of the polynomial approximating f(x) between X. and X. ,, the number of ROM words needed for the piecewise approximation of f (x) for x between the two joints can be derived. If we assume that 46 the polynomial is of degree m. and that it is neither even nor odd and that no coefficient is null, then the number of ROM words is N(X_:, X.^,) (m.+l) since the number of 3 3+1 3 coefficients required is one greater than the degree of the polynomial in that interval. The number of multiplica- * tions between X . and X . , , is m . , the degree of the poly- 3 D+l 3 y * * nomial. We assume that the criteria for "goodness of fit" is the RMS criterion defined earlier. 3.3.2 Algorithm for Finding the Number of ROM Words Choosing the RMS norm for C(f, P , x., x. ,,) requires 3 mi l+l ^ ** the solution of the following equation for h: / y . l x.+h l (f (x)-P m (x i ,h)) 2 w(x)=Eo (3.5) * If a parallel polynomial evaluation scheme is used, then the number of multiplications is bounded by 2 log(m.) and condition (3.2) changes due to the presence of 2 logm instead of m.. This will affect the position of the joints. However we assume that we evaluate P (x) m . 3 serially with one processor, using Horner's rule. The more general case is left for further study. ** It is understood that h = h(x) . That is, each interval [X^, X^ + n] may have a different h. Also P m (x,h) is, in this case, P (x,h.)- We are considering a break- mj l 3 point interval somewhere between two joints. 47 If instead of using the absolute error f - P I , 3 ' m 1 one decides to use the relative error, one would choose w x (x) w(x)= 5 f (x) The weight function w(x) can penalize the error by weight- ing it more or less depending on the interval [x . , x.+h]. Equation (3.5) is an equation in h. Notice that h appears in the polynomial P (degree m) because this polynomial approximates f in the interval [x., x.+h]. Solving the equation for the adjacent interval requires knowledge of the value of h for the previous interval. We can therefore infer that the position of x. , is dependent on the position of x., x i-i' • • • • It is desirable to find a general relationship between x. , and x.: x i+1 = G(x A ) An approximate solution would give us an idea of the ad- vantage of having variable interval lengths versus uniform interval lengths, in terms of ROM size. Equation (3.5) is not in an easily usable form. First recall the rela- * tion: , i x ,. (m+1) , „ > m (x-x ) f ' (£• ) f(x)-P (x. ,h)= n — m x k=0 (m+1) ! (i) Again, P (x,h) 5 P' '(x,h.) and we are considering the interval [xi, x i+1 ] . j x 48 where f is the (m+l)st derivative of f and the x, ' s are the roots of the Chebyshev polynomial of degree m be- tween x. and x - +1 « (We have limited ourselves to Chebyshev approximating polynomials . ) £ . is anywhere between x_ and x (the first and 1 J m last polynomial roots) but also depends on x. We will now make the approximation f (m+l) Ui) ^ f (m+l) (Xi) i — i Equation (3.5) can then be written: f (m+1) (x.) 1 1 2 - x.+h (m+1) ! * m 1 ? 9 n (x-xn\(x)dx = €Z k=0 K ° X. 1 (3.7) If f is monotonically increasing, equation (3.7) will give us a lower bound on the number of intervals. A mono- tonically decreasing equation (3.7) will yield an upper bound on the number of breakpoints. In the discussion that follows, we assume that the functions we are considering are "well-behaved" so that these assumptions apply. A change of variable will further simplify equation (3.7). Let us call r the new variable which maps interval [x., x. ,] onto the interval [0,1]: 49 x-x . r(x) = r = — r — or x = x . + rh h 1 Then (3.7) becomes: h 2m+3 f (m+1) (x.) (m+1) ! 2 J m S t (r-a )w(x.+rh) dr = c (3.8) where a, 's are the roots of the normalized Chebyshev poly- nomial of degree m. For example, if m = 2: 1 + rr i - rr a i = 2 , _ 2 — and a = and in general: "k = T [cos(l+2(k-l). — )+l] Notice that if w(x) = 1, equation (3.8) has the form: 2m+3 r£ (m+l) ,„ * ■• 2 2 i )J K in h*"" "[f »""*' (x,)] K_=€ (3.9) where K is the integral: m trf^ ) 2 dr (3.10) [ m+1) !] 50 which does not depend on x . or h but only on the type of polynomial approximant. If w(x) is not equal to 1, we can make, for h small enough, the approximation: w(x. + rh) - w(x.)» and equation (3.8) can be expanded and will have terms in h , h , etc., but is more difficult to solve analytically. We will therefore limit this study to the first term of the equation. This leads to the final form: 2m+3 f (m+ ¥x< > i2 (m+1) ! w(x i ) 1 m 2 ? k=0 k (3.11) or , 2m+3 , _(m+l) , >,2 . .„ 2 h [f (x.) ] w(x. )K =€ l l m (3.12) Solving for h: x. , , -x. =h= l+l i K r4r (m+l) , n 2 , . m [f (x i ) ] w(x i ) 2m+3 or x. . , -x . = l+l l m - (m+1) , n 2 , « f (x i ) ] w(x i ) 2m+3 (3.13) 51 This relation, of the form: x- , n - x. = k g (x. ) l+l 1 m 3 l can be represented pictorially as shown in Fig. 3.6, 3.7, and 3.8 for three different types of functions g(x). Starting from x. we obtain x l = X + k m g(x ) X 2 = x l + k m g(x l } x . , . =x. + k g(x.) l+l i m 3 i etc., until the end of the interval is reached. If the number of breakpoints S is large enough, one can derive an expression for the ratio of the number of breakpoints in the non-uniform interval case to the number of breakpoints in the uniform interval case. Relation (3.13) shows that if the magnitude of f (x.) increases with x., the smallest interval would i i be [x_, x ]. (See Fig. 3.6.) In the uniform interval case, all the intervals would have to be chosen equal to [x A , x.] to guarantee I If - P I < e . 01 ' ■ m ' ' — o 52 Figure 3.6 Number of Breakpoints in the Non-Uniform Case 53 If the magnitude of f (x.) decreases with x., the smallest interval would be the last one (right most) [x-/ x q+i^ = £ x q' x, +i-' anc * * ts ma 9 n i fcude would be k g(x ) - k g(x.+ ,). (See Fig. 3.7). We will thus assume that the smallest breakpoint interval is at either extre- mity of the partition f X . , X. +1 ] . Let us assume then that the smallest interval is [X. , x^ = [x Q , x ± ] . The number of breakpoints in [x_, x] will be given by the function N(x Q , x) . From x. to x. , N(x Q , x) increases by one. We then have: N(X ,X. +1 )-N(X ,X.) 1 • 1 x i + r x i x i+r x i V If (x./ x. ,) is small enough, this is close to dN dx k g(xi ) x=x^ m* ^*- and therefore N(x Q ,x)= -/ x du k m i 9(u) x 54 N(X. ,£ ) Figure 3.7 Number of Breakpoints in the Non-Uniform Case; Different Behavior of N(x ,x) 55 g(x) NON MONOTONIC x. + k g( X .) 1 m ^ 1 N ( X £ ) Figure 3.8 Number of Breakpoints in the Non-Uniform Case; Another Behavior of N(x Q ,x) 56 x N(x Q ,x)= r r: (ni+1) , , ,2 , . [f (u) ] w(u) 2m+3 du m x The number of breakpoints in the interval [x , x] for the uni- form case is: x - x o N (x f x) = i — u o ' k g (x ) nr 3 o or, in the non^-uniform case N (x Q , x) = ^ L [ A : m J y { * : x. o The ratio of the non-uniform to uniform case is: IK-.*.*^ 9(x °> N u (x ,x) x - X o X. X J 9(4 Similarly, the inverse is: l/(x A ,x) = _ x - X n (x ,x) g( x i) / u g(u) For the whole partition these become: * J S X . -X.J Vi-Vx. g(x) 57 and tf(X X >= g(X. ) r dx X j + 1 -x./ x g(x) -l where g(x) = — -.a 1 [f (m+1) (x)? w P9i 2m+3 For the interval [a,b] , \J be comes: V(a,b) §i£ ) / b-a g(x) d Note that this ratio, although a first order approximation, does not depend on ; the value of e o the type of polynomial (Chebyshev, Legendre . . .) but depends on: the behavior of f(x) and its (m+l)st derivative f (m+1) (x). - the degree m of the polynomial approximant. Note also that N(a,b) = - N(b,a) and that N(a,b) = N(a,c) + N(c,b) . 58 Example : Using f (x) = 1/x as an example with m = 1, Vl 3 1 XT\ 5 D + l D ^ X * / J dx With X. = .5 and X.,, = 1, one obtains D D+l (.5, 1) = .603 The results obtained by computer program are in good agreement with this value for intervals less than 2 . See Table 3.1. In summary, we have found a way to compute the number of breakpoints in the interval [X., X. ,]. This expression is necessary for computing the NLP constraint * on the number of ROM words. In the next section we briefly consider the case where the non-linear programming problem can be simplified or linearized. in append The limit on r \J(x. f X ) when m ■+ °° is treated ix A. 3. 3 3 L 59 # of ROM Words # of ROM Words (.5,1) h Non-Uniform Uniform Ratio Interval Length Interval Length 2" 4 6 8 .75 2" 5 11 16 .69 2" 6 20 32 .625 2" 7 40 64 .625 2- 8 78 128 .61 2" 9 156 256 .61 2 -io 311 512 .609 2 -ll 621 1024 .606 2" 12 1241 2048 .606 For f(x) = 1/x between .5 and 1. P m is a Chebyshev polynomial of degree 1. Table 3.1 Number of ROM Words for Uniform and Non-Uniform Intervals 60 3.3.3 Linearization of the Non-Linear Programming Problem As defined in Section 3.3.1 our problem is to solve for the (s-1) joints, X 2 , . . . , X : Minimize . . with X, - a = -r J- p and X , , = b = 1. s+1 subject to the constraint E (m j +l)N ( x j' x J+ i>< R ( j=i where we compute N (X., X. ,) as explained in the last section. It is possible to linearize this NLP as follows: If the distribution of mantissas of floating point numbers is assumed to be uniform in [1/3/ 1] then, mini- mizing the average number of multiplications ?S+1 M = A" 1 M(x)dx av J p X l 61 becomes: in | £(m. + l) tt j+ x-ty) mi D= which is a linear objective function. To linearize the constraint, one can assume a UIVD realization and within each interval [X., X. ,] the number of breakpoints is X.^-X. N( x.,x j+1 )= h. where h. is the minimum breakpoint width guaranteeing a precision e over the whole interval [1/3/ 1] when using approximating polynomials of degree m.. The constraint: S /X.J.T-X. is linear. The other constraints: X. - X. ,- '< j = 1, . . . , s 3 3+1- are also linear. The problem has now been transformed into a linear programming problem of the form: 62 minimize C X subject to A X < b where X is the (s-1) dimensional column vector X — (X~ / . « • | X- I • • • / ^e' and C is the (s-1) dimensional column vector: C ■ lU»vf . • •/ Lj / • • • f ^-„/ with c, = (m. - m.) j = 2, . .., s d 3-1 2 A is the (s) x (s-1) matrix: A = a 2 a 3 (s-1). Note that A has an interesting diagonal form and that Ax=b can be solved without actually inverting A. 63 where the elements m. ,+1 m.+l xT a .= 3 h. , h. 3-1 3 and b is the s-dimensional column vector: V S x i V The form of this problem can be put into standard linear program form by the use of slack variables and the answers would be the positions of the joints X., j=2, ..., s. This linearization may be of use for cases involving a large number of joints but we have limited ourselves to 2 joints and the NLP problem will be solved without lineariza- tion. Quadratic case For the quadratic case, the function to be minimized can be approximated by a quadratic function of the joint locations: y m s j=l\ X j / i=l i^j 3 = 1 64 The problem is then transformed into a quadratic problem. A method like the steepest descent could be suc- cessfully applied to search for a minimum. 65 4. THE SOLUTION OF THE NON-LINEAR PROGRAMMING PROBLEM FOR f (x) = 1/x The non-linear programming problem to be solved (as described in Section 3) falls into the general class known as constrained, medium-scale problems. For our purposes, the number of internal joints has been limited to one and two. A computer solution is essential in cases like this if easily interpretable results are to be obtained. In addition, this choice of solution was motivated by the exis- tence of reliable software for solving the general minimiza- tion problem. A large number of different algorithms are in fairly common use but the programs implementing these algorithms are not readily available and have generally been used on a limited basis only. Since our goal was not to write a program for solving the general problem of con- strained minimization, some other method had to be found which alleviated writing an extensive and very involved software package. Fortunately, there exists reliable and readily accessible software for computing the unconstrained minimum of a function of N variables. Three FORTRAN pro- grams that are readily available are: 66 FMFP - (IBM Scientific Subroutine Package) This subroutine finds a local minimum of a function of several variables by the method of Fletcher and Powell [IBM68] . FMCG - (IBM Scientific Subroutine Package) This subroutine finds a local minimum of a function of several variables by the method of conjugate gradients [IBM68] . ZXMIN - (International Mathematical Software Library) This subroutine determines the minimum of a function of N variables using a quasi-Newton method. In order to simplify the problem, we were led to con- sider methods that replace, or approximate, constrained opti- mization problems by unconstrained problems. Such methods exist, are readily available, and are known as Penalty and Barrier function methods. These methods are discussed be- low and used, ultimately, to solve the optimization problem. 4.1 Penalty Function Method The basic idea of Penalty/Barrier methods is to add to the objective function a term that prescribes a high cost 67 for violation of the constraints (Penalty Method) or to add a term that favors joints interior to the feasible region over those near the boundary (Barrier Method) . This high cost is represented by a parameter y that weights the size of the Penalty or Barrier. When y ->■ «> the approximation to the original problem becomes more and more accurate and tends toward the true solution. These methods are appealing because: They are comparatively simple and straightforward to implement. The software exists and consists of a routine to find the unconstrained minimum of a function of N variables. They possess a certain degree of generality. However, as the parameter y is increased so as to yield a good approximation to the constrained problem, the corresponding structure of the unconstrained problem be- comes increasingly unfavorable in that the convergence rate of the algorithms used to find the minimum decreases. A study of the location of the approximate solution (opti- mum) as a function of y has been made without knowing the exact behavior of these algorithms as a function of y. The programs used were implemented in FORTRAN. The study was limited to the one and two joint cases, that is, the one and two variable NLP problem. Only the Penalty Method was used. 68 The problem, then, may be stated as follows: minimize M(X) (objective function) subject to X cS (constraints) (4-1) where M is a continuous function on E and S is a constraint set in E . In our case, S is a set of inequalities des- cribing the relative positions of the joints X = (X., X. ... and the inequality concerning the number of available Read Only Memory words R fl . The penalty function method replaces * problem (4-1) by an unconstrained problem of the form: minimize Q (X, y) = M(X) + y P (X) where y is a positive parameter and the penalty function P is a function of the vector X and is defined as follows: (i) P is continuous P(X) (ii) P(X) >_ for all X e E n (iii) P (X) = if an only if X e S In other words, the new problem minimizes the objec- tive function only when X satisfies the constraints X £ S and minimizes the objective function plus a penalty when- ever X violates the set of constraints S. * The meanings of P, Q and S here are different from those used in Sections 2 and 3. 69 The problem, then, is to devise a penalty function P (X) that is continuous and positive but null whenever X is within S. The penalty function is weighted by the para- meter \i which is allowed to increase toward °° so as to approach the exact solution of the initial problem (4-1) . The procedure for solving problem (4-1) is as * follows: Let {^i,} i k = 1/2,... be a sequence tending to infinity such that for each k, y, ^ and \x .. > y. Then, for each k, solve the problem: minimize Q(u k , X R ) = M(X R ) + y k P(X k ) (4-2) obtaining a solution X,. It is assumed that a solution to (4-2) exists for every k. The following two lemmas [LUE72] lead to the global convergence of the approximate solution X, to the exact solutuion of the initial problem. Lemma 1 Q(M k ,X k )< Q(M k+r X k+1 ) P(x k ) >P(x k+1 ) M(X k ) < M(X k+1 ) where X, is a solution of (4-2) . * The use of k and y here is different from that of Section 2 . 70 Lemma 2 each k, Let X be a solution of problem (4-1) . Then for m(x ) > Q(/\,x k ) > F - Wl (4-5) A penalty function that satisfies properties (i) , (ii) , and (iii) and which is frequently used is: s+1 P(X)= T (max(0,g. (X))) 2 since if g . (X) £ for j = 1,2,.... s, then P (X) =0 but P (X) is always positive if g. (X) > for some j. Conse- quently, we must now implement the function Q(u, X) = JVI (X) + yP(X) a V in FORTRAN and then link it with an existing minimization routine in order to solve our general NLP problem. 72 4.2 Software Considerations The routine chosen to compute the minimum was ZXMIN of IMSL. The IMSL software is well supported unlike the IBM SSP routines. The storage requirements, although not critical for our application, are smaller for ZXMIN (the Hessian Matrix is stored in symmetric storage mode) . Moreover, the routine ZXMIN implements Newton's method which has the advantage of alleviating some of the slow convergence problems associated with the extremely unfavor- able eigenvalue structure that accompanies penalty methods The subroutine ZXMIN also evaluates the Hessian Matrix and the user is not required to supply formulas for its evaluation. A FORTRAN program has been written to a. evaluate the number of ROM words for a given joint configuration. b. determine the average number of multiplications using the 1/x In 3 probability density function. The above penalty function was used and the rou- tine ZXMIN was modified for this specific problem. The parameters that can be fixed by the user are e , the precision of the evaluation of f (x) equal vs unequal intervals number of joints, and the degree of polynomial in each partition. 73 The polynomials were assumed to be of the Chebyshev type and the colloquation points were computed accordingly. The weight function w(x) was implemented as a function subroutine and, for simplicity/ w(x) = 1 was chosen. The structure of the program is shown in Fig. 4.1. The program was run under the FORTRAN IV G compiler. For a listing of IMSL routines, refer to IMSL Corporation. Routine ZXMIN is not given in the appendix because it is copyrighted. The remainder of the program is listed in Appendix A.l. Here we will describe each subroutine from the bot- tom up along with a brief explanation of the meaning of the parameters. First recall that in the general case, we must solve for the length of each breakpoint interval [x., x. ,] within the partition [X., X. .] by solving the foil ing equation for h-p ow- / x.+h . i i (4-6) (f(x)-P (x,h )) 2 w(x)dx- 2 1 -Q X We have seen (Section 3.3.2) that the above equa- tion can be replaced by: 2m+3 f (m+1) (x.) (m+1) ?I (r-c* )w(x.+rh)dr - e 2 k=0 k i MAIN 74 ZXMIN I ZSRCH QTOMIN PENLTY NROM N FCT TEGRAL UTOH FCRIT 1 PI W Computer Sol ution; Subrou tine Structure 75 This avoids many rounding problems associated with the computation of the difference f (x) - P m (x, h) which can become very small. The following routines are required. (See Fig, 4.1 and Appendix A.l.) Function W(XV,NTYPE) Parameter XV is the x variable of equation (4-6) . Parameter NTYPE is the order of the derivative of w(x). If one wishes to use the relative error in equation (4-6) , simply replace w(x) by — —^ (f (x)r Function PI (XV, M) PI represents the Lagrangian polynomial: m. II (x - x k } k=0 (4-8) (m.+l) ! where the x, 's are the roots of the Chebyshev Polynomial of degree m. in the interval [x., x. . ,]. The roots of 3 3 l l+l the normalized polynomial of degree m. are referred to as a,. Parameter M is the degree of polynomial (4-8), m.. •K "1 76 Function FCT(XV,M) FCT represents function f(x) in (4-6). Parameter M represents the order of the derivative of function f. If M = 0, the function returns the value of f(XV). Otherwise the derivative is computed by the subroutine. The deri- vative function was hard-coded due to the simplicity of the functions used (— ,lnx, . . .) Function FCRIT (R, HU, INTKND, XI, M, PI, W) FCRIT is the product: m. r D n (r_a k } k=0 (m.+l) ! w(x.+rh) when INTKND=1 (See 4-7) or the product r m. 3 n (r-a k ) k=0 _ (m.+l) ! 3 r-w (x +rh) when INTKND = 2 i or m. n (r-a, ) k=0 K (m.+l) ! for INTKND = 3 77 M is, of course, m . while the jth partition is being con- sidered. These 3 functions appear in the solution of (4-7) using Newton's method. Parameter R: represents variable r and takes values between and 1 . Parameter HU: represents h, the width of [x . , x. , ,] e l l+l Parameter XI: represents x. Function TEGRAL (XL, XU, HU, INTKND, XI, M, PI, W) Function TEGRAL computes the integral of FCRIT summed over R from XL to XU. The evaluation is done by- means of a 24-Point Gauss Quadrature formula which inte- grates polynomials up to degree 47 exactly. Subroutine FSOL (U, F, DERF, XI, M, PI, FCT, W, ETA, EPSQ) The use of function FSOL can be explained as fol- lows: Since h and e are very small, it seemed more con- venient to rearrange equation (4-7) as: ,(m+l) , , ,2 r 1 2 (f x, ) ) / m e dr = ° [x.)T f L n> 2 -=3 — / n (r-a.) w(x.+ rh) J o k= * * h2m+3 Here m = mj . Then, the change of variables: h u = 78 m+2 e o was made and the functions FBIG and FSMALL were defined as: (f (m+1) (x.)) 2 f 1 m _ FBIG = ^ — / r (r-a t ) w(x. + rh)dr k=0 FSMALL = -=j u Then the equation: FBIG(u) - FMSMALL (u) = was solved for u using Newton's method: FBIG(u ) - FSMALL (u , ) n n-1 u = u . - n n-1 (FBlG(u ,) - FSMALL (u , )) \ n-1 n-1 Therefore the derivative of FBIG - FSMALL is needed: (f G / ^ 0< / - - / CD-) z f6x to / f- ■H Lz 4-1 L » i— i U -H r ° o rH / i •H 4J / ^ o rH 3 ro^ s / CD m 'rn£3 / U f Ten u 5 -ONHJ T»^ 0) i • 0) tn ■H fa 84 CO uj Q x cr i- o 2 3 u. w UJ UJ 5 W no U o g o « o Q) ro H •H fa 85 SOW UOB » o •H fd -p o « •H > w •a n s o « 4-1 o n W U i O O u in u •H En 87 •P c CD a* 0) > -H -P O 0) ■n JQ O tn -H -P C _ X 2 side of the diagonal. 2. A is higher than B which is higher than C: when XJ(2) increases the average number of multiplica- tions decreases. Of course, the number of words of ROM required increases but does not influence the curve as long as it is less than R~, the allowed maximum. The penalty starts applying when the number of ROM words exceeds a given value R» (around XJ(2) = .65 on Fig. 4.6). Before the penalty applies, the curves represent the average number of multiplications. The region marked M is where the minimum occurs. The minimum gives a. a value for XJ(2), b. a value for XJ(3), c. the average number of multiplications, and d. the number of ROM necessary, R^. From D, to A the penalty does not apply yet and the curve D, A represents the average number of multiplications which decreases when X., increases because this has the effect of replacing polynomials of degree rru by polynomials of lower degree m ? . The next curve D^B is below the curve D,A. More precisely, for equal X 3 ' s D 2 B is less than D-.A. This is 89 because X~ has increased and consequently more degree (m 2 ) polynomials have been traded for polynomials of lower degree m, and the average number of multiplications has diminished. Figure 4.7 can be considered as a combination of Fig. 4.2 (average number of multiplications) and Fig. 4.5 (number of ROM words) . By taking the "combination" (Fig. 4.2) + y * penalty (Fig. 4.2) one has a visual illustration of the origin of the surface representing Q. 4.3.1 Influence of y Figures 4.8 through 4.12 represent the influence of jj on the position of the minimum. y increases from to 10 and a value of y of the order 10000 was chosen for the runs without creating stability problems in solving the problem on the computer. Illustrated is the fact that y weights the penalty function and this contributes to the steepness of the "penalty wall." 4.3.2 Influence of the Maximum Number of ROM Words R Q By increasing the number of available Read Only Mem- ory words, one can intuitively expect the following: If R n is increased, more polynomials can be used. 90 to XJ fl o >1 +J H id c c W C o •H 4-> m u •H rH C^ •H -P -H P s u CD X) 1 P Q) > m o c o •H -U id c •H XI g o u M P tn •H U O S o « O 5-1 CD X) 91 * *'!"# UlUN3d HUM •OHOi T90 \V II 3. o o c QJ 3 H m c 00 •H 92 X LxJ Q-0£ xnaod hiw -ww "^ 4-1 o •H W CU •4-> M-i CD U c 0) H c LO CD u •H t>4 o o in « e e •H a •H S CD 45 +J 100 CD -p o c o •H -p •H w o fa x: 4-> c o fa w no O a o fa 4-1 o 5-1 QJ .Q 6 e •H fd S (1) -P m o u c sO 0) •H fa IN O fa e a e •H C •H a 101 5. RESULTS AND EXAMPLES 5.1 Average Number of Multiplications vs. Number of ROM Words: Optimal Curves The optimization program was run for various values of the maximum number of available Read Only Memory words R-. The cases chosen are listed in Table 5.1. Each point on these optimum curves was obtained as a solution of the non-linear programming problem as ex- plained in Section 4. The preliminary study of Section 4 made it possible to avoid having to use a grid search for the initialization of the program. Although the function to be minimized is not unimodal, starting with XJ(1) = XJ(2) = .5 insures the convergence to the minimum. This is due to the fact that the degrees of the polynomials increase from left to right. The number of points computed was limited by computer time and cost considerations. Solutions were sought having 3 significant digits. The IMSL mini- mization routine (ZXMIN) allows this choice. Depending on the specific case, the range of R n values is [25, 650] and the value of e is 2 o -24 102 TS CD P G CD en CD p a c •H (0 P .H 3 to Q) « T3 H CD ftf W •H D g w C CD >i CD -H P o tn 1 fa P w -P u c CD -H XI e ^ 9 S3 C •H — » P X u w C M-l 3 fa =L >i4J P C .h rd to P C w 0) fi fa u CD a, >i Eh OHinvDMXX^OHtN X)r--CX3CrirHr-|i-|^lrHrHr-|CNtN(N 0^C7>C7*tX>t7iC7iC7iCX>CT>C7>t7^CT>CT>t7>C7i0">tT>0"> •H -H -H -H -H -H -H -H -H -H -H -H -H -H -H -H -H -H [j-| p^| pM LJL4 p^| pM Pl-I P^4 (-M (-M [xh [x-I P^l Cl-I PM PM PM (-^4 HHHHMN(NOin>>>>>>>>>>>>>>>>> HHHHHHHHHI-II-II-IHHHI-IHH >>>>>>>>>>>>>>>>>> 103 5.1.1 One Joint Results Figure 5.1 shows clearly the origin of each point on the optimal curves. In general, when the number of ROM words available increases, the average number of multipli- cations decreases. The shape of the curve depends on the degrees of the polynomials chosen. In all cases, if the polynomial degrees are iru, iru, m~ , where m. < nu < m_ , the average number of multiplications decreases between m, and iru . In some cases, the curve levels off above a given number of ROM words (for example, polynomial degrees 2, 3, Fig. 5.6). This is due to the fact that penalty does not apply and we have a degenerate case where the approximation is done with the lowest polynomial degree over the whole range. For example, in the case of polynomials of degrees 2 and 3 beyond 150 ROM words, the average (and in this case the true) number of multiplications is 2. That is, there are no polynomials of degree 3 used in the range [.5,1] and the two joints are located by the program at the extreme right (X~ = X- = X. = 1) . In this example the point where the 3rd degree polynomials are no longer needed is at approximately R~ = 150. The one joint results are given in Fig. 5.2 - 5.11. Fig. 5-12 represents a partial ordering between all the one joint implementations. There are two possible cases: 104 X i 1 1- / f-" z / o / ~3 / Ld / 1 —> ;z / X o / \ ! ^ X y~f U_ / O cr uj CD X < UJ _l CD < _J < > < CO Q cr o o en Figure 5.1 Each Point on the Curve Represents a Minimum 105 CO ■z. o < CJ oo CD U o M li_ CO o >- UJ o GO 2> 2 UJ 3 2 z UJ o • < CO cr > UJ > < s s -« r~ o-s O'V 0-C 0-2 SNOUHDndinnw am 33d«3Ab o-i +w M-l u 0) -Q B 3 2 fl •H -P CO U W > CO C •H -P (0 U •H .H a •H +J CN H «. (Q 3 rH g a II X M-l • z: Di o u Q) tt v Q or ja OD E • z 3 C 2 Q) Oi 04 r0 M • « a < •H O id • • cu G u O 3 u a) f-i (0 fd rd S u •H -P Q a > o H > CN ... • M LD U w G •H -M id •H rH P. •H -P ro (0 .H v U P rH Ql g 3 MH II r: • o tP tr >H QJ QJ Q OD X) z | t G 2; rH (U Ui Oi to u ■a. c < •H •"3 • • QJ 0) > g H o 3 u i7 H w (0 (0 i u •H 4J Q a > o H > on »^, • W m •a u 0) n 5 3 Oi s ■H o Pm « 107 CD U □ o Q_ SNOiibondinnw aaN aauiOAd o Q) J3 C •H -P CO O U CO > w C O •H +J fU o •H rH Q. -H -P •-I 3 a M-l O ^1 (1) CD tr» «J CD > CD > u f0 6 •H •P Qi O in Q) ■H fa CT> CD Q C >i H O 4-> c •H o CD c o CD to u Q > CO O a s 108 in CD u o •z. >-• _j a SNOiibondinnw a8N 33ua3Ad M-l u Q) XI 2 C •H -P (0 u w > w c •H +J id u ■H <-\ a -H -P in pH «. 3 rH S II 14-1 • & V4 0) a) Q xj 1 • p C 2 >i iH Q) tP rxi fO >-4 •» CD 4J > C < ■H >"3 • • a) Q) > C u O P u • •n. QJ rH (0 (0 ni s U •H -P Q a. > o H > m t% t W in T3 M ID >-i S 3 tn s •H o Cm pc 109 m rsT CD u o >1 _J o 0_ o«T s s -W 0* 0-C 0-2 SNOiibondiimw yaN 33ya3Ad 0-1 m m CD £> 2 c •H -p CO u CO > CO c •H ■P rt o •H .H a •H -U m (Q rH «. Q 3 CN Qu s 3 m II x= • o CP or. m Q) 00 c < •H • t CD CD > C! H O 3 u CD <-\ CO o H > vo ,„ • CO m J-l CD u & 3 tP s •H o fa rt -8 no 4-1 o u CD Xi CD LJ o _j o CL ■8 -« r 0-S 04 0-C 0-2 SNOJlbOHdinnw H8N 33HM3AB 0*1 c •H +» CO O u to > to c o •H +J a o •H rH a ■H +J rH P s M-l o M CD CD CD > CD > u u e •H a o CO U O s o OS Ill B CD U o o -8 -R sno i iy a ndinnw hqn 33B83AB fi •H ■M co u to > 0) C •H +J fO U •H rH 04 •H -P in rH * (Q 3 CN 9 £ II 3 m • ZT tJ> o ^ CD or 0) Q a: 00 • z 3 C s rH CD fr Oi -i • V > C < ■H • ■ a) CD > c ^ o s u CD rH to id rrj g U •H -P Q a > o H > 00 ,„ • to in r-l 0) u IS 3 Cn s •H o Pm tf 112 n CD U o o Q_ r- 04 O'C 0'? SNOllbOndUTflW &8N 33da3AH 0-1 M-l O H Q) s | C •H 4-' to U CO > co C •H -P a u 8 ■H a •H 4-> M" iH «. JQ 3 n 8 s II 3 M-l • jr tn o u CD Or a> Q or JQ 00 £ • z 3 C z r-i cu O Cn cu fl ^ • •^ ft <1) > -p < •H O • 1 CD 0) > C J-l O P u o > CTi ,^ • (0 n m U CD u 5 P tn s •H o fe « 113 LT> n CO u o o a. S -8 -R D* O'C 0-Z SNOUUOIIdirmU 88N 33B83AB O-l m u a) XJ £ P S C •H -P CO U CO > co G -H -P a o •H ■H a -H -P ■H LO P «. cn S CO M M-l II 6 o x • __ ^ cx> £ <^ cu 5 -Q 01 s Q o: P • gg z c z >1 CL) <-\ CP ftf Oi M (U • ^ > -p < c •H • • t-D cu > CU M c p o u •*. iH cu «J CO g fO •H u -P & Q O > H > o rH • v • 10 m TJ ^1 Q) M s P U> a •H o Pn « 114 IT) H CD U Q O CL -s S -« 0-S 2. O'V C-C 0-2 SNOUHOIIdinnW 88N 33BH3AB 0-1 « M-l H (U XI s c ■H -P CO U CO > to C •H -P rd •H r-i a. •H -P iH m a ^ IS S ■^ g m II 3 C n D> O or XJ Q i (U rH Qi rd ft M Q) •» > -p < c •H •• ^ (1) > H > H H 1* • CO IT) Q) ^l 5 3 Di 2 •H O tn ttj 115 1,5 Fastest Implementation 2,5 )-«— Slowest Figure 5.12 Ordering Relationship for Implementations Involving Two Different Polynomials (One Joint) 116 either the optimal curve of one implementation is always under the optimal curve for another imple- mentation (e.g.: 1.2 is always faster than 1.3 for the same amount of ROM) . or the curves cross at a certain value of R fl (see Fig. 5.13). In Fig. 5.12, implementations that are comparable are linked together. If they are not (such as 1.3 and 2.3), they cross somewhere. This graph has the properties of a lattice. In Fig. 5.13 where two different optimal curves are plotted on the same graph, the point E corresponds to statistically equivalent implementations. For any point such that R Q > R QE / curve B is obviously more advantageous because it leads to a faster evaluation of the function for the same cost in terms of number of Read Only Memory words. However, this comparison should be weighted by the fact that curve B may involve more joints than curve A and there- fore increase the complexity of control. For implementa- tions involving the same number of polynomial degrees, clearly curve B should be chosen. For any point on the R fi axis such that R Q > R flF / curve B no longer represents the optimum choice and it is obvious that for the same amount of ROM, curve A now yields a faster implementation. 117 FOR THE SAME NUMBER OF ROM WORDS THE B APPROACH IS FASTEST CO o < o _J EQUIVALENT IMPLEMENTATIONS IN THIS INTERVAL THE A CURVE IS BETTER FASTER AVERAGE SPEED R, R OE NUMBER OF ROM WORDS Figure 5.13 The Designer's Choice Among Two Possible Implementations 118 This situation can, of course, be generalized when there is a choice between many implementations. In Fig. 5.14, the dark heavy curve should be the optimal realiza- tion when one is faced between implementations A, B, C, D, etc Control complexity may well be a factor, of course. 5.1.2 Two Joint Results The case where three different polynomial degrees are available for implementation is illustrated in Fig. 5.15 through 5.22. The comparison between these different imple- mentations is shown in the graph of Fig. 5.23. The remarks made for the one joint case apply equally when there is a choice between different possible implementations involving 3 degrees. An overall ordering relationship is shown in Fig. 5.24. 119 AVERAGE NUMBER OF MULTIPLICATIONS NUMBER OF ROM WORDS Figure 5.14 The Designer's Choice Among More Than Two Possible Implementations 120 CD u a o 0_ r o-v ft * I o-c o-z SNOlldOndllinw «8N 39Hd3AU m p w G •H -P f0 o •H H a ro •H «. +J (N r-\ «. 3 r-\ s II <4-l • D^ M QJ 0) Q XI | • 3 G 2 >i rH 0) O & o. -p < G •H • • h> QJ > Q ^l S d M u • *» ■H QJ (0 W £ fl •H U -P & Q O > H > LO •H • «. • CO LO 73 M Q) O M s 3 Cn a •H o P4 « 121 U) u a o a. r 0* i i SNOJlBOndUTH U8N 33UH3AB M-l 5-1 a) ,Q z G •H -U (0 u CO > w c •H -P fO O •H .H a ■^ •H v -P i H a) tP cu rd u ■ * +J < a ■H • • b CD > u £ 3 E-i U • *, r-l Q) ri (0 £ flj •H u -P a Q o > H > v^> H • *, • CO LO T3 M 0) u S= 3 cn g •H o Pm « 122 CD U O 21 >-< O CL 0^ 1 1 SNOlldOndUinH 88N 33B83AU m m CD JQ Z C •H +» CO u CO > CO c •H -P (0 O •H rH o. LO •H V 4J CN iH * 3 rH J>J II 4-1 • tn u CD CD Q -Q P C 2 >i H 0) Cn Ot CO U • V CD CO > -p < G •H • • h) 0) > M 5 3 Eh U • ** H CD crj CO 6 CO •H U Jj a Q > H > r^ H • ^ • CO in •O U CD u 5 3 en s •H Cn « 123 CD U a o CL r 0-V SNOlibOndliinN U8N 39tft3Ab ••I m u Q) £) 2 c •H -p W u 0) > to c •H •U i H -M < c •H • • h) cu > u 5 3 En u • *» rH 0! (d CO g rd •H u 4-> a Q o > M > 00 rH • V • 10 m T3 H CD u ss 3 Cn S •H o Cm « 124 ID CD LJ a o CL r o-c t-z SNOUMIIdliTM *8N 33W83AB c •H -P CO U CO > co c •H +J (0 o •H rH a m -H * -P m H *» 3 H s II MH • (^ u 0) i rH -p < fi •H • • r3 0) > u ss 2 Eh u • •* rH Q) tf CO £ «J •H u -P a Q o > H > CTi rH •« • CO in -o U Q) U s 3 0* g •H o Em « 125 LD CD U a o Q- SNOllWOndlimw 88N 33Bd3Aa I 6 -■ 0-1 M (1) XI s c •H -P W u w > w c -H -P fd u •H .H a in •H ** 4-> *tf rH *. 3 H •^ A II M-l • tP U 0) i H (1) & a. fd u **. CD W > JL» H 5 3 E-t u »•* H a) fC 0) e id •H u 4J a Q o > H > o CN • % • CO ID •a u Q) M 5 3 Oi £ •H o P4 « 126 in II CD Ld o >- _] o Q_ B 8 0-C 0* SNOUdOIldinnw H8N 33bH3AB n to C •H -U ctj O •H rH a m •H v +> n M *. 3 CN 2 II M-l O • CJi u a> Q) Q X) g • d G 2 >i rH CD Cn (^ CD to > -p < c •H • • h> CD > H 2 3 E-t u • *, rH CD H > rH CN to • CO LD TJ u CD u 5 3 tp s ■H o fa tf 127 in ii CD U o o Q_ r SNOIlfctflldirmu 88N 33BM3Ad 8 0-1 a ■H •P W u w > w C •H 4J <0 o •H H a m •H ** 4-> ^ H ^ 3 CN s II in • Cn M CD 0) Q ^Q g • 3 C 2 >i -H +J 4 a •H • • b (U > o u 5 3 Eh u • •* H QJ «j W £ fO •H U -P ft Q o > H > OJ CN • *, • (0 IT) X3 H 0) M £ 3 fr s •H o fc. « 128 OPTIMUM CURVES CROSS SOMEWHERE Figure 5.23 Ordering Relationship for Implementations Involving Three Polynomials (Two Joints) 129 i -H o ro O (N JC -P •H W c o •rH -P fO -P c 0) 6 0) rH a E o •H W C o •H ffl rH Q) OS tn C •H 5-1 0) ^ O «0 .Q O rH IT) 0) u •H p4 130 5.2 The Effect of a Finite-Length Look-Up Address The program was run using the same algorithm but with each iterative evaluation of the breakpoint interval h. = x. , - x. replaced by the closest value satisfying: -k 2 <_ h. [k is some integer here] This replaces each interval length by its closest power of 2 approximation. Intuitively what should happen is an increase of the number of ROM words for the same average number of multiplications. In other words, for a given cost, that is, a given maximum number of ROM words available, the speed should be smaller. This effect can be seen in Fig. 5.2 5 which can be compared with Fig. 4.12. The minimum is situated at (XJ(2) - .7 and XJ(3) - .8) in Fig. 4.12 whereas, for the same cost (R fi = 350 Words), the minimum in Fig. 5.2 5 is situated at (XJ(2) = .55 and XJ(3) .75). The implication of this is that fewer one or two degree polynomials are used and consequently slower evalua- tion results. 131 CO o (Z O O m rO It o I LxJ CO < o ju 1 >"* to c •H h) 5 Eh «** • g s •H C •H \ ^ s I * (D | o J! I ^— > +J r\J o UD /- * "^ o r° x -H +J •H I *-< CO / N ^ ft [6 •— f £> 4-> ^> C 00 * N •H ^ -P -H Cn •H Q m M-l (N rH -p W (D 0) m 0) U-t u W Q in CM rH • ftf in •H 1 0* rH ■H o fa Dh 132 6. HARDWARE IMPLEMENTATION 6.1 Design Considerations: Expected Time Delay. How to Store the Coefficients Figure 6.1 shows a general structure for evaluating a function. The coefficients of the polynomials are stored in ROMs and the address decoder determines where to look up the coefficients from the argument x. The algorithm may loop several times through the multiplier. We will distinguish four types of function generator architecture: UIUD Uniform Interval , Uniform Degree : same degree for all the polynomials over the whole range. VIUD Variable Interval , Uniform Degree : intervals will have different lengths to save ROM; same degree for all the polynomials over the whole range. LTVD Uniform Interval/ Variable Degree : intervals have the same lengths between joints, but different polynomial degrees are used over the whole range (but the same between joints) . VIVD Variable Interval, Variable Degree : the most general case; a combination of variable intervals and variable degrees. x = 0. 133 The coefficients of the polynomials are stored in a ROM or PLA-- Address decoder High Speed Combinational Multiplier v Result Figure 6.1 Sketch of the Arithmetic Unit for Function Evaluation 134 Before looking at the detailed design, we will evaluate the expected time delay. For parallel evaluation of the polynomial, the average delay is given by: Vi T avp = ]P(21og(m.) T(n) + (1 ROM access time) .) / L j = l J x xln J dx where s is tne total number of joint partitions T(n) is the time to multiply two n-bit numbers the ROM access time must be counted only once in the interval j since subsequent accesses can be overlapped with the multiplication cycle. 21og(m.) is the number of levels necessary to evaluate a polynomial of degree m. entirely in parallel, the number of multiplications. /D+l 1 _ £ X is the probability of finding argument x In 3 X. D x in interval j If the evaluation of the polynomial of degree m. is not performed in parallel the expression above should be written (serial case) : X ) (m. T(n)+(1 ROM access time) .)/ ' ~! J J J xln 3= ~ X. j+l L_ 3 l j+l 135 If there is no address decoding and if we assume an implementation where the breakpoints are separated by - £ • h- = 2 ] or, in other words, where there is a direct look- up over (&. - 1) bits, the ROM access time is proportional to log 2 (£. -1). A factor K™ will be introduced to take into account the technology used. If we assume that the multiplier uses the same technology as the ROM, we can write: s [X T avs = B m j T(n).K T +log 2 U.-l).K T ) ln j = l L X J J Depending on the structure of the multiplier, T(n) can be proportional either to n or to logn. This formula, when K T is known, can be used to find the average speed expected from a given technology and design. The first ROM access is the most important because the first coefficient, a . must be obtained in order to m begin the evaluation of P (x) = (((a x + a ,)x + a „)x + a _)x + a „ . m. m m-1 m-2' m-3 m-4 While a x is being computed, a -, can be retrieved m 3 c m-1 and added to ax. While (a x + a ,)x is being computed, a ~ ^ s looked up and the time sequence can be viewed as follows: H 136 rd u rd a c •H i >i >1 o H H H o & Q4 a rd •H -P ■H -P •H -P a H N = m n - m J — ^ — Jo = m n (1- — ~ Jo) This diminishes the amount of ROM needed by a rela- tive value of -i-s — - £. If m - t , the savings in Memory can be up to 50% compared to a straightforward storage scheme. In addition, only an n by (n-ic) multiplication need be performed rather than an (nxn) multiplication. 6.2 Architectural Considerations: Obtaining the Address from the Argument 6.2.1 UIUD Case In this case both the breakpoint interval length and the degree of polynomial approximant are uniform over the whole range. The value of h satisfying C(f,P m ,x.,x.+h)£ e \c i , 1 must be determined first. This h must be the most "strin- gent" h over [a,b] so that the value of the criterion 140 C will always be less than e . h can be written: h = 2 £ + E 2~ l E 1 T -1 1 (6.1) where E < 2~ l Stated another way, the number of ROM address lines necessary to look up the coefficients of the poly- nomial approximant will be: I = flog h where x is the "ceiling function," the smallest integer greater than or equal to x. In the case of 3 =2 and a normalized floating point number, the most significant digit is 1 and the number of ROM address input bits is then (£ - 1). Figure 6.2 shows the essential elements of the UIUD hardware realization. When the E defined in (6.1) is very close to 2 , a large amount of memory is wasted using a UIUD design and a VIUD design, as described in the following section, may be more economical. 141 x = 0. I / n / l-\ lines To the Multiplier Figure 6.2 Type 1 Architecture: UIUD 142 6.2.2 VIUD Case In this case each interval [x., x. ,] has a length h. dependent on the value of the function g(x) defined in Section 3.3.2: h i = X i+1 ■ X i = k m 9 (x i> where ^2 1 k 1^2-] 2m+3 m K m and K = / =■ dr / ((m+1)!) 2 / m 2 1 k"o (r -V . . r ... (m+l) . . . 2 . . 2m+3 g(x) = { (f (x i )) w(x i ) } h. depends essentially on the behavior of the (m+l)st derivative of the function f. • dh The rate of change of h i as a function of x is ^. Assuming for simplicity w(x) = 1 2k (m+2) dh m . f (x) Example: dx 2m+3 2m+5 , . -. > 2m+3 [f(x) (m+1) ] f (x) = - m = 1 v ' x 143 2 dh m _1 /_2_\ 5 dx " 5 x ^ 3 ; When the number of breakpoints is large h increases approxi- mately as the continuous function -=—. The region where the size of h changes the most rapidly is at the left side of the interval. As in the UIUD case, it is not possible to implement the look-up for a general (infinite precision) h. since locating the two breakpoints close together would require many address lines. The number of ROM address lines is, for the smallest h: I - 1 = [log ij - 1 However, the smallest value of h need not be used over the entire interval. The smallest h is at x = .5 in the case of 1/x. (See Fig. 6.3) When x. increases from . 5 to 1 in this example, the length h. increases. At point P 2 the value of h satisfying C(f,P , x.,x.+h.) =e becomes twice 2 3 n i l i o -I -$, the smallest value 2 . At points P-., h = 3x2 and at point P . , h = j 2 . It is much easier to detect integral -I -£ multiples of 2 than fractional increments in 2 The total number of ROM addresses is P. . - P. 3 +1 3 A(.5,l) = Z j _„ L P A 1 3 1 -£ 3 144 rO ft* . (M ft* ■ CU LU CO < CO LU CO Q cr o o cr CD cr LU I- Q — (0 ^ _l < £5 CO < > cr LU LU O \- £ - CO CO Q cr o rO LU =? Q 2 < Ld < cr o I— cr CO rH f0 < 00 > ID M O -P C H A -P o CJ > O o CO en o LU CO Q CO CM 'x: cr cr u_ 0) 0_ o o £ co £ Q -M LU 2 < o h- o LU -z. q; H c < CO o cr < rO z •H -P 7) •H o Ll > en u Q D Q) •H fa 145 The number of address lines is Hog- A(P,,1)]. The abscis- sas P. are such that: k m 9< P j> " i 2 ~ l If we want to determine how many such points P . there are between .5 and 1, simply compute the ratio: k g(l) m 3 kg (.5) m -I and if k g(.5) = 2 (this depends on e ) then the ratio: -*IB, will give an approximate value of the numbers of such points P. . 3 Example : f (x) = - m = 1 w(x) =1 3=2 2% g(x) = (^r-) 5 x -3 g(l) = 2"* 4 = 1.3195 g(.5) = 2~ 8/5 = .3928 siiL H 4 g(.5) 146 log( ?T75) ) So we have 2 such points for the above example. The standard way for decoding the addresses would then be to introduce a ROM decoder between the I -1 most significant bits of the argument x and the address lines (£ ) of the coefficient ROM. Each value of h must satisfy the condition h = j.2 * for feasibility of implmentation in hardware (see Fig. 6.4). that If one is willing to consider only points P . such h = 2 J • 2"\ then the ROM decoding used above can be advantageously re- placed by a barrel shifter (Fig. 6.5). When the interval detected is such that h = 2.2 , it is necessary to increment by one, using only every other word. This can be done by looking up bits 1, 2 . . . (£ - 1), discarding bit -I 1. When h = 4.2 , bit (£ - 1) is discarded and bits 1, 2, . . . (£ - 2) are used for direct look-up. The barrel shifter allows a combinational shifting of a string of input bits (Fig. 6.5). The number of shifted places is controlled by independent inputs and the propagation bit can be chosen to be or 1. The control for the shift amount is decoded 147 I x = 0. jS l-\ lines ROM Decoder / /'< /-lines ROM Coefficients of P m (x) ^_ j Address Generation \7 To Multiplier Figure 6.4 Type 2 Architecture: VIUD 148 °7 °6 °i _J I I °4 °3 °2 °1 ~i i i i i I i — r h h h h h h 'i 'o Enable Inhibit Example with a shift of 3 ($ S,$ 2 = 110) Truth table : INHIBIT ENAILE 1*2 «0 «1 «2 °0 °1 °2 °3 °4 °5 °« °1 r o 'l '2 ! 3 '4 's '6 h 1 'l '2 '3 '4 '5 '6 '7 1 T 2 T 3 T 4 T 5 T 6 T 7 1 1 T 3 T 4 \ T 6 T 7 1 T 4 T 5 T 6 T 7 1 1 1 T 5 T 6 T 7 1 1 1 T 6 T 7 1 1 1 X 1 X 1 X 1 X T 7 1 1 1 1 1 1 1 X X X X 1 1 1 1 Figure 6.5 Barrel Shifter 149 (with low precision) directly from the input x (Fig. 6.6). The shifted number is now the address of the ROM. The ROM savings to be expected this way will be less than for the general case, but are balanced by the reduction in control and the simplicity of design. For large N, however, this technique is not ad- vantageous for all types of functions (see Appendix A. 3). 6.2.3 UIVD Case When there are several polynomials and the intervals are chosen to be uniform the designer has a. the flexibility of a variable average number of multiplications b. the advantage of uniform intervals, which is easier to decode. Figure 6.7 illustrates the general organization of a UIVD function generator. By choosing the largest h satisfying the criterion and having h = 2 , direct look-up using the argument is feasible. A decoder of low precision will determine the number of multiplications to be performed and which bank of ROMs to read from. This decoder can be implemented with a small ROM. By choosing the same joints for several func- tions, one can choose the same addressing scheme (See Section 8) . This structure is the most appealing for its flexi- bility and ease of implementation. 150 Cm o 04 D K CO ^H 2 X, o < o o CQ PS J n Q Q + + < i I / 1 -r- > 1 h> ■f- * CN h> X (N X C* PQ J w b pa Eh CM W 2 Pn Q « H H < EC • > PQ CO H Q _ j 1 -n Cm I 1 •H X Eh U $ Eh + ' 00 D W i k J 1 ■H T" X Ph ■n 4 fM m 0* pa co pa 03 « pa Pa k tc Q 5 Q C •H W D Cn C ■H T3 O O CU Q CO CO a; u < VO CU c^ •H Ph 151 0. Breakpoint to Degree Control the number of multiplications and memory select I n Coefficients of PmjW Polynomial L ajX + bj Polynomial ajX 2 + bjX + Cj o] bi Cj Polynomial ,x +q x + dj a-, x 3 +bix 2 +Ci x + d Q; b; C; d J Figure 6.7 Type 3 Architecture: UIVD 152 6.2.4 VIVD Case This case is illustrated in Fig. 6.8. Due to the variable intervals within each joint, a decoder is needed for each bank of ROMs storing the coeffi- cients of the polynomials of different degree. The decoder can be implemented with ROMs. This architecture would be interesting for very high precision implementations where substantial savings in ROM could be achieved. 153 x=0. Polynomial Degree Decoder Iz \7 Control of Number of Multiplications and Memory r I n De- er H Coefficients for Polynomial P^ Degree m. UN rV I De Coefficients for coder — ^ P m2 (x) Coefficients for P m3 (x) m 4 l_ J To Multiplier Figure 6.8 Type 4 Architecture VIVD 154 7. COMPLEXITY OF THE METHOD 7.1 Introduction and Background In this section we consider the bounds on the ROM- based method (the R-method) for function generation as the precision n -> °°. We ultimately compare the R-method with the classical single polynomial evaluation scheme. This will be referred to as the "S-method."* We will assume that an unlimited number of processors is available. This requires a brief review of what is meant by "processor" when memory is allowed to grow with n. We are concerned with order of magnitude results and multiplicative constants are ignored. Multiplicative constants for the bounds are left for further research. We assume that f (x) is to be evaluated with error 0(2 ) as n ■> °° for any floating point number x with an n-bit fraction in a finite interval [a,b]. The function f(x) is continuously dif ferentiable in [a,b] and "well-behaved" in general. In the derivations below the notation f * g means that lim f (n) , and f = 0(g) n ~° g(n) means that there is a positive constant C such that f(n) < Cg(n) for n sufficiently large. We will also present a bound for the time to eval- uate f (x) which is asymptotically equivalent to a bound given by Brent BRE76 and is obtained by using a number of ROM words that does not grow exponentially but rather in polynomial space. 155 Before starting the study of the complexity of the R-method as n -> °° we will briefly review the measure of effectiveness of an algorithm and the best known bounds for parallel polynomial evaluation and multiple precision function evaluation. 7.1.1 The Measure of Effectiveness of an Algorithm Kuck [ KUC74] defines speed-up of a parallel algorithm Ts A over a serialone a as S A = =— where T is the number of A time units required to perform some calculation using p independent processors and T is the time required to do the same calculation on a single processor of the same type. The definition of processor depends on the algorithm con- sidered. It can be a CPU, a hardware multiplier, an adder or even a single gate. The efficiency is defined as . e = S A /R(i_l) • Because speed-up alore is not an adequate measure for comparing algorithms, Kuck defines the cost of an algo- rithm using p processors as C, = r T. . The measure of the effectiveness of an algorithm is then given by the ratio: E = S A /C A - A high cost counterbalances a high speed-up factor. To compare two algorithms A and B that perform the S S same function, it is necessary to compare -^- and -^-. A is S A / S B A B better than B if - — /— - is greater than 1. This ratio can C A / L B p T B 2 be seen to be equal to — =■— *■ • As a result, a parallel Pa V 156 algorithm can be characterized by 4 numbers: speed-up, efficiency, cost and effectiveness. As n ■+ °° , the limits of these 4 numbers can be studied and constitute an adequate and compact means of comparing algorithms. 7.1.2 T he Best Known Bounds for Parallel Polynomial Evaluation Several algorithms for the parallel evaluation of arithmetic expressions have been proposed. The special case of polynomial evaluation was investigated by Maruyama [MAR 73 ] and Munro and Paterson [MUN73 ]. They essentially showed that an upper bound for the number of steps required to evaluate polynomials of degree y is given by: (l+e)logu where e^i log-u In other words, a yth degree polynomial can be evaluated in logy + O(logy) steps. This bound assumes that multiplica- tion and addition times are equal and that arbitrarily many processors are available. The number of processors required to attain this bound is 2y. 157 When the number of processors is bounded and finite, Snir and Barak [ SNI77] have shown that any polynomial can be computed in Y~+l + °(P ) steps with p processors. 7.1.3 N otation and Assumptions The R-method (ROM-method) as we define it for this study is illustrated in Figs. 7.1 and 7.2. f (x) is to be evaluated with precision 0(2 ) as n -* °° over a suitable interval [a,b]. We are interested in the case where n is much greater than the word-length of the machine so that the mantissa occupies several single precision words. Thus, when we say that a "word" of Read-Only Memory has size (n) , we mean that it would take 0(n) single precision words to store such a "word." We assume a UIUD approach. The VIVD improvement affects the bounds by multiplicative constants as is shown in Appendix A- 3. The polynomial approximants have degree y. Because the degrees can be made dependent on n as n -> °°, we will accordingly denote by y(n) the degrees of the polynomial approximants. The total number of Read Only Memory words necessary to store all the polynomial coefficients will be denoted by #)(n) . The number of distinct memory addresses is R(n) and differs fromV?(n) by the product of the number of required coefficients times the number of words, ny (n) . We assume binary representation. Each nolynomial P , , ' u (n) can be evaluated in two ways. One can assume a number 158 05 a: UJ _J Q. _J 2 il / c - o a: __ o?-jr s-\ / x / u-/ ▼ ^ X X o x: .u (X) THIS INCREASES THE HARDWARE OR COST THIS DECREASES THE SPEED Figure 7.2 Tradeoffs when n 160 of processors limited to p for evaluating P , . or else 1 |j (n) an unlimited number of processors may be available as n -> °° . A processor takes a step time t (n) . In our case t (n) will be the time to multiply two n-bit integers using 2 single precision multiplications is (n ) . It has been shown that a better time is possible by neglecting terms which do not contribut to precision 0(n). One bound is (n)=0(n 1 - 58 " The best known bound however has been obtained by Schonhage and Strassen [SCH71J. They showed that (n)=0(n.log(n) .loglog(n) ) (7.1) is the time to multiply two n-bit numbers using a single precision multiplier and the result has precision n. It has been conjectured by Knuth that this is the optimum bound. Thus, we assume in what follows that the time to add/multiply two numbers is t (n) given by the Schonage and Strassen bound. 161 7.1.4 Time vs space. An Intuitive Look at the R-Method We would like to pause here before going on to a formal analysis, and present an informal discussion about a question which, in a sense, is a key point in the area of function evaluation in a computer. One must ask whether or not it is at all worthwhile to use Read Only Memory for computing f (x) rather than the traditional algorithmic, one-polynomial scheme. In other words, is it worth trading time for memory? We would like to point out that this question is of general theoretical interest and that Hopcroft, Paul and Valiant [ HOP77 ] in a paper entitled "On Time vs Space" have very recently presented an important result related to this problem. Their theorem states that: "Every deterministic multitape Turing Machine of time complexity t (n) can be simulated by a deterministic Turing machine of tape complexity t (n) /logt (n) . " More research is needed in order to determine the exact relationship between this important theoretical result and the more specific problem of function generation using Read Only Memory. However, the reader will not fail to notice the striking similarity between the above theorem and the appearance of the bound (n/logn) scattered in our derivations on the complexity of the R-method. 162 The R-method has an easily implementable advantage that other methods do not have: the evaluation of f (x) can be speeded up two ways, one of which consists of using memory : a) by evaluating P (x) in parallel (many multipliers) or b) by having more polynomials of lower degree, namely by doing more look-up (many memory words) . If n is fixed and the number of ROM words is fixed, we have seen that there is an optimum choice for the poly- nomials that yield the lowest average number of multiplica- tions. When f is approximated over [ a,b ] by a single poly- nomial, the only storage required is for the coefficients of this one polynomial. This simplicity is penalized by the necessity of using a high degree polynomial in order to maintain the required precision 2 over the entire interval. The speed of evaluation for this s method can be improved in only one way: by evaluating this single polynomial in parallel. What happens when e (n) = 0(2 n ) and n -*■ «>? In the S-method, the only way to maintain a precision e (n) is to increase the degree of the polynomial approxi- mant. The rate of increase of the degree will depend on the function f (x) and will have the effect of slowing down the evaluation of f(x) by increasing the number of multiplica- tions. 163 In the R-method, one can modify both the number of ROM words and the degree of the piecewise polynomials to maintain the precision e (n) =0(2 ). There is a relation- ship between the degree of the polynomials y (n) and the total number of ROM words ^7(n) necessary to guarantee I f - p , , II <2 . When the number of ROM words increases, 11 y(n) " this has two opposite effects: a) it improves the speed-up of the method, and b) it increases the cost of the computation. What it does to the effectiveness (as defined by Kuck) is a question that we will try to answer in the next sections. Whether there is a strategy [ y (n) , "''(n)] that will make the effectiveness higher than the single-polynomial, parallel evaluation scheme for all n as n ■*■ °° and still guarantee a precision 0(2 ) is also a f undamental question. If this were known, one could choose the best growth rate 0(R(n)), matched with the appropriate degrees 0(y(n)) to make the evaluation the most effective. It is possible that, as n -»■ °°, it may not be feasible to keep the effectiveness of the R-method higher than the effectivenss of the S- method. In other words, the increase in speed obtained by adding memory may not "be worth" the increase in cost of memory . In order to answer these questions we must express the following items as functions of n: 164 a) the relationship between y (n) and R(n) b) the speed-up of the R-method: S_. K c) the cost of the R-method: C D d) the effectiveness of the R-method: S /C_. R R e) the ratio of effectiveness of R and S-methods S R /-£- and study it as n ■*■ °°. c s 7, 2 M emory Complexity of the R-Method We start from the relation y(n) f(x>-p (x)= n f (y+1, u) y i=o (y+D! If the x . ' s are chosen to be the roots of the Chebyshev polynomials of degree u + 1 then | f - P can be bounded: h M , , |f(x)-P (x)|<— H±L (7.1) y 2 y (y+D! where M , is a bound on the (y + 1) st derivative of f (x) . At this point a closer look at the behavior of the (y + 1) st derivative of f(x) as y -> °° is necessary. Before proceeding, we introduce some definitions. 165 7.2.1 Definition of Type 1 and Type 2 Functions Definition 1 A continuously dif ferentiable function f(x) over an interval [ a,h] is said to be of type 1 if there exists a constant C, independent of y such that for any x e[a,b] f (U) (x) |- c 2 y (p + 1) ! 2 y 167 Since y + 1 = (y) , (7.3) becomes so ^--°°and y(n)-*°°: n 71b (n)=0(n2 u ) 168 Proof: Using (7.1) we wish to satisfy the condition y + 1 . C2 ~ n 2 U (y + 1) !" By the definition of a type 2 function, M = 0(0^ ) as — We then get h p+1 M xl h u c!/ y + 1 1 • = — - as y^°° 2 1 " (y+1)! 2%! Again y + 1 = 0(y) and, using Stirling's formula, h y cV h u 2%! 2V X Inequality (7.5) then becomes: (C n e) y (2TTy j'* 5 h^ cV J* -n y 9 v y z i — (2TTyj <», y •> 1 and therefore the number of addressable ROM words is b " a 1 IT R,(n)= -0(i 2 y ) 1 h p Multiplying by the number of polynomial coefficients and the word length of each coefficient: n 1Tb 2 (n)=0(n2 y ) To summarize, the number of ROM words for type 1 functions ( x, l/x,logx...) grows y times faster than the number of ROM words for type 2 functions (e , cosx / sinx...) when y,n -»■ ». The bounds for type 1 and type 2 functions will differ accordingly due to the factor y(n). 7.2.3 The Relationship between Polynomial D egree and ROM Size It would be interesting to know if there is a way to increase y(n) as n -> °° such that the total number of ROM words does not grow "too fast." More specifically: 170 What p(n) will guarantee c =0(2 ) and maintain j • - k ' -1 (n) or 2 (n^ ^ n ^ for some fi xe d k greater than 0? We next present two theorems which answer this question. They establish a growth rate for u(n) which guarantees a polynomial growth rate for the number of ROM words (single precision) . Theorem 1 If -oo< a allows y. to remain finite. The next iteration gives: n y 2 = (y + l) logn-logy , replacing p. n y 2 = ylogn+log (ylogn) The third iteration is: |_i = GfyO n V 3 " (y+] ) logn-logy2 or, replacing y-: n y = 172 2 , , . log (ylogn) (log (ylogn)) ylogn+log (ylogn) + — 2-^ 2 — _ E 2 2 Log2.1ogn 2Log2 . (ylogn) which, by using ln(l+x)= - *=- + becomes n y 3 = (y+l) logn+logn+log (ylogn+log (ylogn) ) 173 and therefore n y-,(n) = 3 ylogn-log(ylogn) +0(^211) k 2+Y Recall that k = y + 2 and therefore n = n so that TV (n) = 0(n 2+Y ). ^Rl Theorem 2 If -« < a < b < fc? R Jn) = 0(n 1+Y ) single precision Read Only Memory words suffice to evaluate type 2 functions f (x) to precision n, uniformly, for all floating point numbers xe[a,b] using the R-method with poly- nomials of degree Ylogn as n -*• <» and where y is an arbitrary positive constant. Proof ; From Lemma 2 n 7Ib 2 (n)=0(n2 u ) Replacing y by — ; where y = k - 1 yields: ^ Note that the relationship between y and k is dif- ferent than in 1 . y is simply the smallest positive number allowing y(n) to remain finite for finite n. 174 inyn) = 0(2T lQ g n . n) or /7 RZ {n) = °( nl+Y ) In theorem 1 and 2 we were concerned with polynomial growth of the total number of ROM words as a whole . The next two theorems deal with polynomial (n ) growth of the number of ROM addresses (denoted R, (n) and R 2 (n) ) . Then, in order to obtain the total number of single precision ROM words it is necessary to multiply R, (n) or R 2 (n) by y(n)*n. The next theorem is presented without proof. Theorem 1-A If -oo < a < b < oo or' - ±i n ~) 6 R1 y Vlogn/ single precision Read Only Memory words suffice to evaluate a type 1 function f (x) to precision n, uniformly, for all floating point numbers Xe[a,b] using the R-method with poly- nomials of degree ' / N n Rl v ylogn as n -> °° where y is an arbitrary positive constant. In this case y = k. 175 Theorem 2-A If - 0° < a Y logn-log(Ylogn)+0( ^ Qg g ) single precision Read Only Memory words suffice to evaluate a type 2 function f (x) to precision n, uniformly, for all floating point numbers x e[a,b] using the R-method with poly- nomial degrees y (n) R2 - y logn-lcg ( ylogn) +1/ Y (i^£2£L) logn as n ■*■ 00 where y is an arbitrary positive constant. Proof ; To maintain the number of addresses bounded by n, one must find a y(n) such that: n 2 y k R 2 (n) =— = n K taking the logs of both sides: — - log y = klogn and the equation above, put in the f orm y = G(y) becomes 176 n \i = klogn+logy and is solved recursively. The method is similar to the proof obtained for theorem 1 choosing p Q = n as the initial value and replacing gives the first iteration y, : n v l = M 2 = G ^ w i^ : (k+1) logn n y 2 = klogn+logy 1 replacing with p. : n y 2 = y 2 - klogn+logn-log( (k+1) logn) n (k+1) logn-log ( (k+1) logn) Continuing this way, the result is after a third 2 iteration and using ln(l-x) - -x - 2 n y = (k+l) logn-log (k + l) logn+0 ( 1 ^ 1 ° gn ) 177 multiplying n by n'p gives the total memory k +2 n ^ti4 n) = "~; ,* ,xi , A/ loglognT RZ ( k fi)logn+log( k -H)logn+0( x g n y > replacing (k+1) by y gives the result. 7.3 The Effectiveness of the R-Method 7.3.1 The S-Method as a Particular Case of the R-Method For the parallel polynomial evaluation method, the function f (x) is approximated by a single polynomial over the entire interval [a,bj. The following two lemmas give an order of magnitude of the growth of the degree of the poly- nomial that approximates f(x) with precision n in [a,b] . It is important to note that the amount of memory will not stay constant. This is simply due to the fact that to "keep up" with the increasing precision, the degree of the poly- nomial must also increase and therefore the number of poly- nomial coefficients must increase to infinity. This is due to a well-known theorem of approxima- tion theory established by Kolmogorov and sometimes referred to as the "n-width theorem. " A consequence is that it is not possible to find a sequence of polynomials P (x) of degree y having a finite number of coefficients, such that P (x) converges uniformly to f(x) in [a,b] as n ■+ °°. 178 Lemma 3 If -oo < a < b < °°, the S-method can evaluate a type 1 function f (x) to precision n, uniformly, for all floating point numbers x [a,b] as n ^ oo if the degree \i . (n) of the O J- polynomial approximant is : M SI (n) = n and therefore the amount of memory is ''7'si (n) = 0(n2) Proof : For a type 1 function the number of address- able words is 0(2—) y If 2— is to stay of 0(1) (since there is only one polynomial) this implies —=0(1) or y=n u Each coefficient has word length 0(n) and there- fore: "/77' sl (n) = 0(n 2 ) 179 Lemma 4 If -oo < a < b < °°, the S-method can evaluate a type 2 function f(x) to precision n, uniformly, for all floating point numbers z [a,bj as n ■> » with the degree y ' (n) of the polynomial such that: n R2 n t n , ~ /loglogn, logn-loqlogn+O (— -^ -2—) ^ ^ ^ logn Proof : The proof can be inferred by trying to find such that: n - = 1 M By taking the log: - - log y = y or n y = logy Let us fix n and solve this equation for y. Let us give an initial value to y = n. 3 o n y l logn iterating: n n 2~ logy-, logn-loglogn n loglogn logn|l logn iterating again: n 3 logy. n y 3 = " loglogn logn-logl logn ( 1- logn n loglogn 1 logn-loglogn-log I 1-- logn using 1 2 i n (l-x)«x + ~T x + • • • and developing the log yields u = 3 180 2 , . , loa( loqn) (log( logn)) logn-log( logn)- — + 2 Log2.1ogn 2Log2 . ( logn) which can be written (n) = , , i ^ ^. loglogn , t>2 loan-log ( logn)+ 0( ) loan 181 where loglogn n . - +0 when n->-°° logn Each coefficient has a word length 0(n) and there- fore: 2 n log n - log logn + 0( — £ Q ^ ) Lemmas 3 and 4 give an idea of how fast the degree of the polynomial which approximates f (x) over [a,b] must grow in order to guarantee a precision 0(n). In essence, a type 2 function can be approximated by a polynomial having a degree smaller by a factor of (logn - loglogn) than that required for a type 1 function approximated with the same precision. It can be seen from theorem 1 that for type 2 functions the S-method is a particular case of the R-method y-1 where the number of ROM addresses is 0(n' ) with 7=1. 7.3.2 The Number of Processors, Speed-Up, Cost, and Efficiency for the R-Method The number of processors for the R-method can be finite and independent of p, or can be equal to 2y if Maruyama's algorithm is used. In what follows, however, we 182 will assume that only one multiplier is available after the bank of ROMs so that we can distinguish the effect of the memory approach from that using a parallel polynomial evalu- ation scheme. We assume that Horner's Rule is used for the evaluation of P (x) . V We can now easily obtain, as a function of y and n, the various measures as defined by Kuck for type 1 and type 2 functions. We will use indexes 1 and 2 for the two types of functions. In all of the derivations it is assumed that the address delay is negligible compared to the polynomial evaluation time. Number of Processors The number of processors will be the number of single precision memory words. Speed-up : T q is the number of steps using one serial processor (that is: using one multiplier and Horner's rule). From Lemma 3, T q , = u q , (n) = n 1 n From Lemma 4, t -, = y_ (n) = S2 M S2 logn - loglogn From Theorem I T = y (n) or y (n) Rl R-L KX or I-A, 183 From Theorem II _ _ ,, . ' s , . T R2 - y R2 (n) or y r2 (n) or II-A Then S = - n Rl ui and S„ = 52 y (logn-loglogn) where y 1 can be y or y and y can be y 2 or y ,' If, for the R-method, we use the scheme that limits k ' the growth of the number of addresses to n (i.e., y , (n) and n'„ 1. Type 1 functions: We have: and and Thus P T Si SI 1, = IP 2 P R1 T 2 R1 p si = '^si (n) = n ' P R1 - 4 (n) - n p 2^ T S1 = n T R1 " ^ 11 = 2 2 n n 1 ~n/y 2 n y2 /H y 186 3 n = y 1 T!7u and we want to determine y (n) such that "i > 1 We can study the function 3 and it is easy to find that n, (x) has a maximum at x = 4.328 ... equal to 4.036 and that the equation n, (x) = 1 has two roots, one at x = 1.368... and one at x = 9.937.... y(n) = 47uT6 = Mn) will be the most effective choice of the polynomial degree. For \x = 6 (n) it is then possible to find a multiplier contrast such that ru (x) is greater than 1 and the R-method is more efficient than the method using one polynomial. This tells us that the optimum number of ROM addresses is, for type 1 functions : R x = 6(1) Type 2 Functions : 2 P T S2 S2 2 P 2 R2 T R2 We have : 2 _n S2 P Q o - 1 (n) = -co logn - loglogn and n P R2 = -. 2 - n 2 u T S.2 " logn-loglogn and T R2= y Therefore: 3 n = D_ 2 n. 2 3 2^ y (logn - loglogn) 187 and we want to find y(n) such that ru > 1. To study this function we first make the change of variable: n x = 2 y 188 and n becomes 2 n = (logx) 2 n x (logn - loglogn) We are looking for x such that: 2 3 (logx) (logn - logn) x n One can show that x is approximated by n / / loglogn x< 1-0 logn \ \ logn and therefore n ^ ^ " /loglogn logn - loglogn + Of logn / will satisfy rip > 1. This is the same expression for y as is found in Lemma 4. Consequently, asymptotically the choice of y corres- ponds to the S method and, for type 2 functions the R-method is asymptotically equivalent to the S-method for the same efficiency . 7.3.4 Comparisons Using p. Processors From Maruyama ' s bound , the S-method for type 1 func- tions gives : 139 T s = O(logy) = O(logn) For the R-method T R = O(logy (n) ) If the s -method is used to evaluate f (x) with pre- cision n, the total hardware requirements are:* Amount of Memory + Number of Multipliers (Processors) From Maruyama ' s bound, the number of multipliers is 2yp(n) = 2 2 0(n) = 0(n) . From Lemma 3 the amount of memory is (n ) . So for the S -method p s = 0(n 2 ) + 0(n) = 0(n 2 ) If the R-method is used, from Lemma 1 n Amount of Memory = 0(n y 2 y) Number of processors = 0(y(n)) so n p R = 0(y (1+ n 2y) ) The ratio n is then: 2 M \ 2 n (logn) n = y (l+n2 n/lJ ) (logy) 2 * Recall that we are counting single precision multipliers, that is the size of a multiplier is 6(1). In the same way we count single precision memory words. 190 Is there a y(n) such that n > 1? This would imply that effectiveness is gained by using some Read Only Memory, Again we must study the function n (x) using n x = — : M x 1 ^ x) = —x- 7 — : — T2 1 1 _ lp g x \ logn We want 2 x /, logx > 1 " which implies 2 x y logn "V/x" logx 1 < 2$- logn The left number is equal to .3 for x = 1 and x = 2 and has a minimum between 1 and 2. This inequality always has a solution, and because y must grow to infinity with n, polynomial degrees y exist and are such that: n M ± xln) 191 7.3.5 Comparisons to the Best Known Bound for Function Evaluation The best known bound for function evaluation has been determined by Brent [BRE74] [BRE76] . He showed, by using ascending Landen transformations and elliptic integrals, that the elementary functions can be evaluated in: (logn Y (n) ) time where Y (n) is the Schonhage/Strassen bound. This assumes a single processor and 0(n) memory. For the R-method, by evaluating polynomials in parallel in less than 2 log y time and by choosing a degree y (n) such that y(n) = Y logn "v+2 with memory bounded by n ' , the speed achievable is 0(logy -Y(n)) = log ( yl p gn ) Y (n) which is asymptotically equivalent to the bound given by Brent. It is a slight improvement of the Brent's bound at the expense of 0(n ) words of memory and 2 —z processors 192 8. CONCLUSIONS 8.1 Areas of Further Research C urves of Average Speed vs Number of ROM Words A detailed study of the curves representing the average number of multiplications vs the number of ROM words for the most common functions and for word lengths up to, say, 120 bits should be made. The curves would be of interest in determining the feasibility of a particular implementation given the design constraints. This study should be done for the 3 cases where: - only one multiplier is available - a finite number of multipliers, P, is available - an unlimited number of multipliers can be used in parallel to evaluate the polynomial. In previous chapters the curves represented a lower bound due to the fact that digitization was not taken into account. It would be useful to have more data on the quantitative effects of digitization on the curves. Polynomial Approximants We assumed the most general Chebyshev polynomial in our developments. But in practice, the polynomials can be chosen with an even or odd degree, depending on the function, 193 or can be economized using telescoping series [BEL68]. This would further contribute to a reduction in memory. Finding the Joints Simultaneously for Several Functions To find the position of the joints for a function f , (x) , we have built a function Q,(X,y) whose minimization gave a solution for the vector of joint locations X, for the NLP problem as defined in Section 3. When the hardware implementation of the algorithm for f (x) was considered, we saw that a Read Only Memory de- coder was needed in order to detect the joints closest to the argument x and thus determine the number of ROM accesses (i.e., number of multiplications) necessary to evaluate P(i)(x) J Given another function f-(x) to be implemented, the positions of the joints for this function would likely be entirely different from the positions of the joints for f.(x), and consequently, the same bank of ROMs could not be used for address decoding and joint interval determina- tion for both f,(x) and f 2 (x). Would it be feasible to have the same joints for f,(x) and f 2 (x)? Stated in a more general way, the question raised is, can one solve several of these NLP problems simultaneously . We build a function Q-^(X,y) for function f,(x), a function Q 2 (X,y) for function f2(x), etc.... and we try to solve for X: 194 < minimize Q,(X,y) minimize Q ? (X,y) (8.1) minimize Q p (X,y) simultaneously. Fig. 8.1 illustrates this problem for the three functions : f x (x) = 1/x in Q5, 1\ f 2 (x) = /x in Q5 r 1] f-.(x) = In x in [. 5 , f] More research in this direction would be useful in the sense that having a unique set of joints for several functions would simplify considerably the structure of the arithmetic unit. The difficulty lies in minimizing several functions at once. One possibility would be to replace the general problem (8.1) by the simpler problem: minimize (X 1 Q 1 (X f y) + A 2 Q 2 (X,y) + ... + A p C p (X,y)) where the constants A. could be chosen proportionally to the frequency of calls to each function f . (x) in a represen- tative set of scientific programs, for example. 195 f 1 (x)-l/x -f 2 (x)=v£~ The positions of the Joints XJ(2) and XJ(3) are determined simultaneously for the 3 functions by solving a general NLP problem M f 3 (x)*ln x ln(x) Figure 8.1. Optimal Determination of the Joints for Several Functions Simultaneously 196 Care must be taken, because the solution of the general problem (8.1) by a method like Newton's or steepest descent may involve non-square matrices and overdetermined or underdetermined systems: the number of joints is (s-1) and the number of functions to be minimized is P , and, in general P ^ s-1. We would like to point out that a technique such as the generalized inverse* may prove useful to solve this kind of problem. (IMSL has a subroutine to compute the generalized inverse of a rectangular matrix.). The effect of a global minimization on the average number of multiplications of each individual function would have to be determined. The whole question of the relative size and growth rates of the address decoding memory versus the coefficient memory needs to be considered. Which one grows faster? The total amount of memory needs to be considered, particularly in a case where the same joint locations are being forced onto several functions. T heoretical Issues As mentioned in Section 7, the theorem stating that "a deterministic multitape Turing machine of time complexity t(n) could be simulated by a deterministic Turing machine of tape complexity t (n) /logt (n) " is related to the problem Sometimes called "the pseudo inverse." It allows solving a system of the form AX=b where A is a rectangular or even singular matrix. The solution has a minimum norm in a least-square sense, in the case of Moore-Penrose inverse 197 of replacing some of the computations by lookup. What the exact relationship is, and what this theorem tells us about the limits and capabilities of the R-method remains to be established in a formal way. The behavior of the multiplicative constants involved in the derivations of Section 7 could be fruitfully investi- gated. Using this, the determination of the value of n where Brent's algorithm using ascending Landen transforma- tions becomes faster and/or more efficient than the R-method for a given technology and amount of hardware, could be derived. Bounds should be derived assuming that the size of the multiplier is allowed to grow with n and is imple- mented with Read Only Memory which allows evaluation in O(logn) time. In our derivations we have assumed that a multiplier of fixed precision was used. A bound is needed on the time to multiply two n-bit integers with precision (n) * as well as on the amount of ROM needed to implement such a multiplier. As memory becomes cheaper and cheaper, it will be possible to design generalized monomial or poly- nomial evaluators. For example, special purpose hardware using ROMs could be designed to evaluate terms of the form ax . Further research on the bound to evaluate ax in hardware should be considered. * ... 2 Bounds already exist if the precision is 0(n ). (Kuck) 198 Finally, the question raised in Section 5 of the possible lattice structure of the various implementations with the partial ordering relationship "to have an implemen- tation always faster for the same amount of ROM, " should be investigated in more detail. 8.2 Summary and Conclusions We have presented a method for function generation which emphasizes the capabilities of LSI by using large- scale Read Only Memory in conjunction with a parallel multi- plier as the basic operator, for evaluating polynomials. The design of the function generator is presented as a trade- off between speed of evaluation, expressed in terms of average number of multiplications, and cost, in terms of number of Read Only Memory words available. It has been shown that the minimization of the Mean Running Time, with a constraint on the number of ROM words, is a Non-Linear Programming Problem. The solution yields the optimal loca- tion of the joints that separate the partitions within which the degree of polynomial approximant is the same. Within each partition there is a large number of subintervals of small size. The polynomial approximating f(x) in each subinterval has a low degree and therefore requires only a few multiplications for evaluation. The average speed is guaranteed to be optimal by the solution of the NLP program. The coefficients of all of the polynomials are 199 stored in ROMs. Decoding the address of the coefficient from the argument x presents practical design problems in the VIVD case. The UIVD technique seems the easiest to implement: it has the flexibility of a variable average number of multiplications and the ease of address-decoding with uniform intervals. It has been shown that a speed-up of ylogn, or (y + 1) depending on the type of function, is achievable and that the amount of memory needed to achieve this speed-up does not grow exponentially but in polynomial space: n . This method could have a wide span of applications ranging from function generation in microprocessors to high- speed high precision arithmetic units. The structure of the arithmetic unit is conceptually simple and relies on Read Only Memory. The control is straightforward and con- sists essentially of applying Horner's Rule. The method does not require a dedicated multiplier. More work is needed in order to obtain data for a common arithmetic unit for a class of functions and more data is needed on the amount of ROM vs. speed for large word sizes. 200 LIST OF REFERENCES AND67 Anderson, Earle, Goldschmidt and Powers, "Model 91 Floating-Point Execution," IBM Journal, 1967, pp. 34-53. BAK73 Baker, P.W., "Predictive Algorithms for Some Elementary Functions in Radix 2," Electronic Letters , Vol. 9, No. 21, pp. 493-494, October, 1973. BAU73 BEL6 8 BRE74 BRE76 BRU73 CHE66 CHE74 CHI70 CHU74 DAD 6 5 Baugh, Charles R. and Bruce A. Wooley, "A Two's Complement Parallel Array Multiplication Algorithm, " IEEE Transactions on Computers , Vol. C-22, No. 12, pp. 1045-47, December, 1973. Bell, Richard A. and S.M. Shah, "Oscillating Poly- nomials and Approximations to Fractional Powers of x," Journal of Approximation Theory , 1, pp. 269- 274, 1968. Brent, Richard P., "The Complexity of Multiple- Precision Evaluation of Elementary Functions," first draft, 4 8 pp., November, 19 74. , "Fast Multiple-Precision Evalua- tion of Elementary Functions," Journal of the Association for Computing Machinery , Vol. 23, No. 2, pp. 242-241, April, 1976. Breuer, David R. , "A High-Speed Monolithic 8x8 Multiplier," unpublished document. Cheney, E.W. , Introduction to Approximation Theory , International Series in Pure and Applied Mathematics, 259 pp., McGraw-Hill Book Co., 1966. Cheney, E.W. and T.J. Rivlin, "Some Polynomial Approximation Operators," Center for Numerical Analysis, The University of Texas at Austin, 16 pp., November, 19 74. Chin, Tung and Algirdas Avizienis, "Combinational Arithmetic Systems for the Approximation of Func- tions," Spring Joint Computer Conference, pp. 9 5- 107, 1970. Chung, T.J. and S.D. Bedrosian, "Fast Digital Multiplier Based on Iterative Cellular Arrays of ROMs," unpublished document. Dadda, L. , "Some Schemes for Parallel Multipliers," Alta Frequenza , Vol. XXIV, No. 5, 1965. 201 DEL70 DeLugish, Bruce Gene, "A Class of Algorithms for Automatic Evaluation of Certain Elementary Functions in a Binary Computer," Ph.D. Thesis, Department of Computer Science, University of Illinois, Urbana, Illinois, June, 1970. ERC75 Ercegovac, Milos D. , "A General Method for Evalua- tion of Functions and Computations in a Digital Computer," Ph.D. Thesis, Department of Computer Science, University of Illinois, Urbana, Illinois, August, 1975. FER67 Ferrari, Domenico, "A Division Method Using a Parallel Multiplier," IEEE Transactions on Elec- tronic Computers , pp. 224-26, April, 1967. FLE66 Flehinger, B.J., "On the Probability That a Random Integer Has Initial Digit A," American Mathematical Monthly , pp. 1056-1061, December, 1966. FLY70 Flynn, Michael J., "On Division by Functional Iteration," IEEE Transactions on Computers , Vol. C-19, No. 8, pp. 702-6, August, 1970. HAB70 Habibi, A. and P. A. Wintz, "Fast Multipliers," IEEE Transactions on Computers , pp. 153-157, February, 19 70. HAM70 Hamming, R.W. , "On the Distribution of Numbers," The Bell System Technical Journal , Vol. 49, No. 8, pp. 1609-1625, October, 1970. HAR6 8 Hart, J.F., Computer Approximations, New York, John Wiley and Sons, Inc., 1968. H073 Ho, Irving T. and Tien Chi Chen, "Multiple Addition by Residue Threshold Functions and Their Represen- tation by Array Logic," IEEE Transactions on Computers , Vol. C-22, No. 8, pp. 762-767, August, 1973. HOP77 Hopcroft, John, Wolfgang Paul and Leslie Valiant, "On Time Versus Space," Journal of the Association for Computing Machinery , Vol. 24, No. 2, pp. 3 32- 337, April, 1977. IBM68 System/360 Scientific Subroutine Package (360A- CM-0 3X) Vesion III Programmer's Manual. 202 IRW77 Irwin, Mary Jane, "An Arithmetic Unit for On-Line Computation," Ph.D. Thesis, Department of Computer Science, University of Illinois, Urbana, Illinois, June, 1977. JOH73 Johnson, N., "Improved Binary Multiplication System," Electronics Letters , Vol. 9, No. 1, pp. 6-7, January, 1973. KAN73 Kaneko, Toyohisa and Bede Liu, "On Local Roundoff Errors in Floating-Point Arithmetic," Journal of the Association for Computing Machinery , Vol No. 3, pp. 391-8, July, 1973. 20, KIN71 Kingsbury, N.G., "High-Speed Binary Multiplier," Electronics Letters , Vol. 7, No. 10, pp. 2 7 7-78, May, 1971. KI075 Kiostelidis, John B. and John K. Petrou, "A Piece-wise Linear Approximation of log2* with Equal Maximum Errors in All Intervals," IEEE Transactions on Computers , Vol. C-24, No. 9, pp. 858-860, September, 1975. KUC74 Kuck, D.J., "On the speed up and cost of Parallel Computation," Proceedings on the Computational Complexity of Problem Solving , The Australian National University, December, 1974. LAK76 Lakshmi, Goyal, "A Study in the Design of an Arithmetic Element for Serial Processing in an Iterative Structure," Ph.D. Thesis, Department of Computer Science, University of Illinois, Urbana, Illinois, 1976. LIN70 Ling, H. , "High-Speed Computer Multiplication Using a Multiple-Bit Decoding Algorithy," IEEE Transactions on Computers , Vol. C-19, No. 8, pp. 706-709, August, 1970. LUT65 Luttmann, F.W. and T.J. Rivlin, "Some Numerical Experiments in the Theory of Polynomial Interpola- tion," IBM Journal , pp. 187-191, May, 1965. MAR73 Maruyama, K., "On the Parallel Evaluation of Polynomials," IEEE Transactions on Computers , Vol. C-22, pp. 2-5, January, 1973. MUN7 3 Munro, I., and Paterson, M. , Optimal algorithm for parallel polynomial evaluation. JCSS 7 (1973) , 189-198. 203 PEZ71 Pezaris, Stylianos D. , "A 40-ns 17-Bit by 17-Bit Array Multiplier," IEEE Transactions on Computers , pp. 442-447, April, 1971. RIV74 Rivlin, Theodore J., The Chebyshev Polynomials , John Wiley and Sons, New York, 186 pp. 1974. ROB76 Robertson, J.E., "Class Notes on Computer Arithme- tic," Department of Computer Science, Urbana, Illinois, Fall, 1976. SCH71 Schonhage, A., and Strassen, V., "Schnelle Multi- plikation grosser Zahlen," Computing 7 , pp. 281- 292, 1971. SIN73 Singh, Shanker and Ronald Waxman, "Multiple Operand Addition and Multiplication," IEEE Transactions on Computers , Vol. C-22, No. 2, pp. 113-120, February, 1973. SNI77 Snir, Marc and Amnon B. Barak, "A Direct Approach to the Parallel Evaluation of Rational Expressions with a Small Number of Processors," IEEE Transactions on Computers , Vol. C-26, No. 10, pp. 933-937, October, 1977. STE72 Stefanelli, R. , "Binary Read-Only-Memory Dividers," Unpublished correspondence, 19 72. STE77 Stenzel, Wm. J., Wm. J. Kubitz and Gilles H. Garcia, "A Compact High-Speed Parallel Multiplication Scheme," IEEE Transactions on Computers , Vol. C-26, No. 10, pp. 948-957, October, 1977. SWE65 Sweeny, D.W. , "An Analysis of Floating-Point Addition," IBM Systems Journal , Vol. 4, No. 1, 1965. SZE59 Szego, Gabor, Orthogonal Polynomials , American Mathematical Society Colloquium Pulbications, Volume XIII, American Mathematical Society, New York, 19 59. TOD6 3 Todd, John, Introduction to the Constructive Theory of Functions , Birkhaeuser Verlag, Basel und Stuttgart, 127 pp., 1963. TOR75 Torrero, Edward A. , "Focus on Semiconductor Memories, Electornic Design , 7, pp. 98-107, April, 1975. 204 TSA74 Tsao, Nai-Kuan, "Pn the Distributions of Signi- ficant Digits and Roundoff Errors," Communications of the ACM , Vol. 17, Number 5, pp. 269-271, May, 1974. WAL64 Wallace, C.S., "A Suggestion for a Fast Multi- plier," IEEE Transactions on Electronic Computers , pp. 14-17, February, 1964. t I 205 APPENDIX A.l Program Listings Ill 206 05 o x Q •I X M o -3 X X O X 2 o EH r"*p v> «»■» K N O <-* 1 MOO E-h CO « O a- x X Ed 1 X •> «* o< - ' H HH «-». 2 X 'JO Oh o O •>* •« »— M 3 •— - X _* 00 » J O •> C M Eh <=£ i — H "J UUtil ^ x Ed X X 05 X> LX. Q 35 « X -3 X M H J X, M 30 H o X H H H —1 Cd H < 00 < X 25 o '. Z. E>J X *->» X \-\ HH X • O 3 1 X v; • 00 y— ^ s H Ed • x Q5 H O 21 EC 00 Cd 'Xl *■— - ' u H H CvJ H -5 H H H O Ix, o X H ^ CJ H 135 •■ X H l-H Ed H 3 «C Cd M CO :z X X E3 u CQ a. Ex, 00 M H H X H >2, •• H El, 5S -1 X 05 J 2; CiJ Ed x X> o < 2) H 05 O > M -3 Cd 2; S2 fi r d o id 2 00 Cd -o Ed 2; E0 J H '-a 00 X X <: a uj H ca Ix, • «=c « H H 2 z: 00 ^2 H 35 3 id X 1-1 05 M M o • H ca O Ed •"3 00 C3 Ed s 2j a 3 M rd J Ed 33 Cx, X X -3 •> cd X < X H O X Ex, •"3 X Ex, M x l-H s>s O 2! IH X X Ed a, n E-" II 2: M o e-i o E-< > H o -3 2; o 05 H J •s Cd — -. a, O 00 2: 0L, ^ o . : N i— i ■ 1 fr-> c£ O O t-i « S Cd X o •• 1 Ex, JS. X • ^^ X M H- 1 X H o T- ^* CO EO + X o X ^ t> X T ] Eh X H CM o X X E H V d! X f-H X 3 3 X X H l '— » X «< E-« 3 o s Oj o EC J Cl* ^ x e-t '/> l<; C£l o j rf% .^: OS o -2 < \/ i— £-> • + J GO 3 «* ■'_! H 3 ns O uj o ^ a, ,2 3 >w» a '•U -5 3 '3 X M 'C Q O 'i. 2; -1 o -» '..-i Cl4 CU T— =( --U ^ CC Cd 7£ *^ rx; :z> X X t— i "3 Ld 3 o O re o CO a, rd Cd o Ci4 (X o w -3 o o - o c_> o cj o o n J-. •— Q t\l ;— a H + -o 3 _) •^d 3 M X -i. Cd o l-H rr. t— H ^ • X lX, Q > — < o 3 O ^3 o * O 2) « Cd •-3 X a 2.' + • O w x, 3 1 • X * — =r .X «— * •> «-> o + 3 i\i -2 x <— O •— M -_^ C\J Cd o *_^ <* 3 Q II •w <— 3 Cd a, •— II :~ Cd • l-H -3 X Cd CO 11 ii ■ii Q 'Oh O X < ru 3 >H 3 ;*S a II II li o II 2 11 l-H t-t cc ■jJ -^ r— r— OJ I— fct< Q C\J H _1 3> 3 -s_; 3 3 ■^» a ii 3 3 3 f-> Q O o Cd Ed Ld O -3 7^ td O r J M 3 re a. CL tL, ^ x X IX, U O ix a: U O O O 208 CO » ►J CO < X ^^ l-l •a: Cd X Ed • • s > X w CC o re H < H X 25 X CO X i <* >H H Pb •■ O Ed >H > J J* O cc < cc J EC o • H O cc «C o fd a. • 2: rj X x Eh CO o x > i — i 21 cd •• *■•> H H -3 • ^ *"3 2: • Eh 2: H • X E-i -3 O CC -3 Ed (r- o O CO o X r d X Et, CO 25 CO l-H -3 e-i Z Td H O Ed Ed HH td E-- 2: Cu d M a Cd 25 o cc 21 o E-< •^ *■»• H id H CC M Hi (C Oh M "3 CD CJ •"3 O x H o Ed rjj X jg X M X •"3 CJ5 SB «<7» Cd x *-» cq cc u w *j[) J •> Q < «— a CC (— 1 s •a; ^-> C- o Q-, Ed ^. ^.^ f-i •~-* CO W E • • ^-N X o M • CC 'd l-H -3 Cd ""3 2; CO CO ^ CO X • Eh X X X CC ecu X < Cd 2: N«* 2 x -1 • f-< o H H Ed X X O X > o 13 ■ X CC CE E-t M o < E-« 2 C\l o CO X a, CO M H 3 H Ed X 2! E-< X Q ri u X ^ CO < < CO E-i M o -3 /--. o ^> *3 O -3 < «-» *-+ s H M v-^ o H NJ =3 X CO «— CO M H CO CO X * Q H 1 X 03 » CO ^ J — ' X H H CC X * • U o Cd <"» »H CC < CC o V2 Id M ^-» CM 3 ► O CO -Q *- o O M O re I— 1 Cd X H »-v >_• co 33 a (H T- ' w ' CO Id S Cd X O CO 2 X « O Eh 1 • CC 3 -3 X Q o X a. •"3 — II o 9h X Q o JO W T— X S "»* "X M n 00 ^-» o M *-+ »-«. CO -J •a! <-» CC O t— X M <£ X o O =«t -3 + >w » u Ed •— n <— Ed 2: ^ o M cc >o *_« X > •■ ^^ o K 1— -3 •-s << t— •-3 X a. H H 1 X cc X ii X Q II Ed O • X) Cd Fh o H -3 H <— * O X Q O * J x _I • J id X • t— X CO cu en a H • X CO »■» fr- o -«^ O u H H x> r— Cu 3 H * ^ « M O J x .-£ *■"» o J25 •j: O *o •-3 o UJ Ld r-t »- zs Xj cc a a, w » ~^ -~* -3 -3 I— T— i— i x H _) X II »-» + ^_« a f-H X «* M H t-H X a u JJ _: -O O ■^ •—- o rj >-* (—1 O x * II O ~D ^ iJ t-< o J LQ i3 -J ^ «— X •< Q '.<'. r o a, H ~> < O it II ~3 hH Q-, JZ ^ X Ul ce o *— r\j ii Cd M t-t r J X - Q X X ^ < H LC as • w CO O M M Q f-n X • (X E-H o O J CO 3 <* CO a, :z> LO txj SMi O S • • rd M *c i3 f-< t— i- M CO + w M CO •> ."S f-< J 2S b CO 3 <£. •> (X Q M M E-i x o S O Cl, o a, o it, o "3E ^ 3 •> •* >* M E-" z: tx3 J a. 3 o « O •• 3 CC (X) a* *~; o «. 2£i Ed -ij ui C\J < O O a: X H •> aj X OS »— X tj to Ci, X H XI Xi o •> •i- 1^ Cti ;j rD nj lil M E- sl. 2; LU '3 ZD re ZD ru r j rj a H ^V< -U ac Pel -w» 6 H 8- a ss CJ + 3 ~ X 6 rj.i :d X ^) H 3 S CiJ II M X S H O 3 Q cc O 3 3 O W o o o o o o o o o «- o c^ 210 1 X >H o -J o X x Cd CQ X *^*> )?_" o X X Eh X fj; M CM - Ed X o X CO E-" Q a_ ES cd rO *«! •■ H ■=c X ?_1 ji x X .z •> a, x H M 3 Cd _> X H Cd X Cd PC •■ S3 CQ C3 M O W a, M CO Q » H i-< "£ C_) %£ X •■ '."2 H O CM x O X Q_ X —1 « X « l-H «< X H m Cd 03 » X 3 X H X ■ ■ X H Q E-i O <-* ss *-+ Ld X O Cd o X " Q M a *-* M • •• • C Cu, C\J O CO CM »■» . T— •• ^ X CM rx, - — 1 1 •o + O Ld o o o a o o • M J X H H »-N J o o «— •• T— X Q II re o a o o 1 H ^ M X a C3 H CO CO X \ ,_^ o> o o E-> < _I J O re x ■— .'—. — X X X 1 — o o s^< •> • H H \ Q x> a •> « \ a H H O o H -— . CO O zz «— r* u, IS o -_^ CJ K U (\J O X o 1 i i » « a i o O <— « X x X CJ » • -— - a a LO X • •— * O C3 »— x » 1 -— ^ X r- <— -H • • n— ^ « «— ^ — K <"^ ^-^ • d H O t— CM N-^ ^-** + N f— « — o X ^ + X o CM Cd X OS X «— X ,< E-« O Q CD • • cd > — CJ X O • X 'Z ^v 1 - — • M O •2 Q H H > X O X a H • • /— » H -J a to id cd 3 J II • J -1 X o X II • CD Cd Ct, r- H < • O CQ X X l-l o i — CM • • o r-i Q «— i\j X • O H X U 2 T— on •a; • z: X II a ii «3! «< CO X II Q ii + ce • 2- I u: H □c II ii O Q *— CM r— EH M 1 — Q 3 X Ei H Q X X H Cd M T— X CM X —1 X 'O a II — ' X x. X X H X J x .d Cd X X S X l-H X + M X X IX H x X «■- x X > — ' ii II n ■z II X - — ■ v_^ X X II ■*~* > — ■ X >^ >^^ f-< X s < M fd II Ed M ulj H CM *— O H X X l-H O CM M X HH X x rd o o o 211 :c '-d M ■d t/2 a, id H -i co X — « * o r -Li co ► Q_ >< -J J - O .'- r/) . fc-i -J O O X, Q - CC !Z -a - lO l-il CO -^ cc cd H Ul, O ;r> js Ed o O -u w F-i CD J o — CO Id Id !=> I J -0 «S M > Q Cn OQ U r«. M Q co ■"« H :o IX, d o ■C E-h ^ o 05 M ■a; f-i > CO tu a O CO o n •0 35 H 5» a- CO J O M •r M O t-l :^ r d i-i ;a ■X CO H M '-0 co H O E- X ■=C E-< a: i— i 'd CC E-> O m cr> I'd «=C t-< CO E-h CC co ua o o cc CD C5 CO «C (xl -J CO X J - E-t <& O CO CO 'H H i < co Q 'd H Cb CO M H «; r d CO -J uU H rH O CO ICC. CO O CO cu M o Cd Cd Q •» u CO •> a, _J :d O « H o — ^ Q M H H O ■=C C- < H -C CD r d o _> O H t-l rd H O jm a 1 *-• a i • o CO a, 'J •j: H Cd o a, cc td Q Q M rd • o o ro O -1 ~-u < O II CO M "d •j: ^3 a O jj o i—i as cd Q CO •a: J o E-« ^—^ "Xi -3 CC O [d i/3 a X Ct! •s. Q < i, 1 n _3 II -< J _1 < n O «»; Q >< t-' o CJCJCJOOOUOOCJOO O O O cjOcj«-0-)cjc\J 212 o ■a! cc o u * EC O H O •a: Cn CO M H X> o O E-. e-i o o o o ~ *— vO -O 1 I Q Q O • ♦ -=r • • O ^ Ci3 fH • • o .-^ ^ o *J Cjl, — ' O J ^3 1 1 £i Q *"» 1 • X '-» r~ Q r*. ■•*» O • << s^ \^ X Q W -* CO CO td CO • — - • -3 -1 a a o Oi CO t— • o <: «c s Ed m 1 i— E- Q Q n ii J Q II > — J "•'^ ^ O ii «— iu o Ct, It O Hi e-< CC H M Cd u] >■ CC J* 3 o O re o o Q CC *-» O cc w 3 fjj CC o CO o (C Cd CNJ &, o w CO re E-i CO CC CC o CC cj o o f'-^rinvoocjcjocj CC o Cd to a, «-* cd M \ - 1 o £ O Q X m eg - 1 - CD «3! O w ^ Q X 'X! • o * «- t-> J >v =) < o Cd 3: a « h M E-i Cd E-i n s; O E-i ■f \ Cd O * M * O Si CM 3 3 UO CO 3 «- CC ii cc O M a- cc ii ro ac o * J«? cd rz> cc H M H Q cQ rx, E-t «— ' E-< Q co Cd M cj rz o S < II Cd a M CC UC O o o Ed CO M Q X CC &] O o o t- O u uco o c 213 X O H rd U S» :s X H o X E- M uu < O c-i *~+ \ C> u o o X i—( CO ^ m Q B re H CO a, O H X Cfa X • x X m fO X J x <* •> o 00 X >-t H o H > «* X •> o O H CJ ce M -i. a X J ;g M a. a, •• o CC -J Cd OQ X •• M • ^H CO "0 a r d X M *£ O H X z H X • Q 1 M CO CO •> M O to O a. rH Q X • H M Cd y^ ' z: CO 3 -ij X H a, X •> 1 •« X i— •> 3 -■ — * 3 S T _ X a-, X r- X o H H •> + •— . ► s: o * O •■ 3 UJ X j => CO x X Cd X « « X CO j-. Cd H o 3 ^ X X i — x - — * _] cc X □S a, •• • — II ^ x ■=c o a o X «* H 2* ac co rH H -j a; O H U X r- o «- "-H H «* X &H Cd X n o II rd i: 3 o -3 IS1 :o + ii cc V id 3 EH li. o H 05 H «— o Ed 25 gj ii ll cc J X < J ^ _J X N O !*S «- r_D a x H E~" x II X « — II ll H S M 3 ~ X *C < >— :>: 2 -1 co -2 ■-D CQ co H rd X x x tt, X X x H CO iii < CO iu rd E-h o o CO o =3 a * on >: co •« (X O CO « X II ■a: CO •a: CO I l-l X Lt, II Ct, o i — H z. X X u <~» - — > + J2 3 _i •■ « C\) d I-l M 55 H a, a, CO 3 « •> CO Oh 21 » ^-^ * 13 ^3 X t— X X ■« X • - y— « X, «•» 3 =3 O =n CVJ X X X X + » * • « -^i- J X CM X X + H «— N_^ ^-' Z X X J J w « v^ M i- > M rd Q X x CO co co co o o cd o X 3 X f-> a + "d X * i« X> X -^ X X ■s* «* H 3 S + a CO 2 » [fa •* tfa a O M 1 ^2 O X (rt u H 1 II -J x II Q M Cd 2 a OS o o o CO u o o CO CL CJ n «u H \ W o »■ Q 3i . • « on H ft U O [fa Q » • HH CM Efa •* •> O £ a ■» • - — . H !■■- .—» X •• td » o :d H a * i-i <-«. » en 2 (SI o H H I \ + X o Cd S - •• Cd s— « t-i X iJS X o x 1 X x * H < H s—^ as H X » — ' •> r- X W ^ .J O + H \ en X) •» s. s + cd H * EH H •> E 2 2! U •« H * O H -J [fa cii X o ^ X «sC • 22 • — - £ * X H O H H * Cd as IX. » CJ •w -~» 2: o x V, X H H -J cc Ed >» X X H H - as J x «=c H X •— ^ N O X Efc H t-l 53 II a, II M ii x IC X <£. O .x -< ^ X X CO H r*l a co X Cd X X X 2 2 Cd X 3: a Cd CO C3 H O CO X X a 2 23 bd H Cd 2 tsi l-l l-H •> J ^D < X l-H T— •• C*j H X O X M yr-V X X * X 2 cd « ^•N M X * -J T~ r « X Q- H '"N w X 3 Cfa J Id s«^ \ < v_^ H o a; * ac fO O * o a. Cd <-* CT Cd i-* s Q >_^ * X * * ^■^ CO ^~- *s o H t^ 'u « C3 + \ l-H \ 2 Cd X Cd • <-« 2 on •w 2 H X O ii O -£ O X - ' — i a Q II >— * < — CO II 2 ^2 • ?Z M H lJJ H ce ji O =3 h-l O Z M X Q E-« X Q Z M X Q Z 2 M X Q Z X H Z a. M X a x H Z M X a z z a. Q Z x H Z a, Z •H X a z x H Z x z* M X Q 12 Z X Q Z X H Z M H M M M M ►H M M M 3 3 X X- X ■3 X X 3 X X X X X X X X X X X o o CJ o o o u o o o 1 1 1 1 1 1 1 1 1 1 •■ :>J « csl • N •■ IN) •■ N » [>J • M • CO • (Nl \ ii « -■"j * ■*~^ ■X Ti^ • ji^ N -^ • j: # 22 « s •* IS; »■» • co O CO uU l-H « X l-H 1 CQ l-H 1 X M 1 X M 1 X H 1 X l-l 1 X l-H 1 X M 1 X M 1 H 1 * X C\J » X a * X a R X a II X a * X a * X Q M X Q M X a M X Q H o o - 1 o « eg o ■ co O •> o o « no o - oo o - =r o . f^ o « <— o •NO O » a a x a a on q 2 o Q a O a cs in a a m o Q =r Q o ^ Q Q o a Q ON jj JC ao z o LfN z X3 CO :2 T> Cvj z m vo z «— ao z >x> X) X ON r- Z 00 ^ Z o ON Z XJ 3 1 vC •*J ON t- ^ r> PO v On LH ^ QO ^r ud o «— ^ ■r> p- M ^o r- ^ CM in • in iZ =r O Z r- o Z ^ U~s 2j m f- ^ ON TD z o >— Z ON ;m Z X) co Z ON "S o • in m ro H ^3- on H 03 cm M t* ON n m C\J i-i in :=T H CM =T l-l t»- o M o o n ro O * H X> « ON LO ■ •— r— « O CM •> m ■ O X) - in -T • in ro » CM X) « x> n u *^» ON "J c^co ^D no o JD =r o X> =r TO x< e- 00 3 t* X) C3 O o> ■^ o i — X» O ^3- => T 'n J hi j ON X ON On X iT o X o> in X OvJ ON X o fr- X o ON X 'X> X) x m QO X =r X X X5 X X) ► T— m « ON CM » =r m ■ o in • CM m •• m m •■ m o> •> ro o X M + o o o [- o ^o t* CJ r- ~o o C\J ON o C\J ON o o r\j o ro rn O i- in o so ro CJ CM x X X X X> + ^r CM + in c\j + 3D h + ON ON + ro o + in co + ON p- + CM t- + CM m + in .c X 00 «* »- 3- w vD -O — * w* kO o s^ 03 ■O ^.^ o =T >-^ x> •— •^ c^ ON s_* t* CM w -X) M o H m o H CO h- E-h ro -o E-i p- m E-" O X O z o X c- M t- t* H r» er> l-H CM on M C\J o l-H no o H ^r ^r i-i =r CM i-i in X) l-l m f h-t X) J H .X Q 1 ON X «- -O CC • -O X • 3" X • t— as • r« X • CM X • ^ X • •— X • in X • 03 J X in X =r U X> =r CJ> + ^3- O + 3- o + sr o + on o + m o + CM o + CM o + r— o + JD X, H • X • hi • • Lu, s* # X >1 • X X • X >-< • X >-> • X X • X >-i • X >-> • x x 3 £ X n II ii it ii ii II II II II II II II II ii II ii ti II ii ii II II II 11 II II II ii II ii ii Q H a H o M !H u t>3 >H o r>a >H CJ to X o M >i r J) NJ >H c_> f>J PH u CO X 216 3= ^-» ^-» 8-i IS ^ n •> CC l-H M Cd J CU a, Q < • •> 3 CC ^ 2 X a ■i •» CO Cd M M 23 Cd E-i X X O ID 3 » - H J l-l a a O (x. M M X O H H Ct, 00 -3 gj Cd Q M l-H Cd si 25 « » X <* M 3 X H H Si 33 x *— *. « » S CO a Cd CJ u •* Cd 25 X l i M 3j «•» X M Q-, i— _S .* a Si 1 O Cd •> • ao H a Cd a II M w t- 25 * X zz a, a. o M X Fh bl a a - l>3 • vO « i fd 25 ~7lL .H * 2 f- 3 < oo 3 Si Sd •> »— cq « on X ~^ M H £-• M X i-l 1 M m on « t i UJ 25 25 * X Q «— X c- CC CO » X LU M H «- « X> 1 » vO >-* « H 1 Q ^r a a t» H O Q X «- on 25 cr> H .-I ii m :*s 3- *— ss o PS --c •» ^- E-< •- o H CT> O Cd H »2^ CM 3 O Ct, CC Qli O M CO O M ON CO - C- m « on 3 E-t .J sO ^3 «- i— 3 X) o H «< m x ;o no X • l-H U ^ (*- ■ (M T - + H H as on o c- kO O » U J d on + x> =r + w 25 Dm H =r U + on O CC 3 • Ct, >n • El, a £-« Q ii ii ii ii II rxj Cd 3 U[x] x o n e-» CC Cd Cd O Z X u CO Q 25 o E-i Cd J ca x CC H O CC Ct, S> Cd ►J Cd CQ X X + * 2 •— CC •> (X, CC II + w CC h •=»: m a, ox CO II M S» a. x ii u u u u o o o o C_) O C_) ^- :z .s 3B H 11 ^ ii * cri CC OS PS - < CO o CO o CO «— H o M o H > — a, CT> a, ON (X ii ii H 25 O H O H o E-i as E-i i-l H M H H o cc OS OS E— i Q O C_) O CJ o U Cd g 'J Ct, a Ct, a Ct, CC o o o M •— o CM on o o 217 05 CO Q CD CL, CO o >-l • J* fr* II •• (— 1 *— *» \ M O CO « Q rx, z • • >H ■w m J H 'O \ H •=C (\J •■ J M v. \ o -S '-» O (X. o »— Q S5 1 • CO >« s C\J rc -J --^ K-» \ H O 53 • CO O cu, "N o n t '2 1X4 O O Cx] X H O CO 3 ^ • £h • X H 6-. ^ HH \ .O H * V. «- •> r»"» O A =r — » 1 CO M u 05 Cr, -t. > ro n H 1 --» • O O lLj HH s ^^ O o O i— Cd M Q o • o ■ r— \ a 6Q £ H + a re lb o Cd Cd o O 3- * X < < «c «— O re o CQ ^N <"«« — ' • Cxj w — * » H 05 CO :o ■-» CO «- J :0 CO \ J3 z 1— o • o 1 CJ o 3S -J O Q H X CO *-^ 1— 1 a io HH efi o • H fr-i s o W ^^ • CO 05 O V •>< «* T— s X ii "^ CO re V. o Cd CO CO H » » o ~* C\i ^Z 3 o Q Q Q5 M o O ■2. 'O O H Q5 03 J aJ < > 00 t-H X, H »— i— o e-« s: H ^5 cu o- ^C H H 05 b >* Q •— ' O — ' 3 H 3 .£ t— 1 < •jj H o H (-1 O O II O ~u O CO r/'J H 3 Q nfii C Eh cc U C. -» .Z) Z£. «< o o- rz » — ' J a J o o < • o «< H * •^ »— H CO _j H H :o 05 J ■ — - ^^ •« x 'J> U cc <« O o r— ^^ =c < x, H v. • • II [Xj Ltl Ck HH cr o O i— H * ^ 3 SI CO :j) z CO w a n II 2 N, ss -a i— i o « • • H o M f-1 H H J5 CO J MLl *£i z. i— U> <— a, O H 0^ 3 r> ■x, 2£ ^^^ >*-^ ii < ii «u z ii H Q o S d Ct, r -l- i— i Cu O M Cu o i— i CO 3 Q H u H l-H a, a a X, «^; CJ a, rc Cd o o o 218 Cd > M > M « Cd Cd :d as a w a E-< 00 co > X c •—» H eg H 1 U O .zz •> 3 -i- Ct, 1 ■a; z^ v-' O JO H * w J H < O u rxa ce OS a. Ed O O -J M Q •u -J • .3 a, r— o s: II Q M jS CO Cd CC Oh re w a, >-i fc-i u o o > X ^^ H U tl* 2 o *•-% M M H 1 _s U O > o o 25 •> X o a. a O a: >v o X • Cti 1 o o •— - X. o «s X Q H *— > II 3 ■— * N, • 1 H .5 O 30 1— r - o *_ • O ^v M * II II E-t s» Cd ''^ Ct, O s ■* H — ' Ld cd OS X .»-» *-» » O E-< • Q3 o o i— Ji. «— H Oi s 3 J >H 2 to o M Cd Cd X ii X II z X 2 H X _] H E-" • • II H o II H rH a « 3 ^> ai J U 2 2 .^S o <— jH U H ii :d •^ H Q CO a, z ^^ — » a ClH tx* a Ct, o a Ct, o O ua a; H X Cd a M C_> CJ H O M X a Q X s o o Cl, CC Cd o o o ON 219 A. 2 Distribution of Floating Point Numbers Outside the Interval [1/6, 1] For the sine and exponential functions, the range cannot be reduced easily to [1/3, 1] . If we assume that the range has been reduced to [0,1], the arguments have the form: X = x • 3 e The distribution of e is problem dependent. For some prob- lems, e may have a very narrow range of possible values while in another application there may be wide variations in the exponent e. The only thing that is known is the distribu- tion l/(xln 3) of the mantissas x. Nothing general is known about the distribution of the exponential in floating point numbers [SWE6 5] and conse- quently the only choice is to assume that the e's have a uni- form distribution. Representing x between and 1 requires values of e varying between and a large negative number which we will take to be — for all practical purposes. What, then, is the probability density function of the argu- ment X? For XE[|, 1] e - r 1 In 1 For X e [-y, -r] e = -1 3 p For X 6 fci, ^] e = -2 For X £ i—hrr, -r] e = -i 8 3 220 Within each one of these intervals the probability density of the mantissas x is 1/ (x ln3) . Consequently, the probability density of X for X [ — : ~TT f — p] is 3 3 1 6-1 x ln3 gi+1 6 - 1 e where . - . is the uniform density of 3 in the interval 3 V +1 3 1 The probability density function for numbers between and 1 is sketched in Fig. A.l. This is the distribution that must be used for solving the NLP problem in the case X TT of f(x) = e or f(x) = sin(j x) . It is interesting to note that, if several polyno- mial degrees are chosen, it is better, in this case, to put the lowest degree polynomials close to 1 and the highest degree polynomials close to 0. X is found more often at the right of the interval [0, 1] . The probability of finding X close to gets smaller as X tends to 0. The cumulative probability between — r and — r^r- is: 3 1 3 1 6-1 r ... 6-1 C dx _^_ In 6- 6 1 6 6 which tends towards as i increases. The average becomes smaller as X -*• 0. 221 DC IS THE MANTISSA NOTATION Figure A.l Probability Density Function of Floating Point Numbers between and 1 222 A. 3 Limit on ROM Savings Coefficient When n ■*■ °° We have seen in Section 3.3.2 that the "ROM Savings Coefficient" defined as: b / b - a J g(x) a where g(x) (f (m+1) (x)) (2m+3) (here w(x) = 1) was obtained as the ratio: number of breakpoints in [a,b] using variable intervals V(a,b> = number of breakpoints in [a,b] using uniform intervals The smaller \J (a,b) is, the greater the savings in ROM. We are concerned here with whether or not this ratio increases as the polynomial degree increases. In other words: as m -*■ °° is it worth using variable intervals to save memory? We already know that, to first approximation, g(x) does not depend on n. We will consequently find an approxi- mate expression for g(x) and look at the limit of the inte- gral when m, the degree of the polynomial tends to °°. As in Section 7 we will consider Type 1 and Type 2 functions. 223 Type 1 Function : f< m+i >/ m - 1) 2 (b-a) 2 x and as m -> °° (b-a) /m-^0 ,and using e X -l = x + + 2! . m / b-a 1 / b-a \ U,(a,b) = + — ( + •• Z (b-a) \ m 2! V m / 225 b - a (b - a) 2 vT(a,b) = 1 + + + ... 2!m 3!m 2 Therefore Vi<«.w i when m. D Consequently for f (x) = e as m ■+ °° the ROM savings tend to be null when using variable vs uniform intervals . It is possible to show that the same thing would occur for the functions sin and cos: choosing f (x) = cosx, for example 2 1 C I c (m+l)\2m+3 <\fi a , b) . / ( m+1 ) <* b - a J \cosa / r / \ m+1, 2. . ... . .2 ,. N 2 [(cos n) ] is either (cos x) or (sin x) , so 1y(a,b) can be expressed as V~( a ,b) = l(sinx) m dx (b-a) (cosa) 1/m ^a or 1_ b m ^ a ' b> = 7TT7 — 7^) Icosxl dx (b-a) (cosa) J " / *" J a. We know that .b b /. r-1 , f r-1 sin x.dx = /cos x.dx = /rrr(r/2) 2 r(r/2+l/2) ,r> 226 applying this formula with r = — + 1 and having m -* °° yields: l Ar(i/2) 1T(0,tt/2) = JT 2 ° 2 rd) but r (l/2) = /tt and r(l) = 1 consequently V(0,Tr/2) = 1 the ROM savings is null. We conjecture here that for type 2 functions, the ROM savings limit is null when the polynomial degrees tend to infinity. 227 VITA Gilles Henri Garcia was born in France in 1947. He received degrees in Electrical Engineering from the Univer- sity of Toulouse in 1970 and from the University of Paris in 19 72. With the aid of a Fulbright Grant, he came to the University of Illinois at Urbana-Champaign, where he received his M.S. and Ph.D. in Computer Science in 1976 and 1978, respectively. From 1973 to 1977, Mr. Garcia was employed as a graduate teaching and research assistant by the Department of Computer Science and the Institute of Aviation at the University of Illinois. He has co-authored a paper in IEEE Transactions in Computers. From 1975 to 1977, Mr. Garcia was also associated with the Marketing Department of the Illinois Bell Telephone Company in Chicago as a consultant. He is a member of ACM and of the Societe des Ingenieurs de l'Ecole Sup^rieure d 'Electricite . Mr. Garcia is currently employed by Schlumberger Well Services, Log Analysis Software, in Houston, Texas. r tLIOGRAPHIC DATA 1. Report No. UIUCDCS-R-78-922 3. Recipient's Accession No. 'itlc and Subtitle 1INIMUM MEAN RUNNING TIME FUNCTION GENERATION J SING READ ONLY MEMORY . uthor(s) lilies Henri Garcia 5. Report Date Dec. 1978 8. Performing Organization Rept. UIUCDCS-R-78-922 No. erforming Organization Name and Address lepartment of Computer Science niversity of Illinois rbana-Champaign, IL 10. Project/Task/Work Unit No. 11. Contract/Grant No. Sponsoring Organization Name and Address epartment of Computer Science niversity of Illinois 13. Type of Report & Period Covered Ph.D. Thesis 14. i Supplementary Notes i abstracts This thesis presents a method for generating functions using a minimum mean unning time polynomial approximation. Although the method can be used for software ibrary routines, the work focuses on hardware implementations. With the advent of LSI, Read Only Memories may be used to store many constants very economically. In ddition, large multipliers are now available which multiply two n-bit number in 0(n) r 0(logn) time. In this discussion, multiplication is considered as the basic perator. The method consists of splitting the interval [a,b] into several large artitions. Within each large partition, the function f(x) is evaluated in a large amber of small sub intervals using an approximating polynomial of very low degree, le coefficient of the polynomials are stored in ROMs. - ey Words and Document Analysis. 17a. Descriptors >lynomial approximation i;ad Only Memories >• lentif iers/Open-Ended Terms • OSAT1 Field/Group Pliability Statement 19. Security Class (This 21. No. of Pages U Limited Report) UNCLASSIFIED 227 20. Security Class (This Page UNCLASSIFIED 22. Price M TIS-3S ( 10-70) USCOMM-DC 40329-P71 ' '