LIBRARY OF THE 
 UNIVERSITY OF ILLINOIS 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 
 It6r 
 
 no. 782-787 
 
 cc p. 2 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 vrtErt 
 
 L161 — O-1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/measuringknowled786mich 
 
UIUCDCS-R- 76-786 
 
 JltKAS* 
 
 MEASURING THE KNOWLEDGE-CONTENT OF PROGRAMS 
 
 by 
 Donald Michie 
 
 May 1976 
 
UIUCDCS-R- 76-786 
 
 MEASURING THE KNOWLEDGE-CONTENT OF PROGRAMS 
 
 by 
 Donald Michie 
 
 May 1976 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBAN A, ILLINOIS 61801 
 
111 
 
 CONTENTS PAGE 
 
 Introductory outline 1 
 
 Basic definition: f 3 
 
 Basic definition: K 4 
 
 Basic definition: L 4 
 
 Basic definition: e 4 
 
 Purity of formulations 4 
 
 Properties of p in relation to f 5 
 
 Motivation and informal remarks 6 
 
 Technical need 7 
 
 Assumptions and restrictions 8 
 
 Information about a function 9 
 
 Knowledge of a function 10 
 
 Partitions of memory 11 
 
 The predicate " known " 12 
 
 Knowledge in a heuristic 13 
 
 co-concepts and e-concepts 14 
 
 Simplified and refined forms 15 
 
 Simplified theory: definitions 15 
 
 Information concerning f 15 
 
 Knowledge-content of the look-up program p T 16 
 
 Knowledge-content of the solution program p. 16 
 
 Difference between information-content and knowledge-content . 17 
 Knowledge-content of advice program and knowledge-overhead 
 
 of control program 18 
 
 Refinements: partial (truncated) evaluation 21 
 
IV 
 
 CONTENTS (continued) PAGE 
 
 Refinements: error 23 
 
 Simplest case of error: no gradations, no partial 
 
 evaluation , . 24 
 
 Gradations but no partial evaluation 24 
 
 Partial evaluation, no gradations 25 
 
 General case: gradations of error in partial evaluation ... 25 
 
 Refinements: differential argument-utilities 28 
 
 Some g-independent quantities 29 
 
 The memo- function as a model of rote-learning 30 
 
 Telegram-factory model of concept learning 33 
 
 Learning, rote-skill and grasp 36 
 
 Concluding remarks 39 
 
 Acknowledgments 40 
 
 References 41 
 
 Appendix 1: Relation to Aristotle's theory of knowledge 49 
 
 Reference 51 
 
 Appendix 2: What to substitute for S 52 
 
 Reference 52 
 
1 
 
 Introductory outl ine 
 
 An approach is described for defining and numerically measurina 
 a system's knowledge of a finite function f and for estimatinn the net 
 knowledge-increment obtainable from a given body of advice. The key idea 
 is that a system "knows" only what it can evaluate (whether by calculation 
 or by look-up) in user-acceptable time. The role of advice is to make it 
 know more. Knowledge of f is expressed quantitatively in information-bits 
 (not to be confused with store-bits). If b denotes an upper bound to the 
 time available for evaluating f for any argument, the in formation- content 
 of what is B-evaluable is the amount of knowledge, K, about f. 
 
 The envisaged application is to the evaluation of a recursively 
 defined function by an algorithm A implemented as a program p. to which a 
 body of advice B is made available in the form of a knowledge-base n R » 
 through the mediation of a control program p c . p R is a kind of data-base 
 (parts of which may on occasion be executed as program). A typical action 
 of the advice contained in it is to alter p.'s termination condition in 
 such a way as to cause earlier exit from the recursion without altering 
 the computed result. 
 
 Advice theory offers a cost-benefit analysis of p R and of its 
 constituents. Benefit is expressed in units of knowledge. Cost is 
 attributed to store occupancy and is expressed in store-bits. All measures 
 are defined with respect to the machine specification S and the solution 
 program p A « The measured benefit- to-cost ratio for d. alone is termed 
 the system's algorithmic advantage. The measured net knowledge-increment 
 conferred by p R divided by its store-occupancy is the system's advisory 
 advantage relative to p». The system's computational advantage is the 
 
2 
 total knowledge (algorithmic plus advisory less an overhead associated 
 with the operations of p~) divided by the store-cost (store occupancy of 
 d. plus store occupancy of p R plus store occupancy of advice-management 
 routines p r ). The term penetration is used when advantage is expressed 
 on a logarithmic scale. Computational yield is defined as the ratio of 
 the actual amount of knowledge to the maximum attainable with the given 
 S and store-cost; the usefulness of this measure is restricted to the 
 case where the store available is less than or equal to that which just 
 permits complete knowledge of f (i.e. less than or equal to f's 6- 
 complexity, see below). :1ore versatile measures are effectiveness, 
 defined as the ratio of the actual to the maximum advantage, and 
 efficiency, defined as the ratio of the minimum to the actual store 
 occupancy for given K. 
 
 Knowledge-increase is seen as occurring not only through input of 
 advice from the user but also through the system's storage of instances 
 of, or (more powerfully) inductive generalization over, past evaluations 
 of f. These processes can all modify or add to p R so as to increase its 
 knowledge-content. They can also, through restructuring or deleting 
 existing advice, reduce p^'s store occupancy. In either case advisorv 
 advantage is increased through purely internal reoraanization. We thus 
 have a domain-independent way of measuring knowledge-acnuisi tion in machine 
 learning, following a scheme of which the memo- function (Michie, 1967) 
 is a primitive example. 
 
 By a restriction of Solomonoff-Kolmogorov complexity we obtain 
 -'omplexity as a numerical measure of a problem-domain's difficultv. 
 --n r . of advice theory deal with partial evaluation, evaluation 
 
3 
 error, and differential utilities of evaluating f for different arguments. 
 With this last extension in place of "pure" knowledge we speak of "useful" 
 knowledge. Computational advantage and penetration as earlier defined 
 divide correspondingly into "pure" and "useful" forms (each consisting 
 as before of algorithmic and advisory parts), and similarly for yield, 
 efficiency and any other quantities compounded from the primitive 
 definitions. 
 
 The rate at which p (p A , or p A plus n g plus p-, as the case may 
 be) can deliver complete information about f is independent of 6, and is 
 termed its computational capacity, expressed in information-bits/sec. 
 Dividing by the bit-cost of p gives its computational power - the number 
 of store-equivalents of information about f delivered per second. This 
 is a measure of a system's overall performance in its knowledge-delivery 
 as opposed to knowledge-acquisition aspect. The latter is treated by an 
 extension of the theory addressed to measures of the performance of 
 rote-learning and inductive generalization programs. 
 
 Two instances of f, one numerical, and one non-numerical, are 
 mentioned, namely finite-factorial, and bestmove (in chess). Reference 
 is also made to a probabilistic generalization by I. J. Good (1976) of 
 the main definition. 
 
 In anticipation of the body of the paper we give below a 
 brief snapshot of the primitive notions. 
 
 Basic definition: f 
 
 f: X -+ Y, where X is an ordered domain of size N and Y, the 
 co-domain, is of size M. 
 
Basic definition: K 
 
 A program p's "useful knowledge" about a finite function f is 
 measured in information-bits as 
 
 N 
 K = i u. x surprisal (Y. ) 
 i = l ] ^ 
 
 where u. is the normalized frequency with which p is invoked to evaluate 
 
 f for the i_-th member of X, and Y. is the candidate set remaining after 
 
 evaluating f(x. ), using p, for just 6 seconds. Surprisal (Samson 1Q51) 
 
 is log 2 — where p. is the prior probability that an arbitrarily chosen 
 
 p i 
 member of Y will fall into Y. . Evaluation is assumed to have the form 
 
 of a successive approximation so that Y -> Y • >, -»• Y. 9 .... in successive 
 
 cycles, 0, 1, 2, ... of the computation. 
 
 Basic definition: L 
 
 L is the number of store-bits occuoied by the program p. L» is 
 the bit-cost of a program containing K bits of useful knowledge. I* is 
 the minimum possible L for any such program: L ft is the minimum possible 
 L for a program containing complete knowledge, i.e. K = 1(f). K* is the 
 maximum possible K for any program of no more than L store-bits. 1(f) 
 is the total information-content of f, calculated in a way to be described 
 later. 
 
 Basic definition: e 
 
 The extraction time, e, is the time needed by the svstem to 
 evaluate f fully over the whole of its domain. 
 
 Purity of formulations 
 
 We do not propose to make a fetish of purity, in particular when 
 it comes to quarantining abstract objects such as "algorithms", "arguments" 
 
and the like from their counterparts in the world of concrete embodiments - 
 such things as "programs" and "inputs". We have taken the trouble above 
 to write " ... frequency with which p is invoked to evaluate f for the i_-th 
 member of X, ... " when we could easily have written " ... frequency with 
 which p is applied to the i_-th member of X, ... " (strictly speaking the 
 latter would be wrong: one cannot apply a concrete object to an abstract 
 one). But we do not undertake to observe these laws of quarantine on occa- 
 sions when law-breaking offers large gains of brevity or perspicuousness. 
 If mathematical taste is thereby offended we are sorry. But communication 
 with the reader who is willing to "see what we mean" has, rightly or 
 wrongly, been placed first. 
 
 Properties of p in relation to f : 
 
 Units of measurement 
 Information-content (of f) Information-bits 1(f) 
 
 Knowledge (possessed by p about f) Information-bits K 
 
 L 
 
 Store-cost 
 Computational advantage 
 Penetration 
 Yield 
 
 Efficiency 
 
 Effectiveness 
 
 Store-hits 
 
 Pure 
 
 number 
 
 G » 
 
 K K A + h + K C 
 L L A +L B +L C 
 
 Decs 
 
 
 n = 
 
 log 10 K - log 10 L 
 
 Pure 
 
 number 
 
 K 
 
 K L 
 
 
 Pure 
 
 number 
 
 L K 
 
 
 Pure 
 
 number 
 
 K > 
 
 < L 3 
 
6 
 
 K B K + K 
 
 Advisory advantage Pure number C B = L~ = L ** + L° 
 
 B r c 
 
 Advisory penetration Decs D R = log,Q K R 
 
 - log 10 L B 
 
 Operating cost Bit-seconds Lt 
 
 aD b 
 Learning index Decs/bit-seconds p- -— 
 
 L ITU 
 
 K and L denote the knowledge-content and bit-cost of a given p and L D is 
 the B-complexity of f, i.e. the minimum bit-cost for a program possessing 
 total knowledge of f, 1(f) in amount. L£ is the 6-complexity of that 
 subset of f covered by K, i.e. of the partial function f.,. In the index 
 of learning L. denotes the store-cost and ty the processor time associated, 
 when running the program (p„, p R , p~, p..), with p.., a procedure which uses 
 results computed by (p., p R , p~) to update p R . Subscripts r and c denote 
 "rote" and "concept" respectively, distinguishing two forms in which 
 information about f may be stored in p R . 
 
 It is not expected that the foregoing "snapshot" will be immediately 
 intelligible: the rest of the paper can be regarded as an attempt to make 
 it so. 
 
 Motivation and informal remarks 
 
 In AI we talk about "knowledge", but would like to progress beyond 
 the mere slogan. 
 
 Is "knowledge" that which is explicitly stored? At first sight this 
 seems reasonable. For example, it takes care of the following kind of 
 paradox. A number theorist knows a great deal more than he did when he 
 
7 
 first learned the axioms of arithmetic. Yet everything he knows, and 
 infinitely more besides, is strictly deducible from these axioms. Therefore, 
 there cannot be anything new for him to have gained. What, then, is this 
 extra stuff which he knows? Presumably it consists of explicitly stored 
 conclusions which previously were only implicitly available to him. Since 
 everything that number theory has done with the integers since God made 
 them has been strictly redundant, we conclude that the only new value 
 created has been that of accessibility. The point of new knowledge is 
 to make less time-consuming the retrieval of old truth. 
 
 According to the explicit/implicit idea, I know the definition 
 of a prime number and I know that 23 is prime. But I do not know that 
 233 is prime, although I can quickly work it out. 
 
 At second sight, the criterion of explicit storage is not auite 
 so reasonable. We suggest that the notion of explicit storage is handy 
 only because it correlates highly (in most information processing systems) 
 with the speed with which the needed information can be retrieved. 
 
 On this more general criterion a desk- top calculator which 
 calculates factorials in a second has as much claim to a "knowledge" of 
 the function as a similarly priced table-look-up machine. This is 
 essentially the Aristotelian position ( see Appendix 1) and we shall adopt 
 it here. Knowledge must be of something, and for definiteness it will 
 always be of a finite mathematical function . 
 
 Technical need 
 
 Information theory allows us to specify the quantity of information 
 contained in a message and to calculate the least space in a computer 
 memory required to store a minimal encoding of it. 
 
8 
 Computation theory allows us to say what functions are computable 
 in principle, and whether particular programs compute them. 
 
 Complexity theory can show, for the evaluation of certain simple 
 functions, how some measures of computational cost can be minimized. 
 What we lack is a calculus which will enable us to say: 
 
 not merely how much information about a function f is stored 
 
 in a computing system; 
 not merely whether and how the system could in nrinciple 
 
 compute f; 
 not merely how the computation might be optimized with respect 
 
 to specified cost criteria; 
 but how much knowledge of f the system possesses. 
 "Knowledge" should he defined in such a way as to be compatible 
 with the informal use of the word in the machine intelligence literature. 
 The definition, and any calculus based upon it, must be capable of 
 characterizing the dynamical aspects of knowledge, its acquisition, 
 transfer, and use. The essence of our approach will be to regard knowledge 
 as the practically retrievable part of the total information possessed 
 by the system about f. 
 
 Assumptions and restrictions 
 
 We measure information in the normal way, and exoress it in bits. 
 In order to avoid confusion we shall use the word "store-bit" to mean a 
 binary digit in machine memory. 
 
 We exclude from strict consideration any function (e.g. a real- 
 valued function or a function of a real variable) which may require 
 infinitely many bits to represent a value or an argument. But we do not 
 to finite functions by which such functions may be approximated. 
 
9 
 
 Examples (the first chosen for its simplicity and the second for 
 
 its complexity) : 
 
 finite- facto rial : X ► Y 
 
 x < xmax y e positive integers 
 
 and x e non-neg. integers 
 
 bestmove : legal chess-positions ■* legal moves 
 
 Consider a member of this class, f: X -*■ Y. Our approach is to 
 regard " information" as a measure of certain properties of a static 
 representation of f, conceived simply as a message which encodes f. In 
 this classical information- theoretic world there are no store-time trade- 
 offs, and the information-content of a message is equal to the minimal 
 number of bits required to store it explicitly. By contrast, " knowledge" 
 measures certain properties of the evaluation of f, whether by look-up 
 from store, by calculation, or by any mixture of the two. In this world 
 we can, and typically do, encode long messages in short programs. How 
 much do such programs "know"? 
 
 Information about a function 
 
 We picture f as laid out in extensional (tabular) form, comprising 
 a sequence of (x, y) pairs, N in number. The first member of each pair is 
 drawn from the ordered domain X, and the second from the co-domain Y. A 
 message of length N, where N = size (X), has thus been constructed, the 
 values y, , y 2 , y~, ..., y.. being regarded as a sequence of choices from an 
 alphabet Y. Assuming equiprobable choice, the information content of such 
 a message is given by Hartley's (1927) expression N log,, M, where 
 M = size (Y). This follows from the fact that there are exactly M 
 possible functions which could be constructed with the given domain and 
 co-domain. Allowing for different frequencies of the different y- values , 
 
10 
 
 M n. 
 we have - Z n. log 77^ where n. is the number of occurrences of the j-th 
 
 member of Y. An equivalent, and for our purposes more convenient, formula- 
 tion is obtained by summing over the domain rather than the co-domain of f, 
 
 N n. 
 thus: - z log, j^-. 
 i=l 
 
 '2 N 
 
 Suppose, now, that only k of the n entries have been inscribed in 
 
 k n. 
 
 the table. Then the table's information content is - Z log- rp, where 
 
 n. is the number of occurrences in the complete table of (x, y) pairs whose 
 right-hand element matches that of the i-th entry of the reduced, k-sized 
 table. This expression is the basis for measuring the information-content 
 of a "partially known" function. 
 
 Knowledge of a function 
 
 The test of "knowledge" is the ability to answer questions: "What 
 is the finite-factorial of 6?", "What is the best continuation of this 
 chess position?" If an answer can be produced within some time limit , then 
 the system is held to "know" it. 
 
 As preamble to an informal definition, consider the respective 
 tasks of answering 
 
 (1) "What is the finite-factorial of 6?", and 
 
 (2) "What is the finite-factorial of 10 6 ?" 
 
 In the second case it would seem that since the answer is so much 
 
 harder to get, there must be more information obtained. This is not so. 
 
 n. 
 An answer conveys - 1°9 2 F" bits of information (its "surprisal ") , and 
 
 that is the end of the matter. But it is true that either 
 
11 
 
 (i) the system is more likely to deliver an answer to (1) than 
 to (2} within an acceptable waiting time (our definition of 
 knowledge will allow us to say that the system is more likely 
 to "know" (1) than (2)) , or 
 
 (ii) if obliged to reply within a time limit, the system will he 
 able to answer (1) to a higher degree of approximation than 
 (2). A definition which permitted degrees of knowledge of 
 f(x) would allow us to say that the system has more knowledge 
 of (1) than of (2). Such a definition is possible and has 
 interesting consequences. These are discussed later. 
 
 Partitions of memory 
 
 We subdivide machine memory into six sectors, dedicated respectively 
 to the following purposes: 
 
 1. Storage of the input argument ("data space"). 
 
 2. Storage of a solution program p. for the function, f, ("function 
 space"). 
 
 3. Storage of data generated and used by the evaluation algorithm 
 in the course of, and solely for the purposes of, the given 
 evaluation. This "work-space" is empty on entry and cleared 
 on exit. 
 
 4. Storage of an advice program p B , (the "knowledge base" or "body 
 of advice"), which modifies the action of p., typically by 
 causing earlier exit without affecting the computed result. We 
 refer to this sector of memory as "advice space". 
 
 5. Storage of a control program p c for making p B available to p.. 
 
12 
 6. Storage of an update program p.. for reorganizing p g between 
 evaluations, at the lowest level by rote-operations, at the 
 highest by inductive generalization. 
 In most conventional computing, nachine menory is partitioned anonn 
 1, 2, and 3 only. Our view of thp disti nguishing characteristics of an 
 intelligent system is that the key connonents are precisely th Q last three. 
 Systems exhibiting such a morphology as the anov° are at the dawn of thpi r 
 evolution. Specimens are scarce, and highly evolved specimens non-existent. 
 The best known example is probably Buchanan, Feigenbaum and Lederberg's 
 Heuristic DENDRAL (for recent reviews see Michie and Buchanan, 1974, 
 Buchanan, 1975). A system of the same kind for chess end-games would map 
 onto the six sectors as follows: 
 
 1. Current board position. 
 
 2. Lookahead routine {every chess program has one). 
 
 3. Lookahead tree. 
 
 4. Body of advice, consisting of chess heuristics to guide and 
 prune the lookahead routine. 
 
 5. Routines to access and apply the heuristics. 
 
 6. A learning routine using accumulating samples of the 
 extensional representation of bestmove , as they are discovered 
 by the lookahead, to add new heuristics and modify old ones. 
 
 The predicate " known " 
 
 A predicate known has as arguments a function f, an input argument 
 /, a ■ achine specification S ( see Appendix 2), programs p., p R and p~ 
 corresponding to 2, 4 and 5 above, and a time bound 3. known (f, x, 
 - :> : , ) is true iff f(x) can be evaluated in no more than 
 
13 
 8 time units. If, in the given advice state f(x) is known for all x, then 
 we say that f is e-evaluable on the given machine by the given (p«, p B , p~) . 
 If f(x) is known only for some x we speak of the "known part" of the 
 function. There is thus an extension here of the familiar notion of a 
 partially defined function, except that we deal with partially known 
 functions. In the new system, which incorporates the time dimension, 
 "known" corresponds to a subset of the concept "defined" in the old system, 
 but unlike "defined" changes dynamically. Thus, a system's knowledge of a 
 function is the information- con tent of its p-evaluable extent . 
 
 We accordingly say that the number of units of knowledge of f is 
 equal to the minimal number of bits of information required to specify that 
 part of f which is 8-evaluable, i.e., how much is known. 
 
 Knowledge in a heuristic 
 
 Consider (by way of example) Huberman's (1968) treatment of chess 
 programming in certain elementary end-games, where p« stands- for her "forcing 
 tree" lookahead routine and p., stands for her logic formulae expressing 
 " better " and " worse " predicates. We wish to measure the knowledge- con tent 
 of individual heuristics in p B . Proceed as follows (in principle: in practice 
 we would sample from the domain, not chart it exhaustively). 
 
 1. With the heuristic loaded, run p« for each x in X, and record 
 the number of bits of information corresponding to the known 
 part of the function. 
 
 2. Unload the heuristic and repeat. 
 
 3. The difference between the two measurements gives the amount of 
 knowledge about f present in the heuristic. 
 
 The knowledge-gain has to be traded against the heuristic's storage cost: 
 
14 
 the ratio of gain (in information-bits) to cost (in store-bits) is the 
 advisory advantage. 
 
 u-concepts and --concepts 
 
 lie are turning our backs on "in principle" concepts of computabi 1 ity, 
 completeness, contradiction and so forth. In so far as we need to refer to 
 them we can sneak of "co-computabi li ty ," "w-completeness ," "u-contradiction ," 
 etc. Instead of these concepts we envisage a calculus of "B-concepts . " A 
 system is .— complete if every question admissible within the system can be 
 answered in the given state of the advice in no more than 3 time units; a 
 set of assertions is .-inconsistent if a contradiction can he B-derived and 
 5-consistent if it cannot (a R-consistent bodv of assertions mav be 
 --consistent or it may be ^-inconsistent); an assertion is B-redundant if 
 its addition leaves unaffected what can be B-derived. The theory evidentlv 
 has implications for the special case of function evaluation known as 
 automatic deduction, where the concent of a B-theorem presents itself and 
 claims relationship to what is usually called "belief" (see Heltzer 1Q7H 
 for the notion of logical validity subject to a depth limit on the deduction). 
 
 We do not worry at this stage about the exact magnitude of the bound 
 8. For concreteness we can suppose that the svstem must answer a question, 
 if it can at all, within a few seconds or minutes rather than mill i-seconds 
 or days and that it has access to a store of say a million, rather than a 
 thousand or a hillion, locations, with access times in the microsecond 
 
 millisecond range, resign and implementation of an "advice 
 language" would be desirable, with facilities for handling and coordinating 
 
 1 - f > " ; But the case for developing and testinn the 
 
 theoreti- independent of these possibilities. 
 
15 
 
 Simplified and refined forms 
 
 An account of the theory follows. First a simplified form is 
 presented from which all issues likely to distract from the central exposi- 
 tion are omitted - specifically the treatment of truncations, errors and 
 utilities. These elaborations are then introduced to complete the theory. 
 A bare framework of some elementary objects and notions is pictured in 
 Figure 1 . 
 
 Simplified theory: definitions 
 
 Information concerning f 
 
 We associate with a function f a recursively defined evaluation 
 algorithm A and a function table T. f maps an ordered finite domain X of 
 size fl into a finite co-domain Y of size M. T is the ordered set of pairs 
 {(x,, y,) , (xp, y 2 ) > • • • » (*m, Ym) > such that the set of the left-hand 
 elements is equal to X and for each pair f(x) = y. 
 
 If the right-hand elements of T taken in the order y, , y^, ..., y r , 
 are interpreted as the symbols of a message composed from an alphabet, Y, 
 a natural expression for the amount of information associated with f is 
 
 1(f) - - N I p(y) log p(y) bits, (1 ) 
 
 Y 
 
 where p(y) is the relative frequency o f occurrence of y in the message. We 
 prefer a re-formulation as a sum over the N symbols of the message of the 
 "surprisals" associated with the individual symbols, defined for the i-th 
 symbol as -log 2 p(y^). This gives us 
 
 1(f) = - z log — r. — where n(y.) is the number of occurrences 
 i=l 2 N " 
 
16 
 
 of y. as a riqht-hand element of T. This may he rewritten 
 
 1(f) = N loq 9 N - z log 9 n(y.) hits 
 <- i=i ^ ! 
 
 (1A) 
 
 ICnowledae-content of the look-uD Droaram p- 
 
 A program p T implements T as a look-up tahle for evaluating f on a 
 machine of specification S. Let 6 be an upper hound to the evaluation time 
 of f for any x in X. Suppose that look-up consumes arbitrarily little time 
 on each occasion and in any case always less than 3. With the role of B 
 
 thus rendered vacuous, the system's knowledge of f is equal to its total 
 stored information about f, subject to the aforesaid restriction relative 
 to 6. Thus n T 's knowledge-content is 
 
 K T = N log 9 N - z log n(y.) 
 
 7 
 
 (?) 
 
 i=l 
 
 Note that the numerical value is equal to that of 1(f) and also to the 
 store-occupancy in binary digits of a minimal -length version of p_. 
 Knowledge-content of the solution program o. 
 
 Now consider a different program o. for evaluating f. n. implements 
 A on a machine of the same specification S. Let 6-evaluable be a predicate 
 which is true for just those x's for which f(x) can be evaluated within this 
 bound, and let tt denote the null program. 
 
 Then X = {x|x e X a p.-eval uable (f , x, p., tt, tt, S) } 
 and T. = {(x. , y. ) , (x, , y .) , .. . , (x, , y. )} such that the set of 
 
 :he left-hand elements of T« is equal to X. (which has N» members) and, for 
 each pair f(x) = y. Clearly X^ S X and T A £ T. We call the function f^ 
 ociated with T ft the "algori thmical ly knov/n" part of f. 
 
17 
 p.'s knowledge- content is thus 
 
 K A = N A log 2 N " Z log ? n(y i* (3) 
 
 where N« is the size of T«. 
 
 Difference between information -content and know! edge- con tent 
 Clearly the amount of information associated with A is the same as 
 
 that associated with T, namely 1(f). Yet their respective implementations 
 
 differ widely in knowledge -con tent. For K. to equal 1(f) either 6 must be 
 
 arbitrarily increased or S so altered that the machine runs arbitrarily 
 
 fast, allowing the system to solve all problems in X within user-acceptable 
 
 ti me . 
 
 Note some connexions with the Chaitin-Solomonoff-Kolmogorov 
 
 definitions (see Chaitin, 1975) of randomness and complexity. If p* is a 
 
 minimal-length implementation of A, then its store-occupancy in bits, which 
 
 we shall write L . (A), is equal to what in Solomonoff and Kolmoqorov's 
 mm 
 
 (Chaitin 1975) terminology would be called the complexity of T, and in ours 
 its w-complexity (u> stands for an arbitrarily long time-allowance for 
 reconstructing the whole of T from p„). The maximum attainable compression 
 ratio for T is, in our notation 
 
 I(f » _W T > (4) 
 
 QJ 
 
 -complexity(f) min^ ' 
 
 where L . (T) is a minimal encoding of T as a look-up table for an abstract 
 mm 
 
 machine of the same specification as that on which p* can be executed. We 
 will later see that this ratio defines the maximum computational advantage 
 attainable for the evaluation of f (e.g. by Martians, who think infinitely 
 fast and never need advice). 
 
18 
 Computational advantage is defined as an information ratio: 
 information-hits divided by store-bits. It is the number of bits of 
 S-retrievable information per bit of store. We give the name penetration 
 to the logarithm of this ratio. Computation is a trick for exploiting the 
 time dimension to achieve penetration into f, subject to the limitation 6. 
 If we optimize a pure look-up system, we get computational advantage eoual 
 to unity, zero penetration. Penetration is expressed in "decs" (decimal 
 compressions). "4(1 3-minute decs", for example, characterizes a system 
 which computes, in less than three minutes for each input, an f whose T 
 
 would require n bits for explicit storage, hut which manaaes with only 
 
 -40 
 10 n. The informational economics of chess are such that a Grandmaster 
 
 might constitute a realistic example. 
 
 Knowledge-content of advice program and knowledge-overhead of 
 control program 
 
 On d machine of specification S which runs the solution program p». 
 an advice program p.. implements a body of advice B. Makino the advice 
 available to p„ requires the action of control routines p r - In the 
 simplified theory we assume a time penalty associated with the invocation 
 of p c which imposes an overhead on e\/ery attempted evaluation of f. 
 
 Consider the ordered sets 
 
 V AC 
 
 = {x|x e X a B-evaluable(f , x, p„, it, d p , S)}, 
 
 and 
 
 f AC = { ^ x i ' y i )' ( x j' y i^ ' ■•'■ ( x k' y k^ } such that the set of 
 
 left-hand elements of J^ is equal to X»p and for each pair f(x) = y. 
 
 Since the invocation of p f on each call of p. consumes time, hut 
 obtains no return (because the advice nrogram is null), the predicate 
 .-evaluable may nive the result false for some members of X, (for all of 
 
 formerly gave true) . Hence X Af . C X^ and T Ar £ T«. Correspondingly 
 
19 
 the quantity 
 
 !, AC 
 
 K AC " N AC log 2 N " z lo 9? n(y.) is interpretable as the 
 
 i = 1 <- 
 
 amount of algorithmic knowledge of f less the knowledge-overhead incurred 
 by p c . 
 
 Now replace the null program by p B and consider the ordered sets 
 X' = {x|x e X a s-evaluable(f, x, p A , p B , p c , S), which has size 
 T and T' = {(x i , y^, (Xj, y. ),..., (x k , v k )} such that the set of 
 left-hand elements of T' is equal to X' and for each pair f(x) = y. 
 
 We form T B (which has N g members) as T' - T. p and define the quantity 
 of advisory knowledge of f relative to p. as 
 
 N 
 11 B 
 
 K B = N B log 2 N - l log 2 n{y.) (5) 
 
 The system's overall knowledge of f is measured as 
 
 N 1 
 K = N' log 2 N - 1 log 2 n(y.) = K A + K B + K c (6) 
 
 where K p , a negative quantity, is the knowledge-overhead associated with 
 the control program, and is measured as K» c - K«. Note that corresponding 
 to T' is a partial function f K which we call the known part of f, and that 
 f K can be further partitioned into f« c and fn corresponding to T« c and 
 T B respectively. 
 
 Now if the store-occupancies of p«, p B , p c are L«, L B , Lp, termed 
 respectively algorithmic cost, advisory cost and overhead, then 
 
 L = l_ A + l_ B + Lp = overall store-cost (7) 
 
 K A 
 C^ = t— = algorithmic advantage (8) 
 
20 
 
 K B 
 C B = p - = advisory advantage relative to p» (9) 
 
 K ^A + ^B + ^C 
 C = j- = r — +-r — +-r— = overall computational advantage (10) 
 
 ABC 
 
 Mote that all these measures of advantaqe can he thought of as 
 compression ratios relative to the smallest store-occunancies which could 
 be attained if f « , f„ and f., were represented in pure look-up form. 
 
 Let L be the 0-complexity of f relative to S' , i.e. the smallest 
 
 P 
 
 store cost for which complete knowledge of f could be obtained 
 with a machine specification S' differing from S onlv in having unlimited 
 store. Let L be the total program soace available in S and let L < L„. 
 If K* is the maximum amount of knowledge about f attainable for store cost 
 L ( not for fixed p« which we are free to rewrite in any way whatsoever), 
 
 then 
 
 jrx- = the computational vield of the system (11) 
 
 k L 
 
 Hf) 
 Another interesting quantity is . v ' which qi ves the maximum attainable 
 
 L 6 
 
 computational advantage for a system limited bv the °, bound - compare with 
 
 l jj\Y cited earlier for Parti ans. 
 
 min v ' 
 
 Consider now K, L» for some nrogram and define L£ as the minimum 
 
 store in which K units of knowledge about f can be represented, i.e. Li? 
 
 is the .'-complexity of the partial function f v . Then since px is the 
 
 ttainaMe computational advantage F or m'ven ", the computat.innal 
 efficiency, define' 
 
 (12) 
 
 
21 
 
 can be regarded as the ratio of an actual to an ideal advantage, viz. 
 
 J<_ 
 
 JL 
 
 _K 
 
 L* 
 
 Dropping the subscript K to make room, i f L* is the length of a 
 minimal p« representing K« units of knowledge about f, and L. is the length 
 
 of an actual p„ with the same K., then 
 
 I * 
 A 
 -. — = the algorithmic efficiency of p A (13) 
 
 M 
 
 Define Li as the minimum store which can contain a program p R with 
 K B units of advisory knowledge relative to p.. Then 
 
 L* 
 B 
 i — = the advisory efficiency of p R (14) 
 
 L B 
 
 Y 
 A similar quantity, effectiveness, relates o's advantage r- to the 
 
 maximum advantage attainable by a program with maximum K (i.e. 1(f)) rather 
 
 than by a program with given K. We then define effectiveness as 
 
 K x I 
 
 mTfT (15) 
 
 Applications of this expression are illustrated graphically in Figure 2. 
 
 Refinements: partial (truncated) evaluation 
 
 If evaluation of f for the i-th member of X is interrupted by the 
 6 cut-off before completion then in the foregoing treatment the system was 
 allotted a zero score for its knowledge of f(x.). This simplification is 
 replaced by the following, in which we no longer speak of the "known part" 
 of f, since some knowledge adheres to every part (note that the treatment 
 
22 
 
 is closely related to Good's (1976) discussion of the " knowledge that 
 
 12 3 - 1728"). 
 
 Let the evaluation have the form of a successive aporoxination in 
 
 which an initial candidate set of results is reduce^ in succeedinn cycles, 
 
 ending in a unit set consisting of f(x. ) v/hen evaluation ones to comnletion 
 
 hut otherwise in a residual set o f candidates containing f(x.). T^e initial 
 
 candidate set is the entire ro-domain Y. In each time-interval a nrogram n 
 
 transforms this set accordina to the sequence Y -> Y ->. v _>. ... -> 
 
 l ,l,p i ,2,p 
 
 such that Y3Y. , n and Y. . n »Y. ,.,,» . Y. . n is the candi- 
 i ,. ,n 1 ,1 ,p l ,J ,n i ,(j+l ) ,n i ,S,n 
 
 date set at the moment of 3 cut-off. We reoui re to quantifv the information 
 (surnrisal) associated with Y. , so that we can measure the svstem's 
 
 •| 
 
 overall Knowledge of f as E surprisal (Y. ). 
 
 ■ 1 "i » p > o 
 
 '..'here evaluation is always comolete, so that Y. = {y.}, we have 
 
 i ,B,o l 
 
 seer, that the information gained from evaluating f(x-) is 
 
 n(y.) 
 
 -log- p(v.) = -log — r , — , where n(y.) is the numher of occurrences of the 
 
 svrihol v. as a rioht-hand element in T. Mho re y. e Y. and Y. mav 
 
 - 1 ' 1 1 ,8,0 i ,B,n 
 
 have more than one member (i.e. the evaluation, although incomnlete, is 
 
 n(v) 
 i ,3,n 
 ill knowledqe of f is the sum of all the surnrisals, namely 
 
 log„ JlLzi 
 
 = / Y. ■' 
 
 ■ - log, n(y) (If) 
 
 i 1 " v • Y. 
 
 i , ,p 
 
 correct as *dr as it has none) we write -log- y, \. - ; . The svstem's 
 
 V e Y _. 
 
23 
 
 To individualize the above to the knowledge content of given programs 
 
 p T , p., p B , p c we substitute for p, thus: 
 
 N 
 
 K T = N log, N - e log 9 z n (y) , 
 
 i=l c y z Y. 
 
 i,B,P T 
 
 and similarly to construct definitions of K», K R and K, using p = p«, 
 p = (p A , p c ) and p = (p A> p B , p c ). 
 
 Refinements: error 
 
 In the simplified theory either an evaluation is complete, or it 
 contributes zero knowledge. But now suppose that for some x. , a program 
 p completes the evaluation of f, but incorrectly, so that apply(x., n) f f(x.) 
 Note that in our scheme error can only arise from the action of p„. p. 
 is assumed ex hypothesi to be a correct implementation of A. 
 
 To calculate the knowledge associated with the apolication of p 
 to x. we multiply the surprisal of f(x.) by a coefficient -1 s g < 1 . q 
 is given by a function defined over the space of pairs (f(x) ,apnly(x,p)) . 
 The user can supply any function he pleases provided that 
 
 (i) for all x e X, g = 1 iff apply (x, p) = f (x) ; 
 
 (ii) for all x e X, a "guessing program" p^ (which selects its 
 
 output at random from among the members of Y with relative frequencies 
 
 M 
 p(y-i)> p(y 9 )> •••> P(y M )» s P(y-i) = I) has zero mathematical expectation 
 
 for g x surprisal (apply(x, P G )). 
 
 Regarding the evaluation of a finite function as analogous to the 
 
 play of a finite one-person game, g is the outcome value (g = 1 corresponds 
 
 n(y-) 
 to a "win") and g x -log 2 — n— is the payoff . This terminology differs 
 
 somewhat from Good's (1976: his Examples 1 - 5) who associates 
 
24 
 different "utilities" with correct and erroneous results of function 
 evaluation. We reserve this term (see later) for discrimination between 
 different members of X ("argument-utilities"), and not for discrimination 
 between different possible results of evaluating f(x) for given x. 
 
 Simplest case of error: no gradations, no partial evaluation 
 
 In the simplest case, the knowledge content for a correct result is 
 
 n(y,) 
 
 g x -log ? — -j— - where y. = f(x.) and g = 1. An in correct result is assianed 
 
 n(y.) n(y.) 
 a negative payoff -g * -loq — q- — , and the value g = - 7-— r- satisfies 
 
 condition (ii), namely that o r 's expected payo f f is zero, as can be verified 
 from the following tabulation. 
 
 tyoe of 
 resul t 
 
 frequency 
 when 
 connuted as 
 apply(x, p c ) 
 
 contribution 
 
 to expectation 
 
 (= freouencv x payoff) 
 
 correct 
 
 incorrect 
 
 n(v,) 
 
 n(v.) 
 
 x -loo. 
 
 n(y-) n(y.) 
 
 — n— x -100 ? — n— 
 
 Gradations hut no partial evaluation 
 
 In the more general case differentiation is md* between different 
 degrees of discrepancy between computed and true values. In chess, f or 
 •"not to compute h ostroyp may he thought more of a failure 
 the computed result changes a won to a lost position than i f it changes 
 • 
 
 l1n supolies his own penalty function, provided that it 
 IntS , (i ) and (ii ). 
 
25 
 
 Partial evaluation, no gradations 
 
 Now consider error in the case of a truncated evaluation which 
 yields a set Y. of size m.. This set may or may not contain f(x-). 
 
 To specify the knowledge-content of p f allowing for error, we 
 introduce a guessing program p G which for each given x. gives a candidate 
 set also of size m.. . We suppose that p G proceeds by making random 
 deletions from an initially complete function table T, until just so many 
 (x, y) pairs remain that the union of their right-hand members forms a 
 set of size m. - the candidate set Y. . n . Initially there are n(y.) 
 
 I 1 sP »Pp 1 
 
 pairs in T with y. as right-hand member. The probability that after d 
 successive deletions no pairs of this kind remain can be calculated from 
 the contingency table: 
 
 right-hand member = y. 
 right-hand member f y. 
 
 Fisher's "exact" method ( Statistical Methods for Research Workers ) 
 gives the probability as 
 
 d! (N - n(y.))! 
 
 undeleted 
 
 deleted 
 
 
 
 
 n(y 1 ) 
 
 n(y i ) 
 
 N - d 
 
 d - n(y.) 
 
 N - n(y.) 
 
 N - d 
 
 d 
 
 N 
 
 N! (d - n( yi ))! 
 
 Denoting this quantity by 1 - P, we now construct a penalty function as 
 shown in Table 1 . 
 
 General case: gradations of error in partial evaluation 
 
 In Table 1 g can be interpreted as a proximity with value 1 for 
 
 p 
 zero difference between computed and true values and - , _ p for any 
 
 other difference. Generalizing this we first consider a proximity 
 
26 
 
 
 
 
 
 
 
 
 
 
 
 CD 
 
 
 
 
 - — , 
 
 Q. 
 
 
 
 
 cj 
 
 •i 
 
 
 
 
 O 
 
 CQ 
 
 
 
 
 *» 
 
 ■t 
 
 
 
 
 CQ 
 
 • r— 
 
 
 
 4- 
 
 *s 
 
 >- 
 
 
 
 4- 
 
 •r— 
 
 » 
 
 
 
 C 
 
 >- 
 
 
 
 
 >> 
 
 *■ — «* 
 
 i — 
 
 
 
 <o 
 
 
 rO 
 
 
 
 o 
 
 03 
 
 CO 
 
 
 
 X 
 
 CO 
 
 Q. 
 
 
 
 >> 
 
 S- 
 
 S_ 
 
 
 
 u 
 
 a. 
 
 3 
 
 
 
 c 
 
 S- 
 
 CO 
 
 
 
 QJ 
 
 ^ 
 
 
 
 
 3 
 
 CO 
 
 X 
 
 
 
 cr 
 
 
 
 
 
 a> 
 
 X 
 
 D. 
 
 
 
 i_ 
 
 
 
 
 
 4- 
 
 o_ 
 
 ' 
 
 
 
 
 
 
 
 
 
 
 C3 
 
 
 
 
 
 C 
 
 
 
 
 
 «s 
 
 
 
 
 
 CQ 
 
 
 
 
 
 •x 
 
 
 
 
 
 ■r— 
 
 
 
 
 
 >- 
 
 
 
 
 
 
 
 
 
 
 ^-^ 
 
 , 
 
 
 
 
 CD 
 
 rO 
 
 
 
 
 Q. 
 
 CO 
 
 
 
 
 •> 
 
 •r— 
 
 
 
 
 CQ 
 
 S- 
 
 
 
 
 •t 
 
 CL 
 
 
 
 
 •■— 
 
 S- 
 
 
 
 
 >- 
 
 3 
 
 
 
 
 
 
 00 
 
 
 
 
 i 
 
 X 
 
 
 
 
 fO 
 
 
 
 
 
 l/) 
 
 
 Q- 
 
 
 
 4- 
 
 •r— 
 
 
 
 
 
 4- 
 
 s- 
 
 D_ 
 
 1 
 
 
 
 O 
 
 CL 
 
 
 
 
 
 >. 
 
 s- 
 
 
 1 — 
 
 
 
 ro 
 
 Z5 
 
 
 
 
 CL 
 
 to 
 
 1 
 
 c 
 
 
 
 
 
 HI 
 
 
 * — * 
 
 
 
 -C 
 
 
 cr 
 
 
 
 2 
 
 oo 
 ro 
 
 Q. 
 
 
 Q. 
 
 >■> 
 
 
 »> 
 
 
 
 o 
 
 ■o 
 
 X 
 
 a. 
 
 1 
 
 c 
 
 GJ 
 
 ■s — ' 
 
 
 
 QJ 
 
 4-> 
 
 
 
 i — 
 
 23 
 
 :3 
 
 >> 
 
 
 
 cr 
 
 Q- 
 
 1 — 
 
 
 
 K 
 
 E 
 
 o. 
 
 
 
 c 
 
 c 
 
 
 
 4- 
 
 u 
 
 rO 
 
 
 
 
 
 
 •r— 
 
 
 
 
 
 >> 
 
 
 
 
 
 00 
 
 +-> 
 
 
 
 
 e 
 
 O 
 
 
 
 -l-> 
 
 •r— 
 
 c 
 
 
 
 , — 
 
 ro 
 
 • r- 
 
 
 
 ^ 
 
 4-> 
 
 CO >> 
 
 
 
 CO 
 
 c 
 
 0) 
 
 
 
 e 
 
 o 
 
 o c 
 
 
 
 u 
 
 -a -.- 
 
 
 
 
 
 ra 
 
 
 
 4- 
 
 CJ 
 
 CJ +-> 
 
 
 
 O 
 
 Q. 
 
 o 
 
 
 
 ai 
 
 
 CQ O 
 
 
 
 Q. 
 
 M 
 
 »» 
 
 
 
 >> 
 
 • r— 
 
 • r- 
 
 
 
 +-> 
 
 >- 
 
 >- 
 
 
 o c: 
 
 
 •r— 
 
 
 CO (/) 
 
 
 C CO 
 
 
 O QJ • 
 
 
 ■P- 13 (— 
 
 
 4-> CD 
 
 *. — ^ 
 
 rO = CD 
 
 ■o 
 
 "O i— 
 
 CD 
 
 ro fO _o 
 
 -t-> 
 
 S_ rO 
 
 rO 
 
 cr>4- +j 
 
 CJ 
 
 o 
 
 E 
 
 o E 
 
 ZJ 
 
 C 4- o 
 
 S- 
 
 4- -i- 
 
 +-) 
 
 •> o +■> 
 
 «^_^ 
 
 c >> o 
 
 
 O «C E 
 
 r— 
 
 •r- D- ZJ 
 
 ro 
 
 +J 4- 
 
 ■ pa 
 
 rO "O 
 
 +-> 
 
 ?B& 
 
 s_ 
 
 rO 
 
 (O U CD 
 
 a. 
 
 > QJ r— 
 
 
 ai aa 
 
 
 X E 
 
 
 i— CD O 
 
 CO 
 
 ro O 
 
 +J 
 
 •r- >> 
 
 c 
 
 -M r- Q) 
 
 QJ 
 
 S~ r- JZ 
 
 E 
 
 rC fO + J 
 
 CD 
 
 c. o 
 
 c 
 
 •r- 4- 
 
 
 4- +J O 
 
 4- 
 
 o rci 
 
 OJ 
 
 E oo 
 
 c2 
 
 CD CD CD 
 
 r 
 
 oo x: ••- 
 
 
 ro +-> S- 
 
 c 
 
 U ra 4-> 
 
 o 
 
 E C 
 
 ■r* 
 
 CD CD 
 
 +-> 
 
 -C 0J 
 
 O 
 
 +-> -C r— 
 
 QJ 
 
 4-> r— 
 
 CO 
 
 C ro 
 
 
 •r- "P 
 
 S- 
 
 (O S- 
 
 QJ 
 
 cx a) 
 
 •r- 
 
 O 4-> > 
 
 pp« 
 
 •i- O 
 
 s_ 
 
 +-> CD 
 
 ro 
 
 ro S- o 
 
 QJ 
 
 D 3 S- 
 
 
 i— CO QJ 
 
 QJ 
 
 rO E N 
 
 JZ 
 
 > CD 
 
 +-> 
 
 CD >> 
 
 
 C i— 
 
 c 
 
 oo S E 
 
 •r— 
 
 => o s- 
 
 
 O -E O 
 
 00 
 
 CD co 4- 
 
 ro 
 
 C -r- 
 
 
 O co C 
 
 £- OJ 3 
 
 -a 
 
 CD 
 
 S- 13 
 
 +-» 
 
 CD i — 00 
 
 ro 
 
 rO -r- 
 
 i — 
 
 i~ > 
 
 3 
 
 O .— ■* 
 
 O 
 
 4- 4- +■> 
 
 ^_ 
 
 4- X 
 
 rO 
 
 C O CD 
 
 O 
 
 O >^+-» 
 
 
 .p. ro 
 
 OO 
 
 ■p atu 
 
 • »— 
 
 U CD 
 
 : 
 
 C QJ OO 
 
 = • 
 
 13 -C 
 
 . — e 
 
 4- H- 
 
 ro O 
 
 s 
 
 co •<- 
 
 ^> . rl 
 
 ■<- +-> 
 
 S- ro 
 
 r— i_ S- 
 
 CL 13 
 
 ra p en 
 c £- o 
 
 i- r— 
 
 =J ro 
 
 QJ i- 1- 
 
 CO > 
 
 O- Q) Q. 
 
 = CD 
 
 QJ 
 
 -Q 
 rO 
 
27 
 
 function which in the expression 0(y, f(x-)), where y e Y. , yields 
 
 ■ i »p >P 
 
 a value for g, -<» < g $ 1. g expresses the "proximity" of y to the correct 
 result. A truncated evaluation which yields a candidate set can in this 
 context be viewed as a question-answerer which produces in response to the 
 question x. a number of possible answers of varying degrees of correctness. 
 To calculate the corresponding knowledge-content, two alternatives present 
 themselves. 
 
 (1) We can summarize the candidate set in terms of the expected 
 value, given this set, of 0(apply(x., p), f(x.)), where apply(x., p) is 
 the output value which would be obtained by running the program p to com- 
 pletion on the input x.. A statistical approach of this kind demands 
 that the user have some model of the computational process which 
 successively shrinks the candidate set. 
 
 (2) We can adopt a "worst-case" position and use a penalty 
 
 function pf defined by 
 
 pf(Y. ) = Min (0(y, f(x.)) 
 i,e,p Y i 
 
 y 1,3, p 
 In other words we take the proximity of the candidate set to the 
 correct result as being given by that of its most distant member. 
 
28 
 
 Refinements: differential argument-utilities 
 
 As just shown, different answers to a qiven question rate different 
 values according to their proximity to the correct answer. The user nav 
 also, and independently, attach different values to the receipt of answers 
 to different questions. Argument-util ities will be regarded as fixed - 
 corresponding to the notion of "importance." This simplifying assumption 
 is nade for exposition only, since they could in practice perfectly well 
 fluctuate (dependent on the progress of some problem-sol vino process) and 
 thus act as "relevance" weightings. An example of the latter case would be 
 the evaluation of bestnove for a chess position containing Queens at a stage 
 when the Queens had already been swapped off. For such an x as this the 
 system could afford not to know f(x) for the time being. 
 
 Argument-utilities are introduced as weighting coefficients thus: 
 
 N 
 
 K = E u. surprisal (Y. ) . 
 1-1 ' 
 
 A normalizing constraint demands that the distribution of argument- 
 utilities has unit mean: this relates the amount of "useful" knowledge to 
 the amount of "pure" knowledge. We use the qualified term "pure" knowledge 
 in a case where the u.'s are omitted, or, equi valently, set uniformly equal 
 to unity. 
 
 Without prejudice to the user's right to define his own utility 
 function, the natural approach is to assume utilities to be di recti v pro- 
 portional to the frequencies with which p is called for the different x-'s 
 (compare Good, 1976) . 
 
29 
 Some 3 -in dependent quantities 
 
 If one does not like the idea of imposing all-or-nothing cut-offs 
 after the expiration of 3 seconds, there are other measures which can be 
 applied to the evaluation of f relevant to the notion of knowledge.. We 
 can introduce a time-base e, the time required for the system to evaluate 
 
 f for every x in X (the "extraction time"). We then define computational 
 
 Iff) 
 capacity as ~- '- , measured in bits per second. Decomposition into 
 
 algorithmic and advisory components can be ner^ormed as before. Likewise 
 
 measures computational (algorithmic plus advisory) power; the units 
 
 un. 
 
 Le 
 are store-equivalents/sec. Power is thus a dynamic version of advantage. 
 
 Some 6-independent measures are tabulated below. 
 
 Extraction time Seconds e 
 
 Capacity Baud (bits/sec) ^ ' 
 
 Power Store-equi vs/sec > ' 
 
 ^ Le 
 
 Le 
 Turnover time Seconds TTfT 
 
 Each of the above is separately measurable for the algorithmic, p., and 
 advisory, p R , components of p - giving "algorithmic capacity" and "power 1 
 and "advisory capacity" and "power". 
 
 The harshness of the 8 cut-off idea can be softened in another way. 
 
 Introduce a multiplier r which is a function of elapsed time t, 
 
 and rewrite the expression of the preceding section as 
 
 N 
 
 K = z r. u. surprisal (Y. ) (16A) 
 
 i=l L n 1 
 
 If r is a step function such that r. = 1 for t <: p and r = 
 
 otherwise, then K is maximized when t = 8, which corresponds to the earlier 
 
 treatment, flow let r be some other function, monotoni cally decreasing in 
 
 <*■ . 
 
30 
 t. The utility of the result is then a continuously decreasing auantity 
 during the evaluation process, /just as the information obtained is an 
 increasing quantity. In this case also we can choose t in such a. wav as to 
 maximize (16A). An optimal waiting time is thus again derived from a 
 user-suopl ied utility function, but in a more subtle fashion than in the 
 original simple 3 cut-off case. 
 
 r can of course be made a function not only of time but of the 
 current state of the computation itself - as when it is decided to devote 
 further time to a given evaluation because it has been "nrovinn difficult." 
 But the main exposition will not be served by pursuing such elaborations 
 further. 
 
 The memo- function as a model of rote-learning 
 
 The most elementary form which the scheme of Figure 1 can take, and 
 one which has received experimental study in the author's laboratory, is 
 the "memo-function" (flichie, 1967). Implementations by Michie, Popplestone 
 and Marsh (1968) and by van Emden (1974) allow the user to endow a function 
 procedure with a private memory in which it automatically stores previously 
 computed results. On each call of the procedure, the rote-memory is first 
 accessed in an attempt to evaluate f by look-up. Only if the rote-search 
 fails is the body of the procedure entered. The aim is to improve 
 evaluation time. 
 
 It wi 1 1 immediately be observed that if the procedure's initial 
 knowledge of f is restricted by a p,-bound on evaluation time then the intro- 
 duction of a rot ory confers an increase of knowledge - precisely 
 measurable in our system. In effect, the memo- function instantiates Fiqure 
 ; for p p a dynamically varied fragment of Pj. The rote 
 
31 
 consists solely of advice at "ground level" - i.e. no abstraction, only 
 selection, from f's function-table. The algorithmic procedure (memo-ised 
 by the library routine "newmemo") corresponds to p., while the library 
 package itself consists of p c (for applying the rote) and p.. (for updating 
 it). With repeated application of p« to inputs sampled from the function's 
 domain, (x, y) pairs accumulate in the rote and evaluation times decrease. 
 Gains accrue not only when the input argument itself is found to be already 
 on the rote, but also from rote-knowledge of results required in recursive 
 calls: if you want to evaluate factorial (17) it helps if you already 
 know factorial (16) - or even factorial (6). By our definitions, then, 
 e-evaluability increases, i.e. the system's advisory knowledge of f . Note 
 that the advisory cost also increases as the rote grows. 
 
 If we "freeze" the rote at some stage, paralyzing the rote-updating 
 mechanism, we can directly apply the earlier techniques to the knowledge 
 associated with p R at the moment of freezing. Rut how are we to measure 
 the cost-effectiveness not of the behavior of the program as a device for 
 knowing about f (we have seen how to do this) but its behavior as a device 
 for getting to know more about f? The following approach seems workable. 
 
 Measurement of an amount of knowledge is concerned with pronerties 
 of (p., p R , p f ). Measurement of rates of increase of the amount is concerned 
 with properties of p.., the update program, in particular its caDacity to 
 enhance the value to the user of p R . For a given p.. we want to have some 
 measure of its effectiveness in doing this, and of the cost incurred. 
 
 First we must decide precisely what attribute of p R to use in the 
 numerator of p 's ratio of effectiveness to cost. Clearly it should be 
 related to the increase of p B 's advisory advantage conferred by p..: a p.. 
 conferring a small knowledge-gain but a large cost-gain upon p R must not 
 
32 
 be allowed to rate high on effectiveness. We have chosen the looarithm of 
 advisory advantage, i.e. advisory penetration, expressed in decs ("decimal 
 compressions"). We shall use the symbol D R for this quantity. 
 
 The denominator must be a function of p..'s store-cost and the time 
 spent by the processor in p (|f both of which olainly contribute to the total 
 cost of running p,. over the learning period. We must also consider the 
 costs associated with running p« in conjunction with p R and p c to provide 
 rote entries in the first place. We accordingly define a learning index: 
 
 aD r 
 
 R = r-f- (17) 
 
 L U L U 
 
 L.. is the store-cost of p.., and t, is the time spent in p.. by the processor 
 during the learning period. The meaning of R is "increase in advisory 
 penetration conferred by the operation of p.., relative to the cost of its 
 operation." The concomitant operation of the rest of the program 
 (p., p R , p«) also has a cost, which can be thought of as the cost of the 
 "practice evaluations" separate from the learning cost. 
 
 We have devoted space to the economics of rote-learning, because we 
 shall later require to do the same for the economics of "concent-learning" ■ 
 i.e. to measure the capability of a given p., to enhance d r 's value to the 
 user in the case where p R is not the rote of a simple memo-function, but 
 contains more organized information abstracted from f's function-table. We 
 shall then use the rote-learning formulation as a base-line for measurement 
 of the ability to acquire conceptualized as opposed to merely rote 
 knowledge . 
 
33 
 
 Telegram-factory model of concept-learning 
 
 A concept learning system is a device for the synthesis of cheap but 
 
 effective inter-galactic telegrams. The stimulus for this deliberately 
 
 mysterious remark is a passage from 6. J. Chaitin (1975): 
 
 "Suppose you have a friend who is visiting a planet in another 
 galaxy s and that sending him telegrams is very expensive. He 
 forgot to take along his tables of trigonometric functions, 
 and he has asked you to supply them. You oould simply translate 
 the numbers into an appropriate code (such as the binary numbers) 
 and transmit them directly , but even the most modest tables of 
 the six functions have a few thousand digits, so that the cost 
 would be high. A much cheaper way to convey the same information 
 would be to transmit instructions for calculating the tables 
 from the underlying trigonometric formulas, such as Euler's 
 
 i x 
 
 equation e = cos x + l sin x. Such a message could be relatively 
 
 brief, yet inherent in it is all the information contained in 
 even the largest tables." 
 
 The mathematical work which Chaitin is here popularizing, carried 
 out by himself, by R. J. Solomonoff and by A. N. Kolmogorov, has been 
 concerned with measuring the "complexity" of a message as the minimal 
 number of store-bits required to hold a program capable of reconstructing 
 it. In this theory, the inter-galactic telegram is to be run as a program 
 on the receiver's machine. Provided that it terminates in finite time 
 no one cares how long it takes to perform the computation. 
 
 But consider the following scenario, diagrammed in Figure 3. Our 
 distant friend has to play in a chess tournament, and asks us to send him, 
 not the trigonometric functions, but the bestmove function for chess. He 
 may suppose that the tournament rules demand, as on Earth, that moves be 
 made without undue delay. Neither of the f ol lowing two responses to his 
 request will earn any marks: 
 
 (1) to send the complete look-up table, p T in our notation. The 
 
 50 
 total bit- cost of the telegram will be in the region 10 . 
 
34 
 
 (2) to follow the Chai tin prescription and send a minimal program 
 embodying an evaluation algorithm (in our case say the classic Zermelo- 
 Borel-von Neumann total-lookahead algorithm), a minimal p A in our notation. 
 
 In the case of (2) all the information has certainly been sent - 
 at a cost of only about 10 4 bits. But our friend will object that we have 
 not put him in possession of much useful knowledge . What should we have 
 sent him? Plainly a telegram comprising p. and n R and n An argument 
 based on the maximum rate of transfer of information into human long-term 
 memory suggests that a Grandmaster is unlikely to have acquired more than 
 10 10 bits of chess information for his p R . So although the telegram 
 will be costly compared to one consisting of o. alone, its cost is a droo 
 in the ocean of p T 's cost. But like p T , and unlike unaided p«, it is 
 executable within acceptable waiting time. 
 
 We now ask: "What is the shortest inter-galactic telegram which 
 will enable the recipient to 8-evaluate hestmove for every x in X?" P.v 
 obvious analogy with Chaitin-Solomonoff-Kolmogorov complexity (which we 
 refer to as ^-complexi ty) we term the length of the shortest such telegram 
 the s-complexity of hestmove (always of course with resnect to S). In 
 contrast with u>- complexity, the interest of which is ourely mathematical, 
 v/e find in s-complexity an explication of the notion of intrinsic intellec- 
 tual difficulty as a measurable (in principle) property of a problem domain. 
 The "bestmove" function for a large version of the game of Him, say with 
 1G heaps of 1000 matches each, has approximately the same information- 
 content as for chess. Rut the corresponding p. (implementing the veil- 
 known parity rule of play) is so fast to evaluate that s-comolexity can 
 hardly differ from .-complexity and is small. So 16 x 1000 Nim has little 
 
35 
 
 intrinsic difficulty. But here is the continuation of the previously 
 
 quoted passage from Chaitin. 
 
 "Suppose, on the other -hand, your friend is interested not in 
 trigonometry but in baseball. He would like to know the scores 
 of all the major-league games played since he left the earth 
 some thousands of years before. In this case it is most 
 unlikely that a formula could be found for compressing the 
 information into a short message; in such a series of numbers 
 each digit is essentially an independent item of information, 
 and it cannot be predicted from its neighbors or from some 
 underlying rule. There is no alternative to transmitting the 
 entire list of scores. " 
 
 In this case co-complexity is high, and a fortiori so is B-comnlexity. 
 Answering questions about baseball scores would be, at the least, a difficult 
 task. 
 
 Escalating the inter-galactic fantasy, suppose now that apneals 
 from our distant friend for this or that finite function become so frequent 
 that consideration must be given to the economics of automatically synthe- 
 sizing new telegrams to his requirements. The resulting set-up is depicted 
 in Figure 4. The recipient's machine, as before, is cost-free but the 
 sender's machine is not, and must operate as a cost-effective telegram- 
 factory. For each new synthesis the sender machine starts with p., n- and 
 some p B (possibly null) ready loaded. It is thus already in a position to 
 emit either of two unsatisfactory answers: to transmit a telegram consisting 
 merely of p«, or to run p» over the whole domain in order to generate and 
 send p T> The first would be good for the sender, bad for the receiver; 
 the second good for the receiver, bad for the sender. To produce the right 
 answer the brunt of the sender machine's task falls on p« and p.., 
 respectively to explore fragments of f ("practice") and to generalize over p.'s 
 discoveries to build a body of advice p R . At some stage a halt is called, 
 and the telegram (p., p B , p c ) is transmitted, p.. stays at home. The 
 
36 
 
 total cost is the manufacturing cost plus the cost of inter-galactic 
 transmission. Manufacturing cost in turn decomposes into learning cost 
 (Lt for runs of p..) and cost of practice (Lt for the associated runs of 
 (p A » P r » Pr))« An illustrative miniature example of a p.. operating in 
 something like this style is Negri's (1976) use of an inductive learning 
 program to synthesize a p p able to discriminate won from drawn positions 
 in the K + P v. K chess end-game. One envisages its use coupled with a 
 p A which looks ahead until positions of known game-theoretic values are 
 encountered. Lookahead would initially go to the end of the game. The 
 final telegram could be the mul tivariable logic expression induced by 
 the program. 
 
 Learning, rote-skill and grasp 
 
 To justify on economic grounds the addition of an advisory package 
 
 (p R , p_) to a solution program it is not enough, or even to the point, to 
 
 show that the package confers increase of knowledge: it must confer enhanced 
 
 computational advantage or (to say the same thing logarithmically) enhanced 
 
 penetration. Note that we earlier defined learning in terms of increase of 
 
 penetration over successive trial evaluations, not in terms o f increase of 
 
 knowledge . Even a decrease of knowledge, if accompanied by a disproportionate 
 
 decrease in p 's bit-cost, could qualify as a form of learning in our scheme. 
 
 The critical inequality, which justifies a niven modification of p R , or 
 
 indicates that a step of positive learning has occurred, is 
 
 Kg, * L c 
 
 r- ! r^ 1» or p quivalently D n < D D , where R relates to the 
 
 1 x K B BR 
 
 to its modified form. The ratio on the le f t abo 
 
 ve 
 
37 
 is the "relative advisory effectiveness" of p DI with respect to p D . 
 
 D D 
 
 With this in mind, let us dissect advisory advantage into two 
 
 dissimilar components, one associated with a rote-dictionary in the style 
 
 of a memo-function and the other associated with a "concept-dictionary," 
 
 in which certain of the (x, y) "facts" discovered in past evaluations have 
 
 been condensed into descriptional form. The "ground-level" is distinguished 
 
 from the "conceptualized" component: 
 
 K K 
 
 Rote advantage = r— ; conceptual advantage = y-S- 
 r c 
 
 We shall call the logarithms of these quantities "rote-skill" and 
 
 "grasp" respectively. For some purposes it will be useful to write advisory 
 
 advantage in the form 
 
 K + K 
 
 , r + , c , especially when considering the profit-and-loss 
 
 r c 
 
 score of assimilating rote elements into the concept-dictionary. 
 
 A consequence of the foregoing system of definitions is that an 
 ordinary recursive procedure for the factorial function is credited with 
 penetration but with neither rote-skill nor grasp . The memo-ised version 
 of the same procedure can claim rote- learning , defined in our system as the 
 autonomous improvement of penetration specifically by increase of rote- 
 skill. Improvement through use of the memo-function's facility for direct 
 assignment of values by the user to the rote exemplifies teaching by rote . 
 If after some regime of rote-learning and/or rote-teaching we decide to 
 suspend the action of p., and to investigate whether the memo-function in 
 the form in which it has been frozen shows signs of having profited from 
 
38 
 
 the regime, then we shall pronounce positively only if we find that 
 penetration has indeed been increased beyond the base-line set by p.. If 
 additionally the memo- function' s up-date program p.. were of itself to 
 impose a beneficial structural reorganization upon the rote, for example 
 by introducing dynamic grouping of result- values into intervals as in 
 Marsh (1969) or (to be fanciful) by discovering Stirling's approximation 
 and using it to replace in p B all values larger than a certain threshold 
 by the corresponding predicate, then and only then could we say not only 
 that this particular (p„, p B , p~) exhibited penetration in the form of 
 rote-skill and grasp, but that the associated p.. had exhibited 
 concept-learning. 
 
 A recapitulation of some of these ideas is set out in Table 2. 
 
39 
 
 Qual ity or aptitude 
 
 Knowledge 
 
 Advantage 
 
 Penetration 
 
 Rote-skill 
 
 Grasp 
 
 Difficulty (e-complexity) 
 
 Rote-learning 
 
 Parameter-learning 
 
 Concept-learning 
 
 Informal explication 
 
 Retrievable information. 
 
 Ability to retrieve more than is stored. 
 
 Logarithm of advantage. 
 
 Penetration attributable to storage of 
 rote-facts. 
 
 Penetration attributable to storage of 
 descriptions. 
 
 Size of smallest program able to confer 
 complete knowledge. 
 
 Increase of rote-skill by storing results 
 of past evaluations. 
 
 Increase of grasp by adjusting already 
 stored descriptions in the light of past 
 evaluations. 
 
 Increase of grasp by abstracting new 
 descriptions from past evaluations. 
 
 Table 2. Recapitulation of some main ideas. An intermediate level of 
 
 learning ("parameter learning," not discussed in the text) has been 
 inserted. Examples of this mode are found in early AI work such as 
 that of Samuel (1959) and of Selfridge (1959), and in many modern 
 adaptive control systems. A "description" is an intensional 
 definition of a (typically non-unit) set. 
 
 Concluding remarks 
 
 In the pre-scientific era, lack of a system of numerical definitions 
 concerning mass and motion hindered the development of the mechanical arts. 
 An analogous situation prevails today with regard to machine intelligence 
 and provides incentive for the present work. Detailed mathematical and 
 experimental development is required. The latter could usefully include 
 the programming of a cluster of interrelated subdomains (for example chess 
 end-games) using an invariant p.. Knowledge and learning parameters of 
 
40 
 different p 's and p 's could thus be systematically compared over a 
 range of cases. 
 
 Acknowledgements 
 
 I am indebted to I. J. Good, Martin Kruskal , David Plaisted and 
 Boris Tsybakov for helpful discussion and in particular to Rod Fletcher, 
 John McCarthy and Michael Tanner who spotted errors in earlier versions. 
 The Department of Computer Science at the University of Illinois provided 
 me with facilities for a substantial part of the work. I wish in 
 particular to thank Pam Farr who was responsible for meticulous typing 
 of difficult material. 
 
41 
 References 
 
 Buchanan, B. G. (1975) Applications of Artificial Intelligence to scientific 
 reasoning. Proc. 2nd USA-Japan Computer Conference . 
 
 Chaitin, G. J. (1975) Randomness and mathematical proof. Scientific 
 American , 232 , 47-52 (May, 1975). ™" 
 
 Good, I. J. (1976) Dynamic probability, computer chess, and the measurement 
 of knowledge, in Machi ne Rep res en t a ti ons of Know] edge (eds. E. M. 
 El cock and D. Michie), Dordrecht: D. Reidel Publishing Company. 
 
 Hartley, R. V. L. (1928) Transmission of information. Re 11 System Tech. 
 Jour. , 7_, 535-563. 
 
 Hubeman, B. J. (1968) A program to play chess end games. Technical report 
 no. CS 106 . Stanford University: Computer Science Department. 
 
 Marsh, D. L. (1969) Memo functions, the Graph Traverser and a simple control 
 situation, in Machine Intelligence 5 , pp. 281-300 (eds. B. Meltzer 
 and D. Michie), Edinburgh: Edinburgh University Press. 
 
 Meltzer, B. (1970) The semantics of induction and the possibility of 
 
 complete systems of inductive inference. Artificial Intel ligence, 
 1, 189-192. 
 
 Michalski, R. and P. Negri (1976) An experiment on inductive learning in 
 
 chess end games: the King-Pawn-King case, in Machine Representations 
 of Knowledge (eds. E. W. El cock and D. Michie), Dordrecht: D. Reidel 
 Publishing Company. 
 
 Michie, D. (1967) Memo functions: a language facility with "rote-learning" 
 properties. Research Memorandum MIP-R-29 , Edinburgh: Department of 
 Machine Intelligence and Perception. 
 
 Michie, D., Popplestone, R. J., and Marsh, D. L. (1968) LIB MEMOFNS. pp. 
 209-212 of Programming in POP-2 by R. M. Burstall , J. S. Collins 
 and R. J. Popplestone. Edinburgh: Edinburgh University Press, 1971. 
 
 Michie, D. and B. G. Buchanan. Artificial Intelligence in Mass Spectroscopy: 
 a review of the Heuristic Dendral program. In Computers for 
 Spectroscopists (ed. R. A. G. Carrington) London: Adam Hilger. 
 
 Samson, E. W. (1951) Fundamental natural concepts of information theory . 
 Air Force Cambridge Research Station: Report E5079. 
 
 Samuel, A. L. (1959) Some studies in machine learning using the game of 
 checkers. IBM Jour. Res. Dev. , 3, 211-229. 
 
 Selfridge, 0. G. (1959) Pandemonium: a paradigm for learning. NPL Symposium 
 no. 10 , Mechanization of Thought Processes , 511-531. London HMS0. 
 
 van Emden, M. H. (1974) LIB MEMO FUNCTIONS: program specification. POP-2 
 Program Library . Edinburgh: School of Artificial Intelligence. 
 
42 
 
 o 
 
 4-> 
 C 
 
 
 0) 
 
 
 
 
 S- 
 
 
 
 
 03 
 
 
 CO 
 
 
 co 
 
 <d 
 
 4- 
 
 
 4-> 
 
 JZ 
 
 •r— 
 
 
 U 
 
 +-> 
 
 fO 
 
 
 <V 
 
 
 Q. 
 
 
 ••-J 
 
 4- 
 
 
 
 -O 
 
 O 
 
 4- 
 
 
 O 
 
 to 
 
 O 
 
 
 +J 
 
 c 
 
 +-> 
 
 
 U 
 
 
 
 CD 
 
 
 fO 
 
 ■r™ 
 
 CO 
 
 
 j_ 
 
 -M 
 
 
 
 +J 
 
 03 
 
 ■a 
 
 
 I/) 
 
 +-> 
 
 a> 
 
 
 -O 
 
 E 
 
 s_ 
 
 
 < 
 
 CD 
 
 OJ 
 
 
 
 to 
 
 ■a 
 
 
 
 CD 
 
 i_ 
 
 
 • 
 
 s_ 
 
 
 
 
 CO 
 
 Q. 
 
 V -- 
 
 
 •»-> 
 
 OJ 
 
 
 
 u 
 
 S_ 
 
 CD 
 
 
 <u 
 
 
 ^~ 
 
 
 •<—> 
 
 -t-> 
 
 JD 
 
 
 JD 
 
 u 
 
 03 
 
 
 O 
 
 03 
 
 -M 
 
 
 r— 
 
 +J 
 
 C 
 
 
 03 
 
 to 
 
 O 
 
 
 c 
 
 -O 
 
 •1 — 
 
 
 o 
 
 fO 
 
 4-> 
 
 
 •r— 
 
 
 U 
 
 
 +-> 
 
 "O 
 
 E 
 
 
 OJ 
 
 QJ 
 
 3 
 
 
 +J 
 
 +-> 
 
 4- 
 
 
 
 3 
 
 to 
 
 
 
 
 CL 
 
 03 
 
 03 
 
 
 QJ 
 
 E 
 
 i_ 
 
 
 
 S- 
 
 O 
 
 +J 
 
 CO 
 
 
 =3 
 
 O 
 
 c 
 
 fO 
 
 
 CT 
 
 
 
 
 
 
 f — 
 
 
 u 
 
 -a 
 
 c 
 
 
 03 
 
 
 
 2 
 
 03 
 
 
 r— 
 
 -M 
 
 E 
 
 
 n3 
 
 
 .c 
 
 
 u 
 
 0J 
 
 4J 
 
 
 • r- 
 
 +J 
 
 • r— 
 
 
 -t-> 
 
 O 
 
 i_ 
 
 
 <a 
 
 C 
 
 O 
 
 
 E 
 
 CI) 
 
 CD 
 
 
 O) 
 
 ■0 
 
 i — 
 
 
 -c 
 
 
 03 
 
 
 +-> 
 
 1— 
 
 
 
 03 
 
 
 C 
 
 
 E 
 
 -a 
 
 O 
 
 
 
 c 
 
 •r— 
 
 
 4- 
 
 fa 
 
 4-> 
 
 
 O 
 
 
 03 
 
 
 
 <C 
 
 ZJ 
 
 
 10 
 
 
 r— 
 
 
 -O 
 
 
 03 
 
 
 c 
 
 • 
 
 > 
 
 
 •i — 
 
 -0 
 
 a> 
 
 
 -^ 
 
 a> 
 
 
 
 
 X 
 
 c: 
 
 
 CO 
 
 
 
 03 
 
 
 3 
 
 _Q 
 
 
 
 
 
 
 CO 
 
 
 •1— 
 
 a; 
 
 03 
 
 
 s_ 
 
 i_ 
 
 
 
 03 
 
 03 
 
 >> 
 
 
 > 
 
 
 r— 
 
 
 
 <u 
 
 oj 
 
 
 CD 
 
 +J 
 
 E 
 
 
 c 
 
 OJ 
 
 03 
 
 
 
 
 s_ 
 
 c: 
 
 • 
 
 E 
 
 
 
 
 >1 
 
 03 
 
 c 
 
 * 
 
 
 
 
 
 4- 
 
 aj 
 
 to 
 
 u 
 
 
 > 
 
 c 
 
 
 C 
 
 • r— 
 
 
 
 «< 
 
 O 
 
 +-> 
 
 •r— 
 
 -0 
 
 •r— 
 
 O 
 
 +-> 
 
 CD 
 
 -t-> 
 
 QJ 
 
 to 
 
 CD 
 
 O 
 
 Q. 
 
 1 — 
 
 C 
 
 c 
 
 CO 
 
 OJ 
 
 • r— 
 
 ZJ 
 
 OJ 
 
 
 q: 
 
 J_ 
 
 4- 
 
 i_ 
 
43 
 
 Figure 2 
 
 Graphical plots of the relation between knowledge and bit-cost for proorams 
 of contrasted characteristics: 
 
 p* - idealized program (minimal length, complete 
 knowledge) 
 
 p, - non-minimal program with complete knowledge 
 
 Pp - program with incomplete knowledge 
 
 p- - another program with incomplete knowledge 
 
 p. - minimal look-up program with complete 
 knowledge 
 
 Pc - non-minimal look-up program with complete 
 knowledge 
 
 For a given p, the tangent of the angle made by the broken line with the 
 horizontal axis measures the program's computational advantage. 
 
 Effectiveness of the i-th program p. = K. L 
 
 L 1 x 1(f) 
 
 i.e. the computational advantage of p. relative to that of p* 
 
 bit-cost 
 
 knowledge 
 
 V 
 
 Kf) 
 
 L T 
 
 Kf) 
 
 L 2 , 
 
 h 
 
 l r 
 
 K 3 
 
 Kf). 
 
 Kf) 
 
 L 5' 
 
 Kf) 
 
44 
 
 I 
 N 
 F 
 
 
 R 
 M 
 A 
 T 
 I 
 
 
 M 
 
 B 
 I 
 
 T 
 
 S 
 
 Kf) 
 
 STORE-BITS 
 
45 
 
 Figure 3 
 
 Chaitin's inter-galactic telegram in the context of "advice theory". We 
 allow the sender a cost-free machine of his own in case he wishes to comoile 
 his telegram from some specification language (see Chaitin's discussion) and 
 we assign the receiver's machine unlimited data and work-space, and zero 
 processing costs. But we assume a limit of 6 seconds on the evaluation of 
 f in the TGM for any x. Attention is thus restricted to the bit-cost of the 
 telegram itself. 
 
46 
 
 r 
 
 
 
 c 
 
 cc 
 
 
 
 
 a> 
 
 
 ■o 
 
 
 o 
 
 
 o 
 
 1— 
 
 +J 
 
 
 a 
 
 Q- 
 
 O) 
 
 t— 
 
 * r T 
 
 ' ' 
 
 JD 
 
 c 
 
 O 
 
 
 zr 
 
 
 <s 
 
 
 \— 
 
 
 s- 
 
 LU 
 
 <u 
 
 Q£ 
 
 1 — 
 
 o 
 
 •r— 
 
 1— 
 
 C. 
 
 <s> 
 
 E 
 
 
 o 
 
 
 u 
 
 
 
 
 O 
 
 
 CL 
 
 t— 
 
 
 
 
 
 
 Ou 
 
 ca 
 
 z 
 
 Cl. 
 
 1— 1 
 
 
 
 #» 
 
 
 «=£ 
 
 
 D. 
 
 
 
 +-> 
 
 O 
 
 -Q 
 fD 
 
 E 
 
 03 
 
 S_ 
 CD 
 O) 
 
 r— 
 
 a; 
 +j 
 
 > 
 
 a> 
 x 
 
 > 
 
 Q. 
 
 CQ 
 
 Q- 
 
 Q. 
 
 a 
 
 (T3 
 
 ai 
 
 
 <-> :r: 
 
 c_> 
 
 < 
 
 i 
 
 on 
 
 en 
 
 _1 
 
 
 
 o 
 
 LU 
 
 
 a: 
 
 C3 
 
 
 h- 
 
 <a. 
 
 O 
 
 ■z^ 
 
 Q. 
 
 C 
 
 o 
 
 C/0 
 
 
 c; 
 
 
 
 LjJ 
 
 LU 
 
 
 <_> 
 
 O 
 
 
 H— < 
 
 «3. 
 
 CC 
 
 > 
 
 D_ 
 
 c 
 
 Q 
 
 C/1 
 
 
 <C 
 
 
 
 
 
 ^-^ 
 
 LU 
 
 
 •o 
 
 O 
 
 
 <D 
 
 <c 
 
 
 -M 
 
 Q_ 
 
 
 •r- 
 
 CO 
 
 
 E 
 
 v: 
 
 
 !— 
 
 cc 
 
 
 C 
 
 o 
 
 
 3 
 
 -- 
 
 
 
 
 
 
 
 
 
 <=£ 
 
 LU 
 
 
 a: 
 
 t i 
 
 
 cr 
 
 <C 
 
 < . 
 
 o 
 
 Cl. 
 
 C 
 
 c£ 
 
 LO 
 
 
 Q- 
 
 
 
 LU 
 
 
 ( 
 
 o 
 
 
 T3 
 
 <: 
 
 
 CU 
 
 D_ 
 
 
 +J 
 
 to 
 
 
 'e 
 
 <■ 
 
 
 *r— 
 
 i— 
 
 
 r— 
 
 cX 
 
 
 c 
 
 Q 
 
 
 13 
 
 1 
 
 1 
 
 J 
 
47 
 
 Figure 4 
 
 "Telegram factory" model of conceptual learning. As before, we neglect the 
 costs associated with the TGM. Earthside, there are two categories of cost 
 the transmission cost of the finished telegram (proportional to its length) 
 and the cost of synthesizing it in the first place. The latter is reckoned 
 under two headings, (i) the storage and running costs of n,, - i.e. the 
 "learning" process, and (ii) the storage and running costs associated with 
 the ETF's trial runs of (p A , p R , p Q ) ("practice"). 
 
48 
 
 r 
 
 >- 
 
 c 
 
 \— 
 
 CJ 
 
 cz 
 
 CO 
 
 
 
 
 
 
 
 _l 
 
 
 CT 
 
 
 
 O UJ 
 
 
 2- LU 
 
 
 
 Oi. CJ 
 
 1— < 
 
 cj 
 
 
 —3 
 
 
 
 ZT D. 
 
 Q. 
 
 o: d. 
 
 C 
 
 
 O on 
 
 
 < co 
 
 
 
 cj 
 
 
 UJ 
 
 _j 
 
 
 
 UJ UJ 
 
 
 
 
 cj cj 
 
 cc 
 
 J— 
 
 
 ^^( 
 
 
 > Q. 
 
 C- 
 
 
 
 c <s. 
 
 
 
 
 < 
 
 
 
 
 LU 
 
 
 
 CJ 
 
 
 
 
 «3. 
 
 
 
 
 D_ 
 
 
 
 
 CO 
 
 
 
 
 ^ 
 
 ' 
 
 
 
 OS 
 
 
 
 
 o 
 
 
 
 
 ^^ 
 
 
 
 
 
 
 
 
 ia 
 
 
 
 
 cr <c 
 
 ct 
 
 
 
 § co 
 
 C 
 
 
 
 CL 
 
 
 
 
 L_ 1 
 
 
 
 (_> 
 
 
 
 
 < 
 
 
 
 
 o. 
 
 
 
 
 co 
 
 
 
 
 < 
 
 
 
 
 1— 
 
 
 
 
 < 
 
 
 
 
 Q 
 
 
 
 
 
 
 a 1 
 
 E 
 
 •i — 
 
 > 
 
 
 "C 
 
 en 
 
 03 
 
 cu 
 
 
 »— 
 
 
 a; 
 
 
 "C 
 
 
 OJ 
 
 
 -c- 
 
 
 00 
 
 Q. 
 
 CC 
 
 <C 
 D. 
 
 
 _J 
 
 
 
 
 O 
 
 UJ 
 
 
 
 C£ 
 
 CJ 
 
 o 
 
 
 1— 
 
 < 
 
 Q. 
 
 
 zr 
 
 D. 
 
 
 
 p 
 
 CO 
 
 
 
 CJ 
 
 
 
 
 UJ 
 
 UJ 
 
 
 ^ — V 
 
 CJ CJ 
 
 CC. 
 
 T 
 
 t— ! 
 
 < 
 
 Q 
 
 Ci 
 
 > a. 
 
 
 1— 
 
 Q CO 
 
 
 — 
 
 < 
 
 
 
 UJ 
 
 
 
 
 z- 
 
 
 
 
 H- ! 
 
 UJ 
 
 
 
 X 
 
 CJ 
 
 
 
 CJ 
 
 < 
 
 
 
 <=r 
 
 Q. 
 
 
 
 3L 
 
 CO 
 
 
 
 CJ 
 
 ^ 
 
 
 
 i— i 
 
 OL 
 
 
 
 h- 
 
 O 
 
 
 
 CJ 
 
 
 
 
 < 
 
 
 
 
 _J 
 
 
 
 
 c? 
 
 2: 
 
 
 
 CO 
 
 *1 - 
 
 g 
 
 UJ 
 CJ 
 
 «=C 
 
 <C 
 
 cr 
 
 < 
 
 C. ' 
 
 Cd 
 
 o 
 
 D_ 
 
 
 1— 
 
 
 CO 
 
 
 
 UJ 
 
 
 
 
 t i 
 
 
 
 
 < 
 
 
 
 
 Q. 
 
 
 
 
 CO 
 
 
 
 
 < 
 
 
 
 
 1— 
 
 
 
 
 s 
 
 
 
 1 
 
 1 
 
 J 
 
49 
 
 Appendix 1. Relation to Aristotle's theory of knowledge 
 
 In the opening passage of the Analytica Posteriora (McKeon, 1947) 
 Aristotle rejects the explicit/implicit dichotomy (his distinction between 
 "unqualified" and "virtual knowledge") as the defining criterion of what is 
 "known" and he incorporates in his definition the case where a man "knows" 
 a fact by prior knowledge of the premisses from which it can be demonstrated. 
 In this case, he says, the man " knows not without oualifi cation but only 
 in the sense that he knows universally"; and he characterizes the man's 
 (implicit) knowledge of the particular fact as "virtual" knowledge. The 
 fact only becomes "known" in the unqualified sense (i.e. explicitly) at the 
 moment that it has first been recognized as a candidate for demonstration 
 from the premisses and has then been successfully demonstrated. Aristotle 
 thus envisages a definite moment of transition from "virtual" knowledge of 
 a fact to "unqualified" knowledge of it, reminiscent of a memo-function's 
 rote-storage of a computed result. 
 
 He asks at what stage the student who alreadv knows that the angles 
 of every triangle sum to two right angles can be said to know this also of 
 some particular triangle presented to him: "only," he answers, "at the 
 actual moment at which he was being led on to recognize this as true in 
 the instance before him.... Before he was led on to reconnition, or before 
 he actually drew a conclusion, we should perhaps say that in a manner he 
 knew, in a manner not." 
 
 Applying this to the example given at the outset of this paner, I 
 know "universally" what numbers are prime through knowledge of the definition 
 of a prime number, I know "virtually" that 233 is prime, and I know "without 
 qualification" that 23 is prime. 
 
50 
 
 The present classification has ventured beyond Aristotle's in two 
 respects : 
 
 (1) discussion is not restricted to knowledge of truth- functi ons , 
 but to finite mathematical functions more general lv; 
 
 (2) the cut separating "known" from "not known" is placed in such 
 a way as to include in the "known" only those facts which are feasibly 
 demonstrable, i.e. those which the knower can demonstrate from the nremisses 
 within an acceptable waiting time. I may reasonable claim "knowledge" that 
 233 is prime but not (unless I am a calculating prodigv) that 23, 333, 333, 
 333 is prime. We thus refine Aristotle's system so as to isolate and 
 exclude a sub-category of "virtual knowledge", namely that which is "vir- 
 tually" but not "feasibly" known (i.e. co-computable but not B-computable) . 
 In our system such material is regarded as definitely not known, althounh 
 
 we can only establish in each case whether it is known or not by askino the 
 corresponding question and seeinn whether retrieval occurs within our time 
 limit. Aristotle was evidently uneasv about the status of a "virtually" 
 known fact when he said that before its possessor had explicitlv derived 
 it "we should perhaps say that in a manner he knew, in a manner not." 
 Probably he did not entertain the possibility that a derivation might be 
 theoretically but not practical lv executable. By introduction of the B 
 criterion the grounds for unease are resolved, and the ambiguous-seeming 
 "virtual" category is partitioned into "known" and "unknown" sub-categories. 
 In a dynamically evolving knowledge system transmigration of elements may 
 occur in either direction between sub-categories as a consemjence of changes 
 :he remainder of the knowledge-base. These can affect in one way or 
 
 '.he total derivation process required for retrieval: new knowledne 
 
51 
 
 can shorten the given fact's retrieval time so that what was previouslv not 
 known becomes known; or it can lengthen it by adding to its derivation 
 overhead, with the consequence that a formerly known fact can become 
 unknown. 
 
 Reference 
 
 McKeon, R. (1947) Introduction to Aristotle . New York: Random House 
 (in "The Modern Library" series). 
 
52 
 
 Appendix 2. What to substitute for S. 
 
 The machine-dependence of the way in which knowledge is measured 
 may be perceived as an obstacle to mathematical development of the theory. 
 To those wishing to instantiate the machine specification S with some- 
 thing more abstract than, say, "the TENEX operating system mounted on a 
 DEC-System 10 computer with a KI processor, 51 2K 32-bit words of 1 micro- 
 second store, ... etc." and at the same time to become free to consider 
 the evaluation of finite functions of arbitrary size, then a formalization 
 such as the random access machine (RAM) of Aho, Hopcroft and Ullman (1974) 
 offers attractions. 8 would then be interpreted as a number of 
 computational steps rather than of time-units. 
 
 Reference 
 
 Aho, A. V., Hopcroft, J. E. and Ullman, J. D. (1974) The Design and 
 
 Analysis of Computer Algorithms . Addison-Wesley Publishing Company, 
 Chapter 1-2. 
 
BIBLIOGRAPHIC DATA 
 HEET 
 
 1. Report No. 
 
 UIUCDCS-R-76-786 
 
 . Tiilc and Subt itle 
 
 MEASURING THE KNOWLEDGE-CONTENT OF PROGRAMS 
 
 3. Recipient's Accession No. 
 
 5- Report Date 
 
 May 1976 
 
 Author(s) 
 
 Donald Michie 
 
 8- Performing Organization Rept. 
 No. 
 
 Performing Organization Name and Address 
 
 10. Project/Task/Work Unit No. 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, IL 61801 
 
 11. Contract /Grant No. 
 
 2. Sponsoring Organization Name and Address 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 5. Supplementary Notes 
 
 6. Abstracts 
 
 By an appl 
 finite fun 
 which is p 
 with the p 
 "answer" y 
 For a real 
 in computa 
 function f 
 
 notions of 
 
 i cation of information theory we measure the information-content of 
 ctions. The information associated with that subset of a function f 
 ractically evaluable by a given program on a given machine is identified 
 rogram's "knowledge" of f. By "practically evaluable" we mean that the 
 to the "question" f(x) can be retrieved within some resource-bound B. 
 -world machine 6 is measured in time-units and for an abstract machine 
 tional steps. Knowledge is thus the information-content of the partial 
 K corresponding to the 3-evaluable subset of f. Applications to the 
 
 "learning" and "difficulty" are discussed. 
 
 7. Key Words and Document Analysis. 17a. Descriptors 
 
 information theory 
 
 information-content 
 
 knowledge 
 
 7b. Identifiers/Open-Ended Terms 
 
 7c. fOSATI Field/Group 
 
 8. Availability Statement 
 
 Unlimited 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21- No. of Pages 
 
 57 
 
 22. Price 
 
 ORM NTIS-35 ( 10-70) 
 
 USCOMM-DC 40329-P71 
 

;.•• 
 
 It** 
 
 
UNIVERSITY OF ILLINOIS-URBANA 
 510.84 IL6R no C002 no 782 787(1978 
 
 Design of Irredundant MOS Networks • p 
 
 3 0112 088402588 
 
 gfS 
 
 
 ZRfSfl 
 
 JG9