LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 It6r no. 782-787 cc p. 2 The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN vrtErt L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/measuringknowled786mich UIUCDCS-R- 76-786 JltKAS* MEASURING THE KNOWLEDGE-CONTENT OF PROGRAMS by Donald Michie May 1976 UIUCDCS-R- 76-786 MEASURING THE KNOWLEDGE-CONTENT OF PROGRAMS by Donald Michie May 1976 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBAN A, ILLINOIS 61801 111 CONTENTS PAGE Introductory outline 1 Basic definition: f 3 Basic definition: K 4 Basic definition: L 4 Basic definition: e 4 Purity of formulations 4 Properties of p in relation to f 5 Motivation and informal remarks 6 Technical need 7 Assumptions and restrictions 8 Information about a function 9 Knowledge of a function 10 Partitions of memory 11 The predicate " known " 12 Knowledge in a heuristic 13 co-concepts and e-concepts 14 Simplified and refined forms 15 Simplified theory: definitions 15 Information concerning f 15 Knowledge-content of the look-up program p T 16 Knowledge-content of the solution program p. 16 Difference between information-content and knowledge-content . 17 Knowledge-content of advice program and knowledge-overhead of control program 18 Refinements: partial (truncated) evaluation 21 IV CONTENTS (continued) PAGE Refinements: error 23 Simplest case of error: no gradations, no partial evaluation , . 24 Gradations but no partial evaluation 24 Partial evaluation, no gradations 25 General case: gradations of error in partial evaluation ... 25 Refinements: differential argument-utilities 28 Some g-independent quantities 29 The memo- function as a model of rote-learning 30 Telegram-factory model of concept learning 33 Learning, rote-skill and grasp 36 Concluding remarks 39 Acknowledgments 40 References 41 Appendix 1: Relation to Aristotle's theory of knowledge 49 Reference 51 Appendix 2: What to substitute for S 52 Reference 52 1 Introductory outl ine An approach is described for defining and numerically measurina a system's knowledge of a finite function f and for estimatinn the net knowledge-increment obtainable from a given body of advice. The key idea is that a system "knows" only what it can evaluate (whether by calculation or by look-up) in user-acceptable time. The role of advice is to make it know more. Knowledge of f is expressed quantitatively in information-bits (not to be confused with store-bits). If b denotes an upper bound to the time available for evaluating f for any argument, the in formation- content of what is B-evaluable is the amount of knowledge, K, about f. The envisaged application is to the evaluation of a recursively defined function by an algorithm A implemented as a program p. to which a body of advice B is made available in the form of a knowledge-base n R » through the mediation of a control program p c . p R is a kind of data-base (parts of which may on occasion be executed as program). A typical action of the advice contained in it is to alter p.'s termination condition in such a way as to cause earlier exit from the recursion without altering the computed result. Advice theory offers a cost-benefit analysis of p R and of its constituents. Benefit is expressed in units of knowledge. Cost is attributed to store occupancy and is expressed in store-bits. All measures are defined with respect to the machine specification S and the solution program p A « The measured benefit- to-cost ratio for d. alone is termed the system's algorithmic advantage. The measured net knowledge-increment conferred by p R divided by its store-occupancy is the system's advisory advantage relative to p». The system's computational advantage is the 2 total knowledge (algorithmic plus advisory less an overhead associated with the operations of p~) divided by the store-cost (store occupancy of d. plus store occupancy of p R plus store occupancy of advice-management routines p r ). The term penetration is used when advantage is expressed on a logarithmic scale. Computational yield is defined as the ratio of the actual amount of knowledge to the maximum attainable with the given S and store-cost; the usefulness of this measure is restricted to the case where the store available is less than or equal to that which just permits complete knowledge of f (i.e. less than or equal to f's 6- complexity, see below). :1ore versatile measures are effectiveness, defined as the ratio of the actual to the maximum advantage, and efficiency, defined as the ratio of the minimum to the actual store occupancy for given K. Knowledge-increase is seen as occurring not only through input of advice from the user but also through the system's storage of instances of, or (more powerfully) inductive generalization over, past evaluations of f. These processes can all modify or add to p R so as to increase its knowledge-content. They can also, through restructuring or deleting existing advice, reduce p^'s store occupancy. In either case advisorv advantage is increased through purely internal reoraanization. We thus have a domain-independent way of measuring knowledge-acnuisi tion in machine learning, following a scheme of which the memo- function (Michie, 1967) is a primitive example. By a restriction of Solomonoff-Kolmogorov complexity we obtain -'omplexity as a numerical measure of a problem-domain's difficultv. --n r . of advice theory deal with partial evaluation, evaluation 3 error, and differential utilities of evaluating f for different arguments. With this last extension in place of "pure" knowledge we speak of "useful" knowledge. Computational advantage and penetration as earlier defined divide correspondingly into "pure" and "useful" forms (each consisting as before of algorithmic and advisory parts), and similarly for yield, efficiency and any other quantities compounded from the primitive definitions. The rate at which p (p A , or p A plus n g plus p-, as the case may be) can deliver complete information about f is independent of 6, and is termed its computational capacity, expressed in information-bits/sec. Dividing by the bit-cost of p gives its computational power - the number of store-equivalents of information about f delivered per second. This is a measure of a system's overall performance in its knowledge-delivery as opposed to knowledge-acquisition aspect. The latter is treated by an extension of the theory addressed to measures of the performance of rote-learning and inductive generalization programs. Two instances of f, one numerical, and one non-numerical, are mentioned, namely finite-factorial, and bestmove (in chess). Reference is also made to a probabilistic generalization by I. J. Good (1976) of the main definition. In anticipation of the body of the paper we give below a brief snapshot of the primitive notions. Basic definition: f f: X -+ Y, where X is an ordered domain of size N and Y, the co-domain, is of size M. Basic definition: K A program p's "useful knowledge" about a finite function f is measured in information-bits as N K = i u. x surprisal (Y. ) i = l ] ^ where u. is the normalized frequency with which p is invoked to evaluate f for the i_-th member of X, and Y. is the candidate set remaining after evaluating f(x. ), using p, for just 6 seconds. Surprisal (Samson 1Q51) is log 2 — where p. is the prior probability that an arbitrarily chosen p i member of Y will fall into Y. . Evaluation is assumed to have the form of a successive approximation so that Y -> Y • >, -»• Y. 9 .... in successive cycles, 0, 1, 2, ... of the computation. Basic definition: L L is the number of store-bits occuoied by the program p. L» is the bit-cost of a program containing K bits of useful knowledge. I* is the minimum possible L for any such program: L ft is the minimum possible L for a program containing complete knowledge, i.e. K = 1(f). K* is the maximum possible K for any program of no more than L store-bits. 1(f) is the total information-content of f, calculated in a way to be described later. Basic definition: e The extraction time, e, is the time needed by the svstem to evaluate f fully over the whole of its domain. Purity of formulations We do not propose to make a fetish of purity, in particular when it comes to quarantining abstract objects such as "algorithms", "arguments" and the like from their counterparts in the world of concrete embodiments - such things as "programs" and "inputs". We have taken the trouble above to write " ... frequency with which p is invoked to evaluate f for the i_-th member of X, ... " when we could easily have written " ... frequency with which p is applied to the i_-th member of X, ... " (strictly speaking the latter would be wrong: one cannot apply a concrete object to an abstract one). But we do not undertake to observe these laws of quarantine on occa- sions when law-breaking offers large gains of brevity or perspicuousness. If mathematical taste is thereby offended we are sorry. But communication with the reader who is willing to "see what we mean" has, rightly or wrongly, been placed first. Properties of p in relation to f : Units of measurement Information-content (of f) Information-bits 1(f) Knowledge (possessed by p about f) Information-bits K L Store-cost Computational advantage Penetration Yield Efficiency Effectiveness Store-hits Pure number G » K K A + h + K C L L A +L B +L C Decs n = log 10 K - log 10 L Pure number K K L Pure number L K Pure number K > < L 3 6 K B K + K Advisory advantage Pure number C B = L~ = L ** + L° B r c Advisory penetration Decs D R = log,Q K R - log 10 L B Operating cost Bit-seconds Lt aD b Learning index Decs/bit-seconds p- -— L ITU K and L denote the knowledge-content and bit-cost of a given p and L D is the B-complexity of f, i.e. the minimum bit-cost for a program possessing total knowledge of f, 1(f) in amount. L£ is the 6-complexity of that subset of f covered by K, i.e. of the partial function f.,. In the index of learning L. denotes the store-cost and ty the processor time associated, when running the program (p„, p R , p~, p..), with p.., a procedure which uses results computed by (p., p R , p~) to update p R . Subscripts r and c denote "rote" and "concept" respectively, distinguishing two forms in which information about f may be stored in p R . It is not expected that the foregoing "snapshot" will be immediately intelligible: the rest of the paper can be regarded as an attempt to make it so. Motivation and informal remarks In AI we talk about "knowledge", but would like to progress beyond the mere slogan. Is "knowledge" that which is explicitly stored? At first sight this seems reasonable. For example, it takes care of the following kind of paradox. A number theorist knows a great deal more than he did when he 7 first learned the axioms of arithmetic. Yet everything he knows, and infinitely more besides, is strictly deducible from these axioms. Therefore, there cannot be anything new for him to have gained. What, then, is this extra stuff which he knows? Presumably it consists of explicitly stored conclusions which previously were only implicitly available to him. Since everything that number theory has done with the integers since God made them has been strictly redundant, we conclude that the only new value created has been that of accessibility. The point of new knowledge is to make less time-consuming the retrieval of old truth. According to the explicit/implicit idea, I know the definition of a prime number and I know that 23 is prime. But I do not know that 233 is prime, although I can quickly work it out. At second sight, the criterion of explicit storage is not auite so reasonable. We suggest that the notion of explicit storage is handy only because it correlates highly (in most information processing systems) with the speed with which the needed information can be retrieved. On this more general criterion a desk- top calculator which calculates factorials in a second has as much claim to a "knowledge" of the function as a similarly priced table-look-up machine. This is essentially the Aristotelian position ( see Appendix 1) and we shall adopt it here. Knowledge must be of something, and for definiteness it will always be of a finite mathematical function . Technical need Information theory allows us to specify the quantity of information contained in a message and to calculate the least space in a computer memory required to store a minimal encoding of it. 8 Computation theory allows us to say what functions are computable in principle, and whether particular programs compute them. Complexity theory can show, for the evaluation of certain simple functions, how some measures of computational cost can be minimized. What we lack is a calculus which will enable us to say: not merely how much information about a function f is stored in a computing system; not merely whether and how the system could in nrinciple compute f; not merely how the computation might be optimized with respect to specified cost criteria; but how much knowledge of f the system possesses. "Knowledge" should he defined in such a way as to be compatible with the informal use of the word in the machine intelligence literature. The definition, and any calculus based upon it, must be capable of characterizing the dynamical aspects of knowledge, its acquisition, transfer, and use. The essence of our approach will be to regard knowledge as the practically retrievable part of the total information possessed by the system about f. Assumptions and restrictions We measure information in the normal way, and exoress it in bits. In order to avoid confusion we shall use the word "store-bit" to mean a binary digit in machine memory. We exclude from strict consideration any function (e.g. a real- valued function or a function of a real variable) which may require infinitely many bits to represent a value or an argument. But we do not to finite functions by which such functions may be approximated. 9 Examples (the first chosen for its simplicity and the second for its complexity) : finite- facto rial : X ► Y x < xmax y e positive integers and x e non-neg. integers bestmove : legal chess-positions ■* legal moves Consider a member of this class, f: X -*■ Y. Our approach is to regard " information" as a measure of certain properties of a static representation of f, conceived simply as a message which encodes f. In this classical information- theoretic world there are no store-time trade- offs, and the information-content of a message is equal to the minimal number of bits required to store it explicitly. By contrast, " knowledge" measures certain properties of the evaluation of f, whether by look-up from store, by calculation, or by any mixture of the two. In this world we can, and typically do, encode long messages in short programs. How much do such programs "know"? Information about a function We picture f as laid out in extensional (tabular) form, comprising a sequence of (x, y) pairs, N in number. The first member of each pair is drawn from the ordered domain X, and the second from the co-domain Y. A message of length N, where N = size (X), has thus been constructed, the values y, , y 2 , y~, ..., y.. being regarded as a sequence of choices from an alphabet Y. Assuming equiprobable choice, the information content of such a message is given by Hartley's (1927) expression N log,, M, where M = size (Y). This follows from the fact that there are exactly M possible functions which could be constructed with the given domain and co-domain. Allowing for different frequencies of the different y- values , 10 M n. we have - Z n. log 77^ where n. is the number of occurrences of the j-th member of Y. An equivalent, and for our purposes more convenient, formula- tion is obtained by summing over the domain rather than the co-domain of f, N n. thus: - z log, j^-. i=l '2 N Suppose, now, that only k of the n entries have been inscribed in k n. the table. Then the table's information content is - Z log- rp, where n. is the number of occurrences in the complete table of (x, y) pairs whose right-hand element matches that of the i-th entry of the reduced, k-sized table. This expression is the basis for measuring the information-content of a "partially known" function. Knowledge of a function The test of "knowledge" is the ability to answer questions: "What is the finite-factorial of 6?", "What is the best continuation of this chess position?" If an answer can be produced within some time limit , then the system is held to "know" it. As preamble to an informal definition, consider the respective tasks of answering (1) "What is the finite-factorial of 6?", and (2) "What is the finite-factorial of 10 6 ?" In the second case it would seem that since the answer is so much harder to get, there must be more information obtained. This is not so. n. An answer conveys - 1°9 2 F" bits of information (its "surprisal ") , and that is the end of the matter. But it is true that either 11 (i) the system is more likely to deliver an answer to (1) than to (2} within an acceptable waiting time (our definition of knowledge will allow us to say that the system is more likely to "know" (1) than (2)) , or (ii) if obliged to reply within a time limit, the system will he able to answer (1) to a higher degree of approximation than (2). A definition which permitted degrees of knowledge of f(x) would allow us to say that the system has more knowledge of (1) than of (2). Such a definition is possible and has interesting consequences. These are discussed later. Partitions of memory We subdivide machine memory into six sectors, dedicated respectively to the following purposes: 1. Storage of the input argument ("data space"). 2. Storage of a solution program p. for the function, f, ("function space"). 3. Storage of data generated and used by the evaluation algorithm in the course of, and solely for the purposes of, the given evaluation. This "work-space" is empty on entry and cleared on exit. 4. Storage of an advice program p B , (the "knowledge base" or "body of advice"), which modifies the action of p., typically by causing earlier exit without affecting the computed result. We refer to this sector of memory as "advice space". 5. Storage of a control program p c for making p B available to p.. 12 6. Storage of an update program p.. for reorganizing p g between evaluations, at the lowest level by rote-operations, at the highest by inductive generalization. In most conventional computing, nachine menory is partitioned anonn 1, 2, and 3 only. Our view of thp disti nguishing characteristics of an intelligent system is that the key connonents are precisely th Q last three. Systems exhibiting such a morphology as the anov° are at the dawn of thpi r evolution. Specimens are scarce, and highly evolved specimens non-existent. The best known example is probably Buchanan, Feigenbaum and Lederberg's Heuristic DENDRAL (for recent reviews see Michie and Buchanan, 1974, Buchanan, 1975). A system of the same kind for chess end-games would map onto the six sectors as follows: 1. Current board position. 2. Lookahead routine {every chess program has one). 3. Lookahead tree. 4. Body of advice, consisting of chess heuristics to guide and prune the lookahead routine. 5. Routines to access and apply the heuristics. 6. A learning routine using accumulating samples of the extensional representation of bestmove , as they are discovered by the lookahead, to add new heuristics and modify old ones. The predicate " known " A predicate known has as arguments a function f, an input argument /, a ■ achine specification S ( see Appendix 2), programs p., p R and p~ corresponding to 2, 4 and 5 above, and a time bound 3. known (f, x, - :> : , ) is true iff f(x) can be evaluated in no more than 13 8 time units. If, in the given advice state f(x) is known for all x, then we say that f is e-evaluable on the given machine by the given (p«, p B , p~) . If f(x) is known only for some x we speak of the "known part" of the function. There is thus an extension here of the familiar notion of a partially defined function, except that we deal with partially known functions. In the new system, which incorporates the time dimension, "known" corresponds to a subset of the concept "defined" in the old system, but unlike "defined" changes dynamically. Thus, a system's knowledge of a function is the information- con tent of its p-evaluable extent . We accordingly say that the number of units of knowledge of f is equal to the minimal number of bits of information required to specify that part of f which is 8-evaluable, i.e., how much is known. Knowledge in a heuristic Consider (by way of example) Huberman's (1968) treatment of chess programming in certain elementary end-games, where p« stands- for her "forcing tree" lookahead routine and p., stands for her logic formulae expressing " better " and " worse " predicates. We wish to measure the knowledge- con tent of individual heuristics in p B . Proceed as follows (in principle: in practice we would sample from the domain, not chart it exhaustively). 1. With the heuristic loaded, run p« for each x in X, and record the number of bits of information corresponding to the known part of the function. 2. Unload the heuristic and repeat. 3. The difference between the two measurements gives the amount of knowledge about f present in the heuristic. The knowledge-gain has to be traded against the heuristic's storage cost: 14 the ratio of gain (in information-bits) to cost (in store-bits) is the advisory advantage. u-concepts and --concepts lie are turning our backs on "in principle" concepts of computabi 1 ity, completeness, contradiction and so forth. In so far as we need to refer to them we can sneak of "co-computabi li ty ," "w-completeness ," "u-contradiction ," etc. Instead of these concepts we envisage a calculus of "B-concepts . " A system is .— complete if every question admissible within the system can be answered in the given state of the advice in no more than 3 time units; a set of assertions is .-inconsistent if a contradiction can he B-derived and 5-consistent if it cannot (a R-consistent bodv of assertions mav be --consistent or it may be ^-inconsistent); an assertion is B-redundant if its addition leaves unaffected what can be B-derived. The theory evidentlv has implications for the special case of function evaluation known as automatic deduction, where the concent of a B-theorem presents itself and claims relationship to what is usually called "belief" (see Heltzer 1Q7H for the notion of logical validity subject to a depth limit on the deduction). We do not worry at this stage about the exact magnitude of the bound 8. For concreteness we can suppose that the svstem must answer a question, if it can at all, within a few seconds or minutes rather than mill i-seconds or days and that it has access to a store of say a million, rather than a thousand or a hillion, locations, with access times in the microsecond millisecond range, resign and implementation of an "advice language" would be desirable, with facilities for handling and coordinating 1 - f > " ; But the case for developing and testinn the theoreti- independent of these possibilities. 15 Simplified and refined forms An account of the theory follows. First a simplified form is presented from which all issues likely to distract from the central exposi- tion are omitted - specifically the treatment of truncations, errors and utilities. These elaborations are then introduced to complete the theory. A bare framework of some elementary objects and notions is pictured in Figure 1 . Simplified theory: definitions Information concerning f We associate with a function f a recursively defined evaluation algorithm A and a function table T. f maps an ordered finite domain X of size fl into a finite co-domain Y of size M. T is the ordered set of pairs {(x,, y,) , (xp, y 2 ) > • • • » (*m, Ym) > such that the set of the left-hand elements is equal to X and for each pair f(x) = y. If the right-hand elements of T taken in the order y, , y^, ..., y r , are interpreted as the symbols of a message composed from an alphabet, Y, a natural expression for the amount of information associated with f is 1(f) - - N I p(y) log p(y) bits, (1 ) Y where p(y) is the relative frequency o f occurrence of y in the message. We prefer a re-formulation as a sum over the N symbols of the message of the "surprisals" associated with the individual symbols, defined for the i-th symbol as -log 2 p(y^). This gives us 1(f) = - z log — r. — where n(y.) is the number of occurrences i=l 2 N " 16 of y. as a riqht-hand element of T. This may he rewritten 1(f) = N loq 9 N - z log 9 n(y.) hits <- i=i ^ ! (1A) ICnowledae-content of the look-uD Droaram p- A program p T implements T as a look-up tahle for evaluating f on a machine of specification S. Let 6 be an upper hound to the evaluation time of f for any x in X. Suppose that look-up consumes arbitrarily little time on each occasion and in any case always less than 3. With the role of B thus rendered vacuous, the system's knowledge of f is equal to its total stored information about f, subject to the aforesaid restriction relative to 6. Thus n T 's knowledge-content is K T = N log 9 N - z log n(y.) 7 (?) i=l Note that the numerical value is equal to that of 1(f) and also to the store-occupancy in binary digits of a minimal -length version of p_. Knowledge-content of the solution program o. Now consider a different program o. for evaluating f. n. implements A on a machine of the same specification S. Let 6-evaluable be a predicate which is true for just those x's for which f(x) can be evaluated within this bound, and let tt denote the null program. Then X = {x|x e X a p.-eval uable (f , x, p., tt, tt, S) } and T. = {(x. , y. ) , (x, , y .) , .. . , (x, , y. )} such that the set of :he left-hand elements of T« is equal to X. (which has N» members) and, for each pair f(x) = y. Clearly X^ S X and T A £ T. We call the function f^ ociated with T ft the "algori thmical ly knov/n" part of f. 17 p.'s knowledge- content is thus K A = N A log 2 N " Z log ? n(y i* (3) where N« is the size of T«. Difference between information -content and know! edge- con tent Clearly the amount of information associated with A is the same as that associated with T, namely 1(f). Yet their respective implementations differ widely in knowledge -con tent. For K. to equal 1(f) either 6 must be arbitrarily increased or S so altered that the machine runs arbitrarily fast, allowing the system to solve all problems in X within user-acceptable ti me . Note some connexions with the Chaitin-Solomonoff-Kolmogorov definitions (see Chaitin, 1975) of randomness and complexity. If p* is a minimal-length implementation of A, then its store-occupancy in bits, which we shall write L . (A), is equal to what in Solomonoff and Kolmoqorov's mm (Chaitin 1975) terminology would be called the complexity of T, and in ours its w-complexity (u> stands for an arbitrarily long time-allowance for reconstructing the whole of T from p„). The maximum attainable compression ratio for T is, in our notation I(f » _W T > (4) QJ -complexity(f) min^ ' where L . (T) is a minimal encoding of T as a look-up table for an abstract mm machine of the same specification as that on which p* can be executed. We will later see that this ratio defines the maximum computational advantage attainable for the evaluation of f (e.g. by Martians, who think infinitely fast and never need advice). 18 Computational advantage is defined as an information ratio: information-hits divided by store-bits. It is the number of bits of S-retrievable information per bit of store. We give the name penetration to the logarithm of this ratio. Computation is a trick for exploiting the time dimension to achieve penetration into f, subject to the limitation 6. If we optimize a pure look-up system, we get computational advantage eoual to unity, zero penetration. Penetration is expressed in "decs" (decimal compressions). "4(1 3-minute decs", for example, characterizes a system which computes, in less than three minutes for each input, an f whose T would require n bits for explicit storage, hut which manaaes with only -40 10 n. The informational economics of chess are such that a Grandmaster might constitute a realistic example. Knowledge-content of advice program and knowledge-overhead of control program On d machine of specification S which runs the solution program p». an advice program p.. implements a body of advice B. Makino the advice available to p„ requires the action of control routines p r - In the simplified theory we assume a time penalty associated with the invocation of p c which imposes an overhead on e\/ery attempted evaluation of f. Consider the ordered sets V AC = {x|x e X a B-evaluable(f , x, p„, it, d p , S)}, and f AC = { ^ x i ' y i )' ( x j' y i^ ' ■•'■ ( x k' y k^ } such that the set of left-hand elements of J^ is equal to X»p and for each pair f(x) = y. Since the invocation of p f on each call of p. consumes time, hut obtains no return (because the advice nrogram is null), the predicate .-evaluable may nive the result false for some members of X, (for all of formerly gave true) . Hence X Af . C X^ and T Ar £ T«. Correspondingly 19 the quantity !, AC K AC " N AC log 2 N " z lo 9? n(y.) is interpretable as the i = 1 <- amount of algorithmic knowledge of f less the knowledge-overhead incurred by p c . Now replace the null program by p B and consider the ordered sets X' = {x|x e X a s-evaluable(f, x, p A , p B , p c , S), which has size T and T' = {(x i , y^, (Xj, y. ),..., (x k , v k )} such that the set of left-hand elements of T' is equal to X' and for each pair f(x) = y. We form T B (which has N g members) as T' - T. p and define the quantity of advisory knowledge of f relative to p. as N 11 B K B = N B log 2 N - l log 2 n{y.) (5) The system's overall knowledge of f is measured as N 1 K = N' log 2 N - 1 log 2 n(y.) = K A + K B + K c (6) where K p , a negative quantity, is the knowledge-overhead associated with the control program, and is measured as K» c - K«. Note that corresponding to T' is a partial function f K which we call the known part of f, and that f K can be further partitioned into f« c and fn corresponding to T« c and T B respectively. Now if the store-occupancies of p«, p B , p c are L«, L B , Lp, termed respectively algorithmic cost, advisory cost and overhead, then L = l_ A + l_ B + Lp = overall store-cost (7) K A C^ = t— = algorithmic advantage (8) 20 K B C B = p - = advisory advantage relative to p» (9) K ^A + ^B + ^C C = j- = r — +-r — +-r— = overall computational advantage (10) ABC Mote that all these measures of advantaqe can he thought of as compression ratios relative to the smallest store-occunancies which could be attained if f « , f„ and f., were represented in pure look-up form. Let L be the 0-complexity of f relative to S' , i.e. the smallest P store cost for which complete knowledge of f could be obtained with a machine specification S' differing from S onlv in having unlimited store. Let L be the total program soace available in S and let L < L„. If K* is the maximum amount of knowledge about f attainable for store cost L ( not for fixed p« which we are free to rewrite in any way whatsoever), then jrx- = the computational vield of the system (11) k L Hf) Another interesting quantity is . v ' which qi ves the maximum attainable L 6 computational advantage for a system limited bv the °, bound - compare with l jj\Y cited earlier for Parti ans. min v ' Consider now K, L» for some nrogram and define L£ as the minimum store in which K units of knowledge about f can be represented, i.e. Li? is the .'-complexity of the partial function f v . Then since px is the ttainaMe computational advantage F or m'ven ", the computat.innal efficiency, define' (12) 21 can be regarded as the ratio of an actual to an ideal advantage, viz. J<_ JL _K L* Dropping the subscript K to make room, i f L* is the length of a minimal p« representing K« units of knowledge about f, and L. is the length of an actual p„ with the same K., then I * A -. — = the algorithmic efficiency of p A (13) M Define Li as the minimum store which can contain a program p R with K B units of advisory knowledge relative to p.. Then L* B i — = the advisory efficiency of p R (14) L B Y A similar quantity, effectiveness, relates o's advantage r- to the maximum advantage attainable by a program with maximum K (i.e. 1(f)) rather than by a program with given K. We then define effectiveness as K x I mTfT (15) Applications of this expression are illustrated graphically in Figure 2. Refinements: partial (truncated) evaluation If evaluation of f for the i-th member of X is interrupted by the 6 cut-off before completion then in the foregoing treatment the system was allotted a zero score for its knowledge of f(x.). This simplification is replaced by the following, in which we no longer speak of the "known part" of f, since some knowledge adheres to every part (note that the treatment 22 is closely related to Good's (1976) discussion of the " knowledge that 12 3 - 1728"). Let the evaluation have the form of a successive aporoxination in which an initial candidate set of results is reduce^ in succeedinn cycles, ending in a unit set consisting of f(x. ) v/hen evaluation ones to comnletion hut otherwise in a residual set o f candidates containing f(x.). T^e initial candidate set is the entire ro-domain Y. In each time-interval a nrogram n transforms this set accordina to the sequence Y -> Y ->. v _>. ... -> l ,l,p i ,2,p such that Y3Y. , n and Y. . n »Y. ,.,,» . Y. . n is the candi- i ,. ,n 1 ,1 ,p l ,J ,n i ,(j+l ) ,n i ,S,n date set at the moment of 3 cut-off. We reoui re to quantifv the information (surnrisal) associated with Y. , so that we can measure the svstem's •| overall Knowledge of f as E surprisal (Y. ). ■ 1 "i » p > o '..'here evaluation is always comolete, so that Y. = {y.}, we have i ,B,o l seer, that the information gained from evaluating f(x-) is n(y.) -log- p(v.) = -log — r , — , where n(y.) is the numher of occurrences of the svrihol v. as a rioht-hand element in T. Mho re y. e Y. and Y. mav - 1 ' 1 1 ,8,0 i ,B,n have more than one member (i.e. the evaluation, although incomnlete, is n(v) i ,3,n ill knowledqe of f is the sum of all the surnrisals, namely log„ JlLzi = / Y. ■' ■ - log, n(y) (If) i 1 " v • Y. i , ,p correct as *dr as it has none) we write -log- y, \. - ; . The svstem's V e Y _. 23 To individualize the above to the knowledge content of given programs p T , p., p B , p c we substitute for p, thus: N K T = N log, N - e log 9 z n (y) , i=l c y z Y. i,B,P T and similarly to construct definitions of K», K R and K, using p = p«, p = (p A , p c ) and p = (p A> p B , p c ). Refinements: error In the simplified theory either an evaluation is complete, or it contributes zero knowledge. But now suppose that for some x. , a program p completes the evaluation of f, but incorrectly, so that apply(x., n) f f(x.) Note that in our scheme error can only arise from the action of p„. p. is assumed ex hypothesi to be a correct implementation of A. To calculate the knowledge associated with the apolication of p to x. we multiply the surprisal of f(x.) by a coefficient -1 s g < 1 . q is given by a function defined over the space of pairs (f(x) ,apnly(x,p)) . The user can supply any function he pleases provided that (i) for all x e X, g = 1 iff apply (x, p) = f (x) ; (ii) for all x e X, a "guessing program" p^ (which selects its output at random from among the members of Y with relative frequencies M p(y-i)> p(y 9 )> •••> P(y M )» s P(y-i) = I) has zero mathematical expectation for g x surprisal (apply(x, P G )). Regarding the evaluation of a finite function as analogous to the play of a finite one-person game, g is the outcome value (g = 1 corresponds n(y-) to a "win") and g x -log 2 — n— is the payoff . This terminology differs somewhat from Good's (1976: his Examples 1 - 5) who associates 24 different "utilities" with correct and erroneous results of function evaluation. We reserve this term (see later) for discrimination between different members of X ("argument-utilities"), and not for discrimination between different possible results of evaluating f(x) for given x. Simplest case of error: no gradations, no partial evaluation In the simplest case, the knowledge content for a correct result is n(y,) g x -log ? — -j— - where y. = f(x.) and g = 1. An in correct result is assianed n(y.) n(y.) a negative payoff -g * -loq — q- — , and the value g = - 7-— r- satisfies condition (ii), namely that o r 's expected payo f f is zero, as can be verified from the following tabulation. tyoe of resul t frequency when connuted as apply(x, p c ) contribution to expectation (= freouencv x payoff) correct incorrect n(v,) n(v.) x -loo. n(y-) n(y.) — n— x -100 ? — n— Gradations hut no partial evaluation In the more general case differentiation is md* between different degrees of discrepancy between computed and true values. In chess, f or •"not to compute h ostroyp may he thought more of a failure the computed result changes a won to a lost position than i f it changes • l1n supolies his own penalty function, provided that it IntS , (i ) and (ii ). 25 Partial evaluation, no gradations Now consider error in the case of a truncated evaluation which yields a set Y. of size m.. This set may or may not contain f(x-). To specify the knowledge-content of p f allowing for error, we introduce a guessing program p G which for each given x. gives a candidate set also of size m.. . We suppose that p G proceeds by making random deletions from an initially complete function table T, until just so many (x, y) pairs remain that the union of their right-hand members forms a set of size m. - the candidate set Y. . n . Initially there are n(y.) I 1 sP »Pp 1 pairs in T with y. as right-hand member. The probability that after d successive deletions no pairs of this kind remain can be calculated from the contingency table: right-hand member = y. right-hand member f y. Fisher's "exact" method ( Statistical Methods for Research Workers ) gives the probability as d! (N - n(y.))! undeleted deleted n(y 1 ) n(y i ) N - d d - n(y.) N - n(y.) N - d d N N! (d - n( yi ))! Denoting this quantity by 1 - P, we now construct a penalty function as shown in Table 1 . General case: gradations of error in partial evaluation In Table 1 g can be interpreted as a proximity with value 1 for p zero difference between computed and true values and - , _ p for any other difference. Generalizing this we first consider a proximity 26 CD - — , Q. cj •i O CQ *» ■t CQ • r— 4- *s >- 4- •r— » C >- >> *■ — «* i — > S- S_ u a. 3 c S- CO QJ ^ 3 CO X cr a> X D. i_ 4- o_ ' C3 C «s CQ •x ■r— >- ^-^ , CD rO Q. CO •> •r— CQ S- •t CL •■— S- >- 3 00 i X fO l/) Q- 4- •r— 4- s- D_ 1 O CL >. s- 1 — ro Z5 CL to 1 c HI * — * -C cr 2 oo ro Q. Q. >■> »> o ■o X a. 1 c GJ ■s — ' QJ 4-> i — 23 :3 >> cr Q- 1 — K E o. c c 4- u rO •r— >> 00 +-> e O -l-> •r— c , — ro • r- ^ 4-> CO >> CO c 0) e o o c u -a -.- ra 4- CJ CJ +-> O Q. o ai CQ O Q. M »» >> • r— • r- +-> >- >- o c: •r— CO (/) C CO O QJ • ■P- 13 (— 4-> CD *. — ^ rO = CD ■o "O i— CD ro fO _o -t-> S_ rO rO cr>4- +j CJ o E o E ZJ C 4- o S- 4- -i- +-) •> o +■> «^_^ c >> o O «C E r— •r- D- ZJ ro +J 4- ■ pa rO "O +-> ?B& s_ rO (O U CD a. > QJ r— ai aa X E i— CD O CO ro O +J •r- >> c -M r- Q) QJ S~ r- JZ E rC fO + J CD c. o c •r- 4- 4- +J O 4- o rci OJ E oo c2 CD CD CD r oo x: ••- ro +-> S- c U ra 4-> o E C ■r* CD CD +-> -C 0J O +-> -C r— QJ 4-> r— CO C ro •r- "P S- (O S- QJ cx a) •r- O 4-> > pp« •i- O s_ +-> CD ro ro S- o QJ D 3 S- i— CO QJ QJ rO E N JZ > CD +-> CD >> C i— c oo S E •r— => o s- O -E O 00 CD co 4- ro C -r- O co C £- OJ 3 -a CD S- 13 +-» CD i — 00 ro rO -r- i — i~ > 3 O .— ■* O 4- 4- +■> ^_ 4- X rO C O CD O O >^+-» .p. ro OO ■p atu • »— U CD : C QJ OO = • 13 -C . — e 4- H- ro O s co •<- ^> . rl ■<- +-> S- ro r— i_ S- CL 13 ra p en c £- o i- r— =J ro QJ i- 1- CO > O- Q) Q. = CD QJ -Q rO 27 function which in the expression 0(y, f(x-)), where y e Y. , yields ■ i »p >P a value for g, -<» < g $ 1. g expresses the "proximity" of y to the correct result. A truncated evaluation which yields a candidate set can in this context be viewed as a question-answerer which produces in response to the question x. a number of possible answers of varying degrees of correctness. To calculate the corresponding knowledge-content, two alternatives present themselves. (1) We can summarize the candidate set in terms of the expected value, given this set, of 0(apply(x., p), f(x.)), where apply(x., p) is the output value which would be obtained by running the program p to com- pletion on the input x.. A statistical approach of this kind demands that the user have some model of the computational process which successively shrinks the candidate set. (2) We can adopt a "worst-case" position and use a penalty function pf defined by pf(Y. ) = Min (0(y, f(x.)) i,e,p Y i y 1,3, p In other words we take the proximity of the candidate set to the correct result as being given by that of its most distant member. 28 Refinements: differential argument-utilities As just shown, different answers to a qiven question rate different values according to their proximity to the correct answer. The user nav also, and independently, attach different values to the receipt of answers to different questions. Argument-util ities will be regarded as fixed - corresponding to the notion of "importance." This simplifying assumption is nade for exposition only, since they could in practice perfectly well fluctuate (dependent on the progress of some problem-sol vino process) and thus act as "relevance" weightings. An example of the latter case would be the evaluation of bestnove for a chess position containing Queens at a stage when the Queens had already been swapped off. For such an x as this the system could afford not to know f(x) for the time being. Argument-utilities are introduced as weighting coefficients thus: N K = E u. surprisal (Y. ) . 1-1 ' A normalizing constraint demands that the distribution of argument- utilities has unit mean: this relates the amount of "useful" knowledge to the amount of "pure" knowledge. We use the qualified term "pure" knowledge in a case where the u.'s are omitted, or, equi valently, set uniformly equal to unity. Without prejudice to the user's right to define his own utility function, the natural approach is to assume utilities to be di recti v pro- portional to the frequencies with which p is called for the different x-'s (compare Good, 1976) . 29 Some 3 -in dependent quantities If one does not like the idea of imposing all-or-nothing cut-offs after the expiration of 3 seconds, there are other measures which can be applied to the evaluation of f relevant to the notion of knowledge.. We can introduce a time-base e, the time required for the system to evaluate f for every x in X (the "extraction time"). We then define computational Iff) capacity as ~- '- , measured in bits per second. Decomposition into algorithmic and advisory components can be ner^ormed as before. Likewise measures computational (algorithmic plus advisory) power; the units un. Le are store-equivalents/sec. Power is thus a dynamic version of advantage. Some 6-independent measures are tabulated below. Extraction time Seconds e Capacity Baud (bits/sec) ^ ' Power Store-equi vs/sec > ' ^ Le Le Turnover time Seconds TTfT Each of the above is separately measurable for the algorithmic, p., and advisory, p R , components of p - giving "algorithmic capacity" and "power 1 and "advisory capacity" and "power". The harshness of the 8 cut-off idea can be softened in another way. Introduce a multiplier r which is a function of elapsed time t, and rewrite the expression of the preceding section as N K = z r. u. surprisal (Y. ) (16A) i=l L n 1 If r is a step function such that r. = 1 for t <: p and r = otherwise, then K is maximized when t = 8, which corresponds to the earlier treatment, flow let r be some other function, monotoni cally decreasing in <*■ . 30 t. The utility of the result is then a continuously decreasing auantity during the evaluation process, /just as the information obtained is an increasing quantity. In this case also we can choose t in such a. wav as to maximize (16A). An optimal waiting time is thus again derived from a user-suopl ied utility function, but in a more subtle fashion than in the original simple 3 cut-off case. r can of course be made a function not only of time but of the current state of the computation itself - as when it is decided to devote further time to a given evaluation because it has been "nrovinn difficult." But the main exposition will not be served by pursuing such elaborations further. The memo- function as a model of rote-learning The most elementary form which the scheme of Figure 1 can take, and one which has received experimental study in the author's laboratory, is the "memo-function" (flichie, 1967). Implementations by Michie, Popplestone and Marsh (1968) and by van Emden (1974) allow the user to endow a function procedure with a private memory in which it automatically stores previously computed results. On each call of the procedure, the rote-memory is first accessed in an attempt to evaluate f by look-up. Only if the rote-search fails is the body of the procedure entered. The aim is to improve evaluation time. It wi 1 1 immediately be observed that if the procedure's initial knowledge of f is restricted by a p,-bound on evaluation time then the intro- duction of a rot ory confers an increase of knowledge - precisely measurable in our system. In effect, the memo- function instantiates Fiqure ; for p p a dynamically varied fragment of Pj. The rote 31 consists solely of advice at "ground level" - i.e. no abstraction, only selection, from f's function-table. The algorithmic procedure (memo-ised by the library routine "newmemo") corresponds to p., while the library package itself consists of p c (for applying the rote) and p.. (for updating it). With repeated application of p« to inputs sampled from the function's domain, (x, y) pairs accumulate in the rote and evaluation times decrease. Gains accrue not only when the input argument itself is found to be already on the rote, but also from rote-knowledge of results required in recursive calls: if you want to evaluate factorial (17) it helps if you already know factorial (16) - or even factorial (6). By our definitions, then, e-evaluability increases, i.e. the system's advisory knowledge of f . Note that the advisory cost also increases as the rote grows. If we "freeze" the rote at some stage, paralyzing the rote-updating mechanism, we can directly apply the earlier techniques to the knowledge associated with p R at the moment of freezing. Rut how are we to measure the cost-effectiveness not of the behavior of the program as a device for knowing about f (we have seen how to do this) but its behavior as a device for getting to know more about f? The following approach seems workable. Measurement of an amount of knowledge is concerned with pronerties of (p., p R , p f ). Measurement of rates of increase of the amount is concerned with properties of p.., the update program, in particular its caDacity to enhance the value to the user of p R . For a given p.. we want to have some measure of its effectiveness in doing this, and of the cost incurred. First we must decide precisely what attribute of p R to use in the numerator of p 's ratio of effectiveness to cost. Clearly it should be related to the increase of p B 's advisory advantage conferred by p..: a p.. conferring a small knowledge-gain but a large cost-gain upon p R must not 32 be allowed to rate high on effectiveness. We have chosen the looarithm of advisory advantage, i.e. advisory penetration, expressed in decs ("decimal compressions"). We shall use the symbol D R for this quantity. The denominator must be a function of p..'s store-cost and the time spent by the processor in p (|f both of which olainly contribute to the total cost of running p,. over the learning period. We must also consider the costs associated with running p« in conjunction with p R and p c to provide rote entries in the first place. We accordingly define a learning index: aD r R = r-f- (17) L U L U L.. is the store-cost of p.., and t, is the time spent in p.. by the processor during the learning period. The meaning of R is "increase in advisory penetration conferred by the operation of p.., relative to the cost of its operation." The concomitant operation of the rest of the program (p., p R , p«) also has a cost, which can be thought of as the cost of the "practice evaluations" separate from the learning cost. We have devoted space to the economics of rote-learning, because we shall later require to do the same for the economics of "concent-learning" ■ i.e. to measure the capability of a given p., to enhance d r 's value to the user in the case where p R is not the rote of a simple memo-function, but contains more organized information abstracted from f's function-table. We shall then use the rote-learning formulation as a base-line for measurement of the ability to acquire conceptualized as opposed to merely rote knowledge . 33 Telegram-factory model of concept-learning A concept learning system is a device for the synthesis of cheap but effective inter-galactic telegrams. The stimulus for this deliberately mysterious remark is a passage from 6. J. Chaitin (1975): "Suppose you have a friend who is visiting a planet in another galaxy s and that sending him telegrams is very expensive. He forgot to take along his tables of trigonometric functions, and he has asked you to supply them. You oould simply translate the numbers into an appropriate code (such as the binary numbers) and transmit them directly , but even the most modest tables of the six functions have a few thousand digits, so that the cost would be high. A much cheaper way to convey the same information would be to transmit instructions for calculating the tables from the underlying trigonometric formulas, such as Euler's i x equation e = cos x + l sin x. Such a message could be relatively brief, yet inherent in it is all the information contained in even the largest tables." The mathematical work which Chaitin is here popularizing, carried out by himself, by R. J. Solomonoff and by A. N. Kolmogorov, has been concerned with measuring the "complexity" of a message as the minimal number of store-bits required to hold a program capable of reconstructing it. In this theory, the inter-galactic telegram is to be run as a program on the receiver's machine. Provided that it terminates in finite time no one cares how long it takes to perform the computation. But consider the following scenario, diagrammed in Figure 3. Our distant friend has to play in a chess tournament, and asks us to send him, not the trigonometric functions, but the bestmove function for chess. He may suppose that the tournament rules demand, as on Earth, that moves be made without undue delay. Neither of the f ol lowing two responses to his request will earn any marks: (1) to send the complete look-up table, p T in our notation. The 50 total bit- cost of the telegram will be in the region 10 . 34 (2) to follow the Chai tin prescription and send a minimal program embodying an evaluation algorithm (in our case say the classic Zermelo- Borel-von Neumann total-lookahead algorithm), a minimal p A in our notation. In the case of (2) all the information has certainly been sent - at a cost of only about 10 4 bits. But our friend will object that we have not put him in possession of much useful knowledge . What should we have sent him? Plainly a telegram comprising p. and n R and n An argument based on the maximum rate of transfer of information into human long-term memory suggests that a Grandmaster is unlikely to have acquired more than 10 10 bits of chess information for his p R . So although the telegram will be costly compared to one consisting of o. alone, its cost is a droo in the ocean of p T 's cost. But like p T , and unlike unaided p«, it is executable within acceptable waiting time. We now ask: "What is the shortest inter-galactic telegram which will enable the recipient to 8-evaluate hestmove for every x in X?" P.v obvious analogy with Chaitin-Solomonoff-Kolmogorov complexity (which we refer to as ^-complexi ty) we term the length of the shortest such telegram the s-complexity of hestmove (always of course with resnect to S). In contrast with u>- complexity, the interest of which is ourely mathematical, v/e find in s-complexity an explication of the notion of intrinsic intellec- tual difficulty as a measurable (in principle) property of a problem domain. The "bestmove" function for a large version of the game of Him, say with 1G heaps of 1000 matches each, has approximately the same information- content as for chess. Rut the corresponding p. (implementing the veil- known parity rule of play) is so fast to evaluate that s-comolexity can hardly differ from .-complexity and is small. So 16 x 1000 Nim has little 35 intrinsic difficulty. But here is the continuation of the previously quoted passage from Chaitin. "Suppose, on the other -hand, your friend is interested not in trigonometry but in baseball. He would like to know the scores of all the major-league games played since he left the earth some thousands of years before. In this case it is most unlikely that a formula could be found for compressing the information into a short message; in such a series of numbers each digit is essentially an independent item of information, and it cannot be predicted from its neighbors or from some underlying rule. There is no alternative to transmitting the entire list of scores. " In this case co-complexity is high, and a fortiori so is B-comnlexity. Answering questions about baseball scores would be, at the least, a difficult task. Escalating the inter-galactic fantasy, suppose now that apneals from our distant friend for this or that finite function become so frequent that consideration must be given to the economics of automatically synthe- sizing new telegrams to his requirements. The resulting set-up is depicted in Figure 4. The recipient's machine, as before, is cost-free but the sender's machine is not, and must operate as a cost-effective telegram- factory. For each new synthesis the sender machine starts with p., n- and some p B (possibly null) ready loaded. It is thus already in a position to emit either of two unsatisfactory answers: to transmit a telegram consisting merely of p«, or to run p» over the whole domain in order to generate and send p T> The first would be good for the sender, bad for the receiver; the second good for the receiver, bad for the sender. To produce the right answer the brunt of the sender machine's task falls on p« and p.., respectively to explore fragments of f ("practice") and to generalize over p.'s discoveries to build a body of advice p R . At some stage a halt is called, and the telegram (p., p B , p c ) is transmitted, p.. stays at home. The 36 total cost is the manufacturing cost plus the cost of inter-galactic transmission. Manufacturing cost in turn decomposes into learning cost (Lt for runs of p..) and cost of practice (Lt for the associated runs of (p A » P r » Pr))« An illustrative miniature example of a p.. operating in something like this style is Negri's (1976) use of an inductive learning program to synthesize a p p able to discriminate won from drawn positions in the K + P v. K chess end-game. One envisages its use coupled with a p A which looks ahead until positions of known game-theoretic values are encountered. Lookahead would initially go to the end of the game. The final telegram could be the mul tivariable logic expression induced by the program. Learning, rote-skill and grasp To justify on economic grounds the addition of an advisory package (p R , p_) to a solution program it is not enough, or even to the point, to show that the package confers increase of knowledge: it must confer enhanced computational advantage or (to say the same thing logarithmically) enhanced penetration. Note that we earlier defined learning in terms of increase of penetration over successive trial evaluations, not in terms o f increase of knowledge . Even a decrease of knowledge, if accompanied by a disproportionate decrease in p 's bit-cost, could qualify as a form of learning in our scheme. The critical inequality, which justifies a niven modification of p R , or indicates that a step of positive learning has occurred, is Kg, * L c r- ! r^ 1» or p quivalently D n < D D , where R relates to the 1 x K B BR to its modified form. The ratio on the le f t abo ve 37 is the "relative advisory effectiveness" of p DI with respect to p D . D D With this in mind, let us dissect advisory advantage into two dissimilar components, one associated with a rote-dictionary in the style of a memo-function and the other associated with a "concept-dictionary," in which certain of the (x, y) "facts" discovered in past evaluations have been condensed into descriptional form. The "ground-level" is distinguished from the "conceptualized" component: K K Rote advantage = r— ; conceptual advantage = y-S- r c We shall call the logarithms of these quantities "rote-skill" and "grasp" respectively. For some purposes it will be useful to write advisory advantage in the form K + K , r + , c , especially when considering the profit-and-loss r c score of assimilating rote elements into the concept-dictionary. A consequence of the foregoing system of definitions is that an ordinary recursive procedure for the factorial function is credited with penetration but with neither rote-skill nor grasp . The memo-ised version of the same procedure can claim rote- learning , defined in our system as the autonomous improvement of penetration specifically by increase of rote- skill. Improvement through use of the memo-function's facility for direct assignment of values by the user to the rote exemplifies teaching by rote . If after some regime of rote-learning and/or rote-teaching we decide to suspend the action of p., and to investigate whether the memo-function in the form in which it has been frozen shows signs of having profited from 38 the regime, then we shall pronounce positively only if we find that penetration has indeed been increased beyond the base-line set by p.. If additionally the memo- function' s up-date program p.. were of itself to impose a beneficial structural reorganization upon the rote, for example by introducing dynamic grouping of result- values into intervals as in Marsh (1969) or (to be fanciful) by discovering Stirling's approximation and using it to replace in p B all values larger than a certain threshold by the corresponding predicate, then and only then could we say not only that this particular (p„, p B , p~) exhibited penetration in the form of rote-skill and grasp, but that the associated p.. had exhibited concept-learning. A recapitulation of some of these ideas is set out in Table 2. 39 Qual ity or aptitude Knowledge Advantage Penetration Rote-skill Grasp Difficulty (e-complexity) Rote-learning Parameter-learning Concept-learning Informal explication Retrievable information. Ability to retrieve more than is stored. Logarithm of advantage. Penetration attributable to storage of rote-facts. Penetration attributable to storage of descriptions. Size of smallest program able to confer complete knowledge. Increase of rote-skill by storing results of past evaluations. Increase of grasp by adjusting already stored descriptions in the light of past evaluations. Increase of grasp by abstracting new descriptions from past evaluations. Table 2. Recapitulation of some main ideas. An intermediate level of learning ("parameter learning," not discussed in the text) has been inserted. Examples of this mode are found in early AI work such as that of Samuel (1959) and of Selfridge (1959), and in many modern adaptive control systems. A "description" is an intensional definition of a (typically non-unit) set. Concluding remarks In the pre-scientific era, lack of a system of numerical definitions concerning mass and motion hindered the development of the mechanical arts. An analogous situation prevails today with regard to machine intelligence and provides incentive for the present work. Detailed mathematical and experimental development is required. The latter could usefully include the programming of a cluster of interrelated subdomains (for example chess end-games) using an invariant p.. Knowledge and learning parameters of 40 different p 's and p 's could thus be systematically compared over a range of cases. Acknowledgements I am indebted to I. J. Good, Martin Kruskal , David Plaisted and Boris Tsybakov for helpful discussion and in particular to Rod Fletcher, John McCarthy and Michael Tanner who spotted errors in earlier versions. The Department of Computer Science at the University of Illinois provided me with facilities for a substantial part of the work. I wish in particular to thank Pam Farr who was responsible for meticulous typing of difficult material. 41 References Buchanan, B. G. (1975) Applications of Artificial Intelligence to scientific reasoning. Proc. 2nd USA-Japan Computer Conference . Chaitin, G. J. (1975) Randomness and mathematical proof. Scientific American , 232 , 47-52 (May, 1975). ™" Good, I. J. (1976) Dynamic probability, computer chess, and the measurement of knowledge, in Machi ne Rep res en t a ti ons of Know] edge (eds. E. M. El cock and D. Michie), Dordrecht: D. Reidel Publishing Company. Hartley, R. V. L. (1928) Transmission of information. Re 11 System Tech. Jour. , 7_, 535-563. Hubeman, B. J. (1968) A program to play chess end games. Technical report no. CS 106 . Stanford University: Computer Science Department. Marsh, D. L. (1969) Memo functions, the Graph Traverser and a simple control situation, in Machine Intelligence 5 , pp. 281-300 (eds. B. Meltzer and D. Michie), Edinburgh: Edinburgh University Press. Meltzer, B. (1970) The semantics of induction and the possibility of complete systems of inductive inference. Artificial Intel ligence, 1, 189-192. Michalski, R. and P. Negri (1976) An experiment on inductive learning in chess end games: the King-Pawn-King case, in Machine Representations of Knowledge (eds. E. W. El cock and D. Michie), Dordrecht: D. Reidel Publishing Company. Michie, D. (1967) Memo functions: a language facility with "rote-learning" properties. Research Memorandum MIP-R-29 , Edinburgh: Department of Machine Intelligence and Perception. Michie, D., Popplestone, R. J., and Marsh, D. L. (1968) LIB MEMOFNS. pp. 209-212 of Programming in POP-2 by R. M. Burstall , J. S. Collins and R. J. Popplestone. Edinburgh: Edinburgh University Press, 1971. Michie, D. and B. G. Buchanan. Artificial Intelligence in Mass Spectroscopy: a review of the Heuristic Dendral program. In Computers for Spectroscopists (ed. R. A. G. Carrington) London: Adam Hilger. Samson, E. W. (1951) Fundamental natural concepts of information theory . Air Force Cambridge Research Station: Report E5079. Samuel, A. L. (1959) Some studies in machine learning using the game of checkers. IBM Jour. Res. Dev. , 3, 211-229. Selfridge, 0. G. (1959) Pandemonium: a paradigm for learning. NPL Symposium no. 10 , Mechanization of Thought Processes , 511-531. London HMS0. van Emden, M. H. (1974) LIB MEMO FUNCTIONS: program specification. POP-2 Program Library . Edinburgh: School of Artificial Intelligence. 42 o 4-> C 0) S- 03 CO co JZ •r— U +-> fO U CD fO ■r™ CO j_ -M +J 03 ■a I/) +-> a> -O E s_ < CD OJ to ■a CD i_ • s_ CO Q. V -- •»-> OJ u S_ CD -t-> JD JD u 03 O 03 -M r— +J C 03 to O c -O •1 — o fO 4-> •r— U +-> "O E OJ QJ 3 +J +-> 4- 3 to CL 03 03 QJ E i_ S- O +J CO =3 O c fO CT f — u -a c 03 2 03 r— -M E n3 .c u 0J 4J • r- +J • r— -t-> O i_ 1— 03 C E -a O c •r— 4- fa 4-> O 03 •i — -0 a> -^ a> X c: CO 03 3 _Q CO •1— a; 03 s_ i_ 03 03 >> > r— 1 03 c * 4- aj to u > c C • r— «< O +-> •r— -0 •r— O +-> CD -t-> QJ to CD O Q. 1 — C c CO OJ • r— ZJ OJ q: J_ 4- i_ 43 Figure 2 Graphical plots of the relation between knowledge and bit-cost for proorams of contrasted characteristics: p* - idealized program (minimal length, complete knowledge) p, - non-minimal program with complete knowledge Pp - program with incomplete knowledge p- - another program with incomplete knowledge p. - minimal look-up program with complete knowledge Pc - non-minimal look-up program with complete knowledge For a given p, the tangent of the angle made by the broken line with the horizontal axis measures the program's computational advantage. Effectiveness of the i-th program p. = K. L L 1 x 1(f) i.e. the computational advantage of p. relative to that of p* bit-cost knowledge V Kf) L T Kf) L 2 , h l r K 3 Kf). Kf) L 5' Kf) 44 I N F R M A T I M B I T S Kf) STORE-BITS 45 Figure 3 Chaitin's inter-galactic telegram in the context of "advice theory". We allow the sender a cost-free machine of his own in case he wishes to comoile his telegram from some specification language (see Chaitin's discussion) and we assign the receiver's machine unlimited data and work-space, and zero processing costs. But we assume a limit of 6 seconds on the evaluation of f in the TGM for any x. Attention is thus restricted to the bit-cost of the telegram itself. 46 r c cc a> ■o o o 1— +J a Q- O) t— * r T ' ' JD c O zr E o u O CL t— Ou ca z Cl. 1— 1 #» «=£ D. +-> O -Q fD E 03 S_ CD O) r— a; +j > a> x > Q. CQ Q- Q. a (T3 ai <-> :r: c_> < i on en _1 o LU a: C3 h- O H— < «3. CC > D_ c Q C/1 - c \— CJ cz CO _l CT O UJ 2- LU Oi. CJ 1— < cj —3 ZT D. Q. o: d. C O on < co cj UJ _j UJ UJ cj cj cc J— ^^( > Q. C- c < o. co < 1— < Q a 1 E •i — > "C en 03 cu »— a; "C OJ -c- 00 Q. CC a. 1— Q CO — < UJ z- H- ! UJ X CJ CJ < <=r Q. 3L CO CJ ^ i— i OL h- O CJ < _J c? 2: CO *1 - g UJ CJ «=C