QH 371 LP363 ^m fork At (SLotaell UntntrHttH SItbrarg Cornell University Library QH 371.P363 On the general theory of skew correlatio 3 1924 003 092 917 a V, Cornell University Library The original of this book is in the Cornell University Library. There are no known copyright restrictions in the United States on the use of the text. http://www.archive.org/details/cu31924003092917 DEPARTMENT OF APPLIED MATHEMATICS, UNIVERSITY COLLEGE, UNIVERSITY OF LONDON. DRAPERS' COMPANY RESEARSIS MEMOIRS. BIOMETRIC SERIES II. MATHEMATICAL CONTRIBUTIONS TO THE THEORY OF EVOLUTION. XIV. ON THE GENERAL THEORY OF SKEW CORRELATION AND NON-LINEAR REGRESSION. BY KARL PEARSON, F.R.S. [WITH FIVE DIAGRAMS.] LONDON: PUBLISHED BY DULAU AND CO., 37, SOHQ SQUARE, W. 1905. Price Five Shillings. In March, 1903, the Worshipful Company of Drapers announced their intention of granting £1,000 to the University of London to he devoted to the furtherance of research and higher work at University College. After consultation between the University and College authorities, the Drapers Company presented £1,000 to the University to assist the statistical work and higher teaching of the Department of Applied Mathematics. It seemed desirable to commemorate this—probably, first occasion on which a great City Company has directly endowed higher research work in mathematical science — by the issue of a special series of memoirs in the preparation of which the Department has been largely assisted hy the grant. Such is the aim of the present series of '■'■Drapers' Company Research Memoirs." K.P. Mathematical Contributions to the Theory of Evolution. — XIV. On the General Theory of Skew Correlation and Non-linear Regression. By Karl Pearson, F.R.S. (1- (2. (3. (4. (5. (6. (7- (8. (9. (10. (11. (12. (13. (U. Contents. Page Introductory. General conceptions as to skew variation and correlation. General theory of skew variation within the limits of practical errors of sampling. ... 3 Generalised idea of correlation. The correlation ratio r) and its relation to the correlation coefficient r 9 Probable errors of the correlation ratio and other constants of the arrays. Probable error of r 11 On the higher types of regression. Homoscedastic and heteroscedastic systems. Homoclitic and heteroclitic systems 21 Cubical. regression. General equations for regression of any order 23 Parabolic regression 28 Linear regression 30 Illustration A. — On the skew correlation between number of branches to the whorl and position of the whorl on the spray in the case of Asperula odorata 31 Illustration B. — On the skew correlation between age and head height in girls. ... 34 Illustration C. — On the skew correlation between size of cell and size of body in Da/phnia magna 38 Illustration D. — On the skew correlation between number of branches to the whorl and position of the whorl on the stem in Equisetum arvense 42 Quartic regression. Necessary criteria for various types of regression 47 Illustration E. — Calculation of quartic regression in the case of Equisetum arvense . . 49 General conclusions. Nomenclature, clitic and scedastic curves. Difference between mere curve fitting and regression calculations. Remarks on retention of decimals . 51 (1.) Introductory. In a series of memoirs presented to the Royal Society I have endeavoured to show- that the Gaussian-Laplace normal distribution is very far from being a general law of frequency distribution either for errors of observation* or for the distribution of deviations from type such as occur in organic populations, t It is quite true that the * "On Errors of Judgment, &c.," 'Phil. Trans.,' A, vol. 198, pp. 235-299. t "On Skew Variation, &c.," 'Phil. Trans.,' A, vol. 186, pp. 343-414. A 2 4 PEOFESSOR K. PEARSON ON THE GENERAL THEORY OF normal distribution applies within certain fields with a remarkable degree of accuracy, notably in a whole series of anthropometric, particularly craniometric, observations.* In other fields it is not even approximately correct, for example in the distribution of barometric variations,! of grades of fertility and incidence of disease.^ For such cases I have introduced a series of skew frequency curves which serve the purpose of describing the frequency of innumerable skew distributions well within the errors of random sampling. An exact test for "goodness of fit" in the case of frequency distributions has also been now provided. § In deahng with frequency which diverges more or less conspicuously from the normal law we require to bear in mind at least three important points : — (i.) Any expression for frequency must be a graduation formula. It is not a disadvantage, but a fundamental requisite that it should smooth ofi" " Scheingipfeln," so far as these are irregularities within the limits of random sampling. Hence formulae like those provided by Thiele|| and Wundt's pupils.H which depend upon taking enough "moments" to reproduce the complete frequency, are a priori fallacious. Many interpolation formulae would do this completely, but such inter- polation formulae are not graduation formulae. (ii.) The graduation formula must not depend upon the calculation of constants having such a high probable error that their value is practically worthless. Now, the probable error of high moments and products increases rapidly with their dimensions ; hence there is, beyond the labour of arithmetic, a practical limit to the number of moments or products which can be efiectively used in a graduation formula. (iii.) There must be a systematic method of approaching frequency distributions, which can be applied to all cases with reasonably practical ease. Now the immense majority, if not the totality, of frequency distributions in homo- geneous material show, when the frequency is indefinitely increased, a tendency to give a smooth curve characterised by the following properties : — (i.) The frequency starts from zero, increases slowly or rapidly to a maximum, and then falls again to zero — probably at a quite different rate — as the character for which the frequency is measured is steadily increased. This is the almost universal unimodal distribution of the frequency of homogeneous series. Homogeneity may * ' Biometrika,' vol. I., p. 443; vol. II., p. 344; vol. HI., p. 230. t 'Phil. Trans.,' A, vol. 190, pp. 423-469. X 'PMl, Trans.,' A, vol. 192, pp. 257-330; 'The Chances of Death,' vol. I., pp. 69, et seq. ; 'Biometrika,' vol. I., p. 134 and p. 292; and for disease, 'Phil. Trans.,' A, vol. 186, pp. 390 and 407; A, vol. 197, p. 159. § 'Phil. Mag.,' vol. 50, 1900, pp. 157-174, and 'Biometrika,' vol. L, pp. 154-163. II ' Forelaesninger over Almindelig lagttagelslaere,' Kjobenhavn, 1889; 'Theory of Observations,' London, 1903. U WUNDT, ' Philosophische Studien.' A whole series of papers, by G. F. Lipps and others, seems to me to quite miss the point of (i.) and (ii.) above. SKEW CORRELATION AND NON-LINEAR REGRESSION. 5 for practical purposes be taken to imply unimodality, although the converse is very far from true. (ii.) In the next place there is generally contact of the frequency curve at the extremities of the range. These characteristics at once suggest the following form of frequency curve, if y8x measure the frequency falling between x and x-\-Sx : — ^a • • • ^«-i ^^ terms of the moments. For example, if we stop at h^ we require two moments, at h^ three moments, at b^ four moments, at 63 six moments, at b^ eight moments, and at fe,_i, 5>2, 2s— 2 moments. * For example, cases in which there is a minimum frequency or antimode at a; = - a, and dyjdx infinite at one or two values for which y = 0, as in the frequency distributions discussed in ' Phil. Trans.,' A, vol. 186, pp. 364-5, and ' Roy. Soc. Proc.,' vol. 62, p. 287, " Cloudiness, a Novel Case of Frequency." PROFESSOR K. PEARSON ON THE GENERAL THEORY OP There is no difficulty whatever in finding the h's ; we have the system of equations ; where /a'o^I ,jL'^a+ 2/160+ 3/2&1 + Wh + 5/^'4&3 + 6/5&,+ [I'^a + 3/260 + 4/361 + 5/462+ 6/563 + 7/564+ /^a + 4/360 + 5/461 + 6iJi'^b.2 + 7/663 + 8/764+ = -/i = -/4 = — )^5 (vi.). Hence, a, b^, b^, b^, 63, . . . are at once given in terms of the determinant A and its minors, where : A = H-'o' 0, H-'o> 2/1, 3/2, 4/.'3, . . . l^'v /o> 2/1, 3/2, 4/3, 5/4, . . . H-%, 2/1, 3/2, 4/3. 5/4, 6/5, . . . f^'s> 3/2, 4/3, 5/4>. 6/5> 7/6, . . . /*'*. 4/^'3> 5/4, 6/5, 7/6, 8/„ . . . . . . (vii.). The results may be simplified slightly by taking the origin at the mean, and the moments about the mean, indicating this by dropping the dashes and putting /i = 0. Thus we have the following series of frequency curves, the origin being the mean : — (i.) Keeping 60 only ydx- '''^' (viii.). This is the Laplace-Gaussian normal form, (ii.) Keeping 60, 6] only This is the Type III. curve of my memoir on skew variation.* (iii.) Keeping 60, 61, b^ only (ix.). 1 %. x-\- /^3(/^4 + 3/*2^) 1 Ofi^H-i — 1 8/x.2^ — 1 2/X.3" y dx /^2 (4/^2^4- 3/^3'^) I /^3(/X4+3/ X2^) ,^ I 2/^2/^4— 3ms^ — 6^0^ in S in 9,T^ 1 /-v in S in ->. I TTi V^ o T^Z" (x.). 10i^2/*4-18/.23-12/^32 10/*2/*4- 18/^3'- 12/^3' 10^2)^4-18/^2'- 12/^3' X'' 'Phil. Trans.,' A, vol. 186, p. 373. SKEW CORRELATION AND NON-LINEAR REGRESSION. 7 This equation gave Types I.-VI. of my two memoirs on skew variation,* and provides at once the expressions d = distance from mode to mean = "'.//^ ^?%'^^\ . ■ • • (xi), 2(5/33—6^1 — 9) skewness_2^g^^_g^^_g^ (xu.j, where cr = v/*3, ^i = /^sV/^s^j I^z — /"'V/^a^j given in my memoir on the theory of errors of observation without proof f There is no theoretical limit, however, to this process; we can from (vi.) and (vii.) express the a and h's at once in terms of determinants, and expanding obtain forms which, Uke the formulae of Thiele, will fit closer and closer to the observed distribution of frequency, the more moments we take. But there are three fundamental practical objections to this. These are the following : — (a.) Experience shows that the form (x.) suffices for certainly the great bulk of frequency distributions, i.e., it describes them effectively within the limits of random sampling. If the distribution be even approximately normal, the series in the denominator converges very rapidly, for the coefficients of every power of x vanish for moments obeying the relationships : — H'Zs + l = 0, 11.2s = (2s— l)/A2/*2,_2, which hold for a normal series. (b.) The labour of arithmetic and of analysis becomes very great, if we desire to keep higher moments. If we go to 64 we should have to calculate the first eight moments of the observations about their centroid— a by no means easy task. Further, the classification of the resulting curves and the criteria for the right one to use in a special case, although not absolutely prohibitive, if we only go as far as 63, are for practical purposes idle in the case of taking into account 64,. (c.) The probable errors of the higher moments are so large that the values found for ju„7, /xg, &c., are quite untrustworthy, and even that for fig is doubtful, J unless we have frequency series far larger than usually occur in actual observations. This is a strong argument against the utility of any descriptions of frequency, such as those suggested by Thiele or Lipps, which depend upon moments higher than the fifth or sixth. -^ * 'Phil. Trans.,' A, vol. 186, pp. 343-414, and ' Phil. Trans.,' A, vol. 197, pp. 443-459. t 'Phil. Trans.,' A, vol. 198, p. 277. X In 'Phil. Trans.,' A, vol. 185, pp. 71-110, I have given a method of breaking up a frequency distribution into two normal series. I obtained long ago the criterion for determining whether such a resolution is possible or not. But it involves moments higher than the fifth, and the probable error of the criterion is thus so great that for practical purposes it is worthless. 8 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF The question of the probable deviations of the higher moments can be illustrated as follows, by finding the standard deviation of the moment when we take a number of random samples from a general population. Let 2^, be the standard deviation of /x.,, then IQQ^Jfis is the percentage variability of /a, due to random sampling. The table below shows the increase of these percentages in the case of the moments of normal distributions, which, quite as well as any other, will illustrate the rapid increase in probable error as we use higher and higher moments. The general values of the standard deviations of some of the moments were first given by Czuber,* then far more completely by Sheppard,! and a resume of all the results recently in ' Biometrika.';]: Percentage Yariability in Moments due to Random Sampling when the Series is supposed to be Normal. Moment. 500 in series. 1000 in series. /*2 /*8 ■ 6-3 14-6 30-1 60-6 4-5 10-3 21-3 42-9 Precisely the same rapid increase takes place when we find the variabilities of the ratios ju.4//*/, i^s/fJi'^, f^s/f^i^f <^c., which are the forms in which the moments actually occur in our coefficients. In this case we have to remember that errors in the moments are correlated, but the correlations are given in the papers cited above. § I find in this case the following series, which is almost as suggestive as the previous table. Percentage Variabilities in Ratio of Moments due to Random Sampling, the Series being Normal. Ratio. 500 in series. 1000 in series. 7-3 23-3 55-1 5-2 16-5 390 The order of this increase of percentage variability, and therefore of probable error, is the same for skew as for normal variation, and it seems therefore, with the length * 'Theorie der Beobachtungsfehler,' S. 130, d seq. t 'Phil. Trans.,' A, vol. 192, pp. 122, et seq. t Vol. II., pp. 273-281. § Ibid., p. 277. SKEW COREELATION AND NON-LINEAR REGRESSION. 9 of the series in customary use, idle to use the 7"" or S"" moments ; these have variabihties varying from 30 to 60 per cent, of their values, and accordingly we might easily on a random sample reach a T"" or 8"" moment having half, or double the value it actually has in the general population. Constants based on these high moments will be practically idle. They may enable us to describe closely an individual random sample, but no safe argument can be drawn from this individual sample as to the general population at large, at any rate so far as the argument is based on the constants depending upon these high moments. It seems to me accordingly obvious that, bearing in mind the object of a theory of frequency (i.e., the description of the distribution in the general population by aid of a graduated sample, agreeing with the general population within the probable errors of random sampling), we can dismiss from practical use all theories which call upon us to use moments as high as the seventh or eighth. Any use of the general form (ii.) beyond 63, indirectly or directly, involves such higher moments. Personally I am inclined to doubt whether the continental series using higher moments are, from the standpoint of graduation, nearly as good as my form (ii. ). Hence we seem driven to the skew curves embraced in (x.) as a practical frequency series. If we have a frequency not described by (x.) we may, perhaps, use /aj and /^g,* but it is difficult to see how its description can possibly be bettered by the use of still higher moments. This may seem a counsel of despair ; but it is very far from being so in reality when we remember that (x. ) has proved its efficiency now — I might almost say, without exception — in a wide range of economic, physical, biometric, and actuarial data. In this memoir on skew correlation I shall accordingly confine my attention, for the most part, to constants the discovery of which does not involve the use of moments or products of higher than six dimensions, judging all above this limit to be, as a rule, disqualified for practical service by the magnitude of their probable errors. (2.) Generalised Idea of Correlation. Given any two variables or characters A and B, we say that they are correlated when, with different values x of A, we do not find the same value ^ of B equally likely to be associated. In other words, certain values of B are relatively more likely to occur with the value x than others. The distribution of B's associated with a given value cc of A is termed an a;-array of B's. If N pairs of A and B are taken, and n^ of these have the character A = x, these n^^ form the x-array of B's. This array, like any other frequency distribution, will have its mean, which we will denote by ^x, and its * Referring to equation (ii.), I propose to call curves which stop at bq skew curves of the 2"' order. Thus the normal curve is a skew curve of zero order; curve of Type III. is a skew curve of the P' order; Types I., II., v., and VI. are of the 2"" order. I hope shortly to publish a discussion of skew curves of the 3"" order to complete the practically legitimate range of such curves. B 10 PROFESSOU K. PEARSON ON THE GENERAL THEORY OP standard deviation, which we will denote by cr„^. The mean of all the B characters shall be y and their variability given by the standard deviation a-y. Similarly x, cr^ will denote the mean and standard deviation of the A's, and n^, Xy, and a-^ the number of individuals, the mean and the standard deviation for a ?/-array of A's. Now qlearly a knowledge of y^ and cTn, will not fix the B's which wUl be found associated with a given A, but it wiU define the limits of probable or even possible B's. The curve obtained by plotting y^ to x is termed the regression curve of y on x. A curve in which the ratio of cr„^ to the standard deviation a-y is plotted to x may be termed a scedastic* curve. Since the standard deviation is always a positive quantity, this curve always lies on one side of the axis ; it is a horizontal line in the case of normal correlation — i.e., the Gauss-Laplacian distribution of deviations — and coincides with the axis, in any case where correlation passes into causation, i.e., when one value of B only is associated with each A. The mean ordinate of this curve would clearly be a sort of general measure of the degree of correlation between A and B, but it seems for many reasons better to base our measure on the mean square of the weighted standard deviations of the arrays, or o-^2 = SKa-„/)/N (xiii.). a- a, will thus measure the average variability in B to be found associated with any A, its vanishing will mean that the scedastic curve as defined above will coincide with the axis. Now let a new quantity t], defined by / 0-^2^(1-772)0-/ (xiv.), be introduced. Then clearly 77 must lie between ±1, because a-a^ cannot be negative, being the sum of a number of positive squares. I term -q the coy-relation ratio, to distinguish it from the correlation coefficient represented by r. When 17=^1 the correlation is perfect or we have causation. Further we have by a well-known property of moments, if < = ^{n.{yn-yf}/^ (xv.), or ^ = o-n,Ja-y (xvi.). This shows us that the correlation ratio is the ratio of the variability of the means of the x-arrays to the variability of B's in general. If 77 = 0, it follows that o-,^ is zero, or from (xv.) that every y„^=y, i.e., there is no association of B's with special A's at all, or correlation is zero. Thus the correlation ratio 77, as defined by either (xiv.) or (xvi.), is an excellent measure of the stringency of correlation, always lying numerically between the values and 1, which mark absolute independence and * I.e., a curve which measures the " scatter " in the arrays. SKEW CORRELATION AND NON-LINEAR REGRESSION. 11 complete causation respectively. Further, remembering the definition ot r, the coefficient of correlation, i.e., = ^{n^(x-x){y„~p)] (xvii.), we have, from (xv.) and (xvii.), Now let N (ry^-r^) cr/ = S [n. (l/„ -y) {y„- y- ^^ {x-x))]^ Y=y+'^{x—x) (xviii.). then (xviii.), as is well known, gives the best fitting straight line to the series of points 2/„. loaded with their respective n^,. We can now write N (V-r^) cr/ = S{n. {y.-Yf} + S{n. {Y-y){y„-Y)]. But, using (xviii.), ^{n^{Y-y){y„,^Y)] = T^^\n^{x-x){y„-y-'^{x-x))\, = 0. Thus the last summation vanishes, and we have N(o,2-r2)cr/ = S{w.(y„-Y)^} (xix.). The right-hand side must always be positive, unless y„^=Y, when it is zero. Hence we conclude that r) is always greater than r, or the correlation ratio greater than the correlation coefficient, except in the special case when the means of the ic-arrays of y's all fall on a straight line, i.e., we have linear regression, and then the two correlation constants are equal. Thus the expression (77®— r^) cr/ has an important physical meaning ; it is the mean square deviation of the regression curve from the straight line which fits this curve most closely.* We have now -freed our treatment of correlation from any condition as to linearity of the regression, and it remains to consider the probable errors of the various quantities dealt with. (3.) Probable Errors of Constants of Correlation. We shall first prove a number of general propositions relating to the probable errors of correlation constants. We first note that if n and n' be the frequencies in * The properties of the correlation ratio were briefly noted in a footnote to a paper by the author in ' Roy. Soc. Proc.,' vol. 71, pp. 303-4. It has been systematically used in my laboratory for some years and determined longside r for many distributions. B 2 V2 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF any two sub-groups of a total N, for which no member of n is a member of n', then the standard deviation of n due to random samphng is given by tj' = n[l-^j (xx.), and the correlation between deviations in n and n' due to random sampling is given ix„„iZi„Zt„, — ^^ (^xxi. ). Problem I. — To find the correlation in deviations due to random sampling between the number n^^ in the Xp-array ofy's and the number ny_ in the y^-array ofx's. If the symbol Sn denote the error or deviation in n, we have with an obvious subscript notation* hi,:=hi^^y^-\-hn^^y^-\-^n:,^y^-\^. . . + Sn^,j,, if there be q groups of y'&, and again Sn^,= 8n:j_j,.+Sn;^^3,.+Sw^^j,.+. . . + hn^.y_, if there be i groups of x's. Multiply the expressions for Sn^^ and Sn^,, together and we have Zn:,Zny = (Sn^^j,,)2+S (Sn^^y.Sji^.j,.), where the summation is for every pair of values of u and v, differing from s and p. Summing all such pairs of values for every random sample and dividing by the number of samples taken, we have the usual definition of correlation or, S„,^S^R%''», = ^^.y.-^^' (xxii.). This gives E.„^„^_, the required correlation, since S„^ and X„^ are known from (xx.). Problem II. — To find the correlation between deviations in the total n^^ of any array and in any sub-group n^^y^ of this array. We have at once 8n^^Sn^^y = {8rL^^y_Y+S {8n^^y8n^^y^) where u is to be taken every value other than s in the summation term. Summing for all random samples and dividing by their number, we have, after using results like (xx.) and (xxi.), ^'Vv,X^S^%v='^-^,y.(l-^) (xxiii.), which gives Il«.__„,^,_. * nxy = frequency of groups with characters x and y. SKEW CORRELATION AND NON-LINEAR REGRESSION. 13 Proposition III. — There is no correlation between deviations in the mean oj an x-array y^^ and the total number in that a/rray. nx,Xy^=^{n^^y:yu), na:,8y^M^P= — 2/^P (Sw^,)HS {Sn^,Sn^,y:y«). Hence as before, using (xxiii.), &c., = 0, which proves that Ry^ „, is zero. Proposition IV. — There is no correlation between deviations in the mean of an x-array and in the total number in any other array. Proof as before. Proposition V. — There is no correlation between deviations in the mean of one x-array and in the mean of a second x-array. We have nxM^p=^ {^^,yjj<)—y^M^p^ ^ V ^x,' = S {pn^^,y^„) — y^^, In^j. Multiply these two expressions together, sum for all random samples, and divide •by the number of such samples. We find +«/^,S {n^n^^,yjy„)/'N +«/vS'(«v'*:r,3,^»)/N — S' {na,,ynx,'yjy^yJ)l'^ i)X,})xJ -xq- n i/«p -NT ijXr' The last term is ^^"^y^^ ^^x^yx^' ^ ^^^ ^j^^g ^j^^ right-hand side is identically zero. It thus appears that there is no correlation between errors made in finding the means of two arrays. This result is not at once obvious, although a very little consideration shows it must be true. 14 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF Proposition VI. — To prove that the standard deviation of the mean y^^ of any x-array due to random sampling equals ~7=^- We have Square, sum for all random samples, and divide by the number of such samples. We have W=^>^^ (l - N ) -22/^,S |n.,,„(l - 1^) yj^ -28 1"^ + S(/l 2\ S {nx^yJJ«) S (Wa;,y„'y«') 2 2 Hence =n^,o-n. ty^=(T„J\/n^^ (xxiv.). Thus the probable error of the mean of an array has exactly the same form as the probable error of the mean of a random sample of a definite number of individuals. The array may have a variable number of individuals, but we have seen in Proposition III. that there is no correlation between errors in its mean and errors in the total number of individuals contained in it. Problem VII. — To find the probable error of the standard deviation of any array. By a precisely similar investigation to that of the previous proposition we find where This is identical with the probable error we should have if the array were a random sample of constant size. In many cases it will be sufficiently approximate to put 7/14= Sm^^ and we then have •67449 S,. =-67449-^!^ (xxvi.), SKEW CORRELATION AND NON-LINEAR REGRESSION. 15 the well-known form for the probable error of the standard deviation of a normal distribution of a definite number of individuals. Problem VIII. — To find the standard deviation of the standard-deviation a-jx of the means of the arrays due to random sampling. Since the last term of which vanishes, since Ny=S {n^^y^^. Square the above relation, sum for all random samples, and divide by the number of such samples. • ■ We find 4N W2.„^=S j«., (l -^) (2/. -# } -2S[^{y.-^f{y.,-§Y] +4S{S,,S,,R.^.^(2/.-y)n +4S {t„,^^,R„,.,,^ (2/v-y)' (y^-P)} +4S \t,^^^y^R,,^^,^ {y^c-y) (yv-y)] +^^{ty^n.;\y.-yf]. But Il„,j,, , Il„,,y. , and Ey.a,, vanish by Propositions III., IV., and V. Further, by VI., S« ^=o-„ V^x- Hence we have 4NVm^SJ=S W:, ^^A-^-^m-yf _2S 1^(2/. -#(?/.,-# +^^{n., (/^4"~3/^2^)/M2^> {K~^\^)/K^ ^ill probably be small and thus Probable error of ■»? = -67449 (l—Ty^Vv/N, nearly (xxxiv.). This simple form suffices for many practical cases. If greater exactitude is wanted, there is, however, no great labour in using (xxxiii.). We find the means and standard deviations of each array. Then NXg and NA.4, are the 2"* and 4"" moments of the means of these arrays about their mean. N/Lig and Nju,4 are the 2"'^ and 4*'' moments about the mean of the ^/-characters, and will always be known for skeiv variation. Xi is defined by N,t/(1-7,^)o-m^ ^ ^' and can be easily found when the means and standard deviations of each array have been found. The most troublesome expression is Xz defined by But as we do not take usually more than 1 to 20 arrays, the discovery of their 3'''^ moments is not an extremely difficult task. As a rule, however, ^2 is very small and may be fairly neglected, even when we must find Xi~l- ^^ these points will be dealt with in the numerical illustrations given later in this paper. At present we note that the probable error of t] has been determined, and that its value for the general case is not really more complex than the value of the probable error of r in the general case, which requires the determination of product moments of the 4*'' order.* * Let Npjg = S {nxy (% - x)i (y - yf}, then the probable eri'or of r is given by y , f l[ P22 - 3^11^ P22 - 3^20^02 j?40 - 3^20^ Poi - 3j?02^ ^31 - ^PwPlO P\i - ^PuPdi \ , "■"NI pn^ + 22)20^^02 ^ ip-ii? ^ W i'iii'20 i'11^02 ' r • (^^^^"•)- This j,grees with the value given by Sheppard ('Phil. Trans.,' A, vol. 192, p. 128), except that the r'^ factor has been dropped by a printer's error in his paper. For the special case of a normal distribution, we have easily from the equation to the normal surface Pm = ^Pi^^, Pm = ^Po2^ P3\ = ^PuP2<>, Pi3 = ^P\\Pm, {p22-Spn^)jpu^ = {l-r^)lr^ and the well-known form (' Phil. Trans.,' A, vol. 191, p. 245). SKEW CORRELATION AND NON-LINEAR REGRESSION. 21 (4.) On the Higher Types of Regression. • We have already seen how the introduction of the correlation ratio t) enables us to drop the limitations associated with the Gauss-Laplacian form of frequency, and the Bravais correlation formulae. The fundamental step towards this advance was undoubtedly taken by G. U. Yule in his paper in the 'Roy. Soc. Proc.,' vol. 60, pp. 477 et seq., wherein he shows that if the regression be linear, the Bravais type of formula applied to multiple correlation is still true, although we make no assumption as to the form of the frequency surface. It would undoubtedly be a gain to have skew frequency surfaces which would describe skew correlation for the great mass of cases as eifectivly as the series of skew frequency curves describe skew variation, but although a considerable amount of progress has been made in the consideration of these surfaces, their full theory has not yet been worked out owing to difficulties of analysis, and their complete discussion must still be postponed. Yule's method of approaching the problem from the form of the regression curves is, however, available and capable of very great extension. Its chief advantage is that it makes little or no assumption as to the distribution of frequency ; its chief defect lies even in this advantage of generality : it does not enable us to predict the probability of an individual with a given combination of characters. This follows at once from the fact that we make no assumption as to the form of the distribution within an array. Without some theory as to variation within the array, we are reduced to the laborious process of calculating the standard deviation, skewness, and other general characters of each array, a lengthy and troublesome process compared with a theory which would, like the Bravais theory, give these at once in terms of a few constants determined from the data as a whole. In the great bulk of biometrical and economical enquiries, however, the regression does not diverge very markedly from the linear form. In the cases of non- linear regression that I have hitherto had to deal with, I find that parabolas of the 2"* or 3"* order will suffice as a rule to describe the deviation from linearity. If they did not, we could, of course, use curves of higher orders, but the difficulty referred to in the first section of this paper at once arises : we then need to use in the determination moments and product-moments of such high orders that the probable errors of the constants are so high as to render valueless their calculation from such statistical data as we can hope for in most actual inquiries. In th,-; great bulk of investigations it is practically impossible to increase our random samples from 500 to 1,000 individuals up to 50,000 to 100,000. Nor in the great bulk of statistical cases is any such increase even desirable, for a fairly wide experience shows that 2"* and .S'** order parabolae amply suffice to describe the skewness of the regression line. I shall accordingly classify skew correlation in the following manner : — 22 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF {a. ) Linear Regression : The mean of an x-arraj of y's, i.e., y^^, is given by ya:,=ci'o+<^r^P (xxxviii.). (&.) Parabolic^ Regression : The mean of an a;-array of y's, i.e., y^^, is given by Vx^aQ-^-aiXp-^-a^Xp^ (xxxix.). {c.) Cubical* Regression : The mean of an a^-array of ^s, i.e., y^^, is given by ya:=ao+<^iXp+a^Xp^+a^Xp^ (xL). It is conceivable— in fact, from unpublished work already done, highly probable — that the theory of skew variation will give regression curves, not of the exact form involved in (xxxix.) or (xl.), but containing product terms in x and y. The most general equation to a regression curve may be taken to be of the type and what experience shows us is : that for the great bulk of vital phenomena it is sufficient to expand by Maclaurin's theorem and keep the first three or four terms. Indeed, in the large majority of cases, (xxxviii.) alone suffices. Hence, if (xxxix.) or (xl.) fit the data within the limits of random sampling, we are not injudiciously circumscribing future developments of the theory of skew correlation by casting our regression curves into the above forms. I shall deal first with the theory of cubical regression, for we can then obtain from this the conditions necessary for parabolic and linear regressions. I must remind the reader, however, that the form of the regression line does not in any way limit the nature of the distribution of the array about its mean ; the variability of an array, i.e., the standard deviation of an array, having for its mean value Oyv/l — rf', may or may not be the same for all arrays. If it is the same, or all arrays are equally scattered about their means, I shall speak of the system as a homoscedastic system, otherwise it is a heteroscedastic system. The Gauss-Laplacian correlation surface gives a homoscedastic linear system. Mr. Yule's linear regression is not necessarily homoscedastic ; it may, however, be homoscedastic without being normal, and then the scatter of each array is measured by a-yy/l—r^. When a system is homoscedastic, but not linear, then cr„^^=(r^^(l— ly^), and consequently the Xl of (xxxv.) is equal to unity. Xi — •'^ ^^ ^ necessary result of homoscedasticity. Lastly, we want a word to express the idea of all the arrays having equal skewness, * ' Parabolic ' and ' cubical ' are here used in the narrower sense of regression curves corresponding to • ordinary parabolse of the 2"* order and of the 3'* order respectively : in both cases the axis of the parabola being parallel to the axis of the ^/-character. SKEW CORRELATION AND NON-LINEAR REGRESSION. 23 or being asymmetrical in an equal degree about their means. I shall express this by the term h omocliti c ; generally the arrays will not be equally asymmetrical round their means, and in this case we shall speak of them as h eterocliti c. If there were no skewness in any of the arrays, then m^ of (xxxvi.) would be zero for all of them. I term arrays of no skewness isocurtic, and skew arrays allocurtic. If we supposed that a curve of Type III. would sufficiently express the skewness of an array, we should have Sk.=^t3/(^„,__^ and therefore from (xxxvi.) _ 2S{n.,cr,„/(Sk.)(y.-y)} For a homoscedastic system we have a;,, ^a-ys/l—rf', and therefore 2SK(Sk)(^V^} and for a homoclitic system _ 2(Sk.)S{n.,or,,/(y.-^)} For a homoclitic homoscedastic system, whether isocurtic or allocurtic, 2(Sk.)S{n.^(y.-^)} _, Thus x% is to a certain extent a measure of both homoscedasticity and homoclisy. But as the correlation between o-^ and y:r,—y is in most cases extremely small, while the skewness of the array can well change its sign with arrays above or below the mean, we can fairly consider the smallness of ^3 to be a measure of the approach to homoclisy. I am thus inclined to speak of Xi — 1 and ^3 as measures of heteroscedasticity and heteroclisy. When they both vanish we have a homoscedastic homoclitic system. For such systems 77, the correlation ratio, tells us effectively the scatter of any array, and as a rule all we want to know, in addition, is the form of the regression line. (5.) Cubical Regression. We have already used the following notation %,,,=S{w,„(a;-*)?(2/-yX} (xlii.). We shall shorten our formulae if we write 'r=Pul{o-x2+hi>s (!•). where <^3 = (^3-^A-^i)/n/^1. * 'Phil. Trans.,' A, vol. 186, p. 368, and A, vol. 198, p. 278. (li.). SKEW COEEELATION AND NON-LINEAR REGEESSION. Eliminating h^, we can write (xlix.) 9z 25 + ' 63[(XVo-..)3-;S,(X,/cr.)-v/;8,-|[(XVcr.„)^-v/A(X,/cr^)-l]] • (Hi-)- Now multiply by n^^ (X^/cr^)^ and sum for all arrays ; we find 92 or where It follows from (1.) that (4<^2-e<^3)/(<^a«^4-<^32) = &3 (liii.), ^,=^,-^i-P, (liv.). b,=i^,-C,)/{Mi-s') (iv.)- We can thus write the cubic regression curve in either of the forms* * The method is perfectly easy of extension, if we choose to use higher products and moments, to a regression curve of any order, e.g., Y^Jn + €11 = X Jo + 61 + 7362 + 74*3 + «2i= h + 7361 + 7462 + 75*3 + tpi= yph +yp+\h + yp+i>i + yp+th + + 7«+l*n + + 7n+2*n + + 7n+p*n + Hence writing epi for 0. 70= 1, 71 = 0, 72= 1, we have where A K = («01 Aon + «11 Ai„ + ejl A2n + • . + tpi Apn + . . •)/A. 70, 71. 72. 73. • 7«. 71. 72, 73. 74, • 7n+l. 72, 73, 74, 75. • 7«+2- 7j.. 7p+i. 7^+2, yp+t, ■ yp+nt and Agn is the minor of the constituent in the (ff+l)'" row and (w+l)'" column. As we have already noted, however, solutions involving anything beyond 75 are hardly likely to be of practical value. The value above for ft,, is the type equation given by the method of least squares, when we strike the best fitting curve to all the entries in the correlation table. I have already pointed out that the method of moments becomes identical with that of least squares, when we fit parabolse of any order (' Biometrika,' vol. I., p. 271). The retention of the method of moments, however, enables us, without abrupt change of method, to introduce the needful 1;, and to grasp at once the application of the proper Sheppard's correc- tions. The extension of the method of least squares to continua in space has not yet, as far as I am aware, been fully considered. D 26 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF YJa-,=r(X,/^,)+-if(XVo-.,)2-y^(V(r.,)-l} or YJo;=r (X>.)+ j%-% {(X,M)^- x/A i^/cr.)-!} 9294 — 93 9294 — 93 The former arrangement of the solution, while it is apparently more cumbersome, is, perhaps, the better, for it gives us at once the measure of the deviation from parabolic or 2""* order regression, i.e., the approach of ^c^^- ^*^3 ^^ zero. In the case of normal correlation both e and £, vanish, and neglecting higher terms the condition for linear regression is that e = 0, and ^(^3— e<^3 = 0, or, again, e and ^=0. For material in which the a;-variability is isocurtic, ^^=^^=^^ = 0, and the regression curve takes the simple form Yja-,=r(X^/cr.)+i-{(XVo-.)^-l} + |{(X,/o-.)«-^,(X,/o-.)} . (Ivi.) ter. 92 94 We now turn to express these relations in terms of the correlation ratio rj. Multiply (Ivi.) by n^^J^^Ja-y, and sum for aU arrays, we obtain ^2,,^2+ |(,_y^,.)^ p2-^j ^s h-^,r- ^ (e-^M], 92 9294—93 I- 92 J whence results (Ivii.) is a necessary condition of cubical regression. It is of course not a sufficient condition, as we ought to show that h^, 65, &c., all vanish, and thus any number of conditions may be found. For example, multiply by n^lLp^laJ' and sum for all arrays, then 9294—93 9294—98 V/Sj is also a necessary condition. Here ^^=v,jvj(rj-^. But the high as well as complicated value of the probable errors of such expressions renders it idle to consider them in practice. SKEW CORRELATION AND NON-LINEAR REGRESSION. 27 Substituting (Ivii.) in (Ivi.) we have : Y J.) - 1 } Which sign is to be given to the root will often be visible on inspection of the observations. Otherwise the sign of the root must be the same as that of ^^a— #3- (lix.) will save the calculation of ^if the root-sign can be found by inspection. Finally there is a third form into which we may put the cubic. Eliminate ^%'^i—'i>z from (lix.) by aid of (Ivii.) and it becomes YJa-,=r (V<-.)+^^^S-^^^ {(XV2~^i^3- Besides this, we require the first six moments of the independent variable x. Of course if the regression of a; on ?/ be required, as well as that of y on X, the second correlation ratio and cubic product as well as the first six moments of y must be found. It is rare, however, that both regression curves are needed for a single enquiry. As to the general form of (lix.), we note that there will always be a real point of inflexion given by ^/o-^=h Ms-^)/Mi) (Ixi.), D 2 28 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF where and further that there may be two points of horizon tality given by a certain quadratic. Thus, in general, the regression hne will tend to be part of an S-shaped curve. The horizontal points may be imaginary, or, if real, either they or the point of inflexion may be far beyond the portion of the curve which crosses the observed field of frequency. If we consider, however, the slope of the regression curve to measure the regression in the neighbourhood of any point, we note that the regression is a maximum at the point given by (Ixi.), and grows smaller and smaller towards the two points of horizontality, i.e., points of complete local independence of the two characters. These are not unfamiliar features in certain practical cases of skew correlation,* and accordingly the cubic regression curve provides us with a ready means of describing regression phenomena, which cannot be dealt with by the simple line or the parabola. It may of course be suggested that a quartic or quintic curve would give a better result than a cubic. The answer to this is : Possibly, but the high moments and products required render it impossible to deal even superficially with the probable errors of the constants involved. The calculation of the probable error of 7^ is a sufficiently stiff task in the general case. To test the probable error of a condition like (Ivii.), to say nothing of one like (Iviii.), would involve an immense amount of work, since we should want the correlation of errors in y], I, l,, and 6. Speaking with some experience of practical statistical possibilities, I think, the tendency to use very high moments or product-moments must be curtailed to the minimum of actual needs. We cannot deny the existence of skew vaiiation, nor of the sensible curvature of regression lines. We must admit their existence as the result of statistical experience. This existence involves a great widening of the old frequency notions and the need for a new means of description. But we must remember that statistics are essentially a practical study, the art of describing by a few numerical constants observational experience, and we must curtail at every turn the desire to run riot in mathematical formulae, which cannot be generally applied in actual practice, t Still I propose later in this paper to deal with the general formulae for quartic regression. (6.) Parabolic Regression. For a parabolic system 63 must vanish, or nearly vanish. Hence we have from (liii.) and (Ivii.). Cs=0 (Ixii.), Av^-r"^)-^^=0 (Ixiii.). * Compare for example the regression line of age of mean age of bridegroom for actual age of bride, which gives a typical S-shaped curve. See ' Biometrika,' vol. II., p. 20. t These remarks have special reference to the points dealt with on p. 6. SKEW CORRELATION AND NON-LINEAR REGRESSION. 29 From these conditions we find These give for the form of the parabolic regression curve Yj,=r(YVo-.)± V'5^{(X>,)-^-v/i8i(XVo-.)-n • • (Ixv.). The latter form, besides the correlation coefficient and correlation ratio, requires only a knowledge of the skew variation constants ^j and /Sg, and is therefore very easy to determine. Except for very nearly linear regression, there can be no doubt as to the sign of s/yf'—'r^, as we can tell at once whether the parabola ought to be concave or convex to the a;-axis. In other cases the sign of y/rf—r'' must be taken to coincide with that of e, which must therefore be found. It will then be as easy to use (Ixiv.) as (Ixv.), although probably i) and r can be found with less error than e. It is thus quite easy to allow for such curvature of the regression line as can be expressed by a parabola of the 2"* order of the type considered. We notice at once that the regression curve does not pass through the mean of the two characters. Or, an individual with the mean of one character will most probably not have the mean of a second character. This is a rather important result, which follows at once for nearly all types of skew correlation. It will be seen, for example, that Quetblbt's " mean man," defended by Professor Edgewoeth as theoretically justifiable, depends entirely on human characters giving linear regression curves. Such linear curves are certainly given by many pairs of characters, e.g., cranial and body measurements, but there are certainly other characters for which regression ceases to be sensibly linear, and the conception of the " mean man " in this case fails. For example, if age be considered as a character, then the regression is certainly not linear, and the individual of mean age will not necessarily have either the mean physical or psychical characters. This seems of some importance for the general conception of " type," if by type we denote the mean, for probably there are other characters than age for which regression is skew. The regression, i.e., dY:cJdXf will be zero, for a point ^(jmn.) for which %he sign of the root being determined as before. Clearly, therefore, unless r be very small, or t)^ diverges very sensibly from 7-^, this point of zero regression may correspond 30 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF to a very large abscissa, and in some cases will lie entirely outside the range of observable frequency. The parabola of regression cuts the line of regression, i.e., the line of best fit to the series of regression points, or to the means of the cc-arrays, in two points determined by the quadratic equation or O-j =i{v/'i8i±v/A+4} (Ixvii.). These points are always real, and correspond, if regression be truly parabolic, to the same values of the x-character, whatever be the ^/-character of which we are considering the correlation. In the case of normal variation of the x-character only, these are the points of inflexion of the as-distribution. (7.) Linear Regression. In this case it is necessary that both h^ and 63 vanish within the limits of random sampling, and, although these are not theoretically sufficient — for a whole series of relations between the higher product-moments could be written down* — they are for practical purposes sufficient. Hence we have the following conditions for linear regression : — r)'^=r-^ (Ixviii.), or, the coefficient of correlation, without regard to sign, should be equal to the correlation ratio. Further e should be zero, or PiiP-2o-2hiPso=^ (Ixix.). The theory of linear i^egression is so familiar that it need not be further discussed here. In the actual practice of statistics, the determination of the means of the a;-arrays and the drawing of the regression line will often suffice to show the fairly trained eye whether the deviations from it are random or not. If they are not random, then we must proceed to the determination of r] and of the higher product- moments. The following are numerical examples of skew correlation, selected to illustrate the theory developed above. * For example, it is necessary in most cases that I should vanish. In the instance of that very special case of linear regression, the Gauss-Laplacian normal frequency, it is easy to show that the constants €, ( both vanish as well as t)^ = r^- SKEW CORRELATION AND NON-LINEAR REGRESSION. 31 Statistical Illusteations. (8.) Illustration A. — On the Skew Correlation between Number of Branches to the Whorl and Position of the Whorl on the Spray in the case of Asperula odorata. In this case the material was collected in a lane near Horsham, Sussex, at Whitsuntide, 1903, by Miss M. Eadpord. There were 150 independent sprays, the woodruff had just flowered, and the whorls were counted from the flower downwards. Being early in the season, the maximum number of whorls was five, and, in some cases, not even as many were available. The material was counted and tabled by the author, and the results are exhibited in the table below : — Table I.- -Correlation of Whorl- Branches and Position ot Whorl. X. Whorl. Number of branches in whorl. np. y^- <^«p- wis. Ms. 4. 5. 6. 7. 8. it X2 Xi First . . Second . Third. . Fourth . Fifth . . 1 1 3 3 6 12 13 66 61 60 68 53 42 47 40 39 10 39 39 44 22 10 150 150 150 142 87 6-7800 6-8133 6-8133 6-4859 6-1724 •8553 -8437 -9047 •8780 -8605 •7316 -7117 -8185 •7709 •7404 •1535 •0985 •0383 •1347 •4049 Totals. . . . 2 37 308 178 154 679 6-6554 — — — We require the regression curve giving the probable number of branches for a given whorl. Dealing first with the skew variation in position, a purely arbitrary system depending solely on the number of whorls dealt with in each position, we find, not using Sheppard's correction,* Mean = 2-802,651, 0-^=1 -336,887, Hence we determine i8,=3 ^2=1787,268, 1^3= -311,783, 1/4=5-841,682. 017,027, «^2= 828,767, «^3= 085,545, ^= 1/5= 2-799,638, j/g=22-678,308. •811,740, -286,465. -610,879, •972,295, and ^^1 = + -130,487. * The numbers are tabulated to six places, because we cannot be sure that the final calculations are for the data true to two places, which is all we finally retain unless this is done. Any number of figures can really be retained with perfect ease when the work is done on a calculator. 32 PKOFESSOR K. PEARSON ON THE GENERAL THEORY OF We now turn to the skew variation in the number of branches to the whorl, and get the following constants : — Mean=6-655,375, /i2= -806,124, cTy= -897,842, )u,3= -132,090, )Li4= 1-138,410. The values of y^^, m^, and wig are given in table above. Using them we find o-M= -224,377, >; = -249,911, (Ta=iTy\/\-^^ = -869,355, X2=V= -050,345, X^= -007,474, xi = "990.862, X2=- '059,851. These give by (xxxiii.), showing the numerical contribution of each term, S,^=:^ {-878,991 -•010,323--000,888--007,231 + -013,578}, or the probable error of ■»; = -0242. Had we calculated the probable error of ■>/ from (xxxiv.), we should have found for its value -0243. It is clear that for this special case the simple formula (xxxiv.) is amply sufiicient, the small terms almost cancelling. We see that ^i is almost unity, and the graph of a-„J(Ty shows indeed that the system is sensibly homoscedastic. Xi, '^^ small, but a glance at the graph of the clitic curve on Diagram I. shows that we can hardly treat the system as homoclitic, the changes in the skewness forming a fairly uniform curve.* For practical purposes, we may treat the variability of the number of branches in any array as sufiiciently closely given by cr^ v/l — rf. We now turn to the product-momentst and find Pji = — '249,160, P3i=— -896,415, P2j=_ -236,289, jp^^ = — 1-210,225. * Throughout these illustrations the clitic curve is plotted by calculating the skewness of the arrays from ^maKmiY'^. See p. 23. t In calculating these products referred to the centroid from those referred to any axes, generally corresponding to whole numbers in the table, the following reduction formulae will be found useful We take Nn^j- = S {n^y x'^y'^'), x' and y' being measured from any axes, further, x, y' are the distances of the means from these axes, and V2, va, V4 the moments of the x-character about its mean as tabled above. Pn = Hn - x'Uoi, fn = riji - ^xH-a + ai'^IIoi - y'v^, i'si = Hsi - Sic'nai + Mm-a - S'^noi - y'vs, Pii = n4i - 4a;'n3i + 6,T 2II21 + ixm^ + ui'^lloi - yvi. The ^'s should be further corrected for grouping by Sheppard's corrections (given on my p. 36), provided there be high contact at the contour of the surface of frequency. Sheppard's corrections have not in this SKEW CORRELATION AND NON-LINEAR REGRESSION. 33 These lead to r=--207,579, i=--120,164, ^=--088,241, ^=--285,890. Thus all the constants are determined. We find 7,2-r3= -019,867, .^2 (7;2_^2)_g2_. 001,281, Mv'-^)-'^'-a^-^<}>sY/{M.-^.')=-ooo,276. These should be respectively zero for linear, parabolic, and cubical regressions. It will be seen that they are satisfied with increasing closeness ; we might well be satisfied even with the parabolic regression curve. The following are the regres- sion curves determined, y^, being the actual number of branches in the whorl (= 6*655, 375 +¥;,:), and x^, the actual position of the whorl : — (a.) Straight line . • y^^=7 -04:6,087 — -139,408 Xp. (b.) Parabola from (Ixv.) : i/,^=6-794,052--125,872a;^--077,592a;/; or, «/^^=6-858,561- -077,592(3;^- 1-991, 535)1 This clearly gives a maximum number of branches, 6-8536 corresponding to a;j„=l-9915, a value within the limits of observation, (c.) Cubic from (lix.) : y^ =6-799,399 - -192,489 X^- -084,230 X/-\- -020,915 X/. Here Xp is measured from the mean position=ajp— 2*802, 651, and.y^^ is, as before, the total number of branches for the given position. Condition (Ivii.) is so closely satisfied that we shall here get sensibly as good results from (lix.) as from (Ivi.). In the table below and in the curves of Diagram I. the values of the mean of the arrays, as found from line, parabola, and cubic, are given and compared with observation. case been used, as this condition is not fulfilled. The axes x', y' actually taken for woodruff were those through the third whorl and through six branches. An obvious warning about the signs of the sums of the products may be given which may save computators some trouble. The axes being taken positive, as in the accompanying figure, then the sums of the products for IIii and Hgi are positive in the 1" and 3'*, negative in the 2°* and 4'" quadrants. For 1121 and n^ they are positive 4th ■+y 1st + x in the 1" and 4"" quadrants and negative in the 2'"' and 3"* quadrants. In 2°^ the figure the axes are taken so as to suit the x and y-directions of the table on p. 31. Care must, of course, be paid to this point. The products may also be found from the «/»,'s in the manner indicated on p. 35, footnote. They were thus verified in this case. 34 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF Table II. — Mean Branches to each Whorl. Xp = 0. 1. 2. 3. 4. 5. 6. yxf from line .... yxr „ parabola . . . yx, „ cubic .... Observed .... ■7-046' ■6-546' '6- 117" 6-907 6-777 6-750 6-780 6-767 6-854 6-889 6-813 6-628 6-775 6-758 6-813 6-488 6-541 6-443 6-486 6-349 6-151 6-192 6-172 6-210 5-607 6-007 1 I think we may safely say that in the relationship of branches to position of the whorl in woodruff we have a case of homoscedastic correlation, which is effectively described by a parabolic regression curve. Thus, in a case of this kind, it is only needful, besides the moments up to the fourth of the x-character, to find the correlation coefficient r and the correlation ratio t/. (9.) Illustration B. — On the Correlation between Age and Head Height in Girls. The data for this are taken from my School Measurement series, and involve the auricular heights of 2272 girls between the ages of 3 and 22. There was considerable paucity of material at the extreme ends of the range, and accordingly as our correlation curves are all obtained by weighting the observations, we can hardly expect good fits near 3 or 22 years of age. The actual correlation table is given as Table III. Sheppaed's corrections were applied throughout, and the unit of height is 2 millims. In the first place the means, standard deviations, and 3'''^ moments of all the arrays of heights for different years of age were determined. These are given at the foot of Table III., but in actually calculating the constants more places of decimals were used. Then the first six moments of the frequency of the ages were found and the first four moments of the height frequencies. These are the x and ^/-frequencies. They give us : — To face page 34. 3-4. 20-21. 21-22. 22-23. Totals. millims. minima. 102 -25-104 -25 — — — — 2 102 -25-104 -25 104 -25-106 -25 — 1 — — — 10 104 -25-106 -25 106 -25-108 -25 — i — — — 10 106 -25-108 -25 108 -25-110 -25 — — — — 27 108 -25-110 -25 110 -25-112 -25 — — — — 56 110 -25-112 -25 112 -25-114 -25 — 1 — — — 59 112 -25-114 -25 114-25-116-25 1 1 — — 115 114 -25-116 -25 116 -25-118 -25 — — 1 — 142 116 -25-118 -25 118 -25-120 -25 — 1 — — 244 118 -25-120 -25 1 120 -23-122 -25 — — 3 — 265 120 -25-122 -25 122-25-124-25 — 2 — 1 261 122 -25-124 -25 'g- e4-( a- 4 124 -25-126 -25 126 -25-128 -25 — 1 1 1 265 219 124-25-126-25 126 -25-128 -25 128 -25-130 -25 — 1 1 — 197 128 -25-130 -25 130 -25-132 -25 — 1 1 — 131 130 -25-132 -25 132 -25-134 -25 — — — — 88 132 -25-134 -25 134 -25-136 -25 — — — — 77 134 -25-136 -25 136 -25-138 -25 — — — — 52 136 -25-138 -25 138 -25-140 -25 — — — — 20 138 -25-140 -25 140 -25-142 -25 — — — — 16 140 -25-142 -25 142 -25-144 -25 — — — — 11 142 -25-144 -25 144 -25-146 -25 — — 1 — 4 144 -25-146 -25 146 -25-148 -25 — — — — — 1 146 -25-148 -26 Totals 1 7 8 2 2272 Totals. Means 1 in r 115 -2500 11 r 123 -8214 126 -5000 125 -2500 124 -0467 Means 1 ^^ 1-millim. units J } [_ 1-millim. units. Standard deviation in 2 -5311 4 -1414 -9574 3 -4541 r Standard deviation 1 "^ 2-millim. units L 2-millim. units. Third moments | in f- - 4 - 2 -729 + 85-816 + 5 -206 r Third moments 1 ^° 2-millim. units J L 2-niilIim. units. Table III. — Correlation between Age and Auricular w Totals . millims. 102 -25-104 -25 104 -25-106 -25 106 -25-108 -25 108 -25-110 -25 110 -25-112 -25 112 -25-114 -25 114 -25-116 -25 116 -25-118 -25 118 -25-120 -25 120 -25-122 -25 122-25-124-25 124 -25-126 -25 126 -25-128 -25 128 -25-130 -25 130 -25-132 -25 132 -25-134 -25 134 -25-136 -25 136 -25-138 -25 138 -25-140 -25 140 -25-142 -25 142 -25-144 -25 144. -25-146 -25 146 -25-148 -25 Means iu 1-millim. units Standard deviation in 2-millim. units Third moments "l ™ . r 2-milluu. umts J 3-4. 4-5. 115 -2500 116 -9643 2 -8853 - 42-822 Age. 5-6. 18 117 -4722 2 -9276 - 18-108 6-7. 7-8. 40 119 -1000 2 -9641 7-679 1 5 1 4 7 9 13 9 7 6 9 3 1 1 76 120 -3026 2 -9882 + 1 -782 8-9. 2 5 3 8 7 22 19 17 19 10 6 6 125 121 -6340 2 -6366 - 6 -171 9-10. 1 1 1 12 10 15 10 24 25 23 18 8 9 5 7 3 3 1 1 177 121 -7246 3 3877 + 15-893 10-11. 4 3 8 14 23 25 29 34 33 21 17 7 8 4 2 2 235 122 -8160 2 -9653 + 2 -330 11-12. 2 4 2 6 6 11 15 37 34 38 29 27 16 13 10 4 2 3 2 261 123 -1427 3 -2089 + -238 12-13. 2 5 9 16 18 44 41 33 40 27 20 17 13 9 10 3 1 309 123 -8908 3 -2061 + 8 -219 13-14. 2 2 4 3 4 10 13 23 32 21 32 32 39 17 8 11 4 2 2 2 263 124 -8622 3 -3589 - 7 -286 14-15. 1 9 3 7 9 11 21 22 23 20 25 15 5 13 5 2 4 3 198 125 -7146 3 -5865 + 3 -015 icular Height of Head in Girls. To face page 34. Totals. 4-15. 15-16. 16-17. 17-18. 18-19. 19-20, 20-21. 21-22. 22-23. — 2 millims. 102 -25-104 -25 — — — — — — — — — 10 104 -25-106 -25 — 1 — — — — — — — 10 106 -25-108 -25 1 3 1 — — 1 — — — 27 108 -25-110 -25 9 2 4 1 — — — — 56 110 -25-112 -25 3 5 5 1 — — — — — 59 112 -25-114 -25 7 6 8 2 2 — 1 — — 115 114 -25-116 -25 9 11 6 4 3 — — 1 — 142 116 -25-118 -25 11 19 6 G 3 2 1 — — 244 118 -25-120 -25 21 15 13 9 4 — — 3 — 265 120 -25-122 -25 w crq* 8. r 22 2.3 20 18 26 18 25 14 16 9 12 13 4 10 9 1 1 2 1 1 1 1 261 265 219 122 -25-124 -25 124 -25-126 -25 126 -25-128 -25 25 29 16 11 7 — 1 1 — 197 128 -25-130 -25 15 18 12 6 6 4 1 1 — 131 130 -25-132 -25 5 16 7 7 6 — — — — 88 132 -25-134 -25 13 9 11 8 2 1 — — — 77 134 -25-136 -25 5 14 6 3 2 1 — — — 52 136 -25-138 -25 2 2 4 2 — — — — — 20 138 -25-140 -25 i 2 2 — 1 1 — — — 16 140 -25-142 -25 3 — 4 — 1 — — — — 11 142 -25-144 -25 — — 2 1 — — — 1 — 4 144 -25-146 -25 — — — — — 1 — — — 1 146 -25-148 -25 198 214 162 95 61 13 7 8 2 2272 Totals. 5 '7146 126 -1565 126 -5340 126 -9132 127 -0205 129 -5577 123 -8214 126 -5000 125 -2500 124 -0467 Means in 1-millim. units. 3 -5865 3-4fi63 3 '8696 3 -1679 3 -1235 4-8406 2 -5311 4 -1414 -9574 3 -4541 Standard deviation in 2.milliin. units. 3 015 - 9 -615 + 9 -379 + 2 -991 + 0'070 - 29-164 - 2 -729 + 85-816 + 5 -206 Third momenta in 2-millini. units. SKEW COEEELATION AND NON-LINEAE REGRESSION. 35 Height Constants. Mean height = 124-0467 millims. a-j,= 3-454,125 Ma= 11-930,977 /*3 = 5-206,247 /^4 = 438-639,633 ^\= -015,960, P\= 3-081,454, m 2 millim. units. Age Constants. Mean age = 12-7007 o-^= 3-064,819 "" v^= 9-393,110 j'3= 1-051,882 v^= 239-157,055 in r ysar units. Further Sm = 2-093,366 millims. \= 4-382,181 1 in 1 millim K= 62-399,135j units. Hence (X,-3V)/(4X,^) = -062,340, 1/6= 104-298,702 V6 = 9536-265,059 fii= -001,335, )82= 2-710,593, ^83= -014,093, ^4= 11-506,681, \/Wi=+ -036,538, <^2= 1-709,258, <^3= -250,123. ,-. 4-158,032. In the next place the products were worked out and referred to the means with the following results : — * ^11= 3-113,712, 2>2i=~ 1-957,022, P3i= 74-447,616, j94i= -108-701,559, whence r= -294,128, e= — -071,065, ^=-•048,576, ^=-•470,126. Further, from 2m, t? = -303,024. In deducing the product-moments after they had been referred to the means, the * These products were in this case (as in all other cases) verified by calculating from the means of the arrays t/xp, the expressions s/%p?^!_fe^"l, gl w^y^pfe-j^) "!^ s|%p3M^j:^\ }gJ %,y«,fa-'«> \ Of course it is easiest to calculate these products about some arbitrary origin coinciding with the abscissa of one array. If these products be then p'u, p'21, p'31, p'n, and *' be the mean, we have Pii=/u, i'21 =p'ii - 2*>'ii, Psi =p'i\ - 3x'p'2i + Ss'Vii, Pa =p'ii - ^'p'zi + 6iB'y2i - 4iB'yii> ■ • • ' B 2 36 PROFESSOE K. PEARSON ON THE GENERAL THEORY OF proper Sheppard's corrections were introduced. These are, if {pn], {p=ii]> \Pii}> \Pii\ represent the uncorrected moments : — Pn={Pn]> Pii=iPii]' Psi={Pii}-i{Pn}, Pii={Pii]-2{Pn]> the units of grouping being the units throughout. From the constants for the arrays, I found Xi-1 = --000,675, X3=-'007'198. Whence the probable error of vj was determined by (xxxiii.). Its value was* Probable error of 77= -012,913, If found from the simple formula '67449 (l-iy^VN, the value is -012,851. We accordingly are again forced to the conclusion that -q may for practical purposes be found from this simple formula, instead of the complicated result (xxxiii.). Although both Xi— 1 a.nd xs are small, it is very doubtful whether we can legitimately consider the system as homoscedastic. The dotted line ab of Diagram II. would fairly well represent increasing variability with age. The skewness of the arrays is relatively small and changes sign so frequently, that we can certainly not attribute any law to such heteroclitic tendencies as there are. They are probably due to errors of random sampling from truly isocurtic material. It will be seen that the height frequencies with ;S'i = '0160 and /8'3=3-0815 do not differ very much from a normal distribution ; in fact, we can lay no stress on the heteroclisy of the system at all. But the values of the standard deviations of the arrays, or the graph of (T„Ja-y, certainly shows increasing variation with increasing age, a phenomenon with which one is familiar in a variety of other human characters, t This heteroscedasticity, due to increasing variation with growth, would hardly have been anticipated from a mere inspection of the smaUness of xi \ it is somewhat obscured by the irregular values of the standard deviations of the small arrays at the adult end of the age range. The mean value of the standard deviation of the weighted arrays is a-y v/l— ■>7" = 3-2992 in 2-millim. units. We now turn to the regression curves to see how far the conditions for the different types are satisfied. We have ^2_^3_ -005,312, <^2 (r?^-r2)-€^= -004,030, <^2('?'-^')-e^-(l«^3-e.^3)7(<^2«^4-«^3')=-000,604. * The contributions of the successive terms of (xxxiii.) are in fact given by V = i {-824,785 + -001,870 + -004,673 - -000,472 + -001,888 }. t See Pearson : ' The Chances of Death and other Studies of Evolution,' vol. I., pp. 296, 307, 310, 314. SKEW COREELATION AND NON-LINEAR REGRESSION. 37 But the first should be zero, if the regression be hnear ; the second, if it be parabolic ; and the third, if it be cubical. We see increasing approximation to fulfilment of the several conditions. Referred to axes through the mean age and head height, the following are the regression curves * ._ (a.) Straight line: Y, =-662,979 Xp. (&.) Parabola (from equation (Ixv.)) : Y^,= -055,749+ -667,570 X^- -041,001 X/. (c.) Cubic (from equation (Ivi.)) : Y^,= -280,194+ -722,886 X^- -029,580 X/- -002,223 X/. (c'.) Cubic (from equation (lix.)) : Y^= -296,076 + -812,249 X^- -028,004 X/- -005,740 X^^ (c') will not give as good results as (c), for it depends on a use of the condition (Ivii.) which is not absolutely fulfilled. The following table gives the values in the case of the four curves : — Table IV. — ?/^_=Mean Auricular Height of Girl's Head at Given Age. a;j, = age. Regression line. Regression parabola.t Cubic (c). Cubic (c'). Observed. 3-5 117-95 114-49 116-90 118-94 115-25 4-5 118 61 115 87 117-66 118-94 116-96 5-5 119 27 117 17 118-42 119-16 117-47 6-5 119 94 118 39 119-24 119-57 119-10 7-5 120 60 119 52 120-08 120-14 120-30 8-5 121 26 120 57 120-93 120-84 121-63 9-5 121 92 121 55 121-78 121-62 121-72 10-5 122 59 122 43 122-62 122-45 122-82 11-5 123 25 123 24 123-42 123-26 123-14 12-5 123 91 123 97 124-18 124-15 123-89 13-5 124 58 124 61 124-88 124-95 124-86 U-5 125 24 125 17 125-52 125-65 125-71 15-5 125 90 125 65 126-07 126-22 126-16 16-5 126 57 126 05 126-52 126-68 126-53 17-5 127 23 126 36 126-87 126-93 126-91 18-5 127 89 126 59 127-09 126-96 127-02 19-5 128 55 126 75 127-18 126-74 129-56 20-5 129 22 126 81 127-11 126-22 123-82 21-5 129 88 126 80 126-88 125-38 126-50 22-5 130-54 126-71 126-48 124-28 125-25 * Y-ep is here measured in millimetres and Xj, in years. t The maximum ordinate is at vertex of parabola, i.e., a; = 8-1409, or age 20-84; its magnitude = 126-82. 38 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF An examination of this table and the graphs on Diagram II. seem to show : — (i.) That cubic (c) is considerably better than cubic (c'). (ii.) That we do get a sensible betterment in passing from parabola to cubic, and, accordingly, that we must use in this the cubic to effectively describe the regression within the range of observation. Probably neither cubic nor parabola would effectively serve for extrapolation even close to the limits of observation. Thus the cubic (c') starting at 3-4 with its point of inflection is clearly inadmissible, and the drop after 20 or 21 years of age, shown by both parabola and cubic, is, of course, only due to the anomalous character of the few girls over 18 left in the schools. Actually the shrinkage of measurements does not begin till at least 26 years, and is then far more gradual than these curves indicate. But, as in all fitting of this kind, we obtain the best fit we can within the range, entirely at the expense of what may occur just outside the range. For this reason, as E. Peerin* has pointed out, a good interpolation curve is usually a bad extra- polation curve. We might sum up our results for auricular height with age in girls by saying : That the correlation is non-linear, effectively cubic ; heteroscedastic, there being increasing variability with growth ; that while the total height frequency is not very far from normal the array frequencies are slightly heteroclitic, but so very irregular in sign, that probably we are dealing with a case of isocurtic homoclisy, to which the sparsity of data in the extreme arrays gives an appearance of anomic heteroclisy. (10.) Illustration C. — On the Skew Correlation between Size of Cell and Size of Body in Daphnia magna. Dr. E. Warren has dealt with this point in a memoir published in ' Biometrika,' vol. II., pp. 2.55-9. The resulting regression curve of size of cell for given size of body is very far from linear, and it is quite clear that the correlation is skew. It has already been noted in ' Biometrika ' that the relationship is considerably obscured by the irregularities produced by ecdysis. Our object at present, however, is purely theoretical, namely, to show how a certain system of constants and of curves describes the actual correlationship, and for this purpose Dr. Warren's observations form as good material for graduation as we could expect to find. The following Table V. gives the observations with the working scales attached. I must refer to Dr. Warren's paper (p. 256) for the relation between the units of grouping on the working scales and those of the actual measurements on body and cell lengths. As far as correcting the raw moments is concerned, Sheppard's corrections were used for the cell sizes, but not for the body lengths, because the number of individuals in the latter case was perfectly arbitrary and there is no approach to high contact. The * ' Biometrika,' vol. Ill,, p. 99. SKEW CORRELATION AND NON-LINEAR REGRESSION. 39 product moments were also uncorrected. The product moments were found in both ways (see p. 35, footnote) and the results thus verified. Table V. gives the means, standard deviations, and third moments of the arrays ; the latter are all small and superficially irregular in sign. I think we may say that there is no marked and continuous heteroclisy. On the other hand, I think we may say that while the clitic curve deviates to and fro from a zero base, the scedastic curve would fit better to a parabolic curve than to the straight line which is its mean. In other words, the variability of the cells increases with size of body {i.e., growth) up to a certain stage and then decreases again. This result is obscured by the fall of the variability after each ecdysis. Roughly the ecdyses produce a rhythm in all three curves, the regression curve, the scedastic curve, and the clitic curve. When the means of the arrays are above the regression cubic, then the ordinates of the scedastic curve are above their mean and those of the clitic curve show positive skewness ; when they are below the regression curve, we have lessened variability and negative skewness. In other words, the ecdyses are accompanied by lessened cell variability and negative skewness of distribution. I think we may state that there is a nomic heteroscedasticity due to growth of body, giving first an increased variability with growth and afterwards a decrease with age. There is probably isocurtic homoclisy. Both of these are, however, obscured by a semi-rhythmic heteroscedasticity and heteroclisy introduced by the ecdyses. We now turn to the constants of the cell and body length distributions, merely noting that all these constants are given in terms of the units of the working scales. Body Length Constants. Further Cell Constants. Mean cell= 9-268,657, a-y= 2-541,734, /*2 = 6-460,410, H= 2-142,362, /*4 = 123-921,496, )8i'= •017,021, )8.'= 2-969,111. Sh = 1-454,600, K= 2-115,862, K= 15-142,840. -3X/)/(4X/)= -095,615. Dgth = 8-502,488, (Tj: = 3-864,784, v% = 14-936,562, Vz = - 5-125,806, Vi = 432-769,533, "6 = - 425-276,682, "6 = 15192-5375, A = •007,885, A = 1-939,793, )83 = •043,796, i84= 4-559,091, v//8i= - -088,798, 4>,= •931,908, ^3 = - -232,167, 4,,= -788,409. 40 PROFESSOE K. PEARSON ON THE GENERAL THEORY OF 8 I Hi o ^' a? -4-S o t-i O pq s C8 !« _• 02 ooj-*»otot-ooo»a»o-*-*o«oo -*(NOSrHt»coi:-OOt~co-*Or-n- (M00e0l0lO00C01Ot--*lO-*Q0Oi-> OOOO-*OOO(N(M«0r-(i-lO(N i I I++I I + I++I l + l O5IM5SieoO5000OeO«D-*t-'^i-HO>«O OOOOlOtDOeC-^THOS-^OSCDQOi— I Or-Hr— l(M(Mi— (r-li— (t-lOlO ■^(M(Mi-HOO'*«ot»i— it--^to 1-li-H (M-HO-<*iOi-ie'5t-«iC) o O e<5 eo M t- lO 1-1 1 1 1 1 1 I-H to I-H t-^ eomOi«DQ0«D(Mi-lC5r-ii-li-i 1 1 1 1 I-H f— 1 (?^ I— t T-H 1 1 1 1 .eoe<500«O| | | | lllll i-Hi-Hi-H III! I-H eo .aoosOi-H(Me'3-*i050 ^ 1 ■poqAV JO uowsoj SKEW CORRELATION AND NON-LINEAR REGRESSION. 45 Pn= - 8-225,585, whence r= -708,222, i)2i= - 21-471,321, e= --390,436, 2)31= -205-084,042, 4= +-029,733, p^^= -917-984,938, 6= --960,212. Further, from Sm, 7; = -850,984. From the constants for the arrays we deduce Xj-1= --356,367, )(2= --312,952. We now ohtain, showing the contribution of each term of (xxxiii. ), V=i{'076,080--157,932-|--055,359 + -079,662 + -038,579}. Whence probable error of 7^= -67449 ^,= '0054. Had we calculated the probable error ol r) from (xxxiv.) we should have found it equal to -0049. The difference -0005 is not of importance for practical purposes. Yet in this case it is clear that the values of ^^j — 1 and Xi ^^^ very sensible. Thus we see that a very marked heteroscedastic and heteroclitic system with continuously changing standard deviation and skewness scarcely affects for practical purposes (i.e., to three significant figures) the probable error of 77, All four of our illustrations therefore confirm the conclusion that : For practical purposes the probable error of the correlation ratio, rj, may be taken as -67449(l-'»,2)/N.(7) f^ Our Diagram IV. gives the values of the relative standard deviations of the arrays, or, (r„Ja-y, the horizontal line giving v^l— )7^=-5252, or the mean value of the relative standard deviations of the weighted arrays. We have also the clitic curve giving ^\/Pi, for each array,* The remarkable smoothness of these scedastic and clitic curves in this case indicates how far certain types of correlation surfaces diverge fi:"om pure normality of distribution, the divergence being obviously nomic. We now turn to the regression curves and write down the conditions for the different types; the three expressions should be zero for linear, parabolic, and cubical regression respectively ^3_^3_. 222,596, s)V(«^3«^*-«^3')= -010,127. * i JPi = difiference between mode and mean divided by standard deviation = skewness in the case of skew-curves of Type III. (' Phil. Trans.,' A, vol. 186, p. 373), and may be taken as a reasonable measure of the skewness for those cases in which the fuller form involving ^2 would involve too laborious calculations. If in equation (xii.) of the present memoir we put ^82 = 3 + a small quantity, and remember that ySj is itself a small quantity, we see that the more correct formula for the skewness involving fi^ reduces, neglecting terms of 2"'' order, to | ^fp[. 46 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF We see at once that the straight line is inadmissible, the parabola will not be very good, and the cubic only moderately appropriate. The conditions are not nearly so closely fulfilled as in the cases of woodrufi" and head heights ; the last two are better than in the case of Daphnia cells, but while the deviations in the case of Daphnia were irregular, there being no approximate smoothness in the scedastic or clitic curves, we shall find here more uniform deviations which would probably be partially allowed for by a quartic regression curve. The following are the regression curves : — (a.) Straight line: Y;,^=- -655,423 X^. (b.) Parabola from (Ixv.) : Y:,^=l-551,307--574,17lX^--123,610X/. The maximum ordinate is at the position Xj„=— 2-3225, or a3p=4-0808, with maximum number of branches yp= 9-435. (c.) Cubic from (Ivi.) : Y^,= 1-590,413--987,694X;,--]37,641X/+-016,605X^3 In all cases X^, and Y^^ are measured from the mean position and the mean number of branches, i.e., 6-403,315 and 7-216,851 respectively. The following table contains the calculated and observed results : — Table VIII. — Mean Number of Branches to each Whorl in Equisetum. Position. Regression line. Regression parabola. Regression cubic. Observed. Regression cubic without first whorl. 1 10-758 8-262 7-506 7-619 [8-207] 2 10-103 8 900 9-070 9-294 8-929 3 9-447 9 291 9-920 9-627 9-869 4 8-792 9 434 10-156 9-730 10-161 5 8-137 9 330 9-876 9-643 9-911 6 7-481 8 980 9-182 9-427 9-224 7 6-826 8 382 8-172 8-732 8-205 8 6-170 7 536 6-947 7-297 6-962 9 5-515 6 444 5-605 5-555 5-599 10 4-859 5 104 4-247 3-964 4-223 11 4-204 3 517 2-971 2-443 2-939 12 3-549 1 683 1-879 1-866 1-854 13 2-893 -0 399 1-069 1-462 1-072 14 2-238 -2 727 0-641 1-333 0-700 15 1-582 -5 303 0-694 1-250 0-844 16 0-927 -8-126 1-328 1-000 1-610 In the last column I have placed the results of re-working the whole system, omitting the first whorl as largely influenced by the ground condition at the foot of SKEW CORRELATION AND NON-LINEAR REGRESSION. 47 the stem.* The improvement of fit is not sufficiently great to justify a publication of all the constants for the distribution in this modified case. But there is improvement for the higher whorls, which are so few in number as to be wholly insignificant when compared with the weight of the first few low whorls. It wUl be noticed at once that the line and the parabola (which gives at the top of the stem negative numbers !) are absolutely unsuitable for representing the facts of the case. The cubic is better and certainly gives the general trend of the observa- tions, but in this our last illustration we have clearly reached the limit of material to which such cubical regression can be satisfactorily applied. See Diagram V. (12.) Quartic Regression. It seemed of some interest in this case of Equisetum to ascertain whether any real improvement in description would be reached by considering the quartic regression curve. I briefly indicate the theory in this case as developed from the general method in the footnote, p. 25. We shall now have Y J(r,=6o+&] (XV«r.)+63 (X^/o-.)H&3 (X,Mr +&* (XA.)*. Eliminating h^ and hi, by the processes familiar to us from the case of cubical regression, we have +fe3{(X,/cr.)3-^,(XVcr,)-v/A} + &J(XVc7.r-(^3/V^)(Vcr^)-^2} (Ixx.). Hence as before ^=63^2+63^3+6^(^5" l=h^i+h4>i+\'i>6 > (Ixxi.), where c^jj <^3> ^.nd ^^ are given as before by (li. and liv.), while <^5=^4-^3-^2 (Ixxii.), i>MPB-l3,fis-Mi)/\^i (Ixxiii.), ^MMe-fi^'-^M/^i (Ixxiv.), and ^h=VlvJ(T^^\ ^^=vjcrj (Ixxv.). Solving, we have 5 — H4>2^i—^i) — ^(^4*^5 — <^3<^6) — C(<^2^6 — 4>?.^h) (Ixxvi.) « < Koy, Soc. Proc.,' vol. 71, pp. 308-310. 48 PROFESSOE K. PEARSON ON THE GENERAL THEORY OF and V. (Ixxvii). Substituting in (Ixx.), the solution is completed. The advantage of this form is that we see clearly the modifications made in 63 and 63 as we pass from cubical to quartic regression. On the other hand, ^g and t^^, as shown by (Ixxv.), involve the 7'" and 8"' moments of the «-character. These are not only very laborious to calculate, but, as we have already shown, are as a rule very untrustworthy. If we proceed as on p. 26, equation (Ivii.), we find 7,2-r2=&3i+63^+fe/ (Ixxviii.). Using this and not the third equation of (Ixxi.), we replace (Ixxvi.) by 6^ = ((^2<^^-(^32)^ 1 ^^)\ . (Ixxix.). This equation for 64 only involves the 7"' and not the 8"" moment, but like the corresponding form (Ix. ) suffers from being a ratio of small quantities. (Ixxvii.) completes the solution as before. (Ixxvii.) and (Ixxix.) in conjunction give us a necessary condition for quartic regression. We can indeed now write the whole series of conditions as follows : — Linear regression : Parabolic regression : Cubical regression : ^•3_,.2_^Y<^^_(^^^_^^^)7l^^(^^^^_^^2).^o. Quartic regression : ^2 (Mi — 4>i) (Mi — ^3^}{Mi^7 — M'a" — Mb' — Me" + '^■6Me) (Ixxx.). We now have a third possibility : we can get rid of the fourth product moment d from the value of h^ and write it : , _ ^ A / v'-r'-ey.-e,f/{UM,-^)] SKEW CORRELATION AND NON-LINEAR REGRESSION. 49 While this value of 64 does not suffer like (Ixxix.) from being the ratio of small quantities, and would a priori appear to save the calculation of 6, yet the right sign of the root may not be ovious on inspection, so that an actual determination of 6 to find the sign of h^ may after all be needful. If (Ixxx.) were absolutely satisfied, (Ixxxi.), (Ixxix.) and (Ixxvi.) would lead to identical results; but this will rarely be true in practice. In any of the three cases \ and 63 will be given by (Ixxviii.). On the whole, I consider that (Ixxxi.) and (Ixxvi.) will give the better results, and probably the former the best, but it will generally require as much arithmetic as the latter. (13). Illustration E. — Calculation of the Quartic Regression Curve in the Case of Equisetum arvense. The only new constants required are : 1/7=43,207-386, whence ^85 = 1-144,882, vg = 507,649'540, ^86=20-463,633, and : <^5=3-425,069, <^s= 3-452,046, <^7 = 15-015792. These lead us to : <^A-<^3<^6 ^ 2-723,384, M^-s'h = 1-211,194, 9i9i—'Pa 9294—93 A,= <^2. <^3» ^6 = 1-745,622. Our successive conditions are therefore : ^2_^_. 222,596, ^2_ra-6V<^2= -069,266, ^a_r2-eV<^3-(C.^,-e<^3)7] <^2 («^2<^4- «^3')} = -010,186, r,^-r^-^/,-{U,-i,)y{MU*-z')} _ f ^(«^2<^4— '^S^) — e('^4<^5 — ^I^Sa) — r(^2<^6 — '^3«^s) } ^ _ .Any OAA (^A-<^3^)A, whence we see the successive approximations to the fulfilment of the conditions. Clearly great gains arise when we pass from linear to parabolic, and from parabolic to cubic regression, but the advance is not so conspicuous when we pass to quartic regression. G 50 PROFESSOE K. PEARSON ON THE GENERAL THEORY OF We have : — From (Ixxvi.) : 6^=-044,517, and 6^= --648,122, 63=-171,260, From (Ixxix.) : &^=-151,842, and 63= --940,410, &3=-041,981, From (Ixxxi.) : 6^=-025,999, and ^2= --597,691, 63 = -193,688. The equations to the three corresponding quartics are : (a). Y^^=l-724,611- -913.208 X^--169,311 X/+ -012,629 Xp3+-000,927 Xp\ (b). ¥^,=2-047,717- -734,966 Xj„--245,667 X/4- -003,096 X^^^. .003,161 X^* (c). ¥:,,= 1-668,788 --944,192 Xp--156,137 X^H '014,283 X/+-000,541 X/. The values of Y^^ and Xp are as before measured from the means, or 7-216,851 and 6-403,315 respectively. The values of the observed and calculated ordinates are given in Table IX., and the graph of the results in the lower half of Diagram V. Table IX. — Mean Number of Branches to Whorl in Equisetum deduced from Quartic Regression. Position. Quartic (a). Quartic (b). Quartic (c). Observed. 1 7-731 8-269 7-637 7.619 2 8-950 8-662 9-000 9-294 3 9-715 9-222 9-800 9-627 4 10-014 9-674 10-073 9-730 6 9-858 9-816 9-866 9-643 6 9-281 9-521 9-240 9-427 7 8-339 8-740 8-270 8-732 8 7-109 7-498 7-042 7-297 9 5-692 5-898 5-656 5-555 10 4-209 4-116 4-225 3-964 11 2-816 2-407 2-875 2-443 12 1-651 1-100 1-745 1-866 13 0-930 0-600 0-987 1-462 14 0-857 1-389 0-766 1-333 15 1-665 4-022 1-259 1-250 16 3-609 9-133 2-657 1-000 From these results we deduce the following conclusions : — (i.) That the use of a quartic instead of a cubic regression curve has not very markedly bettered the fit. The failure to get a closer fit lies largely in the nature of the material. The number of plants with more than 13 whorls is very few, and their contribution allows little weight to the tail of the regression curve. Further, all our SKEW CORRELATION AND NON-LINEAR REGRESSION. 51 attempts to fit a smooth regression curve show that the observed data are unduly flattened at the top. If we confine ourselves to a homogeneous series of 110 plants with ten whorls apiece, we get a remarkably good fit.* The S-shape of the regression line as indicated in both cubic and quartic does, however, appear to be characteristic of the nature of the plant, and I take it that more ample material would allow of a closer analytical description by a simple cubic. I doubt whether for practical statistics the use of the quartic will often be requisite. (ii.) The comparative failure of the quartic (b) shows us that a formula like (Ixxix.) is of small service. This corresponds fully to our experience in the use of (Ix. ) in the case of the cubic. In both cases we get rid of a high moment by making a certain constant the ratio of two small quantities, and experience shows us that the result is unsatisfactory. It is accordingly preferable to use formulae involving high moments of one variable in preference to those with a ratio of small quantities. (iii.) The quartic (c) appears as good, if not slightly better, than quartic (a). In (c) we have got rid of a high product moment, 6, by supposing the quartic condition (Ixxx.) rigidly fulfilled. This of course is not the case. It is clear that product moments like of the 5* order are far from advantageous, and this is the same principle which was in evidence when we found (Ixv.) giving better results than (Ixiv.) for parabolic regression. Hence we must further conclude that the use of third, fourth or fifth product moments is disadvantageous as compared respectively with fifth to eighth moments of one variable. Or, a moment two degrees higher is preferable to a product moment in calculating correlation values. This is, I think, consonant with our knowledge of the relative magnitude of the probable errors in the two cases. (14.) General Conclusions. (i.) The present paper provides us with a general method of dealing with the regression line and the variability of arrays in the case of skew correlation, without any assumption as to the analytical form of the skew correlation surface. (ii.) It provides a nomenclature and classification of the types of array variability which may be of service. Arrays are either homoclitic or heteroclific, according as their skewnesses are of equal magnitude or not. Arrays are further homoscedastic or heteroscedastic, according as their standard deviations are alike ot different. Skew arrays are termed allocurtic; if arrays are symmetrical about their mean, they are isocurtic. A heteroclitic system of arrays may be nomic or anomic, according as the skewness of the arrays changes continuously or irregularly with the position of the array. A heteroscedastic system of arrays is also either nomic or anomic, according as the standard deviation of the arrays changes continuously or irregularly with the ♦ 'Boy. Soc. Proc.,' vol. 71, p. 308. G 2 52 PROFESSOR K. PEARSON ON THE GENERAL THEORY OF position of the arrays. Anomic heteroclisy and anomic heteroscedasticity probably only signify that our material is either heterogeneous or too sparse to free us from the large errors of random sampling in the extreme arrays. Still the terms will be found of use in describing the actual data. The curve in which the skewness of the array is plotted to its position is termed the clitic curve ; the curve in which the ratio of the standard deviation of the array to the standard deviation of the character in the population at large is plotted to position is termed a scedastic curve. (iii.) The types of regression have been classified into linear, parabolic, cubic and quartic. For most practical purposes the first three suffice. Necessary criteria have been given for each case. But as in the case of the skew frequency of one character, an indefinite number of conditions ought theoretically to be fulfilled. Practically in dealing with frequency, no criteria are absolutely fulfilled, and the probable errors of the expressions used become unmanageable as Ave ascend in the scale. We must therefore be content to estimate the degree of approximation with which one or two necessary criteria are satisfied. The fundamental test of deviation from the familiar form of linear regression is the inequality of the correlation coefficient r and the newly introduced correlation ratio 7;. The probable error of this latter is determined. It is shown that o-y v/l — 7j^ is the mean standard deviation of a system of arrays in skew correlation. The ease with which t; can be calculated suggests that in many cases it should accompany, if not replace the determination of the correlation coefficient. In the determination of the constants of the regression curve we must use moments and product moments. The limitations to the order of the curve used depend : (a) on the labour of the arithmetic, (b) on the increasing probable errors of the higher moments and product moments. For these reasons it seems idle to propose going beyond the 6"^ to 8"" moments, or the S'* to 5* product-moments. Practical experience suggests that little is to be gained by using moments beyond the S"*, or product moments beyond the 3'*. A quartic regression curve may be useful occasionally, but it has yet to justify its necessity. As our object is not to repro- duce the given data, but to provide a graduation for them, which smooths down the errors of random sampling, we believe that any legitimate and practical theory must discard the high moments and high product moments with which Thiele and LiPPS propose to deal. (iv.) There is one point to which reference ought to be made. Some reader may enquire why the method of my paper on curving fitting* should not be applied to these regression curves in general, as we have in practice once or twice already applied it. It would seem that that method is the easier, involving in the case of the quartic only quantities analogous to our r, e, C and 0. The answer is * " On the Systematic Fittings of Curves to Observations a d Measurements." ' Biometrika,' vol. I., pp. 265-303, and vol. H., pp. 1-23, especially the latter, pp. 11-15. SKEW CORRELATION AND NON-LINEAR REGRESSION. 53 straightforward : that process supposes every y^^ to have equal weight, or n^^ to be the same for each array. Hence the higher moments of the a:-character, which are really involved, can be written down without calculation once and for all.* The complexity of our present investigation arises from the introduction of the weighting into the calculation of the moments of the a;-character, as well as into that of the product moments r, e, ^, 6. Our results therefore, although they might not look so good on a graph of the regression curve, would be markedly better, if due weight were given to the frequency of each array. The difference of the two conceptions is comparable to the determination of the regression on the one hand from the correlation coefficient, and on the other from merely striking a line through the plotted means of the arrays. The method of moments in the present case, if we except the use of -q, is identical with that of fitting a curve to a continuum in space by the method of least squares. (v.) No stress whatever is laid on the actual instances here selected for illustration of the methods of this paper. I have merely chosen out of available material cases in which I had come across skew regression of various types. Thus we find : — (a.) The correlation of the number of branches and position of the whorl in Asperula odorata is practically parabolic, homoscedastic and of nomic heteroclisy. (6.) The correlation between auricular height of head and age in girls is cubical, of nomic heteroscedasticity and of anomic heteroclisy. It is probably really a case of isocurtosis. (c.) The correlation of size of cell and size of body in Daphnia magna, allowing for the irregularities produced by the ecdyses, is parabolic or cubic, of nomic heteroscedasticity, and probably, but for the above-mentioned irregularities, of isocurtic homoclisy. {d.) The correlation of the number of branches and position of the whorl in Equisetum arvense is cubical or possibly even quartic, of markedly nomic hetero- scedasticity and markedly nomic heteroclisy. It is not impossible that slips have occurred in the lengthy arithmetic involved, but every important piece of work has been done independently twice, once by Dr. Alice Lee, whom I have most heartily to thank for her unwearying assistance, and once by myself. To preserve uniformity of working, the constants have in each case been carried to six figures. This involves little or no additional trouble, using as we do mechanical calculators. The final results are of course of no value beyond their probable errors, which will be in the second or third place of figures. No doubt I shall be told that there is a show of accuracy in the number of decimal figures retained, which does not really exist. It does not exist (and I am as fully conscious of its non-existance as any would-be critic) so far as our results fit the actual population, of which we have but a random sample. The figures, however, are of importance, as far as testing accuracy of fit of result to actual sample goes. The ♦ 'Biometrika,' vol. II., p. 12. 54 ON SKEW CORRELATION AND NON-LINEAR REGRESSION. cubic or quartic curves may have coefficients insensible before the third or fourth figure of decimals, and these coefficients have to be multiplied occasionally by abscissae of the third or fourth powers of 7 to 9. Hence to get ordinates true, as far as the sample goes, to the second or third figure, we require to work to a fairly high number of figures. There is no magic in six figures, four or five would probably satisfy another worker, but they are easily read ofi" the calculator we use, and if the constants had been tabled only to four or five, no reader would have been able to agree exactly, if he wished to test any of our results, even to three figures, with the final ordinates. DIAGRAM I. SKEW CORRELATION IN ASPERULA ODORATA. V»^,.o SCEDASTIC CURVE * 5 * 'ft .7 NUMBER OF BRANCHES TO WHORL FOR CUBIC '6 REGRESSION CUBIC REGRESSION LINE REGRESSION PARABOLA REGRESSION PARABOLA REGRESSION CUBIC fS S CLITIC » CURVE 21 32 AGE OF GIRL DIAGRAM III. SKEW CORRELATION BETWEEN SIZES OF CELL AND BODY IN DAPHNIA. ,j3 REGRESSION CUBIC REGRESSION PARABOLA ■S S --■s =IO 6 7 10 II 12 13 14 IS SIZE OF BODY DIAGRAM IV- SKEW CORRELATION BETWEEN BRANCHES AND POSITION OF WHORL IN EQUISETUM: SCEDASTIC AND CLITIC CURVES ;>(, -s- SCEDASTIC CURVE ' CLITIC CURVE 7 8 S SIZE OF BODY 10 II 12 13 14 IS DIAGRAM V. SKEW CORRELATION BETWEEN BRANCHES AND POSITION OF WHORL tN EQUISETUM : REGRESSION CURVES. REGRESSION CUBIC BEQRESSION LINE QUARTIC (b) REGRESSION PARABOLA QUARTIC (a) QUARTIO (cj POSITION OF WHORL DRAPERS' COMPANY RESEARCH MEMOIRS. DEPARTMENT OF APPLIED MATHEMATICS, UNIVERSITY COLLEGE, UNIVERSITY OF LONDON. These memoirs will be issued at short intervals. The following are ready or will probably appear later in this series : — Biometric Series. I. Mathematical Contributions to the Theory of Evolution.— XIII. On the Theory of Contmgency and its Relation to Association and Normal Correlation. By Karl Pearson, F.K.S. Issued. Piice is. II. Mathematical Contributions to the Theory of Evolution. — XIV. On the Theory of Skew Correlation and Non-linear Regression. By Karl Pearson, F.R.S. Issiied. Price 5s. III. Mathematical Contributions to the Theory of Evolution.— XV. On Homotyposis in the Animal Kingdom. By Ernest Warren, D.Sc, Alice Lee, D.Sc., Edna Lba-SiiHth, Marion RADFOito and Karl Pearson, P.R.S. Slwrtly. Technical Series. I. On a Theory of the Stresses in Crane and Coupling Hooks with Experimental Comparison with Existing Theory. By E. S. Andrews, B.Sc.Eng., assisted by Karl Pearson, F.R.S. Ismed. Price 3s. II. On some Disregarded Points in the Stability of Masonry Dams. By L. W. Atcheri.ey, assisted by KARt Pearson, F,R.S. Issmd. Vvke Zs. M. III. On the Graphics of Metal Arches, with Special Reference to the Relative Strength of Two-pivoted, Three-pivoted and Built-in Metal Arches. By L. W. AtcherLEY and Karl PeaSSON, F.R.8. Issued. Price 5s. IV. On Torsional Vibrations in Shafting. By Karl Pearson, F.R.S. PUBUSHED BY DULAU AND CO. MATHEMATICAL CONTRIBUTIONS TO THE THEORY OF EVOLUTION. XL ON THE INIXUENCE OF SELECTION ON THE VARIABILITY AND CORRELATION OF ORGANS. By Karl Pearson, F.R.S. ' Phil. Trans.,' vol. 200, pp. 1-56. Price 3s. XII. ON A GENERALISED THEORY OF ALTERNATIVE INHERITANCE, WITH SPECLA.L REFERENCE TO MENDEL'S LAWS. By Karl Pearson, F.E.S. 'Phil. Trans.,' vol. 203, pp. 53-86. Price Is. 6d. ■:f — — — — — PJJPUSHED BY THE CAMBRIDGE UNIVERSITY PreBS. BIOMETRIKA. A JOURNAL FOB THE STATISTICAL STUDY OF BIOLOGICAL PROBLEMS. Edited, in Consultation with Francis Galton, By W. Fl B. Weldon, Karl Pearson and C. B. Davenport. Vol. ni., Pabts II. and III. j Vol. III., Past IV. I. Experimental and Statistical Studies upon Lepidoptera. I. Variation and Elimination in Philosaniea eyutllia. By Hbket Edwabd Obampton. II. On the Laws of Inheritance in Man. — II. On the In- heritance of the Mental and Moral Characters in Man, and its Comparison with the Inheritance of the Physical Characters. By Kabl Pbaesoh. III. A Study of the Variation and Correlation of the Human Skull with Special Bef erence to English Crania. By W. R. Macdokkll. (With 50 Plates.) IV. On the Inheritance of Co^t.colour in the Greyhound. By Amy BABBiNaiON, Alice Lbb and K. Pearson. V. Kote on a Bace of Clausilia itala (Von Martens). By W. P. E, W^BLDOK. Miscellanea. On an Elemeutaxy Proof of SHBf fard's Cor- rections for Baw Moments and on some Allied Points. (Editorial.) I. Merism and Sex ia Spinax niger. By B. C. ivirstvi. II. Note on Inheritance of Meristic Character ip Spiiitue niger. By K. Peabson. III. On the Measurement of Internal Capacity from €wiial Circumferences. By M. A. Lbwbnz and K. Fbak> SON. (With two Plates.) IV. £tude Biom^trique sur les Variations de la Fleur et sur I'HeteroBtylie de Fulmonaria .(0fii%al%* L. Par EdmcndGain. "N Miscellanea. (I.) On the; Correlatibn between Hair Colour ' and Eye Colour in Man. By K. Tbak- 80N. , (II.) On the Correlation between Age and, iihe Colour of Hair and Eyes in Man. By Q-. UcaiuA. (III.) On the Contingency hetween OocapationB in the Case of iFather and Sou. By Emilt PsBBra. (IV.) OnaOonTementMeftnsofDnMriiigCarTes to various Scales. By O. XJsirT Tvle. (V.) Albinism in Sicily. By W. Bax^boh. The subscription price, payable in advance, is 30s. net per volume (post free); single numbers 10s. net. Volumes I. to III. (1902-4) complete, 30s. net per volume. Bound in Buckram 34s. 6(1. nd peir Voluate. Subscriptions may be sent to Messrs. C. J. Clay & Sons, Cambridge University Press Warehottee, Ave Maria Lane, London, either direct or through any bookseller. *