a 
 
 sate 
 
 
 

 
 Division Kea cae YAH Riad Pach 
 
 Section 
 
Sy Ay 
 
 hi 
 ( 
 eid; 
 
 
 

 
HOW TO EXPERIMENT 
 IN EDUCATION 
 
EXPERIMENTAL EDUCATION SERIES 
 Eprirep sy M. V. OSHEA 
 HOW TO EXPERIMENT IN EDUCATION. 
 
 By Wittram A. McCatrt, Px.D., Associate Professor of 
 Education, Teachers College, Columbia University. 
 
 
 
HOW TO EXPERIMENT 
 
 pave tama 
 
 
 
 
 
 
 ‘yp 
 A ° . ~ “ 4 
 
 JAN 18 1929 
 
 \ ~ 
 
 BY 
 WILLIAM A.’ McCALL, PH.D. 
 
 ASSOCIATE PROFESSOR OF EDUCATION, TEACHERS COLLEGE, 
 COLUMBIA UNIVERSITY, NEW YORK CITY 
 
 jQew Pork 
 THE MACMILLAN COMPANY 
 1926 
 
 All rights reserved 
 
COPYRIGHT, 1923, 
 
 By THE MACMILLAN COMPANY. 
 
 Set up and electrotyped. Published August, 1923. Reprinted 
 November, 1926. 
 
 PRINTED IN THE UNITED STATES OF AMERICA 
 BY BERWICK & SMITH CO. 
 
CHAPTER 
 
 E 
 
 IT. 
 
 VIII. 
 
 CONTENTS 
 
 SELECTION AND FORMULATION OF EXPERIMENTAL 
 PROBLEM tiie ahs mre i ntar fetanl ge ee a ea ena 
 
 SELECTION OF EXPERIMENTAL METHOD. .. . 
 SELECTION OF EXPERIMENTAL SUBJECTS .. . 
 CONTROL OF EXPERIMENTAL CONDITIONS . . . 
 EXPERIMENTAL MEASUREMENTS . . . .. 
 
 COMPUTATIONS FOR THE ONE-GROUP EXPERIMEN- 
 TALS METHOD Mr en iien ine ate mente ® dure ir ire 
 
 COMPUTATIONS FOR THE EQUIVALENT-GROUPS 
 MVIEETHOD Meee eer i COM a ite ite iret ue nig Bataan Aly 
 
 COMPUTATIONS FOR THE ROTATION EXPERIMENTAL 
 METHOD e e e ° e e e s se es e e 
 
 (SAUSALCINVESTIGATIONS Bae er trea meany (on at aur Us 
 
 ANALYSES OF EXPERIMENTAL AND CAUSAL INVESTI- 
 GATIONS se s e e e s e ® e e e e 
 
 APPENDIX e e s e e e e e e ® e ° e ° 
 
 SUMMARY OF SYMBOLS lr HARP ONIRAN CARL lar ao” AUNTS PAA afl 
 
 INDEX 
 
 PAGE 
 
 140 
 
 161 
 
 187 
 
 208 
 
 245 
 271 
 276 
 279 
 
Digitized by the Internet Archive 
 in 2022 with funding from 
 Princeton Theological Seminary Library 
 
 https ://archive.org/details/nowtoexperimentiOOmcca 
 
LIST OF TABLES 
 
 TABLE PAGE 
 
 I. 
 
 Chronological ages and mental ages of 43 sixth-grade 
 DUDUSE eae aaie Vere eae i an a carn itis tate nL eerie 45 
 
 2. Pupils divided into two groups of equivalent mental age 46 
 3. Illustrates computation of composite scores............. 52 
 4. Illustration of need for equal units of measurement.... 94 
 5. Relative merits of four commonly used scales.......... 98 
 em SHOWS HOW tosCOnStLUCt da | LNSCAlC siilen Gictis t atatesleis.cy 809 
 Pomet OLACOUVELl INCA IACenLGNULOn LSet tly tele sistele esi ete 101 
 8. Shows how to widen the range of a T scale............ 102 
 9. Age-scale and T-scale equivalents. ......s.cecscccecace 103 
 TOPO uOWS how ta:constriuct a‘Biscale;.i.7 2.205. ae cece «5's 108 
 II. For converting T scores into B scores. 00.4)... 2.0. 0s 109 
 12. Reliability of test by net difference method............ 113 
 13. Equating variability in computing net difference....... 114 
 13A. For converting total points correct into T scores...... 124 
 Pe Lem OLE COMDULIN Sm ESCOLEG.\ han shiva vigils geisha sie ald eniatalsts 124 
 TR Gem POtecOniputingAGasGOTesi mn ncaa aan cine sia'sieteieb nevent ote 126 
 13D. For illustrating the computation of T, B, and C scores 127 
 Pa uemior unterorering wimandi SCOLES..: s weliciers vs widamldesgite 127 
 PAGMAINE-STOUP COMP ieationl MOdel Ls. tae al ciets staveteldais es 140 
 Tool lustration ob) computations model ila. aoe bas. sels 6 141 
 16. Computation of M and SD when N is large............ 146 
 17, Computation of M and SD in a frequency distribution 
 MILD Estep =iNiemr dl ShO les feminist isticls cyt «mse are 147 
 18. Computation of the median in special situations........ 149 
 19. Conversion of experimental coefficients into chances.... 155 
 20. Illustration of computation model I when EFs is not the 
 
 MICLEEADSCOCE FOL EG aia inele tive dailas, cei aust enicn erat ss ea 159 
 
viii List of Tables 
 
 TABLE PAGE 
 21. Equivalent-groups computation model II for two EF’s 
 
 ANG OTE tEStHEV PEs. pe sieenines sap a oy eins one ae eee 161 
 22. \Iiiustration of computation; model LL. vse 162 
 23. Equivalent-groups computation model III for three EF’s 
 
 and one test TY¥DG\wriiapaite se sivsee teas eee 166 
 24. Equivalent-groups computation model IV for two EF’S 
 
 and (tWO:téest: typesviiiieic es cdelece tall ie edie cte ee aime 167 
 25. \ Llustration) oficomputation model LV is. 72074. ase sane 172 
 
 26. Equivalent-groups computation model V for three EF’s 
 and fone testetypey siiies mess ow seta a eites ies iela era anneaea 175 
 
 27. Equivalent-groups computation model VI for two sub- 
 PLOUDS Srey isn ec spereiel ee vrei te Uhlel nie oh ete tk etal aha ee 177 
 28. Summary of an actual experiment with three sub-groups 178 
 
 29. Equivalent-groups computation model VII with an inter- 
 
 MECIate7LEST Heuicw ev eaten ele Toate eae alaets te arene 179 
 30. Equivalent-groups computation model VIII with three 
 
 sub-groups and an intermediate test.............. 181-186 
 31. Rotation computation model IX for two EF’s and one 
 
 LESE ELV PO ey rele oda ta. a aleieraie Wiel di ely, otis a abe ter aka anal 187 
 32. Llustration of computation models Xs 0... ee 193 
 33. Rotation computation model X for three EF’s and one 
 
 TESULTV DE cle nla causa vieitiecece elecetelel da bly'< < avy s 4 one epe eae ann 195 
 34. Rotation computation model XI for two EF’s and two 
 
 LOSER TY DOS Ga circ ta ptee ialeiaigalpterelicie tates. 5) «lls! cbt C ee 197 
 
 35. Data from a rotation experiment conducted by Weber 200-201 
 36. Data from Weber’s rotation experiment converted into 
 
 ENSCOTES Si tenis at cles a eulc wale age cle c's tact ds et eae 204 
 ava. Computation? Off 2 ese wie comes vars sine os ce eee 237 
 38. Computation of r from a contingency table............. 229 
 
 39. Reavis’ r’s between attendance and six hypothetical 
 CAUSES.) sare tig atnalelaateitiebtore mite’ sie tc plata’ ade 1 eee 232 
 
 40. Reavis’ original and partial r’s between attendance and 
 six hypothetical’ causes)... <<. 5 vcs anes cc ee 
 
LIST OF DIAGRAMS 
 
 DIAGRAM PAGE 
 
 1. Scatter diagram showing rectilinear and curvilinear rela- 
 TIONSHIP We yet tale eelk eieeldin aisle sis sents s set isie) sis sisieis ins 226 
 
' 
 \ 
 . 
 . 
 : 
 ‘ 
 Aa ti i! ' y vis ‘eit 
 * ' - q + | _ i - 
 is'F ' 4 , : 1° @2>: i 
 mile thy . ' Pune 
 | Neat : Rd 1 (9 ae Gt A a n : oy <a es 
 i ou j Pot : : ‘ ’ . i 
 > j | ; ? Vd x + y ij j sh yy iti yy A ob 14) 13. : 
 ¥ i / Vay 74 "i ‘ Le tei ‘ ein ¥ iy 
 ’ \ - j a j , ie) : ra, 4 Oe or ee 
 ‘ ' A ‘ nie \ i } P a at on Waa tp 
 : i 4 i 7 
 ‘ iy re fy; “a 
 é 
 ‘ ls 
 i 
 ’ 
 ‘ 
 : 
 a | 
 ui 
 : 
 7 
 ' 
 ‘ 
 ry. 
 { 
 j 
 ‘ + 
 tx 
 i 
 ' 
 : 
 ’ ul 
 er | 
 | 
 ! 
 ' 
 i 
 | ' 
 ' Ly 
 , ' sf 
 fe i) 
 ‘ : 
 yi ’ 
 v >| 
 ty ry 
 ieee.) e , 
 > j 
 “ee } 
 M4 , 
 i ety, 
 ’ ina 
 j i i 
 i 
 : a 7 
 a) 
 re) i } ’ 
 » 2° ar j 
 | ‘ 
 wavy ss 
 : . 
 ni wh ] ‘ | 4 ' 
 i , 
 i Py l , 
 AM a | 4 i A 
 ria) b ' J ya! 
 ile Ah bed fig 
 mo ih" ) 4 
 
 - ei 
 « 
 
 [Sin eae 7; WF) 
 i tam df 
 ne ra. ereits nig ] e 1 ee : ‘A 
 
 
 
EDITOR’S INTRODUCTORY NOTE 
 
 Professor McCall has written this book primarily for the 
 purpose of presenting the methodology of educational 
 experimentation in a practical form for the use of 
 teachers and students of education who wish to engage 
 in experimental work, or who desire to understand the great 
 amount of experimental literature which is appearing in 
 magazine and book form. This is the first book on educa- 
 tional experimentation to be published at home or abroad. 
 There are philosophical treatises on scientific methodology, 
 such as Pearson’s ‘‘Grammar of Science,” and a few scat- 
 tered suggestions on the method of experimental education 
 in books on scientific education; but there has been no 
 adequate treatment of experimental work in the educa- 
 tional field. This fact led the present writer, when he 
 became editor of the Experimental Education Series, to 
 ask Dr. McCall to prepare this volume. Dr. McCall has 
 conducted courses in Teachers College in the field of ex- 
 perimental education, and he has for a number of years 
 been accumulating concrete data to illustrate the experi- 
 mental method of procedure. Probably no one is as well 
 equipped as he is to prepare a book for the guidance of all 
 who desire either to understand or to undertake experi- 
 mental work in education. 
 
 With the aid to be gained from this book, intelligent 
 teachers can engage profitably in research work in educa- 
 tion even if they are not technically trained in experimental 
 methods. The subject is one of permanent worth; and 
 students of education or teachers who wish to gain an in- 
 telligent appreciation of and to keep in touch with American 
 educational progress must be familiar with, and, to some 
 
 x1 
 
xii Editor’s Introductory Note 
 
 extent at least, must be master of the methodology of 
 educational experimentation. A large proportion of popular 
 educational doctrines has been derived without due regard 
 to the requirements for securing valid conclusions; and it 
 may be safely predicted that superintendents, principals, 
 and teachers, as well as students of education, who read 
 Professor McCall’s book wunderstandingly will exercise 
 greater care than they have done heretofore in promulgating 
 educational principles based upon data that have not been 
 secured in an accurate manner or treated according to a 
 technique designed to control or eliminate disturbing or 
 irrelevant factors. 
 
 “How to Experiment in Education” is not as technical as 
 it might appear to be at first glance. The formule and 
 diagrams as well as the discussion can be easily understood 
 by any reader, even though untrained in experimental 
 methods, if he will begin at the beginning of the work and 
 go through it systematically and leisurely. Concrete ex- 
 amples of experimental problems that have been or that 
 might be successfully studied are described by Professor 
 McCall frequently and clearly enough to illustrate every 
 method of procedure discussed and every diagram presented. 
 Technical terms are sparingly used, and the meaning of 
 those that are employed can be easily gained from the con- 
 text in which they appear. 
 
 M. V. O’SHEA. 
 The University of Wisconsin. 
 
PREFACE 
 
 My initiation into educational research, like most initia- 
 tions, was a rather tragic one with happy consequences. 
 My professors plunged me into practical research situations 
 when my training in experimentation was exceedingly lop- 
 sided. They trusted to my genius to supply the missing half 
 of research methodology. The memory of this mistaken 
 trust constitutes the pleasant after effects. 
 
 The cause of my tragedy and of others like mine was due — 
 to the fact that, heretofore, chief attention has been directed 
 toward statistical refinements, rather than refinements of 
 pre-statistical procedure. There are excellent books and 
 courses of instruction dealing with the statistical manipula- 
 tion of experimental data, but there is little help to be 
 found on the methods of securing adequate and proper data 
 to which to apply statistical procedure. ‘Training is given 
 and books exist only for the last step of a several-step 
 process. As a result, the final step often becomes little more 
 than statistical doctoring for the ills in the data. 
 
 This book, together with its predecessor, ‘“‘How to Measure 
 in Education,” but particularly this book, represents an 
 attempt to assemble or originate a fairly complete methodol- 
 ogy of research from the selection of the problem to the 
 conclusion of the research. Material has been drawn from 
 numerous sources, but the largest single source is that 
 unannounced richest course of instruction taken by me at 
 Teachers College, namely, the frequent privilege of out-of- 
 course association with Professor E. L. Thorndike. 
 
 The encouragement and support given my work by my 
 departmental Superiors, Professors M. B. Hillegas and 
 Frank M, McMurry, and by Dean James E. Russell have 
 
 X11 
 
xiv Preface 
 
 been a continuous surprise because they have exceeded every 
 expectation. Such encouragement has made it a pleasure to 
 shorten vacations and to lengthen the working day so as to 
 finish this book before departing for a year of service with 
 the Chinese National Association for the Promotion of 
 Education. 
 
 It is fortunate for the future reader that I am in China 
 while this book is being edited and published. As a result, 
 Dr. M. V. O’Shea has given an unusual amount of time to 
 its editing, and in this he has had the technical assistance of 
 Dr. John G. Fowlkes. Miss Harriet Barthelmess, who has 
 a thorough knowledge of the methodology of experimenta- 
 tion, and my wife, Alma McCall, have volunteered to read 
 the proof. I wish to make grateful acknowledgment of 
 their kindness. 
 
 Wiiiiam A. McCatt. 
 Teachers College 
 Columbia University 
 
HOW TO EXPERIMENT 
 IN EDUCATION 
 

 
 
 » \ 4 war ; ye : Si, : 4 ie Ae 1) is ag) j ine 
 ys if Tern ay a a. Mi Ney he clea’ .% “hy 
 ) ib wi i i i cae , A oe ne " ah Bie 
 
 ere SN haan 
 
 
 
 me >, ch Mi 
 
 wit iy oR 
 
HOW TO EXPERIMENT IN 
 EDUCATION 
 
 CHAPTER I 
 
 SELECTION AND FORMULATION OF 
 EXPERIMENTAL PROBLEM 
 
 I. VALUE AND PREVALENCE OF EXPERIMENTATION IN 
 EDUCATION 
 
 Prevalence of Experimentation.—Except for sporadic 
 exceptions and for continuous overlapping, the method for 
 the determination of truth has passed through three major 
 stages. The first stage is that of authority. When any 
 question arose as to the truth or falsity of any fact or 
 principle, it was referred by consent or force to the oracle, 
 chief, king, church, state, or other temporarily ascendant 
 individual or group. In the year 1922 the legislature of a 
 certain state decided by vote whether the principle of evolu- 
 tion is true or false. In this same year there were further 
 occasional evidences that vital educational matters were still 
 being decided on the basis of authority and authority alone. 
 
 The second stage is that of speculation. ‘This repre- 
 sents a genuine advance. When this stage was reached, 
 questions were no longer matters merely to be settled; they 
 were matters to be freely discussed. Broadly speaking, 
 America and American education have now advanced well 
 into this stage. 
 
 The third stage is that of hypothesis and experimentation. 
 This stage is not something perceived only in visions. We 
 
 t 
 
2 How to Experiment in Education 
 
 have seen enough of it to know its aspect and to appraise 
 its promise. Since earliest times a tiny stream of scien- 
 tific research has trickled through the ages, now above 
 ground, now below, now a dashing stream, now a desert rill, 
 but always flowing forward toward the future, and, in late 
 years, increasing greatly in volume. Today, educational 
 experimentation is accepted but not achieved. 
 
 These three, authority, speculation, and experimentation, 
 have been described as stages, and in a sense they are. 
 But, in a truer sense, they supplement each other. Specula- 
 tion, unless it becomes an end in itself, is a fruitful source 
 of hypotheses or problems for research. Authority, when 
 founded upon tested knowledge rather than upon pure opin- 
 ion, has an essential function in the scheme of life and 
 education. 
 
 Everywhere there are evidences of an increasing tendency 
 to evaluate educational procedures experimentally. Though 
 measurement alone is not research, the marvelous spread 
 of the movement for scientific measurement of educational 
 products is a symptom of a new attitude which is favorable 
 for research. ‘The establishment of numerous city and 
 state bureaus of research is another evidence. Numerous 
 experimental schools have arisen for the purpose of re- 
 search, pseudo-research, or propaganda. Most of the de- 
 partments of the better teachers colleges have become satu- 
 rated with the new point of view. Scientific organizations, 
 research committees, an institute of educational research, 
 and large educational foundations are lending such impetus 
 as make experimental education the most important current 
 movement in education. 
 
 But even with all its growth we have barely entered the 
 Stage of experimentation. Most educational theory still 
 needs testing. Adequate testing of theory requires a rigid 
 scientific procedure. The technique of experimentation is 
 possessed today, with a few exceptions, mainly by a small 
 group of educational psychologists. Experimental educa- 
 tion cannot hope to cope with its great task or develop much 
 
Selection and Formulation 3 
 
 faster so long as superintendents, principals, and super- 
 visors, not to mention teachers, are not equipped to solve 
 their own problems for themselves. It is but a question of 
 time until educational leaders will be required to have a 
 command of research technique. ‘Then the third stage has a 
 chance to arrive. 
 
 Value of Experimentation. — Experimentation has 
 proved its worth by hastening the day when the test of truth 
 will be verification and conformity to our experience rather 
 than revelation and miraculous departure from our expe- 
 rience. Science asks us to believe in such unthinkable 
 things as the reality of ether, the absence of weight and 
 friction for celestial bodies, the existence of the atom, that 
 food makes thought, and the like. But these matters are 
 in conformity with logic or experimental evidence. As 
 Burroughs states, the helium atom has been proved to be an 
 objective entity as truly as that the sun is in heaven. 
 
 The practice of experimentation in a school or school 
 system pays in terms of an altered attitude on the part of 
 the entire staff, willingness to consider new proposals, and 
 an alertness for new methods and devices. Experimenta- 
 tion ploughs up the mental field. Teachers join their pupils 
 in becoming question askers. It is the absence of just such 
 stirrings of the mental soil, which, in all probability, is 
 responsible for the supposed fact that teachers fail to im- 
 prove after a few years of experience. 
 
 Experimentation pays in terms of cash. ‘Three years 
 ago an experiment was conducted in a school of five hun- 
 dred pupils. The purpose of the experiment was to evaluate 
 a group of teaching methods. A careful account was kept 
 of the increased ability secured. Careful estimates were 
 made of its financial value. A record was kept of expendi- 
 tures. The value of the increased abilities secured was 
 estimated to be worth $10,000. This estimate was based 
 upon the total cost in previous years of producing each unit 
 of ability. The cost of test material used, and of the spe- 
 cial supervision required, amounted to $540. The net an- 
 
4 How to Experiment in Education 
 
 nual saving, not counting future compounding of the abili- 
 ties, was $9,460. 
 
 Recently an experiment has been conducted by Drans- 
 field, principal of a school in West New York, New Jersey, 
 and by Barton, superintendent of schools at Sapulpa, Okla- 
 homa. The purpose of these experiments was to evaluate 
 the plan for the teaching of reading described in “How to 
 Measure in Education.” The total points of A. Q. growth in 
 reading in the control school were 60. The points of growth 
 in the experimental school were 143. Even without taking 
 into account the improvement in history, geography, arith- 
 metic, etc., resulting from increased reading ability, or the 
 cumulative value to the pupils in future years, and even 
 without considering that the teachers have learned a new 
 process to use with other pupils, still the difference between 
 the two groups is worth thousands of dollars. Consider the 
 value to education of this and similar experiments, when 
 their influence shall have spread to the millions of pupils 
 in American schools. 
 
 The foregoing experiments have been described to show 
 that it is not unreasonable to claim that a widespread use 
 of scientific research could so increase the efficiency of 
 instruction as to save a year of instruction. The value 
 of such an achievement in financial terms is shown by the 
 following approximate figures: 
 
 Population: of the; United totatesiie i... sess ss. alc cele eee 103,600,000 
 Saving to each person through research .............ccecceccececes I yr. 
 Total (saving he Aue a eas fe tee recep elev co tha) Ad) Se 103,600,000 yrs. 
 Valuesot a 7yearin. tartan Garis steele ie ee neue cle ele et $1,000 
 saving “fOr: Ut: Sa i tere orev ere iain at oth ee) 1 seine eee $103 600,000,000 
 Population engaged in World War ............eccccederucn I,300,000,000 
 Saving tore World «Ware bowers ee cicc cos a oak $1,346 ,800,000,000 
 Saving tior 100 ipenerationsSunte es ose aes pec ee $134,680,000,000,000 
 
 $134,680,000,000,000 = 260 times U. S. Wealth= 790 times cost of World 
 War = 395 times cost of all wars in recorded history. 
 
 Experimentation will pay the nation, the school system, 
 and the individual school. The time has now arrived when 
 it also pays the individuals who engage in it. If the finan- 
 cial reward is not large, the esteem of the profession is. 
 
Selection and Formulation 5 
 
 There is no denying the fact that those educators who today 
 are constructively studying educational problems by scien- 
 tific methods have achieved, or are destined to achieve, 
 positions of recognized leadership in education. They be- 
 come the final arbiters for most educational questions, for 
 the peculiar function of experimentation in education is to 
 be a court of last resort. 
 
 Methodology of Research.—Scientific educational re- 
 search may be grouped conveniently into three major divi- 
 sions,—descriptive investigations, experimental investiga- 
 tions, and causal investigations. The purpose of descriptive 
 investigations is to describe a situation as accurately and 
 objectively and quantitatively as possible. They involve 
 the collection of data, and the quantitative description of the 
 data by the following means: some mass measure, such as a 
 frequency distribution, frequency surface, order distribution, 
 or rank distribution; or some point measure, such as a mode, 
 mean, median, midscore, or percentile; or some variability 
 measure, such as a quartile deviation, median deviation, 
 mean deviation, or standard deviation; or some relationship 
 measure, such as a scatter diagram, contingency table, or co- 
 efficient of correlation; or some reliability measure, such 
 as a standard deviation of the measure, or probable error of 
 the measure; or some other of the standard statistical tech- 
 niques, such as are described in Rugg’s “Application of 
 Statistical Methods to Education,” or Thorndike’s “Mental 
 and Social Measurements.” 
 
 The purpose of experimental investigations is to evaluate 
 the methods, materials, and aims of education. It is to de- 
 termine the absolute or relative effects upon some subject 
 or subjects or pupils of one or more experimental factors. 
 
 The purpose of causal investigations is to start with some 
 observed effect and locate the cause or causes; to determine 
 whether hypothetical causes are really causes; or to deter- 
 mine just how much each of several causes contributes to 
 produce the effect. 
 
 McCall’s “How to Measure in Education” has for its 
 
6 How to Experiment in Education 
 
 purpose not only to tell how to use practically and construct 
 scientifically mental and educational tests, but also to pre- 
 sent the measurement, tabular, graphic, and _ statistical 
 techniques required for the conduct of descriptive investi- 
 gations. This book is a sort of companion volume for 
 “How to Measure in Education,” and has for its purpose to 
 complete the presentation of the methodology of research. 
 The first book covers descriptive investigations. This book 
 presents the techniques for experimental and causal investi- 
 gations. 
 
 II. SELECTION OF EXPERIMENTAL PROBLEM 
 
 Planning an Experiment.—An experimenter ought to 
 think through his experiment from the conception of the 
 problem to the formulation of the conclusions and beyond. 
 If he has six months to devote to an experiment he can, with 
 advantage, spend five months in planning the experiment 
 and one month in conducting it. Ideally an experimenter 
 should not start his experiment until he has gone through, 
 mentally at least, every step even down to the smallest 
 statistical detail. Those who do not possess a vivid imagina- 
 tion can advantageously carry a miniature experiment with 
 hypothetical data through the various tabulation and sta- 
 tistical stages. 
 
 The importance of adequate planning cannot easily be 
 exaggerated. There is little justification for the contention 
 that a well-prepared plan is an inflexible plan. A plan can 
 be thorough and yet plastic enough to be altered to meet 
 unexpected emergencies. In fact original adequacy of plan 
 is probably correlated positively with a healthful plasticity. 
 
 Whenever the experimenter can afford the time, an actual- 
 trial experiment is superior to a mental-trial experiment. 
 Even the keenest vision of the most experienced experi- 
 menter cannot always foresee every difficulty which will 
 arise. Hence the theoretically best procedure is to follow 
 the mental-trial experiment with the actual-trial experiment, 
 
Selection and Formulation 7 
 
 to modify and perfect the plan in the light of the actual 
 trial, and, finally, to conduct the real experiment. 
 
 How to Find Experimental Problems.—The best way 
 to find genuine experimental problems is to become a scholar 
 in one or more specialties as early as possible. Thorndike 
 has done a great service for the cause of original research by 
 showing, in a convincing way, that the original mind is the 
 informed mind. The idea that much knowledge hampers a 
 man’s originality has taken deep root in the popular fancy, 
 as a result of its self-deceptive search for some crumb of 
 comfort for stupidity. The essence of originality is high 
 native intelligence plus adequate knowledge. Spencer de- 
 scribes knowledge as a sphere of light floating in an abyss 
 of darkness. As a rule, only those who live their mental 
 life on or in this sphere conceive fruitful problems. 
 
 A second way to discover fruitful problems is to read, 
 listen, and work critically and reflectively. It is well to 
 form the habit of reacting upon every situation with a ques- 
 tion mark, and to consider every untested theory as an hypo- 
 thesis. Between the lines of every worthwhile book are 
 enough problems and enough rich materials to make the 
 finder and utilizer famous. 
 
 A third method of discovering fruitful problems is to con- 
 sider every obstacle an opportunity for the exercise of in- 
 genuity instead of an insuperable barrier. A king once 
 placed a purse full of gold in the middle of a public road. 
 On the purse he placed a large stone. A soldier with his 
 head in the air and whistling a tune chanced that way. He 
 roundly cursed those who drove over that road for not re- 
 moving the stone and hence for the injury to his pride and 
 person. A wagoner, with the expenditure of much emo- 
 tion and considerable skill, maneuvered his wagon past 
 the obstacle. Since no one who passed that way had formed 
 the mental habit of considering every obstacle an oppor- 
 tunity, the reward Boneh the obstacle went by default to 
 the king. 
 
 A fourth method of nding problems is to start a research 
 
8 How to Experiment in Education 
 
 and watch problems bud out of it. The very process of re- 
 search stirs up a hornet’s nest of insistent problems. Spen- 
 cer expressed a profound truth when he said that if we 
 enlarge ever so little the sphere of light we increase infinitely 
 its points of contact with the darkness. 
 
 A fifth method of finding problems is not to lose those 
 already found. Almost everyone has probably been given 
 for a moment—probably some odd and unexpected mo- 
 ment—some rare insight. These flashes come, linger for a 
 moment, go, and are forgotten beyond recall. Twiss attri- 
 buted his rise to a university position to one fact. He 
 bought a steel filing case and recorded and filed original 
 ideas and problems before they were forgotten. So vital 
 for professional growth is this matter of finding and record- 
 ing problems, that the worth of an educator can probably 
 be measured by asking him to list in ten minutes as many 
 as he can of worth-while educational problems. 
 
 What Experimental Problem to Select.—It goes with- 
 out saying, and yet it needs to be said, that experimenters 
 should select problems whose solution is not already known. 
 One of the abler men in educational measurement reported, 
 at a recent gathering of scientific workers, the results of a 
 painstaking and exceptionally original research. Unfor- 
 tunately the same problem had already been solved and 
 the results published. Thorndike tells of a student who 
 submitted to him the results of a research which the candi- 
 date hoped would be acceptable for a Ph.D. thesis. In 
 submitting the manuscript the candidate wrote that he 
 knew the research was original for he had been careful to 
 avoid reading anything whatever about the subject. 
 
 As a rule, an experimenter should select and work upon 
 problems in his own specialty. It will be shown later that 
 successful experimentation requires such a detailed knowl- 
 edge of the factors operating in a particular situation, and 
 of the influence of these factors, as only a trained and expe- 
 rienced individual possesses. Recently, some students of 
 experimentation, who were reasonably expert in education 
 
Selection and Formulation 9 
 
 only, attempted to plan an experiment in chemistry. The 
 undertaking was soon abandoned. No one seemed to know 
 the influence of temperature upon certain chemical reactions. 
 This necessity of intimate knowledge probably explains why 
 over 99 per cent of all discoveries are made by experts in 
 the field of discovery. During the World War, the War 
 Department established a clearing house for popular inven- 
 tions. A few valuable suggestions were received, but in 
 the main the bulk of all research had to be done by a mere 
 handful of experts. 
 
 An experimenter should select the relatively more vital 
 problems. ‘There are many problems which are worth 
 solving but not relatively worth solving. The number of 
 those willing or competent to undertake research is too 
 small and their time too valuable to expend effort on prob- 
 lems not of vital consequence. 
 
 An experimenter should select a problem whose solution 
 is feasible, and should set up hypotheses capable of proof. 
 However vital the hypothesis, if it is not susceptible of 
 proof it should be discarded, for the present at least. Un- 
 fortunately, the solution of many experimental problems of 
 great worth is often not feasible, because needed tests have 
 not been constructed, or because appropriate subjects are 
 not available, or because the experimenter cannot sufficiently 
 control the situation in which the proposed experiment is to 
 be conducted, or for some other reason. Thus, the excellence 
 of an experimental problem depends upon several factors, 
 and hence it should be selected in the light of these factors. 
 A more comprehensive list of these conditioning factors will 
 be given later. 
 
 III. FoRMULATION OF EXPERIMENTAL PROBLEM 
 
 Types of Formulation.—There are three types of indi- 
 viduals engaged in educational research, and the types are 
 clearly indicated by the way they formulate their problems. 
 
 The first type of experimenter “‘flutters in all directions 
 
IO How to Experiment in Education 
 
 and flies in none!” He formulates problems so that their 
 scope is scarcely less wide than the universe. Such broad 
 formulations offer little practical aid in planning the details 
 of an experiment. Gazing at the stars, this experimenter 
 steps into every snare at his feet. Just as a teacher cannot 
 teach arithmetic in general, or spelling in general, but, in- 
 stead, must teach particular examples or particular words, 
 so an experimenter is likely to think and act very irrele- 
 vantly if he is guided by a broad formulation only. 
 
 Recently an experimenter came for consultation about 
 a problem which he had formulated thus: What is the 
 effect of various factors upon learning? After a little urging 
 he departed and returned later with this formulation: What 
 are the effects of distribution of time upon learning? He 
 was commended for the improvement made. At a later stage 
 the problem had become: Will a typical fourth-grade class 
 in silent reading, spending three thirty-minute periods per 
 week, accomplish more or less than an equivalent class 
 spending five periods of eighteen minutes each per week? 
 Even this is too broad for a final working formulation. 
 
 The second type may be called the pot-hole type. Near 
 the Cumberland Falls, the Cumberland River has a stone 
 bed pitted with pot-holes. These holes were made by small 
 hard pebbles which lodged in originally slight concavities 
 and which, due to the action of the water, have ground round 
 and round, thereby making the pebbles smaller and the hole 
 wider and deeper. ‘There are indefatigable individuals en- 
 gaged in educational research whose experimental problems 
 are admirably specific. They are as narrow as the pebbles 
 in the pot-hole. And, like the pebbles, their problems be- 
 come narrower and narrower as their research proceeds. 
 Such experimenters are experimental drudges. They do 
 much excellent work, but each research is isolated from 
 every other. There is an absence of general plan. There 
 is no mental reaching for the larger implications. They 
 are as lop-sided as the first type. 
 
 The third type of experimenter is the truly admirable one. 
 
Selection and Formulation II 
 
 He is the scholarly type. He perceives the larger meanings 
 of each minute investigation. This glorifies the drudgery 
 inherent in all careful research. The scholarly experimenter 
 first formulates a broad problem. ‘This gives the larger 
 goal and permits perspective. He then breaks up the broad 
 problem into very narrow, specific problems. These are the 
 working units. As the results from the specific investiga- 
 tions come in, he fits the bits together into a beautiful mosaic. 
 The solution of any one specific problem may be of no 
 practical value. It merely contributes to the solution of 
 the larger problem which alone has genuine practical sig- 
 nificance. Hence, it is desirable that there be a hierarchy 
 of formulations from very broad to very specific. 
 
 A working formulation of an experimental problem should 
 clearly describe: (1) the experimental factor or factors 
 whose effect or effects are being studied, (2) the experi- 
 mental subjects or individuals or pupils to whom the experi- 
 mental factor or factors are to be applied, and who are 
 expected to register the effect or effects, (3) the nature of 
 the effects expected and to be measured. In sum, a working 
 formulation requires that the experimenter must have 
 analyzed his problem in rough outline at least. 
 
 Why and When to Survey Bibliography on a Prob- 
 lem.—The time to make a survey of the bibliography on 
 an experimental problem is the opposite of the time when 
 the survey is all too frequently made. Often an investi- 
 gator has completed his experiment and has prepared his 
 manuscript for publication before he hurriedly collects a 
 list of references. The prime function of a bibliographical 
 survey is not to provide a dignified list of references to 
 append to an article, but to serve as a practical guide to the 
 formulation of the subordinate problems, and to the general 
 planning of the investigation. Hence, the survey of the 
 bibliography should immediately follow the formulation of 
 the experimental problem or problems. 
 
 If there were no other reason, self-respect as a scholar 
 should be adequate motivation for surveying a bibliography. 
 
12 How to Experiment in Education 
 
 Such a survey will avoid many public humiliations. Pride 
 is not fostered by saying: ‘“This is something never done 
 before,” only to discover later that claim to originality is 
 unjustified. Such humiliations will be frequent enough at 
 best without actually inviting them. 
 
 An initial bibliographical survey will prevent repeating an 
 investigation already done. ‘There are few things more 
 important than the conservation of the time and effort of 
 scientific men. The importance of avoiding repetition does 
 not, of course, mean that it may not be desirable, on occa- 
 sion, to verify 1 a previous investigation. But it is neces- 
 sary to discriminate between ignorant repetition and con- 
 scious verification. 
 
 Again, a bibliographical survey will often suggest addi- 
 tional incidental problems to be settled. There are few men 
 who have extensively engaged in research who cannot testify 
 to many keen regrets because numerous subsidiary problems 
 were conceived too late to make possible their solution at 
 the time the major problem was being attacked. It fre- 
 quently happens that merely minor modifications in an in- 
 vestigation will make possible the solution of five problems 
 instead of one. The importance of conceiving these prob- 
 lems early can be appreciated when it is recalled that many 
 of the world’s greatest discoveries were by-products rather 
 than major objectives of experimental investigations. 
 
 Again, a bibliographical survey helps by offering sugges- 
 tions of procedure and of errors to be avoided. A bibliog- 
 raphy is the recorded experience of previous investigators. 
 The cleverest investigator is selaom able to make an experi- 
 mental plan so perfect that there will be no subsequent 
 regrets. Foresight is never a perfect substitute for expe- 
 rience. The bibliography reveals not only the methods 
 employed and the instruments evolved by others but also 
 criticisms of these on the basis of experience. 
 
 Finally, a bibliographical survey provides material which 
 
 1Wm. A. McCall, “Reliability of a Ph. D. Research Dissertation in Educational 
 Psychology,” School and Society, April 13, 1918. 
 
Selection and Formulation 13 
 
 will be needed in describing the experiment conducted. It 
 is desirable to preface an experimental article with a sum- 
 mary of previous related investigations, and to close it with 
 a relevant bibliography. These, as well as all previously 
 mentioned objectives of the bibliographical survey, should be 
 realized at one and the same time. 
 
 Procedure in Making a Bibliographical Survey.—The 
 procedure of the bibliographical survey should be a highly 
 selective one. The experimental problems are the key to 
 this procedure. Throughout the survey, they should be kept 
 in mind constantly. Everything relevant to them should 
 be seized upon and examined for possible aids. Relevancy 
 to the problems is the principle of selection; helpfulness in 
 furthering the experiment, or its description, is the principle 
 of retention. 
 
 Not the principles of selection and retention but the 
 method of discovery is the chief difficulty in surveying a 
 bibliography. The problem is to know where to look for 
 material likely to be relevant. The method pursued will 
 vary somewhat with the problem and the situation of the 
 experimenter. The following general suggestions may, how- 
 ever, be given: (1) Make inquiries of those who may be 
 able to contribute unrecorded information. (2) Make in- 
 quiries of those who may be able to suggest references to 
 be examined. (3) Go to the contents and references in 
 books known to deal with the same or related problems. 
 (4) Consult the same and related topics in the library’s 
 topically indexed card catalog. (5) Consult the Readers’ 
 Guide to Periodicals. (6) Consult the monthly index to 
 educational publications published by the Bureau of Educa- 
 tion at Washington. (7) Consult the Psychological Index 
 and the index volumes for certain periodicals. (8) Consult 
 such summarizing journals as the Psychological Bulletin. 
 (9) Consult the table of contents of special periodicals not 
 indexed in the Readers’ Guide. The discovery of a single 
 relevant reference by the above procedure frequently leads 
 to the discovery of many other references. 
 
CHAIR D REIT 
 SELECTION OF EXPERIMENTAL METHOD 
 
 I. Types oF EXPERIMENTAL METHODS 
 
 A. One-group Method.—The most frequently used of 
 all types of investigations or experiments is the one-group 
 type, and it occurs as frequently in the physical and social 
 sciences as in the mental. When the physicist subtracts a 
 defined amount of heat from a bar of metal and measures 
 the resulting contraction, he is using the one-group method. 
 When the chemist pours one chemical mixture into another 
 and analyzes the resulting precipitate, he is employing the 
 one-group method. When a psychological examiner fires a 
 pistol behind a candidate for aviation and measures the 
 resulting jump, he is employing the one-group method. 
 When a teacher scolds her class for inadequate preparation 
 and measures the resulting increase or decrease in study, 
 she is employing the one-group method. When a nation like 
 France applies to itself republicanism or a nation like Rus- 
 sia applies to itself bolshevism and observes the result, it, 
 too, is employing the one-group method. Similarly, when 
 a teacher compares the effectiveness of scolding vs. praising, 
 or instruction by one method vs. instruction by another 
 method, she, too, is employing the one-group method, pro- 
 vided the two contrasted factors are tried out upon the 
 identical group. A one-group experiment has been con- 
 ducted when one thing, individual, or group has had applied 
 to it or subtracted from it some experimental factor or fac- 
 tors and the resulting change or changes have been estimated 
 or measured. | 
 
 14 
 
Selection of Experimental Method Ls 
 
 The one-group method may be represented in formula 
 form as follows: 
 
 One Group — Two EF’s — One Test Type 
 3s — (IT — EFr — FT — C1) — (IT —/BR2i'— RT — G2) 
 
 where S is the experimental subject, thing, or group. 
 
 IT is the initial test or status of S before EF1 and EF? are, 
 in turn, added to or subtracted from S. 
 
 EF is one of the two experimental factors. 
 
 EF2 is the other experimental factor. 
 
 FT is the final test or status of S after EF1 and EF>2 have, in 
 turn, been applied. 
 
 Cr is the change in S produced by EF1, and is found by com- 
 puting the difference between the IT and FT which imme- 
 diately precede and succeed EF1 respectively. 
 
 C2 is the change in S effected by EFz. 
 
 The conclusion is yielded by comparing the amounts of C1 
 and C2. If Cz is larger, EFz has been more effective than 
 EF2, and vice versa. 
 
 Thus, if a teacher wished to compare the effects of prais- 
 ing vs. scolding, at the beginning of a class period, upon 
 the amount of discussion on the part of pupils during the 
 class period, she would make an initial test (IT) of the 
 amount of discussion which normally occurs. Then she 
 would praise (EFr) the class at the beginning of some class 
 period. During the remainder of the class period she would 
 test (FT) the amount of discussion. Then she would com- 
 pute the difference (C1) between the initial test and final 
 test. As soon as the effects, if any, of the praising had worn 
 off, she would make another IT or else assume that it would 
 be identical with the first IT, scold the pupils, make an FT, 
 and compute the amount of alteration (C2) produced by 
 scolding. A comparison of the amount and direction of Cx 
 and C2 would yield the correct conclusion from this experl- 
 ment, provided proper experimental precautions were taken, 
 and provided the effects of the praising really did wear off, 
 as evidenced by the second IT. 
 
16 How to Experiment in Education 
 
 Assuming the data to be as shown below, the computa- 
 tions for the praising (EF1) vs. scolding (EF2) experiment 
 are indicated. 
 
 S — (20 — EF1 — 25 —+ 5) — (20 — EF2 — 18 — — 2) 
 Difference equals 7 in favor of EFr. 
 
 The one-group experimental method may be divided upon 
 the basis of the number of experimental factors contrasted. 
 Strictly speaking, there are no one-factor experiments. The 
 nearest approach to such an experiment is where some one 
 factor is added to or subtracted from S. If a teacher makes 
 an IT of her class, adds a good scolding, makes an FT, and 
 computes C, she may be said to have performed an experi- 
 ment with one factor—an experiment which requires only 
 the former or latter half of the above basic formula. On 
 the other hand, it might be argued that she really employed 
 two factors, namely, not scolding or a control EF vs. scold- 
 ing, and that therefore she would require all of the above 
 formula. Since the influence of EF1 (not scolding) would 
 be to leave the pupils unchanged, IT and FT in the former 
 half of the formula would be identical and C1 would be 
 zero. Either approach leads to the same practical con- 
 clusion. 
 
 While half of the formula will suffice when the two fac- 
 tors are really the presence and absence of one identical 
 factor, the entire formula is required when the two EF’s are, 
 not mere presence and absence of one EF, but two EF’s 
 different in nature. Thus, if a teacher wished to compare 
 the effect of praising vs. scolding her class, or of teaching 
 her class by one method vs. another method, Cr could not 
 be assumed to be zero. Both praising and scolding, or both 
 methods of teaching might alter the original status of S. 
 Since the longer formula is correct in all one-group experi- 
 ments and is necessary in some, confusion will be avoided 
 by adopting it as the basic formula for one-group experi- 
 ments. 
 
 In certain other situations the basic formula may be 
 
Selection of Experimental Method 17 
 
 shortened by eliminating both the IT and C, whereupon the 
 formula for the one-group experiment reduces to 
 
 Sy (EBL in) oo His Tl) 
 
 This plan is very economical and its use in preference to 
 the more laborious basic plan is justifiable when S may be 
 assumed to have an IT of zero, for in this case C becomes 
 identical in amount with FT. When an experimenter wishes, 
 for example, to discover how much a group of pupils can 
 learn of certain new material taught for a defined length 
 of time according to a defined method, he may employ the 
 abbreviated experimental plan, provided the material to be 
 taught is so sufficiently new that pupils will start with 
 zero knowledge of it. But since all these variations on 
 the basic plan operate in special situations only, whereas 
 the basic plan will operate in any one-group experiment, 
 confusion will be avoided by keeping in mind the basic 
 plan only. 
 
 There remains to consider the formula required to handle 
 more than two EF’s. The basic formula assumes two EF’s. 
 It can be indefinitely extended by lengthening the formula 
 to provide for EF1, EF2, EF3, and so on, with their corre- 
 sponding C1, C2, C3, etc. 
 
 In many one-group experiments the changes produced by 
 each EF are manifold, so that one test cannot measure 
 them. ‘Thus, a certain EF may change not only a pupil’s 
 reading ability but his spelling ability also. To measure 
 both these effects will require at least two types of tests, 
 namely, a reading test and a spelling test. Hence, one- 
 group experiments may be divided into those requiring one 
 type of test and those requiring two or more types of tests. 
 The former has already been diagramed; the latter is dia- 
 gramed below. This diagram assumes that two EF’s are 
 employed and two types of tests are required. Observe 
 that S and the two EF’s remain unchanged. Cr vs. C2, and 
 C3 vs. C4 show the two conclusions from this experiment. 
 Provision can be made for more EF’s by extending the for- 
 
18 How to Experiment in Education 
 
 mula to the right and for more types of tests by extending 
 it downward. 
 
 One Group — Two EF’s — Two Test Types 
 
 S — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1 — C2) 
 (IT2 — EF1 — FT2 — C3) — (1T2 — EF2 — FT2 — C4) 
 
 B. Equivalent-groups Method. — The equivalent- 
 groups method has been devised for experimental situations 
 where, for reasons to be mentioned shortly, the one-group 
 method is inapplicable. Distinctive features of this method 
 are (1) that there are more than one group, or S, and (2) 
 that all groups are equivalent. Normally, there are as many 
 S’s as there are EF’s, and each S is supposed to be equiva- 
 lent to any other. Thus, if a teacher wishes to compare 
 the effect of scolding vs. praising and employs the equivalent- 
 groups method, she selects two equivalent groups. She 
 scolds one group and measures the change, and praises the 
 other group and measures the change. The diagram for an 
 equivalent-groups experiment with one type of test follows. 
 Sr refers to one group and S2 to the other. The conclusion 
 from the experiment is yielded by a comparison of Cr 
 and C2. 
 
 Equivalent Groups — Two EF’s — One Test Type 
 
 Sr — (IT1 — EF1i — FT1 — C1) 
 S2 — (IT1 — EF2 — FT1 — C2) 
 
 When two types of tests are used, this formula takes on 
 the form shown below. The two conclusions are yielded by 
 a comparison of Cr with C3, and C2 with C4. 
 
 Equivalent Groups — Two EF’s — Two Test Types 
 Sr — (IT1 — EF1 — FT1 — Cr) 
 (IT2 — EF1 — FT2 — C2) 
 S2 — (IT1 — EF2 — FT1 — C3) 
 (IT2 — EF2 — FT2 — C4) 
 
 The following formula is utilized for three EF’s and two 
 test types. Guided by the principles exemplified in this and 
 
Selection of Experimental Method 19 
 
 the two preceding formulae, a formula may be constructed 
 for any number of EF’s, and any number of test types. 
 
 Equivalent Groups — Three EF’s —-Two Test Types 
 Sr — (IT1 — EF1 — FT1 — C1) 
 (IT2 — EF1 — FT2 — C2) 
 S2 — (IT1 — EF2 — FT1 — C3) 
 (IT2 — EF2 — FT2 — C4) 
 S3 — (IT1 — EF3 — FT1 — Cs) 
 (IT2 — EF3 — FT2 — C6) 
 
 C. Rotation Method.—The rotation method is particu- 
 larly useful for solving experimental problems insoluble by 
 other methods. It is a unique combination of two or more 
 one-group methods. When the various groups employed are 
 equivalent, the rotation method is a combination of one- 
 group and equivalent-groups methods. 
 
 As the name implies, the distinctive feature of the rota- 
 tion method is that of rotation—rotation of S’s, or EF’s or 
 irrelevant factors. If a teacher wishes to study, by means 
 of the rotation method, the effect of praising vs. scolding, 
 she first praises S, and measures the result, and then scolds 
 the same S, and measures theiresult. This is the one-group 
 method thus far. She first scolds S2, and measures the re- 
 sult, and then praises S2, and measures the result. In other 
 words, she rotates the order of the EF’s. She combines the 
 results from praising both groups, and compares the sum so 
 found with the sum of the results from scolding both groups. 
 This comparison shows whether praising has been more or 
 less effective than scolding, how much, and in what direc- 
 tion. The simplest form of rotation method, namely, two 
 EF’s and one type of test, is given below. The conclusion 
 is yielded by a comparison of C1 plus C4 with C2 plus C3. 
 
 Rotation — Two EF’s — One Test Type 
 Sr — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1 — C2) 
 92 — (IT1 — EF2 — FT1 — C3) — (1T1 — EF1 — FT1 — C4) 
 OL ADU ot ON ST OF I 
 EF2 = C2 + C3 
 
20 How to Experiment in Education 
 
 If a teacher wishes to determine by means of the rota- 
 tion method the effect of praising vs. scolding vs. sarcasm, 
 the formula becomes as shown below. ‘The conclusion is 
 derived from a comparison of C1 plus C6 plus C8 with C2 
 plus C4 plus Co with C3 plus C5 plus C7. 
 
 Rotation — Three EF’s — One Test Type 
 
 S1 — (IT1 — EF1 — FT1 — C1) — (1T1 — EF2 — FT1 — C2) 
 — (IT1 — EF3 — FT1 — C3) 
 S2 — (IT1 — EF2 — FT1— C4) — (1T1 — EF3 — FT1 — Cs) 
 — (IT1 — EF1 — FT1 — C6) 
 S3 — (IT1 — EF3 — FT1 — C7) — (1T1 — EF1 — FT1 — C8) 
 — (IT1 — EF2 — FT 1 — Cog) 
 
 EF1 = C1 + C6 + C8 
 
 EF2 = C2 + C4-+ Co 
 
 EF3 = C3 + C5 + C7 
 
 A diagram for a rotation method with two EF’s and for 
 two types of tests follows. The two conclusions from the 
 experiment are yielded by a comparison of the sum of C1 
 and C6 with the sum of C2 and Cs, and by a comparison 
 of the sum of C3 and C8 with the sum of C4 and C7. 
 
 Rotation — Two EF’s — Two Test Types 
 
 Sr — (IT1 — EF1 — FT1 — Cr) — (IT1 — EF2 — FT1 — C2) 
 (IT2 — EF1 — FT2— C3) — (IT2 — EF2 — FT2 — C4) 
 S2 — (IT1 — EF2 — FT1— Cs) — (1T1 — EF1 — FT1 — C6) 
 (IT2 — EF2 — FT2— C7) — (1T2 7 EE eee 
 
 EF ir on test 1 = C1 + C6 
 
 EF2 on test 1 = C2 + C5 
 
 EF1 on test 2 = C3 + C8 
 
 EF2 on test 2 = C4-+ C7 
 
 This, as well as any other experimental method, can be 
 indefinitely extended by multiplying the number of factors, 
 or tests, or both. The student will do well to stop at this 
 point and prove his mastery of what has preceded by mak- 
 ing a few sample extensions of each method that has been 
 diagramed. 
 
Selection of Experimental Method 21 
 
 II. CRITERIA FOR SELECTING EXPERIMENTAL METHOD 
 
 A. One-group Method.—When the purpose of an ex- 
 periment is to determine the amount of change due directly 
 to an EF, the one-group method is valid: 
 
 (1) Where the total net change in the trait or traits in 
 question produced by irrelevant factors is negligible, or 
 where the amount of such change is measured and dis- 
 counted by the application of a control EF. 
 
 (2) Where the change produced in S by an EF is not 
 conditioned significantly by any preceding EF. 
 
 (3) Where the change effected by each EF is measurable 
 in equal units. 
 
 Here is an experimental problem which came to the atten- 
 tion of the writer recently: Will the appointment of a 
 physical instructor (EF1) or the establishment of school 
 luncheons (EF2) improve the health (weight, etc.) of ele- 
 mentary school pupils? The purpose of the individual who 
 formulated this problem was to determine whether a phys- 
 ical instructor or school luncheons will alter the weight, etc., 
 of pupils, and if so, how much. 
 
 Even in the case of an inanimate S, it is extraordinarily 
 difficult to create an experimental situation where all irrele- 
 vant factors—disturbing factors—are eliminated. In the 
 case of an animate S like the above, irrelevant factors of 
 considerable magnitude are unavoidable. But irrelevant 
 factors will not invalidate this experiment provided their in- 
 fluence is relatively negligible. Hundreds of influences con- 
 tinuously play upon pupils. Compared to the influence of 
 the EF, most, or sometimes all, of these irrelevant factors 
 exercise a comparatively small influence. 
 
 Even significant irrelevant factors will not invalidate this 
 experiment provided the total met change is negligible. 
 Though pupils are continuously registering the effects of a 
 multitude of accidental or chance or uncontrollable in- 
 fluences, some of these tend to facilitate and some to inhibit 
 
22 How to Experiment in Education 
 
 progress in the trait in question. No trouble is caused 
 provided these positive and negative influences balance or 
 so nearly balance as to give a negligible net total. 
 
 In the case of our sample problem, will the net total 
 change produced by irrelevant factors be negligible? There 
 are excellent reasons for believing that this net total will 
 be a considerable increase in weight due to, not to mention 
 other possibilities, the significant irrelevant factor of natural 
 maturing. 
 
 But even this significant irrelevant factor of maturing 
 does not invalidate the one-group method provided the 
 amount of its influence can be measured and discounted by 
 the application of a control EF (CEF). Thus, we might 
 measure the amount of increase in weight due to one year of 
 maturing, and then apply a year of school luncheons, and 
 then remove school luncheons and apply a year of a phys- 
 ical instructor. The first year would be a control EF be- 
 cause during this time the pupils would presumably be 
 treated exactly the same as during the two following years, 
 except for the EF’s of school luncheons and physical in- 
 structor. By computing the difference between the increase 
 during the first year and each of the other two years it 
 would be possible to determine the amount of increase attri- 
 butable to each regular EF. 
 
 Where there are a CEF and two regular EF’s the basic 
 formula for the one-group method is shown below. Before 
 Cir and C2 are compared, the amount of CC should be sub- 
 tracted from each. 
 
 One Group — CEF and Two EF’S — One Test Type 
 SIT — CEFF CC) (IT EFr—Fi— C1) SU eee eee 
 EFi = C1 — CC 
 EF2 = C2—CC 
 Will one EF condition or carry-over to any succeeding 
 EF? Since the control EF may be dispensed with in ex- 
 periments where the net total change produced by irrelevant 
 factors is negligible, and also in certain other experiments, 
 as will be shown later, and since the control EF is really 
 
Selection of Experimental Method 23 
 
 identical with the preéxperimental factor, these two may be 
 considered together. ‘Thus, if an experimenter desires to 
 compare the relative effectiveness of teaching pupils sub- 
 traction by the additive method vs. the subtractive method, 
 it is important to inquire whether the pupils are just begin- 
 ning subtraction or whether they have been taught for some 
 time previously by the additive or subtractive or some other 
 method. The additive method, superimposed upon a long 
 training according to the subtractive method, may yield re- 
 sults markedly different from that of an additive method 
 superimposed upon an additive training or no training at 
 all. The function of an initial test is to prevent the first 
 regular EF from getting credit or blame for changes pro- 
 duced by a control EF or, lacking a control EF, the pre- 
 experimental factor. But there may be a carry-over of 
 inhibiting or facilitating purposes, methods of work, or in- 
 formation, or all of these which are not removed by the 
 initial test sieve. | 
 
 When the amount of this carry-over is significantly large, 
 the experimenter has two alternatives. He may seek an S 
 whose preéxperimental experiences have been such as to 
 avoid the carry-over, or he may continue with the original 
 S, and remember to state the final conclusions from the ex- 
 periment in the light of the condition of S antedating the 
 experiment. The experimenter does not have the alternative 
 of selecting another experimental method, for every experi- 
 mental method is handicapped equally by this preéxperi- 
 mental factor. 
 
 It is necessary to inquire, not only concerning the carry- 
 over from the preéxperimental factor or control EF, but also 
 concerning the carry-over from one regular EF to any suc- 
 ceeding EF. Will a physical instructor for a year prior to 
 school luncheons add to or detract from the effectiveness of 
 school luncheons? Or vice versa, will school luncheons add 
 to or detract from the effectiveness of a physical instructor? 
 Will the additive EF, preceding a subtractive EF, facilitate 
 the effectiveness of the subtractive EF, or inhibit it, or vice 
 
24 How to Experiment in Education 
 
 versa? Unless there are reasons for believing that any such 
 carry-over will be relatively negligible, the experimenter had 
 better avoid the one-group method. 
 
 If there are reasons for believing that EF1 will condition 
 EF2 but that EF2 will not carry-over to EF1, the one-group 
 method is valid, provided EF2 is applied first, since an EF 
 cannot condition a preceding EF. 
 
 There is this difference between a carry-over from a pre- 
 - experimental factor or from a control EF to a regular EF, 
 and the carry-over from one regular EF to another. In the 
 former situation the experimenter does not have the alterna- 
 tive of selecting another experimental method whereas in 
 the latter situation he does. 
 
 Finally, can the changes effected respectively by the con- 
 trol EF, school luncheons, and physical instructor be meas- 
 ured in equal units? Since all weight changes will be 
 measured in units of pounds, let us say, and since the scale 
 for weight is a uniform scale, it would appear that the units 
 could be called equal. The use throughout the entire ex- 
 periment of a uniform scale with uniform and equal units 
 would seem to be all that could be asked. It is, provided 
 equality of units means equal ease of effecting a unit of 
 change in S at all points on the scale. The units on a scale 
 may be equal in some senses and be quite unequal in an 
 experimental sense. In one sense the interval from ninety- 
 seven to ninety-eight pounds is equal to the interval from 
 one hundred ten to one hundred eleven pounds. In each 
 case the interval is one pound. But it may be more 
 difficult to increase the weight of a particular pupil from 
 one hundred ten to one hundred eleven pounds than 
 from ninety-seven to ninety-eight pounds. Let us assume 
 that it is. Then the EF which came first would show a 
 greater change than the EF which came second, even though 
 both were of exactly equal effectiveness. In sum, objective 
 equality of units does not guarantee experimental equality 
 of units. 
 
 When the same uniform scale of uniform units measures 
 
Selection of Experimental Method 25 
 
 the changes produced by all EF’s there is some possibility 
 that the units will be equal experimentally. This possi- 
 bility is practically nil when the scales employed are not 
 uniform. For example, an experimenter may desire to de- 
 termine the effectiveness of two methods of teaching a 
 geography lesson. He might teach a lesson by method A 
 on the question: Why are certain portions of the United 
 States arid? He would construct a measuring instrument 
 on the content of this particular lesson. This instrument 
 could be used for the initial test and final test to measure 
 the change produced by method A. Now if method A had 
 practically taught the content of the above lesson, or even 
 a part of it, method B could not well be used on the same 
 lesson. Method B would have to be employed on another 
 lesson whose topic was, say: Why is more cotton grown 
 in the southern than in the northern part of the United 
 States? This would require a new test on the content of 
 the second lesson. Suppose that method A increased by ten 
 points the score of S, and that method B also increases by 
 ten points the score of S. Which is more effective, method 
 A or method B? It is impossible to say, because the ten 
 points in one case are not necessarily equal to the ten points 
 in the other. We cannot even be sure that one point on 
 one test is equal to any other point on the same test. 
 
 When the purpose of an experiment is to determine merely 
 the amount of superiority of one EF over any other EF, the 
 one-group method ts valid: 
 
 (1) When the amount of change in S under one EF is 
 practically identical with the amount of change under any 
 other EF, except for the difference in effectiveness of the 
 contrasted EF’s. 
 
 (2) Where the change produced in S by an EF is not 
 conditioned significantly by any preceding EF or EF’s. 
 
 (3) Where the change effected by each EF is measured 
 in equal units. 
 
 Since many of the experiments in education are concerned 
 only with the relative effectiveness of two or more EF’s and 
 
26 How to Experiment in Education 
 
 not with a determination of the absolute amount of change 
 in S directly attributable to an EF, the more searching 
 fundamental criteria may be simplified as indicated in (1), 
 (2), and (3) immediately above. So far as the above pur- 
 pose is concerned, it makes no difference if pupils are ma- 
 turing or if any other irrelevant factors are operating con- 
 temporaneously with the application of the EF’s, provided 
 they operate alike under each EF. 
 
 There are some situations where inequality of units is 
 certain, and, yet, where the one-group method is practically 
 imperative or has been used by mistake. Stevenson con- 
 ducted an investigation under the auspices of the University 
 of Illinois and the Chicago public schools to determine the 
 relative effectiveness of large classes vs. small classes. Cir- 
 cumstances might have forced the one-group method. If 
 sO, one appropriate plan would be to have a teacher teach a 
 class of, say, forty-five pupils for the first semester. Initial 
 and final tests would be given. At the beginning of the 
 second semester, thirty of these forty-five pupils would be 
 so selected as to be fairly representative of the whole group. 
 This class of thirty pupils would be taught during the second 
 semester by the same teacher who had taught them during 
 the first semester. Initial and final tests would be given. 
 
 ‘ The final tests for the first semester would serve as the 
 
 initial tests for the second semester. Cz and C2 would be 
 computed only for the thirty pupils continuing throughout 
 the year. A large number of different classes would be used, 
 but each class would be treated according to the above plan. 
 
 Then, since it is usually more difficult to secure each 
 additional point, the small-class EF would be discriminated 
 against because of inequality of units. Even so, the experi- 
 menter would not have done all his work in vain. There are 
 methods of correcting or approximately correcting for these 
 inequalities. 
 
 One method is to plot the curve of growth for the test in 
 question, using age norms or, lacking age norms, grade norms 
 as the basis of the curve. The curve can be estimated for 
 
Selection of Experimental Method 27 
 
 points between the age norms or grade norms. If the norm 
 for ten-year-old children is, say, fifty, and for twelve-year- 
 olds is sixty, and for thirteen-year-olds is sixty-five, a growth 
 from fifty to sixty may be considered equal roughly to a 
 growth from sixty to sixty-five. By interpolation, a growth 
 on one portion of the curve may be converted into units of 
 growth on any other portion of the curve, thus making com- 
 parison between EF’s fair. In like manner, the slope of the 
 curve for grade norms may be used to equate units on vari- 
 ous portions of the curve, though the grade-norm curve is 
 subject to a selection error. The fifth-grade norm in June is 
 higher than the fourth-grade norm in June not only because 
 of the year’s growth, but also—and failure to recognize this 
 is the error—because certain of the stupider pupils of a 
 fourth-grade are not allowed to continue with their grade 
 when it becomes a fifth grade. 
 
 For several reasons—because norms are frequently un- 
 available, because of the selection error in grade norms, 
 because the equalization of units by means of growth curves 
 is likely to prove laborious, and because such equalization 
 requires that the same or equivalent tests be used through- 
 out the experiment—another method of equalizing units will 
 be found more serviceable. This is the method of convert- 
 ing all units into T’s, in terms of the experimental group 
 rather than twelve-year-old, by the T-scale technique de- 
 scribed in Chapter V, and illustrated in Table 6 (page 99) 
 and Table 36 (page 204). 
 
 If the same or equivalent forms of a test are used through- 
 out the entire experiment, it is suggested that the T12 col- 
 umn of Table 8, p. 102, become the T scores according to 
 the very first initial test of the experiment, and that Tx6 be- 
 come the T scores according to the last of the final tests of 
 the experiment, and that these two columns of T scores be 
 combined according to the procedure illustrated in Table 8. 
 If the T scores were based upon initial test alone, some of the 
 highest scores in the final test could not be scaled. If the 
 T scores were based upon final test alone, some of the lowest 
 
28 How to Experiment in Education 
 
 scores of the initial test could not be scaled. By basing the 
 T scores upon both initial and final tests, all scores for all 
 pupils on a particular test can be converted into equivalent 
 T scores by the use of what will correspond to the first and 
 last columns of Table 6, p. 99. 
 
 If the initial and final tests for EF1 are neither duplicate 
 nor equivalent forms of the initial and final tests used for 
 EF2, i.e., if the EF1 tests.measure information about the 
 geography of New York, whereas the EF2 tests measure 
 information about the geography of Pennsylvania, the T 
 scores for EF1 should be based only upon the initial and 
 final tests for EF1, and the T scores for EF2 should be 
 based only upon the initial and final tests for EF2. This 
 means that Table 6 must be worked twice for each test 
 before all scores in a two-EF experiment can be converted 
 into T scores. The general procedure is the same irrespec- 
 tive of the number of EF’s. 
 
 Fortunately, Stevenson selected a better experimental 
 method. He chose the rotation method instead of the one- 
 group method. He had one teacher teach a class of, say, 
 forty-five pupils and another teacher teach an approximately 
 equivalent class of thirty pupils in the same grade. Both 
 the large and the small classes were taught during the first 
 semester. At the end of the first semester, fifteen pupils 
 were taken from the class of forty-five pupils, thus leaving 
 it a class of thirty pupils during the second semester, and 
 given to the class of thirty pupils, thus making the latter a 
 class of forty-five pupils during the second semester. In this 
 way, both the large-class EF and the small-class EF came 
 under identical courses of study, identical portions of the 
 test, identical portions of the growth curve, and so on. 
 
 The probability of satisfying the fundamental criteria for 
 selecting the one-group method is increased: 
 
 (1) Where the EF or EF’s produce a relatively drastic 
 effect, for this tends to make the influence of trrelevant fac- 
 tors practically negligible. 
 
 (2) Where the experiment is of brief duration, for this 
 
Selection of Experimental Method 29 
 
 abbreviates the action of large, constant, cumulative, irrele- 
 vant factors such as maturing for example, 
 
 (3) Where the trait in question does not involve pur- 
 poses or methods of work, for these usually show a larger 
 carry-over than specific information. 
 
 (4) Where the tests are scaled on the basis of the same 
 unit for this increases probability of equality of units. 
 
 B. Equivalent-groups Method.—When the purpose of 
 an experiment is to determine the amount of change due 
 directly to an EF or EF’s, the equivalent-groups method is 
 valid: 
 
 (1) Where the total net change in the trait or traits in 
 question produced by irrelevant factors is negligible, or 
 where the amount of such change is measured and discounted 
 by the use of a control EF. 
 
 (2) Where it is really possible to equate groups. 
 
 One peculiar virtue of the equivalent-groups method is 
 that in its use the danger of any carry-over from one EF 
 to another is avoided, by applying each EF to a different S 
 so that no EF follows another with the same group. Of 
 course the equivalent-groups method, like all others, is sub- 
 ject to a possible carry-over from the preexperimental fac- 
 tor. But this does not so much invalidate an experiment as 
 limit the conclusions from the experiment to the particular 
 sort of S employed. 
 
 Another superiority of the equivalent-groups method over 
 the one-group is that the units of measurements used for 
 one EF have a greater probability of being equal to those 
 used for another EF. The equivalent-groups method avoids 
 the doubtful assumption that it is equally easy to produce 
 equal amounts of change at various points of the growth 
 curve of S, for two S’s can be chosen at like positions on the 
 growth curve. Furthermore, it is not necessary to measure 
 the changes produced by the various EF’s by means of dif- 
 ferent incomparable tests based upon different subject mat- 
 ter. Thus it would not be necessary to teach one sort of 
 
30 How to Experiment in Education 
 
 geography lesson according to method A and another sort 
 according to method B. The identical lesson could be taught 
 by method A and method B and the identical test could be 
 used to measure the changes produced by each method. 
 We shall see, however, when we come to consider the ques- 
 tion of scaling tests, that the use of identical tests does not 
 guarantee perfect equality of units. But it certainly does 
 tend to increase comparability. 
 
 The one-group method did not prove entirely valid for the 
 illustrative problem of school luncheons vs. physical instruc- 
 tor. How about the equivalent-groups method? Here, as 
 in the case of the one-group method, the total net change 
 produced by irrelevant factors would not be negligible due 
 to the natural maturing of the pupils. But this difficulty 
 could be overcome by employing a control S, to whom the 
 control EF could be applied. Thus one S would be treated 
 _as usual (CEF). Another equivalent group would have 
 school luncheons (EF1). Still another equivalent group 
 would have a physical instructor (EF2). By subtract- 
 ing CC from C1 and C2 the amount of change produced 
 by. EFx and EFz2 could be accurately determined. 
 Hence the equivalent-groups method is applicable to this 
 experimental problem. The method is equally applicable to 
 the praising vs. scolding, or the additive vs. subtractive 
 problems. 
 
 When the purpose of an experiment ts to determine merely 
 the amount of superiority of one EF over any other EF the 
 equivalent-groups method is valid: 
 
 (1) Where the amount of change in S under one EF is 
 practically identical with the amount of change under any 
 other EF, except for the difference in effectiveness of the 
 contrasted EF’s. 
 
 (2) Where it is really possible to equate groups. 
 
 As is the case with the one-group method, the criteria 
 are less stringent when only the relative difference between 
 EF’s is desired. Changes produced by large irrelevant 
 
Selection of Experimental Method 31 
 
 factors, like maturing, cause no trouble provided the irrele- 
 vant factor operates equally under each EF. 
 
 In the case of one-group experiments, equal operation of 
 irrelevant factors under each EF is often difficult to secure, 
 particularly when the experiment extends over a consider- 
 able time interval. But equal operation of irrelevant factors 
 is easy to secure when the groups are different groups and 
 equivalent. Hence the above criteria practically reduce to 
 the second one for most situations. 
 
 C. Rotation Method.—When the purpose of an expert- 
 ment 1s to determine the amount of change due directly to 
 an EF or EF’s, the rotation method is valid: 
 
 (1) Where the total net change in the trait. or traits in 
 question produced by irrelevant factors is negligible, or 
 where the amount of such change is measured and discounted 
 by the application of a control EF. 
 
 (2) Where the change produced in S by an EF is not 
 conditioned significantly by any preceding EF. 
 
 In case the net total effect from irrelevant factors is not 
 negligible, this effect can be measured by a preliminary appli- 
 cation of a control EF to each group employed in the rotation 
 experiment. The amount of change produced by the irrele- 
 vant factors would be combined in the same way, in the 
 same order, and for the same intervals as has been described 
 for the regular EF’s, and the sum would be subtracted from 
 the sum of the corresponding C’s for the regular EF’s. The 
 computations for the control EF is like computing the 
 shadow of the rotation experiment for the regular EF’s, for 
 there would be a control Cr to be added to a control C4, and 
 a control C2 to be added to a control C3. The computation 
 for the control EF’s would be more elaborate if there were 
 more than two regular EF’s, but here, too, the process would 
 duplicate that already given for three or more regular EF’s. 
 The formula for both CEF’s and regular EF’s may be 
 written as below, though it is probable that either the CC2 
 or CC4 would be assumed to be equivalent to CCx or C@z 
 
32 How to Experiment in Education 
 
 respectively, or else the two CEF’s which are applied to each 
 S would be applied in immediate succession. 
 
 Rotation—CEF’s and Two EF’s—One Test Type 
 S1—(IT—CEF1-—FT—CC1)—(1T—EF1—F T—C1)—(IT—CEF2—F T—CC2)—(1T—EF2—FT—C2) 
 §2—(1T—CEF2—FT—CC3)—(UT—EF2—FT—C3)—(UT—CEF1—F T—CC4)—(1T—EF1—F T-—C4) 
 
 EF1 = (Cl + C4) — (CC1 + CC4) 
 EF2 = (C2 + C3) — (CC2 + 003) 
 
 Even though the rotation method is a combination of one- 
 group methods, the criterion concerning equality of units of 
 measurements has not been restated in connection with the 
 rotation method. This omission is due to the fact that the 
 rotation method brings each EF under each lesson and test, 
 if different lessons with different content are used, and brings 
 each EF under each portion of the growth curve, if the same 
 test is used and the experiment continues over a long period 
 of time. In sum, the rotation tends to rotate out lesson 
 differences, test differences, or position-on-growth-curve 
 differences, thus tending to equalize the units of measure- 
 ments. 
 
 In Weber’s rotation experiment to test the effectiveness 
 of a lesson taught by a teacher followed by a brief review 
 vs. a film or motion picture followed by a lesson vs. a lesson 
 followed by a film, a different content with an appropriate 
 test for each content had to be used for the different EF’s. 
 One lesson had to do with India, another with China, and 
 a third with Japan. The appropriate formula for such an 
 experiment follows. In the formula, ITi means the initial 
 test on India, LR means the lesson-review EF, ITc means 
 initial test on China, FL means the film-lesson EF, IT} 
 means initial test on Japan, and LF means lesson-film. 
 S1—(ITi—LR--FTi—C1)—(1Tce—FL—FTc— C2)—(1Tj—LF —FTj—C3) 
 S2'—-(ITi— FL—FTi—C4)—(1Te—LF — F Tc— Cs) —(ITi -—LR—FTj—C6) 
 S3——(ITi—-LF —FTi—C7)—(I1Te—LR — FTc—C8)—(1Tj —-FL—FT}j—Co) 
 
 LR=C1-+C6+C8 
 FL=C2z+C4+ Co 
 LF=C3+Cs5+ C7 
 
 If Sz is a superior group of children, the foregoing plan 
 
 rotates out the superiority, for every EF gets the benefit 
 
Selection of Experimental Method a4 
 
 of the group’s superiority, and similarly for other group 
 differences. If S2 is taught by a superior teacher, the effect 
 of her superiority is rotated out, for every EF profits equally 
 from her skill, and similarly for other teacher differences. 
 If the lesson or test on India is especially difficult, this dif- 
 ficulty is rotated out, for the lesson and test on India is 
 employed with every factor, and similarly for other lesson 
 or test differences. If the LR or lesson-review EF is more 
 effective than the other two EF’s, this superiority is not 
 rotated out, and should not be rotated out, for the purpose 
 of the plan is to give any such superiority a chance to mani- 
 fest itself, unmasked by irrelevant factors of teacher, group, 
 lesson, or test differences. 
 
 The above plan will rotate out any likely irrelevant factor, 
 except (1) uncontrolled bias on the part of the teacher or 
 experimenter for a particular EF; (2) bias on the part of 
 the test for a particular EF; (3) deliberate malingering on 
 the part of the pupils, unless this is uniform throughout the 
 experiment; (4) a carry-over from one EF to another C5) 
 any tendency for one group to learn how to improve more 
 rapidly with the progress of the experiment than any other 
 group; or (6) any tendency for one group to become more 
 fatigued or bored with the progress of the experiment than 
 any other group. 
 
 The last three irrelevant factors are of special interest. 
 If the lesson-review EF were to carry over and benefit the 
 film-lesson EF, C2 would not be an exact measure of the 
 influence of film-lesson. Instead, C2 would be a measure 
 of the effect of film-lesson plus an effect borrowed from 
 lesson-review. In an experiment of this sort, where the 
 entire content of the lessons is changed each time, such 
 carry-over in significant amount is highly improbable. 
 
 If, for some reason, Sx were to learn, as the experiment 
 progressed, how better to retain the content so as to make 
 a higher score on the FT, the second EF would profit more 
 than the first, and the third EF would profit more than the 
 second. This would be rotated out provided and only pro- 
 
34 How to Experiment in Education 
 
 vided S2 and S3 each learned the same thing in like amount. 
 Again, if St were to become fatigued or bored as the experi- 
 ment progressed, relatively more than S2 and S3, this would 
 penalize LF most, FL next, and LR least. Such unique 
 fluctuations are not likely to occur in significant amounts 
 unless there are large differences in intelligence, or the like, 
 between the three groups. 
 
 When the purpose of an experiment is merely to deter- 
 mine the amount of superiority of one EF over any other 
 EF, the rotation method is valid: 
 
 (1) Where the amount of change in S under one EF is 
 practically identical with the amount of change under any 
 other EF, except for the difference in effectiveness of the 
 contrasted EF’s. 
 
 (2) Where there is no carry-over from one EF to an-~ 
 other, or where, in case it occurs, the carry-over ts mutual, 
 1.€., each EF gains equally from such carry-over. 
 
 If, in the case of one S, EF1 preceding EF2 aids EF2 to 
 the extent of, say, two score points, and if EF2, in the case 
 of the other S, aids EF1 to the extent of two score points, 
 the increased change for each EF will be equal, thereby 
 validating the rotation experiment for the purpose of deter- 
 mining relative effectiveness of the EF’s. 
 
 An illustration will make it clear that a mutual carry-over 
 will not disturb a relative rotation experiment. Lacy? con- 
 ducted a rotation experiment to evaluate the relative effec- 
 tiveness of telling a story orally to a pupil (Told), having a 
 pupil read the story (Read), or having him see it in motion 
 pictures (Movie). Assume that each EF is equally effective, 
 and that each C would be 4 were it not for carry-over. As- 
 sume, further, that each EF carries over to the immediately 
 succeeding EF to the extent of half its own C, and to the 
 next EF to the extent of one-fourth its own C. The follow- 
 ing diagram shows that all EF’s come out equal, according 
 to assumption, regardless of a complicated carry-over. 
 
 1Lacy, John V., “The Relative Value of Motion Pictures as an Educational 
 Agency,” Teachers College Record, November, 1919, 
 
Selection of Experimental Method Cis 
 
 4 Airiac 4-33 
 Told Read Movie 
 
 4 Acie Agata ricad 
 Read Movie Told 
 
 4 4+2 Aa atts 
 Movie Told Read 
 
 Told = (4) + (4+3 +1) + (4+ 2) =18 
 Read = (4+ 2) + (4) + (44+3+1)=18 
 Movie= (4+ 3 +1) + (4+ 2) + (4) =18 
 
 If an experimenter desires to be exceedingly careful to 
 equalize the amount of carry-over, he can improve upon 
 any formula thus far given by using six groups for three 
 EF’s as shown below. 
 
 S1 — Told — Read — Movie 
 S2 — Read — Movie — Told 
 S3 — Movie — Told — Read 
 
 iio nncr Lele eT SLSle) hep eseiele ren slevabeledeitele el si sle ei/sielevlelis: cules Novela mich ata lets 
 
 S4 — Read — Told — Movie 
 S5 — Told — Movie — Read 
 56 — Movie — Read — Told 
 
 On the whole, the one-group experimental method is the 
 most convenient and, for this reason, should be preferred 
 when some significant irrelevant factors will not invalidate 
 the experiment; but the one-group method is peculiarly sub- 
 ject to constant errors from these sources. The equivalent- 
 groups method is peculiarly free from the influence of dis- 
 turbing irrelevant factors. The only difficulty encountered 
 here is in selecting two or more S’s which are genuinely 
 equivalent. When the number of pupils composing each § 
 is small, it becomes extremely difficult to prove that exact 
 equivalence was secured. Due to the practical difficulty at 
 times of establishing this equivalence, the rotation method 
 is frequently used. The rotation method is, of course, just 
 a combination of two or more one-group experiments, but 
 the way in which the one-group methods are combined 
 automatically tends to eliminate some of the objections to 
 the one-group method. Reversing the order of application 
 
36 How to Experiment in Education 
 
 of the EF’s, permits each EF to get the advantage or dis- 
 advantage of a carry-over from the other, increases com- 
 parability by having each test used under each EF and by 
 having each EF operate on S at approximately similar por- 
 tions of the growth curve. The rotation method is also of 
 value in eliminating special irrelevant factors, such as teach- 
 ing skill of teacher, and difference in ability of groups. 
 
CHAR TE RIL 
 SELECTION OF EXPERIMENTAL SUBJECTS 
 
 Appropriateness of Subjects to Experiment Factors. 
 —The first consideration in selecting experimental subjects 
 requires that these subjects be appropriate to the EF’s. A 
 principal in a nearby school is interested in determining the 
 effect of employing the project method with a particular 
 class in his school which has been taught by an extremely 
 conservative teacher. Here the EF calls for a particular 
 class or, at least, for pupils whose habits have been formed 
 under a very conservative teaching method. Coy has con- 
 ducted an elaborate experiment with children of high in- 
 telligence. The problem especially called for gifted pupils. 
 Others would have been inappropriate. Ogglesby designed 
 a primer for pupils of subnormal intelligence. She desired 
 to test its relative effectiveness. It was necessary to select 
 pupils appropriate to the EF. Hanson has experimented 
 with the effect upon progress in penmanship of excusing 
 pupils from drill when they attain a handwriting quality of 
 12 on the Thorndike Handwriting Scale, as compared with 
 continuance of drill. Pupils whose handwriting is already 
 above quality 12 would be inappropriate, as would pupils 
 so far below quality 12 that this goal would cause little or 
 no motivation. Thus, appropriateness is an essential con- 
 sideration, and what constitutes appropriateness varies with 
 the nature of the problem. 
 
 The determination of appropriateness frequently requires 
 objective measurement. Thus Coy used intelligence tests to 
 pick children of high intelligence. Ogglesby selected her 
 subjects on the basis of intelligence scores determined by 
 
 37 
 
38 How to Experiment in Education 
 
 Metzner. Gray, Gates, and others have experimented with 
 pupils who were unable to make satisfactory progress in 
 reading. They employed reading tests to select their ex- 
 perimental subjects. 
 
 Appropriateness of Subjects to Tests.—As a rule, sub- 
 jects should not be subordinated to the tests, but rather tests 
 should be found or constructed which will be appropriate to 
 the subjects. But it sometimes happens that the nature of 
 the problem is such as to permit the experimenter consider- 
 able latitude in the choice of subjects, while at the same 
 time it is not feasible to construct new tests. A few days 
 ago the writer advised an experimenter who was planning 
 his doctor’s dissertation to select no experimental subjects 
 below the third grade. This advice was given because ade- 
 quate tests of the type called for by his problem were not 
 available for pupils in grades below the third. Adequate 
 tests were available for pupils in grades above the second. 
 He could have constructed tests for young children, but 
 this would have left no time for experimenting with the 
 problem in which he was interested. 
 
 Representativeness of Subjects—Selection by Chance. 
 —Sometimes it is possible to employ for the S the total 
 group which has proved appropriate for the EF. Thus 
 the experimenter, who desires to determine the effect of the 
 project method upon a particular fourth grade previously 
 taught by an unusually conservative method, could include 
 the total group in the experiment. Sometimes, as for ex- 
 ample in a very large elementary school, it is not feasible 
 to try the EF’s on all the fourth-grade children in question. 
 . Only a selected number can be used. If the conclusion is 
 to be generalized for all the pupils, it is necessary that the 
 S be so selected as to be representative of the total group. 
 
 Representativeness can be secured by making a chance 
 selection from the total group, or a chance selection from 
 a chance portion of the total group. One method of making 
 a chance selection is to write upon a slip of paper the name 
 of each pupil in the total group, to place these names in a 
 
Selection of Experimental Subjects 39 
 
 receptacle, to mix them thoroughly, and to draw from the 
 receptacle as many slips of paper as there are pupils called 
 for in the experimental plan. This was the general pro- 
 cedure followed by the War Department in selecting men 
 for conscription during the World War. 
 
 Another method of making a chance selection is to write 
 the names of the pupils in alphabetical order. If half the 
 total number of pupils are to be used, alternate pupils can 
 be selected. If one-third the total group are to be used, 
 every fourth pupil can be selected, and similarly for the 
 proportions of 25, 75, 90, or other per cents. 
 
 The above methods of selection assume that it is feasible 
 to withdraw the selected pupils from their classes and as- 
 semble them in a new class or classes for experimental pur- 
 poses. This is not, however, always practicable. Fre- 
 quently the experimenter is faced with the necessity of 
 making a chance selection of classes rather than or in 
 addition to a chance selection of pupils. 
 
 Representativeness of Subjects—Selection by Meas- 
 urement.—If tooo pennies be tossed there will be only a 
 slight difference between the number of times that heads as 
 contrasted with tails appear. If twenty pennies are tossed 
 there may be a relatively large difference in the number of 
 heads and tails. ‘This illustrates the fact that chance is a 
 highly exact method of selecting representative pupils when 
 the number of pupils used as subjects is large, whereas its 
 accuracy decreases as the number of pupils decreases. 
 
 When the number of pupils or groups is small it is safer 
 to make the selection on the basis of measurement of some 
 sort. Just what sort of measurement will be best depends 
 upon the nature of the experimental problem to be under- 
 taken and the purposes of the experimenter. If the experi- 
 ment has to do with physical efficiency, the tests used may 
 well be tests of physical condition, in order that pupils with 
 all types of physique may be selected. If the experimental 
 trait is reading, selection on the basis of a test of reading 
 ability will usually prove satisfactory. If the experiment 
 
40 How to Experiment in Education 
 
 has to do with general educational or mental development an 
 intelligence test or a combination of several educational tests 
 may be employed. 
 
 Once the measurements are made, the pupils or groups, as 
 the case may be, should be arranged in order according to 
 the size of their scores. If, say, 10 per cent of the pupils 
 or groups are to be selected, every tenth pupil or group 
 should be selected. If 25 per cent of the pupils or groups 
 are to be used, every fourth pupil should be selected. Thus 
 in the latter instance the best, fifth best, ninth best, and 
 so on, should be selected. 
 
 Representativeness can be slightly but only slightly in- 
 creased by employing a modified method of selecting the 
 experimental pupils. Selecting pupils who stand first, third, 
 fifth, and so on, when half the total group is to be used 
 will cause the experimental pupils to average slightly higher 
 than the total group, as will the selection of pupils who stand 
 first, fifth, ninth and so on when 25 per cent of the total 
 group are to be used. This modified method is described 
 farther along, in connection with the technique of equating 
 groups. 
 
 Appropriateness of Subjects to Experimental 
 Method.—The question of the appropriateness of subjects 
 to the experimental method is most frequently raised in 
 connection with the equivalent-groups method, or the rota- 
 tion method when equivalent groups are to be used. When 
 any experimental method has been decided upon, subjects 
 must be selected who are first, appropriate to EF’s and tests, 
 and second, representative. When the equivalent-groups 
 method has been decided upon, there is the additional re- 
 quirement that subjects be selected and placed in different 
 groups in such a way that the resulting groups will really 
 be equivalent. 
 
 Equivalence of groups does not require that all the sub- 
 jects participating in the experiment be equivalent, but it 
 does mean that all the groups participating be equivalent. 
 To be equivalent the various groups must have like means 
 
Selection of Experimental Subjects 4I 
 
 and like variability among the subjects constituting each 
 group. To have like means and like variability implies in 
 turn that for every subject in one group there should be an 
 equivalent subject in every other group. While this last 
 will guarantee like means and variability, it is not absolutely 
 required that there be an equal number of subjects in each 
 group. The essential is that the groups be equivalent as to 
 means and variability. 
 
 But equivalent in what? In intelligence? Not neces- 
 sarily. In education? Not necessarily. In the experi- 
 mental trait? Not necessarily. The groups must be equal 
 in their possibilities for growth in the trait in question. 
 They should be so equal in the growth potential or possi- 
 bilities that they will show an equal mean change and an 
 equal variability among the changes of the individual sub- 
 jects in each group, provided all groups are placed under 
 an identical EF for an identical length of time. Various 
 methods have been proposed for securing such an equiva- 
 lence. These will be described next. 
 
 Groups Equated by Chance.—Just as representative- 
 ness can be secured by the method of chance, when the 
 subjects involved are sufficiently numerous, so equivalence 
 may be secured by chance, provided the number of sub- 
 jects to be used is sufficiently numerous. One method of 
 equating by chance is to mix the names of the subjects to 
 be used. Half may be drawn at random. This half will 
 constitute one group while the other half will constitute the 
 other group. If three groups are required, the first third 
 of the drawings will constitute one group, the second third 
 of the drawings another group, and the remaining third 
 still another group. 
 
 Or again, the names may be written in alphabetical order. 
 The even-numbered names will constitute one group and 
 the odd-numbered names the other group, and similarly for 
 a larger number of groups. If classes are being paired off 
 instead of pupils, the same general procedure of drawing, or 
 of alternating will apply. 
 
42 How to Experiment in Education 
 
 The above are merely sample procedures. Any device 
 which will make the selection truly random is satisfactory. 
 Extreme caution should be exercised to avoid any constant 
 tendency for one group to turn out superior to another. 
 When the War Department made the famous drawing to 
 determine the order in which individuals would be con- 
 scripted for military service, numbers were written on 
 paper and enclosed in capsules. Due to the fact that every 
 additional figure in a number added to the weight of the 
 capsule because of the additional ink deposit, there was a 
 constant tendency for the larger-numbered capsules to sift 
 to the bottom where they would be drawn last. If the size 
 of the paper increased with the length of the number this 
 still further prevented a perfectly random drawing. These 
 criticisms are made merely by way of illustration. Any ex- 
 perimenter may count himself lucky if he is able to select 
 subjects by the method of chance with no constant error 
 larger than that caused in this national drawing by a few 
 specks of ink. 
 
 Groups Equated by General Ability.—Measurement, 
 if adequate and accurate, is the best basis for selecting sub- 
 jects irrespective of their number. Chance selection is 
 merely an economical substitute for measurement, and is 
 practicable only where the number of experimental subjects 
 is sufficiently large. The trouble with measurement is that 
 we know so little about just what sort of measurement will 
 yield, as a basis of selection in a particular experimental 
 situation, groups equivalent in their possibilities for prog- 
 ress. Nothing in the general technology of experimentation 
 so much needs to be investigated as this. 
 
 One widespread present practice is to attempt to secure 
 equivalence by equating groups on the basis of general 
 ability. If the experiment is concerned primarily with the 
 physical effects of certain EF’s, the groups are equated on 
 the basis of general physical ability determined by general 
 physical measurements. If the experiment is concerned with 
 the mental effects of the EF’s, groups are equated on the 
 
Selection of Experimental Subjects 43 
 
 basis of general mental ability measured by some intelli- 
 gence test or a series of educational tests. 
 
 Thus, if an experimenter were to equate on the basis of 
 an intelligence test, he would select and apply to the pupils, 
 who are otherwise known to be appropriate, some intelli- 
 gence test. Ii the children are primary pupils, he may 
 select and apply to the pupils one or more tests from among 
 such intelligence tests for primary pupils as those by Pres- 
 sey, Franzen, Otis, Haggerty, Dearborn, Trabue, Engel 
 (Detroit), Myers, and others. Or if he can afford the time 
 for testing he may select and apply to the pupils such indi- 
 vidual intelligence tests as those by Goddard, Terman, 
 Herring, Kuhlmann, Yerkes and Bridges, Witmer, and 
 others. If the children are elementary pupils, he may select 
 and apply one or more such group intelligence tests as those 
 by National Research Council, Haggerty, Otis, Dearborn, 
 Pressey, Trabue, Myers, Buckingham and Monroe, and 
 others, or such individual intelligence tests as those by 
 Goddard, Terman, Herring, Kuhlmann, Witmer, Yerkes and 
 Bridges. If the children are in high school he may select 
 and apply such group intelligence tests as those by Otis, 
 Terman, Dearborn, Trabue, Thurstone, and others. Indi- 
 vidual intelligence tests for high school students are not 
 very satisfactory. Group intelligence tests for college stu- 
 dents have been prepared by Thorndike, Thurstone and 
 others. If elementary pupils are foreign, or have a special 
 language handicap, such a group intelligence test as that by 
 Pintner or Liu or such an individual intelligence test as that 
 by Pintner and Paterson, may be used. ‘Thorndike has 
 constructed group non-verbal intelligence tests for adults. 
 
 In selecting a series of educational tests to apply to pupils, 
 the experimenter has a large range of choice from such 
 reading tests as those by Thorndike-McCall, Monroe, Ayres- 
 Burgess, Courtis, Gray, and others; from such arithmetic 
 tests as those by Woody, Woody-McCall, Stone, Courtis, 
 Buckingham, Monroe, and others; from such spelling tests 
 as those by Ayres, Ayres-Buckingham, Ashbaugh, Starch, 
 
44 How to Experiment in Education 
 
 Morrison-McCall, Monroe, and others; from such composi- 
 tion scales as those by Trabue, Thorndike, Hudelson, Wil- 
 ling, Lewis, and others; from such handwriting scales as 
 those by Ayres, Thorndike, Starch, Lister, and others; from 
 such English form tests as those by Charters, Briggs, Starch, 
 and others; from such geography scales as those by Courtis, 
 Hahn-Lackey, and others; from such history tests as those 
 by Harlan, Barr, Van Wagenen, Sackett, and others; and 
 so on for other subjects of the elementary and high schools. 
 Or instead, the examiner may use certain test booklets which 
 are combinations in a single booklet of a variety of educa- 
 tional tests or educational and intelligence tests. These 
 omnibus tests frequently yield a single score on the entire 
 booklet, thus avoiding the difficulty of combining separate 
 scores. Illustrations of such omnibus tests are those by 
 Buckingham and Monroe, Pintner, Chapman, Whipple, and 
 others. 
 
 Whatever intelligence test is used, some sort of a score 
 will result. The National Intelligence Test, for example, 
 yields a point score, and the pupil making the largest num- 
 ber of points is considered to have the highest general mental 
 ability. The Stanford Revision of the Binet-Simon Scale, 
 on the other hand, yields a mental-age score, and the pupil 
 making the highest mental age is considered to have the 
 highest mental ability. 
 
 Suppose that forty pupils are to be divided into two 
 equivalent groups on the basis of an intelligence test which 
 yields a mental age. Suppose that the test to be used has 
 been selected, ordered from the bureau which issues it, 
 applied to the forty pupils according to the standardized 
 directions sent with the test, and scored according to the 
 standardized method of scoring. Suppose also that the 
 resulting mental ages, when arranged in order of size, to- 
 gether with the chronological ages, are as shown in Table 1. 
 
 1 Descriptions, price lists, and samples of tests and the standard directions for 
 the tests may be secured from such distributing centers as World Book Company, 
 Yonkers, New York; Bureau of Publications, Teachers College, New York City; 
 
 Russell Sage Foundation, New York City; Public School Publishing Company, 
 Bloomington, Illinois; and C, H. Stoelting Company, Chicago, Illinois. 
 
Selection of Experimental Subjects 45 
 
 Technique of Pairing Pupils.—The division of pupils 
 in Table 1 into two equivalent groups on the basis of mental 
 age may be done by a common-sense pairing of the pupils. 
 Nevertheless certain helpful suggestions and cautions can 
 
 TABLE I 
 CHRONOLOGICAL AGES AND MENTAL AGES OF 43 6TH GRADE PUPILS 
 
 
 
 Age Age Age Age Age Age 
 I 124 153 16 123 127 30 133 II4 
 2 136 144 17 138 126 31 139 II4 
 3 135 142 18 134 126 BY: 130 II14 
 4 136 I40 19 129 126 33 131 113 
 5 120 139 20 133 126 34 149 IIL 
 6 rig 139 ay 140 126 35 133 108 
 7 I4I I39 22 129 126 36 133 105 
 8 128 737 23 135 T25 37 140 105 
 9 135 136 24 134 124 38 151 102 
 Io 139 135 25 123 124 39 iach IOL 
 II 120 132 26 PZ. 122 40 159 IOl 
 12 126 129 27 129 122 AI 160 100 
 13 130 120 28 II5 121 42 160 99 
 I4 133 128 29 136 II5 43 149 g2 
 I5 142 128 
 
 
 
 be given. For one thing it will not be fully satisfactory to 
 pair the pupils into groups thus: 
 
 Group I Group II 
 Pupil 1 — 153 Pupil 2 — 144 
 Pupil 3 — 142 Pupil 4— 140 
 Pupil 5 — 139 Pupil 6 — 139 
 
 Such a procedure operates to give Group I a higher average 
 mental ability than Group II, as may be discovered by 
 trying it. Rather the general procedure for pairing should 
 be thus: 
 
 Group I Group IT 
 Sera ts: 2— 144 
 4— 140 3—142 
 
 5 — 139 6 — 139 
 
46 How to Experiment in Education 
 
 This method of pairing constantly tends to counteract the 
 tendency to give one group a higher average ability than the 
 other. 
 
 But even when this last procedure is followed, the mean 
 of the mental ages for one group may not be identical with 
 the mean of the mental ages for the other group. By a 
 
 TABLE 2 
 THE PUPILS OF TABLE I DIVIDED INTO TWO GROUPS OF EQUIVALENT MENTAL AGE 
 
 
 
 Group I. Group II 
 Pupil Mental Age Pupil Mental Age 
 2 144 3 142 
 5 139 4 I40 
 6 139 7 139 
 9 136 8 137 
 IO 135 II 132 
 13 I20 12 129 
 14 128 I5 128 
 17 126 16 127 
 18 126 ae) 126 
 21 126 20 126 
 22 126 23 125 
 25 124 24 124 
 26 122 ar I22 
 30 114 20 II5 
 Ke II4 32 II4 
 34 Tit a II3 
 35 108 36 105 
 38 102 cy 105 
 39 IOI 40 IOI 
 42 99 41 100 
 Mean 122.45 Mean | 122.5 
 
 special juggling of pupils two groups may be constituted 
 which have practically identical means. But such juggling 
 is seldom advisable. Unless care is exercised, it is likely 
 to result In an equivalence secured by pairing a gifted and 
 ungifted with two average pupils. The means will be 
 equated to be sure, but the variabilities will be unequal. 
 
Selection of Experimental Subjects 47 
 
 Such special juggling is helpful only when previously paired 
 pupils exchange groups. ) 
 
 Certain modifications of the procedure recommended are 
 desirable. These modifications are illustrated in Table 2. 
 Pupil 1 is eliminated from the experiment entirely. His 
 mental age is so high, or rather it is so much above 
 any other pupil, that he cannot be even approximately 
 paired. The next pupil, namely, Pupil 2, is 9 points of 
 mental age below him. If for administrative reasons Pupil 
 1 must be included in the experimental classes he can still 
 be eliminated from this and all subsequent experimental 
 computation. Except for the influence his presence in one 
 of the groups will have, he can become experimentally non- 
 existent. Pupil 2 is substituted for Pupil 1. He pairs satis- 
 _factorily with Pupil 3, so the pairing continues according to 
 rule until Pupil 28 is reached. Pupil 28 does not pair well 
 with Pupil 29, hence Pupil 28 does not appear in Table 2. 
 Pupil 29 appears in his place. The pairing continues with- 
 out interruption until Pupil 43 is reached. Partly because 
 he makes an odd number and partly because his inclusion 
 in either group will be distinctly unfair to that group, owing 
 to his low mental age, he does not appear in Table 2. 
 
 Thus far it has been assumed that the pupils in Table r 
 are to be divided into two equivalent groups only. The 
 procedure for dividing them into three equivalent groups is 
 as follows: 
 
 Group I Group II Group III 
 2— 144 3— 142 4— 140 
 (139 Ot 30 Sao 
 8 — 137 9 — 136 IO — 135 
 
 The procedure for equating four groups follows the same 
 general principle, thus: 
 
 Group I Group II Group III Group IV 
 2— 144 3-142 4— 140 5130 
 Oran ts0 oirot Wasp 160 6 — 139 
 
 IO — 135 II — 132 I2 — 129 13 — 129 
 
48 How to Experiment in Education 
 
 Because of inequalities in room space or for other rea- 
 sons, it may not be practicable to have an equal number 
 of pupils in each group. If we assume that one-third of 
 the pupils in Table 1 are to be in Group I and the remainder 
 in Group II, the procedure for equating would be as shown 
 below. This assumption means that of every adjoining 
 group of three pupils, two will go into Group I and one into 
 Group II. The closest equivalence will be secured if the 
 middle pupil of each group of three is placed in Group II, 
 thus: 
 
 Group I Group IT 
 2— 144 3 — 142 
 4— 140 
 Samoe 6 — 139 
 7 — 139 
 
 When one-fourth of the pupils are to be placed in one 
 group and three-fourths in the other, the pupils come in 
 groups of four instead of three, and hence there is no mid- 
 dle pupil. Of the first group of four pupils, namely, pupils 
 2, 3, 4, and 5, pupils 2, 4, and 5 may be placed in Group I 
 and pupil 3 in Group II, and of the second group of four 
 pupils, namely, pupils 6, 7, 8, and 9, pupils 6, 7, and 9 may 
 be placed in Group I and pupil 8 in Group IJ. Thus in the 
 first pairing, Group I gains a slight advantage, and, in the 
 second pairing, Group II gains an equivalent advantage. 
 This pairing by alternating advantage may be continued 
 similarly for the remaining pupils. 
 
 The technique of equating groups on the basis of mental 
 age has been discussed. The procedure for equating groups 
 on the basis of point scores on an intelligence test is identi- 
 cal. The procedure is the same for equating groups on the 
 basis of a series of educational tests. The only difficulty 
 likely to be met in this last situation, or in any situation 
 where groups are being equated on the basis of more than 
 one test, is the difficulty of properly combining the scores 
 made by each pupil on the separate tests into a single score. 
 
Selection of Experimental Subjects 49 
 
 The procedure required to deal with this difficulty will be 
 described later in this chapter. 
 
 Groups Equated by Initial Status in Experimental 
 Trait.—When groups are equated on the basis of measure- 
 ment, the most convenient and perhaps most frequent basis 
 employed by experimenters for equating groups is that of 
 initial status in the experimental trait. This method is 
 convenient because it is necessary in most experiments to 
 give an initial test in order to measure the change produced 
 by the EF. This provides, without additional labor, scores 
 for the experimental subjects which may be used to divide 
 them into two or more groups. . The procedure for making 
 this pairing is identical with that just described. 
 
 When the division of pupils into groups requires the 
 actual physical shifting of pupils, the division must be 
 made before the EF’s are applied. When such shifting is 
 not necessary, this detailed division is left until the EF’s 
 and FT’s have been applied and the experimental computa- 
 tions have been started. Thus Pittman! wished to deter- 
 mine the relative efficiency of the zone system of super- 
 vision for rural schools as compared with the conventional 
 system. One group was composed of the schools of one 
 rural county and the other group of the schools of another 
 rural county. Here it was not feasible to transfer pupils 
 or schools from one county to another. What Pittman did 
 was to make a rough initial equating by choosing two rural 
 counties that were as nearly identical as possible in wealth, 
 quality of population, quality of teachers, and so on. He 
 applied the IT, appropriate EF, and FT to all the pupils 
 in grades III through VIII in each county. At the conclu- 
 sion of the experiment he arranged the pupils in one county 
 in the order of the size of their scores on the IT. He did 
 likewise with the pupils in the other county. He then elimi- 
 nated from subsequent computations all the pupils in one 
 group who could not be paired with an equivalent pupil in 
 
 1Pittman, M. S., The Value of School Supervision; Warwick and York, Balti- 
 more, 1921. 
 
50 How to Experiment in Education 
 
 the other group. The remaining pupils constituted his two 
 equivalent groups, and they were the ones used in com- 
 puting changes produced by the EF’s. Bennett, in a 
 Maryland rural county, followed an identical procedure, 
 except that he split one county into two roughly equivalent 
 parts. 
 
 It would have been no advantage to Pittman or Bennett 
 to equate groups immediately after the application of the 
 IT. In fact it would have been a slight disadvantage. It 
 would not have been possible to segregate the chosen pupils 
 for the purpose of applying the EF or FT, and thereby 
 save the waste effort of applying EF and FT to all pupils 
 indiscriminately. So there would have been no gain here. 
 On the other hand there would have been a slight disad- 
 vantage in equating at the beginning due to the fact that 
 certain pupils selected for the experimental groups would 
 have been absent at the time of the FT thereby necessitating 
 their ultimate elimination, together with the paired pupil in 
 the other group. The paired pupil in the other group could 
 have been retained only on condition that an equivalent 
 pupil could have been found to take the place of the pupil 
 who was absent for the FT. All this trouble was avoided 
 by delaying the equating of groups until it was definitely 
 determined what pupils remained throughout the experi- 
 ment. In sum, wherever the actual physical shifting of 
 experimental subjects is not to take place, and, in addition, 
 wherever the experimental subjects proper are not to be 
 segregated for purposes of applying EF or FT, delayed 
 equating is preferable to early equating of groups. Initial 
 equating is essential or advisable wherever subjects are to 
 be shifted or segregated. 
 
 In actual practice the equating of groups is sometimes 
 not so simple as has been described, but the general prin- 
 ciple is the same. ‘Thus Pittman and Bennett both used 
 many types of tests—reading, arithmetic, spelling, and so 
 on—in order to get a rather thorough measurement of all the 
 changes produced by each EF. Each of these dozen or so 
 
Selection of Experimental Subjects 51 
 
 tests was applied both at the beginning and at the end of the 
 experiment. Which type of test was used as the basis of 
 equating? Pittman and Bennett employed each type in 
 turn. Thus in comparing the amount of change in reading 
 produced by each EF, the groups were equated on the basis 
 of the initial scores in reading. When comparing the amount 
 of change in arithmetic produced by each EF, the pupils 
 employed were selected on the basis of the initial scores in 
 arithmetic. This procedure meant, of course, that the com- 
 position of the experimental groups changed somewhat with 
 each new equating, but the procedure assured an initial 
 equivalence of groups in the experimental trait under con- 
 sideration. 
 
 One additional suggestion may be given. The EF2 for 
 Pittman’s control group was merely the customary super- 
 vision. Since the application of EF2 involved no particular 
 effort on Pittman’s part, he used and tested many more 
 pupils in his control group than in the other. By doing 
 this he made it easy to find a pair for every pupil in the 
 group to which EF 1 was applied, thereby avoiding the neces- 
 sity of discarding any of these pupils because of an inability 
 to pair them. 
 
 Groups Equated by Composite of Several Tests.— 
 Sometimes the experimenter desires to equate groups on the 
 basis of more than one test. This requires the experimenter 
 to make a composite of the scores on the various tests. To 
 equate separately for general-ability tests seldom serves any 
 useful purpose. To equate separately for each of several 
 experimental tests does serve a useful purpose, but there is 
 a certain inconvenience in having to alter the composition of 
 the group from time to time during the experimental com- 
 putation. To avoid this objection, some experimenters pre- 
 fer to equate groups on the basis of a composite of the initial 
 scores on all the experimental tests. This gives constancy 
 in the composition of the groups and gives an approximate, 
 if not an exact, equivalence for each experimental test, unless 
 the traits are markedly different in nature. In sum, there 
 
52 How to Experiment in Education 
 
 are situations where equating by a composite of scores on 
 several tests is desirable. 
 
 The process of computing a composite is illustrated for a 
 small number of pupils in Table 3. The first vertical col- 
 umn gives the identification number for each pupil. The 
 
 TABLE 3 
 
 ILLUSTRATING THE COMPUTATION OF A COMPOSITE SCORE WHERE EACH TEST 
 RECEIVES’ EQUAL WEIGHT 
 
 : . Read. Arith. Spell. Com- 
 Pupil | Read. | Arith. | Spell. | Weiensed| Weighted| Weighted| posite 
 
 ARR ee | en | ee | ee te | a 
 
 
 
 I 64 13 24 64 65 48 177 
 2 68 9 za 68 45 42 I55 
 3 46 9 17 46 45 34 125 
 4 54 14 27 54 70 54 178 
 5 54 ie) 13 54 50 26 130 
 6 72 12 20 72 60 40 172 
 7 52 13 13 52 65 26 143 
 8 43 II 24 43 55 48 146 
 9 72 I4 22 72 70 44 186 
 10 46 12 18 46 60 36 142 
 II 50 10 20 50 50 40 140 
 12 46 II 21 46 55 42 143 
 13 68 13 23 68 65 46 179 
 14 61 > ike 26 61 65 52 178 
 15 46 8 12 46 40 24 IIo 
 16 64 II 28 64 55 56 175 
 17 46 14 15 46 70 30 146 
 18 43 9 15 43 45 30 118 
 19 46 8 23 46 40 46 132 
 20 56 13 25 56 65 50 I7I 
 S.D. 9.8 2.0 4.8 9.8 10.0 9.6 
 Mult. I 5 2 
 
 
 
 second, third, and fourth columns show the scores made by 
 each pupil on a reading, an arithmetic, and a spelling test re- 
 spectively. Beneath each of these columns appears a meas- 
 ure—standard deviation (S.D.)—of the variability among 
 the scores of that particular column. 
 
 The first step in the determination of the composite scores 
 shown in Table 3 was to compute some measure of vari- 
 
Selection of Experimental Subjects 53 
 
 ability, in this case S.D. Any other standard measure of 
 variability, such as mean deviation, median deviation, or 
 quartile deviation, can be used instead. The computation 
 of the S.D. for a series of scores is illustrated in Table 15 
 and Table 16 and explained in the adjoining text. 
 
 The second step was to select multipliers which would give 
 equal weight to each test. Just what weight should be given 
 each test in determining a composite depends upon the con- 
 ditions encountered in the situation; but once a decision 
 has been reached, the procedure for selecting the multipliers 
 which will effect this weighting should utilize some measure 
 of variability, in this case S.D. That is, tests are weighted 
 according to their variabilities and not, as naive common- 
 sense would indicate, according to their means. For ex- 
 ample, ordinary common-sense would lead us to suppose 
 that Test I below has more influence than Test II in deter- 
 mining a pupil’s relative position in the composite of the 
 two tests, because its mean is relatively much larger. But 
 as a matter of fact, Test II has the more weight because its 
 variability is relatively larger. It has exactly ten times as 
 much weight because its variability is ten times that of 
 Test I. Mere inspection of the composite of the two tests 
 shows that Test II has a large influence upon the composite 
 and that Test I has only a negligible influence. The order 
 of the composite scores is the order of the scores in Test II. 
 
 Pe SEEEEEEEEEEEESNEEEUUSSRESISSTIEIRTEE 
 
 Pupil Test I Test II Composite 
 a 1000 40 1040 
 b 1001 30 1031 
 Cc 1002 20 1022 
 d 1003 10 1013 
 e 1004 fo) 1004 
 Mean 1002 20 
 
 
 
 The two tests can be given equal weight either by multi- 
 plying all the scores of Test I by 10 or by dividing all the 
 scores of Test II by 10. Either procedure will make their 
 
54 How to Experiment in Education 
 
 variabilities equivalent. To illustrate this point, the scores 
 of Test II are divided by ro in the following: 
 
 
 
 
 
 Pupil | Test I Test II Composite 
 a 1000 4 1004 
 b IOOI 3 1004 
 c 1002 2 1004 
 d 1003 I 1004 
 e 1004 Oo 1004 
 
 
 
 All this means that if the three tests in Table 3 are to be 
 given equal weight, such multipliers must be selected and 
 used on the test scores as will make their variabilities equal. 
 A multiplier of 1 for reading, of 5 for arithmetic, and of 
 2 for spelling will alter their $.D.’s to 9.8 for reading, 10.0 
 for arithmetic, and 9.6 for spelling, as shown in Table 3. 
 These variabilities are sufficiently equivalent for practical 
 purposes. By the use of fractional multipliers they can be 
 made exactly equivalent. 
 
 The multipliers just selected are not the only possible 
 ones. Equivalence of variability can be secured just as well 
 by multiplying reading by 4, arithmetic by 214, and spell- 
 ing by 1, or by many other combinations. As a rule it is 
 most convenient to select only whole numbers for multipliers 
 or divisors, and to select as small numbers as possible. 
 
 Thus iar it has been assumed that the three tests are to 
 receive equal weight. This is not necessary. Any desired 
 weight may be given. Thus if it is desired to give reading 
 twice as much weight as spelling and spelling two-and-a-half 
 times as much weight as arithmetic, all the multipliers will 
 be 1, because the variabilities of the three tests are in this 
 ratio originally. If it is desired to give arithmetic twice 
 the weight of reading, and reading twice the weight of 
 spelling, the multiplier for spelling will be 10, for reading 
 1, and for spelling 1, or other multipliers which will as satis- 
 factorily effect the weighting desired. 
 
 The third step in determining a composite is to multiply 
 the respective series of test scores by the multiplier selected 
 
Selection of Experimental Subjects 55 
 
 for that test. Thus, in Table 3, all the reading scores are 
 multiplied by 1, all the arithmetic scores by 5, and all the 
 spelling scores by 2. The products are shown in columns 5, 
 6, and 7. 
 
 The final step in computing a composite is to add the 
 weighted scores for the various tests for each pupil. Thus, 
 in Table 3, the addition of weighted scores 64, 65, and 48 
 yields a composite of 177. From this point the procedure 
 for equating groups has already been described. 
 
 Groups Equated by Preliminary Rate of Growth.— 
 There are competent experimenters who contend that the 
 best index of future rate of growth, or of possibilities for 
 future growth, is current rate of growth. They advise, there- 
 fore, that the experimenter test his experimental pupils at 
 intervals preceding the experiment in order to determine the 
 rate at which each pupil is developing in the experimental 
 trait. Once this rate has been determined, pupils may be 
 paired on this basis. 
 
 But we cannot be certain that equating by current rate 
 of growth is superior to, say, equating by initial status in 
 the trait in question. The latter is pairing by actual rate 
 of growth as truly as is the former. The former means 
 pairing by rate of growth as determined for a necessarily 
 relatively brief time, whereas the latter means pairing by 
 rate of growth measured from birth to the present. The 
 greater accuracy of the rate-of-growth method of equating 
 is, then, somewhat dubious, and its greater inconvenience is 
 certain. As a result, the method is not likely to come into 
 general use until its superiority has been definitely estab- 
 lished by investigation. The most relevant study thus far 
 conducted, namely, that by Hollingworth, was planned for 
 another purpose. 
 
 Besides those already discussed, there are many other 
 bases which may or may not be worthy of consideration, 
 depending upon the nature of the experiment. Among 
 
 1 Hollingworth, H. L. and L. S., Vocational Psychology, D. Appleton and 
 Company, New York, 
 
56 How to Experiment in Education 
 
 these the following may be mentioned: chronological age, 
 physiological age, social age, previous training, and home 
 environment in case this last cannot be controlled experi- 
 mentally. 
 
 Any one or all of these may exercise an influence in de- 
 termining a pupil’s possibilities for growth in the trait in 
 question. 
 
 Groups Equated by Multiple Bases.—Any one basis 
 for equating groups is bound to fall short of complete satis- 
 faction, because it is necessarily inadequate. A human 
 mechanism is exceptionally complex. Any one basis taps 
 only a phase of this total mechanism. A perfect prophecy 
 can be made only when every phase of this mechanism is 
 properly measured and properly weighted. 
 
 Again, any one basis fails to give complete satisfaction 
 because of the intricate dependence of one basis upon an- 
 other or of one part of the human mechanism upon another. 
 It will be sufficient to cite two simple illustrations of this 
 dependence. An intelligence test shows two pupils, A and 
 B, to have identical mental ages, namely 12 years and 12 
 years, respectively. May they be paired with reasonable 
 assurance that the two will progress at equal rates in the 
 future, except for differences in effectiveness of the EF’s? 
 Perhaps two groups can be equated on this sole basis pro- 
 vided the number of pupils is large. But two pupils cannot 
 be equated without taking other factors into consideration. 
 If, for example, Pupil A is 10 years old chronologically, and 
 Pupil B 12 years old chronologically, they are not equiva- 
 lent pupils. Pupil A has progressed mentally since birth 
 much faster than has Pupil B, for he has progressed in 10 
 years as far as Pupil B in 12 years. The conventional 
 method for expressing this rate of mental growth is the 
 Intelligence Quotient, computed by dividing mental age by 
 chronological age, and by multiplying the quotient by 100. 
 Thus the Intelligence Quotient for Pupil A is (12 + 10) X 
 100, l.e. 120, whereas that for Pupil B is (12 +12) X I00, 
 1.€. 100. 
 
Selection of Experimental Subjects 57 
 
 But the fact that they cannot be paired because their 
 Intelligence Quotients are different does not mean at all that 
 they can be paired if their Intelligence Quotients are identi- 
 cal. A ten-year-old pupil with a mental age of 10 years may 
 not be equivalent to a fourteen-year-old pupil with a men- 
 tal age of 14 years, even though both have Intelligence 
 Quotients of 100. This means that equating is improved 
 by pairing pupils who are alike both in mental age and 
 Intelligence Quotient or, stated more conveniently, who are 
 alike in both mental age and chronological age. In similar 
 manner, chronological age conditions all the bases for 
 equating groups. 
 
 For a second illustration of this dependency of one basis 
 upon another, we may take the case of the dependence of 
 initial status in the experimental trait upon previous train- 
 ing. Two pupils who have like initial scores in the experi- 
 mental trait may have widely different promise for future 
 rate of growth. One may have attained his initial status 
 after much training and the other after little training. In 
 the case of the former pupil, a low score probably means a 
 low physiological limit of growth and hence little promise 
 for the future. In the latter case a low score probably means 
 a high physiological limit and hence great promise for the 
 future. In similar manner, a high score may mean great 
 promise or little promise, depending upon the amount of 
 training required to produce the high score. 
 
 Wherever feasible, then, groups should be equated on as 
 many bases as possible. Pupils should be paired who are 
 alike in initial status in the experimental trait, in mental age, 
 in chronological age, in home environments, in sex, in race, 
 and so on for all significant bases. In actual practice, pair- 
 ing is seldom done on more than three bases, namely, 
 initial status in experimental trait, mental age, and chrono- 
 logical age. Pairing is usually done on just one basis, in- 
 itial status in the experimental trait or mental age, with the 
 preference for the former. 
 
 Equating is usually done on just one basis, first, because 
 
58 How to Experiment in Education 
 
 every increase in the number of bases employed reduces the 
 number of pupils who can be satisfactorily paired from a 
 given total number of pupils; and, second, because equating 
 on one basis tends to make the groups have approximately 
 equivalent means and variabilities on any other basis, even 
 though particular pupils do not pair on all the bases. The 
 existence of this latter tendency is due both to the positive 
 correlation likely to obtain between desirable bases and to 
 the operation of chance. Those who equate on a variety of 
 bases rarely insist that paired pupils be identical on the vari- 
 ous bases. Rough equivalence is all that is ever secured. 
 Even where equating is done on one basis only, it is fre- 
 quently possible to increase the equivalence on some other 
 bases merely by shifting paired pupils from one group to 
 the other. 
 
 Mason D. Gray has called attention to a unique diffi- 
 culty in equating two groups. Because of the close correla- 
 tion between intelligence and vocabulary, we would expect 
 normally that two groups which have been equated on the 
 basis of intelligence would be found thereby to have been 
 equated, at least approximately, on the basis of vocabulary. 
 But Gray reports that when a group which has elected high- 
 school Latin is equated on the basis of intelligence with a 
 group which has not elected Latin, the Latin group has a 
 higher vocabulary ability than the non-Latin group. It is 
 highly improbable that such would be the case if both groups 
 were indiscriminately mingled and if students were assigned 
 by the experimenter to the Latin EF and the non-Latin EF 
 without regard to students’ preferences. In general, the ex- 
 perimenter needs to be particularly alert in equating groups 
 which have been divided previously on the basis of some 
 intrinsic psychological difference between them. 
 
 Groups Equated by the A. Q. or F Technique.— 
 Whenever possible, groups should be equated. Whenever 
 conditions do not permit this, it is possible to equate pupils 
 Statistically by means of the A. Q. or F technique. The 
 effect of these techniques is to take a group, no matter what 
 
Selection of Experimental Subjects 59 
 
 its ability, whether high, average, or low, and convert it into 
 a standard group. 
 
 The underlying principle of the A. Q. or F techniques 
 is that it demands of each pupil a progress commensurate 
 with his brightness, and provides a formula for testing 
 whether progress has been commensurate with capacity to 
 progress. A class with low capacity is asked to make a 
 defined amount of progress in a defined time. A class with 
 high capacity is asked to make a proportionately greater 
 progress. If each group under its own EF just exactly 
 makes its expected progress, both EF’s may be considered 
 of equal effectiveness. 
 
 Suppose that the experimental trait is reading. Then the 
 equivalent-groups formula becomes: 
 
 
 
 Sr — (Initial A. Q. — EF1 — Final A. Q. — A. Q. Change) 
 S2 — (Initial A. Q. — EF2 — Final A. Q. — A. Q. Change) 
 Where 
 
 ta = Il edge 
 
 Binal tC Oe Final reading age 
 
 
 
 ~ Final mental age 
 
 The computation of reading age is explained by the direc- 
 tions booklet which accompanies the Thorndike-McCall 
 Reading Scale.* 
 
 The computation of mental age is explained in Terman’s 
 “The Measurement of Intelligence.” ? 
 
 The final reading age will have to be determined by a 
 retest. The final mental age may be determined statistically 
 without a retest, due to the fact that a pupil’s Intelligence 
 Quotient, i.e. mental age divided by chronological age, is 
 fairly constant. The final mental age may be computed by 
 means of the following formula: 
 
 1Yssued by the Bureau of Publications, Teachers College, New York City. 
 2 Houghton Mifflin Company, Boston. 
 
60 How to Experiment in Education 
 
 
 
 Wd initial mental age 
 Final mental age = Initial mental age + snitial heatiohel ave 
 
 X the no. of months between initial and final reading tests. 
 
 The computation of mental age presents no difficulty if 
 such tests as the Stanford Revision of the Binet-Simon Scale 
 or the Herring Revision of the Binet-Simon Scale are used. 
 These tests yield a score in terms of mental age. If some 
 other intelligence test which yields point scores is used, 
 these point scores can be transmuted into approximate men- 
 tal ages, provided age norms are available. Tentative age 
 norms for a few ages on the National Intelligence Test, Form 
 A, are given below. A pupil’s score of 90 is equivalent to a 
 mental age of 138. A score of 75 is equivalent to a mental 
 age of 126. A score of 95.5 is equivalent to a mental age 
 
 of 144. 
 
 Chronological age in years...... mol mH 124% 13% 
 Chronological age in months.... 126 138 150 162 
 National Intelligence Test norms 75 go IOI 112 
 
 The computation of reading ages is provided for in the 
 directions which accompany the Thorndike-McCall Reading 
 Scale. Reading ages on other reading tests, spelling ages, 
 arithmetic ages, etc., may be computed, provided age norms 
 are available, by simply transmuting point scores on some 
 reading test, spelling test, or arithmetic test into reading 
 ages, spelling ages, or arithmetic ages respectively, as has 
 just been illustrated for the National Intelligence Test. 
 
 Unfortunately most educational tests report grade norms 
 rather than age norms. Even so, approximate age scores 
 may be computed by substituting for each grade its chrono- 
 logical age equivalent. The first two rows of the data shown 
 below will be the same regardless of the test which appears 
 in the third row. The third row will vary with the test. 
 In the following case, a point score of 37.8 on the Ayres 
 Spelling Scale, 10 words each from columns L, O, Q, S, U, 
 and W becomes a spelling age of 141. A point score of 50.3 
 
Selection of Experimental Subjects 61 
 
 becomes a spelling age of 167. A point score of 49 becomes 
 a spelling age of 161. 
 
 End of grade DO Tea Ty beet, Vee Vinay Lev Le LLP 
 Approx. ch. age equivalent of grade 89 102 115 128 14% 154 167 180 
 Ayres Spelling Test grade norm.. 19.6 30.4 37.8 47.7 50.3 54.4 
 
 The computation and use of reading age, spelling age, men- 
 tal age, A. Q., and the like, when age norms are available 
 and when only grade norms are available, is discussed more 
 fully in “How to Measure in Education.” 
 
 F has the same function and significance as A. Q. 
 
 Tests scaled according to the age-scale system use A. Q., 
 whereas tests scaled according to the T-Scale system use F. 
 These two scale systems will be described in Chapter V. In 
 case F is used in place of A. Q., the equivalent-groups for- 
 mula becomes: 
 
 S1 — (Initial F — EF1 — Final F — F Change) 
 S2 — (Initial F — EF2 — Final F — F Change) 
 
 As will be explained more fully in Chapter V, F, in case 
 the experimental trait is reading, is computed thus: 
 
 Initial F = Initial reading T — initial intelligence T 
 Final F = Final reading T —final intelligence T 
 
 The initial and final reading T require the application of 
 both an initial and final reading test; whereas the final 
 intelligence T may be computed from the initial intelligence 
 T, through the use of each pupil’s B or brightness score. 
 The steps in the process are: (1) Compute the pupil’s B 
 score. Assume that the pupil’s T score is 38 and that his 
 age is exactly 10 years, o months. Then, by Table 11 
 (p. 109), his B score is 38 + 12, i.e. 50. (Assume that 
 Table 11 is for the intelligence test in question.) (2) If the 
 experiment continues ten months locate in Table 11 the B 
 correction corresponding to this pupil’s age ten months later. 
 
 2The Macmillan Company, New York City. 
 
62 How to Experiment in Education 
 
 Ten months later he will be aged 10 years and 10 months. 
 The B correction for this age is 8. Were the experiment to 
 run for four months the B correction would be 10. Assume 
 the experiment to run 10 months. (3) Subtract this B cor- 
 rection of 8 from the initial B score of 50. The result is 42, 
 which is the desired final intelligence T, required to compute 
 the final F. The final B correction of 8 is subtracted from 
 the initial B score, even if the caption at the top of Table 11 
 says “add.” In transmuting a T score into a B score, add 
 the B correction when the caption says to add and subtract 
 the B correction when the caption says to subtract. But 
 in transmuting a B score back into a T score reverse the 
 process. 
 
 The Thorndike-McCall Reading Scale yields a T score 
 directly just as certain tests yield an age score directly. The 
 process for utilizing age or grade norms for converting scores 
 on any test into age scores has just been described. The 
 following shows the approximate T-score and B-correction 
 equivalents of age scores for any mental or educational test. 
 The T and B equivalents for intervening ages may be de- 
 termined by simple interpolation. 
 
 Age 63 7h 8h oh rohrrdz2d 134 14} 15} 163 174 
 TP score yi -OEBWe 5) 32)390 44.50 530057503 a 
 B correction 50 37 25 18 11 6 oO —3 —7 —I3 —20 —27 
 
 Equating groups through the A. Q. or F technique assumes 
 that rate of growth in the trait in question will be propor- 
 tional to intelligence, except for the differing effects of the 
 two EF’s. This assumption is justified when the trait in 
 question is a general mental function like reading, spelling, 
 arithmetic, geography, etc. The assumption is of doubtful 
 validity for specialized mental functions. Specialized pro- 
 phetic tests may be available some day for such specialized 
 mental functions, 
 
CHAPTER IV 
 CONTROL OF EXPERIMENTAL CONDITIONS 
 
 Constant vs. Variable Irrelevant Factors.—In the 
 actual conduct of an experiment an experimenter must con- 
 tend with both constant and variable irrelevant factors. 
 Variable irrelevant factors do not particularly annoy the 
 experimenter. They are chance influences which operate 
 favorably as frequently as they operate unfavorably for a 
 particular EF. A multitude of such factors are unavoid- 
 ably playing upon experimental pupils throughout even the 
 best controlled educational experiments. In the long run, 
 their net effect is zero. The net result of constant irrele- 
 vant factors, on the contrary, is not a zero facilitation or 
 inhibition of a particular EF. They are any undesired 
 influences whose net result is favorable or unfavorable to 
 some EF. 
 
 An experimenter may ignore truly variable irrelevant fac- 
 tors, but he cannot ignore significant constant irrelevant 
 factors. He must either eliminate them, or else determine 
 the amount of their influence and allow for it in computing 
 the amount of change produced by the EF in question. The 
 ability to detect and eliminate constant irrelevant factors 
 is one of the distinguishing marks of a sagacious experi- 
 menter. 
 
 This chapter will be devoted to an enumeration of the 
 more common constant irrelevant factors, and to suggested 
 methods of eliminating them. This list should be studied 
 not with the idea that it is complete or that every factor 
 listed would be a constant error in every situation. Mere 
 maturing, for example, introduces a constant error in ex- 
 periments whose object is to determine the amount of 
 
 63 
 
64 How to Experiment in Education 
 
 change due directly to an EF, whereas its influence may be 
 ignored in experiments whose object is to determine the 
 relative effectiveness of two or more EF’s. 
 
 The purpose of this chapter is the amplification and 
 illustration of the fundamental principle of experimenta- 
 tion—that changes in experimental subjects due to irrele- 
 vant factors should be eliminated, equated, or accurately 
 measured and discounted. . The importance of any irrelevant 
 factor varies with the amount of its contribution to each 
 EF, where the purpose of the experiment is to determine 
 the amount of change in experimental subjects due directly 
 to each EF, and varies with the difference in amount of its 
 contribution to each EF, where the purpose of the experi- 
 ment is to determine the relative effectiveness of two or 
 more EF’s. 
 
 Errors Due to Bias of Experimenters.—Conscious or 
 unconscious manifestation of bias on the part of an experi- 
 menter is a common constant error. This constant irrele- 
 vant factor is of special significance because there are so 
 many points in an experiment where an experimenter’s bias 
 can influence the final conclusion. Of course anyone who 
 consciously favors unfairly in any way any EF, is mentally 
 incompetent to conduct experiments. He is, to say it less 
 politely, an experimental cheat. He is employing the ap- 
 pearance of experimentation to secure a readier acquiescence 
 on the part of others to his own emotional prejudice. Con- 
 scious bias is so human as to be sometimes unavoidable. 
 But to be biased is one thing; consciously to allow this bias 
 to modify experimental arrangements is quite another. 
 
 A manifestation of unconscious bias is far more likely 
 to occur. It is extremely difficult for an experimenter to 
 remain exactly neutral. With some individuals, conscious 
 bias for a particular EF will cause them to favor it uncon- 
 sciously. Other individuals will be so meticulously careful to 
 avoid favoring a favorite EF as actually to favor the con- 
 trasted EF. Impressed by the conflicting results obtained 
 from various investigations of the amount and nature of sex 
 
Control of Experimental Conditions 65 
 
 differences, Cattell caustically remarked that the sex dif- 
 ferences discovered depended upon the sex of the investi- 
 gator. 
 
 In many experiments it is possible to take certain pre- 
 cautions against manifestations of a possible bias. Thus, 
 Poffenberger, in his experiments to determine the mental 
 effect of doses of strychnine, numbered the capsules. He 
 then proceeded to forget just which did and which did not 
 contain strychnine. He did not refresh his memory until 
 ‘the experiments had been concluded, tests given and scored, 
 etc. Pittman, in pairing pupils at the end of his experi- 
 ment with the zone system of supervision, covered up the 
 final scores of pupils, lest he show a possible bias by pairing 
 with knowledge of the amount of change produced by each 
 EF. Another investigator wished to determine whether 
 judges varied more in judging the merits of compositions 
 containing much originality than in judging specimens con- 
 taining little originality. This investigator was careful to 
 choose the specimens containing much and those containing 
 little originality before securing, much less consulting, the 
 judgments of merit. By a system of key numbers and by 
 other devices it is possible in many experiments to reduce 
 the opportunities for bias to manifest itself. 
 
 Errors Due to Bias cf Assistants.—Skepticism regard- 
 ing conclusions where adequate supporting data are not 
 produced, and the reverse mental attitude where data are 
 produced, are eminently desirable traits. Such skepticism 
 or enthusiasm is on the increase in education, and this in- 
 crease should receive every encouragement. But there is a 
 lop-sided skepticism or enthusiasm which is really nothing 
 more than irrational prejudice. Many who pride themselves 
 upon their insistence upon proof are really priding them- 
 selves upon an irrational prejudice for one alternative, 
 usually the present practice, and an equally irrational preju- 
 dice against the other alternative. The experimenter, in 
 organizing cooperative experimentation, will meet both varie- 
 ties among teachers, supervisors, superintendents, or other 
 
66 How to Experiment in Education 
 
 experimental assistants. There is some hope that the rational 
 skeptic or enthusiast will subordinate his preferences to the 
 objects of the experiment. There is little hope that the 
 irrational individual will be able to do so. Neither variety 
 makes an ideal experimental assistant. The ideal assistant 
 is one who is genuinely uncertain as to which EF is superior. 
 
 The way to avoid bias upon the part of assistants depends 
 upon the experiment. But certain common precautions may 
 be listed. One way is to avoid assistants who have a bias, 
 or where they cannot well be avoided they may be elimi- 
 nated from all computations. ‘This avoidance or elimina- 
 tion may be employed provided the experimenter has some 
 objective way to determine which assistants will manifest 
 or have manifested bias. Lacking such objective data the 
 experimental assistants chosen may manifest merely the 
 experimenter’s own bias. Any assistant who confesses to a 
 preference may reasonably be assumed to hold such a pref- 
 erence. 
 
 Another way to avoid bias is to equate it. This can be 
 done, roughly at least, by using as many assistants who are 
 favorable to one EF as there are assistants favorable to the 
 other EF or EF’s. Such an equating may prove satisfac- 
 tory in experiments whose only object is to determine the 
 relative effectiveness of two or more EF’s. The procedure 
 for equating teachers or other assistants is, in general, like 
 that for equating groups of pupils. 
 
 Finally, something may be accomplished by impressing 
 upon assistants the necessity for experimental neutrality in 
 thought and deed, and by providing them with detailed type- 
 written instructions as to what to do. Few realize the 
 extraordinary difficulty of maintaining perfect self-control, 
 particularly where a preference has already developed. The 
 careless assistant is in danger of manifesting the preference 
 and the conscientious assistant of going to the other extreme. 
 The provision of detailed instructions will tend to minimize 
 such manifestations. 
 
 Bound up with this problem of bias is the whole question 
 
Control of Experimental Conditions 67 
 
 of just how much effort should be expended upon each EF. 
 A fundamental principle of experimentation is that there 
 should be an accurate measurement of the amount of the 
 experimental factor. Thus in the physical sciences, a com- 
 mon procedure is to add an EF of defined amount and 
 measure the result, or subtract an EF of defined amount and 
 measure the result, or both add and subtract in succession 
 an EF of defined amount and measure the result, or both 
 add and subtract in succession an EF of varying amounts 
 and measure the changing results with each increase or 
 decrease in the amount of the EF. Probably the greatest 
 defect in educational experimentation is the inability, in 
 most cases, to measure accurately the amount of presence 
 of an EF. Further, there is some, though meager, evidence 
 that maximum effort can be maintained more constantly than 
 any effort lower than maximum. These facts and proba- 
 bilities would lead one to infer that it is better, not only 
 educationally but experimentally, to aim at maximum effort 
 all the time for each EF. 
 
 Though evidence on this question is meagre, there is 
 some reason to believe that the mere process of experi- 
 menting with new methods or materials of instruction, at- 
 tracts such attention to the traits in question as to cause 
 an unconscious concentration, both on the part of teacher 
 and pupils, upon progress in these traits. As a result, it iS 
 supposed that a large temporary effort is called forth, thus 
 causing a large but artificial growth, and that this artificial 
 effort will evaporate if the novel methods or materials were 
 used term after term. Consciousness of the possibility of 
 such bias may help the experimenter to avoid it, but the 
 only sure way to determine whether ephemeral effort has 
 been evoked is to continue the experiment for a consider- 
 able period. If each succeeding term shows a flagging of 
 effort and an elimination or reduction of superiority, the 
 existence of such ephemeral effort may be assumed. 
 
 Errors Due to Differences in Teaching Skill—_Re- 
 search on a large scale frequently requires codperation on 
 
68 How to Experiment in Education 
 
 the part of many superintendents, supervisors, and teachers. 
 My own experience in such work has been one continuous 
 surprise as to the trouble members of the educational pro- 
 fession will take to codperate fully in scientific research. 
 Still, one finds occasional instances of unwilling teachers or 
 superior officers. The trouble with such individuals from 
 an experimental standpoint is that they will inadequately 
 apply a particular EF and be careless about maintaining 
 desired experimental conditions in general. 
 
 Again, there are wide differences in teaching skill or 
 supervising skill. If one group is taught by an unskillful 
 teacher according to one EF and another equivalent group 
 is taught by a skillful teacher according to another EF, any 
 difference in the change produced may be due to a differ- 
 ence in teaching skill rather than a difference in effective- 
 ness of the contrasted EF’s. This difference may be due 
 to the operation of special forces or to a real difference in 
 skill. Thus one experimenter grumbles that one of his EF’s 
 did not have a fair chance because so many of the teachers 
 who were assigned to apply this particular EF turned out 
 to be bride-teachers. Another experimenter found that one 
 EF had suffered from more frequent changes of teachers 
 than the other EF. Still another experimenter found that 
 substitute teachers were more frequent under one EF than 
 another. 
 
 The experimenter must attempt, then, to avoid experi- 
 mental errors due to a difference in general unwillingness, 
 and a difference in general capability on the part of 
 assistants. 
 
 He must guard also against errors due to peculiar fitness 
 or unfitness for applying an EF. The general efficiency of 
 two teachers, for example, may be equal. But one may be 
 peculiarly unskilled in the teaching of arithmetic. This 
 special disability makes it unwise to use her for applying 
 some EF whose object is to increase pupils’ ability in arith- 
 metic. The other EF applied by the other teacher has an 
 advantage, or if the same teacher applies both EF’s, it is 
 
Control of Experimental Conditions 69 
 
 possible that her special abilities and disabilities favor one 
 EF and handicap another. 
 
 Five general methods have been employed for avoiding 
 or reducing experimental errors due to a difference in, say, 
 teaching skill. One method is to equate the skill of the 
 teachers assigned to each EF. This pairing of teachers is 
 done on the basis of some preéxperimental measurement of 
 each teacher’s efficiency of teaching. These measurements 
 may be by means of objective tests or may be judgments of 
 Supervisory officers. 
 
 A second method is to equate teachers by chance. To do 
 this means that the experiment must be conducted in numer- 
 ous classes to insure that chance will provide equivalence 
 in teaching skill. ‘This method is very laborious but it 
 increases the probability of securing both equivalence and 
 representativeness of teaching skill. 
 
 A third method is the departmental method, namely, to 
 have the same teacher apply both or all EF’s; then, gen- 
 erally superior teachers will be equally favorable to each 
 EF, and the generally inferior teachers will be equally un- 
 favorable to each EF. 
 
 A fourth method is to have two teachers divide the 
 work of two classes. Thus when the New York State Com- 
 mission on Ventilation was contrasting two EF’s on two 
 equivalent classes in a public school in New York City, 
 the two classes were placed in adjoining rooms, one teacher 
 teaching half the studies to both groups, and the other 
 teacher teaching the other half to both groups. 
 
 A fifth method is to rotate the teachers so that each EF 
 has every teacher. To illustrate how this can be done there 
 is repeated below the formula for a rotation experiment. 
 It may be observed that the teacher of Sx will appear under 
 each EF, and the teacher of S2 will appear under each EF, 
 thereby equating any difference in general teaching skill. 
 
 St — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1-—C2) 
 S2 — (IT1 — EF2 — FT1 — C3) — (IT1 — EF1 — FT1 — C4) 
 
70 How to Experiment in Education 
 
 It is useful for the experimenter to distinguish in this 
 connection two varieties of experimental situations. In one 
 variety the teacher applies the EF while giving the gen- 
 eral instruction to her class at the same time. In the 
 other variety the teacher, as before, gives the general in- 
 struction, but the specific EF is applied by some person 
 other than the teacher. If the EF’s contrasted are project 
 method and conventional method of teaching, or one method 
 of teaching spelling and another method of teaching it, it 
 is probable that the teacher will be asked to apply the EF’s. 
 Here unusual care should be exercised to equate or elimi- 
 nate any difference in teachers’ skill. If the EF’s con- 
 trasted are one type of motion picture and another type 
 of motion picture, there is considerable likelihood that the 
 experimenter himself or non-teaching assistants will apply 
 the EF’s. Here again difference in teachers’ skill may be 
 important, particularly if the motion pictures deal with 
 portions of the regular curriculum, but it is much less im- 
 portant than where the teachers apply the EF’s, because the 
 teachers will have relatively less influence upon the changes 
 of the pupils in the experimental trait. But as the teachers’ 
 importance grows less, the experimenter’s or non-teaching 
 assistants’ importance increases, in accordance with the gen- 
 eral principle stated at the opening of this chapter, namely, 
 that the importance of an irrelevant factor varies with the 
 amount of its contribution to each EF, or to the difference 
 in the amount of its contribution to the various EF’s. 
 
 Errors Due to Bias of Subjects.—Bias on the part of 
 experimental subjects is just as disturbing to an experiment 
 as bias on the part of the experimenter or his assistants. 
 Such bias comes about in many ways. A popular teacher 
 will make it known to the pupils that an experiment is under 
 way and consciously or unconsciously reveal her own pref- 
 erence. The pupils, as a consequence, will strive to make 
 the experiment come out happily for their teacher. An 
 unpopular teacher under similar circumstances provokes an 
 antagonism toward the EF which she prefers. 
 
Control of Experimental Conditions 71 
 
 Again, a teacher, an experimenter, or certain circumstances 
 surrounding the experiment will reveal to pupils that two 
 groups are being compared. This information, apart from 
 any preference for or antagonism toward their teacher, may 
 engender an undesired rivalry between the two groups. In 
 case the information leaks out to only one group the result- 
 ing stimulus to this group might well prove decisive. 
 
 The best way for an experimenter to avoid a bias is to 
 keep himself, when possible, in ignorance of just when he 
 is applying a particular EF, or scoring tests for a particular 
 experimental group, and so on for the other experimental 
 processes where his bias would be likely to affect results. 
 The best way to avoid bias on the part of assistants is to 
 keep them in ignorance of the objectives of the experiment. 
 An experiment with two varieties of ventilation was con- 
 ducted in two schoolrooms for a full year without either of 
 the two teachers discovering just what the EF’s were. It 
 is even more important and fortunately easier to keep pupils 
 in ignorance of the nature of the EF’s and, if possible, of the 
 fact that an experiment is in progress. Certainly one group 
 should not be informed and the other kept in ignorance. 
 
 Research is such an eminently individual and original 
 process that it is well-nigh impossible to lay down certain 
 principles of procedure without calling attention to possible 
 exceptions. There are situations where it is really desira- 
 ble that pupils be informed, in a measure, that something 
 unusual is taking place. Pittman, in one of his investiga- 
 tions, went so far as to issue a bulletin to the pupils of one 
 of his two equivalent groups telling them he wished to see 
 just how much progress they could make. In an experi- 
 mental evaluation of the worth of using standard tests in 
 the teaching of reading, the writer set up for one group of 
 the experimental pupils definite objectives in reading, gave 
 them their scores on periodic tests in order that they might 
 see how nearly they were attaining these objectives. This 
 was not done for the other experimental group. And yet 
 neither Pittman nor the writer introduced thereby any con- 
 
72 How to Experiment in Education 
 
 stant irrelevant factor. These were legitimate portions of 
 one of the EF’s. The use of a bulletin by Pittman was a 
 portion of his plan for increasing the progress of the pupils. 
 The employment of definite reading objectives and the 
 periodic reporting of scores by the writer were made possible 
 by the use of standard tests, and were some of the advan- 
 tages of the use of standard tests. Objectives and scores 
 could not be reported to the other groups, either because the 
 EF did not call for them or because standard tests were not 
 employed with them. On the other hand, it would not 
 have been legitimate for either of us to tell these same 
 experimental groups that their progress was to be compared 
 with that of another equivalent group and that we hoped 
 they would win in the contest. To do so would be to change 
 the EF by adding features peculiar to the experiment and 
 necessarily temporary. 
 
 Such an EF would not be illegitimate but it would not be 
 particularly practical. The information given certain of 
 the experimental subjects by Pittman and by the writer 
 were normal advantages of the EF in question and were 
 permanently obtainable in a practical school situation with- 
 out assuming the impractical situation of an everlasting 
 experiment. In sum, it is always legitimate to give experi- 
 mental pupils such facts as are the normal concomitants of 
 the EF in question, unless the experimenter desires to limit 
 his experimental conclusions to a narrower EF. As a mat- 
 ter of fact, the writer gave certain standard tests to the 
 pupils in his control group, thereby making it possible, had 
 he so desired, to report to them the scores made as in the 
 case of the other group. This was not done because the 
 EF for this group assumed that in a normal non-experi- 
 mental situation no standard-test scores would be available. 
 
 Errors Due to Difference in Time Allowance.— 
 When the effectiveness of two or more EF’s is being studied, 
 one EF may secure an unfair advantage over another be- 
 cause of a longer teaching or studying time on the part of 
 the pupils, or the application of their EF for a longer 
 
Control of Experimental Conditions fhe 
 
 period. This may occur in many ways. The class period 
 may be longer. The study which occurs at the pupil’s home 
 may be longer. Each application of the EF may be longer. 
 The total period during which the EF operates may be 
 longer. Thus, in conducting the experiment to determine 
 the relative effectiveness of employing tests in teaching read- 
 ing, the writer found it necessary to regulate the length of the 
 official reading period both for teaching and for study. In 
 this experiment to determine whether motion-picture presen- 
 tation, or printed presentation, or teacher presentation, or 
 various combinations of these was the most effective, Weber * 
 exercised extreme care lest the time allowance for one EF 
 exceed the time allowance for another EF. In his experi- 
 ment to determine whether supervision plus standard tests 
 were superior to supervision minus standard tests, Bennett 
 found it impossible to give all the initial tests or all the final 
 tests to all the pupils at the same time. Because of the scat- 
 tered nature of rural schools both testing periods extended 
 over several weeks. All tests were carefully dated in order 
 that the interval between initial and final tests might be kept 
 identical for every pupil. Since instruction toward the 
 close of school may be more effective than toward the be- 
 ginning, he was careful to avoid applying initial tests to 
 one group earlier, on the average, than to the other group. 
 Lacy,” in his experiments with visual, verbal, and printed 
 presentation, was careful to see that the few minutes’ interval 
 between the ending of each EF and the application of the 
 final test was kept identical for all EF’s, and that the few 
 weeks’ interval between the final test and a delayed-recall 
 test was kept identical for all EF’s. In every experimental 
 situation where a time variation will favor one EF to 
 the detriment of another, the time should be kept identical, 
 unless such a variation is a desired element in an EF. 
 There is a special variety of time variation which should 
 
 1Weber, J. J., Relative Effectiveness of Some Visual Aids in Elementary 
 Education; (to be published soon). 
 
 2Lacy, John V., “Motion Pictures as an Educational Agency”; Teachers College 
 Record, Vol, XX, No. 5. 
 
74 How to Experiment in Education 
 
 not escape the attention of the experimenter. The pupils in 
 one experimental group may have a poorer attendance record 
 than those in some other group. This may be caused by an 
 excess for one group of poorer roads, longer average dis- 
 tance of homes from school, more inclement weather, more 
 contagious diseases, and the like. Consideration should be 
 given to whether the absence is toward the beginning or 
 end of year, or is continuous or intermittent. When the 
 pupils are sufficiently numerous, average attendance records 
 are usually approximately equivalent for each group. But 
 when the group is small it may be necessary to eliminate 
 from experimental computations pupils whose attendance 
 record is such as to disturb the balance between the two 
 groups. 
 
 Sometimes it is difficult to decide whether a time variation 
 is an irrelevant factor or a consequence of an EF. Pittman 
 found that the pupils in the schools which were under the 
 zone-system-of-supervision EF showed a better attendance 
 record. Instead of discounting this as an irrelevant factor 
 he credited it to the beneficent influence of the EF, because 
 there was no other observable cause. 
 
 The writer found that one method of teaching reading 
 resulted in more reading both in school and out than did 
 another EF. This extra reading was a partial or perhaps 
 entire explanation of the superior growth of these pupils. It 
 was assumed that this was not an irrelevant time variation 
 but a beneficent consequence of the EF. Tests made in 
 other subjects of the curriculum did not show that this in- 
 creased emphasis upon reading had occurred at the expense 
 of other portions of the school work. 
 
 Finally, errors may occur due to the length of time the 
 experiment runs. An experiment may be allowed to run too 
 brief a time or too long a time. It may be so brief that 
 variable errors swamp the effect of the EF’s. This is likely 
 to occur if the trait measured is one in which growth is slow 
 and cumulative. In such a situation the experiment needs to 
 continue over a long period. When the trait measured de- 
 
Control of Experimental Conditions 75 
 
 velops rapidly, and when the effect of the EF’s is relatively 
 non-cumulative, brief experiments are preferable. The prin- 
 ciple to be kept in mind in deciding upon the time length of 
 the experiment is to secure the maximum effect of experi- 
 mental factors with a minimum effect from disturbing 
 variables. 
 
 Errors Due to Difference in Transfer.—After giving 
 a recent examination to his class in mental measurement, 
 the writer announced to the students that his efficiency as 
 a teacher of mental measurement was only 43 per cent, for 
 on the average the class had mastered only 43 per cent of 
 the procedures he had aimed to teach. One unkind student 
 increased his chagrin by remarking that a portion of that 
 43 per cent was acquired in other classes given by the 
 writer’s colleagues. In other words, there had been a trans- 
 fer from one class to another. This same sort of transfer 
 from one school activity to another is going on all the time. 
 More of it may occur in the case of one group than another, 
 thereby introducing a constant irrelevant factor. Reading 
 ability is liable in a peculiar way to be enhanced by such 
 transfer. The teacher of reading usually has a heavy obliga- 
 tion to all the other teachers, where there is departmental in- 
 struction, or a heavy obligation to all the other phases of 
 her own instruction where she is the sole teacher. Certain 
 teachers or schools give a sum total of more instruction in 
 reading during the periods officially assigned to history, 
 geography, and the like, than during the reading period © 
 itself. This is equivalent to giving more time to reading. 
 The experimenter should not neglect these transfer possi- 
 bilities when standardizing the time allowance for each EF. 
 
 Another disturbing irrelevant factor is the transfer of 
 knowledge of how to do the experimental tests. The writer 
 found this to be of considerable significance in some experi- 
 mentation on young children. All the tests were individual 
 tests, which means that only one child could be tested at a 
 time. As soon as a child was tested he was returned to his 
 class. This gave opportunity for the other children to dis- 
 
76 How to Experiment in Education 
 
 cover, in advance, something as to both the general and 
 specific nature of the tests. An effort was made to reduce 
 the amount of this error by employing several examiners 
 so as to reduce the length of the total testing period, by 
 testing first those pupils who, according to the teacher’s 
 judgment, were least competent to make an intelligible re- 
 port of what occurred in the examining room, by applying 
 one test to all pupils before starting another, by urging the 
 teacher to conduct her class while a test was being given 
 so as to reduce opportunities for conferences among pupils, 
 and by condensing the total period for one test between 
 recess periods. An attempt was made to equate any error 
 not avoided by the preceding precautions by testing pupils 
 from the two groups according to the principle of alterna- 
 tion. It is much easier to avoid this irrelevant factor when 
 group tests may be employed. 
 
 When the equivalent groups are located in the same 
 school, other sorts of transfer may occur. One group may 
 catch a spark of enthusiasm from another. One group 
 may sulk because the other group has a pleasanter or sup- 
 posedly pleasanter EF. The writer is still wondering just 
 what sort of transfer occurred during a year’s experiment in 
 the Horace Mann School, conducted in collaboration with 
 Principal Pearson, Vice-Principal Hunt, and the teachers. 
 Half the teachers and half the pupils continued to teach and 
 study, respectively, a particular subject, as during the pre- 
 ceding year. The other equivalent half of the teachers 
 attempted by concentrated study to invent teaching pro- 
 cedures which would produce, with the same time allowance, 
 a greater growth than usual in their half of the pupils. 
 This program was known to half the teachers only and to 
 none of the pupils. Initial and final tests were given to 
 both groups as had been customary in previous years. To 
 our great surprise both groups had made practically identical 
 progress. Naturally this was a considerable disappointment 
 to us all. It was not until some time later that it occurred 
 to us to compare the usual progress with the progress made 
 
Control of Experimental Conditions Fire 
 
 for an equal period during the experimental year. Both 
 groups had made a 50 per cent greater growth than usual! 
 Somehow, some sort of transfer had occurred. 
 
 Errors Due to Bias of Tests.—There is danger that 
 tests used for the initial and final measurements will be 
 partial to one EF. Those who advocate the project method 
 in preference to the conventional method of teaching have 
 certain reservations about experiments which have been 
 conducted to date to evaluate the relative effectiveness of 
 these two educational processes. They claim, and with some 
 justification, that standard tests available for such evalua- 
 tion are partial to the conventional method. Lacy’s con- 
 clusion that verbal instruction is more effective than visual 
 instruction has been questioned by Weber on the ground 
 that Lacy’s verbal tests were partial to the verbal method. 
 To substantiate his criticism Weber devised one test like 
 Lacy’s, another in which the verbal element was reduced 
 to a minimum, and another which, in his judgment, was 
 about half-way between these two. At the time when this 
 is written, his experiments have gone far enough to show, 
 among other things, that the visual group does better on 
 the visual test and the verbal group upon the more verbal 
 test. 
 
 What has been said concerning the nature of the tests em- 
 ployed applies with equal force to the examiner who gives 
 tests, the acquaintance of pupils with the tests, instructions to 
 pupils as to how to take the test, the conditions while tests 
 are in progress, the scoring of the tests, and the statistical 
 treatment of results. In general, the same examiner should 
 give the same tests to all groups in the same way in order 
 that difference in personality of examiners, or in the stimulus 
 given to pupils, may not corrupt results. Uniformity will 
 be increased if the method of applying the test is determined 
 in advance and written down. Sometimes one group has 
 had more experience in taking tests in general. This may 
 be eliminated by supplying the deficiency. Sometimes the 
 experiment calls for intermediate tests of the same experi- 
 
78 How to Experiment in Education 
 
 mental trait with the same test that is used for the initial 
 and final tests. If this applies to one group only it may 
 gain an advantage from increased acquaintance with the 
 test. Such practice effect can be reduced by the use of 
 parallel forms rather than the identical test. 
 
 Sometimes it is desirable to analyze the curriculum con- 
 tent and test content to discover the degree of correspondence 
 between the two, and this-~-is especially true when the one- 
 group experimental method has been employed. It is pos- 
 sible that the arithmetic curriculum during the first semester 
 may be more akin to the content of the arithmetic test used 
 than is the content of the arithmetic curriculum for the 
 second semester. Analysis of the curriculum may reveal this. 
 
 Finally, a test may be biased because it fails to take 
 account of periods of especially rapid growth, and minor or 
 major plateau periods of especially slow growth. In certain 
 traits, pupils lose during the summer vacation some of the 
 skill acquired the previous year. Usually, this loss is quickly 
 made up in the first few weeks of the fall term. When the 
 initial tests are given on the first day or two of school, the 
 EF will get the benefit, not only of the effect of the EF, 
 but also of the effect of this early spurt. 
 
 Errors Due to Bias of Other Irrelevant Factors.— 
 Various environmental factors which may prove irrelevant 
 factors have already been listed. On occasion, many others 
 may be significant. The experimenter should canvass the 
 general physical environment including such items as tem- 
 perature, humidity, ruralness, playgrounds, and the like, to 
 see if differences in these may not be significant. Thus 
 conclusions from experiments in physical geography might 
 be profoundly affected by whether one group had better 
 contacts with mountains, streams, and the like. The home 
 environment is frequently of very great importance. Some 
 children have home surroundings which encourage study, 
 home facilities which aid study, parents who give moral 
 support to the school, and parents who give actual instruc- 
 tion in school subjects in no mean amount and of no small 
 
Control of Experimental Conditions 79 
 
 worth. All such conditions, if relevant to the experiment in 
 question, should be made approximately equivalent or should 
 be discounted in drawing conclusions. 
 
 Then there are errors due to difference in susceptibility 
 of pupils to the EF’s. Conclusions from an experiment 
 conducted by Norsworthy, Hillegas, McCall, and Johnson 
 were made uncertain because one of the two groups was in 
 more robust health than the other. Differences in phys- 
 ical condition, intelligence, previous training, age, sex, race, 
 and all other such personal characteristics which at times 
 condition the susceptibility of pupils are not matters easily 
 or at all subject to control during the application of the 
 EF’s. They should receive attention when experimental 
 pupils are being selected. 
 
 Experimental Log.—One necessity of experimentation 
 is an experimental log or record of dated events, of relevant 
 ideas, of the appearance of variables, and the like. It is 
 seldom safe to trust to memory circumstances which will 
 need to be recalled. Every scrap of experimental record 
 should be labeled and dated. Records should be kept as 
 though the experimental material were to be filed away for 
 several years before experimental computations were made 
 and before the experiment was described. In fact, any 
 one who does much experimentation will need to refer 
 to experimental records long after the conclusion of the 
 experiment. Further, it often becomes necessary to ask 
 others to complete an experiment one has begun. A prop- 
 erly kept experimental log quickly informs the new experi- 
 menter concerning the previous history of the experiment. 
 Norsworthy had just completed an experiment extending 
 over several years when she died. Though the writer knew 
 nothing about the experiment he was able to take up the 
 research where she left off, complete the computations, and 
 describe and publish the results. Without the experimental 
 log this would have been impossible. 
 
 In an extensive experiment in the teaching of English to 
 foreigners, Courtis employed a unique device for main- 
 
80 How to Experiment in Education 
 
 taining desired experimental conditions and of recording 
 deviations from them. First he met the teachers and gave 
 them typewritten directions concerning and training in how 
 to apply the EF, namely, a particular method of teaching 
 English to foreigners. Then he employed a group of gradu- 
 ate students in education to act as observers, there being 
 one observer for each teacher. Next he devised a form on 
 which the observer could keep a graphic time-record of just 
 what the teacher did during the lesson period. He rotated 
 the observers so that each observer saw each teacher. At 
 the conclusion of the experiment, he did not have to hope 
 that experimental conditions had been maintained. He had 
 an accurate record of the extent to which they had been 
 maintained. As a result, he was able to avoid grave errors, 
 and was able to make a much fuller use of his data. 
 
CHAPTER V 
 EXPERIMENTAL MEASUREMENTS 
 
 I. FuNcTIONS oF EXPERIMENTAL MEASUREMEN7S 
 
 Amount of Experimental Factors.—The first demand 
 upon experimental measurements is the exact measurement 
 of the amount of the EF’s. 
 
 The amount of certain EF’s may be measured with great 
 exactness. Among the many experiments conducted by the 
 Ventilation Commission of New York, some had for their 
 purpose to determine the mental and physical effects upon 
 school children or adults of various temperatures, humidities, 
 carbon-dioxide contents, and the like. The successful con- 
 duct and interpretation of these experiments required that 
 an exact record be kept of the temperature, humidity, and 
 carbon-dioxide content maintained in the experimental cham- 
 bers. Instruments were installed which made possible a 
 very exact record of the amount of these EF’s. 
 
 The amount of some experimental factors cannot be meas- 
 ured with such accuracy. If, for example, one experimental 
 factor is the project method, it is impossible to secure an 
 exact quantitative record of the amount of this EF, even 
 though we can be reasonably sure that it is an EF which 
 varies in amount of presence. Similarly it is difficult to 
 secure a quantitative record of the amount of a particular 
 method of teaching reading. 
 
 Though difficult to secure, the experimenter is responsi- 
 ble for reporting as best he can the amount of each EF. In 
 
 81 
 
82 How to Experiment in Education 
 
 the case of some EF’s, it may not be possible to be more defi- 
 nite than to state roughly the skill and effort of the teacher; 
 the degree of codperation of officials and parents, the ade- 
 quacy of equipment, the amount of time during which each 
 EF operated, and similar information, according to the 
 nature of the experiment. 
 
 Amount of Change Produced by Irrelevant Factors. 
 —The second demand upon experimental measurements is 
 the exact measurement of the amount of change produced 
 in the trait in question by irrelevant factors. The purpose 
 of this measurement is to make it possible to discount the 
 corrupting influence of irrelevant factors. 
 
 In certain very specific types of experimentation, it is 
 possible to measure the amount of this influence of irrele- 
 vant factors. But in most educational experimentation, 
 their individual influence is so slight as to be unmeasur- 
 able, or so subtly bound up with the EF’s that the exact 
 amount of their contribution cannot be separated from the 
 influence of the EF’s. Usually, the experimenter will find 
 it easier to eliminate or equate significant irrelevant factors 
 than to measure the amount of their contribution to the trait 
 in question. 
 
 Amount of Change Produced by Experimental Fac- 
 tors.—The third demand upon experimental measurements 
 is the exact measurement of the amount of change in the 
 trait in question produced by the EF’s. In educational ex- 
 perimentation, this is the most common and most important 
 type of experimental measurement. 
 
 II. FUNDAMENTAL CRITERIA 
 
 In common with measurements for any purpose, experi- 
 mental measurements should satisfy certain fundamental 
 criteria. They should be selected or constructed with these 
 criteria in mind. These fundamental criteria are: 
 
 1. Validity. A test is perfectly valid when it measures 
 exactly what it purports to measure. 
 
Experimental Measurements 83 
 
 2. Accuracy. A test is perfectly accurate when the 
 units of measurement are wholly appropriate and are abso- 
 lutely equal at all points on the scale. 
 
 3. Reliability. A test is perfectly reliable when two 
 applications of equivalent tests to the same pupil yield 
 identical scores. 
 
 4. Objectivity. <A test is perfectly objective when two 
 examiners using equivalent tests upon identical pupils secure 
 identical scores. 
 
 5. Norms. A test has satisfactory norms when the 
 achievement on this particular test has been determined for 
 age, grade, nationality, and any other groups 2 knowledge 
 of whose achievement would be helpful. 
 
 6. Economy. A test should be as economical as possible 
 of the funds and time of the experimenter and the time of 
 the pupils. 
 
 Detailed suggestions to guide the experimenter in satis- 
 fying these fundamental criteria follow. Not all these sug- 
 gestions are of equal worth, nor do they all apply to a single 
 test. 
 
 III. CRITERIA FOR THE EVALUATION AND CONSTRUCTION 
 OF EXPERIMENTAL MEASUREMENTS 
 
 1. The Test Should Correspond or Correlate Closely 
 with a Valid Criterion. 
 
 A psychologist might undertake to construct a test to 
 measure mechanical ability. He could follow individuals 
 around hour by hour and day by day and score their suc- 
 cess in dealing with life’s mechanical situations. Provided 
 certain precautions were taken, most persons would accept 
 as valid the scores yielded by such an investigation. Such 
 a test may be called a criterion. 
 
 In building up such a criterion an experimenter would 
 discover very early in the process that pupil performance 
 in one practical situation may be far from a perfect index 
 of that same pupil’s performance in any other practical 
 
84 How to Experiment in Education 
 
 situation. One part of the criterion may not show perfect 
 correspondence with another part of the criterion. 
 
 This absence of perfect correspondence between perform- 
 ance in different practical situations means that to secure 
 a satisfactory criterion, the psychologist must make a suffi- 
 cient number of observations of a pupil’s performance in a 
 sufficient number of practical situations so that the com- 
 bined results of these records will give a true picture of 
 the pupil’s mechanical ability. This means, in turn, that 
 the psychologist must observe the pupil’s performance in 
 representative mechanical situations, or, lacking any way 
 to determine what are representative situations, in a ran- 
 dom sampling of all mechanical situations. We can be 
 certain that perfect sufficiency in the criterion has been 
 secured when the criterion may be divided into two random 
 halves which show perfect correspondence. Perfect suffi- 
 ciency is rarely, if ever, attained, in the case of any criterion. 
 
 All this means in turn that most of the lay criticism of 
 mental tests is extremely superficial. The lay individual 
 observes that pupils’ performances fall considerably short 
 of perfect correspondence or even perfect correlation with 
 his observation of their performances in practical situa- 
 tions. He rarely stops to consider that his observation of 
 their performances in these practical situations may not 
 and probably will not correspond perfectly or correlate 
 perfectly with his own observation of these same pupils in 
 other practical situations. Failure of a criterion to show 
 perfect correspondence with performance in a limited num- 
 ber of practical situations may be an argument in favor of 
 the criterion. And similarly, the failure of a test to corre- 
 spond or correlate closely with a particular individual’s 
 limited and fallible observations may be more of a con- 
 demnation of the individual’s observations than of the test. 
 
 Liu’ gives a detailed exposition of the construction and 
 utilization of an intelligence criterion. THis criterion has 
 
 1Liu, H. C.. Non-Verbal Intelligence Tests for Use im China; Bureau of 
 Publications, Teachers College, Columbia University. 
 
Experimental Measurements 85 
 
 two major weighted components, namely, the school suc- 
 cess of the pupils, and their achievement in a battery of 
 previously constructed intelligence tests. The components 
 of school success for each pupil are weighted school marks, 
 teacher’s estimate, grade reached, and age when attain- 
 ing this grade. The components of test achievement for 
 each pupil are weighted scores in the Dearborn, Army 
 Beta, Pintner, Myers, and Pressey Non-Verbal Intelli- 
 gence Tests. 
 
 The procedure Liu followed to determine the weight to 
 be assigned to each of these five non-verbal tests was to 
 compute for each test the per cent of third-grade pupils 
 whose scores exceeded the median score of the fourth-grade 
 pupils (grades II, III, and IV were used in Liu’s study). 
 He assumed that that test best measures intelligence which 
 most effectively separates the two grades, and, hence, that 
 the test showing the smallest per cent of overlapping should 
 receive the largest weight. The validity of this assumption 
 should be more carefully tested before we are justified in 
 accepting it finally. The per cent of overlapping and the 
 weight assigned each test were as follows: 
 
 Per Cent of Value or 
 Test Overlap ping Weight 
 PEAT OOII ys. ots sows f= 9.8 15 
 PATIOV ELA, pistes bees 12.0 14 
 PATIELION facet sel Sahete ah 15.2 IO 
 RD VELS Metals ih sc) aome sid ake BIN] 6 
 RSPESSOV itty 1. os ieieel oes 27.0 6 
 
 According to a technique described in Chapter III, Liu 
 altered the variabilities among the scores for each test so 
 as to make them proportional to the desired weights. He 
 then combined the weighted scores to make the test half 
 of his criterion. 
 
 In like manner, the four items provided by the school 
 were weighted and combined to constitute the school’s 
 half of the criterion. 
 
86 How to Experiment in Education 
 
 Credit for grade attained by each pupil was assigned 
 as follows: 
 
 Grade Reached 2B 3A 3B 4A 4B 
 Value Ones Oe Lowers om2O 
 
 Credit for the age of reaching the present grade was 
 assigned as follows: 
 
 me bw 
 es) 
 WwW 
 > 
 Ww 
 wD 
 aS 
 w 
 
 e 
 
 ° 
 
 | 
 
 Oo 
 ODO WOOWBWKWANAN OW O 
 COO OWWWRhUN AAT CO 
 OCOWWWAN AAT CO 
 CwOWKWWAMN AN CO] 
 WOWwhun an Ovo 
 
 Credit for regular school marks was assigned thus: 
 
 School mark AY (Bose GOD) Sank 
 Value LOMAS ht es es 
 
 Credit for teacher’s special estimate of pupils was as- 
 signed as follows: 
 
 ‘Teachers estimate) | AX Bo Gn Le no 
 Value 12° Oo Ore eae 
 
 Observe that, in assigning credit to the average of school 
 marks and to the teacher’s special estimate, no account 
 was taken of the pupil’s grade. A second-grade pupil 
 making an A was assigned the same number of points of 
 credit as a fourth-grade pupil making an A. This pro- 
 cedure is defensible only when the group is a fairly homo- 
 geneous one, and when the object is to construct a criterion 
 whose sole purpose is to evaluate test elements relative to 
 each other. 
 
Experimental Measurements 87 
 
 Finally, Liu combined his test criterion and school cri- 
 terion, giving equal weight to each. Then he computed 
 the correlation and partial correlation of each test element 
 in the five non-verbal tests with this criterion. The test 
 elements showing the largest partial correlation with the 
 criterion were selected to constitute a new test. Further- 
 more, the method of scoring the new test took account of 
 the relative value of each element of the test as an inde- 
 pendent measure of intelligence. This was accomplished by 
 the use of the regression equation technique. ‘These tech- 
 niques of correlation, partial correlation, and regression 
 equations are discussed in detail in Chapter IX. 
 
 In the actual selection of the best test elements to put 
 into the new test battery for China, Liu was influenced by 
 such non-statistical considerations as adaptability to all 
 races equally, possibility of constructing duplicate forms of 
 each, and the like. Also he short-circuited the laborious par- 
 tial correlation technique by (a) computing the correlation 
 of each test element with the criterion, (b) choosing as basic 
 test elements the two elements which showed the highest 
 correlation with criterion and which appeared to test different 
 mental functions, and (c) selecting other tests which, by 
 trial, showed high correlations with the criterion but low 
 correlations with the basic tests and with each other. 
 
 2. The Test Should Measure Comprehensively the Trait 
 in Question. 
 
 Perfect validity may be secured by so constructing the 
 test that it duplicates in form, procedure, and content the 
 criterion itself. But almost invariably this means an im- 
 practicably cumbersome test. Hence the psychologist 
 usually sacrifices some validity to convenience. He may 
 construct a test which duplicates the criterion in miniature.* 
 Or, instead of a toy representative, he may select for his 
 test an actual sampling of some representative portion of 
 the criterion. Or, he may construct an analogy which em- 
 
 1See Hollingworth, H. L. and L. S., Vocational Psychology; D. Appleton and 
 Company, New York, 1916. 
 
88 How to Experiment in Education 
 
 ploys material which is not even similar to the material of 
 the criterion but which is supposed to exercise the mental 
 traits requisite for success in the criterion. Finally, he 
 may attempt to find or construct an empirical test, 1.e., he 
 tries out many tests in the hope of discovering that one of 
 these will happen to show a close correspondence with the 
 criterion. 
 
 This question of adequacy is of particular importance to 
 the experimenter. He wishes to measure and evaluate all 
 the changes produced by each EF and not just a part of 
 them. Bryan and Harter’s ordinary measurements showed 
 that their subjects reached a plateau where a series of 
 measurements showed no further evidence of growth. The 
 use of more adequate tests showed, however, that growth 
 in certain accessory traits was continuous throughout the 
 plateau period. In experiments with project teaching and 
 the like, the adequate measurement of such accessory and 
 concomitant developments becomes a matter of primary 
 importance. It is a good rule in experimentation to test, 
 so far as possible, every aspect of the problem, and score 
 every aspect of the tests. 
 
 Adequacy in content plus practical convenience offers a 
 special problem to the test constructor. Some of those who 
 develop tests attempt to secure adequacy without sacrificing 
 convenience by taking a random sampling of the total ma- 
 terial. Thus, the words in the Starch Spelling Scale were 
 selected at random from all the non-technical words in the 
 dictionary. Others follow the social-worth principle. Thus 
 the words in the Ayres Spelling Scale are the more com- 
 monly used words. Others employ the type principle in 
 selection of test material. Thus the examples in Monroe’s 
 Diagnostic Tests in Arithmetic were so selected as to repre- 
 sent all the typical processes in the fundamentals of arith- 
 metic. Others follow the sétatistical-dificulty procedure. 
 Thus, the examples in Woody’s Arithmetic Scales were 
 selected because of their statistical behavior, i.e., those ex- 
 amples were selected which would make an equal-step ladder 
 
Experimental Measurements 89 
 
 of difficulty. Various combinations of these bases of selec- 
 tion are possible. The basis or bases to be employed will 
 vary with the purpose of the test and the nature of the trait 
 to be studied. 
 
 3. The Test Should be Non-coachable. 
 
 The coachability of a test may be reduced by such a selec- 
 tion and arrangement of material as will make it difficult 
 for one pupil to communicate knowledge of how to do the 
 test to another, by increasing the amount of the test ma- 
 terial, by the preparation of several equivalent forms of 
 the test, and by providing that those pupils will be tested 
 first who are least able to report the content of the test. 
 
 4. The Test Should be Free from Ambiguities and 
 Other Irrelevancies. 
 
 Even when the content of a test is satisfactory, the form 
 and procedure of the test require careful scrutiny. All sorts 
 of irrelevancies may subtract from validity. ‘The test 
 material may be in question form when greater validity 
 might be secured by employing the classification, completion, 
 matching, or manipulation form. ‘The general conditions 
 under which the test is to be given may detract from valid- 
 ity. The instructions which accompany the test may de- 
 mand too much linguistic ability or may be otherwise 
 unsuitable. The nature of the response demanded of the 
 pupil may require too much writing ability, muscular 
 strength, or the like. The test may be so long as to meas- 
 ure fatigue instead of the trait desired, or so short as to 
 be unreliable or unsuited to measure the speed of adjust- 
 ment to the test. It may be so arranged as to measure 
 the pupil’s honesty rather than his ability. The scoring 
 provided for may be crude, or may concern insignificant 
 phases of the pupil’s performance. Ambiguities or other 
 irrelevancies may appear at various stages. 
 
 5. The Elements of the Test Should Be Weighted in 
 the Optimum Manner. 
 
 In practice, few tests have as yet been validated in any 
 adequate way. The tests are usually assumed to measure 
 
90 How to Experiment in Education 
 
 what they appear to measure. In time every person who 
 proposes a test will be obligated to report the degree of 
 correspondence between test scores and criterion scores. 
 This correspondence is usually determined by computing 
 the coefficient of correlation between these two series of 
 scores. The procedure for computing and interpreting a 
 coefficient of correlation is described in Chapter IX. 
 
 It frequently happens, however, that the correspondence 
 between test and criterion can be measurably increased by 
 determining and utilizing in scoring, the optimum weights 
 for the various parts of the total test, especially when the 
 total test is composed of subordinate tests which differ 
 somewhat in nature. These weights may be determined 
 statistically by means of the partial correlation and regres- 
 sion equation techniques. ‘These techniques also are dis- 
 cussed in Chapter IX. 
 
 6. The Test Should Be So Constructed That the Pupil’s 
 Reactions Will Be as Abbreviated as Possible. 
 
 Satisfaction of this criterion makes for economy and 
 objectivity of scoring. Frequently an abbreviated reaction, 
 such as a word, number, or check, will yield as valid+ a 
 measure of the pupil’s ability as a much more complicated 
 reaction. j 
 
 7. The Test Should Be So Constructed That the Pupil’s 
 Abbreviated Answers Will Be Controlled. 
 
 If any one of many different abbreviated answers is 
 correct, or if the spatial location of the pupil’s answers is 
 uncontrolled, the probable result will be uneconomical, in- 
 accurate, and subjective scoring. Furthermore, it will prove 
 difficult in this case to employ mechanical scoring devices. 
 When the nature of the test permits, it is well to have pupils’ 
 answers recorded along the right-hand margin of the test 
 sheet. This permits the experimenter to lay a correctly- 
 filled test sheet beside the pupil’s answers and determine 
 correctness or incorrectness by a simple visual comparison. 
 
 i Gates, Arthur I., “‘The True-False Test as a Measure of Achievement in College 
 Courses”; Journal of Educational Psychology, May, 1921. 
 
Experimental Measurements or 
 
 When marginal answers are not feasible, spatial location 
 may be so controlled as to permit the use of a perforated 
 test sheet or a celluloid scoring device. 
 
 8. The Test Should Be So Constructed as to Permit Its 
 Use Both with One Pupil and with a Group of Pupils. 
 
 It is claimed that when a test is given to one pupil at a 
 time the results are more reliable than when a pupil is tested 
 in a group. However, questions of time, economy, and the 
 prevention of the spread among untested pupils of informa- 
 tion as to the nature of the test practically require group 
 testing, for most experimental situations. 
 
 9. Test Instructions Should Be as Brief as Is Consistent 
 with an Adequate Understanding of What Is to Be Done. 
 
 Long instructions tend to produce confusion in the minds 
 of the pupils, and even of experimenters themselves if they 
 are inexperienced. But adequacy should not be sacrificed 
 to brevity. Particular care should be exercised to see that 
 no key points are omitted. 
 
 10. Instructions Should Employ a Demonstration and 
 Preliminary Test. 
 
 It is easier to imitate than to comprehend and follow lin- 
 guistic directions. Both demonstration and preliminary 
 test may be given on the blackboard or may be printed on 
 the test sheet. The latter is preferable. 
 
 11. Instructions Should Be Adapted to and Uniform for 
 All Who Are to Be Tested. 
 
 It is feasible to find words sufficiently simple for young 
 pupils and which are also sufficiently dignified for older 
 pupils. Also it is possible so to prepare instructions that 
 they will be uniform and equally fair to all experimental 
 groups irrespective of their environment. 
 
 The importance of universalizing the test applies with as 
 much force to the test material as to the instructions. In 
 less than a year after their publication, the Thorndike- 
 McCall Reading Scales were in use in England, China, and 
 other foreign countries. Unfortunately, the authors were 
 so provincial in their outlook that minor revisions must be 
 made before they can be used to greatest advantage in 
 
92 How to Experiment in Education 
 
 countries other than the United States. They could have 
 been approximately internationalized from the beginning 
 without impairing their value for this country. 
 
 12. The Order of Instruction Should Be the Order of 
 Execution. 
 
 There are abundant reasons for believing that it is easier 
 for pupils to follow instructions when the sequence of 
 instructions is the sequence of action expected from the 
 pupils. 
 
 13. Instruction Should Be Broken into Action Units. 
 
 As soon as a natural unit of instruction has been given, 
 the pupil should be directed to carry out these directions 
 before another unit is given. This is especially important 
 where the instructions are necessarily long and complicated. 
 Any other procedure taxes too heavily the pupil’s memory. 
 
 14. Instructions Should Equalize Interest. 
 
 Interest should be equalized not only for all experi- 
 mental groups but for the pupils in each group. Probably 
 it is easier to secure this equalization on a high interest 
 plane than on a low plane. As a rule it is best to induce 
 each pupil to do the best he can. 
 
 15. The Test Should Be So Easy That Each Pupil Will 
 Make a Score above Zero. 
 
 Two pupils who make zero scores appear to be of like 
 ability, whereas the amount of instruction required to lift 
 both above zero might be one month in the case of one 
 pupil and twenty-four months in the case of the other. 
 Obviously to call these pupils equivalent and to pair them 
 for experimental purposes would give a special advantage to 
 the experimental group receiving the one-month pupil. For 
 at the final test, this pupil might show marked improvement 
 while the other would be still making zero. With a prop- 
 erly constructed test with equal units at all points on the 
 scale, the twenty-four-month pupil might be shown to have 
 made greater growth than the one-month pupil. 
 
 16. The Test Should Be So Difficult That No Pupil 
 Wil Make a Perfect Score. 
 
Experimental Measurements 93 
 
 All perfect-score pupils look alike just as all zero pupils 
 look alike. A properly constructed test might reveal wide 
 differences of ability. Furthermore, a final test, even though 
 it be more difficult than the initial test, cannot reveal cor- 
 rect improvement scores for such perfect-score pupils. 
 
 17. The Test Should Have No Undistributed Scores. 
 
 Besides undistributed zero and perfect scores it is possi- 
 ble to have undistributed intermediate scores. Coarse 
 scoring, or tests which yield a few degrees of merit only, 
 automatically cause undistributed intermediate scores. 
 Pupils are made to appear of like ability when, by a finer 
 scoring or by a finer test, they would appear quite unlike. 
 The number of degrees of merit which a test should reveal 
 depends upon the homogeneity of the group being tested, 
 but, as a rule, tests should be so constructed as to separate 
 the pupils into not less than seven groups of ability and, if 
 the data are to be used for correlation, into not less than 
 thirteen ability groups. 
 
 18. A Test Should Vield a Statistical Score. 
 
 It is unfortunate that the custom ever grew up of report- 
 ing scores in terms of letters, words, or phrases. These 
 must be converted into statistical terms before they are 
 susceptible of necessary quantitative treatment. 
 
 19. The Test Should Vield Absolute Rather Than, or in 
 Addition to, Relative Scores. 
 
 Teachers’ marks are relative scores—trelative to the group 
 in question. An able pupil in Grade I will receive a mark 
 of A. When this same pupil reaches Grade VIII, he will 
 be making a score no higher than A. He stands, in fact, 
 a good chance of making a score less than A, even when 
 his absolute ability has markedly increased and his relative 
 status has remained unchanged. Relative tests cannot easily 
 be used to measure improvement. 
 
 20. The Test Should Be Scaled So That Units of Meas- 
 urement Will Be Equal at All Points on the Scale and the 
 Method of Combining Units Will Be Simple and Appro- 
 priate. 
 
04 How to Experiment in Education 
 
 Evaluation of Scaling Methods.—The need for equal- 
 ity of units is shown in Table 4. 
 
 TABLE 4 
 
 SHOWING THE NEED FOR EQUAL UNITS OF MEASUREMENT 
 (R = RIGHT. W = WRONG) 
 
 Number of 
 Problems I 2 Sed 5 6 7 & | Score 
 Solved 
 Difficulty ..| 1 2 3 3.1 3.2 3.3 ay 4 
 Pupil ous R R W W W W W 3 
 Pupil:B ops th ts R R R R R W W 6 
 
 Pupil A solves three problems correctly. His unscaled 
 score is, therefore, 3, as shown in the table. Pupil B solves 
 six problems. His unscaled score is 6, as shown. Employ- 
 ing unscaled units of measurement in this manner makes 
 Pupil B appear much more competent in comparison with 
 Pupil A than he really is. The difficulty of solving six prob- 
 lems, namely 3.3, is only slightly above the difficulty of 
 solving three problems, namely 3. A very small superiority 
 of ability on the part of Pupil B enabled him to double his 
 unscaled score. The use of equal units of difficulty gives 
 Pupil A a score of 3 and Pupil B a score of 3.3. 
 
 Many methods! of varying worth have been proposed 
 for scaling mental tests. One method—the grade-scale 
 method—is to determine the difficulty of each separate prob- 
 lem, question, or other test element on the basis of the 
 achievement of school grades, and then to compute a pupil’s 
 score by combining the scale values of the test elements done 
 correctly. 
 
 To call a pupil’s score the scale value of the most diffi- 
 cult test element done correctly is subject to the objection 
 that pupils are unable frequently to do correctly test ele- 
 ments of less scale value. Depending as it does upon a single 
 test element, the score would also be rather unreliable. The 
 
 1 For a detailed evaluation see McCall, Wm. A., How to Measure in Education, 
 Chapters IX and X; Macmillan Company, New York, 1922. 
 
Experimental Measurements 95 
 
 only satisfactory procedure thus far devised to meet these 
 two difficulties is too complicated for practical use. 
 
 On the other hand, to call a pupil’s score the sum of the 
 scale values of the test elements done correctly is somewhat 
 laborious, and, in addition, is subject to the criticism that 
 a score yielded by such a cumulative total shows the num- 
 ber of units of work done rather than the ability level 
 reached. It would be like measuring a man’s lifting strength 
 by adding the weights of a variety of weights lifted. The 
 preceding simple-total procedure appears preferable. The 
 man’s lifting strength, according to the simple-total pro- 
 cedure, would be the weight of the heaviest object the man 
 could barely lift. 
 
 For the foregoing reasons, the drift is away from the 
 scaling of the separate test elements, except in a rough 
 way for the purpose of arranging test elements in an 
 approximate order of difficulty. The drift is in the direc- 
 tion of scaling, ie., determining the difficulty of doing cor- 
 rectly a given number of the test elements in a given test. 
 Stated differently, the drift is toward scaling total scores 
 instead of test elements. 
 
 The three most promising methods that have been pro- 
 posed for scaling total scores are the percentile scale, age 
 scale, and T scale. 
 
 In the case of the percentile scale, the smallest number 
 of points made on the test in question by any pupil of the 
 group used as the basis for scaling is scored zero, the num- 
 ber of points below which are one per cent of the pupils is 
 scored 1, the number of points below which are two per 
 cent of the pupils is called 2, and so on to the highest num- 
 ber of points made by any pupil which is scored 100. 
 
 This method assumes that the difference in ability be- 
 tween a pupil who makes a zero-percentile score and a pupil 
 who makes a Io-percentile score is the same as the differ- 
 ence between a pupil who makes a 4o-percentile score and 
 a 50-percentile score. It is rather generally conceded, how- 
 ever, that the former difference is actually much greater 
 
96 How to Experiment in Education 
 
 than the latter difference, and that therefore the units are 
 not equal in the truest sense at all parts of the scale. 
 
 In the case of the age scale, the mean number of points 
 made on the test in question by unselected eight-year-old 
 pupils is scored 8. The mean number of points made by 
 nine-year-olds is scored 9, and so on. Intermediate scores 
 are given also. 
 
 A vital defect of this scale is the almost insuperable dif- 
 ficulty of locating and testing unselected pupils below the 
 age of eight or nine and above the age of thirteen or four- 
 teen. Large sections of the former group have not left the 
 social group to enter the school and of the latter group 
 have left the school to return to the social group. Again, 
 growth ceases or actually recedes in some traits after the 
 age of thirteen, fourteen, or thereabouts. Quality of hand- 
 writing, and speed and accuracy of addition are probable 
 illustrations of recessions. No one has proposed a satis- 
 factory way of handling a situation when the mean number 
 of points made by, say, thirteen-year-olds is 20, and that 
 made by fourteen-year-olds is 18. Finally, it is generally 
 believed that the actual growth between ages eight and 
 nine, say, is greater than between thirteen and fourteen. 
 This belief does not have evidential support, for it is 
 impossible to say that the units on one scale are unequal 
 without assuming the equality of units on some other 
 criterion scale. The foregoing criticisms, even excluding 
 the third, mean that the age scale is inappropriate 
 except within a narrow range of ability and for certain 
 mental traits. 
 
 The T scale is believed to be superior to any of the pre- 
 viously described methods. It was constructed for the 
 purpose of embodying their virtues and eliminating their 
 defects. It scales the total score. It employs the simple 
 total. It allows each test element done to affect the scale 
 score, thereby increasing reliability. Its units are equal 
 in the generally accepted sense at all points on the scale. 
 It covers a wide range of ability and may be extended if 
 
Experimental Measurements 07 
 
 necessary. The process of scaling is as simple as any, and 
 so is the computation of a pupil’s scale score. 
 
 The age scale by permitting the computation of quotients 
 such as Intelligence Quotients, Reading Quotients, Accom- 
 plishment Quotients, and the like, has had a decided prac- 
 tical advantage over the T scale, though the age scale may 
 be, and is now being, used as a secondary scale in conjunc- 
 tion with the T scale to permit the computation of quotients. 
 A procedure has just been devised, and will be described in 
 this chapter, whereby the T scale alone can secure these 
 special advantages of the age scale and that in a more eco- 
 nomical way. 
 
 The relative merits of the four most commonly used 
 scaling methods are summarized where they may be seen at 
 a glance in Table 5. This table assumes that the latest 
 improvements on each scaling procedure have been em- 
 ployed. The scoring of the scales is necessarily somewhat 
 subjective. After an elaborate discussion of the various 
 scale systems, a colleague in this field scored the systems 
 and arrived at results closely similar to those given in 
 Table 5. 
 
 The total scores of 29, 23, 22, and 11, give a rough but 
 only a rough index of the relative merits of the four scale 
 systems. Some of the criteria are far more significant than 
 others. The convenience and definiteness of the reference 
 point is so important that the deficiency of the grade scale 
 is very serious. The equality of units is even more impor- 
 tant. The deficiency of the age scale and percentile scale 
 at this point practically means that they cannot well be 
 adopted as permanent scaling systems. The additional de- 
 ficiency of the age scale on width of range of scale is fatal, 
 because both these defects are inherently uncorrectable. 
 The ease of scaling test and of computing pupil scale scores 
 fatally indict the grade scale for other than scientific pur- 
 poses. 
 
 Borrowing and combining as it does the desirable features 
 of the other three scales systems, the T scale satisfactorily 
 
98 How to Experiment in Education 
 
 meets every criterion except one. At the present time it is 
 easier for the uninitiated to understand, or at least to think 
 they understand, the age-scale or percentile-scale units bet- 
 ter than the T-scale units. This is not, however, a perma- 
 nent defect. When the T scale has come into general use, 
 the T will be comprehended almost as easily as an age or 
 a percentile. 
 
 TABLE 5 
 
 SHOWING THE RELATIVE MERITS OF THE FOUR COMMONLY USED SCALE METHODS. 
 SATISFACTORY PROVISION FOR A CRITERION = 2. FAIRLY SATIS- 
 FACTORY =I. UNSATISFACTORY — 0. 
 
 
 
 
 
 Ape ik Age |Percentile| Grade 
 Criteria Scale Scale Scale Scale 
 1. Definiteness and convenience of ref- 
 
 CTEDCE POIs Woe es eat elaeia ace ales 2 2 I ° 
 
 Qe WCuality: Oly UNM eye tse a meas hare = 2 ° ° 2 
 3.e Width of) range, olvscale;.. .. - as. ss p. ° 2 2 
 4. Reliability of scale scores.......... 2 I I 2 
 Se Permanence. OL/SCAlG bani sais 4 sees 2 2 2 I 
 6. Conventionality of scale units..... 2 2 2 2 
 7, Lay interpretability of scale scores. I 2 2 fe) 
 8. Internationality of scale units...... 2 2 I ° 
 
 9. Comparability of scores on various 
 
 SCALCS TE re os Oe ee ert oe ae aia aes 2 2 I I 
 
 10. Method of combining units........ 2 2 2 fe) 
 11. Ease of computing scores......... x 2 2 2 ° 
 12. Permits the quotient techniques.... 2 2 fo) fo) 
 13..Hase or scaling testi un an ce ees 2 I 2 ° 
 14. Utilization of all scaled material... 2 2 2 I 
 15. Ease of preparing duplicate scales. . 2 I 2 fa) 
 Total 29 23 22 II 
 
 Construction of T Scale.—The detailed process of con- 
 structing a T scale has been published.t A summary will 
 suffice for this book. Table 6 illustrates the process. The 
 second column shows the number of unselected 12-year-old 
 children answering correctly the number of questions indi- 
 cated in the first column. It is recommended that unselected 
 12-year-olds (12.0-13.0) be used for scaling tests which are 
 to be used generally. If any other age is used it should be 
 
 1See McCall, Wm. A., How to Measure in Education, Chapter X; Macmillan 
 Company, New York, 1922. 
 
Experimental Measurements 99 
 
 TABLE 6 
 SHOWING HOW TO SCALE TOTAL SCORES 
 
 
 
 Number Per Cent 
 Total Number) | Number of Exceeding Plus|Exceeding Plus Scale 
 Bape a Loe rE | Vaalt Those. |) Holft Chose Score 
 
 ih ses nie Reaching Reaching 
 o 3 498.5 99.7 23 
 I I 499.5 99.3 25 
 2 2 495.0 99.0 27 
 3 I 493.5 98.7 23 
 4 2 492.0 98.4 29 
 5 2 4.90.0 98.0 29 
 6 2 488.0 97.6 30 
 7 2 486.0 97.2 31 
 8 4 483.0 96.6 22 
 9 2 480.0 96.0 32 
 Io 2 478.0 95.6 a2 
 II Io 472.0 04.4 34 
 12 3 465.5 93.1 35 
 13 8 460.0 92.0 36 
 i4 8 452.0 90.4 oe 
 I5 13 441.5 88.3 38 
 16 15 427.5 85.5 39 
 LT 18 4II.O 82.2 4I 
 18 28 388.0 77.6 42 
 19 26 361.0 vate 44 
 20 34 331.0 66.2 46 
 21 40 294.0 58.8 48 
 22 40 254.0 50.8 50 
 23 41 213.5 42.7 52 
 od 37 174.5 34.9 54 
 25 31 140.5 28.1 56 
 26 35 107.5 215 58 
 a7 24 78.0 15.6 60 
 28 26 53.0 10.6 62 
 29 21 20.5 5.9 66 
 30 14 12.0 2.4 70 
 3I 3 3-5 0.7 75 
 32 I 1.5 0.3 78 
 33 I 0.5 O.I 81 
 34 Oo 85 
 35 o go 
 
100 How to Experiment in Education 
 
 indicated by a subscript, thus, T1r or T13 or T16 in all 
 publications. For experimental purposes the experimenter 
 may use the group or groups upon which he is experimenting. 
 The third column shows the number of pupils exceeding 
 plus half those reaching each total number of questions 
 correct. Thus the number of pupils exceeding 33 is o. Half 
 those reaching 33 is 0.5. The sum of o and 0.5 is 0.5 as 
 shown in the third column. The number exceeding 32 is 1. 
 Half those reaching 32 is 0.5. The sum of 1 and 0.5 is 1.5 
 as shown. ‘The number exceeding 31 is 2. Half those 
 reaching 31 is 1.5. The sum of 2 and 1.5 is 3.5, and simi- 
 larly for other results shown in the third column. Since 
 there are 500 pupils in the group used for scaling, the fourth 
 column is obtained by dividing the results in the third 
 column by 500 and by expressing the quotients as per cents. 
 Were the fourth column inverted the first and fourth col- 
 umns would constitute a percentile scale. The fifth column 
 gives the T score, and is found by converting the per cents 
 in the fourth column by means of Table 7. Thus a per 
 cent of 99.7 corresponds to 22.5 or, for convenience, 23. 
 
 The first column in Table 6 shows the number of test 
 elements done correctly, where each element done counts 
 one point. The process of scaling is the same whether each 
 element done correctly gives a credit or penalty of one point, 
 two points, or any number of points, or a different number 
 of points for different elements. ‘Thus in scoring composi- 
 tions, the scorer may wish to penalize one point for each 
 error in punctuation, and two points for each error in choice 
 of words. If penalties instead of credits are used the first 
 column should be inverted, i.e., large quantities should ap- 
 pear at the top. 
 
 Increasing the Range of a T Scale.—The width of 
 range of a T scale based on 12-year-olds is much wider 
 than the inexperienced individual would suspect. In a 
 continuous function like reading, such a T scale will meas- 
 ure first-grade pupils and most university students. Of 
 course, these extreme measurements will be more unreliable 
 
TABLE 7 
 
 SHOWING THE S. D. DISTANCE OF A GIVEN PER CENT ABOVE ZERO. EACH S. D. 
 VALUE IS MULTIPLIED BY IO TO ELIMINATE DECIMALS. THE ZERO 
 POINT IS 5 S. D. BELOW THE MEAN. S. D. VALUE EQUALS T. 
 
 
 
 
 
 5S. D. Per eT BE Per Nag OF: Per Sal. Per 
 Value Cent Value Cent | Value Cent Value Cent 
 
 fe) 99.999971 | 25 99.38 50 50.00 75 0.62 
 
 0.5 99.999963 | 25.5 99.29 50.5 48.01 15:0) 0-54 
 
 I 99.999952 26 99.18 51 46.02 76 0.47 
 
 1.5  99.9999038 | 26.5 99.06 51.5 44.04 79.5 0.40 
 
 2 99.99992 27 98.93 52 42.07 77 0.35 
 
 2.5 99.99990 27.5 98.78 52.5 40.13 77-5) F030 
 
 3 99.99987 28 98.61 53 38.21 78 0.26 
 
 3.5 99.99983 28.5 98.42 ey 36.32 78.5 0.22 
 
 4 99.99979 29 98.21 54 34.46 79 0.19 
 
 45  99.99973 29.5 97.98 54.5 32.04 79-5 0.16 
 
 5 99.99966 30 97.72 55 30.85 80 0.13 
 
 5-5  99.99957 30.5 97.44 55-5 29.12 SOOT 
 
 6 99.99946 31 97-13 | 56 27.43 81 0.097 
 
 6.5 99.99932 ar5 96.78 56.5 25.78 81.5 0.082 
 
 7 99.99915 32 96.41 57 24.20 82 0.069 
 
 7.5 99.9989 32.5 95.99 57.5 22.66 82.5 0.058 
 
 8 99.9987 33 95-54 58 21.19 83 0.048 
 
 8.5 99.9983 33-5 95.05 58.5 19.77 83.5 0.040 
 
 9 99.9979 34 94.52 59 18.41 84 0.034 
 
 9.5 99.9974 34.5 93-94 59.5 17.11 84.5. 0.028 
 Io 99.9968 35 93.32 60 15.87 85 0.023 
 10.5 99.9961 255 92.05 60.5 14.69 85.5 0.019 
 rt 99.9952 36 QI.92 61 13.57 86 0.016 
 2S OO.OUAL 36.5 OI1.15 61.5 12.51 86.5 0.013 
 I2 99.9928 37 90.32 62 II.51 87 0.011 
 I2.5 99.9912 37.5 89.44 62.5 10.56 87.5 0.009 
 13 99.989 38 88.49 63 9.68 83 0.007 
 135511 00.007 38.5 87.49 63.5 8.85 88.5 0.0059 
 I4 99.984 39 86.43 64 8.08 89 0.0048 
 14.5 99.981 39.5 85.31 64.5 7-35 89.5 0.0039 
 15 99.077 40 84.13 65 6.68 go 0.0032 
 15.5 99.972 40.5 82.89 65.5 6.06 90.5 0.0026 
 16 99.966 4I 81.59 66 5.48 gI 0.0021 
 16.5 99.960 41.5 80.23 66.5 4.95 QI.5 0.0017 
 17 99.952 42 78.81 67 4.46 g2 0.0013 
 17-5 99.942 42.5 77:34 67.5 4.01 92.5 0.00IT 
 18 99.931 43 75.80 68 3.59 93 0.0009 
 18.5 99.918 43.5 74.22 68.5 B22 93-5 0.0007 
 19 99.903 44 72.57 69 2.87 04 0.0005 
 19.5 99.886 44.5 70.88 69.5 2.56 94.5 0.00043 
 20 99.865 45 69.15 70 2.28 95 0.00034 
 20.5 99.84 45.5 67.36 40.5 2.02 95.5 0.00027 
 at 99.81 46 65.54 oe 7.0 96 0.00021 
 ars) 09.78 46.5 63.68 7215 1.58 96.5 0.00017 
 22 99.74 47 61.79 02 1.39 97 0.00013 
 22.5.. 90.70 47.5 59.87 72.5 I<a2 97-5 0.00010 
 23 99.65 48 57.93 73 I.07 98 0.00008 
 23.5 99.60 48.5 55.06 | 73-5 0.94 98.5 0.000062 
 24 99.53 49 53.98 74 0.82 99 0.000048 
 24.5 99.46 495° 51.99 | 74.5 0.71 99-5. 0.000037 
 
 100 0.000029 
 
 
 
 Io! 
 
102 How to Experiment in Education 
 
 than those nearer the center of the distribution for 12-year- 
 olds. In certain non-continuously-taught functions like alge- 
 bra, or even in functions like reading, it may be desirable 
 to widen the range that 12-year-olds would yield. This 
 can be done by repeating the process shown in Table 6 for, 
 say, g-year-olds and 16-year-olds who are in high school 
 and elementary school, or just in high school, and by com- 
 bining the results obtained with the results for 12-year-olds. 
 Table 8 illustrates a rough method for effecting such a com- 
 bination. 
 
 TABLE 8 
 SHOWING HOW TO WIDEN THE RANGE OF A T SCALE 
 
 Problems To T Final 
 
 Correct 116 T Scale 
 fe) 32 22 
 I 36 26 
 2 40 30 
 3 43 33 ao 
 4 46 35 35 
 5 48 38 38 
 6 50 40 40 
 7 52 43 43 
 8 54 45 34 45 
 9 58 48 37 48 
 
 Io Or 50 40 5° 
 II 65 53 42 53 
 12 70 56 45 56 
 13 59 47 59 
 14 63 50 63 
 I5 67 53 67 
 16 7h 56 na 
 17 75 60 75 
 18 80 65 80 
 19 70 85 
 20 76 gI 
 
 Construction of a B Scale.—It remains to explain how 
 the T scale can secure all the advantages of the quotient 
 technique associated with the age scale. To make clear 
 just what is sought, there is given in Table 9 a table of 
 age-scale and T-scale equivalents. A fuller explanation of 
 the age-scale terms may be found in “How to Measure in 
 
Experimental Measurements 103 
 
 Education.” The symbols B and F have been evolved since 
 the foregoing book was written. 
 
 TABLE 9 
 SHOWING AGE-SCALE AND T-SCALE EQUIVALENTS 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Age Scale T Scale 
 C.A. = Chronological Age C.A. = Chronological Age 
 M.A. = Mental Age Ti = Total intelligence 
 E.A. = Educational Age Te = Total educational ability 
 R.A. = Reading Age Tr = Total reading ability 
 Ar.A. = Arithmetic Age Ta = Total arithmetical ability 
 Ctr: elG: 
 M.A. 
 i oe eka Intelligence Quotient |Bi= Brightness in intelligence 
 E.Q. === Educational Quotient |Be = Brightness in education 
 R.Q. = a = = Reading Quotient Br = Brightness in reading 
 Ar.Q. = sae Arithmetic Quotient [Ba = Brightness in arithmetic 
 C.A. etc, etc. 
 E.A. 
 A.Q. =——-—= Accomplishment Quo-|F = Te-Ti= Effort or efficiency 
 M.A tient 
 R.A.Q. solace Reading Accom-|Fr = Tr-Ti= Effort in reading 
 M.A. plishment Quotient 
 Ar.A.Q, = feD) = Arithmetic Accom-|Fa = Ta-Ti = Effort in arithmetic 
 M.A. plishment Quotient etc. 
 etc. 
 
 Ti is merely a T score on some intelligence test. Te 
 is the average T score on several educational tests. Tr is 
 the T score on some reading test. Ta is the T score on 
 some arithmetic test. Each F is explained by its formula. 
 When Te — Ti, for example, yields a plus result, the pupil 
 or class is making better educational progress than the 
 typical pupil or class of like intelligence, and vice versa. 
 
 The computation of B has not been described before. To 
 make the computation of Bi possible there is needed a T 
 scale for each age group for some intelligence test, i.e., there 
 is needed T8, T9, T10, T11, T12, T13, etc., scales. Ifa 
 
104 How to Experiment in Education 
 
 pupil is, say ro years old his Ti is his T12 score, but his Bi 
 is his Tro score. If he is 13 years old, his Ti is his T12 
 score but his Bi is his T13 score. If he is 12 years old his 
 Ti is his T12 score and his Bi is also his T12 score. A 
 pupil’s Ti is an absolute score which should increase as he 
 grows older. His Bi is a relative score which should remain 
 unchanged throughout his life, if the assumption that in- 
 herited intellectual brightness is constant is a true assump- 
 tion. If he is an average ten-year-old his B will be 50. 
 When he becomes eleven years old his B will also read 50, 
 provided he has remained average, and so on for the re- 
 mainder of his life. The computation of Br is similar to the 
 computation of Bi, except that some reading test is used. Be 
 is the mean of Br, Ba, etc., or other B’s for educational tests. 
 
 The construction of the separate B scales for each age 
 merely duplicates the process of constructing a T scale, 
 provided it is possible to test unselected pupils for each age 
 group. But here a difficulty arises. Some of the brightest 
 13-, 14-, and 15-year-olds are in high school, or have left the 
 school system entirely. Some of the stupider 7-, 8-, and 
 g-year-olds have not yet entered the first grade, or else they 
 are clustered in grades I and II, where it is inconvenient 
 to test with linguistic tests. Most tests designed for the 
 elementary school are not applicable below Grade III. Con- 
 sequently, the construction of a T scale for each age often 
 becomes impracticable. 
 
 What is needed is some other simple procedure that will 
 yield the equivalent of separate T scales for each age. Since 
 the procedure which follows will meet all situations, whereas 
 the procedure of scaling separately is not generally applica- 
 ble, it is suggested that the procedure described below and 
 illustrated in Table 10, p. 108, be used in all situations. 
 
 1. Construct age distributions like those shown in 
 Table to. 
 
 2. Compute the total number of pupils for each age, 
 and write it below the appropriate frequency column, as 
 shown in Table 1o. 
 
Experimental Measurements 105 
 
 3. Construct a T scale on the basis of the 12-year-olds, 
 and write the T-scale value in the second column, as shown 
 in Table ro. 
 
 4. Compute half the total number of pupils for the 
 youngest age. The half sum or one half the 7-year-olds in 
 Table ro is one half of 35, i.e., 17.5 pupils. 
 
 5. Begin at the bottom of the frequency column for the 
 youngest age, and add up the frequencies until the next 
 addition or frequency will exceed the half sum. Take half 
 of this next frequency and add it to the total up to that 
 frequency. The result will be the familiar “number exceed- 
 ing plus half those reaching” the T score shown at the 
 left. To illustrate, the half sum for 7-year-olds is 17.5. 
 Counting up the 7-year-old frequency column, we have 
 I+ot3+1+2+o+t2tiritygatoeat (oecmoi tess 
 17. This 17 is the number exceeding plus half those reach- 
 ing a T score of 34. 
 
 6. Divide the “number exceeding plus half those reach- 
 ing” found in (5) by the total number of 12-year-olds. The 
 total number of 12-year-olds is 500, so 17 +500 gives 3.4 
 per cent. 
 
 7. Convert this per cent into a T score by means of 
 Table 7. This gives 68, as shown at the bottom of Table to. 
 Had all 7-year-olds been tested, and had a T7 scale been 
 constructed, the T score for 11 questions correct would 
 have been approximately 68. 
 
 The procedure outlined above assumes that there are no 
 7-year-olds who read better than the better half of the 35 
 pupils tested. This assumption is a reasonable one, and 
 becomes more reasonable for ages 8, 9, 10, and 11. The 
 procedure also assumes that, since there are 500 unselected 
 12-year-olds, there must be an equal number of 7-year-olds 
 in the lower grades or community. 
 
 8. Tabulate the corresponding T score for 12-year-olds 
 beneath this T score for 7 years. Thus, Table 10 shows 34 
 beneath 68. 
 
 g. Subtract the T12 score from the T7 score. The 
 
106 How to Experiment in Education 
 
 remainder is 34 and is positive, as shown in Table ro. 
 This remainder is the brightness or B scale correction. 
 Thus, if a 7-year-old pupil correctly answers 9 questions on 
 the test, his T score, according to the second column of 
 Table 10, is 32. His B score is 32 plus the correction 34, 
 i.e. 66. This B score of 66 tells us that the pupil reads 
 better than the average 7-year-old by 16 points, or, as 
 shown by Table 7, that he 1s exceeded by only 5.48 per cent 
 of 7-year-olds. 
 
 10. Repeat steps 4, 5, 6, 7, 8, and 9 for all other ages 
 up to 12. The B correction for 12-year-olds will be zero. 
 To give another illustration, the arithmetic of these steps 
 for r1-year-olds follows. (a) 426-- 2==213. (b) r+ 0+ 
 OE Ale Bice ier Oe tO 22) 12 201i 2 Oe 
 —— 2) = 1099.5. (C) 199.5 -- 500 = 39.9 per cent. (d) 39.9 
 per cent== 52.5 Tr1. (e) 52.5—48 = 4.5, the 5 cor- 
 rection. 
 
 11. The computation of B corrections for ages above 12 
 is closely similar to that for ages below 12. The only dif- 
 ference is that, for ages above 12, account must be taken of 
 the fact that the better readers rather than the poorer read- 
 ers are missing from Table 10. This can be done by deter- 
 mining the number of missing pupils, and then by adding 
 this number in, after adding up the frequency column to 
 find the half-sum. For 13-year-olds the number of pupils 
 missing is 500 — 452, l.e., 48. Note how this 48 is utilized 
 in the following computations for 13-year olds. (a) 452 + 
 2—=220.) (b) 2450/45 576.11 + 19-25) 24) Ome 
 46 +. 42 + (42 + 2) = 235. (C) 235 + 48= 283. (d) 
 283 —- 500 =='56.6 per cent: \(€)' 56.6 per cent’=-="48'5) Laas 
 (f) 48.5 — 52 = — 3.5, the B correction. This means that 
 the B, for a 13-year-old pupil whose T12 is, say, 40, is 
 40 — 3.5 = 36.5. 
 
 The B corrections for all the ages are shown in the last 
 row of Table ro. The corrections for ages 7, 16, and 17 are 
 quite unreliable due to the small number of cases. This 
 general procedure for determining B corrections has been 
 
 eS 
 
Experimental Measurements 107 
 
 checked by (a) counting up the frequency column until the 
 quarter-sum, for ages below 12, and the three-quarter-sum, 
 for ages above 12, was reached, and by (b) computing the 
 estimated true mean score for each age in terms of T12, as 
 illustrated in Table 25 “How to Measure in Education.” 
 The first, second, and third rows below give B corrections 
 for each age according to the half-sum, one-quarter-three- 
 quarter-sum, and the estimated-true-mean methods, respec- 
 tively. The results by the three methods are surprisingly 
 close, in view of the small number of pupils for the extreme 
 ages. | 
 
 
 
 By | BAO 8515) TSS OV ON ENO mh abCl ii Si) in 6 — 24! —37 
 PE 3365/1 24.0/.26,0)1). (89) 14.8 oO | —3.5| —7|—12 | —22] —37 
 PEs ten tS OGM ONT Oni) 4.Gl To ce Ap 
 
 12. The last step is to determine the B corrections for 
 ages in between 7 and 8, 8 and 9, 9 and 10, etc. This may 
 be done by simple interpolation. If the B correction for 7 
 years or go months is 34, and the B correction for 8 years 
 or 102 months is 23.5, the B correction for any intervening 
 month of age may be computed with sufficient accuracy by 
 simple interpolation. That is, if 102 — oo corresponds to 
 34 — 23.5, one month’s interval will equal 10.5 ~ 12, i.e., 
 0.875. If, then, 90 months equals a plus correction of 34, 
 gr months will equal a correction of 33.125 or for conve- 
 nience 33, and so on for other months up to 102, when the 
 interpolation must be done again for 23.5 to rs. 5. In ac- 
 cordance with the foregoing procedure, the B corrections 
 shown in Table 11, p. 109, were computed. The table may 
 be extended by estimation for ages below 7 and above U7: 
 Table 11 makes it possible to convert the T score of a pupil 
 of any months of chronological age into a B score, by simply 
 adding to or subtracting from his T score the amount shown 
 at the right of his age. 
 
TABLE I0 
 
 SHOWING THE NUMBER OF PUPILS FOR THE AGES 7 TO 17 ANSWERING CORRECTLY 
 THE NUMBER OF QUESTIONS INDICATED IN THE FIRST COLUMN AND 
 HENCE MAKING THE SCALE SCORES INDICATED 
 IN THE SECOND COLUMN 
 
 ——— ff | | | | | ff SE | | Ll NNN 
 
 
 
 
 
 o 23 I 3 . 2 I 3 5 
 
 I 25 2 3 3 4 I I ° 
 
 2 27 2 3 2 I I 2 fo) I 
 
 3 28 3 fe) 6 3 I I o) ° 2 
 
 4 29 fo) 5 5 5 I 2 ° ° fe) 
 
 5 29 2 5 9 6 I 2 I 2 fe) I 
 6 30 2 6 6 5 t 2 2 z fe) fe) 
 7 31 OUITO 6 3 5 2 2 ° ° fe) 
 8 32 I 8 9 6 4 4 ° I fe) fe) 
 9 32 2 VLG 5 5 2 2 I fo) fe) ° 
 Io 33 2 6 | 15 8 6 2 3 2 fe) ° 
 ti 34 PRN he pe Wig Zo: 5 AS PITO I ° I ° 
 12 35 2 9 21 12 3 3 6 2 I fe) 
 13 36 ASDA s 12 4 8 3 z I ° 
 14 Lay 4 Daniels 23 Ly othe 8 4 I 3 ° 
 15 38 tals 2 25a ES. els 12 5 2 fe) 
 16 39 Opt eAv2s Te te We 8 lee 6 4 3 fe) 
 Ny 4I CALL ease eek ie SH olen aed, 4 4, 0 
 18 42 I 5 | 20 o5r 120r peoslmelo II 5 I 
 19 44 3 3 20 272 Sanh eOuiEoe 21 3 fe) 
 20 46 fe) 4 | 22 C341 42 saul zo 19 5 I 
 21 48 I Ae aiS 25a as 5 ACH 28 10 2 
 22 50 2 6 207) HAO? AOV Wage 25 6 I 
 23 52 2 6 27 NES 2H CALA MAS 24 9 2 
 24 54 ui 8 TO M20. | a9 as 38 8 I 
 25 56 iB R7orys20\ | STAG 24 16 2 
 26 58 6 OF sLOsy 2357 esd 23 18 I 2 
 27 60 ° II TOT SAT ee 17 8 2 
 28 62 2 Bharat a6ciies 23 5 I 
 29 66 7 Bit 12 ¥ 19 2 5 fe) 
 30 70 2 ANT Aner y 7 2 I 
 31 75 I 6 3 5 4 I 
 32 “8 fo) i I 3 
 33 81 I I 2 
 34 85 
 35 go 
 
 Total Pupils..! 35 |173 1347 |399 1426 | soo 452 |303 | 118} 16 2 
 B Scale Score.| 68 | 59.5} 53.5} 53 | 52.5] so 48.5] 44 38|' 281 ar 
 T Scale Score. 34 | 36.0] 38.0] 44 | 48 50 | 52.0] 52 54 52] 58 
 B Correction..| 34 23.5) TS clano 4.5] © |—3.5|—8 |— 16|— 24/— 37 
 
 108 
 
 
 
Experimental Measurements 109 
 
 TABLE II 
 
 SHOWING HOW TO CONVERT A T SCORE INTO A B SCORE FROM KNOWLEDGE 
 OF CHRONOLOGICAL AGE 
 
 
 
 Ch. Age Addto|Ch. Age Addto|Ch. Age Addto|Ch. Age Addto 
 Yrs—Mos. T Score|Yrs—Mos. T Score |Yrs—Mos. T Score |Yrs—Mos. T Score 
 
 7—- 6 34 Io— 2 II 12— 8 —I I5- 2 —I3 
 7—- 8 32 Io-— 4 Io I2-—10 —I I5- 4 —I5 
 7-10 31 Io-— 6 9 I13- 0 —2 I5-— 6 —I16 
 8- 0 29 Io— 8 8 I3— 2 —2 I5-— 8 —I17 
 8-— 2 27 10-10 8 I13- 4 —3 I5—10 —I19 
 8- 4 25 II-— 0 7 13- 6 —4 1] 16-0 — 20 
 8— 6 24 Il (2 6 13- 8 —4 | 16- 2 ——~ 21 
 8- 8 22 II- 4 6 13 —10 —>5 I6- 4 — 23 
 8-10 21 II- 6 5 I4- 0 —6 16— 6 — 24 
 0-10 19 I1-— 8 4 I4- 2 —7 16-— 8 — 26 
 Q- 2 18 II—1I0 3 I4- 4 —7 16—10 — 28 
 O-= Fh 17 I2-— 0 3 I4—- 6 —8 I17- 0 — 31 
 9- 6 16 I2- 2 2 14-—- 8 —9g I7— 2 — 33 
 g- 8 I4 I2-— 4 I 14-10 — II I17- 4 — 35 
 9-10 13 I2— 6 ° I5-— 0 tego e Me 8 37 
 Io—- 0 12 
 
 
 
 How to Construct C Scale——The T scale measures 
 total ability in a sort of absolute sense. The B scale meas- 
 ures brightness, i.e., ability relative to age. The purpose 
 of the C scale is to indicate automatically a pupil’s correct 
 classification in school in the trait tested, and to measure 
 ability relative to grade. A pupil may be doing excellent 
 work for his age but poor work for his grade or vice versa. 
 The steps in the process of constructing a C scale follow. 
 
 1. Construct grade distributions similar to the age dis- 
 tribution in Table 10. 
 
 2. Using the T score column and the frequency column 
 for the grade in question, compute the mean T score for 
 each grade or for each half-grade in case the schools tested 
 have half-year promotions. These mean T scores for each 
 grade are grade norms. The grade norms were as follows: 
 
 
 
 Grades. | 2A 2B (3A) 3B -4A AB | sA’ (5B 6A) 6B). 7A.) 7B 
 Norm, ..|26 30 | 33.7 37.3] 39.6 41.8] 44.9 48.0] 50.9 53.7] 56.0 58.3 
 
 Grade ..| 8A 8B! 9A o9B]/10A 10B]11A 11B|12A 12B 
 Norm, ..| 59.6 60.9 | 61.5 62.1] 62.90 63.6] 64.5 65.4| 66.8 68.1 
 
 
 
 
 
 
 
 
 
 
 
 
 
110 How to Experiment in Education 
 
 3. Write the letters in the foregoing 2A, 2B, 3A, etc., as 
 decimals which will indicate how much of each grade the 
 classes tested have completed. Since the test was given in 
 June the 2A classes had completed half of Grade II, the 2B 
 classes had completed all of Grade II, and so on. Hence 2A 
 above should be changed to 2.5, 2B to 2.99 or 3.0, 3A to 3.5, 
 3B to 4.0, 4A to 4.5, 4B to 5.0, etc. If the test has been 
 given just after mid-year promotion, 2A should be written 
 as. 2/0,2.B as 2:5, etc, 
 
 4. Interpolate to determine what norm corresponds to 
 each tenth of a grade. Since 2.5 corresponds to 26, and 3.0 
 to 30, 2.6 is found by interpolation to correspond to 26.8, 
 2.7 is found to correspond to 27.6, and so on. The expan- 
 sion by interpolation shown in Table 13C, p. 126, illustrates 
 the process in detail. ‘‘Grade” has been written as ‘“G” 
 (grade status), and “Norm” has been altered to T since 
 it is really a mean T score. The table has been extended 
 downward by common sense estimation, and upward arbi- 
 trarily so that the highest possible score will coincide with 
 a G of 20. 
 
 5. Prepare a C correction table for correcting a G into 
 a C. The C-corrections are given below. They are the 
 same for all tests whether designed for the elementary or 
 the high school, and regardless of the time when the data 
 for scaling the test were collected. 
 
 End of 
 
 Month I 2 3 4 5 6 7 8 9 10 
 Ca 
 
 Correction | .4 | 3 7) I o }—i}]—.2]/—3)—4]—-.5 
 
 21. The Test Should Be Long Enough to Vield Reliable 
 Scores. 
 
 This means that not only the time for, but also the ma- 
 terial of the test should be adequate. We have just seen 
 that calling the pupil’s score the scale difficulty of the single 
 most difficult test element done correctly tends to yield an 
 unreliable score. This is because this procedure in effect 
 
Experimental Measurements ET 
 
 shortens the test, since not every test element plays an 
 intimate part in determining the score. To secure adequate 
 reliability frequently requires that two or more forms of a 
 test be given and the results averaged. Spearman has de- 
 vised a formula in order to determine how many forms of 
 a test must be given to yield a desired reliability—a desired 
 self-correlation coefficient (see Chapter IX). The answer 
 is given by the following formula: 
 
 __ YX—rirx 
 WPT rs re 
 
 Where N is the number of tests required to yield rx; 
 rx is the desired self-correlation coefficient, and 
 rr is the self-correlation coefficient of one form 
 with another form of the test. 
 
 Thus the number of forms of a test required to yield a 
 self-correlation coefficient (rx) of .95, when the coefficient 
 of correlation (rr) of one test with a duplicate is .8, may be 
 found by substituting in the foregoing formula and solving 
 for N, thus: 
 
 905 — .8(. 
 NS Pa Fh = 4.75 oF 5. 
 
 This tells us that the mean of 5 equivalent forms of the test 
 would correlate with the mean of 5 other equivalent forms 
 to the extent of .95. 
 
 Sometimes the information desired is,—what self-correla- 
 tion coefficient would result from correlating the mean of, 
 say, 4 equivalent forms of a test with 4 other equivalent 
 forms, when, say, r1 is .7. Here the formula and substitu- 
 tions are: 
 
 ie Nr1 ae qExan7 a 
 Pree gs = oat om) wea pS a 
 
 If rz in both the above substitutions should be the self- 
 correlation coefficient found by correlating the mean of two 
 
112 How to Experiment in Education 
 
 equivalent forms of a test with the mean of two other forms, 
 instead of the self-correlation coefficient for one form of 
 a test with another form, the foregoing formule may be 
 operated just the same. The N found in the first computa- 
 tion would show, however, not 5 forms of the test but 5 
 pairs of forms, i.e., 10 forms, or more exactly 9.5 forms. 
 Since, in the second computation, 4 forms are equivalent 
 to two pairs of forms, 2 should take the place of 4, thus: 
 
 uw 2X-7 
 Hepat (een Va 
 
 How reliable should a test be? A self-correlation coeffi- 
 cient of 1.0 would mean perfect reliability. The best intelli- 
 gence tests have self-correlation coefficients of one form 
 with a duplicate of .9 to .95 as based upon records from 
 unselected pupils of the same chronological age. In grade 
 groups the coefficient would be slightly less. The standard 
 test has a reliability in age groups of about .8. A test with 
 a reliability of .8 will yield a sufficiently reliable mean 
 score for a group of 40 or more pupils. It will not yield a 
 very reliable score for an individual. ‘The experimenter 
 should have little confidence in the reliability of individual 
 scores unless his test has a self-correlation of .95 or above, 
 or until he has given enough forms of the test to bring the 
 self-correlation to or above this figure. Fortunately, experi- 
 menters are more concerned, as a rule, with mean scores for 
 groups of pupils than with individual scores. 
 
 Self-correlation coefficients are probably not the most 
 intelligible way to determine and report reliability. Another 
 way is illustrated in miniature in Table 12. The first 
 column indicates the various pupils. The second column 
 shows the scores made on one form of a test. The third 
 column shows the scores made on another form of the test 
 given shortly afterward. The fourth column shows the 
 difference between the two scores. The mean of the differ- 
 ences shows the amount of error on the average to be 
 expected with this test. Were each of the tests perfectly 
 
 Sao 
 
Experimental Measurements II3 
 
 reliable and were there no increase or decrease of the second 
 series of scores over the first series due to (a) difference 
 in difficulty of the two tests, (b) practice on the first test, 
 (c) instruction, coaching, or natural growth in the trait, 
 the second series of scores would then be identical with the 
 first series and the differences in the last column would all 
 be zero. Any difference due to (a), (b), and (c), pro- 
 vided these influences have operated equally upon all pupils, 
 can be eliminated by diminishing the non-algebraic mean 
 
 TABLE 12 
 APPROXIMATE METHOD OF DETERMINING A TEST’S RELIABILITY 
 
 
 
 Pupil es ch slash Difference 
 
 a 20 22 2 
 
 b 12 15 a 
 
 Cc 25 24 —t1 
 
 d 32 35 3 
 
 e 12 II —I 
 
 f 6 10 4 
 
 g 28 28 fa) 
 
 h 15 13 —2 
 
 i 18 20 2 
 
 j 22 20 —2 
 Mean difference (non-algebraic). ..........0ccccceee. a 
 mreanaciirerence, (algebraic)prsdeae ee eke ok 0.8 
 prcthditerence’ (unreliability) is tie. te ee ok ke ces 1.2 
 
 
 
 difference by the amount of the algebraic mean difference. 
 The net difference is approximately pure unreliability. To 
 secure an absolutely pure measure of unreliability would 
 require that an allowance be made for the fact that all 
 pupils do not profit equally from practice, instruction, coach- 
 ing, maturing, and the like. 
 
 The procedure illustrated in Table 12 is quite satisfac- 
 tory provided the variation in scores on form 1 of the test 
 is the same or approximately the same as the variation in 
 scores on form 2. Whether the general size of the scores 
 is the same on both forms is immaterial. Equivalent forms 
 of tests are so constructed, as a rule, that the two series of 
 
II4 How to Experiment in Education 
 
 scores are alike in both variability and general size. The 
 variability of scores on form 1 of Test A in Table 12 is 
 about the same as that of the scores on form 2. The slight 
 tendency for the scores on form 2 to be larger than those 
 on form 1 is discounted by the use of the mean algebraic 
 difference, namely 0.8. 
 
 Test X in Table 13 illustrates a situation where the varia- 
 bilities are identical, but-where the two series of scores differ 
 markedly in size. The net difference shows how this process 
 
 TABLE 13 
 
 ILLUSTRATING THE NECESSITY FOR EQUATING VARIABILITIES BEFORE COMPUTING 
 RELIABILITY BY THE NET-DIFFERENCE METHOD 
 
 Test X . Testy Equated Var. 
 
 5 Differ- Differ Differ- 
 ed Form Form she Form Form ae Form Form Hah 
 z 2 I 2 I 2 
 a 22 fo) —22] 10 o |—I0o 10 o |—I0 
 b 24 2 —22| 14 8 | —6 14 4 |—10 
 C 26 4 — 22 18 16 —2 18 8 |—I0 
 d 28 6 —22| 22 24 2 22 12 |—1I0 
 e 30 8 — 22 26 32 6 26 16 |—I0 
 Mean Difference (non- 
 algebraic) isc. ae ee 22 Sa b de) 
 Mean Difference (alge- 
 braic) vise eee swine se 22 2.0 10 
 Net Difference (unrelia- 
 bility) Ve eeoe en eee fe) cee ° 
 
 eliminates the effect of differences in size. Test Y illustrates 
 a situation where mere inspection shows there is perfect 
 reliability, yet the net difference fails to show perfect relia- 
 bility. It fails to show the true reliability because the varia- 
 tion in scores is not the same for both forms. The variability 
 of the scores on form 2 is exactly twice that of the scores 
 on form 1. The variabilities can be made identical by the 
 simple process of dividing all the scores on form 2 by 2. 
 Once the variabilities are equated the net difference shows 
 the true reliability, as shown in the third portion of the table. 
 
 It is seldom feasible to determine the amount of a test’s 
 variability by inspection as was done for form 2 of Test Y 
 
Experimental Measurements IIS 
 
 in Table 13. The usual procedure is to compute for each 
 series of scores one of the standard measures of variability, 
 such as Q (quartile deviation) or SD (standard deviation), 
 and to use these as a basis for equating. The computation 
 of the Q and SD is explained in Chapter VI. Suffice it to 
 state here that the SD for form x of Test Y is 5.66, and 
 for form 2 is 11.32. Thus the SD’s show also that the 
 variability of scores on form 2 is twice that for form 1. The 
 variabilities or SD’s may be equated by dividing all scores 
 on form 2 by 2, as was done, or instead, by multiplying all 
 scores on form 1 by 2. Had the SD been 5 for form x and 
 4 for form 2, variabilities could be equated by dividing the 
 scores on form 1 by 1.25, or instead, by multiplying the 
 scores on form 2 by 1.25. Had the SD’s been x and 6 for 
 forms 1 and 2, respectively, variabilities could be equated 
 by multiplying scores on form 1 by 3, and by dividing 
 scores on form 2 by 2. That is, the variability of one form 
 may be adjusted to another form or the variability of both 
 forms may be adjusted to a third variability different from 
 the original variability of both. Sometimes one type of 
 adjustment is more convenient and sometimes the other. 
 
 Herring has called attention to the fact that the corre- 
 spondence of scores on one form of a test with scores on 
 another form is not the best measure of reliability. He 
 claims, and rightly so, that scores on one form of a test 
 will correspond more closely with mean scores from an 
 infinite number of forms, than they will with scores on 
 another equally unreliable form. That is, the correct meas- 
 ure of the reliability of a test is some measure of the close- 
 ness of its correspondence with a perfectly reliable deter- 
 mination. 
 
 A better measure of the reliability of a test than that 
 given by self-correlation or self net difference is the corre- 
 lation between a test and the mean of two forms of that 
 test, or the net difference between a test and the mean of 
 two forms of the test. The effect of this last is to make the 
 net difference just exactly half the net difference between 
 
116 How to Experiment in Education 
 
 one form and another. The procedure would yield a net 
 difference of 0.6 instead of 1.2 for the data of Table 12. 
 
 But due to the fact that a test has half the influence in 
 determining the mean of the two forms against which it is 
 checked, the preceding procedure makes the reliability 
 appear about as much better than it really is as the self- 
 correspondence procedure makes it appear less satisfactory 
 than it really is. Otis + has determined that the true unre- 
 liability is .707 of the net difference as computed in Table 
 12 and Table 13. The correct measure of unreliability for 
 Table 12 is .707 times 1.2, 1.e., .8484. 
 
 22. The Test Should Be Scored Comprehensively 
 Enough to Yield Reliable Scores. 
 
 The failure to score all phases of a pupil’s product while 
 taking a test may be a prolific source of unreliability, par- 
 ticularly in the case of rate tests where one phase is inti- 
 mately dependent upon another. ‘Thus a sort of see-saw 
 relation exists between speed and quality in a rate test of 
 handwriting. Generally, as speed increases, quality de- 
 creases and vice versa. Unless the method of testing is 
 such as to keep speed, say, constant, the two quality scores 
 for a pupil from two tests might be quite dissimilar, whereas 
 if each quality score were corrected for differences in speed, 
 they might, in reality, be identical. 
 
 The approximate amount of correction for speed may be 
 determined empirically. That correction is best which will 
 produce the maximum possible self-correlation between the 
 two series of corrected scores for quality. Another tech- 
 nique for determining the amount of correction has been 
 proposed by Courtis and Thorndike? and applied to the 
 former’s rate tests in arithmetic. 
 
 23. The Test Should Be So Constructed As to Permit 
 Uniformity of Procedure in Applying and Scoring It. 
 
 The key to objectivity and an important key to reliability 
 
 1 Otis, Arthur I., ‘“The Reliability of the Binet eee and of Pedagogical Scales’”’ 
 Journal of Educational Research, September, 192 
 
 ? Courtis. S.:A., and Thorndike, E. L., Ei entiod Formule for Addition 
 Tests,” Teachers College Record, January, T920. 
 
Experimental Measurements aes, 
 
 is this matter of uniformity of procedure. If it is not possi- 
 ble to repeat a test in a uniform way, one individual cannot 
 verify his own previous results, and one individual has 
 even less opportunity to verify the results of another. The 
 possibility of uniformity is partly a function of the nature 
 of the test, partly of the detail and accuracy of the directions 
 for applying and scoring the test, and partly of an experi- 
 mental determination and consequent allowance for the 
 amount and direction of each individual’s personal equation. 
 The first two are the most promising. 
 
 24. The Test Should Have Satisfactory Age and Grade 
 Norms. 
 
 The experimenter has less need for norms than other 
 users of tests. The experimenter is more interested, as a 
 rule, in comparing the progress of one experimental group 
 with the progress of an equivalent experimental group. 
 Norms are very convenient, however, where only one experi- 
 mental group is available, for then the progress of the avail- 
 able experimental group may be compared with the progress 
 of the norm group. Proper allowances can be made for any 
 differences of intelligence between the two groups thus 
 compared. 
 
 Norms are most valuable when they are representative of 
 the groups with whom it is most desirable to make com- 
 parisons; when they are based upon enough cases to make 
 them stable; when both the total distribution of scores and 
 the averages are reported; when the number of cases upon 
 which they are based is stated; and when the date of stand- 
 ardization is specified. 
 
 The addition of a B-scale correction to so or its subtrac- 
 tion from 50 shows the norm for the chronological age cor- 
 responding to the particular correction (see Table 11). 
 
 25. The Test Should Be Provided With an Inexpensive 
 Leaflet of Directions, Scoring Devices, and Tabulation and 
 Graph Forms. 
 
 All too frequently it is necessary, in order to use a test, 
 to purchase a monograph. In this monograph it is quite 
 
118 How to Experiment in Education 
 
 common to discover after diligent search that the directions 
 for applying the test are in the appendix, that directions for 
 scoring are near the beginning of the book, that the key for 
 scoring is somewhere else, that norms are at still another 
 place in the monograph, and that tabulation forms are lack- 
 ing entirely. Fortunately a strong public opinion is com- 
 pelling a more careful attention to these details. This con- 
 sideration for the time and convenience of test users applies 
 less to experimenters who are constructing tests for tempo- 
 rary purposes than to those who expect a wide distribution 
 of the test which they have prepared. 
 
 IV. SAMPLE TEST AND DIRECTIONS 
 
 In order to give a concrete illustration of how the T, B, 
 C, F scale system will operate in practice there follows an 
 unfinished sample of form 1 of an arithmetic test now in 
 process of construction, and a tentative model direction 
 booklet. All the data in the tables are for another test of 
 35 elements instead of for the arithmetic test of 80 elements. 
 Otherwise the tables may be thought of as applying to the 
 arithmetic test. 
 
 CHINESE FUNDAMENTALS OF ARITHMETIC SCALE 
 
 Do not open this paper until told to do so. As soon as I have 
 
 told you how, fill the blanks below, and then hold up your pencil 
 to show that you have finished. 
 
 SuUrMaAmMes Pirst Na Mame pee eg lens ois Lele g tee Boy, Girl owas 
 ADENIOY Cars oo iae SiG irthVLonth |). anteater Birthday ens 
 het abu 8) BD eta rege Bsr bey apd 25) cpp Grade | 0.0). 0.0 sta ates 
 Dater y car. ofA Republicia san 67 Month ei Day” eee 
 
 Pencils up! 
 
Experimental Measurements 
 
 IIQ 
 
 We want to see how well you can add, subtract, multiply, and 
 
 divide. 
 
 Do all your work on this paper. 
 
 Get no help from 
 
 anyone. Answers should be given in decimals and not in fractions. 
 See how many examples you can get correct in the time allowed. 
 
 You will be told your score later. 
 
 do the next. 
 
 As soon as you finish one page, 
 
 Meade no he ee meime ce rec 8) '*).S 8) (8118 Ske 1818 Cel S86 i's: eel ove.a eier e ele Tel oie later ela enote tethered te 
 
 Addition 
 
 Add 
 
 Subtract 
 
 Add 
 
 Subtract 
 
 Multiply 
 
 Divide 
 
 Add 
 
 
 
 Moree ts Alem Disha ea tee Rights eee ae eee 
 .... Subtraction .... Multiplication .... Division .... 
 (z) (2) (3) (4) 
 3 6 7 7 
 4 2 5 9 Add 
 (5) (6) (7) (8) 
 6 8 9 8 
 3 4 5 O Subtract 
 (9) (z0) (77) (12) 
 5 8 
 I O 24 50 
 7 5 4 6 Add 
 (13) (74) (15) (16) 
 29 74 76 92 
 6 4 32 21 Subtract 
 (17) (18) (79) (20) 
 4 3 7 8 
 2 3 3 6 Multiply 
 (27) (22) (23) (24) 
 2)6 4)8 4) 36 7)49 Divide 
 (25) (26) (27) (28) 
 22 72 69 58 
 ras 26 4 8 Add 
 
 
 
 
 
 
 
 
 
 
 
 
 
120 How to Experiment in Education 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 (29) (30) (32) (32) 
 34 44 41 86 
 Subtract 8 7 26 19 Subtract 
 (33) (34) (35) (36) 
 24 20 28 63 
 Multiply 2 4 7 9 Multiply 
 (37) (38) (39) (40) 
 Divide 2)178 4)260 5) 845 7)973 Divide 
 (47) (42) (43) (44) 
 984 32 
 75 43 253 571 
 Add oa 89 457 185 Add 
 (49) (50) (57) (52) 
 407 350 65 7 
 Multiply 7 8 36 57 Multiply 
 (53) (54) (55) aon 
 Divide 9)54054 §8)16200 43)559 27)864 Divide 
 (57) (58) (59) (60) 
 72 28 
 46 95 
 53 60 
 98 72 
 28 — 89 
 70 43 6.43 
 69 39 48.19 -78 
 Add 98 39 96.13 70. Add 
 (61) (62) (63) (64) 
 5004 3500 7-32 a 
 Subtract 169 2891 2.59 8.63 Subtract 
 (65) (66) (67) (68) 
 
 Multiply 70 600 8 “7 Multiply 
 
 
 
 
 
Experimental Measurements 121 
 
 
 
 
 
 (69) (OL NG Ae 
 Divide 68)68544 97)1949700 55)198  83)431.6 Divide 
 (73) (74) (75) (76) 
 ,; 58 76 7555 72.3 
 Multiply BT .09 5.98 8.06 Multiply 
 (77) (78) (79) (80) 
 
 Divide .40)2.42 .90)3.59 .03)8.76 .08).46 Divide 
 
 When you finish, close your paper, lay it on your desk with the 
 front page up, and wait quietly until papers are collected. 
 
 DIRECTIONS FOR THE CHINESE FUNDAMENTALS OF 
 ARITHMETIC SCALE 
 
 ForRM I 
 I. GENERAL DIRECTIONS FOR APPLYING TEST 
 
 1. Follow the instructions for giving the test with literal exact- 
 ness. No additional help should be given except as hereafter 
 provided for. Avoid unstandardized introductory remarks. 
 Secure rapport by charm of manner rather than felicity of 
 expression. 
 
 2. Give directions distinctly, at moderate speed, with careful 
 attention to emphasis, loudly enough to enable all pupils in the 
 room to hear without difficulty, and confidently enough to secure 
 instant obedience from every pupil. Insist courteously but firmly 
 on this prompt obedience from the start. 
 
 3. Remove all distracting elements from the environment, and 
 make pupils as comfortable as possible. Provide against any dis- 
 turbances while the test is in progress. Preferably there should 
 be no visitors. 
 
 4. Prevent copying. Do this by carefully watching those who 
 act suspiciously or by standing beside them. Do not distract 
 others by oral reprimands in the midst of the test. 
 
 5. In timing the test use a stop-watch if possible. If not, an 
 ordinary watch may be used provided it has a second hand. 
 Where feasible, it is well to have an assistant do the timing. 
 
 6. Clear desks. See that each pupil is provided with a sharp- 
 ened pencil. Have a few extra pencils available. 
 
Taz How to Experiment in Education 
 
 7. Carefully count enough and just enough test papers for each 
 row and place them on the first desk of that row. Be very careful 
 lest a test paper be left in the possession of the pupils. If pupils 
 are practiced or are permitted to practice themselves on the con- 
 tents of this test, its usefulness as a measuring instrument will be 
 destroyed. 
 
 i. INSTRUCTIONS TO PUPILS 
 
 1. Hold up one of the test.papers and say: 
 
 One of these papers will be placed on each desk. Do not open 
 them until told to do so. Will the pupils in the first row please 
 distribute papers. 
 
 2. When papers are distributed, say: 
 Look at the first page and read silently while I read aloud. 
 
 3. Read the directions with a sufficient pause at the end of each 
 sentence to permit the direction to be followed or the thought to 
 be fully grasped. 
 
 4. When directions have been read, record the time in hours, 
 minutes, and seconds, as you say: Open your paper and begin! 
 
 5. At the end of exactly 10 minutes, say: 
 
 Stop! Draw a large circle around the example you are now 
 working on and then pencils up. (Pause.) Now finish the ex- 
 ample and go right on. 
 
 6. Make sure that each pupil does not forget that as soon as 
 he finishes one page he is to do the next, and that he does not 
 overlook the last page. 
 
 7. At the end of exactly 30 minutes after saying “Begin,” say: 
 Stop! Pencils down! Wil pupils in the first row please collect 
 
 papers. 
 
 m1. How To Score TEST 
 
 Take a blank test paper and fill it out with the correct answers 
 given below. This scoring stencil may be creased in successive 
 folds, thus making it possible to lay the row of correct answers 
 just below the pupil’s answers. Draw a line through every in- 
 correct or omitted answer and write the number of correct answers 
 in each row to the right of that row. Compute the total number 
 of correct answers made on the entire test by each pupil and write 
 this in the “Examples correct” space provided on the front page 
 of his paper. 
 
 To be counted correct a pupil’s answers must agree exactly with 
 
Experimental Measurements 123 
 
 those given below. Each example is scored as either wholly right 
 or wholly wrong. No partial credits are given. When an answer 
 has been corrected by the pupil, the correction is the answer to be 
 scored. The use of fractions instead of decimals is scored as incor- 
 rect in order to discourage a cumbersome practice. If pupils must 
 meet fractions in their environment, they should be taught how to 
 convert fractions into decimals. Omission or misplacement of a 
 decimal point makes the answer wrong. The presence of zero 
 before an integer or after a decimal does not make an otherwise 
 correct answer incorrect. 
 
 As a rule it will be found quite satisfactory to have pupils 
 exchange papers and do all the scoring themselves, the examiner 
 calling the correct answers. If this is done, at least two pupils 
 Should score each paper, and the examiner should check the 
 accuracy of the scoring for some of the papers. 
 
 The list of correct answers follows. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Example| Form I | Example| FormI\\Example| Form I Example| Form! 
 I 7 21 3 4I I12 61 4835 
 2 8 22 2 42 132 62 609 
 3 12 23 9 43 1694 63 4.73 
 4 16 24 7 44 1084 64 66.37 
 5 3 25 57 45 194 65 4200 
 6 4 26 98 46 286 66 30600 
 7 4 27 73 47 562 67 4.72 
 8 8 28 66 48 299 68 6.30 
 9 II 29 26 49 2849 69 1008 
 bo) 13 30 37 5° 2800 70 2010 
 II 28 31 15 51 2340 71 3.6 
 12 56 32 67 52 4332 v2 5.2 
 13 23 33 48 53 6006 73 21.46 
 14 79 34 80 54 2025 74 6.84 
 15 44 35 196 55 13 75 451.49 
 16 71 36 567 56 32 76 582.738 
 17 8 37 89 57 533 77 6.05 
 18 9 38 65 58 465 78 15.1 
 19 21 39 169 59 144.32 79 292 
 20 48 40 139 60 86.21 80 5.75 
 
 
 
 Iv. How To Compute Puri Ta (Torat ABILITY 
 IN ARITHMETIC) 
 
 Find the pupil’s total number of examples correct in the first 
 column of Table 13A and read the corresponding Ta. This is the 
 
124 How to Experiment in Education 
 
 pupil’s T score in arithmetic. Thus the first pupil in Table 13D 
 (p. 127) did 16 examples correctly, which, according to Table 13A 
 corresponds to a Ta of 40. 
 
 TABLE 134 
 
 Examples Examples Examples Examples 
 Correct Ta Correct Ta Correct Ta Correct Ta 
 
 fe) 23 9 33 18 43 27 63 
 I 25 Io 34 19 45 28 67 
 2 26 II 35 20 47 29 71 
 3 of 12 36 oY. 49 30 76 
 4 27 13 37 22 51 31 79 
 5 28 14 38 a3 53 32 86 
 6 29 15 39 24 56 33 86 
 7 31 16 40 25 58 34 92 
 8 32 7 42 26 60 35 96 
 
 v. How To Compute Puprt BA (BRIGHTNESS IN ARITHMETIC) 
 
 Find the pupil’s solar age in Table 13B and read the corre- 
 sponding Ba correction. If the Ba correction is plus, add it to 
 the pupil’s Ta. If it is minus, subtract it from his Ta. The result 
 is the Ba. Thus the first pupil in Table 13D is 13 yrs. 2 mos. old, 
 which, according to Table 13B, corresponds to a Ba correction 
 of —2. His Ta of 40 plus the Ba correction of —2 gives a 
 Ba of 38. 
 
 TABLE 13B 
 
 Solar Age Addto| Solar Age Addto| Solar Age Addto|Solar Age Addto 
 Yrs —Mos.T Score| Yrs—Mos. T Score |\Yrs—Mos. T Score |\VYrs—Mos. T Score 
 
 7-6 34 IO — 2 II 12-8 —I /|15-2 —I3 
 7-8 32 10-4 10 I2 — 10 —I /|}15-4 —iI15 
 7, — 10 31 Io — 6 9 13-0 —2 /15 —6 — 16 
 8-0 29 10 — 8 8 13 — 2 —2 |15 -8 —I17 
 8-2 ae Io — I0 8 13 —4 —3 |15 —-10 —IQ 
 8-4 25 II -o 7 13 — 6 —4 /16-0 — 20 
 8-6 24 II — 2 6 13 -8 —4 |16-2 — 21 
 8-8 22 II-4 6 13 — 10 —5 |16-4 — 23 
 8 -— 10 21 Ir — 6 5 14-0 —6 |16-6 — 24 
 9-0 19 Ir - 8 4 I4 - 2 —7 |16-—8 — 26 
 9-2 18 II — 10 3 14-4 —7 |16—-10 —28 
 9-4 17 12-0 3 14 — 6 —8 |17-0 —3I 
 9-6 16 I2 — 2 2 14 - 8 —9Q |17 -2 — 33 
 9-8 I4 12-4 I I4 — 10 —II {17-4 — 35 
 9g -— 10 13 12-6 fe) I5 -—0o —iI2/17-6 — 37 
 
 ms 
 
 ° 
 I 
 
 ° 
 
 12 
 
Experimental Measurements 125 
 
 vi. How To CompuTE APPROXIMATE SOLAR AGE 
 (FOR USE IN CHINA) 
 
 First, determine the pupil’s lunar age and the lunar month of 
 birth. Deduct 1 from his lunar age to get his basal age. Then 
 from the number of the lunar month in which the tests are given, 
 deduct the number of his lunar month of birth. If the resulting 
 number is positive, add that number of months to his basal age to 
 get his approximate solar age. For example, if the pupil is 15 
 yrs. old and was born in the 5th month, and if the tests are given 
 in 8th month, his basal age is 15 — 1 = 14 yrs., and the number of 
 months is 8—-5 3. Thus his approximate solar age will be 
 14 yrs. 3 mos. 
 
 In case the resulting number is negative, it means that the 
 pupil is not up to the supposed basal age. Then from this age 
 deduct the number of months deficient. Thus if a 15-year-old 
 pupil who was born in the 11th lunar month is tested in the 8th 
 lunar month, his basal age is 14 but he is deficient by 3 months 
 (8— 11 =3). So his solar age should be 14 yrs. minus 3 mos., 
 that is, 13 yrs. 9 mos. 
 
 vir. How To Compute Pupit Ca (CLASSIFICATION IN 
 ARITHMETIC) 
 
 Find the pupil’s Ta in Table 13C and read the corresponding 
 Ga (Grade status in arithmetic). A Ga of 4.0, 4.5, or 4.9 means 
 that the pupil has an ability in arithmetic equal to the average 
 fourth-grade pupil at the beginning, middle, or end of the year 
 respectively. 
 
 To convert a Ga into a Ca add to or subtract from the Ga the 
 Ca correction shown below. Use the correction for the month 
 when the test was applied. Thus the first pupil’s Ta in Table 
 13D is 40. According to Table 13C this Ta is equivalent to a 
 Ga of 4.6. Since the test was applied December roth this is 
 nearest to the end of November, i.e., the 3rd month. The cor- 
 rection for the 3rd month is ++ .2 which added to the Ga yields a 
 Ca of 4.8. Of course the correction is the same for all pupils 
 tested on December 10. For a school starting October 1, Decem- 
 ber ro is the 2nd month, and similarly for other starting dates. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 End of Month| 1 2 3 4 5 6 7 8 9 10 
 ETeCHON || 1.44 tans etna et) || | 2 8 |e ed 
 
 
 
 
 
126 How to Experiment in Education 
 
 TABLE 13C 
 
 
 
 Ta -Ga\| Ta Ge|To Ga| Ta Ga| Ta Ga| Ta Ga 
 yy ee ay Ph ied ie BB 
 22.0 (1 F2.841) 43.0 
 SHAY © 2:30 (245.0 
 2520 2. 4cadis 
 26,08 m2.52 144A 
 
 MAMNUUN 
 
 20,008 02.0019 4575 
 27.00 a7 eAO. 
 28.4 2.8 | 46.7 
 20.2.0 2.0014 753 
 30.0 3.0 | 48.0 
 
 Attn uw v1 
 
 30.7 V3 PAS ON Ole Ol.0 0.1.):65.7) 22:2.1776.5" 125.11) Boson 
 ST Ai 312 TOAO cup Orca Otel 0.2 166.0 |. 12.2:|/)76.0" P1521 OmOuno 
 32.04)) 3.301 440.5 Ona OtKe 9.3 | 66.3) 12.3 19713 0 X58 Oo eae 
 
 22.0 NMSA iSO. AO ies 0:4.166.6 9 °12.419977-7° 15:41 60,7 ato 
 33.7 3.5110 50.0 0 OLS MLOdns 9.5'}:66.8. )-12.5'| 48.1) “Ex.S OC fh emeaes 
 34.40 '3.0) (51.5 el6-On O50 0.6: )67.5.. 12.01) 78.5) 515.6 OCs amet 
 35: 3.7 |. 52.0 6.7) 01.77) 10.7167-4" 12.7 | 98.9) "15.71 OO aus 
 B05 Ss Ou) 5 27 Oe nO Lue 08) 67:7 32.81, 970.3') -15.2101.5 eee 
 36.5, 3.9 | 53.3. 6.9 ]/61.0) 9:9) 68.0 | 32.9 70.7 15,0 Ol. 7 teng 
 29.2 VA OP S307 CO mOsetan Ol.) OGL 13.0}: 80.1  )26,0}/02. Reo 
 
 87.5. 4.01, 54:2°) 7 Pe 62.3) 10.81 68.5° ) 13.1] 80.5 a akO.5 104s eee ee 
 28:3). 4:2 |) 54.9 1) 7-20-02 nt0,2 t.68,0. | 13:2) 80.0%, 1.10.2) On. mene 
 
 38.3) 4.351 55.2 >. 7.31 1162-7'| (10.3) 60.3°° 13.3) 81-3" 736.3) 03-3 0eaao 
 39-3 44] 55-7 74 |62.8 104/60.7 13.4) 81.7 16.4/03.7 194 
 30.057 4.571'50.00) 7 .588102.0 wa tO-51 70.1 13.5 |°82.5 7126.5) 04, tages 
 40.0) (4.61) 56:5 7.662.005 36.61 470.5 | 12.61) 82,500 10.0 Oa, see oe 
 40.4 4.7.) 57.0 7.7,|/63.1) | 10.7| 70.9, 13.7} 82.9. 16.7) 04.0) gto? 
 40.8) 4:8 S75 0 9.81) 63-2 yetO.8 191.3 6 13.8.1083 300 Oo re 
 41.2) 4:0°)'§80"" 7:0, \.63i4 8 10:01:71.7. 13.0) 82:74.9360.01 65; 70 
 AL. 9005.0 | 58.3. 8.0 
 
 6316 (411.0) 72,1 . 14.0) 84.1 "0° 27.0) OG,Oemeacas 
 
 vim. How To Compute Crass Ta, BA, AND CA 
 
 The Ta for the class, grade, or group is the mean of the pupils’ 
 Ta’s. In Table 13D the class Ta is 48.2. 
 
 To compute the class Ba, first compute the mean solar age for 
 the class, second, convert this into a Ba correction by the use of 
 Table 13B, third, add or subtract the Ba correction to or from 
 the Class Ta. Thus the mean solar age for the class in Table 13D 
 is 12 yrs. 2 mos. According to Table 13B, this solar age corre- 
 sponds to a Ba correction of + 2. When 2 is added to the class 
 Ta, the resulting class Ba is 50.2 as shown in Table 13D. 
 
 To compute the class Ca, find the class Ta in Table 13C and 
 
Experimental Measurements 127 
 
 read the corresponding Ga. Add to or subtract from the Ga the 
 appropriate correction. Thus the class Ta of 48.2 corresponds 
 to a Ga of 6.0. A Ga of 6.0 plus a correction of .2 for the third 
 month gives a class Ca of 6.2. 
 
 
 
 
 
 
 
 
 
 TABLE 13D 
 CHINESE FUNDAMENTALS OF ARITHMETIC SCALE, FORM I 
 
 School No. 25 Grade VI Down December ro, 1922 
 Solar Age Name Ta Ba Ca 
 I3 yrs. 2 mos. A 40 38 4.8 
 I2 yrs. 6 mos. B 50 50 6.5 
 IO yrs. 7 mos. C 53 62 7.1 
 II yrs. 4 mos. D 46 52 5.9 
 13 yrs. § mos. E 52 48 6.9 
 
 I2 yrs. 2 mos. Ta 48.2 
 
 Ba 50.2 
 
 Ca 6.2 
 
 
 
 aa an SN LOE ON AEE MANE 
 
 1x. How To Interest Pupir Ta AND CrAss “FA 
 
 The number of examples correct is not a satisfactory unit of 
 measurement because the difference in difficulty between 30 and 
 31 examples correct may be greater or less than between Io and 
 Ir examples correct. The difference between 30) band 3 ta or 
 28 T and 29 T always equals the difference between 10 T and 
 Pipl cOr 55) land sor 1, 
 
 Again T scores make possible such statements as the following. 
 Any pupil or class whose T is 50 has an ability which equals the 
 mean ability of all twelve-year-old pupils. Any pupil or class 
 whose T is 70 has an ability which is 20 T (or 2 S. D.) above the 
 mean ability of twelve-year-olds. Any pupil whose T is 35 is 15 T 
 (or 1.5 S. D.) below the mean ability of twelve-year-olds. 
 
 Again, T scores may be interpreted as shown in Table 1 3E. 
 
 
 
 TABLE 13E 
 ne rr pe 
 A Is Exceeded by the A Ts ae by nes 
 Following Per Cent Following Per Cent 
 T’ Score of of 12-year olds T Score of of 12-year-olds 
 25 99 55 31 
 30 98 60 16 
 35 93 65 7 
 40 84 70 2 
 45 69 75 I 
 50 50 80 o.1 
 
128 How to Experiment in Education 
 
 x. How To INTEREST Puprtt BA AND CLAss BA 
 
 The Ba norm is always 50 for all pupils. If a pupil’s Ba is 
 50, his arithmetic ability equals the mean ability of ail pupils of 
 like age. He is of average brightness. If his Ba is 40 he is 10 T 
 (or r S. D.) below the mean brightness in arithmetic of his own 
 age group. According to Table 13E he is exceeded by 84 per cent, 
 not of 12-year-olds, but of pupils of like age. If his Ba is 75, he 
 is 25 T (or 2.5 S. D.) above the mean brightness in arithmetic of 
 pupils of like age. According to Table 13E, he is extremely 
 bright, since only 1 per cent of his own age group are brighter. 
 In like manner the mean Ba for a class shows the brightness in 
 arithmetic of that class as a whole as compared with the brightness 
 of all other classes, not of like grade, but of like age. 
 
 Thus both Ta and Ba are needed. Ta gives a measure of total 
 arithmetic ability and incidentally shows how much each pupil or 
 class Ta is above or below the mean Ta of twelve-year-olds. A 
 Ta scale is used primarily for the purpose of measuring growth in 
 ability from month to month and year to year. 
 
 But a nine-year-old pupil or class might have a Ta much below 
 50 and still be doing exceptionally satisfactory work. There is 
 needed some score which makes allowance for the fact that a pupil 
 or class is younger or older than twelve. The Ba correction 
 automatically makes just this allowance, and the Ba shows pupil 
 or class ability in comparison with pupils or classes of the same 
 age. A young pupil may have a small Ta and a large Ba and an 
 old pupil may have a large Ta and a small Ba. A pupil or class 
 Ta grows larger from month to month and year to year, whereas 
 the Ba changes little or not at all. 
 
 xI. How To INTEREST Pupit CA AND CLAss CA 
 
 For a pupil to have a Ca of 3.5 means that he is an average 
 third-grade pupil in the fundamentals of arithmetic. A Ca of 3.0 
 means that he barely belongs in the third grade. A Ca of 3.9 
 means that he is almost, but not quite, ready to be promoted into 
 fourth-grade work in the fundamentals of arithmetic. A Ca of 
 6.4 means that he just fails of being an average sixth-grade pupil. 
 The class Ca is interpreted similarly. 
 
 Since the pupils in Table 13D are sixth-grade pupils their norm 
 Ca is 6.5 and will continue to be 6.5 so long as they remain in 
 Grade VI. It jumps to 7.5 as soon as a pupil is promoted to the 
 next grade. The first pupil is 1.7 Ca or grade below norm. The 
 
Experimental Measurements 120 
 
 second pupil is exactly at the Ca norm. The class is o. 3 Ca below 
 the Ca norm. 
 
 XII. SUPPLEMENTARY D1acNnostic ScoRING 
 
 On the front page of the test paper, write in the space after 
 “Attempts,” the number of the example circled by the pupil. 
 This may be taken as a measure of his speed of work. Write in 
 the space after “Rights” the number of examples done correctly 
 inclusive of and prior to the example circled. A comparison of 
 Rights and Attempts shows the per cent of accuracy. Some pupils 
 are slow and inaccurate, some slow and accurate, some fast and 
 inaccurate, and some fast and accurate, and some are average. 
 Each type requires different treatment. 
 
 There are 20 examples for each of the four processes. Count 
 separately the number of examples done correctly on each process, 
 and write these scores in the spaces provided on the front page of 
 the test paper. If the pupil has mastered each of the processes 
 equally well his four separate scores should be approximately 
 equal in size. 
 
 An even more helpful diagnosis can be secured by making out, 
 or having the pupils make out, a table showing just what examples 
 were missed or omitted by each pupil. From this the per cent of 
 pupils missing or omitting each example can be readily deter- 
 mined. Each pair of examples (1 and 2, 3 and 4, etc.) are built 
 to test a pupil’s mastery of a certain type principle or difficulty. 
 As a rule, each pair of examples includes the difficulties of all 
 preceding pairs and one additional difficulty. Two examples of 
 each type are included because a chance error may cause a pupil 
 to miss an example whose principle he has really mastered. 
 
 Once each pupil’s need has been discovered in these ways, he 
 can be given training on his specific weaknesses. A specially 
 effective set of practice materials for giving this training is being 
 prepared by the Nanking Committee for publication by the Com- 
 mercial Press, Shanghai. Under no circumstances should a pupil 
 be especially drilled on the particular examples of this test. The 
 teacher who does this destroys the usefulness of the test as a 
 measuring instrument. 
 
 Since diagnostic scores are intended for local use rather than 
 for publication, tables have not been provided for scaling them. 
 
 xr. ACCURACY OF SCALE SCORING 
 
 The accuracy of scale scores depends upon (1) the way in 
 which pupils to be tested were selected, and (2) the number of 
 
130 How to Experiment in Education 
 
 pupils tested. The pupils tested were a random sampling from the 
 total population in grades III through VIII in the government 
 schools of Peking and Tientsin. The number tested was ap- 
 proximately 2000. 
 
 xIv. ACKNOWLEDGMENTS 
 
 These arithmetic scales were prepared by the Peking Committee 
 consisting of Professors L. C. Cha, C. Y. Chang, Y. C. Chang, 
 T. T. Lew, E. L. Terman, Wm. A. McCall, their students, and 
 Lydia Sherritt, under the auspices of the National Association 
 for the Advancement of Education. 
 
 The units of measurement used in these scales were devised by 
 Dr. Wm. A. McCall and named by him in honor of those whose 
 contribution to scientific mental measurement has been of most 
 fundamental significance. 
 
 T (Total ability) is for Thorndike, the originator and teacher 
 of scientific educational measurement and author of the first 
 College Entrance Intelligence Test, and for Terman the author of 
 the Stanford Revision of the Binet-Simon scale and leading ex- 
 ponent of the age-scale system. 
 
 B (Brightness) is for Binet the creator, with Simon of the first 
 intelligence scale, and for Buckingham the creator of the grade- 
 scale system. 
 
 C (Classification) is for Courtis, an early pioneer in educational 
 measurement and originator of practice tests, and for Cattell who 
 with Fullerton laid the foundation built upon by Hillegas in con- 
 structing the first statistically satisfactory product scale and in 
 remembrance of China where this unit was first devised and used 
 as such. 
 
 F (Effort) is for Franzen, Pintner and Monroe, all of whom 
 published at about the same time a practical mechanism for meas- 
 uring achievement as related to capacity to achieve. This unit 
 is used only when both an intelligence and educational test have 
 been given. 
 
 W. T. Tao, General Director of the Association. 
 
 V. SUMMARY OF THE STEPS IN THE PrOcESS OF CON- 
 STRUCTING, SCALING, AND STANDARDIZING A TEST 
 
 1. Dificulty Test 
 
 t. Decide upon the mental trait to be measured and 
 define it as exactly as possible. 
 
Experimental Measurements 131 
 
 2. Decide upon a test form and general content which 
 will measure this trait and this trait only, which will yield 
 one and only one correct and easily scored pupil response 
 to each test element, and where each element may be scored 
 as either right, wrong, or omitted. 
 
 3. Decide upon the range of ability to be measured. 
 
 4. Consult previous tests of this trait or similar traits 
 to determine how easy and how difficult the test elements 
 must be made, how simple the directions must be, and 
 what is a suitable mechanical arrangement of material for 
 mimeographing or printing. 
 
 5. If no such test exists prepare a tentative set of direc- 
 tions and a few tentative test elements and try them on a 
 few of the ablest and least able pupils ever likely to be 
 tested. 
 
 6. Prepare a test, which is as perfect in every detail 
 as possible, which advances by gradual steps of difficulty 
 from slightly easier to slightly more difficult than will be 
 required in the final test, and which has about one-fourth 
 more content than will be required in the final test (unless 
 the test is for diagnostic purposes in which case only the 
 material to be used finally should be used). 
 
 7. Make provision for the following identification data: 
 (1) First name, (2) Last name, (3) Sex, (4) Age in years, 
 (5) Birth month, (6) Birthday, (7) School, (8) Grade, 
 (9) Section, (10) Date of test. 
 
 8. Prepare sample and directions for pupils. For gen- 
 eral directions to examiner, see Section III of this chapter. 
 
 9g. Explain and apply the test to several intelligent 
 adults and correct it in the light of their criticisms. 
 
 10. Apply the test to about 110 pupils scattered over 
 the entire range of ability of pupils for whom the test is 
 designed. Be sure to include some of the ablest and least 
 able pupils ever to be treated with completed test. Give 
 all the time pupils need to do every test element or to do all 
 they can. Record on his paper the time required by each 
 pupil. 
 
132 How to Experiment in Education 
 
 11. Make out a list of correct answers, a mechanical 
 device for scoring, and directions for scoring. 
 
 12. Score each test element, using 1 for correct, x for 
 wrong, and o for omitted. 
 
 13. Eliminate from the test all elements which prove 
 ambiguous, unscorable, or are otherwise unsatisfactory. 
 
 14. Discard enough tests to leave 100. Do not dis- 
 card the best and poorest papers. 
 
 15. Compute the total score made by each pupil on 
 the odd numbered questions and then on the even num- 
 bered questions. 
 
 16. Make a correlation diagram for these two sets of 
 scores. Call in for a conference those pupils who are 
 chiefly responsible for lowering the correlation. Go over 
 each element tried and missed by them to see if some 
 ambiguity or other defect is responsible. Correct or elim- 
 inate test elements if defects are brought to light. 
 
 17. Make a correlation diagram for the total score of 
 each pupil on the total test and the criterion (if such be 
 available). Confer and correct as before. 
 
 18. Call in a few of the most gifted pupils and enquire 
 the reason why various test elements were missed by them. 
 Correct or eliminate elements if defects are brought to 
 light. 
 
 19. Tabulate, by pupils and remaining test elements the 
 1’s, x’s, and o’s, thus for the 100 papers. 
 
 Test ELEMENTS 
 
 Name 
 I 2 3 4 5 6 2 8 9 Io | etc 
 roti ae pa rit arb tt I I I x I I x = fe) o | etc 
 RE Mar es eres I 7 De ne I x x O x o | etc 
 CLO Seon tien cts 6's etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. 
 
 Total. Correct;.| | (| |e ee ee 
 T Difficulty...) —}|—} —|—]|—}—/—J—}— J — | 
 
 20. Compute, from the preceding tabulation, the num- 
 ber and per cent of pupils doing correctly each test element. 
 
Experimental Measurements 133 
 
 Since there are 100 pupils the “Total correct” will also be 
 the per cent required. This will not be true when the 
 pupil has a 50-50 opportunity of getting an element cor- 
 rect by chance. In this case, subtract from the total of 
 I’s on each element, the total x’s, and divide the re- 
 mainder by 100. The quotient will be the proper per cent 
 correct. 
 
 21. Convert each per cent into an S.D. value or T diffi- 
 culty by means of Table 7. 
 
 22. Arrange test elements in order of T difficulty. 
 
 23. In view of the time records on the test and the 
 time decided upon for the final test, decide upon the number 
 of test elements required in order that the fastest pupil 
 will not quite finish the test before time is called. In 
 deciding upon the time allowance for the final test, due con- 
 sideration should be given to practicality and to reliability. 
 In general do not be satisfied with a reliability (Self r) 
 of less than .85 between the two halves of the test. Other 
 things being equal, an abbreviated test means a low re- 
 liability. Hence if the self r is too low, lengthen the time 
 allowance, and increase the number of test elements or 
 provide for two tests to be averaged instead of one longer 
 test. 
 
 24. Select the number of test elements decided upon. 
 Select in such a way that the successive elements will in- 
 crease, So far as possible, by equal increments of T difficulty 
 from one done correctly by about 99 per cent of the pupils 
 to one done correctly by about 1 per cent of the pupils. 
 If the elements available are too easy or too difficult try 
 out and incorporate additional elements of the desired diffi- 
 culty. Sometimes diagnostic or other considerations should 
 weigh more heavily than difficulty or time-allowance con- 
 siderations in determining the final content of a test. In 
 this case the test constructor must use his judgment to 
 decide how much alteration of the test content is per- 
 missible. 
 
 25. Improve the mechanical make-up of the test and 
 
134 How to Experiment in Education 
 
 directions for applying it in any way that experience 
 suggests. 
 
 26. Print the test in final form. 
 
 27. To test the satisfactoriness of the proposed time 
 allowance, apply the test to the ablest class ever likely to 
 be tested. Have pupils circle the number of the test element 
 being worked upon at the end of regular intervals. Stop 
 the test the moment the -fastest pupil finishes. Record this 
 time. 
 
 28. Determine the total score made by all pupils com- 
 bined during each of the successive time intervals. 
 
 29. Fix an official final time allowance such that at its 
 expiration the fastest pupil would not quite have finished 
 and the ablest pupil would have done all he could. Adopt 
 for future use the minimum time that would have accom- 
 plished these two objects. 
 
 30. Apply the test to about 2000 pupils in the grades 
 for which the test is designed. ‘The schools selected for 
 testing should approximate as closely as possible a random 
 sampling of all schools. In the schools selected, all pupils 
 in the appropriate grades should be tested. 
 
 31. Score the tests and compute the total score made 
 by each pupil. In scoring it is usually more convenient to 
 give one point for each element done correctly, but this is 
 not imperative. Some prefer to give 2, I, or o credits to an 
 element according to the excellence of the pupil’s answer. 
 The resulting increase in accuracy is seldom worth the 
 extra trouble. Elements of large enough scope to justify 
 extra points can usually be broken into two or more sepa- 
 rate elements. Do not assign points proportional to the 
 difficulty of an element. This involves a cumulative error. 
 
 32. Make a frequency distribution of scores for each 
 grade, and then for each age. Make all frequency distribu- 
 tions in step intervals the size of the smallest scoring unit. 
 This is usually one. 
 
 33. Using 8.0 to 9.0, 12.0 to 13.0, or 16.0 to 17.0 year- 
 olds for primary, higher elementary, or high school, respec- 
 
Experimental Measurements 135 
 
 tively, convert these raw scores into T scores by means of 
 Table 7, and as illustrated in Table 6. 
 
 34. If thought desirable, increase the range of the T 
 scale by a process illustrated in Table 8. 
 
 35. Construct a B scale for the test by a process illus- 
 trated in Table ro. 
 
 36. Construct a C scale for the test. 
 
 37. Prepare the official directions booklet to be issued 
 with the test. In order to secure uniformity, a sample direc- 
 tions booklet is given in Section IV of this chapter. 
 
 i. hate Lest 
 
 1. Do steps Iz, I2, 13, I4 except that all elements of 
 the test should be of uniform or approximately uniform 
 difficulty, I5, 16 except the statement concerning gradually 
 increasing difficulty, I7, 18, Io, I1o except that there should 
 be a fixed time allowance instead of a fixed number of ele- 
 ments to be done, Ir1, I12, I13, [14, Ir5, 1x6, Ir7, 118, I19, 
 for a few representative test elements only to see whether 
 the test elements are on the desired difficulty level, I20, I21, 
 I23, I24 except for all reference to difficulty, I25, I26, I30, 
 131, 132, 133, 134, 135, 136, and 137. 
 
 2. Since rate tests usually yield two scores, namely num- 
 ber tried and accuracy, T, B, and C scales may be con- 
 structed for both, or for just number right only, or for a 
 properly weighted combination of number tried and number 
 right. 
 
 mr. Product Tests Such As Handwriting, Composition, and 
 Drawing 
 
 1. Do I1, I2 except that product tests are usually scored 
 as a whole rather than by separate elements, 13, Iu, I5, 16 
 except for the references to difficulty, 17, 18, Io, Izo except 
 that there should be a fixed time limit, and, in the case of 
 traits like composition and drawing, a warning a few min- 
 utes before time is called. 
 
136 How to Experiment in Education 
 
 2. Repeat I1o on the same group of pupils so as to 
 secure two measures of the trait. 
 
 3. Do I14 for both sets of products. 
 
 4. Rate 1 the poorest specimen in the first set. Rate 2 
 the next poorest and so on to 100. Have this done by, say, 
 three competent judges. Average the three judgments to 
 get the final rating for each specimen. 
 
 5. Repeat III4 for the second set of specimens. 
 
 6. Do I16 for these two sets of ratings, and I17 for 
 either set or both. If the self r is too low, increase the time 
 allowance or provide for two or more tests to be averaged 
 and treated as one. 
 
 7 DOds sale Omande 20, 
 
 8. Pick out all specimens written by pupils of ages 8.0 
 to 9.0, or 12.0 tO 13.0, or 16.0 to 17.0 depending upon the 
 level for which the test is designed. Age 12.0 to 13.0 will 
 serve fairly well for all levels. Write on each specimen a 
 number without regard to its merit. 
 
 9. Separate the papers into ten piles—A (poorest), 
 B (next poorest), C, D, E, F, G, H, I and J (best)— 
 according to the merit of each specimen. 
 
 10. Take pile A and divide it into 5 piles—a (poorest), 
 b, c, d, and e (best )—according to merit. 
 
 tz. Do IIIro for the other nine piles. 
 
 12. Take pile Aa and arrange the papers in it in order 
 of merit. 
 
 13. Do III12 for Ab, Ac, Ad, Ae, Ba, Be and on for the 
 50 separate piles. 
 
 14. Carefully compare the few best specimen in Aa with 
 the few poorest specimen in Ab. If the order of merit is 
 not correct rearrange across the junction point. Repeat 
 this process for the other 48 junction points. 
 
 15. Ona record sheet, write down in order of merit the 
 number of each specimen. After the number of the poorest 
 specimen, mark 1. After the number of the next poorest, 
 mark 2, and so on for all specimens. 
 
 16. Have at least three competent judges do steps IIo, 
 
Experimental Measurements 137 
 
 TITro, [1I11, 1112, [1113, T1114, and Il115 without knowl- 
 edge of each other’s marks. 
 
 17. Compute the mean of the three marks given each 
 specimen by the three judges. Arrange specimen numbers 
 in order of merit according to these means. 
 
 18. Check that specimen number where the per cent 
 exceeding-plus-half-those-reaching-it in merit is nearest 
 99.865. According to Table 7, this specimen has a merit 
 of 20. Check the one where the per cent is nearest 99.38. 
 This has a merit of 25. The other per cents to check are 
 shown in the first row of the following. The T merit of the 
 specimen checked is shown in the second row. If only half 
 this number of specimens are desired in the final scale, use 
 those per cents whose T merits are 20, 30, 40, 50, 60, 70 
 and 80. If more specimens are desired in the final scale, 
 Table 7 will show which per cents will yield equal intervals 
 of T merit. 
 
 ZETA CELILD Na close cid vere vid ee OGG05 mm 00-30 1) 00.72 11 03-4200 h O41 3 e00.L5 
 SPMINIGTIGE ate. trate, eee tiers 20 25 30 35 40 45 
 PGCE Carat a iale were chee 50 30.85 15.87 6.68 2.28 62 it3 
 SDRTIDeTILUE ee Sassen SOruss 60 65 40 75 80 
 
 19. After checking these 13, say, specimen numbers, 
 check also the five specimens immediately preceding each 
 in merit and the five immediately following each in merit. 
 This will give 13 sets—N, O, P, Q, R, S, T, U, V, W, X, 
 Y, and Z—of eleven specimens each. Mix up the specimens 
 within each set. 
 
 20. Ask a large number of judges to arrange in order 
 of merit the specimens in set N, and record in order the 
 specimen numbers, together with marks 1 through 11. The 
 previous rating by three judges can be utilized. 
 
 21. Repeat III20 for the other twelve sets. 
 
 22. Compute the mean of all these marks given each 
 specimen. 
 
 23. Guided by these means, choose from set N the speci- 
 men most central in merit. This is the specimen most 
 entitled to the T merit of 20. Do likewise for sets O, P, Q, 
 
138 How to Experiment in Education 
 
 etc., and give to each, T merits of 25, 30, 35, etc., respec- 
 tively. These 13 specimens together with their T merits 
 constitute a product-scoring scale, which may be used to 
 determine the T score in handwriting made by any pupil. 
 All that is necessary is to move the pupil’s specimen along 
 this scale until a scale specimen is found which is like it in 
 merit. The pupil’s T score is the T merit of the scale speci- 
 man most like it in merit: 
 
 24. Have at least three competent judges score each of 
 the 2000 specimens originally collected by comparing it with 
 the specimens in this product-scoring scale. Consider that 
 each pupil’s T score is the mean of these three ratings. 
 
 25. Do 132 for each of the grades, and for each of the 
 ages, except age 12.0 tO 13.0. 
 
 26. Do 135, 136, and 137. 
 
 27. A much more laborious and, for purposes of pure 
 research, perhaps more satisfactory method of constructing 
 a product-scoring scale is described in Chapter IX, Sec- 
 tion IV of “How to Measure in Education.” 
 
 If this more laborious method of product-scale construc- 
 tion is used, omit steps III8 through III23. Do II]2q, 
 III25 not excepting ages 12.0 to 13.0, 133, 134, 135, 136, 
 and I37. 
 
 Iv. Battery of Tests 
 
 1. Prepare each of the difficulty, rate, or product tests 
 entering into the battery up to, but not including step, I26, 
 in so far as these 25 steps apply to the construction of each 
 type. If there are product tests, construct, besides, a 
 product-scoring scale for each, based upon about 1000 speci- 
 mens collected from 1000 unselected pupils between the ages - 
 8.0 and 9.0, 12.0 and 13.0, or 16.0 and 17.0. 
 
 2. Prepare all these component tests from data collected 
 from the same 1oo pupils. If tests are merely being com- 
 piled and were carried through the preliminary stages pre- 
 viously, then apply them all to the same too pupils. 
 
 3. Compute the total score on each test separately made 
 
Experimental Measurements 139 
 
 by these 100 pupils on the basis only of the test elements 
 selected for the final form of the test. 
 
 4. Make a separate frequency distribution of the 100 
 scores on each test. 
 
 5. Compute the SD of each frequency distribution. 
 
 6. If all tests in the battery are to have equal weight, 
 choose a multiplier for each SD such that all SD’s will 
 be made approximately alike in size. For example: 
 
 SD 4 
 Multiplier I 
 
 2 8 a 
 2 Ya 3 
 If all tests are not to have equal weight, choose multipliers 
 which will bring the SD’s to the desired ratio. Choose 
 multipliers such that the labor of applying them will be the 
 least possible. 
 
 7. Print the tests in booklet form. Insert the multipliers 
 on the front page of the booklet, thus: 
 
 Test Points Multiplier Weighted Points 
 I I 
 2 2 
 3 +2 
 4 mor) 
 Total 
 
 8. Do all three of 127, I28, and I29 for each difficulty 
 test in the battery. 
 
 9. Do I3o0 for the battery booklet. 
 
 10. Do 131 for each of the battery tests. 
 
 Ir. Compute for each pupil the total weighted points as 
 indicated in IV7. 
 
 12. Do all of [32, 133, 134, 135, and 136 for the total 
 weighted points. 
 
 13. Do 137 for the battery. 
 
CHAPTER VI 
 
 COMPUTATIONS FOR THE ONE-GROUP 
 EXPERIMENTAL METHOD 
 
 Computation Model I.—The purpose of this chapter is 
 to give and explain a series of computation molds into 
 which the experimenter may fit his experimental data. 
 Enough such models are given to provide for all the com- 
 mon varieties of experiments. Thus all the experimenter 
 needs to do is to find the mold which fits his experiment, 
 substitute in it his experimental data, do the computations 
 indicated, and the proper conclusions and the reliability of 
 these conclusions will follow automatically. 
 
 The simplest type of experiment is the one-group experi- 
 
 TABLE 14 
 COMPUTATION MODEL I 
 
 One Group — Two EF’s— One Test Type 
 
 
 
 
 
 Group A—EFr Group A— EF2 
 
 Pilty Kr Crt xax. UD hire eee 
 N Mi Sx? M2 Sx? 
 
 ads BEN feb 
 
 AM) SD=y5= _ () AM SD = 4X _ 
 SDM me SDM a 
 a c = —= 
 C I Ry, N 2 y, N 
 SUMMARY 
 EFr1 EF2 D SDD EC 
 
 N = te 
 pict a Me ane ‘/ (SDM1)* + (SDM2)?| 2.78 SDD 
 
 140 
 
Computations for the One-group Experimental Method I4I 
 
 ment, where two experimental factors are contrasted, and 
 where only one type of test is used to measure the change 
 produced by the experimental factors. The computation 
 mold for this experimental method is given in Table 14. 
 Illustration of Computation Model I.—Table 142 is best 
 explained by formulating an experimental problem which 
 may be solved by means of the one-group experimental 
 
 TABLE 15 
 
 ILLUSTRATING HOW TO USE COMPUTATION MODEL 1 WITH SAMPLE DATA, WHEN EF2 1S 
 THE MERE ABSENCE OF EFI 
 
 ee ee ee 
 
 One Group — Two EF’s — One Test Type 
 Pera ee ae he lt a ed ae oil ool ALR EAMES AP 
 
 
 
 Group A—EFr Group A — EF2 
 - oi Aeeatatle i det ee $e UR an as eid bio PLY | EG Py f 
 Pet rey bt Ky of xo AOS Er i tee bee x? 
 a Os Lo sth 2 4 95 95 o!o ry) 
 De100! (tos 5 3 9 100 100 0] oO fa) 
 ce | TOLe Too 8 oO oO IOI IOI 0} oO oO 
 d O7METOO 9 I I 97 97 o| o fe) 
 e |102 109 7 I I Pia ge, 102) 2010 ra) 
 t 96 108 12 4 16 96 96 o| o o 
 $ | 99 107 8 fe) re) 99 99 ~«Oo| o oO 
 h 98 107 9 I I 98 98 o| 0 o 
 ee rOG iM LTT tT 7 3 9 100 100 0} Oo fo) 
 9 Mi = 8.8 Sxa==tay M2=0 Sx? ==10 
 AM = 8.0 SD= <~(0.8)* AM=o0| SD=¥ > — (0)? 
 cr==70.8 SDF 2.6 Ci= 0 SD=0 
 SDM1 = 72 =0.7 SDMz=~=o0 
 V9 9 
 
 SUMMARY 
 
 
 
 EF1 EF2) . D SDD EC 
 ris Lite sat ASidiedeucs bd ea oe: 8.8 
 Test 1 8.8 Oo 8.8 V (0.7)? + (0)?= 0.7 2.78 X0.7 = 4.6 
 
 
 
 method, and then to substitute sample data in computation 
 model I. Assume this problem: What is the effect of a 
 defined amount of vigorous physical exercise upon the pulse 
 rate of pupils? This problem may be solved by the one- 
 group method. There are two EF’s, namely, vigorous 
 physical exercise (EF1) and the absence of such exercise 
 (EF2). 
 
 Table 15 reproduces model I in statistical form. Unless 
 the formula especially demands something else, all compu- 
 
142 How to Experiment in Education 
 
 tations at all stages are done to the nearest first decimal 
 only, so as to make it easier for the student to check com- 
 putations. Greater exactness is advised in actual experi- 
 mental computations. 
 
 Computation of Changes Produced by EF1.—Since a 
 thorough mastery of the symbols, abbreviations, and com- 
 putations shown in Table 14 and illustrated in Table 15 is 
 essential to an understanding of all subsequent experi- 
 mental computations, the data of these two tables are ex- 
 plained in considerable detail. 
 
 Both Table 14 and Table 15 show the experimental com- 
 putations for any one-group experiment contrasting two 
 EF’s and employing only one type of test. The one type 
 of test employed in Table 15 is a test or count of determina- 
 tion of pulse rate. Of course this test was made more than 
 once, but throughout Table 15 only one function is meas- 
 ured. Had the effect of vigorous exercise upon both pulse 
 rate and, say, blood pressure been studied, two-test types 
 would have been employed, since two different functions 
 would have been measured. 
 
 In the left half of both Table 14 and Table 15 “‘Group 
 A” is the experimental group or subjects used. As indi- 
 cated, Group A has EF1 applied to it. Instead of placing 
 EF1 immediately after Group A as shown in the tables it 
 might have been placed between IT1 and FT1 to indicate 
 that the EF1 is applied to Group A after the IT1 and before 
 the FT1. 
 
 In Table 14 “P” represents the pupils who constitute 
 Group A. The ‘‘N” beneath it means the number of pupis 
 in Group A. In Table 15 the pupils used are a, J, c, etc., 
 and J is 9. 
 
 IT means the initial test or scores made on the initial 
 test by each pupil. In Table 15, these scores are pulse rates 
 of 95, 100, ror, etc. The numeral 1 following IT, refers 
 to the first type of test. This will be needed more when 
 more than one test type is used. The “FTx” refers to the 
 final test. 
 
Computations for the One-group Experimental Method 143 
 
 “Cx” in both Table 14 and Table 15 means the change 
 produced by the EF1, and is found by computing the dif- 
 ference between each pupil’s IT and FT. Thus in Table 1 S 
 Ci for Pupil a is ro points, found by getting the difference 
 between 105 and 95. Had the ITx for Pupil a been 105 
 and the FT1 been 95, Cr would still be 10, but should be 
 preceded by a minus sign to indicate that the change is a 
 ro point loss. In all cases where the FT is smaller than 
 the IT a minus should be prefixed to the C, unless the test 
 is scored in terms of time or the like where a smaller FT 
 than IT clearly means a gain rather than a loss. In cases 
 _ where it is not clear, whether a smaller FT than IT is de- 
 sirable or undesirable, the minus should be prefixed. The 
 experimenter should remember, however, that the minus in 
 such cases does not, as it usually does, mean something 
 undesirable. 
 
 Computation of Mean, SD, and SDM for EF1.—The 
 “Mr” under the Cz, is the arithmetic mean of the various 
 Cr’s. In Table 15 this Mz is 8.8. Had any of the Cx’s 
 been preceded by a minus the Mr would have been less 
 than 8.8, for signs should be regarded in computing Mr. 
 The “AM” beneath the Mz means the assumed mean. 
 The AM is used instead of the Mz for computing beg Hp eye" 
 etc., because its use is a great convenience and economy. 
 Any convenient number might be used as the assumed mean, 
 though it is usually most convenient to assume the nearest 
 whole number to the Mr. Thus in Table 15, 8.0 is used 
 as the AM, which makes the c or correction 0.8. Signs 
 are disregarded in determining and using c. The AM of 
 8.0 makes a c of 0.8. An AM of 9.0 would make ac 
 of o.2. Had the Mz been 8.0 instead of 8.8, an excellent 
 AM would be 8.0, which would make a c of zero. 
 
 The symbol x is the traditional symbol for deviation. 
 Thus the x for Pupil a is 2, because his Cx of 10 deviates 
 or differs from the AM of 8.0 by 2 points. The x for 
 Pupil } is 3, because his Cx of 5 deviates from 8.0 by 3 
 points. As in the case of c, the direction of the deviation 
 
144 How to Experiment in Education 
 
 is disregarded. Had the Cr for Pupil a been — 10 instead 
 of + 10, the x would be 18 instead of 2, because the differ- 
 ence between 8.0 and — 10 is 18 points. Had the AM been 
 — 8.o and the C1 been — to, the x would have been 2. 
 
 The column labeled “x’” is found by squaring all the 
 x’s. Sx? means the sum of the x* column. In Table 15, 
 Sx? is 41. SD means standard deviation and is one of sev- 
 eral conventional measures of variability. It is computed 
 according to the formula given in Table 14 and illustrated 
 in Table 15. No matter whether the AM is larger or 
 
 2 
 smaller than the M, the c? is always subtracted frome 
 and it is subtracted before the square root of the whole 
 quantity is taken. The subtraction of c? corrects for the 
 use of 8.0 instead of 8.8 in computing x’s, x?’s, etc. If 
 the reader will compute x, x”, etc., from 8.8, he will appre- 
 ciate the convenience in the use of 8.0, and correcting for 
 its use at the end. The N in the SD formula means the 
 number of pupils in the experimental group. The SD in 
 Table 15 is 2.0. SDMz1 or SD of the Mr is so indicated 
 to distinguish it from the preceding SD or SD of the C1’s. 
 SDMz is a conventional measure of the unreliability of 
 the Mr. It is computed according to the formula shown 
 in Table 14, and illustrated in Table 15. The SDMr for 
 Table 15 is 0.7. The reliability of the Mr or 8.8 is shown 
 then by its SDMr1 of 0.7. 
 
 Comgutations for EF2.—The right half of Table 14 
 and Table 15 is headed ‘‘Group A-EF2” because EF2 is 
 applied to the same group of pupils as experienced EFtr. 
 Column P is omitted, since the pupils are the same as those 
 shown in the first column of the table. The IT, FT, C2, 
 M2, AM, c, x, x’, etc., shown in the right half of the table 
 are interpreted and computed like those shown in the left 
 half of the table. 
 
 In Table 15 the EF2 is merely the absence of vigorous 
 exercise. That is, EF2 is merely a continuation of the 
 same restful conditions which obtained when the IT, in the 
 
Computations for the One-group Experimental Method 145 
 
 left half of the table was made. The IT, in the right half 
 of the table, does not need redetermination, for presumably 
 the results would be identical with the ITr results shown 
 in the left half. Since EF2 is a continuation of conditions 
 obtaining when the ITz is made, FT1r will coincide, pre- 
 sumably, with the scores on the IT1. This makes zero all 
 the C2’s, the M2, the x’s, x?’s, SD and SDM2. In actual 
 practice when EF2 is merely the absence of EF 1, the experi- 
 menter will not actually compute the right half of the 
 table but will assume all the C2’s and subsequent meas- 
 ures to be zero. In case EF2 is not the mere absence of 
 EFr, the right half of the table will have to be computed 
 in detail. 
 
 Computation of M and SD when N Is Large.—The 
 method of computing M and SD, illustrated in Table ris’ 
 is appropriate and convenient when N is small. It is appro- 
 priate, but not convenient, when N is, say, 50 or more. 
 When N is large it is more convenient to determine the C1 
 for each pupil as in Table 15, and then to tabulate these 
 Cr’s into a frequency distribution. 
 
 The procedure for constructing a frequency distribution 
 is as follows: 
 
 (1) Write a column of figures beginning with the small- 
 est Cr and increasing by one to the largest Cx. (2) Write 
 this column in step-intervals of one, extending from five- 
 tenths below to five-tenths above the Cx. The first column 
 of Table 16 illustrates (1) and (2). (3) Look at the 
 original Ci’s. If the first Cz is 4, place a dot or mark 
 just after the step-interval 3.5 to 4.5 in Table 16. If the 
 next C1 is — 2, place a mark just after the step-interval 
 — 2.5 to — 1.5. If the next Cz is another 4, place another 
 mark just after the step-interval 3.5 to 4.5. Continue until 
 a mark has been made after the appropriate step-interval 
 for every C1. (4) Total the marks placed after each step- 
 interval, and write this total just after the step-interval in 
 question. When finished, the two resulting columns will be 
 a frequency distribution. The first and second columns of 
 
146 How to Experiment in Education 
 
 Table 16 constitute a frequency distribution. Note that 
 each zero frequency (f) must be indicated if data is to be 
 used for further computation. 
 
 TABLE 16 
 SHOWING HOW TO COMPUTE M AND SD WHEN N IS LARGE 
 
 
 
 
 
 
 
 G f x fx fx? 
 -—4.5 to —3.5 I —8 — 8 64 
 —-3.5 “* —2.5 2 —7 — 14 98 
 —2.5 “© =—1.5 2 — 6 — 12 72 
 — 1.5 i — 0.5 3 =——"5 eared: 75 
 —0.5 0.5 3 —4 — 12 48 
 Pb dens Ts 4 — 3 —I2 36 
 1.5 - 205 Oo —2 oO (a) 
 2.5 ‘ 3.5 5 Le as 5 
 3-5 4-5 co) oO oO 
 AS 5:5 5 1 5 3 
 Be LS 2 2 4 
 (yep ok Phcle oO 3 fe) o 
 rE Ne 5 4 20 80 
 8.5 ve 9.5 3 5 15 75 
 9-5 10.5 3 6 18 108 
 AM= 4.0 |N=44 + 62 674 
 c= -0o — 78 
 — 16 
 — Te ie tah Zp Us ce me 674 
 = 6 — 16 19) 2 =e 
 SD = nor Ce ama Gas Bim Shores ri A 5 Cee or 30) )x (1) = 3.9 
 SDM =)'0-89 
 
 SDM = 22 = o.59 
 Vv 44 
 
 The steps in the process of computing M and SD follow: 
 (1) Some AM is selected at the mid-point of some step- 
 interval near the center of the frequency distribution. Any 
 AM will do, but it must be at the mid-point of some step- 
 interval. AM= 4.0. (2) N is computed. N= 44. (3) 
 step x’s from the AM are computed. Thus the step-interval 
 3.5 to 4.5 deviates from 4.0 by zero. Step-interval 2.5 to 
 3.5 deviates by — 1. Step-interval 4.5 to 5.5 deviates by 
 -++ 1, and similarly for other step-intervals. Note that zero 
 frequencies are not overlooked. (3) Each x is multiplied by 
 its corresponding f to secure the fx column. (4) The posi- 
 tive fx are added. The negative fx are added. The differ- 
 ence between these two sums is obtained. Positive Sfx = 62. 
 Negative Sfx = 78. The difference = — 16. (5) Thec is 
 computed. 
 
Computations for the One-group Experimental Method 147 
 c= ( eee) < (size of step-interval). 
 
 c—= — .36. Had AM been 3.0 instead of 4.0, the positive 
 Sfx would have been larger than the negative Sfx. This 
 would have produced a positive instead of a negative c. (6) 
 M is computed by the formula: M = (AM) + (c). Had 
 c been positive instead of negative, M would have been 
 4.36 instead of 3.64. (7) The fx? column is secured by 
 squaring each x, and multiplying by the corresponding f. 
 It may also be secured by multiplying each fx by the corre- 
 sponding x. (8) The Sfx? is computed. Sfx?— 674. (9) 
 The SD is computed by the formula: 
 
 SB Ye (VEZ OE _ (c)? ) )x (size of the step-interval) 
 SD Baer Be) 
 
 (10) SDM is computed according to the usual procedure. 
 
 Sometimes a frequency distribution is so strung out that 
 the experimenter prefers to condense it into step-intervals 
 of 2, 3, or more instead of 1, or to construct it in step- 
 intervals of 2, 3, or more from the beginning. Thus the 
 
 TABLE 17 
 
 SHOWING HOW TO COMPUTE M AND SD WHEN N IS LARGE AND WHEN FREQUENCY DIS- 
 TRIBUTION IS GROUPED IN STEP-INTERVALS OF TWO (DATA FROM TABLE 16) 
 
 
 
 CG 7 x fx fz 
 —-4.5 to -2.5 3 TR) MER Ee 27 
 Pat ES Ok 5 —2 — 10 20 
 — 0.5 a TS 7 re ime OY 7 
 1.5 3-5 5 ) 0 0 
 B.S ara heS-5 II I II II 
 BE forte a 2 2 4 
 7-5 9-5 8 3 24 72 
 Gikmnht Ts 8 3 4 re) 48 
 AM = 2.5 |N=44 + 51 193 
 c= 14 — 26 
 
 
 
 
 
148 How to Experiment in Education 
 
 frequency distribution of Table 16 may be grouped as 
 shown in Table 17. No matter what the size of the step- 
 interval, the process for computing M and SD is the same 
 as that already described. ‘That this is so is shown by 
 Table 17. 
 
 The process just described for computing M1, SD, and 
 SDM1 may be used for computing M2, SD, and SDMz2. It 
 may be used, in fact, for.computing any M, SD, or SDM. 
 
 Computation of Median and SDmedian.—Because of 
 its greater reliability, the M is usually preferable to the 
 median. The only advantage of the median is that it is less 
 influenced by extreme improvements. A few pupils mak- 
 ing relatively large or relatively small improvements will 
 affect the size of the M more than they will affect the 
 size of the median. If these extreme improvements were 
 twice as large or half as small respectively, the 
 median would remain unaltered, but not so the M. 
 There are as many arguments for their being allowed to 
 have their full effect as for a curtailment of their effect. 
 But there may be rare occasions on which the experi- 
 menter will prefer the median to the mean. For this 
 reason the steps in the process of computing a median 
 and an SDmedian for the frequency distribution of Table 
 16 follows. 
 
 (1) ComputeN. N= 44. (2) Compute%’N. YN 
 — 22. (3) Begin at the top of the frequency column and 
 add the successive f’s, calling the successive totals until 
 14 N or 22 has been reached, thus: 1 and 2 are 3, and 
 2 are 5, and 3 are 8, and 3 are 11, and 4 are 15, and o are 15, 
 and 5 are 20, and 2 of the 6 are 22. (4) Place this 2 as 
 a numerator over this 6, multiply the fraction 2/6 by 1, the 
 size of the step-interval, and add the product to the begin- 
 ning point of the step-interval corresponding to the fre- 
 quency of 6, namely 3.5. The result is the median. Median 
 IR on 42 JO Calera Oa 
 
 The reliability of the median 3.83 is found by means of 
 the following formula: 
 
Computations for the One-group Experimental Method 149 
 
 1% SD 
 SDmedian= 4/N 
 
 The SD, in the preceding formula, may be the SD from the 
 mean, computed in the usual way, or it may be the SD 
 from Ane median. It will be found more convenient as a 
 rule to use SD from the mean. If computed from the 
 median, the exact deviations from the exact median must 
 be used, because SD from the median must be computed 
 by the formula: 
 
 Sie jy instead of SD = 1 eVGA 
 
 The steps in the process of computing a median for Table 
 17 follow. (1) N=44. (2) ZN=22. (3) 22=3 and 
 5 are 8, and 7 are 15, and 5 are 20, and 2 of 11. (4) 
 
 Wvledian=—.3.5 + De pian visley 
 
 The experimenter may have difficulty in computing a 
 median for a frequency distribution where the numerator 
 of the fraction is zero and the preceding f or f’s is zero. 
 Table 18 shows how to overcome this difficulty. 
 
 TABLE 18 
 SHOWING HOW TO COMPUTE A MEDIAN IN TWO SPECIAL SITUATIONS 
 
 
 
 C f C f 
 2.5°to3.5| 1 |N= 14 ie 15.5| 2|N=12 
 Bip 4510 ZN 7 “ 20.5] 1|4N= 
 Es ae p=1tote+atoroe « 25.5) 3/6=2+1+3+0+0 
 5.5 “ 6.5) 4] andoof 5 erie s0.5| 0 auc ONOled 
 05:9) 7.5/0 30:5111435:5//10 
 75 8.5| 5 Median == 2:5 17:5 73 135.5 40.5) 4 Median = 25:9 1 35:5 1 35.5 
 “= oe eg tae 2 AOS wat AS i812 2 
 +x +— x 
 a = on = 30. 
 5 Peay | 4 Sire 50:5 
 
 
 
 The median is sometimes called the 50 percentile. It is 
 possible to compute other percentile points according to the 
 same process. The 50 percentile is found by counting down 
 
150 How to Experiment in Education 
 
 the frequency column 1% N. The 25 percentile or Qr is 
 found by taking 4 N. The 75 percentile or Q3 is found 
 by taking 34 N. The 20 percentile is found by taking 
 WN. 
 
 A knowledge of Qr and Q3 enables us to compute Q 
 (quartile deviation) by the formula: 
 
 ease 
 
 2 
 
 Q, which is a variability measure like SD and which is 
 approximately .6745 SD, may be used in the place of SD 
 to compute SDmedian. In fact, this is the simplest way to 
 determine SDmedian. The formula is: 
 
 SDmedian = £3539 
 
 Computation of D and SDD.—In the “Summary” 
 (Tables 14 and 15) are retabulated certain measures pre- 
 viously computed, and certain additional computations are 
 made. First there appears the mean of the changes pro- 
 duced by EF1, i.e. M1 in Table 14 and 8.8 in Table 15. 
 Next comes the mean of the changes produced by EF2, i.e. 
 M2 in Table 14 and zero in Table 15. 
 
 The next step, namely, ““D” or difference, is merely the 
 difference between M1 and M2, i.e. M1 — M2, in Table 14, 
 or between 8.8 and o, i.e. 8.8 in Table 15. It is well to form 
 the habit of subtracting M2 from Mi. Then a plus D will 
 mean that EF1 has been more effective than EF2. A minus 
 D will mean always just the reverse. This D is the most 
 significant measure shown in the two tables. It is the chief 
 goal of the experimental computations. It yields the con- 
 clusion from the experiment. Thus the D of 8.8 in Table 
 15 tells us that the C produced by EF1 is 8.8 points larger 
 than that produced by EF2. This is another way of saying 
 that the effect of a defined amount of vigorous physical 
 exercise is to increase the pulse rate 8.8 on the average. 
 
Computations for the One-group Experimental Method 151 
 
 The next computation, namely, SDD or the SD of the D, 
 utilizes the SDM1 and SDMz2 as shown in the two tables. 
 This SDD shows the reliability of the preceding D just as 
 the SDMz shows the reliability of M1. That is, the D of 
 8.8 has a reliability of 0.7. 
 
 In case medians have been used instead of M’s, D will be 
 the difference between median 1 and median 2, and SDD 
 will be computed according to the formula: 
 
 SDD = 4/(SDmedian 1)? + (SDmedian 2)? 
 
 Though SDM and SDD will be used throughout this 
 book, many experiments report reliability in terms of PE. 
 Thus the reader of scientific literature frequently sees some- 
 thing like this: Mean = 8+ 0.7, or like this: Differ- 
 ence = 4+ 1.0. Such expressions signify that the PE of 
 the mean or PEM is 0.7, and that the PED is 1.0. By 
 multiplying any SD, SDM, SDmedian, or SDD by 0.6745, 
 it may be transmuted into a PE, PEM, PEmedian, or PED 
 respectively. SD and PE tell the same story. In a normal 
 frequency distribution + SD includes the middle 68% of 
 the f’s whereas + PE includes the middle 50% of the f’s. 
 
 Measures of Variability.—Thus far three sorts of SD’s 
 have been computed, namely, SD, SDC, or SD of the C’s, 
 SDM or SD of the mean of the C’s, and SDD or SD of 
 the difference. All three are measures of variability. The 
 SD or SDC is a measure of the variation or variability 
 among the C’s. Thus the C1’s in Table 15 vary from 5 to 
 12, 1.e., there is a range of 7. This 7 could be taken as a 
 measure of variation; but the reader will easily understand 
 that a change in the C1 for one pupil might markedly affect 
 such a measure of variability. The SD is better because 
 its size is dependent not upon just two pupils but upon 
 the records for all pupils. Furthermore, the SD is de- 
 manded by the formula for SDM. The SD increases in size 
 with an increase in the variability of the C’s, and it de- 
 creases as the variation of the C’s decrease. In sum, it is 
 
152 How to Experiment in Education 
 
 an exceedingly sensitive and stable measure of the vari- 
 ability among the C’s. The SD of 2.0 in Table 14 means 
 approximately that 68 per cent of all the C1’s fall between 
 Mi — 2.0 and M1 + 2.0 or between 8.8 — 2.0 and 8.8 + 
 2.0, or between 6.8 and 10.8. The per cent between 
 M — SD and M + SD is exactly 68 when the C’s make an 
 exactly normal frequency distribution, i.e., when a graph 
 of the frequency distribution is approximately bell-shaped. 
 
 The SDM is also a measure of variability. It is a meas- 
 ure of the variability among the M’s just as SD is a measure 
 of variability among the C’s. Assume the nine pupils used 
 in Table 15 to be a random sampling from the 10,000 ten- 
 year-old pupils in a certain school system. Imagine this 
 experiment repeated upon another random sampling of nine 
 pupils from the total 10,000, and then upon another 
 sampling, and then upon another sampling, and so on until 
 a great many samplings have been taken and a great many 
 Mz1’s have been computed. In making these samplings 
 certain pupils might be chosen more than once and certain 
 ones might never be chosen at all. Not all the Mr1’s so 
 computed would be identical. In fact, no two M1’s might 
 be identical. Certainly there would be variation among 
 them. The SD of all these Mr’s could be computed just as 
 the SD of the C1’s was computed. When so computed, the 
 result would be SDMrz, and, in theory at least, would be 
 the same as SDM1 computed by the formula illustrated in 
 Table 15, 1.e., 0.7. Since it is more probable that all these 
 Mr’s will center at the obtained Mr of 8.8 than at any 
 other point, the SDMz of 0.7 tells us that most probably 
 68 per cent of these M1’s would be between 8.8 — 0.7 and 
 8.8 + 0.7, 1e., between 8.1 and 9.5. In sum, SDMr1 isa 
 measure of variability just as SD is a measure of varia- 
 ability. The difference is that SD is computed from actually ' 
 obtained C’s whereas SDMr is always computed by for- 
 mula. The Mz1’s whose variability it measures could actually 
 be determined as suggested above but in practice their 
 existence is only imagined. 
 
Computations for the One-group Experimental Method 153 
 
 SDD is also a measure of the variability among many 
 differences determined from many repetitions of the experi- 
 ment upon different random samplings. As with SDMz1, 
 SDD is computed always by formula. The SDD of o.7 in 
 Table 15 tells us that most probably 68 per cent of all the 
 differences determined from such repetitions of this experi- 
 ment would fall between obtained difference 8.8 —o0.7 and 
 8.8 +.0.7, 1e., between 8.1 and 9.5. Mz and SDMr will 
 not always coincide with D and SDD as they do in this 
 experiment. 
 
 Measures of Reliability and Randomness of Sam- 
 pling.—SDMz and SDD are measures of reliability as well 
 as of variability. They measure the reliability, respectively, 
 of Mi and D. The true Mr for the 10,000 pupils in ques- 
 tion can be determined only by securing the Cr for all 
 10,000 pupils. The Mz for any number of pupils less than 
 10,000 will not be the true mean exactly except by chance. 
 The Mr for the nine pupils in Table 15 may happen to 
 be the true Mz. On the other hand the Mz from any 
 other random sampling of nine pupils has as much chance 
 of being the true M1. Any measure which will show the 
 amount of variation among all the M1’s from the various 
 possible random samplings of nine pupils each will be an 
 index of how much a particular obtained Mr may be in 
 error. The SDMz, as has been pointed out already, is just 
 such a measure of variation. Consequently it tells us how 
 probable it is that the obtained Mx diverges from the true 
 Mz by a given amount. When the various possible M1’s 
 vary little among themselves, there is little chance for any 
 one of them to diverge largely from the true Mr. In such 
 a situation the SDMr1 will be small in amount. When 
 the SDMrz is large in amount, it means that there is a large 
 variation in size among the possible M1’s, which, in turn, 
 means that the obtained Mz is not particularly reliable. 
 In like manner it can be shown that SDD, because it meas- 
 ures the variation among the possible differences, is an index 
 of the reliability of the obtained D, and shows the probabil- 
 
154 How to Experiment in Education 
 
 ity that it diverges from the true D for all 10,000 by a 
 given amount. 
 
 SDM1 and SDD, as computed by formula, will coincide 
 with SDMz1 and SDD as computed from a great many ran- 
 domly determined Mz1’s and D’s only when an assumption 
 underlying these formule perfectly obtains. That is, 
 SDMx1 and SDD, as computed by formula, are valid only 
 to the extent that the nine-pupils used are a genuine random 
 sampling of all the 10,000 pupils, or that the obtained C’s 
 are a genuine random sampling of all the C’s that would be 
 obtained if all 10,000 pupils were experimented upon. That 
 is, both reliability formule assume randomness of sampling. 
 
 In actual practice no one would hope to secure a genuine 
 random sampling from 10,000 pupils by selecting only nine 
 pupils. Since this book, however, is concerned with meth- 
 odology rather than results, a ludicrously small amount of 
 data is used in most tables. The purpose of this is econ- 
 omy of space and clearness of presentation rather than to 
 set an example for the reader. | 
 
 Close attention to the nature of the sampling is neces- 
 sary, not only in order to discover the validity of the re- 
 liability measures computed but also to determine the 
 limitations of the conclusion drawn from the experiment. 
 Thus if the pupils used in the experiment are a random 
 sampling from the ten-year-olds in a particular elementary 
 school, the conclusion should be distinctly limited to the 
 ten-year-olds in this particular school. The experimenter 
 cannot be sure that the results of his experiment apply to 
 all ten-year-olds in the United States, or to all eleven-year- 
 olds in this same school. 
 
 Experimental Coefficient and Chances.—The “EC” or 
 experimental coefficient in Table 14 and Table 15 remains 
 to be explained. The formula for its computation is given 
 in the former table and illustrated in the latter. The experi- 
 mental coefficient has been devised to interpret SDD. The 
 formula for its computation is so constructed that an experi- 
 mental coefficient of 1.0 means that we can be practically 
 
Computations for the One-group Experimental Method 1 is 
 
 certain that the true D is somewhere above zero. An EC 
 of 0.5 means that we can be only half certain that the true 
 D is above zero. An EC of 2.0 means we can be doubly 
 certain that the true D is above zero, and similarly for 
 other sizes of EC. Since the EC in Table I5 iS 4.6 we can 
 say that there is 4.6 times practical certainty that the true 
 D is above zero. 
 
 Since some statisticians wish to state probability in terms 
 of chances that the true D is above or below zero or above 
 or below any defined point, Table 19 permits the con- 
 version of experimental coefficients into statements of 
 chance. This table says, for example, that when the experi- 
 mental coefficient is 0.3 the chances are 3.9 to 1 that the 
 true D is above zero if the obtained D is above zero, Or 
 below zero if the obtained D is negative. 
 
 TABLE 19 
 
 SHOWING HOW TO CONVERT AN EXPERIMENTAL COEFFICIENT INTO A 
 STATEMENT OF CHANCES 
 
 EE 
 
 Experimental Coefficient Approximate Chances 
 ot 1.6 to r 
 ‘2 2.5 tO 
 3 3.9 to 1 
 4 6.5 to I 
 5 Tia etOeT 
 6 20m cOuT 
 o7 38 tor 
 8 75 Eto. ft 
 9 160. 6to Tr 
 
 I.0 200m tO7T 
 TT O30 VECOnT 
 Toa 2350 tor 
 i 6700 tor 
 1.4 20000 tor 
 is 65000 tor 
 
 Se a een terres ee A LSE: OA AUN APA ik 
 
 The formula for EC is constructed to a D of zero as a 
 reference, because the experimenter’s primary concern is to 
 know whether the obtained superiority of one EF over 
 another, or the obtained D in favor of one EF, is sufficiently 
 reliable to justify him in concluding that the true 1) Saf 
 
156 How to Experiment in Education 
 
 known, would continue to favor that same EF. If the 
 obtained D is, say, 2.0 in favor of EF1, the experimenter 
 wonders whether the true D may not be zero or even, say, 
 —1.0. For the true D to be zero, would be to make the 
 two EF’s of equal effectiveness. For it to become — 1.0, 
 would be to reverse the conclusion indicated by the obtained 
 D. So whenever the EC is less than 1.0, the experimenter 
 should state that one of his EF’s is probably more effective 
 than the other. The less the EC becomes, the more wary 
 the experimenter should be. This does not mean that the 
 experimenter is justified in advising practical action on the 
 basis of his experiment only when the EC is 1.0 or above. 
 So long as the EC is above zero, the true D more probably 
 lies in the direction of the obtained D than in the opposite 
 direction. Life’s most important considerations, such as 
 marriage, investments, and hope of Heaven, rest upon an 
 EC of less than 1.0! 
 
 Though the EC formula is built to a D of zero, it may 
 be used to measure the probability that an obtained D will 
 be above a defined point, or will be below a given point. 
 Thus if we wish to know the probability that the true D in 
 Table 15 will be above, say, 7.8 we should compute thus: 
 
 1.0 
 8.8 — 7.8==1.0. nC eeropronenes echt We can be 
 
 only half certain that the true D is above 7.8, whereas we 
 can be 4.6 times practical certainty that it is above zero. 
 Since there is just as much probability that the true D is 
 above as below 8.8, we may wish to determine the proba- 
 bility that the true D is below, say, 10.8. Compute thus: 
 
 10.8 — 8.8 = 2.0. | chy sy le 
 DON On 
 
 practically certain that the true D is below 10.8.- If desired 
 these EC’s may be expressed in terms of chances by the use 
 of Table 109. 
 
 Though to do so would serve no especially useful purpose 
 in connection with experimental computations, the EC 
 formula may be used to help interpret the reliability of an 
 
 1.0. We can be 
 
Computations for the One-group Experimental Method 157 
 
 M. In this case, the SDD in the denominator of the for- 
 mula should give place to SDM. Thus if we desired to 
 _ know the probability that the true Mz in Table 1 5 
 would be above, say, 5.8, we could proceed as follows: 
 
 3.0 
 ee 5 13 10, 1) GC SECT Uae T. 1.6. The probabil- 
 ity then is 1.6 times practical certainty that the true Mr is 
 above 5.8. It happens that in Table 15 the SDM1 is the 
 Same as the SDD, ie., 0.7. In similar manner we could 
 determine the probability that the true Mr is below a de- 
 fined amount. 
 
 How to Increase the Experimental Coefficient.—If 
 the EC is not as large as desired, how can it be increased? 
 An inspection of the EC formula reveals the answer. The 
 EC can be increased by increasing the numerator of the 
 formula, i.e., by increasing D. But D is not subject to con- 
 trol by the experimenter. It is, in fact, illegitimate for him 
 to try consciously to increase D. Then the denominator 
 must be reduced. The 2.78 in the denominator is constant 
 So it cannot be reduced. The reduction must be in the 
 SDD. To see how it can be reduced we need to inspect the 
 formula for computing SDD. This formula shows that the 
 only way to reduce the SDD is to reduce one or both the 
 SDM’s upon which the size of the SDD depends. To find 
 out how, say, SDMzr can be reduced it is necessary to in- 
 spect the formula for computing SDMr. This reveals that 
 the SDMr can be reduced by reducing the SD in the 
 numerator or by increasing the N in the denominator. 
 Since errors of measurement tend to increase the variability 
 among the C1’s, a refinement of the testing instruments 
 would make a slight but almost negligible reduction in SD. 
 For practical purposes the SD cannot be materially re- 
 duced. Then the N must be increased. The N is subject 
 to the control of the experimenter. Therefore our search 
 has led us to the conclusion that the only practicable plan 
 for increasing the size of the EC is to increase N. 
 
 The experimenter can compute in advance about how 
 
158 How to Experiment in Education 
 
 many pupils he must experiment upon to secure a desired 
 EC. The EC of 4.6 in Table 15 is high enough, but suppose 
 that an EC of 6.0 were desired. The size of the SDD 
 required to yield an EC of 6.0 may be determined by solv- 
 ing the following EC formula for SDD, because, presuma- 
 bly, the D of 8.8 would be altered little or not at all by 
 increases in N. 
 8.8 
 
 2.78 X SDD 
 pol DD Memeeb(e 
 
 6.0 
 
 Now the size of the SDMz1 required to yield an SDD of 
 o.5 may be determined by solving the following SDD for- 
 mula for SDM1. The SDMz2 cannot be reduced so it is 
 disregarded. When it is reducible, it may be asked to share 
 its proportionate part in reducing the SDD. 
 
 /(SDM1)? + (0)? =0.5 
 SDM1 = 0.5 
 
 Since the SD in the SDMr formula changes little or not at 
 all with changes in N, the N required to yield the needed 
 SDMz1 of 0.5 may be determined by the solving of the fol- 
 lowing SDMz1 formula for N. 
 
 20. 
 /N 
 N = 16 
 
 The answer to our query is, then, that 16 pupils must be 
 used if a desired EC of 6.0 is to be secured. If the neces- 
 sary reduction in SDD is distributed between the two 
 SDM’s, N must be determined for both SDMz1 and SDM2. 
 
 Another Illustration of Computation Model I.—Table 
 20 illustrates the application of computation model I to 
 sample data where EF2 is not the mere absence of EF1. 
 Imagine the data to have been collected in an experiment 
 to determine whether the pulse rate increased more from 
 reading a familiar favorite thrilling short story (EF1) or 
 

 
 
 
 
 
 
 
 
 
 $0 = = zWwas Foe = IWdS 
 v v 
 = Nar ae oe os 7O= 9 ‘T_ Seas, =< oo — 9 
 Vv 
 Seo Sy eeee Cope 66 hay To =66 66 Dp 
 I I Zz 66 L6 I I Zz 66 L6 2 
 I I z vor zor I I fe) ZOI ZOI q 
 .e) ° £ Cor OOI Vv z ¢ Lor Oor e 
 & x 2@) ILA ILI 2X p¢ 1) ILA ILI d 
 
 
 
 
 
 217 — Vp gnorsy Iq — Vp gnosy 
 ae ana a ee ee eee 
 
 edkT, S99, UIQ —S.aq OM], — dnoiy sup 
 Se See a ea ee es 
 Idd JO HONASAV TUAW AHL LON SI tHF NAHM I TACO NOLLVIOAWOO BSO OL MOH ONILVaLSOTIO 
 
 Oz aIavy, 
 
 
 
 Computations for the One-group Experimental Method 159 
 | 
 aS 
 wm 
 | 
 = 
 2 
 | 
 ae 
 op) 
 fe) 
 ! 
 a 
 
160 How to Experiment in Education 
 
 from hearing the story told orally by the teacher (EF2). 
 The story used must be an extremely familiar one, other- 
 wise the repetition would differ markedly in interest from 
 the first presentation, thereby invalidating the experiment 
 unless the equivalent-groups method were used. 
 
 The reader’s attention is directed to the following special 
 features of Table 20. The C1 of — 1.0 deviates from the 
 AM of 1.0 by 2 points. The AM is the same as M1, 
 thereby making c of zero size. As shown by the computa- 
 tion of SD, when the M and AM are identical no correc- 
 tion for the SD is necessary. The M2 is less than the AM, 
 but this in no way alters the usual subsequent procedure. 
 The D is — 1.8 because in this experiment EF2 proved to 
 be more effective than EF1. The EC is only o.7 which 
 means that we can be only o.7 practically certain that the 
 true D, if known, is below zero, 1.e., favors EF2. 
 
 There are several possible one-group computation models. 
 We could have one computation model for two EF’s and 
 two test types. Substitute Group A for “Group B” in com- 
 putation model IV, Table 24, and the reader will have such 
 a model. Again, we could have a computation model for 
 three EF’s and one test type. Substitute Group A for 
 “Group B” and also for “Group C” in computation model 
 III, Table 23, and the reader will have such a model. 
 Again, we could have a computation model for three EF’s 
 and three test types. Substitute Group A for “Group B” 
 and also for ‘““Group C” in computation model V, Table 25, 
 and the reader will have such a model. In sum, every com- 
 putation model listed in the next chapter could have been 
 listed as one-group computation models. Economy of space 
 is the only reason for not doing so. Imagine Group A to 
 run through all these models instead of different groups and 
 they will all be converted automatically into one-group 
 computation models. In like manner the detailed discus- 
 sion and illustration of computation model I in this chapter 
 is applicable to all the computation models in the next 
 chapter. 
 
CHAPTER VII 
 
 COMPUTATIONS FOR THE EQUIVALENT- 
 GROUPS EXPERIMENTAL METHOD 
 
 Computation Model II.—Computation model II given 
 in Table 21 shows the necessary computations for an ex- 
 periment with two equivalent groups, two EF’s and one type 
 of test. Note that “P” appears twice because EF2 is not 
 applied to the same pupils who experience EFr. Note also 
 that the detailed formule for SD and SDM are omitted, 
 since the reader is already familiar with them. 
 
 TABLE 21 
 COMPUTATION MODEL II 
 
 Fe ns ns NET a ON VN Oat 
 Two Equivalent Groups — Two EF’S — One Test Type 
 
 
 
 
 
 Group A—EFr Group B— EF2 
 
 Deets OLY ACL ix Xen vir wher Woy lis x? 
 N M1 Sx NN M2 Sx? 
 AM SD AM SD 
 
 c | SDM1 Cc SDM2 
 
 ee eR EO EE lA etn ne 
 SUMMARY 
 EFr1 EF2 D SDD EG 
 
 ANS ga D 
 Test 1...) Mz M2 M1—Mz2 | 4/(SDMr)?+ (SDMz)? 278 SDD 
 erate ene cere ee Oem NORE A lA), (Foul SER VAL [RNAs Wo 
 Illustration of Computation Model II.—In order to 
 illustrate computation model II with sample experimental 
 data assume this problem: Which is better for the quality 
 of the penmanship, a penmanship period preceding the 
 gymnasium period (EFr1), or following the gymnasium 
 161 
 
162 How to Experiment in Education 
 
 (EF2)? This problem may be solved either by the one- 
 group or equivalent-groups method. The equivalent-groups 
 method is used. 
 
 The IT for both groups should be made at the same 
 identical period of the day, and at a period different from 
 either of the experimental periods, though several other ways 
 of working out this experiment would be as feasible and as 
 satisfactory. Assume that the IT has been made on both 
 
 TABLE 22 
 SHOWING HOW TO USE COMPUTATION MODEL It 
 
 Two Equivalent Groups — Two EF’s— One Test Type 
 
 Group A—EFr Group B— EF2 
 P |ITx FTr C1 pee iN Beal SA aR IO UE Rap OG) Oa By C2 Dene ey 
 a 7 8 I OT Outed 7 8 I rey: 
 Dae nT. 6 —I DENA Mey PS 4 oe Ono 
 c 8 10 2 Tae k 9 7 —2 re hes: 
 d 8 9 I GuniD Lro 9 —I red ie! 
 € 9 9 Oo i I —— ew Soothe 
 f Our 3 Cara TL A M2 = —08 Sx*=5 
 g 10 aT I OFnLO AM = —1.0 SD = 1.1 
 shen be f=) 12 2 Twi c = 0.2} SDM2—0.6 
 8 M1 =—1.1 Sx? ==11 
 AM =~ 1.0 SD = 1.2 
 c=o0.1} SDM1—0.4 
 SUMMARY 
 EF1 EF2 D SDD EC 
 LeSCuda vie ocean Tat —o.8 1.9 0.8 0.9 
 
 
 
 groups just before dismissal at the end of the day. The FT 
 for Group A should be made, then, just preceding the 
 gymnasium period, and the FT for Group B should be made 
 just after the gymnasium period. The necessary computa- 
 tions are made in Table 22. 
 
 In Table 22 the pupils are arranged in order of the size of 
 their [Tx scores in order that the reader will easily perceive 
 that Group A as a whole is really equivalent in initial ability 
 
Computations for the Equivalent-groups 163 
 
 in handwriting with Group B as a whole. Table 22 also 
 shows that the number of pupils in one group need not 
 be identical with the number in the other group. Since 
 Mz and AM are negative, we have here an illustration 
 of the computation of x’s from a negative AM. This also 
 affords an opportunity to show how to compute D when one 
 of the M’s is a negative quantity. Had both M’s been 
 negative quantities, ie., had Mz, say, been — ite toeD) 
 would have been — 0.3 in favor of EF2. Both EF1r and 
 EF2 would have produced a loss of handwriting quality, but 
 EFr would have effected a larger loss. The minus is 
 prefixed to 0.3 to indicate that EF2 is the favored one. As 
 the experiment stands, however, the conclusion is that EF1 
 is better than EF2 for the quality of handwriting of pupils 
 by 1.9 points on the handwriting scale used. We can be 0.9 
 practically certain that this conclusion is true for the whole 
 group from which the experimental pupils are a random 
 sampling. 
 
 Practical Certainty and Pre-requisites of Reliability. 
 —Several times thus far the term practical certainty has 
 been used. This needs a fuller explanation. When 100 
 pupils are selected at random from rooo pupils, we can be 
 entirely certain that the experimental results secured for the 
 Ioo are true for those 100. But no matter how large the 
 D, we can never be absolutely certain that results secured 
 from any sampling less than the entire rooo are true for the 
 1000. Since absolute certainty is never obtainable, except 
 for the particular group used, statisticians have coined the 
 term practical certainty to designate a degree of certainty 
 which is generally acceptable. Practical certainty is defined 
 as plus and minus three times the SD of the measure in 
 question. Thus we can be practically certain that the 
 true Mz lies between obtained Mz minus 3 SDMr and ob- 
 _ tained Mz plus 3 SDMz. If M1 is 1.1 and SDM is 0.4, we 
 can be practically certain that the true Mr lies between 1.1 
 minus 3(0.4) and 1.1 plus 3(0.4), i.e., between —o.1 and 
 2.3. Similarly, we can be practically certain that the true 
 
164 How to Experiment in Education 
 
 D lies between obtained D minus 3 SDD and obtained D 
 plus 3 SDD, or using the data of Table 22, we can be 
 practically certain that the true D is somewhere between 1.9 
 minus 3(0.8) and 1.9 plus 3(0.8), i.e., between — 0.5 and 
 4.3. Had such definition of limits been more significant than 
 the definition of a point above which the true D lies, i.e., 
 zero, the denominator in the EC formula would have been 
 3 SDD instead of 2.783 SDD. The 3.0 is reduced to 2.78 
 because any chance or probability that the true D is above 
 D plus 3 SDD (when D is positive) or below D minus 3 
 SDD (when D is negative) merely strengthens the conclu- 
 sion yielded by the experiment. The difference between 3.0 
 and 2.78 exactly accounts for this probability. 
 
 The one-group method is a more convenient method than 
 the equivalent-groups method of solving the experimental 
 problem whose sample data appears in Table 22. But even 
 though the equivalent-groups method be employed, there is 
 a more convenient method of determining D than that shown 
 in Table 22. Both experimental groups could have had 
 their IT1 at one of the EF periods, at, let us say, the period 
 preceding the gymnasium period (EF1). Then the FTr for 
 Group A could be assumed to be identical with the ITr. 
 This would have made each of C1, M1, SD and SDMz zero. 
 This would have saved labor and would, in theory, have 
 yielded the identical D obtained by giving the IT1z in a 
 period other than one of the EF periods. 
 
 But even though the IT1 be made in a non-EF period as 
 shown in Table 22, the same D could have been secured by 
 a single computation, namely, by computing the M of Group 
 A’s FT1, and the M of Group B’s FT1 and by subtracting 
 one M from the other. Experimenters frequently resort to 
 this plan to avoid the necessity of making an IT1. Such an 
 avoidance is not commendable because the experimenter has 
 no right to assume that his two groups are equivalent. He 
 needs the IT1 to prove their equivalence. If he avoids this 
 criticism by using one group only, where he has a right to 
 assume equivalence, or if he proves the equivalence of his 
 
Computations for the Equivalent-groups 165 
 
 two groups by means of an IT1, but then proceeds to ignore 
 it and work with FTr only instead of C, he is subject to 
 another criticism. His computations will yield the correct 
 D, but will not permit him to determine the EC or reliability 
 of the D. It will not suffice for him to compute the M, SD, 
 and SDM of the FT1 for each group, and to use these two 
 SDM’s to compute SDD just as the SDM’s of the C’s are 
 used to compute SDD. The SDM of the FT1’s tends as a 
 rule, though not always, to be unduly large and thus tends 
 to make the D appear less reliable than it really is. Some 
 distortion will always occur unless the IT1’s are all zero or 
 all identical in size. It is not legitimate to avoid this final 
 criticism by simply omitting altogether the computation of 
 the reliability of the D, for each experimenter is obligated to 
 report the reliability of his conclusion. In sum, C is required 
 to determine the correct reliability of D, and the obtaining 
 of C presupposes both an ITx and FTr. 
 
 There is a way whereby the correct SDD may be secured 
 without the use of C. The steps in this process follow. (1) 
 Compute M of initial scores. (2) Compute M of final 
 scores. (3) Subtract intial M from final M to get Mr. 
 (4) Compute SD and SDM of initial scores. (5) Compute 
 SD and SDM of final scores. (6) Compute SDM1 by 
 means of the following formula. 
 
 SDMi— 
 (Initial SDM)? + (Final SDM)? — (2 r initial with final) (SD 
 
 
 
 initial) (SD final) 
 
 Thus the SDMz1, computed in this way, is equal to the 
 square root of the following: the square of the SDM of the 
 IT scores, plus the square of the SDM of the FT scores, 
 minus twice the coefficient of correlation between the IT 
 scores and FT scores times the SD of the IT scores times 
 the SD of the FT scores. The procedure is similar for the 
 computation of M2 and SDM2. 
 
 The use of this thoroughly exact but substitute procedure 
 for determining Mr and SDMz is seldom advisable. Some 
 time may be saved by its use provided the IT and FT scores 
 
166 How to Experiment in Education 
 
 have been tabulated previously into two frequency distribu- 
 tions, respectively. If the experimental data are available 
 only in such form, it is impossible to compute C’s. Gen- 
 erally, however, the computation of C not only facilitates the 
 computation of Mi and SDMr or M2 and SDMz2, but it 
 also makes possible a fuller utilization of experimental re- 
 sults in that it shows what sub-group made the larger C’s. 
 
 TABLE 23 
 COMPUTATION MODEL IIT 
 
 Three Equivalent Groups— Three EF’s— One Test Type 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Group A—EFr1 Group B— EF2 Group C — EF3 
 PaLDrorn yy ses ix KELL Le Le x7n (PIT eo secrete x? 
 N Mr Sx? | N Mz Sx? | N M3 Sx? 
 AM SD AM SD AM SD 
 c SDMr1 c SDM2 c SDM3 
 SUMMARY 
 EF1 EFz2 EF3 D SDD EC 
 dele aks eh ee D 
 Test 1...) Mx Ma Mr — M2 |v’ (SDMr)? + (SDM2)?| >3-spp 
 Big SECS ile hs on de D 
 Test) Tce Mt M3) M1—Ms3 |v (SDMr1)? + (SDM3)? 2.78 SDD 
 pe UT Ege ie MAA et D 
 Tastire. M2 M3 M2z2—Ms3 /V/ (SDMz2)?+ (SDM3)? 2.78 SDD 
 
 Recently my attention was attracted to an experiment 
 where some of the pupils had one IT and one FT; whereas 
 others had two or more IT’s and two or more FT’s (as 
 though pupils a, d, and f say in Table 22, had three IT and 
 three FT records each). These records were‘recorded and 
 treated as though they belonged to different individuals. 
 The effect of this is to distort the SD, SDM, and SDD. 
 When more than one record exists for a pupil they should 
 be averaged so that each pupil will have just one IT and 
 one FT for each test. 
 
 Computation Model III.—Computation model III in 
 Table 23 shows the experimental computations necessary 
 when there are three equivalent groups, three EF’s and one 
 
Computations for the Equivalent-groups 167 
 
 type of test. If the purpose of the experiment is to deter- 
 mine the relative effectiveness of three EF’s, EF1, EF2, and 
 EF3 will be distinctly different EF’s. If the purpose of the 
 experiment is to determine the absolute effectiveness of EF1, 
 and EF2, then, EF3 will be a control EF. It should be 
 understood that in all preceding and succeeding computation 
 models, one of the EF’s must be a control EF whenever 
 knowledge of the absolute effectiveness of one or more of 
 the EF’s is sought. 
 
 Table 23 is practically self-explanatory. The two 
 M1’s under EF1 in the Summary are the same Mz, and 
 similarly for the two M2’s under EF2 and the M3’s under 
 EF3. The first D and SDD under EC are M1 — M2 and 
 
 V (SDM1)? + (SDMz)? respectively, and similarly for the 
 second and third formule under EC. The first D, namely 
 M1 — M2, shows whether EF1 or EF2 is more effective and 
 the first EC shows its reliability. The second D, namely 
 M1 — M3, shows whether EF1 or EF3 is more effective 
 and the second EC shows its reliability, and similarly for 
 the third D and third EC. 
 
 By extending computation model III in Table 23 farther 
 to the right, to provide for a Group D — EF4 and a Group 
 E,— EF5 and a Group F — EF6 and so on, the experi- 
 menter will have a computation model for any number of 
 groups and EF’s when one test type is used. An extension 
 of the Summary according to the plan exemplified in Table 
 23 will take care of any number of EF’s. 
 
 Computation Model IV.—The computation models so 
 far given show how to take care of any number of EF’s 
 when one test type is used. Computation model IV in Table 
 24 shows how to handle two EF’s and two test types. 
 
 Table 24 shows that additional test types can be provided 
 for by expanding the original computation model downward, 
 just as additional EF’s were provided for by expanding the 
 original computation model to the right. Note that the 
 second test type is indicated by the numeral 2, and that 
 the two new M’s are labeled M3 and M4. The D of 
 
 
 
 
 
168 How to Experiment in Education 
 
 M1 — M2 shows whether according to Test 1, EF1 or EF2 
 is the more effective. The D of M3 — Mg shows whether, 
 according to Test 2, EF1 or EF2 is the more effective. The 
 two EC’s show the reliability of these two D’s. 
 
 Equating of Differences.—Table 24 exemplifies a new 
 feature in connection with EC. This new feature requires 
 explanation. Test 1 may favor EF1 by a D of a certain 
 
 TABLE 24 
 COMPUTATION MODEL IV 
 
 Two Equivalent Groups — Two EF’s — Two Test Types 
 
 
 
 
 
 
 
 Group A — EFr Group B — EF2 
 1 IT1 FT1 Crialax x? IY ITr 1D Py C27 rx x? 
 N M1 Sx? | N M2 Sx? 
 AM SD AM SD 
 c SDM1 c SDM2 
 P IT2 FT2 Ca aiix x? Ie IT2 Ft2 Ca tits x? 
 N M3 Sx? N M4 Sx? 
 AM SD AM SD 
 c SDM3 c SDM4 
 SUMMARY 
 EF1 EF2 D SDD EC x 7) 4 EDe tee 
 __walV (SDM1)? + (SDMa2)3|___D Dee 
 Test1]Mr1 M2 M1r—Maz2 278SDD Mi or Ma 
 fas 2 f | pee Se et Se , isucuicesiphisiseaaeaiselietantaieaeetenaan 
 Test 2|M3 M4 M3—Malv(SDM3)? + (SDM4)-— 55 por Ma 
 MEC Sx?]) MED Sx? 
 AM SD} AM SD 
 ec) SDMEC c SDMED 
 ECMEC ECMED 
 
 amount, whereas Test 2 may favor EF2 by a D of a certain 
 amount, or perhaps both tests may favor EF1, or again, 
 both tests may favor EF2. At any rate, there is needed 
 some way whereby the two D’s may be combined into a 
 single number which will show whether, both tests consid- 
 ered, EF1 or EF2 is more effective and how much more 
 effective. 
 
 But the two D’s cannot be averaged just as they stand. 
 To do so might give far more weight to one test than to the 
 other. To make this clear, assume the following situation: 
 
Computations for the Equivalent-groups 169 
 
 EF1 EF2 D 
 
 Test 1 105 100 5 
 Test 2 10 5 5 
 
 Now, in all probability, these two D’s are far from equal, 
 even though they are numerically the same. The first 5 is, 
 in all probability, a much smaller D than is the second s. 
 Before they can be combined they need to be equated. The 
 two EC’s are not only indices of the reliability of the two 
 D’s, but they are also at the same time excellent equaters of 
 the two D’s. The EC’s may be averaged. This has been 
 done and “MEC” or mean EC is the result. Before this 
 averaging is done, the sign of each D should be prefixed 
 to its EC. 
 
 The MEC is really a mean difference. The reliability of 
 each of the two D’s is known. The next need is for some 
 way to determine the reliability of the MEC. Such a way 
 is shown in Table 24. SD of the two EC’s and SDMEC or 
 SD of the MEC may be computed just as SDC and SDMr 
 are computed. 
 
 In this situation where there are two EC’s the formulae 
 become: 
 
 Seay ee > SDMEC= —2 
 aa A eae meine nha 
 
 The SDMEC is an index of the reliability or trustworthiness 
 of MEC as a true MEC for all the tests from which Test 1 
 and Test 2 are a random sampling, and, to make the state- 
 ment complete, for all the pupils from which the experi- 
 mental pupils are a random sampling. 
 
 Just as SDD needed EC for its interpretation, so SDMEC 
 needs an ECMEC for its interpretation. Since, as was 
 pointed out above, MEC is really a D still, and since 
 SDMEC is really an SDD still, the regular EC formula with 
 its customary interpretation may be used. In this situation 
 the formula becomes 
 
170 How to Experiment in Education 
 
 MEC 
 EMEC — 378 SDMEG 
 
 The only difficulty with the use of EC and MEC as a 
 method of equating and combining D’s, is the impossibility 
 of making any clear, simple statement as to what an MEC 
 of a given amount means. Therefore the “ED” or equated 
 difference, has been devised to provide a more easily inter- 
 pretable method of equating and combining D’s from two 
 or more test types. While preferable to the MEC from a 
 popular standpoint it is probably less preferable from a 
 technical statistical point of view. 
 
 The ED for the first D is M1 — Ma divided by Mz if it 
 is smaller than M2 or by M2 if it is smaller than Mr. The 
 ED for the second D is M3 — Mg divided by M3 if it is 
 smaller than M4 or by Mg if it is smaller than M3. When 
 so computed, the ED tells the per cent of the time the 
 experiment has run that it would take the backward group to 
 catch up with the favored group if the favored group were 
 to stop growing until the other catches up. The ED’s for 
 each of the two D’s of 5, previously given, become, according 
 to the above process, .o5 and 1.0 respectively. These ED’s 
 interpreted mean respectively that the EF2 group would 
 catch the EF1 group in Test 1 in .o5 of the time the ex- 
 periment has run, and that the EF2 group would catch the 
 EF 1 group in Test 2 in a time exactly equal to the time the 
 experiment has run. 
 
 After explaining the computation of MEC and ECMEC, 
 it will not be necessary to rehearse the process for computing 
 MED and ECMED. In computing MED, the sign of the 
 D should be prefixed to its ED. One other caution is needed. 
 It sometimes happens that the smaller of the two M’s is so 
 close to zero that, when it is divided into the D, the resulting 
 ED becomes an exaggerated and unnatural amount. Thus, 
 if the smaller of the two M’s were exactly zero and if the 
 D were not also zero, the ED would become infinity! The 
 reader does not need to be told what this will do to the MED. 
 
Computations for the Equivalent-groups rt 
 
 If this, or anything approaching it, were to happen, the 
 MED could not be used. The use of MEC would be com- 
 pulsory. Because of this tendency on the part of ED, the 
 experimenter is advised always to prefer the midscore of 
 the ED’s to the MED, wherever it is possible to compute 
 the midscore, i.e., wherever more than two test types have 
 been used. The midscore of the ED’s may be treated as 
 though it were the MED. 
 
 The computation of the midscore is exceedingly simple. 
 First arrange the ED’s in order of their size, paying 
 due regard to signs. That ED which is middlemost in: 
 size is the midscore. If there is an even number of ED’s 
 and, as a consequence, no middle ED, the mean of the 
 two middlemost ED’s may be taken for the midscore and 
 MED. | 
 
 There is no obligation upon the experimenter to give equal 
 weight to each test always. Because of a given test’s greater 
 reliability, because it is more symptomatic of the entire 
 objects of instruction, or for some other reason, the ex- 
 perimenter may desire to weight it more heavily than any 
 other test used. Once the D’s have been equated, weighting 
 becomes a simple matter of multiplying the EC or ED by 
 the weight desired, before averaging. ‘Thus, if there are 
 three tests to be averaged, and if it is desired to weight the 
 tests, in order, 3, 1, and 2, the experimenter should multiply 
 the first EC or ED by 3, the second by 1, and the third by 2. 
 Then he should add the products and divide by 3 plus 1 
 plus 2, 1.e., 6. 
 
 Illustration of Computation Model IV.—The fore- 
 going discussion of computation model IV will be clarified 
 by the use of sample data. Such data appear in Table 25, 
 where we shall assume the experimental problem to be this: 
 Which is more effective in developing reading (Test 1) and 
 the fundamentals of arithmetic (Test 2), three class periods 
 per week of fifty minutes each (EFr) or five class periods 
 per week of thirty minutes each (EF2). Here we have a 
 problem with two EF’s and two test types, requiring the 
 
How to Experiment in Education 
 
 172 
 
 eT 
 
 
 
 
 
 
 
 
 
 
 
 oF = CANO oo = 9 6:0 = DAWOA 
 ro = CHWdS 9°0 = DUNS vo! = 9 
 co —oS rT AV: 3:0 = (1S SI —_=— NV 
 SOs == xc De (1 A i 5x5 CI1—=—J)AN 
 to z‘O er — So L‘o ae Ai “I S‘or — Sgr “ee ASOL, 
 vo z‘O 6°0 — 9°0 g'0 £°0 — 6°0 gi — gt SOG EER p 
 =~ x qa 3X “~ ei dads da eda 
 AUVWWAS 
 Valen IN CLS SiS or =f€was 0°O == 9 
 Lom —= (TS ol4Ii=WV Om = CS og = NV 
 ge =-<xS S°g1 = ry, v Vic=—2XC og = th v 
 Sz $ zz Ze SI fe I I 6 vz SI =) 
 0) o) LI ZV Sz t 6 £ Si of Sz rs 
 6 ve oz of Or q oO oO g gI cop q 
 v z SI ce Oz 3 v z OL of oz e 
 eX x YO eLa  eLI d “Xx x 78 eer AAC Rd A Aj d 
 vo =2KAdS go =) V0 TINGS 070s) 
 go=qs OLfe=— bY, 420 S oz = NV 
 S$ =,xS gst =z 4 2. xs Oc = TN ¥ 
 o ° £ zs 6v i co) fo) _ os gr P 
 fc) o £ gs $s t I I € gs So a) 
 ¥ z i SV or q I I I Iv ov q 
 I I v £% 6% 3 fc) co) z zs os e 
 ex x fo eoen LORS eA A d eX x ig SS Pipe | d 
 2d —q nosy Iyq—p qnosn 
 
 
 
 sodA], ISA. OMT —S. qq OMY, — sdnoin jusjeamby omy, 
 
 
 
 VLVd 2IdWWS NOdQ AI TaGOW NOILVLOAdMOD ASA OL MOH ONIMOHS 
 Sz alavy 
 
Computations for the Equivalent-groups 578 
 
 equivalent-groups methods. Assume the experiment to 
 continue for a half year. 
 
 The first novel feature of Table 25 is that pupils g and j 
 are not exactly equivalent to pupils @ and d in IT1. This 
 is partially corrected by the fact that g’s deficiency of one 
 point is balanced by j’s excess of one point. 
 
 The second feature to be noted is that Group A consists 
 of pupils a, 6, c, and d for Test 1 and of pupils a, 5, c, and 
 é for Test 2. This is to illustrate the point made in Chapter 
 III that when pairing is not feasible until the experiment is 
 concluded, it may be necessary to alter somewhat the com- 
 position of the group from test to test in order to establish 
 more perfect initial equivalence in each test. Pupil d paired 
 fairly well with Pupil j in reading, but not in arithmetic. 
 But it happens that Pupil e who experienced the same EF 
 as Pupil d pairs well with Pupil j in arithmetic. Conse- 
 quently Pupil e takes the place of Pupil d in Test 2. 
 
 The third feature is the computation of MEC and 
 ECMEC. Test 1 shows a D of — 1.8 with a 0.7 practical 
 certainty. Test 2 shows a D of —10.5 with a 2.2 times 
 practical certainty. Combining these results we get an 
 MEC of —1.5 in favor of EF2. We can be only 0.9 
 practically certain that the true MEC for all such reading 
 and arithmetic tests would favor EF2. 
 
 The fourth feature worth noting is the computation of 
 ED, MED, and ECMED. The ED of —o.9 is found by 
 dividing the D of — 1.8 by the smaller M of 2.0. The ED 
 of — 1.3 is found by dividing the D of — 10.5 by the smaller 
 M of 8.0. The ED of — 0.9 means that it would take Group 
 A nine-tenths of a half year to catch Group B in reading if 
 Group B were to stop growing altogether. The ED of — 1.3 
 means that it would require one and one-third of a half- 
 year’s time for Group A to catch up to where Group B now 
 is, if Group A continues under the EF1. The MED of 
 — 1.1 means that on the average it would take Group A 
 one and one-tenth of the time during which the experiment 
 ran to attain the reading ability and arithmetical ability now 
 
174 How to Experiment in Education 
 
 possessed by Group B. The ECMED of 4.0 is not at all in 
 harmony with an ECMEC of 0.9. This discrepancy is ex- 
 plained by the artificiality of the data used, the inexactness 
 of the computations, and the small number of tests used. 
 Because the number of tests used in most experiments is 
 usually small, we seriously considered illustrating the com- 
 putation of MEC and MED and omitting any reference to 
 ECMEC and ECMED. The reader is advised to place little 
 confidence in these last two measures. 
 
 When, as rarely occurs, either or both the M’s from which 
 an ED comes are negative quantities, ED should always be 
 considered infinity in amount. For the group that is behind 
 could never attain the position of the group that is ahead or 
 that lost less. So long as the group that is behind remains 
 under its particular EF it would continue to lose ground and 
 to widen the gap between itself and its more favored 
 competitor. 
 
 Computation Model V.—The reader who understands 
 computation models I, II, III, and IV will find computation 
 model V in Table 26 self-explanatory. It is for the purpose 
 of showing the necessary computations for three EF’s and 
 three types of tests. By a further extension of model V to 
 the right, any number of EF’s may be accommodated, and 
 by a further extension downward, any number of test types 
 may be accommodated. 
 
 Computation Model VI.—Computation model VI shows 
 the computations needed in connection with an equivalent- 
 groups experiment where there are sub-groups. Bennett 
 faced just such a situation when he set out to determine 
 whether rural supervision based on tests is more effective 
 than supervision unaided by tests. He divided his county 
 into two equivalent groups of schools. He gave initial and 
 final tests to both groups. In the case of one group he 
 made use of the initial-test data in his supervision. In the 
 case of the other group he laid the tests away unscored until 
 the conclusion of the experiment. Otherwise the two groups 
 were treated as nearly alike as possible. 
 
aa ey ge I SE Dae ee I ee Oe wee ee 
 
 175 
 
 Computations for the Equivalent-groups 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 can SaNod 
 | a 
 OW 108sNW +d ads s4z+q c(ON dS) + <(8WdS) A SW — 8W OW sW phd SE “9: IS9T, 
 9W Jo SW+d dds s47+q z(9WdS) +2(Was) A 9W — SW OW SW z sah 
 fWIozw+d dds s4z+q e(€Was) + 2(@Was) A ed a? ew tN nigh? 
 qa VA ads d CAA CASAC | 
 cand jane 
 d a 
 6W 10 “W+d dds s4z+qd e(6WdS) + <(4WdS) A 6m — ZW OW LW oeeccees © ysar 
 9W Jo rN + dads s4z+q 2(9WGS) + «(PNGS) A oW — *W OW rw soz eased. Joe A F 
 fWtOIN+d ads 3s4z+q 2(€WdS) +<(iwds) A fW — 1 eW DW eee I sah 
 da OF ads qd Aa Ia 
 cana Jane 
 aa a 
 8W 104W +d dads s47-+q z(8NWAS) +2:(4WdS) A SW — ZW SW LW € ysoy, 
 SW I0tW+d ads s4z~+q z(SINAS) + <(VWaS) jx SW — ty oW ee z 4s9], 
 zWIOIW +d ads s4z7+q e(CWdS) +c<(CiWds) A ZW — IW ZW IW I 4saL 
 aa oa ads at 2c iC ae a | 
 AYVAWOS 
 6 ° 8swa 9 4WAS 9 
 as” WV ds > WV das WY 
 XS OW N exS sw N XS at 5 . N 
 zx x 69 LA ELI d 2X x 89 eLa ELI 2X 3 Li Ani d 
 9 SWas ) VINGS > 
 as WV as WV as ,nV 
 XS oW N | ex$ SW N 2XS TN Sees : N 
 2x x OT aks Oe | d | x x AR = tA AI 2x a) La ai 
 £ 9 zWds 9 1WdS 9 
 as” WV as WV as WV 
 XS cW N sXS 7W N 2X$ TW N 
 x x COM tA Se aa d | x x ZO RE br =e NA <x we Foe ek Rs er ok fe d 
 
 Eq —D nosy 
 
 
 
 eHA — gq qnosy 
 
 Iyqq —V gnosy 
 
 
 
 sodA], 189], VIL —S.qy VIG], —sdnoiy jusjeamnby se1q 7, 
 
 
 
 A TagOW NOILVLAdHOD 
 
 gz aTav], 
 
176 How to Experiment in Education 
 
 In making his experimental computations, he could have 
 thrown all the pupils in one group of schools into one large 
 group, and similarly for all the pupils in the other group of 
 schools. Had he done this, he would have had two equiva- 
 lent groups, two EF’s, and two or more test types, and his 
 experimental computations, in this case, would have been 
 that of computation model IV. 
 
 But he desired to know whether the D between the two 
 EF’s would be in the same direction and of the same amount 
 for Grade III, as for Grade IV, as for Grade V, etc. In 
 like manner, an experimenter may wish to compute separate 
 D’s for each age, or for the brighter half of the two groups 
 as contrasted with the duller half, or for boys vs. girls, or 
 for all of these and more. EFr may be more effective than 
 EF2 for the lower grades, or younger ages, or duller pupils, 
 or boys, whereas the reverse situation may obtain for the 
 upper grades, upper ages, brighter pupils, or girls, respec- 
 tively. Computation by sub-groups has the effect, then, of 
 yielding fuller information, and, sometimes, the most signifi- 
 cant information. 
 
 In Table 27, Grade III and Grade IV are the sub-groups. 
 Were sex, say, the sub-group, ‘“‘Boys—EF1,” “Boys—EF2,” 
 “Girls—EF1,” “Girls—EF2” should take the place, respec- 
 tively, of ‘Grade IIJ—EF1,” “Grade III—EF2,” ‘Grade 
 IV—EF1,” and “Grade IV—EF2,” and similarly for any 
 other sub-group basis. 
 
 An extension to the right of computation model VI will 
 provide for any number of EF’s. An extension downward 
 will provide for any number of sub-groups. An extension 
 downward under each sub-group will provide for any num- 
 ber of test types. 
 
 If the experimenter wishes to know the results for Grade 
 III and Grade IV treated as one group as well as treated 
 separately he can compute the M of the MEC for Grade III 
 and the MEC for Grade IV, or he can compute the M of 
 MED for Grade III and MED for Grade IV. If he wishes 
 to know the results for each test type separately, he can 
 
Computations for the Equivalent-groups 177 
 
 compute the M of Grade III’s EC on test 1 and Grade IV’s 
 EC on test 1, and the M of Grade III’s EC on test 2 and 
 Grade IV’s EC on test 2. Or he can compute the M of 
 
 TABLE 27 
 
 COMPUTATION MODEL VI 
 
 
 
 Two Equivalent Groups with Two Sub-Groups — Two EF’s — Two Test Types 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Grade III —EFr Grade III — EF2 
 P Netsteune Neks Ten x x? ie Lr Tre G2 x x? 
 N Mr Sx? N M2 Sx? 
 AM S AM SD 
 Cc SDM: Cc SDM2 
 Uy BAD eee 18 id Be x x? is Pizyerlaica x x? 
 N 3 Sx? N 4 Sx? 
 AM SD AM SD 
 c SDM3 c SDM4 
 Grade IV — EFr Grade IV — EF2 
 P ibys OAR | fee ye x? iE ities aaMbee | (Ole x x? 
 N Ms Sx? N M6 Sx? 
 AM SD AM SD 
 c SDMs5 Cc SDM®& 
 P Miya ee, x x3 P it ae 2aetCs x x? 
 N 7 Sx? N Sx? 
 AM SD AM 
 c SDM7 c SDM8 
 SUMMARY 
 Grade III 
 EF:1 EF2 D SDD EC ED 
 Test1]M1 Mz M1r—Mz2|/V (SDM1)? + (SDM2)?|D ~ 2.78 SDD|M1 — M2 + 
 Mri or M2 
 Test2}M3 M4 M3— M4/V (SDM3)? + (SDM4)?/D + 2.78 SDD|M3 — M4 + 
 M3 or M4 
 MEC MED 
 ECMEC ECMED 
 Grade IV 
 EF: EF2 D SDD EC ED 
 Test1|Ms M6 Ms —M6/V(SDMs)? + (SDM®)? |D + 2.78 SDD|Ms — M6 = 
 Ms or M6 
 Test2;}M7 M8 M7—M8 /(SDM7)? + (SDM8)? |D + 2.78 SDD|/M7 — M8 + 
 M7 or M8 
 
 MEC MED 
 ECMEC ECMED 
 
178 How to Experiment in Education 
 
 Grade ITI’s ED on test 1 and Grade IV’s ED on test 1, and 
 the M of Grade III’s ED on test 2 and Grade IV’s ED on 
 Lespu2.. 
 
 There are certain possible objections to the foregoing plan 
 for combining Grade III and Grade IV. First, the plan 
 gives an equal weight to each grade irrespective of the num- 
 ber of pupils in each grade. This objection loses its validity 
 if the number of pupils is about the same or, even though 
 
 TABLE 28 
 
 SUMMARY OF AN ACTUAL EXPERIMENT UPON THREE SUB-GROUPS 
 (AFTER OGGLESBY) 
 
 
 
 Summary — Bright Group 
 
 EF1 EF2 D SDD EC 
 Deste bavcacce ii bier 14.11 13.46 0.65 0.27 0.87 
 
 Summary — Normal Group 
 
 EFC Ea nD ves EC 
 TLOSEAT MAS State b 13.05 12.14 0.9r | 0.31 1.06 
 
 Summary — Dull Group 
 
 EF1 EF2 D SDD EC 
 
 TeStrtie mike ieste 11.08 8.64 2.44 0.58 I.51 
 
 
 
 not the same, if there are special reasons for weighting each 
 grade equally. Second, there is no convenient way to de- 
 termine the reliability of the M’s so computed. 
 
 There is another plan for combining Grade III and Grade 
 IV which takes account of the number of pupils in each 
 grade, and which permits the computation of the reliability 
 of the combined results. This plan is to disregard the sub- 
 groups entirely, and compute from the beginning as though 
 Grade III and Grade IV were one group. In Table 27, this 
 would amount to computing the M of all the C1’s and Cs5’s 
 
179 
 
 Computations for the Equivalent-groups 
 
 
 
 qanWouw JAWOA 
 
 
 
 
 
 
 
 
 
 
 
 
 
 quw JAW 
 ZW 10 6W + z1W — OW adds s4z~+a s(Z1WdS) +.(SWas) A CW — OW EIN OCOWCO' °°" * © 3S0L, 
 9W 30 fW+ OW— EW dds s4z~+d «(9N dS) + <(€WdS) A oW — °W oW SS i 1 
 da , a ee ads a TAH TAD 
 St ea ee ee ee eres, 
 {DUST OF JorMuy 
 qaWwod JAWOA 
 quw JUN 
 11W 40 8W + 11 — 8W dads 847+d zC1IWdS) + <(8WdS) A IlW— gw 11} SW eat Se soy. 
 SW ioz7W > SI — ?W dds s47~+qd _3GWdS) + Z@Wds) A *W—?-W *W Cer a ela oe 
 qa 04a ads da 2 AC | 14a 
 {DUT 04 a4DIpamsazquT 
 qanod JUWOA 
 qaw Jd 
 o1W Jo 4W + O1W — 4W ads 347+d e(OINdS) + <(4WdS) A o1Ww — 4W o1rW AIS | ie ae ee, 
 WA JOIN + *N— IW dds 34z7~=+q z(VINCGS) +-Gwds) A FIN — IW PIN HS od per dade Rae a Be 
 da Ou qds ad 2s AC f a AC | 
 
 
 
 ayDipaMmsazuy OF [DImMUT 
 
 SL LF LES ea man erga rn sa a re oro ems rele to pe a eo EE AA EL PLAIN EEA! DALLA OLY LAELIA SALIDA LDA ei 
 
 
 
 
 
 AYVWWAS 
 
 ZIWdS IIWdS oIWNdsS 6—Was 8ANdS 4Wds 
 cIW 1IW oI N ON 8IN LW N 
 ce 56) Ir) ory) 7LA ZLNI ZLI d 69 8D £9 ZLA ZINI ZLI d 
 
 9NdGS SWas PIGS fwas “Was IWas 
 
 oW STA VIN N cI Z7W IW N 
 
 99 $9 Le) ILA ILNI ILI d £9 ‘4@) I) ILA ILNI ILI d 
 
 See a cs al i ea i i ae catia |e rear 
 ?4A — gq qnosy Iq — Pp ¢nosy 
 
 a er EE ee ee 
 
 SOP eeIpeuliojuy 3uQ — sadA], jsof, OMT, —S,Jq OMY —sdnoiy jualeamby omy, 
 
 aac ar a Serer re Stns Spee ge ee ee 
 
 IIA TadOW NOILVLAdWOD 
 62 miavy, 
 
180 How to Experiment in Education 
 
 treated together, the M of all the C3’s and C7’s, the M of 
 all C2’s and C6’s, and the M of all the C4’s and C8’s. This 
 will entail for each M so computed an appropriate series of 
 x’s, x?’s, Sx*’s, SD’s, and SDM’s and a “Grade III and 
 Grade IV” section in the “Summary.” 
 
 A good illustration of the value of being alert for the sub- 
 groups is afforded by an experiment conducted by Eliza F. 
 Ogglesby of Detroit upon 350 experimental and 350 con- 
 trol first-grade pupils. The purpose of the experiment was 
 to discover whether a new reading book she had prepared 
 especially for slow pupils was superior to one previously in 
 use, and, if so, whether it was better for dull pupils than for 
 normal pupils or bright pupils. Miss Ogglesby has furnished 
 the author with the summary of her experiment. ‘This is 
 shown in Table 28. There were 100, 150, and 100 pupils 
 in each of the bright, normal, and dull groups, respectively. 
 EF 1 is the new book, EF2 is the usual book. The data show 
 that the new book is superior to the old by 0.65 points for 
 the bright group, o.g1 points for the normal group, and 2.44 
 points for the dull group. This suggests that it is an advan- 
 tage to make books adapted to these different levels of 
 capacity. 
 
 Computation Model VII.—Another common form of 
 experimentation is one where there is for each group an 
 initial test, one or more intermediate tests, and a final test. 
 In an experiment extending over a school year it is fre- 
 quently desirable to give an intermediate test at the end of 
 the first semester. This tends to strengthen the experiment 
 and fortify the conclusions. 
 
 Computation model VII in Table 29 shows how to treat 
 an experiment of two equivalent groups, two EF’s, two test 
 types, and an intermediate test for each test type. By a 
 horizontal and vertical extension of this table provision could 
 be made, respectively, for more EF’s or intermediate tests, 
 and more test types. 
 
 In Table 29, the usual form has been somewhat abbre- 
 viated to save space. C1 is the change from IT1 to INT1. 
 
181 
 
 Computations for the Equivalent-groups 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 IZWGS °SWAaS §4Wwas 84was 44Wds 94Was S<Was *4Was &4was 
 IgW OS 62 N |84Ww 441 94 N |$4w 3 +4w f<W N 
 1g) 089 64) =6(€LT FLNI ELI| d {840 LLy 92%) 6 €LA FLNI fLI] d (829 rly 4) «6 (fla SLNI fLI] d 
 zZWAS 14WdS °4WadS 69WGS 89WdS 49Was 99WdS S9WdS *9WadSs 
 zZW ZW oLW N |S9W = 891 LOW N |991 SOW vOW N 
 zy 74) oy) (@LI ZLNI 2LI} d |699 899 499 (@La ZLNI ZLI}d 999 $99 v9D «= (ZLA ZLNI 2LI] d 
 foWdS z9WdS I9WdS COWaS SSWdS 8SWds 4SIWdS 9SWdS SSWas 
 foW ZOW 1OW N |°91N OST SST N /[4S;Ww OST SSW N 
 €99 z9D 199 ILI 'LNI ILI} d |999 689 te) ILL 'LNI ML} d (489 989 ssp ILI WLNI WLI} d 
 eq — spidngq uvgsQ e47 — sping uvgsQ Iqq — sping “vgsn 
 rSWdS SWS Z5WdS ISWAS OSWAS SFWas 8rWas 4’WaSs 9FWdSs 
 PS ESI ST N 15M OS OOPW N [s?Ww 4*W OF N 
 SD es 2S) (LA SLNI LI} de ji89 osy 6h) = (ETA fLNI €LI] da |gbo Ly) 9rd €La fLNI £LI] d 
 Stas trINdS f*Wds ZPWaS ItWdS °F Was 6f€Was 8f€Was 4EWds 
 SPT bry EvTIN N |2zrvw IVIN ory N [6tw sti LEW N 
 46) bry fr) (2@LA ZLNI ZLI| d |zr9 Iv) D4 oe a pes ON PR Rat ft A re Som La) gta Se) S12I> EN “Lied 
 9fWas SfWads ’EWasS ffwas zfWds 1f£wdas ofWas °Z7WdSs 8zWds 
 Of W SEAL rey N [fw Ze IfW N jot] Oz 871 N 
 9f9 $€D vey TLA- LIN 111) d-\ft9 ze) rf) ILI ILNI ILI} d jot 6zD Bee) Gott Deel, L Neel LE bead 
 Eq —spdng upqganqns 24 — sirdngq unqanqns IA — spdnq upganqns 
 4ZWdS 97WAS S2Wds bzWas £2WdsS 2z7Wds IZWAS °7WAdS SIWds 
 L7W 97W Sze N |’ew EzW Zz7W N |1zjW Oo7W 61" N 
 “zo 97D $zD CLI ©LNI &Lild \'z9 fz ezyQ ELA fLNI fLI] d (129 oz) 619 {€La fLNI &LIl d 
 SINGS 4IWdS 9INAdS SIWdS *’IWdS £IWdSs ZIWdS I11WdS °IWdS 
 SIN LIW 9IW N {St VIW fIW N |21W 1IW OlIW N 
 BID £19 91D (2LAI 2LNI 2LI}] d -|§19 vig f1y) (@LA ZLNI ZL] d 219 IIg ord) (2@LI Z@LNI ZLIl d 
 owas 8Wds 4wads 9WaS ‘Was ?’was fWwas zWdaS IWwds 
 OW 8N LI N j9W SW rw N |£W ZW IW N 
 69 89 £9 ILI 'LNI ML} d |90 $9 Lge) ILI YINI ML} d |€O ag) 1D) LTA INT FLT od 
 E47 — stidng pany 244 — spidnd 1v4ny Iq — sjidnd 104ny 
 
 wsOT WIpsWstoayu. suQ—sodh], jsa], saIG IL —s,Jq e014 [.—sdnoi3-qng v14, yp sdnois-jualeainby 2014], 
 IIIA TAGOW NOILVLNAHOD 
 of a1avy 
 
182 How to Experiment in Education 
 
 C2 is the change from INT1 to FT1. C3 is the change from 
 IT1 to FT1, and similarly throughout the table. The AM, 
 Cc, x, x2, Sx2, and SD involved in the computation of SDMz1, 
 are omitted. The same omission occurs in the case of 
 SDM2, SDM4, SDMs3, and so on. 
 
 Computation Model VIII.—Computation model VIII, 
 shown in Table 30, is a sort of composite computation model 
 or a sort of summary of all the models which have preceded. 
 It illustrates an experiment where there are three EF’s, three 
 sub-groups, three test types, and one intermediate test. This 
 computation model embraces practically all the difficulties 
 in computation ever presented by a regular equivalent-groups 
 experiment. How to handle certain rare forms of the 
 equivalent-groups experiment is considered at the end of 
 the next chapter. 
 
 TABLE 30 
 
 SUMMARY 
 
 
 
 Rural Pupils — Initial Test to Intermediate Test 
 
 EFr EF2 D SDD EC ED 
 
 Test 1 M1 M4 Mr —My4 SDD EC ED 
 Test 2 Mio M13 Mro—Mrz3 SDD EC ED 
 Lest, 3 Mig M22 Mig—M22 SDD EC ED 
 MEC MED 
 
 ECMEC ; ECMED 
 
 EFi EF3 D SDD EC ED 
 
 Test 1 M1 M7 Mr —My7 SDD EC ED 
 Test 2 Mio M16 Mio—Mr16 SDD EC ED 
 Test 3 Mig M25 Mig—M25 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 EF2 EF3 D SDD EC ED 
 
 Test 1 M4 M7 Ma —M7 SDD EC ED 
 Test 2 M13 M16 M13—M16 SDD EC ED 
 Test 3 M22 M25 M22—Mz25 SDD EC ED 
 MEC MED 
 
Computations for the Equivalent-groups 183 
 
 Rural Pupils — Intermediate Test to Final Test 
 
 
 
 EFr EF2 D SDD EC ED 
 
 esteL east WL 2 M5 M2 —Ms5 SDD EC ED 
 Test 2.... Mir M14 Mir—Mr14 SDD EC ED 
 Test 3.... M2zo M23 M20—M23 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 EFr EF3 D SDD EC ED 
 
 pLesteahseul) (v2 M8 M2 —Ms8 SDD EC ED 
 Test 2.... Mir Mrz Mir—M17 SDD EG ED 
 Test 3.... M20 M26 M20—M26 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 ADT POL D SDD EC ED 
 
 Test 1 Ms5 M8 Ms —M8 SDD EC ED 
 Test 2 M14 M17 M14—Mr17 SDD EC ED 
 Pesta 3 M23 M26 M23—M26 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 
 
 Rural Pupils — Initial Test to Final Test 
 
 
 
 EFr  EF2 D SDD EC ED 
 
 Leste Sisal) 1.13 M6 M3 — M6 SDD EC ED 
 Test 2... Miz Mrs M12—Mr1s5 SDD EC ED 
 Test 3... M2x M24 M2r1—M24 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 EFr EF3 D SDD EC ED 
 
 Test 1 M3 Mo M3 —Mo SDD EC ED 
 Test 2 Miz M18 M12—M18 SDD EC ED 
 Test 3 M21 M27 M2z1—M27 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 EF2 EF3 D SDD EC ED 
 
 Test 1 M6 Mo M6o —Mo SDD EC ED 
 Test 2 Mis M18 M15—Mr18 SDD EC ED 
 Leste s M24 M27 M24—M27 SDD EC ED 
 MEC MED 
 
 
 
184 How to Experiment in Education 
 
 Suburban Pupils — Initial Test to Intermediate Test 
 
 
 
 
 
 EFr EF2 D SDD EC ED 
 Test x1... M28 M31 M28—M3r1 SDD EC ED 
 Test 2... M37 M4o M37—Mg4o SDD EC ED 
 Test 3... M46 M49 M46— M49 SDD EC ED 
 MEC MED 
 ECMEC ECMED 
 EFr EF3 D SDD EC ED 
 Tester M28 M34 M28—M34 SDD EC ED 
 Test 2 M37. M43 M37—M43 SDD EC ED 
 Test 3 M46 Ms2 M46—M52 SDD EC ED 
 MEC MED 
 ECMEC ECMED 
 EF2 EF3 D SDD EC ED 
 Test 1.... M31 M34 M31—M34 SDD EC ED 
 Test 2... M40 M43 M4o0— M43 SDD EC ED 
 Test 3.... M49 M52 M4go—Ms52 SDD EC ED 
 MEC MED 
 ECMEC ECMED 
 
 Suburban Pupils — Intermediate Test to Final Test 
 EFr EF2 Dwi esp EC ED 
 Test 1... M2g M32 M2zg— M32 SDD EC ED 
 Test 2.... M38 Mgr M38— Mar SDD EC ED 
 Test 3.... M47 Mso M47—Mp50 SDD EC ED 
 MEC MED 
 ECMEC ECMED 
 EFr1 EF3 D SDD EC ED 
 Test 1.... M29 M35 M29—M35 SDD EC ED 
 Test 2.... M38 M44 M38—Ma44 SDD EC ED 
 Test 3.... M47 Ms3 M47—Ms53 SDD EC ED 
 MEC MED 
 ECMEC ECMED 
 EF2 EF3 D SDD EC ED 
 Test 1 M32 M35 M32—M35 SDD EC ED 
 Test 2 Mar M44 M4r—M44 SDD EC ED 
 Test 3 Mso Ms3 Mso—M53 SDD EC ED 
 MEC MED 
 
Computations for the Equivalent-groups 185 
 
 Suburban Pupils — Initial Test to Final Test 
 
 
 
 EFr EF2 D SDD EC ED 
 
 Test 1....1 M30 M33 M30—M33 SDD EC ED 
 Test 2... M39 M42 M39—Mgqz2 SDD EC ED 
 Test 3... M48 Msr Ma4a8—Ms5r1 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EFr EF3 D SDD EC ED 
 
 Test 1.... M30 M36 M30—M36 SDD EC ED 
 Test 2... M39 M45 M39—Ma4s5 SDD EC ED 
 Test 3... M48 M54 M48—M54 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EF2 EF3 D SDD EC ED 
 
 Test r.... M33 M36 M33—M36 SDD EG ED 
 Test 2...) M42 Mas M42—Ma45 SDD EC ED 
 Test 3.... M51 M54 Ms1—M54 SDD BC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 Urban Pupils — Initial Test to Intermediate Test 
 
 EFr EF2 D SDD EC ED 
 
 Test z.... M55 Ms8 Ms5—Ms58 SDD EC ED 
 Test 2... M64 M67 M64—M67 SDD EC ED 
 Test 3... M73 M76 M73—M76 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EFr EF3 D SDD EC ED 
 
 Test I Mss Mor Ms5—Mé6r1 SDD EC ED 
 Test 2 M64 Myo M64—My7o SDD EC ED 
 Test 3 M73 M79 M73—My7o9 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EF2 EF3 D SDD EC ED 
 
 Test 1.... M58 M6r Ms8—M6r SDD EC ED 
 Test 2.... M67 Myo M67—My7o SDD EC ED 
 Test 3... M76 M7q M76—My79 SDD EC ED 
 MEC MED 
 
186 How to Experiment in Education 
 
 Urban Pupils — Intermediate Test to Final Test 
 
 
 
 
 
 EFri- EF2 D SDD EC ED 
 
 Test 1.... Ms6 Msg Ms6—Ms50 SDD EC ED 
 Test 2... M65 M68 M65—M68 SDD EC ED 
 Test 3... M74 M77 M74—M77 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EE. BES D SDD EC ED 
 
 Test 1 Ms6 M62 Ms6— M62 SDD EC ED 
 Test 2 M65 Myr M65—Myz71 SDD EC ED 
 Test 3 M74 M8 M74—M8o SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EF2 EF3 D SDD EC ED 
 
 Test 1.... M59 M62 Msq9—Mé6z2 SDD EC ED 
 Test 2.... M68 M71 M68—Mz71 SDD EC ED 
 Test 3.... M77, M80 M77—M8o SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 Urban Pupils — Initial Test to Final Test 
 
 EFr EF2 D SDD EC ED 
 
 Test 1...4 M57 M60 Ms7— M60 SDD EC ED 
 Test 2... M66 M69 M66—Mé69 SDD EC ED 
 Test 3... M75 M78 M75—M78 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EFr EF3 D SDD EC ED 
 
 Test I Ms7 M63 Ms7—M63 SDD EC ED 
 Test 2 M66 M72 M66—M72 SDD EC ED 
 Test 3 M75 M8r M75—M81 SDD EC ED 
 MEC MED 
 
 ECMEC ECMED 
 
 EF2 EF3 D SDD EC ED 
 
 Test 1 Mto M63 Mb6o— M63 SDD EC ED 
 Test 2 M69 M72 Mb69—M72 SDD EC ED 
 Test 3 M78 M&8&r M7&8—M81 SDD EC ED 
 MEC MED 
 
 ECMEC | ECMED 
 
 
 
CHAPTER VIII 
 
 COMPUTATIONS FOR THE ROTATION 
 EXPERIMENTAL METHOD 
 
 Computation Model IX.—The nature and functions of 
 the rotation experimental method were discussed in Chapter 
 II. It remains to illustrate the statistical computations nec- 
 essary to yleld the conclusion from a rotation experiment, 
 together with the reliability of the conclusion. 
 
 Computation model IX is for the simplest type of rota- 
 tion experiment, namely, two groups which may or may not 
 be equivalent, two EF’s, and one type of test. 
 
 TABLE 31 
 COMPUTATION MODEL IX — ROTATION METHOD 
 
 
 
 Two Groups— Two EF’s— One Test Type 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Group A—EFr1 Group B—EF2 
 P ITr FT1 Cr Pp IT1 FT1 C2 
 N M1 N M2 
 SDM1 SDM2 
 
 Group A — EF2 Group B— EFr 
 P ITr FTr C3 P ITi FT1 C4 
 N M3 N M4 
 SDM3 SDM4 
 
 SUMMARY 
 EF1 SDS1 EF2 SDS2 
 Test 1|Mzr-+ Mg 4/(SDMr1)?+ (SDM4)?|M2-+ M3 4/(SDM2)?-++ (SDM3)? 
 D SDD EC 
 (Mr + M4) — (M2-+ M3) | 4/(SDSr)?+ (SDS2)? | D-—+2.78 SDD 
 
 
 
 
 
188 How to Experiment in Education 
 
 The first point to note in computation model IX, in Table 
 31, is that Group A has EF1 applied to it first and EF2 
 applied second, whereas the EF’s are applied to Group B 
 in the reverse order. Since both EF1 and EF2 appear first 
 and second any advantage of order is rotated out. 
 
 According to the computation model, Group A experiences 
 in order IT1, EF1, FT1, IT1 again, EF2, and FT1 again. 
 This does not mean that the second IT1 and FT1 will yield 
 identical scores with those yielded by the first [Tz and FT1, 
 respectively. It does not even mean that the identical test- 
 ing instrument must be employed. It means merely that the 
 same general mental function is usually tested in both in- 
 stances. In rare cases, however, the similarity between the 
 mental functions tested is slight or non-existent. 
 
 Sample problems will make clear the various possible de- 
 grees of similarity between the first and second pair of tests. 
 Assume EF 1 to be a high per cent of re-circulated air for a 
 classroom, and EF2 to be a continuous supply of wholly 
 fresh air. Assume that each EF operates one semester. The 
 first IIx for Group A might be a test of general reading 
 ability. The first FTr1 could be the identical testing instru- 
 ment, a duplicate test of reading ability, or some other test 
 of general reading ability. It must measure the same trait 
 as the ITxr. The second IT1 for Group A could be the same 
 test as that already used, or a duplicate test, or another test 
 of general reading ability, or a test of a similar mental func- 
 tion, say a vocabulary test, or a totally different sort of 
 test, say, a test of fundamentals of arithmetic. The second 
 FT1 must test the same trait as its IT1. Furthermore, the 
 same tests used for Group A with EFi and EF2 must be 
 used for Group B with EF2 and EF1, respectively. This 
 will prevent penalizing either EF since each EF will have 
 both varieties of tests. 
 
 Consider another sample problem. Assume EFr1 to be 
 motion-picture presentation of a lesson, and EF2 to be 
 teacher presentation. The subject of the motion picture 
 might be the geography of Alaska. This would require the 
 
Computations for the Rotation Experimental Method 189 
 
 first ITr and FT1 to be constructed of Alaskan content. 
 But the teacher could not well use the identical topic and 
 identical tests a second time. The carry-over would be alto- 
 gether too large. She could choose, instead, say, the geog- 
 raphy of Hawaii. This topic would require that the second 
 IT1 and FT1 have a Hawaiian content. In group B the 
 order of topics would have to be reversed so that EF2 would 
 secure any advantages or disadvantages of the Alaskan topic 
 and tests, and EF1 any advantages or disadvantages of the 
 Hawaiian topic and tests. 
 
 Both the first and second IT’s for both Group A and 
 Group B are often not applied in rotation experiments. In 
 case Alaska and Hawaii are known to be new to the pupils, 
 and if, in addition, the test questions are so highly specific 
 that they could not be answered from general information 
 about the geography of places other than Alaska and Hawaii, 
 the experimenter frequently assumes that the pupils’ knowl- 
 edge is zero and so records it without testing. Even when 
 such an assumption introduces a slight error, it is sometimes 
 an advantage to accept the error and omit applying the IT’s. 
 Sometimes it is an advantage to keep pupils ignorant of that 
 upon which they are to be tested until the EF1 has been 
 applied. The ITz prevents such concealment unless a dupli- 
 cate test is available. 
 
 There is a special situation where the second IT1’s for 
 both Group A and Group B are not applied. If EF2 for 
 Group A follows EF1 immediately, and if EF1 for Group B 
 follows EF2 immediately, and if, in addition, the identical 
 or equivalent test used for the first FT1 is to be used for 
 the second IT1, then the scores made on the first FT1 may 
 be assumed to be identical with those which would result 
 from giving the test again as ITr. 
 
 As shown by the Summary, the total C produced by EF1 
 is Mit + M4. The C produced in Group A by EF1 is Mr. 
 That produced in Group B by EF1 is M4. The sum of 
 these gives the C produced in both groups by EFr1. In like 
 manner, the total C produced by EF2 in both groups is 
 
190 How to Experiment in Education 
 
 M2 + M3. The D between EFi and EF2 becomes, then, 
 (Mr + M4) — (M2 + M3). 
 
 To compute the SDD of this last quantity requires us to 
 know the reliability of its two components M1 + M4 and 
 M2-+ M3. From a knowledge of the reliability of M1 and 
 M4 it is possible to compute the reliability of their sum, Le., 
 it is possible to compute SD of the sum, or SDS or SDSzr. 
 As shown in the table, the formula for computing the re- 
 liability of the sum of the two M’s is just like the formula 
 for computing the reliability of the difference between two 
 M’s. All preceding computation models have made this 
 latter formula familiar to the reader. Once the SDS1 and 
 SDS2 have been computed SDD and EC are readily deter- 
 mined, as shown. The more detailed formula for EC may 
 be written thus: 
 
 EC =[(Mz + M4) — (M2 + M3)] + 2.78 (4/(SDS1)? + (SDSz)?) 
 
 - Reliability Computations in Special Situations.—It 
 was stated in the preceding paragraph that the formula for 
 the reliability of a sum is identical with the formula for the 
 reliability of a difference. In the short form in which these 
 formule are usually used and commonly published, they are 
 alike. ‘The complete, long formule, as given below, are 
 not identical. 
 
 SDD =  (SDMr1)? + (SDM2)? — arr2 (SD1)(SD2) 
 SDS = V(SDM1)? + (SDM2)? + 2rr2 (SD1r)(SD2) 
 
 When the sum of three numbers is involved the formula be- 
 comes: 
 
 SOS 4/ (SDM1)* + (SDM2)?+ (SDM3)?+ 2 r12(SDr) (SD2) + 
 2 r13(SD1) (SD3) + 2 r23(SDz2) (SD3) 
 
 In the preceding chapter, the reader was shown how M1 
 could be computed by getting the difference between the M 
 of the IT and the M of the FT, and how the SDMz1 could 
 be computed by a formula which utilized the SDM of the 
 
Computations for the Rotation Experimental Method 191 
 
 IT, SDM of the FT, the coefficient of correlation between 
 IT and FT, SD of IT, and SD of FT. The Mz, so com- 
 puted, is really a D, and the SDMxz is really an SDD. Con- 
 sequently the above formula for SDD is identical in form 
 with the SDMz formula just referred to. Just as it is pos- 
 sible to determine Mr by subtracting M of the IT from M 
 of FT, so it is possible to compute MS by adding M of IT 
 and M of FT. If this were needed for some purpose and 
 actually done, the SDMS formula would be identical with 
 the SDS formula given above. 
 
 In the SDS1 formula given in Table 31 it is permissible 
 to omit the rr2(SD1)(SDz) portion of the formula be- 
 cause the coefficient of correlation between the C1’s and 
 C4’s may be assumed to be zero, since the pairing of each 
 Cr with some C4 would be by chance, and similarly for the 
 SDS2 formula. But in computing the SDM1 or SDMS men- 
 tioned above, an assumption of zero correlation between IT 
 and FT is not permissible. It is far more probable that 
 some correlation will exist. To ignore the last portion of 
 the formula might lead to a grossly exaggerated SDMr1 or 
 SDMS. How this exaggeration may occur is shown by the 
 following data. Obviously the Mz and SDMz computed 
 through Cr are 5 and zero, respectively. Computed through 
 M of IT and M of FT, the Mz likewise comes out 5. Com- 
 puted through M of IT and M of FT, SDMzr comes out 
 zero, provided rr2(SD1)(SDz2) are utilized in its com- 
 putation. 
 
 Pupil IT1 FT1 Cr 
 a IO 15 5 
 b 12 17 5 
 Cc 14 19 5 
 d 16 21 5 
 13 18° Mr 5 
 SDMi =o 
 
 ‘ In computing any SDD or SDS, then, the short form of 
 the reliability formula may be employed provided the ele- 
 
192 How to Experiment in Education 
 
 ments that enter into the formula are uncorrelated, or are 
 relatively uncorrelated. The SDD in Table 31 may be com- 
 puted by means of the short formula because the C1’s and 
 C2’s come from different groups and hence their correlation 
 may be assumed to be zero. The SDD in the one-group 
 experiment shown in Table 20 has been computed with the 
 short formula, because the C1’s and C2’s do not appear to 
 be at all closely correlated. Usually, however, such correla- 
 tion is more in evidence, due to the fact that the brighter 
 pupils tend to have larger C’s under all EF’s. The one- 
 group method is peculiarly liable to manifest such correla- 
 tion, and hence with it the SDD should usually be computed 
 by the long formula. 
 
 The formula for the computation of SDM as illustrated 
 in all the computation models is appropriate only when N 
 exceeds 30. When N is less than 10 compute SDM thus: 
 
 
 
 SVE a, 
 7 VN—2 
 
 When N is between to and 20, compute SDM thus: 
 Rh eee ae 
 7 VN=2 
 
 When N is between 20 and 30, compute SDM thus: 
 asf yd a tulle 
 Mavi 
 
 When N is above 30, compute SDM thus: 
 
 The last formula is used in all computation models and 
 illustrations of such models, irrespective of the number of 
 pupils, because most actual experiments will employ 30 or 
 more cases and because the sample data given merely typify 
 a much larger amount of data. 
 

 
 Lo Ql L's. C’r £°z o'r o'9 eeesceeveceenees I SOL 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ) 
 2 eke ads a 2Sdas a ™Sds 14H 
 S AMVAWAS 
 S 
 b v i 
 
 S eS =“ =+was [oo no =i ewads o— 5 
 ™ 
 
 4 + 
 = S*r = ,(S'0) =a = 5 oz= WV 6° =,(£°0) Ars els OT =a=KNV 
 Olga xs 8:2 v Sieg XS Coie .e v 
 in) aa —— — -_ —_ — 
 = fe) o z 6 LY q 14 z € 9S £s Pp 
 ‘ I I I 6? gv 3 I I z zs os 3) 
 = o ro) z ob ge } 6 € z— oF zy q 
 q, 6 £ S ov Se ° I I z gf ve eB 
 aS x x leet ee i. x x oF) eee eT d 
 S Iq — g dnosy 210 — WY dnosy 
 3 | b 
 
 v 
 
 2 rr== —ewas oo= 39 go==/ — pas fo= 3 
 
 v 14 
 x ze = (0) — at = ds or1r= NV i'r = ,(5°0) as = ds of =WV 
 © ee ARE SS: pane Pe 
 ae File f= o1r1=7zW 14 Fees c= 1W 4 
 is c= € z— LY 6 q o 0 € €$ os P 
 j=) v z € gv cv 3 vy z S os SY 9 
 Bi. I I o gt gt j I I Zz zy oF q 
 6 ¥ z £ S¢ ze 2 I I ¥ ve of e 
 Ss eX x Zz) ILA ILI d sx x I) ILA ILI d 
 fm 
 ~~ 
 3 2qq — q ¢nosy Idq — Pp ¢n0sy 
 Ss adAy 29 UWO—S.qAqA OME —sdnoiy omy 
 Ss 
 1S) GOHLAW NOILVLOA—XI TAGOW NOILVLNAWOD ONILVALSATII 
 
 ze atlav.. 
 
194 How to Experiment in Education 
 
 Illustration of Computation Model IX.—Since compu- 
 tation model IX is the basic rotation-experiment model out 
 of which all other rotation models will be constructed, it had 
 better be illustrated with sample data. Assume the problem 
 to be the relative mental effectiveness of recirculated air 
 (EF1) vs. fresh air (EF2). Assume the test used to deter- 
 mine this relative effectiveness to be a reading test. The 
 necessary computations are shown in Table 32. 
 
 Only the Summary in Table 32 needs explanation. The 
 EFr is 3.5 plus 2.5, 1.¢., 6.0. SDSr is the V (0.6)? -- (0.837, 
 1.60) / 1.0.0) B2)18)1,.0) plus) 1.2) 1e., 12.2.) 20 eee 
 Vi(r.1)7 (1-0) 2 eser.5.1) Dis Glo minus) 2231 eran 
 SDD is the V (1.0)? + (1.5)?, ie., 1.8. EC is 3.7 divided 
 by 2.78 times 1.8, i.e., 0.7. The conclusion from this experi- 
 ment is shown by D, which tells us that recirculated air is 
 better than fresh air by 3.7 points for the reading develop- 
 ment of pupils used in this experiment and for all those from 
 whom these pupils are a random sampling. But we can be 
 only 0.7 practically certain that this conclusion is true for 
 the larger group. ; 
 
 The data of Table 32 are artificial and inadequate. This 
 experiment was actually conducted by Thorndike and Mc- 
 Call under the auspices of the Ventilation Commission of 
 New York. The EF’s, as here, were washed recirculated 
 air and fresh air. All other conditions of temperature, 
 humidity, and the like were kept constant. Group A was a 
 group of 44 typical sixth-grade public-school pupils. Group 
 B was another similar group of 44 pupils. The two teachers 
 divided the work and both taught both groups. At the mid- 
 dle of the year the EF’s were rotated, as shown in Table 32. 
 A large number of mental and educational tests were used, 
 as were the teachers’ marks. The conclusion from the actual 
 experiment also favored the recirculated air. The experi- 
 ment was repeated a year later by Thorndike and Ruger. 
 The second experiment verified the first. These experiments 
 are described in School and Society for May 6 and August 
 12, 1916. 
 
Computations for the Rotation Experimental Method 195 
 
 
 
 
 
 
 
 
 
 
 
 oa | aas | ¢wt+swt+tw)—(w+*wt+enm) | sas | 4wtswt+tw | isas | owteowtew [1 3504 
 Ou ads a sds faa I¢ds ZAa 
 OF ads (4W + SW + €W) — (8W + 9W + IW) €Sdas AW + SW + °W ISds SW +9OW +IW [TT 3S9L 
 0 ads a sas caa ISqs ITA 
 OF aadas (OW + VW + ZW) — (SW + OW + IW) zSas 6W+ 7W + 7W ISds SN +9N +IN [°F 4S9°L 
 OF ads rai zSds ZA 1sds Ia 
 AYVWWNAS 
 6was sds 4IWaS 
 OW N SIN N LIN N 
 69 ILA ILI d 89 ILA ILI d £9 ILA ELI d 
 244 —]2 Gnosy IW — gq qnosy ; &qq— Pp qnouy 
 9was swas vas 
 oW N ST N VIN N 
 99 ILA ILI d s9 Ila ILI d vD Ie ALE d 
 dq —D enosiy Eq4q —q qoiy 2d — V nosy 
 eWwas ZWads | Iwas 
 fW N ZW N IW N 
 £9 ILA ILI d zo Ila Fit | d 4 ILA. ILI d 
 F417 —D dnosy fda — q doin IA —P gnosy 
 
 
 
 adh] S9L MUO — S.Aa 2G, —sdnoin so14] 
 
 
 
 GOHLaW NOILVLOY— X TACON NOILVINANOD 
 ff alavy 
 
196 How to Experiment in Education 
 
 Computation Model X.—The purpose of presenting 
 computation model X, shown in Table 33, is to indicate the 
 computations needed with the rotation method when there 
 are three EF’s, and, consequently, three groups, and one type 
 of test. By an appropriate extension to the right and down- 
 ward, computation model X may be adapted for any num- 
 ber of EF’s. 
 
 The computation of the SDS’s in Table 33 requires ex- 
 planation. The formula for the computation of SDS1 is as 
 follows: 
 
 SDS1 = V(SDM1)? + (SDM6)? + (SDM8)? 
 
 SDS2 and SDS3 were computed in similar manner. 
 
 In Chapter II, it was stated that the object of the rota- 
 tion experimental method may be to determine the relative 
 effectiveness of two or more EF’s. If this is the object of 
 the experiment, the three EF’s will be distinctly different 
 EF’s. If, however, the object is to determine the absolute 
 effectiveness of EF1 and EF2 as well as their relative effec- 
 tiveness, EF3 must be the mere absence of EF1 and EF2, 
 thereby showing the normal change produced during the 
 experiment by general conditions other than EF1 or EF2. 
 In this case, the first D in Table 33 shows the relative effec- 
 tiveness of EF1 and EF2. The second D shows the absolute 
 change produced by EF1. The third D shows the absolute 
 change produced by EF2. 
 
 In none of the computation models has provision been 
 made for delayed tests as was done, say, for intermediate 
 tests. It frequently happens that an experimenter wishes 
 to determine whether the effect of some favorable EF will 
 persist. It is conceivable that EF1 may be superior to EF2 
 immediately after they have been applied, but that the 
 superiority will disappear, or actually turn into an inferiority 
 after a month, say, has elapsed. Repetition of the tests a 
 month after the FT’s were made will show what effect time 
 has had. No special computation model needs to be pro- 
 vided. The regular IT’s will serve as the IT’s for the de- 
 
Computations for the Rotation Experimental Method 197 
 
 qawoa OAWIA 
 
 qaw oan 
 
 aa oa ads (4W + +) — (8W + EW) esas) 4W + PW isqsS s8sWM+E&W {°° 9b 
 daa od dds (SW + 7) — COW + 1) esqS  _ §W + 7 iSqS OW + IN I s9L 
 daa ele ads d esas cAa ™SAS 14a 
 
 
 
 
 
 
 
 
 
 
 
 swas 4Wwas 
 
 sv N LWW N 
 
 89 ZL ZLI d £9 zLa ZzLI d 
 
 owas swas 
 
 oW N SW N 
 
 99 Ld ILI d 3 1Ld ILI d 
 IW — gq nosy eq — WV qos 
 
 
 
 
 
 
 
 
 
 vWdS fWwas 
 
 vw N cI N 
 %) LA ZLI d 70) eLA ZLI d 
 ZWdS . GS c 
 z7W 
 
 z=) Ld ILI d 8) ILA ILI d 
 
 SS SS Se 
 eH — gq FnosH IqI— Vp enosH 
 
 sadhy, SOL OME —S,qq OMT —sdnoin omy 
 GOHLaN NOILVLOU — IX TIGOW NOILVINdHOD 
 
 ve AlAvy, 
 
198 How to Experiment in Education 
 
 layed test, and the delayed test becomes the FT. From this 
 point the computations reproduce the process for the regular 
 IT and FT. The final D shows the difference between two 
 EF’s plus a defined interval. 
 
 Computation Model XI.—Computation model XI shows 
 how the computations may be made when two test types are 
 used. By extending this model downward, provision can 
 be made for any number of test types. 
 
 Computation models IX, X, and XI make it clear that 
 computations for rotation experiments are similar funda- 
 mentally to computations for one-group and equivalent- 
 groups methods. With this knowledge, the reader who has 
 mastered the eleven computation models presented will have 
 little difficulty in evolving for himself rotation computation 
 models for any number of EF’s, groups, sub-groups, test 
 types, and intermediate tests. 
 
 Scaling Experimental Tests.—A few pages back it was 
 pointed out that the first IT1’s are not always the same tests 
 as or similar tests to the second IT1’s. Yet all this some- 
 what incomparable data can be combined, and this combina- 
 tion can be combined, in turn, with an equal mixture of 
 rather incomparable data from the IT2’s, provided each test 
 is scaled in comparable units. It is impossible to construct 
 a geography test, say, on Alaska which will be just as diffi- 
 cult as one with a Hawaiian content. Furthermore, it is sel- 
 dom feasible to scale all the tests to be used in advance of 
 and independently of the experiment itself, so as to have 
 comparability of measuring units throughout. 
 
 While conducting some rotation experiments to determine 
 the relative effectiveness of some visual aids, Weber met just 
 this situation, and overcame it economically by using his own 
 experimental data as a basis for scaling the experimental 
 tests. Tests so scaled, while not absolutely required, do add 
 a substantial refinement to experimental computations. 
 
 The following gives the general plan! of one of Weber’s 
 experiments. 
 
 Weber, J. J., Comparative Effectiveness of Some Visual Aids in Elementary 
 Education (to be published soon), 
 
Computations for the Rotation Experimental Method 199 
 
 Unit I India 
 Lecture 25 minutes 
 
 L—R Review quiz 12 minutes Group A 
 Film 12 minutes 
 
 F—L Lecture 25 minutes Group B 
 Lecture 25 minutes 
 
 L—F Film 12 minutes Group C 
 
 Unit II China 
 Lecture 25 minutes 
 
 L—R Review quiz 12 minutes Group C 
 Film I2 minutes 
 
 F—L Lecture 25 minutes Group A 
 Lecture 25 minutes 
 
 L—F Film 12 minutes Group B 
 
 Unit III Japan 
 Lecture 22 minutes 
 
 L—R Review quiz: IO minutes Group B 
 Film Io minutes f 
 
 F—L Lecture 22 minutes Group C 
 Lecture 22 minutes 
 
 L—F Film IO minutes Group A 
 
 Note that the content of the first experimental unit has to 
 do with India, the second with China, and the third with 
 Japan. Note, further, that EF1 is a lecture followed by a 
 review quiz (L-R), EF2 is a film followed by a lecture on 
 the subject matter of the motion picture, and EF3 is a lec- 
 ture on the material of the motion picture followed by the 
 motion picture. The subject matter of EF1 was drawn from 
 this same motion picture on India. Note, further, that 
 groups A, B, and C, which are approximately equivalent 
 seventh-grade classes are rotated in such a way that each 
 group experiences every EF. Note, finally, that the short- 
 ness of the film on Japan required that time allotments be 
 reduced for this unit. 
 
 Since Weber gave no IT’s, the reader should think of his 
 FT’s as identical with C. Since seventh-grade pupils started 
 this experiment with some knowledge of these lessons on 
 India, China, and Japan, as Weber himself proved later, 
 he was scarcely justified in treating his FT’s as equivalent 
 
How to Experiment in Education 
 
 200 
 
 
 
 
 
 L II 8 gs ¢ 9 4 6s re) I £ $9 
 OI gI 14 1S 9 8 9 LS z 9 14 £9 
 S II 9 6v II 9 4 Ss z 9 9 09 
 zI II z Lv 9 L 14 ¢s 9 ZL 9 gs 
 6 I 6 Sv Ne 1 II 1S 9 II S ss 
 L v 6 a4 g oI 9 6 OI OI II £s 
 14 ¢ v wv 9 II S Lv I £1 8 1S 
 z Zz g Iv S S OI Sv LI 9 L 60 
 v I ¢ ov § 9 6 tv 6 g v LY 
 z I Y 6£ i v 9 Iv g c oI Sv 
 z fe) Ss Ly v t I ov 9 Ss OI £v 
 I Zz s of S fe) 6 gt L e Ss Iv 
 I ° v c¢ v 4 9 o£ L I v ob 
 fe) fe) v ee I I ¢ ve 9 I L gt 
 I I I ze I L ze Ps ¢ ¢ of 
 I Zz 1¢ I of ¢ I v £¢ 
 
 ¢ 6z I Qz " ¢ I 1¢ 
 
 z Sz I Sz Zz fe) ct 6z 
 
 I 61 I 61 ct z 
 
 Thorold biped Comat bat Kaw d § 9409 We Fe | a eh ee a40I9$ Bee 1 a ee oe a409¢ 
 V ‘a 3 L qd V 0] L <) qd V LD 
 uDgD fe Duty) BIpUy 
 
 
 
 (aaaaM WOU Gildvav) ATaAAILOadSaU ‘Nvdv{[ aNv ‘VYNIHO ‘VIGNI NO SNOSS@T 
 GaMOTIO“d HOIHM SISAL NOILSANO-09 TAUHL AHL AO HOVA NI STidNd aadvad-vi GaloaTas Oof Ad ACVW SaXOOS 40 NOILNANI1s1a 
 
 S¢€ alavy 
 
Computations for the Rotation Experimental Method 201 
 
 
 
 ce 
 
 L3°z fel: 93°S obs’ g6°SP come wan AS 96v ¥g°1S 
 Og'I hoy ae o06'¢ ors: g6°SP gcs° gg OV oT aes 
 L6° Sz: 96'I ee aoe gzS° 83 OF 96 vgs 
 IT dds d SdS = %.alaaYy-a4NjI90T Sds Ub tJ -9d4N4I9'T Sds 9AN4I9'T-U RY 
 
 SN 10 NVIW — AYVINWAS 
 
 eer ener 
 
 Lez 00z'Z oS Lr 619°I SO:LET eo eee eoccce 6gh'1 1S‘SS1 
 9g°I S97 z 6g°11 619°! SO°LE1 SgS'r v9 6v1 Sees perdi 
 L6° CLrz Lgs ee ece eereee CgS'1 9° 6v1 6gh'1 rS'SSr 
 OT dads ad SdS agaay-a4nqI0T Sds Ut -AdN IIT Sds 94njI9T-WR 
 
 $JU gO WAS — AYVAWOAS 
 
 
 
 
 
 ee ee 
 
 
 
 
 
 
 
 173° eVvl: ZIO'I Was | ozo'r ogZ: 616° Was | £6¢° 6z0'X co was 
 1Z7'8 trl rp exept as Oz OI og’ 61'°6 as £6°g 6z'o1 89°83 as 
 zvos zg1s St'vP W vgIS 6S°1$ grsy W go ly ores ze°gv W 
 I gL 
 I I eZ I 6L 
 e rf I ol I vL 
 re) z I Lo z tL I Ig 
 z L ce) v9 9 4 89 I ° LL 
 g Ss z 19 v V $9 4 ¢ 14 
 ol 6 ¢ gS L g 4 £9 I 4 I 69 
 6 1I 8 ss ° 4 € 19 ° v ° L9 
 
202 How to Experiment in Education 
 
 to C. The effect of doing so is probably to make the SD 
 and SDM too large. The error is not serious, and is cer- 
 tainly less serious than notifying pupils what to expect in 
 the lectures and films by giving tests to the pupils before 
 they had had the EF’s applied. After each group had had 
 an EF applied, the pupils were given a 60-question test on 
 the content of the lesson presented. ‘The scores made by 
 each group as a result of each EF are given in Table 35. 
 
 Heretofore, each pupil’s score has been tabulated sepa- 
 rately. Such tabulations become unwieldy when many pupils 
 are used. The conventional economical substitute for indi- 
 vidual tabulation is the frequency distribution, samples of 
 which appear in Table 35. Such frequency distributions, 
 though not absolutely necessary, do permit the employment 
 of various statistical short-cuts. An illustrative reading of 
 Table 35 will make clear the meaning of the frequency dis- 
 tributions. Table 35 is read thus. After a lesson on India, 
 presented by means of a lecture followed by a review quiz, 
 i.e., L-R, a test on India was given to Group A. One pupil 
 made a score of 29, one pupil made a score of 31, four pupils 
 made a score of 33 and so on. After the same lesson on 
 India, presented by means of F-L, the same test on India 
 was given to Group B. Two pupils made a score of 24, three 
 pupils made a score of 31, and so on. In like manner, all 
 six frequency distributions, shown in Table 35, may be read. 
 
 If he so desires, the experimenter can make a frequency 
 distribution of the C1’s, and of the C2’s, etc., in each of the 
 computation models, and can use this as a basis for com- 
 puting M, SD, and SDM by short-cut statistical processes. 
 But there is one thing the experimenter cannot do. He can- 
 not make a frequency distribution of IT’s, and another fre- 
 quency distribution of FT’s, and hope from these to obtain 
 directly a frequency distribution of C’s or even to obtain C’s 
 at all. C’s can be obtained only from individual tabulations. 
 After individual C’s have been so obtained a frequency dis- 
 tribution of them can be made. 
 
 The Summary for Table 35 is given in two forms. The 
 
Computations for the Rotation Experimental Method 203 
 
 first part is in terms of the sum of the three M’s for each EF. 
 It is the form with which the reader is already familiar. The 
 second part is in terms of the mean of the three M’s for 
 each EF, i.e., the sum of the three M’s divided by three. 
 The mean of the M’s has the advantage over the sum of 
 the M’s in that the mean of the M’s is comparable with any 
 of the original M’s from which it comes, and with any 
 original M for any EF. But if the sum of the three M’s 
 is divided by three, the experimenter must be careful to 
 divide each SDS by three also. If this is not done the final 
 EC will be just one-third the size to which it is entitled. 
 As Table 35 shows, the second part of the Summary is one- 
 third the first part except for the EC which is the same. 
 And this is as it should be, for the D from the sum of M’s 
 is neither more nor less reliable than the D from the mean 
 of the M’s. 
 
 But the unique feature of Weber’s experimental computa- 
 tions is not so much his use of frequency distributions, or 
 his use of means instead of sums. The unique feature is 
 his use of T scores or scale scores intead of the original 
 number of questions correct. His use of T scores makes all 
 three tests and the scores from them comparable. To begin 
 with, the test on India may have been the most difficult, 
 and the one on Japan of medium difficulty. After the process 
 of scaling has been completed, these differences in difficulty 
 have been ironed out so that every score, irrespective of 
 the test, is comparable with every other score and every M 
 is comparable with every other M. This makes it profitable 
 to use the mean of the M’s instead of the sum of the M’s 
 in the Summary. Finally, the T scores make the D’s and 
 the EC’s more exact. 
 
 The procedure by which each test was scaled is shown in 
 Table 36, which is identical with the India portion of Table 
 35 except that 499 pupils instead of 300 pupils are used, 
 that the T scores are shown in the last column instead of 
 the first, and that three additional columns essential to the 
 computation of T scores are added. The first column is the 
 
204 How to Experiment in Education 
 
 number of questions, out of 60 questions on India, answered 
 correctly by the indicated number of pupils in each of Group 
 A, Group B and Group C. The fifth column is the total 
 number of pupils in all three groups answering the number 
 
 TABLE 36 
 
 DISTRIBUTION OF SCORES MADE BY 499 7A-GRADE PUPILS IN A 60-QUESTION TEST 
 WHICH FOLLOWED A LESSON ON INDIA. ORIGINAL STEPS CONVERTED 
 INTO T-SCALE UNITS (AFTER WEBER) 
 
 
 
 Per Cent Ex- 
 Group A B CG . ceeding Plus 
 Score | L—R | FL |) tr | 2%% | raterhose |e 
 Reaching 
 — oO 2 2 I 5 99.50 24 
 
 I— 2 I fe) I 2 98.80 27 
 3— 4 I a 2 4 98.20 29 
 5— 6 iz 4 I 6 97.19 31 
 iio 4 6 5 15 95.09 33 
 g—10 3 5 4 ne 92.38 36 
 II —12 8 2 II 21 89.08 38 
 13 — 14 5 3 9 17 85.27 40 
 15 —16 7 9 10 26 80.96 41 
 AV diypract 4: Lb 8 12 34 74.95 43 
 IQ — 20 17 9 13 39 67.64 45 
 21 — 22 5 II I4 30 60.72 47 
 23 — 24 13 9 20 42 53-51 49 
 25—26 TT 19 6 36 45.69 SI 
 27 25 17 13 13 43 37.78 53 
 29 — 30 8 I4 14 36 29.86 55 
 31 — 32 16 I5 10 41 22.14 58 
 33-734 12 8 7 27 15.33 60 
 S5e—-30 9 9 5 23 10.32 63 
 Bye a0 4 I 3 8 at 65 
 39— 40 2 8 2 12 5.21 67 
 4I — 42 2 4 2 8 nox 69 
 43 — 44 T 4 2 7 1.70 71 
 45 — 46 I I 2 80 74 
 OY eee I I 2 .40 77 
 49 — 50 I I 10 81 
 Total 163 167 169 499 
 
 
 
 of questions shown in the first column. The numbers of 
 questions shown in this first column are grouped two 
 together instead of each question separately as is usuallv 
 done when scaling. This grouping is not necessary. It 
 is, in fact, of doubtful desirability. Its virtue is that it 
 
Computations for the Rotation Experimental Method 205 
 
 saves labor. The sixth column gives the per cent exceeding 
 plus half those reaching each number of questions correct. 
 This per cent is based on the fifth column. How to com- 
 pute these per cents and transmute them into T scores, 
 shown in the last column, is described in Chapter V. Once 
 these T scores are known, the first, fifth, and sixth columns 
 may be eliminated as no longer useful, and the T scores may 
 be moved to the extreme left, thus making a table similar 
 to the India portion of Table 35. In like manner, the orig- 
 inal number of questions correct on the test on China, and 
 then the number of questions correct on the test on Japan, 
 can be transmuted into T scores. Since all the pupils in 
 all three groups are used in each of these three test scalings, 
 all scale values, i.e., T scores, are thus made comparable. 
 
 The possibility of scaling experimental tests on the basis 
 of the performance of experimental pupils is not limited to 
 rotation experiments employing three groups and FT’s only. 
 It is possible for any rotation experiment with any number 
 of groups and with or without IT’s. It is equally possible 
 for any one-group or equivalent-groups experiment. In all 
 these cases the scaling may be based upon IT, FT, or C 
 records. The C records are best to use, the FT records are 
 next best. When C records are used the experimenter can 
 be absolutely certain of getting a T score for every need. 
 If IT’s are used, there is a possibility that no pupil at the 
 beginning of the experiment will make as high a record as 
 will be made by some pupil on the FT. This means that 
 extremely high scores on the FT may have to go unscaled. 
 If the scaling is based upon FT scores, there is a possibility 
 that extremely low scores on the IT cannot be scaled. No 
 difficulty need be anticipated if C records are scaled. Chap- 
 ter V shows how both IT and FT may be used to widen the 
 range of the scale so as to include the highest and lowest 
 Scores. 
 
 But no matter which of the three records is scaled, it is 
 highly important that the scores of every experimental group 
 taking the test be utilized in scaling that test. This does 
 
206 How to Experiment in Education 
 
 not mean that every pupil involved in the experiment has 
 to be used. It is required only that those utilized in experi- 
 mental computations be included. Weber scaled his tests 
 on 499 pupils. In his experimental computations he used 
 only 300 of these 499 pupils. It would have been just as 
 satisfactory to have scaled his tests on the 300 finally 
 selected as the basis for his experimental computations. It 
 would not have been quite so satisfactory if, say, Group C 
 were omitted in the scaling. 
 
 Under certain conditions it is permissible to compute 
 51.84 in the Summary of Table 35, by a less laborious pro- 
 cedure. The data which yields the three M’s from which 
 51.84 is derived, may be lumped together so that only one 
 M and one SDM is computed for all of it. In this case, the 
 final M for each of the other two EF’s should be computed 
 in the same way. The conditions required to make the 
 above modification permissible are (a) an equal number of 
 pupils in each group, (b) a uniform test for each group, or 
 else the tests to be scaled upon the experimental groups so 
 as to eliminate inequalities in difficulty and consequent 
 unduly-increased variability and unreliability, and (c) ap- 
 proximate equivalence of ability for the groups so com- 
 bined. 
 
 Special Computation Difficulties.—Since the rotation 
 method is a combination of several one-group methods or 
 several equivalent-groups methods, it is appropriate that this 
 chapter should close with a consideration of special types 
 of statistical computations required for special situations. 
 
 These special difficulties are caused not so much by pecu- 
 liar variations in experimental method as in variation in 
 methods of measuring changes. There are, for example, 
 the following common ways of measuring changes produced 
 in pupils by an EF: 
 
 1. Total points change on test made by each pupil. 
 2. Per cent of total possible gain on each test made by 
 each pupil. 
 
Computations for the Rotation Experimental Method 207 
 
 3. Time required for each pupil to attain a defined score 
 on a test. 
 
 4. Per cent of pupils in each group attaining a perfect 
 score or any defined score on a test. 
 
 5. Per cent of pupils in each group making any gain on 
 test. 
 
 6. Per cent of pupils in one group whose change exceeds 
 the mean change of the other group. 
 
 Measuring-method 1 is the most commonly used and 
 should be. Except in very special instances, measuring- 
 methods 2, 3, 4, 5, and 6 should be used merely as supple- 
 mentary to the first method; they yield certain additional 
 information which, on occasion, is valuable. For example, 
 it may be useful to know whether the superiority of a par- 
 ticular EF is due to the large gains of a relatively few pupils 
 only, or whether every pupil has contributed to the superior- 
 ity. Measuring-method 4 tells whether the gains are well- 
 distributed. All the computation models assume measuring- 
 method 1. The experimenter is advised to avoid subsequent 
 statistical difficulty by planning for this method. 
 
 Measuring-methods 1, 2, and 3 yield a score and C for 
 each pupil, thereby permitting the computation of an M and 
 a SDM and ultimately a D, SDD and EC. Measuring- 
 methods 4, 5 and 6 yield a score for the group only, thereby 
 making it difficult, if not impossible, to compute measures 
 of reliability. Since each experimenter is obligated to report 
 the reliability of his conclusions, he should make sure that 
 the measuring-method which he plans to employ will yield 
 a measure of reliability at the end. 
 
CHAPTER IX 
 CAUSAL INVESTIGATIONS 
 
 Methodology of Causal Investigations——When Dar- 
 win visited South America, he was surprised to discover an 
 outbreak of yellow fever high up in the Andes Mountains. 
 Since he was a born scientist, he began immediately to specu- 
 late and observe to see if he could discover the cause for 
 such an unusual phenomenon. Doubtless he asked himself 
 these two questions: In what respect is this situation dif- 
 ferent from places which are immune from yellow fever? In 
 what respect is this situation like places which are subject 
 to yellow fever? Darwin showed his genius by almost dis- 
 covering the cause of yellow fever. He observed something 
 about the place which was very unusual for high altitudes 
 where yellow fever is unusual, and very much like lowlands 
 where yellow fever is more common,—pools of stagnant 
 water. He therefore suggested the hypothesis that this stag- 
 nant water was responsible for the yellow fever. He was 
 right so far as he went. It was not until long afterward that 
 this investigation was pushed far enough to make it appear 
 highly probable that stagnant water produced the mosquito, 
 which, in turn, caused yellow fever to spread. 
 
 -Metchnikoff observed that the Bulgarians were an 
 unusually long-lived people. Metchnikoff wished to know 
 why. Doubtless he, too, asked himself these questions: In 
 what respect are the Bulgarians like other peoples who live 
 long? In what respect are they different from other peoples, 
 1.e., what force operates upon the Bulgarians which does not 
 operate upon other races? Like Darwin, he proceeded to 
 observe for differences. He concluded that the most striking 
 difference was the extent to which the Bulgarian people drink 
 
 208 
 
Causal Investigations 209 
 
 buttermilk. He therefore concluded that the drinking of 
 buttermilk was responsible for the long life of the Bul- 
 garian, and that a similar practice on the part of other races 
 would lead to an equally long life. He went beyond Darwin 
 and buttressed his hypothesis by showing that certain organ- 
 isms present in buttermilk are specially beneficial to the 
 action of the alimentary canal. 
 
 Reavis’s recent work! is an admirable illustration of a 
 causal investigation in the field of education. He set out to 
 locate the causes for attendance and non-attendance in’ 
 school. From incidental observation and logical deduction, 
 he had arrived at not one but a number of hypotheses as 
 to what factors influenced attendance. He proceeded to 
 collect a large amount of data with a view to testing the 
 truth of his various hypotheses. 
 
 These illustrations of causal investigations, together with 
 many others which will occur to the reader, indicate some 
 interesting inferences. One inference is that different causa! 
 investigations differ in their starting point and ending point. 
 Darwin’s causal investigation began with a problem and 
 ended with the formulation of a crude hypothesis. The pre- 
 eminent function of causal investigations is to yield sugges- 
 tive hypotheses to be tested by further logical deduction, 
 observations or experimentation. Because of the great value 
 of fruitful hypotheses, causal investigation has constituted 
 the fundamental method of discovery from the beginning of 
 time. Metchnikoff’s causal investigation began with a prob- 
 lem which not only led to the formulation of a hypothesis, 
 but also to the collection of certain subsidiary evidence to 
 show that the hypothesis was not an unreasonable one. But 
 Metchnikoff went no further. Reavis did not conduct an 
 investigation to secure useful hypotheses. Probable causes 
 were more evident. He started his causal investigation well 
 supplied with fruitful hypotheses. But what is more impor- 
 tant, he carried the investigation very much further than 
 
 1 Reavis, George H., Factors Controlling Attendance in Rural Schools, Teachers 
 College, Columbia University, 1922. 
 
210 How to Experiment in Education 
 
 was done in the other instances. He carried it far enough 
 practically to prove or disprove his various hypotheses. 
 
 A second inference from these samples is that the con- 
 clusions yielded by causal investigations are usually less 
 convincing than those yielded by experimentation. Conclu- 
 sions from causal investigations are seldom more than strong 
 hypotheses, which await confirmation by experimentation. 
 This need for confirmation varies with the nature of the 
 investigation and the adequacy of the data which is assem- 
 bled or it is possible to assemble. Experimentation carries 
 greater weight than causal investigations, because an experi- 
 menter can control conditions much better than the investi- 
 gator. The investigator is compelled to accept conditions 
 as they are presented, complicated, as they usually are, by 
 all sorts of irrelevant factors, and providing, as they fre- 
 quently do, insufficient data upon which to base conclusions. 
 
 Darwin’s conclusion concerning the cause of yellow fever 
 was only a good guess, at best. It was a very slender hypo- 
 thesis. He could have greatly strengthened his hypothesis 
 by making a systematic series of observations or collection 
 of data. He could have strengthened it still more by evolv- 
 ing a hypothesis as to the exact mechanism whereby stag- 
 nant water causes yellow fever, and then by conducting an 
 equivalent-groups experiment to test this hypothesis. All 
 are familiar with the famous equivalent-groups experiment, 
 finally conducted, in which a group of healthy men offered 
 their lives to prove conclusively that yellow fever is trans- 
 mitted by a certain variety of mosquito which thrives only 
 where stagnant water is found. 
 
 Metchnikoff’s conclusion as to the efficacy of buttermilk 
 was and remains a hypothesis only, and will continue to re- 
 main so until it is tested experimentally. It is doubtful if it 
 can be tested conclusively by means of a causal investigation 
 because nature apparently does not present the proper con- 
 ditions. 
 
 The nature of Reavis’s research makes it more feasible 
 as a Causal investigation. By the selection of a relatively 
 
Causal Investigations 211 
 
 narrow problem, by the collection of many data readily 
 available, by the utilization of recently-developed statistical 
 techniques, and by the exercise of no little ingenuity, he was 
 able to isolate fairly well the factors whose influence he 
 desired to study. 
 
 A third inference is that the methodology of causal investi- 
 gations is the methodology of equivalent-groups experimen- 
 tation. A causal investigation is merely an equivalent-groups 
 experiment conducted backward. The criteria for a valid 
 equivalent-groups experiment are the criteria for a valid 
 causal investigation. To the extent that a causal investiga- 
 tion would be invalid if reversed and conducted forward as 
 an equivalent-groups experiment, just to that extent it is 
 invalid as a causal investigation. A perspective of a correct 
 plan for a causal investigation, viewed from its starting 
 point, is identical with a perspective of an equivalent-groups 
 experimental plan, for the solution of the same problem, 
 viewed. from the ending point. If these perspectives are not 
 identical, there is a crudity in one of the plans, and the 
 crudity will usually be found in the plan for the causal 
 investigation. An important corollary of the foregoing is 
 that he who has mastered the technique of experimentation 
 is already equipped for causal investigation. Only a few 
 additional techniques need be described. 
 
 In illustration of the foregoing statement that the same 
 criteria hold for both causal investigations and equivalent- 
 groups experimentation,. it will suffice to show how these 
 criteria apply to Metchnikoff’s causal investigation. To 
 satisfy these criteria, Metchnikoff would have to show that, 
 except for much buttermilk drinking and its reputed good 
 effects, Bulgarians are by nature and environment equiva- 
 lent to other races. This he has not shown. Consequently, 
 critics of his hypothesis have some justification in attributing 
 the long life of the Bulgarians to certain other factors in 
 which the Bulgarians possibly differ from other races. The 
 true cause may be due, for example, to the operation of a 
 more rigorous environment than has been operating upon 
 
212 How to Experiment in Education 
 
 other races. The effect of such selective agency would be 
 to make the present Bulgarian people a very hardy stock. 
 Combine this possible fact with the assumption that there 
 has been a rapid amelioration of environmental conditions 
 during the last few hundred years, and we have an explana- 
 tion for Bulgarian longevity totally unconnected with but- 
 termilk. Or, again, it may be that the original ancestors 
 of the Bulgarians possessed and transmitted through hered- 
 ity a tendency toward longevity, just as they doubtless 
 possessed and transmitted the physical traits which dis- 
 tinguish them from other races today. Or, finally, their 
 greater longevity may be due to the cooperative contribution 
 of several of these factors rather than to any one of them. 
 All this shows why causal investigations which fail to satisfy 
 perfectly the equivalent-groups experimental criteria yield 
 conclusions which are suggestive hypotheses only. Their 
 validity is no greater and no less than that of the conclusions 
 yielded by an equivalent-groups experiment which fails to 
 satisfy its own criteria to an equal extent. 
 
 Essential Procedure of Simple Causal Investigations. 
 —Causal investigations may be prosecuted in either of two 
 ways. Perhaps the most common and certainly the most 
 simple and elementary way, is the all-or-none procedure. In 
 an all-or-none investigation, the effect, whose cause is sought, 
 is either totally present or totally absent, or else the investi- 
 gator arbitrarily ignores any gradations in between, or else 
 he defines a certain minimum amount of the effect, any 
 amounts in excess of which will be considered to constitute 
 its presence, and any amounts less than which will be con- 
 sidered to constitute its absence. 
 
 The preceding discussion of this chapter has made it clear 
 that for this variety of causal investigations the essential 
 steps are as follows: 
 
 1. The investigator searches until he finds objects, indi- 
 viduals, communities or situations which are alike in that 
 they all show a particular effect whose cause is sought. 
 
 2. He inspects these situations to see whether they have 
 
Causal Investigations 213 
 
 anything else in common which might possibly be the cause 
 of the observed effect. If he finds such a common cause, 
 he formulates the hypothesis that this is the probable cause 
 of the effect. 
 
 3. He continues his collection of cases to discover 
 whether the hypothetical cause is always and without excep- 
 tion present when the effect is present. 
 
 4. He collects cases which are alike except for the pres- 
 ence of the effect in some of the cases and its absence in 
 others. 
 
 5. He observes to see whether the hypothetical cause is 
 present in those cases which show the effect, and absent in 
 those cases which do not show it. 
 
 6. He continues the collection of such instances to dis- 
 cover whether inexplicable exceptions occur. 
 
 7. If in either half of the foregoing process inexplicable 
 exceptions occur, the investigator attempts to find a new 
 and more promising hypothesis as to the cause of the effect. 
 If he is successful in this he starts through the above process 
 again. If he is not successful the causal investigation ends 
 unsuccessfully. 
 
 Essential Procedure of a Complex Causal Investiga- 
 tion. a. Formulation of Hypotheses.—Causal investiga- 
 tions of a complex variety do not treat the effect merely as 
 present or absent, but recognize and take account of grada- 
 tions of effect and gradations of cause. Here the investi- 
 gator determines not only whether the presence of the effect 
 is accompanied by the presence of the hypothetical cause, 
 but also whether increase in the amount of the cause is 
 accompanied by a corresponding increase in the amount of 
 the effect. Furthermore, the investigator may attempt to 
 discover whether the effect is produced by one or more 
 causes, and if produced by several causes he may attempt 
 to determine just how much of the effect each cause con- 
 tributes. 
 
 Reavis’s investigation is an illustration of one which took 
 account of gradations in cause and effect, which found that 
 
214 How to Experiment in Education 
 
 the effect was produced by several codperating causes, and 
 which determined the exact amount of independent contribu- 
 tion of each cause to the effect. A summary of his pro- 
 cedure is given below. The reader is referred to his disserta- 
 tion for details. 
 
 From incidental observation and logical deduction, he 
 formulated numerous hypotheses as to the more probable 
 causes or factors influencing the attendance of rural-school 
 elementary pupils. Some of these factors related to the 
 pupil, some to the school and teacher, and some to the com- 
 munity. Sample questions relating to the pupil were: Does 
 age, sex, distance from school, quality of roads from home 
 to school, distance transported, age-grade position, or quality 
 of school influence a pupil’s attendance record? Sample 
 questions relating to teacher and school were: Does the 
 teacher’s salary, or amount of training, or the school’s mod- 
 ernness of equipment, playground space, or the like influence 
 a pupil’s attendance? Sample questions relating to the com- 
 munity were: Does the community’s wealth, intellectual 
 level, or interest in education influence a pupil’s school 
 attendance? 
 
 b. Collection of Data.—The collection of data is a prob- 
 lem in measurement. The general principles to guide such 
 measurements were given in Chapter V. These principles 
 hold whether the investigator personally makes his own 
 measurements, or secures them from others by means of a 
 questionnaire. The principles apply whether the measure- 
 ments made be tests of mental traits, tests of school build- 
 ings, collection of school records, or the introspections or 
 judgments of judges. 
 
 The following questions ! will guide the investigator in the 
 evaluation and preparation of a questionnaire. Are the 
 questions as factual as possible? Do they involve a mini- 
 mum of judgment and memory? Are the questions as spe- 
 cific as possible? Will the data secured lend themselves to 
 
 1See Rugg, Harold O., Application of urbe Methods to Education, pp. 39-55; 
 Houghton Mifflin Company, New York, 
 
Causal Investigations 215 
 
 tabulation and statistical treatment? Are the questions 
 unambiguous? Will all terms used have the same meaning 
 to all reporters? Will the questions evoke replies which 
 will be unambiguous to the investigator? Is the informa- 
 tion called for difficult to obtain? Can the data called for 
 be obtained more accurately otherwise? Do the questions 
 cover all the data needed for subsequent computations? 
 Can the questions be answered by a check, number, Yes, 
 No, or brief phrase? Are the questions arranged so that 
 none will be overlooked? Is the space sufficient for each 
 answer? Are the questions worded and arranged to facili- 
 tate tabulation and fit the tabulation form to be used? Will 
 the data called for by the questions, answer the specific and 
 previously worded objects of the investigation? Are the 
 questions formulated in the light of a bibliographical survey? 
 Is the amount of time required to answer questions so 
 excessive as to induce careless responses, omission of items, 
 or few replies? Are the questions worded in the light of 
 one or more preliminary trials with representative samplings 
 of the individuals for whom questions are designed? Are 
 the nature and number of questions such as to secure replies 
 from representative individuals and from a sufficient num- 
 ber to satisfy the statistical criteria of reliability? 
 
 A common form of questionnaire is one which aims to 
 measure the degree of preference for this or that. Thus 
 Lowe sent a questionnaire which gave a comprehensive list 
 of the activities of clergymen. He desired to know how 
 each clergyman evaluated each activity. Several methods 
 have been proposed for meeting just such a situation, Le., 
 for measuring opinions. 
 
 One method, the rank method, is to ask that the activity 
 which is deemed most important be ranked 1, the one deemed 
 next most important be ranked 2, and so on for the number 
 of activities listed. This method is fairly satisfactory in 
 most cases. It is very time-consuming if the number of 
 items is large. It yields relative evaluations only; it does 
 not show what activities are deemed of no value whatever. 
 
216 How to Experiment in Education 
 
 It does not show which activities are judged to be of equal 
 value, but forces the reporter to make a choice. This forc- 
 ing does no harm so far as group results go, but it may do 
 violence to one individual’s opinion. Finally, the rank 
 method forces the reporter to make the same difference be- 
 tween all adjoining activities, namely, a difference of one. 
 
 A second method is the distribution method. Here the 
 reporter is asked to distribute, say, 100 points among the 
 listed activities, thus showing the importance of each activity 
 by the number of points assigned to it. This method per- 
 mits the reporter to indicate just what activities are of no 
 merit, but does not allow him to indicate negative values. 
 It permits the reporter to attach the same value to more 
 than one activity, and to indicate varying differences be- 
 tween activities. It is more time-consuming, however, than 
 the rank method, unless the activities are grouped into head- 
 ings and sub-headings. If they can be so grouped, the re- 
 porter can be asked to distribute his 100 points among the 
 main headings, and, after this is done, to distribute the total 
 points assigned to each heading among its sub-items. Some- 
 times, however, activities do not fall into convenient group- 
 ings which are mutually exclusive as to items and sub-items 
 or where the sub-items completely exhaust their heading. 
 Theoretically, the distribution method requires both such 
 exclusiveness and exhaustion. Finally, the distribution 
 method tends to make the number of points assigned to each 
 activity incomparable from one reporter to another. One 
 clergyman may hold half the activities listed to be of no 
 value; nevertheless he must use up his 100 points. Another 
 clergyman who assigns some points to every activity will be 
 compelled to assign fewer points to an activity which he may 
 evaluate just the same as the previously mentioned indi- 
 vidual. 
 
 A third method is the relative-to-the-items scale method. 
 Here the reporter is asked to rate the activity considered 
 least important as 1, the activity considered most important 
 aS 20, or 10, or 5, and to assign a value anywhere from 1 to 
 
Causal Investigations 217 
 
 20 inclusive to the other activities, assigning the same value 
 more than once if desired. This method has all the virtues 
 previously mentioned as desirable, except that of permitting 
 a report as to just what activities are judged of no worth 
 or negative worth or whether any activities are of greater 
 worth. | 
 
 A fourth method is the absolute-worth-occupational scale. 
 Here the clergyman is asked to rate any activity equal in 
 value to the most desirable activity in which a clergyman 
 can engage as worth, say, 19 points; to rate any activity 
 zero, which is of just no professional significance; to rate 
 any activity minus 19 which is equal in professional destruc- 
 tiveness to the worst occupational activity in which a clergy- 
 man can engage; and to rate all other activities according 
 to this absolute occupational scale. Thus, mending shoes 
 is above zero in social value, but is probably below zero on 
 a clergyman’s occupational scale. The chief objection to 
 this scale is the great likelihood that the reporter will be 
 unable to avoid confusing this fourth scale with the fifth to 
 be described. 
 
 The fifth method is the absolute-worth-social scale. Here 
 the reporter is asked to construct or think a scale ranging 
 from minus 19 through o to plus 19, where minus 19 means 
 the worst imaginable human act such as an able-bodied man 
 murdering his defenseless, gifted child to avoid working for 
 its support, where plus 19 means the best conceivable human 
 act, and then to rate the listed activities according to this 
 scale. This scale yields the fullest information of any of 
 the five methods described. Whether it is more or less 
 reliable than the others is not surely known. 
 
 Reavis employed the questionnaire procedure for collect- 
 ing the data used in his investigation. Fortunately, he was 
 in a position of authority where he could secure unusually 
 accurate and adequate returns. He eliminated from con- 
 sideration all transient pupils whose attendance could not 
 possibly be perfect due to the fact that they were not in 
 one district throughout the school year. Then he secured a 
 
218 How to Experiment in Education 
 
 measure of the amount of attendance of each of 5314 pupils 
 in 200 country schools in five counties in Maryland. At the 
 same time he determined the amount of presence of each of 
 a large number of hypothetical factors, such as the pupil’s 
 distance from school, the quality of his work at school, the 
 sort of teacher who taught him, the character of the school 
 building and equipment which surrounded him, and the 
 character of the community.in which he lived. 
 
 Much ingenuity was shown in making these determina- 
 tions, and in securing a comparable quantitative expression 
 for the amount of presence of each factor. To illustrate 
 with only one of the difficulties encountered—consider his 
 method for securing comparable measures of the distance a 
 pupil lives from the school. A pupil who lives a mile from 
 the school and in order to reach it must walk all the way 
 along an unimproved clay dirt road, really lives farther 
 away than another pupil a mile from the school who walks 
 half the way on an unimproved clay dirt road and half the 
 way on a macadam state road. 
 
 To equate these two conditions, Reavis reduced the dis- 
 tance for pupils travelling over state roads so as to make 
 State-road distances equal unimproved-road distances. He 
 made various guesses as to the proper subtraction and 
 checked up each guess by computing the coefficient of corre- 
 lation between attendance of all pupils and the distance 
 score for each pupil corrected by his guess. With each 
 improvement in his guess, the coefficient of correlation 
 should go up, due to the fact that errors in measurement 
 reduce the coefficient of correlation toward zero. The corre- 
 lation between uncorrected distances and attendance was 
 .38. A perfect correlation would be 1.0, and no correlation 
 would be zero. Calling each mile of state road equivalent 
 to one-half mile of unimproved road and correcting accord- 
 ingly yielded a coefficient of correlation between corrected 
 distance and attendance of .43. Counting each mile of state 
 road as equal to three-fourths of a mile of unimproved road 
 and correcting accordingly raised the correlation to .54. 
 
Causal Investigations 219 
 
 A guess on either side of the last weighting yielded correla- 
 tion of .48 and .51, showing that the best basis for correction 
 was to call one mile of state road equal to three-fourths of 
 a mile of unimproved road. 
 
 But even the correction for the quality of the road does 
 not eliminate all the error in the distance measurements. 
 Some of the pupils were transported all or a part of the way. 
 By employing the same correlation device to check up vari- 
 ous guesses as to the proper weighting, Reavis found the 
 optimum correction for distance transported per number of 
 days transported and per cent of days attended. The rea- 
 son for taking the amount of attendance into consideration 
 will readily occur to the reader. 
 
 c. Determination of Significance of Causes——The next 
 step was to divide the 5314 pupils into two groups of equal 
 numbers. One group was composed of that half of the 
 pupils having the better attendance record. The half with 
 the poorer attendance record composed the other group. 
 Three or more groups representing as many attendance 
 gradations could have been used. From the better-attend- 
 ance groups a smaller group was so selected as to be equiva- 
 lent in every respect, except for the difference in attendance 
 and the factor of distance, to a smaller group selected from 
 the poorer-attendance group. That is, in equating these 
 two groups, the factor of distance was ignored but all other 
 factors were regarded. The technique for equating groups 
 on several bases was discussed in Chapter III. Next, the 
 mean distance from school of each equated group was com- 
 puted. If, when this was done, the mean distance was 
 less for the better-attendance group, the investigator was 
 justified in concluding that a difference in distance was asso- 
 ciated or correlated with a difference in attendance. 
 
 The next step was to equate two groups in every respect 
 except, say, the quality of school work of the pupils and 
 attendance. The difference between the mean quality of 
 school work for the two groups showed the extent to which 
 quality of school work was associated with attendance, 
 
220 How to Experiment in Education 
 
 whether positively correlated, negatively correlated, or 
 whether neutral. In similar fashion, the investigator deter- 
 mined whether any other factor relating to the pupil, teacher, 
 school, or community was associated, and to what degree, 
 with the attendance of the pupils. 
 
 If the mean distance for one attendance group was identi- 
 cal with the mean for the other attendance group, a con- 
 clusion that distance affects attendance would be totally 
 unreliable. Since the D between the two M’s would be 
 zero, the EC would be zero. If there were some difference 
 between the two M’s, the significance of this D, or rather 
 how much we could trust its significance, would depend upon 
 the reliability or EC of this D. This reliability could be 
 determined in the usual way. The series of distance scores 
 from which Mi came would permit the computation of SD 
 and SDMr. Similarly the series of distance scores which 
 yielded M2 would yield SD and SDM2. Mr and M2 would 
 yield D. SDMz1 and SDM2 would yield SDD. Dand SDD 
 would yield EC. 
 
 When two groups equivalent in all respects, except for 
 attendance and the difference in the factor being studied, 
 show the same mean amount of the factor, we can certainly 
 say that the factor under consideration has no influence 
 upon attendance, is not a cause or contributing cause of 
 attendance. When the above procedure is used, and when 
 variations in attendance are accompanied by variations in 
 the factor being studied, we are justified in saying that 
 variations in the factor are associated or are correlated with 
 variations in attendance. But additional considerations are 
 necessary before we are justified in concluding that varia- 
 tions in a factor zmfluence or are a cause of variations in 
 attendance. It may be that attendance is, instead, a cause 
 of the factor. Or it may be that each is partly effect and 
 partly cause. Or it may be that no direct, definite causal 
 relation exists. 
 
 Judging by Reavis’s findings, distance is associated with 
 attendance. Now since it is easily conceivable that distance 
 
Causal Investigations 22 
 
 influences attendance, and since it is highly improbable that 
 attendance in a particular year has influenced the distance 
 a pupil lives from school during that year, we are justified 
 in concluding that distance is not only associated with but 
 actually influences attendance. Also the results of Reavis’s 
 study showed that quality of school work was associated 
 or correlated with attendance, but we cannot be quite certain 
 here, whether the quality of school work influenced attend- 
 ance or attendance influenced quality of school work or both. 
 Probably the last is nearest the truth. Poor attendance 
 leads to low quality of work, which leads to loss of interest, 
 which leads to poorer attendance still. In sum, if the investi- 
 gator will follow the procedure outlined above he can con- 
 clude that a correlation exists between factor and attendance, 
 and that sometimes a causal relation exists; but which is 
 cause and which effect rests upon additional logical con- 
 siderations. 
 
 When the cases are as numerous as they were in the study 
 made by Reavis, causal investigators often save themselves 
 trouble by using all the cases in the study of each factor, 
 trusting to luck and to numbers to make the groups equiva- 
 lent in all other factors. Thus, in the sample illustration, 
 they would divide the 5314 pupils into, say, two groups equal 
 in number, those living nearer and those living farther 
 from the school. The investigator would assume, in this 
 case, that since the pupils were divided with an eye to 
 one factor only, that the two groups would by chance be 
 approximately equivalent with respect to the amount of 
 presence of any other factor. 
 
 If the various factors are independent of each other, i.e., 
 if they are uncorrelated with each other, the foregoing pro- 
 cedure would be fairly satisfactory. But in any complex 
 investigation, the investigator can be practically certain that 
 various factors are correlated and cross correlated in all 
 sorts of bewildering ways. If all pupils are divided regard- 
 less of everything except quality of school work, we can 
 be practically sure that chance would not equal the two 
 
222 How to Experiment in Education 
 
 groups with respect to, say distance. Long distance from 
 school, through its reduction of attendance, affects quality 
 of school work. That is, distance and quality of school 
 work are not independent factors. ‘They are negatively 
 correlated. As a result, any division on the basis of quality 
 of school work alone, unavoidably becomes, in part at least, 
 a division on the basis of distance. In like manner, it 
 will become, in part at least, a division on the basis of 
 every other factor which is correlated either positively or 
 negatively with quality of school work. So long as this is 
 the case, the investigator is unable, to tell just how much of 
 any difference in attendance is attributable to quality of 
 school work, and how much to each of the various factors 
 correlated with quality of school work. All he can conclude 
 is that this total complex is correlated with the attendance 
 record, and may be a cause or an effect of the attendance 
 record. The only safe procedure is to satisfy as completely 
 as possible the equivalent-groups experimental criteria by 
 attempting consciously to equate the groups in every known 
 factor. Even so there will be enough error due to unknown 
 significant factors. 
 
 d. Preliminary Exploration of Significance of Causes.— 
 Now as a matter of fact, Reavis did not employ the former 
 or more exact method of evaluating the factors. He used 
 instead a modified and rather drastic form of the latter 
 more crude method. But he used this method not for the 
 purpose of evaluating exactly the influence of each factor 
 upon attendance, but rather for the purpose of preliminary 
 exploration to discover which factors appeared promising 
 enough to justify an additional very refined procedure—a 
 procedure more feasible than the exact one already de- 
 scribed. 
 
 His preliminary explorative procedure was to place in one 
 group, not the half of his pupils who had the best attend- 
 ance records, but the topmost 12% in attendance. The 
 other group was composed of the lowest 12% in attendance. 
 Since any factor that varies with attendance should be 
 
Causal Investigations 423 
 
 found in different amounts in these two groups, he computed 
 the mean distance from school for each group, and then 
 the mean quality of work in school for each group, the 
 per cent of each group found under the better teachers, vs. 
 the per cent found under the poorer teachers, and so on 
 for the large variety of factors whose influence upon attend- 
 ance was under consideration. When there was a pro- 
 nounced difference between the two means or the two per 
 cents for a factor, Reavis considered that factor to be 
 worthy of further study by a more exact procedure. When 
 no pronounced difference appeared he considered that factor 
 to have little or no influence upon attendance and eliminated 
 it from further consideration. While this method is so crude 
 that it will not show the independent contribution of each 
 factor, it is sufficiently exact to show what factors are 
 promising ones for further study and which ones are un- 
 promising. 
 
 In this preliminary investigation Reavis determined 
 roughly the significance for attendance of the following 
 factors relating to the child: sex, chronological age, grade 
 in which enrolled, quality of work, and promotion. He 
 studied the following factors relating to the school: training 
 of teacher, salary of teacher, experience of teacher, num- 
 ber of recitations, completeness of teacher’s report, neat- 
 ness of teacher’s report, handwriting of the teacher, teacher’s 
 intention to continue, schools changing teachers, rating of 
 teacher, size of library, kind of blackboard, rating of equip- 
 ment, age of desks, number and kind of pictures on the 
 walls, school enrollment, size of schoolroom, lighting of 
 schoolroom, system of heating and ventilation, rating of 
 school building, suitability of school grounds, play and 
 games, value of school property, cost of running school and 
 distance from children’s homes. He investigated the fol- 
 lowing factors relating to the community; money raised, 
 number of community meetings, and rating of the com- 
 munity. 
 
 Many of the above factors proved to have little or no 
 
224 How to Experiment in Education 
 
 connection with attendance. Many other factors showed a 
 significantly promising relationship. In order to reduce the 
 number of factors for detailed examination, various signifi- 
 cant factors were combined where possible. ‘Thus a score 
 for distance was determined by combining uncorrected dis- 
 tance, quality of roads, and transportation. A score for 
 the teacher was secured by combining the factors relating 
 to her which proved significant, namely, her rating by the 
 superintendent, her salary, and her training. A score for 
 the school plant was secured by combining the rating on 
 the building, rating on the equipment, and rating on the 
 grounds. In describing the correction of distance, a device 
 was given for determining weights to be assigned to the 
 elements that entered into these various combinations. A 
 like method was employed for computing these composites 
 for teacher, and for school. Three other factors, namely, 
 a pupil’s progress through the grades or age-grade relation- 
 ship, a pupil’s quality of school work, and the quality of 
 the community, were found worthy of additional considera- 
 tion. This means that six factors were selected for detailed 
 examination by the process to be described. 
 
 A seventh factor, namely, chronological age, was found 
 to be significant, but the effect of this factor was taken care 
 of by studying the relationship between attendance and the 
 six selected factors separately for each of three age groups, 
 namely, 5 to 8, 8 to 12, 12 and above. 
 
 e. Correlation and Inter-correlation Between Causes 
 and Effect—The next step was to compute the coefficient 
 of correlation between attendance and each of the six 
 selected factors, and to do this separately for each of the 
 three age sub-groups. 
 
 The coefficient of correlation is a statistical expression 
 for the degree of proportionality or correspondence between 
 two series of measures, and is indicated by the symbol r. 
 When r is t.0 the correspondence or correlation between the 
 two series of measures, say, scores for distance and attend- 
 ance is perfect and positive. When r is — 1.0 the correla- 
 
Causal Investigations 220 
 
 tion is perfect but it is inverse or negative. When r is zero 
 the correlation is mi. An r may be anywhere from — 1.0 
 through zero to + 1.0. We should expect the r between 
 attendance and quality of school work to be positive, because 
 we should expect those pupils who have a good attendance 
 record to tend to show high quality of school work, and 
 vice versa we should expect those pupils who have a poor 
 attendance record to tend to show a low quality of work. 
 On the other hand we should expect the r between attend- 
 ance and distance to be negative, because we should expect 
 that those pupils who have a high distance score to tend to 
 have a low attendance record, and vice versa. 
 
 There are several formule for the computation of r. 
 The standard formula when the relationship is approximately 
 rectilinear (see Diagram 1) is Pearson’s product-moment 
 formula, which may be written thus when the exact mean 
 is used: 
 
 T= V/Sx 4/Sy? 
 
 or thus, when the assumed mean is used: 
 
 Most educational relationships are rectilinear or are suffi- 
 ciently so to make it permissible to employ the product- 
 moment formula. But it is well to construct and inspect 
 a scatter diagram (see Diagram 1) to determine whether 
 the general drift of the diagram is rectilinear or curvilinear 
 (see Diagram 1). If it is pronouncedly curvilinear the in- 
 vestigator is referred to Rugg’s book ! on statistical methods 
 for the appropriate formula. | 
 
 * Rugg, Harold O., Application of Statistical Methods to Education; Houghton 
 Mifflin Company, New York, 1917. 
 
226 How to Experiment in Education 
 
 PER CENT OF ATTENDANCE 
 
 DIAGRAM I 
 
 THE CIRCLES SHOW AN APPROXIMATELY RECTILINEAR RELATIONSHIP. THE 
 CROSSES SHOW A CURVILINEAR RELATIONSHIP 
 
 
 
 a ee SS eS SS 
 
 —— | SS | I ee | ee en ee Se ee I SS SS I eS 
 
 | | J J J fF I | J ff — fF  — | | | J | | | | | 
 
 
 
 in miles 
 & 
 NS 
 >4 
 iw 
 ° 
 
 me | SS | i | S| SS eS | eS 
 
 a | | | | | —— | S| SS I 
 
 Distance 
 
 — | Se SS I eh ee Se | SS | SS I SS CO I 
 
 
 
 oO 
 a 
 ° 
 
 ON 
 ra 
 oO 
 
 
 
 b 
 ba 
 fo) 
 
 
 
 
 
 
 
 N 
 | 
 | 
 | 
 | 
 | 
 | 
 Ea 
 4 
 ese 
 Bes 
 Ll 
 es 
 | 
 | 
 | 
 | 
 ie) 
 ° 
 
 
 
 ° 
 
 O 5 10152025 30 35 40 45 5055 60 65 70 75 80 85 90 95 100 
 
 Diagram 1 shows in one diagram two sample scatter dia- 
 grams for two groups of twenty-five children. The circles 
 show the relationship between attendance and distance. 
 
Causal Investigations 
 
 
 
 
 
 
 
 cA Sz 
 0°) — = 
 zoo = e ( ) mae sz 
 Io 0- 6g eee Sz 
 ) veel — 
 
 
 
 
 
 Sorvz — ,xS 
 
 botz 
 OgI 
 O7gI 
 6g01 
 6g01 
 
 I WVUOVIG NI (SHIONIO) VIVd AHL UOA I ALNAWOD OL MOH ONIMOHS— 
 
 SS —_——————— 
 oe a | SS 
 
 veel — — Axg 
 VS gI— 
 aLiI— ro— 
 g's — ZI— 
 voz — go — 
 a ov — vi— 
 g 7S — gi— 
 9s — ZO — 
 bos — giI— 
 gv ZO 
 ofr — OL 
 gv 9'0 
 0'O o'O 
 vz — oT 
 gO vo— 
 Olt QI 
 cag 9.0 — 
 SO to 
 ve Pei 
 aa? Oe 
 ozs — Our 
 9°99 — QI 
 ott — go 
 zsol— gl 
 gso— v1 
 o' vor — O'7 
 Ax A 
 
 L¢ alavy, 
 
 
 
 aIUD ISI 
 
 
 
 
 
 aaa 
 
 70 = X92 
 
 o7S — WV 
 77S = W 
 
 
 
 gouppuaz1 P 
 
 SAHVD OH Wwe Me GEGSCATHNHY SSE K > 
 
 dnd 
 
228 How to Experiment in Education 
 
 Each circle indicates one child’s attendance record and 
 distance from school. The general drift of the relationship 
 is a straight-line or rectilinear drift. The crosses show the 
 relationship between attendance and distance for twenty-five 
 other pupils. Remember that the diagram is merely for 
 illustrative purposes. It is extremely improbable that one 
 group of pupils (circles) would show a decided negative 
 correlation and another group (crosses) a decided positive 
 correlation. But the important point to note about the 
 diagram is that the circles show a rectilinear drift whereas 
 the crosses show a curvilinear drift. 
 
 The procedure for computing r is given in Table 37. Note 
 that the x column shows deviations from the AM for attend- 
 ance, and that the y column shows deviations from the AM 
 for distance. Everything else is self-explanatory. 
 
 When N is large, say 50 or above, it is more economical 
 to tabulate data into a contingency table, such as Table 38. 
 Such a contingency table may be used not only as a starting 
 point for a short-cut method of computing a product-moment 
 coefficient of correlation, but it also makes unnecessary the 
 construction of a scatter diagram, such as Diagram 1. In- 
 spection of the contingency table will show whether the rela- 
 tionship is sufficiently rectilinear to make the product- 
 moment method applicable. 
 
 Table 38 is read thus: There were 3 pupils who lived 
 between 3.4 and 4.0 (inclusive) miles distance from school 
 whose per cent of attendance was between o and 1o inclu- 
 sive, and similarly for the remainder of the contingency 
 table. 
 
 There is no particular virtue in grouping the per cents in 
 step-intervals of 15, or the miles in step-intervals of 0.8. 
 The per cents could be grouped in step-intervals of 5, 10, 15 
 or any amount that is convenient. Likewise, the miles could 
 be grouped in step-intervals of 0.2, 0.4, 0.6, 0.8 or any 
 amount that is convenient. The size of the step-intervals 
 chosen for Table 38 gives 7 steps for attendance, and 5 
 steps for distance. As a rule it is better to have a step- 
 
229 
 
 Causal Investigations 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ; az N N 
 62.59 Geet ¢ (49) Tish _(X2) xc 
 ————— errno 
 eSlLae— Sz ; N 
 p als Se, tee —— ee 
 (v0) (zx") — (49) (x9) — 
 £8 — = l§$—o= Axg 60 = AS "€or = XS 
 [opel rec NN ne foo ee Nees 
 I— Aj 2 x} 
 for Lz vz £ ° S g of x} 
 ERS Oy eee fl ee ek tena Pee + ened fag ey eer —S— | pe | er— XJ 
 pe ee ea ea | Sees oe ae Se Cee | Oe Se ee ee x 
 LS ° 6+ I— Sz mee 9 2 z ¢ z v } 
 ch sie Oz oI — 6s S z e g'0 0} Z'0 
 zI— zI — 
 S S S— Paste 5 es ea I I I 9°10} O'1 
 ¢— z— I— 20 I 
 ° Oo ° O 9 z tea € V2 0} QI 
 Oo fe) ° 
 v v V I 14 I I I I 7 £ 03 9°2 
 I fo) ~— ¢ — 
 vz Oz OI z sc I I ¢ or oie 
 ~— y— gI— 
 _—1 ae y, ‘ k OOI Sg ol gs ov Sz oI 
 2 oe ae cA} J j 06 SZ 09 SY of St fe) sony 
 ul 9IUDISIG 
 
 sounpuaiip {0 JuUaD sag 
 
 (Z1LaIa “I ‘H YaLdv) 
 MIGVL AONFZONILNOO V NI GaLVINGVL Naad SvH Lf AIdvL JO VLVd NAHM NOILVITUNOO AO LNALOLZLAIOO V TLNdWOO OL MOH SMOHS 
 
 gt alavy, 
 
230 How to Experiment in Education 
 
 interval of such size as to produce not less than 10 nor more 
 than 20 steps in each of the two items. The steps are made 
 fewer in Table 38 so as to simplify the presentation of the 
 correlation procedure. 
 
 The steps in the process of computing a coefficient of 
 correlation from a contingency table follow. (1) Construct 
 contingency table. (2) The total frequencies in the first 
 column are 4. The total frequencies in the second column 
 are 2, and so on for the other columns. The grand total 
 of frequencies is 25. (3) The total frequencies for the first 
 row are 5, for the second row, 4, and so on. The grand 
 total of frequencies is 25, thus checking the preceding de- 
 termination. (4) The AM for attendance is 50, as shown 
 by the vertical double ruling. The AM for distance is 2.1, 
 as shown by the horizontal double ruling. Other AM’s 
 might have been taken, though AM’s near the center of each 
 frequency distribution are more convenient. (5) The step- 
 deviations from the AM for attendance are shown in the x 
 row. The step-deviations from the AM for distance appear 
 in the y column. (6) The product of each x multiplied by 
 its corresponding f appears in the fx row. The algebraic 
 total of the fx’s is shown at the end of the fx row. Sfx = 3. 
 (7) The product of each y multiplied by its corresponding f 
 appears in the fy column. The algebraic sum of the fy’s is 
 shown at the bottom of the fy column. Sfy=—1. (8) 
 The product of each x? multiplied by its corresponding f 
 appears in the fx? column. Sfx? = 103. (9) The product 
 of each y” multiplied by its corresponding f appears in the 
 fy? column. Sfy? = 49. (10) The f in the first square in the 
 first column and first row is 3. The x at the bottom of this 
 column is — 3. The y at the end of this row is 2. The 
 product of (3) X (—3) X (2) is — 18, which is written in 
 the upper right corner of this first square. The f in the 
 second square of the first column is 1. The x at the bottom 
 of this column is — 3, and y at the end of this row is 1. The 
 product of (1) X (—3) X (1) is — 3, which is written in 
 the upper right corner of the square in question. The f in 
 
Causal Investigations 231 
 
 the third square of the third column is 3. The x is —1, and 
 the y iso. The product of (3) X (—1) X (0) is written 
 in the upper right corner. The f in the last square of the 
 last row is 2. The x is 3 and the y is — 2. The product of 
 (2) X (3) X (—2) is written in the upper right corner of 
 this square. The other f’s times the xy products are com- 
 puted similarly. (11) The sum of the xy products in the 
 first row, ie., the sum of — 18, — 4, and —2 is — 24. 
 This sum is written in the xy column in the minus sub- 
 column. Were this sum positive instead of negative, it 
 would be written in the positive sub-column. In like man- 
 ner, the sum of the xy products for each row is computed 
 and written in the last column. Positive Sxy—o. Nega- 
 Hives Ve 57 eto eb ne:cxts, COMpUted +. CX = O:0 21 ( 13) 
 The cy is computed; cy = —o0.04. These c’s are not multi- 
 plied by the size of the step-interval as is done in Table 17, 
 because Sxy, Sx”, and Sy? used in the correlation formula 
 are kept in terms of step-intervals also. (14) Sx? == 103. 
 Sy?= 49. Sxy =o—57 =—57. (15) The values pre- 
 viously computed are substituted in the correlation formula 
 shown at the bottom of the table. This formula is identical 
 with that used in Table 37, except that all values are in 
 terms of step-intervals. By solving the formula, r is found 
 to be — .80-+. ‘The r, when computed by the procedure 
 illustrated in Table 37, is —.81. This is a remarkably 
 close agreement, when we consider the drastic condensation 
 of the data produced by the large step-intervals used in the 
 contingency table. 
 
 By substituting age-grade scores for distance scores in 
 Table 37 or Table 38, and by recomputing, the r for at- 
 tendance with age-grade relation can be determined. In 
 similar manner, the r between attendance and each of the 
 six selected factors, or between any factor and any other 
 factor, can be computed. The first row of Table 39 shows 
 the coefficients of correlation between attendance and each 
 of the six factors as computed by Reavis for the age group 
 8 to 12 and all five counties combined. Reavis’s original 
 
232 How to Experiment in Education 
 
 table presents the coefficients for the three separate groups 
 and the five separate counties. Additional rows show the 
 correlation between each factor and every other factor. 
 
 For our present purpose the first row of Table 39 is the 
 most significant. It tells us that those whose attendance 
 records are excellent tend to live near the school to the 
 extent of .45, tend to progress rapidly through the grades 
 to the extent of .50, tend.to make high marks in school to 
 
 TABLE 39 
 
 SHOWING THE COEFFICIENTS OF CORRELATION BETWEEN ATTENDANCE AND EACH 
 OF SIX HYPOTHETICAL CAUSES OF ATTENDANCE, TOGETHER WITH THE 
 CORRELATION BETWEEN EACH CAUSE AND EVERY OTHER CAUSE (ADAPTED 
 FROM. REAVIS) 
 
 2 3 4 5 6 7 
 CMS Distance] Grade lof Work] Te°| “Plone | munity 
 
 1. Attendance ........ — .45 50 i332 16 07 30 
 2 RLVISTANCE Roe asia —.20 | —.13 | —.10 | —.06 02 
 2, wAven Grade. .eiae We 24 OI 08 .08 
 4. Quality of Work... 00 08 03 
 SL CACHED nhs crete nea 25 35 
 6 ochool Plant it woe 17 
 
 the extent of .33, tend to have good teachers to the extent 
 of .16, tend to have an excellent school plant to the extent 
 of .o7, and tend to live in a highly-rated community to the 
 extent of .30. So far as these coefficients go, attendance 
 appears to be most closely associated with age-grade rela- 
 tionship and distance. 
 
 Among the inter-correlations of the various factors, the 
 most surprising coefficient is the zero relation between qual- 
 ity of work and the teacher. One would expect better 
 teachers to secure a higher quality of work on the part of 
 the pupils. Had quality of work been measured by stand- 
 ard tests, a positive coefficient would almost certainly have 
 
Causal Investigations 233 
 
 been found. But the scores for quality of work were the 
 teacher’s marks. These marks are strictly relative, which 
 fact effectively covers up any difference in the efficiency 
 of different teachers. 
 
 If the size of any coefficient of correlation in Table 39 
 is so small as to cast a doubt upon its significance, there is a 
 formula which permits the computation of the reliability 
 ofanr. Itis 
 
 I—r? 
 
 SDt= (TN 
 
 where r is the coefficient of correlation whose reliability is 
 sought, and N is the number of pupils used in computing r. 
 
 The SDr is interpreted like SDM or SDD. If it is desired 
 to know the probability that the true r is not zero or below, 
 the EC may be computed by means of the following formula: 
 
 r 
 
 a 2.78SDr 
 
 
 
 Also this EC formula can be used to determine the prob- 
 ability that the true r does not lie below a defined r, or that 
 it does not lie above a defined r. How to use the EC 
 formula for either of these two special purposes has been 
 discussed in connection with its similar use for M or D. 
 
 f. Final Evaluation of Causes by Partial Correlation.— 
 The crude correlation coefficients in the first row of Table 39 
 may not tell the independent influence of each factor upon 
 attendance or vice versa. We could be certain that they 
 show such independent contribution only in case the inter- 
 correlation coefficients between the various factors were all 
 zero. Were they all zero we should know beyond doubt 
 that the correlation between a particular factor and attend- 
 ance has not been enhanced or diminished, as a result of its 
 correlation with some other of the factors listed. Addi- 
 tional evaluation has shown, for example, that the school 
 
234 How to Experiment in Education 
 
 plant has no intrinsic connection with attendance. It has 
 a slight positive correlation of .o7 as shown in Table 39 
 largely because it is correlated with the teacher who does 
 have some genuine connection with attendance. ‘That is, all 
 the correlation between school plant and attendance is a 
 borrowed correlation. It is possible for a factor to borrow 
 in this way from all the other factors. The problem of 
 determining the independent correlation of each factor 
 with attendance becomes a problem of stripping from 
 each the correlation it has borrowed from all the other 
 factors. If the borrowing has been small, little will be 
 subtracted from the coefficients shown in the first row of 
 Table 30. 
 
 The crude correlation of a factor with attendance is com- 
 parable to the crude process previously described of dividing 
 all the pupils into a better-attendance and a poorer-attend- 
 ance group, and then averaging the distance each group 
 lives from school without making any attempt to equate 
 groups. We have seen how such a procedure tends to lump 
 the various factors together, depending upon the degree of 
 correlation between them. We have seen, further, that the 
 only way to avoid this confusion of different factors and to 
 determine the independent contribution of each to attend- 
 ance is to equate the two groups with respect to all the 
 factors except the one under investigation. 
 
 Due to the fact that it is difficult to select two groups 
 from the better-attendance and poorer-attendance groups 
 which are exactly equivalent in five different factors, Reavis 
 elected to employ an alternative process which yields com- 
 parable results. He used the method of correlation supple- 
 mented by partial correlation. The effect of partial cor- 
 relation coefficients is to show what the correlation would 
 be between, say, attendance and distance if all pupils were 
 of the same age in the same grade, were doing the same 
 quality of work, were under like teachers, were housed in 
 like school plants, and lived in like communities. The crude 
 coefficients in rows 2, 3, 4, 5, and 6 in Table 39 were com- 
 
Causal Investigations 235 
 
 puted in order to make possible the computation of just such 
 partial correlation coefficients. 
 
 The operation of the partial correlation formula has for 
 its goal the following independent, isolated, or partial cor- 
 relation coefficients: 
 
 YI2.34567 
 T13.24567 
 TI14.23567 
 r15.23467 
 r16.23457 
 ¥17.23456 
 
 The figures 1, 2, 3, 4, 5, 6, and 7 refer respectively to attend- 
 ance, distance, age grade, quality of work, teacher, school 
 plant, and community, as shown in Table 39. The partial 
 correlation coefficient of r12.34567 means the correlation 
 between attendance (1) and distance (2) when freed (.) 
 from the influence of age grade (3), quality of work (4), 
 teacher (5), school plant (6), and community (7). The 
 coefficient, r13.24567, means the correlation between attend- 
 ance and age grade when freed from the influence of the 
 five other factors. 
 
 The computation of r12.34567 requires the investigator to 
 operate the partial correlation formula over and over again. 
 Each operation takes out the influence of just one factor. 
 The total process is shown below, in exactly the reverse 
 order in which computations are actually made. Reversing 
 the order makes the principle of the process easier to grasp. 
 The first series of formule from the bottom removes the 
 MMuenCe Ole wiLOMaLigweric ITA, Wis) TiOires 24 to. 
 r26, r34, r35, r36, r45, r46, and r56. The next series of 
 formulz removes, in addition, the influence of 6 from r12, 
 Gio Tidy ttyetee eredite saad t35 andi tA seem Lie. Next 
 series removes, in addition, the influence of 5 from r1i2, r13, 
 rI4, 23, r24, and r34. The next series removes the in- 
 fluence of 4 fromr12,r13, andr23. The next series removes 
 the influence of 3 from riz. This leaves r12 purified from 
 the influence of 3, 4, 5, 6, and 7. | 
 
236 How to Experiment in Education 
 
 r12.4567 — (113.4567) (123.4567) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 r12.34569 S= 
 345034 1 — (r13.4567)? VO oe (123.4567)? 
 where 
 ROE Lop ae east ea LD) EL SSN) 
 Vit — (114.567)? */1 — (124.567)? 
 eco 13.567 — (114.567) (134.567) 
 V1 — (414.567)? Wt — (134.567)? 
 Bt aa 123.567 — (124.567) (134.567) 
 ; Vt — (124.567)? 1 — (134.567)? 
 where 
 Ni pt r12.67 — (r15.67) (125.67) 
 anita V1 — (115.67)? V1 — (125.67)? 
 aan 114.67 — (115.67) (145.67) 
 /t — (115.67)? V1 — (145.67)? 
 Ce aseetee 124.67 — (125.67) (145.67) __ 
 V1 — (125.67)? 1 — (145.67)? 
 a ps 113.67 — (r15.67) (135.67) 
 V 1 (015.67)* VV 1 (135.67) 7 
 eta 34.67 — (135.67) (145.67) 
 V1 — (135.67)? V1 — (145.67)? 
 Fe Gye iat Tora AAES 8 A 
 Vt — (125.67)? 1 — (135.67)? 
 where 
 pate Lie /ara) (Et0;7) ede, eee 
 "At — (£16.7)? V1 — (126.7)? 
 one r15.7 — (r16.7) (156.7) 
 V1 — (116.7)? 1 — (156.7)? 
 Awa isons wach aee WYARIAUUHG) 
 VT (ray) Ay tee (Ope 
 relorees TEA AMEL 7) AcAOeg a 
 
 V1 — (116.7)? 1 — (146.7)? 
 
Causal Investigations 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Oe a ee Ne 
 ad V1 — (146.7)? V1 — (156.7)? 
 67 = ERAT (726.7) (46.7) 
 nay / 1 — (126.7)? V1 —\(r46.7)? 
 r13.7 — (r16.7) (136.7) 
 Ngee ed BAG seer SEO“ IRL OFT te 
 at MCE Or ye Te (136.7)? 
 135.7 — (136.7) (156.7) 
 ena maith as Mie) AUR fy 
 en V1 — (36.7)? Vr — (156.7)? 
 134.7 — (136.7) (146.7) 
 (frp A ars RI gs 
 a V1 — (136.7)? 1 — (146.7)? 
 123.7 — (126.7) (136.7) 
 (AT ET NSN 
 waa Vt — (126.7)? V1 — (136.7)? 
 where 
 pli eee )\ra7) 
 tS Vata va Ga 
 r16 — (17) (167) 
 6. — sO 
 r16.7 Wet (nt envi (TOA) 2 
 26 — (127) (167) 
 6. SEES ees RE ET SE EE aN 
 126.7 Vay vou (107) 
 aaah Tato 7) 
 OTS Va (a7)? VE — (57)? 
 r0.7— Wines Oram (2574107) oie 
 /1— (157)? “1 — (167)? 
 125.7 == — 2S (027) (857) 
 V1 — (127)? Vv 1 — (157)? 
 — __14— (117) (147) 
 SS Ar pan a ne OPE 
 r46 — (r47) (167) 
 Fate eet aca LE a 2S, tel ee dak a AC ae 
 A eas (147)? “1 — (167)? 
 TAs. 76 Ne Ses (147) (557) Nea 
 
 Vt — (t47)? Vt — (57)? 
 
 237 
 
238 How to Experiment in Education 
 
 T24 = ALATA LAG 
 
 47 = 7 (a7)? Vi— Gan)? 
 i Uoiorersgye ee oar 
 peinwicn ear 
 N84 = ee 
 idk ulus ice ad eran 2 i 
 
 V/ 1 — (637)? V1 — (147)? 
 MEE reat 27) AEST) 
 £23-7 = A/t — (127)? V1 — (137)? 
 
 
 
 Beginning at the bottom of the foregoing series of for- 
 mule, the coefficients of correlation from Table 39 should 
 be substituted in the first computation series of formule. 
 As soon as these first partials have been computed, data 
 will be available for substitution in the second computation 
 series. The computation climb may thus be continued until 
 r12.34567 has been determined. 
 
 Once the process has been completed and the size of 
 r12.34567 has been determined, the investigator will have 
 to construct a similar series of formule and compute 
 T13.24567. Since the principle for the construction of each 
 of the six needed series is identical with that for the first 
 series, the other five series need not be given here. Fur- 
 thermore, an investigator who is concerned with a larger 
 or smaller number of factors than six should have no diffi- 
 culty in extending this series to provide for a larger number 
 of factors, or of omitting the upper superfluous portion of 
 this series in case of a smaller number of factors. 
 
 By operating these formule in six such series, Reavis 
 isolated each of the six factors and determined its inde- 
 pendent contribution to attendance. That is, he determined 
 the significance of the distance pupils live from school, 
 
Causal Investigations 230 
 
 regardless of the grades they are in, the quality of the work 
 they do, the kind of teachers they have, the character of 
 the school plants, or the type of community in which they 
 live. Similarly, he determined the independent correlation 
 of each factor regardless, not of all conceivable factors, nor 
 even of all factors studied, but of the six other factors 
 which appeared to be most significant and hence most need- 
 ful to be partialled out. 
 
 The final partial coefficients, as computed by Reavis, are 
 given in Table 40. For purposes of comparison the partials 
 
 TABLE 40 
 ORIGINAL AND PARTIAL COEFFICIENTS OF CORRELATION BETWEEN ATTENDANCE 
 AND SIX HYPOTHETICAL CAUSES (ADAPTED FROM REAVIS) 
 
 Age Quality School Com- 
 
 Causes Distance Grows of Work Teacher 
 
 Attendance 
 Original ..| —.4s .50 -33 16 07 30 
 Partial «>; ,| —— 43 44 45 .08 — .O1 28 
 
 are preceded by the original crude coefficients. Distance 
 and community suffered the least reduction. The teacher 
 appears to have little to do with attendance, and the school 
 plant has nothing to do with it. The outstanding deter- 
 miners of attendance are distance and age-grade relation. 
 The quality of school work and type of community come 
 next and are about equal in their influence. But the 
 reader should remember that the purpose of this chapter is 
 to describe a process rather than to present results. Final 
 conclusion as to the significance of these factors should take 
 into consideration Reavis’s results for the two other age sub- 
 groups. To do so would alter somewhat the conclusions 
 just stated. 
 
 As has been stated already, correlation does not imply 
 causation. But partial correlation does imply causation in 
 so far as all significant factors are partialled out. But par- 
 tial correlation does not show which is cause and which 
 
240 How to Experiment in Education 
 
 effect. This must be decided from non-statistical consid- 
 erations. Such considerations lead to the conclusions that 
 distance, age-grade relation, teacher, and community are 
 clearly causes rather than effects of attendance. Each of 
 these factors was determined at the beginning of the year 
 in which the attendance records were secured. On the 
 other hand it seems much more probable that quality of 
 work partly influences attendance and is partly influenced by 
 attendance, i.e., it is both cause and effect. 
 
 g. Regression Equation.—No further step is required to 
 satisfy the purpose of a causal investigation. But the com- 
 putation of partial correlation coefficients makes possible an 
 additional step, familiarity with which is important not only 
 for the causal investigator but also for those who construct 
 tests. This next step is the derivation of a regression equa- 
 tion or prophecy equation. 
 
 The simplest form of prophecy is where a pupil’s score 
 in one trait is prophesied from a knowledge of his score 
 in one other trait. Since this sort of situation demands 
 only ordinary correlation and the simplest form of regres- 
 sion equation, it makes a good starting point for the explana- 
 tion of a situation which demands partial correlation and a 
 complicated regression equation. 
 
 Suppose that the problem is to secure the best prophecy 
 as to a pupil’s attendance based on knowledge of his dis- 
 tance from school. Assume the correlation between attend- 
 ance and distance to be as shown in Table 37. The regres- 
 sion equation for this purpose is: 
 
 ee pee 
 SDy y 
 As shown at the bottom of Table 37, r—=—.81, 
 pare Sx? Bes 24105 aes ye 
 Sx AT? ves. bacon \/ (0:2) ee Pacers 
 Da EN nails i) BE Has. 
 SL irene N CCV) ase ie (O02 
 
Causal Investigations 241 
 
 Assume that the pupil’s distance score is known to be tr. ct 
 Then y is the difference between 1.5 and the M of 2.0; 
 y—-— 0.5. This pupil’s most probable position in attend- 
 ance may be found by substituting the preceding values in 
 the above formula, thus: 
 
 
 
 Since M for attendance is 52.2, the pupil’s most probable 
 Score in attendance is then 52.2 + 10.8, Lew Osu lnaike 
 manner any y can be transmuted into a most probable x. 
 
 In case x is known and the problem is to prophesy y, the 
 regression equation becomes: 
 
 
 
 By means of the first of these two regression equations, it 
 is possible for an experimenter to build up a table for trans- 
 muting x values into y values, so that subsequent workers 
 will need to determine only the value of x for each pupil. 
 By using the second equation, he can construct a table for 
 transmuting y values into x values. At this point, it should 
 be pointed out, that one table will not suffice for trans- 
 muting x values into y values, and y values into x values. 
 Two tables are required. 
 
 When the problem is to prophesy a pupil’s position in x, 
 say, attendance, from knowledge of his scores in Vea C 
 etc., say, distance, age-grade relation, quality of work, etc., 
 partial correlation is required. The regression equation 
 combines the pupil’s scores on the various factors, weight- 
 
242 How to Experiment in Education 
 
 ing each score according to the partial correlation of that 
 factor with the criterion, namely, attendance. If the prob- 
 lem is to prophesy a pupil’s intelligence from several tests 
 of this trait, the regression equation combines a pupil’s 
 scores on the several tests, weighting each test according 
 to its partial correlation with some criterion of intelligence, 
 whether the criterion be some standard intelligence test, or 
 teacher’s judgment, or age-grade relation, or something else, 
 or a combination of these to constitute a criterion. Thus, 
 the regression equation will combine any number of ele- 
 ments and weight them so as to yield composite scores 
 which will correspond as closely as possible, considering the 
 elements used, with some criterion. 
 
 All that is needed to make such an equation possible is 
 the partial correlation of each element with the criterion 
 and certain measures of variability, as shown in the follow- 
 ing formula. This formula is the regression equation for 
 attendance, i.e., it combines and weights the scores on the 
 various factors so as to yield the most accurate possible score 
 in attendance from a combination of these six factors, 
 
 Shes SD 1.23456 
 Kip (+12.s4567g 5024527 ) X2 + (+13.24567g 254501 ) ue 
 
 SD1.234567 D1.234567 
 + (114.2356 Tenieenen: x4 - r15.23467ep ood ) X5 
 
 5.123467 
 SD1.234567 govt: 5D1.234567 
 + (116. CRE Acre rae pe x6 + [ 117.23456 SD7ita34e60) a 
 
 Where xt is the deviation of the pupil’s score from the mean 
 of the attendance records, and is determined by the solution 
 of the formula, 
 
 x2 1s the deviation of the pupil’s score from the mean of the 
 scores in distance, 
 
 x3 is the deviation of the pupil’s score from the mean of 
 the age-grade relation, and so on for x4, x5, x6, and x7, 
 where x2, X3, X4, x5, x6, and x7 are known, and where 
 
Causal Investigations 243 
 
 SD1.234567 = SDr V 1 — (r12)?Wr — (113.2)*V 1 — (414.23)? 
 V I= (115.234)'V a — (116.2345)? V a — (917.23456)? 
 SD2.134567 = SD2 Vir — (112)? Wr — (423.1) ?W 1 — (424.13)? 
 V 1 — (r25.134)'V 1 — (426.1345)? V 1 — (227.43456)" 
 SD3.124567 = SD3 V1 — (113)? V 1 — (123.1) °V 1 — (134.2)? 
 Vir — (135.124)* Vt — (136.1245)? Va — (137.12456)* 
 SD4.123567 =SD4 Vx — (114)? Wa — (424.1)? V 1 — (134.42)* 
 V1 — (r45.423)*V x — (46.1235)? V x — (147.12356)" 
 SD5.123467 = SDs Via (115)? Wi — (425.1)? 1 — (135.12)* 
 V1 — (r45.123)°V x — (156.1234)°V 1 — (r57.12340)? 
 SD6.123457 = SD6 V 1 — (116)? V1 — (126.1) *V 1 — (136.12)* 
 V x — (146.123)*V x — (456.1234)? V 1 — (07.12348)" 
 SD7.123456 = SD7 Vr— (117)? Vt — (727.1) ?V 1 — (37.12)* 
 V 1 — (147.123)°V x — (157.1234) Vt — (167.12345)" 
 
 To illustrate the evolution and use of a regression equa- 
 tion in a simple situation, assume that the problem is to 
 prophesy a pupil’s position in 1 from a knowledge of his 
 position in 2 and 3. Stated in another way, assume that 
 the problem is to combine the scores on 2 and 3 so that the 
 resulting score will be the best possible in 1 which 2 and 3 
 can yield. Assume that 
 1 = Intelligence as measured by the Stanford or Herring 
 Revision of the Binet-Simon Intelligence Scale, 
 
 2 == Comprehension score on the Thorndike-McCall Read- 
 ing Scale, and 
 
 3 == Minutes spent on the Thorndike-McCall Reading 
 Scale divided by the comprehension score. 
 
 Assume further that 
 
 (eT Are: SD1 = 4.42 M1 = 120 
 rIl3 = —.40 U2 TG Li Deecera’ i te) 
 23 = — .56 Sa OLS & Mee LS 
 
244 How to Experiment in Education 
 Then the regression equation is 
 
 SD1.2 SD1.2 
 Xie Gas x2 + (mano 
 
 Utilizing the assumed data to compute the required values 
 in the regression equation, we have 
 
 Sw r12 — (r13) (r23) wal 80 — (— .40) (—.56) Lee 
 cnuet Vere 3 yor oe (829 a toa eee z 
 yy r13 — (r12) (r23) Ly — 40 — (.80) (—.56) a 
 
 $ Vr (TiO Ai er 3 WW iB) A ee 
 
 r23 — (ri2) (rI3) — .56 — (.80) (— .40) 
 
 SD1.23 = $D1 V1 — (r12)?V te (ri303)2 = 
 AAgV Tie 80)2V De t0)7 ="9163 
 
 $D2.13 =SD2vi a (r12)?V c— (r23.1)7 peed 
 or Vai (i80)4V Tis (Gaga == 59 
 
 SD3.12 = SD3 Vi — (113)? V1 — (123.1)? = 
 ‘Be Vip ea ( ho | Ten (ee .70 
 
 Substituting the computed values in the regression equa- 
 
 tion, we have 
 a= (76-23 x2 + (1023) fa) 2p. Gen X2 + .38x 
 
 oe Sra 1) By = 3.39 38X3 
 
 Now if a pupil’s score in 2 is 53, x2 53 —50 =3, 
 
 since M2 is 50. If his score in 3 is 14, x3 = I4—15 = 
 
 —1, since M3 is15. Substituting x2 and x3 in the preced- 
 ing equation 
 
 XI = 3.39(3) + .38(—1) =9.79 
 
 The 9.79 shows that the pupil’s deviation from Mr is a 
 _plus 9.79. Since Mrz is 120, the pupil’s score in 1 becomes 
 120 +. 9.79, 1.€., 129.79. 
 
CHAPTER X 
 
 ANALYSES OF EXPERIMENTAL AND CAUSAL 
 INVESTIGATIONS 
 
 The principles and procedures formulated in the preced- 
 ing chapters had to be confined necessarily to the more 
 common types of experiments and investigations. Further- 
 more, the progress of discussion permitted only a limited 
 use of concrete illustrations. The purpose of this closing 
 chapter is twofold, (a) to show the applicability of these 
 principles and procedures to many specific experimental 
 problems and problems for causal investigation, and (b) to 
 suggest a method of attack upon relatively uncommon 
 varieties of problems. The problems used are taken more 
 or less at random from a large number submitted from time 
 to time by graduate students. 
 
 No special effort has been made to make these analyses 
 complete. Space would not permit, nor has an effort been 
 made to make them model analyses. This would require 
 not only a long period of concentrated thinking about each 
 problem but also an actual trial of each experiment to check 
 the thinking done. All that is attempted is to draw up for 
 each problem a rough plan for its solution, in order to point 
 out to the reader the general line of attack. 
 
 PROBLEM 1. Do Rural Children Learn More Rapidly in 
 Consolidated Schools or in One-room Schools? 
 
 EF1 is a consolidated school. EF2 is a one-room school. 
 S is a group or groups of rural pupils. 
 
 This problem may be solved as an equivalent-groups ex- 
 periment very simply but with some delay, or it may be 
 solved without delay by an equivalent-groups causal inves- 
 
 245 
 
246 How to Experiment in Education 
 
 tigation. Since an experiment always gives the experimenter 
 more complete control of the situation than does a causal 
 investigation, let us assume that this advantage outweighs 
 the disadvantage of a year’s delay, and that the problem 
 is to be solved by an equivalent-groups experiment. 
 
 The chief problem is to secure genuine equivalence of 
 groups. Pupils should be paired on two bases, at least, 
 namely, mental age and chronological age. 
 
 Having selected two equivalent-groups, or else having 
 delayed selection until the conclusion of the experiment, a 
 series of IT’s or standard tests of school abilities should be 
 applied. At the close of the year these tests or duplicates 
 of them should be applied as FT’s. 
 
 The data from these tests can be fitted into one of the 
 computation molds provided in a preceding chapter. For 
 purposes of computation, all the pupils can be treated to- 
 gether as two equivalent groups or else the two main groups 
 may be broken up into age sub-groups or grade sub-groups, 
 or they may be treated both ways. 
 
 PROBLEM 2. Effect of Exemption from Class Drill in 
 Penmanship when Pupils Attain Quality 12 on the Thorn- 
 dike Handwriting Scale Compared with the Effect of Con- 
 tinuance in Class Drill. 
 
 EFI is exemption from class drill in penmanship of those 
 pupils who attain quality 12 on the Thorndike Handwriting 
 Scale. EF2 is the continuance in class drill, or the absence 
 of such exemption. 
 
 The experimental group (S) is not indicated, though the 
 effectiveness of EF1 is likely to vary with the distance the 
 ability of S is from quality 12. The implication of the 
 student’s formulation is that S has an ability below quality 
 12. The conclusion from the experiment should be stated 
 in terms of whatever S is employed. 
 
 Since the purpose of this experiment is merely to deter- 
 mine the amount of superiority of one EF over the other 
 no control EF is required and only the less stringent criteria 
 
Analyses of Experimental and Causal Investigations 247 
 
 for selecting the experimental method need be considered. 
 The one-group method is not entirely satisfactory, because: 
 (a) Even apart from any difference in the effectiveness of 
 EF’s, the amount of change under one EF will not be iden- 
 tical with the amount of change under the other EF. Even 
 under identical conditions the rate of progress in penman- 
 ship as measured by available tests usually shows a slowing 
 up as progress proceeds. ‘To date, no progress scales have 
 been constructed which demonstrably discount this retarda- 
 tion. (b) There is some danger that there will be a signifi- 
 cant carry-over from one EF to the other, particularly if 
 the exemption-from-drill EF precedes the continuance-in- 
 drill EF. (c) The one-group method is more than unsatis- 
 factory; it is completely impossible if the change in S is 
 determined by measuring the amount of time required to 
 attain quality 12. Just as soon as one EF had brought 5 to 
 quality 12 there would be no opportunity to determine the 
 effect of the other EF because S would already be at quality 
 12. All this means the equivalent-groups method is the best 
 one for this problem. 
 
 The change (C) produced by each EF can be measured 
 by the per cent of pupils in each group who attain quality 
 12, as measured by the Thorndike Handwriting Scale, dur- 
 ing the period of the experiment. The experiment can be 
 stopped when, say, 50% or 85% of the leading group has 
 attained quality 12. This per cent can be compared with 
 the per cent of the other group who have attained quality 12. 
 
 This method of measurement is objectionable because it 
 does not yield a score for each pupil. It yields a score for 
 the group as a whole. This does not permit the computa- 
 tion of SD, SDM, and SDD, and hence does not permit any 
 statement of the reliability of the conclusion. 
 
 The C can be measured by the total number of points 
 of growth on the scale during the period of the experiment. 
 There is a fatal objection to this plan. The EFr pupils 
 are excused from handwriting instruction when they attain 
 quality 12, and are thereby and thereafter encouraged to 
 
248 How to Experiment in Education 
 
 spend the handwriting drill period in more congenial ways. 
 But no EF2 pupil who attains quality 12 is so excused. 
 Measuring C by points of growth definitely discriminates 
 against EFtr. 
 
 The C can be measured by the length of time required 
 by each pupil to attain quality 12. A serious objection to 
 this plan is that it requires the experiment to continue until 
 every pupil of both groups, even the slowest, has attained 
 quality 12. Certain pupils in the group may never attain 
 this level. Except for this practical objection the method 
 is quite satisfactory. If all pupils are within an easy dis- 
 tance of ability 12, this objection disappears. 
 
 Again, the C can be measured by determining the amount 
 of growth per unit of time. Suppose the first EF1 pupil 
 to attain quality 12 does so in one month from the begin- 
 ning of the experiment. To avoid disappointing pupils the 
 experiment will have to continue, but for purposes of com- 
 putation the experiment can stop at that point. The points 
 of growth made by each and all pupils in each group in one 
 month shows the relative effectiveness of each EF. The 
 IT1z here may be assumed to be approximately zero for each 
 pupil. The FT1 is the points growth in a month. The C 
 is then identical with FT1. Further computations follow 
 the computation models already given. 
 
 It is advisable for the experimenter to check the measur- 
 ing method just recommended by a related method. He 
 can permit the experiment to continue until most or perhaps 
 all of the EF1 pupils have reached quality 12. The instant 
 that an EF1 pupil reaches quality 12, the experimenter 
 should determine and record the attainment of the EF2 
 pupil who is paired with the EF1 pupil. By dividing the 
 points of growth from the initial starting point up to 12 
 by the number of days required to attain 12, the growth 
 per day can be determined for each EF1 pupil who attains 
 quality 12 during the period of the experiment. By divid- 
 ing the points of growth of each EF2 pupil, up to the time 
 his EF1 pair reached quality 12, by the number of days 
 
Analyses of Experimental and Causal Investigations 240 
 
 required by his EFr pair to attain quality 12, measures 
 comparable with the foregoing EFi measures can be secured 
 for the EF2 pupils who pair with EF1 pupils attaining 
 quality 12. Quite satisfactory and comparable measures 
 can be secured for each EF1 pupil who fails to attain quality 
 12 and for his EF2 pair by dividing the points of growth 
 made by each during the whole time of the experiment by 
 the number of days in the experiment. 
 
 This method of measuring C is suggested as a check upon 
 the preceding one, because there is some possibility that as 
 EF? pupils approach their goal they are stimulated to added 
 zeal. To stop the experiment as soon as the first EF1 pupil 
 attains the goal means that only a few pupils have come 
 within the sway of this possible facilitating effect. This 
 last method gives all the pupils a chance to feel its effect, 
 in case such an effect exists. And in order to make results 
 entirely comparable an EF2 pupil, for purposes of com- 
 putation, is stopped, for computation purposes at least, at 
 the same instant that his EF1 pair stops. For purposes of 
 fitting these data in the computation model, assume IT1 to 
 be zero, and FT1 to be the above scores. 
 
 The careful experimenter will not be satisfied to measure 
 quality of handwriting only. As a minimum he will deter- 
 mine, in similar manner, the effect of each EF upon speed 
 of handwriting. 
 
 PROBLEM 3. What Is the Effect of the Spirit of a Class 
 on Its Achievement? 
 
 EFT is a spirit of enjoyment, hopefulness, codperation 
 and the like in a class. EF2 is the opposite sort of spirit. 
 There could be other EF’s representing varying degrees or 
 varieties of spirit. 
 
 The one-group or rotation method may be employed pro- 
 vided the period for each EF does not last more than a few 
 days. A longer pericd might fix certain attitudes which 
 will transfer to the succeeding EF. Even when the period 
 is brief some transfer is doubtless unavoidable. If the 
 
250 How to Experiment in Education 
 
 teacher or other agent generates a pleasant spirit, this will 
 tend to aid the succeeding EF. If the unpleasant spirit 
 precedes, it will tend to subtract from the succeeding EF. 
 
 Probably the best method of all is the equivalent-groups 
 method, where Sx and S2 are two equivalent classes. This 
 method does not require a brief application of each EF. 
 
 Both IT’s and FT’s for both groups are needed. These 
 achievement tests will need to cover the abilities being 
 developed while the EF’s are operating. The differences 
 between the M’s of the two C’s in each achievement test 
 give the conclusions from the experiment. 
 
 ProBLEM 4. Are Nature and Object Drawing and Paint- 
 ing Fundamental to Improve Taste in Selection of Environ- 
 ment, or Are the Principles of Design and Color the Basts 
 for This Response? 
 
 EF1 is nature and object drawing and painting. EF2 is 
 principles of design and color. 
 
 The one-group and rotation methods are inappropriate be- 
 cause of probable carry-over, so the equivalent-groups 
 method must be employed. 
 
 The S is a group of pupils improvable in their taste in 
 selection of environment, and not yet trained in either EF 1 
 or EF2. | 
 
 Both Sr and S2 should be given an IT to determine 
 initial taste in selection of environment. S1 should have 
 EF1 applied. Sz2 should have EF2 applied. Both should 
 then be given an FT. The difference between the M’s of 
 the two C’s will show which EF contributes more toward a 
 development of taste in selection of the environment. 
 
 PROBLEM 5. Which Is Better for Pupil Growth, a Tem- 
 perature of 68 degrees and a Humidity of 50 per cent, or a 
 Temperature of 86 degrees and a Humidity of 80 per cent? 
 
 EFr is a temperature of 68 degrees and a humidity of 
 50 per cent. EF2 is a temperature of 86 degrees and a 
 humidity of 80 per cent. 
 
 Either the rotation or equivalent-groups method may be 
 
Analyses of Experimental and Causal Investigations 251 
 
 employed, though the rotation method is preferable perhaps. 
 Sit can be subjected to EFz and then to EF2. S2 can be 
 subjected to EF2 first, and then to EF1. The length of 
 time each EF is applied should be the same for all four 
 periods, and will depend upon the nature of the tests used. 
 If the tests are of traits growth in which is very rapid, 
 each EF may be applied for a brief time. 
 
 Several test types covering the work of the pupils will be 
 needed. Both IT and FT should be given. These may be 
 tests of general reading ability, arithmetical ability, spelling 
 ability, and the like. In this case, the experiment will need 
 to continue for a considerable period. Or the tests may 
 be based upon the specific lessons being taught. In this 
 case, growth will be rapid, and the experiment, if desired, 
 may be brief. 
 
 The computation will follow the regular rotation’ com- 
 putation model for two EF’s and several test types. 
 
 PROBLEM 6. To Determine the Effect on the Mastery of 
 English of Teaching Technical Grammar from the Fourth 
 to the Eighth Grade. 
 
 EF is the teaching of technical grammar from the fourth 
 to the eighth grade. EF2 is the absence of such technical 
 grammar and presumably the presence of other forms of 
 ordinary English instruction instead. 
 
 The equivalent-groups method is required. The formula- 
 tion of the problem does not make it clear whether there 
 are to be five sub-groups—fourth, fifth, sixth, seventh, and 
 eighth grades—with equivalent sub-groups, or whether there 
 are to be two equivalent fourth grades each of which is to 
 have its EF applied for five years in succession. 
 
 In either case IT’s and FT’s of English ability are re- 
 quired. A computation model has been provided for either 
 form of experiment. 
 
 PROBLEM 7. To Determine the Relation of Physical Effi- 
 ciency to School Progress. 
 EF1 is physical efficiency of a defined amount. EF2 is 
 
252 How to Experiment in Education 
 
 physical inefficiency of a defined amount. A variety of 
 EF’s representing different degrees of physical efficiency 
 might be employed. 
 
 The equivalent-groups method is appropriate to this prob- 
 lem. Both groups may start below par physically, or at any 
 stage short of a physical condition which is at the limit of 
 possible improvement. Sz will have its physical efficiency 
 improved by careful attention to diet, etc. S2 will continue 
 on the same physical level. 
 
 Both IT’s and FT’s are needed, covering abilities growth 
 in which constitutes school progress. The difference be- 
 tween the M’s of C1 and C2 shows the effect of improved 
 physical efficiency. 
 
 This problem may be interpreted to mean: Does physical 
 efficiency facilitate school progress? Of it may be inter- 
 preted to mean: Are physical efficiency and school progress 
 associated or correlated? If the latter is the problem, the 
 one-group method is the only satisfactory experimental plan. 
 EF1 is the physical efficiency of the pupil in the best physi- 
 cal condition, EF2, EF3, EF4, etc., are the physical condi- 
 tions of the pupils who are second, third, fourth, and so on, 
 respectively, in physical condition. Each pupil should be 
 measured in both physical efficiency and past school prog- 
 ress. The correlation between these two series of measures 
 is the answer to the problem, for this correlation shows the 
 relationship between various physical conditions and corre- 
 sponding amounts of school progress. Interpretation is 
 facilitated if only those pupils are used whose present physi- 
 cal condition has been about the same throughout the school 
 career of the pupils. 
 
 One difficulty with the foregoing is that positive correla- 
 tion may not indicate a genuine relationship between physi- 
 cal efficiency and school progress. It may be that those 
 selected as more fit are also more intelligent, and that it is 
 intelligence rather than physical fitness which is responsible 
 for the correlation. This possibility may be investigated 
 by equating the fit and the unfit with respect to intelligence, 
 
Analyses of Experimental and Causal Investigations 253 
 
 by using only those pupils of like intelligence, or by partial 
 correlation. 
 
 ProsLeM 8. What Effect Has Previous Training in Type- 
 writing upon Speed and Accuracy in Learning to Use a 
 Comptometer? 
 
 The EF1 is learning to compute with a comptometer plus 
 previous training in typewriting. The EF2 is learning to 
 compute with a comptometer when there has been no pre- 
 vious training in typewriting. 
 
 The one-group method cannot be used because, if for 
 no other reason, there will be a carry-over from one EF to 
 the other. For this same reason the rotation method can- 
 not be employed. The equivalent-groups method is appro- 
 priate. 
 
 Sx should have previous training in typewriting. S2 
 should lack such previous training but should be equivalent 
 in all other respects. No additional control S is required. 
 A unique feature of this experiment is that one group is both 
 an S2 and a control S at the same time, for Cr minus C2 
 shows the exact effect of previous training in typewriting 
 upon learning to use a comptometer. Sz and S2 are not 
 defined by the problem. ‘The inference is that they are two 
 groups of clerical students. 
 
 IT1, FT1, IT2, and FT2 are required both for speed 
 and accuracy in computing with the comptometer. In case 
 both S’s have had no experience at all with the comptometer 
 both IT1 and IT2 may be assumed to be zero. 
 
 This problem may be solved by either an experiment, or 
 a causal investigation, or half investigation and half ex- 
 periment. An experimenter finds two appropriate and 
 equivalent groups. To one he gives training in typewriting 
 and follows it with training on a comptometer. To the 
 other he gives no training in typewriting, but begins train- 
 ing them on the comptometer, after a period has elapsed 
 equivalent to that used in giving his typewriting training to 
 the EF1 group. 
 
254 How to Experiment in Education 
 
 The causal investigator proceeds backward rather than 
 forward. He locates two groups, both of whom are learning 
 or have learned to operate a comptometer, who are equiva- 
 lent, except that one has learned typewriting while the other 
 has not. He then investigates their respective records in 
 learning to operate a comptometer. Any differences dis- 
 covered he attributes to typewriting. 
 
 The half-investigator, .half-experimenter, locates two 
 groups equivalent in every respect except for typewriting. 
 To these two groups he applies uniform training on the 
 comptometer and measures the progress of each group. 
 
 PRoBLEM 9g. Given Equivalent Groups of Sales Clerks 
 and Clerical Workers, Is There Any Difference Between 
 Them in Type of Memory? 
 
 This is a causal investigation. The investigator finds the 
 EF’s applied before he assumes control of the situation. 
 The only thing left for him to do is to apply the FT’s and 
 formulate conclusions. 
 
 EF1 is sales clerk, or the inherited or environmental 
 conditions which set sales clerks apart as an occupational 
 group. EF2 is clerical workers or the conditions which 
 selected and differentiated clerical workers as an occupa- 
 tional group. 
 
 Si is a group of sales clerks, who, except for occupational 
 differentiation and its concomitants and consequences, are 
 equivalent to Sz. Unless the two groups are allowed to 
 differ in the possible immediate and direct concomitants and 
 consequences of occupational differentiation the whole in- 
 vestigation loses its point, for its very object is to determine 
 whether such concomitants or consequent differences occur. 
 This means that when the two groups are being equated the 
 probable concomitants and consequences should not be 
 among the bases employed for equating. 
 
 No IT’s can be given since the EF’s have been applied 
 before the investigator takes control of the situation. Even 
 if possible, none would be given, because the psychological 
 
Analyses of Experimental and Causal Investigations 255 
 
 factors influential in determining ultimate occupational 
 choice may have been present from birth. Hence ail that 
 can be done is to apply FT’s to determine whether the type 
 of memory possessed by Sz2 differs from that possessed 
 by Sr. 
 
 In an investigation of this sort the investigator should 
 be wary about concluding from any difference in memory 
 revealed that this difference has been produced by the occu- 
 pation of a sales clerk as distinguished from the occupation 
 of clerical work. The truth may be instead that the differ- 
 ence discovered merely accompanies the occupation, i.e., is 
 caused directly by a fundamental something which is the 
 cause of occupational differentiation. It may be that the 
 difference revealed is itself the cause of the occupational 
 differentiation. In sum, whenever the investigator is pre- 
 sented with a completed experiment he has no assurance 
 as to whether the EF’s or the difference in FT’s came first 
 and hence is the cause or whether something more funda- 
 mental may not be the cause of both. All the investigator 
 can say is that occupational differentiation is or is not asso- 
 ciated with memory differentiation. 
 
 The FT’s should be tests for various types of memory. 
 No IT’s can be given, but in fitting data into the computation 
 models all IT scores may be assumed to be zero. 
 
 ProBLeM 10. Is Complete Understanding Necessary to 
 the Enjoyment of a Piece of Literature? 
 
 EF 1 is incomplete understanding of a piece of literature. 
 EF2 is presumably complete understanding. Since under- 
 standing may vary from complete understanding to com- 
 plete misunderstanding it will be necessary for the experi- 
 menter to define the completeness of EF1 and EF2. He 
 may find it necessary to employ several EF’s of varying de- 
 grees of completeness of understanding. 
 
 Any one of the several experimental plans promises rea- 
 sonably satisfactory results. One plan is to employ the 
 one-group method, to expose Sr to an incompletely under- 
 
256 How to Experiment in Education 
 
 stood piece of literature and measure the resulting enjoy- 
 ment, and then to expose Si to the same piece of literature 
 after an understanding of it is taught or while an under- 
 standing of it is being given and measure the resulting enjoy- 
 ment. The difference between these two FT’s gives the 
 desired answer. If it is suspected that the conclusion holds 
 only for the particular type and difficulty of the piece of 
 literature employed, the experiment may be repeated with a 
 variety of pieces of literature. 
 
 Another plan is to employ the one-group method, to select 
 two pieces of literature which are known to be or may be 
 assumed to be equal in their appeal when both are incom- 
 pletely understood or completely understood and equally so 
 in both cases. To S1, however, one of these equated pieces 
 of literature is incompletely understood while the other is 
 completely understood. ‘The difference in amount of enjoy- 
 ment evoked from Sr when these two pieces are presented 
 gives the desired answer. As before, various pairs of speci- 
 mens may be presented. 
 
 Still another plan is to employ equivalent groups. S1 
 may be exposed to a piece of literature which is incompletely 
 understood and the resulting enjoyment measured. S2 may 
 be exposed to the identical piece of literature after under- 
 standing of it has been given or while understanding is 
 being given, and the resulting enjoyment may be measured. 
 As before, various pieces of literature may be used or vari- 
 ous degrees of understanding may be imparted. 
 
 The rotation method is inappropriate. Incomplete under- 
 standing may precede completer understanding without seri- 
 ous carry-over, but to reverse this order of sequence, as 
 required by the rotation method, is impossible. 
 
 No IT’s need be given, for the degree of enjoyment of a 
 piece of literature before the S has been exposed to it may 
 be assumed to be zero. 
 
 No little ingenuity will be required to devise a satisfactory 
 test of enjoyment. Any one of many methods may be em- 
 ployed. Subtle physiological indices of enjoyment may be 
 
Analyses of Experimental and Causal Investigations 257 
 
 recorded, or the pupils may be asked to choose between a 
 second exposure to the piece of literature in question and 
 other alternatives of reasonably constant and equal appeal, 
 or the pupils may rate the piece of literature in comparison 
 with the enjoyment derived from other common experiences 
 of varying satisfyingness, or a secret record may be kept 
 of the amount of subsequent use made of the piece of 
 literature when it is in the class library, and so on. 
 
 ProBLtEM 11. What Is the Effect upon Teaching Effi- 
 ciency and Length of Service in Teaching of a Sabbatical 
 Year for Public School Teachers? 
 
 EF1 is a Sabbatical year. EF2 is no Sabbatical year. 
 
 The one-group method is not appropriate, because the 
 problem assumes that the EF is to be applied throughout the 
 teaching life of the teacher. Also one of the measurements 
 stipulated, namely, length of service, assumes the entire 
 teaching life. The equivalent-groups method is applicable, 
 and it is the only method which is applicable. 
 
 Si is a group of public school teachers to whom EFr is 
 applied and who are otherwise equal to and under conditions 
 comparable with Sa. 
 
 Initial, intermediate, and final tests of teaching efficiency 
 are desirable for both S’s. Only FT’s of length of service 
 for both S’s are necessary or possible. The various periodic 
 intermediate tests will reveal whether Sabbatical years have 
 a cumulative effect or a decreasing effect, and whether 
 there comes a time where they no longer contribute to teach- 
 ing efficiency. 
 
 Since few experimenters have the patience or confidence 
 in their own longevity to wait a lifetime for the completion 
 of such an experiment, the investigational rather than the 
 experimental method is likely to be employed. 
 
 PRoBLEM 12. How Do Individual Scores Obtained on 
 National Intelligence Scale A Compare with Those on Scale 
 B for the Same Pupils? 
 
258 How to Experiment in Education 
 
 EF1 is application of National Intelligence Test, Scale A. 
 EF2 is application of Scale B of the same test. 
 
 The one-group method is required. There is some trans- 
 fer from EF1 to EF2 such as practice effect, but this can- 
 not be avoided. It can be largely eliminated by statistical 
 methods. 
 
 This experiment is unique in that the EF’s and FT’s are 
 identical. No IT’s are required. 
 
 The difference between FT1 and FT2 may be determined 
 by computing the coefficient of correlation between the Scale 
 A and Scale B scores, or by computing the net difference 
 (unreliability) between the two series of scores as was done 
 in Table 13. 
 
 Thus this experiment is unique in three ways. The EF’s 
 and FT’s are identical. Transfer from one EF to a succeed- 
 ing EF is eliminated statistically. Novel methods are sug- 
 gested for computing the difference between C1 and Ca. 
 
 PrRoBLEM 13. What Effect in Securing Order Will a Beau- 
 tiful Picture Placed in the Front of a Room Have Upon an 
 Unruly Boy Who Loves Art? 
 
 EF1 is no picture in front of room. EF2 is a beautiful 
 picture in front of room. 
 
 The one-group method or rotation method is the most 
 feasible, owing to the difficulty of equating unruly boys 
 who love art. 
 
 Assuming the one-group method, S is an unruly boy who 
 loves art. S has applied to him, in order, IT1 of unruliness, 
 EF1, FT1, of unruliness, EF2, FT2, of unruliness. FTr1 
 may be used as the IT2. This experimental unit may and 
 should be repeated many times to make certain that any 
 differences observed in the C’s are not accidental. 
 
 The foregoing experiment is a particularly difficult one 
 to carry through successfully. The influence of the picture, 
 though real, is likely to be so subtle as to have its effects 
 masked by one of a hundred other influences playing upon 
 
Analyses of Experimental and Causal Investigations 259 
 
 the pupil. When S is only one pupil the probability of 
 large changes due to irrelevant influences is especially great. 
 
 PROBLEM 14. To Determine the Relation Between Pla- 
 teaus on the Learning Curve and Recall. 
 
 In its present form the problem is so vaguely stated that 
 an analysis of it is impossible. What is really wanted is to 
 know whether pupils who have plateaus in their learning 
 curves are better able to recall or reproduce what is learned 
 at some later date. 
 
 EFT is plateau or plateaus in learning curve. EF2 is a 
 learning curve without plateaus. 
 
 This experiment is peculiar in that the experimenter can- 
 not control the application of the EF’s. His only recourse 
 is to have a large group of pupils learn something, to plot 
 their learning curves, to single out those who show a plateau 
 or plateaus in their learning curve, to match them with a 
 group of pupils who show no plateaus in their learning 
 curves but who are otherwise equivalent as shown by tests 
 given prior to the beginning of the experiment, and finally 
 to measure the difference in the ability of these two groups 
 to recall what has been learned. 
 
 No IT’s need be given though it is important to know 
 that the two groups are equivalent in general ability to recall 
 what has been learned. If this is not known, it cannot be 
 said that plateaus have caused the difference in ability to 
 recall. They may be the effect or may merely be asso- 
 ciated with a certain recall ability. 
 
 Since the purpose of the experiment is to learn whether 
 learning curves plus plateaus cause or are correlated with . 
 recall which is superior to that caused by or associated 
 with learning curves minus plateaus, no control EF and S 
 are required. For purposes of discussion, however, let us 
 suppose that the problem calls for a knowledge of the exact 
 contribution to recall of learning curves plus plateaus, i.e., 
 of learning plus a period or periods of little or no progress. 
 Still no control EF would be required because the contribu- 
 
260 How to Experiment in Education 
 
 tion of irrelevant factors to recall will be substantially zero. 
 If the experiment continues over a long period mere matur- 
 ing might contribute some power of recall. In this case a 
 control EF and S could be used to advantage. 
 
 If, however, the purpose of the experiment is to deter- 
 mine the amount of contribution of plateaus rather than 
 learning curves plus plateaus, a control EF, that is, an EF 
 of learning curves with plateaus absent, is required. EF2, 
 above, is just such a control EF. But here is a difficulty. 
 Is EF 2 identical with EFx1 except for the plateau feature of 
 EF1? Isa plateau merely an addition to a learning curve 
 with a plateau lacking, or is a plateau an integral portion of 
 its curve? If we affirm the latter, then it becomes impos- 
 sible to isolate and measure the effect of plateaus; we must 
 always measure the effect of plateaus-imbedded-in-learning- 
 curves. 
 
 PROBLEM 15. Which Will Give Better Results in Baking, 
 to Put an Angel-food Cake Into a Gas Oven Just Lighted 
 or Into One of Medium Temperature? 
 
 EF 1 is a just lighted gas oven. EF2 is a gas oven which 
 has reached a medium temperature. 
 
 The one-group method or rotation method will not do. 
 Since the S is a set of angel-food cake-dough it could 
 not very well be baked twice. The carry-over will be 
 enormous, to say the least. The equivalent-groups method 
 is required, 1e., two sets of angel-food cake-dough made 
 according to identical recipes, or taken from the same 
 mixture. 
 
 The IT’s can be assumed to be zero. The FT’s should be 
 various tests of the appearance, deliciousness, and digesti- 
 bility of the cake baked according to each of the EF’s. 
 
 The only difficulty in this experiment is to identify the S 
 and the EF. It is the cake dough whose change by the two 
 varieties of temperature is of primary concern. The cake 
 dough is to these EF’s just as pupils are to the customary 
 EF’s. 
 
Analyses of Experimental and Causal Investigations 261 
 
 PRoBLEM 16. Are Girls More Interested in Learning 
 Manipulative Processes in Junior High School Than in 
 Senior High School? 
 
 EF1 is the junior high school age for girls. EF2 is the 
 senior high school age for girls. 
 
 Either the one-group or equivalent-groups method may 
 be employed. If the one-group method is employed, a group 
 of junior high school girls should be tested, in some way, 
 as to the strength of their interest in learning manipulative 
 processes. When these same girls have reached the senior 
 high school age they can, then, be tested again to see whether 
 their interest in learning manipulative processes has in- 
 creased. 
 
 If the equivalent-groups method is employed, the experi- 
 ment becomes essentially an investigation. A group of 
 senior high school girls and another group of junior high 
 school girls should be selected so as to be equivalent, in all 
 respects, except for the senior and junior high school diiffer- 
 entiation with all of its concomitant differentiation. Stated 
 more simply, a group of junior high school girls should be 
 so selected that they will be equivalent when they become 
 senior high school girls, to a previously selected group of 
 present senior high school girls. 
 
 Each group can be tested for its interest in learning 
 manipulative processes. The C for each group may be 
 assumed to be the same as the FT. The difference between 
 the M’s of the two series of C’s shows the difference between 
 the EF’s. 
 
 ProspLEM 17. Does Observation of Skilled Teaching Aid 
 Normal School Students to Grasp Facts and Principles of 
 Teaching and to Apply Them? 
 
 EF1 is observation of skilled teaching. EF2 is the 
 absence of such observation. 
 
 Since the one-group and rotation methods cannot be used 
 because of carry-over, the equivalent-groups method is re- 
 quired. One group of normal school students will observe 
 
262 How to Experiment in Education 
 
 skilled teaching while an equivalent group will forego such 
 observation. 
 
 Both IT’s and FT’s covering all or a random sampling of 
 the facts and principles of teaching will need to be con- 
 structed and applied to both groups. 
 
 All the foregoing is simple enough. The real difficulty is 
 in devising some way to measure each group’s ability to 
 apply facts and principles learned. ‘The only satisfactory 
 way to make the test is to organize an experiment within an 
 experiment, so as to discover just how well the normal school 
 students can actually teach pupils. In sum, the best way 
 for these students to manifest superior changes in them- 
 selves is to show that they can make superior measurable 
 changes in pupils. 
 
 Two groups of equivalent pupils can be selected. The 
 EF1 normal school students can be assigned to teach, in 
 rotation, say, one group of pupils, and the EF2 students can 
 be assigned to teach the other group of pupils. If the pupils 
 are sufficiently numerous each normal school student may 
 be assigned to her own group of pupils exclusively. The 
 specific lessons to be taught may be assigned by the experi- 
 menter and tests for the pupils may be constructed to meas- 
 ure the effect of these lessons. Or the experiment may be 
 permitted to run for a considerable period and general tests 
 may be given. Initial and final tests upon the pupils will 
 show which normal school group has been most successful 
 in applying facts and principles learned to the real task of 
 making desirable changes in pupils. Thus the best way to 
 measure the normal school student is to measure her pupils. 
 
 ProsLeM 18. Is the Per Cent of Failures Higher Among 
 Pupils Who Enter the Sentor High School Direct from the 
 Eighth Grade or From the Junior High School? 
 
 EF1 is entrance to senior high school from eighth grade. 
 EF2 is entrance from junior high school. 
 
 This is not so much an experiment as a causal investiga- 
 tion, and must of necessity be an equivalent-groups investi- 
 
Analyses of Experimental and Causal Investigations 263 
 
 gation. A group of students entering from the junior high 
 school must be found who are equivalent, except for con- 
 comitant differentiations, to a group entering from the regu- 
 lar eighth grade. 
 
 The FT is the record of failures for each of these groups 
 during the high school period. In computation, the C may 
 be considered identical with FT. 
 
 ProBLEM 19. At How Much Greater Saving of Time and 
 Effort Can a Group of Normal Seven-year-old Children 
 Learn to Read Than a Group of Normal Six-year-old Chil- 
 dren? 
 
 EF1 is normal seven-year-olds. EF2 is normal six-year- 
 olds. 
 
 The one-group and rotation methods are inappropriate. 
 If the six-year-olds and seven-year-olds are truly normal, 
 the six-year-olds will in one year be equivalent to the pres- 
 ent condition of the seven-year-olds. In sum, the conditions 
 of the experiment require equivalent groups except for the 
 EF difference and its concomitants. It also requires both 
 groups to be equally unable to read at present, though not 
 necessarily of equal capacity to learn to read. 
 
 One or more IT’s and FT’s of reading ability, with the 
 intervening teaching of reading by the same or equated 
 teachers to both groups, will show which group can learn 
 more rapidly. The computation will follow the regular 
 computation model. 
 
 All the foregoing appears quite simple. But there is a 
 hidden difficulty so great as to be well nigh insurmountable. 
 The foregoing plan shows which group learns to read more 
 quickly. Even though the experiment favors the seven-year- 
 olds, it does not show that, in the long run, it is more eco- 
 nomical to delay learning to read until seven years of age. 
 If the six-year-olds learn to read, they can spend the read- 
 ing period during their seventh year learning something 
 else. If the six-year-olds learn to read, even though at some 
 labor, they have an extra year of access to printed material. 
 
264 How to Experiment in Education 
 
 If the six-year-olds do not spend their time learning to read, 
 they may spend their time learning something else which 
 may be proportionately difficult and valuable. There are 
 few abilities which a ten-year-old cannot learn more easily 
 than a six-year-old, but this does not mean that everything 
 should be postponed until pupils are ten years old. Decision 
 as to what to postpone involves a consideration of capacity, 
 interest, need, injury, and the total work of the school. The 
 practical problem cannot be solved by the simple experi- 
 mental plan outlined above. 
 
 PROBLEM 20. What Specific Abilities Are Required for 
 Success as a Telegrapher? 
 
 The EF’s are unknown specific abilities. The problem 
 here is not to determine whether a given specific ability con- 
 tributes or will contribute to success as a telegrapher. The 
 problem is to discover promising specific abilities with which 
 to experiment. In sum, the problem is to discover some 
 hypothesis to be a basis for experimentation. This is always 
 the first step in research. 
 
 One plan of procedure is to study the work of a tele- 
 grapher and logically infer what specific abilities are needed. 
 
 Another plan is to select two groups, one of which is com- 
 posed of successful telegraphers and the other of which is 
 composed of unsuccessful telegraphers, but where both other- 
 wise appear much alike. Observation of the work of the two 
 groups and tests of them may bring to light suggestive 
 differences. 
 
 Another plan is to chose strikingly successful and strik- 
 ingly unsuccessful telegraphers, and to contrast these oppo- 
 sites in close proximity. This is the most drastic possible 
 method of shaking out into the field of consciousness those 
 differences which spell success or failure as a telegrapher. 
 
 Once specific abilities have been hit upon in such ways, 
 their contribution to success as a telegrapher may be deter- 
 mined experimentally, or by an equivalent-groups causal in- 
 vestigation, or by a partial correlation investigation. 
 
 7 
 
Analyses of Experimental and Causal Investigations 265 
 
 PROBLEM 21. In a Recitation, Can a Class of Girls Bluff 
 a Teacher More Easily Than a Class of Boys? 
 
 EF is aclass of girls. EF2 is an equivalent class of boys. 
 S is the teacher, or, better, several teachers of both sexes, 
 since an experiment of this sort needs repetition on both 
 men and women teachers. 
 
 The rotation method is most appropriate because it per- 
 mits the experimenter to rotate out differences in nature of 
 lesson, teacher’s experience in teaching it, and the like. Thus 
 the experimenter can request a teacher to teach a specific 
 lesson to a class of girls, and then to teach this same lesson 
 to a class of generally equivalent boys. Next he can ask 
 the teacher to teach another lesson to both boys and girls, 
 only, in this case, the boys should be taught first and the 
 girls second. 
 
 While each lesson is being taught or afterward, the ex- 
 perimenter must measure the amount of bluffing which oc- 
 curs. The C may be treated as identical with this FT, so 
 that a regular rotation computation model will apply. 
 
 PROBLEM 22. To What Extent Are Children in the Upper 
 Grades of the Elementary School Capable of Selecting on 
 Their Own Initiative Statements of Most Worth in Their 
 History Reading? 
 
 EF is attainment of upper grade status. EF2 is, if any- 
 thing, the mere absence of such attainment. S is upper 
 grade pupils. 
 
 Of necessity the one-group method must be employed. 
 The whole experiment, if such it may be called, is very sim- 
 ple. It merely consists in locating upper grade pupils and 
 in testing the extent to which they can select on their own 
 initiative statements of most worth in their histories. 
 
 IT may be assumed to be zero, so that FT becomes Cr. 
 Similarly all the C2’s may be considered zero. Thus the 
 effect of upper-gradeness is shown by a straight measure- 
 ment of the present status of upper-grade children in the 
 trait in question. 
 
266 How to Experiment in Education 
 
 PROBLEM 23. What Is the Best Order to Teach Geog- 
 raphy to Fourth-grade Pupils, the Concrete and Then the 
 Abstract, or the Abstract Followed by the Concrete? 
 
 EF tr is concrete followed by abstract. EF2 is abstract 
 followed by concrete. S is fourth-grade pupils. 
 
 Owing to the possibility of carry-over, the equivalent- 
 groups method is preferable. One fourth-grade group can 
 be taught according to EF 1 and an equivalent fourth grade 
 according to EF2. 
 
 IT and FT tests, testing the degree of mastery of geog- 
 raphy lessons at the beginning and end of the experiment, 
 should be applied to both groups. | 
 
 The general plan for this experiment is quite simple. The 
 actual carrying out of the experiment would involve much 
 careful labor. It is unique in that the two EF’s appear to 
 be rotated when they really are not. The purpose of the 
 experiment is not to evaluate abstract vs. concrete but 
 abstract after concrete vs. concrete after abstract. A simi- 
 larly deceptive problem is this: Which method brings the 
 best results in beginning reading—to teach the printed forms 
 of the words first and follow with the script forms, or the 
 reverse order? Another like deceptive problem is this: 
 What is the best possible order of subjects during the school 
 day? Here the various EF’s are all possible combinations 
 of order of school subjects. As many equivalent groups will 
 be required as there are EF’s. There may be a carry-over 
 from the first subject taught to the second subject, or from 
 the second subject to the third subject, and so on. But 
 carry-over from one part of an EF to another part of an 
 EF is not an irrelevant factor. Carry-over is an irrelevant 
 factor only where there is carry-over from one total EF to 
 another total EF. 
 
 PROBLEM 24. Can Anything Done Well By One Indi- 
 vidual Be So Analyzed That the Ability May Be Imparted 
 to Others? 
 
 For purposes of experimentation, the above problem will 
 
 Gd a 
 
Analyses of Experimental and Causal Investigations 267 
 
 be clearer if phrased thus: Will a particular person’s analy- 
 sis of what some individual does remarkably well confer that 
 remarkable ability upon another? 
 
 Here the EFr is some particular person’s analysis of the 
 process by which some gifted person achieves certain ends. 
 EF2 is the absence of EF1. S is some individual to whom 
 EF or the analysis is to be taught in hopes of endowing 
 him with this rare ability. 
 
 The one-group method is required, for EF1 must be ap- 
 plied to a particular individual. 
 
 An IT or IT’s showing S’s initial status in the ability in 
 question needs to be followed, after EF1 has been applied, 
 by an FT or FT’s. These FT’s permit the computation of 
 C or C’s and show whether a particular individual can 
 analyze and impart the ability in high degree to another 
 particular individual. To make the experiment conclusive, 
 many individuals will have to attempt to analyze the process 
 and impart the ability to many S’s. 
 
 PROBLEM 25. To See What Projects Second-grade Pupils 
 Will Initiate. 
 
 EFi is the school environment and internal nature of 
 second-grade pupils. EF2 is the mere absence of EFr. S 
 is a group of second-grade pupils. 
 
 The problem calls for the one-group method in its most 
 elementary form, for the experiment consists solely in plung- 
 ing pupils with certain natures into a certain medium, and 
 then watching to see what happens. This elementary sort 
 of research is quite fundamental, and, when operated by a 
 keen observer, frequently leads to very valuable conclusions. 
 
 PROBLEM 26. Do Commas After Dependent Clauses Help 
 the Reader in Speed or Accuracy of Reading? 
 
 EF r is commas after dependent clauses. EF2 is the mere 
 absence of EF1, which is to say it is the absence of commas 
 at such places. S is not defined and hence may be any group 
 that can read. 
 
268 How to Experiment in Education 
 
 The equivalent-groups method can be employed but it is 
 not the best method. The one-group method cannot be used, 
 for there will be a carry-over of acquaintance with material, 
 if certain material containing commas is followed by that 
 same material without the commas, and vice versa. This 
 is one of those rare situations where the one-group method 
 is inappropriate, but where the rotation experiment may be 
 used to advantage by alternating the content of the material. 
 The following shows a possible plan: 
 
 Period I Period II 
 
 Group A Material 1—Commas Material 2—No commas 
 Group B Material r1—No commas Material 2—Commas 
 
 The speed and accuracy made by Group A on “Material 
 1—Commas” can be combined with the speed and accuracy 
 scores, respectively, made by Group B on ‘Material 2— 
 Commas.” This can be compared with the combined speed 
 scores and accuracy scores for “Material 1—No commas” 
 and ‘‘Material 2—No commas.” 
 
 PROBLEM 27. Does Brightness Facilitate Progress Through 
 School? 
 
 EF1 is brightness. EF2 is absence of EF1. The subjects 
 are school pupils. 
 
 The one-group experimental method cannot be employed 
 because it is impossible for pupils to be dull for a period and 
 then become bright or be bright and then become dull. For 
 the same reason, the rotation method cannot be used. The 
 equivalent groups method is the correct one for this problem. 
 
 Sr is a group of pupils who are known or are shown to 
 be of a defined brightness. Sz2 is another group who are 
 known to be of a defined dullness. Except for these intelli- 
 gence differences and their concomitants the two groups 
 should be equivalent. They should be equivalent in chrono- 
 logical age, grade position in school, i.e., beginning first 
 grade or kindergarten children, etc. 
 
Analyses of Experimental and Causal Investigations 269 
 
 Since the measure of C is the rate of progress through 
 school no initial tests, except of brightness, are required. 
 The answer to the problem will be shown by the FT, 1.e., the 
 number of years required on the average for each group 
 to complete a defined number of school grades. 
 
 PROBLEM 28. Does Genius Beget Genius? 
 
 EF is genius on the part of parents. EF2 is the absence 
 of such genius, or a smaller quantity of it. 
 
 The one-group and rotation experimental methods are 
 inappropriate owing to the fact that parents cannot be 
 geniuses for a time and then become non-geniuses or vice 
 versa. Hence the equivalent-groups method must be used. 
 
 Sir is the product of the union of the sperm and ovum of 
 genius parents. Sz is the product of the union of these ele- 
 ments from non-genius parents. : 
 
 No IT’s are required except to yield a measure of the 
 amount of each EF. The IT for the subjects may be as- 
 sumed to be zero. As soon as the offspring of each group 
 have sufficiently matured to make measurement practicable 
 an FT of intelligence may be applied. Cx and C2 will be 
 identical with the two FT’s. Mz minus M2 will reveal the 
 effect upon the intelligence of offspring of genius in the 
 parents. 
 
 To make it possible to separate the influence of germ 
 plasm and environmental influence, all children of both 
 groups should be placed under equally favorable environ- 
 mental influences immediately after conception or after birth, 
 at the latest. The equality of environment should be main- 
 tained until the FT’s are made. 
 
AHH AMA M OA Kita ata Ry Sigh 
 7 Ly i‘ 
 
 
 
 iy bind ' 5 ei } 
 ’ 4! \ yi > A ' as. ‘i 
 a j 
 4 
 4 
 F i : } 
 a ee =» 1) 
 JB ai f al ee | * 
 -~ i 
 , ? 
 ‘ mt wed Wh ‘ 
 La ] 
 by ss A e 
 é J 
 i wee Sti? 
 4) 
 i : *' 
 i - 
 ° 
 1 it h 
 i 
 { 
 all 
 . i 
 ae | 
 - 
 b 4 ‘ 
 ‘ 
 +9 
 y i 
 i 
 ‘ 
 ' 
 \ ; 
 ‘ 
 ~~ 
 vt] \ 
 : 
 : 
 - ‘ 
 i] 
 i ‘ 
 ‘ ja 
 ’ 
 , 
 ‘ 
 é 
 ; 
 ; i 
 / 
 * 
 i ‘ 
 j +’ j 
 : ’ 
 : 
 i ‘ 
 ae je 
 f 
 ‘ e UJ 
 \ : i] 
 4 ‘ ' 
 { 
 P - 
 ~1 
 | ] : 
 : 
 s ' 
 1 é y 
 A 
 i! f 
 
SELECTED REFERENCES FOR FURTHER READING 
 
 I. Onet-Group EXPERIMENT 
 
 Aral, TsurA.—Mental Fatigue; Teachers College, Columbia Uni- 
 versity, New York. 
 
 BALDwin, Birp T.—Physical Growth of School Children; Uni- 
 versity of Iowa, Iowa City, 1919. 
 
 Brooks, F. D.—Changes in Mental Traits With Age; Teachers 
 College, Columbia University, New York City. 
 
 Coy, GENEviEvE L.—IJnterests, Abilities, and Achievements of a 
 Special Class for Gifted Children; Teachers College, Colum- 
 bia University, New York, 1922. 
 
 FREEMAN, FRANK N.—Experimental Education; Houghton 
 Mifflin Company, New York, 1916. 
 
 Jupp, Cuarites H., anpD OTHERS.—Reading: Its Nature and 
 Development; University of Chicago, Chicago, 1918. 
 
 Rusk, RoBert R.—Experimental Education; Longmans, Green 
 and Company, London, 1919. 
 
 WHIPPLE, G. M.—Classes for Gifted Children; Public School Pub- 
 lishing Company, Bloomington, Illinois, 1919. 
 
 II. EQUIVALENT-GROUP EXPERIMENT 
 
 Courtis, S. A——Measuring the Effects of Supervision, in Geog- 
 raphy; School and Society, July 19, 19109. 
 
 Cummins, R. A.—Improvement and the Distribution of Practice; 
 Teachers College, Columbia University, New York. 
 
 Frost, NorMAN.—A Comparative Study of Achievement in Coun- 
 try and Town Schools; Teachers College, Columbia Uni- 
 versity, New York. 
 
 Kirsy, T. J—Practice in the Case of School Children; Teachers 
 College, Columbia University, New York. 
 
 PittMAN, M. S.—The Value of School Supervision; Warwick and 
 York, Baltimore, 1921. 
 
 271 
 
oe How to Exteriment in Education 
 IiI. Rotation EXPERIMENT 
 
 Heck, W. H.—A Study of Mental Fatigue; J. P. Bell Company, 
 Lynchburg, Virginia, 1913. 
 
 THORNDIKE, E. L.; McCatt, WM. A., AND CHapman, J. C.— 
 Ventilation in Relation to Mental Work; Teachers College, 
 Columbia University, New York. 
 
 WEBER, J. J—The Relative Effectiveness of Some Visual Aids in 
 Elementary Education (to be published soon). 
 
 IV. CausAL INVESTIGATION 
 
 DENBURG, J. K. V.—Causes of the Elimination of Students in 
 Public Secondary Schools of New York City; Teachers Col- 
 lege, Columbia University, New York. 
 
 HoLLINGWworTH, L. S., AND WinForpD, C. A.—The Psychology of 
 Special Disability in Spelling; Teachers College, Columbia 
 University, New York, 1918. 
 
 O’BrRIEN, F. P—A Study of School Records of Pupils Failing in 
 Academic or Commercial High School Subjects; Teachers 
 College, Columbia University, New York. 
 
 REAvis, GEORGE H.—Factors Controlling Attendance in Rural 
 Schools; Teachers College, Columbia University, New York, 
 1920. 
 
 V. DESCRIPTIVE INVESTIGATION 
 
 BUCKNER, CHESTER A.—Baltimore School Survey Series; Board 
 of School Commissioners, Baltimore, 1922. Educational 
 Diagnosis of Individual Pupils; Teachers College, Columbia 
 University, New York, 1919. 
 
 Cleveland School Survey Series; Russell Sage Foundation, New 
 York, 1916. 
 
 Gary School Survey Series; General Education Board, New 
 York, 1919. 
 
 Ketty, F. J.—Teachers’ Marks; Their Variability and Standard- 
 ization; Teachers College, Columbia University, New York. 
 
 Kentucky State Educational Survey Series; General Education 
 Board, New York, 1922. 
 
 KrusE, Paut.—The Overlapping of Attainments in Certain 
 Grades; Teachers College, Columbia University, New York, 
 1918. 
 
References for Further Reading 273 
 
 McCatL, WM. A.—How to Measure in Education; The Mac- 
 millen Company, New York, 1922. 
 
 MeEap, C. D.—The Relations of General Intelligence to Certain 
 M ental and Physical Traits; Teachers College, Columbia 
 University, New York. 
 
 Morrison, J. C.—Legal Status of City School Superintendents ; 
 Warwick and York, Baltimore, 1921. 
 
 SIMPSON, B. R. — Correlations of M ental Abilities; Teachers Col- 
 lege, Columbia University, New York. 
 
 Virginia State School Survey Series; World Book Company, 
 Yonkers, New York, 10920. 
 
 VI. EXPERIMENTAL MEASUREMENTS 
 
 Burcrss, May Ayres.—Measurement of Silent Reading; Russell 
 Sage Foundation, New York, 1920. 
 
 Burt, Cyrit.—MW contd and Scholastic Tests; P.S. King and Sons, 
 2 and 4 Great Smith St., Victoria, Westminster, Sa We _ Eng- 
 land. 
 
 CHAPMAN, J. Crospy.—Trade Tests; Henry Holt and Company, 
 New York, 1921. 
 
 DEWEY, EVELYN, CHILD, Emity, aNnD RuML, BEARDSLEY.— 
 Methods and Results of Testing School Children; E. P. Dut- 
 ton and Company, New York, 1920. 
 
 Hitrecas, Mito B.—Scale for the Measurement of Quality in 
 English Composition by Young People; Teachers College, 
 Columbia University, New York, 1912. 
 
 KUHLMANN, FReD.—Handbook of Mental Tests; A Further Re- 
 vision and Extension of the Binet-Simon Scale; Warwick and 
 York, Baltimore, 1922. 
 
 McCatt, Wm. A.—How to Measure in Education; The Mac- 
 millan Company, New York, 1922. 
 
 MoNnRoE, WALTER S.—Measuring the Results of Teaching; 
 Houghton Mifflin Company, New York, 1018. 
 
 Monroe, WALTER S.; DE Voss, J. C., AND Ketty, F. J.—Educa- 
 tional Tests and Measurements; Houghton Mifflin Company, 
 New York, 1913. 
 
 PINTNER, RUDOLF, AND PATERSON, Donatp.—A Scale of Per- 
 formance Tests; Warwick and York, Baltimore, 1917. 
 TERMAN, Lewis M.—The Measurement of Intelligence; Hough- 
 
 ton Mifflin Company, New York, 1916. 
 
274 How to Experiment in Education 
 
 Toors, H. A.—Trade Tests in Education; Teachers College, 
 Columbia University, New York. 
 
 VAN WAGENEN, M. J.—Historical Information and Judgment of 
 Elementary School Pupils; Teachers College, Columbia Uni- 
 versity, New York, 1919. 
 
 VOELKER, Paut F.—Function of Ideals and Attitudes in Social 
 Education; Teachers College, Columbia University, New 
 York. 
 
 WHIPPLE, G. M.—Manual of Mental and Physical Tests, Vols. 
 I and II; Warwick and York, Baltimore, rgro. 
 
 Witson, G. M., AND Hoke, K. J—How To Measure; The Mac- 
 millan Company, New York, 1921. 
 
 Woopy, Ciirrorp.—Measurements of Some Achievements in 
 Arithmetic; Teachers College, Columbia University, New 
 York, 1916. 
 
 YERKES, R. M., Bripces, J. W., AND HARDWICK, RosE S.—A 
 Point Scale for Measuring Mental Ability; Warwick and 
 York, Baltimore, 1915. 
 
 YOAKUM, CLARENCE S., AND YERKES, R. M.—Army Mental 
 Tests; Henry Holt and Company, New York, 1920. 
 
 VII. STATISTICAL AND GRAPHIC METHODS 
 
 ALEXANDER, CARTER.—School Statistics and Publicity; Silver 
 Burdett and Company, New York, 1919. 
 BRINTON, WILLARD C.—Graphic Methods for Presenting Facts; 
 The Engineering Magazine Company, New York, 1917. 
 BROWN, WILLIAM, AND THompson, G. H.—Essentials of Mental 
 Measurement ; The Macmillan Company, New York, 1921. 
 
 KetLey, T. L—Educational Guidance; An Experimental Study 
 in the Analysis and Prediction of Ability of High School 
 Pupils; Teachers College, Columbia University, New York, 
 IQI4. 
 
 McCatiL, Wm. A.—How to Measure in Education; The Mac- 
 millan Company, New York, 1922. 
 
 Rucc, Harotp O.—A pplication of Statistical Methods to Educa- 
 tion; Houghton Mifflin Company, New York, 1917. 
 
 THORNDIKE, Epwarp L.—Introduction to the Theory of Mental 
 and Social Measurements; Teachers College, Columbia Uni- 
 versity, New York, 1913. 
 
 — 
 
References for Further Reading 275 
 
 Yue, G. Upny.—An Introduction to the Theory of Statistics ; 
 C. Griffin and Company, London, 1912. 
 
 VIII. Arps IN STATISTICAL COMPUTATIONS 
 
 BARLOW, PETER.—T ables of Squares, Cubes, Square-Roots, Cube- 
 Roots, and Reciprocals of all Integers, Numbers up to 
 10,000; E. Spon, New York. 
 
 CRELLE, A. L.—Rechentafeln; G. Reimer, Berlin, Germany, 1907. 
 PEARSON, Karu.—Tabdles for Statisticians and Biometricians; 
 Cambridge University Press, Cambridge, England, 1914. 
 PETERS, J—Neue Rechentafeln fur Multiplikation und Division; 
 
 G. Reimer, Berlin, Germany. 
 
 IX. GENERAL 
 
 DEWEY, JOHN, AND DEwEy, EvELYN.—Bibliography of Tests for 
 Use in Schools; World Book Company, Yonkers, New York, 
 1921. Schools of Tomorrow; E. P. Dutton Company, New 
 York, 1915. 
 
 Hotmes, Henry W., AND OTHERS.—A Descriptive Bibliography 
 of Measurement in Elementary Subjects; Harvard Univer- 
 sity Press, Cambridge, Massachusetts, 1917. 
 
 Journal of Educational Psychology; Warwick and York, Balti- 
 more. 
 
 Journal of Educational Research; Public School Publishing Com- 
 pany, Bloomington, Illinois. 
 
 NATIONAL SOCIETY FOR THE STUDY oF EpucatTion.—Year Books; 
 Public School Publishing Company, Bloomington, Illinois. 
 
 PEARSON, Karit.—The Grammar of Science; Adam and Charles 
 Black, London, 1900. 
 
 Rucer, GerorcirE, J.—Bibliography on Psychological Tests; 
 Bureau of Educational Experiments, New York, 10918. 
 Teachers College Contribution to Education Series ; Teachers 
 College, Columbia University, New York. 
 
 THORNDIKE, Epwarp L.—Educational Psychology, Vols. I, II and 
 III ; Teachers College, Columbia University, New York, 1914. 
 
 Warp, Gitpert O.—The Practical Use of Books and Libraries; 
 The Boston Book Company, Boston, 1911. 
 
SUMMARY OF SYMBOLS AND FORMULAE 
 
 
 
 A.Q. = accomplishment quotient = — = i= 
 Ar.A. = arithmetic age 
 : ; Ar.A. 
 Ar.A.Q. = arithmetic accomplishment quotient = TAriAg 
 ‘ : Pe ls eel 
 Ar.Q. = arithmetic quotient = CA 
 
 A.M. = assumed mean 
 B = brightness = T + B correction 
 Ba, Be, Bi, Br = brightness in arithmetic, education, intelligence 
 and reading, respectively 
 C = (1) change produced by an experimental factor 
 (2) pupil classification = G+ C correction 
 CC = change produced by a control experimental factor 
 CEF = control experimental factor 
 C.A. = chronological age 
 C= correction 
 D = difference 
 EC = experimental coefficient 
 
 ah D 
 (1) for difference = 2.78 SDD 
 On SS 
 
 (2) for coefficient of correlation 798 SDt 
 
 ECMEC = experimental coefficient of the mean experimental 
 fieienies MEC 
 
 Rabie Me oS DILL G 
 ECMED = experimental coefficient of the mean equated dif- 
 
 f i MED 
 
 Tie iwnte 78 SUMED 
 
 ED = equated difference 
 EF = experimental factor 
 E.A. 
 CAG 
 F = effort or efficiency = Te — Ti 
 Fa = effort in arithmetic = Ta— Ti 
 Fr = effort in reading = Tr — Ti 
 f = frequency 
 
 E.Q. = educational quotient = 
 
 276 
 
Summary of Symbols and Formule 277 
 
 fx = deviation X number of frequencies 
 FT = final test 
 G = grade status 
 INT = intermediate test 
 
 I.Q. = intelligence quotient = 
 
 IT = initial test 
 M = arithmetic mean 
 M.A. = mental age 
 MEC = mean experimental coefficient 
 MED = mean equated difference 
 N = total number 
 N.= = ae ~ = =Spearman self-correlation coefficient 
 where N is the number of tests required to yield 
 a defined correlation 
 P= pupil 
 PE = probable error 
 PED = probable error of the difference 
 PEM = probable error of the mean 
 ies 
 
 2 
 
 Shs 
 pl 
 
 Q = quartile deviation = 
 
 Q: = 25 percentile 
 Q: = 75 percentile 
 R.A. = reading age 
 Rese 
 
 R.A.Q. = reading accomplishment quotient = TA 
 
 SIGART OAS 
 R.Q. = reading quotient = TORE 
 
 r = product moment coefficient of correlation = 
 
 
 
 Sxy 
 ei Re 0 
 V Sx* V/Sy?* 
 ao) — cxcy 
 senor eaten where assumed 
 mean is used 
 — cx” Sg (/ eee 
 a : : 
 = —__—_—__—-= correlation coefficient resulting 
 I+ (n—1I1)nr 
 
 when N forms of tests are used 
 S = experimental subject, thing, OF group or BELORD 
 
 x, size of 
 SD or S.D. = standard deviation = CD eee. 
 
278 Summary of Symbols and Formule 
 
 SDC = standard deviation of the changes 
 SDD = standard deviation of the difference 
 
 = (SDM:)* + (SDM2)*— 2 re (SD:) (SDz2) 
 D 
 SDM = standard deviation of the mean = ae 
 
 SDMEC = standard deviation of the mean experimental co- 
 efficient 
 
 SDMED = standard deviation of the mean equated differ- 
 ence 
 Mela Sea REO. 
 ANE VALENS 
 
 SDr = standard deviation of the coefficient of correlation 
 I—r 
 
 SD median =~ 
 
 
 
 SDS = standard deviation of the sum 
 = 4/(SDM:)? + (SDM2)? + 2 rx (SDs) (SD2) 
 Sfx or Sx = sum of the deviations 
 T =.1 standard deviation of unselected 12 year old 
 children 
 Ta, Te, Ti, etc.= T score in arithmetic, education, intelligence, etc. 
 x = deviation 
 y = deviation 
 
INDEX 
 
 Absolute-worth scales, in question- 
 naires, 215, 216. 
 
 Accomplishment 
 103. 
 
 Age scale, evaluation of, 95-98. 
 
 Army Beta non-verbal intelligence 
 test, use of, 85. 
 
 Assumed mean, 143. 
 
 Attendance, Reavis’s investigation 
 of, 209, 210, 213, 238, 239. 
 
 Quotient, 58-61, 
 
 B scale, construction of, 102-109. 
 
 Barton, and Dransfield, on teaching 
 of reading, 4. 
 
 Battery of tests, use in Liu’s study, 
 85; construction of, 138, 139. 
 Bennett, on equating of groups, 50, 
 
 51, 73. 
 
 Bibliography, making of survey of, 
 11-13; of equivalent groups meth- 
 od, 271; of one-group method, 
 271; of causal investigations, 272; 
 of rotation method, 272; of ex- 
 perimental measurements, 273, 
 274; general, 275. 
 
 Binet-Simon, 60, 130. 
 
 Brian, and Harter, 88. 
 
 Brightness in arithmetic, computa- 
 tion of pupil, 124; of class, 126. 
 
 Buckingham, 130. 
 
 C scale, construction of, 109, IIo. 
 
 Cattell, 130. 
 
 Causal investigations, methodology 
 of, 207-212; Reavis’s investiga- 
 tion, 209, 210, 213, 238, 239; pro- 
 cedure of, 212-244; analysis of 
 problems, 245-269; bibliography, 
 2472. 
 
 Chal Garo: 
 
 Chang, C. Y., 130. 
 
 Chang, Y. C., 130. 
 
 Chinese fundamentals of arithmetic 
 scale, 121-130. 
 
 Classification in arithmetic, compu- 
 tation of pupil, 125, 126; of class, 
 126. 
 
 Computation, special difficulties in, 
 200.) 207; 
 
 Correction, 143. 
 
 Correlation, and test reliability, 111; 
 in causal investigations, 224-244. 
 
 Courtis, and Thorndike, on cor- 
 rection formule, 116, 130. 
 
 Coy, 37. 
 
 Criteria, see Experimental measure- 
 ments. 
 
 Darwin, 208. 
 
 Dearborn non-verbal 
 test, use of, 85. 
 
 Descriptive investigations, biblicg- 
 raphy, 272, 273. 
 
 Difference, computation of, 150. 
 
 Difficulty test, construction of, 131- 
 E355 
 
 Distribution method, in question- 
 naires, 215, 210. 
 
 Dransfield, and Barton, on teaching 
 of reading, 4. 
 
 intelligence 
 
 Equivalent groups method, descrip- 
 tion of, 18, 19, 40, 44; formule 
 for, 18, 19, 59; criteria for se- 
 lecting, 29-31, 35; computations 
 for, 161-186; bibliography, 271. 
 
 Errors, see Experimental errors. 
 
 Experimental coefficient, 154-158, 
 168, 174. 
 
 Experimental errors, avoidance of, 
 63-80. 
 
 Experimental factors, amount of, 
 81; changes produced by, 82. See 
 also Irrelevant factors. 
 
 Experimental investigations, analyses 
 of problems for, 245-269. 
 
 Experimental measurements, func- 
 tions of, 81; criteria, fundamental, 
 82, 83; for evaluation and con- 
 struction of, 83-93; bibliography, 
 273, 274. 
 
 Experimental methods, see One- 
 group, Equivalent groups and Ro- 
 tation method. 
 
 279 
 
280 
 
 Experimental subjects, appropriate- 
 ness of, 37-38, 40-44; selection of, 
 38-40. 
 
 Experimentation, in education, prev- 
 alence of, 1, 2; value of, 3-5; 
 selection of problem, 6-9; formu- 
 lation of problem, 9-11. 
 
 Experiments, see Weber’s rotation, 
 Lacy’s rotation, Thorndike and 
 McCall’s rotation. 
 
 Franzen, 130. 
 
 Frequency distribution, 
 tion of, 145-148. 
 
 Fullerton, 130. 
 
 construc- 
 
 Gates, 138. 
 
 Grade scale, evaluation of, 94. 
 
 Graphic methods, see Statistical and 
 graphic methods. 
 
 Gray, 38; on equating two groups, 
 <8, 
 
 Groups, equating of, 41-61. 
 
 Hanson, 37. 
 
 Harter, and Brian, 88. 
 
 Herring Revision of Binet-Simon 
 Scale, 60. 
 
 Hillegas, 130. 
 
 Hollingworth, H. L. and L. S., on 
 equating groups, 55. 
 
 Intelligence Quotient, 56, 59. 
 
 Intelligence tests, classified, 43, 44; 
 battery of, 85. 
 
 Irrelevant factors, constant vs. va- 
 riable, 63, 64; bias of experi- 
 menters, 64, 65; bias of assistants, 
 65-75; transfer, 75, 76; bias of 
 tests, 77, 78; other factors, 78, 79; 
 change produced by, 82. 
 
 Lacy, rotation experiment, 34, 35, 
 73- 
 
 Lew,/L. 1.0830. 
 
 Liu, H. C., on construction and use 
 of intelligence criterion, 84-87. 
 
 McCall, and Thorndike, reading 
 scale, 59-62; rotation experiment, 
 194. 
 
 Mean, computation of, 143; use of, 
 148. 
 
 Measurement, of changes, 206, 207. 
 
 Median, computation of, 148, 140. 
 
 Index 
 
 Mental age, computation of, 50, 
 60. 
 
 Metchnikoff, 208. 
 
 Monroe, diagnostic tests in arith- 
 metic, use, 88; measurement of 
 achievement, 130. 
 
 Myers, non-verbal intelligence test, 
 use, 85. 
 
 Norms, 60, 83, 117. 
 
 Ogglesby, 37, 180. 
 
 One-group method, description of, 
 14-17; formula for, 173; cri- 
 teria for selecting, 21-29, 35; 
 computations for, 140-160; bibli- 
 ography, 271. 
 
 Otis, on unreliability, 116. 
 
 Pairing pupils, technique of, 45-49, 
 
 57- 
 
 Percentile scale, evaluation of, 95- 
 98; points, computation of, 149- 
 150. 
 
 Pintner, non-verbal intelligence test, 
 use of, 85, 130. 
 
 Pittman, on equating of groups, 40- 
 SI. 
 
 Practical certainty, 156, 163. 
 
 Pressey, non-verbal intelligence test, 
 use of, 85. 
 
 Probable error, 151. 
 
 Product-moment formula, 225. 
 
 Product tests, construction of, 135- 
 138. 
 
 QI, 50. 
 
 Os\)nso aie | 
 
 Quartile deviation, computation of, 
 150. 
 
 Questionnaires, methods in causal 
 investigations, 215-217. 
 
 Rank method, in questionnaires, 215, 
 2106. 
 
 Rate test, construction of, 135. 
 
 Reavis, attendance investigation, 
 000, 210, 313.238, 3G, 
 
 Regression equation, in causal in- 
 vestigations, 240-244. 
 
 Relative-to-the-items scale method, 
 in questionnaires, 216. 
 
 Reliability, of tests, 83; formula 
 for, 111; net-difference method, 
 112-114; practical certainty, 156, 
 
Index 
 
 163; computations in special situ- 
 ations, 190. 
 
 Rotation method, description of, 109, 
 20; formula for, I9, 20, 32; cri- 
 teria for selecting, 31-36; Steven- 
 son’s experiment, 28; Weber’s 
 experiment, formula, 32, descrip- 
 tion of, 198-207; Lacy’s experi- 
 ment, 34, 35; computations for, 
 187-207; Thorndike and McCall, 
 ventilation experiment, 194; bib- 
 liography, 272. 
 
 Rugg, H. O., 5. 
 
 Scales, adequacy of, 88; evalua- 
 tion of methods, 94-98; for ex- 
 perimental tests, 198. See also 
 Age scale, B scale, C scale, Chi- 
 nese fundamentals of arithmetic 
 scale, Percentile, T scale. 
 
 Scores, point, sample of, 44; men- 
 tal age, sample of, 44. 
 
 Scoring, of Chinese fundamentals 
 of arithmetic test, 122, 123, 129. 
 
 Self-correlation, see Correlation. 
 
 Sherritt 21s., 1130. 
 
 Sigma, see Standard deviation. 
 
 Spearman, self-correlation formula, 
 III, 112; product-moment for- 
 mula 225. 
 
 Standard deviation, computation of, 
 144; of difference, I51. 
 
 281 
 
 Stanford Revision of Binet-Simon 
 scale, 60. 
 
 Starch spelling scale, use of, 88. 
 
 Statistical and graphic methods, 
 bibliography, 274, 275. 
 
 Stevenson, rotation experiment, 26, 
 28. 
 
 T scale, 27; evaluation of, 95-98; 
 construction of, 98-102. 
 
 T scores, Weber’s use of, 203. 
 
 PaO, WWW uke so: 
 
 Terman, on mental age, 59, 130. 
 
 Tests, intelligence, classified, 43, 44; 
 battery of in Liu’s study, 85; 
 summary of steps in constructing, 
 scaling and standardizing, 130-139, 
 experimental, scaling of, 1098. 
 
 Thorndike, 5, and McCall, reading 
 scale, 59-62, 130; rotation experi- 
 ment, 194. 
 
 Total ability in arithmetic, com- 
 putation of pupil, 123, 124; of 
 class, 126. 
 
 Unreliability, see Reliability. 
 Variability, measures of, 151. 
 Weber, rotation experiment, 32, 73, 
 
 198-207. 
 Woody, arithmetic scales, use, 88. 
 
ey ba 
 
 
 
 rly * ¢, Ms a | ie Re . - rt P, "> © . — > 
 CSD Sas Ree aia) Tee ROE ae AR enlace 
 . avy n ae a. > F. it NY 
 . / o arn Nig eye he Bak _ 
 = i ‘ : as . 
 : i ts ite P 
 ‘ 
 Vola a " : 
 a 
 ari h i 4 { 
 yeh eae ; ¥ EAE RACY they abrak OA 
 ' } F 
 : \ ‘ ( i 
 , 
 " ; A ei. it 1 
 f A, 
 vf v a 
 4 * G 1h 
 ‘ Me ) 
 ’ ’ ’ 
 R ‘ , ; j 
 ‘ ' f 
 ! ' 
 } ' 
 - ‘ a! J j ‘ 
 ’ 
 ‘ 
 ; “4 = : 7 
 ‘* 
 i 
 aA ‘ 
 . 
 ‘ 
 . ( 
 i 
 y 
 ,' 
 - \ ' 
 ) 
 - : ! 
 a 
 i 
 . F 
 * i 
 j 7 | 
 ! j 
 ' fi 4 
 H 
 a” 
 q ' 
 f 
 ‘\ 
 + 
 a 
 - ’ ’ 
 4 ' s , 4 
 4 
 i 
 P \ vit ‘ { Mi 
 ' 
 ? vv M 
 j i ‘et haw 
 4 f Pi 
 0 ‘ 
 } ' , * UF 7) ‘ 
 ‘ | 4 " “ae i “ 
 ' \ * 
 t 
 : 
 ] 
 ' sale ; 
 7] 
 ) i 4 : 
 ' ; ; ¥ LS 
 : 4 
 ¥e t hy : 
 
 
 
oe 
 
 
 
 Pee hehe 
 hah east anny 
 
IL 
 
 | 
 
 y—Speer Librar 
 
 I 
 
 | 
 
 CO 
 NWN 
 (ep) 
 ae 
 a "a 
 =— © 
 —} O 
 ——— ee Se 
 i 
 N 
 = 
 © 
 _ 
 a 
 
 Princeton Theological Seminar 
 
 How to experiment in education 
 
 LB1026. 
 
 aati