a sate Division Kea cae YAH Riad Pach Section Sy Ay hi ( eid; HOW TO EXPERIMENT IN EDUCATION EXPERIMENTAL EDUCATION SERIES Eprirep sy M. V. OSHEA HOW TO EXPERIMENT IN EDUCATION. By Wittram A. McCatrt, Px.D., Associate Professor of Education, Teachers College, Columbia University. HOW TO EXPERIMENT pave tama ‘yp A ° . ~ “ 4 JAN 18 1929 \ ~ BY WILLIAM A.’ McCALL, PH.D. ASSOCIATE PROFESSOR OF EDUCATION, TEACHERS COLLEGE, COLUMBIA UNIVERSITY, NEW YORK CITY jQew Pork THE MACMILLAN COMPANY 1926 All rights reserved COPYRIGHT, 1923, By THE MACMILLAN COMPANY. Set up and electrotyped. Published August, 1923. Reprinted November, 1926. PRINTED IN THE UNITED STATES OF AMERICA BY BERWICK & SMITH CO. CHAPTER E IT. VIII. CONTENTS SELECTION AND FORMULATION OF EXPERIMENTAL PROBLEM tiie ahs mre i ntar fetanl ge ee a ea ena SELECTION OF EXPERIMENTAL METHOD. .. . SELECTION OF EXPERIMENTAL SUBJECTS .. . CONTROL OF EXPERIMENTAL CONDITIONS . . . EXPERIMENTAL MEASUREMENTS . . . .. COMPUTATIONS FOR THE ONE-GROUP EXPERIMEN- TALS METHOD Mr en iien ine ate mente ® dure ir ire COMPUTATIONS FOR THE EQUIVALENT-GROUPS MVIEETHOD Meee eer i COM a ite ite iret ue nig Bataan Aly COMPUTATIONS FOR THE ROTATION EXPERIMENTAL METHOD e e e ° e e e s se es e e (SAUSALCINVESTIGATIONS Bae er trea meany (on at aur Us ANALYSES OF EXPERIMENTAL AND CAUSAL INVESTI- GATIONS se s e e e s e ® e e e e APPENDIX e e s e e e e e e ® e ° e ° SUMMARY OF SYMBOLS lr HARP ONIRAN CARL lar ao” AUNTS PAA afl INDEX PAGE 140 161 187 208 245 271 276 279 Digitized by the Internet Archive in 2022 with funding from Princeton Theological Seminary Library https ://archive.org/details/nowtoexperimentiOOmcca LIST OF TABLES TABLE PAGE I. Chronological ages and mental ages of 43 sixth-grade DUDUSE eae aaie Vere eae i an a carn itis tate nL eerie 45 2. Pupils divided into two groups of equivalent mental age 46 3. Illustrates computation of composite scores............. 52 4. Illustration of need for equal units of measurement.... 94 5. Relative merits of four commonly used scales.......... 98 em SHOWS HOW tosCOnStLUCt da | LNSCAlC siilen Gictis t atatesleis.cy 809 Pomet OLACOUVELl INCA IACenLGNULOn LSet tly tele sistele esi ete 101 8. Shows how to widen the range of a T scale............ 102 9. Age-scale and T-scale equivalents. ......s.cecscccecace 103 TOPO uOWS how ta:constriuct a‘Biscale;.i.7 2.205. ae cece «5's 108 II. For converting T scores into B scores. 00.4)... 2.0. 0s 109 12. Reliability of test by net difference method............ 113 13. Equating variability in computing net difference....... 114 13A. For converting total points correct into T scores...... 124 Pe Lem OLE COMDULIN Sm ESCOLEG.\ han shiva vigils geisha sie ald eniatalsts 124 TR Gem POtecOniputingAGasGOTesi mn ncaa aan cine sia'sieteieb nevent ote 126 13D. For illustrating the computation of T, B, and C scores 127 Pa uemior unterorering wimandi SCOLES..: s weliciers vs widamldesgite 127 PAGMAINE-STOUP COMP ieationl MOdel Ls. tae al ciets staveteldais es 140 Tool lustration ob) computations model ila. aoe bas. sels 6 141 16. Computation of M and SD when N is large............ 146 17, Computation of M and SD in a frequency distribution MILD Estep =iNiemr dl ShO les feminist isticls cyt «mse are 147 18. Computation of the median in special situations........ 149 19. Conversion of experimental coefficients into chances.... 155 20. Illustration of computation model I when EFs is not the MICLEEADSCOCE FOL EG aia inele tive dailas, cei aust enicn erat ss ea 159 viii List of Tables TABLE PAGE 21. Equivalent-groups computation model II for two EF’s ANG OTE tEStHEV PEs. pe sieenines sap a oy eins one ae eee 161 22. \Iiiustration of computation; model LL. vse 162 23. Equivalent-groups computation model III for three EF’s and one test TY¥DG\wriiapaite se sivsee teas eee 166 24. Equivalent-groups computation model IV for two EF’S and (tWO:téest: typesviiiieic es cdelece tall ie edie cte ee aime 167 25. \ Llustration) oficomputation model LV is. 72074. ase sane 172 26. Equivalent-groups computation model V for three EF’s and fone testetypey siiies mess ow seta a eites ies iela era anneaea 175 27. Equivalent-groups computation model VI for two sub- PLOUDS Srey isn ec spereiel ee vrei te Uhlel nie oh ete tk etal aha ee 177 28. Summary of an actual experiment with three sub-groups 178 29. Equivalent-groups computation model VII with an inter- MECIate7LEST Heuicw ev eaten ele Toate eae alaets te arene 179 30. Equivalent-groups computation model VIII with three sub-groups and an intermediate test.............. 181-186 31. Rotation computation model IX for two EF’s and one LESE ELV PO ey rele oda ta. a aleieraie Wiel di ely, otis a abe ter aka anal 187 32. Llustration of computation models Xs 0... ee 193 33. Rotation computation model X for three EF’s and one TESULTV DE cle nla causa vieitiecece elecetelel da bly'< < avy s 4 one epe eae ann 195 34. Rotation computation model XI for two EF’s and two LOSER TY DOS Ga circ ta ptee ialeiaigalpterelicie tates. 5) «lls! cbt C ee 197 35. Data from a rotation experiment conducted by Weber 200-201 36. Data from Weber’s rotation experiment converted into ENSCOTES Si tenis at cles a eulc wale age cle c's tact ds et eae 204 ava. Computation? Off 2 ese wie comes vars sine os ce eee 237 38. Computation of r from a contingency table............. 229 39. Reavis’ r’s between attendance and six hypothetical CAUSES.) sare tig atnalelaateitiebtore mite’ sie tc plata’ ade 1 eee 232 40. Reavis’ original and partial r’s between attendance and six hypothetical’ causes)... <<. 5 vcs anes cc ee LIST OF DIAGRAMS DIAGRAM PAGE 1. Scatter diagram showing rectilinear and curvilinear rela- TIONSHIP We yet tale eelk eieeldin aisle sis sents s set isie) sis sisieis ins 226 ' \ . . : ‘ Aa ti i! ' y vis ‘eit * ' - q + | _ i - is'F ' 4 , : 1° @2>: i mile thy . ' Pune | Neat : Rd 1 (9 ae Gt A a n : oy j | ; ? Vd x + y ij j sh yy iti yy A ob 14) 13. : ¥ i / Vay 74 "i ‘ Le tei ‘ ein ¥ iy ’ \ - j a j , ie) : ra, 4 Oe or ee ‘ ' A ‘ nie \ i } P a at on Waa tp : i 4 i 7 ‘ iy re fy; “a é ‘ ls i ’ ‘ : a | ui : 7 ' ‘ ry. { j ‘ + tx i ' : ’ ul er | | ! ' i | ' ' Ly , ' sf fe i) ‘ : yi ’ v >| ty ry ieee.) e , > j “ee } M4 , i ety, ’ ina j i i i : a 7 a) re) i } ’ » 2° ar j | ‘ wavy ss : . ni wh ] ‘ | 4 ' i , i Py l , AM a | 4 i A ria) b ' J ya! ile Ah bed fig mo ih" ) 4 - ei « [Sin eae 7; WF) i tam df ne ra. ereits nig ] e 1 ee : ‘A EDITOR’S INTRODUCTORY NOTE Professor McCall has written this book primarily for the purpose of presenting the methodology of educational experimentation in a practical form for the use of teachers and students of education who wish to engage in experimental work, or who desire to understand the great amount of experimental literature which is appearing in magazine and book form. This is the first book on educa- tional experimentation to be published at home or abroad. There are philosophical treatises on scientific methodology, such as Pearson’s ‘‘Grammar of Science,” and a few scat- tered suggestions on the method of experimental education in books on scientific education; but there has been no adequate treatment of experimental work in the educa- tional field. This fact led the present writer, when he became editor of the Experimental Education Series, to ask Dr. McCall to prepare this volume. Dr. McCall has conducted courses in Teachers College in the field of ex- perimental education, and he has for a number of years been accumulating concrete data to illustrate the experi- mental method of procedure. Probably no one is as well equipped as he is to prepare a book for the guidance of all who desire either to understand or to undertake experi- mental work in education. With the aid to be gained from this book, intelligent teachers can engage profitably in research work in educa- tion even if they are not technically trained in experimental methods. The subject is one of permanent worth; and students of education or teachers who wish to gain an in- telligent appreciation of and to keep in touch with American educational progress must be familiar with, and, to some x1 xii Editor’s Introductory Note extent at least, must be master of the methodology of educational experimentation. A large proportion of popular educational doctrines has been derived without due regard to the requirements for securing valid conclusions; and it may be safely predicted that superintendents, principals, and teachers, as well as students of education, who read Professor McCall’s book wunderstandingly will exercise greater care than they have done heretofore in promulgating educational principles based upon data that have not been secured in an accurate manner or treated according to a technique designed to control or eliminate disturbing or irrelevant factors. “How to Experiment in Education” is not as technical as it might appear to be at first glance. The formule and diagrams as well as the discussion can be easily understood by any reader, even though untrained in experimental methods, if he will begin at the beginning of the work and go through it systematically and leisurely. Concrete ex- amples of experimental problems that have been or that might be successfully studied are described by Professor McCall frequently and clearly enough to illustrate every method of procedure discussed and every diagram presented. Technical terms are sparingly used, and the meaning of those that are employed can be easily gained from the con- text in which they appear. M. V. O’SHEA. The University of Wisconsin. PREFACE My initiation into educational research, like most initia- tions, was a rather tragic one with happy consequences. My professors plunged me into practical research situations when my training in experimentation was exceedingly lop- sided. They trusted to my genius to supply the missing half of research methodology. The memory of this mistaken trust constitutes the pleasant after effects. The cause of my tragedy and of others like mine was due — to the fact that, heretofore, chief attention has been directed toward statistical refinements, rather than refinements of pre-statistical procedure. There are excellent books and courses of instruction dealing with the statistical manipula- tion of experimental data, but there is little help to be found on the methods of securing adequate and proper data to which to apply statistical procedure. ‘Training is given and books exist only for the last step of a several-step process. As a result, the final step often becomes little more than statistical doctoring for the ills in the data. This book, together with its predecessor, ‘“‘How to Measure in Education,” but particularly this book, represents an attempt to assemble or originate a fairly complete methodol- ogy of research from the selection of the problem to the conclusion of the research. Material has been drawn from numerous sources, but the largest single source is that unannounced richest course of instruction taken by me at Teachers College, namely, the frequent privilege of out-of- course association with Professor E. L. Thorndike. The encouragement and support given my work by my departmental Superiors, Professors M. B. Hillegas and Frank M, McMurry, and by Dean James E. Russell have X11 xiv Preface been a continuous surprise because they have exceeded every expectation. Such encouragement has made it a pleasure to shorten vacations and to lengthen the working day so as to finish this book before departing for a year of service with the Chinese National Association for the Promotion of Education. It is fortunate for the future reader that I am in China while this book is being edited and published. As a result, Dr. M. V. O’Shea has given an unusual amount of time to its editing, and in this he has had the technical assistance of Dr. John G. Fowlkes. Miss Harriet Barthelmess, who has a thorough knowledge of the methodology of experimenta- tion, and my wife, Alma McCall, have volunteered to read the proof. I wish to make grateful acknowledgment of their kindness. Wiiiiam A. McCatt. Teachers College Columbia University HOW TO EXPERIMENT IN EDUCATION » \ 4 war ; ye : Si, : 4 ie Ae 1) is ag) j ine ys if Tern ay a a. Mi Ney he clea’ .% “hy ) ib wi i i i cae , A oe ne " ah Bie ere SN haan me >, ch Mi wit iy oR HOW TO EXPERIMENT IN EDUCATION CHAPTER I SELECTION AND FORMULATION OF EXPERIMENTAL PROBLEM I. VALUE AND PREVALENCE OF EXPERIMENTATION IN EDUCATION Prevalence of Experimentation.—Except for sporadic exceptions and for continuous overlapping, the method for the determination of truth has passed through three major stages. The first stage is that of authority. When any question arose as to the truth or falsity of any fact or principle, it was referred by consent or force to the oracle, chief, king, church, state, or other temporarily ascendant individual or group. In the year 1922 the legislature of a certain state decided by vote whether the principle of evolu- tion is true or false. In this same year there were further occasional evidences that vital educational matters were still being decided on the basis of authority and authority alone. The second stage is that of speculation. ‘This repre- sents a genuine advance. When this stage was reached, questions were no longer matters merely to be settled; they were matters to be freely discussed. Broadly speaking, America and American education have now advanced well into this stage. The third stage is that of hypothesis and experimentation. This stage is not something perceived only in visions. We t 2 How to Experiment in Education have seen enough of it to know its aspect and to appraise its promise. Since earliest times a tiny stream of scien- tific research has trickled through the ages, now above ground, now below, now a dashing stream, now a desert rill, but always flowing forward toward the future, and, in late years, increasing greatly in volume. Today, educational experimentation is accepted but not achieved. These three, authority, speculation, and experimentation, have been described as stages, and in a sense they are. But, in a truer sense, they supplement each other. Specula- tion, unless it becomes an end in itself, is a fruitful source of hypotheses or problems for research. Authority, when founded upon tested knowledge rather than upon pure opin- ion, has an essential function in the scheme of life and education. Everywhere there are evidences of an increasing tendency to evaluate educational procedures experimentally. Though measurement alone is not research, the marvelous spread of the movement for scientific measurement of educational products is a symptom of a new attitude which is favorable for research. ‘The establishment of numerous city and state bureaus of research is another evidence. Numerous experimental schools have arisen for the purpose of re- search, pseudo-research, or propaganda. Most of the de- partments of the better teachers colleges have become satu- rated with the new point of view. Scientific organizations, research committees, an institute of educational research, and large educational foundations are lending such impetus as make experimental education the most important current movement in education. But even with all its growth we have barely entered the Stage of experimentation. Most educational theory still needs testing. Adequate testing of theory requires a rigid scientific procedure. The technique of experimentation is possessed today, with a few exceptions, mainly by a small group of educational psychologists. Experimental educa- tion cannot hope to cope with its great task or develop much Selection and Formulation 3 faster so long as superintendents, principals, and super- visors, not to mention teachers, are not equipped to solve their own problems for themselves. It is but a question of time until educational leaders will be required to have a command of research technique. ‘Then the third stage has a chance to arrive. Value of Experimentation. — Experimentation has proved its worth by hastening the day when the test of truth will be verification and conformity to our experience rather than revelation and miraculous departure from our expe- rience. Science asks us to believe in such unthinkable things as the reality of ether, the absence of weight and friction for celestial bodies, the existence of the atom, that food makes thought, and the like. But these matters are in conformity with logic or experimental evidence. As Burroughs states, the helium atom has been proved to be an objective entity as truly as that the sun is in heaven. The practice of experimentation in a school or school system pays in terms of an altered attitude on the part of the entire staff, willingness to consider new proposals, and an alertness for new methods and devices. Experimenta- tion ploughs up the mental field. Teachers join their pupils in becoming question askers. It is the absence of just such stirrings of the mental soil, which, in all probability, is responsible for the supposed fact that teachers fail to im- prove after a few years of experience. Experimentation pays in terms of cash. ‘Three years ago an experiment was conducted in a school of five hun- dred pupils. The purpose of the experiment was to evaluate a group of teaching methods. A careful account was kept of the increased ability secured. Careful estimates were made of its financial value. A record was kept of expendi- tures. The value of the increased abilities secured was estimated to be worth $10,000. This estimate was based upon the total cost in previous years of producing each unit of ability. The cost of test material used, and of the spe- cial supervision required, amounted to $540. The net an- 4 How to Experiment in Education nual saving, not counting future compounding of the abili- ties, was $9,460. Recently an experiment has been conducted by Drans- field, principal of a school in West New York, New Jersey, and by Barton, superintendent of schools at Sapulpa, Okla- homa. The purpose of these experiments was to evaluate the plan for the teaching of reading described in “How to Measure in Education.” The total points of A. Q. growth in reading in the control school were 60. The points of growth in the experimental school were 143. Even without taking into account the improvement in history, geography, arith- metic, etc., resulting from increased reading ability, or the cumulative value to the pupils in future years, and even without considering that the teachers have learned a new process to use with other pupils, still the difference between the two groups is worth thousands of dollars. Consider the value to education of this and similar experiments, when their influence shall have spread to the millions of pupils in American schools. The foregoing experiments have been described to show that it is not unreasonable to claim that a widespread use of scientific research could so increase the efficiency of instruction as to save a year of instruction. The value of such an achievement in financial terms is shown by the following approximate figures: Population: of the; United totatesiie i... sess ss. alc cele eee 103,600,000 Saving to each person through research .............ccecceccececes I yr. Total (saving he Aue a eas fe tee recep elev co tha) Ad) Se 103,600,000 yrs. Valuesot a 7yearin. tartan Garis steele ie ee neue cle ele et $1,000 saving “fOr: Ut: Sa i tere orev ere iain at oth ee) 1 seine eee $103 600,000,000 Population engaged in World War ............eccccederucn I,300,000,000 Saving tore World «Ware bowers ee cicc cos a oak $1,346 ,800,000,000 Saving tior 100 ipenerationsSunte es ose aes pec ee $134,680,000,000,000 $134,680,000,000,000 = 260 times U. S. Wealth= 790 times cost of World War = 395 times cost of all wars in recorded history. Experimentation will pay the nation, the school system, and the individual school. The time has now arrived when it also pays the individuals who engage in it. If the finan- cial reward is not large, the esteem of the profession is. Selection and Formulation 5 There is no denying the fact that those educators who today are constructively studying educational problems by scien- tific methods have achieved, or are destined to achieve, positions of recognized leadership in education. They be- come the final arbiters for most educational questions, for the peculiar function of experimentation in education is to be a court of last resort. Methodology of Research.—Scientific educational re- search may be grouped conveniently into three major divi- sions,—descriptive investigations, experimental investiga- tions, and causal investigations. The purpose of descriptive investigations is to describe a situation as accurately and objectively and quantitatively as possible. They involve the collection of data, and the quantitative description of the data by the following means: some mass measure, such as a frequency distribution, frequency surface, order distribution, or rank distribution; or some point measure, such as a mode, mean, median, midscore, or percentile; or some variability measure, such as a quartile deviation, median deviation, mean deviation, or standard deviation; or some relationship measure, such as a scatter diagram, contingency table, or co- efficient of correlation; or some reliability measure, such as a standard deviation of the measure, or probable error of the measure; or some other of the standard statistical tech- niques, such as are described in Rugg’s “Application of Statistical Methods to Education,” or Thorndike’s “Mental and Social Measurements.” The purpose of experimental investigations is to evaluate the methods, materials, and aims of education. It is to de- termine the absolute or relative effects upon some subject or subjects or pupils of one or more experimental factors. The purpose of causal investigations is to start with some observed effect and locate the cause or causes; to determine whether hypothetical causes are really causes; or to deter- mine just how much each of several causes contributes to produce the effect. McCall’s “How to Measure in Education” has for its 6 How to Experiment in Education purpose not only to tell how to use practically and construct scientifically mental and educational tests, but also to pre- sent the measurement, tabular, graphic, and _ statistical techniques required for the conduct of descriptive investi- gations. This book is a sort of companion volume for “How to Measure in Education,” and has for its purpose to complete the presentation of the methodology of research. The first book covers descriptive investigations. This book presents the techniques for experimental and causal investi- gations. II. SELECTION OF EXPERIMENTAL PROBLEM Planning an Experiment.—An experimenter ought to think through his experiment from the conception of the problem to the formulation of the conclusions and beyond. If he has six months to devote to an experiment he can, with advantage, spend five months in planning the experiment and one month in conducting it. Ideally an experimenter should not start his experiment until he has gone through, mentally at least, every step even down to the smallest statistical detail. Those who do not possess a vivid imagina- tion can advantageously carry a miniature experiment with hypothetical data through the various tabulation and sta- tistical stages. The importance of adequate planning cannot easily be exaggerated. There is little justification for the contention that a well-prepared plan is an inflexible plan. A plan can be thorough and yet plastic enough to be altered to meet unexpected emergencies. In fact original adequacy of plan is probably correlated positively with a healthful plasticity. Whenever the experimenter can afford the time, an actual- trial experiment is superior to a mental-trial experiment. Even the keenest vision of the most experienced experi- menter cannot always foresee every difficulty which will arise. Hence the theoretically best procedure is to follow the mental-trial experiment with the actual-trial experiment, Selection and Formulation 7 to modify and perfect the plan in the light of the actual trial, and, finally, to conduct the real experiment. How to Find Experimental Problems.—The best way to find genuine experimental problems is to become a scholar in one or more specialties as early as possible. Thorndike has done a great service for the cause of original research by showing, in a convincing way, that the original mind is the informed mind. The idea that much knowledge hampers a man’s originality has taken deep root in the popular fancy, as a result of its self-deceptive search for some crumb of comfort for stupidity. The essence of originality is high native intelligence plus adequate knowledge. Spencer de- scribes knowledge as a sphere of light floating in an abyss of darkness. As a rule, only those who live their mental life on or in this sphere conceive fruitful problems. A second way to discover fruitful problems is to read, listen, and work critically and reflectively. It is well to form the habit of reacting upon every situation with a ques- tion mark, and to consider every untested theory as an hypo- thesis. Between the lines of every worthwhile book are enough problems and enough rich materials to make the finder and utilizer famous. A third method of discovering fruitful problems is to con- sider every obstacle an opportunity for the exercise of in- genuity instead of an insuperable barrier. A king once placed a purse full of gold in the middle of a public road. On the purse he placed a large stone. A soldier with his head in the air and whistling a tune chanced that way. He roundly cursed those who drove over that road for not re- moving the stone and hence for the injury to his pride and person. A wagoner, with the expenditure of much emo- tion and considerable skill, maneuvered his wagon past the obstacle. Since no one who passed that way had formed the mental habit of considering every obstacle an oppor- tunity, the reward Boneh the obstacle went by default to the king. A fourth method of nding problems is to start a research 8 How to Experiment in Education and watch problems bud out of it. The very process of re- search stirs up a hornet’s nest of insistent problems. Spen- cer expressed a profound truth when he said that if we enlarge ever so little the sphere of light we increase infinitely its points of contact with the darkness. A fifth method of finding problems is not to lose those already found. Almost everyone has probably been given for a moment—probably some odd and unexpected mo- ment—some rare insight. These flashes come, linger for a moment, go, and are forgotten beyond recall. Twiss attri- buted his rise to a university position to one fact. He bought a steel filing case and recorded and filed original ideas and problems before they were forgotten. So vital for professional growth is this matter of finding and record- ing problems, that the worth of an educator can probably be measured by asking him to list in ten minutes as many as he can of worth-while educational problems. What Experimental Problem to Select.—It goes with- out saying, and yet it needs to be said, that experimenters should select problems whose solution is not already known. One of the abler men in educational measurement reported, at a recent gathering of scientific workers, the results of a painstaking and exceptionally original research. Unfor- tunately the same problem had already been solved and the results published. Thorndike tells of a student who submitted to him the results of a research which the candi- date hoped would be acceptable for a Ph.D. thesis. In submitting the manuscript the candidate wrote that he knew the research was original for he had been careful to avoid reading anything whatever about the subject. As a rule, an experimenter should select and work upon problems in his own specialty. It will be shown later that successful experimentation requires such a detailed knowl- edge of the factors operating in a particular situation, and of the influence of these factors, as only a trained and expe- rienced individual possesses. Recently, some students of experimentation, who were reasonably expert in education Selection and Formulation 9 only, attempted to plan an experiment in chemistry. The undertaking was soon abandoned. No one seemed to know the influence of temperature upon certain chemical reactions. This necessity of intimate knowledge probably explains why over 99 per cent of all discoveries are made by experts in the field of discovery. During the World War, the War Department established a clearing house for popular inven- tions. A few valuable suggestions were received, but in the main the bulk of all research had to be done by a mere handful of experts. An experimenter should select the relatively more vital problems. ‘There are many problems which are worth solving but not relatively worth solving. The number of those willing or competent to undertake research is too small and their time too valuable to expend effort on prob- lems not of vital consequence. An experimenter should select a problem whose solution is feasible, and should set up hypotheses capable of proof. However vital the hypothesis, if it is not susceptible of proof it should be discarded, for the present at least. Un- fortunately, the solution of many experimental problems of great worth is often not feasible, because needed tests have not been constructed, or because appropriate subjects are not available, or because the experimenter cannot sufficiently control the situation in which the proposed experiment is to be conducted, or for some other reason. Thus, the excellence of an experimental problem depends upon several factors, and hence it should be selected in the light of these factors. A more comprehensive list of these conditioning factors will be given later. III. FoRMULATION OF EXPERIMENTAL PROBLEM Types of Formulation.—There are three types of indi- viduals engaged in educational research, and the types are clearly indicated by the way they formulate their problems. The first type of experimenter “‘flutters in all directions IO How to Experiment in Education and flies in none!” He formulates problems so that their scope is scarcely less wide than the universe. Such broad formulations offer little practical aid in planning the details of an experiment. Gazing at the stars, this experimenter steps into every snare at his feet. Just as a teacher cannot teach arithmetic in general, or spelling in general, but, in- stead, must teach particular examples or particular words, so an experimenter is likely to think and act very irrele- vantly if he is guided by a broad formulation only. Recently an experimenter came for consultation about a problem which he had formulated thus: What is the effect of various factors upon learning? After a little urging he departed and returned later with this formulation: What are the effects of distribution of time upon learning? He was commended for the improvement made. At a later stage the problem had become: Will a typical fourth-grade class in silent reading, spending three thirty-minute periods per week, accomplish more or less than an equivalent class spending five periods of eighteen minutes each per week? Even this is too broad for a final working formulation. The second type may be called the pot-hole type. Near the Cumberland Falls, the Cumberland River has a stone bed pitted with pot-holes. These holes were made by small hard pebbles which lodged in originally slight concavities and which, due to the action of the water, have ground round and round, thereby making the pebbles smaller and the hole wider and deeper. ‘There are indefatigable individuals en- gaged in educational research whose experimental problems are admirably specific. They are as narrow as the pebbles in the pot-hole. And, like the pebbles, their problems be- come narrower and narrower as their research proceeds. Such experimenters are experimental drudges. They do much excellent work, but each research is isolated from every other. There is an absence of general plan. There is no mental reaching for the larger implications. They are as lop-sided as the first type. The third type of experimenter is the truly admirable one. Selection and Formulation II He is the scholarly type. He perceives the larger meanings of each minute investigation. This glorifies the drudgery inherent in all careful research. The scholarly experimenter first formulates a broad problem. ‘This gives the larger goal and permits perspective. He then breaks up the broad problem into very narrow, specific problems. These are the working units. As the results from the specific investiga- tions come in, he fits the bits together into a beautiful mosaic. The solution of any one specific problem may be of no practical value. It merely contributes to the solution of the larger problem which alone has genuine practical sig- nificance. Hence, it is desirable that there be a hierarchy of formulations from very broad to very specific. A working formulation of an experimental problem should clearly describe: (1) the experimental factor or factors whose effect or effects are being studied, (2) the experi- mental subjects or individuals or pupils to whom the experi- mental factor or factors are to be applied, and who are expected to register the effect or effects, (3) the nature of the effects expected and to be measured. In sum, a working formulation requires that the experimenter must have analyzed his problem in rough outline at least. Why and When to Survey Bibliography on a Prob- lem.—The time to make a survey of the bibliography on an experimental problem is the opposite of the time when the survey is all too frequently made. Often an investi- gator has completed his experiment and has prepared his manuscript for publication before he hurriedly collects a list of references. The prime function of a bibliographical survey is not to provide a dignified list of references to append to an article, but to serve as a practical guide to the formulation of the subordinate problems, and to the general planning of the investigation. Hence, the survey of the bibliography should immediately follow the formulation of the experimental problem or problems. If there were no other reason, self-respect as a scholar should be adequate motivation for surveying a bibliography. 12 How to Experiment in Education Such a survey will avoid many public humiliations. Pride is not fostered by saying: ‘“This is something never done before,” only to discover later that claim to originality is unjustified. Such humiliations will be frequent enough at best without actually inviting them. An initial bibliographical survey will prevent repeating an investigation already done. ‘There are few things more important than the conservation of the time and effort of scientific men. The importance of avoiding repetition does not, of course, mean that it may not be desirable, on occa- sion, to verify 1 a previous investigation. But it is neces- sary to discriminate between ignorant repetition and con- scious verification. Again, a bibliographical survey will often suggest addi- tional incidental problems to be settled. There are few men who have extensively engaged in research who cannot testify to many keen regrets because numerous subsidiary problems were conceived too late to make possible their solution at the time the major problem was being attacked. It fre- quently happens that merely minor modifications in an in- vestigation will make possible the solution of five problems instead of one. The importance of conceiving these prob- lems early can be appreciated when it is recalled that many of the world’s greatest discoveries were by-products rather than major objectives of experimental investigations. Again, a bibliographical survey helps by offering sugges- tions of procedure and of errors to be avoided. A bibliog- raphy is the recorded experience of previous investigators. The cleverest investigator is selaom able to make an experi- mental plan so perfect that there will be no subsequent regrets. Foresight is never a perfect substitute for expe- rience. The bibliography reveals not only the methods employed and the instruments evolved by others but also criticisms of these on the basis of experience. Finally, a bibliographical survey provides material which 1Wm. A. McCall, “Reliability of a Ph. D. Research Dissertation in Educational Psychology,” School and Society, April 13, 1918. Selection and Formulation 13 will be needed in describing the experiment conducted. It is desirable to preface an experimental article with a sum- mary of previous related investigations, and to close it with a relevant bibliography. These, as well as all previously mentioned objectives of the bibliographical survey, should be realized at one and the same time. Procedure in Making a Bibliographical Survey.—The procedure of the bibliographical survey should be a highly selective one. The experimental problems are the key to this procedure. Throughout the survey, they should be kept in mind constantly. Everything relevant to them should be seized upon and examined for possible aids. Relevancy to the problems is the principle of selection; helpfulness in furthering the experiment, or its description, is the principle of retention. Not the principles of selection and retention but the method of discovery is the chief difficulty in surveying a bibliography. The problem is to know where to look for material likely to be relevant. The method pursued will vary somewhat with the problem and the situation of the experimenter. The following general suggestions may, how- ever, be given: (1) Make inquiries of those who may be able to contribute unrecorded information. (2) Make in- quiries of those who may be able to suggest references to be examined. (3) Go to the contents and references in books known to deal with the same or related problems. (4) Consult the same and related topics in the library’s topically indexed card catalog. (5) Consult the Readers’ Guide to Periodicals. (6) Consult the monthly index to educational publications published by the Bureau of Educa- tion at Washington. (7) Consult the Psychological Index and the index volumes for certain periodicals. (8) Consult such summarizing journals as the Psychological Bulletin. (9) Consult the table of contents of special periodicals not indexed in the Readers’ Guide. The discovery of a single relevant reference by the above procedure frequently leads to the discovery of many other references. CHAIR D REIT SELECTION OF EXPERIMENTAL METHOD I. Types oF EXPERIMENTAL METHODS A. One-group Method.—The most frequently used of all types of investigations or experiments is the one-group type, and it occurs as frequently in the physical and social sciences as in the mental. When the physicist subtracts a defined amount of heat from a bar of metal and measures the resulting contraction, he is using the one-group method. When the chemist pours one chemical mixture into another and analyzes the resulting precipitate, he is employing the one-group method. When a psychological examiner fires a pistol behind a candidate for aviation and measures the resulting jump, he is employing the one-group method. When a teacher scolds her class for inadequate preparation and measures the resulting increase or decrease in study, she is employing the one-group method. When a nation like France applies to itself republicanism or a nation like Rus- sia applies to itself bolshevism and observes the result, it, too, is employing the one-group method. Similarly, when a teacher compares the effectiveness of scolding vs. praising, or instruction by one method vs. instruction by another method, she, too, is employing the one-group method, pro- vided the two contrasted factors are tried out upon the identical group. A one-group experiment has been con- ducted when one thing, individual, or group has had applied to it or subtracted from it some experimental factor or fac- tors and the resulting change or changes have been estimated or measured. | 14 Selection of Experimental Method Ls The one-group method may be represented in formula form as follows: One Group — Two EF’s — One Test Type 3s — (IT — EFr — FT — C1) — (IT —/BR2i'— RT — G2) where S is the experimental subject, thing, or group. IT is the initial test or status of S before EF1 and EF? are, in turn, added to or subtracted from S. EF is one of the two experimental factors. EF2 is the other experimental factor. FT is the final test or status of S after EF1 and EF>2 have, in turn, been applied. Cr is the change in S produced by EF1, and is found by com- puting the difference between the IT and FT which imme- diately precede and succeed EF1 respectively. C2 is the change in S effected by EFz. The conclusion is yielded by comparing the amounts of C1 and C2. If Cz is larger, EFz has been more effective than EF2, and vice versa. Thus, if a teacher wished to compare the effects of prais- ing vs. scolding, at the beginning of a class period, upon the amount of discussion on the part of pupils during the class period, she would make an initial test (IT) of the amount of discussion which normally occurs. Then she would praise (EFr) the class at the beginning of some class period. During the remainder of the class period she would test (FT) the amount of discussion. Then she would com- pute the difference (C1) between the initial test and final test. As soon as the effects, if any, of the praising had worn off, she would make another IT or else assume that it would be identical with the first IT, scold the pupils, make an FT, and compute the amount of alteration (C2) produced by scolding. A comparison of the amount and direction of Cx and C2 would yield the correct conclusion from this experl- ment, provided proper experimental precautions were taken, and provided the effects of the praising really did wear off, as evidenced by the second IT. 16 How to Experiment in Education Assuming the data to be as shown below, the computa- tions for the praising (EF1) vs. scolding (EF2) experiment are indicated. S — (20 — EF1 — 25 —+ 5) — (20 — EF2 — 18 — — 2) Difference equals 7 in favor of EFr. The one-group experimental method may be divided upon the basis of the number of experimental factors contrasted. Strictly speaking, there are no one-factor experiments. The nearest approach to such an experiment is where some one factor is added to or subtracted from S. If a teacher makes an IT of her class, adds a good scolding, makes an FT, and computes C, she may be said to have performed an experi- ment with one factor—an experiment which requires only the former or latter half of the above basic formula. On the other hand, it might be argued that she really employed two factors, namely, not scolding or a control EF vs. scold- ing, and that therefore she would require all of the above formula. Since the influence of EF1 (not scolding) would be to leave the pupils unchanged, IT and FT in the former half of the formula would be identical and C1 would be zero. Either approach leads to the same practical con- clusion. While half of the formula will suffice when the two fac- tors are really the presence and absence of one identical factor, the entire formula is required when the two EF’s are, not mere presence and absence of one EF, but two EF’s different in nature. Thus, if a teacher wished to compare the effect of praising vs. scolding her class, or of teaching her class by one method vs. another method, Cr could not be assumed to be zero. Both praising and scolding, or both methods of teaching might alter the original status of S. Since the longer formula is correct in all one-group experi- ments and is necessary in some, confusion will be avoided by adopting it as the basic formula for one-group experi- ments. In certain other situations the basic formula may be Selection of Experimental Method 17 shortened by eliminating both the IT and C, whereupon the formula for the one-group experiment reduces to Sy (EBL in) oo His Tl) This plan is very economical and its use in preference to the more laborious basic plan is justifiable when S may be assumed to have an IT of zero, for in this case C becomes identical in amount with FT. When an experimenter wishes, for example, to discover how much a group of pupils can learn of certain new material taught for a defined length of time according to a defined method, he may employ the abbreviated experimental plan, provided the material to be taught is so sufficiently new that pupils will start with zero knowledge of it. But since all these variations on the basic plan operate in special situations only, whereas the basic plan will operate in any one-group experiment, confusion will be avoided by keeping in mind the basic plan only. There remains to consider the formula required to handle more than two EF’s. The basic formula assumes two EF’s. It can be indefinitely extended by lengthening the formula to provide for EF1, EF2, EF3, and so on, with their corre- sponding C1, C2, C3, etc. In many one-group experiments the changes produced by each EF are manifold, so that one test cannot measure them. ‘Thus, a certain EF may change not only a pupil’s reading ability but his spelling ability also. To measure both these effects will require at least two types of tests, namely, a reading test and a spelling test. Hence, one- group experiments may be divided into those requiring one type of test and those requiring two or more types of tests. The former has already been diagramed; the latter is dia- gramed below. This diagram assumes that two EF’s are employed and two types of tests are required. Observe that S and the two EF’s remain unchanged. Cr vs. C2, and C3 vs. C4 show the two conclusions from this experiment. Provision can be made for more EF’s by extending the for- 18 How to Experiment in Education mula to the right and for more types of tests by extending it downward. One Group — Two EF’s — Two Test Types S — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1 — C2) (IT2 — EF1 — FT2 — C3) — (1T2 — EF2 — FT2 — C4) B. Equivalent-groups Method. — The equivalent- groups method has been devised for experimental situations where, for reasons to be mentioned shortly, the one-group method is inapplicable. Distinctive features of this method are (1) that there are more than one group, or S, and (2) that all groups are equivalent. Normally, there are as many S’s as there are EF’s, and each S is supposed to be equiva- lent to any other. Thus, if a teacher wishes to compare the effect of scolding vs. praising and employs the equivalent- groups method, she selects two equivalent groups. She scolds one group and measures the change, and praises the other group and measures the change. The diagram for an equivalent-groups experiment with one type of test follows. Sr refers to one group and S2 to the other. The conclusion from the experiment is yielded by a comparison of Cr and C2. Equivalent Groups — Two EF’s — One Test Type Sr — (IT1 — EF1i — FT1 — C1) S2 — (IT1 — EF2 — FT1 — C2) When two types of tests are used, this formula takes on the form shown below. The two conclusions are yielded by a comparison of Cr with C3, and C2 with C4. Equivalent Groups — Two EF’s — Two Test Types Sr — (IT1 — EF1 — FT1 — Cr) (IT2 — EF1 — FT2 — C2) S2 — (IT1 — EF2 — FT1 — C3) (IT2 — EF2 — FT2 — C4) The following formula is utilized for three EF’s and two test types. Guided by the principles exemplified in this and Selection of Experimental Method 19 the two preceding formulae, a formula may be constructed for any number of EF’s, and any number of test types. Equivalent Groups — Three EF’s —-Two Test Types Sr — (IT1 — EF1 — FT1 — C1) (IT2 — EF1 — FT2 — C2) S2 — (IT1 — EF2 — FT1 — C3) (IT2 — EF2 — FT2 — C4) S3 — (IT1 — EF3 — FT1 — Cs) (IT2 — EF3 — FT2 — C6) C. Rotation Method.—The rotation method is particu- larly useful for solving experimental problems insoluble by other methods. It is a unique combination of two or more one-group methods. When the various groups employed are equivalent, the rotation method is a combination of one- group and equivalent-groups methods. As the name implies, the distinctive feature of the rota- tion method is that of rotation—rotation of S’s, or EF’s or irrelevant factors. If a teacher wishes to study, by means of the rotation method, the effect of praising vs. scolding, she first praises S, and measures the result, and then scolds the same S, and measures theiresult. This is the one-group method thus far. She first scolds S2, and measures the re- sult, and then praises S2, and measures the result. In other words, she rotates the order of the EF’s. She combines the results from praising both groups, and compares the sum so found with the sum of the results from scolding both groups. This comparison shows whether praising has been more or less effective than scolding, how much, and in what direc- tion. The simplest form of rotation method, namely, two EF’s and one type of test, is given below. The conclusion is yielded by a comparison of C1 plus C4 with C2 plus C3. Rotation — Two EF’s — One Test Type Sr — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1 — C2) 92 — (IT1 — EF2 — FT1 — C3) — (1T1 — EF1 — FT1 — C4) OL ADU ot ON ST OF I EF2 = C2 + C3 20 How to Experiment in Education If a teacher wishes to determine by means of the rota- tion method the effect of praising vs. scolding vs. sarcasm, the formula becomes as shown below. ‘The conclusion is derived from a comparison of C1 plus C6 plus C8 with C2 plus C4 plus Co with C3 plus C5 plus C7. Rotation — Three EF’s — One Test Type S1 — (IT1 — EF1 — FT1 — C1) — (1T1 — EF2 — FT1 — C2) — (IT1 — EF3 — FT1 — C3) S2 — (IT1 — EF2 — FT1— C4) — (1T1 — EF3 — FT1 — Cs) — (IT1 — EF1 — FT1 — C6) S3 — (IT1 — EF3 — FT1 — C7) — (1T1 — EF1 — FT1 — C8) — (IT1 — EF2 — FT 1 — Cog) EF1 = C1 + C6 + C8 EF2 = C2 + C4-+ Co EF3 = C3 + C5 + C7 A diagram for a rotation method with two EF’s and for two types of tests follows. The two conclusions from the experiment are yielded by a comparison of the sum of C1 and C6 with the sum of C2 and Cs, and by a comparison of the sum of C3 and C8 with the sum of C4 and C7. Rotation — Two EF’s — Two Test Types Sr — (IT1 — EF1 — FT1 — Cr) — (IT1 — EF2 — FT1 — C2) (IT2 — EF1 — FT2— C3) — (IT2 — EF2 — FT2 — C4) S2 — (IT1 — EF2 — FT1— Cs) — (1T1 — EF1 — FT1 — C6) (IT2 — EF2 — FT2— C7) — (1T2 7 EE eee EF ir on test 1 = C1 + C6 EF2 on test 1 = C2 + C5 EF1 on test 2 = C3 + C8 EF2 on test 2 = C4-+ C7 This, as well as any other experimental method, can be indefinitely extended by multiplying the number of factors, or tests, or both. The student will do well to stop at this point and prove his mastery of what has preceded by mak- ing a few sample extensions of each method that has been diagramed. Selection of Experimental Method 21 II. CRITERIA FOR SELECTING EXPERIMENTAL METHOD A. One-group Method.—When the purpose of an ex- periment is to determine the amount of change due directly to an EF, the one-group method is valid: (1) Where the total net change in the trait or traits in question produced by irrelevant factors is negligible, or where the amount of such change is measured and dis- counted by the application of a control EF. (2) Where the change produced in S by an EF is not conditioned significantly by any preceding EF. (3) Where the change effected by each EF is measurable in equal units. Here is an experimental problem which came to the atten- tion of the writer recently: Will the appointment of a physical instructor (EF1) or the establishment of school luncheons (EF2) improve the health (weight, etc.) of ele- mentary school pupils? The purpose of the individual who formulated this problem was to determine whether a phys- ical instructor or school luncheons will alter the weight, etc., of pupils, and if so, how much. Even in the case of an inanimate S, it is extraordinarily difficult to create an experimental situation where all irrele- vant factors—disturbing factors—are eliminated. In the case of an animate S like the above, irrelevant factors of considerable magnitude are unavoidable. But irrelevant factors will not invalidate this experiment provided their in- fluence is relatively negligible. Hundreds of influences con- tinuously play upon pupils. Compared to the influence of the EF, most, or sometimes all, of these irrelevant factors exercise a comparatively small influence. Even significant irrelevant factors will not invalidate this experiment provided the total met change is negligible. Though pupils are continuously registering the effects of a multitude of accidental or chance or uncontrollable in- fluences, some of these tend to facilitate and some to inhibit 22 How to Experiment in Education progress in the trait in question. No trouble is caused provided these positive and negative influences balance or so nearly balance as to give a negligible net total. In the case of our sample problem, will the net total change produced by irrelevant factors be negligible? There are excellent reasons for believing that this net total will be a considerable increase in weight due to, not to mention other possibilities, the significant irrelevant factor of natural maturing. But even this significant irrelevant factor of maturing does not invalidate the one-group method provided the amount of its influence can be measured and discounted by the application of a control EF (CEF). Thus, we might measure the amount of increase in weight due to one year of maturing, and then apply a year of school luncheons, and then remove school luncheons and apply a year of a phys- ical instructor. The first year would be a control EF be- cause during this time the pupils would presumably be treated exactly the same as during the two following years, except for the EF’s of school luncheons and physical in- structor. By computing the difference between the increase during the first year and each of the other two years it would be possible to determine the amount of increase attri- butable to each regular EF. Where there are a CEF and two regular EF’s the basic formula for the one-group method is shown below. Before Cir and C2 are compared, the amount of CC should be sub- tracted from each. One Group — CEF and Two EF’S — One Test Type SIT — CEFF CC) (IT EFr—Fi— C1) SU eee eee EFi = C1 — CC EF2 = C2—CC Will one EF condition or carry-over to any succeeding EF? Since the control EF may be dispensed with in ex- periments where the net total change produced by irrelevant factors is negligible, and also in certain other experiments, as will be shown later, and since the control EF is really Selection of Experimental Method 23 identical with the preéxperimental factor, these two may be considered together. ‘Thus, if an experimenter desires to compare the relative effectiveness of teaching pupils sub- traction by the additive method vs. the subtractive method, it is important to inquire whether the pupils are just begin- ning subtraction or whether they have been taught for some time previously by the additive or subtractive or some other method. The additive method, superimposed upon a long training according to the subtractive method, may yield re- sults markedly different from that of an additive method superimposed upon an additive training or no training at all. The function of an initial test is to prevent the first regular EF from getting credit or blame for changes pro- duced by a control EF or, lacking a control EF, the pre- experimental factor. But there may be a carry-over of inhibiting or facilitating purposes, methods of work, or in- formation, or all of these which are not removed by the initial test sieve. | When the amount of this carry-over is significantly large, the experimenter has two alternatives. He may seek an S whose preéxperimental experiences have been such as to avoid the carry-over, or he may continue with the original S, and remember to state the final conclusions from the ex- periment in the light of the condition of S antedating the experiment. The experimenter does not have the alternative of selecting another experimental method, for every experi- mental method is handicapped equally by this preéxperi- mental factor. It is necessary to inquire, not only concerning the carry- over from the preéxperimental factor or control EF, but also concerning the carry-over from one regular EF to any suc- ceeding EF. Will a physical instructor for a year prior to school luncheons add to or detract from the effectiveness of school luncheons? Or vice versa, will school luncheons add to or detract from the effectiveness of a physical instructor? Will the additive EF, preceding a subtractive EF, facilitate the effectiveness of the subtractive EF, or inhibit it, or vice 24 How to Experiment in Education versa? Unless there are reasons for believing that any such carry-over will be relatively negligible, the experimenter had better avoid the one-group method. If there are reasons for believing that EF1 will condition EF2 but that EF2 will not carry-over to EF1, the one-group method is valid, provided EF2 is applied first, since an EF cannot condition a preceding EF. There is this difference between a carry-over from a pre- - experimental factor or from a control EF to a regular EF, and the carry-over from one regular EF to another. In the former situation the experimenter does not have the alterna- tive of selecting another experimental method whereas in the latter situation he does. Finally, can the changes effected respectively by the con- trol EF, school luncheons, and physical instructor be meas- ured in equal units? Since all weight changes will be measured in units of pounds, let us say, and since the scale for weight is a uniform scale, it would appear that the units could be called equal. The use throughout the entire ex- periment of a uniform scale with uniform and equal units would seem to be all that could be asked. It is, provided equality of units means equal ease of effecting a unit of change in S at all points on the scale. The units on a scale may be equal in some senses and be quite unequal in an experimental sense. In one sense the interval from ninety- seven to ninety-eight pounds is equal to the interval from one hundred ten to one hundred eleven pounds. In each case the interval is one pound. But it may be more difficult to increase the weight of a particular pupil from one hundred ten to one hundred eleven pounds than from ninety-seven to ninety-eight pounds. Let us assume that it is. Then the EF which came first would show a greater change than the EF which came second, even though both were of exactly equal effectiveness. In sum, objective equality of units does not guarantee experimental equality of units. When the same uniform scale of uniform units measures Selection of Experimental Method 25 the changes produced by all EF’s there is some possibility that the units will be equal experimentally. This possi- bility is practically nil when the scales employed are not uniform. For example, an experimenter may desire to de- termine the effectiveness of two methods of teaching a geography lesson. He might teach a lesson by method A on the question: Why are certain portions of the United States arid? He would construct a measuring instrument on the content of this particular lesson. This instrument could be used for the initial test and final test to measure the change produced by method A. Now if method A had practically taught the content of the above lesson, or even a part of it, method B could not well be used on the same lesson. Method B would have to be employed on another lesson whose topic was, say: Why is more cotton grown in the southern than in the northern part of the United States? This would require a new test on the content of the second lesson. Suppose that method A increased by ten points the score of S, and that method B also increases by ten points the score of S. Which is more effective, method A or method B? It is impossible to say, because the ten points in one case are not necessarily equal to the ten points in the other. We cannot even be sure that one point on one test is equal to any other point on the same test. When the purpose of an experiment is to determine merely the amount of superiority of one EF over any other EF, the one-group method ts valid: (1) When the amount of change in S under one EF is practically identical with the amount of change under any other EF, except for the difference in effectiveness of the contrasted EF’s. (2) Where the change produced in S by an EF is not conditioned significantly by any preceding EF or EF’s. (3) Where the change effected by each EF is measured in equal units. Since many of the experiments in education are concerned only with the relative effectiveness of two or more EF’s and 26 How to Experiment in Education not with a determination of the absolute amount of change in S directly attributable to an EF, the more searching fundamental criteria may be simplified as indicated in (1), (2), and (3) immediately above. So far as the above pur- pose is concerned, it makes no difference if pupils are ma- turing or if any other irrelevant factors are operating con- temporaneously with the application of the EF’s, provided they operate alike under each EF. There are some situations where inequality of units is certain, and, yet, where the one-group method is practically imperative or has been used by mistake. Stevenson con- ducted an investigation under the auspices of the University of Illinois and the Chicago public schools to determine the relative effectiveness of large classes vs. small classes. Cir- cumstances might have forced the one-group method. If sO, one appropriate plan would be to have a teacher teach a class of, say, forty-five pupils for the first semester. Initial and final tests would be given. At the beginning of the second semester, thirty of these forty-five pupils would be so selected as to be fairly representative of the whole group. This class of thirty pupils would be taught during the second semester by the same teacher who had taught them during the first semester. Initial and final tests would be given. ‘ The final tests for the first semester would serve as the initial tests for the second semester. Cz and C2 would be computed only for the thirty pupils continuing throughout the year. A large number of different classes would be used, but each class would be treated according to the above plan. Then, since it is usually more difficult to secure each additional point, the small-class EF would be discriminated against because of inequality of units. Even so, the experi- menter would not have done all his work in vain. There are methods of correcting or approximately correcting for these inequalities. One method is to plot the curve of growth for the test in question, using age norms or, lacking age norms, grade norms as the basis of the curve. The curve can be estimated for Selection of Experimental Method 27 points between the age norms or grade norms. If the norm for ten-year-old children is, say, fifty, and for twelve-year- olds is sixty, and for thirteen-year-olds is sixty-five, a growth from fifty to sixty may be considered equal roughly to a growth from sixty to sixty-five. By interpolation, a growth on one portion of the curve may be converted into units of growth on any other portion of the curve, thus making com- parison between EF’s fair. In like manner, the slope of the curve for grade norms may be used to equate units on vari- ous portions of the curve, though the grade-norm curve is subject to a selection error. The fifth-grade norm in June is higher than the fourth-grade norm in June not only because of the year’s growth, but also—and failure to recognize this is the error—because certain of the stupider pupils of a fourth-grade are not allowed to continue with their grade when it becomes a fifth grade. For several reasons—because norms are frequently un- available, because of the selection error in grade norms, because the equalization of units by means of growth curves is likely to prove laborious, and because such equalization requires that the same or equivalent tests be used through- out the experiment—another method of equalizing units will be found more serviceable. This is the method of convert- ing all units into T’s, in terms of the experimental group rather than twelve-year-old, by the T-scale technique de- scribed in Chapter V, and illustrated in Table 6 (page 99) and Table 36 (page 204). If the same or equivalent forms of a test are used through- out the entire experiment, it is suggested that the T12 col- umn of Table 8, p. 102, become the T scores according to the very first initial test of the experiment, and that Tx6 be- come the T scores according to the last of the final tests of the experiment, and that these two columns of T scores be combined according to the procedure illustrated in Table 8. If the T scores were based upon initial test alone, some of the highest scores in the final test could not be scaled. If the T scores were based upon final test alone, some of the lowest 28 How to Experiment in Education scores of the initial test could not be scaled. By basing the T scores upon both initial and final tests, all scores for all pupils on a particular test can be converted into equivalent T scores by the use of what will correspond to the first and last columns of Table 6, p. 99. If the initial and final tests for EF1 are neither duplicate nor equivalent forms of the initial and final tests used for EF2, i.e., if the EF1 tests.measure information about the geography of New York, whereas the EF2 tests measure information about the geography of Pennsylvania, the T scores for EF1 should be based only upon the initial and final tests for EF1, and the T scores for EF2 should be based only upon the initial and final tests for EF2. This means that Table 6 must be worked twice for each test before all scores in a two-EF experiment can be converted into T scores. The general procedure is the same irrespec- tive of the number of EF’s. Fortunately, Stevenson selected a better experimental method. He chose the rotation method instead of the one- group method. He had one teacher teach a class of, say, forty-five pupils and another teacher teach an approximately equivalent class of thirty pupils in the same grade. Both the large and the small classes were taught during the first semester. At the end of the first semester, fifteen pupils were taken from the class of forty-five pupils, thus leaving it a class of thirty pupils during the second semester, and given to the class of thirty pupils, thus making the latter a class of forty-five pupils during the second semester. In this way, both the large-class EF and the small-class EF came under identical courses of study, identical portions of the test, identical portions of the growth curve, and so on. The probability of satisfying the fundamental criteria for selecting the one-group method is increased: (1) Where the EF or EF’s produce a relatively drastic effect, for this tends to make the influence of trrelevant fac- tors practically negligible. (2) Where the experiment is of brief duration, for this Selection of Experimental Method 29 abbreviates the action of large, constant, cumulative, irrele- vant factors such as maturing for example, (3) Where the trait in question does not involve pur- poses or methods of work, for these usually show a larger carry-over than specific information. (4) Where the tests are scaled on the basis of the same unit for this increases probability of equality of units. B. Equivalent-groups Method.—When the purpose of an experiment is to determine the amount of change due directly to an EF or EF’s, the equivalent-groups method is valid: (1) Where the total net change in the trait or traits in question produced by irrelevant factors is negligible, or where the amount of such change is measured and discounted by the use of a control EF. (2) Where it is really possible to equate groups. One peculiar virtue of the equivalent-groups method is that in its use the danger of any carry-over from one EF to another is avoided, by applying each EF to a different S so that no EF follows another with the same group. Of course the equivalent-groups method, like all others, is sub- ject to a possible carry-over from the preexperimental fac- tor. But this does not so much invalidate an experiment as limit the conclusions from the experiment to the particular sort of S employed. Another superiority of the equivalent-groups method over the one-group is that the units of measurements used for one EF have a greater probability of being equal to those used for another EF. The equivalent-groups method avoids the doubtful assumption that it is equally easy to produce equal amounts of change at various points of the growth curve of S, for two S’s can be chosen at like positions on the growth curve. Furthermore, it is not necessary to measure the changes produced by the various EF’s by means of dif- ferent incomparable tests based upon different subject mat- ter. Thus it would not be necessary to teach one sort of 30 How to Experiment in Education geography lesson according to method A and another sort according to method B. The identical lesson could be taught by method A and method B and the identical test could be used to measure the changes produced by each method. We shall see, however, when we come to consider the ques- tion of scaling tests, that the use of identical tests does not guarantee perfect equality of units. But it certainly does tend to increase comparability. The one-group method did not prove entirely valid for the illustrative problem of school luncheons vs. physical instruc- tor. How about the equivalent-groups method? Here, as in the case of the one-group method, the total net change produced by irrelevant factors would not be negligible due to the natural maturing of the pupils. But this difficulty could be overcome by employing a control S, to whom the control EF could be applied. Thus one S would be treated _as usual (CEF). Another equivalent group would have school luncheons (EF1). Still another equivalent group would have a physical instructor (EF2). By subtract- ing CC from C1 and C2 the amount of change produced by. EFx and EFz2 could be accurately determined. Hence the equivalent-groups method is applicable to this experimental problem. The method is equally applicable to the praising vs. scolding, or the additive vs. subtractive problems. When the purpose of an experiment ts to determine merely the amount of superiority of one EF over any other EF the equivalent-groups method is valid: (1) Where the amount of change in S under one EF is practically identical with the amount of change under any other EF, except for the difference in effectiveness of the contrasted EF’s. (2) Where it is really possible to equate groups. As is the case with the one-group method, the criteria are less stringent when only the relative difference between EF’s is desired. Changes produced by large irrelevant Selection of Experimental Method 31 factors, like maturing, cause no trouble provided the irrele- vant factor operates equally under each EF. In the case of one-group experiments, equal operation of irrelevant factors under each EF is often difficult to secure, particularly when the experiment extends over a consider- able time interval. But equal operation of irrelevant factors is easy to secure when the groups are different groups and equivalent. Hence the above criteria practically reduce to the second one for most situations. C. Rotation Method.—When the purpose of an expert- ment 1s to determine the amount of change due directly to an EF or EF’s, the rotation method is valid: (1) Where the total net change in the trait. or traits in question produced by irrelevant factors is negligible, or where the amount of such change is measured and discounted by the application of a control EF. (2) Where the change produced in S by an EF is not conditioned significantly by any preceding EF. In case the net total effect from irrelevant factors is not negligible, this effect can be measured by a preliminary appli- cation of a control EF to each group employed in the rotation experiment. The amount of change produced by the irrele- vant factors would be combined in the same way, in the same order, and for the same intervals as has been described for the regular EF’s, and the sum would be subtracted from the sum of the corresponding C’s for the regular EF’s. The computations for the control EF is like computing the shadow of the rotation experiment for the regular EF’s, for there would be a control Cr to be added to a control C4, and a control C2 to be added to a control C3. The computation for the control EF’s would be more elaborate if there were more than two regular EF’s, but here, too, the process would duplicate that already given for three or more regular EF’s. The formula for both CEF’s and regular EF’s may be written as below, though it is probable that either the CC2 or CC4 would be assumed to be equivalent to CCx or C@z 32 How to Experiment in Education respectively, or else the two CEF’s which are applied to each S would be applied in immediate succession. Rotation—CEF’s and Two EF’s—One Test Type S1—(IT—CEF1-—FT—CC1)—(1T—EF1—F T—C1)—(IT—CEF2—F T—CC2)—(1T—EF2—FT—C2) §2—(1T—CEF2—FT—CC3)—(UT—EF2—FT—C3)—(UT—CEF1—F T—CC4)—(1T—EF1—F T-—C4) EF1 = (Cl + C4) — (CC1 + CC4) EF2 = (C2 + C3) — (CC2 + 003) Even though the rotation method is a combination of one- group methods, the criterion concerning equality of units of measurements has not been restated in connection with the rotation method. This omission is due to the fact that the rotation method brings each EF under each lesson and test, if different lessons with different content are used, and brings each EF under each portion of the growth curve, if the same test is used and the experiment continues over a long period of time. In sum, the rotation tends to rotate out lesson differences, test differences, or position-on-growth-curve differences, thus tending to equalize the units of measure- ments. In Weber’s rotation experiment to test the effectiveness of a lesson taught by a teacher followed by a brief review vs. a film or motion picture followed by a lesson vs. a lesson followed by a film, a different content with an appropriate test for each content had to be used for the different EF’s. One lesson had to do with India, another with China, and a third with Japan. The appropriate formula for such an experiment follows. In the formula, ITi means the initial test on India, LR means the lesson-review EF, ITc means initial test on China, FL means the film-lesson EF, IT} means initial test on Japan, and LF means lesson-film. S1—(ITi—LR--FTi—C1)—(1Tce—FL—FTc— C2)—(1Tj—LF —FTj—C3) S2'—-(ITi— FL—FTi—C4)—(1Te—LF — F Tc— Cs) —(ITi -—LR—FTj—C6) S3——(ITi—-LF —FTi—C7)—(I1Te—LR — FTc—C8)—(1Tj —-FL—FT}j—Co) LR=C1-+C6+C8 FL=C2z+C4+ Co LF=C3+Cs5+ C7 If Sz is a superior group of children, the foregoing plan rotates out the superiority, for every EF gets the benefit Selection of Experimental Method a4 of the group’s superiority, and similarly for other group differences. If S2 is taught by a superior teacher, the effect of her superiority is rotated out, for every EF profits equally from her skill, and similarly for other teacher differences. If the lesson or test on India is especially difficult, this dif- ficulty is rotated out, for the lesson and test on India is employed with every factor, and similarly for other lesson or test differences. If the LR or lesson-review EF is more effective than the other two EF’s, this superiority is not rotated out, and should not be rotated out, for the purpose of the plan is to give any such superiority a chance to mani- fest itself, unmasked by irrelevant factors of teacher, group, lesson, or test differences. The above plan will rotate out any likely irrelevant factor, except (1) uncontrolled bias on the part of the teacher or experimenter for a particular EF; (2) bias on the part of the test for a particular EF; (3) deliberate malingering on the part of the pupils, unless this is uniform throughout the experiment; (4) a carry-over from one EF to another C5) any tendency for one group to learn how to improve more rapidly with the progress of the experiment than any other group; or (6) any tendency for one group to become more fatigued or bored with the progress of the experiment than any other group. The last three irrelevant factors are of special interest. If the lesson-review EF were to carry over and benefit the film-lesson EF, C2 would not be an exact measure of the influence of film-lesson. Instead, C2 would be a measure of the effect of film-lesson plus an effect borrowed from lesson-review. In an experiment of this sort, where the entire content of the lessons is changed each time, such carry-over in significant amount is highly improbable. If, for some reason, Sx were to learn, as the experiment progressed, how better to retain the content so as to make a higher score on the FT, the second EF would profit more than the first, and the third EF would profit more than the second. This would be rotated out provided and only pro- 34 How to Experiment in Education vided S2 and S3 each learned the same thing in like amount. Again, if St were to become fatigued or bored as the experi- ment progressed, relatively more than S2 and S3, this would penalize LF most, FL next, and LR least. Such unique fluctuations are not likely to occur in significant amounts unless there are large differences in intelligence, or the like, between the three groups. When the purpose of an experiment is merely to deter- mine the amount of superiority of one EF over any other EF, the rotation method is valid: (1) Where the amount of change in S under one EF is practically identical with the amount of change under any other EF, except for the difference in effectiveness of the contrasted EF’s. (2) Where there is no carry-over from one EF to an-~ other, or where, in case it occurs, the carry-over ts mutual, 1.€., each EF gains equally from such carry-over. If, in the case of one S, EF1 preceding EF2 aids EF2 to the extent of, say, two score points, and if EF2, in the case of the other S, aids EF1 to the extent of two score points, the increased change for each EF will be equal, thereby validating the rotation experiment for the purpose of deter- mining relative effectiveness of the EF’s. An illustration will make it clear that a mutual carry-over will not disturb a relative rotation experiment. Lacy? con- ducted a rotation experiment to evaluate the relative effec- tiveness of telling a story orally to a pupil (Told), having a pupil read the story (Read), or having him see it in motion pictures (Movie). Assume that each EF is equally effective, and that each C would be 4 were it not for carry-over. As- sume, further, that each EF carries over to the immediately succeeding EF to the extent of half its own C, and to the next EF to the extent of one-fourth its own C. The follow- ing diagram shows that all EF’s come out equal, according to assumption, regardless of a complicated carry-over. 1Lacy, John V., “The Relative Value of Motion Pictures as an Educational Agency,” Teachers College Record, November, 1919, Selection of Experimental Method Cis 4 Airiac 4-33 Told Read Movie 4 Acie Agata ricad Read Movie Told 4 4+2 Aa atts Movie Told Read Told = (4) + (4+3 +1) + (4+ 2) =18 Read = (4+ 2) + (4) + (44+3+1)=18 Movie= (4+ 3 +1) + (4+ 2) + (4) =18 If an experimenter desires to be exceedingly careful to equalize the amount of carry-over, he can improve upon any formula thus far given by using six groups for three EF’s as shown below. S1 — Told — Read — Movie S2 — Read — Movie — Told S3 — Movie — Told — Read iio nncr Lele eT SLSle) hep eseiele ren slevabeledeitele el si sle ei/sielevlelis: cules Novela mich ata lets S4 — Read — Told — Movie S5 — Told — Movie — Read 56 — Movie — Read — Told On the whole, the one-group experimental method is the most convenient and, for this reason, should be preferred when some significant irrelevant factors will not invalidate the experiment; but the one-group method is peculiarly sub- ject to constant errors from these sources. The equivalent- groups method is peculiarly free from the influence of dis- turbing irrelevant factors. The only difficulty encountered here is in selecting two or more S’s which are genuinely equivalent. When the number of pupils composing each § is small, it becomes extremely difficult to prove that exact equivalence was secured. Due to the practical difficulty at times of establishing this equivalence, the rotation method is frequently used. The rotation method is, of course, just a combination of two or more one-group experiments, but the way in which the one-group methods are combined automatically tends to eliminate some of the objections to the one-group method. Reversing the order of application 36 How to Experiment in Education of the EF’s, permits each EF to get the advantage or dis- advantage of a carry-over from the other, increases com- parability by having each test used under each EF and by having each EF operate on S at approximately similar por- tions of the growth curve. The rotation method is also of value in eliminating special irrelevant factors, such as teach- ing skill of teacher, and difference in ability of groups. CHAR TE RIL SELECTION OF EXPERIMENTAL SUBJECTS Appropriateness of Subjects to Experiment Factors. —The first consideration in selecting experimental subjects requires that these subjects be appropriate to the EF’s. A principal in a nearby school is interested in determining the effect of employing the project method with a particular class in his school which has been taught by an extremely conservative teacher. Here the EF calls for a particular class or, at least, for pupils whose habits have been formed under a very conservative teaching method. Coy has con- ducted an elaborate experiment with children of high in- telligence. The problem especially called for gifted pupils. Others would have been inappropriate. Ogglesby designed a primer for pupils of subnormal intelligence. She desired to test its relative effectiveness. It was necessary to select pupils appropriate to the EF. Hanson has experimented with the effect upon progress in penmanship of excusing pupils from drill when they attain a handwriting quality of 12 on the Thorndike Handwriting Scale, as compared with continuance of drill. Pupils whose handwriting is already above quality 12 would be inappropriate, as would pupils so far below quality 12 that this goal would cause little or no motivation. Thus, appropriateness is an essential con- sideration, and what constitutes appropriateness varies with the nature of the problem. The determination of appropriateness frequently requires objective measurement. Thus Coy used intelligence tests to pick children of high intelligence. Ogglesby selected her subjects on the basis of intelligence scores determined by 37 38 How to Experiment in Education Metzner. Gray, Gates, and others have experimented with pupils who were unable to make satisfactory progress in reading. They employed reading tests to select their ex- perimental subjects. Appropriateness of Subjects to Tests.—As a rule, sub- jects should not be subordinated to the tests, but rather tests should be found or constructed which will be appropriate to the subjects. But it sometimes happens that the nature of the problem is such as to permit the experimenter consider- able latitude in the choice of subjects, while at the same time it is not feasible to construct new tests. A few days ago the writer advised an experimenter who was planning his doctor’s dissertation to select no experimental subjects below the third grade. This advice was given because ade- quate tests of the type called for by his problem were not available for pupils in grades below the third. Adequate tests were available for pupils in grades above the second. He could have constructed tests for young children, but this would have left no time for experimenting with the problem in which he was interested. Representativeness of Subjects—Selection by Chance. —Sometimes it is possible to employ for the S the total group which has proved appropriate for the EF. Thus the experimenter, who desires to determine the effect of the project method upon a particular fourth grade previously taught by an unusually conservative method, could include the total group in the experiment. Sometimes, as for ex- ample in a very large elementary school, it is not feasible to try the EF’s on all the fourth-grade children in question. . Only a selected number can be used. If the conclusion is to be generalized for all the pupils, it is necessary that the S be so selected as to be representative of the total group. Representativeness can be secured by making a chance selection from the total group, or a chance selection from a chance portion of the total group. One method of making a chance selection is to write upon a slip of paper the name of each pupil in the total group, to place these names in a Selection of Experimental Subjects 39 receptacle, to mix them thoroughly, and to draw from the receptacle as many slips of paper as there are pupils called for in the experimental plan. This was the general pro- cedure followed by the War Department in selecting men for conscription during the World War. Another method of making a chance selection is to write the names of the pupils in alphabetical order. If half the total number of pupils are to be used, alternate pupils can be selected. If one-third the total group are to be used, every fourth pupil can be selected, and similarly for the proportions of 25, 75, 90, or other per cents. The above methods of selection assume that it is feasible to withdraw the selected pupils from their classes and as- semble them in a new class or classes for experimental pur- poses. This is not, however, always practicable. Fre- quently the experimenter is faced with the necessity of making a chance selection of classes rather than or in addition to a chance selection of pupils. Representativeness of Subjects—Selection by Meas- urement.—If tooo pennies be tossed there will be only a slight difference between the number of times that heads as contrasted with tails appear. If twenty pennies are tossed there may be a relatively large difference in the number of heads and tails. ‘This illustrates the fact that chance is a highly exact method of selecting representative pupils when the number of pupils used as subjects is large, whereas its accuracy decreases as the number of pupils decreases. When the number of pupils or groups is small it is safer to make the selection on the basis of measurement of some sort. Just what sort of measurement will be best depends upon the nature of the experimental problem to be under- taken and the purposes of the experimenter. If the experi- ment has to do with physical efficiency, the tests used may well be tests of physical condition, in order that pupils with all types of physique may be selected. If the experimental trait is reading, selection on the basis of a test of reading ability will usually prove satisfactory. If the experiment 40 How to Experiment in Education has to do with general educational or mental development an intelligence test or a combination of several educational tests may be employed. Once the measurements are made, the pupils or groups, as the case may be, should be arranged in order according to the size of their scores. If, say, 10 per cent of the pupils or groups are to be selected, every tenth pupil or group should be selected. If 25 per cent of the pupils or groups are to be used, every fourth pupil should be selected. Thus in the latter instance the best, fifth best, ninth best, and so on, should be selected. Representativeness can be slightly but only slightly in- creased by employing a modified method of selecting the experimental pupils. Selecting pupils who stand first, third, fifth, and so on, when half the total group is to be used will cause the experimental pupils to average slightly higher than the total group, as will the selection of pupils who stand first, fifth, ninth and so on when 25 per cent of the total group are to be used. This modified method is described farther along, in connection with the technique of equating groups. Appropriateness of Subjects to Experimental Method.—The question of the appropriateness of subjects to the experimental method is most frequently raised in connection with the equivalent-groups method, or the rota- tion method when equivalent groups are to be used. When any experimental method has been decided upon, subjects must be selected who are first, appropriate to EF’s and tests, and second, representative. When the equivalent-groups method has been decided upon, there is the additional re- quirement that subjects be selected and placed in different groups in such a way that the resulting groups will really be equivalent. Equivalence of groups does not require that all the sub- jects participating in the experiment be equivalent, but it does mean that all the groups participating be equivalent. To be equivalent the various groups must have like means Selection of Experimental Subjects 4I and like variability among the subjects constituting each group. To have like means and like variability implies in turn that for every subject in one group there should be an equivalent subject in every other group. While this last will guarantee like means and variability, it is not absolutely required that there be an equal number of subjects in each group. The essential is that the groups be equivalent as to means and variability. But equivalent in what? In intelligence? Not neces- sarily. In education? Not necessarily. In the experi- mental trait? Not necessarily. The groups must be equal in their possibilities for growth in the trait in question. They should be so equal in the growth potential or possi- bilities that they will show an equal mean change and an equal variability among the changes of the individual sub- jects in each group, provided all groups are placed under an identical EF for an identical length of time. Various methods have been proposed for securing such an equiva- lence. These will be described next. Groups Equated by Chance.—Just as representative- ness can be secured by the method of chance, when the subjects involved are sufficiently numerous, so equivalence may be secured by chance, provided the number of sub- jects to be used is sufficiently numerous. One method of equating by chance is to mix the names of the subjects to be used. Half may be drawn at random. This half will constitute one group while the other half will constitute the other group. If three groups are required, the first third of the drawings will constitute one group, the second third of the drawings another group, and the remaining third still another group. Or again, the names may be written in alphabetical order. The even-numbered names will constitute one group and the odd-numbered names the other group, and similarly for a larger number of groups. If classes are being paired off instead of pupils, the same general procedure of drawing, or of alternating will apply. 42 How to Experiment in Education The above are merely sample procedures. Any device which will make the selection truly random is satisfactory. Extreme caution should be exercised to avoid any constant tendency for one group to turn out superior to another. When the War Department made the famous drawing to determine the order in which individuals would be con- scripted for military service, numbers were written on paper and enclosed in capsules. Due to the fact that every additional figure in a number added to the weight of the capsule because of the additional ink deposit, there was a constant tendency for the larger-numbered capsules to sift to the bottom where they would be drawn last. If the size of the paper increased with the length of the number this still further prevented a perfectly random drawing. These criticisms are made merely by way of illustration. Any ex- perimenter may count himself lucky if he is able to select subjects by the method of chance with no constant error larger than that caused in this national drawing by a few specks of ink. Groups Equated by General Ability.—Measurement, if adequate and accurate, is the best basis for selecting sub- jects irrespective of their number. Chance selection is merely an economical substitute for measurement, and is practicable only where the number of experimental subjects is sufficiently large. The trouble with measurement is that we know so little about just what sort of measurement will yield, as a basis of selection in a particular experimental situation, groups equivalent in their possibilities for prog- ress. Nothing in the general technology of experimentation so much needs to be investigated as this. One widespread present practice is to attempt to secure equivalence by equating groups on the basis of general ability. If the experiment is concerned primarily with the physical effects of certain EF’s, the groups are equated on the basis of general physical ability determined by general physical measurements. If the experiment is concerned with the mental effects of the EF’s, groups are equated on the Selection of Experimental Subjects 43 basis of general mental ability measured by some intelli- gence test or a series of educational tests. Thus, if an experimenter were to equate on the basis of an intelligence test, he would select and apply to the pupils, who are otherwise known to be appropriate, some intelli- gence test. Ii the children are primary pupils, he may select and apply to the pupils one or more tests from among such intelligence tests for primary pupils as those by Pres- sey, Franzen, Otis, Haggerty, Dearborn, Trabue, Engel (Detroit), Myers, and others. Or if he can afford the time for testing he may select and apply to the pupils such indi- vidual intelligence tests as those by Goddard, Terman, Herring, Kuhlmann, Yerkes and Bridges, Witmer, and others. If the children are elementary pupils, he may select and apply one or more such group intelligence tests as those by National Research Council, Haggerty, Otis, Dearborn, Pressey, Trabue, Myers, Buckingham and Monroe, and others, or such individual intelligence tests as those by Goddard, Terman, Herring, Kuhlmann, Witmer, Yerkes and Bridges. If the children are in high school he may select and apply such group intelligence tests as those by Otis, Terman, Dearborn, Trabue, Thurstone, and others. Indi- vidual intelligence tests for high school students are not very satisfactory. Group intelligence tests for college stu- dents have been prepared by Thorndike, Thurstone and others. If elementary pupils are foreign, or have a special language handicap, such a group intelligence test as that by Pintner or Liu or such an individual intelligence test as that by Pintner and Paterson, may be used. ‘Thorndike has constructed group non-verbal intelligence tests for adults. In selecting a series of educational tests to apply to pupils, the experimenter has a large range of choice from such reading tests as those by Thorndike-McCall, Monroe, Ayres- Burgess, Courtis, Gray, and others; from such arithmetic tests as those by Woody, Woody-McCall, Stone, Courtis, Buckingham, Monroe, and others; from such spelling tests as those by Ayres, Ayres-Buckingham, Ashbaugh, Starch, 44 How to Experiment in Education Morrison-McCall, Monroe, and others; from such composi- tion scales as those by Trabue, Thorndike, Hudelson, Wil- ling, Lewis, and others; from such handwriting scales as those by Ayres, Thorndike, Starch, Lister, and others; from such English form tests as those by Charters, Briggs, Starch, and others; from such geography scales as those by Courtis, Hahn-Lackey, and others; from such history tests as those by Harlan, Barr, Van Wagenen, Sackett, and others; and so on for other subjects of the elementary and high schools. Or instead, the examiner may use certain test booklets which are combinations in a single booklet of a variety of educa- tional tests or educational and intelligence tests. These omnibus tests frequently yield a single score on the entire booklet, thus avoiding the difficulty of combining separate scores. Illustrations of such omnibus tests are those by Buckingham and Monroe, Pintner, Chapman, Whipple, and others. Whatever intelligence test is used, some sort of a score will result. The National Intelligence Test, for example, yields a point score, and the pupil making the largest num- ber of points is considered to have the highest general mental ability. The Stanford Revision of the Binet-Simon Scale, on the other hand, yields a mental-age score, and the pupil making the highest mental age is considered to have the highest mental ability. Suppose that forty pupils are to be divided into two equivalent groups on the basis of an intelligence test which yields a mental age. Suppose that the test to be used has been selected, ordered from the bureau which issues it, applied to the forty pupils according to the standardized directions sent with the test, and scored according to the standardized method of scoring. Suppose also that the resulting mental ages, when arranged in order of size, to- gether with the chronological ages, are as shown in Table 1. 1 Descriptions, price lists, and samples of tests and the standard directions for the tests may be secured from such distributing centers as World Book Company, Yonkers, New York; Bureau of Publications, Teachers College, New York City; Russell Sage Foundation, New York City; Public School Publishing Company, Bloomington, Illinois; and C, H. Stoelting Company, Chicago, Illinois. Selection of Experimental Subjects 45 Technique of Pairing Pupils.—The division of pupils in Table 1 into two equivalent groups on the basis of mental age may be done by a common-sense pairing of the pupils. Nevertheless certain helpful suggestions and cautions can TABLE I CHRONOLOGICAL AGES AND MENTAL AGES OF 43 6TH GRADE PUPILS Age Age Age Age Age Age I 124 153 16 123 127 30 133 II4 2 136 144 17 138 126 31 139 II4 3 135 142 18 134 126 BY: 130 II14 4 136 I40 19 129 126 33 131 113 5 120 139 20 133 126 34 149 IIL 6 rig 139 ay 140 126 35 133 108 7 I4I I39 22 129 126 36 133 105 8 128 737 23 135 T25 37 140 105 9 135 136 24 134 124 38 151 102 Io 139 135 25 123 124 39 iach IOL II 120 132 26 PZ. 122 40 159 IOl 12 126 129 27 129 122 AI 160 100 13 130 120 28 II5 121 42 160 99 I4 133 128 29 136 II5 43 149 g2 I5 142 128 be given. For one thing it will not be fully satisfactory to pair the pupils into groups thus: Group I Group II Pupil 1 — 153 Pupil 2 — 144 Pupil 3 — 142 Pupil 4— 140 Pupil 5 — 139 Pupil 6 — 139 Such a procedure operates to give Group I a higher average mental ability than Group II, as may be discovered by trying it. Rather the general procedure for pairing should be thus: Group I Group IT Sera ts: 2— 144 4— 140 3—142 5 — 139 6 — 139 46 How to Experiment in Education This method of pairing constantly tends to counteract the tendency to give one group a higher average ability than the other. But even when this last procedure is followed, the mean of the mental ages for one group may not be identical with the mean of the mental ages for the other group. By a TABLE 2 THE PUPILS OF TABLE I DIVIDED INTO TWO GROUPS OF EQUIVALENT MENTAL AGE Group I. Group II Pupil Mental Age Pupil Mental Age 2 144 3 142 5 139 4 I40 6 139 7 139 9 136 8 137 IO 135 II 132 13 I20 12 129 14 128 I5 128 17 126 16 127 18 126 ae) 126 21 126 20 126 22 126 23 125 25 124 24 124 26 122 ar I22 30 114 20 II5 Ke II4 32 II4 34 Tit a II3 35 108 36 105 38 102 cy 105 39 IOI 40 IOI 42 99 41 100 Mean 122.45 Mean | 122.5 special juggling of pupils two groups may be constituted which have practically identical means. But such juggling is seldom advisable. Unless care is exercised, it is likely to result In an equivalence secured by pairing a gifted and ungifted with two average pupils. The means will be equated to be sure, but the variabilities will be unequal. Selection of Experimental Subjects 47 Such special juggling is helpful only when previously paired pupils exchange groups. ) Certain modifications of the procedure recommended are desirable. These modifications are illustrated in Table 2. Pupil 1 is eliminated from the experiment entirely. His mental age is so high, or rather it is so much above any other pupil, that he cannot be even approximately paired. The next pupil, namely, Pupil 2, is 9 points of mental age below him. If for administrative reasons Pupil 1 must be included in the experimental classes he can still be eliminated from this and all subsequent experimental computation. Except for the influence his presence in one of the groups will have, he can become experimentally non- existent. Pupil 2 is substituted for Pupil 1. He pairs satis- _factorily with Pupil 3, so the pairing continues according to rule until Pupil 28 is reached. Pupil 28 does not pair well with Pupil 29, hence Pupil 28 does not appear in Table 2. Pupil 29 appears in his place. The pairing continues with- out interruption until Pupil 43 is reached. Partly because he makes an odd number and partly because his inclusion in either group will be distinctly unfair to that group, owing to his low mental age, he does not appear in Table 2. Thus far it has been assumed that the pupils in Table r are to be divided into two equivalent groups only. The procedure for dividing them into three equivalent groups is as follows: Group I Group II Group III 2— 144 3— 142 4— 140 (139 Ot 30 Sao 8 — 137 9 — 136 IO — 135 The procedure for equating four groups follows the same general principle, thus: Group I Group II Group III Group IV 2— 144 3-142 4— 140 5130 Oran ts0 oirot Wasp 160 6 — 139 IO — 135 II — 132 I2 — 129 13 — 129 48 How to Experiment in Education Because of inequalities in room space or for other rea- sons, it may not be practicable to have an equal number of pupils in each group. If we assume that one-third of the pupils in Table 1 are to be in Group I and the remainder in Group II, the procedure for equating would be as shown below. This assumption means that of every adjoining group of three pupils, two will go into Group I and one into Group II. The closest equivalence will be secured if the middle pupil of each group of three is placed in Group II, thus: Group I Group IT 2— 144 3 — 142 4— 140 Samoe 6 — 139 7 — 139 When one-fourth of the pupils are to be placed in one group and three-fourths in the other, the pupils come in groups of four instead of three, and hence there is no mid- dle pupil. Of the first group of four pupils, namely, pupils 2, 3, 4, and 5, pupils 2, 4, and 5 may be placed in Group I and pupil 3 in Group II, and of the second group of four pupils, namely, pupils 6, 7, 8, and 9, pupils 6, 7, and 9 may be placed in Group I and pupil 8 in Group IJ. Thus in the first pairing, Group I gains a slight advantage, and, in the second pairing, Group II gains an equivalent advantage. This pairing by alternating advantage may be continued similarly for the remaining pupils. The technique of equating groups on the basis of mental age has been discussed. The procedure for equating groups on the basis of point scores on an intelligence test is identi- cal. The procedure is the same for equating groups on the basis of a series of educational tests. The only difficulty likely to be met in this last situation, or in any situation where groups are being equated on the basis of more than one test, is the difficulty of properly combining the scores made by each pupil on the separate tests into a single score. Selection of Experimental Subjects 49 The procedure required to deal with this difficulty will be described later in this chapter. Groups Equated by Initial Status in Experimental Trait.—When groups are equated on the basis of measure- ment, the most convenient and perhaps most frequent basis employed by experimenters for equating groups is that of initial status in the experimental trait. This method is convenient because it is necessary in most experiments to give an initial test in order to measure the change produced by the EF. This provides, without additional labor, scores for the experimental subjects which may be used to divide them into two or more groups. . The procedure for making this pairing is identical with that just described. When the division of pupils into groups requires the actual physical shifting of pupils, the division must be made before the EF’s are applied. When such shifting is not necessary, this detailed division is left until the EF’s and FT’s have been applied and the experimental computa- tions have been started. Thus Pittman! wished to deter- mine the relative efficiency of the zone system of super- vision for rural schools as compared with the conventional system. One group was composed of the schools of one rural county and the other group of the schools of another rural county. Here it was not feasible to transfer pupils or schools from one county to another. What Pittman did was to make a rough initial equating by choosing two rural counties that were as nearly identical as possible in wealth, quality of population, quality of teachers, and so on. He applied the IT, appropriate EF, and FT to all the pupils in grades III through VIII in each county. At the conclu- sion of the experiment he arranged the pupils in one county in the order of the size of their scores on the IT. He did likewise with the pupils in the other county. He then elimi- nated from subsequent computations all the pupils in one group who could not be paired with an equivalent pupil in 1Pittman, M. S., The Value of School Supervision; Warwick and York, Balti- more, 1921. 50 How to Experiment in Education the other group. The remaining pupils constituted his two equivalent groups, and they were the ones used in com- puting changes produced by the EF’s. Bennett, in a Maryland rural county, followed an identical procedure, except that he split one county into two roughly equivalent parts. It would have been no advantage to Pittman or Bennett to equate groups immediately after the application of the IT. In fact it would have been a slight disadvantage. It would not have been possible to segregate the chosen pupils for the purpose of applying the EF or FT, and thereby save the waste effort of applying EF and FT to all pupils indiscriminately. So there would have been no gain here. On the other hand there would have been a slight disad- vantage in equating at the beginning due to the fact that certain pupils selected for the experimental groups would have been absent at the time of the FT thereby necessitating their ultimate elimination, together with the paired pupil in the other group. The paired pupil in the other group could have been retained only on condition that an equivalent pupil could have been found to take the place of the pupil who was absent for the FT. All this trouble was avoided by delaying the equating of groups until it was definitely determined what pupils remained throughout the experi- ment. In sum, wherever the actual physical shifting of experimental subjects is not to take place, and, in addition, wherever the experimental subjects proper are not to be segregated for purposes of applying EF or FT, delayed equating is preferable to early equating of groups. Initial equating is essential or advisable wherever subjects are to be shifted or segregated. In actual practice the equating of groups is sometimes not so simple as has been described, but the general prin- ciple is the same. ‘Thus Pittman and Bennett both used many types of tests—reading, arithmetic, spelling, and so on—in order to get a rather thorough measurement of all the changes produced by each EF. Each of these dozen or so Selection of Experimental Subjects 51 tests was applied both at the beginning and at the end of the experiment. Which type of test was used as the basis of equating? Pittman and Bennett employed each type in turn. Thus in comparing the amount of change in reading produced by each EF, the groups were equated on the basis of the initial scores in reading. When comparing the amount of change in arithmetic produced by each EF, the pupils employed were selected on the basis of the initial scores in arithmetic. This procedure meant, of course, that the com- position of the experimental groups changed somewhat with each new equating, but the procedure assured an initial equivalence of groups in the experimental trait under con- sideration. One additional suggestion may be given. The EF2 for Pittman’s control group was merely the customary super- vision. Since the application of EF2 involved no particular effort on Pittman’s part, he used and tested many more pupils in his control group than in the other. By doing this he made it easy to find a pair for every pupil in the group to which EF 1 was applied, thereby avoiding the neces- sity of discarding any of these pupils because of an inability to pair them. Groups Equated by Composite of Several Tests.— Sometimes the experimenter desires to equate groups on the basis of more than one test. This requires the experimenter to make a composite of the scores on the various tests. To equate separately for general-ability tests seldom serves any useful purpose. To equate separately for each of several experimental tests does serve a useful purpose, but there is a certain inconvenience in having to alter the composition of the group from time to time during the experimental com- putation. To avoid this objection, some experimenters pre- fer to equate groups on the basis of a composite of the initial scores on all the experimental tests. This gives constancy in the composition of the groups and gives an approximate, if not an exact, equivalence for each experimental test, unless the traits are markedly different in nature. In sum, there 52 How to Experiment in Education are situations where equating by a composite of scores on several tests is desirable. The process of computing a composite is illustrated for a small number of pupils in Table 3. The first vertical col- umn gives the identification number for each pupil. The TABLE 3 ILLUSTRATING THE COMPUTATION OF A COMPOSITE SCORE WHERE EACH TEST RECEIVES’ EQUAL WEIGHT : . Read. Arith. Spell. Com- Pupil | Read. | Arith. | Spell. | Weiensed| Weighted| Weighted| posite ARR ee | en | ee | ee te | a I 64 13 24 64 65 48 177 2 68 9 za 68 45 42 I55 3 46 9 17 46 45 34 125 4 54 14 27 54 70 54 178 5 54 ie) 13 54 50 26 130 6 72 12 20 72 60 40 172 7 52 13 13 52 65 26 143 8 43 II 24 43 55 48 146 9 72 I4 22 72 70 44 186 10 46 12 18 46 60 36 142 II 50 10 20 50 50 40 140 12 46 II 21 46 55 42 143 13 68 13 23 68 65 46 179 14 61 > ike 26 61 65 52 178 15 46 8 12 46 40 24 IIo 16 64 II 28 64 55 56 175 17 46 14 15 46 70 30 146 18 43 9 15 43 45 30 118 19 46 8 23 46 40 46 132 20 56 13 25 56 65 50 I7I S.D. 9.8 2.0 4.8 9.8 10.0 9.6 Mult. I 5 2 second, third, and fourth columns show the scores made by each pupil on a reading, an arithmetic, and a spelling test re- spectively. Beneath each of these columns appears a meas- ure—standard deviation (S.D.)—of the variability among the scores of that particular column. The first step in the determination of the composite scores shown in Table 3 was to compute some measure of vari- Selection of Experimental Subjects 53 ability, in this case S.D. Any other standard measure of variability, such as mean deviation, median deviation, or quartile deviation, can be used instead. The computation of the S.D. for a series of scores is illustrated in Table 15 and Table 16 and explained in the adjoining text. The second step was to select multipliers which would give equal weight to each test. Just what weight should be given each test in determining a composite depends upon the con- ditions encountered in the situation; but once a decision has been reached, the procedure for selecting the multipliers which will effect this weighting should utilize some measure of variability, in this case S.D. That is, tests are weighted according to their variabilities and not, as naive common- sense would indicate, according to their means. For ex- ample, ordinary common-sense would lead us to suppose that Test I below has more influence than Test II in deter- mining a pupil’s relative position in the composite of the two tests, because its mean is relatively much larger. But as a matter of fact, Test II has the more weight because its variability is relatively larger. It has exactly ten times as much weight because its variability is ten times that of Test I. Mere inspection of the composite of the two tests shows that Test II has a large influence upon the composite and that Test I has only a negligible influence. The order of the composite scores is the order of the scores in Test II. Pe SEEEEEEEEEEEESNEEEUUSSRESISSTIEIRTEE Pupil Test I Test II Composite a 1000 40 1040 b 1001 30 1031 Cc 1002 20 1022 d 1003 10 1013 e 1004 fo) 1004 Mean 1002 20 The two tests can be given equal weight either by multi- plying all the scores of Test I by 10 or by dividing all the scores of Test II by 10. Either procedure will make their 54 How to Experiment in Education variabilities equivalent. To illustrate this point, the scores of Test II are divided by ro in the following: Pupil | Test I Test II Composite a 1000 4 1004 b IOOI 3 1004 c 1002 2 1004 d 1003 I 1004 e 1004 Oo 1004 All this means that if the three tests in Table 3 are to be given equal weight, such multipliers must be selected and used on the test scores as will make their variabilities equal. A multiplier of 1 for reading, of 5 for arithmetic, and of 2 for spelling will alter their $.D.’s to 9.8 for reading, 10.0 for arithmetic, and 9.6 for spelling, as shown in Table 3. These variabilities are sufficiently equivalent for practical purposes. By the use of fractional multipliers they can be made exactly equivalent. The multipliers just selected are not the only possible ones. Equivalence of variability can be secured just as well by multiplying reading by 4, arithmetic by 214, and spell- ing by 1, or by many other combinations. As a rule it is most convenient to select only whole numbers for multipliers or divisors, and to select as small numbers as possible. Thus iar it has been assumed that the three tests are to receive equal weight. This is not necessary. Any desired weight may be given. Thus if it is desired to give reading twice as much weight as spelling and spelling two-and-a-half times as much weight as arithmetic, all the multipliers will be 1, because the variabilities of the three tests are in this ratio originally. If it is desired to give arithmetic twice the weight of reading, and reading twice the weight of spelling, the multiplier for spelling will be 10, for reading 1, and for spelling 1, or other multipliers which will as satis- factorily effect the weighting desired. The third step in determining a composite is to multiply the respective series of test scores by the multiplier selected Selection of Experimental Subjects 55 for that test. Thus, in Table 3, all the reading scores are multiplied by 1, all the arithmetic scores by 5, and all the spelling scores by 2. The products are shown in columns 5, 6, and 7. The final step in computing a composite is to add the weighted scores for the various tests for each pupil. Thus, in Table 3, the addition of weighted scores 64, 65, and 48 yields a composite of 177. From this point the procedure for equating groups has already been described. Groups Equated by Preliminary Rate of Growth.— There are competent experimenters who contend that the best index of future rate of growth, or of possibilities for future growth, is current rate of growth. They advise, there- fore, that the experimenter test his experimental pupils at intervals preceding the experiment in order to determine the rate at which each pupil is developing in the experimental trait. Once this rate has been determined, pupils may be paired on this basis. But we cannot be certain that equating by current rate of growth is superior to, say, equating by initial status in the trait in question. The latter is pairing by actual rate of growth as truly as is the former. The former means pairing by rate of growth as determined for a necessarily relatively brief time, whereas the latter means pairing by rate of growth measured from birth to the present. The greater accuracy of the rate-of-growth method of equating is, then, somewhat dubious, and its greater inconvenience is certain. As a result, the method is not likely to come into general use until its superiority has been definitely estab- lished by investigation. The most relevant study thus far conducted, namely, that by Hollingworth, was planned for another purpose. Besides those already discussed, there are many other bases which may or may not be worthy of consideration, depending upon the nature of the experiment. Among 1 Hollingworth, H. L. and L. S., Vocational Psychology, D. Appleton and Company, New York, 56 How to Experiment in Education these the following may be mentioned: chronological age, physiological age, social age, previous training, and home environment in case this last cannot be controlled experi- mentally. Any one or all of these may exercise an influence in de- termining a pupil’s possibilities for growth in the trait in question. Groups Equated by Multiple Bases.—Any one basis for equating groups is bound to fall short of complete satis- faction, because it is necessarily inadequate. A human mechanism is exceptionally complex. Any one basis taps only a phase of this total mechanism. A perfect prophecy can be made only when every phase of this mechanism is properly measured and properly weighted. Again, any one basis fails to give complete satisfaction because of the intricate dependence of one basis upon an- other or of one part of the human mechanism upon another. It will be sufficient to cite two simple illustrations of this dependence. An intelligence test shows two pupils, A and B, to have identical mental ages, namely 12 years and 12 years, respectively. May they be paired with reasonable assurance that the two will progress at equal rates in the future, except for differences in effectiveness of the EF’s? Perhaps two groups can be equated on this sole basis pro- vided the number of pupils is large. But two pupils cannot be equated without taking other factors into consideration. If, for example, Pupil A is 10 years old chronologically, and Pupil B 12 years old chronologically, they are not equiva- lent pupils. Pupil A has progressed mentally since birth much faster than has Pupil B, for he has progressed in 10 years as far as Pupil B in 12 years. The conventional method for expressing this rate of mental growth is the Intelligence Quotient, computed by dividing mental age by chronological age, and by multiplying the quotient by 100. Thus the Intelligence Quotient for Pupil A is (12 + 10) X 100, l.e. 120, whereas that for Pupil B is (12 +12) X I00, 1.€. 100. Selection of Experimental Subjects 57 But the fact that they cannot be paired because their Intelligence Quotients are different does not mean at all that they can be paired if their Intelligence Quotients are identi- cal. A ten-year-old pupil with a mental age of 10 years may not be equivalent to a fourteen-year-old pupil with a men- tal age of 14 years, even though both have Intelligence Quotients of 100. This means that equating is improved by pairing pupils who are alike both in mental age and Intelligence Quotient or, stated more conveniently, who are alike in both mental age and chronological age. In similar manner, chronological age conditions all the bases for equating groups. For a second illustration of this dependency of one basis upon another, we may take the case of the dependence of initial status in the experimental trait upon previous train- ing. Two pupils who have like initial scores in the experi- mental trait may have widely different promise for future rate of growth. One may have attained his initial status after much training and the other after little training. In the case of the former pupil, a low score probably means a low physiological limit of growth and hence little promise for the future. In the latter case a low score probably means a high physiological limit and hence great promise for the future. In similar manner, a high score may mean great promise or little promise, depending upon the amount of training required to produce the high score. Wherever feasible, then, groups should be equated on as many bases as possible. Pupils should be paired who are alike in initial status in the experimental trait, in mental age, in chronological age, in home environments, in sex, in race, and so on for all significant bases. In actual practice, pair- ing is seldom done on more than three bases, namely, initial status in experimental trait, mental age, and chrono- logical age. Pairing is usually done on just one basis, in- itial status in the experimental trait or mental age, with the preference for the former. Equating is usually done on just one basis, first, because 58 How to Experiment in Education every increase in the number of bases employed reduces the number of pupils who can be satisfactorily paired from a given total number of pupils; and, second, because equating on one basis tends to make the groups have approximately equivalent means and variabilities on any other basis, even though particular pupils do not pair on all the bases. The existence of this latter tendency is due both to the positive correlation likely to obtain between desirable bases and to the operation of chance. Those who equate on a variety of bases rarely insist that paired pupils be identical on the vari- ous bases. Rough equivalence is all that is ever secured. Even where equating is done on one basis only, it is fre- quently possible to increase the equivalence on some other bases merely by shifting paired pupils from one group to the other. Mason D. Gray has called attention to a unique diffi- culty in equating two groups. Because of the close correla- tion between intelligence and vocabulary, we would expect normally that two groups which have been equated on the basis of intelligence would be found thereby to have been equated, at least approximately, on the basis of vocabulary. But Gray reports that when a group which has elected high- school Latin is equated on the basis of intelligence with a group which has not elected Latin, the Latin group has a higher vocabulary ability than the non-Latin group. It is highly improbable that such would be the case if both groups were indiscriminately mingled and if students were assigned by the experimenter to the Latin EF and the non-Latin EF without regard to students’ preferences. In general, the ex- perimenter needs to be particularly alert in equating groups which have been divided previously on the basis of some intrinsic psychological difference between them. Groups Equated by the A. Q. or F Technique.— Whenever possible, groups should be equated. Whenever conditions do not permit this, it is possible to equate pupils Statistically by means of the A. Q. or F technique. The effect of these techniques is to take a group, no matter what Selection of Experimental Subjects 59 its ability, whether high, average, or low, and convert it into a standard group. The underlying principle of the A. Q. or F techniques is that it demands of each pupil a progress commensurate with his brightness, and provides a formula for testing whether progress has been commensurate with capacity to progress. A class with low capacity is asked to make a defined amount of progress in a defined time. A class with high capacity is asked to make a proportionately greater progress. If each group under its own EF just exactly makes its expected progress, both EF’s may be considered of equal effectiveness. Suppose that the experimental trait is reading. Then the equivalent-groups formula becomes: Sr — (Initial A. Q. — EF1 — Final A. Q. — A. Q. Change) S2 — (Initial A. Q. — EF2 — Final A. Q. — A. Q. Change) Where ta = Il edge Binal tC Oe Final reading age ~ Final mental age The computation of reading age is explained by the direc- tions booklet which accompanies the Thorndike-McCall Reading Scale.* The computation of mental age is explained in Terman’s “The Measurement of Intelligence.” ? The final reading age will have to be determined by a retest. The final mental age may be determined statistically without a retest, due to the fact that a pupil’s Intelligence Quotient, i.e. mental age divided by chronological age, is fairly constant. The final mental age may be computed by means of the following formula: 1Yssued by the Bureau of Publications, Teachers College, New York City. 2 Houghton Mifflin Company, Boston. 60 How to Experiment in Education Wd initial mental age Final mental age = Initial mental age + snitial heatiohel ave X the no. of months between initial and final reading tests. The computation of mental age presents no difficulty if such tests as the Stanford Revision of the Binet-Simon Scale or the Herring Revision of the Binet-Simon Scale are used. These tests yield a score in terms of mental age. If some other intelligence test which yields point scores is used, these point scores can be transmuted into approximate men- tal ages, provided age norms are available. Tentative age norms for a few ages on the National Intelligence Test, Form A, are given below. A pupil’s score of 90 is equivalent to a mental age of 138. A score of 75 is equivalent to a mental age of 126. A score of 95.5 is equivalent to a mental age of 144. Chronological age in years...... mol mH 124% 13% Chronological age in months.... 126 138 150 162 National Intelligence Test norms 75 go IOI 112 The computation of reading ages is provided for in the directions which accompany the Thorndike-McCall Reading Scale. Reading ages on other reading tests, spelling ages, arithmetic ages, etc., may be computed, provided age norms are available, by simply transmuting point scores on some reading test, spelling test, or arithmetic test into reading ages, spelling ages, or arithmetic ages respectively, as has just been illustrated for the National Intelligence Test. Unfortunately most educational tests report grade norms rather than age norms. Even so, approximate age scores may be computed by substituting for each grade its chrono- logical age equivalent. The first two rows of the data shown below will be the same regardless of the test which appears in the third row. The third row will vary with the test. In the following case, a point score of 37.8 on the Ayres Spelling Scale, 10 words each from columns L, O, Q, S, U, and W becomes a spelling age of 141. A point score of 50.3 Selection of Experimental Subjects 61 becomes a spelling age of 167. A point score of 49 becomes a spelling age of 161. End of grade DO Tea Ty beet, Vee Vinay Lev Le LLP Approx. ch. age equivalent of grade 89 102 115 128 14% 154 167 180 Ayres Spelling Test grade norm.. 19.6 30.4 37.8 47.7 50.3 54.4 The computation and use of reading age, spelling age, men- tal age, A. Q., and the like, when age norms are available and when only grade norms are available, is discussed more fully in “How to Measure in Education.” F has the same function and significance as A. Q. Tests scaled according to the age-scale system use A. Q., whereas tests scaled according to the T-Scale system use F. These two scale systems will be described in Chapter V. In case F is used in place of A. Q., the equivalent-groups for- mula becomes: S1 — (Initial F — EF1 — Final F — F Change) S2 — (Initial F — EF2 — Final F — F Change) As will be explained more fully in Chapter V, F, in case the experimental trait is reading, is computed thus: Initial F = Initial reading T — initial intelligence T Final F = Final reading T —final intelligence T The initial and final reading T require the application of both an initial and final reading test; whereas the final intelligence T may be computed from the initial intelligence T, through the use of each pupil’s B or brightness score. The steps in the process are: (1) Compute the pupil’s B score. Assume that the pupil’s T score is 38 and that his age is exactly 10 years, o months. Then, by Table 11 (p. 109), his B score is 38 + 12, i.e. 50. (Assume that Table 11 is for the intelligence test in question.) (2) If the experiment continues ten months locate in Table 11 the B correction corresponding to this pupil’s age ten months later. 2The Macmillan Company, New York City. 62 How to Experiment in Education Ten months later he will be aged 10 years and 10 months. The B correction for this age is 8. Were the experiment to run for four months the B correction would be 10. Assume the experiment to run 10 months. (3) Subtract this B cor- rection of 8 from the initial B score of 50. The result is 42, which is the desired final intelligence T, required to compute the final F. The final B correction of 8 is subtracted from the initial B score, even if the caption at the top of Table 11 says “add.” In transmuting a T score into a B score, add the B correction when the caption says to add and subtract the B correction when the caption says to subtract. But in transmuting a B score back into a T score reverse the process. The Thorndike-McCall Reading Scale yields a T score directly just as certain tests yield an age score directly. The process for utilizing age or grade norms for converting scores on any test into age scores has just been described. The following shows the approximate T-score and B-correction equivalents of age scores for any mental or educational test. The T and B equivalents for intervening ages may be de- termined by simple interpolation. Age 63 7h 8h oh rohrrdz2d 134 14} 15} 163 174 TP score yi -OEBWe 5) 32)390 44.50 530057503 a B correction 50 37 25 18 11 6 oO —3 —7 —I3 —20 —27 Equating groups through the A. Q. or F technique assumes that rate of growth in the trait in question will be propor- tional to intelligence, except for the differing effects of the two EF’s. This assumption is justified when the trait in question is a general mental function like reading, spelling, arithmetic, geography, etc. The assumption is of doubtful validity for specialized mental functions. Specialized pro- phetic tests may be available some day for such specialized mental functions, CHAPTER IV CONTROL OF EXPERIMENTAL CONDITIONS Constant vs. Variable Irrelevant Factors.—In the actual conduct of an experiment an experimenter must con- tend with both constant and variable irrelevant factors. Variable irrelevant factors do not particularly annoy the experimenter. They are chance influences which operate favorably as frequently as they operate unfavorably for a particular EF. A multitude of such factors are unavoid- ably playing upon experimental pupils throughout even the best controlled educational experiments. In the long run, their net effect is zero. The net result of constant irrele- vant factors, on the contrary, is not a zero facilitation or inhibition of a particular EF. They are any undesired influences whose net result is favorable or unfavorable to some EF. An experimenter may ignore truly variable irrelevant fac- tors, but he cannot ignore significant constant irrelevant factors. He must either eliminate them, or else determine the amount of their influence and allow for it in computing the amount of change produced by the EF in question. The ability to detect and eliminate constant irrelevant factors is one of the distinguishing marks of a sagacious experi- menter. This chapter will be devoted to an enumeration of the more common constant irrelevant factors, and to suggested methods of eliminating them. This list should be studied not with the idea that it is complete or that every factor listed would be a constant error in every situation. Mere maturing, for example, introduces a constant error in ex- periments whose object is to determine the amount of 63 64 How to Experiment in Education change due directly to an EF, whereas its influence may be ignored in experiments whose object is to determine the relative effectiveness of two or more EF’s. The purpose of this chapter is the amplification and illustration of the fundamental principle of experimenta- tion—that changes in experimental subjects due to irrele- vant factors should be eliminated, equated, or accurately measured and discounted. . The importance of any irrelevant factor varies with the amount of its contribution to each EF, where the purpose of the experiment is to determine the amount of change in experimental subjects due directly to each EF, and varies with the difference in amount of its contribution to each EF, where the purpose of the experi- ment is to determine the relative effectiveness of two or more EF’s. Errors Due to Bias of Experimenters.—Conscious or unconscious manifestation of bias on the part of an experi- menter is a common constant error. This constant irrele- vant factor is of special significance because there are so many points in an experiment where an experimenter’s bias can influence the final conclusion. Of course anyone who consciously favors unfairly in any way any EF, is mentally incompetent to conduct experiments. He is, to say it less politely, an experimental cheat. He is employing the ap- pearance of experimentation to secure a readier acquiescence on the part of others to his own emotional prejudice. Con- scious bias is so human as to be sometimes unavoidable. But to be biased is one thing; consciously to allow this bias to modify experimental arrangements is quite another. A manifestation of unconscious bias is far more likely to occur. It is extremely difficult for an experimenter to remain exactly neutral. With some individuals, conscious bias for a particular EF will cause them to favor it uncon- sciously. Other individuals will be so meticulously careful to avoid favoring a favorite EF as actually to favor the con- trasted EF. Impressed by the conflicting results obtained from various investigations of the amount and nature of sex Control of Experimental Conditions 65 differences, Cattell caustically remarked that the sex dif- ferences discovered depended upon the sex of the investi- gator. In many experiments it is possible to take certain pre- cautions against manifestations of a possible bias. Thus, Poffenberger, in his experiments to determine the mental effect of doses of strychnine, numbered the capsules. He then proceeded to forget just which did and which did not contain strychnine. He did not refresh his memory until ‘the experiments had been concluded, tests given and scored, etc. Pittman, in pairing pupils at the end of his experi- ment with the zone system of supervision, covered up the final scores of pupils, lest he show a possible bias by pairing with knowledge of the amount of change produced by each EF. Another investigator wished to determine whether judges varied more in judging the merits of compositions containing much originality than in judging specimens con- taining little originality. This investigator was careful to choose the specimens containing much and those containing little originality before securing, much less consulting, the judgments of merit. By a system of key numbers and by other devices it is possible in many experiments to reduce the opportunities for bias to manifest itself. Errors Due to Bias cf Assistants.—Skepticism regard- ing conclusions where adequate supporting data are not produced, and the reverse mental attitude where data are produced, are eminently desirable traits. Such skepticism or enthusiasm is on the increase in education, and this in- crease should receive every encouragement. But there is a lop-sided skepticism or enthusiasm which is really nothing more than irrational prejudice. Many who pride themselves upon their insistence upon proof are really priding them- selves upon an irrational prejudice for one alternative, usually the present practice, and an equally irrational preju- dice against the other alternative. The experimenter, in organizing cooperative experimentation, will meet both varie- ties among teachers, supervisors, superintendents, or other 66 How to Experiment in Education experimental assistants. There is some hope that the rational skeptic or enthusiast will subordinate his preferences to the objects of the experiment. There is little hope that the irrational individual will be able to do so. Neither variety makes an ideal experimental assistant. The ideal assistant is one who is genuinely uncertain as to which EF is superior. The way to avoid bias upon the part of assistants depends upon the experiment. But certain common precautions may be listed. One way is to avoid assistants who have a bias, or where they cannot well be avoided they may be elimi- nated from all computations. ‘This avoidance or elimina- tion may be employed provided the experimenter has some objective way to determine which assistants will manifest or have manifested bias. Lacking such objective data the experimental assistants chosen may manifest merely the experimenter’s own bias. Any assistant who confesses to a preference may reasonably be assumed to hold such a pref- erence. Another way to avoid bias is to equate it. This can be done, roughly at least, by using as many assistants who are favorable to one EF as there are assistants favorable to the other EF or EF’s. Such an equating may prove satisfac- tory in experiments whose only object is to determine the relative effectiveness of two or more EF’s. The procedure for equating teachers or other assistants is, in general, like that for equating groups of pupils. Finally, something may be accomplished by impressing upon assistants the necessity for experimental neutrality in thought and deed, and by providing them with detailed type- written instructions as to what to do. Few realize the extraordinary difficulty of maintaining perfect self-control, particularly where a preference has already developed. The careless assistant is in danger of manifesting the preference and the conscientious assistant of going to the other extreme. The provision of detailed instructions will tend to minimize such manifestations. Bound up with this problem of bias is the whole question Control of Experimental Conditions 67 of just how much effort should be expended upon each EF. A fundamental principle of experimentation is that there should be an accurate measurement of the amount of the experimental factor. Thus in the physical sciences, a com- mon procedure is to add an EF of defined amount and measure the result, or subtract an EF of defined amount and measure the result, or both add and subtract in succession an EF of defined amount and measure the result, or both add and subtract in succession an EF of varying amounts and measure the changing results with each increase or decrease in the amount of the EF. Probably the greatest defect in educational experimentation is the inability, in most cases, to measure accurately the amount of presence of an EF. Further, there is some, though meager, evidence that maximum effort can be maintained more constantly than any effort lower than maximum. These facts and proba- bilities would lead one to infer that it is better, not only educationally but experimentally, to aim at maximum effort all the time for each EF. Though evidence on this question is meagre, there is some reason to believe that the mere process of experi- menting with new methods or materials of instruction, at- tracts such attention to the traits in question as to cause an unconscious concentration, both on the part of teacher and pupils, upon progress in these traits. As a result, it iS supposed that a large temporary effort is called forth, thus causing a large but artificial growth, and that this artificial effort will evaporate if the novel methods or materials were used term after term. Consciousness of the possibility of such bias may help the experimenter to avoid it, but the only sure way to determine whether ephemeral effort has been evoked is to continue the experiment for a consider- able period. If each succeeding term shows a flagging of effort and an elimination or reduction of superiority, the existence of such ephemeral effort may be assumed. Errors Due to Differences in Teaching Skill—_Re- search on a large scale frequently requires codperation on 68 How to Experiment in Education the part of many superintendents, supervisors, and teachers. My own experience in such work has been one continuous surprise as to the trouble members of the educational pro- fession will take to codperate fully in scientific research. Still, one finds occasional instances of unwilling teachers or superior officers. The trouble with such individuals from an experimental standpoint is that they will inadequately apply a particular EF and be careless about maintaining desired experimental conditions in general. Again, there are wide differences in teaching skill or supervising skill. If one group is taught by an unskillful teacher according to one EF and another equivalent group is taught by a skillful teacher according to another EF, any difference in the change produced may be due to a differ- ence in teaching skill rather than a difference in effective- ness of the contrasted EF’s. This difference may be due to the operation of special forces or to a real difference in skill. Thus one experimenter grumbles that one of his EF’s did not have a fair chance because so many of the teachers who were assigned to apply this particular EF turned out to be bride-teachers. Another experimenter found that one EF had suffered from more frequent changes of teachers than the other EF. Still another experimenter found that substitute teachers were more frequent under one EF than another. The experimenter must attempt, then, to avoid experi- mental errors due to a difference in general unwillingness, and a difference in general capability on the part of assistants. He must guard also against errors due to peculiar fitness or unfitness for applying an EF. The general efficiency of two teachers, for example, may be equal. But one may be peculiarly unskilled in the teaching of arithmetic. This special disability makes it unwise to use her for applying some EF whose object is to increase pupils’ ability in arith- metic. The other EF applied by the other teacher has an advantage, or if the same teacher applies both EF’s, it is Control of Experimental Conditions 69 possible that her special abilities and disabilities favor one EF and handicap another. Five general methods have been employed for avoiding or reducing experimental errors due to a difference in, say, teaching skill. One method is to equate the skill of the teachers assigned to each EF. This pairing of teachers is done on the basis of some preéxperimental measurement of each teacher’s efficiency of teaching. These measurements may be by means of objective tests or may be judgments of Supervisory officers. A second method is to equate teachers by chance. To do this means that the experiment must be conducted in numer- ous classes to insure that chance will provide equivalence in teaching skill. ‘This method is very laborious but it increases the probability of securing both equivalence and representativeness of teaching skill. A third method is the departmental method, namely, to have the same teacher apply both or all EF’s; then, gen- erally superior teachers will be equally favorable to each EF, and the generally inferior teachers will be equally un- favorable to each EF. A fourth method is to have two teachers divide the work of two classes. Thus when the New York State Com- mission on Ventilation was contrasting two EF’s on two equivalent classes in a public school in New York City, the two classes were placed in adjoining rooms, one teacher teaching half the studies to both groups, and the other teacher teaching the other half to both groups. A fifth method is to rotate the teachers so that each EF has every teacher. To illustrate how this can be done there is repeated below the formula for a rotation experiment. It may be observed that the teacher of Sx will appear under each EF, and the teacher of S2 will appear under each EF, thereby equating any difference in general teaching skill. St — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1-—C2) S2 — (IT1 — EF2 — FT1 — C3) — (IT1 — EF1 — FT1 — C4) 70 How to Experiment in Education It is useful for the experimenter to distinguish in this connection two varieties of experimental situations. In one variety the teacher applies the EF while giving the gen- eral instruction to her class at the same time. In the other variety the teacher, as before, gives the general in- struction, but the specific EF is applied by some person other than the teacher. If the EF’s contrasted are project method and conventional method of teaching, or one method of teaching spelling and another method of teaching it, it is probable that the teacher will be asked to apply the EF’s. Here unusual care should be exercised to equate or elimi- nate any difference in teachers’ skill. If the EF’s con- trasted are one type of motion picture and another type of motion picture, there is considerable likelihood that the experimenter himself or non-teaching assistants will apply the EF’s. Here again difference in teachers’ skill may be important, particularly if the motion pictures deal with portions of the regular curriculum, but it is much less im- portant than where the teachers apply the EF’s, because the teachers will have relatively less influence upon the changes of the pupils in the experimental trait. But as the teachers’ importance grows less, the experimenter’s or non-teaching assistants’ importance increases, in accordance with the gen- eral principle stated at the opening of this chapter, namely, that the importance of an irrelevant factor varies with the amount of its contribution to each EF, or to the difference in the amount of its contribution to the various EF’s. Errors Due to Bias of Subjects.—Bias on the part of experimental subjects is just as disturbing to an experiment as bias on the part of the experimenter or his assistants. Such bias comes about in many ways. A popular teacher will make it known to the pupils that an experiment is under way and consciously or unconsciously reveal her own pref- erence. The pupils, as a consequence, will strive to make the experiment come out happily for their teacher. An unpopular teacher under similar circumstances provokes an antagonism toward the EF which she prefers. Control of Experimental Conditions 71 Again, a teacher, an experimenter, or certain circumstances surrounding the experiment will reveal to pupils that two groups are being compared. This information, apart from any preference for or antagonism toward their teacher, may engender an undesired rivalry between the two groups. In case the information leaks out to only one group the result- ing stimulus to this group might well prove decisive. The best way for an experimenter to avoid a bias is to keep himself, when possible, in ignorance of just when he is applying a particular EF, or scoring tests for a particular experimental group, and so on for the other experimental processes where his bias would be likely to affect results. The best way to avoid bias on the part of assistants is to keep them in ignorance of the objectives of the experiment. An experiment with two varieties of ventilation was con- ducted in two schoolrooms for a full year without either of the two teachers discovering just what the EF’s were. It is even more important and fortunately easier to keep pupils in ignorance of the nature of the EF’s and, if possible, of the fact that an experiment is in progress. Certainly one group should not be informed and the other kept in ignorance. Research is such an eminently individual and original process that it is well-nigh impossible to lay down certain principles of procedure without calling attention to possible exceptions. There are situations where it is really desira- ble that pupils be informed, in a measure, that something unusual is taking place. Pittman, in one of his investiga- tions, went so far as to issue a bulletin to the pupils of one of his two equivalent groups telling them he wished to see just how much progress they could make. In an experi- mental evaluation of the worth of using standard tests in the teaching of reading, the writer set up for one group of the experimental pupils definite objectives in reading, gave them their scores on periodic tests in order that they might see how nearly they were attaining these objectives. This was not done for the other experimental group. And yet neither Pittman nor the writer introduced thereby any con- 72 How to Experiment in Education stant irrelevant factor. These were legitimate portions of one of the EF’s. The use of a bulletin by Pittman was a portion of his plan for increasing the progress of the pupils. The employment of definite reading objectives and the periodic reporting of scores by the writer were made possible by the use of standard tests, and were some of the advan- tages of the use of standard tests. Objectives and scores could not be reported to the other groups, either because the EF did not call for them or because standard tests were not employed with them. On the other hand, it would not have been legitimate for either of us to tell these same experimental groups that their progress was to be compared with that of another equivalent group and that we hoped they would win in the contest. To do so would be to change the EF by adding features peculiar to the experiment and necessarily temporary. Such an EF would not be illegitimate but it would not be particularly practical. The information given certain of the experimental subjects by Pittman and by the writer were normal advantages of the EF in question and were permanently obtainable in a practical school situation with- out assuming the impractical situation of an everlasting experiment. In sum, it is always legitimate to give experi- mental pupils such facts as are the normal concomitants of the EF in question, unless the experimenter desires to limit his experimental conclusions to a narrower EF. As a mat- ter of fact, the writer gave certain standard tests to the pupils in his control group, thereby making it possible, had he so desired, to report to them the scores made as in the case of the other group. This was not done because the EF for this group assumed that in a normal non-experi- mental situation no standard-test scores would be available. Errors Due to Difference in Time Allowance.— When the effectiveness of two or more EF’s is being studied, one EF may secure an unfair advantage over another be- cause of a longer teaching or studying time on the part of the pupils, or the application of their EF for a longer Control of Experimental Conditions fhe period. This may occur in many ways. The class period may be longer. The study which occurs at the pupil’s home may be longer. Each application of the EF may be longer. The total period during which the EF operates may be longer. Thus, in conducting the experiment to determine the relative effectiveness of employing tests in teaching read- ing, the writer found it necessary to regulate the length of the official reading period both for teaching and for study. In this experiment to determine whether motion-picture presen- tation, or printed presentation, or teacher presentation, or various combinations of these was the most effective, Weber * exercised extreme care lest the time allowance for one EF exceed the time allowance for another EF. In his experi- ment to determine whether supervision plus standard tests were superior to supervision minus standard tests, Bennett found it impossible to give all the initial tests or all the final tests to all the pupils at the same time. Because of the scat- tered nature of rural schools both testing periods extended over several weeks. All tests were carefully dated in order that the interval between initial and final tests might be kept identical for every pupil. Since instruction toward the close of school may be more effective than toward the be- ginning, he was careful to avoid applying initial tests to one group earlier, on the average, than to the other group. Lacy,” in his experiments with visual, verbal, and printed presentation, was careful to see that the few minutes’ interval between the ending of each EF and the application of the final test was kept identical for all EF’s, and that the few weeks’ interval between the final test and a delayed-recall test was kept identical for all EF’s. In every experimental situation where a time variation will favor one EF to the detriment of another, the time should be kept identical, unless such a variation is a desired element in an EF. There is a special variety of time variation which should 1Weber, J. J., Relative Effectiveness of Some Visual Aids in Elementary Education; (to be published soon). 2Lacy, John V., “Motion Pictures as an Educational Agency”; Teachers College Record, Vol, XX, No. 5. 74 How to Experiment in Education not escape the attention of the experimenter. The pupils in one experimental group may have a poorer attendance record than those in some other group. This may be caused by an excess for one group of poorer roads, longer average dis- tance of homes from school, more inclement weather, more contagious diseases, and the like. Consideration should be given to whether the absence is toward the beginning or end of year, or is continuous or intermittent. When the pupils are sufficiently numerous, average attendance records are usually approximately equivalent for each group. But when the group is small it may be necessary to eliminate from experimental computations pupils whose attendance record is such as to disturb the balance between the two groups. Sometimes it is difficult to decide whether a time variation is an irrelevant factor or a consequence of an EF. Pittman found that the pupils in the schools which were under the zone-system-of-supervision EF showed a better attendance record. Instead of discounting this as an irrelevant factor he credited it to the beneficent influence of the EF, because there was no other observable cause. The writer found that one method of teaching reading resulted in more reading both in school and out than did another EF. This extra reading was a partial or perhaps entire explanation of the superior growth of these pupils. It was assumed that this was not an irrelevant time variation but a beneficent consequence of the EF. Tests made in other subjects of the curriculum did not show that this in- creased emphasis upon reading had occurred at the expense of other portions of the school work. Finally, errors may occur due to the length of time the experiment runs. An experiment may be allowed to run too brief a time or too long a time. It may be so brief that variable errors swamp the effect of the EF’s. This is likely to occur if the trait measured is one in which growth is slow and cumulative. In such a situation the experiment needs to continue over a long period. When the trait measured de- Control of Experimental Conditions 75 velops rapidly, and when the effect of the EF’s is relatively non-cumulative, brief experiments are preferable. The prin- ciple to be kept in mind in deciding upon the time length of the experiment is to secure the maximum effect of experi- mental factors with a minimum effect from disturbing variables. Errors Due to Difference in Transfer.—After giving a recent examination to his class in mental measurement, the writer announced to the students that his efficiency as a teacher of mental measurement was only 43 per cent, for on the average the class had mastered only 43 per cent of the procedures he had aimed to teach. One unkind student increased his chagrin by remarking that a portion of that 43 per cent was acquired in other classes given by the writer’s colleagues. In other words, there had been a trans- fer from one class to another. This same sort of transfer from one school activity to another is going on all the time. More of it may occur in the case of one group than another, thereby introducing a constant irrelevant factor. Reading ability is liable in a peculiar way to be enhanced by such transfer. The teacher of reading usually has a heavy obliga- tion to all the other teachers, where there is departmental in- struction, or a heavy obligation to all the other phases of her own instruction where she is the sole teacher. Certain teachers or schools give a sum total of more instruction in reading during the periods officially assigned to history, geography, and the like, than during the reading period © itself. This is equivalent to giving more time to reading. The experimenter should not neglect these transfer possi- bilities when standardizing the time allowance for each EF. Another disturbing irrelevant factor is the transfer of knowledge of how to do the experimental tests. The writer found this to be of considerable significance in some experi- mentation on young children. All the tests were individual tests, which means that only one child could be tested at a time. As soon as a child was tested he was returned to his class. This gave opportunity for the other children to dis- 76 How to Experiment in Education cover, in advance, something as to both the general and specific nature of the tests. An effort was made to reduce the amount of this error by employing several examiners so as to reduce the length of the total testing period, by testing first those pupils who, according to the teacher’s judgment, were least competent to make an intelligible re- port of what occurred in the examining room, by applying one test to all pupils before starting another, by urging the teacher to conduct her class while a test was being given so as to reduce opportunities for conferences among pupils, and by condensing the total period for one test between recess periods. An attempt was made to equate any error not avoided by the preceding precautions by testing pupils from the two groups according to the principle of alterna- tion. It is much easier to avoid this irrelevant factor when group tests may be employed. When the equivalent groups are located in the same school, other sorts of transfer may occur. One group may catch a spark of enthusiasm from another. One group may sulk because the other group has a pleasanter or sup- posedly pleasanter EF. The writer is still wondering just what sort of transfer occurred during a year’s experiment in the Horace Mann School, conducted in collaboration with Principal Pearson, Vice-Principal Hunt, and the teachers. Half the teachers and half the pupils continued to teach and study, respectively, a particular subject, as during the pre- ceding year. The other equivalent half of the teachers attempted by concentrated study to invent teaching pro- cedures which would produce, with the same time allowance, a greater growth than usual in their half of the pupils. This program was known to half the teachers only and to none of the pupils. Initial and final tests were given to both groups as had been customary in previous years. To our great surprise both groups had made practically identical progress. Naturally this was a considerable disappointment to us all. It was not until some time later that it occurred to us to compare the usual progress with the progress made Control of Experimental Conditions Fire for an equal period during the experimental year. Both groups had made a 50 per cent greater growth than usual! Somehow, some sort of transfer had occurred. Errors Due to Bias of Tests.—There is danger that tests used for the initial and final measurements will be partial to one EF. Those who advocate the project method in preference to the conventional method of teaching have certain reservations about experiments which have been conducted to date to evaluate the relative effectiveness of these two educational processes. They claim, and with some justification, that standard tests available for such evalua- tion are partial to the conventional method. Lacy’s con- clusion that verbal instruction is more effective than visual instruction has been questioned by Weber on the ground that Lacy’s verbal tests were partial to the verbal method. To substantiate his criticism Weber devised one test like Lacy’s, another in which the verbal element was reduced to a minimum, and another which, in his judgment, was about half-way between these two. At the time when this is written, his experiments have gone far enough to show, among other things, that the visual group does better on the visual test and the verbal group upon the more verbal test. What has been said concerning the nature of the tests em- ployed applies with equal force to the examiner who gives tests, the acquaintance of pupils with the tests, instructions to pupils as to how to take the test, the conditions while tests are in progress, the scoring of the tests, and the statistical treatment of results. In general, the same examiner should give the same tests to all groups in the same way in order that difference in personality of examiners, or in the stimulus given to pupils, may not corrupt results. Uniformity will be increased if the method of applying the test is determined in advance and written down. Sometimes one group has had more experience in taking tests in general. This may be eliminated by supplying the deficiency. Sometimes the experiment calls for intermediate tests of the same experi- 78 How to Experiment in Education mental trait with the same test that is used for the initial and final tests. If this applies to one group only it may gain an advantage from increased acquaintance with the test. Such practice effect can be reduced by the use of parallel forms rather than the identical test. Sometimes it is desirable to analyze the curriculum con- tent and test content to discover the degree of correspondence between the two, and this-~-is especially true when the one- group experimental method has been employed. It is pos- sible that the arithmetic curriculum during the first semester may be more akin to the content of the arithmetic test used than is the content of the arithmetic curriculum for the second semester. Analysis of the curriculum may reveal this. Finally, a test may be biased because it fails to take account of periods of especially rapid growth, and minor or major plateau periods of especially slow growth. In certain traits, pupils lose during the summer vacation some of the skill acquired the previous year. Usually, this loss is quickly made up in the first few weeks of the fall term. When the initial tests are given on the first day or two of school, the EF will get the benefit, not only of the effect of the EF, but also of the effect of this early spurt. Errors Due to Bias of Other Irrelevant Factors.— Various environmental factors which may prove irrelevant factors have already been listed. On occasion, many others may be significant. The experimenter should canvass the general physical environment including such items as tem- perature, humidity, ruralness, playgrounds, and the like, to see if differences in these may not be significant. Thus conclusions from experiments in physical geography might be profoundly affected by whether one group had better contacts with mountains, streams, and the like. The home environment is frequently of very great importance. Some children have home surroundings which encourage study, home facilities which aid study, parents who give moral support to the school, and parents who give actual instruc- tion in school subjects in no mean amount and of no small Control of Experimental Conditions 79 worth. All such conditions, if relevant to the experiment in question, should be made approximately equivalent or should be discounted in drawing conclusions. Then there are errors due to difference in susceptibility of pupils to the EF’s. Conclusions from an experiment conducted by Norsworthy, Hillegas, McCall, and Johnson were made uncertain because one of the two groups was in more robust health than the other. Differences in phys- ical condition, intelligence, previous training, age, sex, race, and all other such personal characteristics which at times condition the susceptibility of pupils are not matters easily or at all subject to control during the application of the EF’s. They should receive attention when experimental pupils are being selected. Experimental Log.—One necessity of experimentation is an experimental log or record of dated events, of relevant ideas, of the appearance of variables, and the like. It is seldom safe to trust to memory circumstances which will need to be recalled. Every scrap of experimental record should be labeled and dated. Records should be kept as though the experimental material were to be filed away for several years before experimental computations were made and before the experiment was described. In fact, any one who does much experimentation will need to refer to experimental records long after the conclusion of the experiment. Further, it often becomes necessary to ask others to complete an experiment one has begun. A prop- erly kept experimental log quickly informs the new experi- menter concerning the previous history of the experiment. Norsworthy had just completed an experiment extending over several years when she died. Though the writer knew nothing about the experiment he was able to take up the research where she left off, complete the computations, and describe and publish the results. Without the experimental log this would have been impossible. In an extensive experiment in the teaching of English to foreigners, Courtis employed a unique device for main- 80 How to Experiment in Education taining desired experimental conditions and of recording deviations from them. First he met the teachers and gave them typewritten directions concerning and training in how to apply the EF, namely, a particular method of teaching English to foreigners. Then he employed a group of gradu- ate students in education to act as observers, there being one observer for each teacher. Next he devised a form on which the observer could keep a graphic time-record of just what the teacher did during the lesson period. He rotated the observers so that each observer saw each teacher. At the conclusion of the experiment, he did not have to hope that experimental conditions had been maintained. He had an accurate record of the extent to which they had been maintained. As a result, he was able to avoid grave errors, and was able to make a much fuller use of his data. CHAPTER V EXPERIMENTAL MEASUREMENTS I. FuNcTIONS oF EXPERIMENTAL MEASUREMEN7S Amount of Experimental Factors.—The first demand upon experimental measurements is the exact measurement of the amount of the EF’s. The amount of certain EF’s may be measured with great exactness. Among the many experiments conducted by the Ventilation Commission of New York, some had for their purpose to determine the mental and physical effects upon school children or adults of various temperatures, humidities, carbon-dioxide contents, and the like. The successful con- duct and interpretation of these experiments required that an exact record be kept of the temperature, humidity, and carbon-dioxide content maintained in the experimental cham- bers. Instruments were installed which made possible a very exact record of the amount of these EF’s. The amount of some experimental factors cannot be meas- ured with such accuracy. If, for example, one experimental factor is the project method, it is impossible to secure an exact quantitative record of the amount of this EF, even though we can be reasonably sure that it is an EF which varies in amount of presence. Similarly it is difficult to secure a quantitative record of the amount of a particular method of teaching reading. Though difficult to secure, the experimenter is responsi- ble for reporting as best he can the amount of each EF. In 81 82 How to Experiment in Education the case of some EF’s, it may not be possible to be more defi- nite than to state roughly the skill and effort of the teacher; the degree of codperation of officials and parents, the ade- quacy of equipment, the amount of time during which each EF operated, and similar information, according to the nature of the experiment. Amount of Change Produced by Irrelevant Factors. —The second demand upon experimental measurements is the exact measurement of the amount of change produced in the trait in question by irrelevant factors. The purpose of this measurement is to make it possible to discount the corrupting influence of irrelevant factors. In certain very specific types of experimentation, it is possible to measure the amount of this influence of irrele- vant factors. But in most educational experimentation, their individual influence is so slight as to be unmeasur- able, or so subtly bound up with the EF’s that the exact amount of their contribution cannot be separated from the influence of the EF’s. Usually, the experimenter will find it easier to eliminate or equate significant irrelevant factors than to measure the amount of their contribution to the trait in question. Amount of Change Produced by Experimental Fac- tors.—The third demand upon experimental measurements is the exact measurement of the amount of change in the trait in question produced by the EF’s. In educational ex- perimentation, this is the most common and most important type of experimental measurement. II. FUNDAMENTAL CRITERIA In common with measurements for any purpose, experi- mental measurements should satisfy certain fundamental criteria. They should be selected or constructed with these criteria in mind. These fundamental criteria are: 1. Validity. A test is perfectly valid when it measures exactly what it purports to measure. Experimental Measurements 83 2. Accuracy. A test is perfectly accurate when the units of measurement are wholly appropriate and are abso- lutely equal at all points on the scale. 3. Reliability. A test is perfectly reliable when two applications of equivalent tests to the same pupil yield identical scores. 4. Objectivity. Ww wD aS w e ° | Oo ODO WOOWBWKWANAN OW O COO OWWWRhUN AAT CO OCOWWWAN AAT CO CwOWKWWAMN AN CO] WOWwhun an Ovo Credit for regular school marks was assigned thus: School mark AY (Bose GOD) Sank Value LOMAS ht es es Credit for teacher’s special estimate of pupils was as- signed as follows: ‘Teachers estimate) | AX Bo Gn Le no Value 12° Oo Ore eae Observe that, in assigning credit to the average of school marks and to the teacher’s special estimate, no account was taken of the pupil’s grade. A second-grade pupil making an A was assigned the same number of points of credit as a fourth-grade pupil making an A. This pro- cedure is defensible only when the group is a fairly homo- geneous one, and when the object is to construct a criterion whose sole purpose is to evaluate test elements relative to each other. Experimental Measurements 87 Finally, Liu combined his test criterion and school cri- terion, giving equal weight to each. Then he computed the correlation and partial correlation of each test element in the five non-verbal tests with this criterion. The test elements showing the largest partial correlation with the criterion were selected to constitute a new test. Further- more, the method of scoring the new test took account of the relative value of each element of the test as an inde- pendent measure of intelligence. This was accomplished by the use of the regression equation technique. ‘These tech- niques of correlation, partial correlation, and regression equations are discussed in detail in Chapter IX. In the actual selection of the best test elements to put into the new test battery for China, Liu was influenced by such non-statistical considerations as adaptability to all races equally, possibility of constructing duplicate forms of each, and the like. Also he short-circuited the laborious par- tial correlation technique by (a) computing the correlation of each test element with the criterion, (b) choosing as basic test elements the two elements which showed the highest correlation with criterion and which appeared to test different mental functions, and (c) selecting other tests which, by trial, showed high correlations with the criterion but low correlations with the basic tests and with each other. 2. The Test Should Measure Comprehensively the Trait in Question. Perfect validity may be secured by so constructing the test that it duplicates in form, procedure, and content the criterion itself. But almost invariably this means an im- practicably cumbersome test. Hence the psychologist usually sacrifices some validity to convenience. He may construct a test which duplicates the criterion in miniature.* Or, instead of a toy representative, he may select for his test an actual sampling of some representative portion of the criterion. Or, he may construct an analogy which em- 1See Hollingworth, H. L. and L. S., Vocational Psychology; D. Appleton and Company, New York, 1916. 88 How to Experiment in Education ploys material which is not even similar to the material of the criterion but which is supposed to exercise the mental traits requisite for success in the criterion. Finally, he may attempt to find or construct an empirical test, 1.e., he tries out many tests in the hope of discovering that one of these will happen to show a close correspondence with the criterion. This question of adequacy is of particular importance to the experimenter. He wishes to measure and evaluate all the changes produced by each EF and not just a part of them. Bryan and Harter’s ordinary measurements showed that their subjects reached a plateau where a series of measurements showed no further evidence of growth. The use of more adequate tests showed, however, that growth in certain accessory traits was continuous throughout the plateau period. In experiments with project teaching and the like, the adequate measurement of such accessory and concomitant developments becomes a matter of primary importance. It is a good rule in experimentation to test, so far as possible, every aspect of the problem, and score every aspect of the tests. Adequacy in content plus practical convenience offers a special problem to the test constructor. Some of those who develop tests attempt to secure adequacy without sacrificing convenience by taking a random sampling of the total ma- terial. Thus, the words in the Starch Spelling Scale were selected at random from all the non-technical words in the dictionary. Others follow the social-worth principle. Thus the words in the Ayres Spelling Scale are the more com- monly used words. Others employ the type principle in selection of test material. Thus the examples in Monroe’s Diagnostic Tests in Arithmetic were so selected as to repre- sent all the typical processes in the fundamentals of arith- metic. Others follow the sétatistical-dificulty procedure. Thus, the examples in Woody’s Arithmetic Scales were selected because of their statistical behavior, i.e., those ex- amples were selected which would make an equal-step ladder Experimental Measurements 89 of difficulty. Various combinations of these bases of selec- tion are possible. The basis or bases to be employed will vary with the purpose of the test and the nature of the trait to be studied. 3. The Test Should be Non-coachable. The coachability of a test may be reduced by such a selec- tion and arrangement of material as will make it difficult for one pupil to communicate knowledge of how to do the test to another, by increasing the amount of the test ma- terial, by the preparation of several equivalent forms of the test, and by providing that those pupils will be tested first who are least able to report the content of the test. 4. The Test Should be Free from Ambiguities and Other Irrelevancies. Even when the content of a test is satisfactory, the form and procedure of the test require careful scrutiny. All sorts of irrelevancies may subtract from validity. ‘The test material may be in question form when greater validity might be secured by employing the classification, completion, matching, or manipulation form. ‘The general conditions under which the test is to be given may detract from valid- ity. The instructions which accompany the test may de- mand too much linguistic ability or may be otherwise unsuitable. The nature of the response demanded of the pupil may require too much writing ability, muscular strength, or the like. The test may be so long as to meas- ure fatigue instead of the trait desired, or so short as to be unreliable or unsuited to measure the speed of adjust- ment to the test. It may be so arranged as to measure the pupil’s honesty rather than his ability. The scoring provided for may be crude, or may concern insignificant phases of the pupil’s performance. Ambiguities or other irrelevancies may appear at various stages. 5. The Elements of the Test Should Be Weighted in the Optimum Manner. In practice, few tests have as yet been validated in any adequate way. The tests are usually assumed to measure 90 How to Experiment in Education what they appear to measure. In time every person who proposes a test will be obligated to report the degree of correspondence between test scores and criterion scores. This correspondence is usually determined by computing the coefficient of correlation between these two series of scores. The procedure for computing and interpreting a coefficient of correlation is described in Chapter IX. It frequently happens, however, that the correspondence between test and criterion can be measurably increased by determining and utilizing in scoring, the optimum weights for the various parts of the total test, especially when the total test is composed of subordinate tests which differ somewhat in nature. These weights may be determined statistically by means of the partial correlation and regres- sion equation techniques. ‘These techniques also are dis- cussed in Chapter IX. 6. The Test Should Be So Constructed That the Pupil’s Reactions Will Be as Abbreviated as Possible. Satisfaction of this criterion makes for economy and objectivity of scoring. Frequently an abbreviated reaction, such as a word, number, or check, will yield as valid+ a measure of the pupil’s ability as a much more complicated reaction. j 7. The Test Should Be So Constructed That the Pupil’s Abbreviated Answers Will Be Controlled. If any one of many different abbreviated answers is correct, or if the spatial location of the pupil’s answers is uncontrolled, the probable result will be uneconomical, in- accurate, and subjective scoring. Furthermore, it will prove difficult in this case to employ mechanical scoring devices. When the nature of the test permits, it is well to have pupils’ answers recorded along the right-hand margin of the test sheet. This permits the experimenter to lay a correctly- filled test sheet beside the pupil’s answers and determine correctness or incorrectness by a simple visual comparison. i Gates, Arthur I., “‘The True-False Test as a Measure of Achievement in College Courses”; Journal of Educational Psychology, May, 1921. Experimental Measurements or When marginal answers are not feasible, spatial location may be so controlled as to permit the use of a perforated test sheet or a celluloid scoring device. 8. The Test Should Be So Constructed as to Permit Its Use Both with One Pupil and with a Group of Pupils. It is claimed that when a test is given to one pupil at a time the results are more reliable than when a pupil is tested in a group. However, questions of time, economy, and the prevention of the spread among untested pupils of informa- tion as to the nature of the test practically require group testing, for most experimental situations. 9. Test Instructions Should Be as Brief as Is Consistent with an Adequate Understanding of What Is to Be Done. Long instructions tend to produce confusion in the minds of the pupils, and even of experimenters themselves if they are inexperienced. But adequacy should not be sacrificed to brevity. Particular care should be exercised to see that no key points are omitted. 10. Instructions Should Employ a Demonstration and Preliminary Test. It is easier to imitate than to comprehend and follow lin- guistic directions. Both demonstration and preliminary test may be given on the blackboard or may be printed on the test sheet. The latter is preferable. 11. Instructions Should Be Adapted to and Uniform for All Who Are to Be Tested. It is feasible to find words sufficiently simple for young pupils and which are also sufficiently dignified for older pupils. Also it is possible so to prepare instructions that they will be uniform and equally fair to all experimental groups irrespective of their environment. The importance of universalizing the test applies with as much force to the test material as to the instructions. In less than a year after their publication, the Thorndike- McCall Reading Scales were in use in England, China, and other foreign countries. Unfortunately, the authors were so provincial in their outlook that minor revisions must be made before they can be used to greatest advantage in 92 How to Experiment in Education countries other than the United States. They could have been approximately internationalized from the beginning without impairing their value for this country. 12. The Order of Instruction Should Be the Order of Execution. There are abundant reasons for believing that it is easier for pupils to follow instructions when the sequence of instructions is the sequence of action expected from the pupils. 13. Instruction Should Be Broken into Action Units. As soon as a natural unit of instruction has been given, the pupil should be directed to carry out these directions before another unit is given. This is especially important where the instructions are necessarily long and complicated. Any other procedure taxes too heavily the pupil’s memory. 14. Instructions Should Equalize Interest. Interest should be equalized not only for all experi- mental groups but for the pupils in each group. Probably it is easier to secure this equalization on a high interest plane than on a low plane. As a rule it is best to induce each pupil to do the best he can. 15. The Test Should Be So Easy That Each Pupil Will Make a Score above Zero. Two pupils who make zero scores appear to be of like ability, whereas the amount of instruction required to lift both above zero might be one month in the case of one pupil and twenty-four months in the case of the other. Obviously to call these pupils equivalent and to pair them for experimental purposes would give a special advantage to the experimental group receiving the one-month pupil. For at the final test, this pupil might show marked improvement while the other would be still making zero. With a prop- erly constructed test with equal units at all points on the scale, the twenty-four-month pupil might be shown to have made greater growth than the one-month pupil. 16. The Test Should Be So Difficult That No Pupil Wil Make a Perfect Score. Experimental Measurements 93 All perfect-score pupils look alike just as all zero pupils look alike. A properly constructed test might reveal wide differences of ability. Furthermore, a final test, even though it be more difficult than the initial test, cannot reveal cor- rect improvement scores for such perfect-score pupils. 17. The Test Should Have No Undistributed Scores. Besides undistributed zero and perfect scores it is possi- ble to have undistributed intermediate scores. Coarse scoring, or tests which yield a few degrees of merit only, automatically cause undistributed intermediate scores. Pupils are made to appear of like ability when, by a finer scoring or by a finer test, they would appear quite unlike. The number of degrees of merit which a test should reveal depends upon the homogeneity of the group being tested, but, as a rule, tests should be so constructed as to separate the pupils into not less than seven groups of ability and, if the data are to be used for correlation, into not less than thirteen ability groups. 18. A Test Should Vield a Statistical Score. It is unfortunate that the custom ever grew up of report- ing scores in terms of letters, words, or phrases. These must be converted into statistical terms before they are susceptible of necessary quantitative treatment. 19. The Test Should Vield Absolute Rather Than, or in Addition to, Relative Scores. Teachers’ marks are relative scores—trelative to the group in question. An able pupil in Grade I will receive a mark of A. When this same pupil reaches Grade VIII, he will be making a score no higher than A. He stands, in fact, a good chance of making a score less than A, even when his absolute ability has markedly increased and his relative status has remained unchanged. Relative tests cannot easily be used to measure improvement. 20. The Test Should Be Scaled So That Units of Meas- urement Will Be Equal at All Points on the Scale and the Method of Combining Units Will Be Simple and Appro- priate. 04 How to Experiment in Education Evaluation of Scaling Methods.—The need for equal- ity of units is shown in Table 4. TABLE 4 SHOWING THE NEED FOR EQUAL UNITS OF MEASUREMENT (R = RIGHT. W = WRONG) Number of Problems I 2 Sed 5 6 7 & | Score Solved Difficulty ..| 1 2 3 3.1 3.2 3.3 ay 4 Pupil ous R R W W W W W 3 Pupil:B ops th ts R R R R R W W 6 Pupil A solves three problems correctly. His unscaled score is, therefore, 3, as shown in the table. Pupil B solves six problems. His unscaled score is 6, as shown. Employ- ing unscaled units of measurement in this manner makes Pupil B appear much more competent in comparison with Pupil A than he really is. The difficulty of solving six prob- lems, namely 3.3, is only slightly above the difficulty of solving three problems, namely 3. A very small superiority of ability on the part of Pupil B enabled him to double his unscaled score. The use of equal units of difficulty gives Pupil A a score of 3 and Pupil B a score of 3.3. Many methods! of varying worth have been proposed for scaling mental tests. One method—the grade-scale method—is to determine the difficulty of each separate prob- lem, question, or other test element on the basis of the achievement of school grades, and then to compute a pupil’s score by combining the scale values of the test elements done correctly. To call a pupil’s score the scale value of the most diffi- cult test element done correctly is subject to the objection that pupils are unable frequently to do correctly test ele- ments of less scale value. Depending as it does upon a single test element, the score would also be rather unreliable. The 1 For a detailed evaluation see McCall, Wm. A., How to Measure in Education, Chapters IX and X; Macmillan Company, New York, 1922. Experimental Measurements 95 only satisfactory procedure thus far devised to meet these two difficulties is too complicated for practical use. On the other hand, to call a pupil’s score the sum of the scale values of the test elements done correctly is somewhat laborious, and, in addition, is subject to the criticism that a score yielded by such a cumulative total shows the num- ber of units of work done rather than the ability level reached. It would be like measuring a man’s lifting strength by adding the weights of a variety of weights lifted. The preceding simple-total procedure appears preferable. The man’s lifting strength, according to the simple-total pro- cedure, would be the weight of the heaviest object the man could barely lift. For the foregoing reasons, the drift is away from the scaling of the separate test elements, except in a rough way for the purpose of arranging test elements in an approximate order of difficulty. The drift is in the direc- tion of scaling, ie., determining the difficulty of doing cor- rectly a given number of the test elements in a given test. Stated differently, the drift is toward scaling total scores instead of test elements. The three most promising methods that have been pro- posed for scaling total scores are the percentile scale, age scale, and T scale. In the case of the percentile scale, the smallest number of points made on the test in question by any pupil of the group used as the basis for scaling is scored zero, the num- ber of points below which are one per cent of the pupils is scored 1, the number of points below which are two per cent of the pupils is called 2, and so on to the highest num- ber of points made by any pupil which is scored 100. This method assumes that the difference in ability be- tween a pupil who makes a zero-percentile score and a pupil who makes a Io-percentile score is the same as the differ- ence between a pupil who makes a 4o-percentile score and a 50-percentile score. It is rather generally conceded, how- ever, that the former difference is actually much greater 96 How to Experiment in Education than the latter difference, and that therefore the units are not equal in the truest sense at all parts of the scale. In the case of the age scale, the mean number of points made on the test in question by unselected eight-year-old pupils is scored 8. The mean number of points made by nine-year-olds is scored 9, and so on. Intermediate scores are given also. A vital defect of this scale is the almost insuperable dif- ficulty of locating and testing unselected pupils below the age of eight or nine and above the age of thirteen or four- teen. Large sections of the former group have not left the social group to enter the school and of the latter group have left the school to return to the social group. Again, growth ceases or actually recedes in some traits after the age of thirteen, fourteen, or thereabouts. Quality of hand- writing, and speed and accuracy of addition are probable illustrations of recessions. No one has proposed a satis- factory way of handling a situation when the mean number of points made by, say, thirteen-year-olds is 20, and that made by fourteen-year-olds is 18. Finally, it is generally believed that the actual growth between ages eight and nine, say, is greater than between thirteen and fourteen. This belief does not have evidential support, for it is impossible to say that the units on one scale are unequal without assuming the equality of units on some other criterion scale. The foregoing criticisms, even excluding the third, mean that the age scale is inappropriate except within a narrow range of ability and for certain mental traits. The T scale is believed to be superior to any of the pre- viously described methods. It was constructed for the purpose of embodying their virtues and eliminating their defects. It scales the total score. It employs the simple total. It allows each test element done to affect the scale score, thereby increasing reliability. Its units are equal in the generally accepted sense at all points on the scale. It covers a wide range of ability and may be extended if Experimental Measurements 07 necessary. The process of scaling is as simple as any, and so is the computation of a pupil’s scale score. The age scale by permitting the computation of quotients such as Intelligence Quotients, Reading Quotients, Accom- plishment Quotients, and the like, has had a decided prac- tical advantage over the T scale, though the age scale may be, and is now being, used as a secondary scale in conjunc- tion with the T scale to permit the computation of quotients. A procedure has just been devised, and will be described in this chapter, whereby the T scale alone can secure these special advantages of the age scale and that in a more eco- nomical way. The relative merits of the four most commonly used scaling methods are summarized where they may be seen at a glance in Table 5. This table assumes that the latest improvements on each scaling procedure have been em- ployed. The scoring of the scales is necessarily somewhat subjective. After an elaborate discussion of the various scale systems, a colleague in this field scored the systems and arrived at results closely similar to those given in Table 5. The total scores of 29, 23, 22, and 11, give a rough but only a rough index of the relative merits of the four scale systems. Some of the criteria are far more significant than others. The convenience and definiteness of the reference point is so important that the deficiency of the grade scale is very serious. The equality of units is even more impor- tant. The deficiency of the age scale and percentile scale at this point practically means that they cannot well be adopted as permanent scaling systems. The additional de- ficiency of the age scale on width of range of scale is fatal, because both these defects are inherently uncorrectable. The ease of scaling test and of computing pupil scale scores fatally indict the grade scale for other than scientific pur- poses. Borrowing and combining as it does the desirable features of the other three scales systems, the T scale satisfactorily 98 How to Experiment in Education meets every criterion except one. At the present time it is easier for the uninitiated to understand, or at least to think they understand, the age-scale or percentile-scale units bet- ter than the T-scale units. This is not, however, a perma- nent defect. When the T scale has come into general use, the T will be comprehended almost as easily as an age or a percentile. TABLE 5 SHOWING THE RELATIVE MERITS OF THE FOUR COMMONLY USED SCALE METHODS. SATISFACTORY PROVISION FOR A CRITERION = 2. FAIRLY SATIS- FACTORY =I. UNSATISFACTORY — 0. Ape ik Age |Percentile| Grade Criteria Scale Scale Scale Scale 1. Definiteness and convenience of ref- CTEDCE POIs Woe es eat elaeia ace ales 2 2 I ° Qe WCuality: Oly UNM eye tse a meas hare = 2 ° ° 2 3.e Width of) range, olvscale;.. .. - as. ss p. ° 2 2 4. Reliability of scale scores.......... 2 I I 2 Se Permanence. OL/SCAlG bani sais 4 sees 2 2 2 I 6. Conventionality of scale units..... 2 2 2 2 7, Lay interpretability of scale scores. I 2 2 fe) 8. Internationality of scale units...... 2 2 I ° 9. Comparability of scores on various SCALCS TE re os Oe ee ert oe ae aia aes 2 2 I I 10. Method of combining units........ 2 2 2 fe) 11. Ease of computing scores......... x 2 2 2 ° 12. Permits the quotient techniques.... 2 2 fo) fo) 13..Hase or scaling testi un an ce ees 2 I 2 ° 14. Utilization of all scaled material... 2 2 2 I 15. Ease of preparing duplicate scales. . 2 I 2 fa) Total 29 23 22 II Construction of T Scale.—The detailed process of con- structing a T scale has been published.t A summary will suffice for this book. Table 6 illustrates the process. The second column shows the number of unselected 12-year-old children answering correctly the number of questions indi- cated in the first column. It is recommended that unselected 12-year-olds (12.0-13.0) be used for scaling tests which are to be used generally. If any other age is used it should be 1See McCall, Wm. A., How to Measure in Education, Chapter X; Macmillan Company, New York, 1922. Experimental Measurements 99 TABLE 6 SHOWING HOW TO SCALE TOTAL SCORES Number Per Cent Total Number) | Number of Exceeding Plus|Exceeding Plus Scale Bape a Loe rE | Vaalt Those. |) Holft Chose Score ih ses nie Reaching Reaching o 3 498.5 99.7 23 I I 499.5 99.3 25 2 2 495.0 99.0 27 3 I 493.5 98.7 23 4 2 492.0 98.4 29 5 2 4.90.0 98.0 29 6 2 488.0 97.6 30 7 2 486.0 97.2 31 8 4 483.0 96.6 22 9 2 480.0 96.0 32 Io 2 478.0 95.6 a2 II Io 472.0 04.4 34 12 3 465.5 93.1 35 13 8 460.0 92.0 36 i4 8 452.0 90.4 oe I5 13 441.5 88.3 38 16 15 427.5 85.5 39 LT 18 4II.O 82.2 4I 18 28 388.0 77.6 42 19 26 361.0 vate 44 20 34 331.0 66.2 46 21 40 294.0 58.8 48 22 40 254.0 50.8 50 23 41 213.5 42.7 52 od 37 174.5 34.9 54 25 31 140.5 28.1 56 26 35 107.5 215 58 a7 24 78.0 15.6 60 28 26 53.0 10.6 62 29 21 20.5 5.9 66 30 14 12.0 2.4 70 3I 3 3-5 0.7 75 32 I 1.5 0.3 78 33 I 0.5 O.I 81 34 Oo 85 35 o go 100 How to Experiment in Education indicated by a subscript, thus, T1r or T13 or T16 in all publications. For experimental purposes the experimenter may use the group or groups upon which he is experimenting. The third column shows the number of pupils exceeding plus half those reaching each total number of questions correct. Thus the number of pupils exceeding 33 is o. Half those reaching 33 is 0.5. The sum of o and 0.5 is 0.5 as shown in the third column. The number exceeding 32 is 1. Half those reaching 32 is 0.5. The sum of 1 and 0.5 is 1.5 as shown. ‘The number exceeding 31 is 2. Half those reaching 31 is 1.5. The sum of 2 and 1.5 is 3.5, and simi- larly for other results shown in the third column. Since there are 500 pupils in the group used for scaling, the fourth column is obtained by dividing the results in the third column by 500 and by expressing the quotients as per cents. Were the fourth column inverted the first and fourth col- umns would constitute a percentile scale. The fifth column gives the T score, and is found by converting the per cents in the fourth column by means of Table 7. Thus a per cent of 99.7 corresponds to 22.5 or, for convenience, 23. The first column in Table 6 shows the number of test elements done correctly, where each element done counts one point. The process of scaling is the same whether each element done correctly gives a credit or penalty of one point, two points, or any number of points, or a different number of points for different elements. ‘Thus in scoring composi- tions, the scorer may wish to penalize one point for each error in punctuation, and two points for each error in choice of words. If penalties instead of credits are used the first column should be inverted, i.e., large quantities should ap- pear at the top. Increasing the Range of a T Scale.—The width of range of a T scale based on 12-year-olds is much wider than the inexperienced individual would suspect. In a continuous function like reading, such a T scale will meas- ure first-grade pupils and most university students. Of course, these extreme measurements will be more unreliable TABLE 7 SHOWING THE S. D. DISTANCE OF A GIVEN PER CENT ABOVE ZERO. EACH S. D. VALUE IS MULTIPLIED BY IO TO ELIMINATE DECIMALS. THE ZERO POINT IS 5 S. D. BELOW THE MEAN. S. D. VALUE EQUALS T. 5S. D. Per eT BE Per Nag OF: Per Sal. Per Value Cent Value Cent | Value Cent Value Cent fe) 99.999971 | 25 99.38 50 50.00 75 0.62 0.5 99.999963 | 25.5 99.29 50.5 48.01 15:0) 0-54 I 99.999952 26 99.18 51 46.02 76 0.47 1.5 99.9999038 | 26.5 99.06 51.5 44.04 79.5 0.40 2 99.99992 27 98.93 52 42.07 77 0.35 2.5 99.99990 27.5 98.78 52.5 40.13 77-5) F030 3 99.99987 28 98.61 53 38.21 78 0.26 3.5 99.99983 28.5 98.42 ey 36.32 78.5 0.22 4 99.99979 29 98.21 54 34.46 79 0.19 45 99.99973 29.5 97.98 54.5 32.04 79-5 0.16 5 99.99966 30 97.72 55 30.85 80 0.13 5-5 99.99957 30.5 97.44 55-5 29.12 SOOT 6 99.99946 31 97-13 | 56 27.43 81 0.097 6.5 99.99932 ar5 96.78 56.5 25.78 81.5 0.082 7 99.99915 32 96.41 57 24.20 82 0.069 7.5 99.9989 32.5 95.99 57.5 22.66 82.5 0.058 8 99.9987 33 95-54 58 21.19 83 0.048 8.5 99.9983 33-5 95.05 58.5 19.77 83.5 0.040 9 99.9979 34 94.52 59 18.41 84 0.034 9.5 99.9974 34.5 93-94 59.5 17.11 84.5. 0.028 Io 99.9968 35 93.32 60 15.87 85 0.023 10.5 99.9961 255 92.05 60.5 14.69 85.5 0.019 rt 99.9952 36 QI.92 61 13.57 86 0.016 2S OO.OUAL 36.5 OI1.15 61.5 12.51 86.5 0.013 I2 99.9928 37 90.32 62 II.51 87 0.011 I2.5 99.9912 37.5 89.44 62.5 10.56 87.5 0.009 13 99.989 38 88.49 63 9.68 83 0.007 135511 00.007 38.5 87.49 63.5 8.85 88.5 0.0059 I4 99.984 39 86.43 64 8.08 89 0.0048 14.5 99.981 39.5 85.31 64.5 7-35 89.5 0.0039 15 99.077 40 84.13 65 6.68 go 0.0032 15.5 99.972 40.5 82.89 65.5 6.06 90.5 0.0026 16 99.966 4I 81.59 66 5.48 gI 0.0021 16.5 99.960 41.5 80.23 66.5 4.95 QI.5 0.0017 17 99.952 42 78.81 67 4.46 g2 0.0013 17-5 99.942 42.5 77:34 67.5 4.01 92.5 0.00IT 18 99.931 43 75.80 68 3.59 93 0.0009 18.5 99.918 43.5 74.22 68.5 B22 93-5 0.0007 19 99.903 44 72.57 69 2.87 04 0.0005 19.5 99.886 44.5 70.88 69.5 2.56 94.5 0.00043 20 99.865 45 69.15 70 2.28 95 0.00034 20.5 99.84 45.5 67.36 40.5 2.02 95.5 0.00027 at 99.81 46 65.54 oe 7.0 96 0.00021 ars) 09.78 46.5 63.68 7215 1.58 96.5 0.00017 22 99.74 47 61.79 02 1.39 97 0.00013 22.5.. 90.70 47.5 59.87 72.5 I5 I6- 4 — 23 8-10 21 II- 6 5 I4- 0 —6 16— 6 — 24 0-10 19 I1-— 8 4 I4- 2 —7 16-— 8 — 26 Q- 2 18 II—1I0 3 I4- 4 —7 16—10 — 28 O-= Fh 17 I2-— 0 3 I4—- 6 —8 I17- 0 — 31 9- 6 16 I2- 2 2 14-—- 8 —9g I7— 2 — 33 g- 8 I4 I2-— 4 I 14-10 — II I17- 4 — 35 9-10 13 I2— 6 ° I5-— 0 tego e Me 8 37 Io—- 0 12 How to Construct C Scale——The T scale measures total ability in a sort of absolute sense. The B scale meas- ures brightness, i.e., ability relative to age. The purpose of the C scale is to indicate automatically a pupil’s correct classification in school in the trait tested, and to measure ability relative to grade. A pupil may be doing excellent work for his age but poor work for his grade or vice versa. The steps in the process of constructing a C scale follow. 1. Construct grade distributions similar to the age dis- tribution in Table 10. 2. Using the T score column and the frequency column for the grade in question, compute the mean T score for each grade or for each half-grade in case the schools tested have half-year promotions. These mean T scores for each grade are grade norms. The grade norms were as follows: Grades. | 2A 2B (3A) 3B -4A AB | sA’ (5B 6A) 6B). 7A.) 7B Norm, ..|26 30 | 33.7 37.3] 39.6 41.8] 44.9 48.0] 50.9 53.7] 56.0 58.3 Grade ..| 8A 8B! 9A o9B]/10A 10B]11A 11B|12A 12B Norm, ..| 59.6 60.9 | 61.5 62.1] 62.90 63.6] 64.5 65.4| 66.8 68.1 110 How to Experiment in Education 3. Write the letters in the foregoing 2A, 2B, 3A, etc., as decimals which will indicate how much of each grade the classes tested have completed. Since the test was given in June the 2A classes had completed half of Grade II, the 2B classes had completed all of Grade II, and so on. Hence 2A above should be changed to 2.5, 2B to 2.99 or 3.0, 3A to 3.5, 3B to 4.0, 4A to 4.5, 4B to 5.0, etc. If the test has been given just after mid-year promotion, 2A should be written as. 2/0,2.B as 2:5, etc, 4. Interpolate to determine what norm corresponds to each tenth of a grade. Since 2.5 corresponds to 26, and 3.0 to 30, 2.6 is found by interpolation to correspond to 26.8, 2.7 is found to correspond to 27.6, and so on. The expan- sion by interpolation shown in Table 13C, p. 126, illustrates the process in detail. ‘‘Grade” has been written as ‘“G” (grade status), and “Norm” has been altered to T since it is really a mean T score. The table has been extended downward by common sense estimation, and upward arbi- trarily so that the highest possible score will coincide with a G of 20. 5. Prepare a C correction table for correcting a G into a C. The C-corrections are given below. They are the same for all tests whether designed for the elementary or the high school, and regardless of the time when the data for scaling the test were collected. End of Month I 2 3 4 5 6 7 8 9 10 Ca Correction | .4 | 3 7) I o }—i}]—.2]/—3)—4]—-.5 21. The Test Should Be Long Enough to Vield Reliable Scores. This means that not only the time for, but also the ma- terial of the test should be adequate. We have just seen that calling the pupil’s score the scale difficulty of the single most difficult test element done correctly tends to yield an unreliable score. This is because this procedure in effect Experimental Measurements ET shortens the test, since not every test element plays an intimate part in determining the score. To secure adequate reliability frequently requires that two or more forms of a test be given and the results averaged. Spearman has de- vised a formula in order to determine how many forms of a test must be given to yield a desired reliability—a desired self-correlation coefficient (see Chapter IX). The answer is given by the following formula: __ YX—rirx WPT rs re Where N is the number of tests required to yield rx; rx is the desired self-correlation coefficient, and rr is the self-correlation coefficient of one form with another form of the test. Thus the number of forms of a test required to yield a self-correlation coefficient (rx) of .95, when the coefficient of correlation (rr) of one test with a duplicate is .8, may be found by substituting in the foregoing formula and solving for N, thus: 905 — .8(. NS Pa Fh = 4.75 oF 5. This tells us that the mean of 5 equivalent forms of the test would correlate with the mean of 5 other equivalent forms to the extent of .95. Sometimes the information desired is,—what self-correla- tion coefficient would result from correlating the mean of, say, 4 equivalent forms of a test with 4 other equivalent forms, when, say, r1 is .7. Here the formula and substitu- tions are: ie Nr1 ae qExan7 a Pree gs = oat om) wea pS a If rz in both the above substitutions should be the self- correlation coefficient found by correlating the mean of two 112 How to Experiment in Education equivalent forms of a test with the mean of two other forms, instead of the self-correlation coefficient for one form of a test with another form, the foregoing formule may be operated just the same. The N found in the first computa- tion would show, however, not 5 forms of the test but 5 pairs of forms, i.e., 10 forms, or more exactly 9.5 forms. Since, in the second computation, 4 forms are equivalent to two pairs of forms, 2 should take the place of 4, thus: uw 2X-7 Hepat (een Va How reliable should a test be? A self-correlation coeffi- cient of 1.0 would mean perfect reliability. The best intelli- gence tests have self-correlation coefficients of one form with a duplicate of .9 to .95 as based upon records from unselected pupils of the same chronological age. In grade groups the coefficient would be slightly less. The standard test has a reliability in age groups of about .8. A test with a reliability of .8 will yield a sufficiently reliable mean score for a group of 40 or more pupils. It will not yield a very reliable score for an individual. ‘The experimenter should have little confidence in the reliability of individual scores unless his test has a self-correlation of .95 or above, or until he has given enough forms of the test to bring the self-correlation to or above this figure. Fortunately, experi- menters are more concerned, as a rule, with mean scores for groups of pupils than with individual scores. Self-correlation coefficients are probably not the most intelligible way to determine and report reliability. Another way is illustrated in miniature in Table 12. The first column indicates the various pupils. The second column shows the scores made on one form of a test. The third column shows the scores made on another form of the test given shortly afterward. The fourth column shows the difference between the two scores. The mean of the differ- ences shows the amount of error on the average to be expected with this test. Were each of the tests perfectly Sao Experimental Measurements II3 reliable and were there no increase or decrease of the second series of scores over the first series due to (a) difference in difficulty of the two tests, (b) practice on the first test, (c) instruction, coaching, or natural growth in the trait, the second series of scores would then be identical with the first series and the differences in the last column would all be zero. Any difference due to (a), (b), and (c), pro- vided these influences have operated equally upon all pupils, can be eliminated by diminishing the non-algebraic mean TABLE 12 APPROXIMATE METHOD OF DETERMINING A TEST’S RELIABILITY Pupil es ch slash Difference a 20 22 2 b 12 15 a Cc 25 24 —t1 d 32 35 3 e 12 II —I f 6 10 4 g 28 28 fa) h 15 13 —2 i 18 20 2 j 22 20 —2 Mean difference (non-algebraic). ..........0ccccceee. a mreanaciirerence, (algebraic)prsdeae ee eke ok 0.8 prcthditerence’ (unreliability) is tie. te ee ok ke ces 1.2 difference by the amount of the algebraic mean difference. The net difference is approximately pure unreliability. To secure an absolutely pure measure of unreliability would require that an allowance be made for the fact that all pupils do not profit equally from practice, instruction, coach- ing, maturing, and the like. The procedure illustrated in Table 12 is quite satisfac- tory provided the variation in scores on form 1 of the test is the same or approximately the same as the variation in scores on form 2. Whether the general size of the scores is the same on both forms is immaterial. Equivalent forms of tests are so constructed, as a rule, that the two series of II4 How to Experiment in Education scores are alike in both variability and general size. The variability of scores on form 1 of Test A in Table 12 is about the same as that of the scores on form 2. The slight tendency for the scores on form 2 to be larger than those on form 1 is discounted by the use of the mean algebraic difference, namely 0.8. Test X in Table 13 illustrates a situation where the varia- bilities are identical, but-where the two series of scores differ markedly in size. The net difference shows how this process TABLE 13 ILLUSTRATING THE NECESSITY FOR EQUATING VARIABILITIES BEFORE COMPUTING RELIABILITY BY THE NET-DIFFERENCE METHOD Test X . Testy Equated Var. 5 Differ- Differ Differ- ed Form Form she Form Form ae Form Form Hah z 2 I 2 I 2 a 22 fo) —22] 10 o |—I0o 10 o |—I0 b 24 2 —22| 14 8 | —6 14 4 |—10 C 26 4 — 22 18 16 —2 18 8 |—I0 d 28 6 —22| 22 24 2 22 12 |—1I0 e 30 8 — 22 26 32 6 26 16 |—I0 Mean Difference (non- algebraic) isc. ae ee 22 Sa b de) Mean Difference (alge- braic) vise eee swine se 22 2.0 10 Net Difference (unrelia- bility) Ve eeoe en eee fe) cee ° eliminates the effect of differences in size. Test Y illustrates a situation where mere inspection shows there is perfect reliability, yet the net difference fails to show perfect relia- bility. It fails to show the true reliability because the varia- tion in scores is not the same for both forms. The variability of the scores on form 2 is exactly twice that of the scores on form 1. The variabilities can be made identical by the simple process of dividing all the scores on form 2 by 2. Once the variabilities are equated the net difference shows the true reliability, as shown in the third portion of the table. It is seldom feasible to determine the amount of a test’s variability by inspection as was done for form 2 of Test Y Experimental Measurements IIS in Table 13. The usual procedure is to compute for each series of scores one of the standard measures of variability, such as Q (quartile deviation) or SD (standard deviation), and to use these as a basis for equating. The computation of the Q and SD is explained in Chapter VI. Suffice it to state here that the SD for form x of Test Y is 5.66, and for form 2 is 11.32. Thus the SD’s show also that the variability of scores on form 2 is twice that for form 1. The variabilities or SD’s may be equated by dividing all scores on form 2 by 2, as was done, or instead, by multiplying all scores on form 1 by 2. Had the SD been 5 for form x and 4 for form 2, variabilities could be equated by dividing the scores on form 1 by 1.25, or instead, by multiplying the scores on form 2 by 1.25. Had the SD’s been x and 6 for forms 1 and 2, respectively, variabilities could be equated by multiplying scores on form 1 by 3, and by dividing scores on form 2 by 2. That is, the variability of one form may be adjusted to another form or the variability of both forms may be adjusted to a third variability different from the original variability of both. Sometimes one type of adjustment is more convenient and sometimes the other. Herring has called attention to the fact that the corre- spondence of scores on one form of a test with scores on another form is not the best measure of reliability. He claims, and rightly so, that scores on one form of a test will correspond more closely with mean scores from an infinite number of forms, than they will with scores on another equally unreliable form. That is, the correct meas- ure of the reliability of a test is some measure of the close- ness of its correspondence with a perfectly reliable deter- mination. A better measure of the reliability of a test than that given by self-correlation or self net difference is the corre- lation between a test and the mean of two forms of that test, or the net difference between a test and the mean of two forms of the test. The effect of this last is to make the net difference just exactly half the net difference between 116 How to Experiment in Education one form and another. The procedure would yield a net difference of 0.6 instead of 1.2 for the data of Table 12. But due to the fact that a test has half the influence in determining the mean of the two forms against which it is checked, the preceding procedure makes the reliability appear about as much better than it really is as the self- correspondence procedure makes it appear less satisfactory than it really is. Otis + has determined that the true unre- liability is .707 of the net difference as computed in Table 12 and Table 13. The correct measure of unreliability for Table 12 is .707 times 1.2, 1.e., .8484. 22. The Test Should Be Scored Comprehensively Enough to Yield Reliable Scores. The failure to score all phases of a pupil’s product while taking a test may be a prolific source of unreliability, par- ticularly in the case of rate tests where one phase is inti- mately dependent upon another. ‘Thus a sort of see-saw relation exists between speed and quality in a rate test of handwriting. Generally, as speed increases, quality de- creases and vice versa. Unless the method of testing is such as to keep speed, say, constant, the two quality scores for a pupil from two tests might be quite dissimilar, whereas if each quality score were corrected for differences in speed, they might, in reality, be identical. The approximate amount of correction for speed may be determined empirically. That correction is best which will produce the maximum possible self-correlation between the two series of corrected scores for quality. Another tech- nique for determining the amount of correction has been proposed by Courtis and Thorndike? and applied to the former’s rate tests in arithmetic. 23. The Test Should Be So Constructed As to Permit Uniformity of Procedure in Applying and Scoring It. The key to objectivity and an important key to reliability 1 Otis, Arthur I., ‘“The Reliability of the Binet eee and of Pedagogical Scales’”’ Journal of Educational Research, September, 192 ? Courtis. S.:A., and Thorndike, E. L., Ei entiod Formule for Addition Tests,” Teachers College Record, January, T920. Experimental Measurements aes, is this matter of uniformity of procedure. If it is not possi- ble to repeat a test in a uniform way, one individual cannot verify his own previous results, and one individual has even less opportunity to verify the results of another. The possibility of uniformity is partly a function of the nature of the test, partly of the detail and accuracy of the directions for applying and scoring the test, and partly of an experi- mental determination and consequent allowance for the amount and direction of each individual’s personal equation. The first two are the most promising. 24. The Test Should Have Satisfactory Age and Grade Norms. The experimenter has less need for norms than other users of tests. The experimenter is more interested, as a rule, in comparing the progress of one experimental group with the progress of an equivalent experimental group. Norms are very convenient, however, where only one experi- mental group is available, for then the progress of the avail- able experimental group may be compared with the progress of the norm group. Proper allowances can be made for any differences of intelligence between the two groups thus compared. Norms are most valuable when they are representative of the groups with whom it is most desirable to make com- parisons; when they are based upon enough cases to make them stable; when both the total distribution of scores and the averages are reported; when the number of cases upon which they are based is stated; and when the date of stand- ardization is specified. The addition of a B-scale correction to so or its subtrac- tion from 50 shows the norm for the chronological age cor- responding to the particular correction (see Table 11). 25. The Test Should Be Provided With an Inexpensive Leaflet of Directions, Scoring Devices, and Tabulation and Graph Forms. All too frequently it is necessary, in order to use a test, to purchase a monograph. In this monograph it is quite 118 How to Experiment in Education common to discover after diligent search that the directions for applying the test are in the appendix, that directions for scoring are near the beginning of the book, that the key for scoring is somewhere else, that norms are at still another place in the monograph, and that tabulation forms are lack- ing entirely. Fortunately a strong public opinion is com- pelling a more careful attention to these details. This con- sideration for the time and convenience of test users applies less to experimenters who are constructing tests for tempo- rary purposes than to those who expect a wide distribution of the test which they have prepared. IV. SAMPLE TEST AND DIRECTIONS In order to give a concrete illustration of how the T, B, C, F scale system will operate in practice there follows an unfinished sample of form 1 of an arithmetic test now in process of construction, and a tentative model direction booklet. All the data in the tables are for another test of 35 elements instead of for the arithmetic test of 80 elements. Otherwise the tables may be thought of as applying to the arithmetic test. CHINESE FUNDAMENTALS OF ARITHMETIC SCALE Do not open this paper until told to do so. As soon as I have told you how, fill the blanks below, and then hold up your pencil to show that you have finished. SuUrMaAmMes Pirst Na Mame pee eg lens ois Lele g tee Boy, Girl owas ADENIOY Cars oo iae SiG irthVLonth |). anteater Birthday ens het abu 8) BD eta rege Bsr bey apd 25) cpp Grade | 0.0). 0.0 sta ates Dater y car. ofA Republicia san 67 Month ei Day” eee Pencils up! Experimental Measurements IIQ We want to see how well you can add, subtract, multiply, and divide. Do all your work on this paper. Get no help from anyone. Answers should be given in decimals and not in fractions. See how many examples you can get correct in the time allowed. You will be told your score later. do the next. As soon as you finish one page, Meade no he ee meime ce rec 8) '*).S 8) (8118 Ske 1818 Cel S86 i's: eel ove.a eier e ele Tel oie later ela enote tethered te Addition Add Subtract Add Subtract Multiply Divide Add Moree ts Alem Disha ea tee Rights eee ae eee .... Subtraction .... Multiplication .... Division .... (z) (2) (3) (4) 3 6 7 7 4 2 5 9 Add (5) (6) (7) (8) 6 8 9 8 3 4 5 O Subtract (9) (z0) (77) (12) 5 8 I O 24 50 7 5 4 6 Add (13) (74) (15) (16) 29 74 76 92 6 4 32 21 Subtract (17) (18) (79) (20) 4 3 7 8 2 3 3 6 Multiply (27) (22) (23) (24) 2)6 4)8 4) 36 7)49 Divide (25) (26) (27) (28) 22 72 69 58 ras 26 4 8 Add 120 How to Experiment in Education (29) (30) (32) (32) 34 44 41 86 Subtract 8 7 26 19 Subtract (33) (34) (35) (36) 24 20 28 63 Multiply 2 4 7 9 Multiply (37) (38) (39) (40) Divide 2)178 4)260 5) 845 7)973 Divide (47) (42) (43) (44) 984 32 75 43 253 571 Add oa 89 457 185 Add (49) (50) (57) (52) 407 350 65 7 Multiply 7 8 36 57 Multiply (53) (54) (55) aon Divide 9)54054 §8)16200 43)559 27)864 Divide (57) (58) (59) (60) 72 28 46 95 53 60 98 72 28 — 89 70 43 6.43 69 39 48.19 -78 Add 98 39 96.13 70. Add (61) (62) (63) (64) 5004 3500 7-32 a Subtract 169 2891 2.59 8.63 Subtract (65) (66) (67) (68) Multiply 70 600 8 “7 Multiply Experimental Measurements 121 (69) (OL NG Ae Divide 68)68544 97)1949700 55)198 83)431.6 Divide (73) (74) (75) (76) ,; 58 76 7555 72.3 Multiply BT .09 5.98 8.06 Multiply (77) (78) (79) (80) Divide .40)2.42 .90)3.59 .03)8.76 .08).46 Divide When you finish, close your paper, lay it on your desk with the front page up, and wait quietly until papers are collected. DIRECTIONS FOR THE CHINESE FUNDAMENTALS OF ARITHMETIC SCALE ForRM I I. GENERAL DIRECTIONS FOR APPLYING TEST 1. Follow the instructions for giving the test with literal exact- ness. No additional help should be given except as hereafter provided for. Avoid unstandardized introductory remarks. Secure rapport by charm of manner rather than felicity of expression. 2. Give directions distinctly, at moderate speed, with careful attention to emphasis, loudly enough to enable all pupils in the room to hear without difficulty, and confidently enough to secure instant obedience from every pupil. Insist courteously but firmly on this prompt obedience from the start. 3. Remove all distracting elements from the environment, and make pupils as comfortable as possible. Provide against any dis- turbances while the test is in progress. Preferably there should be no visitors. 4. Prevent copying. Do this by carefully watching those who act suspiciously or by standing beside them. Do not distract others by oral reprimands in the midst of the test. 5. In timing the test use a stop-watch if possible. If not, an ordinary watch may be used provided it has a second hand. Where feasible, it is well to have an assistant do the timing. 6. Clear desks. See that each pupil is provided with a sharp- ened pencil. Have a few extra pencils available. Taz How to Experiment in Education 7. Carefully count enough and just enough test papers for each row and place them on the first desk of that row. Be very careful lest a test paper be left in the possession of the pupils. If pupils are practiced or are permitted to practice themselves on the con- tents of this test, its usefulness as a measuring instrument will be destroyed. i. INSTRUCTIONS TO PUPILS 1. Hold up one of the test.papers and say: One of these papers will be placed on each desk. Do not open them until told to do so. Will the pupils in the first row please distribute papers. 2. When papers are distributed, say: Look at the first page and read silently while I read aloud. 3. Read the directions with a sufficient pause at the end of each sentence to permit the direction to be followed or the thought to be fully grasped. 4. When directions have been read, record the time in hours, minutes, and seconds, as you say: Open your paper and begin! 5. At the end of exactly 10 minutes, say: Stop! Draw a large circle around the example you are now working on and then pencils up. (Pause.) Now finish the ex- ample and go right on. 6. Make sure that each pupil does not forget that as soon as he finishes one page he is to do the next, and that he does not overlook the last page. 7. At the end of exactly 30 minutes after saying “Begin,” say: Stop! Pencils down! Wil pupils in the first row please collect papers. m1. How To Score TEST Take a blank test paper and fill it out with the correct answers given below. This scoring stencil may be creased in successive folds, thus making it possible to lay the row of correct answers just below the pupil’s answers. Draw a line through every in- correct or omitted answer and write the number of correct answers in each row to the right of that row. Compute the total number of correct answers made on the entire test by each pupil and write this in the “Examples correct” space provided on the front page of his paper. To be counted correct a pupil’s answers must agree exactly with Experimental Measurements 123 those given below. Each example is scored as either wholly right or wholly wrong. No partial credits are given. When an answer has been corrected by the pupil, the correction is the answer to be scored. The use of fractions instead of decimals is scored as incor- rect in order to discourage a cumbersome practice. If pupils must meet fractions in their environment, they should be taught how to convert fractions into decimals. Omission or misplacement of a decimal point makes the answer wrong. The presence of zero before an integer or after a decimal does not make an otherwise correct answer incorrect. As a rule it will be found quite satisfactory to have pupils exchange papers and do all the scoring themselves, the examiner calling the correct answers. If this is done, at least two pupils Should score each paper, and the examiner should check the accuracy of the scoring for some of the papers. The list of correct answers follows. Example| Form I | Example| FormI\\Example| Form I Example| Form! I 7 21 3 4I I12 61 4835 2 8 22 2 42 132 62 609 3 12 23 9 43 1694 63 4.73 4 16 24 7 44 1084 64 66.37 5 3 25 57 45 194 65 4200 6 4 26 98 46 286 66 30600 7 4 27 73 47 562 67 4.72 8 8 28 66 48 299 68 6.30 9 II 29 26 49 2849 69 1008 bo) 13 30 37 5° 2800 70 2010 II 28 31 15 51 2340 71 3.6 12 56 32 67 52 4332 v2 5.2 13 23 33 48 53 6006 73 21.46 14 79 34 80 54 2025 74 6.84 15 44 35 196 55 13 75 451.49 16 71 36 567 56 32 76 582.738 17 8 37 89 57 533 77 6.05 18 9 38 65 58 465 78 15.1 19 21 39 169 59 144.32 79 292 20 48 40 139 60 86.21 80 5.75 Iv. How To Compute Puri Ta (Torat ABILITY IN ARITHMETIC) Find the pupil’s total number of examples correct in the first column of Table 13A and read the corresponding Ta. This is the 124 How to Experiment in Education pupil’s T score in arithmetic. Thus the first pupil in Table 13D (p. 127) did 16 examples correctly, which, according to Table 13A corresponds to a Ta of 40. TABLE 134 Examples Examples Examples Examples Correct Ta Correct Ta Correct Ta Correct Ta fe) 23 9 33 18 43 27 63 I 25 Io 34 19 45 28 67 2 26 II 35 20 47 29 71 3 of 12 36 oY. 49 30 76 4 27 13 37 22 51 31 79 5 28 14 38 a3 53 32 86 6 29 15 39 24 56 33 86 7 31 16 40 25 58 34 92 8 32 7 42 26 60 35 96 v. How To Compute Puprt BA (BRIGHTNESS IN ARITHMETIC) Find the pupil’s solar age in Table 13B and read the corre- sponding Ba correction. If the Ba correction is plus, add it to the pupil’s Ta. If it is minus, subtract it from his Ta. The result is the Ba. Thus the first pupil in Table 13D is 13 yrs. 2 mos. old, which, according to Table 13B, corresponds to a Ba correction of —2. His Ta of 40 plus the Ba correction of —2 gives a Ba of 38. TABLE 13B Solar Age Addto| Solar Age Addto| Solar Age Addto|Solar Age Addto Yrs —Mos.T Score| Yrs—Mos. T Score |\Yrs—Mos. T Score |\VYrs—Mos. T Score 7-6 34 IO — 2 II 12-8 —I /|15-2 —I3 7-8 32 10-4 10 I2 — 10 —I /|}15-4 —iI15 7, — 10 31 Io — 6 9 13-0 —2 /15 —6 — 16 8-0 29 10 — 8 8 13 — 2 —2 |15 -8 —I17 8-2 ae Io — I0 8 13 —4 —3 |15 —-10 —IQ 8-4 25 II -o 7 13 — 6 —4 /16-0 — 20 8-6 24 II — 2 6 13 -8 —4 |16-2 — 21 8-8 22 II-4 6 13 — 10 —5 |16-4 — 23 8 -— 10 21 Ir — 6 5 14-0 —6 |16-6 — 24 9-0 19 Ir - 8 4 I4 - 2 —7 |16-—8 — 26 9-2 18 II — 10 3 14-4 —7 |16—-10 —28 9-4 17 12-0 3 14 — 6 —8 |17-0 —3I 9-6 16 I2 — 2 2 14 - 8 —9Q |17 -2 — 33 9-8 I4 12-4 I I4 — 10 —II {17-4 — 35 9g -— 10 13 12-6 fe) I5 -—0o —iI2/17-6 — 37 ms ° I ° 12 Experimental Measurements 125 vi. How To CompuTE APPROXIMATE SOLAR AGE (FOR USE IN CHINA) First, determine the pupil’s lunar age and the lunar month of birth. Deduct 1 from his lunar age to get his basal age. Then from the number of the lunar month in which the tests are given, deduct the number of his lunar month of birth. If the resulting number is positive, add that number of months to his basal age to get his approximate solar age. For example, if the pupil is 15 yrs. old and was born in the 5th month, and if the tests are given in 8th month, his basal age is 15 — 1 = 14 yrs., and the number of months is 8—-5 3. Thus his approximate solar age will be 14 yrs. 3 mos. In case the resulting number is negative, it means that the pupil is not up to the supposed basal age. Then from this age deduct the number of months deficient. Thus if a 15-year-old pupil who was born in the 11th lunar month is tested in the 8th lunar month, his basal age is 14 but he is deficient by 3 months (8— 11 =3). So his solar age should be 14 yrs. minus 3 mos., that is, 13 yrs. 9 mos. vir. How To Compute Pupit Ca (CLASSIFICATION IN ARITHMETIC) Find the pupil’s Ta in Table 13C and read the corresponding Ga (Grade status in arithmetic). A Ga of 4.0, 4.5, or 4.9 means that the pupil has an ability in arithmetic equal to the average fourth-grade pupil at the beginning, middle, or end of the year respectively. To convert a Ga into a Ca add to or subtract from the Ga the Ca correction shown below. Use the correction for the month when the test was applied. Thus the first pupil’s Ta in Table 13D is 40. According to Table 13C this Ta is equivalent to a Ga of 4.6. Since the test was applied December roth this is nearest to the end of November, i.e., the 3rd month. The cor- rection for the 3rd month is ++ .2 which added to the Ga yields a Ca of 4.8. Of course the correction is the same for all pupils tested on December 10. For a school starting October 1, Decem- ber ro is the 2nd month, and similarly for other starting dates. End of Month| 1 2 3 4 5 6 7 8 9 10 ETeCHON || 1.44 tans etna et) || | 2 8 |e ed 126 How to Experiment in Education TABLE 13C Ta -Ga\| Ta Ge|To Ga| Ta Ga| Ta Ga| Ta Ga yy ee ay Ph ied ie BB 22.0 (1 F2.841) 43.0 SHAY © 2:30 (245.0 2520 2. 4cadis 26,08 m2.52 144A MAMNUUN 20,008 02.0019 4575 27.00 a7 eAO. 28.4 2.8 | 46.7 20.2.0 2.0014 753 30.0 3.0 | 48.0 Attn uw v1 30.7 V3 PAS ON Ole Ol.0 0.1.):65.7) 22:2.1776.5" 125.11) Boson ST Ai 312 TOAO cup Orca Otel 0.2 166.0 |. 12.2:|/)76.0" P1521 OmOuno 32.04)) 3.301 440.5 Ona OtKe 9.3 | 66.3) 12.3 19713 0 X58 Oo eae 22.0 NMSA iSO. AO ies 0:4.166.6 9 °12.419977-7° 15:41 60,7 ato 33.7 3.5110 50.0 0 OLS MLOdns 9.5'}:66.8. )-12.5'| 48.1) “Ex.S OC fh emeaes 34.40 '3.0) (51.5 el6-On O50 0.6: )67.5.. 12.01) 78.5) 515.6 OCs amet 35: 3.7 |. 52.0 6.7) 01.77) 10.7167-4" 12.7 | 98.9) "15.71 OO aus B05 Ss Ou) 5 27 Oe nO Lue 08) 67:7 32.81, 970.3') -15.2101.5 eee 36.5, 3.9 | 53.3. 6.9 ]/61.0) 9:9) 68.0 | 32.9 70.7 15,0 Ol. 7 teng 29.2 VA OP S307 CO mOsetan Ol.) OGL 13.0}: 80.1 )26,0}/02. Reo 87.5. 4.01, 54:2°) 7 Pe 62.3) 10.81 68.5° ) 13.1] 80.5 a akO.5 104s eee ee 28:3). 4:2 |) 54.9 1) 7-20-02 nt0,2 t.68,0. | 13:2) 80.0%, 1.10.2) On. mene 38.3) 4.351 55.2 >. 7.31 1162-7'| (10.3) 60.3°° 13.3) 81-3" 736.3) 03-3 0eaao 39-3 44] 55-7 74 |62.8 104/60.7 13.4) 81.7 16.4/03.7 194 30.057 4.571'50.00) 7 .588102.0 wa tO-51 70.1 13.5 |°82.5 7126.5) 04, tages 40.0) (4.61) 56:5 7.662.005 36.61 470.5 | 12.61) 82,500 10.0 Oa, see oe 40.4 4.7.) 57.0 7.7,|/63.1) | 10.7| 70.9, 13.7} 82.9. 16.7) 04.0) gto? 40.8) 4:8 S75 0 9.81) 63-2 yetO.8 191.3 6 13.8.1083 300 Oo re 41.2) 4:0°)'§80"" 7:0, \.63i4 8 10:01:71.7. 13.0) 82:74.9360.01 65; 70 AL. 9005.0 | 58.3. 8.0 6316 (411.0) 72,1 . 14.0) 84.1 "0° 27.0) OG,Oemeacas vim. How To Compute Crass Ta, BA, AND CA The Ta for the class, grade, or group is the mean of the pupils’ Ta’s. In Table 13D the class Ta is 48.2. To compute the class Ba, first compute the mean solar age for the class, second, convert this into a Ba correction by the use of Table 13B, third, add or subtract the Ba correction to or from the Class Ta. Thus the mean solar age for the class in Table 13D is 12 yrs. 2 mos. According to Table 13B, this solar age corre- sponds to a Ba correction of + 2. When 2 is added to the class Ta, the resulting class Ba is 50.2 as shown in Table 13D. To compute the class Ca, find the class Ta in Table 13C and Experimental Measurements 127 read the corresponding Ga. Add to or subtract from the Ga the appropriate correction. Thus the class Ta of 48.2 corresponds to a Ga of 6.0. A Ga of 6.0 plus a correction of .2 for the third month gives a class Ca of 6.2. TABLE 13D CHINESE FUNDAMENTALS OF ARITHMETIC SCALE, FORM I School No. 25 Grade VI Down December ro, 1922 Solar Age Name Ta Ba Ca I3 yrs. 2 mos. A 40 38 4.8 I2 yrs. 6 mos. B 50 50 6.5 IO yrs. 7 mos. C 53 62 7.1 II yrs. 4 mos. D 46 52 5.9 13 yrs. § mos. E 52 48 6.9 I2 yrs. 2 mos. Ta 48.2 Ba 50.2 Ca 6.2 aa an SN LOE ON AEE MANE 1x. How To Interest Pupir Ta AND CrAss “FA The number of examples correct is not a satisfactory unit of measurement because the difference in difficulty between 30 and 31 examples correct may be greater or less than between Io and Ir examples correct. The difference between 30) band 3 ta or 28 T and 29 T always equals the difference between 10 T and Pipl cOr 55) land sor 1, Again T scores make possible such statements as the following. Any pupil or class whose T is 50 has an ability which equals the mean ability of all twelve-year-old pupils. Any pupil or class whose T is 70 has an ability which is 20 T (or 2 S. D.) above the mean ability of twelve-year-olds. Any pupil whose T is 35 is 15 T (or 1.5 S. D.) below the mean ability of twelve-year-olds. Again, T scores may be interpreted as shown in Table 1 3E. TABLE 13E ne rr pe A Is Exceeded by the A Ts ae by nes Following Per Cent Following Per Cent T’ Score of of 12-year olds T Score of of 12-year-olds 25 99 55 31 30 98 60 16 35 93 65 7 40 84 70 2 45 69 75 I 50 50 80 o.1 128 How to Experiment in Education x. How To INTEREST Puprtt BA AND CLAss BA The Ba norm is always 50 for all pupils. If a pupil’s Ba is 50, his arithmetic ability equals the mean ability of ail pupils of like age. He is of average brightness. If his Ba is 40 he is 10 T (or r S. D.) below the mean brightness in arithmetic of his own age group. According to Table 13E he is exceeded by 84 per cent, not of 12-year-olds, but of pupils of like age. If his Ba is 75, he is 25 T (or 2.5 S. D.) above the mean brightness in arithmetic of pupils of like age. According to Table 13E, he is extremely bright, since only 1 per cent of his own age group are brighter. In like manner the mean Ba for a class shows the brightness in arithmetic of that class as a whole as compared with the brightness of all other classes, not of like grade, but of like age. Thus both Ta and Ba are needed. Ta gives a measure of total arithmetic ability and incidentally shows how much each pupil or class Ta is above or below the mean Ta of twelve-year-olds. A Ta scale is used primarily for the purpose of measuring growth in ability from month to month and year to year. But a nine-year-old pupil or class might have a Ta much below 50 and still be doing exceptionally satisfactory work. There is needed some score which makes allowance for the fact that a pupil or class is younger or older than twelve. The Ba correction automatically makes just this allowance, and the Ba shows pupil or class ability in comparison with pupils or classes of the same age. A young pupil may have a small Ta and a large Ba and an old pupil may have a large Ta and a small Ba. A pupil or class Ta grows larger from month to month and year to year, whereas the Ba changes little or not at all. xI. How To INTEREST Pupit CA AND CLAss CA For a pupil to have a Ca of 3.5 means that he is an average third-grade pupil in the fundamentals of arithmetic. A Ca of 3.0 means that he barely belongs in the third grade. A Ca of 3.9 means that he is almost, but not quite, ready to be promoted into fourth-grade work in the fundamentals of arithmetic. A Ca of 6.4 means that he just fails of being an average sixth-grade pupil. The class Ca is interpreted similarly. Since the pupils in Table 13D are sixth-grade pupils their norm Ca is 6.5 and will continue to be 6.5 so long as they remain in Grade VI. It jumps to 7.5 as soon as a pupil is promoted to the next grade. The first pupil is 1.7 Ca or grade below norm. The Experimental Measurements 120 second pupil is exactly at the Ca norm. The class is o. 3 Ca below the Ca norm. XII. SUPPLEMENTARY D1acNnostic ScoRING On the front page of the test paper, write in the space after “Attempts,” the number of the example circled by the pupil. This may be taken as a measure of his speed of work. Write in the space after “Rights” the number of examples done correctly inclusive of and prior to the example circled. A comparison of Rights and Attempts shows the per cent of accuracy. Some pupils are slow and inaccurate, some slow and accurate, some fast and inaccurate, and some fast and accurate, and some are average. Each type requires different treatment. There are 20 examples for each of the four processes. Count separately the number of examples done correctly on each process, and write these scores in the spaces provided on the front page of the test paper. If the pupil has mastered each of the processes equally well his four separate scores should be approximately equal in size. An even more helpful diagnosis can be secured by making out, or having the pupils make out, a table showing just what examples were missed or omitted by each pupil. From this the per cent of pupils missing or omitting each example can be readily deter- mined. Each pair of examples (1 and 2, 3 and 4, etc.) are built to test a pupil’s mastery of a certain type principle or difficulty. As a rule, each pair of examples includes the difficulties of all preceding pairs and one additional difficulty. Two examples of each type are included because a chance error may cause a pupil to miss an example whose principle he has really mastered. Once each pupil’s need has been discovered in these ways, he can be given training on his specific weaknesses. A specially effective set of practice materials for giving this training is being prepared by the Nanking Committee for publication by the Com- mercial Press, Shanghai. Under no circumstances should a pupil be especially drilled on the particular examples of this test. The teacher who does this destroys the usefulness of the test as a measuring instrument. Since diagnostic scores are intended for local use rather than for publication, tables have not been provided for scaling them. xr. ACCURACY OF SCALE SCORING The accuracy of scale scores depends upon (1) the way in which pupils to be tested were selected, and (2) the number of 130 How to Experiment in Education pupils tested. The pupils tested were a random sampling from the total population in grades III through VIII in the government schools of Peking and Tientsin. The number tested was ap- proximately 2000. xIv. ACKNOWLEDGMENTS These arithmetic scales were prepared by the Peking Committee consisting of Professors L. C. Cha, C. Y. Chang, Y. C. Chang, T. T. Lew, E. L. Terman, Wm. A. McCall, their students, and Lydia Sherritt, under the auspices of the National Association for the Advancement of Education. The units of measurement used in these scales were devised by Dr. Wm. A. McCall and named by him in honor of those whose contribution to scientific mental measurement has been of most fundamental significance. T (Total ability) is for Thorndike, the originator and teacher of scientific educational measurement and author of the first College Entrance Intelligence Test, and for Terman the author of the Stanford Revision of the Binet-Simon scale and leading ex- ponent of the age-scale system. B (Brightness) is for Binet the creator, with Simon of the first intelligence scale, and for Buckingham the creator of the grade- scale system. C (Classification) is for Courtis, an early pioneer in educational measurement and originator of practice tests, and for Cattell who with Fullerton laid the foundation built upon by Hillegas in con- structing the first statistically satisfactory product scale and in remembrance of China where this unit was first devised and used as such. F (Effort) is for Franzen, Pintner and Monroe, all of whom published at about the same time a practical mechanism for meas- uring achievement as related to capacity to achieve. This unit is used only when both an intelligence and educational test have been given. W. T. Tao, General Director of the Association. V. SUMMARY OF THE STEPS IN THE PrOcESS OF CON- STRUCTING, SCALING, AND STANDARDIZING A TEST 1. Dificulty Test t. Decide upon the mental trait to be measured and define it as exactly as possible. Experimental Measurements 131 2. Decide upon a test form and general content which will measure this trait and this trait only, which will yield one and only one correct and easily scored pupil response to each test element, and where each element may be scored as either right, wrong, or omitted. 3. Decide upon the range of ability to be measured. 4. Consult previous tests of this trait or similar traits to determine how easy and how difficult the test elements must be made, how simple the directions must be, and what is a suitable mechanical arrangement of material for mimeographing or printing. 5. If no such test exists prepare a tentative set of direc- tions and a few tentative test elements and try them on a few of the ablest and least able pupils ever likely to be tested. 6. Prepare a test, which is as perfect in every detail as possible, which advances by gradual steps of difficulty from slightly easier to slightly more difficult than will be required in the final test, and which has about one-fourth more content than will be required in the final test (unless the test is for diagnostic purposes in which case only the material to be used finally should be used). 7. Make provision for the following identification data: (1) First name, (2) Last name, (3) Sex, (4) Age in years, (5) Birth month, (6) Birthday, (7) School, (8) Grade, (9) Section, (10) Date of test. 8. Prepare sample and directions for pupils. For gen- eral directions to examiner, see Section III of this chapter. 9g. Explain and apply the test to several intelligent adults and correct it in the light of their criticisms. 10. Apply the test to about 110 pupils scattered over the entire range of ability of pupils for whom the test is designed. Be sure to include some of the ablest and least able pupils ever to be treated with completed test. Give all the time pupils need to do every test element or to do all they can. Record on his paper the time required by each pupil. 132 How to Experiment in Education 11. Make out a list of correct answers, a mechanical device for scoring, and directions for scoring. 12. Score each test element, using 1 for correct, x for wrong, and o for omitted. 13. Eliminate from the test all elements which prove ambiguous, unscorable, or are otherwise unsatisfactory. 14. Discard enough tests to leave 100. Do not dis- card the best and poorest papers. 15. Compute the total score made by each pupil on the odd numbered questions and then on the even num- bered questions. 16. Make a correlation diagram for these two sets of scores. Call in for a conference those pupils who are chiefly responsible for lowering the correlation. Go over each element tried and missed by them to see if some ambiguity or other defect is responsible. Correct or elim- inate test elements if defects are brought to light. 17. Make a correlation diagram for the total score of each pupil on the total test and the criterion (if such be available). Confer and correct as before. 18. Call in a few of the most gifted pupils and enquire the reason why various test elements were missed by them. Correct or eliminate elements if defects are brought to light. 19. Tabulate, by pupils and remaining test elements the 1’s, x’s, and o’s, thus for the 100 papers. Test ELEMENTS Name I 2 3 4 5 6 2 8 9 Io | etc roti ae pa rit arb tt I I I x I I x = fe) o | etc RE Mar es eres I 7 De ne I x x O x o | etc CLO Seon tien cts 6's etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. Total. Correct;.| | (| |e ee ee T Difficulty...) —}|—} —|—]|—}—/—J—}— J — | 20. Compute, from the preceding tabulation, the num- ber and per cent of pupils doing correctly each test element. Experimental Measurements 133 Since there are 100 pupils the “Total correct” will also be the per cent required. This will not be true when the pupil has a 50-50 opportunity of getting an element cor- rect by chance. In this case, subtract from the total of I’s on each element, the total x’s, and divide the re- mainder by 100. The quotient will be the proper per cent correct. 21. Convert each per cent into an S.D. value or T diffi- culty by means of Table 7. 22. Arrange test elements in order of T difficulty. 23. In view of the time records on the test and the time decided upon for the final test, decide upon the number of test elements required in order that the fastest pupil will not quite finish the test before time is called. In deciding upon the time allowance for the final test, due con- sideration should be given to practicality and to reliability. In general do not be satisfied with a reliability (Self r) of less than .85 between the two halves of the test. Other things being equal, an abbreviated test means a low re- liability. Hence if the self r is too low, lengthen the time allowance, and increase the number of test elements or provide for two tests to be averaged instead of one longer test. 24. Select the number of test elements decided upon. Select in such a way that the successive elements will in- crease, So far as possible, by equal increments of T difficulty from one done correctly by about 99 per cent of the pupils to one done correctly by about 1 per cent of the pupils. If the elements available are too easy or too difficult try out and incorporate additional elements of the desired diffi- culty. Sometimes diagnostic or other considerations should weigh more heavily than difficulty or time-allowance con- siderations in determining the final content of a test. In this case the test constructor must use his judgment to decide how much alteration of the test content is per- missible. 25. Improve the mechanical make-up of the test and 134 How to Experiment in Education directions for applying it in any way that experience suggests. 26. Print the test in final form. 27. To test the satisfactoriness of the proposed time allowance, apply the test to the ablest class ever likely to be tested. Have pupils circle the number of the test element being worked upon at the end of regular intervals. Stop the test the moment the -fastest pupil finishes. Record this time. 28. Determine the total score made by all pupils com- bined during each of the successive time intervals. 29. Fix an official final time allowance such that at its expiration the fastest pupil would not quite have finished and the ablest pupil would have done all he could. Adopt for future use the minimum time that would have accom- plished these two objects. 30. Apply the test to about 2000 pupils in the grades for which the test is designed. ‘The schools selected for testing should approximate as closely as possible a random sampling of all schools. In the schools selected, all pupils in the appropriate grades should be tested. 31. Score the tests and compute the total score made by each pupil. In scoring it is usually more convenient to give one point for each element done correctly, but this is not imperative. Some prefer to give 2, I, or o credits to an element according to the excellence of the pupil’s answer. The resulting increase in accuracy is seldom worth the extra trouble. Elements of large enough scope to justify extra points can usually be broken into two or more sepa- rate elements. Do not assign points proportional to the difficulty of an element. This involves a cumulative error. 32. Make a frequency distribution of scores for each grade, and then for each age. Make all frequency distribu- tions in step intervals the size of the smallest scoring unit. This is usually one. 33. Using 8.0 to 9.0, 12.0 to 13.0, or 16.0 to 17.0 year- olds for primary, higher elementary, or high school, respec- Experimental Measurements 135 tively, convert these raw scores into T scores by means of Table 7, and as illustrated in Table 6. 34. If thought desirable, increase the range of the T scale by a process illustrated in Table 8. 35. Construct a B scale for the test by a process illus- trated in Table ro. 36. Construct a C scale for the test. 37. Prepare the official directions booklet to be issued with the test. In order to secure uniformity, a sample direc- tions booklet is given in Section IV of this chapter. i. hate Lest 1. Do steps Iz, I2, 13, I4 except that all elements of the test should be of uniform or approximately uniform difficulty, I5, 16 except the statement concerning gradually increasing difficulty, I7, 18, Io, I1o except that there should be a fixed time allowance instead of a fixed number of ele- ments to be done, Ir1, I12, I13, [14, Ir5, 1x6, Ir7, 118, I19, for a few representative test elements only to see whether the test elements are on the desired difficulty level, I20, I21, I23, I24 except for all reference to difficulty, I25, I26, I30, 131, 132, 133, 134, 135, 136, and 137. 2. Since rate tests usually yield two scores, namely num- ber tried and accuracy, T, B, and C scales may be con- structed for both, or for just number right only, or for a properly weighted combination of number tried and number right. mr. Product Tests Such As Handwriting, Composition, and Drawing 1. Do I1, I2 except that product tests are usually scored as a whole rather than by separate elements, 13, Iu, I5, 16 except for the references to difficulty, 17, 18, Io, Izo except that there should be a fixed time limit, and, in the case of traits like composition and drawing, a warning a few min- utes before time is called. 136 How to Experiment in Education 2. Repeat I1o on the same group of pupils so as to secure two measures of the trait. 3. Do I14 for both sets of products. 4. Rate 1 the poorest specimen in the first set. Rate 2 the next poorest and so on to 100. Have this done by, say, three competent judges. Average the three judgments to get the final rating for each specimen. 5. Repeat III4 for the second set of specimens. 6. Do I16 for these two sets of ratings, and I17 for either set or both. If the self r is too low, increase the time allowance or provide for two or more tests to be averaged and treated as one. 7 DOds sale Omande 20, 8. Pick out all specimens written by pupils of ages 8.0 to 9.0, or 12.0 tO 13.0, or 16.0 to 17.0 depending upon the level for which the test is designed. Age 12.0 to 13.0 will serve fairly well for all levels. Write on each specimen a number without regard to its merit. 9. Separate the papers into ten piles—A (poorest), B (next poorest), C, D, E, F, G, H, I and J (best)— according to the merit of each specimen. 10. Take pile A and divide it into 5 piles—a (poorest), b, c, d, and e (best )—according to merit. tz. Do IIIro for the other nine piles. 12. Take pile Aa and arrange the papers in it in order of merit. 13. Do III12 for Ab, Ac, Ad, Ae, Ba, Be and on for the 50 separate piles. 14. Carefully compare the few best specimen in Aa with the few poorest specimen in Ab. If the order of merit is not correct rearrange across the junction point. Repeat this process for the other 48 junction points. 15. Ona record sheet, write down in order of merit the number of each specimen. After the number of the poorest specimen, mark 1. After the number of the next poorest, mark 2, and so on for all specimens. 16. Have at least three competent judges do steps IIo, Experimental Measurements 137 TITro, [1I11, 1112, [1113, T1114, and Il115 without knowl- edge of each other’s marks. 17. Compute the mean of the three marks given each specimen by the three judges. Arrange specimen numbers in order of merit according to these means. 18. Check that specimen number where the per cent exceeding-plus-half-those-reaching-it in merit is nearest 99.865. According to Table 7, this specimen has a merit of 20. Check the one where the per cent is nearest 99.38. This has a merit of 25. The other per cents to check are shown in the first row of the following. The T merit of the specimen checked is shown in the second row. If only half this number of specimens are desired in the final scale, use those per cents whose T merits are 20, 30, 40, 50, 60, 70 and 80. If more specimens are desired in the final scale, Table 7 will show which per cents will yield equal intervals of T merit. ZETA CELILD Na close cid vere vid ee OGG05 mm 00-30 1) 00.72 11 03-4200 h O41 3 e00.L5 SPMINIGTIGE ate. trate, eee tiers 20 25 30 35 40 45 PGCE Carat a iale were chee 50 30.85 15.87 6.68 2.28 62 it3 SDRTIDeTILUE ee Sassen SOruss 60 65 40 75 80 19. After checking these 13, say, specimen numbers, check also the five specimens immediately preceding each in merit and the five immediately following each in merit. This will give 13 sets—N, O, P, Q, R, S, T, U, V, W, X, Y, and Z—of eleven specimens each. Mix up the specimens within each set. 20. Ask a large number of judges to arrange in order of merit the specimens in set N, and record in order the specimen numbers, together with marks 1 through 11. The previous rating by three judges can be utilized. 21. Repeat III20 for the other twelve sets. 22. Compute the mean of all these marks given each specimen. 23. Guided by these means, choose from set N the speci- men most central in merit. This is the specimen most entitled to the T merit of 20. Do likewise for sets O, P, Q, 138 How to Experiment in Education etc., and give to each, T merits of 25, 30, 35, etc., respec- tively. These 13 specimens together with their T merits constitute a product-scoring scale, which may be used to determine the T score in handwriting made by any pupil. All that is necessary is to move the pupil’s specimen along this scale until a scale specimen is found which is like it in merit. The pupil’s T score is the T merit of the scale speci- man most like it in merit: 24. Have at least three competent judges score each of the 2000 specimens originally collected by comparing it with the specimens in this product-scoring scale. Consider that each pupil’s T score is the mean of these three ratings. 25. Do 132 for each of the grades, and for each of the ages, except age 12.0 tO 13.0. 26. Do 135, 136, and 137. 27. A much more laborious and, for purposes of pure research, perhaps more satisfactory method of constructing a product-scoring scale is described in Chapter IX, Sec- tion IV of “How to Measure in Education.” If this more laborious method of product-scale construc- tion is used, omit steps III8 through III23. Do II]2q, III25 not excepting ages 12.0 to 13.0, 133, 134, 135, 136, and I37. Iv. Battery of Tests 1. Prepare each of the difficulty, rate, or product tests entering into the battery up to, but not including step, I26, in so far as these 25 steps apply to the construction of each type. If there are product tests, construct, besides, a product-scoring scale for each, based upon about 1000 speci- mens collected from 1000 unselected pupils between the ages - 8.0 and 9.0, 12.0 and 13.0, or 16.0 and 17.0. 2. Prepare all these component tests from data collected from the same 1oo pupils. If tests are merely being com- piled and were carried through the preliminary stages pre- viously, then apply them all to the same too pupils. 3. Compute the total score on each test separately made Experimental Measurements 139 by these 100 pupils on the basis only of the test elements selected for the final form of the test. 4. Make a separate frequency distribution of the 100 scores on each test. 5. Compute the SD of each frequency distribution. 6. If all tests in the battery are to have equal weight, choose a multiplier for each SD such that all SD’s will be made approximately alike in size. For example: SD 4 Multiplier I 2 8 a 2 Ya 3 If all tests are not to have equal weight, choose multipliers which will bring the SD’s to the desired ratio. Choose multipliers such that the labor of applying them will be the least possible. 7. Print the tests in booklet form. Insert the multipliers on the front page of the booklet, thus: Test Points Multiplier Weighted Points I I 2 2 3 +2 4 mor) Total 8. Do all three of 127, I28, and I29 for each difficulty test in the battery. 9. Do I3o0 for the battery booklet. 10. Do 131 for each of the battery tests. Ir. Compute for each pupil the total weighted points as indicated in IV7. 12. Do all of [32, 133, 134, 135, and 136 for the total weighted points. 13. Do 137 for the battery. CHAPTER VI COMPUTATIONS FOR THE ONE-GROUP EXPERIMENTAL METHOD Computation Model I.—The purpose of this chapter is to give and explain a series of computation molds into which the experimenter may fit his experimental data. Enough such models are given to provide for all the com- mon varieties of experiments. Thus all the experimenter needs to do is to find the mold which fits his experiment, substitute in it his experimental data, do the computations indicated, and the proper conclusions and the reliability of these conclusions will follow automatically. The simplest type of experiment is the one-group experi- TABLE 14 COMPUTATION MODEL I One Group — Two EF’s— One Test Type Group A—EFr Group A— EF2 Pilty Kr Crt xax. UD hire eee N Mi Sx? M2 Sx? ads BEN feb AM) SD=y5= _ () AM SD = 4X _ SDM me SDM a a c = —= C I Ry, N 2 y, N SUMMARY EFr1 EF2 D SDD EC N = te pict a Me ane ‘/ (SDM1)* + (SDM2)?| 2.78 SDD 140 Computations for the One-group Experimental Method I4I ment, where two experimental factors are contrasted, and where only one type of test is used to measure the change produced by the experimental factors. The computation mold for this experimental method is given in Table 14. Illustration of Computation Model I.—Table 142 is best explained by formulating an experimental problem which may be solved by means of the one-group experimental TABLE 15 ILLUSTRATING HOW TO USE COMPUTATION MODEL 1 WITH SAMPLE DATA, WHEN EF2 1S THE MERE ABSENCE OF EFI ee ee ee One Group — Two EF’s — One Test Type Pera ee ae he lt a ed ae oil ool ALR EAMES AP Group A—EFr Group A — EF2 - oi Aeeatatle i det ee $e UR an as eid bio PLY | EG Py f Pet rey bt Ky of xo AOS Er i tee bee x? a Os Lo sth 2 4 95 95 o!o ry) De100! (tos 5 3 9 100 100 0] oO fa) ce | TOLe Too 8 oO oO IOI IOI 0} oO oO d O7METOO 9 I I 97 97 o| o fe) e |102 109 7 I I Pia ge, 102) 2010 ra) t 96 108 12 4 16 96 96 o| o o $ | 99 107 8 fe) re) 99 99 ~«Oo| o oO h 98 107 9 I I 98 98 o| 0 o ee rOG iM LTT tT 7 3 9 100 100 0} Oo fo) 9 Mi = 8.8 Sxa==tay M2=0 Sx? ==10 AM = 8.0 SD= <~(0.8)* AM=o0| SD=¥ > — (0)? cr==70.8 SDF 2.6 Ci= 0 SD=0 SDM1 = 72 =0.7 SDMz=~=o0 V9 9 SUMMARY EF1 EF2) . D SDD EC ris Lite sat ASidiedeucs bd ea oe: 8.8 Test 1 8.8 Oo 8.8 V (0.7)? + (0)?= 0.7 2.78 X0.7 = 4.6 method, and then to substitute sample data in computation model I. Assume this problem: What is the effect of a defined amount of vigorous physical exercise upon the pulse rate of pupils? This problem may be solved by the one- group method. There are two EF’s, namely, vigorous physical exercise (EF1) and the absence of such exercise (EF2). Table 15 reproduces model I in statistical form. Unless the formula especially demands something else, all compu- 142 How to Experiment in Education tations at all stages are done to the nearest first decimal only, so as to make it easier for the student to check com- putations. Greater exactness is advised in actual experi- mental computations. Computation of Changes Produced by EF1.—Since a thorough mastery of the symbols, abbreviations, and com- putations shown in Table 14 and illustrated in Table 15 is essential to an understanding of all subsequent experi- mental computations, the data of these two tables are ex- plained in considerable detail. Both Table 14 and Table 15 show the experimental com- putations for any one-group experiment contrasting two EF’s and employing only one type of test. The one type of test employed in Table 15 is a test or count of determina- tion of pulse rate. Of course this test was made more than once, but throughout Table 15 only one function is meas- ured. Had the effect of vigorous exercise upon both pulse rate and, say, blood pressure been studied, two-test types would have been employed, since two different functions would have been measured. In the left half of both Table 14 and Table 15 “‘Group A” is the experimental group or subjects used. As indi- cated, Group A has EF1 applied to it. Instead of placing EF1 immediately after Group A as shown in the tables it might have been placed between IT1 and FT1 to indicate that the EF1 is applied to Group A after the IT1 and before the FT1. In Table 14 “P” represents the pupils who constitute Group A. The ‘‘N” beneath it means the number of pupis in Group A. In Table 15 the pupils used are a, J, c, etc., and J is 9. IT means the initial test or scores made on the initial test by each pupil. In Table 15, these scores are pulse rates of 95, 100, ror, etc. The numeral 1 following IT, refers to the first type of test. This will be needed more when more than one test type is used. The “FTx” refers to the final test. Computations for the One-group Experimental Method 143 “Cx” in both Table 14 and Table 15 means the change produced by the EF1, and is found by computing the dif- ference between each pupil’s IT and FT. Thus in Table 1 S Ci for Pupil a is ro points, found by getting the difference between 105 and 95. Had the ITx for Pupil a been 105 and the FT1 been 95, Cr would still be 10, but should be preceded by a minus sign to indicate that the change is a ro point loss. In all cases where the FT is smaller than the IT a minus should be prefixed to the C, unless the test is scored in terms of time or the like where a smaller FT than IT clearly means a gain rather than a loss. In cases _ where it is not clear, whether a smaller FT than IT is de- sirable or undesirable, the minus should be prefixed. The experimenter should remember, however, that the minus in such cases does not, as it usually does, mean something undesirable. Computation of Mean, SD, and SDM for EF1.—The “Mr” under the Cz, is the arithmetic mean of the various Cr’s. In Table 15 this Mz is 8.8. Had any of the Cx’s been preceded by a minus the Mr would have been less than 8.8, for signs should be regarded in computing Mr. The “AM” beneath the Mz means the assumed mean. The AM is used instead of the Mz for computing beg Hp eye" etc., because its use is a great convenience and economy. Any convenient number might be used as the assumed mean, though it is usually most convenient to assume the nearest whole number to the Mr. Thus in Table 15, 8.0 is used as the AM, which makes the c or correction 0.8. Signs are disregarded in determining and using c. The AM of 8.0 makes a c of 0.8. An AM of 9.0 would make ac of o.2. Had the Mz been 8.0 instead of 8.8, an excellent AM would be 8.0, which would make a c of zero. The symbol x is the traditional symbol for deviation. Thus the x for Pupil a is 2, because his Cx of 10 deviates or differs from the AM of 8.0 by 2 points. The x for Pupil } is 3, because his Cx of 5 deviates from 8.0 by 3 points. As in the case of c, the direction of the deviation 144 How to Experiment in Education is disregarded. Had the Cr for Pupil a been — 10 instead of + 10, the x would be 18 instead of 2, because the differ- ence between 8.0 and — 10 is 18 points. Had the AM been — 8.o and the C1 been — to, the x would have been 2. The column labeled “x’” is found by squaring all the x’s. Sx? means the sum of the x* column. In Table 15, Sx? is 41. SD means standard deviation and is one of sev- eral conventional measures of variability. It is computed according to the formula given in Table 14 and illustrated in Table 15. No matter whether the AM is larger or 2 smaller than the M, the c? is always subtracted frome and it is subtracted before the square root of the whole quantity is taken. The subtraction of c? corrects for the use of 8.0 instead of 8.8 in computing x’s, x?’s, etc. If the reader will compute x, x”, etc., from 8.8, he will appre- ciate the convenience in the use of 8.0, and correcting for its use at the end. The N in the SD formula means the number of pupils in the experimental group. The SD in Table 15 is 2.0. SDMz1 or SD of the Mr is so indicated to distinguish it from the preceding SD or SD of the C1’s. SDMz is a conventional measure of the unreliability of the Mr. It is computed according to the formula shown in Table 14, and illustrated in Table 15. The SDMr for Table 15 is 0.7. The reliability of the Mr or 8.8 is shown then by its SDMr1 of 0.7. Comgutations for EF2.—The right half of Table 14 and Table 15 is headed ‘‘Group A-EF2” because EF2 is applied to the same group of pupils as experienced EFtr. Column P is omitted, since the pupils are the same as those shown in the first column of the table. The IT, FT, C2, M2, AM, c, x, x’, etc., shown in the right half of the table are interpreted and computed like those shown in the left half of the table. In Table 15 the EF2 is merely the absence of vigorous exercise. That is, EF2 is merely a continuation of the same restful conditions which obtained when the IT, in the Computations for the One-group Experimental Method 145 left half of the table was made. The IT, in the right half of the table, does not need redetermination, for presumably the results would be identical with the ITr results shown in the left half. Since EF2 is a continuation of conditions obtaining when the ITz is made, FT1r will coincide, pre- sumably, with the scores on the IT1. This makes zero all the C2’s, the M2, the x’s, x?’s, SD and SDM2. In actual practice when EF2 is merely the absence of EF 1, the experi- menter will not actually compute the right half of the table but will assume all the C2’s and subsequent meas- ures to be zero. In case EF2 is not the mere absence of EFr, the right half of the table will have to be computed in detail. Computation of M and SD when N Is Large.—The method of computing M and SD, illustrated in Table ris’ is appropriate and convenient when N is small. It is appro- priate, but not convenient, when N is, say, 50 or more. When N is large it is more convenient to determine the C1 for each pupil as in Table 15, and then to tabulate these Cr’s into a frequency distribution. The procedure for constructing a frequency distribution is as follows: (1) Write a column of figures beginning with the small- est Cr and increasing by one to the largest Cx. (2) Write this column in step-intervals of one, extending from five- tenths below to five-tenths above the Cx. The first column of Table 16 illustrates (1) and (2). (3) Look at the original Ci’s. If the first Cz is 4, place a dot or mark just after the step-interval 3.5 to 4.5 in Table 16. If the next C1 is — 2, place a mark just after the step-interval — 2.5 to — 1.5. If the next Cz is another 4, place another mark just after the step-interval 3.5 to 4.5. Continue until a mark has been made after the appropriate step-interval for every C1. (4) Total the marks placed after each step- interval, and write this total just after the step-interval in question. When finished, the two resulting columns will be a frequency distribution. The first and second columns of 146 How to Experiment in Education Table 16 constitute a frequency distribution. Note that each zero frequency (f) must be indicated if data is to be used for further computation. TABLE 16 SHOWING HOW TO COMPUTE M AND SD WHEN N IS LARGE G f x fx fx? -—4.5 to —3.5 I —8 — 8 64 —-3.5 “* —2.5 2 —7 — 14 98 —2.5 “© =—1.5 2 — 6 — 12 72 — 1.5 i — 0.5 3 =——"5 eared: 75 —0.5 0.5 3 —4 — 12 48 Pb dens Ts 4 — 3 —I2 36 1.5 - 205 Oo —2 oO (a) 2.5 ‘ 3.5 5 Le as 5 3-5 4-5 co) oO oO AS 5:5 5 1 5 3 Be LS 2 2 4 (yep ok Phcle oO 3 fe) o rE Ne 5 4 20 80 8.5 ve 9.5 3 5 15 75 9-5 10.5 3 6 18 108 AM= 4.0 |N=44 + 62 674 c= -0o — 78 — 16 — Te ie tah Zp Us ce me 674 = 6 — 16 19) 2 =e SD = nor Ce ama Gas Bim Shores ri A 5 Cee or 30) )x (1) = 3.9 SDM =)'0-89 SDM = 22 = o.59 Vv 44 The steps in the process of computing M and SD follow: (1) Some AM is selected at the mid-point of some step- interval near the center of the frequency distribution. Any AM will do, but it must be at the mid-point of some step- interval. AM= 4.0. (2) N is computed. N= 44. (3) step x’s from the AM are computed. Thus the step-interval 3.5 to 4.5 deviates from 4.0 by zero. Step-interval 2.5 to 3.5 deviates by — 1. Step-interval 4.5 to 5.5 deviates by -++ 1, and similarly for other step-intervals. Note that zero frequencies are not overlooked. (3) Each x is multiplied by its corresponding f to secure the fx column. (4) The posi- tive fx are added. The negative fx are added. The differ- ence between these two sums is obtained. Positive Sfx = 62. Negative Sfx = 78. The difference = — 16. (5) Thec is computed. Computations for the One-group Experimental Method 147 c= ( eee) < (size of step-interval). c—= — .36. Had AM been 3.0 instead of 4.0, the positive Sfx would have been larger than the negative Sfx. This would have produced a positive instead of a negative c. (6) M is computed by the formula: M = (AM) + (c). Had c been positive instead of negative, M would have been 4.36 instead of 3.64. (7) The fx? column is secured by squaring each x, and multiplying by the corresponding f. It may also be secured by multiplying each fx by the corre- sponding x. (8) The Sfx? is computed. Sfx?— 674. (9) The SD is computed by the formula: SB Ye (VEZ OE _ (c)? ) )x (size of the step-interval) SD Baer Be) (10) SDM is computed according to the usual procedure. Sometimes a frequency distribution is so strung out that the experimenter prefers to condense it into step-intervals of 2, 3, or more instead of 1, or to construct it in step- intervals of 2, 3, or more from the beginning. Thus the TABLE 17 SHOWING HOW TO COMPUTE M AND SD WHEN N IS LARGE AND WHEN FREQUENCY DIS- TRIBUTION IS GROUPED IN STEP-INTERVALS OF TWO (DATA FROM TABLE 16) CG 7 x fx fz —-4.5 to -2.5 3 TR) MER Ee 27 Pat ES Ok 5 —2 — 10 20 — 0.5 a TS 7 re ime OY 7 1.5 3-5 5 ) 0 0 B.S ara heS-5 II I II II BE forte a 2 2 4 7-5 9-5 8 3 24 72 Gikmnht Ts 8 3 4 re) 48 AM = 2.5 |N=44 + 51 193 c= 14 — 26 148 How to Experiment in Education frequency distribution of Table 16 may be grouped as shown in Table 17. No matter what the size of the step- interval, the process for computing M and SD is the same as that already described. ‘That this is so is shown by Table 17. The process just described for computing M1, SD, and SDM1 may be used for computing M2, SD, and SDMz2. It may be used, in fact, for.computing any M, SD, or SDM. Computation of Median and SDmedian.—Because of its greater reliability, the M is usually preferable to the median. The only advantage of the median is that it is less influenced by extreme improvements. A few pupils mak- ing relatively large or relatively small improvements will affect the size of the M more than they will affect the size of the median. If these extreme improvements were twice as large or half as small respectively, the median would remain unaltered, but not so the M. There are as many arguments for their being allowed to have their full effect as for a curtailment of their effect. But there may be rare occasions on which the experi- menter will prefer the median to the mean. For this reason the steps in the process of computing a median and an SDmedian for the frequency distribution of Table 16 follows. (1) ComputeN. N= 44. (2) Compute%’N. YN — 22. (3) Begin at the top of the frequency column and add the successive f’s, calling the successive totals until 14 N or 22 has been reached, thus: 1 and 2 are 3, and 2 are 5, and 3 are 8, and 3 are 11, and 4 are 15, and o are 15, and 5 are 20, and 2 of the 6 are 22. (4) Place this 2 as a numerator over this 6, multiply the fraction 2/6 by 1, the size of the step-interval, and add the product to the begin- ning point of the step-interval corresponding to the fre- quency of 6, namely 3.5. The result is the median. Median IR on 42 JO Calera Oa The reliability of the median 3.83 is found by means of the following formula: Computations for the One-group Experimental Method 149 1% SD SDmedian= 4/N The SD, in the preceding formula, may be the SD from the mean, computed in the usual way, or it may be the SD from Ane median. It will be found more convenient as a rule to use SD from the mean. If computed from the median, the exact deviations from the exact median must be used, because SD from the median must be computed by the formula: Sie jy instead of SD = 1 eVGA The steps in the process of computing a median for Table 17 follow. (1) N=44. (2) ZN=22. (3) 22=3 and 5 are 8, and 7 are 15, and 5 are 20, and 2 of 11. (4) Wvledian=—.3.5 + De pian visley The experimenter may have difficulty in computing a median for a frequency distribution where the numerator of the fraction is zero and the preceding f or f’s is zero. Table 18 shows how to overcome this difficulty. TABLE 18 SHOWING HOW TO COMPUTE A MEDIAN IN TWO SPECIAL SITUATIONS C f C f 2.5°to3.5| 1 |N= 14 ie 15.5| 2|N=12 Bip 4510 ZN 7 “ 20.5] 1|4N= Es ae p=1tote+atoroe « 25.5) 3/6=2+1+3+0+0 5.5 “ 6.5) 4] andoof 5 erie s0.5| 0 auc ONOled 05:9) 7.5/0 30:5111435:5//10 75 8.5| 5 Median == 2:5 17:5 73 135.5 40.5) 4 Median = 25:9 1 35:5 1 35.5 “= oe eg tae 2 AOS wat AS i812 2 +x +— x a = on = 30. 5 Peay | 4 Sire 50:5 The median is sometimes called the 50 percentile. It is possible to compute other percentile points according to the same process. The 50 percentile is found by counting down 150 How to Experiment in Education the frequency column 1% N. The 25 percentile or Qr is found by taking 4 N. The 75 percentile or Q3 is found by taking 34 N. The 20 percentile is found by taking WN. A knowledge of Qr and Q3 enables us to compute Q (quartile deviation) by the formula: ease 2 Q, which is a variability measure like SD and which is approximately .6745 SD, may be used in the place of SD to compute SDmedian. In fact, this is the simplest way to determine SDmedian. The formula is: SDmedian = £3539 Computation of D and SDD.—In the “Summary” (Tables 14 and 15) are retabulated certain measures pre- viously computed, and certain additional computations are made. First there appears the mean of the changes pro- duced by EF1, i.e. M1 in Table 14 and 8.8 in Table 15. Next comes the mean of the changes produced by EF2, i.e. M2 in Table 14 and zero in Table 15. The next step, namely, ““D” or difference, is merely the difference between M1 and M2, i.e. M1 — M2, in Table 14, or between 8.8 and o, i.e. 8.8 in Table 15. It is well to form the habit of subtracting M2 from Mi. Then a plus D will mean that EF1 has been more effective than EF2. A minus D will mean always just the reverse. This D is the most significant measure shown in the two tables. It is the chief goal of the experimental computations. It yields the con- clusion from the experiment. Thus the D of 8.8 in Table 15 tells us that the C produced by EF1 is 8.8 points larger than that produced by EF2. This is another way of saying that the effect of a defined amount of vigorous physical exercise is to increase the pulse rate 8.8 on the average. Computations for the One-group Experimental Method 151 The next computation, namely, SDD or the SD of the D, utilizes the SDM1 and SDMz2 as shown in the two tables. This SDD shows the reliability of the preceding D just as the SDMz shows the reliability of M1. That is, the D of 8.8 has a reliability of 0.7. In case medians have been used instead of M’s, D will be the difference between median 1 and median 2, and SDD will be computed according to the formula: SDD = 4/(SDmedian 1)? + (SDmedian 2)? Though SDM and SDD will be used throughout this book, many experiments report reliability in terms of PE. Thus the reader of scientific literature frequently sees some- thing like this: Mean = 8+ 0.7, or like this: Differ- ence = 4+ 1.0. Such expressions signify that the PE of the mean or PEM is 0.7, and that the PED is 1.0. By multiplying any SD, SDM, SDmedian, or SDD by 0.6745, it may be transmuted into a PE, PEM, PEmedian, or PED respectively. SD and PE tell the same story. In a normal frequency distribution + SD includes the middle 68% of the f’s whereas + PE includes the middle 50% of the f’s. Measures of Variability.—Thus far three sorts of SD’s have been computed, namely, SD, SDC, or SD of the C’s, SDM or SD of the mean of the C’s, and SDD or SD of the difference. All three are measures of variability. The SD or SDC is a measure of the variation or variability among the C’s. Thus the C1’s in Table 15 vary from 5 to 12, 1.e., there is a range of 7. This 7 could be taken as a measure of variation; but the reader will easily understand that a change in the C1 for one pupil might markedly affect such a measure of variability. The SD is better because its size is dependent not upon just two pupils but upon the records for all pupils. Furthermore, the SD is de- manded by the formula for SDM. The SD increases in size with an increase in the variability of the C’s, and it de- creases as the variation of the C’s decrease. In sum, it is 152 How to Experiment in Education an exceedingly sensitive and stable measure of the vari- ability among the C’s. The SD of 2.0 in Table 14 means approximately that 68 per cent of all the C1’s fall between Mi — 2.0 and M1 + 2.0 or between 8.8 — 2.0 and 8.8 + 2.0, or between 6.8 and 10.8. The per cent between M — SD and M + SD is exactly 68 when the C’s make an exactly normal frequency distribution, i.e., when a graph of the frequency distribution is approximately bell-shaped. The SDM is also a measure of variability. It is a meas- ure of the variability among the M’s just as SD is a measure of variability among the C’s. Assume the nine pupils used in Table 15 to be a random sampling from the 10,000 ten- year-old pupils in a certain school system. Imagine this experiment repeated upon another random sampling of nine pupils from the total 10,000, and then upon another sampling, and then upon another sampling, and so on until a great many samplings have been taken and a great many Mz1’s have been computed. In making these samplings certain pupils might be chosen more than once and certain ones might never be chosen at all. Not all the Mr1’s so computed would be identical. In fact, no two M1’s might be identical. Certainly there would be variation among them. The SD of all these Mr’s could be computed just as the SD of the C1’s was computed. When so computed, the result would be SDMrz, and, in theory at least, would be the same as SDM1 computed by the formula illustrated in Table 15, 1.e., 0.7. Since it is more probable that all these Mr’s will center at the obtained Mr of 8.8 than at any other point, the SDMz of 0.7 tells us that most probably 68 per cent of these M1’s would be between 8.8 — 0.7 and 8.8 + 0.7, 1e., between 8.1 and 9.5. In sum, SDMr1 isa measure of variability just as SD is a measure of varia- ability. The difference is that SD is computed from actually ' obtained C’s whereas SDMr is always computed by for- mula. The Mz1’s whose variability it measures could actually be determined as suggested above but in practice their existence is only imagined. Computations for the One-group Experimental Method 153 SDD is also a measure of the variability among many differences determined from many repetitions of the experi- ment upon different random samplings. As with SDMz1, SDD is computed always by formula. The SDD of o.7 in Table 15 tells us that most probably 68 per cent of all the differences determined from such repetitions of this experi- ment would fall between obtained difference 8.8 —o0.7 and 8.8 +.0.7, 1e., between 8.1 and 9.5. Mz and SDMr will not always coincide with D and SDD as they do in this experiment. Measures of Reliability and Randomness of Sam- pling.—SDMz and SDD are measures of reliability as well as of variability. They measure the reliability, respectively, of Mi and D. The true Mr for the 10,000 pupils in ques- tion can be determined only by securing the Cr for all 10,000 pupils. The Mz for any number of pupils less than 10,000 will not be the true mean exactly except by chance. The Mr for the nine pupils in Table 15 may happen to be the true Mz. On the other hand the Mz from any other random sampling of nine pupils has as much chance of being the true M1. Any measure which will show the amount of variation among all the M1’s from the various possible random samplings of nine pupils each will be an index of how much a particular obtained Mr may be in error. The SDMz, as has been pointed out already, is just such a measure of variation. Consequently it tells us how probable it is that the obtained Mx diverges from the true Mz by a given amount. When the various possible M1’s vary little among themselves, there is little chance for any one of them to diverge largely from the true Mr. In such a situation the SDMr1 will be small in amount. When the SDMrz is large in amount, it means that there is a large variation in size among the possible M1’s, which, in turn, means that the obtained Mz is not particularly reliable. In like manner it can be shown that SDD, because it meas- ures the variation among the possible differences, is an index of the reliability of the obtained D, and shows the probabil- 154 How to Experiment in Education ity that it diverges from the true D for all 10,000 by a given amount. SDM1 and SDD, as computed by formula, will coincide with SDMz1 and SDD as computed from a great many ran- domly determined Mz1’s and D’s only when an assumption underlying these formule perfectly obtains. That is, SDMx1 and SDD, as computed by formula, are valid only to the extent that the nine-pupils used are a genuine random sampling of all the 10,000 pupils, or that the obtained C’s are a genuine random sampling of all the C’s that would be obtained if all 10,000 pupils were experimented upon. That is, both reliability formule assume randomness of sampling. In actual practice no one would hope to secure a genuine random sampling from 10,000 pupils by selecting only nine pupils. Since this book, however, is concerned with meth- odology rather than results, a ludicrously small amount of data is used in most tables. The purpose of this is econ- omy of space and clearness of presentation rather than to set an example for the reader. | Close attention to the nature of the sampling is neces- sary, not only in order to discover the validity of the re- liability measures computed but also to determine the limitations of the conclusion drawn from the experiment. Thus if the pupils used in the experiment are a random sampling from the ten-year-olds in a particular elementary school, the conclusion should be distinctly limited to the ten-year-olds in this particular school. The experimenter cannot be sure that the results of his experiment apply to all ten-year-olds in the United States, or to all eleven-year- olds in this same school. Experimental Coefficient and Chances.—The “EC” or experimental coefficient in Table 14 and Table 15 remains to be explained. The formula for its computation is given in the former table and illustrated in the latter. The experi- mental coefficient has been devised to interpret SDD. The formula for its computation is so constructed that an experi- mental coefficient of 1.0 means that we can be practically Computations for the One-group Experimental Method 1 is certain that the true D is somewhere above zero. An EC of 0.5 means that we can be only half certain that the true D is above zero. An EC of 2.0 means we can be doubly certain that the true D is above zero, and similarly for other sizes of EC. Since the EC in Table I5 iS 4.6 we can say that there is 4.6 times practical certainty that the true D is above zero. Since some statisticians wish to state probability in terms of chances that the true D is above or below zero or above or below any defined point, Table 19 permits the con- version of experimental coefficients into statements of chance. This table says, for example, that when the experi- mental coefficient is 0.3 the chances are 3.9 to 1 that the true D is above zero if the obtained D is above zero, Or below zero if the obtained D is negative. TABLE 19 SHOWING HOW TO CONVERT AN EXPERIMENTAL COEFFICIENT INTO A STATEMENT OF CHANCES EE Experimental Coefficient Approximate Chances ot 1.6 to r ‘2 2.5 tO 3 3.9 to 1 4 6.5 to I 5 Tia etOeT 6 20m cOuT o7 38 tor 8 75 Eto. ft 9 160. 6to Tr I.0 200m tO7T TT O30 VECOnT Toa 2350 tor i 6700 tor 1.4 20000 tor is 65000 tor Se a een terres ee A LSE: OA AUN APA ik The formula for EC is constructed to a D of zero as a reference, because the experimenter’s primary concern is to know whether the obtained superiority of one EF over another, or the obtained D in favor of one EF, is sufficiently reliable to justify him in concluding that the true 1) Saf 156 How to Experiment in Education known, would continue to favor that same EF. If the obtained D is, say, 2.0 in favor of EF1, the experimenter wonders whether the true D may not be zero or even, say, —1.0. For the true D to be zero, would be to make the two EF’s of equal effectiveness. For it to become — 1.0, would be to reverse the conclusion indicated by the obtained D. So whenever the EC is less than 1.0, the experimenter should state that one of his EF’s is probably more effective than the other. The less the EC becomes, the more wary the experimenter should be. This does not mean that the experimenter is justified in advising practical action on the basis of his experiment only when the EC is 1.0 or above. So long as the EC is above zero, the true D more probably lies in the direction of the obtained D than in the opposite direction. Life’s most important considerations, such as marriage, investments, and hope of Heaven, rest upon an EC of less than 1.0! Though the EC formula is built to a D of zero, it may be used to measure the probability that an obtained D will be above a defined point, or will be below a given point. Thus if we wish to know the probability that the true D in Table 15 will be above, say, 7.8 we should compute thus: 1.0 8.8 — 7.8==1.0. nC eeropronenes echt We can be only half certain that the true D is above 7.8, whereas we can be 4.6 times practical certainty that it is above zero. Since there is just as much probability that the true D is above as below 8.8, we may wish to determine the proba- bility that the true D is below, say, 10.8. Compute thus: 10.8 — 8.8 = 2.0. | chy sy le DON On practically certain that the true D is below 10.8.- If desired these EC’s may be expressed in terms of chances by the use of Table 109. Though to do so would serve no especially useful purpose in connection with experimental computations, the EC formula may be used to help interpret the reliability of an 1.0. We can be Computations for the One-group Experimental Method 157 M. In this case, the SDD in the denominator of the for- mula should give place to SDM. Thus if we desired to _ know the probability that the true Mz in Table 1 5 would be above, say, 5.8, we could proceed as follows: 3.0 ee 5 13 10, 1) GC SECT Uae T. 1.6. The probabil- ity then is 1.6 times practical certainty that the true Mr is above 5.8. It happens that in Table 15 the SDM1 is the Same as the SDD, ie., 0.7. In similar manner we could determine the probability that the true Mr is below a de- fined amount. How to Increase the Experimental Coefficient.—If the EC is not as large as desired, how can it be increased? An inspection of the EC formula reveals the answer. The EC can be increased by increasing the numerator of the formula, i.e., by increasing D. But D is not subject to con- trol by the experimenter. It is, in fact, illegitimate for him to try consciously to increase D. Then the denominator must be reduced. The 2.78 in the denominator is constant So it cannot be reduced. The reduction must be in the SDD. To see how it can be reduced we need to inspect the formula for computing SDD. This formula shows that the only way to reduce the SDD is to reduce one or both the SDM’s upon which the size of the SDD depends. To find out how, say, SDMzr can be reduced it is necessary to in- spect the formula for computing SDMr. This reveals that the SDMr can be reduced by reducing the SD in the numerator or by increasing the N in the denominator. Since errors of measurement tend to increase the variability among the C1’s, a refinement of the testing instruments would make a slight but almost negligible reduction in SD. For practical purposes the SD cannot be materially re- duced. Then the N must be increased. The N is subject to the control of the experimenter. Therefore our search has led us to the conclusion that the only practicable plan for increasing the size of the EC is to increase N. The experimenter can compute in advance about how 158 How to Experiment in Education many pupils he must experiment upon to secure a desired EC. The EC of 4.6 in Table 15 is high enough, but suppose that an EC of 6.0 were desired. The size of the SDD required to yield an EC of 6.0 may be determined by solv- ing the following EC formula for SDD, because, presuma- bly, the D of 8.8 would be altered little or not at all by increases in N. 8.8 2.78 X SDD pol DD Memeeb(e 6.0 Now the size of the SDMz1 required to yield an SDD of o.5 may be determined by solving the following SDD for- mula for SDM1. The SDMz2 cannot be reduced so it is disregarded. When it is reducible, it may be asked to share its proportionate part in reducing the SDD. /(SDM1)? + (0)? =0.5 SDM1 = 0.5 Since the SD in the SDMr formula changes little or not at all with changes in N, the N required to yield the needed SDMz1 of 0.5 may be determined by the solving of the fol- lowing SDMz1 formula for N. 20. /N N = 16 The answer to our query is, then, that 16 pupils must be used if a desired EC of 6.0 is to be secured. If the neces- sary reduction in SDD is distributed between the two SDM’s, N must be determined for both SDMz1 and SDM2. Another Illustration of Computation Model I.—Table 20 illustrates the application of computation model I to sample data where EF2 is not the mere absence of EF1. Imagine the data to have been collected in an experiment to determine whether the pulse rate increased more from reading a familiar favorite thrilling short story (EF1) or $0 = = zWwas Foe = IWdS v v = Nar ae oe os 7O= 9 ‘T_ Seas, =< oo — 9 Vv Seo Sy eeee Cope 66 hay To =66 66 Dp I I Zz 66 L6 I I Zz 66 L6 2 I I z vor zor I I fe) ZOI ZOI q .e) ° £ Cor OOI Vv z ¢ Lor Oor e & x 2@) ILA ILI 2X p¢ 1) ILA ILI d 217 — Vp gnorsy Iq — Vp gnosy ae ana a ee ee eee edkT, S99, UIQ —S.aq OM], — dnoiy sup Se See a ea ee es Idd JO HONASAV TUAW AHL LON SI tHF NAHM I TACO NOLLVIOAWOO BSO OL MOH ONILVaLSOTIO Oz aIavy, Computations for the One-group Experimental Method 159 | aS wm | = 2 | ae op) fe) ! a 160 How to Experiment in Education from hearing the story told orally by the teacher (EF2). The story used must be an extremely familiar one, other- wise the repetition would differ markedly in interest from the first presentation, thereby invalidating the experiment unless the equivalent-groups method were used. The reader’s attention is directed to the following special features of Table 20. The C1 of — 1.0 deviates from the AM of 1.0 by 2 points. The AM is the same as M1, thereby making c of zero size. As shown by the computa- tion of SD, when the M and AM are identical no correc- tion for the SD is necessary. The M2 is less than the AM, but this in no way alters the usual subsequent procedure. The D is — 1.8 because in this experiment EF2 proved to be more effective than EF1. The EC is only o.7 which means that we can be only o.7 practically certain that the true D, if known, is below zero, 1.e., favors EF2. There are several possible one-group computation models. We could have one computation model for two EF’s and two test types. Substitute Group A for “Group B” in com- putation model IV, Table 24, and the reader will have such a model. Again, we could have a computation model for three EF’s and one test type. Substitute Group A for “Group B” and also for “Group C” in computation model III, Table 23, and the reader will have such a model. Again, we could have a computation model for three EF’s and three test types. Substitute Group A for “Group B” and also for ‘““Group C” in computation model V, Table 25, and the reader will have such a model. In sum, every com- putation model listed in the next chapter could have been listed as one-group computation models. Economy of space is the only reason for not doing so. Imagine Group A to run through all these models instead of different groups and they will all be converted automatically into one-group computation models. In like manner the detailed discus- sion and illustration of computation model I in this chapter is applicable to all the computation models in the next chapter. CHAPTER VII COMPUTATIONS FOR THE EQUIVALENT- GROUPS EXPERIMENTAL METHOD Computation Model II.—Computation model II given in Table 21 shows the necessary computations for an ex- periment with two equivalent groups, two EF’s and one type of test. Note that “P” appears twice because EF2 is not applied to the same pupils who experience EFr. Note also that the detailed formule for SD and SDM are omitted, since the reader is already familiar with them. TABLE 21 COMPUTATION MODEL II Fe ns ns NET a ON VN Oat Two Equivalent Groups — Two EF’S — One Test Type Group A—EFr Group B— EF2 Deets OLY ACL ix Xen vir wher Woy lis x? N M1 Sx NN M2 Sx? AM SD AM SD c | SDM1 Cc SDM2 ee eR EO EE lA etn ne SUMMARY EFr1 EF2 D SDD EG ANS ga D Test 1...) Mz M2 M1—Mz2 | 4/(SDMr)?+ (SDMz)? 278 SDD erate ene cere ee Oem NORE A lA), (Foul SER VAL [RNAs Wo Illustration of Computation Model II.—In order to illustrate computation model II with sample experimental data assume this problem: Which is better for the quality of the penmanship, a penmanship period preceding the gymnasium period (EFr1), or following the gymnasium 161 162 How to Experiment in Education (EF2)? This problem may be solved either by the one- group or equivalent-groups method. The equivalent-groups method is used. The IT for both groups should be made at the same identical period of the day, and at a period different from either of the experimental periods, though several other ways of working out this experiment would be as feasible and as satisfactory. Assume that the IT has been made on both TABLE 22 SHOWING HOW TO USE COMPUTATION MODEL It Two Equivalent Groups — Two EF’s— One Test Type Group A—EFr Group B— EF2 P |ITx FTr C1 pee iN Beal SA aR IO UE Rap OG) Oa By C2 Dene ey a 7 8 I OT Outed 7 8 I rey: Dae nT. 6 —I DENA Mey PS 4 oe Ono c 8 10 2 Tae k 9 7 —2 re hes: d 8 9 I GuniD Lro 9 —I red ie! € 9 9 Oo i I —— ew Soothe f Our 3 Cara TL A M2 = —08 Sx*=5 g 10 aT I OFnLO AM = —1.0 SD = 1.1 shen be f=) 12 2 Twi c = 0.2} SDM2—0.6 8 M1 =—1.1 Sx? ==11 AM =~ 1.0 SD = 1.2 c=o0.1} SDM1—0.4 SUMMARY EF1 EF2 D SDD EC LeSCuda vie ocean Tat —o.8 1.9 0.8 0.9 groups just before dismissal at the end of the day. The FT for Group A should be made, then, just preceding the gymnasium period, and the FT for Group B should be made just after the gymnasium period. The necessary computa- tions are made in Table 22. In Table 22 the pupils are arranged in order of the size of their [Tx scores in order that the reader will easily perceive that Group A as a whole is really equivalent in initial ability Computations for the Equivalent-groups 163 in handwriting with Group B as a whole. Table 22 also shows that the number of pupils in one group need not be identical with the number in the other group. Since Mz and AM are negative, we have here an illustration of the computation of x’s from a negative AM. This also affords an opportunity to show how to compute D when one of the M’s is a negative quantity. Had both M’s been negative quantities, ie., had Mz, say, been — ite toeD) would have been — 0.3 in favor of EF2. Both EF1r and EF2 would have produced a loss of handwriting quality, but EFr would have effected a larger loss. The minus is prefixed to 0.3 to indicate that EF2 is the favored one. As the experiment stands, however, the conclusion is that EF1 is better than EF2 for the quality of handwriting of pupils by 1.9 points on the handwriting scale used. We can be 0.9 practically certain that this conclusion is true for the whole group from which the experimental pupils are a random sampling. Practical Certainty and Pre-requisites of Reliability. —Several times thus far the term practical certainty has been used. This needs a fuller explanation. When 100 pupils are selected at random from rooo pupils, we can be entirely certain that the experimental results secured for the Ioo are true for those 100. But no matter how large the D, we can never be absolutely certain that results secured from any sampling less than the entire rooo are true for the 1000. Since absolute certainty is never obtainable, except for the particular group used, statisticians have coined the term practical certainty to designate a degree of certainty which is generally acceptable. Practical certainty is defined as plus and minus three times the SD of the measure in question. Thus we can be practically certain that the true Mz lies between obtained Mz minus 3 SDMr and ob- _ tained Mz plus 3 SDMz. If M1 is 1.1 and SDM is 0.4, we can be practically certain that the true Mr lies between 1.1 minus 3(0.4) and 1.1 plus 3(0.4), i.e., between —o.1 and 2.3. Similarly, we can be practically certain that the true 164 How to Experiment in Education D lies between obtained D minus 3 SDD and obtained D plus 3 SDD, or using the data of Table 22, we can be practically certain that the true D is somewhere between 1.9 minus 3(0.8) and 1.9 plus 3(0.8), i.e., between — 0.5 and 4.3. Had such definition of limits been more significant than the definition of a point above which the true D lies, i.e., zero, the denominator in the EC formula would have been 3 SDD instead of 2.783 SDD. The 3.0 is reduced to 2.78 because any chance or probability that the true D is above D plus 3 SDD (when D is positive) or below D minus 3 SDD (when D is negative) merely strengthens the conclu- sion yielded by the experiment. The difference between 3.0 and 2.78 exactly accounts for this probability. The one-group method is a more convenient method than the equivalent-groups method of solving the experimental problem whose sample data appears in Table 22. But even though the equivalent-groups method be employed, there is a more convenient method of determining D than that shown in Table 22. Both experimental groups could have had their IT1 at one of the EF periods, at, let us say, the period preceding the gymnasium period (EF1). Then the FTr for Group A could be assumed to be identical with the ITr. This would have made each of C1, M1, SD and SDMz zero. This would have saved labor and would, in theory, have yielded the identical D obtained by giving the IT1z in a period other than one of the EF periods. But even though the IT1 be made in a non-EF period as shown in Table 22, the same D could have been secured by a single computation, namely, by computing the M of Group A’s FT1, and the M of Group B’s FT1 and by subtracting one M from the other. Experimenters frequently resort to this plan to avoid the necessity of making an IT1. Such an avoidance is not commendable because the experimenter has no right to assume that his two groups are equivalent. He needs the IT1 to prove their equivalence. If he avoids this criticism by using one group only, where he has a right to assume equivalence, or if he proves the equivalence of his Computations for the Equivalent-groups 165 two groups by means of an IT1, but then proceeds to ignore it and work with FTr only instead of C, he is subject to another criticism. His computations will yield the correct D, but will not permit him to determine the EC or reliability of the D. It will not suffice for him to compute the M, SD, and SDM of the FT1 for each group, and to use these two SDM’s to compute SDD just as the SDM’s of the C’s are used to compute SDD. The SDM of the FT1’s tends as a rule, though not always, to be unduly large and thus tends to make the D appear less reliable than it really is. Some distortion will always occur unless the IT1’s are all zero or all identical in size. It is not legitimate to avoid this final criticism by simply omitting altogether the computation of the reliability of the D, for each experimenter is obligated to report the reliability of his conclusion. In sum, C is required to determine the correct reliability of D, and the obtaining of C presupposes both an ITx and FTr. There is a way whereby the correct SDD may be secured without the use of C. The steps in this process follow. (1) Compute M of initial scores. (2) Compute M of final scores. (3) Subtract intial M from final M to get Mr. (4) Compute SD and SDM of initial scores. (5) Compute SD and SDM of final scores. (6) Compute SDM1 by means of the following formula. SDMi— (Initial SDM)? + (Final SDM)? — (2 r initial with final) (SD initial) (SD final) Thus the SDMz1, computed in this way, is equal to the square root of the following: the square of the SDM of the IT scores, plus the square of the SDM of the FT scores, minus twice the coefficient of correlation between the IT scores and FT scores times the SD of the IT scores times the SD of the FT scores. The procedure is similar for the computation of M2 and SDM2. The use of this thoroughly exact but substitute procedure for determining Mr and SDMz is seldom advisable. Some time may be saved by its use provided the IT and FT scores 166 How to Experiment in Education have been tabulated previously into two frequency distribu- tions, respectively. If the experimental data are available only in such form, it is impossible to compute C’s. Gen- erally, however, the computation of C not only facilitates the computation of Mi and SDMr or M2 and SDMz2, but it also makes possible a fuller utilization of experimental re- sults in that it shows what sub-group made the larger C’s. TABLE 23 COMPUTATION MODEL IIT Three Equivalent Groups— Three EF’s— One Test Type Group A—EFr1 Group B— EF2 Group C — EF3 PaLDrorn yy ses ix KELL Le Le x7n (PIT eo secrete x? N Mr Sx? | N Mz Sx? | N M3 Sx? AM SD AM SD AM SD c SDMr1 c SDM2 c SDM3 SUMMARY EF1 EFz2 EF3 D SDD EC dele aks eh ee D Test 1...) Mx Ma Mr — M2 |v’ (SDMr)? + (SDM2)?| >3-spp Big SECS ile hs on de D Test) Tce Mt M3) M1—Ms3 |v (SDMr1)? + (SDM3)? 2.78 SDD pe UT Ege ie MAA et D Tastire. M2 M3 M2z2—Ms3 /V/ (SDMz2)?+ (SDM3)? 2.78 SDD Recently my attention was attracted to an experiment where some of the pupils had one IT and one FT; whereas others had two or more IT’s and two or more FT’s (as though pupils a, d, and f say in Table 22, had three IT and three FT records each). These records were‘recorded and treated as though they belonged to different individuals. The effect of this is to distort the SD, SDM, and SDD. When more than one record exists for a pupil they should be averaged so that each pupil will have just one IT and one FT for each test. Computation Model III.—Computation model III in Table 23 shows the experimental computations necessary when there are three equivalent groups, three EF’s and one Computations for the Equivalent-groups 167 type of test. If the purpose of the experiment is to deter- mine the relative effectiveness of three EF’s, EF1, EF2, and EF3 will be distinctly different EF’s. If the purpose of the experiment is to determine the absolute effectiveness of EF1, and EF2, then, EF3 will be a control EF. It should be understood that in all preceding and succeeding computation models, one of the EF’s must be a control EF whenever knowledge of the absolute effectiveness of one or more of the EF’s is sought. Table 23 is practically self-explanatory. The two M1’s under EF1 in the Summary are the same Mz, and similarly for the two M2’s under EF2 and the M3’s under EF3. The first D and SDD under EC are M1 — M2 and V (SDM1)? + (SDMz)? respectively, and similarly for the second and third formule under EC. The first D, namely M1 — M2, shows whether EF1 or EF2 is more effective and the first EC shows its reliability. The second D, namely M1 — M3, shows whether EF1 or EF3 is more effective and the second EC shows its reliability, and similarly for the third D and third EC. By extending computation model III in Table 23 farther to the right, to provide for a Group D — EF4 and a Group E,— EF5 and a Group F — EF6 and so on, the experi- menter will have a computation model for any number of groups and EF’s when one test type is used. An extension of the Summary according to the plan exemplified in Table 23 will take care of any number of EF’s. Computation Model IV.—The computation models so far given show how to take care of any number of EF’s when one test type is used. Computation model IV in Table 24 shows how to handle two EF’s and two test types. Table 24 shows that additional test types can be provided for by expanding the original computation model downward, just as additional EF’s were provided for by expanding the original computation model to the right. Note that the second test type is indicated by the numeral 2, and that the two new M’s are labeled M3 and M4. The D of 168 How to Experiment in Education M1 — M2 shows whether according to Test 1, EF1 or EF2 is the more effective. The D of M3 — Mg shows whether, according to Test 2, EF1 or EF2 is the more effective. The two EC’s show the reliability of these two D’s. Equating of Differences.—Table 24 exemplifies a new feature in connection with EC. This new feature requires explanation. Test 1 may favor EF1 by a D of a certain TABLE 24 COMPUTATION MODEL IV Two Equivalent Groups — Two EF’s — Two Test Types Group A — EFr Group B — EF2 1 IT1 FT1 Crialax x? IY ITr 1D Py C27 rx x? N M1 Sx? | N M2 Sx? AM SD AM SD c SDM1 c SDM2 P IT2 FT2 Ca aiix x? Ie IT2 Ft2 Ca tits x? N M3 Sx? N M4 Sx? AM SD AM SD c SDM3 c SDM4 SUMMARY EF1 EF2 D SDD EC x 7) 4 EDe tee __walV (SDM1)? + (SDMa2)3|___D Dee Test1]Mr1 M2 M1r—Maz2 278SDD Mi or Ma fas 2 f | pee Se et Se , isucuicesiphisiseaaeaiselietantaieaeetenaan Test 2|M3 M4 M3—Malv(SDM3)? + (SDM4)-— 55 por Ma MEC Sx?]) MED Sx? AM SD} AM SD ec) SDMEC c SDMED ECMEC ECMED amount, whereas Test 2 may favor EF2 by a D of a certain amount, or perhaps both tests may favor EF1, or again, both tests may favor EF2. At any rate, there is needed some way whereby the two D’s may be combined into a single number which will show whether, both tests consid- ered, EF1 or EF2 is more effective and how much more effective. But the two D’s cannot be averaged just as they stand. To do so might give far more weight to one test than to the other. To make this clear, assume the following situation: Computations for the Equivalent-groups 169 EF1 EF2 D Test 1 105 100 5 Test 2 10 5 5 Now, in all probability, these two D’s are far from equal, even though they are numerically the same. The first 5 is, in all probability, a much smaller D than is the second s. Before they can be combined they need to be equated. The two EC’s are not only indices of the reliability of the two D’s, but they are also at the same time excellent equaters of the two D’s. The EC’s may be averaged. This has been done and “MEC” or mean EC is the result. Before this averaging is done, the sign of each D should be prefixed to its EC. The MEC is really a mean difference. The reliability of each of the two D’s is known. The next need is for some way to determine the reliability of the MEC. Such a way is shown in Table 24. SD of the two EC’s and SDMEC or SD of the MEC may be computed just as SDC and SDMr are computed. In this situation where there are two EC’s the formulae become: Seay ee > SDMEC= —2 aa A eae meine nha The SDMEC is an index of the reliability or trustworthiness of MEC as a true MEC for all the tests from which Test 1 and Test 2 are a random sampling, and, to make the state- ment complete, for all the pupils from which the experi- mental pupils are a random sampling. Just as SDD needed EC for its interpretation, so SDMEC needs an ECMEC for its interpretation. Since, as was pointed out above, MEC is really a D still, and since SDMEC is really an SDD still, the regular EC formula with its customary interpretation may be used. In this situation the formula becomes 170 How to Experiment in Education MEC EMEC — 378 SDMEG The only difficulty with the use of EC and MEC as a method of equating and combining D’s, is the impossibility of making any clear, simple statement as to what an MEC of a given amount means. Therefore the “ED” or equated difference, has been devised to provide a more easily inter- pretable method of equating and combining D’s from two or more test types. While preferable to the MEC from a popular standpoint it is probably less preferable from a technical statistical point of view. The ED for the first D is M1 — Ma divided by Mz if it is smaller than M2 or by M2 if it is smaller than Mr. The ED for the second D is M3 — Mg divided by M3 if it is smaller than M4 or by Mg if it is smaller than M3. When so computed, the ED tells the per cent of the time the experiment has run that it would take the backward group to catch up with the favored group if the favored group were to stop growing until the other catches up. The ED’s for each of the two D’s of 5, previously given, become, according to the above process, .o5 and 1.0 respectively. These ED’s interpreted mean respectively that the EF2 group would catch the EF1 group in Test 1 in .o5 of the time the ex- periment has run, and that the EF2 group would catch the EF 1 group in Test 2 in a time exactly equal to the time the experiment has run. After explaining the computation of MEC and ECMEC, it will not be necessary to rehearse the process for computing MED and ECMED. In computing MED, the sign of the D should be prefixed to its ED. One other caution is needed. It sometimes happens that the smaller of the two M’s is so close to zero that, when it is divided into the D, the resulting ED becomes an exaggerated and unnatural amount. Thus, if the smaller of the two M’s were exactly zero and if the D were not also zero, the ED would become infinity! The reader does not need to be told what this will do to the MED. Computations for the Equivalent-groups rt If this, or anything approaching it, were to happen, the MED could not be used. The use of MEC would be com- pulsory. Because of this tendency on the part of ED, the experimenter is advised always to prefer the midscore of the ED’s to the MED, wherever it is possible to compute the midscore, i.e., wherever more than two test types have been used. The midscore of the ED’s may be treated as though it were the MED. The computation of the midscore is exceedingly simple. First arrange the ED’s in order of their size, paying due regard to signs. That ED which is middlemost in: size is the midscore. If there is an even number of ED’s and, as a consequence, no middle ED, the mean of the two middlemost ED’s may be taken for the midscore and MED. | There is no obligation upon the experimenter to give equal weight to each test always. Because of a given test’s greater reliability, because it is more symptomatic of the entire objects of instruction, or for some other reason, the ex- perimenter may desire to weight it more heavily than any other test used. Once the D’s have been equated, weighting becomes a simple matter of multiplying the EC or ED by the weight desired, before averaging. ‘Thus, if there are three tests to be averaged, and if it is desired to weight the tests, in order, 3, 1, and 2, the experimenter should multiply the first EC or ED by 3, the second by 1, and the third by 2. Then he should add the products and divide by 3 plus 1 plus 2, 1.e., 6. Illustration of Computation Model IV.—The fore- going discussion of computation model IV will be clarified by the use of sample data. Such data appear in Table 25, where we shall assume the experimental problem to be this: Which is more effective in developing reading (Test 1) and the fundamentals of arithmetic (Test 2), three class periods per week of fifty minutes each (EFr) or five class periods per week of thirty minutes each (EF2). Here we have a problem with two EF’s and two test types, requiring the How to Experiment in Education 172 eT oF = CANO oo = 9 6:0 = DAWOA ro = CHWdS 9°0 = DUNS vo! = 9 co —oS rT AV: 3:0 = (1S SI —_=— NV SOs == xc De (1 A i 5x5 CI1—=—J)AN to z‘O er — So L‘o ae Ai “I S‘or — Sgr “ee ASOL, vo z‘O 6°0 — 9°0 g'0 £°0 — 6°0 gi — gt SOG EER p =~ x qa 3X “~ ei dads da eda AUVWWAS Valen IN CLS SiS or =f€was 0°O == 9 Lom —= (TS ol4Ii=WV Om = CS og = NV ge =- WV das WY XS OW N exS sw N XS at 5 . N zx x 69 LA ELI d 2X x 89 eLa ELI 2X 3 Li Ani d 9 SWas ) VINGS > as WV as WV as ,nV XS oW N | ex$ SW N 2XS TN Sees : N 2x x OT aks Oe | d | x x AR = tA AI 2x a) La ai £ 9 zWds 9 1WdS 9 as” WV as WV as WV XS cW N sXS 7W N 2X$ TW N x x COM tA Se aa d | x x ZO RE br =e NA SI — ?W dds s47~+qd _3GWdS) + Z@Wds) A *W—?-W *W Cer a ela oe qa 04a ads da 2 AC | 14a {DUT 04 a4DIpamsazquT qanod JUWOA qaw Jd o1W Jo 4W + O1W — 4W ads 347+d e(OINdS) + <(4WdS) A o1Ww — 4W o1rW AIS | ie ae ee, WA JOIN + *N— IW dds 34z7~=+q z(VINCGS) +-Gwds) A FIN — IW PIN HS od per dade Rae a Be da Ou qds ad 2s AC f a AC | ayDipaMmsazuy OF [DImMUT SL LF LES ea man erga rn sa a re oro ems rele to pe a eo EE AA EL PLAIN EEA! DALLA OLY LAELIA SALIDA LDA ei AYVWWAS ZIWdS IIWdS oIWNdsS 6—Was 8ANdS 4Wds cIW 1IW oI N ON 8IN LW N ce 56) Ir) ory) 7LA ZLNI ZLI d 69 8D £9 ZLA ZINI ZLI d 9NdGS SWas PIGS fwas “Was IWas oW STA VIN N cI Z7W IW N 99 $9 Le) ILA ILNI ILI d £9 ‘4@) I) ILA ILNI ILI d See a cs al i ea i i ae catia |e rear ?4A — gq qnosy Iq — Pp ¢nosy a er EE ee ee SOP eeIpeuliojuy 3uQ — sadA], jsof, OMT, —S,Jq OMY —sdnoiy jualeamby omy, aac ar a Serer re Stns Spee ge ee ee IIA TadOW NOILVLAdWOD 62 miavy, 180 How to Experiment in Education treated together, the M of all the C3’s and C7’s, the M of all C2’s and C6’s, and the M of all the C4’s and C8’s. This will entail for each M so computed an appropriate series of x’s, x?’s, Sx*’s, SD’s, and SDM’s and a “Grade III and Grade IV” section in the “Summary.” A good illustration of the value of being alert for the sub- groups is afforded by an experiment conducted by Eliza F. Ogglesby of Detroit upon 350 experimental and 350 con- trol first-grade pupils. The purpose of the experiment was to discover whether a new reading book she had prepared especially for slow pupils was superior to one previously in use, and, if so, whether it was better for dull pupils than for normal pupils or bright pupils. Miss Ogglesby has furnished the author with the summary of her experiment. ‘This is shown in Table 28. There were 100, 150, and 100 pupils in each of the bright, normal, and dull groups, respectively. EF 1 is the new book, EF2 is the usual book. The data show that the new book is superior to the old by 0.65 points for the bright group, o.g1 points for the normal group, and 2.44 points for the dull group. This suggests that it is an advan- tage to make books adapted to these different levels of capacity. Computation Model VII.—Another common form of experimentation is one where there is for each group an initial test, one or more intermediate tests, and a final test. In an experiment extending over a school year it is fre- quently desirable to give an intermediate test at the end of the first semester. This tends to strengthen the experiment and fortify the conclusions. Computation model VII in Table 29 shows how to treat an experiment of two equivalent groups, two EF’s, two test types, and an intermediate test for each test type. By a horizontal and vertical extension of this table provision could be made, respectively, for more EF’s or intermediate tests, and more test types. In Table 29, the usual form has been somewhat abbre- viated to save space. C1 is the change from IT1 to INT1. 181 Computations for the Equivalent-groups IZWGS °SWAaS §4Wwas 84was 44Wds 94Was S EN “Lied 9fWas SfWads ’EWasS ffwas zfWds 1f£wdas ofWas °Z7WdSs 8zWds Of W SEAL rey N [fw Ze IfW N jot] Oz 871 N 9f9 $€D vey TLA- LIN 111) d-\ft9 ze) rf) ILI ILNI ILI} d jot 6zD Bee) Gott Deel, L Neel LE bead Eq —spdng upqganqns 24 — sirdngq unqanqns IA — spdnq upganqns 4ZWdS 97WAS S2Wds bzWas £2WdsS 2z7Wds IZWAS °7WAdS SIWds L7W 97W Sze N |’ew EzW Zz7W N |1zjW Oo7W 61" N “zo 97D $zD CLI ©LNI &Lild \'z9 fz ezyQ ELA fLNI fLI] d (129 oz) 619 {€La fLNI &LIl d SINGS 4IWdS 9INAdS SIWdS *’IWdS £IWdSs ZIWdS I11WdS °IWdS SIN LIW 9IW N {St VIW fIW N |21W 1IW OlIW N BID £19 91D (2LAI 2LNI 2LI}] d -|§19 vig f1y) (@LA ZLNI ZL] d 219 IIg ord) (2@LI Z@LNI ZLIl d owas 8Wds 4wads 9WaS ‘Was ?’was fWwas zWdaS IWwds OW 8N LI N j9W SW rw N |£W ZW IW N 69 89 £9 ILI 'LNI ML} d |90 $9 Lge) ILI YINI ML} d |€O ag) 1D) LTA INT FLT od E47 — stidng pany 244 — spidnd 1v4ny Iq — sjidnd 104ny wsOT WIpsWstoayu. suQ—sodh], jsa], saIG IL —s,Jq e014 [.—sdnoi3-qng v14, yp sdnois-jualeainby 2014], IIIA TAGOW NOILVLNAHOD of a1avy 182 How to Experiment in Education C2 is the change from INT1 to FT1. C3 is the change from IT1 to FT1, and similarly throughout the table. The AM, Cc, x, x2, Sx2, and SD involved in the computation of SDMz1, are omitted. The same omission occurs in the case of SDM2, SDM4, SDMs3, and so on. Computation Model VIII.—Computation model VIII, shown in Table 30, is a sort of composite computation model or a sort of summary of all the models which have preceded. It illustrates an experiment where there are three EF’s, three sub-groups, three test types, and one intermediate test. This computation model embraces practically all the difficulties in computation ever presented by a regular equivalent-groups experiment. How to handle certain rare forms of the equivalent-groups experiment is considered at the end of the next chapter. TABLE 30 SUMMARY Rural Pupils — Initial Test to Intermediate Test EFr EF2 D SDD EC ED Test 1 M1 M4 Mr —My4 SDD EC ED Test 2 Mio M13 Mro—Mrz3 SDD EC ED Lest, 3 Mig M22 Mig—M22 SDD EC ED MEC MED ECMEC ; ECMED EFi EF3 D SDD EC ED Test 1 M1 M7 Mr —My7 SDD EC ED Test 2 Mio M16 Mio—Mr16 SDD EC ED Test 3 Mig M25 Mig—M25 SDD EC ED MEC MED ECMEC | ECMED EF2 EF3 D SDD EC ED Test 1 M4 M7 Ma —M7 SDD EC ED Test 2 M13 M16 M13—M16 SDD EC ED Test 3 M22 M25 M22—Mz25 SDD EC ED MEC MED Computations for the Equivalent-groups 183 Rural Pupils — Intermediate Test to Final Test EFr EF2 D SDD EC ED esteL east WL 2 M5 M2 —Ms5 SDD EC ED Test 2.... Mir M14 Mir—Mr14 SDD EC ED Test 3.... M2zo M23 M20—M23 SDD EC ED MEC MED ECMEC | ECMED EFr EF3 D SDD EC ED pLesteahseul) (v2 M8 M2 —Ms8 SDD EC ED Test 2.... Mir Mrz Mir—M17 SDD EG ED Test 3.... M20 M26 M20—M26 SDD EC ED MEC MED ECMEC | ECMED ADT POL D SDD EC ED Test 1 Ms5 M8 Ms —M8 SDD EC ED Test 2 M14 M17 M14—Mr17 SDD EC ED Pesta 3 M23 M26 M23—M26 SDD EC ED MEC MED ECMEC | ECMED Rural Pupils — Initial Test to Final Test EFr EF2 D SDD EC ED Leste Sisal) 1.13 M6 M3 — M6 SDD EC ED Test 2... Miz Mrs M12—Mr1s5 SDD EC ED Test 3... M2x M24 M2r1—M24 SDD EC ED MEC MED ECMEC | ECMED EFr EF3 D SDD EC ED Test 1 M3 Mo M3 —Mo SDD EC ED Test 2 Miz M18 M12—M18 SDD EC ED Test 3 M21 M27 M2z1—M27 SDD EC ED MEC MED ECMEC | ECMED EF2 EF3 D SDD EC ED Test 1 M6 Mo M6o —Mo SDD EC ED Test 2 Mis M18 M15—Mr18 SDD EC ED Leste s M24 M27 M24—M27 SDD EC ED MEC MED 184 How to Experiment in Education Suburban Pupils — Initial Test to Intermediate Test EFr EF2 D SDD EC ED Test x1... M28 M31 M28—M3r1 SDD EC ED Test 2... M37 M4o M37—Mg4o SDD EC ED Test 3... M46 M49 M46— M49 SDD EC ED MEC MED ECMEC ECMED EFr EF3 D SDD EC ED Tester M28 M34 M28—M34 SDD EC ED Test 2 M37. M43 M37—M43 SDD EC ED Test 3 M46 Ms2 M46—M52 SDD EC ED MEC MED ECMEC ECMED EF2 EF3 D SDD EC ED Test 1.... M31 M34 M31—M34 SDD EC ED Test 2... M40 M43 M4o0— M43 SDD EC ED Test 3.... M49 M52 M4go—Ms52 SDD EC ED MEC MED ECMEC ECMED Suburban Pupils — Intermediate Test to Final Test EFr EF2 Dwi esp EC ED Test 1... M2g M32 M2zg— M32 SDD EC ED Test 2.... M38 Mgr M38— Mar SDD EC ED Test 3.... M47 Mso M47—Mp50 SDD EC ED MEC MED ECMEC ECMED EFr1 EF3 D SDD EC ED Test 1.... M29 M35 M29—M35 SDD EC ED Test 2.... M38 M44 M38—Ma44 SDD EC ED Test 3.... M47 Ms3 M47—Ms53 SDD EC ED MEC MED ECMEC ECMED EF2 EF3 D SDD EC ED Test 1 M32 M35 M32—M35 SDD EC ED Test 2 Mar M44 M4r—M44 SDD EC ED Test 3 Mso Ms3 Mso—M53 SDD EC ED MEC MED Computations for the Equivalent-groups 185 Suburban Pupils — Initial Test to Final Test EFr EF2 D SDD EC ED Test 1....1 M30 M33 M30—M33 SDD EC ED Test 2... M39 M42 M39—Mgqz2 SDD EC ED Test 3... M48 Msr Ma4a8—Ms5r1 SDD EC ED MEC MED ECMEC ECMED EFr EF3 D SDD EC ED Test 1.... M30 M36 M30—M36 SDD EC ED Test 2... M39 M45 M39—Ma4s5 SDD EC ED Test 3... M48 M54 M48—M54 SDD EC ED MEC MED ECMEC ECMED EF2 EF3 D SDD EC ED Test r.... M33 M36 M33—M36 SDD EG ED Test 2...) M42 Mas M42—Ma45 SDD EC ED Test 3.... M51 M54 Ms1—M54 SDD BC ED MEC MED ECMEC | ECMED Urban Pupils — Initial Test to Intermediate Test EFr EF2 D SDD EC ED Test z.... M55 Ms8 Ms5—Ms58 SDD EC ED Test 2... M64 M67 M64—M67 SDD EC ED Test 3... M73 M76 M73—M76 SDD EC ED MEC MED ECMEC ECMED EFr EF3 D SDD EC ED Test I Mss Mor Ms5—Mé6r1 SDD EC ED Test 2 M64 Myo M64—My7o SDD EC ED Test 3 M73 M79 M73—My7o9 SDD EC ED MEC MED ECMEC ECMED EF2 EF3 D SDD EC ED Test 1.... M58 M6r Ms8—M6r SDD EC ED Test 2.... M67 Myo M67—My7o SDD EC ED Test 3... M76 M7q M76—My79 SDD EC ED MEC MED 186 How to Experiment in Education Urban Pupils — Intermediate Test to Final Test EFri- EF2 D SDD EC ED Test 1.... Ms6 Msg Ms6—Ms50 SDD EC ED Test 2... M65 M68 M65—M68 SDD EC ED Test 3... M74 M77 M74—M77 SDD EC ED MEC MED ECMEC ECMED EE. BES D SDD EC ED Test 1 Ms6 M62 Ms6— M62 SDD EC ED Test 2 M65 Myr M65—Myz71 SDD EC ED Test 3 M74 M8 M74—M8o SDD EC ED MEC MED ECMEC ECMED EF2 EF3 D SDD EC ED Test 1.... M59 M62 Msq9—Mé6z2 SDD EC ED Test 2.... M68 M71 M68—Mz71 SDD EC ED Test 3.... M77, M80 M77—M8o SDD EC ED MEC MED ECMEC ECMED Urban Pupils — Initial Test to Final Test EFr EF2 D SDD EC ED Test 1...4 M57 M60 Ms7— M60 SDD EC ED Test 2... M66 M69 M66—Mé69 SDD EC ED Test 3... M75 M78 M75—M78 SDD EC ED MEC MED ECMEC ECMED EFr EF3 D SDD EC ED Test I Ms7 M63 Ms7—M63 SDD EC ED Test 2 M66 M72 M66—M72 SDD EC ED Test 3 M75 M8r M75—M81 SDD EC ED MEC MED ECMEC ECMED EF2 EF3 D SDD EC ED Test 1 Mto M63 Mb6o— M63 SDD EC ED Test 2 M69 M72 Mb69—M72 SDD EC ED Test 3 M78 M&8&r M7&8—M81 SDD EC ED MEC MED ECMEC | ECMED CHAPTER VIII COMPUTATIONS FOR THE ROTATION EXPERIMENTAL METHOD Computation Model IX.—The nature and functions of the rotation experimental method were discussed in Chapter II. It remains to illustrate the statistical computations nec- essary to yleld the conclusion from a rotation experiment, together with the reliability of the conclusion. Computation model IX is for the simplest type of rota- tion experiment, namely, two groups which may or may not be equivalent, two EF’s, and one type of test. TABLE 31 COMPUTATION MODEL IX — ROTATION METHOD Two Groups— Two EF’s— One Test Type Group A—EFr1 Group B—EF2 P ITr FT1 Cr Pp IT1 FT1 C2 N M1 N M2 SDM1 SDM2 Group A — EF2 Group B— EFr P ITr FTr C3 P ITi FT1 C4 N M3 N M4 SDM3 SDM4 SUMMARY EF1 SDS1 EF2 SDS2 Test 1|Mzr-+ Mg 4/(SDMr1)?+ (SDM4)?|M2-+ M3 4/(SDM2)?-++ (SDM3)? D SDD EC (Mr + M4) — (M2-+ M3) | 4/(SDSr)?+ (SDS2)? | D-—+2.78 SDD 188 How to Experiment in Education The first point to note in computation model IX, in Table 31, is that Group A has EF1 applied to it first and EF2 applied second, whereas the EF’s are applied to Group B in the reverse order. Since both EF1 and EF2 appear first and second any advantage of order is rotated out. According to the computation model, Group A experiences in order IT1, EF1, FT1, IT1 again, EF2, and FT1 again. This does not mean that the second IT1 and FT1 will yield identical scores with those yielded by the first [Tz and FT1, respectively. It does not even mean that the identical test- ing instrument must be employed. It means merely that the same general mental function is usually tested in both in- stances. In rare cases, however, the similarity between the mental functions tested is slight or non-existent. Sample problems will make clear the various possible de- grees of similarity between the first and second pair of tests. Assume EF 1 to be a high per cent of re-circulated air for a classroom, and EF2 to be a continuous supply of wholly fresh air. Assume that each EF operates one semester. The first IIx for Group A might be a test of general reading ability. The first FTr1 could be the identical testing instru- ment, a duplicate test of reading ability, or some other test of general reading ability. It must measure the same trait as the ITxr. The second IT1 for Group A could be the same test as that already used, or a duplicate test, or another test of general reading ability, or a test of a similar mental func- tion, say a vocabulary test, or a totally different sort of test, say, a test of fundamentals of arithmetic. The second FT1 must test the same trait as its IT1. Furthermore, the same tests used for Group A with EFi and EF2 must be used for Group B with EF2 and EF1, respectively. This will prevent penalizing either EF since each EF will have both varieties of tests. Consider another sample problem. Assume EFr1 to be motion-picture presentation of a lesson, and EF2 to be teacher presentation. The subject of the motion picture might be the geography of Alaska. This would require the Computations for the Rotation Experimental Method 189 first ITr and FT1 to be constructed of Alaskan content. But the teacher could not well use the identical topic and identical tests a second time. The carry-over would be alto- gether too large. She could choose, instead, say, the geog- raphy of Hawaii. This topic would require that the second IT1 and FT1 have a Hawaiian content. In group B the order of topics would have to be reversed so that EF2 would secure any advantages or disadvantages of the Alaskan topic and tests, and EF1 any advantages or disadvantages of the Hawaiian topic and tests. Both the first and second IT’s for both Group A and Group B are often not applied in rotation experiments. In case Alaska and Hawaii are known to be new to the pupils, and if, in addition, the test questions are so highly specific that they could not be answered from general information about the geography of places other than Alaska and Hawaii, the experimenter frequently assumes that the pupils’ knowl- edge is zero and so records it without testing. Even when such an assumption introduces a slight error, it is sometimes an advantage to accept the error and omit applying the IT’s. Sometimes it is an advantage to keep pupils ignorant of that upon which they are to be tested until the EF1 has been applied. The ITz prevents such concealment unless a dupli- cate test is available. There is a special situation where the second IT1’s for both Group A and Group B are not applied. If EF2 for Group A follows EF1 immediately, and if EF1 for Group B follows EF2 immediately, and if, in addition, the identical or equivalent test used for the first FT1 is to be used for the second IT1, then the scores made on the first FT1 may be assumed to be identical with those which would result from giving the test again as ITr. As shown by the Summary, the total C produced by EF1 is Mit + M4. The C produced in Group A by EF1 is Mr. That produced in Group B by EF1 is M4. The sum of these gives the C produced in both groups by EFr1. In like manner, the total C produced by EF2 in both groups is 190 How to Experiment in Education M2 + M3. The D between EFi and EF2 becomes, then, (Mr + M4) — (M2 + M3). To compute the SDD of this last quantity requires us to know the reliability of its two components M1 + M4 and M2-+ M3. From a knowledge of the reliability of M1 and M4 it is possible to compute the reliability of their sum, Le., it is possible to compute SD of the sum, or SDS or SDSzr. As shown in the table, the formula for computing the re- liability of the sum of the two M’s is just like the formula for computing the reliability of the difference between two M’s. All preceding computation models have made this latter formula familiar to the reader. Once the SDS1 and SDS2 have been computed SDD and EC are readily deter- mined, as shown. The more detailed formula for EC may be written thus: EC =[(Mz + M4) — (M2 + M3)] + 2.78 (4/(SDS1)? + (SDSz)?) - Reliability Computations in Special Situations.—It was stated in the preceding paragraph that the formula for the reliability of a sum is identical with the formula for the reliability of a difference. In the short form in which these formule are usually used and commonly published, they are alike. ‘The complete, long formule, as given below, are not identical. SDD = (SDMr1)? + (SDM2)? — arr2 (SD1)(SD2) SDS = V(SDM1)? + (SDM2)? + 2rr2 (SD1r)(SD2) When the sum of three numbers is involved the formula be- comes: SOS 4/ (SDM1)* + (SDM2)?+ (SDM3)?+ 2 r12(SDr) (SD2) + 2 r13(SD1) (SD3) + 2 r23(SDz2) (SD3) In the preceding chapter, the reader was shown how M1 could be computed by getting the difference between the M of the IT and the M of the FT, and how the SDMz1 could be computed by a formula which utilized the SDM of the Computations for the Rotation Experimental Method 191 IT, SDM of the FT, the coefficient of correlation between IT and FT, SD of IT, and SD of FT. The Mz, so com- puted, is really a D, and the SDMxz is really an SDD. Con- sequently the above formula for SDD is identical in form with the SDMz formula just referred to. Just as it is pos- sible to determine Mr by subtracting M of the IT from M of FT, so it is possible to compute MS by adding M of IT and M of FT. If this were needed for some purpose and actually done, the SDMS formula would be identical with the SDS formula given above. In the SDS1 formula given in Table 31 it is permissible to omit the rr2(SD1)(SDz) portion of the formula be- cause the coefficient of correlation between the C1’s and C4’s may be assumed to be zero, since the pairing of each Cr with some C4 would be by chance, and similarly for the SDS2 formula. But in computing the SDM1 or SDMS men- tioned above, an assumption of zero correlation between IT and FT is not permissible. It is far more probable that some correlation will exist. To ignore the last portion of the formula might lead to a grossly exaggerated SDMr1 or SDMS. How this exaggeration may occur is shown by the following data. Obviously the Mz and SDMz computed through Cr are 5 and zero, respectively. Computed through M of IT and M of FT, the Mz likewise comes out 5. Com- puted through M of IT and M of FT, SDMzr comes out zero, provided rr2(SD1)(SDz2) are utilized in its com- putation. Pupil IT1 FT1 Cr a IO 15 5 b 12 17 5 Cc 14 19 5 d 16 21 5 13 18° Mr 5 SDMi =o ‘ In computing any SDD or SDS, then, the short form of the reliability formula may be employed provided the ele- 192 How to Experiment in Education ments that enter into the formula are uncorrelated, or are relatively uncorrelated. The SDD in Table 31 may be com- puted by means of the short formula because the C1’s and C2’s come from different groups and hence their correlation may be assumed to be zero. The SDD in the one-group experiment shown in Table 20 has been computed with the short formula, because the C1’s and C2’s do not appear to be at all closely correlated. Usually, however, such correla- tion is more in evidence, due to the fact that the brighter pupils tend to have larger C’s under all EF’s. The one- group method is peculiarly liable to manifest such correla- tion, and hence with it the SDD should usually be computed by the long formula. The formula for the computation of SDM as illustrated in all the computation models is appropriate only when N exceeds 30. When N is less than 10 compute SDM thus: SVE a, 7 VN—2 When N is between to and 20, compute SDM thus: Rh eee ae 7 VN=2 When N is between 20 and 30, compute SDM thus: asf yd a tulle Mavi When N is above 30, compute SDM thus: The last formula is used in all computation models and illustrations of such models, irrespective of the number of pupils, because most actual experiments will employ 30 or more cases and because the sample data given merely typify a much larger amount of data. Lo Ql L's. C’r £°z o'r o'9 eeesceeveceenees I SOL ) 2 eke ads a 2Sdas a ™Sds 14H S AMVAWAS S b v i S eS =“ =+was [oo no =i ewads o— 5 ™ 4 + = S*r = ,(S'0) =a = 5 oz= WV 6° =,(£°0) Ars els OT =a=KNV Olga xs 8:2 v Sieg XS Coie .e v in) aa —— — -_ —_ — = fe) o z 6 LY q 14 z € 9S £s Pp ‘ I I I 6? gv 3 I I z zs os 3) = o ro) z ob ge } 6 € z— oF zy q q, 6 £ S ov Se ° I I z gf ve eB aS x x leet ee i. x x oF) eee eT d S Iq — g dnosy 210 — WY dnosy 3 | b v 2 rr== —ewas oo= 39 go==/ — pas fo= 3 v 14 x ze = (0) — at = ds or1r= NV i'r = ,(5°0) as = ds of =WV © ee ARE SS: pane Pe ae File f= o1r1=7zW 14 Fees c= 1W 4 is c= € z— LY 6 q o 0 € €$ os P j=) v z € gv cv 3 vy z S os SY 9 Bi. I I o gt gt j I I Zz zy oF q 6 ¥ z £ S¢ ze 2 I I ¥ ve of e Ss eX x Zz) ILA ILI d sx x I) ILA ILI d fm ~~ 3 2qq — q ¢nosy Idq — Pp ¢n0sy Ss adAy 29 UWO—S.qAqA OME —sdnoiy omy Ss 1S) GOHLAW NOILVLOA—XI TAGOW NOILVLNAWOD ONILVALSATII ze atlav.. 194 How to Experiment in Education Illustration of Computation Model IX.—Since compu- tation model IX is the basic rotation-experiment model out of which all other rotation models will be constructed, it had better be illustrated with sample data. Assume the problem to be the relative mental effectiveness of recirculated air (EF1) vs. fresh air (EF2). Assume the test used to deter- mine this relative effectiveness to be a reading test. The necessary computations are shown in Table 32. Only the Summary in Table 32 needs explanation. The EFr is 3.5 plus 2.5, 1.¢., 6.0. SDSr is the V (0.6)? -- (0.837, 1.60) / 1.0.0) B2)18)1,.0) plus) 1.2) 1e., 12.2.) 20 eee Vi(r.1)7 (1-0) 2 eser.5.1) Dis Glo minus) 2231 eran SDD is the V (1.0)? + (1.5)?, ie., 1.8. EC is 3.7 divided by 2.78 times 1.8, i.e., 0.7. The conclusion from this experi- ment is shown by D, which tells us that recirculated air is better than fresh air by 3.7 points for the reading develop- ment of pupils used in this experiment and for all those from whom these pupils are a random sampling. But we can be only 0.7 practically certain that this conclusion is true for the larger group. ; The data of Table 32 are artificial and inadequate. This experiment was actually conducted by Thorndike and Mc- Call under the auspices of the Ventilation Commission of New York. The EF’s, as here, were washed recirculated air and fresh air. All other conditions of temperature, humidity, and the like were kept constant. Group A was a group of 44 typical sixth-grade public-school pupils. Group B was another similar group of 44 pupils. The two teachers divided the work and both taught both groups. At the mid- dle of the year the EF’s were rotated, as shown in Table 32. A large number of mental and educational tests were used, as were the teachers’ marks. The conclusion from the actual experiment also favored the recirculated air. The experi- ment was repeated a year later by Thorndike and Ruger. The second experiment verified the first. These experiments are described in School and Society for May 6 and August 12, 1916. Computations for the Rotation Experimental Method 195 oa | aas | ¢wt+swt+tw)—(w+*wt+enm) | sas | 4wtswt+tw | isas | owteowtew [1 3504 Ou ads a sds faa I¢ds ZAa OF ads (4W + SW + €W) — (8W + 9W + IW) €Sdas AW + SW + °W ISds SW +9OW +IW [TT 3S9L 0 ads a sas caa ISqs ITA OF aadas (OW + VW + ZW) — (SW + OW + IW) zSas 6W+ 7W + 7W ISds SN +9N +IN [°F 4S9°L OF ads rai zSds ZA 1sds Ia AYVWWNAS 6was sds 4IWaS OW N SIN N LIN N 69 ILA ILI d 89 ILA ILI d £9 ILA ELI d 244 —]2 Gnosy IW — gq qnosy ; &qq— Pp qnouy 9was swas vas oW N ST N VIN N 99 ILA ILI d s9 Ila ILI d vD Ie ALE d dq —D enosiy Eq4q —q qoiy 2d — V nosy eWwas ZWads | Iwas fW N ZW N IW N £9 ILA ILI d zo Ila Fit | d 4 ILA. ILI d F417 —D dnosy fda — q doin IA —P gnosy adh] S9L MUO — S.Aa 2G, —sdnoin so14] GOHLaW NOILVLOY— X TACON NOILVINANOD ff alavy 196 How to Experiment in Education Computation Model X.—The purpose of presenting computation model X, shown in Table 33, is to indicate the computations needed with the rotation method when there are three EF’s, and, consequently, three groups, and one type of test. By an appropriate extension to the right and down- ward, computation model X may be adapted for any num- ber of EF’s. The computation of the SDS’s in Table 33 requires ex- planation. The formula for the computation of SDS1 is as follows: SDS1 = V(SDM1)? + (SDM6)? + (SDM8)? SDS2 and SDS3 were computed in similar manner. In Chapter II, it was stated that the object of the rota- tion experimental method may be to determine the relative effectiveness of two or more EF’s. If this is the object of the experiment, the three EF’s will be distinctly different EF’s. If, however, the object is to determine the absolute effectiveness of EF1 and EF2 as well as their relative effec- tiveness, EF3 must be the mere absence of EF1 and EF2, thereby showing the normal change produced during the experiment by general conditions other than EF1 or EF2. In this case, the first D in Table 33 shows the relative effec- tiveness of EF1 and EF2. The second D shows the absolute change produced by EF1. The third D shows the absolute change produced by EF2. In none of the computation models has provision been made for delayed tests as was done, say, for intermediate tests. It frequently happens that an experimenter wishes to determine whether the effect of some favorable EF will persist. It is conceivable that EF1 may be superior to EF2 immediately after they have been applied, but that the superiority will disappear, or actually turn into an inferiority after a month, say, has elapsed. Repetition of the tests a month after the FT’s were made will show what effect time has had. No special computation model needs to be pro- vided. The regular IT’s will serve as the IT’s for the de- Computations for the Rotation Experimental Method 197 qawoa OAWIA qaw oan aa oa ads (4W + +) — (8W + EW) esas) 4W + PW isqsS s8sWM+E&W {°° 9b daa od dds (SW + 7) — COW + 1) esqS _ §W + 7 iSqS OW + IN I s9L daa ele ads d esas cAa ™SAS 14a swas 4Wwas sv N LWW N 89 ZL ZLI d £9 zLa ZzLI d owas swas oW N SW N 99 Ld ILI d 3 1Ld ILI d IW — gq nosy eq — WV qos vWdS fWwas vw N cI N %) LA ZLI d 70) eLA ZLI d ZWdS . GS c z7W z=) Ld ILI d 8) ILA ILI d SS SS Se eH — gq FnosH IqI— Vp enosH sadhy, SOL OME —S,qq OMT —sdnoin omy GOHLaN NOILVLOU — IX TIGOW NOILVINdHOD ve AlAvy, 198 How to Experiment in Education layed test, and the delayed test becomes the FT. From this point the computations reproduce the process for the regular IT and FT. The final D shows the difference between two EF’s plus a defined interval. Computation Model XI.—Computation model XI shows how the computations may be made when two test types are used. By extending this model downward, provision can be made for any number of test types. Computation models IX, X, and XI make it clear that computations for rotation experiments are similar funda- mentally to computations for one-group and equivalent- groups methods. With this knowledge, the reader who has mastered the eleven computation models presented will have little difficulty in evolving for himself rotation computation models for any number of EF’s, groups, sub-groups, test types, and intermediate tests. Scaling Experimental Tests.—A few pages back it was pointed out that the first IT1’s are not always the same tests as or similar tests to the second IT1’s. Yet all this some- what incomparable data can be combined, and this combina- tion can be combined, in turn, with an equal mixture of rather incomparable data from the IT2’s, provided each test is scaled in comparable units. It is impossible to construct a geography test, say, on Alaska which will be just as diffi- cult as one with a Hawaiian content. Furthermore, it is sel- dom feasible to scale all the tests to be used in advance of and independently of the experiment itself, so as to have comparability of measuring units throughout. While conducting some rotation experiments to determine the relative effectiveness of some visual aids, Weber met just this situation, and overcame it economically by using his own experimental data as a basis for scaling the experimental tests. Tests so scaled, while not absolutely required, do add a substantial refinement to experimental computations. The following gives the general plan! of one of Weber’s experiments. Weber, J. J., Comparative Effectiveness of Some Visual Aids in Elementary Education (to be published soon), Computations for the Rotation Experimental Method 199 Unit I India Lecture 25 minutes L—R Review quiz 12 minutes Group A Film 12 minutes F—L Lecture 25 minutes Group B Lecture 25 minutes L—F Film 12 minutes Group C Unit II China Lecture 25 minutes L—R Review quiz 12 minutes Group C Film I2 minutes F—L Lecture 25 minutes Group A Lecture 25 minutes L—F Film 12 minutes Group B Unit III Japan Lecture 22 minutes L—R Review quiz: IO minutes Group B Film Io minutes f F—L Lecture 22 minutes Group C Lecture 22 minutes L—F Film IO minutes Group A Note that the content of the first experimental unit has to do with India, the second with China, and the third with Japan. Note, further, that EF1 is a lecture followed by a review quiz (L-R), EF2 is a film followed by a lecture on the subject matter of the motion picture, and EF3 is a lec- ture on the material of the motion picture followed by the motion picture. The subject matter of EF1 was drawn from this same motion picture on India. Note, further, that groups A, B, and C, which are approximately equivalent seventh-grade classes are rotated in such a way that each group experiences every EF. Note, finally, that the short- ness of the film on Japan required that time allotments be reduced for this unit. Since Weber gave no IT’s, the reader should think of his FT’s as identical with C. Since seventh-grade pupils started this experiment with some knowledge of these lessons on India, China, and Japan, as Weber himself proved later, he was scarcely justified in treating his FT’s as equivalent How to Experiment in Education 200 L II 8 gs ¢ 9 4 6s re) I £ $9 OI gI 14 1S 9 8 9 LS z 9 14 £9 S II 9 6v II 9 4 Ss z 9 9 09 zI II z Lv 9 L 14 ¢s 9 ZL 9 gs 6 I 6 Sv Ne 1 II 1S 9 II S ss L v 6 a4 g oI 9 6 OI OI II £s 14 ¢ v wv 9 II S Lv I £1 8 1S z Zz g Iv S S OI Sv LI 9 L 60 v I ¢ ov § 9 6 tv 6 g v LY z I Y 6£ i v 9 Iv g c oI Sv z fe) Ss Ly v t I ov 9 Ss OI £v I Zz s of S fe) 6 gt L e Ss Iv I ° v c¢ v 4 9 o£ L I v ob fe) fe) v ee I I ¢ ve 9 I L gt I I I ze I L ze Ps ¢ ¢ of I Zz 1¢ I of ¢ I v £¢ ¢ 6z I Qz " ¢ I 1¢ z Sz I Sz Zz fe) ct 6z I 61 I 61 ct z Thorold biped Comat bat Kaw d § 9409 We Fe | a eh ee a40I9$ Bee 1 a ee oe a409¢ V ‘a 3 L qd V 0] L <) qd V LD uDgD fe Duty) BIpUy (aaaaM WOU Gildvav) ATaAAILOadSaU ‘Nvdv{[ aNv ‘VYNIHO ‘VIGNI NO SNOSS@T GaMOTIO“d HOIHM SISAL NOILSANO-09 TAUHL AHL AO HOVA NI STidNd aadvad-vi GaloaTas Oof Ad ACVW SaXOOS 40 NOILNANI1s1a S¢€ alavy Computations for the Rotation Experimental Method 201 ce L3°z fel: 93°S obs’ g6°SP come wan AS 96v ¥g°1S Og'I hoy ae o06'¢ ors: g6°SP gcs° gg OV oT aes L6° Sz: 96'I ee aoe gzS° 83 OF 96 vgs IT dds d SdS = %.alaaYy-a4NjI90T Sds Ub tJ -9d4N4I9'T Sds 9AN4I9'T-U RY SN 10 NVIW — AYVINWAS eer ener Lez 00z'Z oS Lr 619°I SO:LET eo eee eoccce 6gh'1 1S‘SS1 9g°I S97 z 6g°11 619°! SO°LE1 SgS'r v9 6v1 Sees perdi L6° CLrz Lgs ee ece eereee CgS'1 9° 6v1 6gh'1 rS'SSr OT dads ad SdS agaay-a4nqI0T Sds Ut -AdN IIT Sds 94njI9T-WR $JU gO WAS — AYVAWOAS ee ee 173° eVvl: ZIO'I Was | ozo'r ogZ: 616° Was | £6¢° 6z0'X co was 1Z7'8 trl rp exept as Oz OI og’ 61'°6 as £6°g 6z'o1 89°83 as zvos zg1s St'vP W vgIS 6S°1$ grsy W go ly ores ze°gv W I gL I I eZ I 6L e rf I ol I vL re) z I Lo z tL I Ig z L ce) v9 9 4 89 I ° LL g Ss z 19 v V $9 4 ¢ 14 ol 6 ¢ gS L g 4 £9 I 4 I 69 6 1I 8 ss ° 4 € 19 ° v ° L9 202 How to Experiment in Education to C. The effect of doing so is probably to make the SD and SDM too large. The error is not serious, and is cer- tainly less serious than notifying pupils what to expect in the lectures and films by giving tests to the pupils before they had had the EF’s applied. After each group had had an EF applied, the pupils were given a 60-question test on the content of the lesson presented. ‘The scores made by each group as a result of each EF are given in Table 35. Heretofore, each pupil’s score has been tabulated sepa- rately. Such tabulations become unwieldy when many pupils are used. The conventional economical substitute for indi- vidual tabulation is the frequency distribution, samples of which appear in Table 35. Such frequency distributions, though not absolutely necessary, do permit the employment of various statistical short-cuts. An illustrative reading of Table 35 will make clear the meaning of the frequency dis- tributions. Table 35 is read thus. After a lesson on India, presented by means of a lecture followed by a review quiz, i.e., L-R, a test on India was given to Group A. One pupil made a score of 29, one pupil made a score of 31, four pupils made a score of 33 and so on. After the same lesson on India, presented by means of F-L, the same test on India was given to Group B. Two pupils made a score of 24, three pupils made a score of 31, and so on. In like manner, all six frequency distributions, shown in Table 35, may be read. If he so desires, the experimenter can make a frequency distribution of the C1’s, and of the C2’s, etc., in each of the computation models, and can use this as a basis for com- puting M, SD, and SDM by short-cut statistical processes. But there is one thing the experimenter cannot do. He can- not make a frequency distribution of IT’s, and another fre- quency distribution of FT’s, and hope from these to obtain directly a frequency distribution of C’s or even to obtain C’s at all. C’s can be obtained only from individual tabulations. After individual C’s have been so obtained a frequency dis- tribution of them can be made. The Summary for Table 35 is given in two forms. The Computations for the Rotation Experimental Method 203 first part is in terms of the sum of the three M’s for each EF. It is the form with which the reader is already familiar. The second part is in terms of the mean of the three M’s for each EF, i.e., the sum of the three M’s divided by three. The mean of the M’s has the advantage over the sum of the M’s in that the mean of the M’s is comparable with any of the original M’s from which it comes, and with any original M for any EF. But if the sum of the three M’s is divided by three, the experimenter must be careful to divide each SDS by three also. If this is not done the final EC will be just one-third the size to which it is entitled. As Table 35 shows, the second part of the Summary is one- third the first part except for the EC which is the same. And this is as it should be, for the D from the sum of M’s is neither more nor less reliable than the D from the mean of the M’s. But the unique feature of Weber’s experimental computa- tions is not so much his use of frequency distributions, or his use of means instead of sums. The unique feature is his use of T scores or scale scores intead of the original number of questions correct. His use of T scores makes all three tests and the scores from them comparable. To begin with, the test on India may have been the most difficult, and the one on Japan of medium difficulty. After the process of scaling has been completed, these differences in difficulty have been ironed out so that every score, irrespective of the test, is comparable with every other score and every M is comparable with every other M. This makes it profitable to use the mean of the M’s instead of the sum of the M’s in the Summary. Finally, the T scores make the D’s and the EC’s more exact. The procedure by which each test was scaled is shown in Table 36, which is identical with the India portion of Table 35 except that 499 pupils instead of 300 pupils are used, that the T scores are shown in the last column instead of the first, and that three additional columns essential to the computation of T scores are added. The first column is the 204 How to Experiment in Education number of questions, out of 60 questions on India, answered correctly by the indicated number of pupils in each of Group A, Group B and Group C. The fifth column is the total number of pupils in all three groups answering the number TABLE 36 DISTRIBUTION OF SCORES MADE BY 499 7A-GRADE PUPILS IN A 60-QUESTION TEST WHICH FOLLOWED A LESSON ON INDIA. ORIGINAL STEPS CONVERTED INTO T-SCALE UNITS (AFTER WEBER) Per Cent Ex- Group A B CG . ceeding Plus Score | L—R | FL |) tr | 2%% | raterhose |e Reaching — oO 2 2 I 5 99.50 24 I— 2 I fe) I 2 98.80 27 3— 4 I a 2 4 98.20 29 5— 6 iz 4 I 6 97.19 31 iio 4 6 5 15 95.09 33 g—10 3 5 4 ne 92.38 36 II —12 8 2 II 21 89.08 38 13 — 14 5 3 9 17 85.27 40 15 —16 7 9 10 26 80.96 41 AV diypract 4: Lb 8 12 34 74.95 43 IQ — 20 17 9 13 39 67.64 45 21 — 22 5 II I4 30 60.72 47 23 — 24 13 9 20 42 53-51 49 25—26 TT 19 6 36 45.69 SI 27 25 17 13 13 43 37.78 53 29 — 30 8 I4 14 36 29.86 55 31 — 32 16 I5 10 41 22.14 58 33-734 12 8 7 27 15.33 60 S5e—-30 9 9 5 23 10.32 63 Bye a0 4 I 3 8 at 65 39— 40 2 8 2 12 5.21 67 4I — 42 2 4 2 8 nox 69 43 — 44 T 4 2 7 1.70 71 45 — 46 I I 2 80 74 OY eee I I 2 .40 77 49 — 50 I I 10 81 Total 163 167 169 499 of questions shown in the first column. The numbers of questions shown in this first column are grouped two together instead of each question separately as is usuallv done when scaling. This grouping is not necessary. It is, in fact, of doubtful desirability. Its virtue is that it Computations for the Rotation Experimental Method 205 saves labor. The sixth column gives the per cent exceeding plus half those reaching each number of questions correct. This per cent is based on the fifth column. How to com- pute these per cents and transmute them into T scores, shown in the last column, is described in Chapter V. Once these T scores are known, the first, fifth, and sixth columns may be eliminated as no longer useful, and the T scores may be moved to the extreme left, thus making a table similar to the India portion of Table 35. In like manner, the orig- inal number of questions correct on the test on China, and then the number of questions correct on the test on Japan, can be transmuted into T scores. Since all the pupils in all three groups are used in each of these three test scalings, all scale values, i.e., T scores, are thus made comparable. The possibility of scaling experimental tests on the basis of the performance of experimental pupils is not limited to rotation experiments employing three groups and FT’s only. It is possible for any rotation experiment with any number of groups and with or without IT’s. It is equally possible for any one-group or equivalent-groups experiment. In all these cases the scaling may be based upon IT, FT, or C records. The C records are best to use, the FT records are next best. When C records are used the experimenter can be absolutely certain of getting a T score for every need. If IT’s are used, there is a possibility that no pupil at the beginning of the experiment will make as high a record as will be made by some pupil on the FT. This means that extremely high scores on the FT may have to go unscaled. If the scaling is based upon FT scores, there is a possibility that extremely low scores on the IT cannot be scaled. No difficulty need be anticipated if C records are scaled. Chap- ter V shows how both IT and FT may be used to widen the range of the scale so as to include the highest and lowest Scores. But no matter which of the three records is scaled, it is highly important that the scores of every experimental group taking the test be utilized in scaling that test. This does 206 How to Experiment in Education not mean that every pupil involved in the experiment has to be used. It is required only that those utilized in experi- mental computations be included. Weber scaled his tests on 499 pupils. In his experimental computations he used only 300 of these 499 pupils. It would have been just as satisfactory to have scaled his tests on the 300 finally selected as the basis for his experimental computations. It would not have been quite so satisfactory if, say, Group C were omitted in the scaling. Under certain conditions it is permissible to compute 51.84 in the Summary of Table 35, by a less laborious pro- cedure. The data which yields the three M’s from which 51.84 is derived, may be lumped together so that only one M and one SDM is computed for all of it. In this case, the final M for each of the other two EF’s should be computed in the same way. The conditions required to make the above modification permissible are (a) an equal number of pupils in each group, (b) a uniform test for each group, or else the tests to be scaled upon the experimental groups so as to eliminate inequalities in difficulty and consequent unduly-increased variability and unreliability, and (c) ap- proximate equivalence of ability for the groups so com- bined. Special Computation Difficulties.—Since the rotation method is a combination of several one-group methods or several equivalent-groups methods, it is appropriate that this chapter should close with a consideration of special types of statistical computations required for special situations. These special difficulties are caused not so much by pecu- liar variations in experimental method as in variation in methods of measuring changes. There are, for example, the following common ways of measuring changes produced in pupils by an EF: 1. Total points change on test made by each pupil. 2. Per cent of total possible gain on each test made by each pupil. Computations for the Rotation Experimental Method 207 3. Time required for each pupil to attain a defined score on a test. 4. Per cent of pupils in each group attaining a perfect score or any defined score on a test. 5. Per cent of pupils in each group making any gain on test. 6. Per cent of pupils in one group whose change exceeds the mean change of the other group. Measuring-method 1 is the most commonly used and should be. Except in very special instances, measuring- methods 2, 3, 4, 5, and 6 should be used merely as supple- mentary to the first method; they yield certain additional information which, on occasion, is valuable. For example, it may be useful to know whether the superiority of a par- ticular EF is due to the large gains of a relatively few pupils only, or whether every pupil has contributed to the superior- ity. Measuring-method 4 tells whether the gains are well- distributed. All the computation models assume measuring- method 1. The experimenter is advised to avoid subsequent statistical difficulty by planning for this method. Measuring-methods 1, 2, and 3 yield a score and C for each pupil, thereby permitting the computation of an M and a SDM and ultimately a D, SDD and EC. Measuring- methods 4, 5 and 6 yield a score for the group only, thereby making it difficult, if not impossible, to compute measures of reliability. Since each experimenter is obligated to report the reliability of his conclusions, he should make sure that the measuring-method which he plans to employ will yield a measure of reliability at the end. CHAPTER IX CAUSAL INVESTIGATIONS Methodology of Causal Investigations——When Dar- win visited South America, he was surprised to discover an outbreak of yellow fever high up in the Andes Mountains. Since he was a born scientist, he began immediately to specu- late and observe to see if he could discover the cause for such an unusual phenomenon. Doubtless he asked himself these two questions: In what respect is this situation dif- ferent from places which are immune from yellow fever? In what respect is this situation like places which are subject to yellow fever? Darwin showed his genius by almost dis- covering the cause of yellow fever. He observed something about the place which was very unusual for high altitudes where yellow fever is unusual, and very much like lowlands where yellow fever is more common,—pools of stagnant water. He therefore suggested the hypothesis that this stag- nant water was responsible for the yellow fever. He was right so far as he went. It was not until long afterward that this investigation was pushed far enough to make it appear highly probable that stagnant water produced the mosquito, which, in turn, caused yellow fever to spread. -Metchnikoff observed that the Bulgarians were an unusually long-lived people. Metchnikoff wished to know why. Doubtless he, too, asked himself these questions: In what respect are the Bulgarians like other peoples who live long? In what respect are they different from other peoples, 1.e., what force operates upon the Bulgarians which does not operate upon other races? Like Darwin, he proceeded to observe for differences. He concluded that the most striking difference was the extent to which the Bulgarian people drink 208 Causal Investigations 209 buttermilk. He therefore concluded that the drinking of buttermilk was responsible for the long life of the Bul- garian, and that a similar practice on the part of other races would lead to an equally long life. He went beyond Darwin and buttressed his hypothesis by showing that certain organ- isms present in buttermilk are specially beneficial to the action of the alimentary canal. Reavis’s recent work! is an admirable illustration of a causal investigation in the field of education. He set out to locate the causes for attendance and non-attendance in’ school. From incidental observation and logical deduction, he had arrived at not one but a number of hypotheses as to what factors influenced attendance. He proceeded to collect a large amount of data with a view to testing the truth of his various hypotheses. These illustrations of causal investigations, together with many others which will occur to the reader, indicate some interesting inferences. One inference is that different causa! investigations differ in their starting point and ending point. Darwin’s causal investigation began with a problem and ended with the formulation of a crude hypothesis. The pre- eminent function of causal investigations is to yield sugges- tive hypotheses to be tested by further logical deduction, observations or experimentation. Because of the great value of fruitful hypotheses, causal investigation has constituted the fundamental method of discovery from the beginning of time. Metchnikoff’s causal investigation began with a prob- lem which not only led to the formulation of a hypothesis, but also to the collection of certain subsidiary evidence to show that the hypothesis was not an unreasonable one. But Metchnikoff went no further. Reavis did not conduct an investigation to secure useful hypotheses. Probable causes were more evident. He started his causal investigation well supplied with fruitful hypotheses. But what is more impor- tant, he carried the investigation very much further than 1 Reavis, George H., Factors Controlling Attendance in Rural Schools, Teachers College, Columbia University, 1922. 210 How to Experiment in Education was done in the other instances. He carried it far enough practically to prove or disprove his various hypotheses. A second inference from these samples is that the con- clusions yielded by causal investigations are usually less convincing than those yielded by experimentation. Conclu- sions from causal investigations are seldom more than strong hypotheses, which await confirmation by experimentation. This need for confirmation varies with the nature of the investigation and the adequacy of the data which is assem- bled or it is possible to assemble. Experimentation carries greater weight than causal investigations, because an experi- menter can control conditions much better than the investi- gator. The investigator is compelled to accept conditions as they are presented, complicated, as they usually are, by all sorts of irrelevant factors, and providing, as they fre- quently do, insufficient data upon which to base conclusions. Darwin’s conclusion concerning the cause of yellow fever was only a good guess, at best. It was a very slender hypo- thesis. He could have greatly strengthened his hypothesis by making a systematic series of observations or collection of data. He could have strengthened it still more by evolv- ing a hypothesis as to the exact mechanism whereby stag- nant water causes yellow fever, and then by conducting an equivalent-groups experiment to test this hypothesis. All are familiar with the famous equivalent-groups experiment, finally conducted, in which a group of healthy men offered their lives to prove conclusively that yellow fever is trans- mitted by a certain variety of mosquito which thrives only where stagnant water is found. Metchnikoff’s conclusion as to the efficacy of buttermilk was and remains a hypothesis only, and will continue to re- main so until it is tested experimentally. It is doubtful if it can be tested conclusively by means of a causal investigation because nature apparently does not present the proper con- ditions. The nature of Reavis’s research makes it more feasible as a Causal investigation. By the selection of a relatively Causal Investigations 211 narrow problem, by the collection of many data readily available, by the utilization of recently-developed statistical techniques, and by the exercise of no little ingenuity, he was able to isolate fairly well the factors whose influence he desired to study. A third inference is that the methodology of causal investi- gations is the methodology of equivalent-groups experimen- tation. A causal investigation is merely an equivalent-groups experiment conducted backward. The criteria for a valid equivalent-groups experiment are the criteria for a valid causal investigation. To the extent that a causal investiga- tion would be invalid if reversed and conducted forward as an equivalent-groups experiment, just to that extent it is invalid as a causal investigation. A perspective of a correct plan for a causal investigation, viewed from its starting point, is identical with a perspective of an equivalent-groups experimental plan, for the solution of the same problem, viewed. from the ending point. If these perspectives are not identical, there is a crudity in one of the plans, and the crudity will usually be found in the plan for the causal investigation. An important corollary of the foregoing is that he who has mastered the technique of experimentation is already equipped for causal investigation. Only a few additional techniques need be described. In illustration of the foregoing statement that the same criteria hold for both causal investigations and equivalent- groups experimentation,. it will suffice to show how these criteria apply to Metchnikoff’s causal investigation. To satisfy these criteria, Metchnikoff would have to show that, except for much buttermilk drinking and its reputed good effects, Bulgarians are by nature and environment equiva- lent to other races. This he has not shown. Consequently, critics of his hypothesis have some justification in attributing the long life of the Bulgarians to certain other factors in which the Bulgarians possibly differ from other races. The true cause may be due, for example, to the operation of a more rigorous environment than has been operating upon 212 How to Experiment in Education other races. The effect of such selective agency would be to make the present Bulgarian people a very hardy stock. Combine this possible fact with the assumption that there has been a rapid amelioration of environmental conditions during the last few hundred years, and we have an explana- tion for Bulgarian longevity totally unconnected with but- termilk. Or, again, it may be that the original ancestors of the Bulgarians possessed and transmitted through hered- ity a tendency toward longevity, just as they doubtless possessed and transmitted the physical traits which dis- tinguish them from other races today. Or, finally, their greater longevity may be due to the cooperative contribution of several of these factors rather than to any one of them. All this shows why causal investigations which fail to satisfy perfectly the equivalent-groups experimental criteria yield conclusions which are suggestive hypotheses only. Their validity is no greater and no less than that of the conclusions yielded by an equivalent-groups experiment which fails to satisfy its own criteria to an equal extent. Essential Procedure of Simple Causal Investigations. —Causal investigations may be prosecuted in either of two ways. Perhaps the most common and certainly the most simple and elementary way, is the all-or-none procedure. In an all-or-none investigation, the effect, whose cause is sought, is either totally present or totally absent, or else the investi- gator arbitrarily ignores any gradations in between, or else he defines a certain minimum amount of the effect, any amounts in excess of which will be considered to constitute its presence, and any amounts less than which will be con- sidered to constitute its absence. The preceding discussion of this chapter has made it clear that for this variety of causal investigations the essential steps are as follows: 1. The investigator searches until he finds objects, indi- viduals, communities or situations which are alike in that they all show a particular effect whose cause is sought. 2. He inspects these situations to see whether they have Causal Investigations 213 anything else in common which might possibly be the cause of the observed effect. If he finds such a common cause, he formulates the hypothesis that this is the probable cause of the effect. 3. He continues his collection of cases to discover whether the hypothetical cause is always and without excep- tion present when the effect is present. 4. He collects cases which are alike except for the pres- ence of the effect in some of the cases and its absence in others. 5. He observes to see whether the hypothetical cause is present in those cases which show the effect, and absent in those cases which do not show it. 6. He continues the collection of such instances to dis- cover whether inexplicable exceptions occur. 7. If in either half of the foregoing process inexplicable exceptions occur, the investigator attempts to find a new and more promising hypothesis as to the cause of the effect. If he is successful in this he starts through the above process again. If he is not successful the causal investigation ends unsuccessfully. Essential Procedure of a Complex Causal Investiga- tion. a. Formulation of Hypotheses.—Causal investiga- tions of a complex variety do not treat the effect merely as present or absent, but recognize and take account of grada- tions of effect and gradations of cause. Here the investi- gator determines not only whether the presence of the effect is accompanied by the presence of the hypothetical cause, but also whether increase in the amount of the cause is accompanied by a corresponding increase in the amount of the effect. Furthermore, the investigator may attempt to discover whether the effect is produced by one or more causes, and if produced by several causes he may attempt to determine just how much of the effect each cause con- tributes. Reavis’s investigation is an illustration of one which took account of gradations in cause and effect, which found that 214 How to Experiment in Education the effect was produced by several codperating causes, and which determined the exact amount of independent contribu- tion of each cause to the effect. A summary of his pro- cedure is given below. The reader is referred to his disserta- tion for details. From incidental observation and logical deduction, he formulated numerous hypotheses as to the more probable causes or factors influencing the attendance of rural-school elementary pupils. Some of these factors related to the pupil, some to the school and teacher, and some to the com- munity. Sample questions relating to the pupil were: Does age, sex, distance from school, quality of roads from home to school, distance transported, age-grade position, or quality of school influence a pupil’s attendance record? Sample questions relating to teacher and school were: Does the teacher’s salary, or amount of training, or the school’s mod- ernness of equipment, playground space, or the like influence a pupil’s attendance? Sample questions relating to the com- munity were: Does the community’s wealth, intellectual level, or interest in education influence a pupil’s school attendance? b. Collection of Data.—The collection of data is a prob- lem in measurement. The general principles to guide such measurements were given in Chapter V. These principles hold whether the investigator personally makes his own measurements, or secures them from others by means of a questionnaire. The principles apply whether the measure- ments made be tests of mental traits, tests of school build- ings, collection of school records, or the introspections or judgments of judges. The following questions ! will guide the investigator in the evaluation and preparation of a questionnaire. Are the questions as factual as possible? Do they involve a mini- mum of judgment and memory? Are the questions as spe- cific as possible? Will the data secured lend themselves to 1See Rugg, Harold O., Application of urbe Methods to Education, pp. 39-55; Houghton Mifflin Company, New York, Causal Investigations 215 tabulation and statistical treatment? Are the questions unambiguous? Will all terms used have the same meaning to all reporters? Will the questions evoke replies which will be unambiguous to the investigator? Is the informa- tion called for difficult to obtain? Can the data called for be obtained more accurately otherwise? Do the questions cover all the data needed for subsequent computations? Can the questions be answered by a check, number, Yes, No, or brief phrase? Are the questions arranged so that none will be overlooked? Is the space sufficient for each answer? Are the questions worded and arranged to facili- tate tabulation and fit the tabulation form to be used? Will the data called for by the questions, answer the specific and previously worded objects of the investigation? Are the questions formulated in the light of a bibliographical survey? Is the amount of time required to answer questions so excessive as to induce careless responses, omission of items, or few replies? Are the questions worded in the light of one or more preliminary trials with representative samplings of the individuals for whom questions are designed? Are the nature and number of questions such as to secure replies from representative individuals and from a sufficient num- ber to satisfy the statistical criteria of reliability? A common form of questionnaire is one which aims to measure the degree of preference for this or that. Thus Lowe sent a questionnaire which gave a comprehensive list of the activities of clergymen. He desired to know how each clergyman evaluated each activity. Several methods have been proposed for meeting just such a situation, Le., for measuring opinions. One method, the rank method, is to ask that the activity which is deemed most important be ranked 1, the one deemed next most important be ranked 2, and so on for the number of activities listed. This method is fairly satisfactory in most cases. It is very time-consuming if the number of items is large. It yields relative evaluations only; it does not show what activities are deemed of no value whatever. 216 How to Experiment in Education It does not show which activities are judged to be of equal value, but forces the reporter to make a choice. This forc- ing does no harm so far as group results go, but it may do violence to one individual’s opinion. Finally, the rank method forces the reporter to make the same difference be- tween all adjoining activities, namely, a difference of one. A second method is the distribution method. Here the reporter is asked to distribute, say, 100 points among the listed activities, thus showing the importance of each activity by the number of points assigned to it. This method per- mits the reporter to indicate just what activities are of no merit, but does not allow him to indicate negative values. It permits the reporter to attach the same value to more than one activity, and to indicate varying differences be- tween activities. It is more time-consuming, however, than the rank method, unless the activities are grouped into head- ings and sub-headings. If they can be so grouped, the re- porter can be asked to distribute his 100 points among the main headings, and, after this is done, to distribute the total points assigned to each heading among its sub-items. Some- times, however, activities do not fall into convenient group- ings which are mutually exclusive as to items and sub-items or where the sub-items completely exhaust their heading. Theoretically, the distribution method requires both such exclusiveness and exhaustion. Finally, the distribution method tends to make the number of points assigned to each activity incomparable from one reporter to another. One clergyman may hold half the activities listed to be of no value; nevertheless he must use up his 100 points. Another clergyman who assigns some points to every activity will be compelled to assign fewer points to an activity which he may evaluate just the same as the previously mentioned indi- vidual. A third method is the relative-to-the-items scale method. Here the reporter is asked to rate the activity considered least important as 1, the activity considered most important aS 20, or 10, or 5, and to assign a value anywhere from 1 to Causal Investigations 217 20 inclusive to the other activities, assigning the same value more than once if desired. This method has all the virtues previously mentioned as desirable, except that of permitting a report as to just what activities are judged of no worth or negative worth or whether any activities are of greater worth. | A fourth method is the absolute-worth-occupational scale. Here the clergyman is asked to rate any activity equal in value to the most desirable activity in which a clergyman can engage as worth, say, 19 points; to rate any activity zero, which is of just no professional significance; to rate any activity minus 19 which is equal in professional destruc- tiveness to the worst occupational activity in which a clergy- man can engage; and to rate all other activities according to this absolute occupational scale. Thus, mending shoes is above zero in social value, but is probably below zero on a clergyman’s occupational scale. The chief objection to this scale is the great likelihood that the reporter will be unable to avoid confusing this fourth scale with the fifth to be described. The fifth method is the absolute-worth-social scale. Here the reporter is asked to construct or think a scale ranging from minus 19 through o to plus 19, where minus 19 means the worst imaginable human act such as an able-bodied man murdering his defenseless, gifted child to avoid working for its support, where plus 19 means the best conceivable human act, and then to rate the listed activities according to this scale. This scale yields the fullest information of any of the five methods described. Whether it is more or less reliable than the others is not surely known. Reavis employed the questionnaire procedure for collect- ing the data used in his investigation. Fortunately, he was in a position of authority where he could secure unusually accurate and adequate returns. He eliminated from con- sideration all transient pupils whose attendance could not possibly be perfect due to the fact that they were not in one district throughout the school year. Then he secured a 218 How to Experiment in Education measure of the amount of attendance of each of 5314 pupils in 200 country schools in five counties in Maryland. At the same time he determined the amount of presence of each of a large number of hypothetical factors, such as the pupil’s distance from school, the quality of his work at school, the sort of teacher who taught him, the character of the school building and equipment which surrounded him, and the character of the community.in which he lived. Much ingenuity was shown in making these determina- tions, and in securing a comparable quantitative expression for the amount of presence of each factor. To illustrate with only one of the difficulties encountered—consider his method for securing comparable measures of the distance a pupil lives from the school. A pupil who lives a mile from the school and in order to reach it must walk all the way along an unimproved clay dirt road, really lives farther away than another pupil a mile from the school who walks half the way on an unimproved clay dirt road and half the way on a macadam state road. To equate these two conditions, Reavis reduced the dis- tance for pupils travelling over state roads so as to make State-road distances equal unimproved-road distances. He made various guesses as to the proper subtraction and checked up each guess by computing the coefficient of corre- lation between attendance of all pupils and the distance score for each pupil corrected by his guess. With each improvement in his guess, the coefficient of correlation should go up, due to the fact that errors in measurement reduce the coefficient of correlation toward zero. The corre- lation between uncorrected distances and attendance was .38. A perfect correlation would be 1.0, and no correlation would be zero. Calling each mile of state road equivalent to one-half mile of unimproved road and correcting accord- ingly yielded a coefficient of correlation between corrected distance and attendance of .43. Counting each mile of state road as equal to three-fourths of a mile of unimproved road and correcting accordingly raised the correlation to .54. Causal Investigations 219 A guess on either side of the last weighting yielded correla- tion of .48 and .51, showing that the best basis for correction was to call one mile of state road equal to three-fourths of a mile of unimproved road. But even the correction for the quality of the road does not eliminate all the error in the distance measurements. Some of the pupils were transported all or a part of the way. By employing the same correlation device to check up vari- ous guesses as to the proper weighting, Reavis found the optimum correction for distance transported per number of days transported and per cent of days attended. The rea- son for taking the amount of attendance into consideration will readily occur to the reader. c. Determination of Significance of Causes——The next step was to divide the 5314 pupils into two groups of equal numbers. One group was composed of that half of the pupils having the better attendance record. The half with the poorer attendance record composed the other group. Three or more groups representing as many attendance gradations could have been used. From the better-attend- ance groups a smaller group was so selected as to be equiva- lent in every respect, except for the difference in attendance and the factor of distance, to a smaller group selected from the poorer-attendance group. That is, in equating these two groups, the factor of distance was ignored but all other factors were regarded. The technique for equating groups on several bases was discussed in Chapter III. Next, the mean distance from school of each equated group was com- puted. If, when this was done, the mean distance was less for the better-attendance group, the investigator was justified in concluding that a difference in distance was asso- ciated or correlated with a difference in attendance. The next step was to equate two groups in every respect except, say, the quality of school work of the pupils and attendance. The difference between the mean quality of school work for the two groups showed the extent to which quality of school work was associated with attendance, 220 How to Experiment in Education whether positively correlated, negatively correlated, or whether neutral. In similar fashion, the investigator deter- mined whether any other factor relating to the pupil, teacher, school, or community was associated, and to what degree, with the attendance of the pupils. If the mean distance for one attendance group was identi- cal with the mean for the other attendance group, a con- clusion that distance affects attendance would be totally unreliable. Since the D between the two M’s would be zero, the EC would be zero. If there were some difference between the two M’s, the significance of this D, or rather how much we could trust its significance, would depend upon the reliability or EC of this D. This reliability could be determined in the usual way. The series of distance scores from which Mi came would permit the computation of SD and SDMr. Similarly the series of distance scores which yielded M2 would yield SD and SDM2. Mr and M2 would yield D. SDMz1 and SDM2 would yield SDD. Dand SDD would yield EC. When two groups equivalent in all respects, except for attendance and the difference in the factor being studied, show the same mean amount of the factor, we can certainly say that the factor under consideration has no influence upon attendance, is not a cause or contributing cause of attendance. When the above procedure is used, and when variations in attendance are accompanied by variations in the factor being studied, we are justified in saying that variations in the factor are associated or are correlated with variations in attendance. But additional considerations are necessary before we are justified in concluding that varia- tions in a factor zmfluence or are a cause of variations in attendance. It may be that attendance is, instead, a cause of the factor. Or it may be that each is partly effect and partly cause. Or it may be that no direct, definite causal relation exists. Judging by Reavis’s findings, distance is associated with attendance. Now since it is easily conceivable that distance Causal Investigations 22 influences attendance, and since it is highly improbable that attendance in a particular year has influenced the distance a pupil lives from school during that year, we are justified in concluding that distance is not only associated with but actually influences attendance. Also the results of Reavis’s study showed that quality of school work was associated or correlated with attendance, but we cannot be quite certain here, whether the quality of school work influenced attend- ance or attendance influenced quality of school work or both. Probably the last is nearest the truth. Poor attendance leads to low quality of work, which leads to loss of interest, which leads to poorer attendance still. In sum, if the investi- gator will follow the procedure outlined above he can con- clude that a correlation exists between factor and attendance, and that sometimes a causal relation exists; but which is cause and which effect rests upon additional logical con- siderations. When the cases are as numerous as they were in the study made by Reavis, causal investigators often save themselves trouble by using all the cases in the study of each factor, trusting to luck and to numbers to make the groups equiva- lent in all other factors. Thus, in the sample illustration, they would divide the 5314 pupils into, say, two groups equal in number, those living nearer and those living farther from the school. The investigator would assume, in this case, that since the pupils were divided with an eye to one factor only, that the two groups would by chance be approximately equivalent with respect to the amount of presence of any other factor. If the various factors are independent of each other, i.e., if they are uncorrelated with each other, the foregoing pro- cedure would be fairly satisfactory. But in any complex investigation, the investigator can be practically certain that various factors are correlated and cross correlated in all sorts of bewildering ways. If all pupils are divided regard- less of everything except quality of school work, we can be practically sure that chance would not equal the two 222 How to Experiment in Education groups with respect to, say distance. Long distance from school, through its reduction of attendance, affects quality of school work. That is, distance and quality of school work are not independent factors. ‘They are negatively correlated. As a result, any division on the basis of quality of school work alone, unavoidably becomes, in part at least, a division on the basis of distance. In like manner, it will become, in part at least, a division on the basis of every other factor which is correlated either positively or negatively with quality of school work. So long as this is the case, the investigator is unable, to tell just how much of any difference in attendance is attributable to quality of school work, and how much to each of the various factors correlated with quality of school work. All he can conclude is that this total complex is correlated with the attendance record, and may be a cause or an effect of the attendance record. The only safe procedure is to satisfy as completely as possible the equivalent-groups experimental criteria by attempting consciously to equate the groups in every known factor. Even so there will be enough error due to unknown significant factors. d. Preliminary Exploration of Significance of Causes.— Now as a matter of fact, Reavis did not employ the former or more exact method of evaluating the factors. He used instead a modified and rather drastic form of the latter more crude method. But he used this method not for the purpose of evaluating exactly the influence of each factor upon attendance, but rather for the purpose of preliminary exploration to discover which factors appeared promising enough to justify an additional very refined procedure—a procedure more feasible than the exact one already de- scribed. His preliminary explorative procedure was to place in one group, not the half of his pupils who had the best attend- ance records, but the topmost 12% in attendance. The other group was composed of the lowest 12% in attendance. Since any factor that varies with attendance should be Causal Investigations 423 found in different amounts in these two groups, he computed the mean distance from school for each group, and then the mean quality of work in school for each group, the per cent of each group found under the better teachers, vs. the per cent found under the poorer teachers, and so on for the large variety of factors whose influence upon attend- ance was under consideration. When there was a pro- nounced difference between the two means or the two per cents for a factor, Reavis considered that factor to be worthy of further study by a more exact procedure. When no pronounced difference appeared he considered that factor to have little or no influence upon attendance and eliminated it from further consideration. While this method is so crude that it will not show the independent contribution of each factor, it is sufficiently exact to show what factors are promising ones for further study and which ones are un- promising. In this preliminary investigation Reavis determined roughly the significance for attendance of the following factors relating to the child: sex, chronological age, grade in which enrolled, quality of work, and promotion. He studied the following factors relating to the school: training of teacher, salary of teacher, experience of teacher, num- ber of recitations, completeness of teacher’s report, neat- ness of teacher’s report, handwriting of the teacher, teacher’s intention to continue, schools changing teachers, rating of teacher, size of library, kind of blackboard, rating of equip- ment, age of desks, number and kind of pictures on the walls, school enrollment, size of schoolroom, lighting of schoolroom, system of heating and ventilation, rating of school building, suitability of school grounds, play and games, value of school property, cost of running school and distance from children’s homes. He investigated the fol- lowing factors relating to the community; money raised, number of community meetings, and rating of the com- munity. Many of the above factors proved to have little or no 224 How to Experiment in Education connection with attendance. Many other factors showed a significantly promising relationship. In order to reduce the number of factors for detailed examination, various signifi- cant factors were combined where possible. ‘Thus a score for distance was determined by combining uncorrected dis- tance, quality of roads, and transportation. A score for the teacher was secured by combining the factors relating to her which proved significant, namely, her rating by the superintendent, her salary, and her training. A score for the school plant was secured by combining the rating on the building, rating on the equipment, and rating on the grounds. In describing the correction of distance, a device was given for determining weights to be assigned to the elements that entered into these various combinations. A like method was employed for computing these composites for teacher, and for school. Three other factors, namely, a pupil’s progress through the grades or age-grade relation- ship, a pupil’s quality of school work, and the quality of the community, were found worthy of additional considera- tion. This means that six factors were selected for detailed examination by the process to be described. A seventh factor, namely, chronological age, was found to be significant, but the effect of this factor was taken care of by studying the relationship between attendance and the six selected factors separately for each of three age groups, namely, 5 to 8, 8 to 12, 12 and above. e. Correlation and Inter-correlation Between Causes and Effect—The next step was to compute the coefficient of correlation between attendance and each of the six selected factors, and to do this separately for each of the three age sub-groups. The coefficient of correlation is a statistical expression for the degree of proportionality or correspondence between two series of measures, and is indicated by the symbol r. When r is t.0 the correspondence or correlation between the two series of measures, say, scores for distance and attend- ance is perfect and positive. When r is — 1.0 the correla- Causal Investigations 220 tion is perfect but it is inverse or negative. When r is zero the correlation is mi. An r may be anywhere from — 1.0 through zero to + 1.0. We should expect the r between attendance and quality of school work to be positive, because we should expect those pupils who have a good attendance record to tend to show high quality of school work, and vice versa we should expect those pupils who have a poor attendance record to tend to show a low quality of work. On the other hand we should expect the r between attend- ance and distance to be negative, because we should expect that those pupils who have a high distance score to tend to have a low attendance record, and vice versa. There are several formule for the computation of r. The standard formula when the relationship is approximately rectilinear (see Diagram 1) is Pearson’s product-moment formula, which may be written thus when the exact mean is used: T= V/Sx 4/Sy? or thus, when the assumed mean is used: Most educational relationships are rectilinear or are suffi- ciently so to make it permissible to employ the product- moment formula. But it is well to construct and inspect a scatter diagram (see Diagram 1) to determine whether the general drift of the diagram is rectilinear or curvilinear (see Diagram 1). If it is pronouncedly curvilinear the in- vestigator is referred to Rugg’s book ! on statistical methods for the appropriate formula. | * Rugg, Harold O., Application of Statistical Methods to Education; Houghton Mifflin Company, New York, 1917. 226 How to Experiment in Education PER CENT OF ATTENDANCE DIAGRAM I THE CIRCLES SHOW AN APPROXIMATELY RECTILINEAR RELATIONSHIP. THE CROSSES SHOW A CURVILINEAR RELATIONSHIP a ee SS eS SS —— | SS | I ee | ee en ee Se ee I SS SS I eS | | J J J fF I | J ff — fF — | | | J | | | | | in miles & NS >4 iw ° me | SS | i | S| SS eS | eS a | | | | | —— | S| SS I Distance — | Se SS I eh ee Se | SS | SS I SS CO I oO a ° ON ra oO b ba fo) N | | | | | | Ea 4 ese Bes Ll es | | | | ie) ° ° O 5 10152025 30 35 40 45 5055 60 65 70 75 80 85 90 95 100 Diagram 1 shows in one diagram two sample scatter dia- grams for two groups of twenty-five children. The circles show the relationship between attendance and distance. Causal Investigations cA Sz 0°) — = zoo = e ( ) mae sz Io 0- 6g eee Sz ) veel — Sorvz — ,xS botz OgI O7gI 6g01 6g01 I WVUOVIG NI (SHIONIO) VIVd AHL UOA I ALNAWOD OL MOH ONIMOHS— SS —_——————— oe a | SS veel — — Axg VS gI— aLiI— ro— g's — ZI— voz — go — a ov — vi— g 7S — gi— 9s — ZO — bos — giI— gv ZO ofr — OL gv 9'0 0'O o'O vz — oT gO vo— Olt QI cag 9.0 — SO to ve Pei aa? Oe ozs — Our 9°99 — QI ott — go zsol— gl gso— v1 o' vor — O'7 Ax A L¢ alavy, aIUD ISI aaa 70 = X92 o7S — WV 77S = W gouppuaz1 P SAHVD OH Wwe Me GEGSCATHNHY SSE K > dnd 228 How to Experiment in Education Each circle indicates one child’s attendance record and distance from school. The general drift of the relationship is a straight-line or rectilinear drift. The crosses show the relationship between attendance and distance for twenty-five other pupils. Remember that the diagram is merely for illustrative purposes. It is extremely improbable that one group of pupils (circles) would show a decided negative correlation and another group (crosses) a decided positive correlation. But the important point to note about the diagram is that the circles show a rectilinear drift whereas the crosses show a curvilinear drift. The procedure for computing r is given in Table 37. Note that the x column shows deviations from the AM for attend- ance, and that the y column shows deviations from the AM for distance. Everything else is self-explanatory. When N is large, say 50 or above, it is more economical to tabulate data into a contingency table, such as Table 38. Such a contingency table may be used not only as a starting point for a short-cut method of computing a product-moment coefficient of correlation, but it also makes unnecessary the construction of a scatter diagram, such as Diagram 1. In- spection of the contingency table will show whether the rela- tionship is sufficiently rectilinear to make the product- moment method applicable. Table 38 is read thus: There were 3 pupils who lived between 3.4 and 4.0 (inclusive) miles distance from school whose per cent of attendance was between o and 1o inclu- sive, and similarly for the remainder of the contingency table. There is no particular virtue in grouping the per cents in step-intervals of 15, or the miles in step-intervals of 0.8. The per cents could be grouped in step-intervals of 5, 10, 15 or any amount that is convenient. Likewise, the miles could be grouped in step-intervals of 0.2, 0.4, 0.6, 0.8 or any amount that is convenient. The size of the step-intervals chosen for Table 38 gives 7 steps for attendance, and 5 steps for distance. As a rule it is better to have a step- 229 Causal Investigations ; az N N 62.59 Geet ¢ (49) Tish _(X2) xc ————— errno eSlLae— Sz ; N p als Se, tee —— ee (v0) (zx") — (49) (x9) — £8 — = l§$—o= Axg 60 = AS "€or = XS [opel rec NN ne foo ee Nees I— Aj 2 x} for Lz vz £ ° S g of x} ERS Oy eee fl ee ek tena Pee + ened fag ey eer —S— | pe | er— XJ pe ee ea ea | Sees oe ae Se Cee | Oe Se ee ee x LS ° 6+ I— Sz mee 9 2 z ¢ z v } ch sie Oz oI — 6s S z e g'0 0} Z'0 zI— zI — S S S— Paste 5 es ea I I I 9°10} O'1 ¢— z— I— 20 I ° Oo ° O 9 z tea € V2 0} QI Oo fe) ° v v V I 14 I I I I 7 £ 03 9°2 I fo) ~— ¢ — vz Oz OI z sc I I ¢ or oie ~— y— gI— _—1 ae y, ‘ k OOI Sg ol gs ov Sz oI 2 oe ae cA} J j 06 SZ 09 SY of St fe) sony ul 9IUDISIG sounpuaiip {0 JuUaD sag (Z1LaIa “I ‘H YaLdv) MIGVL AONFZONILNOO V NI GaLVINGVL Naad SvH Lf AIdvL JO VLVd NAHM NOILVITUNOO AO LNALOLZLAIOO V TLNdWOO OL MOH SMOHS gt alavy, 230 How to Experiment in Education interval of such size as to produce not less than 10 nor more than 20 steps in each of the two items. The steps are made fewer in Table 38 so as to simplify the presentation of the correlation procedure. The steps in the process of computing a coefficient of correlation from a contingency table follow. (1) Construct contingency table. (2) The total frequencies in the first column are 4. The total frequencies in the second column are 2, and so on for the other columns. The grand total of frequencies is 25. (3) The total frequencies for the first row are 5, for the second row, 4, and so on. The grand total of frequencies is 25, thus checking the preceding de- termination. (4) The AM for attendance is 50, as shown by the vertical double ruling. The AM for distance is 2.1, as shown by the horizontal double ruling. Other AM’s might have been taken, though AM’s near the center of each frequency distribution are more convenient. (5) The step- deviations from the AM for attendance are shown in the x row. The step-deviations from the AM for distance appear in the y column. (6) The product of each x multiplied by its corresponding f appears in the fx row. The algebraic total of the fx’s is shown at the end of the fx row. Sfx = 3. (7) The product of each y multiplied by its corresponding f appears in the fy column. The algebraic sum of the fy’s is shown at the bottom of the fy column. Sfy=—1. (8) The product of each x? multiplied by its corresponding f appears in the fx? column. Sfx? = 103. (9) The product of each y” multiplied by its corresponding f appears in the fy? column. Sfy? = 49. (10) The f in the first square in the first column and first row is 3. The x at the bottom of this column is — 3. The y at the end of this row is 2. The product of (3) X (—3) X (2) is — 18, which is written in the upper right corner of this first square. The f in the second square of the first column is 1. The x at the bottom of this column is — 3, and y at the end of this row is 1. The product of (1) X (—3) X (1) is — 3, which is written in the upper right corner of the square in question. The f in Causal Investigations 231 the third square of the third column is 3. The x is —1, and the y iso. The product of (3) X (—1) X (0) is written in the upper right corner. The f in the last square of the last row is 2. The x is 3 and the y is — 2. The product of (2) X (3) X (—2) is written in the upper right corner of this square. The other f’s times the xy products are com- puted similarly. (11) The sum of the xy products in the first row, ie., the sum of — 18, — 4, and —2 is — 24. This sum is written in the xy column in the minus sub- column. Were this sum positive instead of negative, it would be written in the positive sub-column. In like man- ner, the sum of the xy products for each row is computed and written in the last column. Positive Sxy—o. Nega- Hives Ve 57 eto eb ne:cxts, COMpUted +. CX = O:0 21 ( 13) The cy is computed; cy = —o0.04. These c’s are not multi- plied by the size of the step-interval as is done in Table 17, because Sxy, Sx”, and Sy? used in the correlation formula are kept in terms of step-intervals also. (14) Sx? == 103. Sy?= 49. Sxy =o—57 =—57. (15) The values pre- viously computed are substituted in the correlation formula shown at the bottom of the table. This formula is identical with that used in Table 37, except that all values are in terms of step-intervals. By solving the formula, r is found to be — .80-+. ‘The r, when computed by the procedure illustrated in Table 37, is —.81. This is a remarkably close agreement, when we consider the drastic condensation of the data produced by the large step-intervals used in the contingency table. By substituting age-grade scores for distance scores in Table 37 or Table 38, and by recomputing, the r for at- tendance with age-grade relation can be determined. In similar manner, the r between attendance and each of the six selected factors, or between any factor and any other factor, can be computed. The first row of Table 39 shows the coefficients of correlation between attendance and each of the six factors as computed by Reavis for the age group 8 to 12 and all five counties combined. Reavis’s original 232 How to Experiment in Education table presents the coefficients for the three separate groups and the five separate counties. Additional rows show the correlation between each factor and every other factor. For our present purpose the first row of Table 39 is the most significant. It tells us that those whose attendance records are excellent tend to live near the school to the extent of .45, tend to progress rapidly through the grades to the extent of .50, tend.to make high marks in school to TABLE 39 SHOWING THE COEFFICIENTS OF CORRELATION BETWEEN ATTENDANCE AND EACH OF SIX HYPOTHETICAL CAUSES OF ATTENDANCE, TOGETHER WITH THE CORRELATION BETWEEN EACH CAUSE AND EVERY OTHER CAUSE (ADAPTED FROM. REAVIS) 2 3 4 5 6 7 CMS Distance] Grade lof Work] Te°| “Plone | munity 1. Attendance ........ — .45 50 i332 16 07 30 2 RLVISTANCE Roe asia —.20 | —.13 | —.10 | —.06 02 2, wAven Grade. .eiae We 24 OI 08 .08 4. Quality of Work... 00 08 03 SL CACHED nhs crete nea 25 35 6 ochool Plant it woe 17 the extent of .33, tend to have good teachers to the extent of .16, tend to have an excellent school plant to the extent of .o7, and tend to live in a highly-rated community to the extent of .30. So far as these coefficients go, attendance appears to be most closely associated with age-grade rela- tionship and distance. Among the inter-correlations of the various factors, the most surprising coefficient is the zero relation between qual- ity of work and the teacher. One would expect better teachers to secure a higher quality of work on the part of the pupils. Had quality of work been measured by stand- ard tests, a positive coefficient would almost certainly have Causal Investigations 233 been found. But the scores for quality of work were the teacher’s marks. These marks are strictly relative, which fact effectively covers up any difference in the efficiency of different teachers. If the size of any coefficient of correlation in Table 39 is so small as to cast a doubt upon its significance, there is a formula which permits the computation of the reliability ofanr. Itis I—r? SDt= (TN where r is the coefficient of correlation whose reliability is sought, and N is the number of pupils used in computing r. The SDr is interpreted like SDM or SDD. If it is desired to know the probability that the true r is not zero or below, the EC may be computed by means of the following formula: r a 2.78SDr Also this EC formula can be used to determine the prob- ability that the true r does not lie below a defined r, or that it does not lie above a defined r. How to use the EC formula for either of these two special purposes has been discussed in connection with its similar use for M or D. f. Final Evaluation of Causes by Partial Correlation.— The crude correlation coefficients in the first row of Table 39 may not tell the independent influence of each factor upon attendance or vice versa. We could be certain that they show such independent contribution only in case the inter- correlation coefficients between the various factors were all zero. Were they all zero we should know beyond doubt that the correlation between a particular factor and attend- ance has not been enhanced or diminished, as a result of its correlation with some other of the factors listed. Addi- tional evaluation has shown, for example, that the school 234 How to Experiment in Education plant has no intrinsic connection with attendance. It has a slight positive correlation of .o7 as shown in Table 39 largely because it is correlated with the teacher who does have some genuine connection with attendance. ‘That is, all the correlation between school plant and attendance is a borrowed correlation. It is possible for a factor to borrow in this way from all the other factors. The problem of determining the independent correlation of each factor with attendance becomes a problem of stripping from each the correlation it has borrowed from all the other factors. If the borrowing has been small, little will be subtracted from the coefficients shown in the first row of Table 30. The crude correlation of a factor with attendance is com- parable to the crude process previously described of dividing all the pupils into a better-attendance and a poorer-attend- ance group, and then averaging the distance each group lives from school without making any attempt to equate groups. We have seen how such a procedure tends to lump the various factors together, depending upon the degree of correlation between them. We have seen, further, that the only way to avoid this confusion of different factors and to determine the independent contribution of each to attend- ance is to equate the two groups with respect to all the factors except the one under investigation. Due to the fact that it is difficult to select two groups from the better-attendance and poorer-attendance groups which are exactly equivalent in five different factors, Reavis elected to employ an alternative process which yields com- parable results. He used the method of correlation supple- mented by partial correlation. The effect of partial cor- relation coefficients is to show what the correlation would be between, say, attendance and distance if all pupils were of the same age in the same grade, were doing the same quality of work, were under like teachers, were housed in like school plants, and lived in like communities. The crude coefficients in rows 2, 3, 4, 5, and 6 in Table 39 were com- Causal Investigations 235 puted in order to make possible the computation of just such partial correlation coefficients. The operation of the partial correlation formula has for its goal the following independent, isolated, or partial cor- relation coefficients: YI2.34567 T13.24567 TI14.23567 r15.23467 r16.23457 ¥17.23456 The figures 1, 2, 3, 4, 5, 6, and 7 refer respectively to attend- ance, distance, age grade, quality of work, teacher, school plant, and community, as shown in Table 39. The partial correlation coefficient of r12.34567 means the correlation between attendance (1) and distance (2) when freed (.) from the influence of age grade (3), quality of work (4), teacher (5), school plant (6), and community (7). The coefficient, r13.24567, means the correlation between attend- ance and age grade when freed from the influence of the five other factors. The computation of r12.34567 requires the investigator to operate the partial correlation formula over and over again. Each operation takes out the influence of just one factor. The total process is shown below, in exactly the reverse order in which computations are actually made. Reversing the order makes the principle of the process easier to grasp. The first series of formule from the bottom removes the MMuenCe Ole wiLOMaLigweric ITA, Wis) TiOires 24 to. r26, r34, r35, r36, r45, r46, and r56. The next series of formulz removes, in addition, the influence of 6 from r12, Gio Tidy ttyetee eredite saad t35 andi tA seem Lie. Next series removes, in addition, the influence of 5 from r1i2, r13, rI4, 23, r24, and r34. The next series removes the in- fluence of 4 fromr12,r13, andr23. The next series removes the influence of 3 from riz. This leaves r12 purified from the influence of 3, 4, 5, 6, and 7. | 236 How to Experiment in Education r12.4567 — (113.4567) (123.4567) r12.34569 S= 345034 1 — (r13.4567)? VO oe (123.4567)? where ROE Lop ae east ea LD) EL SSN) Vit — (114.567)? */1 — (124.567)? eco 13.567 — (114.567) (134.567) V1 — (414.567)? Wt — (134.567)? Bt aa 123.567 — (124.567) (134.567) ; Vt — (124.567)? 1 — (134.567)? where Ni pt r12.67 — (r15.67) (125.67) anita V1 — (115.67)? V1 — (125.67)? aan 114.67 — (115.67) (145.67) /t — (115.67)? V1 — (145.67)? Ce aseetee 124.67 — (125.67) (145.67) __ V1 — (125.67)? 1 — (145.67)? a ps 113.67 — (r15.67) (135.67) V 1 (015.67)* VV 1 (135.67) 7 eta 34.67 — (135.67) (145.67) V1 — (135.67)? V1 — (145.67)? Fe Gye iat Tora AAES 8 A Vt — (125.67)? 1 — (135.67)? where pate Lie /ara) (Et0;7) ede, eee "At — (£16.7)? V1 — (126.7)? one r15.7 — (r16.7) (156.7) V1 — (116.7)? 1 — (156.7)? Awa isons wach aee WYARIAUUHG) VT (ray) Ay tee (Ope relorees TEA AMEL 7) AcAOeg a V1 — (116.7)? 1 — (146.7)? Causal Investigations Oe a ee Ne ad V1 — (146.7)? V1 — (156.7)? 67 = ERAT (726.7) (46.7) nay / 1 — (126.7)? V1 —\(r46.7)? r13.7 — (r16.7) (136.7) Ngee ed BAG seer SEO“ IRL OFT te at MCE Or ye Te (136.7)? 135.7 — (136.7) (156.7) ena maith as Mie) AUR fy en V1 — (36.7)? Vr — (156.7)? 134.7 — (136.7) (146.7) (frp A ars RI gs a V1 — (136.7)? 1 — (146.7)? 123.7 — (126.7) (136.7) (AT ET NSN waa Vt — (126.7)? V1 — (136.7)? where pli eee )\ra7) tS Vata va Ga r16 — (17) (167) 6. — sO r16.7 Wet (nt envi (TOA) 2 26 — (127) (167) 6. SEES ees RE ET SE EE aN 126.7 Vay vou (107) aaah Tato 7) OTS Va (a7)? VE — (57)? r0.7— Wines Oram (2574107) oie /1— (157)? “1 — (167)? 125.7 == — 2S (027) (857) V1 — (127)? Vv 1 — (157)? — __14— (117) (147) SS Ar pan a ne OPE r46 — (r47) (167) Fate eet aca LE a 2S, tel ee dak a AC ae A eas (147)? “1 — (167)? TAs. 76 Ne Ses (147) (557) Nea Vt — (t47)? Vt — (57)? 237 238 How to Experiment in Education T24 = ALATA LAG 47 = 7 (a7)? Vi— Gan)? i Uoiorersgye ee oar peinwicn ear N84 = ee idk ulus ice ad eran 2 i V/ 1 — (637)? V1 — (147)? MEE reat 27) AEST) £23-7 = A/t — (127)? V1 — (137)? Beginning at the bottom of the foregoing series of for- mule, the coefficients of correlation from Table 39 should be substituted in the first computation series of formule. As soon as these first partials have been computed, data will be available for substitution in the second computation series. The computation climb may thus be continued until r12.34567 has been determined. Once the process has been completed and the size of r12.34567 has been determined, the investigator will have to construct a similar series of formule and compute T13.24567. Since the principle for the construction of each of the six needed series is identical with that for the first series, the other five series need not be given here. Fur- thermore, an investigator who is concerned with a larger or smaller number of factors than six should have no diffi- culty in extending this series to provide for a larger number of factors, or of omitting the upper superfluous portion of this series in case of a smaller number of factors. By operating these formule in six such series, Reavis isolated each of the six factors and determined its inde- pendent contribution to attendance. That is, he determined the significance of the distance pupils live from school, Causal Investigations 230 regardless of the grades they are in, the quality of the work they do, the kind of teachers they have, the character of the school plants, or the type of community in which they live. Similarly, he determined the independent correlation of each factor regardless, not of all conceivable factors, nor even of all factors studied, but of the six other factors which appeared to be most significant and hence most need- ful to be partialled out. The final partial coefficients, as computed by Reavis, are given in Table 40. For purposes of comparison the partials TABLE 40 ORIGINAL AND PARTIAL COEFFICIENTS OF CORRELATION BETWEEN ATTENDANCE AND SIX HYPOTHETICAL CAUSES (ADAPTED FROM REAVIS) Age Quality School Com- Causes Distance Grows of Work Teacher Attendance Original ..| —.4s .50 -33 16 07 30 Partial «>; ,| —— 43 44 45 .08 — .O1 28 are preceded by the original crude coefficients. Distance and community suffered the least reduction. The teacher appears to have little to do with attendance, and the school plant has nothing to do with it. The outstanding deter- miners of attendance are distance and age-grade relation. The quality of school work and type of community come next and are about equal in their influence. But the reader should remember that the purpose of this chapter is to describe a process rather than to present results. Final conclusion as to the significance of these factors should take into consideration Reavis’s results for the two other age sub- groups. To do so would alter somewhat the conclusions just stated. As has been stated already, correlation does not imply causation. But partial correlation does imply causation in so far as all significant factors are partialled out. But par- tial correlation does not show which is cause and which 240 How to Experiment in Education effect. This must be decided from non-statistical consid- erations. Such considerations lead to the conclusions that distance, age-grade relation, teacher, and community are clearly causes rather than effects of attendance. Each of these factors was determined at the beginning of the year in which the attendance records were secured. On the other hand it seems much more probable that quality of work partly influences attendance and is partly influenced by attendance, i.e., it is both cause and effect. g. Regression Equation.—No further step is required to satisfy the purpose of a causal investigation. But the com- putation of partial correlation coefficients makes possible an additional step, familiarity with which is important not only for the causal investigator but also for those who construct tests. This next step is the derivation of a regression equa- tion or prophecy equation. The simplest form of prophecy is where a pupil’s score in one trait is prophesied from a knowledge of his score in one other trait. Since this sort of situation demands only ordinary correlation and the simplest form of regres- sion equation, it makes a good starting point for the explana- tion of a situation which demands partial correlation and a complicated regression equation. Suppose that the problem is to secure the best prophecy as to a pupil’s attendance based on knowledge of his dis- tance from school. Assume the correlation between attend- ance and distance to be as shown in Table 37. The regres- sion equation for this purpose is: ee pee SDy y As shown at the bottom of Table 37, r—=—.81, pare Sx? Bes 24105 aes ye Sx AT? ves. bacon \/ (0:2) ee Pacers Da EN nails i) BE Has. SL irene N CCV) ase ie (O02 Causal Investigations 241 Assume that the pupil’s distance score is known to be tr. ct Then y is the difference between 1.5 and the M of 2.0; y—-— 0.5. This pupil’s most probable position in attend- ance may be found by substituting the preceding values in the above formula, thus: Since M for attendance is 52.2, the pupil’s most probable Score in attendance is then 52.2 + 10.8, Lew Osu lnaike manner any y can be transmuted into a most probable x. In case x is known and the problem is to prophesy y, the regression equation becomes: By means of the first of these two regression equations, it is possible for an experimenter to build up a table for trans- muting x values into y values, so that subsequent workers will need to determine only the value of x for each pupil. By using the second equation, he can construct a table for transmuting y values into x values. At this point, it should be pointed out, that one table will not suffice for trans- muting x values into y values, and y values into x values. Two tables are required. When the problem is to prophesy a pupil’s position in x, say, attendance, from knowledge of his scores in Vea C etc., say, distance, age-grade relation, quality of work, etc., partial correlation is required. The regression equation combines the pupil’s scores on the various factors, weight- 242 How to Experiment in Education ing each score according to the partial correlation of that factor with the criterion, namely, attendance. If the prob- lem is to prophesy a pupil’s intelligence from several tests of this trait, the regression equation combines a pupil’s scores on the several tests, weighting each test according to its partial correlation with some criterion of intelligence, whether the criterion be some standard intelligence test, or teacher’s judgment, or age-grade relation, or something else, or a combination of these to constitute a criterion. Thus, the regression equation will combine any number of ele- ments and weight them so as to yield composite scores which will correspond as closely as possible, considering the elements used, with some criterion. All that is needed to make such an equation possible is the partial correlation of each element with the criterion and certain measures of variability, as shown in the follow- ing formula. This formula is the regression equation for attendance, i.e., it combines and weights the scores on the various factors so as to yield the most accurate possible score in attendance from a combination of these six factors, Shes SD 1.23456 Kip (+12.s4567g 5024527 ) X2 + (+13.24567g 254501 ) ue SD1.234567 D1.234567 + (114.2356 Tenieenen: x4 - r15.23467ep ood ) X5 5.123467 SD1.234567 govt: 5D1.234567 + (116. CRE Acre rae pe x6 + [ 117.23456 SD7ita34e60) a Where xt is the deviation of the pupil’s score from the mean of the attendance records, and is determined by the solution of the formula, x2 1s the deviation of the pupil’s score from the mean of the scores in distance, x3 is the deviation of the pupil’s score from the mean of the age-grade relation, and so on for x4, x5, x6, and x7, where x2, X3, X4, x5, x6, and x7 are known, and where Causal Investigations 243 SD1.234567 = SDr V 1 — (r12)?Wr — (113.2)*V 1 — (414.23)? V I= (115.234)'V a — (116.2345)? V a — (917.23456)? SD2.134567 = SD2 Vir — (112)? Wr — (423.1) ?W 1 — (424.13)? V 1 — (r25.134)'V 1 — (426.1345)? V 1 — (227.43456)" SD3.124567 = SD3 V1 — (113)? V 1 — (123.1) °V 1 — (134.2)? Vir — (135.124)* Vt — (136.1245)? Va — (137.12456)* SD4.123567 =SD4 Vx — (114)? Wa — (424.1)? V 1 — (134.42)* V1 — (r45.423)*V x — (46.1235)? V x — (147.12356)" SD5.123467 = SDs Via (115)? Wi — (425.1)? 1 — (135.12)* V1 — (r45.123)°V x — (156.1234)°V 1 — (r57.12340)? SD6.123457 = SD6 V 1 — (116)? V1 — (126.1) *V 1 — (136.12)* V x — (146.123)*V x — (456.1234)? V 1 — (07.12348)" SD7.123456 = SD7 Vr— (117)? Vt — (727.1) ?V 1 — (37.12)* V 1 — (147.123)°V x — (157.1234) Vt — (167.12345)" To illustrate the evolution and use of a regression equa- tion in a simple situation, assume that the problem is to prophesy a pupil’s position in 1 from a knowledge of his position in 2 and 3. Stated in another way, assume that the problem is to combine the scores on 2 and 3 so that the resulting score will be the best possible in 1 which 2 and 3 can yield. Assume that 1 = Intelligence as measured by the Stanford or Herring Revision of the Binet-Simon Intelligence Scale, 2 == Comprehension score on the Thorndike-McCall Read- ing Scale, and 3 == Minutes spent on the Thorndike-McCall Reading Scale divided by the comprehension score. Assume further that (eT Are: SD1 = 4.42 M1 = 120 rIl3 = —.40 U2 TG Li Deecera’ i te) 23 = — .56 Sa OLS & Mee LS 244 How to Experiment in Education Then the regression equation is SD1.2 SD1.2 Xie Gas x2 + (mano Utilizing the assumed data to compute the required values in the regression equation, we have Sw r12 — (r13) (r23) wal 80 — (— .40) (—.56) Lee cnuet Vere 3 yor oe (829 a toa eee z yy r13 — (r12) (r23) Ly — 40 — (.80) (—.56) a $ Vr (TiO Ai er 3 WW iB) A ee r23 — (ri2) (rI3) — .56 — (.80) (— .40) SD1.23 = $D1 V1 — (r12)?V te (ri303)2 = AAgV Tie 80)2V De t0)7 ="9163 $D2.13 =SD2vi a (r12)?V c— (r23.1)7 peed or Vai (i80)4V Tis (Gaga == 59 SD3.12 = SD3 Vi — (113)? V1 — (123.1)? = ‘Be Vip ea ( ho | Ten (ee .70 Substituting the computed values in the regression equa- tion, we have a= (76-23 x2 + (1023) fa) 2p. Gen X2 + .38x oe Sra 1) By = 3.39 38X3 Now if a pupil’s score in 2 is 53, x2 53 —50 =3, since M2 is 50. If his score in 3 is 14, x3 = I4—15 = —1, since M3 is15. Substituting x2 and x3 in the preced- ing equation XI = 3.39(3) + .38(—1) =9.79 The 9.79 shows that the pupil’s deviation from Mr is a _plus 9.79. Since Mrz is 120, the pupil’s score in 1 becomes 120 +. 9.79, 1.€., 129.79. CHAPTER X ANALYSES OF EXPERIMENTAL AND CAUSAL INVESTIGATIONS The principles and procedures formulated in the preced- ing chapters had to be confined necessarily to the more common types of experiments and investigations. Further- more, the progress of discussion permitted only a limited use of concrete illustrations. The purpose of this closing chapter is twofold, (a) to show the applicability of these principles and procedures to many specific experimental problems and problems for causal investigation, and (b) to suggest a method of attack upon relatively uncommon varieties of problems. The problems used are taken more or less at random from a large number submitted from time to time by graduate students. No special effort has been made to make these analyses complete. Space would not permit, nor has an effort been made to make them model analyses. This would require not only a long period of concentrated thinking about each problem but also an actual trial of each experiment to check the thinking done. All that is attempted is to draw up for each problem a rough plan for its solution, in order to point out to the reader the general line of attack. PROBLEM 1. Do Rural Children Learn More Rapidly in Consolidated Schools or in One-room Schools? EF1 is a consolidated school. EF2 is a one-room school. S is a group or groups of rural pupils. This problem may be solved as an equivalent-groups ex- periment very simply but with some delay, or it may be solved without delay by an equivalent-groups causal inves- 245 246 How to Experiment in Education tigation. Since an experiment always gives the experimenter more complete control of the situation than does a causal investigation, let us assume that this advantage outweighs the disadvantage of a year’s delay, and that the problem is to be solved by an equivalent-groups experiment. The chief problem is to secure genuine equivalence of groups. Pupils should be paired on two bases, at least, namely, mental age and chronological age. Having selected two equivalent-groups, or else having delayed selection until the conclusion of the experiment, a series of IT’s or standard tests of school abilities should be applied. At the close of the year these tests or duplicates of them should be applied as FT’s. The data from these tests can be fitted into one of the computation molds provided in a preceding chapter. For purposes of computation, all the pupils can be treated to- gether as two equivalent groups or else the two main groups may be broken up into age sub-groups or grade sub-groups, or they may be treated both ways. PROBLEM 2. Effect of Exemption from Class Drill in Penmanship when Pupils Attain Quality 12 on the Thorn- dike Handwriting Scale Compared with the Effect of Con- tinuance in Class Drill. EFI is exemption from class drill in penmanship of those pupils who attain quality 12 on the Thorndike Handwriting Scale. EF2 is the continuance in class drill, or the absence of such exemption. The experimental group (S) is not indicated, though the effectiveness of EF1 is likely to vary with the distance the ability of S is from quality 12. The implication of the student’s formulation is that S has an ability below quality 12. The conclusion from the experiment should be stated in terms of whatever S is employed. Since the purpose of this experiment is merely to deter- mine the amount of superiority of one EF over the other no control EF is required and only the less stringent criteria Analyses of Experimental and Causal Investigations 247 for selecting the experimental method need be considered. The one-group method is not entirely satisfactory, because: (a) Even apart from any difference in the effectiveness of EF’s, the amount of change under one EF will not be iden- tical with the amount of change under the other EF. Even under identical conditions the rate of progress in penman- ship as measured by available tests usually shows a slowing up as progress proceeds. ‘To date, no progress scales have been constructed which demonstrably discount this retarda- tion. (b) There is some danger that there will be a signifi- cant carry-over from one EF to the other, particularly if the exemption-from-drill EF precedes the continuance-in- drill EF. (c) The one-group method is more than unsatis- factory; it is completely impossible if the change in S is determined by measuring the amount of time required to attain quality 12. Just as soon as one EF had brought 5 to quality 12 there would be no opportunity to determine the effect of the other EF because S would already be at quality 12. All this means the equivalent-groups method is the best one for this problem. The change (C) produced by each EF can be measured by the per cent of pupils in each group who attain quality 12, as measured by the Thorndike Handwriting Scale, dur- ing the period of the experiment. The experiment can be stopped when, say, 50% or 85% of the leading group has attained quality 12. This per cent can be compared with the per cent of the other group who have attained quality 12. This method of measurement is objectionable because it does not yield a score for each pupil. It yields a score for the group as a whole. This does not permit the computa- tion of SD, SDM, and SDD, and hence does not permit any statement of the reliability of the conclusion. The C can be measured by the total number of points of growth on the scale during the period of the experiment. There is a fatal objection to this plan. The EFr pupils are excused from handwriting instruction when they attain quality 12, and are thereby and thereafter encouraged to 248 How to Experiment in Education spend the handwriting drill period in more congenial ways. But no EF2 pupil who attains quality 12 is so excused. Measuring C by points of growth definitely discriminates against EFtr. The C can be measured by the length of time required by each pupil to attain quality 12. A serious objection to this plan is that it requires the experiment to continue until every pupil of both groups, even the slowest, has attained quality 12. Certain pupils in the group may never attain this level. Except for this practical objection the method is quite satisfactory. If all pupils are within an easy dis- tance of ability 12, this objection disappears. Again, the C can be measured by determining the amount of growth per unit of time. Suppose the first EF1 pupil to attain quality 12 does so in one month from the begin- ning of the experiment. To avoid disappointing pupils the experiment will have to continue, but for purposes of com- putation the experiment can stop at that point. The points of growth made by each and all pupils in each group in one month shows the relative effectiveness of each EF. The IT1z here may be assumed to be approximately zero for each pupil. The FT1 is the points growth in a month. The C is then identical with FT1. Further computations follow the computation models already given. It is advisable for the experimenter to check the measur- ing method just recommended by a related method. He can permit the experiment to continue until most or perhaps all of the EF1 pupils have reached quality 12. The instant that an EF1 pupil reaches quality 12, the experimenter should determine and record the attainment of the EF2 pupil who is paired with the EF1 pupil. By dividing the points of growth from the initial starting point up to 12 by the number of days required to attain 12, the growth per day can be determined for each EF1 pupil who attains quality 12 during the period of the experiment. By divid- ing the points of growth of each EF2 pupil, up to the time his EF1 pair reached quality 12, by the number of days Analyses of Experimental and Causal Investigations 240 required by his EFr pair to attain quality 12, measures comparable with the foregoing EFi measures can be secured for the EF2 pupils who pair with EF1 pupils attaining quality 12. Quite satisfactory and comparable measures can be secured for each EF1 pupil who fails to attain quality 12 and for his EF2 pair by dividing the points of growth made by each during the whole time of the experiment by the number of days in the experiment. This method of measuring C is suggested as a check upon the preceding one, because there is some possibility that as EF? pupils approach their goal they are stimulated to added zeal. To stop the experiment as soon as the first EF1 pupil attains the goal means that only a few pupils have come within the sway of this possible facilitating effect. This last method gives all the pupils a chance to feel its effect, in case such an effect exists. And in order to make results entirely comparable an EF2 pupil, for purposes of com- putation, is stopped, for computation purposes at least, at the same instant that his EF1 pair stops. For purposes of fitting these data in the computation model, assume IT1 to be zero, and FT1 to be the above scores. The careful experimenter will not be satisfied to measure quality of handwriting only. As a minimum he will deter- mine, in similar manner, the effect of each EF upon speed of handwriting. PROBLEM 3. What Is the Effect of the Spirit of a Class on Its Achievement? EFT is a spirit of enjoyment, hopefulness, codperation and the like in a class. EF2 is the opposite sort of spirit. There could be other EF’s representing varying degrees or varieties of spirit. The one-group or rotation method may be employed pro- vided the period for each EF does not last more than a few days. A longer pericd might fix certain attitudes which will transfer to the succeeding EF. Even when the period is brief some transfer is doubtless unavoidable. If the 250 How to Experiment in Education teacher or other agent generates a pleasant spirit, this will tend to aid the succeeding EF. If the unpleasant spirit precedes, it will tend to subtract from the succeeding EF. Probably the best method of all is the equivalent-groups method, where Sx and S2 are two equivalent classes. This method does not require a brief application of each EF. Both IT’s and FT’s for both groups are needed. These achievement tests will need to cover the abilities being developed while the EF’s are operating. The differences between the M’s of the two C’s in each achievement test give the conclusions from the experiment. ProBLEM 4. Are Nature and Object Drawing and Paint- ing Fundamental to Improve Taste in Selection of Environ- ment, or Are the Principles of Design and Color the Basts for This Response? EF1 is nature and object drawing and painting. EF2 is principles of design and color. The one-group and rotation methods are inappropriate be- cause of probable carry-over, so the equivalent-groups method must be employed. The S is a group of pupils improvable in their taste in selection of environment, and not yet trained in either EF 1 or EF2. | Both Sr and S2 should be given an IT to determine initial taste in selection of environment. S1 should have EF1 applied. Sz2 should have EF2 applied. Both should then be given an FT. The difference between the M’s of the two C’s will show which EF contributes more toward a development of taste in selection of the environment. PROBLEM 5. Which Is Better for Pupil Growth, a Tem- perature of 68 degrees and a Humidity of 50 per cent, or a Temperature of 86 degrees and a Humidity of 80 per cent? EFr is a temperature of 68 degrees and a humidity of 50 per cent. EF2 is a temperature of 86 degrees and a humidity of 80 per cent. Either the rotation or equivalent-groups method may be Analyses of Experimental and Causal Investigations 251 employed, though the rotation method is preferable perhaps. Sit can be subjected to EFz and then to EF2. S2 can be subjected to EF2 first, and then to EF1. The length of time each EF is applied should be the same for all four periods, and will depend upon the nature of the tests used. If the tests are of traits growth in which is very rapid, each EF may be applied for a brief time. Several test types covering the work of the pupils will be needed. Both IT and FT should be given. These may be tests of general reading ability, arithmetical ability, spelling ability, and the like. In this case, the experiment will need to continue for a considerable period. Or the tests may be based upon the specific lessons being taught. In this case, growth will be rapid, and the experiment, if desired, may be brief. The computation will follow the regular rotation’ com- putation model for two EF’s and several test types. PROBLEM 6. To Determine the Effect on the Mastery of English of Teaching Technical Grammar from the Fourth to the Eighth Grade. EF is the teaching of technical grammar from the fourth to the eighth grade. EF2 is the absence of such technical grammar and presumably the presence of other forms of ordinary English instruction instead. The equivalent-groups method is required. The formula- tion of the problem does not make it clear whether there are to be five sub-groups—fourth, fifth, sixth, seventh, and eighth grades—with equivalent sub-groups, or whether there are to be two equivalent fourth grades each of which is to have its EF applied for five years in succession. In either case IT’s and FT’s of English ability are re- quired. A computation model has been provided for either form of experiment. PROBLEM 7. To Determine the Relation of Physical Effi- ciency to School Progress. EF1 is physical efficiency of a defined amount. EF2 is 252 How to Experiment in Education physical inefficiency of a defined amount. A variety of EF’s representing different degrees of physical efficiency might be employed. The equivalent-groups method is appropriate to this prob- lem. Both groups may start below par physically, or at any stage short of a physical condition which is at the limit of possible improvement. Sz will have its physical efficiency improved by careful attention to diet, etc. S2 will continue on the same physical level. Both IT’s and FT’s are needed, covering abilities growth in which constitutes school progress. The difference be- tween the M’s of C1 and C2 shows the effect of improved physical efficiency. This problem may be interpreted to mean: Does physical efficiency facilitate school progress? Of it may be inter- preted to mean: Are physical efficiency and school progress associated or correlated? If the latter is the problem, the one-group method is the only satisfactory experimental plan. EF1 is the physical efficiency of the pupil in the best physi- cal condition, EF2, EF3, EF4, etc., are the physical condi- tions of the pupils who are second, third, fourth, and so on, respectively, in physical condition. Each pupil should be measured in both physical efficiency and past school prog- ress. The correlation between these two series of measures is the answer to the problem, for this correlation shows the relationship between various physical conditions and corre- sponding amounts of school progress. Interpretation is facilitated if only those pupils are used whose present physi- cal condition has been about the same throughout the school career of the pupils. One difficulty with the foregoing is that positive correla- tion may not indicate a genuine relationship between physi- cal efficiency and school progress. It may be that those selected as more fit are also more intelligent, and that it is intelligence rather than physical fitness which is responsible for the correlation. This possibility may be investigated by equating the fit and the unfit with respect to intelligence, Analyses of Experimental and Causal Investigations 253 by using only those pupils of like intelligence, or by partial correlation. ProsLeM 8. What Effect Has Previous Training in Type- writing upon Speed and Accuracy in Learning to Use a Comptometer? The EF1 is learning to compute with a comptometer plus previous training in typewriting. The EF2 is learning to compute with a comptometer when there has been no pre- vious training in typewriting. The one-group method cannot be used because, if for no other reason, there will be a carry-over from one EF to the other. For this same reason the rotation method can- not be employed. The equivalent-groups method is appro- priate. Sx should have previous training in typewriting. S2 should lack such previous training but should be equivalent in all other respects. No additional control S is required. A unique feature of this experiment is that one group is both an S2 and a control S at the same time, for Cr minus C2 shows the exact effect of previous training in typewriting upon learning to use a comptometer. Sz and S2 are not defined by the problem. ‘The inference is that they are two groups of clerical students. IT1, FT1, IT2, and FT2 are required both for speed and accuracy in computing with the comptometer. In case both S’s have had no experience at all with the comptometer both IT1 and IT2 may be assumed to be zero. This problem may be solved by either an experiment, or a causal investigation, or half investigation and half ex- periment. An experimenter finds two appropriate and equivalent groups. To one he gives training in typewriting and follows it with training on a comptometer. To the other he gives no training in typewriting, but begins train- ing them on the comptometer, after a period has elapsed equivalent to that used in giving his typewriting training to the EF1 group. 254 How to Experiment in Education The causal investigator proceeds backward rather than forward. He locates two groups, both of whom are learning or have learned to operate a comptometer, who are equiva- lent, except that one has learned typewriting while the other has not. He then investigates their respective records in learning to operate a comptometer. Any differences dis- covered he attributes to typewriting. The half-investigator, .half-experimenter, locates two groups equivalent in every respect except for typewriting. To these two groups he applies uniform training on the comptometer and measures the progress of each group. PRoBLEM 9g. Given Equivalent Groups of Sales Clerks and Clerical Workers, Is There Any Difference Between Them in Type of Memory? This is a causal investigation. The investigator finds the EF’s applied before he assumes control of the situation. The only thing left for him to do is to apply the FT’s and formulate conclusions. EF1 is sales clerk, or the inherited or environmental conditions which set sales clerks apart as an occupational group. EF2 is clerical workers or the conditions which selected and differentiated clerical workers as an occupa- tional group. Si is a group of sales clerks, who, except for occupational differentiation and its concomitants and consequences, are equivalent to Sz. Unless the two groups are allowed to differ in the possible immediate and direct concomitants and consequences of occupational differentiation the whole in- vestigation loses its point, for its very object is to determine whether such concomitants or consequent differences occur. This means that when the two groups are being equated the probable concomitants and consequences should not be among the bases employed for equating. No IT’s can be given since the EF’s have been applied before the investigator takes control of the situation. Even if possible, none would be given, because the psychological Analyses of Experimental and Causal Investigations 255 factors influential in determining ultimate occupational choice may have been present from birth. Hence ail that can be done is to apply FT’s to determine whether the type of memory possessed by Sz2 differs from that possessed by Sr. In an investigation of this sort the investigator should be wary about concluding from any difference in memory revealed that this difference has been produced by the occu- pation of a sales clerk as distinguished from the occupation of clerical work. The truth may be instead that the differ- ence discovered merely accompanies the occupation, i.e., is caused directly by a fundamental something which is the cause of occupational differentiation. It may be that the difference revealed is itself the cause of the occupational differentiation. In sum, whenever the investigator is pre- sented with a completed experiment he has no assurance as to whether the EF’s or the difference in FT’s came first and hence is the cause or whether something more funda- mental may not be the cause of both. All the investigator can say is that occupational differentiation is or is not asso- ciated with memory differentiation. The FT’s should be tests for various types of memory. No IT’s can be given, but in fitting data into the computation models all IT scores may be assumed to be zero. ProBLeM 10. Is Complete Understanding Necessary to the Enjoyment of a Piece of Literature? EF 1 is incomplete understanding of a piece of literature. EF2 is presumably complete understanding. Since under- standing may vary from complete understanding to com- plete misunderstanding it will be necessary for the experi- menter to define the completeness of EF1 and EF2. He may find it necessary to employ several EF’s of varying de- grees of completeness of understanding. Any one of the several experimental plans promises rea- sonably satisfactory results. One plan is to employ the one-group method, to expose Sr to an incompletely under- 256 How to Experiment in Education stood piece of literature and measure the resulting enjoy- ment, and then to expose Si to the same piece of literature after an understanding of it is taught or while an under- standing of it is being given and measure the resulting enjoy- ment. The difference between these two FT’s gives the desired answer. If it is suspected that the conclusion holds only for the particular type and difficulty of the piece of literature employed, the experiment may be repeated with a variety of pieces of literature. Another plan is to employ the one-group method, to select two pieces of literature which are known to be or may be assumed to be equal in their appeal when both are incom- pletely understood or completely understood and equally so in both cases. To S1, however, one of these equated pieces of literature is incompletely understood while the other is completely understood. ‘The difference in amount of enjoy- ment evoked from Sr when these two pieces are presented gives the desired answer. As before, various pairs of speci- mens may be presented. Still another plan is to employ equivalent groups. S1 may be exposed to a piece of literature which is incompletely understood and the resulting enjoyment measured. S2 may be exposed to the identical piece of literature after under- standing of it has been given or while understanding is being given, and the resulting enjoyment may be measured. As before, various pieces of literature may be used or vari- ous degrees of understanding may be imparted. The rotation method is inappropriate. Incomplete under- standing may precede completer understanding without seri- ous carry-over, but to reverse this order of sequence, as required by the rotation method, is impossible. No IT’s need be given, for the degree of enjoyment of a piece of literature before the S has been exposed to it may be assumed to be zero. No little ingenuity will be required to devise a satisfactory test of enjoyment. Any one of many methods may be em- ployed. Subtle physiological indices of enjoyment may be Analyses of Experimental and Causal Investigations 257 recorded, or the pupils may be asked to choose between a second exposure to the piece of literature in question and other alternatives of reasonably constant and equal appeal, or the pupils may rate the piece of literature in comparison with the enjoyment derived from other common experiences of varying satisfyingness, or a secret record may be kept of the amount of subsequent use made of the piece of literature when it is in the class library, and so on. ProBLtEM 11. What Is the Effect upon Teaching Effi- ciency and Length of Service in Teaching of a Sabbatical Year for Public School Teachers? EF1 is a Sabbatical year. EF2 is no Sabbatical year. The one-group method is not appropriate, because the problem assumes that the EF is to be applied throughout the teaching life of the teacher. Also one of the measurements stipulated, namely, length of service, assumes the entire teaching life. The equivalent-groups method is applicable, and it is the only method which is applicable. Si is a group of public school teachers to whom EFr is applied and who are otherwise equal to and under conditions comparable with Sa. Initial, intermediate, and final tests of teaching efficiency are desirable for both S’s. Only FT’s of length of service for both S’s are necessary or possible. The various periodic intermediate tests will reveal whether Sabbatical years have a cumulative effect or a decreasing effect, and whether there comes a time where they no longer contribute to teach- ing efficiency. Since few experimenters have the patience or confidence in their own longevity to wait a lifetime for the completion of such an experiment, the investigational rather than the experimental method is likely to be employed. PRoBLEM 12. How Do Individual Scores Obtained on National Intelligence Scale A Compare with Those on Scale B for the Same Pupils? 258 How to Experiment in Education EF1 is application of National Intelligence Test, Scale A. EF2 is application of Scale B of the same test. The one-group method is required. There is some trans- fer from EF1 to EF2 such as practice effect, but this can- not be avoided. It can be largely eliminated by statistical methods. This experiment is unique in that the EF’s and FT’s are identical. No IT’s are required. The difference between FT1 and FT2 may be determined by computing the coefficient of correlation between the Scale A and Scale B scores, or by computing the net difference (unreliability) between the two series of scores as was done in Table 13. Thus this experiment is unique in three ways. The EF’s and FT’s are identical. Transfer from one EF to a succeed- ing EF is eliminated statistically. Novel methods are sug- gested for computing the difference between C1 and Ca. PrRoBLEM 13. What Effect in Securing Order Will a Beau- tiful Picture Placed in the Front of a Room Have Upon an Unruly Boy Who Loves Art? EF1 is no picture in front of room. EF2 is a beautiful picture in front of room. The one-group method or rotation method is the most feasible, owing to the difficulty of equating unruly boys who love art. Assuming the one-group method, S is an unruly boy who loves art. S has applied to him, in order, IT1 of unruliness, EF1, FT1, of unruliness, EF2, FT2, of unruliness. FTr1 may be used as the IT2. This experimental unit may and should be repeated many times to make certain that any differences observed in the C’s are not accidental. The foregoing experiment is a particularly difficult one to carry through successfully. The influence of the picture, though real, is likely to be so subtle as to have its effects masked by one of a hundred other influences playing upon Analyses of Experimental and Causal Investigations 259 the pupil. When S is only one pupil the probability of large changes due to irrelevant influences is especially great. PROBLEM 14. To Determine the Relation Between Pla- teaus on the Learning Curve and Recall. In its present form the problem is so vaguely stated that an analysis of it is impossible. What is really wanted is to know whether pupils who have plateaus in their learning curves are better able to recall or reproduce what is learned at some later date. EFT is plateau or plateaus in learning curve. EF2 is a learning curve without plateaus. This experiment is peculiar in that the experimenter can- not control the application of the EF’s. His only recourse is to have a large group of pupils learn something, to plot their learning curves, to single out those who show a plateau or plateaus in their learning curve, to match them with a group of pupils who show no plateaus in their learning curves but who are otherwise equivalent as shown by tests given prior to the beginning of the experiment, and finally to measure the difference in the ability of these two groups to recall what has been learned. No IT’s need be given though it is important to know that the two groups are equivalent in general ability to recall what has been learned. If this is not known, it cannot be said that plateaus have caused the difference in ability to recall. They may be the effect or may merely be asso- ciated with a certain recall ability. Since the purpose of the experiment is to learn whether learning curves plus plateaus cause or are correlated with . recall which is superior to that caused by or associated with learning curves minus plateaus, no control EF and S are required. For purposes of discussion, however, let us suppose that the problem calls for a knowledge of the exact contribution to recall of learning curves plus plateaus, i.e., of learning plus a period or periods of little or no progress. Still no control EF would be required because the contribu- 260 How to Experiment in Education tion of irrelevant factors to recall will be substantially zero. If the experiment continues over a long period mere matur- ing might contribute some power of recall. In this case a control EF and S could be used to advantage. If, however, the purpose of the experiment is to deter- mine the amount of contribution of plateaus rather than learning curves plus plateaus, a control EF, that is, an EF of learning curves with plateaus absent, is required. EF2, above, is just such a control EF. But here is a difficulty. Is EF 2 identical with EFx1 except for the plateau feature of EF1? Isa plateau merely an addition to a learning curve with a plateau lacking, or is a plateau an integral portion of its curve? If we affirm the latter, then it becomes impos- sible to isolate and measure the effect of plateaus; we must always measure the effect of plateaus-imbedded-in-learning- curves. PROBLEM 15. Which Will Give Better Results in Baking, to Put an Angel-food Cake Into a Gas Oven Just Lighted or Into One of Medium Temperature? EF 1 is a just lighted gas oven. EF2 is a gas oven which has reached a medium temperature. The one-group method or rotation method will not do. Since the S is a set of angel-food cake-dough it could not very well be baked twice. The carry-over will be enormous, to say the least. The equivalent-groups method is required, 1e., two sets of angel-food cake-dough made according to identical recipes, or taken from the same mixture. The IT’s can be assumed to be zero. The FT’s should be various tests of the appearance, deliciousness, and digesti- bility of the cake baked according to each of the EF’s. The only difficulty in this experiment is to identify the S and the EF. It is the cake dough whose change by the two varieties of temperature is of primary concern. The cake dough is to these EF’s just as pupils are to the customary EF’s. Analyses of Experimental and Causal Investigations 261 PRoBLEM 16. Are Girls More Interested in Learning Manipulative Processes in Junior High School Than in Senior High School? EF1 is the junior high school age for girls. EF2 is the senior high school age for girls. Either the one-group or equivalent-groups method may be employed. If the one-group method is employed, a group of junior high school girls should be tested, in some way, as to the strength of their interest in learning manipulative processes. When these same girls have reached the senior high school age they can, then, be tested again to see whether their interest in learning manipulative processes has in- creased. If the equivalent-groups method is employed, the experi- ment becomes essentially an investigation. A group of senior high school girls and another group of junior high school girls should be selected so as to be equivalent, in all respects, except for the senior and junior high school diiffer- entiation with all of its concomitant differentiation. Stated more simply, a group of junior high school girls should be so selected that they will be equivalent when they become senior high school girls, to a previously selected group of present senior high school girls. Each group can be tested for its interest in learning manipulative processes. The C for each group may be assumed to be the same as the FT. The difference between the M’s of the two series of C’s shows the difference between the EF’s. ProspLEM 17. Does Observation of Skilled Teaching Aid Normal School Students to Grasp Facts and Principles of Teaching and to Apply Them? EF1 is observation of skilled teaching. EF2 is the absence of such observation. Since the one-group and rotation methods cannot be used because of carry-over, the equivalent-groups method is re- quired. One group of normal school students will observe 262 How to Experiment in Education skilled teaching while an equivalent group will forego such observation. Both IT’s and FT’s covering all or a random sampling of the facts and principles of teaching will need to be con- structed and applied to both groups. All the foregoing is simple enough. The real difficulty is in devising some way to measure each group’s ability to apply facts and principles learned. ‘The only satisfactory way to make the test is to organize an experiment within an experiment, so as to discover just how well the normal school students can actually teach pupils. In sum, the best way for these students to manifest superior changes in them- selves is to show that they can make superior measurable changes in pupils. Two groups of equivalent pupils can be selected. The EF1 normal school students can be assigned to teach, in rotation, say, one group of pupils, and the EF2 students can be assigned to teach the other group of pupils. If the pupils are sufficiently numerous each normal school student may be assigned to her own group of pupils exclusively. The specific lessons to be taught may be assigned by the experi- menter and tests for the pupils may be constructed to meas- ure the effect of these lessons. Or the experiment may be permitted to run for a considerable period and general tests may be given. Initial and final tests upon the pupils will show which normal school group has been most successful in applying facts and principles learned to the real task of making desirable changes in pupils. Thus the best way to measure the normal school student is to measure her pupils. ProsLeM 18. Is the Per Cent of Failures Higher Among Pupils Who Enter the Sentor High School Direct from the Eighth Grade or From the Junior High School? EF1 is entrance to senior high school from eighth grade. EF2 is entrance from junior high school. This is not so much an experiment as a causal investiga- tion, and must of necessity be an equivalent-groups investi- Analyses of Experimental and Causal Investigations 263 gation. A group of students entering from the junior high school must be found who are equivalent, except for con- comitant differentiations, to a group entering from the regu- lar eighth grade. The FT is the record of failures for each of these groups during the high school period. In computation, the C may be considered identical with FT. ProBLEM 19. At How Much Greater Saving of Time and Effort Can a Group of Normal Seven-year-old Children Learn to Read Than a Group of Normal Six-year-old Chil- dren? EF1 is normal seven-year-olds. EF2 is normal six-year- olds. The one-group and rotation methods are inappropriate. If the six-year-olds and seven-year-olds are truly normal, the six-year-olds will in one year be equivalent to the pres- ent condition of the seven-year-olds. In sum, the conditions of the experiment require equivalent groups except for the EF difference and its concomitants. It also requires both groups to be equally unable to read at present, though not necessarily of equal capacity to learn to read. One or more IT’s and FT’s of reading ability, with the intervening teaching of reading by the same or equated teachers to both groups, will show which group can learn more rapidly. The computation will follow the regular computation model. All the foregoing appears quite simple. But there is a hidden difficulty so great as to be well nigh insurmountable. The foregoing plan shows which group learns to read more quickly. Even though the experiment favors the seven-year- olds, it does not show that, in the long run, it is more eco- nomical to delay learning to read until seven years of age. If the six-year-olds learn to read, they can spend the read- ing period during their seventh year learning something else. If the six-year-olds learn to read, even though at some labor, they have an extra year of access to printed material. 264 How to Experiment in Education If the six-year-olds do not spend their time learning to read, they may spend their time learning something else which may be proportionately difficult and valuable. There are few abilities which a ten-year-old cannot learn more easily than a six-year-old, but this does not mean that everything should be postponed until pupils are ten years old. Decision as to what to postpone involves a consideration of capacity, interest, need, injury, and the total work of the school. The practical problem cannot be solved by the simple experi- mental plan outlined above. PROBLEM 20. What Specific Abilities Are Required for Success as a Telegrapher? The EF’s are unknown specific abilities. The problem here is not to determine whether a given specific ability con- tributes or will contribute to success as a telegrapher. The problem is to discover promising specific abilities with which to experiment. In sum, the problem is to discover some hypothesis to be a basis for experimentation. This is always the first step in research. One plan of procedure is to study the work of a tele- grapher and logically infer what specific abilities are needed. Another plan is to select two groups, one of which is com- posed of successful telegraphers and the other of which is composed of unsuccessful telegraphers, but where both other- wise appear much alike. Observation of the work of the two groups and tests of them may bring to light suggestive differences. Another plan is to chose strikingly successful and strik- ingly unsuccessful telegraphers, and to contrast these oppo- sites in close proximity. This is the most drastic possible method of shaking out into the field of consciousness those differences which spell success or failure as a telegrapher. Once specific abilities have been hit upon in such ways, their contribution to success as a telegrapher may be deter- mined experimentally, or by an equivalent-groups causal in- vestigation, or by a partial correlation investigation. 7 Analyses of Experimental and Causal Investigations 265 PROBLEM 21. In a Recitation, Can a Class of Girls Bluff a Teacher More Easily Than a Class of Boys? EF is aclass of girls. EF2 is an equivalent class of boys. S is the teacher, or, better, several teachers of both sexes, since an experiment of this sort needs repetition on both men and women teachers. The rotation method is most appropriate because it per- mits the experimenter to rotate out differences in nature of lesson, teacher’s experience in teaching it, and the like. Thus the experimenter can request a teacher to teach a specific lesson to a class of girls, and then to teach this same lesson to a class of generally equivalent boys. Next he can ask the teacher to teach another lesson to both boys and girls, only, in this case, the boys should be taught first and the girls second. While each lesson is being taught or afterward, the ex- perimenter must measure the amount of bluffing which oc- curs. The C may be treated as identical with this FT, so that a regular rotation computation model will apply. PROBLEM 22. To What Extent Are Children in the Upper Grades of the Elementary School Capable of Selecting on Their Own Initiative Statements of Most Worth in Their History Reading? EF is attainment of upper grade status. EF2 is, if any- thing, the mere absence of such attainment. S is upper grade pupils. Of necessity the one-group method must be employed. The whole experiment, if such it may be called, is very sim- ple. It merely consists in locating upper grade pupils and in testing the extent to which they can select on their own initiative statements of most worth in their histories. IT may be assumed to be zero, so that FT becomes Cr. Similarly all the C2’s may be considered zero. Thus the effect of upper-gradeness is shown by a straight measure- ment of the present status of upper-grade children in the trait in question. 266 How to Experiment in Education PROBLEM 23. What Is the Best Order to Teach Geog- raphy to Fourth-grade Pupils, the Concrete and Then the Abstract, or the Abstract Followed by the Concrete? EF tr is concrete followed by abstract. EF2 is abstract followed by concrete. S is fourth-grade pupils. Owing to the possibility of carry-over, the equivalent- groups method is preferable. One fourth-grade group can be taught according to EF 1 and an equivalent fourth grade according to EF2. IT and FT tests, testing the degree of mastery of geog- raphy lessons at the beginning and end of the experiment, should be applied to both groups. | The general plan for this experiment is quite simple. The actual carrying out of the experiment would involve much careful labor. It is unique in that the two EF’s appear to be rotated when they really are not. The purpose of the experiment is not to evaluate abstract vs. concrete but abstract after concrete vs. concrete after abstract. A simi- larly deceptive problem is this: Which method brings the best results in beginning reading—to teach the printed forms of the words first and follow with the script forms, or the reverse order? Another like deceptive problem is this: What is the best possible order of subjects during the school day? Here the various EF’s are all possible combinations of order of school subjects. As many equivalent groups will be required as there are EF’s. There may be a carry-over from the first subject taught to the second subject, or from the second subject to the third subject, and so on. But carry-over from one part of an EF to another part of an EF is not an irrelevant factor. Carry-over is an irrelevant factor only where there is carry-over from one total EF to another total EF. PROBLEM 24. Can Anything Done Well By One Indi- vidual Be So Analyzed That the Ability May Be Imparted to Others? For purposes of experimentation, the above problem will Gd a Analyses of Experimental and Causal Investigations 267 be clearer if phrased thus: Will a particular person’s analy- sis of what some individual does remarkably well confer that remarkable ability upon another? Here the EFr is some particular person’s analysis of the process by which some gifted person achieves certain ends. EF2 is the absence of EF1. S is some individual to whom EF or the analysis is to be taught in hopes of endowing him with this rare ability. The one-group method is required, for EF1 must be ap- plied to a particular individual. An IT or IT’s showing S’s initial status in the ability in question needs to be followed, after EF1 has been applied, by an FT or FT’s. These FT’s permit the computation of C or C’s and show whether a particular individual can analyze and impart the ability in high degree to another particular individual. To make the experiment conclusive, many individuals will have to attempt to analyze the process and impart the ability to many S’s. PROBLEM 25. To See What Projects Second-grade Pupils Will Initiate. EFi is the school environment and internal nature of second-grade pupils. EF2 is the mere absence of EFr. S is a group of second-grade pupils. The problem calls for the one-group method in its most elementary form, for the experiment consists solely in plung- ing pupils with certain natures into a certain medium, and then watching to see what happens. This elementary sort of research is quite fundamental, and, when operated by a keen observer, frequently leads to very valuable conclusions. PROBLEM 26. Do Commas After Dependent Clauses Help the Reader in Speed or Accuracy of Reading? EF r is commas after dependent clauses. EF2 is the mere absence of EF1, which is to say it is the absence of commas at such places. S is not defined and hence may be any group that can read. 268 How to Experiment in Education The equivalent-groups method can be employed but it is not the best method. The one-group method cannot be used, for there will be a carry-over of acquaintance with material, if certain material containing commas is followed by that same material without the commas, and vice versa. This is one of those rare situations where the one-group method is inappropriate, but where the rotation experiment may be used to advantage by alternating the content of the material. The following shows a possible plan: Period I Period II Group A Material 1—Commas Material 2—No commas Group B Material r1—No commas Material 2—Commas The speed and accuracy made by Group A on “Material 1—Commas” can be combined with the speed and accuracy scores, respectively, made by Group B on ‘Material 2— Commas.” This can be compared with the combined speed scores and accuracy scores for “Material 1—No commas” and ‘‘Material 2—No commas.” PROBLEM 27. Does Brightness Facilitate Progress Through School? EF1 is brightness. EF2 is absence of EF1. The subjects are school pupils. The one-group experimental method cannot be employed because it is impossible for pupils to be dull for a period and then become bright or be bright and then become dull. For the same reason, the rotation method cannot be used. The equivalent groups method is the correct one for this problem. Sr is a group of pupils who are known or are shown to be of a defined brightness. Sz2 is another group who are known to be of a defined dullness. Except for these intelli- gence differences and their concomitants the two groups should be equivalent. They should be equivalent in chrono- logical age, grade position in school, i.e., beginning first grade or kindergarten children, etc. Analyses of Experimental and Causal Investigations 269 Since the measure of C is the rate of progress through school no initial tests, except of brightness, are required. The answer to the problem will be shown by the FT, 1.e., the number of years required on the average for each group to complete a defined number of school grades. PROBLEM 28. Does Genius Beget Genius? EF is genius on the part of parents. EF2 is the absence of such genius, or a smaller quantity of it. The one-group and rotation experimental methods are inappropriate owing to the fact that parents cannot be geniuses for a time and then become non-geniuses or vice versa. Hence the equivalent-groups method must be used. Sir is the product of the union of the sperm and ovum of genius parents. Sz is the product of the union of these ele- ments from non-genius parents. : No IT’s are required except to yield a measure of the amount of each EF. The IT for the subjects may be as- sumed to be zero. As soon as the offspring of each group have sufficiently matured to make measurement practicable an FT of intelligence may be applied. Cx and C2 will be identical with the two FT’s. Mz minus M2 will reveal the effect upon the intelligence of offspring of genius in the parents. To make it possible to separate the influence of germ plasm and environmental influence, all children of both groups should be placed under equally favorable environ- mental influences immediately after conception or after birth, at the latest. The equality of environment should be main- tained until the FT’s are made. AHH AMA M OA Kita ata Ry Sigh 7 Ly i‘ iy bind ' 5 ei } ’ 4! \ yi > A ' as. ‘i a j 4 4 F i : } a ee =» 1) JB ai f al ee | * -~ i , ? ‘ mt wed Wh ‘ La ] by ss A e é J i wee Sti? 4) i : *' i - ° 1 it h i { all . i ae | - b 4 ‘ ‘ +9 y i i ‘ ' \ ; ‘ ~~ vt] \ : : - ‘ i] i ‘ ‘ ja ’ , ‘ é ; ; i / * i ‘ j +’ j : ’ : i ‘ ae je f ‘ e UJ \ : i] 4 ‘ ' { P - ~1 | ] : : s ' 1 é y A i! f SELECTED REFERENCES FOR FURTHER READING I. Onet-Group EXPERIMENT Aral, TsurA.—Mental Fatigue; Teachers College, Columbia Uni- versity, New York. BALDwin, Birp T.—Physical Growth of School Children; Uni- versity of Iowa, Iowa City, 1919. Brooks, F. D.—Changes in Mental Traits With Age; Teachers College, Columbia University, New York City. Coy, GENEviEvE L.—IJnterests, Abilities, and Achievements of a Special Class for Gifted Children; Teachers College, Colum- bia University, New York, 1922. FREEMAN, FRANK N.—Experimental Education; Houghton Mifflin Company, New York, 1916. Jupp, Cuarites H., anpD OTHERS.—Reading: Its Nature and Development; University of Chicago, Chicago, 1918. Rusk, RoBert R.—Experimental Education; Longmans, Green and Company, London, 1919. WHIPPLE, G. M.—Classes for Gifted Children; Public School Pub- lishing Company, Bloomington, Illinois, 1919. II. EQUIVALENT-GROUP EXPERIMENT Courtis, S. A——Measuring the Effects of Supervision, in Geog- raphy; School and Society, July 19, 19109. Cummins, R. A.—Improvement and the Distribution of Practice; Teachers College, Columbia University, New York. Frost, NorMAN.—A Comparative Study of Achievement in Coun- try and Town Schools; Teachers College, Columbia Uni- versity, New York. Kirsy, T. J—Practice in the Case of School Children; Teachers College, Columbia University, New York. PittMAN, M. S.—The Value of School Supervision; Warwick and York, Baltimore, 1921. 271 oe How to Exteriment in Education IiI. Rotation EXPERIMENT Heck, W. H.—A Study of Mental Fatigue; J. P. Bell Company, Lynchburg, Virginia, 1913. THORNDIKE, E. L.; McCatt, WM. A., AND CHapman, J. C.— Ventilation in Relation to Mental Work; Teachers College, Columbia University, New York. WEBER, J. J—The Relative Effectiveness of Some Visual Aids in Elementary Education (to be published soon). IV. CausAL INVESTIGATION DENBURG, J. K. V.—Causes of the Elimination of Students in Public Secondary Schools of New York City; Teachers Col- lege, Columbia University, New York. HoLLINGWworTH, L. S., AND WinForpD, C. A.—The Psychology of Special Disability in Spelling; Teachers College, Columbia University, New York, 1918. O’BrRIEN, F. P—A Study of School Records of Pupils Failing in Academic or Commercial High School Subjects; Teachers College, Columbia University, New York. REAvis, GEORGE H.—Factors Controlling Attendance in Rural Schools; Teachers College, Columbia University, New York, 1920. V. DESCRIPTIVE INVESTIGATION BUCKNER, CHESTER A.—Baltimore School Survey Series; Board of School Commissioners, Baltimore, 1922. Educational Diagnosis of Individual Pupils; Teachers College, Columbia University, New York, 1919. Cleveland School Survey Series; Russell Sage Foundation, New York, 1916. Gary School Survey Series; General Education Board, New York, 1919. Ketty, F. J.—Teachers’ Marks; Their Variability and Standard- ization; Teachers College, Columbia University, New York. Kentucky State Educational Survey Series; General Education Board, New York, 1922. KrusE, Paut.—The Overlapping of Attainments in Certain Grades; Teachers College, Columbia University, New York, 1918. References for Further Reading 273 McCatL, WM. A.—How to Measure in Education; The Mac- millen Company, New York, 1922. MeEap, C. D.—The Relations of General Intelligence to Certain M ental and Physical Traits; Teachers College, Columbia University, New York. Morrison, J. C.—Legal Status of City School Superintendents ; Warwick and York, Baltimore, 1921. SIMPSON, B. R. — Correlations of M ental Abilities; Teachers Col- lege, Columbia University, New York. Virginia State School Survey Series; World Book Company, Yonkers, New York, 10920. VI. EXPERIMENTAL MEASUREMENTS Burcrss, May Ayres.—Measurement of Silent Reading; Russell Sage Foundation, New York, 1920. Burt, Cyrit.—MW contd and Scholastic Tests; P.S. King and Sons, 2 and 4 Great Smith St., Victoria, Westminster, Sa We _ Eng- land. CHAPMAN, J. Crospy.—Trade Tests; Henry Holt and Company, New York, 1921. DEWEY, EVELYN, CHILD, Emity, aNnD RuML, BEARDSLEY.— Methods and Results of Testing School Children; E. P. Dut- ton and Company, New York, 1920. Hitrecas, Mito B.—Scale for the Measurement of Quality in English Composition by Young People; Teachers College, Columbia University, New York, 1912. KUHLMANN, FReD.—Handbook of Mental Tests; A Further Re- vision and Extension of the Binet-Simon Scale; Warwick and York, Baltimore, 1922. McCatt, Wm. A.—How to Measure in Education; The Mac- millan Company, New York, 1922. MoNnRoE, WALTER S.—Measuring the Results of Teaching; Houghton Mifflin Company, New York, 1018. Monroe, WALTER S.; DE Voss, J. C., AND Ketty, F. J.—Educa- tional Tests and Measurements; Houghton Mifflin Company, New York, 1913. PINTNER, RUDOLF, AND PATERSON, Donatp.—A Scale of Per- formance Tests; Warwick and York, Baltimore, 1917. TERMAN, Lewis M.—The Measurement of Intelligence; Hough- ton Mifflin Company, New York, 1916. 274 How to Experiment in Education Toors, H. A.—Trade Tests in Education; Teachers College, Columbia University, New York. VAN WAGENEN, M. J.—Historical Information and Judgment of Elementary School Pupils; Teachers College, Columbia Uni- versity, New York, 1919. VOELKER, Paut F.—Function of Ideals and Attitudes in Social Education; Teachers College, Columbia University, New York. WHIPPLE, G. M.—Manual of Mental and Physical Tests, Vols. I and II; Warwick and York, Baltimore, rgro. Witson, G. M., AND Hoke, K. J—How To Measure; The Mac- millan Company, New York, 1921. Woopy, Ciirrorp.—Measurements of Some Achievements in Arithmetic; Teachers College, Columbia University, New York, 1916. YERKES, R. M., Bripces, J. W., AND HARDWICK, RosE S.—A Point Scale for Measuring Mental Ability; Warwick and York, Baltimore, 1915. YOAKUM, CLARENCE S., AND YERKES, R. M.—Army Mental Tests; Henry Holt and Company, New York, 1920. VII. STATISTICAL AND GRAPHIC METHODS ALEXANDER, CARTER.—School Statistics and Publicity; Silver Burdett and Company, New York, 1919. BRINTON, WILLARD C.—Graphic Methods for Presenting Facts; The Engineering Magazine Company, New York, 1917. BROWN, WILLIAM, AND THompson, G. H.—Essentials of Mental Measurement ; The Macmillan Company, New York, 1921. KetLey, T. L—Educational Guidance; An Experimental Study in the Analysis and Prediction of Ability of High School Pupils; Teachers College, Columbia University, New York, IQI4. McCatiL, Wm. A.—How to Measure in Education; The Mac- millan Company, New York, 1922. Rucc, Harotp O.—A pplication of Statistical Methods to Educa- tion; Houghton Mifflin Company, New York, 1917. THORNDIKE, Epwarp L.—Introduction to the Theory of Mental and Social Measurements; Teachers College, Columbia Uni- versity, New York, 1913. — References for Further Reading 275 Yue, G. Upny.—An Introduction to the Theory of Statistics ; C. Griffin and Company, London, 1912. VIII. Arps IN STATISTICAL COMPUTATIONS BARLOW, PETER.—T ables of Squares, Cubes, Square-Roots, Cube- Roots, and Reciprocals of all Integers, Numbers up to 10,000; E. Spon, New York. CRELLE, A. L.—Rechentafeln; G. Reimer, Berlin, Germany, 1907. PEARSON, Karu.—Tabdles for Statisticians and Biometricians; Cambridge University Press, Cambridge, England, 1914. PETERS, J—Neue Rechentafeln fur Multiplikation und Division; G. Reimer, Berlin, Germany. IX. GENERAL DEWEY, JOHN, AND DEwEy, EvELYN.—Bibliography of Tests for Use in Schools; World Book Company, Yonkers, New York, 1921. Schools of Tomorrow; E. P. Dutton Company, New York, 1915. Hotmes, Henry W., AND OTHERS.—A Descriptive Bibliography of Measurement in Elementary Subjects; Harvard Univer- sity Press, Cambridge, Massachusetts, 1917. Journal of Educational Psychology; Warwick and York, Balti- more. Journal of Educational Research; Public School Publishing Com- pany, Bloomington, Illinois. NATIONAL SOCIETY FOR THE STUDY oF EpucatTion.—Year Books; Public School Publishing Company, Bloomington, Illinois. PEARSON, Karit.—The Grammar of Science; Adam and Charles Black, London, 1900. Rucer, GerorcirE, J.—Bibliography on Psychological Tests; Bureau of Educational Experiments, New York, 10918. Teachers College Contribution to Education Series ; Teachers College, Columbia University, New York. THORNDIKE, Epwarp L.—Educational Psychology, Vols. I, II and III ; Teachers College, Columbia University, New York, 1914. Warp, Gitpert O.—The Practical Use of Books and Libraries; The Boston Book Company, Boston, 1911. SUMMARY OF SYMBOLS AND FORMULAE A.Q. = accomplishment quotient = — = i= Ar.A. = arithmetic age : ; Ar.A. Ar.A.Q. = arithmetic accomplishment quotient = TAriAg ‘ : Pe ls eel Ar.Q. = arithmetic quotient = CA A.M. = assumed mean B = brightness = T + B correction Ba, Be, Bi, Br = brightness in arithmetic, education, intelligence and reading, respectively C = (1) change produced by an experimental factor (2) pupil classification = G+ C correction CC = change produced by a control experimental factor CEF = control experimental factor C.A. = chronological age C= correction D = difference EC = experimental coefficient ah D (1) for difference = 2.78 SDD On SS (2) for coefficient of correlation 798 SDt ECMEC = experimental coefficient of the mean experimental fieienies MEC Rabie Me oS DILL G ECMED = experimental coefficient of the mean equated dif- f i MED Tie iwnte 78 SUMED ED = equated difference EF = experimental factor E.A. CAG F = effort or efficiency = Te — Ti Fa = effort in arithmetic = Ta— Ti Fr = effort in reading = Tr — Ti f = frequency E.Q. = educational quotient = 276 Summary of Symbols and Formule 277 fx = deviation X number of frequencies FT = final test G = grade status INT = intermediate test I.Q. = intelligence quotient = IT = initial test M = arithmetic mean M.A. = mental age MEC = mean experimental coefficient MED = mean equated difference N = total number N.= = ae ~ = =Spearman self-correlation coefficient where N is the number of tests required to yield a defined correlation P= pupil PE = probable error PED = probable error of the difference PEM = probable error of the mean ies 2 Shs pl Q = quartile deviation = Q: = 25 percentile Q: = 75 percentile R.A. = reading age Rese R.A.Q. = reading accomplishment quotient = TA SIGART OAS R.Q. = reading quotient = TORE r = product moment coefficient of correlation = Sxy ei Re 0 V Sx* V/Sy?* ao) — cxcy senor eaten where assumed mean is used — cx” Sg (/ eee a : : = —__—_—__—-= correlation coefficient resulting I+ (n—1I1)nr when N forms of tests are used S = experimental subject, thing, OF group or BELORD x, size of SD or S.D. = standard deviation = CD eee. 278 Summary of Symbols and Formule SDC = standard deviation of the changes SDD = standard deviation of the difference = (SDM:)* + (SDM2)*— 2 re (SD:) (SDz2) D SDM = standard deviation of the mean = ae SDMEC = standard deviation of the mean experimental co- efficient SDMED = standard deviation of the mean equated differ- ence Mela Sea REO. ANE VALENS SDr = standard deviation of the coefficient of correlation I—r SD median =~ SDS = standard deviation of the sum = 4/(SDM:)? + (SDM2)? + 2 rx (SDs) (SD2) Sfx or Sx = sum of the deviations T =.1 standard deviation of unselected 12 year old children Ta, Te, Ti, etc.= T score in arithmetic, education, intelligence, etc. x = deviation y = deviation INDEX Absolute-worth scales, in question- naires, 215, 216. Accomplishment 103. Age scale, evaluation of, 95-98. Army Beta non-verbal intelligence test, use of, 85. Assumed mean, 143. Attendance, Reavis’s investigation of, 209, 210, 213, 238, 239. Quotient, 58-61, B scale, construction of, 102-109. Barton, and Dransfield, on teaching of reading, 4. Battery of tests, use in Liu’s study, 85; construction of, 138, 139. Bennett, on equating of groups, 50, 51, 73. Bibliography, making of survey of, 11-13; of equivalent groups meth- od, 271; of one-group method, 271; of causal investigations, 272; of rotation method, 272; of ex- perimental measurements, 273, 274; general, 275. Binet-Simon, 60, 130. Brian, and Harter, 88. Brightness in arithmetic, computa- tion of pupil, 124; of class, 126. Buckingham, 130. C scale, construction of, 109, IIo. Cattell, 130. Causal investigations, methodology of, 207-212; Reavis’s investiga- tion, 209, 210, 213, 238, 239; pro- cedure of, 212-244; analysis of problems, 245-269; bibliography, 2472. Chal Garo: Chang, C. Y., 130. Chang, Y. C., 130. Chinese fundamentals of arithmetic scale, 121-130. Classification in arithmetic, compu- tation of pupil, 125, 126; of class, 126. Computation, special difficulties in, 200.) 207; Correction, 143. Correlation, and test reliability, 111; in causal investigations, 224-244. Courtis, and Thorndike, on cor- rection formule, 116, 130. Coy, 37. Criteria, see Experimental measure- ments. Darwin, 208. Dearborn non-verbal test, use of, 85. Descriptive investigations, biblicg- raphy, 272, 273. Difference, computation of, 150. Difficulty test, construction of, 131- E355 Distribution method, in question- naires, 215, 210. Dransfield, and Barton, on teaching of reading, 4. intelligence Equivalent groups method, descrip- tion of, 18, 19, 40, 44; formule for, 18, 19, 59; criteria for se- lecting, 29-31, 35; computations for, 161-186; bibliography, 271. Errors, see Experimental errors. Experimental coefficient, 154-158, 168, 174. Experimental errors, avoidance of, 63-80. Experimental factors, amount of, 81; changes produced by, 82. See also Irrelevant factors. Experimental investigations, analyses of problems for, 245-269. Experimental measurements, func- tions of, 81; criteria, fundamental, 82, 83; for evaluation and con- struction of, 83-93; bibliography, 273, 274. Experimental methods, see One- group, Equivalent groups and Ro- tation method. 279 280 Experimental subjects, appropriate- ness of, 37-38, 40-44; selection of, 38-40. Experimentation, in education, prev- alence of, 1, 2; value of, 3-5; selection of problem, 6-9; formu- lation of problem, 9-11. Experiments, see Weber’s rotation, Lacy’s rotation, Thorndike and McCall’s rotation. Franzen, 130. Frequency distribution, tion of, 145-148. Fullerton, 130. construc- Gates, 138. Grade scale, evaluation of, 94. Graphic methods, see Statistical and graphic methods. Gray, 38; on equating two groups, <8, Groups, equating of, 41-61. Hanson, 37. Harter, and Brian, 88. Herring Revision of Binet-Simon Scale, 60. Hillegas, 130. Hollingworth, H. L. and L. S., on equating groups, 55. Intelligence Quotient, 56, 59. Intelligence tests, classified, 43, 44; battery of, 85. Irrelevant factors, constant vs. va- riable, 63, 64; bias of experi- menters, 64, 65; bias of assistants, 65-75; transfer, 75, 76; bias of tests, 77, 78; other factors, 78, 79; change produced by, 82. Lacy, rotation experiment, 34, 35, 73- Lew,/L. 1.0830. Liu, H. C., on construction and use of intelligence criterion, 84-87. McCall, and Thorndike, reading scale, 59-62; rotation experiment, 194. Mean, computation of, 143; use of, 148. Measurement, of changes, 206, 207. Median, computation of, 148, 140. Index Mental age, computation of, 50, 60. Metchnikoff, 208. Monroe, diagnostic tests in arith- metic, use, 88; measurement of achievement, 130. Myers, non-verbal intelligence test, use, 85. Norms, 60, 83, 117. Ogglesby, 37, 180. One-group method, description of, 14-17; formula for, 173; cri- teria for selecting, 21-29, 35; computations for, 140-160; bibli- ography, 271. Otis, on unreliability, 116. Pairing pupils, technique of, 45-49, 57- Percentile scale, evaluation of, 95- 98; points, computation of, 149- 150. Pintner, non-verbal intelligence test, use of, 85, 130. Pittman, on equating of groups, 40- SI. Practical certainty, 156, 163. Pressey, non-verbal intelligence test, use of, 85. Probable error, 151. Product-moment formula, 225. Product tests, construction of, 135- 138. QI, 50. Os\)nso aie | Quartile deviation, computation of, 150. Questionnaires, methods in causal investigations, 215-217. Rank method, in questionnaires, 215, 2106. Rate test, construction of, 135. Reavis, attendance investigation, 000, 210, 313.238, 3G, Regression equation, in causal in- vestigations, 240-244. Relative-to-the-items scale method, in questionnaires, 216. Reliability, of tests, 83; formula for, 111; net-difference method, 112-114; practical certainty, 156, Index 163; computations in special situ- ations, 190. Rotation method, description of, 109, 20; formula for, I9, 20, 32; cri- teria for selecting, 31-36; Steven- son’s experiment, 28; Weber’s experiment, formula, 32, descrip- tion of, 198-207; Lacy’s experi- ment, 34, 35; computations for, 187-207; Thorndike and McCall, ventilation experiment, 194; bib- liography, 272. Rugg, H. O., 5. Scales, adequacy of, 88; evalua- tion of methods, 94-98; for ex- perimental tests, 198. See also Age scale, B scale, C scale, Chi- nese fundamentals of arithmetic scale, Percentile, T scale. Scores, point, sample of, 44; men- tal age, sample of, 44. Scoring, of Chinese fundamentals of arithmetic test, 122, 123, 129. Self-correlation, see Correlation. Sherritt 21s., 1130. Sigma, see Standard deviation. Spearman, self-correlation formula, III, 112; product-moment for- mula 225. Standard deviation, computation of, 144; of difference, I51. 281 Stanford Revision of Binet-Simon scale, 60. Starch spelling scale, use of, 88. Statistical and graphic methods, bibliography, 274, 275. Stevenson, rotation experiment, 26, 28. T scale, 27; evaluation of, 95-98; construction of, 98-102. T scores, Weber’s use of, 203. PaO, WWW uke so: Terman, on mental age, 59, 130. Tests, intelligence, classified, 43, 44; battery of in Liu’s study, 85; summary of steps in constructing, scaling and standardizing, 130-139, experimental, scaling of, 1098. Thorndike, 5, and McCall, reading scale, 59-62, 130; rotation experi- ment, 194. Total ability in arithmetic, com- putation of pupil, 123, 124; of class, 126. Unreliability, see Reliability. Variability, measures of, 151. Weber, rotation experiment, 32, 73, 198-207. Woody, arithmetic scales, use, 88. ey ba rly * ¢, Ms a | ie Re . - rt P, "> © . — > CSD Sas Ree aia) Tee ROE ae AR enlace . avy n ae a. > F. it NY . / o arn Nig eye he Bak _ = i ‘ : as . : i ts ite P ‘ Vola a " : a ari h i 4 { yeh eae ; ¥ EAE RACY they abrak OA ' } F : \ ‘ ( i , " ; A ei. it 1 f A, vf v a 4 * G 1h ‘ Me ) ’ ’ ’ R ‘ , ; j ‘ ' f ! ' } ' - ‘ a! J j ‘ ’ ‘ ; “4 = : 7 ‘* i aA ‘ . ‘ . ( i y ,' - \ ' ) - : ! a i . F * i j 7 | ! j ' fi 4 H a” q ' f ‘\ + a - ’ ’ 4 ' s , 4 4 i P \ vit ‘ { Mi ' ? vv M j i ‘et haw 4 f Pi 0 ‘ } ' , * UF 7) ‘ ‘ | 4 " “ae i “ ' \ * t : ] ' sale ; 7] ) i 4 : ' ; ; ¥ LS : 4 ¥e t hy : oe Pee hehe hah east anny IL | y—Speer Librar I | CO NWN (ep) ae a "a =— © —} O ——— ee Se i N = © _ a Princeton Theological Seminar How to experiment in education LB1026. aati