a
sate
Division Kea cae YAH Riad Pach
Section
Sy Ay
hi
(
eid;
HOW TO EXPERIMENT
IN EDUCATION
EXPERIMENTAL EDUCATION SERIES
Eprirep sy M. V. OSHEA
HOW TO EXPERIMENT IN EDUCATION.
By Wittram A. McCatrt, Px.D., Associate Professor of
Education, Teachers College, Columbia University.
HOW TO EXPERIMENT
pave tama
‘yp
A ° . ~ “ 4
JAN 18 1929
\ ~
BY
WILLIAM A.’ McCALL, PH.D.
ASSOCIATE PROFESSOR OF EDUCATION, TEACHERS COLLEGE,
COLUMBIA UNIVERSITY, NEW YORK CITY
jQew Pork
THE MACMILLAN COMPANY
1926
All rights reserved
COPYRIGHT, 1923,
By THE MACMILLAN COMPANY.
Set up and electrotyped. Published August, 1923. Reprinted
November, 1926.
PRINTED IN THE UNITED STATES OF AMERICA
BY BERWICK & SMITH CO.
CHAPTER
E
IT.
VIII.
CONTENTS
SELECTION AND FORMULATION OF EXPERIMENTAL
PROBLEM tiie ahs mre i ntar fetanl ge ee a ea ena
SELECTION OF EXPERIMENTAL METHOD. .. .
SELECTION OF EXPERIMENTAL SUBJECTS .. .
CONTROL OF EXPERIMENTAL CONDITIONS . . .
EXPERIMENTAL MEASUREMENTS . . . ..
COMPUTATIONS FOR THE ONE-GROUP EXPERIMEN-
TALS METHOD Mr en iien ine ate mente ® dure ir ire
COMPUTATIONS FOR THE EQUIVALENT-GROUPS
MVIEETHOD Meee eer i COM a ite ite iret ue nig Bataan Aly
COMPUTATIONS FOR THE ROTATION EXPERIMENTAL
METHOD e e e ° e e e s se es e e
(SAUSALCINVESTIGATIONS Bae er trea meany (on at aur Us
ANALYSES OF EXPERIMENTAL AND CAUSAL INVESTI-
GATIONS se s e e e s e ® e e e e
APPENDIX e e s e e e e e e ® e ° e °
SUMMARY OF SYMBOLS lr HARP ONIRAN CARL lar ao” AUNTS PAA afl
INDEX
PAGE
140
161
187
208
245
271
276
279
Digitized by the Internet Archive
in 2022 with funding from
Princeton Theological Seminary Library
https ://archive.org/details/nowtoexperimentiOOmcca
LIST OF TABLES
TABLE PAGE
I.
Chronological ages and mental ages of 43 sixth-grade
DUDUSE eae aaie Vere eae i an a carn itis tate nL eerie 45
2. Pupils divided into two groups of equivalent mental age 46
3. Illustrates computation of composite scores............. 52
4. Illustration of need for equal units of measurement.... 94
5. Relative merits of four commonly used scales.......... 98
em SHOWS HOW tosCOnStLUCt da | LNSCAlC siilen Gictis t atatesleis.cy 809
Pomet OLACOUVELl INCA IACenLGNULOn LSet tly tele sistele esi ete 101
8. Shows how to widen the range of a T scale............ 102
9. Age-scale and T-scale equivalents. ......s.cecscccecace 103
TOPO uOWS how ta:constriuct a‘Biscale;.i.7 2.205. ae cece «5's 108
II. For converting T scores into B scores. 00.4)... 2.0. 0s 109
12. Reliability of test by net difference method............ 113
13. Equating variability in computing net difference....... 114
13A. For converting total points correct into T scores...... 124
Pe Lem OLE COMDULIN Sm ESCOLEG.\ han shiva vigils geisha sie ald eniatalsts 124
TR Gem POtecOniputingAGasGOTesi mn ncaa aan cine sia'sieteieb nevent ote 126
13D. For illustrating the computation of T, B, and C scores 127
Pa uemior unterorering wimandi SCOLES..: s weliciers vs widamldesgite 127
PAGMAINE-STOUP COMP ieationl MOdel Ls. tae al ciets staveteldais es 140
Tool lustration ob) computations model ila. aoe bas. sels 6 141
16. Computation of M and SD when N is large............ 146
17, Computation of M and SD in a frequency distribution
MILD Estep =iNiemr dl ShO les feminist isticls cyt «mse are 147
18. Computation of the median in special situations........ 149
19. Conversion of experimental coefficients into chances.... 155
20. Illustration of computation model I when EFs is not the
MICLEEADSCOCE FOL EG aia inele tive dailas, cei aust enicn erat ss ea 159
viii List of Tables
TABLE PAGE
21. Equivalent-groups computation model II for two EF’s
ANG OTE tEStHEV PEs. pe sieenines sap a oy eins one ae eee 161
22. \Iiiustration of computation; model LL. vse 162
23. Equivalent-groups computation model III for three EF’s
and one test TY¥DG\wriiapaite se sivsee teas eee 166
24. Equivalent-groups computation model IV for two EF’S
and (tWO:téest: typesviiiieic es cdelece tall ie edie cte ee aime 167
25. \ Llustration) oficomputation model LV is. 72074. ase sane 172
26. Equivalent-groups computation model V for three EF’s
and fone testetypey siiies mess ow seta a eites ies iela era anneaea 175
27. Equivalent-groups computation model VI for two sub-
PLOUDS Srey isn ec spereiel ee vrei te Uhlel nie oh ete tk etal aha ee 177
28. Summary of an actual experiment with three sub-groups 178
29. Equivalent-groups computation model VII with an inter-
MECIate7LEST Heuicw ev eaten ele Toate eae alaets te arene 179
30. Equivalent-groups computation model VIII with three
sub-groups and an intermediate test.............. 181-186
31. Rotation computation model IX for two EF’s and one
LESE ELV PO ey rele oda ta. a aleieraie Wiel di ely, otis a abe ter aka anal 187
32. Llustration of computation models Xs 0... ee 193
33. Rotation computation model X for three EF’s and one
TESULTV DE cle nla causa vieitiecece elecetelel da bly'< < avy s 4 one epe eae ann 195
34. Rotation computation model XI for two EF’s and two
LOSER TY DOS Ga circ ta ptee ialeiaigalpterelicie tates. 5) «lls! cbt C ee 197
35. Data from a rotation experiment conducted by Weber 200-201
36. Data from Weber’s rotation experiment converted into
ENSCOTES Si tenis at cles a eulc wale age cle c's tact ds et eae 204
ava. Computation? Off 2 ese wie comes vars sine os ce eee 237
38. Computation of r from a contingency table............. 229
39. Reavis’ r’s between attendance and six hypothetical
CAUSES.) sare tig atnalelaateitiebtore mite’ sie tc plata’ ade 1 eee 232
40. Reavis’ original and partial r’s between attendance and
six hypothetical’ causes)... <<. 5 vcs anes cc ee
LIST OF DIAGRAMS
DIAGRAM PAGE
1. Scatter diagram showing rectilinear and curvilinear rela-
TIONSHIP We yet tale eelk eieeldin aisle sis sents s set isie) sis sisieis ins 226
'
\
.
.
:
‘
Aa ti i! ' y vis ‘eit
* ' - q + | _ i -
is'F ' 4 , : 1° @2>: i
mile thy . ' Pune
| Neat : Rd 1 (9 ae Gt A a n : oy j | ; ? Vd x + y ij j sh yy iti yy A ob 14) 13. :
¥ i / Vay 74 "i ‘ Le tei ‘ ein ¥ iy
’ \ - j a j , ie) : ra, 4 Oe or ee
‘ ' A ‘ nie \ i } P a at on Waa tp
: i 4 i 7
‘ iy re fy; “a
é
‘ ls
i
’
‘
:
a |
ui
:
7
'
‘
ry.
{
j
‘ +
tx
i
'
:
’ ul
er |
|
!
'
i
| '
' Ly
, ' sf
fe i)
‘ :
yi ’
v >|
ty ry
ieee.) e ,
> j
“ee }
M4 ,
i ety,
’ ina
j i i
i
: a 7
a)
re) i } ’
» 2° ar j
| ‘
wavy ss
: .
ni wh ] ‘ | 4 '
i ,
i Py l ,
AM a | 4 i A
ria) b ' J ya!
ile Ah bed fig
mo ih" ) 4
- ei
«
[Sin eae 7; WF)
i tam df
ne ra. ereits nig ] e 1 ee : ‘A
EDITOR’S INTRODUCTORY NOTE
Professor McCall has written this book primarily for the
purpose of presenting the methodology of educational
experimentation in a practical form for the use of
teachers and students of education who wish to engage
in experimental work, or who desire to understand the great
amount of experimental literature which is appearing in
magazine and book form. This is the first book on educa-
tional experimentation to be published at home or abroad.
There are philosophical treatises on scientific methodology,
such as Pearson’s ‘‘Grammar of Science,” and a few scat-
tered suggestions on the method of experimental education
in books on scientific education; but there has been no
adequate treatment of experimental work in the educa-
tional field. This fact led the present writer, when he
became editor of the Experimental Education Series, to
ask Dr. McCall to prepare this volume. Dr. McCall has
conducted courses in Teachers College in the field of ex-
perimental education, and he has for a number of years
been accumulating concrete data to illustrate the experi-
mental method of procedure. Probably no one is as well
equipped as he is to prepare a book for the guidance of all
who desire either to understand or to undertake experi-
mental work in education.
With the aid to be gained from this book, intelligent
teachers can engage profitably in research work in educa-
tion even if they are not technically trained in experimental
methods. The subject is one of permanent worth; and
students of education or teachers who wish to gain an in-
telligent appreciation of and to keep in touch with American
educational progress must be familiar with, and, to some
x1
xii Editor’s Introductory Note
extent at least, must be master of the methodology of
educational experimentation. A large proportion of popular
educational doctrines has been derived without due regard
to the requirements for securing valid conclusions; and it
may be safely predicted that superintendents, principals,
and teachers, as well as students of education, who read
Professor McCall’s book wunderstandingly will exercise
greater care than they have done heretofore in promulgating
educational principles based upon data that have not been
secured in an accurate manner or treated according to a
technique designed to control or eliminate disturbing or
irrelevant factors.
“How to Experiment in Education” is not as technical as
it might appear to be at first glance. The formule and
diagrams as well as the discussion can be easily understood
by any reader, even though untrained in experimental
methods, if he will begin at the beginning of the work and
go through it systematically and leisurely. Concrete ex-
amples of experimental problems that have been or that
might be successfully studied are described by Professor
McCall frequently and clearly enough to illustrate every
method of procedure discussed and every diagram presented.
Technical terms are sparingly used, and the meaning of
those that are employed can be easily gained from the con-
text in which they appear.
M. V. O’SHEA.
The University of Wisconsin.
PREFACE
My initiation into educational research, like most initia-
tions, was a rather tragic one with happy consequences.
My professors plunged me into practical research situations
when my training in experimentation was exceedingly lop-
sided. They trusted to my genius to supply the missing half
of research methodology. The memory of this mistaken
trust constitutes the pleasant after effects.
The cause of my tragedy and of others like mine was due —
to the fact that, heretofore, chief attention has been directed
toward statistical refinements, rather than refinements of
pre-statistical procedure. There are excellent books and
courses of instruction dealing with the statistical manipula-
tion of experimental data, but there is little help to be
found on the methods of securing adequate and proper data
to which to apply statistical procedure. ‘Training is given
and books exist only for the last step of a several-step
process. As a result, the final step often becomes little more
than statistical doctoring for the ills in the data.
This book, together with its predecessor, ‘“‘How to Measure
in Education,” but particularly this book, represents an
attempt to assemble or originate a fairly complete methodol-
ogy of research from the selection of the problem to the
conclusion of the research. Material has been drawn from
numerous sources, but the largest single source is that
unannounced richest course of instruction taken by me at
Teachers College, namely, the frequent privilege of out-of-
course association with Professor E. L. Thorndike.
The encouragement and support given my work by my
departmental Superiors, Professors M. B. Hillegas and
Frank M, McMurry, and by Dean James E. Russell have
X11
xiv Preface
been a continuous surprise because they have exceeded every
expectation. Such encouragement has made it a pleasure to
shorten vacations and to lengthen the working day so as to
finish this book before departing for a year of service with
the Chinese National Association for the Promotion of
Education.
It is fortunate for the future reader that I am in China
while this book is being edited and published. As a result,
Dr. M. V. O’Shea has given an unusual amount of time to
its editing, and in this he has had the technical assistance of
Dr. John G. Fowlkes. Miss Harriet Barthelmess, who has
a thorough knowledge of the methodology of experimenta-
tion, and my wife, Alma McCall, have volunteered to read
the proof. I wish to make grateful acknowledgment of
their kindness.
Wiiiiam A. McCatt.
Teachers College
Columbia University
HOW TO EXPERIMENT
IN EDUCATION
» \ 4 war ; ye : Si, : 4 ie Ae 1) is ag) j ine
ys if Tern ay a a. Mi Ney he clea’ .% “hy
) ib wi i i i cae , A oe ne " ah Bie
ere SN haan
me >, ch Mi
wit iy oR
HOW TO EXPERIMENT IN
EDUCATION
CHAPTER I
SELECTION AND FORMULATION OF
EXPERIMENTAL PROBLEM
I. VALUE AND PREVALENCE OF EXPERIMENTATION IN
EDUCATION
Prevalence of Experimentation.—Except for sporadic
exceptions and for continuous overlapping, the method for
the determination of truth has passed through three major
stages. The first stage is that of authority. When any
question arose as to the truth or falsity of any fact or
principle, it was referred by consent or force to the oracle,
chief, king, church, state, or other temporarily ascendant
individual or group. In the year 1922 the legislature of a
certain state decided by vote whether the principle of evolu-
tion is true or false. In this same year there were further
occasional evidences that vital educational matters were still
being decided on the basis of authority and authority alone.
The second stage is that of speculation. ‘This repre-
sents a genuine advance. When this stage was reached,
questions were no longer matters merely to be settled; they
were matters to be freely discussed. Broadly speaking,
America and American education have now advanced well
into this stage.
The third stage is that of hypothesis and experimentation.
This stage is not something perceived only in visions. We
t
2 How to Experiment in Education
have seen enough of it to know its aspect and to appraise
its promise. Since earliest times a tiny stream of scien-
tific research has trickled through the ages, now above
ground, now below, now a dashing stream, now a desert rill,
but always flowing forward toward the future, and, in late
years, increasing greatly in volume. Today, educational
experimentation is accepted but not achieved.
These three, authority, speculation, and experimentation,
have been described as stages, and in a sense they are.
But, in a truer sense, they supplement each other. Specula-
tion, unless it becomes an end in itself, is a fruitful source
of hypotheses or problems for research. Authority, when
founded upon tested knowledge rather than upon pure opin-
ion, has an essential function in the scheme of life and
education.
Everywhere there are evidences of an increasing tendency
to evaluate educational procedures experimentally. Though
measurement alone is not research, the marvelous spread
of the movement for scientific measurement of educational
products is a symptom of a new attitude which is favorable
for research. ‘The establishment of numerous city and
state bureaus of research is another evidence. Numerous
experimental schools have arisen for the purpose of re-
search, pseudo-research, or propaganda. Most of the de-
partments of the better teachers colleges have become satu-
rated with the new point of view. Scientific organizations,
research committees, an institute of educational research,
and large educational foundations are lending such impetus
as make experimental education the most important current
movement in education.
But even with all its growth we have barely entered the
Stage of experimentation. Most educational theory still
needs testing. Adequate testing of theory requires a rigid
scientific procedure. The technique of experimentation is
possessed today, with a few exceptions, mainly by a small
group of educational psychologists. Experimental educa-
tion cannot hope to cope with its great task or develop much
Selection and Formulation 3
faster so long as superintendents, principals, and super-
visors, not to mention teachers, are not equipped to solve
their own problems for themselves. It is but a question of
time until educational leaders will be required to have a
command of research technique. ‘Then the third stage has a
chance to arrive.
Value of Experimentation. — Experimentation has
proved its worth by hastening the day when the test of truth
will be verification and conformity to our experience rather
than revelation and miraculous departure from our expe-
rience. Science asks us to believe in such unthinkable
things as the reality of ether, the absence of weight and
friction for celestial bodies, the existence of the atom, that
food makes thought, and the like. But these matters are
in conformity with logic or experimental evidence. As
Burroughs states, the helium atom has been proved to be an
objective entity as truly as that the sun is in heaven.
The practice of experimentation in a school or school
system pays in terms of an altered attitude on the part of
the entire staff, willingness to consider new proposals, and
an alertness for new methods and devices. Experimenta-
tion ploughs up the mental field. Teachers join their pupils
in becoming question askers. It is the absence of just such
stirrings of the mental soil, which, in all probability, is
responsible for the supposed fact that teachers fail to im-
prove after a few years of experience.
Experimentation pays in terms of cash. ‘Three years
ago an experiment was conducted in a school of five hun-
dred pupils. The purpose of the experiment was to evaluate
a group of teaching methods. A careful account was kept
of the increased ability secured. Careful estimates were
made of its financial value. A record was kept of expendi-
tures. The value of the increased abilities secured was
estimated to be worth $10,000. This estimate was based
upon the total cost in previous years of producing each unit
of ability. The cost of test material used, and of the spe-
cial supervision required, amounted to $540. The net an-
4 How to Experiment in Education
nual saving, not counting future compounding of the abili-
ties, was $9,460.
Recently an experiment has been conducted by Drans-
field, principal of a school in West New York, New Jersey,
and by Barton, superintendent of schools at Sapulpa, Okla-
homa. The purpose of these experiments was to evaluate
the plan for the teaching of reading described in “How to
Measure in Education.” The total points of A. Q. growth in
reading in the control school were 60. The points of growth
in the experimental school were 143. Even without taking
into account the improvement in history, geography, arith-
metic, etc., resulting from increased reading ability, or the
cumulative value to the pupils in future years, and even
without considering that the teachers have learned a new
process to use with other pupils, still the difference between
the two groups is worth thousands of dollars. Consider the
value to education of this and similar experiments, when
their influence shall have spread to the millions of pupils
in American schools.
The foregoing experiments have been described to show
that it is not unreasonable to claim that a widespread use
of scientific research could so increase the efficiency of
instruction as to save a year of instruction. The value
of such an achievement in financial terms is shown by the
following approximate figures:
Population: of the; United totatesiie i... sess ss. alc cele eee 103,600,000
Saving to each person through research .............ccecceccececes I yr.
Total (saving he Aue a eas fe tee recep elev co tha) Ad) Se 103,600,000 yrs.
Valuesot a 7yearin. tartan Garis steele ie ee neue cle ele et $1,000
saving “fOr: Ut: Sa i tere orev ere iain at oth ee) 1 seine eee $103 600,000,000
Population engaged in World War ............eccccederucn I,300,000,000
Saving tore World «Ware bowers ee cicc cos a oak $1,346 ,800,000,000
Saving tior 100 ipenerationsSunte es ose aes pec ee $134,680,000,000,000
$134,680,000,000,000 = 260 times U. S. Wealth= 790 times cost of World
War = 395 times cost of all wars in recorded history.
Experimentation will pay the nation, the school system,
and the individual school. The time has now arrived when
it also pays the individuals who engage in it. If the finan-
cial reward is not large, the esteem of the profession is.
Selection and Formulation 5
There is no denying the fact that those educators who today
are constructively studying educational problems by scien-
tific methods have achieved, or are destined to achieve,
positions of recognized leadership in education. They be-
come the final arbiters for most educational questions, for
the peculiar function of experimentation in education is to
be a court of last resort.
Methodology of Research.—Scientific educational re-
search may be grouped conveniently into three major divi-
sions,—descriptive investigations, experimental investiga-
tions, and causal investigations. The purpose of descriptive
investigations is to describe a situation as accurately and
objectively and quantitatively as possible. They involve
the collection of data, and the quantitative description of the
data by the following means: some mass measure, such as a
frequency distribution, frequency surface, order distribution,
or rank distribution; or some point measure, such as a mode,
mean, median, midscore, or percentile; or some variability
measure, such as a quartile deviation, median deviation,
mean deviation, or standard deviation; or some relationship
measure, such as a scatter diagram, contingency table, or co-
efficient of correlation; or some reliability measure, such
as a standard deviation of the measure, or probable error of
the measure; or some other of the standard statistical tech-
niques, such as are described in Rugg’s “Application of
Statistical Methods to Education,” or Thorndike’s “Mental
and Social Measurements.”
The purpose of experimental investigations is to evaluate
the methods, materials, and aims of education. It is to de-
termine the absolute or relative effects upon some subject
or subjects or pupils of one or more experimental factors.
The purpose of causal investigations is to start with some
observed effect and locate the cause or causes; to determine
whether hypothetical causes are really causes; or to deter-
mine just how much each of several causes contributes to
produce the effect.
McCall’s “How to Measure in Education” has for its
6 How to Experiment in Education
purpose not only to tell how to use practically and construct
scientifically mental and educational tests, but also to pre-
sent the measurement, tabular, graphic, and _ statistical
techniques required for the conduct of descriptive investi-
gations. This book is a sort of companion volume for
“How to Measure in Education,” and has for its purpose to
complete the presentation of the methodology of research.
The first book covers descriptive investigations. This book
presents the techniques for experimental and causal investi-
gations.
II. SELECTION OF EXPERIMENTAL PROBLEM
Planning an Experiment.—An experimenter ought to
think through his experiment from the conception of the
problem to the formulation of the conclusions and beyond.
If he has six months to devote to an experiment he can, with
advantage, spend five months in planning the experiment
and one month in conducting it. Ideally an experimenter
should not start his experiment until he has gone through,
mentally at least, every step even down to the smallest
statistical detail. Those who do not possess a vivid imagina-
tion can advantageously carry a miniature experiment with
hypothetical data through the various tabulation and sta-
tistical stages.
The importance of adequate planning cannot easily be
exaggerated. There is little justification for the contention
that a well-prepared plan is an inflexible plan. A plan can
be thorough and yet plastic enough to be altered to meet
unexpected emergencies. In fact original adequacy of plan
is probably correlated positively with a healthful plasticity.
Whenever the experimenter can afford the time, an actual-
trial experiment is superior to a mental-trial experiment.
Even the keenest vision of the most experienced experi-
menter cannot always foresee every difficulty which will
arise. Hence the theoretically best procedure is to follow
the mental-trial experiment with the actual-trial experiment,
Selection and Formulation 7
to modify and perfect the plan in the light of the actual
trial, and, finally, to conduct the real experiment.
How to Find Experimental Problems.—The best way
to find genuine experimental problems is to become a scholar
in one or more specialties as early as possible. Thorndike
has done a great service for the cause of original research by
showing, in a convincing way, that the original mind is the
informed mind. The idea that much knowledge hampers a
man’s originality has taken deep root in the popular fancy,
as a result of its self-deceptive search for some crumb of
comfort for stupidity. The essence of originality is high
native intelligence plus adequate knowledge. Spencer de-
scribes knowledge as a sphere of light floating in an abyss
of darkness. As a rule, only those who live their mental
life on or in this sphere conceive fruitful problems.
A second way to discover fruitful problems is to read,
listen, and work critically and reflectively. It is well to
form the habit of reacting upon every situation with a ques-
tion mark, and to consider every untested theory as an hypo-
thesis. Between the lines of every worthwhile book are
enough problems and enough rich materials to make the
finder and utilizer famous.
A third method of discovering fruitful problems is to con-
sider every obstacle an opportunity for the exercise of in-
genuity instead of an insuperable barrier. A king once
placed a purse full of gold in the middle of a public road.
On the purse he placed a large stone. A soldier with his
head in the air and whistling a tune chanced that way. He
roundly cursed those who drove over that road for not re-
moving the stone and hence for the injury to his pride and
person. A wagoner, with the expenditure of much emo-
tion and considerable skill, maneuvered his wagon past
the obstacle. Since no one who passed that way had formed
the mental habit of considering every obstacle an oppor-
tunity, the reward Boneh the obstacle went by default to
the king.
A fourth method of nding problems is to start a research
8 How to Experiment in Education
and watch problems bud out of it. The very process of re-
search stirs up a hornet’s nest of insistent problems. Spen-
cer expressed a profound truth when he said that if we
enlarge ever so little the sphere of light we increase infinitely
its points of contact with the darkness.
A fifth method of finding problems is not to lose those
already found. Almost everyone has probably been given
for a moment—probably some odd and unexpected mo-
ment—some rare insight. These flashes come, linger for a
moment, go, and are forgotten beyond recall. Twiss attri-
buted his rise to a university position to one fact. He
bought a steel filing case and recorded and filed original
ideas and problems before they were forgotten. So vital
for professional growth is this matter of finding and record-
ing problems, that the worth of an educator can probably
be measured by asking him to list in ten minutes as many
as he can of worth-while educational problems.
What Experimental Problem to Select.—It goes with-
out saying, and yet it needs to be said, that experimenters
should select problems whose solution is not already known.
One of the abler men in educational measurement reported,
at a recent gathering of scientific workers, the results of a
painstaking and exceptionally original research. Unfor-
tunately the same problem had already been solved and
the results published. Thorndike tells of a student who
submitted to him the results of a research which the candi-
date hoped would be acceptable for a Ph.D. thesis. In
submitting the manuscript the candidate wrote that he
knew the research was original for he had been careful to
avoid reading anything whatever about the subject.
As a rule, an experimenter should select and work upon
problems in his own specialty. It will be shown later that
successful experimentation requires such a detailed knowl-
edge of the factors operating in a particular situation, and
of the influence of these factors, as only a trained and expe-
rienced individual possesses. Recently, some students of
experimentation, who were reasonably expert in education
Selection and Formulation 9
only, attempted to plan an experiment in chemistry. The
undertaking was soon abandoned. No one seemed to know
the influence of temperature upon certain chemical reactions.
This necessity of intimate knowledge probably explains why
over 99 per cent of all discoveries are made by experts in
the field of discovery. During the World War, the War
Department established a clearing house for popular inven-
tions. A few valuable suggestions were received, but in
the main the bulk of all research had to be done by a mere
handful of experts.
An experimenter should select the relatively more vital
problems. ‘There are many problems which are worth
solving but not relatively worth solving. The number of
those willing or competent to undertake research is too
small and their time too valuable to expend effort on prob-
lems not of vital consequence.
An experimenter should select a problem whose solution
is feasible, and should set up hypotheses capable of proof.
However vital the hypothesis, if it is not susceptible of
proof it should be discarded, for the present at least. Un-
fortunately, the solution of many experimental problems of
great worth is often not feasible, because needed tests have
not been constructed, or because appropriate subjects are
not available, or because the experimenter cannot sufficiently
control the situation in which the proposed experiment is to
be conducted, or for some other reason. Thus, the excellence
of an experimental problem depends upon several factors,
and hence it should be selected in the light of these factors.
A more comprehensive list of these conditioning factors will
be given later.
III. FoRMULATION OF EXPERIMENTAL PROBLEM
Types of Formulation.—There are three types of indi-
viduals engaged in educational research, and the types are
clearly indicated by the way they formulate their problems.
The first type of experimenter “‘flutters in all directions
IO How to Experiment in Education
and flies in none!” He formulates problems so that their
scope is scarcely less wide than the universe. Such broad
formulations offer little practical aid in planning the details
of an experiment. Gazing at the stars, this experimenter
steps into every snare at his feet. Just as a teacher cannot
teach arithmetic in general, or spelling in general, but, in-
stead, must teach particular examples or particular words,
so an experimenter is likely to think and act very irrele-
vantly if he is guided by a broad formulation only.
Recently an experimenter came for consultation about
a problem which he had formulated thus: What is the
effect of various factors upon learning? After a little urging
he departed and returned later with this formulation: What
are the effects of distribution of time upon learning? He
was commended for the improvement made. At a later stage
the problem had become: Will a typical fourth-grade class
in silent reading, spending three thirty-minute periods per
week, accomplish more or less than an equivalent class
spending five periods of eighteen minutes each per week?
Even this is too broad for a final working formulation.
The second type may be called the pot-hole type. Near
the Cumberland Falls, the Cumberland River has a stone
bed pitted with pot-holes. These holes were made by small
hard pebbles which lodged in originally slight concavities
and which, due to the action of the water, have ground round
and round, thereby making the pebbles smaller and the hole
wider and deeper. ‘There are indefatigable individuals en-
gaged in educational research whose experimental problems
are admirably specific. They are as narrow as the pebbles
in the pot-hole. And, like the pebbles, their problems be-
come narrower and narrower as their research proceeds.
Such experimenters are experimental drudges. They do
much excellent work, but each research is isolated from
every other. There is an absence of general plan. There
is no mental reaching for the larger implications. They
are as lop-sided as the first type.
The third type of experimenter is the truly admirable one.
Selection and Formulation II
He is the scholarly type. He perceives the larger meanings
of each minute investigation. This glorifies the drudgery
inherent in all careful research. The scholarly experimenter
first formulates a broad problem. ‘This gives the larger
goal and permits perspective. He then breaks up the broad
problem into very narrow, specific problems. These are the
working units. As the results from the specific investiga-
tions come in, he fits the bits together into a beautiful mosaic.
The solution of any one specific problem may be of no
practical value. It merely contributes to the solution of
the larger problem which alone has genuine practical sig-
nificance. Hence, it is desirable that there be a hierarchy
of formulations from very broad to very specific.
A working formulation of an experimental problem should
clearly describe: (1) the experimental factor or factors
whose effect or effects are being studied, (2) the experi-
mental subjects or individuals or pupils to whom the experi-
mental factor or factors are to be applied, and who are
expected to register the effect or effects, (3) the nature of
the effects expected and to be measured. In sum, a working
formulation requires that the experimenter must have
analyzed his problem in rough outline at least.
Why and When to Survey Bibliography on a Prob-
lem.—The time to make a survey of the bibliography on
an experimental problem is the opposite of the time when
the survey is all too frequently made. Often an investi-
gator has completed his experiment and has prepared his
manuscript for publication before he hurriedly collects a
list of references. The prime function of a bibliographical
survey is not to provide a dignified list of references to
append to an article, but to serve as a practical guide to the
formulation of the subordinate problems, and to the general
planning of the investigation. Hence, the survey of the
bibliography should immediately follow the formulation of
the experimental problem or problems.
If there were no other reason, self-respect as a scholar
should be adequate motivation for surveying a bibliography.
12 How to Experiment in Education
Such a survey will avoid many public humiliations. Pride
is not fostered by saying: ‘“This is something never done
before,” only to discover later that claim to originality is
unjustified. Such humiliations will be frequent enough at
best without actually inviting them.
An initial bibliographical survey will prevent repeating an
investigation already done. ‘There are few things more
important than the conservation of the time and effort of
scientific men. The importance of avoiding repetition does
not, of course, mean that it may not be desirable, on occa-
sion, to verify 1 a previous investigation. But it is neces-
sary to discriminate between ignorant repetition and con-
scious verification.
Again, a bibliographical survey will often suggest addi-
tional incidental problems to be settled. There are few men
who have extensively engaged in research who cannot testify
to many keen regrets because numerous subsidiary problems
were conceived too late to make possible their solution at
the time the major problem was being attacked. It fre-
quently happens that merely minor modifications in an in-
vestigation will make possible the solution of five problems
instead of one. The importance of conceiving these prob-
lems early can be appreciated when it is recalled that many
of the world’s greatest discoveries were by-products rather
than major objectives of experimental investigations.
Again, a bibliographical survey helps by offering sugges-
tions of procedure and of errors to be avoided. A bibliog-
raphy is the recorded experience of previous investigators.
The cleverest investigator is selaom able to make an experi-
mental plan so perfect that there will be no subsequent
regrets. Foresight is never a perfect substitute for expe-
rience. The bibliography reveals not only the methods
employed and the instruments evolved by others but also
criticisms of these on the basis of experience.
Finally, a bibliographical survey provides material which
1Wm. A. McCall, “Reliability of a Ph. D. Research Dissertation in Educational
Psychology,” School and Society, April 13, 1918.
Selection and Formulation 13
will be needed in describing the experiment conducted. It
is desirable to preface an experimental article with a sum-
mary of previous related investigations, and to close it with
a relevant bibliography. These, as well as all previously
mentioned objectives of the bibliographical survey, should be
realized at one and the same time.
Procedure in Making a Bibliographical Survey.—The
procedure of the bibliographical survey should be a highly
selective one. The experimental problems are the key to
this procedure. Throughout the survey, they should be kept
in mind constantly. Everything relevant to them should
be seized upon and examined for possible aids. Relevancy
to the problems is the principle of selection; helpfulness in
furthering the experiment, or its description, is the principle
of retention.
Not the principles of selection and retention but the
method of discovery is the chief difficulty in surveying a
bibliography. The problem is to know where to look for
material likely to be relevant. The method pursued will
vary somewhat with the problem and the situation of the
experimenter. The following general suggestions may, how-
ever, be given: (1) Make inquiries of those who may be
able to contribute unrecorded information. (2) Make in-
quiries of those who may be able to suggest references to
be examined. (3) Go to the contents and references in
books known to deal with the same or related problems.
(4) Consult the same and related topics in the library’s
topically indexed card catalog. (5) Consult the Readers’
Guide to Periodicals. (6) Consult the monthly index to
educational publications published by the Bureau of Educa-
tion at Washington. (7) Consult the Psychological Index
and the index volumes for certain periodicals. (8) Consult
such summarizing journals as the Psychological Bulletin.
(9) Consult the table of contents of special periodicals not
indexed in the Readers’ Guide. The discovery of a single
relevant reference by the above procedure frequently leads
to the discovery of many other references.
CHAIR D REIT
SELECTION OF EXPERIMENTAL METHOD
I. Types oF EXPERIMENTAL METHODS
A. One-group Method.—The most frequently used of
all types of investigations or experiments is the one-group
type, and it occurs as frequently in the physical and social
sciences as in the mental. When the physicist subtracts a
defined amount of heat from a bar of metal and measures
the resulting contraction, he is using the one-group method.
When the chemist pours one chemical mixture into another
and analyzes the resulting precipitate, he is employing the
one-group method. When a psychological examiner fires a
pistol behind a candidate for aviation and measures the
resulting jump, he is employing the one-group method.
When a teacher scolds her class for inadequate preparation
and measures the resulting increase or decrease in study,
she is employing the one-group method. When a nation like
France applies to itself republicanism or a nation like Rus-
sia applies to itself bolshevism and observes the result, it,
too, is employing the one-group method. Similarly, when
a teacher compares the effectiveness of scolding vs. praising,
or instruction by one method vs. instruction by another
method, she, too, is employing the one-group method, pro-
vided the two contrasted factors are tried out upon the
identical group. A one-group experiment has been con-
ducted when one thing, individual, or group has had applied
to it or subtracted from it some experimental factor or fac-
tors and the resulting change or changes have been estimated
or measured. |
14
Selection of Experimental Method Ls
The one-group method may be represented in formula
form as follows:
One Group — Two EF’s — One Test Type
3s — (IT — EFr — FT — C1) — (IT —/BR2i'— RT — G2)
where S is the experimental subject, thing, or group.
IT is the initial test or status of S before EF1 and EF? are,
in turn, added to or subtracted from S.
EF is one of the two experimental factors.
EF2 is the other experimental factor.
FT is the final test or status of S after EF1 and EF>2 have, in
turn, been applied.
Cr is the change in S produced by EF1, and is found by com-
puting the difference between the IT and FT which imme-
diately precede and succeed EF1 respectively.
C2 is the change in S effected by EFz.
The conclusion is yielded by comparing the amounts of C1
and C2. If Cz is larger, EFz has been more effective than
EF2, and vice versa.
Thus, if a teacher wished to compare the effects of prais-
ing vs. scolding, at the beginning of a class period, upon
the amount of discussion on the part of pupils during the
class period, she would make an initial test (IT) of the
amount of discussion which normally occurs. Then she
would praise (EFr) the class at the beginning of some class
period. During the remainder of the class period she would
test (FT) the amount of discussion. Then she would com-
pute the difference (C1) between the initial test and final
test. As soon as the effects, if any, of the praising had worn
off, she would make another IT or else assume that it would
be identical with the first IT, scold the pupils, make an FT,
and compute the amount of alteration (C2) produced by
scolding. A comparison of the amount and direction of Cx
and C2 would yield the correct conclusion from this experl-
ment, provided proper experimental precautions were taken,
and provided the effects of the praising really did wear off,
as evidenced by the second IT.
16 How to Experiment in Education
Assuming the data to be as shown below, the computa-
tions for the praising (EF1) vs. scolding (EF2) experiment
are indicated.
S — (20 — EF1 — 25 —+ 5) — (20 — EF2 — 18 — — 2)
Difference equals 7 in favor of EFr.
The one-group experimental method may be divided upon
the basis of the number of experimental factors contrasted.
Strictly speaking, there are no one-factor experiments. The
nearest approach to such an experiment is where some one
factor is added to or subtracted from S. If a teacher makes
an IT of her class, adds a good scolding, makes an FT, and
computes C, she may be said to have performed an experi-
ment with one factor—an experiment which requires only
the former or latter half of the above basic formula. On
the other hand, it might be argued that she really employed
two factors, namely, not scolding or a control EF vs. scold-
ing, and that therefore she would require all of the above
formula. Since the influence of EF1 (not scolding) would
be to leave the pupils unchanged, IT and FT in the former
half of the formula would be identical and C1 would be
zero. Either approach leads to the same practical con-
clusion.
While half of the formula will suffice when the two fac-
tors are really the presence and absence of one identical
factor, the entire formula is required when the two EF’s are,
not mere presence and absence of one EF, but two EF’s
different in nature. Thus, if a teacher wished to compare
the effect of praising vs. scolding her class, or of teaching
her class by one method vs. another method, Cr could not
be assumed to be zero. Both praising and scolding, or both
methods of teaching might alter the original status of S.
Since the longer formula is correct in all one-group experi-
ments and is necessary in some, confusion will be avoided
by adopting it as the basic formula for one-group experi-
ments.
In certain other situations the basic formula may be
Selection of Experimental Method 17
shortened by eliminating both the IT and C, whereupon the
formula for the one-group experiment reduces to
Sy (EBL in) oo His Tl)
This plan is very economical and its use in preference to
the more laborious basic plan is justifiable when S may be
assumed to have an IT of zero, for in this case C becomes
identical in amount with FT. When an experimenter wishes,
for example, to discover how much a group of pupils can
learn of certain new material taught for a defined length
of time according to a defined method, he may employ the
abbreviated experimental plan, provided the material to be
taught is so sufficiently new that pupils will start with
zero knowledge of it. But since all these variations on
the basic plan operate in special situations only, whereas
the basic plan will operate in any one-group experiment,
confusion will be avoided by keeping in mind the basic
plan only.
There remains to consider the formula required to handle
more than two EF’s. The basic formula assumes two EF’s.
It can be indefinitely extended by lengthening the formula
to provide for EF1, EF2, EF3, and so on, with their corre-
sponding C1, C2, C3, etc.
In many one-group experiments the changes produced by
each EF are manifold, so that one test cannot measure
them. ‘Thus, a certain EF may change not only a pupil’s
reading ability but his spelling ability also. To measure
both these effects will require at least two types of tests,
namely, a reading test and a spelling test. Hence, one-
group experiments may be divided into those requiring one
type of test and those requiring two or more types of tests.
The former has already been diagramed; the latter is dia-
gramed below. This diagram assumes that two EF’s are
employed and two types of tests are required. Observe
that S and the two EF’s remain unchanged. Cr vs. C2, and
C3 vs. C4 show the two conclusions from this experiment.
Provision can be made for more EF’s by extending the for-
18 How to Experiment in Education
mula to the right and for more types of tests by extending
it downward.
One Group — Two EF’s — Two Test Types
S — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1 — C2)
(IT2 — EF1 — FT2 — C3) — (1T2 — EF2 — FT2 — C4)
B. Equivalent-groups Method. — The equivalent-
groups method has been devised for experimental situations
where, for reasons to be mentioned shortly, the one-group
method is inapplicable. Distinctive features of this method
are (1) that there are more than one group, or S, and (2)
that all groups are equivalent. Normally, there are as many
S’s as there are EF’s, and each S is supposed to be equiva-
lent to any other. Thus, if a teacher wishes to compare
the effect of scolding vs. praising and employs the equivalent-
groups method, she selects two equivalent groups. She
scolds one group and measures the change, and praises the
other group and measures the change. The diagram for an
equivalent-groups experiment with one type of test follows.
Sr refers to one group and S2 to the other. The conclusion
from the experiment is yielded by a comparison of Cr
and C2.
Equivalent Groups — Two EF’s — One Test Type
Sr — (IT1 — EF1i — FT1 — C1)
S2 — (IT1 — EF2 — FT1 — C2)
When two types of tests are used, this formula takes on
the form shown below. The two conclusions are yielded by
a comparison of Cr with C3, and C2 with C4.
Equivalent Groups — Two EF’s — Two Test Types
Sr — (IT1 — EF1 — FT1 — Cr)
(IT2 — EF1 — FT2 — C2)
S2 — (IT1 — EF2 — FT1 — C3)
(IT2 — EF2 — FT2 — C4)
The following formula is utilized for three EF’s and two
test types. Guided by the principles exemplified in this and
Selection of Experimental Method 19
the two preceding formulae, a formula may be constructed
for any number of EF’s, and any number of test types.
Equivalent Groups — Three EF’s —-Two Test Types
Sr — (IT1 — EF1 — FT1 — C1)
(IT2 — EF1 — FT2 — C2)
S2 — (IT1 — EF2 — FT1 — C3)
(IT2 — EF2 — FT2 — C4)
S3 — (IT1 — EF3 — FT1 — Cs)
(IT2 — EF3 — FT2 — C6)
C. Rotation Method.—The rotation method is particu-
larly useful for solving experimental problems insoluble by
other methods. It is a unique combination of two or more
one-group methods. When the various groups employed are
equivalent, the rotation method is a combination of one-
group and equivalent-groups methods.
As the name implies, the distinctive feature of the rota-
tion method is that of rotation—rotation of S’s, or EF’s or
irrelevant factors. If a teacher wishes to study, by means
of the rotation method, the effect of praising vs. scolding,
she first praises S, and measures the result, and then scolds
the same S, and measures theiresult. This is the one-group
method thus far. She first scolds S2, and measures the re-
sult, and then praises S2, and measures the result. In other
words, she rotates the order of the EF’s. She combines the
results from praising both groups, and compares the sum so
found with the sum of the results from scolding both groups.
This comparison shows whether praising has been more or
less effective than scolding, how much, and in what direc-
tion. The simplest form of rotation method, namely, two
EF’s and one type of test, is given below. The conclusion
is yielded by a comparison of C1 plus C4 with C2 plus C3.
Rotation — Two EF’s — One Test Type
Sr — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1 — C2)
92 — (IT1 — EF2 — FT1 — C3) — (1T1 — EF1 — FT1 — C4)
OL ADU ot ON ST OF I
EF2 = C2 + C3
20 How to Experiment in Education
If a teacher wishes to determine by means of the rota-
tion method the effect of praising vs. scolding vs. sarcasm,
the formula becomes as shown below. ‘The conclusion is
derived from a comparison of C1 plus C6 plus C8 with C2
plus C4 plus Co with C3 plus C5 plus C7.
Rotation — Three EF’s — One Test Type
S1 — (IT1 — EF1 — FT1 — C1) — (1T1 — EF2 — FT1 — C2)
— (IT1 — EF3 — FT1 — C3)
S2 — (IT1 — EF2 — FT1— C4) — (1T1 — EF3 — FT1 — Cs)
— (IT1 — EF1 — FT1 — C6)
S3 — (IT1 — EF3 — FT1 — C7) — (1T1 — EF1 — FT1 — C8)
— (IT1 — EF2 — FT 1 — Cog)
EF1 = C1 + C6 + C8
EF2 = C2 + C4-+ Co
EF3 = C3 + C5 + C7
A diagram for a rotation method with two EF’s and for
two types of tests follows. The two conclusions from the
experiment are yielded by a comparison of the sum of C1
and C6 with the sum of C2 and Cs, and by a comparison
of the sum of C3 and C8 with the sum of C4 and C7.
Rotation — Two EF’s — Two Test Types
Sr — (IT1 — EF1 — FT1 — Cr) — (IT1 — EF2 — FT1 — C2)
(IT2 — EF1 — FT2— C3) — (IT2 — EF2 — FT2 — C4)
S2 — (IT1 — EF2 — FT1— Cs) — (1T1 — EF1 — FT1 — C6)
(IT2 — EF2 — FT2— C7) — (1T2 7 EE eee
EF ir on test 1 = C1 + C6
EF2 on test 1 = C2 + C5
EF1 on test 2 = C3 + C8
EF2 on test 2 = C4-+ C7
This, as well as any other experimental method, can be
indefinitely extended by multiplying the number of factors,
or tests, or both. The student will do well to stop at this
point and prove his mastery of what has preceded by mak-
ing a few sample extensions of each method that has been
diagramed.
Selection of Experimental Method 21
II. CRITERIA FOR SELECTING EXPERIMENTAL METHOD
A. One-group Method.—When the purpose of an ex-
periment is to determine the amount of change due directly
to an EF, the one-group method is valid:
(1) Where the total net change in the trait or traits in
question produced by irrelevant factors is negligible, or
where the amount of such change is measured and dis-
counted by the application of a control EF.
(2) Where the change produced in S by an EF is not
conditioned significantly by any preceding EF.
(3) Where the change effected by each EF is measurable
in equal units.
Here is an experimental problem which came to the atten-
tion of the writer recently: Will the appointment of a
physical instructor (EF1) or the establishment of school
luncheons (EF2) improve the health (weight, etc.) of ele-
mentary school pupils? The purpose of the individual who
formulated this problem was to determine whether a phys-
ical instructor or school luncheons will alter the weight, etc.,
of pupils, and if so, how much.
Even in the case of an inanimate S, it is extraordinarily
difficult to create an experimental situation where all irrele-
vant factors—disturbing factors—are eliminated. In the
case of an animate S like the above, irrelevant factors of
considerable magnitude are unavoidable. But irrelevant
factors will not invalidate this experiment provided their in-
fluence is relatively negligible. Hundreds of influences con-
tinuously play upon pupils. Compared to the influence of
the EF, most, or sometimes all, of these irrelevant factors
exercise a comparatively small influence.
Even significant irrelevant factors will not invalidate this
experiment provided the total met change is negligible.
Though pupils are continuously registering the effects of a
multitude of accidental or chance or uncontrollable in-
fluences, some of these tend to facilitate and some to inhibit
22 How to Experiment in Education
progress in the trait in question. No trouble is caused
provided these positive and negative influences balance or
so nearly balance as to give a negligible net total.
In the case of our sample problem, will the net total
change produced by irrelevant factors be negligible? There
are excellent reasons for believing that this net total will
be a considerable increase in weight due to, not to mention
other possibilities, the significant irrelevant factor of natural
maturing.
But even this significant irrelevant factor of maturing
does not invalidate the one-group method provided the
amount of its influence can be measured and discounted by
the application of a control EF (CEF). Thus, we might
measure the amount of increase in weight due to one year of
maturing, and then apply a year of school luncheons, and
then remove school luncheons and apply a year of a phys-
ical instructor. The first year would be a control EF be-
cause during this time the pupils would presumably be
treated exactly the same as during the two following years,
except for the EF’s of school luncheons and physical in-
structor. By computing the difference between the increase
during the first year and each of the other two years it
would be possible to determine the amount of increase attri-
butable to each regular EF.
Where there are a CEF and two regular EF’s the basic
formula for the one-group method is shown below. Before
Cir and C2 are compared, the amount of CC should be sub-
tracted from each.
One Group — CEF and Two EF’S — One Test Type
SIT — CEFF CC) (IT EFr—Fi— C1) SU eee eee
EFi = C1 — CC
EF2 = C2—CC
Will one EF condition or carry-over to any succeeding
EF? Since the control EF may be dispensed with in ex-
periments where the net total change produced by irrelevant
factors is negligible, and also in certain other experiments,
as will be shown later, and since the control EF is really
Selection of Experimental Method 23
identical with the preéxperimental factor, these two may be
considered together. ‘Thus, if an experimenter desires to
compare the relative effectiveness of teaching pupils sub-
traction by the additive method vs. the subtractive method,
it is important to inquire whether the pupils are just begin-
ning subtraction or whether they have been taught for some
time previously by the additive or subtractive or some other
method. The additive method, superimposed upon a long
training according to the subtractive method, may yield re-
sults markedly different from that of an additive method
superimposed upon an additive training or no training at
all. The function of an initial test is to prevent the first
regular EF from getting credit or blame for changes pro-
duced by a control EF or, lacking a control EF, the pre-
experimental factor. But there may be a carry-over of
inhibiting or facilitating purposes, methods of work, or in-
formation, or all of these which are not removed by the
initial test sieve. |
When the amount of this carry-over is significantly large,
the experimenter has two alternatives. He may seek an S
whose preéxperimental experiences have been such as to
avoid the carry-over, or he may continue with the original
S, and remember to state the final conclusions from the ex-
periment in the light of the condition of S antedating the
experiment. The experimenter does not have the alternative
of selecting another experimental method, for every experi-
mental method is handicapped equally by this preéxperi-
mental factor.
It is necessary to inquire, not only concerning the carry-
over from the preéxperimental factor or control EF, but also
concerning the carry-over from one regular EF to any suc-
ceeding EF. Will a physical instructor for a year prior to
school luncheons add to or detract from the effectiveness of
school luncheons? Or vice versa, will school luncheons add
to or detract from the effectiveness of a physical instructor?
Will the additive EF, preceding a subtractive EF, facilitate
the effectiveness of the subtractive EF, or inhibit it, or vice
24 How to Experiment in Education
versa? Unless there are reasons for believing that any such
carry-over will be relatively negligible, the experimenter had
better avoid the one-group method.
If there are reasons for believing that EF1 will condition
EF2 but that EF2 will not carry-over to EF1, the one-group
method is valid, provided EF2 is applied first, since an EF
cannot condition a preceding EF.
There is this difference between a carry-over from a pre-
- experimental factor or from a control EF to a regular EF,
and the carry-over from one regular EF to another. In the
former situation the experimenter does not have the alterna-
tive of selecting another experimental method whereas in
the latter situation he does.
Finally, can the changes effected respectively by the con-
trol EF, school luncheons, and physical instructor be meas-
ured in equal units? Since all weight changes will be
measured in units of pounds, let us say, and since the scale
for weight is a uniform scale, it would appear that the units
could be called equal. The use throughout the entire ex-
periment of a uniform scale with uniform and equal units
would seem to be all that could be asked. It is, provided
equality of units means equal ease of effecting a unit of
change in S at all points on the scale. The units on a scale
may be equal in some senses and be quite unequal in an
experimental sense. In one sense the interval from ninety-
seven to ninety-eight pounds is equal to the interval from
one hundred ten to one hundred eleven pounds. In each
case the interval is one pound. But it may be more
difficult to increase the weight of a particular pupil from
one hundred ten to one hundred eleven pounds than
from ninety-seven to ninety-eight pounds. Let us assume
that it is. Then the EF which came first would show a
greater change than the EF which came second, even though
both were of exactly equal effectiveness. In sum, objective
equality of units does not guarantee experimental equality
of units.
When the same uniform scale of uniform units measures
Selection of Experimental Method 25
the changes produced by all EF’s there is some possibility
that the units will be equal experimentally. This possi-
bility is practically nil when the scales employed are not
uniform. For example, an experimenter may desire to de-
termine the effectiveness of two methods of teaching a
geography lesson. He might teach a lesson by method A
on the question: Why are certain portions of the United
States arid? He would construct a measuring instrument
on the content of this particular lesson. This instrument
could be used for the initial test and final test to measure
the change produced by method A. Now if method A had
practically taught the content of the above lesson, or even
a part of it, method B could not well be used on the same
lesson. Method B would have to be employed on another
lesson whose topic was, say: Why is more cotton grown
in the southern than in the northern part of the United
States? This would require a new test on the content of
the second lesson. Suppose that method A increased by ten
points the score of S, and that method B also increases by
ten points the score of S. Which is more effective, method
A or method B? It is impossible to say, because the ten
points in one case are not necessarily equal to the ten points
in the other. We cannot even be sure that one point on
one test is equal to any other point on the same test.
When the purpose of an experiment is to determine merely
the amount of superiority of one EF over any other EF, the
one-group method ts valid:
(1) When the amount of change in S under one EF is
practically identical with the amount of change under any
other EF, except for the difference in effectiveness of the
contrasted EF’s.
(2) Where the change produced in S by an EF is not
conditioned significantly by any preceding EF or EF’s.
(3) Where the change effected by each EF is measured
in equal units.
Since many of the experiments in education are concerned
only with the relative effectiveness of two or more EF’s and
26 How to Experiment in Education
not with a determination of the absolute amount of change
in S directly attributable to an EF, the more searching
fundamental criteria may be simplified as indicated in (1),
(2), and (3) immediately above. So far as the above pur-
pose is concerned, it makes no difference if pupils are ma-
turing or if any other irrelevant factors are operating con-
temporaneously with the application of the EF’s, provided
they operate alike under each EF.
There are some situations where inequality of units is
certain, and, yet, where the one-group method is practically
imperative or has been used by mistake. Stevenson con-
ducted an investigation under the auspices of the University
of Illinois and the Chicago public schools to determine the
relative effectiveness of large classes vs. small classes. Cir-
cumstances might have forced the one-group method. If
sO, one appropriate plan would be to have a teacher teach a
class of, say, forty-five pupils for the first semester. Initial
and final tests would be given. At the beginning of the
second semester, thirty of these forty-five pupils would be
so selected as to be fairly representative of the whole group.
This class of thirty pupils would be taught during the second
semester by the same teacher who had taught them during
the first semester. Initial and final tests would be given.
‘ The final tests for the first semester would serve as the
initial tests for the second semester. Cz and C2 would be
computed only for the thirty pupils continuing throughout
the year. A large number of different classes would be used,
but each class would be treated according to the above plan.
Then, since it is usually more difficult to secure each
additional point, the small-class EF would be discriminated
against because of inequality of units. Even so, the experi-
menter would not have done all his work in vain. There are
methods of correcting or approximately correcting for these
inequalities.
One method is to plot the curve of growth for the test in
question, using age norms or, lacking age norms, grade norms
as the basis of the curve. The curve can be estimated for
Selection of Experimental Method 27
points between the age norms or grade norms. If the norm
for ten-year-old children is, say, fifty, and for twelve-year-
olds is sixty, and for thirteen-year-olds is sixty-five, a growth
from fifty to sixty may be considered equal roughly to a
growth from sixty to sixty-five. By interpolation, a growth
on one portion of the curve may be converted into units of
growth on any other portion of the curve, thus making com-
parison between EF’s fair. In like manner, the slope of the
curve for grade norms may be used to equate units on vari-
ous portions of the curve, though the grade-norm curve is
subject to a selection error. The fifth-grade norm in June is
higher than the fourth-grade norm in June not only because
of the year’s growth, but also—and failure to recognize this
is the error—because certain of the stupider pupils of a
fourth-grade are not allowed to continue with their grade
when it becomes a fifth grade.
For several reasons—because norms are frequently un-
available, because of the selection error in grade norms,
because the equalization of units by means of growth curves
is likely to prove laborious, and because such equalization
requires that the same or equivalent tests be used through-
out the experiment—another method of equalizing units will
be found more serviceable. This is the method of convert-
ing all units into T’s, in terms of the experimental group
rather than twelve-year-old, by the T-scale technique de-
scribed in Chapter V, and illustrated in Table 6 (page 99)
and Table 36 (page 204).
If the same or equivalent forms of a test are used through-
out the entire experiment, it is suggested that the T12 col-
umn of Table 8, p. 102, become the T scores according to
the very first initial test of the experiment, and that Tx6 be-
come the T scores according to the last of the final tests of
the experiment, and that these two columns of T scores be
combined according to the procedure illustrated in Table 8.
If the T scores were based upon initial test alone, some of the
highest scores in the final test could not be scaled. If the
T scores were based upon final test alone, some of the lowest
28 How to Experiment in Education
scores of the initial test could not be scaled. By basing the
T scores upon both initial and final tests, all scores for all
pupils on a particular test can be converted into equivalent
T scores by the use of what will correspond to the first and
last columns of Table 6, p. 99.
If the initial and final tests for EF1 are neither duplicate
nor equivalent forms of the initial and final tests used for
EF2, i.e., if the EF1 tests.measure information about the
geography of New York, whereas the EF2 tests measure
information about the geography of Pennsylvania, the T
scores for EF1 should be based only upon the initial and
final tests for EF1, and the T scores for EF2 should be
based only upon the initial and final tests for EF2. This
means that Table 6 must be worked twice for each test
before all scores in a two-EF experiment can be converted
into T scores. The general procedure is the same irrespec-
tive of the number of EF’s.
Fortunately, Stevenson selected a better experimental
method. He chose the rotation method instead of the one-
group method. He had one teacher teach a class of, say,
forty-five pupils and another teacher teach an approximately
equivalent class of thirty pupils in the same grade. Both
the large and the small classes were taught during the first
semester. At the end of the first semester, fifteen pupils
were taken from the class of forty-five pupils, thus leaving
it a class of thirty pupils during the second semester, and
given to the class of thirty pupils, thus making the latter a
class of forty-five pupils during the second semester. In this
way, both the large-class EF and the small-class EF came
under identical courses of study, identical portions of the
test, identical portions of the growth curve, and so on.
The probability of satisfying the fundamental criteria for
selecting the one-group method is increased:
(1) Where the EF or EF’s produce a relatively drastic
effect, for this tends to make the influence of trrelevant fac-
tors practically negligible.
(2) Where the experiment is of brief duration, for this
Selection of Experimental Method 29
abbreviates the action of large, constant, cumulative, irrele-
vant factors such as maturing for example,
(3) Where the trait in question does not involve pur-
poses or methods of work, for these usually show a larger
carry-over than specific information.
(4) Where the tests are scaled on the basis of the same
unit for this increases probability of equality of units.
B. Equivalent-groups Method.—When the purpose of
an experiment is to determine the amount of change due
directly to an EF or EF’s, the equivalent-groups method is
valid:
(1) Where the total net change in the trait or traits in
question produced by irrelevant factors is negligible, or
where the amount of such change is measured and discounted
by the use of a control EF.
(2) Where it is really possible to equate groups.
One peculiar virtue of the equivalent-groups method is
that in its use the danger of any carry-over from one EF
to another is avoided, by applying each EF to a different S
so that no EF follows another with the same group. Of
course the equivalent-groups method, like all others, is sub-
ject to a possible carry-over from the preexperimental fac-
tor. But this does not so much invalidate an experiment as
limit the conclusions from the experiment to the particular
sort of S employed.
Another superiority of the equivalent-groups method over
the one-group is that the units of measurements used for
one EF have a greater probability of being equal to those
used for another EF. The equivalent-groups method avoids
the doubtful assumption that it is equally easy to produce
equal amounts of change at various points of the growth
curve of S, for two S’s can be chosen at like positions on the
growth curve. Furthermore, it is not necessary to measure
the changes produced by the various EF’s by means of dif-
ferent incomparable tests based upon different subject mat-
ter. Thus it would not be necessary to teach one sort of
30 How to Experiment in Education
geography lesson according to method A and another sort
according to method B. The identical lesson could be taught
by method A and method B and the identical test could be
used to measure the changes produced by each method.
We shall see, however, when we come to consider the ques-
tion of scaling tests, that the use of identical tests does not
guarantee perfect equality of units. But it certainly does
tend to increase comparability.
The one-group method did not prove entirely valid for the
illustrative problem of school luncheons vs. physical instruc-
tor. How about the equivalent-groups method? Here, as
in the case of the one-group method, the total net change
produced by irrelevant factors would not be negligible due
to the natural maturing of the pupils. But this difficulty
could be overcome by employing a control S, to whom the
control EF could be applied. Thus one S would be treated
_as usual (CEF). Another equivalent group would have
school luncheons (EF1). Still another equivalent group
would have a physical instructor (EF2). By subtract-
ing CC from C1 and C2 the amount of change produced
by. EFx and EFz2 could be accurately determined.
Hence the equivalent-groups method is applicable to this
experimental problem. The method is equally applicable to
the praising vs. scolding, or the additive vs. subtractive
problems.
When the purpose of an experiment ts to determine merely
the amount of superiority of one EF over any other EF the
equivalent-groups method is valid:
(1) Where the amount of change in S under one EF is
practically identical with the amount of change under any
other EF, except for the difference in effectiveness of the
contrasted EF’s.
(2) Where it is really possible to equate groups.
As is the case with the one-group method, the criteria
are less stringent when only the relative difference between
EF’s is desired. Changes produced by large irrelevant
Selection of Experimental Method 31
factors, like maturing, cause no trouble provided the irrele-
vant factor operates equally under each EF.
In the case of one-group experiments, equal operation of
irrelevant factors under each EF is often difficult to secure,
particularly when the experiment extends over a consider-
able time interval. But equal operation of irrelevant factors
is easy to secure when the groups are different groups and
equivalent. Hence the above criteria practically reduce to
the second one for most situations.
C. Rotation Method.—When the purpose of an expert-
ment 1s to determine the amount of change due directly to
an EF or EF’s, the rotation method is valid:
(1) Where the total net change in the trait. or traits in
question produced by irrelevant factors is negligible, or
where the amount of such change is measured and discounted
by the application of a control EF.
(2) Where the change produced in S by an EF is not
conditioned significantly by any preceding EF.
In case the net total effect from irrelevant factors is not
negligible, this effect can be measured by a preliminary appli-
cation of a control EF to each group employed in the rotation
experiment. The amount of change produced by the irrele-
vant factors would be combined in the same way, in the
same order, and for the same intervals as has been described
for the regular EF’s, and the sum would be subtracted from
the sum of the corresponding C’s for the regular EF’s. The
computations for the control EF is like computing the
shadow of the rotation experiment for the regular EF’s, for
there would be a control Cr to be added to a control C4, and
a control C2 to be added to a control C3. The computation
for the control EF’s would be more elaborate if there were
more than two regular EF’s, but here, too, the process would
duplicate that already given for three or more regular EF’s.
The formula for both CEF’s and regular EF’s may be
written as below, though it is probable that either the CC2
or CC4 would be assumed to be equivalent to CCx or C@z
32 How to Experiment in Education
respectively, or else the two CEF’s which are applied to each
S would be applied in immediate succession.
Rotation—CEF’s and Two EF’s—One Test Type
S1—(IT—CEF1-—FT—CC1)—(1T—EF1—F T—C1)—(IT—CEF2—F T—CC2)—(1T—EF2—FT—C2)
§2—(1T—CEF2—FT—CC3)—(UT—EF2—FT—C3)—(UT—CEF1—F T—CC4)—(1T—EF1—F T-—C4)
EF1 = (Cl + C4) — (CC1 + CC4)
EF2 = (C2 + C3) — (CC2 + 003)
Even though the rotation method is a combination of one-
group methods, the criterion concerning equality of units of
measurements has not been restated in connection with the
rotation method. This omission is due to the fact that the
rotation method brings each EF under each lesson and test,
if different lessons with different content are used, and brings
each EF under each portion of the growth curve, if the same
test is used and the experiment continues over a long period
of time. In sum, the rotation tends to rotate out lesson
differences, test differences, or position-on-growth-curve
differences, thus tending to equalize the units of measure-
ments.
In Weber’s rotation experiment to test the effectiveness
of a lesson taught by a teacher followed by a brief review
vs. a film or motion picture followed by a lesson vs. a lesson
followed by a film, a different content with an appropriate
test for each content had to be used for the different EF’s.
One lesson had to do with India, another with China, and
a third with Japan. The appropriate formula for such an
experiment follows. In the formula, ITi means the initial
test on India, LR means the lesson-review EF, ITc means
initial test on China, FL means the film-lesson EF, IT}
means initial test on Japan, and LF means lesson-film.
S1—(ITi—LR--FTi—C1)—(1Tce—FL—FTc— C2)—(1Tj—LF —FTj—C3)
S2'—-(ITi— FL—FTi—C4)—(1Te—LF — F Tc— Cs) —(ITi -—LR—FTj—C6)
S3——(ITi—-LF —FTi—C7)—(I1Te—LR — FTc—C8)—(1Tj —-FL—FT}j—Co)
LR=C1-+C6+C8
FL=C2z+C4+ Co
LF=C3+Cs5+ C7
If Sz is a superior group of children, the foregoing plan
rotates out the superiority, for every EF gets the benefit
Selection of Experimental Method a4
of the group’s superiority, and similarly for other group
differences. If S2 is taught by a superior teacher, the effect
of her superiority is rotated out, for every EF profits equally
from her skill, and similarly for other teacher differences.
If the lesson or test on India is especially difficult, this dif-
ficulty is rotated out, for the lesson and test on India is
employed with every factor, and similarly for other lesson
or test differences. If the LR or lesson-review EF is more
effective than the other two EF’s, this superiority is not
rotated out, and should not be rotated out, for the purpose
of the plan is to give any such superiority a chance to mani-
fest itself, unmasked by irrelevant factors of teacher, group,
lesson, or test differences.
The above plan will rotate out any likely irrelevant factor,
except (1) uncontrolled bias on the part of the teacher or
experimenter for a particular EF; (2) bias on the part of
the test for a particular EF; (3) deliberate malingering on
the part of the pupils, unless this is uniform throughout the
experiment; (4) a carry-over from one EF to another C5)
any tendency for one group to learn how to improve more
rapidly with the progress of the experiment than any other
group; or (6) any tendency for one group to become more
fatigued or bored with the progress of the experiment than
any other group.
The last three irrelevant factors are of special interest.
If the lesson-review EF were to carry over and benefit the
film-lesson EF, C2 would not be an exact measure of the
influence of film-lesson. Instead, C2 would be a measure
of the effect of film-lesson plus an effect borrowed from
lesson-review. In an experiment of this sort, where the
entire content of the lessons is changed each time, such
carry-over in significant amount is highly improbable.
If, for some reason, Sx were to learn, as the experiment
progressed, how better to retain the content so as to make
a higher score on the FT, the second EF would profit more
than the first, and the third EF would profit more than the
second. This would be rotated out provided and only pro-
34 How to Experiment in Education
vided S2 and S3 each learned the same thing in like amount.
Again, if St were to become fatigued or bored as the experi-
ment progressed, relatively more than S2 and S3, this would
penalize LF most, FL next, and LR least. Such unique
fluctuations are not likely to occur in significant amounts
unless there are large differences in intelligence, or the like,
between the three groups.
When the purpose of an experiment is merely to deter-
mine the amount of superiority of one EF over any other
EF, the rotation method is valid:
(1) Where the amount of change in S under one EF is
practically identical with the amount of change under any
other EF, except for the difference in effectiveness of the
contrasted EF’s.
(2) Where there is no carry-over from one EF to an-~
other, or where, in case it occurs, the carry-over ts mutual,
1.€., each EF gains equally from such carry-over.
If, in the case of one S, EF1 preceding EF2 aids EF2 to
the extent of, say, two score points, and if EF2, in the case
of the other S, aids EF1 to the extent of two score points,
the increased change for each EF will be equal, thereby
validating the rotation experiment for the purpose of deter-
mining relative effectiveness of the EF’s.
An illustration will make it clear that a mutual carry-over
will not disturb a relative rotation experiment. Lacy? con-
ducted a rotation experiment to evaluate the relative effec-
tiveness of telling a story orally to a pupil (Told), having a
pupil read the story (Read), or having him see it in motion
pictures (Movie). Assume that each EF is equally effective,
and that each C would be 4 were it not for carry-over. As-
sume, further, that each EF carries over to the immediately
succeeding EF to the extent of half its own C, and to the
next EF to the extent of one-fourth its own C. The follow-
ing diagram shows that all EF’s come out equal, according
to assumption, regardless of a complicated carry-over.
1Lacy, John V., “The Relative Value of Motion Pictures as an Educational
Agency,” Teachers College Record, November, 1919,
Selection of Experimental Method Cis
4 Airiac 4-33
Told Read Movie
4 Acie Agata ricad
Read Movie Told
4 4+2 Aa atts
Movie Told Read
Told = (4) + (4+3 +1) + (4+ 2) =18
Read = (4+ 2) + (4) + (44+3+1)=18
Movie= (4+ 3 +1) + (4+ 2) + (4) =18
If an experimenter desires to be exceedingly careful to
equalize the amount of carry-over, he can improve upon
any formula thus far given by using six groups for three
EF’s as shown below.
S1 — Told — Read — Movie
S2 — Read — Movie — Told
S3 — Movie — Told — Read
iio nncr Lele eT SLSle) hep eseiele ren slevabeledeitele el si sle ei/sielevlelis: cules Novela mich ata lets
S4 — Read — Told — Movie
S5 — Told — Movie — Read
56 — Movie — Read — Told
On the whole, the one-group experimental method is the
most convenient and, for this reason, should be preferred
when some significant irrelevant factors will not invalidate
the experiment; but the one-group method is peculiarly sub-
ject to constant errors from these sources. The equivalent-
groups method is peculiarly free from the influence of dis-
turbing irrelevant factors. The only difficulty encountered
here is in selecting two or more S’s which are genuinely
equivalent. When the number of pupils composing each §
is small, it becomes extremely difficult to prove that exact
equivalence was secured. Due to the practical difficulty at
times of establishing this equivalence, the rotation method
is frequently used. The rotation method is, of course, just
a combination of two or more one-group experiments, but
the way in which the one-group methods are combined
automatically tends to eliminate some of the objections to
the one-group method. Reversing the order of application
36 How to Experiment in Education
of the EF’s, permits each EF to get the advantage or dis-
advantage of a carry-over from the other, increases com-
parability by having each test used under each EF and by
having each EF operate on S at approximately similar por-
tions of the growth curve. The rotation method is also of
value in eliminating special irrelevant factors, such as teach-
ing skill of teacher, and difference in ability of groups.
CHAR TE RIL
SELECTION OF EXPERIMENTAL SUBJECTS
Appropriateness of Subjects to Experiment Factors.
—The first consideration in selecting experimental subjects
requires that these subjects be appropriate to the EF’s. A
principal in a nearby school is interested in determining the
effect of employing the project method with a particular
class in his school which has been taught by an extremely
conservative teacher. Here the EF calls for a particular
class or, at least, for pupils whose habits have been formed
under a very conservative teaching method. Coy has con-
ducted an elaborate experiment with children of high in-
telligence. The problem especially called for gifted pupils.
Others would have been inappropriate. Ogglesby designed
a primer for pupils of subnormal intelligence. She desired
to test its relative effectiveness. It was necessary to select
pupils appropriate to the EF. Hanson has experimented
with the effect upon progress in penmanship of excusing
pupils from drill when they attain a handwriting quality of
12 on the Thorndike Handwriting Scale, as compared with
continuance of drill. Pupils whose handwriting is already
above quality 12 would be inappropriate, as would pupils
so far below quality 12 that this goal would cause little or
no motivation. Thus, appropriateness is an essential con-
sideration, and what constitutes appropriateness varies with
the nature of the problem.
The determination of appropriateness frequently requires
objective measurement. Thus Coy used intelligence tests to
pick children of high intelligence. Ogglesby selected her
subjects on the basis of intelligence scores determined by
37
38 How to Experiment in Education
Metzner. Gray, Gates, and others have experimented with
pupils who were unable to make satisfactory progress in
reading. They employed reading tests to select their ex-
perimental subjects.
Appropriateness of Subjects to Tests.—As a rule, sub-
jects should not be subordinated to the tests, but rather tests
should be found or constructed which will be appropriate to
the subjects. But it sometimes happens that the nature of
the problem is such as to permit the experimenter consider-
able latitude in the choice of subjects, while at the same
time it is not feasible to construct new tests. A few days
ago the writer advised an experimenter who was planning
his doctor’s dissertation to select no experimental subjects
below the third grade. This advice was given because ade-
quate tests of the type called for by his problem were not
available for pupils in grades below the third. Adequate
tests were available for pupils in grades above the second.
He could have constructed tests for young children, but
this would have left no time for experimenting with the
problem in which he was interested.
Representativeness of Subjects—Selection by Chance.
—Sometimes it is possible to employ for the S the total
group which has proved appropriate for the EF. Thus
the experimenter, who desires to determine the effect of the
project method upon a particular fourth grade previously
taught by an unusually conservative method, could include
the total group in the experiment. Sometimes, as for ex-
ample in a very large elementary school, it is not feasible
to try the EF’s on all the fourth-grade children in question.
. Only a selected number can be used. If the conclusion is
to be generalized for all the pupils, it is necessary that the
S be so selected as to be representative of the total group.
Representativeness can be secured by making a chance
selection from the total group, or a chance selection from
a chance portion of the total group. One method of making
a chance selection is to write upon a slip of paper the name
of each pupil in the total group, to place these names in a
Selection of Experimental Subjects 39
receptacle, to mix them thoroughly, and to draw from the
receptacle as many slips of paper as there are pupils called
for in the experimental plan. This was the general pro-
cedure followed by the War Department in selecting men
for conscription during the World War.
Another method of making a chance selection is to write
the names of the pupils in alphabetical order. If half the
total number of pupils are to be used, alternate pupils can
be selected. If one-third the total group are to be used,
every fourth pupil can be selected, and similarly for the
proportions of 25, 75, 90, or other per cents.
The above methods of selection assume that it is feasible
to withdraw the selected pupils from their classes and as-
semble them in a new class or classes for experimental pur-
poses. This is not, however, always practicable. Fre-
quently the experimenter is faced with the necessity of
making a chance selection of classes rather than or in
addition to a chance selection of pupils.
Representativeness of Subjects—Selection by Meas-
urement.—If tooo pennies be tossed there will be only a
slight difference between the number of times that heads as
contrasted with tails appear. If twenty pennies are tossed
there may be a relatively large difference in the number of
heads and tails. ‘This illustrates the fact that chance is a
highly exact method of selecting representative pupils when
the number of pupils used as subjects is large, whereas its
accuracy decreases as the number of pupils decreases.
When the number of pupils or groups is small it is safer
to make the selection on the basis of measurement of some
sort. Just what sort of measurement will be best depends
upon the nature of the experimental problem to be under-
taken and the purposes of the experimenter. If the experi-
ment has to do with physical efficiency, the tests used may
well be tests of physical condition, in order that pupils with
all types of physique may be selected. If the experimental
trait is reading, selection on the basis of a test of reading
ability will usually prove satisfactory. If the experiment
40 How to Experiment in Education
has to do with general educational or mental development an
intelligence test or a combination of several educational tests
may be employed.
Once the measurements are made, the pupils or groups, as
the case may be, should be arranged in order according to
the size of their scores. If, say, 10 per cent of the pupils
or groups are to be selected, every tenth pupil or group
should be selected. If 25 per cent of the pupils or groups
are to be used, every fourth pupil should be selected. Thus
in the latter instance the best, fifth best, ninth best, and
so on, should be selected.
Representativeness can be slightly but only slightly in-
creased by employing a modified method of selecting the
experimental pupils. Selecting pupils who stand first, third,
fifth, and so on, when half the total group is to be used
will cause the experimental pupils to average slightly higher
than the total group, as will the selection of pupils who stand
first, fifth, ninth and so on when 25 per cent of the total
group are to be used. This modified method is described
farther along, in connection with the technique of equating
groups.
Appropriateness of Subjects to Experimental
Method.—The question of the appropriateness of subjects
to the experimental method is most frequently raised in
connection with the equivalent-groups method, or the rota-
tion method when equivalent groups are to be used. When
any experimental method has been decided upon, subjects
must be selected who are first, appropriate to EF’s and tests,
and second, representative. When the equivalent-groups
method has been decided upon, there is the additional re-
quirement that subjects be selected and placed in different
groups in such a way that the resulting groups will really
be equivalent.
Equivalence of groups does not require that all the sub-
jects participating in the experiment be equivalent, but it
does mean that all the groups participating be equivalent.
To be equivalent the various groups must have like means
Selection of Experimental Subjects 4I
and like variability among the subjects constituting each
group. To have like means and like variability implies in
turn that for every subject in one group there should be an
equivalent subject in every other group. While this last
will guarantee like means and variability, it is not absolutely
required that there be an equal number of subjects in each
group. The essential is that the groups be equivalent as to
means and variability.
But equivalent in what? In intelligence? Not neces-
sarily. In education? Not necessarily. In the experi-
mental trait? Not necessarily. The groups must be equal
in their possibilities for growth in the trait in question.
They should be so equal in the growth potential or possi-
bilities that they will show an equal mean change and an
equal variability among the changes of the individual sub-
jects in each group, provided all groups are placed under
an identical EF for an identical length of time. Various
methods have been proposed for securing such an equiva-
lence. These will be described next.
Groups Equated by Chance.—Just as representative-
ness can be secured by the method of chance, when the
subjects involved are sufficiently numerous, so equivalence
may be secured by chance, provided the number of sub-
jects to be used is sufficiently numerous. One method of
equating by chance is to mix the names of the subjects to
be used. Half may be drawn at random. This half will
constitute one group while the other half will constitute the
other group. If three groups are required, the first third
of the drawings will constitute one group, the second third
of the drawings another group, and the remaining third
still another group.
Or again, the names may be written in alphabetical order.
The even-numbered names will constitute one group and
the odd-numbered names the other group, and similarly for
a larger number of groups. If classes are being paired off
instead of pupils, the same general procedure of drawing, or
of alternating will apply.
42 How to Experiment in Education
The above are merely sample procedures. Any device
which will make the selection truly random is satisfactory.
Extreme caution should be exercised to avoid any constant
tendency for one group to turn out superior to another.
When the War Department made the famous drawing to
determine the order in which individuals would be con-
scripted for military service, numbers were written on
paper and enclosed in capsules. Due to the fact that every
additional figure in a number added to the weight of the
capsule because of the additional ink deposit, there was a
constant tendency for the larger-numbered capsules to sift
to the bottom where they would be drawn last. If the size
of the paper increased with the length of the number this
still further prevented a perfectly random drawing. These
criticisms are made merely by way of illustration. Any ex-
perimenter may count himself lucky if he is able to select
subjects by the method of chance with no constant error
larger than that caused in this national drawing by a few
specks of ink.
Groups Equated by General Ability.—Measurement,
if adequate and accurate, is the best basis for selecting sub-
jects irrespective of their number. Chance selection is
merely an economical substitute for measurement, and is
practicable only where the number of experimental subjects
is sufficiently large. The trouble with measurement is that
we know so little about just what sort of measurement will
yield, as a basis of selection in a particular experimental
situation, groups equivalent in their possibilities for prog-
ress. Nothing in the general technology of experimentation
so much needs to be investigated as this.
One widespread present practice is to attempt to secure
equivalence by equating groups on the basis of general
ability. If the experiment is concerned primarily with the
physical effects of certain EF’s, the groups are equated on
the basis of general physical ability determined by general
physical measurements. If the experiment is concerned with
the mental effects of the EF’s, groups are equated on the
Selection of Experimental Subjects 43
basis of general mental ability measured by some intelli-
gence test or a series of educational tests.
Thus, if an experimenter were to equate on the basis of
an intelligence test, he would select and apply to the pupils,
who are otherwise known to be appropriate, some intelli-
gence test. Ii the children are primary pupils, he may
select and apply to the pupils one or more tests from among
such intelligence tests for primary pupils as those by Pres-
sey, Franzen, Otis, Haggerty, Dearborn, Trabue, Engel
(Detroit), Myers, and others. Or if he can afford the time
for testing he may select and apply to the pupils such indi-
vidual intelligence tests as those by Goddard, Terman,
Herring, Kuhlmann, Yerkes and Bridges, Witmer, and
others. If the children are elementary pupils, he may select
and apply one or more such group intelligence tests as those
by National Research Council, Haggerty, Otis, Dearborn,
Pressey, Trabue, Myers, Buckingham and Monroe, and
others, or such individual intelligence tests as those by
Goddard, Terman, Herring, Kuhlmann, Witmer, Yerkes and
Bridges. If the children are in high school he may select
and apply such group intelligence tests as those by Otis,
Terman, Dearborn, Trabue, Thurstone, and others. Indi-
vidual intelligence tests for high school students are not
very satisfactory. Group intelligence tests for college stu-
dents have been prepared by Thorndike, Thurstone and
others. If elementary pupils are foreign, or have a special
language handicap, such a group intelligence test as that by
Pintner or Liu or such an individual intelligence test as that
by Pintner and Paterson, may be used. ‘Thorndike has
constructed group non-verbal intelligence tests for adults.
In selecting a series of educational tests to apply to pupils,
the experimenter has a large range of choice from such
reading tests as those by Thorndike-McCall, Monroe, Ayres-
Burgess, Courtis, Gray, and others; from such arithmetic
tests as those by Woody, Woody-McCall, Stone, Courtis,
Buckingham, Monroe, and others; from such spelling tests
as those by Ayres, Ayres-Buckingham, Ashbaugh, Starch,
44 How to Experiment in Education
Morrison-McCall, Monroe, and others; from such composi-
tion scales as those by Trabue, Thorndike, Hudelson, Wil-
ling, Lewis, and others; from such handwriting scales as
those by Ayres, Thorndike, Starch, Lister, and others; from
such English form tests as those by Charters, Briggs, Starch,
and others; from such geography scales as those by Courtis,
Hahn-Lackey, and others; from such history tests as those
by Harlan, Barr, Van Wagenen, Sackett, and others; and
so on for other subjects of the elementary and high schools.
Or instead, the examiner may use certain test booklets which
are combinations in a single booklet of a variety of educa-
tional tests or educational and intelligence tests. These
omnibus tests frequently yield a single score on the entire
booklet, thus avoiding the difficulty of combining separate
scores. Illustrations of such omnibus tests are those by
Buckingham and Monroe, Pintner, Chapman, Whipple, and
others.
Whatever intelligence test is used, some sort of a score
will result. The National Intelligence Test, for example,
yields a point score, and the pupil making the largest num-
ber of points is considered to have the highest general mental
ability. The Stanford Revision of the Binet-Simon Scale,
on the other hand, yields a mental-age score, and the pupil
making the highest mental age is considered to have the
highest mental ability.
Suppose that forty pupils are to be divided into two
equivalent groups on the basis of an intelligence test which
yields a mental age. Suppose that the test to be used has
been selected, ordered from the bureau which issues it,
applied to the forty pupils according to the standardized
directions sent with the test, and scored according to the
standardized method of scoring. Suppose also that the
resulting mental ages, when arranged in order of size, to-
gether with the chronological ages, are as shown in Table 1.
1 Descriptions, price lists, and samples of tests and the standard directions for
the tests may be secured from such distributing centers as World Book Company,
Yonkers, New York; Bureau of Publications, Teachers College, New York City;
Russell Sage Foundation, New York City; Public School Publishing Company,
Bloomington, Illinois; and C, H. Stoelting Company, Chicago, Illinois.
Selection of Experimental Subjects 45
Technique of Pairing Pupils.—The division of pupils
in Table 1 into two equivalent groups on the basis of mental
age may be done by a common-sense pairing of the pupils.
Nevertheless certain helpful suggestions and cautions can
TABLE I
CHRONOLOGICAL AGES AND MENTAL AGES OF 43 6TH GRADE PUPILS
Age Age Age Age Age Age
I 124 153 16 123 127 30 133 II4
2 136 144 17 138 126 31 139 II4
3 135 142 18 134 126 BY: 130 II14
4 136 I40 19 129 126 33 131 113
5 120 139 20 133 126 34 149 IIL
6 rig 139 ay 140 126 35 133 108
7 I4I I39 22 129 126 36 133 105
8 128 737 23 135 T25 37 140 105
9 135 136 24 134 124 38 151 102
Io 139 135 25 123 124 39 iach IOL
II 120 132 26 PZ. 122 40 159 IOl
12 126 129 27 129 122 AI 160 100
13 130 120 28 II5 121 42 160 99
I4 133 128 29 136 II5 43 149 g2
I5 142 128
be given. For one thing it will not be fully satisfactory to
pair the pupils into groups thus:
Group I Group II
Pupil 1 — 153 Pupil 2 — 144
Pupil 3 — 142 Pupil 4— 140
Pupil 5 — 139 Pupil 6 — 139
Such a procedure operates to give Group I a higher average
mental ability than Group II, as may be discovered by
trying it. Rather the general procedure for pairing should
be thus:
Group I Group IT
Sera ts: 2— 144
4— 140 3—142
5 — 139 6 — 139
46 How to Experiment in Education
This method of pairing constantly tends to counteract the
tendency to give one group a higher average ability than the
other.
But even when this last procedure is followed, the mean
of the mental ages for one group may not be identical with
the mean of the mental ages for the other group. By a
TABLE 2
THE PUPILS OF TABLE I DIVIDED INTO TWO GROUPS OF EQUIVALENT MENTAL AGE
Group I. Group II
Pupil Mental Age Pupil Mental Age
2 144 3 142
5 139 4 I40
6 139 7 139
9 136 8 137
IO 135 II 132
13 I20 12 129
14 128 I5 128
17 126 16 127
18 126 ae) 126
21 126 20 126
22 126 23 125
25 124 24 124
26 122 ar I22
30 114 20 II5
Ke II4 32 II4
34 Tit a II3
35 108 36 105
38 102 cy 105
39 IOI 40 IOI
42 99 41 100
Mean 122.45 Mean | 122.5
special juggling of pupils two groups may be constituted
which have practically identical means. But such juggling
is seldom advisable. Unless care is exercised, it is likely
to result In an equivalence secured by pairing a gifted and
ungifted with two average pupils. The means will be
equated to be sure, but the variabilities will be unequal.
Selection of Experimental Subjects 47
Such special juggling is helpful only when previously paired
pupils exchange groups. )
Certain modifications of the procedure recommended are
desirable. These modifications are illustrated in Table 2.
Pupil 1 is eliminated from the experiment entirely. His
mental age is so high, or rather it is so much above
any other pupil, that he cannot be even approximately
paired. The next pupil, namely, Pupil 2, is 9 points of
mental age below him. If for administrative reasons Pupil
1 must be included in the experimental classes he can still
be eliminated from this and all subsequent experimental
computation. Except for the influence his presence in one
of the groups will have, he can become experimentally non-
existent. Pupil 2 is substituted for Pupil 1. He pairs satis-
_factorily with Pupil 3, so the pairing continues according to
rule until Pupil 28 is reached. Pupil 28 does not pair well
with Pupil 29, hence Pupil 28 does not appear in Table 2.
Pupil 29 appears in his place. The pairing continues with-
out interruption until Pupil 43 is reached. Partly because
he makes an odd number and partly because his inclusion
in either group will be distinctly unfair to that group, owing
to his low mental age, he does not appear in Table 2.
Thus far it has been assumed that the pupils in Table r
are to be divided into two equivalent groups only. The
procedure for dividing them into three equivalent groups is
as follows:
Group I Group II Group III
2— 144 3— 142 4— 140
(139 Ot 30 Sao
8 — 137 9 — 136 IO — 135
The procedure for equating four groups follows the same
general principle, thus:
Group I Group II Group III Group IV
2— 144 3-142 4— 140 5130
Oran ts0 oirot Wasp 160 6 — 139
IO — 135 II — 132 I2 — 129 13 — 129
48 How to Experiment in Education
Because of inequalities in room space or for other rea-
sons, it may not be practicable to have an equal number
of pupils in each group. If we assume that one-third of
the pupils in Table 1 are to be in Group I and the remainder
in Group II, the procedure for equating would be as shown
below. This assumption means that of every adjoining
group of three pupils, two will go into Group I and one into
Group II. The closest equivalence will be secured if the
middle pupil of each group of three is placed in Group II,
thus:
Group I Group IT
2— 144 3 — 142
4— 140
Samoe 6 — 139
7 — 139
When one-fourth of the pupils are to be placed in one
group and three-fourths in the other, the pupils come in
groups of four instead of three, and hence there is no mid-
dle pupil. Of the first group of four pupils, namely, pupils
2, 3, 4, and 5, pupils 2, 4, and 5 may be placed in Group I
and pupil 3 in Group II, and of the second group of four
pupils, namely, pupils 6, 7, 8, and 9, pupils 6, 7, and 9 may
be placed in Group I and pupil 8 in Group IJ. Thus in the
first pairing, Group I gains a slight advantage, and, in the
second pairing, Group II gains an equivalent advantage.
This pairing by alternating advantage may be continued
similarly for the remaining pupils.
The technique of equating groups on the basis of mental
age has been discussed. The procedure for equating groups
on the basis of point scores on an intelligence test is identi-
cal. The procedure is the same for equating groups on the
basis of a series of educational tests. The only difficulty
likely to be met in this last situation, or in any situation
where groups are being equated on the basis of more than
one test, is the difficulty of properly combining the scores
made by each pupil on the separate tests into a single score.
Selection of Experimental Subjects 49
The procedure required to deal with this difficulty will be
described later in this chapter.
Groups Equated by Initial Status in Experimental
Trait.—When groups are equated on the basis of measure-
ment, the most convenient and perhaps most frequent basis
employed by experimenters for equating groups is that of
initial status in the experimental trait. This method is
convenient because it is necessary in most experiments to
give an initial test in order to measure the change produced
by the EF. This provides, without additional labor, scores
for the experimental subjects which may be used to divide
them into two or more groups. . The procedure for making
this pairing is identical with that just described.
When the division of pupils into groups requires the
actual physical shifting of pupils, the division must be
made before the EF’s are applied. When such shifting is
not necessary, this detailed division is left until the EF’s
and FT’s have been applied and the experimental computa-
tions have been started. Thus Pittman! wished to deter-
mine the relative efficiency of the zone system of super-
vision for rural schools as compared with the conventional
system. One group was composed of the schools of one
rural county and the other group of the schools of another
rural county. Here it was not feasible to transfer pupils
or schools from one county to another. What Pittman did
was to make a rough initial equating by choosing two rural
counties that were as nearly identical as possible in wealth,
quality of population, quality of teachers, and so on. He
applied the IT, appropriate EF, and FT to all the pupils
in grades III through VIII in each county. At the conclu-
sion of the experiment he arranged the pupils in one county
in the order of the size of their scores on the IT. He did
likewise with the pupils in the other county. He then elimi-
nated from subsequent computations all the pupils in one
group who could not be paired with an equivalent pupil in
1Pittman, M. S., The Value of School Supervision; Warwick and York, Balti-
more, 1921.
50 How to Experiment in Education
the other group. The remaining pupils constituted his two
equivalent groups, and they were the ones used in com-
puting changes produced by the EF’s. Bennett, in a
Maryland rural county, followed an identical procedure,
except that he split one county into two roughly equivalent
parts.
It would have been no advantage to Pittman or Bennett
to equate groups immediately after the application of the
IT. In fact it would have been a slight disadvantage. It
would not have been possible to segregate the chosen pupils
for the purpose of applying the EF or FT, and thereby
save the waste effort of applying EF and FT to all pupils
indiscriminately. So there would have been no gain here.
On the other hand there would have been a slight disad-
vantage in equating at the beginning due to the fact that
certain pupils selected for the experimental groups would
have been absent at the time of the FT thereby necessitating
their ultimate elimination, together with the paired pupil in
the other group. The paired pupil in the other group could
have been retained only on condition that an equivalent
pupil could have been found to take the place of the pupil
who was absent for the FT. All this trouble was avoided
by delaying the equating of groups until it was definitely
determined what pupils remained throughout the experi-
ment. In sum, wherever the actual physical shifting of
experimental subjects is not to take place, and, in addition,
wherever the experimental subjects proper are not to be
segregated for purposes of applying EF or FT, delayed
equating is preferable to early equating of groups. Initial
equating is essential or advisable wherever subjects are to
be shifted or segregated.
In actual practice the equating of groups is sometimes
not so simple as has been described, but the general prin-
ciple is the same. ‘Thus Pittman and Bennett both used
many types of tests—reading, arithmetic, spelling, and so
on—in order to get a rather thorough measurement of all the
changes produced by each EF. Each of these dozen or so
Selection of Experimental Subjects 51
tests was applied both at the beginning and at the end of the
experiment. Which type of test was used as the basis of
equating? Pittman and Bennett employed each type in
turn. Thus in comparing the amount of change in reading
produced by each EF, the groups were equated on the basis
of the initial scores in reading. When comparing the amount
of change in arithmetic produced by each EF, the pupils
employed were selected on the basis of the initial scores in
arithmetic. This procedure meant, of course, that the com-
position of the experimental groups changed somewhat with
each new equating, but the procedure assured an initial
equivalence of groups in the experimental trait under con-
sideration.
One additional suggestion may be given. The EF2 for
Pittman’s control group was merely the customary super-
vision. Since the application of EF2 involved no particular
effort on Pittman’s part, he used and tested many more
pupils in his control group than in the other. By doing
this he made it easy to find a pair for every pupil in the
group to which EF 1 was applied, thereby avoiding the neces-
sity of discarding any of these pupils because of an inability
to pair them.
Groups Equated by Composite of Several Tests.—
Sometimes the experimenter desires to equate groups on the
basis of more than one test. This requires the experimenter
to make a composite of the scores on the various tests. To
equate separately for general-ability tests seldom serves any
useful purpose. To equate separately for each of several
experimental tests does serve a useful purpose, but there is
a certain inconvenience in having to alter the composition of
the group from time to time during the experimental com-
putation. To avoid this objection, some experimenters pre-
fer to equate groups on the basis of a composite of the initial
scores on all the experimental tests. This gives constancy
in the composition of the groups and gives an approximate,
if not an exact, equivalence for each experimental test, unless
the traits are markedly different in nature. In sum, there
52 How to Experiment in Education
are situations where equating by a composite of scores on
several tests is desirable.
The process of computing a composite is illustrated for a
small number of pupils in Table 3. The first vertical col-
umn gives the identification number for each pupil. The
TABLE 3
ILLUSTRATING THE COMPUTATION OF A COMPOSITE SCORE WHERE EACH TEST
RECEIVES’ EQUAL WEIGHT
: . Read. Arith. Spell. Com-
Pupil | Read. | Arith. | Spell. | Weiensed| Weighted| Weighted| posite
ARR ee | en | ee | ee te | a
I 64 13 24 64 65 48 177
2 68 9 za 68 45 42 I55
3 46 9 17 46 45 34 125
4 54 14 27 54 70 54 178
5 54 ie) 13 54 50 26 130
6 72 12 20 72 60 40 172
7 52 13 13 52 65 26 143
8 43 II 24 43 55 48 146
9 72 I4 22 72 70 44 186
10 46 12 18 46 60 36 142
II 50 10 20 50 50 40 140
12 46 II 21 46 55 42 143
13 68 13 23 68 65 46 179
14 61 > ike 26 61 65 52 178
15 46 8 12 46 40 24 IIo
16 64 II 28 64 55 56 175
17 46 14 15 46 70 30 146
18 43 9 15 43 45 30 118
19 46 8 23 46 40 46 132
20 56 13 25 56 65 50 I7I
S.D. 9.8 2.0 4.8 9.8 10.0 9.6
Mult. I 5 2
second, third, and fourth columns show the scores made by
each pupil on a reading, an arithmetic, and a spelling test re-
spectively. Beneath each of these columns appears a meas-
ure—standard deviation (S.D.)—of the variability among
the scores of that particular column.
The first step in the determination of the composite scores
shown in Table 3 was to compute some measure of vari-
Selection of Experimental Subjects 53
ability, in this case S.D. Any other standard measure of
variability, such as mean deviation, median deviation, or
quartile deviation, can be used instead. The computation
of the S.D. for a series of scores is illustrated in Table 15
and Table 16 and explained in the adjoining text.
The second step was to select multipliers which would give
equal weight to each test. Just what weight should be given
each test in determining a composite depends upon the con-
ditions encountered in the situation; but once a decision
has been reached, the procedure for selecting the multipliers
which will effect this weighting should utilize some measure
of variability, in this case S.D. That is, tests are weighted
according to their variabilities and not, as naive common-
sense would indicate, according to their means. For ex-
ample, ordinary common-sense would lead us to suppose
that Test I below has more influence than Test II in deter-
mining a pupil’s relative position in the composite of the
two tests, because its mean is relatively much larger. But
as a matter of fact, Test II has the more weight because its
variability is relatively larger. It has exactly ten times as
much weight because its variability is ten times that of
Test I. Mere inspection of the composite of the two tests
shows that Test II has a large influence upon the composite
and that Test I has only a negligible influence. The order
of the composite scores is the order of the scores in Test II.
Pe SEEEEEEEEEEEESNEEEUUSSRESISSTIEIRTEE
Pupil Test I Test II Composite
a 1000 40 1040
b 1001 30 1031
Cc 1002 20 1022
d 1003 10 1013
e 1004 fo) 1004
Mean 1002 20
The two tests can be given equal weight either by multi-
plying all the scores of Test I by 10 or by dividing all the
scores of Test II by 10. Either procedure will make their
54 How to Experiment in Education
variabilities equivalent. To illustrate this point, the scores
of Test II are divided by ro in the following:
Pupil | Test I Test II Composite
a 1000 4 1004
b IOOI 3 1004
c 1002 2 1004
d 1003 I 1004
e 1004 Oo 1004
All this means that if the three tests in Table 3 are to be
given equal weight, such multipliers must be selected and
used on the test scores as will make their variabilities equal.
A multiplier of 1 for reading, of 5 for arithmetic, and of
2 for spelling will alter their $.D.’s to 9.8 for reading, 10.0
for arithmetic, and 9.6 for spelling, as shown in Table 3.
These variabilities are sufficiently equivalent for practical
purposes. By the use of fractional multipliers they can be
made exactly equivalent.
The multipliers just selected are not the only possible
ones. Equivalence of variability can be secured just as well
by multiplying reading by 4, arithmetic by 214, and spell-
ing by 1, or by many other combinations. As a rule it is
most convenient to select only whole numbers for multipliers
or divisors, and to select as small numbers as possible.
Thus iar it has been assumed that the three tests are to
receive equal weight. This is not necessary. Any desired
weight may be given. Thus if it is desired to give reading
twice as much weight as spelling and spelling two-and-a-half
times as much weight as arithmetic, all the multipliers will
be 1, because the variabilities of the three tests are in this
ratio originally. If it is desired to give arithmetic twice
the weight of reading, and reading twice the weight of
spelling, the multiplier for spelling will be 10, for reading
1, and for spelling 1, or other multipliers which will as satis-
factorily effect the weighting desired.
The third step in determining a composite is to multiply
the respective series of test scores by the multiplier selected
Selection of Experimental Subjects 55
for that test. Thus, in Table 3, all the reading scores are
multiplied by 1, all the arithmetic scores by 5, and all the
spelling scores by 2. The products are shown in columns 5,
6, and 7.
The final step in computing a composite is to add the
weighted scores for the various tests for each pupil. Thus,
in Table 3, the addition of weighted scores 64, 65, and 48
yields a composite of 177. From this point the procedure
for equating groups has already been described.
Groups Equated by Preliminary Rate of Growth.—
There are competent experimenters who contend that the
best index of future rate of growth, or of possibilities for
future growth, is current rate of growth. They advise, there-
fore, that the experimenter test his experimental pupils at
intervals preceding the experiment in order to determine the
rate at which each pupil is developing in the experimental
trait. Once this rate has been determined, pupils may be
paired on this basis.
But we cannot be certain that equating by current rate
of growth is superior to, say, equating by initial status in
the trait in question. The latter is pairing by actual rate
of growth as truly as is the former. The former means
pairing by rate of growth as determined for a necessarily
relatively brief time, whereas the latter means pairing by
rate of growth measured from birth to the present. The
greater accuracy of the rate-of-growth method of equating
is, then, somewhat dubious, and its greater inconvenience is
certain. As a result, the method is not likely to come into
general use until its superiority has been definitely estab-
lished by investigation. The most relevant study thus far
conducted, namely, that by Hollingworth, was planned for
another purpose.
Besides those already discussed, there are many other
bases which may or may not be worthy of consideration,
depending upon the nature of the experiment. Among
1 Hollingworth, H. L. and L. S., Vocational Psychology, D. Appleton and
Company, New York,
56 How to Experiment in Education
these the following may be mentioned: chronological age,
physiological age, social age, previous training, and home
environment in case this last cannot be controlled experi-
mentally.
Any one or all of these may exercise an influence in de-
termining a pupil’s possibilities for growth in the trait in
question.
Groups Equated by Multiple Bases.—Any one basis
for equating groups is bound to fall short of complete satis-
faction, because it is necessarily inadequate. A human
mechanism is exceptionally complex. Any one basis taps
only a phase of this total mechanism. A perfect prophecy
can be made only when every phase of this mechanism is
properly measured and properly weighted.
Again, any one basis fails to give complete satisfaction
because of the intricate dependence of one basis upon an-
other or of one part of the human mechanism upon another.
It will be sufficient to cite two simple illustrations of this
dependence. An intelligence test shows two pupils, A and
B, to have identical mental ages, namely 12 years and 12
years, respectively. May they be paired with reasonable
assurance that the two will progress at equal rates in the
future, except for differences in effectiveness of the EF’s?
Perhaps two groups can be equated on this sole basis pro-
vided the number of pupils is large. But two pupils cannot
be equated without taking other factors into consideration.
If, for example, Pupil A is 10 years old chronologically, and
Pupil B 12 years old chronologically, they are not equiva-
lent pupils. Pupil A has progressed mentally since birth
much faster than has Pupil B, for he has progressed in 10
years as far as Pupil B in 12 years. The conventional
method for expressing this rate of mental growth is the
Intelligence Quotient, computed by dividing mental age by
chronological age, and by multiplying the quotient by 100.
Thus the Intelligence Quotient for Pupil A is (12 + 10) X
100, l.e. 120, whereas that for Pupil B is (12 +12) X I00,
1.€. 100.
Selection of Experimental Subjects 57
But the fact that they cannot be paired because their
Intelligence Quotients are different does not mean at all that
they can be paired if their Intelligence Quotients are identi-
cal. A ten-year-old pupil with a mental age of 10 years may
not be equivalent to a fourteen-year-old pupil with a men-
tal age of 14 years, even though both have Intelligence
Quotients of 100. This means that equating is improved
by pairing pupils who are alike both in mental age and
Intelligence Quotient or, stated more conveniently, who are
alike in both mental age and chronological age. In similar
manner, chronological age conditions all the bases for
equating groups.
For a second illustration of this dependency of one basis
upon another, we may take the case of the dependence of
initial status in the experimental trait upon previous train-
ing. Two pupils who have like initial scores in the experi-
mental trait may have widely different promise for future
rate of growth. One may have attained his initial status
after much training and the other after little training. In
the case of the former pupil, a low score probably means a
low physiological limit of growth and hence little promise
for the future. In the latter case a low score probably means
a high physiological limit and hence great promise for the
future. In similar manner, a high score may mean great
promise or little promise, depending upon the amount of
training required to produce the high score.
Wherever feasible, then, groups should be equated on as
many bases as possible. Pupils should be paired who are
alike in initial status in the experimental trait, in mental age,
in chronological age, in home environments, in sex, in race,
and so on for all significant bases. In actual practice, pair-
ing is seldom done on more than three bases, namely,
initial status in experimental trait, mental age, and chrono-
logical age. Pairing is usually done on just one basis, in-
itial status in the experimental trait or mental age, with the
preference for the former.
Equating is usually done on just one basis, first, because
58 How to Experiment in Education
every increase in the number of bases employed reduces the
number of pupils who can be satisfactorily paired from a
given total number of pupils; and, second, because equating
on one basis tends to make the groups have approximately
equivalent means and variabilities on any other basis, even
though particular pupils do not pair on all the bases. The
existence of this latter tendency is due both to the positive
correlation likely to obtain between desirable bases and to
the operation of chance. Those who equate on a variety of
bases rarely insist that paired pupils be identical on the vari-
ous bases. Rough equivalence is all that is ever secured.
Even where equating is done on one basis only, it is fre-
quently possible to increase the equivalence on some other
bases merely by shifting paired pupils from one group to
the other.
Mason D. Gray has called attention to a unique diffi-
culty in equating two groups. Because of the close correla-
tion between intelligence and vocabulary, we would expect
normally that two groups which have been equated on the
basis of intelligence would be found thereby to have been
equated, at least approximately, on the basis of vocabulary.
But Gray reports that when a group which has elected high-
school Latin is equated on the basis of intelligence with a
group which has not elected Latin, the Latin group has a
higher vocabulary ability than the non-Latin group. It is
highly improbable that such would be the case if both groups
were indiscriminately mingled and if students were assigned
by the experimenter to the Latin EF and the non-Latin EF
without regard to students’ preferences. In general, the ex-
perimenter needs to be particularly alert in equating groups
which have been divided previously on the basis of some
intrinsic psychological difference between them.
Groups Equated by the A. Q. or F Technique.—
Whenever possible, groups should be equated. Whenever
conditions do not permit this, it is possible to equate pupils
Statistically by means of the A. Q. or F technique. The
effect of these techniques is to take a group, no matter what
Selection of Experimental Subjects 59
its ability, whether high, average, or low, and convert it into
a standard group.
The underlying principle of the A. Q. or F techniques
is that it demands of each pupil a progress commensurate
with his brightness, and provides a formula for testing
whether progress has been commensurate with capacity to
progress. A class with low capacity is asked to make a
defined amount of progress in a defined time. A class with
high capacity is asked to make a proportionately greater
progress. If each group under its own EF just exactly
makes its expected progress, both EF’s may be considered
of equal effectiveness.
Suppose that the experimental trait is reading. Then the
equivalent-groups formula becomes:
Sr — (Initial A. Q. — EF1 — Final A. Q. — A. Q. Change)
S2 — (Initial A. Q. — EF2 — Final A. Q. — A. Q. Change)
Where
ta = Il edge
Binal tC Oe Final reading age
~ Final mental age
The computation of reading age is explained by the direc-
tions booklet which accompanies the Thorndike-McCall
Reading Scale.*
The computation of mental age is explained in Terman’s
“The Measurement of Intelligence.” ?
The final reading age will have to be determined by a
retest. The final mental age may be determined statistically
without a retest, due to the fact that a pupil’s Intelligence
Quotient, i.e. mental age divided by chronological age, is
fairly constant. The final mental age may be computed by
means of the following formula:
1Yssued by the Bureau of Publications, Teachers College, New York City.
2 Houghton Mifflin Company, Boston.
60 How to Experiment in Education
Wd initial mental age
Final mental age = Initial mental age + snitial heatiohel ave
X the no. of months between initial and final reading tests.
The computation of mental age presents no difficulty if
such tests as the Stanford Revision of the Binet-Simon Scale
or the Herring Revision of the Binet-Simon Scale are used.
These tests yield a score in terms of mental age. If some
other intelligence test which yields point scores is used,
these point scores can be transmuted into approximate men-
tal ages, provided age norms are available. Tentative age
norms for a few ages on the National Intelligence Test, Form
A, are given below. A pupil’s score of 90 is equivalent to a
mental age of 138. A score of 75 is equivalent to a mental
age of 126. A score of 95.5 is equivalent to a mental age
of 144.
Chronological age in years...... mol mH 124% 13%
Chronological age in months.... 126 138 150 162
National Intelligence Test norms 75 go IOI 112
The computation of reading ages is provided for in the
directions which accompany the Thorndike-McCall Reading
Scale. Reading ages on other reading tests, spelling ages,
arithmetic ages, etc., may be computed, provided age norms
are available, by simply transmuting point scores on some
reading test, spelling test, or arithmetic test into reading
ages, spelling ages, or arithmetic ages respectively, as has
just been illustrated for the National Intelligence Test.
Unfortunately most educational tests report grade norms
rather than age norms. Even so, approximate age scores
may be computed by substituting for each grade its chrono-
logical age equivalent. The first two rows of the data shown
below will be the same regardless of the test which appears
in the third row. The third row will vary with the test.
In the following case, a point score of 37.8 on the Ayres
Spelling Scale, 10 words each from columns L, O, Q, S, U,
and W becomes a spelling age of 141. A point score of 50.3
Selection of Experimental Subjects 61
becomes a spelling age of 167. A point score of 49 becomes
a spelling age of 161.
End of grade DO Tea Ty beet, Vee Vinay Lev Le LLP
Approx. ch. age equivalent of grade 89 102 115 128 14% 154 167 180
Ayres Spelling Test grade norm.. 19.6 30.4 37.8 47.7 50.3 54.4
The computation and use of reading age, spelling age, men-
tal age, A. Q., and the like, when age norms are available
and when only grade norms are available, is discussed more
fully in “How to Measure in Education.”
F has the same function and significance as A. Q.
Tests scaled according to the age-scale system use A. Q.,
whereas tests scaled according to the T-Scale system use F.
These two scale systems will be described in Chapter V. In
case F is used in place of A. Q., the equivalent-groups for-
mula becomes:
S1 — (Initial F — EF1 — Final F — F Change)
S2 — (Initial F — EF2 — Final F — F Change)
As will be explained more fully in Chapter V, F, in case
the experimental trait is reading, is computed thus:
Initial F = Initial reading T — initial intelligence T
Final F = Final reading T —final intelligence T
The initial and final reading T require the application of
both an initial and final reading test; whereas the final
intelligence T may be computed from the initial intelligence
T, through the use of each pupil’s B or brightness score.
The steps in the process are: (1) Compute the pupil’s B
score. Assume that the pupil’s T score is 38 and that his
age is exactly 10 years, o months. Then, by Table 11
(p. 109), his B score is 38 + 12, i.e. 50. (Assume that
Table 11 is for the intelligence test in question.) (2) If the
experiment continues ten months locate in Table 11 the B
correction corresponding to this pupil’s age ten months later.
2The Macmillan Company, New York City.
62 How to Experiment in Education
Ten months later he will be aged 10 years and 10 months.
The B correction for this age is 8. Were the experiment to
run for four months the B correction would be 10. Assume
the experiment to run 10 months. (3) Subtract this B cor-
rection of 8 from the initial B score of 50. The result is 42,
which is the desired final intelligence T, required to compute
the final F. The final B correction of 8 is subtracted from
the initial B score, even if the caption at the top of Table 11
says “add.” In transmuting a T score into a B score, add
the B correction when the caption says to add and subtract
the B correction when the caption says to subtract. But
in transmuting a B score back into a T score reverse the
process.
The Thorndike-McCall Reading Scale yields a T score
directly just as certain tests yield an age score directly. The
process for utilizing age or grade norms for converting scores
on any test into age scores has just been described. The
following shows the approximate T-score and B-correction
equivalents of age scores for any mental or educational test.
The T and B equivalents for intervening ages may be de-
termined by simple interpolation.
Age 63 7h 8h oh rohrrdz2d 134 14} 15} 163 174
TP score yi -OEBWe 5) 32)390 44.50 530057503 a
B correction 50 37 25 18 11 6 oO —3 —7 —I3 —20 —27
Equating groups through the A. Q. or F technique assumes
that rate of growth in the trait in question will be propor-
tional to intelligence, except for the differing effects of the
two EF’s. This assumption is justified when the trait in
question is a general mental function like reading, spelling,
arithmetic, geography, etc. The assumption is of doubtful
validity for specialized mental functions. Specialized pro-
phetic tests may be available some day for such specialized
mental functions,
CHAPTER IV
CONTROL OF EXPERIMENTAL CONDITIONS
Constant vs. Variable Irrelevant Factors.—In the
actual conduct of an experiment an experimenter must con-
tend with both constant and variable irrelevant factors.
Variable irrelevant factors do not particularly annoy the
experimenter. They are chance influences which operate
favorably as frequently as they operate unfavorably for a
particular EF. A multitude of such factors are unavoid-
ably playing upon experimental pupils throughout even the
best controlled educational experiments. In the long run,
their net effect is zero. The net result of constant irrele-
vant factors, on the contrary, is not a zero facilitation or
inhibition of a particular EF. They are any undesired
influences whose net result is favorable or unfavorable to
some EF.
An experimenter may ignore truly variable irrelevant fac-
tors, but he cannot ignore significant constant irrelevant
factors. He must either eliminate them, or else determine
the amount of their influence and allow for it in computing
the amount of change produced by the EF in question. The
ability to detect and eliminate constant irrelevant factors
is one of the distinguishing marks of a sagacious experi-
menter.
This chapter will be devoted to an enumeration of the
more common constant irrelevant factors, and to suggested
methods of eliminating them. This list should be studied
not with the idea that it is complete or that every factor
listed would be a constant error in every situation. Mere
maturing, for example, introduces a constant error in ex-
periments whose object is to determine the amount of
63
64 How to Experiment in Education
change due directly to an EF, whereas its influence may be
ignored in experiments whose object is to determine the
relative effectiveness of two or more EF’s.
The purpose of this chapter is the amplification and
illustration of the fundamental principle of experimenta-
tion—that changes in experimental subjects due to irrele-
vant factors should be eliminated, equated, or accurately
measured and discounted. . The importance of any irrelevant
factor varies with the amount of its contribution to each
EF, where the purpose of the experiment is to determine
the amount of change in experimental subjects due directly
to each EF, and varies with the difference in amount of its
contribution to each EF, where the purpose of the experi-
ment is to determine the relative effectiveness of two or
more EF’s.
Errors Due to Bias of Experimenters.—Conscious or
unconscious manifestation of bias on the part of an experi-
menter is a common constant error. This constant irrele-
vant factor is of special significance because there are so
many points in an experiment where an experimenter’s bias
can influence the final conclusion. Of course anyone who
consciously favors unfairly in any way any EF, is mentally
incompetent to conduct experiments. He is, to say it less
politely, an experimental cheat. He is employing the ap-
pearance of experimentation to secure a readier acquiescence
on the part of others to his own emotional prejudice. Con-
scious bias is so human as to be sometimes unavoidable.
But to be biased is one thing; consciously to allow this bias
to modify experimental arrangements is quite another.
A manifestation of unconscious bias is far more likely
to occur. It is extremely difficult for an experimenter to
remain exactly neutral. With some individuals, conscious
bias for a particular EF will cause them to favor it uncon-
sciously. Other individuals will be so meticulously careful to
avoid favoring a favorite EF as actually to favor the con-
trasted EF. Impressed by the conflicting results obtained
from various investigations of the amount and nature of sex
Control of Experimental Conditions 65
differences, Cattell caustically remarked that the sex dif-
ferences discovered depended upon the sex of the investi-
gator.
In many experiments it is possible to take certain pre-
cautions against manifestations of a possible bias. Thus,
Poffenberger, in his experiments to determine the mental
effect of doses of strychnine, numbered the capsules. He
then proceeded to forget just which did and which did not
contain strychnine. He did not refresh his memory until
‘the experiments had been concluded, tests given and scored,
etc. Pittman, in pairing pupils at the end of his experi-
ment with the zone system of supervision, covered up the
final scores of pupils, lest he show a possible bias by pairing
with knowledge of the amount of change produced by each
EF. Another investigator wished to determine whether
judges varied more in judging the merits of compositions
containing much originality than in judging specimens con-
taining little originality. This investigator was careful to
choose the specimens containing much and those containing
little originality before securing, much less consulting, the
judgments of merit. By a system of key numbers and by
other devices it is possible in many experiments to reduce
the opportunities for bias to manifest itself.
Errors Due to Bias cf Assistants.—Skepticism regard-
ing conclusions where adequate supporting data are not
produced, and the reverse mental attitude where data are
produced, are eminently desirable traits. Such skepticism
or enthusiasm is on the increase in education, and this in-
crease should receive every encouragement. But there is a
lop-sided skepticism or enthusiasm which is really nothing
more than irrational prejudice. Many who pride themselves
upon their insistence upon proof are really priding them-
selves upon an irrational prejudice for one alternative,
usually the present practice, and an equally irrational preju-
dice against the other alternative. The experimenter, in
organizing cooperative experimentation, will meet both varie-
ties among teachers, supervisors, superintendents, or other
66 How to Experiment in Education
experimental assistants. There is some hope that the rational
skeptic or enthusiast will subordinate his preferences to the
objects of the experiment. There is little hope that the
irrational individual will be able to do so. Neither variety
makes an ideal experimental assistant. The ideal assistant
is one who is genuinely uncertain as to which EF is superior.
The way to avoid bias upon the part of assistants depends
upon the experiment. But certain common precautions may
be listed. One way is to avoid assistants who have a bias,
or where they cannot well be avoided they may be elimi-
nated from all computations. ‘This avoidance or elimina-
tion may be employed provided the experimenter has some
objective way to determine which assistants will manifest
or have manifested bias. Lacking such objective data the
experimental assistants chosen may manifest merely the
experimenter’s own bias. Any assistant who confesses to a
preference may reasonably be assumed to hold such a pref-
erence.
Another way to avoid bias is to equate it. This can be
done, roughly at least, by using as many assistants who are
favorable to one EF as there are assistants favorable to the
other EF or EF’s. Such an equating may prove satisfac-
tory in experiments whose only object is to determine the
relative effectiveness of two or more EF’s. The procedure
for equating teachers or other assistants is, in general, like
that for equating groups of pupils.
Finally, something may be accomplished by impressing
upon assistants the necessity for experimental neutrality in
thought and deed, and by providing them with detailed type-
written instructions as to what to do. Few realize the
extraordinary difficulty of maintaining perfect self-control,
particularly where a preference has already developed. The
careless assistant is in danger of manifesting the preference
and the conscientious assistant of going to the other extreme.
The provision of detailed instructions will tend to minimize
such manifestations.
Bound up with this problem of bias is the whole question
Control of Experimental Conditions 67
of just how much effort should be expended upon each EF.
A fundamental principle of experimentation is that there
should be an accurate measurement of the amount of the
experimental factor. Thus in the physical sciences, a com-
mon procedure is to add an EF of defined amount and
measure the result, or subtract an EF of defined amount and
measure the result, or both add and subtract in succession
an EF of defined amount and measure the result, or both
add and subtract in succession an EF of varying amounts
and measure the changing results with each increase or
decrease in the amount of the EF. Probably the greatest
defect in educational experimentation is the inability, in
most cases, to measure accurately the amount of presence
of an EF. Further, there is some, though meager, evidence
that maximum effort can be maintained more constantly than
any effort lower than maximum. These facts and proba-
bilities would lead one to infer that it is better, not only
educationally but experimentally, to aim at maximum effort
all the time for each EF.
Though evidence on this question is meagre, there is
some reason to believe that the mere process of experi-
menting with new methods or materials of instruction, at-
tracts such attention to the traits in question as to cause
an unconscious concentration, both on the part of teacher
and pupils, upon progress in these traits. As a result, it iS
supposed that a large temporary effort is called forth, thus
causing a large but artificial growth, and that this artificial
effort will evaporate if the novel methods or materials were
used term after term. Consciousness of the possibility of
such bias may help the experimenter to avoid it, but the
only sure way to determine whether ephemeral effort has
been evoked is to continue the experiment for a consider-
able period. If each succeeding term shows a flagging of
effort and an elimination or reduction of superiority, the
existence of such ephemeral effort may be assumed.
Errors Due to Differences in Teaching Skill—_Re-
search on a large scale frequently requires codperation on
68 How to Experiment in Education
the part of many superintendents, supervisors, and teachers.
My own experience in such work has been one continuous
surprise as to the trouble members of the educational pro-
fession will take to codperate fully in scientific research.
Still, one finds occasional instances of unwilling teachers or
superior officers. The trouble with such individuals from
an experimental standpoint is that they will inadequately
apply a particular EF and be careless about maintaining
desired experimental conditions in general.
Again, there are wide differences in teaching skill or
supervising skill. If one group is taught by an unskillful
teacher according to one EF and another equivalent group
is taught by a skillful teacher according to another EF, any
difference in the change produced may be due to a differ-
ence in teaching skill rather than a difference in effective-
ness of the contrasted EF’s. This difference may be due
to the operation of special forces or to a real difference in
skill. Thus one experimenter grumbles that one of his EF’s
did not have a fair chance because so many of the teachers
who were assigned to apply this particular EF turned out
to be bride-teachers. Another experimenter found that one
EF had suffered from more frequent changes of teachers
than the other EF. Still another experimenter found that
substitute teachers were more frequent under one EF than
another.
The experimenter must attempt, then, to avoid experi-
mental errors due to a difference in general unwillingness,
and a difference in general capability on the part of
assistants.
He must guard also against errors due to peculiar fitness
or unfitness for applying an EF. The general efficiency of
two teachers, for example, may be equal. But one may be
peculiarly unskilled in the teaching of arithmetic. This
special disability makes it unwise to use her for applying
some EF whose object is to increase pupils’ ability in arith-
metic. The other EF applied by the other teacher has an
advantage, or if the same teacher applies both EF’s, it is
Control of Experimental Conditions 69
possible that her special abilities and disabilities favor one
EF and handicap another.
Five general methods have been employed for avoiding
or reducing experimental errors due to a difference in, say,
teaching skill. One method is to equate the skill of the
teachers assigned to each EF. This pairing of teachers is
done on the basis of some preéxperimental measurement of
each teacher’s efficiency of teaching. These measurements
may be by means of objective tests or may be judgments of
Supervisory officers.
A second method is to equate teachers by chance. To do
this means that the experiment must be conducted in numer-
ous classes to insure that chance will provide equivalence
in teaching skill. ‘This method is very laborious but it
increases the probability of securing both equivalence and
representativeness of teaching skill.
A third method is the departmental method, namely, to
have the same teacher apply both or all EF’s; then, gen-
erally superior teachers will be equally favorable to each
EF, and the generally inferior teachers will be equally un-
favorable to each EF.
A fourth method is to have two teachers divide the
work of two classes. Thus when the New York State Com-
mission on Ventilation was contrasting two EF’s on two
equivalent classes in a public school in New York City,
the two classes were placed in adjoining rooms, one teacher
teaching half the studies to both groups, and the other
teacher teaching the other half to both groups.
A fifth method is to rotate the teachers so that each EF
has every teacher. To illustrate how this can be done there
is repeated below the formula for a rotation experiment.
It may be observed that the teacher of Sx will appear under
each EF, and the teacher of S2 will appear under each EF,
thereby equating any difference in general teaching skill.
St — (IT1 — EF1 — FT1 — Cr) — (1T1 — EF2 — FT1-—C2)
S2 — (IT1 — EF2 — FT1 — C3) — (IT1 — EF1 — FT1 — C4)
70 How to Experiment in Education
It is useful for the experimenter to distinguish in this
connection two varieties of experimental situations. In one
variety the teacher applies the EF while giving the gen-
eral instruction to her class at the same time. In the
other variety the teacher, as before, gives the general in-
struction, but the specific EF is applied by some person
other than the teacher. If the EF’s contrasted are project
method and conventional method of teaching, or one method
of teaching spelling and another method of teaching it, it
is probable that the teacher will be asked to apply the EF’s.
Here unusual care should be exercised to equate or elimi-
nate any difference in teachers’ skill. If the EF’s con-
trasted are one type of motion picture and another type
of motion picture, there is considerable likelihood that the
experimenter himself or non-teaching assistants will apply
the EF’s. Here again difference in teachers’ skill may be
important, particularly if the motion pictures deal with
portions of the regular curriculum, but it is much less im-
portant than where the teachers apply the EF’s, because the
teachers will have relatively less influence upon the changes
of the pupils in the experimental trait. But as the teachers’
importance grows less, the experimenter’s or non-teaching
assistants’ importance increases, in accordance with the gen-
eral principle stated at the opening of this chapter, namely,
that the importance of an irrelevant factor varies with the
amount of its contribution to each EF, or to the difference
in the amount of its contribution to the various EF’s.
Errors Due to Bias of Subjects.—Bias on the part of
experimental subjects is just as disturbing to an experiment
as bias on the part of the experimenter or his assistants.
Such bias comes about in many ways. A popular teacher
will make it known to the pupils that an experiment is under
way and consciously or unconsciously reveal her own pref-
erence. The pupils, as a consequence, will strive to make
the experiment come out happily for their teacher. An
unpopular teacher under similar circumstances provokes an
antagonism toward the EF which she prefers.
Control of Experimental Conditions 71
Again, a teacher, an experimenter, or certain circumstances
surrounding the experiment will reveal to pupils that two
groups are being compared. This information, apart from
any preference for or antagonism toward their teacher, may
engender an undesired rivalry between the two groups. In
case the information leaks out to only one group the result-
ing stimulus to this group might well prove decisive.
The best way for an experimenter to avoid a bias is to
keep himself, when possible, in ignorance of just when he
is applying a particular EF, or scoring tests for a particular
experimental group, and so on for the other experimental
processes where his bias would be likely to affect results.
The best way to avoid bias on the part of assistants is to
keep them in ignorance of the objectives of the experiment.
An experiment with two varieties of ventilation was con-
ducted in two schoolrooms for a full year without either of
the two teachers discovering just what the EF’s were. It
is even more important and fortunately easier to keep pupils
in ignorance of the nature of the EF’s and, if possible, of the
fact that an experiment is in progress. Certainly one group
should not be informed and the other kept in ignorance.
Research is such an eminently individual and original
process that it is well-nigh impossible to lay down certain
principles of procedure without calling attention to possible
exceptions. There are situations where it is really desira-
ble that pupils be informed, in a measure, that something
unusual is taking place. Pittman, in one of his investiga-
tions, went so far as to issue a bulletin to the pupils of one
of his two equivalent groups telling them he wished to see
just how much progress they could make. In an experi-
mental evaluation of the worth of using standard tests in
the teaching of reading, the writer set up for one group of
the experimental pupils definite objectives in reading, gave
them their scores on periodic tests in order that they might
see how nearly they were attaining these objectives. This
was not done for the other experimental group. And yet
neither Pittman nor the writer introduced thereby any con-
72 How to Experiment in Education
stant irrelevant factor. These were legitimate portions of
one of the EF’s. The use of a bulletin by Pittman was a
portion of his plan for increasing the progress of the pupils.
The employment of definite reading objectives and the
periodic reporting of scores by the writer were made possible
by the use of standard tests, and were some of the advan-
tages of the use of standard tests. Objectives and scores
could not be reported to the other groups, either because the
EF did not call for them or because standard tests were not
employed with them. On the other hand, it would not
have been legitimate for either of us to tell these same
experimental groups that their progress was to be compared
with that of another equivalent group and that we hoped
they would win in the contest. To do so would be to change
the EF by adding features peculiar to the experiment and
necessarily temporary.
Such an EF would not be illegitimate but it would not be
particularly practical. The information given certain of
the experimental subjects by Pittman and by the writer
were normal advantages of the EF in question and were
permanently obtainable in a practical school situation with-
out assuming the impractical situation of an everlasting
experiment. In sum, it is always legitimate to give experi-
mental pupils such facts as are the normal concomitants of
the EF in question, unless the experimenter desires to limit
his experimental conclusions to a narrower EF. As a mat-
ter of fact, the writer gave certain standard tests to the
pupils in his control group, thereby making it possible, had
he so desired, to report to them the scores made as in the
case of the other group. This was not done because the
EF for this group assumed that in a normal non-experi-
mental situation no standard-test scores would be available.
Errors Due to Difference in Time Allowance.—
When the effectiveness of two or more EF’s is being studied,
one EF may secure an unfair advantage over another be-
cause of a longer teaching or studying time on the part of
the pupils, or the application of their EF for a longer
Control of Experimental Conditions fhe
period. This may occur in many ways. The class period
may be longer. The study which occurs at the pupil’s home
may be longer. Each application of the EF may be longer.
The total period during which the EF operates may be
longer. Thus, in conducting the experiment to determine
the relative effectiveness of employing tests in teaching read-
ing, the writer found it necessary to regulate the length of the
official reading period both for teaching and for study. In
this experiment to determine whether motion-picture presen-
tation, or printed presentation, or teacher presentation, or
various combinations of these was the most effective, Weber *
exercised extreme care lest the time allowance for one EF
exceed the time allowance for another EF. In his experi-
ment to determine whether supervision plus standard tests
were superior to supervision minus standard tests, Bennett
found it impossible to give all the initial tests or all the final
tests to all the pupils at the same time. Because of the scat-
tered nature of rural schools both testing periods extended
over several weeks. All tests were carefully dated in order
that the interval between initial and final tests might be kept
identical for every pupil. Since instruction toward the
close of school may be more effective than toward the be-
ginning, he was careful to avoid applying initial tests to
one group earlier, on the average, than to the other group.
Lacy,” in his experiments with visual, verbal, and printed
presentation, was careful to see that the few minutes’ interval
between the ending of each EF and the application of the
final test was kept identical for all EF’s, and that the few
weeks’ interval between the final test and a delayed-recall
test was kept identical for all EF’s. In every experimental
situation where a time variation will favor one EF to
the detriment of another, the time should be kept identical,
unless such a variation is a desired element in an EF.
There is a special variety of time variation which should
1Weber, J. J., Relative Effectiveness of Some Visual Aids in Elementary
Education; (to be published soon).
2Lacy, John V., “Motion Pictures as an Educational Agency”; Teachers College
Record, Vol, XX, No. 5.
74 How to Experiment in Education
not escape the attention of the experimenter. The pupils in
one experimental group may have a poorer attendance record
than those in some other group. This may be caused by an
excess for one group of poorer roads, longer average dis-
tance of homes from school, more inclement weather, more
contagious diseases, and the like. Consideration should be
given to whether the absence is toward the beginning or
end of year, or is continuous or intermittent. When the
pupils are sufficiently numerous, average attendance records
are usually approximately equivalent for each group. But
when the group is small it may be necessary to eliminate
from experimental computations pupils whose attendance
record is such as to disturb the balance between the two
groups.
Sometimes it is difficult to decide whether a time variation
is an irrelevant factor or a consequence of an EF. Pittman
found that the pupils in the schools which were under the
zone-system-of-supervision EF showed a better attendance
record. Instead of discounting this as an irrelevant factor
he credited it to the beneficent influence of the EF, because
there was no other observable cause.
The writer found that one method of teaching reading
resulted in more reading both in school and out than did
another EF. This extra reading was a partial or perhaps
entire explanation of the superior growth of these pupils. It
was assumed that this was not an irrelevant time variation
but a beneficent consequence of the EF. Tests made in
other subjects of the curriculum did not show that this in-
creased emphasis upon reading had occurred at the expense
of other portions of the school work.
Finally, errors may occur due to the length of time the
experiment runs. An experiment may be allowed to run too
brief a time or too long a time. It may be so brief that
variable errors swamp the effect of the EF’s. This is likely
to occur if the trait measured is one in which growth is slow
and cumulative. In such a situation the experiment needs to
continue over a long period. When the trait measured de-
Control of Experimental Conditions 75
velops rapidly, and when the effect of the EF’s is relatively
non-cumulative, brief experiments are preferable. The prin-
ciple to be kept in mind in deciding upon the time length of
the experiment is to secure the maximum effect of experi-
mental factors with a minimum effect from disturbing
variables.
Errors Due to Difference in Transfer.—After giving
a recent examination to his class in mental measurement,
the writer announced to the students that his efficiency as
a teacher of mental measurement was only 43 per cent, for
on the average the class had mastered only 43 per cent of
the procedures he had aimed to teach. One unkind student
increased his chagrin by remarking that a portion of that
43 per cent was acquired in other classes given by the
writer’s colleagues. In other words, there had been a trans-
fer from one class to another. This same sort of transfer
from one school activity to another is going on all the time.
More of it may occur in the case of one group than another,
thereby introducing a constant irrelevant factor. Reading
ability is liable in a peculiar way to be enhanced by such
transfer. The teacher of reading usually has a heavy obliga-
tion to all the other teachers, where there is departmental in-
struction, or a heavy obligation to all the other phases of
her own instruction where she is the sole teacher. Certain
teachers or schools give a sum total of more instruction in
reading during the periods officially assigned to history,
geography, and the like, than during the reading period ©
itself. This is equivalent to giving more time to reading.
The experimenter should not neglect these transfer possi-
bilities when standardizing the time allowance for each EF.
Another disturbing irrelevant factor is the transfer of
knowledge of how to do the experimental tests. The writer
found this to be of considerable significance in some experi-
mentation on young children. All the tests were individual
tests, which means that only one child could be tested at a
time. As soon as a child was tested he was returned to his
class. This gave opportunity for the other children to dis-
76 How to Experiment in Education
cover, in advance, something as to both the general and
specific nature of the tests. An effort was made to reduce
the amount of this error by employing several examiners
so as to reduce the length of the total testing period, by
testing first those pupils who, according to the teacher’s
judgment, were least competent to make an intelligible re-
port of what occurred in the examining room, by applying
one test to all pupils before starting another, by urging the
teacher to conduct her class while a test was being given
so as to reduce opportunities for conferences among pupils,
and by condensing the total period for one test between
recess periods. An attempt was made to equate any error
not avoided by the preceding precautions by testing pupils
from the two groups according to the principle of alterna-
tion. It is much easier to avoid this irrelevant factor when
group tests may be employed.
When the equivalent groups are located in the same
school, other sorts of transfer may occur. One group may
catch a spark of enthusiasm from another. One group
may sulk because the other group has a pleasanter or sup-
posedly pleasanter EF. The writer is still wondering just
what sort of transfer occurred during a year’s experiment in
the Horace Mann School, conducted in collaboration with
Principal Pearson, Vice-Principal Hunt, and the teachers.
Half the teachers and half the pupils continued to teach and
study, respectively, a particular subject, as during the pre-
ceding year. The other equivalent half of the teachers
attempted by concentrated study to invent teaching pro-
cedures which would produce, with the same time allowance,
a greater growth than usual in their half of the pupils.
This program was known to half the teachers only and to
none of the pupils. Initial and final tests were given to
both groups as had been customary in previous years. To
our great surprise both groups had made practically identical
progress. Naturally this was a considerable disappointment
to us all. It was not until some time later that it occurred
to us to compare the usual progress with the progress made
Control of Experimental Conditions Fire
for an equal period during the experimental year. Both
groups had made a 50 per cent greater growth than usual!
Somehow, some sort of transfer had occurred.
Errors Due to Bias of Tests.—There is danger that
tests used for the initial and final measurements will be
partial to one EF. Those who advocate the project method
in preference to the conventional method of teaching have
certain reservations about experiments which have been
conducted to date to evaluate the relative effectiveness of
these two educational processes. They claim, and with some
justification, that standard tests available for such evalua-
tion are partial to the conventional method. Lacy’s con-
clusion that verbal instruction is more effective than visual
instruction has been questioned by Weber on the ground
that Lacy’s verbal tests were partial to the verbal method.
To substantiate his criticism Weber devised one test like
Lacy’s, another in which the verbal element was reduced
to a minimum, and another which, in his judgment, was
about half-way between these two. At the time when this
is written, his experiments have gone far enough to show,
among other things, that the visual group does better on
the visual test and the verbal group upon the more verbal
test.
What has been said concerning the nature of the tests em-
ployed applies with equal force to the examiner who gives
tests, the acquaintance of pupils with the tests, instructions to
pupils as to how to take the test, the conditions while tests
are in progress, the scoring of the tests, and the statistical
treatment of results. In general, the same examiner should
give the same tests to all groups in the same way in order
that difference in personality of examiners, or in the stimulus
given to pupils, may not corrupt results. Uniformity will
be increased if the method of applying the test is determined
in advance and written down. Sometimes one group has
had more experience in taking tests in general. This may
be eliminated by supplying the deficiency. Sometimes the
experiment calls for intermediate tests of the same experi-
78 How to Experiment in Education
mental trait with the same test that is used for the initial
and final tests. If this applies to one group only it may
gain an advantage from increased acquaintance with the
test. Such practice effect can be reduced by the use of
parallel forms rather than the identical test.
Sometimes it is desirable to analyze the curriculum con-
tent and test content to discover the degree of correspondence
between the two, and this-~-is especially true when the one-
group experimental method has been employed. It is pos-
sible that the arithmetic curriculum during the first semester
may be more akin to the content of the arithmetic test used
than is the content of the arithmetic curriculum for the
second semester. Analysis of the curriculum may reveal this.
Finally, a test may be biased because it fails to take
account of periods of especially rapid growth, and minor or
major plateau periods of especially slow growth. In certain
traits, pupils lose during the summer vacation some of the
skill acquired the previous year. Usually, this loss is quickly
made up in the first few weeks of the fall term. When the
initial tests are given on the first day or two of school, the
EF will get the benefit, not only of the effect of the EF,
but also of the effect of this early spurt.
Errors Due to Bias of Other Irrelevant Factors.—
Various environmental factors which may prove irrelevant
factors have already been listed. On occasion, many others
may be significant. The experimenter should canvass the
general physical environment including such items as tem-
perature, humidity, ruralness, playgrounds, and the like, to
see if differences in these may not be significant. Thus
conclusions from experiments in physical geography might
be profoundly affected by whether one group had better
contacts with mountains, streams, and the like. The home
environment is frequently of very great importance. Some
children have home surroundings which encourage study,
home facilities which aid study, parents who give moral
support to the school, and parents who give actual instruc-
tion in school subjects in no mean amount and of no small
Control of Experimental Conditions 79
worth. All such conditions, if relevant to the experiment in
question, should be made approximately equivalent or should
be discounted in drawing conclusions.
Then there are errors due to difference in susceptibility
of pupils to the EF’s. Conclusions from an experiment
conducted by Norsworthy, Hillegas, McCall, and Johnson
were made uncertain because one of the two groups was in
more robust health than the other. Differences in phys-
ical condition, intelligence, previous training, age, sex, race,
and all other such personal characteristics which at times
condition the susceptibility of pupils are not matters easily
or at all subject to control during the application of the
EF’s. They should receive attention when experimental
pupils are being selected.
Experimental Log.—One necessity of experimentation
is an experimental log or record of dated events, of relevant
ideas, of the appearance of variables, and the like. It is
seldom safe to trust to memory circumstances which will
need to be recalled. Every scrap of experimental record
should be labeled and dated. Records should be kept as
though the experimental material were to be filed away for
several years before experimental computations were made
and before the experiment was described. In fact, any
one who does much experimentation will need to refer
to experimental records long after the conclusion of the
experiment. Further, it often becomes necessary to ask
others to complete an experiment one has begun. A prop-
erly kept experimental log quickly informs the new experi-
menter concerning the previous history of the experiment.
Norsworthy had just completed an experiment extending
over several years when she died. Though the writer knew
nothing about the experiment he was able to take up the
research where she left off, complete the computations, and
describe and publish the results. Without the experimental
log this would have been impossible.
In an extensive experiment in the teaching of English to
foreigners, Courtis employed a unique device for main-
80 How to Experiment in Education
taining desired experimental conditions and of recording
deviations from them. First he met the teachers and gave
them typewritten directions concerning and training in how
to apply the EF, namely, a particular method of teaching
English to foreigners. Then he employed a group of gradu-
ate students in education to act as observers, there being
one observer for each teacher. Next he devised a form on
which the observer could keep a graphic time-record of just
what the teacher did during the lesson period. He rotated
the observers so that each observer saw each teacher. At
the conclusion of the experiment, he did not have to hope
that experimental conditions had been maintained. He had
an accurate record of the extent to which they had been
maintained. As a result, he was able to avoid grave errors,
and was able to make a much fuller use of his data.
CHAPTER V
EXPERIMENTAL MEASUREMENTS
I. FuNcTIONS oF EXPERIMENTAL MEASUREMEN7S
Amount of Experimental Factors.—The first demand
upon experimental measurements is the exact measurement
of the amount of the EF’s.
The amount of certain EF’s may be measured with great
exactness. Among the many experiments conducted by the
Ventilation Commission of New York, some had for their
purpose to determine the mental and physical effects upon
school children or adults of various temperatures, humidities,
carbon-dioxide contents, and the like. The successful con-
duct and interpretation of these experiments required that
an exact record be kept of the temperature, humidity, and
carbon-dioxide content maintained in the experimental cham-
bers. Instruments were installed which made possible a
very exact record of the amount of these EF’s.
The amount of some experimental factors cannot be meas-
ured with such accuracy. If, for example, one experimental
factor is the project method, it is impossible to secure an
exact quantitative record of the amount of this EF, even
though we can be reasonably sure that it is an EF which
varies in amount of presence. Similarly it is difficult to
secure a quantitative record of the amount of a particular
method of teaching reading.
Though difficult to secure, the experimenter is responsi-
ble for reporting as best he can the amount of each EF. In
81
82 How to Experiment in Education
the case of some EF’s, it may not be possible to be more defi-
nite than to state roughly the skill and effort of the teacher;
the degree of codperation of officials and parents, the ade-
quacy of equipment, the amount of time during which each
EF operated, and similar information, according to the
nature of the experiment.
Amount of Change Produced by Irrelevant Factors.
—The second demand upon experimental measurements is
the exact measurement of the amount of change produced
in the trait in question by irrelevant factors. The purpose
of this measurement is to make it possible to discount the
corrupting influence of irrelevant factors.
In certain very specific types of experimentation, it is
possible to measure the amount of this influence of irrele-
vant factors. But in most educational experimentation,
their individual influence is so slight as to be unmeasur-
able, or so subtly bound up with the EF’s that the exact
amount of their contribution cannot be separated from the
influence of the EF’s. Usually, the experimenter will find
it easier to eliminate or equate significant irrelevant factors
than to measure the amount of their contribution to the trait
in question.
Amount of Change Produced by Experimental Fac-
tors.—The third demand upon experimental measurements
is the exact measurement of the amount of change in the
trait in question produced by the EF’s. In educational ex-
perimentation, this is the most common and most important
type of experimental measurement.
II. FUNDAMENTAL CRITERIA
In common with measurements for any purpose, experi-
mental measurements should satisfy certain fundamental
criteria. They should be selected or constructed with these
criteria in mind. These fundamental criteria are:
1. Validity. A test is perfectly valid when it measures
exactly what it purports to measure.
Experimental Measurements 83
2. Accuracy. A test is perfectly accurate when the
units of measurement are wholly appropriate and are abso-
lutely equal at all points on the scale.
3. Reliability. A test is perfectly reliable when two
applications of equivalent tests to the same pupil yield
identical scores.
4. Objectivity.
Ww
wD
aS
w
e
°
|
Oo
ODO WOOWBWKWANAN OW O
COO OWWWRhUN AAT CO
OCOWWWAN AAT CO
CwOWKWWAMN AN CO]
WOWwhun an Ovo
Credit for regular school marks was assigned thus:
School mark AY (Bose GOD) Sank
Value LOMAS ht es es
Credit for teacher’s special estimate of pupils was as-
signed as follows:
‘Teachers estimate) | AX Bo Gn Le no
Value 12° Oo Ore eae
Observe that, in assigning credit to the average of school
marks and to the teacher’s special estimate, no account
was taken of the pupil’s grade. A second-grade pupil
making an A was assigned the same number of points of
credit as a fourth-grade pupil making an A. This pro-
cedure is defensible only when the group is a fairly homo-
geneous one, and when the object is to construct a criterion
whose sole purpose is to evaluate test elements relative to
each other.
Experimental Measurements 87
Finally, Liu combined his test criterion and school cri-
terion, giving equal weight to each. Then he computed
the correlation and partial correlation of each test element
in the five non-verbal tests with this criterion. The test
elements showing the largest partial correlation with the
criterion were selected to constitute a new test. Further-
more, the method of scoring the new test took account of
the relative value of each element of the test as an inde-
pendent measure of intelligence. This was accomplished by
the use of the regression equation technique. ‘These tech-
niques of correlation, partial correlation, and regression
equations are discussed in detail in Chapter IX.
In the actual selection of the best test elements to put
into the new test battery for China, Liu was influenced by
such non-statistical considerations as adaptability to all
races equally, possibility of constructing duplicate forms of
each, and the like. Also he short-circuited the laborious par-
tial correlation technique by (a) computing the correlation
of each test element with the criterion, (b) choosing as basic
test elements the two elements which showed the highest
correlation with criterion and which appeared to test different
mental functions, and (c) selecting other tests which, by
trial, showed high correlations with the criterion but low
correlations with the basic tests and with each other.
2. The Test Should Measure Comprehensively the Trait
in Question.
Perfect validity may be secured by so constructing the
test that it duplicates in form, procedure, and content the
criterion itself. But almost invariably this means an im-
practicably cumbersome test. Hence the psychologist
usually sacrifices some validity to convenience. He may
construct a test which duplicates the criterion in miniature.*
Or, instead of a toy representative, he may select for his
test an actual sampling of some representative portion of
the criterion. Or, he may construct an analogy which em-
1See Hollingworth, H. L. and L. S., Vocational Psychology; D. Appleton and
Company, New York, 1916.
88 How to Experiment in Education
ploys material which is not even similar to the material of
the criterion but which is supposed to exercise the mental
traits requisite for success in the criterion. Finally, he
may attempt to find or construct an empirical test, 1.e., he
tries out many tests in the hope of discovering that one of
these will happen to show a close correspondence with the
criterion.
This question of adequacy is of particular importance to
the experimenter. He wishes to measure and evaluate all
the changes produced by each EF and not just a part of
them. Bryan and Harter’s ordinary measurements showed
that their subjects reached a plateau where a series of
measurements showed no further evidence of growth. The
use of more adequate tests showed, however, that growth
in certain accessory traits was continuous throughout the
plateau period. In experiments with project teaching and
the like, the adequate measurement of such accessory and
concomitant developments becomes a matter of primary
importance. It is a good rule in experimentation to test,
so far as possible, every aspect of the problem, and score
every aspect of the tests.
Adequacy in content plus practical convenience offers a
special problem to the test constructor. Some of those who
develop tests attempt to secure adequacy without sacrificing
convenience by taking a random sampling of the total ma-
terial. Thus, the words in the Starch Spelling Scale were
selected at random from all the non-technical words in the
dictionary. Others follow the social-worth principle. Thus
the words in the Ayres Spelling Scale are the more com-
monly used words. Others employ the type principle in
selection of test material. Thus the examples in Monroe’s
Diagnostic Tests in Arithmetic were so selected as to repre-
sent all the typical processes in the fundamentals of arith-
metic. Others follow the sétatistical-dificulty procedure.
Thus, the examples in Woody’s Arithmetic Scales were
selected because of their statistical behavior, i.e., those ex-
amples were selected which would make an equal-step ladder
Experimental Measurements 89
of difficulty. Various combinations of these bases of selec-
tion are possible. The basis or bases to be employed will
vary with the purpose of the test and the nature of the trait
to be studied.
3. The Test Should be Non-coachable.
The coachability of a test may be reduced by such a selec-
tion and arrangement of material as will make it difficult
for one pupil to communicate knowledge of how to do the
test to another, by increasing the amount of the test ma-
terial, by the preparation of several equivalent forms of
the test, and by providing that those pupils will be tested
first who are least able to report the content of the test.
4. The Test Should be Free from Ambiguities and
Other Irrelevancies.
Even when the content of a test is satisfactory, the form
and procedure of the test require careful scrutiny. All sorts
of irrelevancies may subtract from validity. ‘The test
material may be in question form when greater validity
might be secured by employing the classification, completion,
matching, or manipulation form. ‘The general conditions
under which the test is to be given may detract from valid-
ity. The instructions which accompany the test may de-
mand too much linguistic ability or may be otherwise
unsuitable. The nature of the response demanded of the
pupil may require too much writing ability, muscular
strength, or the like. The test may be so long as to meas-
ure fatigue instead of the trait desired, or so short as to
be unreliable or unsuited to measure the speed of adjust-
ment to the test. It may be so arranged as to measure
the pupil’s honesty rather than his ability. The scoring
provided for may be crude, or may concern insignificant
phases of the pupil’s performance. Ambiguities or other
irrelevancies may appear at various stages.
5. The Elements of the Test Should Be Weighted in
the Optimum Manner.
In practice, few tests have as yet been validated in any
adequate way. The tests are usually assumed to measure
90 How to Experiment in Education
what they appear to measure. In time every person who
proposes a test will be obligated to report the degree of
correspondence between test scores and criterion scores.
This correspondence is usually determined by computing
the coefficient of correlation between these two series of
scores. The procedure for computing and interpreting a
coefficient of correlation is described in Chapter IX.
It frequently happens, however, that the correspondence
between test and criterion can be measurably increased by
determining and utilizing in scoring, the optimum weights
for the various parts of the total test, especially when the
total test is composed of subordinate tests which differ
somewhat in nature. These weights may be determined
statistically by means of the partial correlation and regres-
sion equation techniques. ‘These techniques also are dis-
cussed in Chapter IX.
6. The Test Should Be So Constructed That the Pupil’s
Reactions Will Be as Abbreviated as Possible.
Satisfaction of this criterion makes for economy and
objectivity of scoring. Frequently an abbreviated reaction,
such as a word, number, or check, will yield as valid+ a
measure of the pupil’s ability as a much more complicated
reaction. j
7. The Test Should Be So Constructed That the Pupil’s
Abbreviated Answers Will Be Controlled.
If any one of many different abbreviated answers is
correct, or if the spatial location of the pupil’s answers is
uncontrolled, the probable result will be uneconomical, in-
accurate, and subjective scoring. Furthermore, it will prove
difficult in this case to employ mechanical scoring devices.
When the nature of the test permits, it is well to have pupils’
answers recorded along the right-hand margin of the test
sheet. This permits the experimenter to lay a correctly-
filled test sheet beside the pupil’s answers and determine
correctness or incorrectness by a simple visual comparison.
i Gates, Arthur I., “‘The True-False Test as a Measure of Achievement in College
Courses”; Journal of Educational Psychology, May, 1921.
Experimental Measurements or
When marginal answers are not feasible, spatial location
may be so controlled as to permit the use of a perforated
test sheet or a celluloid scoring device.
8. The Test Should Be So Constructed as to Permit Its
Use Both with One Pupil and with a Group of Pupils.
It is claimed that when a test is given to one pupil at a
time the results are more reliable than when a pupil is tested
in a group. However, questions of time, economy, and the
prevention of the spread among untested pupils of informa-
tion as to the nature of the test practically require group
testing, for most experimental situations.
9. Test Instructions Should Be as Brief as Is Consistent
with an Adequate Understanding of What Is to Be Done.
Long instructions tend to produce confusion in the minds
of the pupils, and even of experimenters themselves if they
are inexperienced. But adequacy should not be sacrificed
to brevity. Particular care should be exercised to see that
no key points are omitted.
10. Instructions Should Employ a Demonstration and
Preliminary Test.
It is easier to imitate than to comprehend and follow lin-
guistic directions. Both demonstration and preliminary
test may be given on the blackboard or may be printed on
the test sheet. The latter is preferable.
11. Instructions Should Be Adapted to and Uniform for
All Who Are to Be Tested.
It is feasible to find words sufficiently simple for young
pupils and which are also sufficiently dignified for older
pupils. Also it is possible so to prepare instructions that
they will be uniform and equally fair to all experimental
groups irrespective of their environment.
The importance of universalizing the test applies with as
much force to the test material as to the instructions. In
less than a year after their publication, the Thorndike-
McCall Reading Scales were in use in England, China, and
other foreign countries. Unfortunately, the authors were
so provincial in their outlook that minor revisions must be
made before they can be used to greatest advantage in
92 How to Experiment in Education
countries other than the United States. They could have
been approximately internationalized from the beginning
without impairing their value for this country.
12. The Order of Instruction Should Be the Order of
Execution.
There are abundant reasons for believing that it is easier
for pupils to follow instructions when the sequence of
instructions is the sequence of action expected from the
pupils.
13. Instruction Should Be Broken into Action Units.
As soon as a natural unit of instruction has been given,
the pupil should be directed to carry out these directions
before another unit is given. This is especially important
where the instructions are necessarily long and complicated.
Any other procedure taxes too heavily the pupil’s memory.
14. Instructions Should Equalize Interest.
Interest should be equalized not only for all experi-
mental groups but for the pupils in each group. Probably
it is easier to secure this equalization on a high interest
plane than on a low plane. As a rule it is best to induce
each pupil to do the best he can.
15. The Test Should Be So Easy That Each Pupil Will
Make a Score above Zero.
Two pupils who make zero scores appear to be of like
ability, whereas the amount of instruction required to lift
both above zero might be one month in the case of one
pupil and twenty-four months in the case of the other.
Obviously to call these pupils equivalent and to pair them
for experimental purposes would give a special advantage to
the experimental group receiving the one-month pupil. For
at the final test, this pupil might show marked improvement
while the other would be still making zero. With a prop-
erly constructed test with equal units at all points on the
scale, the twenty-four-month pupil might be shown to have
made greater growth than the one-month pupil.
16. The Test Should Be So Difficult That No Pupil
Wil Make a Perfect Score.
Experimental Measurements 93
All perfect-score pupils look alike just as all zero pupils
look alike. A properly constructed test might reveal wide
differences of ability. Furthermore, a final test, even though
it be more difficult than the initial test, cannot reveal cor-
rect improvement scores for such perfect-score pupils.
17. The Test Should Have No Undistributed Scores.
Besides undistributed zero and perfect scores it is possi-
ble to have undistributed intermediate scores. Coarse
scoring, or tests which yield a few degrees of merit only,
automatically cause undistributed intermediate scores.
Pupils are made to appear of like ability when, by a finer
scoring or by a finer test, they would appear quite unlike.
The number of degrees of merit which a test should reveal
depends upon the homogeneity of the group being tested,
but, as a rule, tests should be so constructed as to separate
the pupils into not less than seven groups of ability and, if
the data are to be used for correlation, into not less than
thirteen ability groups.
18. A Test Should Vield a Statistical Score.
It is unfortunate that the custom ever grew up of report-
ing scores in terms of letters, words, or phrases. These
must be converted into statistical terms before they are
susceptible of necessary quantitative treatment.
19. The Test Should Vield Absolute Rather Than, or in
Addition to, Relative Scores.
Teachers’ marks are relative scores—trelative to the group
in question. An able pupil in Grade I will receive a mark
of A. When this same pupil reaches Grade VIII, he will
be making a score no higher than A. He stands, in fact,
a good chance of making a score less than A, even when
his absolute ability has markedly increased and his relative
status has remained unchanged. Relative tests cannot easily
be used to measure improvement.
20. The Test Should Be Scaled So That Units of Meas-
urement Will Be Equal at All Points on the Scale and the
Method of Combining Units Will Be Simple and Appro-
priate.
04 How to Experiment in Education
Evaluation of Scaling Methods.—The need for equal-
ity of units is shown in Table 4.
TABLE 4
SHOWING THE NEED FOR EQUAL UNITS OF MEASUREMENT
(R = RIGHT. W = WRONG)
Number of
Problems I 2 Sed 5 6 7 & | Score
Solved
Difficulty ..| 1 2 3 3.1 3.2 3.3 ay 4
Pupil ous R R W W W W W 3
Pupil:B ops th ts R R R R R W W 6
Pupil A solves three problems correctly. His unscaled
score is, therefore, 3, as shown in the table. Pupil B solves
six problems. His unscaled score is 6, as shown. Employ-
ing unscaled units of measurement in this manner makes
Pupil B appear much more competent in comparison with
Pupil A than he really is. The difficulty of solving six prob-
lems, namely 3.3, is only slightly above the difficulty of
solving three problems, namely 3. A very small superiority
of ability on the part of Pupil B enabled him to double his
unscaled score. The use of equal units of difficulty gives
Pupil A a score of 3 and Pupil B a score of 3.3.
Many methods! of varying worth have been proposed
for scaling mental tests. One method—the grade-scale
method—is to determine the difficulty of each separate prob-
lem, question, or other test element on the basis of the
achievement of school grades, and then to compute a pupil’s
score by combining the scale values of the test elements done
correctly.
To call a pupil’s score the scale value of the most diffi-
cult test element done correctly is subject to the objection
that pupils are unable frequently to do correctly test ele-
ments of less scale value. Depending as it does upon a single
test element, the score would also be rather unreliable. The
1 For a detailed evaluation see McCall, Wm. A., How to Measure in Education,
Chapters IX and X; Macmillan Company, New York, 1922.
Experimental Measurements 95
only satisfactory procedure thus far devised to meet these
two difficulties is too complicated for practical use.
On the other hand, to call a pupil’s score the sum of the
scale values of the test elements done correctly is somewhat
laborious, and, in addition, is subject to the criticism that
a score yielded by such a cumulative total shows the num-
ber of units of work done rather than the ability level
reached. It would be like measuring a man’s lifting strength
by adding the weights of a variety of weights lifted. The
preceding simple-total procedure appears preferable. The
man’s lifting strength, according to the simple-total pro-
cedure, would be the weight of the heaviest object the man
could barely lift.
For the foregoing reasons, the drift is away from the
scaling of the separate test elements, except in a rough
way for the purpose of arranging test elements in an
approximate order of difficulty. The drift is in the direc-
tion of scaling, ie., determining the difficulty of doing cor-
rectly a given number of the test elements in a given test.
Stated differently, the drift is toward scaling total scores
instead of test elements.
The three most promising methods that have been pro-
posed for scaling total scores are the percentile scale, age
scale, and T scale.
In the case of the percentile scale, the smallest number
of points made on the test in question by any pupil of the
group used as the basis for scaling is scored zero, the num-
ber of points below which are one per cent of the pupils is
scored 1, the number of points below which are two per
cent of the pupils is called 2, and so on to the highest num-
ber of points made by any pupil which is scored 100.
This method assumes that the difference in ability be-
tween a pupil who makes a zero-percentile score and a pupil
who makes a Io-percentile score is the same as the differ-
ence between a pupil who makes a 4o-percentile score and
a 50-percentile score. It is rather generally conceded, how-
ever, that the former difference is actually much greater
96 How to Experiment in Education
than the latter difference, and that therefore the units are
not equal in the truest sense at all parts of the scale.
In the case of the age scale, the mean number of points
made on the test in question by unselected eight-year-old
pupils is scored 8. The mean number of points made by
nine-year-olds is scored 9, and so on. Intermediate scores
are given also.
A vital defect of this scale is the almost insuperable dif-
ficulty of locating and testing unselected pupils below the
age of eight or nine and above the age of thirteen or four-
teen. Large sections of the former group have not left the
social group to enter the school and of the latter group
have left the school to return to the social group. Again,
growth ceases or actually recedes in some traits after the
age of thirteen, fourteen, or thereabouts. Quality of hand-
writing, and speed and accuracy of addition are probable
illustrations of recessions. No one has proposed a satis-
factory way of handling a situation when the mean number
of points made by, say, thirteen-year-olds is 20, and that
made by fourteen-year-olds is 18. Finally, it is generally
believed that the actual growth between ages eight and
nine, say, is greater than between thirteen and fourteen.
This belief does not have evidential support, for it is
impossible to say that the units on one scale are unequal
without assuming the equality of units on some other
criterion scale. The foregoing criticisms, even excluding
the third, mean that the age scale is inappropriate
except within a narrow range of ability and for certain
mental traits.
The T scale is believed to be superior to any of the pre-
viously described methods. It was constructed for the
purpose of embodying their virtues and eliminating their
defects. It scales the total score. It employs the simple
total. It allows each test element done to affect the scale
score, thereby increasing reliability. Its units are equal
in the generally accepted sense at all points on the scale.
It covers a wide range of ability and may be extended if
Experimental Measurements 07
necessary. The process of scaling is as simple as any, and
so is the computation of a pupil’s scale score.
The age scale by permitting the computation of quotients
such as Intelligence Quotients, Reading Quotients, Accom-
plishment Quotients, and the like, has had a decided prac-
tical advantage over the T scale, though the age scale may
be, and is now being, used as a secondary scale in conjunc-
tion with the T scale to permit the computation of quotients.
A procedure has just been devised, and will be described in
this chapter, whereby the T scale alone can secure these
special advantages of the age scale and that in a more eco-
nomical way.
The relative merits of the four most commonly used
scaling methods are summarized where they may be seen at
a glance in Table 5. This table assumes that the latest
improvements on each scaling procedure have been em-
ployed. The scoring of the scales is necessarily somewhat
subjective. After an elaborate discussion of the various
scale systems, a colleague in this field scored the systems
and arrived at results closely similar to those given in
Table 5.
The total scores of 29, 23, 22, and 11, give a rough but
only a rough index of the relative merits of the four scale
systems. Some of the criteria are far more significant than
others. The convenience and definiteness of the reference
point is so important that the deficiency of the grade scale
is very serious. The equality of units is even more impor-
tant. The deficiency of the age scale and percentile scale
at this point practically means that they cannot well be
adopted as permanent scaling systems. The additional de-
ficiency of the age scale on width of range of scale is fatal,
because both these defects are inherently uncorrectable.
The ease of scaling test and of computing pupil scale scores
fatally indict the grade scale for other than scientific pur-
poses.
Borrowing and combining as it does the desirable features
of the other three scales systems, the T scale satisfactorily
98 How to Experiment in Education
meets every criterion except one. At the present time it is
easier for the uninitiated to understand, or at least to think
they understand, the age-scale or percentile-scale units bet-
ter than the T-scale units. This is not, however, a perma-
nent defect. When the T scale has come into general use,
the T will be comprehended almost as easily as an age or
a percentile.
TABLE 5
SHOWING THE RELATIVE MERITS OF THE FOUR COMMONLY USED SCALE METHODS.
SATISFACTORY PROVISION FOR A CRITERION = 2. FAIRLY SATIS-
FACTORY =I. UNSATISFACTORY — 0.
Ape ik Age |Percentile| Grade
Criteria Scale Scale Scale Scale
1. Definiteness and convenience of ref-
CTEDCE POIs Woe es eat elaeia ace ales 2 2 I °
Qe WCuality: Oly UNM eye tse a meas hare = 2 ° ° 2
3.e Width of) range, olvscale;.. .. - as. ss p. ° 2 2
4. Reliability of scale scores.......... 2 I I 2
Se Permanence. OL/SCAlG bani sais 4 sees 2 2 2 I
6. Conventionality of scale units..... 2 2 2 2
7, Lay interpretability of scale scores. I 2 2 fe)
8. Internationality of scale units...... 2 2 I °
9. Comparability of scores on various
SCALCS TE re os Oe ee ert oe ae aia aes 2 2 I I
10. Method of combining units........ 2 2 2 fe)
11. Ease of computing scores......... x 2 2 2 °
12. Permits the quotient techniques.... 2 2 fo) fo)
13..Hase or scaling testi un an ce ees 2 I 2 °
14. Utilization of all scaled material... 2 2 2 I
15. Ease of preparing duplicate scales. . 2 I 2 fa)
Total 29 23 22 II
Construction of T Scale.—The detailed process of con-
structing a T scale has been published.t A summary will
suffice for this book. Table 6 illustrates the process. The
second column shows the number of unselected 12-year-old
children answering correctly the number of questions indi-
cated in the first column. It is recommended that unselected
12-year-olds (12.0-13.0) be used for scaling tests which are
to be used generally. If any other age is used it should be
1See McCall, Wm. A., How to Measure in Education, Chapter X; Macmillan
Company, New York, 1922.
Experimental Measurements 99
TABLE 6
SHOWING HOW TO SCALE TOTAL SCORES
Number Per Cent
Total Number) | Number of Exceeding Plus|Exceeding Plus Scale
Bape a Loe rE | Vaalt Those. |) Holft Chose Score
ih ses nie Reaching Reaching
o 3 498.5 99.7 23
I I 499.5 99.3 25
2 2 495.0 99.0 27
3 I 493.5 98.7 23
4 2 492.0 98.4 29
5 2 4.90.0 98.0 29
6 2 488.0 97.6 30
7 2 486.0 97.2 31
8 4 483.0 96.6 22
9 2 480.0 96.0 32
Io 2 478.0 95.6 a2
II Io 472.0 04.4 34
12 3 465.5 93.1 35
13 8 460.0 92.0 36
i4 8 452.0 90.4 oe
I5 13 441.5 88.3 38
16 15 427.5 85.5 39
LT 18 4II.O 82.2 4I
18 28 388.0 77.6 42
19 26 361.0 vate 44
20 34 331.0 66.2 46
21 40 294.0 58.8 48
22 40 254.0 50.8 50
23 41 213.5 42.7 52
od 37 174.5 34.9 54
25 31 140.5 28.1 56
26 35 107.5 215 58
a7 24 78.0 15.6 60
28 26 53.0 10.6 62
29 21 20.5 5.9 66
30 14 12.0 2.4 70
3I 3 3-5 0.7 75
32 I 1.5 0.3 78
33 I 0.5 O.I 81
34 Oo 85
35 o go
100 How to Experiment in Education
indicated by a subscript, thus, T1r or T13 or T16 in all
publications. For experimental purposes the experimenter
may use the group or groups upon which he is experimenting.
The third column shows the number of pupils exceeding
plus half those reaching each total number of questions
correct. Thus the number of pupils exceeding 33 is o. Half
those reaching 33 is 0.5. The sum of o and 0.5 is 0.5 as
shown in the third column. The number exceeding 32 is 1.
Half those reaching 32 is 0.5. The sum of 1 and 0.5 is 1.5
as shown. ‘The number exceeding 31 is 2. Half those
reaching 31 is 1.5. The sum of 2 and 1.5 is 3.5, and simi-
larly for other results shown in the third column. Since
there are 500 pupils in the group used for scaling, the fourth
column is obtained by dividing the results in the third
column by 500 and by expressing the quotients as per cents.
Were the fourth column inverted the first and fourth col-
umns would constitute a percentile scale. The fifth column
gives the T score, and is found by converting the per cents
in the fourth column by means of Table 7. Thus a per
cent of 99.7 corresponds to 22.5 or, for convenience, 23.
The first column in Table 6 shows the number of test
elements done correctly, where each element done counts
one point. The process of scaling is the same whether each
element done correctly gives a credit or penalty of one point,
two points, or any number of points, or a different number
of points for different elements. ‘Thus in scoring composi-
tions, the scorer may wish to penalize one point for each
error in punctuation, and two points for each error in choice
of words. If penalties instead of credits are used the first
column should be inverted, i.e., large quantities should ap-
pear at the top.
Increasing the Range of a T Scale.—The width of
range of a T scale based on 12-year-olds is much wider
than the inexperienced individual would suspect. In a
continuous function like reading, such a T scale will meas-
ure first-grade pupils and most university students. Of
course, these extreme measurements will be more unreliable
TABLE 7
SHOWING THE S. D. DISTANCE OF A GIVEN PER CENT ABOVE ZERO. EACH S. D.
VALUE IS MULTIPLIED BY IO TO ELIMINATE DECIMALS. THE ZERO
POINT IS 5 S. D. BELOW THE MEAN. S. D. VALUE EQUALS T.
5S. D. Per eT BE Per Nag OF: Per Sal. Per
Value Cent Value Cent | Value Cent Value Cent
fe) 99.999971 | 25 99.38 50 50.00 75 0.62
0.5 99.999963 | 25.5 99.29 50.5 48.01 15:0) 0-54
I 99.999952 26 99.18 51 46.02 76 0.47
1.5 99.9999038 | 26.5 99.06 51.5 44.04 79.5 0.40
2 99.99992 27 98.93 52 42.07 77 0.35
2.5 99.99990 27.5 98.78 52.5 40.13 77-5) F030
3 99.99987 28 98.61 53 38.21 78 0.26
3.5 99.99983 28.5 98.42 ey 36.32 78.5 0.22
4 99.99979 29 98.21 54 34.46 79 0.19
45 99.99973 29.5 97.98 54.5 32.04 79-5 0.16
5 99.99966 30 97.72 55 30.85 80 0.13
5-5 99.99957 30.5 97.44 55-5 29.12 SOOT
6 99.99946 31 97-13 | 56 27.43 81 0.097
6.5 99.99932 ar5 96.78 56.5 25.78 81.5 0.082
7 99.99915 32 96.41 57 24.20 82 0.069
7.5 99.9989 32.5 95.99 57.5 22.66 82.5 0.058
8 99.9987 33 95-54 58 21.19 83 0.048
8.5 99.9983 33-5 95.05 58.5 19.77 83.5 0.040
9 99.9979 34 94.52 59 18.41 84 0.034
9.5 99.9974 34.5 93-94 59.5 17.11 84.5. 0.028
Io 99.9968 35 93.32 60 15.87 85 0.023
10.5 99.9961 255 92.05 60.5 14.69 85.5 0.019
rt 99.9952 36 QI.92 61 13.57 86 0.016
2S OO.OUAL 36.5 OI1.15 61.5 12.51 86.5 0.013
I2 99.9928 37 90.32 62 II.51 87 0.011
I2.5 99.9912 37.5 89.44 62.5 10.56 87.5 0.009
13 99.989 38 88.49 63 9.68 83 0.007
135511 00.007 38.5 87.49 63.5 8.85 88.5 0.0059
I4 99.984 39 86.43 64 8.08 89 0.0048
14.5 99.981 39.5 85.31 64.5 7-35 89.5 0.0039
15 99.077 40 84.13 65 6.68 go 0.0032
15.5 99.972 40.5 82.89 65.5 6.06 90.5 0.0026
16 99.966 4I 81.59 66 5.48 gI 0.0021
16.5 99.960 41.5 80.23 66.5 4.95 QI.5 0.0017
17 99.952 42 78.81 67 4.46 g2 0.0013
17-5 99.942 42.5 77:34 67.5 4.01 92.5 0.00IT
18 99.931 43 75.80 68 3.59 93 0.0009
18.5 99.918 43.5 74.22 68.5 B22 93-5 0.0007
19 99.903 44 72.57 69 2.87 04 0.0005
19.5 99.886 44.5 70.88 69.5 2.56 94.5 0.00043
20 99.865 45 69.15 70 2.28 95 0.00034
20.5 99.84 45.5 67.36 40.5 2.02 95.5 0.00027
at 99.81 46 65.54 oe 7.0 96 0.00021
ars) 09.78 46.5 63.68 7215 1.58 96.5 0.00017
22 99.74 47 61.79 02 1.39 97 0.00013
22.5.. 90.70 47.5 59.87 72.5 I5 I6- 4 — 23
8-10 21 II- 6 5 I4- 0 —6 16— 6 — 24
0-10 19 I1-— 8 4 I4- 2 —7 16-— 8 — 26
Q- 2 18 II—1I0 3 I4- 4 —7 16—10 — 28
O-= Fh 17 I2-— 0 3 I4—- 6 —8 I17- 0 — 31
9- 6 16 I2- 2 2 14-—- 8 —9g I7— 2 — 33
g- 8 I4 I2-— 4 I 14-10 — II I17- 4 — 35
9-10 13 I2— 6 ° I5-— 0 tego e Me 8 37
Io—- 0 12
How to Construct C Scale——The T scale measures
total ability in a sort of absolute sense. The B scale meas-
ures brightness, i.e., ability relative to age. The purpose
of the C scale is to indicate automatically a pupil’s correct
classification in school in the trait tested, and to measure
ability relative to grade. A pupil may be doing excellent
work for his age but poor work for his grade or vice versa.
The steps in the process of constructing a C scale follow.
1. Construct grade distributions similar to the age dis-
tribution in Table 10.
2. Using the T score column and the frequency column
for the grade in question, compute the mean T score for
each grade or for each half-grade in case the schools tested
have half-year promotions. These mean T scores for each
grade are grade norms. The grade norms were as follows:
Grades. | 2A 2B (3A) 3B -4A AB | sA’ (5B 6A) 6B). 7A.) 7B
Norm, ..|26 30 | 33.7 37.3] 39.6 41.8] 44.9 48.0] 50.9 53.7] 56.0 58.3
Grade ..| 8A 8B! 9A o9B]/10A 10B]11A 11B|12A 12B
Norm, ..| 59.6 60.9 | 61.5 62.1] 62.90 63.6] 64.5 65.4| 66.8 68.1
110 How to Experiment in Education
3. Write the letters in the foregoing 2A, 2B, 3A, etc., as
decimals which will indicate how much of each grade the
classes tested have completed. Since the test was given in
June the 2A classes had completed half of Grade II, the 2B
classes had completed all of Grade II, and so on. Hence 2A
above should be changed to 2.5, 2B to 2.99 or 3.0, 3A to 3.5,
3B to 4.0, 4A to 4.5, 4B to 5.0, etc. If the test has been
given just after mid-year promotion, 2A should be written
as. 2/0,2.B as 2:5, etc,
4. Interpolate to determine what norm corresponds to
each tenth of a grade. Since 2.5 corresponds to 26, and 3.0
to 30, 2.6 is found by interpolation to correspond to 26.8,
2.7 is found to correspond to 27.6, and so on. The expan-
sion by interpolation shown in Table 13C, p. 126, illustrates
the process in detail. ‘‘Grade” has been written as ‘“G”
(grade status), and “Norm” has been altered to T since
it is really a mean T score. The table has been extended
downward by common sense estimation, and upward arbi-
trarily so that the highest possible score will coincide with
a G of 20.
5. Prepare a C correction table for correcting a G into
a C. The C-corrections are given below. They are the
same for all tests whether designed for the elementary or
the high school, and regardless of the time when the data
for scaling the test were collected.
End of
Month I 2 3 4 5 6 7 8 9 10
Ca
Correction | .4 | 3 7) I o }—i}]—.2]/—3)—4]—-.5
21. The Test Should Be Long Enough to Vield Reliable
Scores.
This means that not only the time for, but also the ma-
terial of the test should be adequate. We have just seen
that calling the pupil’s score the scale difficulty of the single
most difficult test element done correctly tends to yield an
unreliable score. This is because this procedure in effect
Experimental Measurements ET
shortens the test, since not every test element plays an
intimate part in determining the score. To secure adequate
reliability frequently requires that two or more forms of a
test be given and the results averaged. Spearman has de-
vised a formula in order to determine how many forms of
a test must be given to yield a desired reliability—a desired
self-correlation coefficient (see Chapter IX). The answer
is given by the following formula:
__ YX—rirx
WPT rs re
Where N is the number of tests required to yield rx;
rx is the desired self-correlation coefficient, and
rr is the self-correlation coefficient of one form
with another form of the test.
Thus the number of forms of a test required to yield a
self-correlation coefficient (rx) of .95, when the coefficient
of correlation (rr) of one test with a duplicate is .8, may be
found by substituting in the foregoing formula and solving
for N, thus:
905 — .8(.
NS Pa Fh = 4.75 oF 5.
This tells us that the mean of 5 equivalent forms of the test
would correlate with the mean of 5 other equivalent forms
to the extent of .95.
Sometimes the information desired is,—what self-correla-
tion coefficient would result from correlating the mean of,
say, 4 equivalent forms of a test with 4 other equivalent
forms, when, say, r1 is .7. Here the formula and substitu-
tions are:
ie Nr1 ae qExan7 a
Pree gs = oat om) wea pS a
If rz in both the above substitutions should be the self-
correlation coefficient found by correlating the mean of two
112 How to Experiment in Education
equivalent forms of a test with the mean of two other forms,
instead of the self-correlation coefficient for one form of
a test with another form, the foregoing formule may be
operated just the same. The N found in the first computa-
tion would show, however, not 5 forms of the test but 5
pairs of forms, i.e., 10 forms, or more exactly 9.5 forms.
Since, in the second computation, 4 forms are equivalent
to two pairs of forms, 2 should take the place of 4, thus:
uw 2X-7
Hepat (een Va
How reliable should a test be? A self-correlation coeffi-
cient of 1.0 would mean perfect reliability. The best intelli-
gence tests have self-correlation coefficients of one form
with a duplicate of .9 to .95 as based upon records from
unselected pupils of the same chronological age. In grade
groups the coefficient would be slightly less. The standard
test has a reliability in age groups of about .8. A test with
a reliability of .8 will yield a sufficiently reliable mean
score for a group of 40 or more pupils. It will not yield a
very reliable score for an individual. ‘The experimenter
should have little confidence in the reliability of individual
scores unless his test has a self-correlation of .95 or above,
or until he has given enough forms of the test to bring the
self-correlation to or above this figure. Fortunately, experi-
menters are more concerned, as a rule, with mean scores for
groups of pupils than with individual scores.
Self-correlation coefficients are probably not the most
intelligible way to determine and report reliability. Another
way is illustrated in miniature in Table 12. The first
column indicates the various pupils. The second column
shows the scores made on one form of a test. The third
column shows the scores made on another form of the test
given shortly afterward. The fourth column shows the
difference between the two scores. The mean of the differ-
ences shows the amount of error on the average to be
expected with this test. Were each of the tests perfectly
Sao
Experimental Measurements II3
reliable and were there no increase or decrease of the second
series of scores over the first series due to (a) difference
in difficulty of the two tests, (b) practice on the first test,
(c) instruction, coaching, or natural growth in the trait,
the second series of scores would then be identical with the
first series and the differences in the last column would all
be zero. Any difference due to (a), (b), and (c), pro-
vided these influences have operated equally upon all pupils,
can be eliminated by diminishing the non-algebraic mean
TABLE 12
APPROXIMATE METHOD OF DETERMINING A TEST’S RELIABILITY
Pupil es ch slash Difference
a 20 22 2
b 12 15 a
Cc 25 24 —t1
d 32 35 3
e 12 II —I
f 6 10 4
g 28 28 fa)
h 15 13 —2
i 18 20 2
j 22 20 —2
Mean difference (non-algebraic). ..........0ccccceee. a
mreanaciirerence, (algebraic)prsdeae ee eke ok 0.8
prcthditerence’ (unreliability) is tie. te ee ok ke ces 1.2
difference by the amount of the algebraic mean difference.
The net difference is approximately pure unreliability. To
secure an absolutely pure measure of unreliability would
require that an allowance be made for the fact that all
pupils do not profit equally from practice, instruction, coach-
ing, maturing, and the like.
The procedure illustrated in Table 12 is quite satisfac-
tory provided the variation in scores on form 1 of the test
is the same or approximately the same as the variation in
scores on form 2. Whether the general size of the scores
is the same on both forms is immaterial. Equivalent forms
of tests are so constructed, as a rule, that the two series of
II4 How to Experiment in Education
scores are alike in both variability and general size. The
variability of scores on form 1 of Test A in Table 12 is
about the same as that of the scores on form 2. The slight
tendency for the scores on form 2 to be larger than those
on form 1 is discounted by the use of the mean algebraic
difference, namely 0.8.
Test X in Table 13 illustrates a situation where the varia-
bilities are identical, but-where the two series of scores differ
markedly in size. The net difference shows how this process
TABLE 13
ILLUSTRATING THE NECESSITY FOR EQUATING VARIABILITIES BEFORE COMPUTING
RELIABILITY BY THE NET-DIFFERENCE METHOD
Test X . Testy Equated Var.
5 Differ- Differ Differ-
ed Form Form she Form Form ae Form Form Hah
z 2 I 2 I 2
a 22 fo) —22] 10 o |—I0o 10 o |—I0
b 24 2 —22| 14 8 | —6 14 4 |—10
C 26 4 — 22 18 16 —2 18 8 |—I0
d 28 6 —22| 22 24 2 22 12 |—1I0
e 30 8 — 22 26 32 6 26 16 |—I0
Mean Difference (non-
algebraic) isc. ae ee 22 Sa b de)
Mean Difference (alge-
braic) vise eee swine se 22 2.0 10
Net Difference (unrelia-
bility) Ve eeoe en eee fe) cee °
eliminates the effect of differences in size. Test Y illustrates
a situation where mere inspection shows there is perfect
reliability, yet the net difference fails to show perfect relia-
bility. It fails to show the true reliability because the varia-
tion in scores is not the same for both forms. The variability
of the scores on form 2 is exactly twice that of the scores
on form 1. The variabilities can be made identical by the
simple process of dividing all the scores on form 2 by 2.
Once the variabilities are equated the net difference shows
the true reliability, as shown in the third portion of the table.
It is seldom feasible to determine the amount of a test’s
variability by inspection as was done for form 2 of Test Y
Experimental Measurements IIS
in Table 13. The usual procedure is to compute for each
series of scores one of the standard measures of variability,
such as Q (quartile deviation) or SD (standard deviation),
and to use these as a basis for equating. The computation
of the Q and SD is explained in Chapter VI. Suffice it to
state here that the SD for form x of Test Y is 5.66, and
for form 2 is 11.32. Thus the SD’s show also that the
variability of scores on form 2 is twice that for form 1. The
variabilities or SD’s may be equated by dividing all scores
on form 2 by 2, as was done, or instead, by multiplying all
scores on form 1 by 2. Had the SD been 5 for form x and
4 for form 2, variabilities could be equated by dividing the
scores on form 1 by 1.25, or instead, by multiplying the
scores on form 2 by 1.25. Had the SD’s been x and 6 for
forms 1 and 2, respectively, variabilities could be equated
by multiplying scores on form 1 by 3, and by dividing
scores on form 2 by 2. That is, the variability of one form
may be adjusted to another form or the variability of both
forms may be adjusted to a third variability different from
the original variability of both. Sometimes one type of
adjustment is more convenient and sometimes the other.
Herring has called attention to the fact that the corre-
spondence of scores on one form of a test with scores on
another form is not the best measure of reliability. He
claims, and rightly so, that scores on one form of a test
will correspond more closely with mean scores from an
infinite number of forms, than they will with scores on
another equally unreliable form. That is, the correct meas-
ure of the reliability of a test is some measure of the close-
ness of its correspondence with a perfectly reliable deter-
mination.
A better measure of the reliability of a test than that
given by self-correlation or self net difference is the corre-
lation between a test and the mean of two forms of that
test, or the net difference between a test and the mean of
two forms of the test. The effect of this last is to make the
net difference just exactly half the net difference between
116 How to Experiment in Education
one form and another. The procedure would yield a net
difference of 0.6 instead of 1.2 for the data of Table 12.
But due to the fact that a test has half the influence in
determining the mean of the two forms against which it is
checked, the preceding procedure makes the reliability
appear about as much better than it really is as the self-
correspondence procedure makes it appear less satisfactory
than it really is. Otis + has determined that the true unre-
liability is .707 of the net difference as computed in Table
12 and Table 13. The correct measure of unreliability for
Table 12 is .707 times 1.2, 1.e., .8484.
22. The Test Should Be Scored Comprehensively
Enough to Yield Reliable Scores.
The failure to score all phases of a pupil’s product while
taking a test may be a prolific source of unreliability, par-
ticularly in the case of rate tests where one phase is inti-
mately dependent upon another. ‘Thus a sort of see-saw
relation exists between speed and quality in a rate test of
handwriting. Generally, as speed increases, quality de-
creases and vice versa. Unless the method of testing is
such as to keep speed, say, constant, the two quality scores
for a pupil from two tests might be quite dissimilar, whereas
if each quality score were corrected for differences in speed,
they might, in reality, be identical.
The approximate amount of correction for speed may be
determined empirically. That correction is best which will
produce the maximum possible self-correlation between the
two series of corrected scores for quality. Another tech-
nique for determining the amount of correction has been
proposed by Courtis and Thorndike? and applied to the
former’s rate tests in arithmetic.
23. The Test Should Be So Constructed As to Permit
Uniformity of Procedure in Applying and Scoring It.
The key to objectivity and an important key to reliability
1 Otis, Arthur I., ‘“The Reliability of the Binet eee and of Pedagogical Scales’”’
Journal of Educational Research, September, 192
? Courtis. S.:A., and Thorndike, E. L., Ei entiod Formule for Addition
Tests,” Teachers College Record, January, T920.
Experimental Measurements aes,
is this matter of uniformity of procedure. If it is not possi-
ble to repeat a test in a uniform way, one individual cannot
verify his own previous results, and one individual has
even less opportunity to verify the results of another. The
possibility of uniformity is partly a function of the nature
of the test, partly of the detail and accuracy of the directions
for applying and scoring the test, and partly of an experi-
mental determination and consequent allowance for the
amount and direction of each individual’s personal equation.
The first two are the most promising.
24. The Test Should Have Satisfactory Age and Grade
Norms.
The experimenter has less need for norms than other
users of tests. The experimenter is more interested, as a
rule, in comparing the progress of one experimental group
with the progress of an equivalent experimental group.
Norms are very convenient, however, where only one experi-
mental group is available, for then the progress of the avail-
able experimental group may be compared with the progress
of the norm group. Proper allowances can be made for any
differences of intelligence between the two groups thus
compared.
Norms are most valuable when they are representative of
the groups with whom it is most desirable to make com-
parisons; when they are based upon enough cases to make
them stable; when both the total distribution of scores and
the averages are reported; when the number of cases upon
which they are based is stated; and when the date of stand-
ardization is specified.
The addition of a B-scale correction to so or its subtrac-
tion from 50 shows the norm for the chronological age cor-
responding to the particular correction (see Table 11).
25. The Test Should Be Provided With an Inexpensive
Leaflet of Directions, Scoring Devices, and Tabulation and
Graph Forms.
All too frequently it is necessary, in order to use a test,
to purchase a monograph. In this monograph it is quite
118 How to Experiment in Education
common to discover after diligent search that the directions
for applying the test are in the appendix, that directions for
scoring are near the beginning of the book, that the key for
scoring is somewhere else, that norms are at still another
place in the monograph, and that tabulation forms are lack-
ing entirely. Fortunately a strong public opinion is com-
pelling a more careful attention to these details. This con-
sideration for the time and convenience of test users applies
less to experimenters who are constructing tests for tempo-
rary purposes than to those who expect a wide distribution
of the test which they have prepared.
IV. SAMPLE TEST AND DIRECTIONS
In order to give a concrete illustration of how the T, B,
C, F scale system will operate in practice there follows an
unfinished sample of form 1 of an arithmetic test now in
process of construction, and a tentative model direction
booklet. All the data in the tables are for another test of
35 elements instead of for the arithmetic test of 80 elements.
Otherwise the tables may be thought of as applying to the
arithmetic test.
CHINESE FUNDAMENTALS OF ARITHMETIC SCALE
Do not open this paper until told to do so. As soon as I have
told you how, fill the blanks below, and then hold up your pencil
to show that you have finished.
SuUrMaAmMes Pirst Na Mame pee eg lens ois Lele g tee Boy, Girl owas
ADENIOY Cars oo iae SiG irthVLonth |). anteater Birthday ens
het abu 8) BD eta rege Bsr bey apd 25) cpp Grade | 0.0). 0.0 sta ates
Dater y car. ofA Republicia san 67 Month ei Day” eee
Pencils up!
Experimental Measurements
IIQ
We want to see how well you can add, subtract, multiply, and
divide.
Do all your work on this paper.
Get no help from
anyone. Answers should be given in decimals and not in fractions.
See how many examples you can get correct in the time allowed.
You will be told your score later.
do the next.
As soon as you finish one page,
Meade no he ee meime ce rec 8) '*).S 8) (8118 Ske 1818 Cel S86 i's: eel ove.a eier e ele Tel oie later ela enote tethered te
Addition
Add
Subtract
Add
Subtract
Multiply
Divide
Add
Moree ts Alem Disha ea tee Rights eee ae eee
.... Subtraction .... Multiplication .... Division ....
(z) (2) (3) (4)
3 6 7 7
4 2 5 9 Add
(5) (6) (7) (8)
6 8 9 8
3 4 5 O Subtract
(9) (z0) (77) (12)
5 8
I O 24 50
7 5 4 6 Add
(13) (74) (15) (16)
29 74 76 92
6 4 32 21 Subtract
(17) (18) (79) (20)
4 3 7 8
2 3 3 6 Multiply
(27) (22) (23) (24)
2)6 4)8 4) 36 7)49 Divide
(25) (26) (27) (28)
22 72 69 58
ras 26 4 8 Add
120 How to Experiment in Education
(29) (30) (32) (32)
34 44 41 86
Subtract 8 7 26 19 Subtract
(33) (34) (35) (36)
24 20 28 63
Multiply 2 4 7 9 Multiply
(37) (38) (39) (40)
Divide 2)178 4)260 5) 845 7)973 Divide
(47) (42) (43) (44)
984 32
75 43 253 571
Add oa 89 457 185 Add
(49) (50) (57) (52)
407 350 65 7
Multiply 7 8 36 57 Multiply
(53) (54) (55) aon
Divide 9)54054 §8)16200 43)559 27)864 Divide
(57) (58) (59) (60)
72 28
46 95
53 60
98 72
28 — 89
70 43 6.43
69 39 48.19 -78
Add 98 39 96.13 70. Add
(61) (62) (63) (64)
5004 3500 7-32 a
Subtract 169 2891 2.59 8.63 Subtract
(65) (66) (67) (68)
Multiply 70 600 8 “7 Multiply
Experimental Measurements 121
(69) (OL NG Ae
Divide 68)68544 97)1949700 55)198 83)431.6 Divide
(73) (74) (75) (76)
,; 58 76 7555 72.3
Multiply BT .09 5.98 8.06 Multiply
(77) (78) (79) (80)
Divide .40)2.42 .90)3.59 .03)8.76 .08).46 Divide
When you finish, close your paper, lay it on your desk with the
front page up, and wait quietly until papers are collected.
DIRECTIONS FOR THE CHINESE FUNDAMENTALS OF
ARITHMETIC SCALE
ForRM I
I. GENERAL DIRECTIONS FOR APPLYING TEST
1. Follow the instructions for giving the test with literal exact-
ness. No additional help should be given except as hereafter
provided for. Avoid unstandardized introductory remarks.
Secure rapport by charm of manner rather than felicity of
expression.
2. Give directions distinctly, at moderate speed, with careful
attention to emphasis, loudly enough to enable all pupils in the
room to hear without difficulty, and confidently enough to secure
instant obedience from every pupil. Insist courteously but firmly
on this prompt obedience from the start.
3. Remove all distracting elements from the environment, and
make pupils as comfortable as possible. Provide against any dis-
turbances while the test is in progress. Preferably there should
be no visitors.
4. Prevent copying. Do this by carefully watching those who
act suspiciously or by standing beside them. Do not distract
others by oral reprimands in the midst of the test.
5. In timing the test use a stop-watch if possible. If not, an
ordinary watch may be used provided it has a second hand.
Where feasible, it is well to have an assistant do the timing.
6. Clear desks. See that each pupil is provided with a sharp-
ened pencil. Have a few extra pencils available.
Taz How to Experiment in Education
7. Carefully count enough and just enough test papers for each
row and place them on the first desk of that row. Be very careful
lest a test paper be left in the possession of the pupils. If pupils
are practiced or are permitted to practice themselves on the con-
tents of this test, its usefulness as a measuring instrument will be
destroyed.
i. INSTRUCTIONS TO PUPILS
1. Hold up one of the test.papers and say:
One of these papers will be placed on each desk. Do not open
them until told to do so. Will the pupils in the first row please
distribute papers.
2. When papers are distributed, say:
Look at the first page and read silently while I read aloud.
3. Read the directions with a sufficient pause at the end of each
sentence to permit the direction to be followed or the thought to
be fully grasped.
4. When directions have been read, record the time in hours,
minutes, and seconds, as you say: Open your paper and begin!
5. At the end of exactly 10 minutes, say:
Stop! Draw a large circle around the example you are now
working on and then pencils up. (Pause.) Now finish the ex-
ample and go right on.
6. Make sure that each pupil does not forget that as soon as
he finishes one page he is to do the next, and that he does not
overlook the last page.
7. At the end of exactly 30 minutes after saying “Begin,” say:
Stop! Pencils down! Wil pupils in the first row please collect
papers.
m1. How To Score TEST
Take a blank test paper and fill it out with the correct answers
given below. This scoring stencil may be creased in successive
folds, thus making it possible to lay the row of correct answers
just below the pupil’s answers. Draw a line through every in-
correct or omitted answer and write the number of correct answers
in each row to the right of that row. Compute the total number
of correct answers made on the entire test by each pupil and write
this in the “Examples correct” space provided on the front page
of his paper.
To be counted correct a pupil’s answers must agree exactly with
Experimental Measurements 123
those given below. Each example is scored as either wholly right
or wholly wrong. No partial credits are given. When an answer
has been corrected by the pupil, the correction is the answer to be
scored. The use of fractions instead of decimals is scored as incor-
rect in order to discourage a cumbersome practice. If pupils must
meet fractions in their environment, they should be taught how to
convert fractions into decimals. Omission or misplacement of a
decimal point makes the answer wrong. The presence of zero
before an integer or after a decimal does not make an otherwise
correct answer incorrect.
As a rule it will be found quite satisfactory to have pupils
exchange papers and do all the scoring themselves, the examiner
calling the correct answers. If this is done, at least two pupils
Should score each paper, and the examiner should check the
accuracy of the scoring for some of the papers.
The list of correct answers follows.
Example| Form I | Example| FormI\\Example| Form I Example| Form!
I 7 21 3 4I I12 61 4835
2 8 22 2 42 132 62 609
3 12 23 9 43 1694 63 4.73
4 16 24 7 44 1084 64 66.37
5 3 25 57 45 194 65 4200
6 4 26 98 46 286 66 30600
7 4 27 73 47 562 67 4.72
8 8 28 66 48 299 68 6.30
9 II 29 26 49 2849 69 1008
bo) 13 30 37 5° 2800 70 2010
II 28 31 15 51 2340 71 3.6
12 56 32 67 52 4332 v2 5.2
13 23 33 48 53 6006 73 21.46
14 79 34 80 54 2025 74 6.84
15 44 35 196 55 13 75 451.49
16 71 36 567 56 32 76 582.738
17 8 37 89 57 533 77 6.05
18 9 38 65 58 465 78 15.1
19 21 39 169 59 144.32 79 292
20 48 40 139 60 86.21 80 5.75
Iv. How To Compute Puri Ta (Torat ABILITY
IN ARITHMETIC)
Find the pupil’s total number of examples correct in the first
column of Table 13A and read the corresponding Ta. This is the
124 How to Experiment in Education
pupil’s T score in arithmetic. Thus the first pupil in Table 13D
(p. 127) did 16 examples correctly, which, according to Table 13A
corresponds to a Ta of 40.
TABLE 134
Examples Examples Examples Examples
Correct Ta Correct Ta Correct Ta Correct Ta
fe) 23 9 33 18 43 27 63
I 25 Io 34 19 45 28 67
2 26 II 35 20 47 29 71
3 of 12 36 oY. 49 30 76
4 27 13 37 22 51 31 79
5 28 14 38 a3 53 32 86
6 29 15 39 24 56 33 86
7 31 16 40 25 58 34 92
8 32 7 42 26 60 35 96
v. How To Compute Puprt BA (BRIGHTNESS IN ARITHMETIC)
Find the pupil’s solar age in Table 13B and read the corre-
sponding Ba correction. If the Ba correction is plus, add it to
the pupil’s Ta. If it is minus, subtract it from his Ta. The result
is the Ba. Thus the first pupil in Table 13D is 13 yrs. 2 mos. old,
which, according to Table 13B, corresponds to a Ba correction
of —2. His Ta of 40 plus the Ba correction of —2 gives a
Ba of 38.
TABLE 13B
Solar Age Addto| Solar Age Addto| Solar Age Addto|Solar Age Addto
Yrs —Mos.T Score| Yrs—Mos. T Score |\Yrs—Mos. T Score |\VYrs—Mos. T Score
7-6 34 IO — 2 II 12-8 —I /|15-2 —I3
7-8 32 10-4 10 I2 — 10 —I /|}15-4 —iI15
7, — 10 31 Io — 6 9 13-0 —2 /15 —6 — 16
8-0 29 10 — 8 8 13 — 2 —2 |15 -8 —I17
8-2 ae Io — I0 8 13 —4 —3 |15 —-10 —IQ
8-4 25 II -o 7 13 — 6 —4 /16-0 — 20
8-6 24 II — 2 6 13 -8 —4 |16-2 — 21
8-8 22 II-4 6 13 — 10 —5 |16-4 — 23
8 -— 10 21 Ir — 6 5 14-0 —6 |16-6 — 24
9-0 19 Ir - 8 4 I4 - 2 —7 |16-—8 — 26
9-2 18 II — 10 3 14-4 —7 |16—-10 —28
9-4 17 12-0 3 14 — 6 —8 |17-0 —3I
9-6 16 I2 — 2 2 14 - 8 —9Q |17 -2 — 33
9-8 I4 12-4 I I4 — 10 —II {17-4 — 35
9g -— 10 13 12-6 fe) I5 -—0o —iI2/17-6 — 37
ms
°
I
°
12
Experimental Measurements 125
vi. How To CompuTE APPROXIMATE SOLAR AGE
(FOR USE IN CHINA)
First, determine the pupil’s lunar age and the lunar month of
birth. Deduct 1 from his lunar age to get his basal age. Then
from the number of the lunar month in which the tests are given,
deduct the number of his lunar month of birth. If the resulting
number is positive, add that number of months to his basal age to
get his approximate solar age. For example, if the pupil is 15
yrs. old and was born in the 5th month, and if the tests are given
in 8th month, his basal age is 15 — 1 = 14 yrs., and the number of
months is 8—-5 3. Thus his approximate solar age will be
14 yrs. 3 mos.
In case the resulting number is negative, it means that the
pupil is not up to the supposed basal age. Then from this age
deduct the number of months deficient. Thus if a 15-year-old
pupil who was born in the 11th lunar month is tested in the 8th
lunar month, his basal age is 14 but he is deficient by 3 months
(8— 11 =3). So his solar age should be 14 yrs. minus 3 mos.,
that is, 13 yrs. 9 mos.
vir. How To Compute Pupit Ca (CLASSIFICATION IN
ARITHMETIC)
Find the pupil’s Ta in Table 13C and read the corresponding
Ga (Grade status in arithmetic). A Ga of 4.0, 4.5, or 4.9 means
that the pupil has an ability in arithmetic equal to the average
fourth-grade pupil at the beginning, middle, or end of the year
respectively.
To convert a Ga into a Ca add to or subtract from the Ga the
Ca correction shown below. Use the correction for the month
when the test was applied. Thus the first pupil’s Ta in Table
13D is 40. According to Table 13C this Ta is equivalent to a
Ga of 4.6. Since the test was applied December roth this is
nearest to the end of November, i.e., the 3rd month. The cor-
rection for the 3rd month is ++ .2 which added to the Ga yields a
Ca of 4.8. Of course the correction is the same for all pupils
tested on December 10. For a school starting October 1, Decem-
ber ro is the 2nd month, and similarly for other starting dates.
End of Month| 1 2 3 4 5 6 7 8 9 10
ETeCHON || 1.44 tans etna et) || | 2 8 |e ed
126 How to Experiment in Education
TABLE 13C
Ta -Ga\| Ta Ge|To Ga| Ta Ga| Ta Ga| Ta Ga
yy ee ay Ph ied ie BB
22.0 (1 F2.841) 43.0
SHAY © 2:30 (245.0
2520 2. 4cadis
26,08 m2.52 144A
MAMNUUN
20,008 02.0019 4575
27.00 a7 eAO.
28.4 2.8 | 46.7
20.2.0 2.0014 753
30.0 3.0 | 48.0
Attn uw v1
30.7 V3 PAS ON Ole Ol.0 0.1.):65.7) 22:2.1776.5" 125.11) Boson
ST Ai 312 TOAO cup Orca Otel 0.2 166.0 |. 12.2:|/)76.0" P1521 OmOuno
32.04)) 3.301 440.5 Ona OtKe 9.3 | 66.3) 12.3 19713 0 X58 Oo eae
22.0 NMSA iSO. AO ies 0:4.166.6 9 °12.419977-7° 15:41 60,7 ato
33.7 3.5110 50.0 0 OLS MLOdns 9.5'}:66.8. )-12.5'| 48.1) “Ex.S OC fh emeaes
34.40 '3.0) (51.5 el6-On O50 0.6: )67.5.. 12.01) 78.5) 515.6 OCs amet
35: 3.7 |. 52.0 6.7) 01.77) 10.7167-4" 12.7 | 98.9) "15.71 OO aus
B05 Ss Ou) 5 27 Oe nO Lue 08) 67:7 32.81, 970.3') -15.2101.5 eee
36.5, 3.9 | 53.3. 6.9 ]/61.0) 9:9) 68.0 | 32.9 70.7 15,0 Ol. 7 teng
29.2 VA OP S307 CO mOsetan Ol.) OGL 13.0}: 80.1 )26,0}/02. Reo
87.5. 4.01, 54:2°) 7 Pe 62.3) 10.81 68.5° ) 13.1] 80.5 a akO.5 104s eee ee
28:3). 4:2 |) 54.9 1) 7-20-02 nt0,2 t.68,0. | 13:2) 80.0%, 1.10.2) On. mene
38.3) 4.351 55.2 >. 7.31 1162-7'| (10.3) 60.3°° 13.3) 81-3" 736.3) 03-3 0eaao
39-3 44] 55-7 74 |62.8 104/60.7 13.4) 81.7 16.4/03.7 194
30.057 4.571'50.00) 7 .588102.0 wa tO-51 70.1 13.5 |°82.5 7126.5) 04, tages
40.0) (4.61) 56:5 7.662.005 36.61 470.5 | 12.61) 82,500 10.0 Oa, see oe
40.4 4.7.) 57.0 7.7,|/63.1) | 10.7| 70.9, 13.7} 82.9. 16.7) 04.0) gto?
40.8) 4:8 S75 0 9.81) 63-2 yetO.8 191.3 6 13.8.1083 300 Oo re
41.2) 4:0°)'§80"" 7:0, \.63i4 8 10:01:71.7. 13.0) 82:74.9360.01 65; 70
AL. 9005.0 | 58.3. 8.0
6316 (411.0) 72,1 . 14.0) 84.1 "0° 27.0) OG,Oemeacas
vim. How To Compute Crass Ta, BA, AND CA
The Ta for the class, grade, or group is the mean of the pupils’
Ta’s. In Table 13D the class Ta is 48.2.
To compute the class Ba, first compute the mean solar age for
the class, second, convert this into a Ba correction by the use of
Table 13B, third, add or subtract the Ba correction to or from
the Class Ta. Thus the mean solar age for the class in Table 13D
is 12 yrs. 2 mos. According to Table 13B, this solar age corre-
sponds to a Ba correction of + 2. When 2 is added to the class
Ta, the resulting class Ba is 50.2 as shown in Table 13D.
To compute the class Ca, find the class Ta in Table 13C and
Experimental Measurements 127
read the corresponding Ga. Add to or subtract from the Ga the
appropriate correction. Thus the class Ta of 48.2 corresponds
to a Ga of 6.0. A Ga of 6.0 plus a correction of .2 for the third
month gives a class Ca of 6.2.
TABLE 13D
CHINESE FUNDAMENTALS OF ARITHMETIC SCALE, FORM I
School No. 25 Grade VI Down December ro, 1922
Solar Age Name Ta Ba Ca
I3 yrs. 2 mos. A 40 38 4.8
I2 yrs. 6 mos. B 50 50 6.5
IO yrs. 7 mos. C 53 62 7.1
II yrs. 4 mos. D 46 52 5.9
13 yrs. § mos. E 52 48 6.9
I2 yrs. 2 mos. Ta 48.2
Ba 50.2
Ca 6.2
aa an SN LOE ON AEE MANE
1x. How To Interest Pupir Ta AND CrAss “FA
The number of examples correct is not a satisfactory unit of
measurement because the difference in difficulty between 30 and
31 examples correct may be greater or less than between Io and
Ir examples correct. The difference between 30) band 3 ta or
28 T and 29 T always equals the difference between 10 T and
Pipl cOr 55) land sor 1,
Again T scores make possible such statements as the following.
Any pupil or class whose T is 50 has an ability which equals the
mean ability of all twelve-year-old pupils. Any pupil or class
whose T is 70 has an ability which is 20 T (or 2 S. D.) above the
mean ability of twelve-year-olds. Any pupil whose T is 35 is 15 T
(or 1.5 S. D.) below the mean ability of twelve-year-olds.
Again, T scores may be interpreted as shown in Table 1 3E.
TABLE 13E
ne rr pe
A Is Exceeded by the A Ts ae by nes
Following Per Cent Following Per Cent
T’ Score of of 12-year olds T Score of of 12-year-olds
25 99 55 31
30 98 60 16
35 93 65 7
40 84 70 2
45 69 75 I
50 50 80 o.1
128 How to Experiment in Education
x. How To INTEREST Puprtt BA AND CLAss BA
The Ba norm is always 50 for all pupils. If a pupil’s Ba is
50, his arithmetic ability equals the mean ability of ail pupils of
like age. He is of average brightness. If his Ba is 40 he is 10 T
(or r S. D.) below the mean brightness in arithmetic of his own
age group. According to Table 13E he is exceeded by 84 per cent,
not of 12-year-olds, but of pupils of like age. If his Ba is 75, he
is 25 T (or 2.5 S. D.) above the mean brightness in arithmetic of
pupils of like age. According to Table 13E, he is extremely
bright, since only 1 per cent of his own age group are brighter.
In like manner the mean Ba for a class shows the brightness in
arithmetic of that class as a whole as compared with the brightness
of all other classes, not of like grade, but of like age.
Thus both Ta and Ba are needed. Ta gives a measure of total
arithmetic ability and incidentally shows how much each pupil or
class Ta is above or below the mean Ta of twelve-year-olds. A
Ta scale is used primarily for the purpose of measuring growth in
ability from month to month and year to year.
But a nine-year-old pupil or class might have a Ta much below
50 and still be doing exceptionally satisfactory work. There is
needed some score which makes allowance for the fact that a pupil
or class is younger or older than twelve. The Ba correction
automatically makes just this allowance, and the Ba shows pupil
or class ability in comparison with pupils or classes of the same
age. A young pupil may have a small Ta and a large Ba and an
old pupil may have a large Ta and a small Ba. A pupil or class
Ta grows larger from month to month and year to year, whereas
the Ba changes little or not at all.
xI. How To INTEREST Pupit CA AND CLAss CA
For a pupil to have a Ca of 3.5 means that he is an average
third-grade pupil in the fundamentals of arithmetic. A Ca of 3.0
means that he barely belongs in the third grade. A Ca of 3.9
means that he is almost, but not quite, ready to be promoted into
fourth-grade work in the fundamentals of arithmetic. A Ca of
6.4 means that he just fails of being an average sixth-grade pupil.
The class Ca is interpreted similarly.
Since the pupils in Table 13D are sixth-grade pupils their norm
Ca is 6.5 and will continue to be 6.5 so long as they remain in
Grade VI. It jumps to 7.5 as soon as a pupil is promoted to the
next grade. The first pupil is 1.7 Ca or grade below norm. The
Experimental Measurements 120
second pupil is exactly at the Ca norm. The class is o. 3 Ca below
the Ca norm.
XII. SUPPLEMENTARY D1acNnostic ScoRING
On the front page of the test paper, write in the space after
“Attempts,” the number of the example circled by the pupil.
This may be taken as a measure of his speed of work. Write in
the space after “Rights” the number of examples done correctly
inclusive of and prior to the example circled. A comparison of
Rights and Attempts shows the per cent of accuracy. Some pupils
are slow and inaccurate, some slow and accurate, some fast and
inaccurate, and some fast and accurate, and some are average.
Each type requires different treatment.
There are 20 examples for each of the four processes. Count
separately the number of examples done correctly on each process,
and write these scores in the spaces provided on the front page of
the test paper. If the pupil has mastered each of the processes
equally well his four separate scores should be approximately
equal in size.
An even more helpful diagnosis can be secured by making out,
or having the pupils make out, a table showing just what examples
were missed or omitted by each pupil. From this the per cent of
pupils missing or omitting each example can be readily deter-
mined. Each pair of examples (1 and 2, 3 and 4, etc.) are built
to test a pupil’s mastery of a certain type principle or difficulty.
As a rule, each pair of examples includes the difficulties of all
preceding pairs and one additional difficulty. Two examples of
each type are included because a chance error may cause a pupil
to miss an example whose principle he has really mastered.
Once each pupil’s need has been discovered in these ways, he
can be given training on his specific weaknesses. A specially
effective set of practice materials for giving this training is being
prepared by the Nanking Committee for publication by the Com-
mercial Press, Shanghai. Under no circumstances should a pupil
be especially drilled on the particular examples of this test. The
teacher who does this destroys the usefulness of the test as a
measuring instrument.
Since diagnostic scores are intended for local use rather than
for publication, tables have not been provided for scaling them.
xr. ACCURACY OF SCALE SCORING
The accuracy of scale scores depends upon (1) the way in
which pupils to be tested were selected, and (2) the number of
130 How to Experiment in Education
pupils tested. The pupils tested were a random sampling from the
total population in grades III through VIII in the government
schools of Peking and Tientsin. The number tested was ap-
proximately 2000.
xIv. ACKNOWLEDGMENTS
These arithmetic scales were prepared by the Peking Committee
consisting of Professors L. C. Cha, C. Y. Chang, Y. C. Chang,
T. T. Lew, E. L. Terman, Wm. A. McCall, their students, and
Lydia Sherritt, under the auspices of the National Association
for the Advancement of Education.
The units of measurement used in these scales were devised by
Dr. Wm. A. McCall and named by him in honor of those whose
contribution to scientific mental measurement has been of most
fundamental significance.
T (Total ability) is for Thorndike, the originator and teacher
of scientific educational measurement and author of the first
College Entrance Intelligence Test, and for Terman the author of
the Stanford Revision of the Binet-Simon scale and leading ex-
ponent of the age-scale system.
B (Brightness) is for Binet the creator, with Simon of the first
intelligence scale, and for Buckingham the creator of the grade-
scale system.
C (Classification) is for Courtis, an early pioneer in educational
measurement and originator of practice tests, and for Cattell who
with Fullerton laid the foundation built upon by Hillegas in con-
structing the first statistically satisfactory product scale and in
remembrance of China where this unit was first devised and used
as such.
F (Effort) is for Franzen, Pintner and Monroe, all of whom
published at about the same time a practical mechanism for meas-
uring achievement as related to capacity to achieve. This unit
is used only when both an intelligence and educational test have
been given.
W. T. Tao, General Director of the Association.
V. SUMMARY OF THE STEPS IN THE PrOcESS OF CON-
STRUCTING, SCALING, AND STANDARDIZING A TEST
1. Dificulty Test
t. Decide upon the mental trait to be measured and
define it as exactly as possible.
Experimental Measurements 131
2. Decide upon a test form and general content which
will measure this trait and this trait only, which will yield
one and only one correct and easily scored pupil response
to each test element, and where each element may be scored
as either right, wrong, or omitted.
3. Decide upon the range of ability to be measured.
4. Consult previous tests of this trait or similar traits
to determine how easy and how difficult the test elements
must be made, how simple the directions must be, and
what is a suitable mechanical arrangement of material for
mimeographing or printing.
5. If no such test exists prepare a tentative set of direc-
tions and a few tentative test elements and try them on a
few of the ablest and least able pupils ever likely to be
tested.
6. Prepare a test, which is as perfect in every detail
as possible, which advances by gradual steps of difficulty
from slightly easier to slightly more difficult than will be
required in the final test, and which has about one-fourth
more content than will be required in the final test (unless
the test is for diagnostic purposes in which case only the
material to be used finally should be used).
7. Make provision for the following identification data:
(1) First name, (2) Last name, (3) Sex, (4) Age in years,
(5) Birth month, (6) Birthday, (7) School, (8) Grade,
(9) Section, (10) Date of test.
8. Prepare sample and directions for pupils. For gen-
eral directions to examiner, see Section III of this chapter.
9g. Explain and apply the test to several intelligent
adults and correct it in the light of their criticisms.
10. Apply the test to about 110 pupils scattered over
the entire range of ability of pupils for whom the test is
designed. Be sure to include some of the ablest and least
able pupils ever to be treated with completed test. Give
all the time pupils need to do every test element or to do all
they can. Record on his paper the time required by each
pupil.
132 How to Experiment in Education
11. Make out a list of correct answers, a mechanical
device for scoring, and directions for scoring.
12. Score each test element, using 1 for correct, x for
wrong, and o for omitted.
13. Eliminate from the test all elements which prove
ambiguous, unscorable, or are otherwise unsatisfactory.
14. Discard enough tests to leave 100. Do not dis-
card the best and poorest papers.
15. Compute the total score made by each pupil on
the odd numbered questions and then on the even num-
bered questions.
16. Make a correlation diagram for these two sets of
scores. Call in for a conference those pupils who are
chiefly responsible for lowering the correlation. Go over
each element tried and missed by them to see if some
ambiguity or other defect is responsible. Correct or elim-
inate test elements if defects are brought to light.
17. Make a correlation diagram for the total score of
each pupil on the total test and the criterion (if such be
available). Confer and correct as before.
18. Call in a few of the most gifted pupils and enquire
the reason why various test elements were missed by them.
Correct or eliminate elements if defects are brought to
light.
19. Tabulate, by pupils and remaining test elements the
1’s, x’s, and o’s, thus for the 100 papers.
Test ELEMENTS
Name
I 2 3 4 5 6 2 8 9 Io | etc
roti ae pa rit arb tt I I I x I I x = fe) o | etc
RE Mar es eres I 7 De ne I x x O x o | etc
CLO Seon tien cts 6's etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc.
Total. Correct;.| | (| |e ee ee
T Difficulty...) —}|—} —|—]|—}—/—J—}— J — |
20. Compute, from the preceding tabulation, the num-
ber and per cent of pupils doing correctly each test element.
Experimental Measurements 133
Since there are 100 pupils the “Total correct” will also be
the per cent required. This will not be true when the
pupil has a 50-50 opportunity of getting an element cor-
rect by chance. In this case, subtract from the total of
I’s on each element, the total x’s, and divide the re-
mainder by 100. The quotient will be the proper per cent
correct.
21. Convert each per cent into an S.D. value or T diffi-
culty by means of Table 7.
22. Arrange test elements in order of T difficulty.
23. In view of the time records on the test and the
time decided upon for the final test, decide upon the number
of test elements required in order that the fastest pupil
will not quite finish the test before time is called. In
deciding upon the time allowance for the final test, due con-
sideration should be given to practicality and to reliability.
In general do not be satisfied with a reliability (Self r)
of less than .85 between the two halves of the test. Other
things being equal, an abbreviated test means a low re-
liability. Hence if the self r is too low, lengthen the time
allowance, and increase the number of test elements or
provide for two tests to be averaged instead of one longer
test.
24. Select the number of test elements decided upon.
Select in such a way that the successive elements will in-
crease, So far as possible, by equal increments of T difficulty
from one done correctly by about 99 per cent of the pupils
to one done correctly by about 1 per cent of the pupils.
If the elements available are too easy or too difficult try
out and incorporate additional elements of the desired diffi-
culty. Sometimes diagnostic or other considerations should
weigh more heavily than difficulty or time-allowance con-
siderations in determining the final content of a test. In
this case the test constructor must use his judgment to
decide how much alteration of the test content is per-
missible.
25. Improve the mechanical make-up of the test and
134 How to Experiment in Education
directions for applying it in any way that experience
suggests.
26. Print the test in final form.
27. To test the satisfactoriness of the proposed time
allowance, apply the test to the ablest class ever likely to
be tested. Have pupils circle the number of the test element
being worked upon at the end of regular intervals. Stop
the test the moment the -fastest pupil finishes. Record this
time.
28. Determine the total score made by all pupils com-
bined during each of the successive time intervals.
29. Fix an official final time allowance such that at its
expiration the fastest pupil would not quite have finished
and the ablest pupil would have done all he could. Adopt
for future use the minimum time that would have accom-
plished these two objects.
30. Apply the test to about 2000 pupils in the grades
for which the test is designed. ‘The schools selected for
testing should approximate as closely as possible a random
sampling of all schools. In the schools selected, all pupils
in the appropriate grades should be tested.
31. Score the tests and compute the total score made
by each pupil. In scoring it is usually more convenient to
give one point for each element done correctly, but this is
not imperative. Some prefer to give 2, I, or o credits to an
element according to the excellence of the pupil’s answer.
The resulting increase in accuracy is seldom worth the
extra trouble. Elements of large enough scope to justify
extra points can usually be broken into two or more sepa-
rate elements. Do not assign points proportional to the
difficulty of an element. This involves a cumulative error.
32. Make a frequency distribution of scores for each
grade, and then for each age. Make all frequency distribu-
tions in step intervals the size of the smallest scoring unit.
This is usually one.
33. Using 8.0 to 9.0, 12.0 to 13.0, or 16.0 to 17.0 year-
olds for primary, higher elementary, or high school, respec-
Experimental Measurements 135
tively, convert these raw scores into T scores by means of
Table 7, and as illustrated in Table 6.
34. If thought desirable, increase the range of the T
scale by a process illustrated in Table 8.
35. Construct a B scale for the test by a process illus-
trated in Table ro.
36. Construct a C scale for the test.
37. Prepare the official directions booklet to be issued
with the test. In order to secure uniformity, a sample direc-
tions booklet is given in Section IV of this chapter.
i. hate Lest
1. Do steps Iz, I2, 13, I4 except that all elements of
the test should be of uniform or approximately uniform
difficulty, I5, 16 except the statement concerning gradually
increasing difficulty, I7, 18, Io, I1o except that there should
be a fixed time allowance instead of a fixed number of ele-
ments to be done, Ir1, I12, I13, [14, Ir5, 1x6, Ir7, 118, I19,
for a few representative test elements only to see whether
the test elements are on the desired difficulty level, I20, I21,
I23, I24 except for all reference to difficulty, I25, I26, I30,
131, 132, 133, 134, 135, 136, and 137.
2. Since rate tests usually yield two scores, namely num-
ber tried and accuracy, T, B, and C scales may be con-
structed for both, or for just number right only, or for a
properly weighted combination of number tried and number
right.
mr. Product Tests Such As Handwriting, Composition, and
Drawing
1. Do I1, I2 except that product tests are usually scored
as a whole rather than by separate elements, 13, Iu, I5, 16
except for the references to difficulty, 17, 18, Io, Izo except
that there should be a fixed time limit, and, in the case of
traits like composition and drawing, a warning a few min-
utes before time is called.
136 How to Experiment in Education
2. Repeat I1o on the same group of pupils so as to
secure two measures of the trait.
3. Do I14 for both sets of products.
4. Rate 1 the poorest specimen in the first set. Rate 2
the next poorest and so on to 100. Have this done by, say,
three competent judges. Average the three judgments to
get the final rating for each specimen.
5. Repeat III4 for the second set of specimens.
6. Do I16 for these two sets of ratings, and I17 for
either set or both. If the self r is too low, increase the time
allowance or provide for two or more tests to be averaged
and treated as one.
7 DOds sale Omande 20,
8. Pick out all specimens written by pupils of ages 8.0
to 9.0, or 12.0 tO 13.0, or 16.0 to 17.0 depending upon the
level for which the test is designed. Age 12.0 to 13.0 will
serve fairly well for all levels. Write on each specimen a
number without regard to its merit.
9. Separate the papers into ten piles—A (poorest),
B (next poorest), C, D, E, F, G, H, I and J (best)—
according to the merit of each specimen.
10. Take pile A and divide it into 5 piles—a (poorest),
b, c, d, and e (best )—according to merit.
tz. Do IIIro for the other nine piles.
12. Take pile Aa and arrange the papers in it in order
of merit.
13. Do III12 for Ab, Ac, Ad, Ae, Ba, Be and on for the
50 separate piles.
14. Carefully compare the few best specimen in Aa with
the few poorest specimen in Ab. If the order of merit is
not correct rearrange across the junction point. Repeat
this process for the other 48 junction points.
15. Ona record sheet, write down in order of merit the
number of each specimen. After the number of the poorest
specimen, mark 1. After the number of the next poorest,
mark 2, and so on for all specimens.
16. Have at least three competent judges do steps IIo,
Experimental Measurements 137
TITro, [1I11, 1112, [1113, T1114, and Il115 without knowl-
edge of each other’s marks.
17. Compute the mean of the three marks given each
specimen by the three judges. Arrange specimen numbers
in order of merit according to these means.
18. Check that specimen number where the per cent
exceeding-plus-half-those-reaching-it in merit is nearest
99.865. According to Table 7, this specimen has a merit
of 20. Check the one where the per cent is nearest 99.38.
This has a merit of 25. The other per cents to check are
shown in the first row of the following. The T merit of the
specimen checked is shown in the second row. If only half
this number of specimens are desired in the final scale, use
those per cents whose T merits are 20, 30, 40, 50, 60, 70
and 80. If more specimens are desired in the final scale,
Table 7 will show which per cents will yield equal intervals
of T merit.
ZETA CELILD Na close cid vere vid ee OGG05 mm 00-30 1) 00.72 11 03-4200 h O41 3 e00.L5
SPMINIGTIGE ate. trate, eee tiers 20 25 30 35 40 45
PGCE Carat a iale were chee 50 30.85 15.87 6.68 2.28 62 it3
SDRTIDeTILUE ee Sassen SOruss 60 65 40 75 80
19. After checking these 13, say, specimen numbers,
check also the five specimens immediately preceding each
in merit and the five immediately following each in merit.
This will give 13 sets—N, O, P, Q, R, S, T, U, V, W, X,
Y, and Z—of eleven specimens each. Mix up the specimens
within each set.
20. Ask a large number of judges to arrange in order
of merit the specimens in set N, and record in order the
specimen numbers, together with marks 1 through 11. The
previous rating by three judges can be utilized.
21. Repeat III20 for the other twelve sets.
22. Compute the mean of all these marks given each
specimen.
23. Guided by these means, choose from set N the speci-
men most central in merit. This is the specimen most
entitled to the T merit of 20. Do likewise for sets O, P, Q,
138 How to Experiment in Education
etc., and give to each, T merits of 25, 30, 35, etc., respec-
tively. These 13 specimens together with their T merits
constitute a product-scoring scale, which may be used to
determine the T score in handwriting made by any pupil.
All that is necessary is to move the pupil’s specimen along
this scale until a scale specimen is found which is like it in
merit. The pupil’s T score is the T merit of the scale speci-
man most like it in merit:
24. Have at least three competent judges score each of
the 2000 specimens originally collected by comparing it with
the specimens in this product-scoring scale. Consider that
each pupil’s T score is the mean of these three ratings.
25. Do 132 for each of the grades, and for each of the
ages, except age 12.0 tO 13.0.
26. Do 135, 136, and 137.
27. A much more laborious and, for purposes of pure
research, perhaps more satisfactory method of constructing
a product-scoring scale is described in Chapter IX, Sec-
tion IV of “How to Measure in Education.”
If this more laborious method of product-scale construc-
tion is used, omit steps III8 through III23. Do II]2q,
III25 not excepting ages 12.0 to 13.0, 133, 134, 135, 136,
and I37.
Iv. Battery of Tests
1. Prepare each of the difficulty, rate, or product tests
entering into the battery up to, but not including step, I26,
in so far as these 25 steps apply to the construction of each
type. If there are product tests, construct, besides, a
product-scoring scale for each, based upon about 1000 speci-
mens collected from 1000 unselected pupils between the ages -
8.0 and 9.0, 12.0 and 13.0, or 16.0 and 17.0.
2. Prepare all these component tests from data collected
from the same 1oo pupils. If tests are merely being com-
piled and were carried through the preliminary stages pre-
viously, then apply them all to the same too pupils.
3. Compute the total score on each test separately made
Experimental Measurements 139
by these 100 pupils on the basis only of the test elements
selected for the final form of the test.
4. Make a separate frequency distribution of the 100
scores on each test.
5. Compute the SD of each frequency distribution.
6. If all tests in the battery are to have equal weight,
choose a multiplier for each SD such that all SD’s will
be made approximately alike in size. For example:
SD 4
Multiplier I
2 8 a
2 Ya 3
If all tests are not to have equal weight, choose multipliers
which will bring the SD’s to the desired ratio. Choose
multipliers such that the labor of applying them will be the
least possible.
7. Print the tests in booklet form. Insert the multipliers
on the front page of the booklet, thus:
Test Points Multiplier Weighted Points
I I
2 2
3 +2
4 mor)
Total
8. Do all three of 127, I28, and I29 for each difficulty
test in the battery.
9. Do I3o0 for the battery booklet.
10. Do 131 for each of the battery tests.
Ir. Compute for each pupil the total weighted points as
indicated in IV7.
12. Do all of [32, 133, 134, 135, and 136 for the total
weighted points.
13. Do 137 for the battery.
CHAPTER VI
COMPUTATIONS FOR THE ONE-GROUP
EXPERIMENTAL METHOD
Computation Model I.—The purpose of this chapter is
to give and explain a series of computation molds into
which the experimenter may fit his experimental data.
Enough such models are given to provide for all the com-
mon varieties of experiments. Thus all the experimenter
needs to do is to find the mold which fits his experiment,
substitute in it his experimental data, do the computations
indicated, and the proper conclusions and the reliability of
these conclusions will follow automatically.
The simplest type of experiment is the one-group experi-
TABLE 14
COMPUTATION MODEL I
One Group — Two EF’s— One Test Type
Group A—EFr Group A— EF2
Pilty Kr Crt xax. UD hire eee
N Mi Sx? M2 Sx?
ads BEN feb
AM) SD=y5= _ () AM SD = 4X _
SDM me SDM a
a c = —=
C I Ry, N 2 y, N
SUMMARY
EFr1 EF2 D SDD EC
N = te
pict a Me ane ‘/ (SDM1)* + (SDM2)?| 2.78 SDD
140
Computations for the One-group Experimental Method I4I
ment, where two experimental factors are contrasted, and
where only one type of test is used to measure the change
produced by the experimental factors. The computation
mold for this experimental method is given in Table 14.
Illustration of Computation Model I.—Table 142 is best
explained by formulating an experimental problem which
may be solved by means of the one-group experimental
TABLE 15
ILLUSTRATING HOW TO USE COMPUTATION MODEL 1 WITH SAMPLE DATA, WHEN EF2 1S
THE MERE ABSENCE OF EFI
ee ee ee
One Group — Two EF’s — One Test Type
Pera ee ae he lt a ed ae oil ool ALR EAMES AP
Group A—EFr Group A — EF2
- oi Aeeatatle i det ee $e UR an as eid bio PLY | EG Py f
Pet rey bt Ky of xo AOS Er i tee bee x?
a Os Lo sth 2 4 95 95 o!o ry)
De100! (tos 5 3 9 100 100 0] oO fa)
ce | TOLe Too 8 oO oO IOI IOI 0} oO oO
d O7METOO 9 I I 97 97 o| o fe)
e |102 109 7 I I Pia ge, 102) 2010 ra)
t 96 108 12 4 16 96 96 o| o o
$ | 99 107 8 fe) re) 99 99 ~«Oo| o oO
h 98 107 9 I I 98 98 o| 0 o
ee rOG iM LTT tT 7 3 9 100 100 0} Oo fo)
9 Mi = 8.8 Sxa==tay M2=0 Sx? ==10
AM = 8.0 SD= <~(0.8)* AM=o0| SD=¥ > — (0)?
cr==70.8 SDF 2.6 Ci= 0 SD=0
SDM1 = 72 =0.7 SDMz=~=o0
V9 9
SUMMARY
EF1 EF2) . D SDD EC
ris Lite sat ASidiedeucs bd ea oe: 8.8
Test 1 8.8 Oo 8.8 V (0.7)? + (0)?= 0.7 2.78 X0.7 = 4.6
method, and then to substitute sample data in computation
model I. Assume this problem: What is the effect of a
defined amount of vigorous physical exercise upon the pulse
rate of pupils? This problem may be solved by the one-
group method. There are two EF’s, namely, vigorous
physical exercise (EF1) and the absence of such exercise
(EF2).
Table 15 reproduces model I in statistical form. Unless
the formula especially demands something else, all compu-
142 How to Experiment in Education
tations at all stages are done to the nearest first decimal
only, so as to make it easier for the student to check com-
putations. Greater exactness is advised in actual experi-
mental computations.
Computation of Changes Produced by EF1.—Since a
thorough mastery of the symbols, abbreviations, and com-
putations shown in Table 14 and illustrated in Table 15 is
essential to an understanding of all subsequent experi-
mental computations, the data of these two tables are ex-
plained in considerable detail.
Both Table 14 and Table 15 show the experimental com-
putations for any one-group experiment contrasting two
EF’s and employing only one type of test. The one type
of test employed in Table 15 is a test or count of determina-
tion of pulse rate. Of course this test was made more than
once, but throughout Table 15 only one function is meas-
ured. Had the effect of vigorous exercise upon both pulse
rate and, say, blood pressure been studied, two-test types
would have been employed, since two different functions
would have been measured.
In the left half of both Table 14 and Table 15 “‘Group
A” is the experimental group or subjects used. As indi-
cated, Group A has EF1 applied to it. Instead of placing
EF1 immediately after Group A as shown in the tables it
might have been placed between IT1 and FT1 to indicate
that the EF1 is applied to Group A after the IT1 and before
the FT1.
In Table 14 “P” represents the pupils who constitute
Group A. The ‘‘N” beneath it means the number of pupis
in Group A. In Table 15 the pupils used are a, J, c, etc.,
and J is 9.
IT means the initial test or scores made on the initial
test by each pupil. In Table 15, these scores are pulse rates
of 95, 100, ror, etc. The numeral 1 following IT, refers
to the first type of test. This will be needed more when
more than one test type is used. The “FTx” refers to the
final test.
Computations for the One-group Experimental Method 143
“Cx” in both Table 14 and Table 15 means the change
produced by the EF1, and is found by computing the dif-
ference between each pupil’s IT and FT. Thus in Table 1 S
Ci for Pupil a is ro points, found by getting the difference
between 105 and 95. Had the ITx for Pupil a been 105
and the FT1 been 95, Cr would still be 10, but should be
preceded by a minus sign to indicate that the change is a
ro point loss. In all cases where the FT is smaller than
the IT a minus should be prefixed to the C, unless the test
is scored in terms of time or the like where a smaller FT
than IT clearly means a gain rather than a loss. In cases
_ where it is not clear, whether a smaller FT than IT is de-
sirable or undesirable, the minus should be prefixed. The
experimenter should remember, however, that the minus in
such cases does not, as it usually does, mean something
undesirable.
Computation of Mean, SD, and SDM for EF1.—The
“Mr” under the Cz, is the arithmetic mean of the various
Cr’s. In Table 15 this Mz is 8.8. Had any of the Cx’s
been preceded by a minus the Mr would have been less
than 8.8, for signs should be regarded in computing Mr.
The “AM” beneath the Mz means the assumed mean.
The AM is used instead of the Mz for computing beg Hp eye"
etc., because its use is a great convenience and economy.
Any convenient number might be used as the assumed mean,
though it is usually most convenient to assume the nearest
whole number to the Mr. Thus in Table 15, 8.0 is used
as the AM, which makes the c or correction 0.8. Signs
are disregarded in determining and using c. The AM of
8.0 makes a c of 0.8. An AM of 9.0 would make ac
of o.2. Had the Mz been 8.0 instead of 8.8, an excellent
AM would be 8.0, which would make a c of zero.
The symbol x is the traditional symbol for deviation.
Thus the x for Pupil a is 2, because his Cx of 10 deviates
or differs from the AM of 8.0 by 2 points. The x for
Pupil } is 3, because his Cx of 5 deviates from 8.0 by 3
points. As in the case of c, the direction of the deviation
144 How to Experiment in Education
is disregarded. Had the Cr for Pupil a been — 10 instead
of + 10, the x would be 18 instead of 2, because the differ-
ence between 8.0 and — 10 is 18 points. Had the AM been
— 8.o and the C1 been — to, the x would have been 2.
The column labeled “x’” is found by squaring all the
x’s. Sx? means the sum of the x* column. In Table 15,
Sx? is 41. SD means standard deviation and is one of sev-
eral conventional measures of variability. It is computed
according to the formula given in Table 14 and illustrated
in Table 15. No matter whether the AM is larger or
2
smaller than the M, the c? is always subtracted frome
and it is subtracted before the square root of the whole
quantity is taken. The subtraction of c? corrects for the
use of 8.0 instead of 8.8 in computing x’s, x?’s, etc. If
the reader will compute x, x”, etc., from 8.8, he will appre-
ciate the convenience in the use of 8.0, and correcting for
its use at the end. The N in the SD formula means the
number of pupils in the experimental group. The SD in
Table 15 is 2.0. SDMz1 or SD of the Mr is so indicated
to distinguish it from the preceding SD or SD of the C1’s.
SDMz is a conventional measure of the unreliability of
the Mr. It is computed according to the formula shown
in Table 14, and illustrated in Table 15. The SDMr for
Table 15 is 0.7. The reliability of the Mr or 8.8 is shown
then by its SDMr1 of 0.7.
Comgutations for EF2.—The right half of Table 14
and Table 15 is headed ‘‘Group A-EF2” because EF2 is
applied to the same group of pupils as experienced EFtr.
Column P is omitted, since the pupils are the same as those
shown in the first column of the table. The IT, FT, C2,
M2, AM, c, x, x’, etc., shown in the right half of the table
are interpreted and computed like those shown in the left
half of the table.
In Table 15 the EF2 is merely the absence of vigorous
exercise. That is, EF2 is merely a continuation of the
same restful conditions which obtained when the IT, in the
Computations for the One-group Experimental Method 145
left half of the table was made. The IT, in the right half
of the table, does not need redetermination, for presumably
the results would be identical with the ITr results shown
in the left half. Since EF2 is a continuation of conditions
obtaining when the ITz is made, FT1r will coincide, pre-
sumably, with the scores on the IT1. This makes zero all
the C2’s, the M2, the x’s, x?’s, SD and SDM2. In actual
practice when EF2 is merely the absence of EF 1, the experi-
menter will not actually compute the right half of the
table but will assume all the C2’s and subsequent meas-
ures to be zero. In case EF2 is not the mere absence of
EFr, the right half of the table will have to be computed
in detail.
Computation of M and SD when N Is Large.—The
method of computing M and SD, illustrated in Table ris’
is appropriate and convenient when N is small. It is appro-
priate, but not convenient, when N is, say, 50 or more.
When N is large it is more convenient to determine the C1
for each pupil as in Table 15, and then to tabulate these
Cr’s into a frequency distribution.
The procedure for constructing a frequency distribution
is as follows:
(1) Write a column of figures beginning with the small-
est Cr and increasing by one to the largest Cx. (2) Write
this column in step-intervals of one, extending from five-
tenths below to five-tenths above the Cx. The first column
of Table 16 illustrates (1) and (2). (3) Look at the
original Ci’s. If the first Cz is 4, place a dot or mark
just after the step-interval 3.5 to 4.5 in Table 16. If the
next C1 is — 2, place a mark just after the step-interval
— 2.5 to — 1.5. If the next Cz is another 4, place another
mark just after the step-interval 3.5 to 4.5. Continue until
a mark has been made after the appropriate step-interval
for every C1. (4) Total the marks placed after each step-
interval, and write this total just after the step-interval in
question. When finished, the two resulting columns will be
a frequency distribution. The first and second columns of
146 How to Experiment in Education
Table 16 constitute a frequency distribution. Note that
each zero frequency (f) must be indicated if data is to be
used for further computation.
TABLE 16
SHOWING HOW TO COMPUTE M AND SD WHEN N IS LARGE
G f x fx fx?
-—4.5 to —3.5 I —8 — 8 64
—-3.5 “* —2.5 2 —7 — 14 98
—2.5 “© =—1.5 2 — 6 — 12 72
— 1.5 i — 0.5 3 =——"5 eared: 75
—0.5 0.5 3 —4 — 12 48
Pb dens Ts 4 — 3 —I2 36
1.5 - 205 Oo —2 oO (a)
2.5 ‘ 3.5 5 Le as 5
3-5 4-5 co) oO oO
AS 5:5 5 1 5 3
Be LS 2 2 4
(yep ok Phcle oO 3 fe) o
rE Ne 5 4 20 80
8.5 ve 9.5 3 5 15 75
9-5 10.5 3 6 18 108
AM= 4.0 |N=44 + 62 674
c= -0o — 78
— 16
— Te ie tah Zp Us ce me 674
= 6 — 16 19) 2 =e
SD = nor Ce ama Gas Bim Shores ri A 5 Cee or 30) )x (1) = 3.9
SDM =)'0-89
SDM = 22 = o.59
Vv 44
The steps in the process of computing M and SD follow:
(1) Some AM is selected at the mid-point of some step-
interval near the center of the frequency distribution. Any
AM will do, but it must be at the mid-point of some step-
interval. AM= 4.0. (2) N is computed. N= 44. (3)
step x’s from the AM are computed. Thus the step-interval
3.5 to 4.5 deviates from 4.0 by zero. Step-interval 2.5 to
3.5 deviates by — 1. Step-interval 4.5 to 5.5 deviates by
-++ 1, and similarly for other step-intervals. Note that zero
frequencies are not overlooked. (3) Each x is multiplied by
its corresponding f to secure the fx column. (4) The posi-
tive fx are added. The negative fx are added. The differ-
ence between these two sums is obtained. Positive Sfx = 62.
Negative Sfx = 78. The difference = — 16. (5) Thec is
computed.
Computations for the One-group Experimental Method 147
c= ( eee) < (size of step-interval).
c—= — .36. Had AM been 3.0 instead of 4.0, the positive
Sfx would have been larger than the negative Sfx. This
would have produced a positive instead of a negative c. (6)
M is computed by the formula: M = (AM) + (c). Had
c been positive instead of negative, M would have been
4.36 instead of 3.64. (7) The fx? column is secured by
squaring each x, and multiplying by the corresponding f.
It may also be secured by multiplying each fx by the corre-
sponding x. (8) The Sfx? is computed. Sfx?— 674. (9)
The SD is computed by the formula:
SB Ye (VEZ OE _ (c)? ) )x (size of the step-interval)
SD Baer Be)
(10) SDM is computed according to the usual procedure.
Sometimes a frequency distribution is so strung out that
the experimenter prefers to condense it into step-intervals
of 2, 3, or more instead of 1, or to construct it in step-
intervals of 2, 3, or more from the beginning. Thus the
TABLE 17
SHOWING HOW TO COMPUTE M AND SD WHEN N IS LARGE AND WHEN FREQUENCY DIS-
TRIBUTION IS GROUPED IN STEP-INTERVALS OF TWO (DATA FROM TABLE 16)
CG 7 x fx fz
—-4.5 to -2.5 3 TR) MER Ee 27
Pat ES Ok 5 —2 — 10 20
— 0.5 a TS 7 re ime OY 7
1.5 3-5 5 ) 0 0
B.S ara heS-5 II I II II
BE forte a 2 2 4
7-5 9-5 8 3 24 72
Gikmnht Ts 8 3 4 re) 48
AM = 2.5 |N=44 + 51 193
c= 14 — 26
148 How to Experiment in Education
frequency distribution of Table 16 may be grouped as
shown in Table 17. No matter what the size of the step-
interval, the process for computing M and SD is the same
as that already described. ‘That this is so is shown by
Table 17.
The process just described for computing M1, SD, and
SDM1 may be used for computing M2, SD, and SDMz2. It
may be used, in fact, for.computing any M, SD, or SDM.
Computation of Median and SDmedian.—Because of
its greater reliability, the M is usually preferable to the
median. The only advantage of the median is that it is less
influenced by extreme improvements. A few pupils mak-
ing relatively large or relatively small improvements will
affect the size of the M more than they will affect the
size of the median. If these extreme improvements were
twice as large or half as small respectively, the
median would remain unaltered, but not so the M.
There are as many arguments for their being allowed to
have their full effect as for a curtailment of their effect.
But there may be rare occasions on which the experi-
menter will prefer the median to the mean. For this
reason the steps in the process of computing a median
and an SDmedian for the frequency distribution of Table
16 follows.
(1) ComputeN. N= 44. (2) Compute%’N. YN
— 22. (3) Begin at the top of the frequency column and
add the successive f’s, calling the successive totals until
14 N or 22 has been reached, thus: 1 and 2 are 3, and
2 are 5, and 3 are 8, and 3 are 11, and 4 are 15, and o are 15,
and 5 are 20, and 2 of the 6 are 22. (4) Place this 2 as
a numerator over this 6, multiply the fraction 2/6 by 1, the
size of the step-interval, and add the product to the begin-
ning point of the step-interval corresponding to the fre-
quency of 6, namely 3.5. The result is the median. Median
IR on 42 JO Calera Oa
The reliability of the median 3.83 is found by means of
the following formula:
Computations for the One-group Experimental Method 149
1% SD
SDmedian= 4/N
The SD, in the preceding formula, may be the SD from the
mean, computed in the usual way, or it may be the SD
from Ane median. It will be found more convenient as a
rule to use SD from the mean. If computed from the
median, the exact deviations from the exact median must
be used, because SD from the median must be computed
by the formula:
Sie jy instead of SD = 1 eVGA
The steps in the process of computing a median for Table
17 follow. (1) N=44. (2) ZN=22. (3) 22=3 and
5 are 8, and 7 are 15, and 5 are 20, and 2 of 11. (4)
Wvledian=—.3.5 + De pian visley
The experimenter may have difficulty in computing a
median for a frequency distribution where the numerator
of the fraction is zero and the preceding f or f’s is zero.
Table 18 shows how to overcome this difficulty.
TABLE 18
SHOWING HOW TO COMPUTE A MEDIAN IN TWO SPECIAL SITUATIONS
C f C f
2.5°to3.5| 1 |N= 14 ie 15.5| 2|N=12
Bip 4510 ZN 7 “ 20.5] 1|4N=
Es ae p=1tote+atoroe « 25.5) 3/6=2+1+3+0+0
5.5 “ 6.5) 4] andoof 5 erie s0.5| 0 auc ONOled
05:9) 7.5/0 30:5111435:5//10
75 8.5| 5 Median == 2:5 17:5 73 135.5 40.5) 4 Median = 25:9 1 35:5 1 35.5
“= oe eg tae 2 AOS wat AS i812 2
+x +— x
a = on = 30.
5 Peay | 4 Sire 50:5
The median is sometimes called the 50 percentile. It is
possible to compute other percentile points according to the
same process. The 50 percentile is found by counting down
150 How to Experiment in Education
the frequency column 1% N. The 25 percentile or Qr is
found by taking 4 N. The 75 percentile or Q3 is found
by taking 34 N. The 20 percentile is found by taking
WN.
A knowledge of Qr and Q3 enables us to compute Q
(quartile deviation) by the formula:
ease
2
Q, which is a variability measure like SD and which is
approximately .6745 SD, may be used in the place of SD
to compute SDmedian. In fact, this is the simplest way to
determine SDmedian. The formula is:
SDmedian = £3539
Computation of D and SDD.—In the “Summary”
(Tables 14 and 15) are retabulated certain measures pre-
viously computed, and certain additional computations are
made. First there appears the mean of the changes pro-
duced by EF1, i.e. M1 in Table 14 and 8.8 in Table 15.
Next comes the mean of the changes produced by EF2, i.e.
M2 in Table 14 and zero in Table 15.
The next step, namely, ““D” or difference, is merely the
difference between M1 and M2, i.e. M1 — M2, in Table 14,
or between 8.8 and o, i.e. 8.8 in Table 15. It is well to form
the habit of subtracting M2 from Mi. Then a plus D will
mean that EF1 has been more effective than EF2. A minus
D will mean always just the reverse. This D is the most
significant measure shown in the two tables. It is the chief
goal of the experimental computations. It yields the con-
clusion from the experiment. Thus the D of 8.8 in Table
15 tells us that the C produced by EF1 is 8.8 points larger
than that produced by EF2. This is another way of saying
that the effect of a defined amount of vigorous physical
exercise is to increase the pulse rate 8.8 on the average.
Computations for the One-group Experimental Method 151
The next computation, namely, SDD or the SD of the D,
utilizes the SDM1 and SDMz2 as shown in the two tables.
This SDD shows the reliability of the preceding D just as
the SDMz shows the reliability of M1. That is, the D of
8.8 has a reliability of 0.7.
In case medians have been used instead of M’s, D will be
the difference between median 1 and median 2, and SDD
will be computed according to the formula:
SDD = 4/(SDmedian 1)? + (SDmedian 2)?
Though SDM and SDD will be used throughout this
book, many experiments report reliability in terms of PE.
Thus the reader of scientific literature frequently sees some-
thing like this: Mean = 8+ 0.7, or like this: Differ-
ence = 4+ 1.0. Such expressions signify that the PE of
the mean or PEM is 0.7, and that the PED is 1.0. By
multiplying any SD, SDM, SDmedian, or SDD by 0.6745,
it may be transmuted into a PE, PEM, PEmedian, or PED
respectively. SD and PE tell the same story. In a normal
frequency distribution + SD includes the middle 68% of
the f’s whereas + PE includes the middle 50% of the f’s.
Measures of Variability.—Thus far three sorts of SD’s
have been computed, namely, SD, SDC, or SD of the C’s,
SDM or SD of the mean of the C’s, and SDD or SD of
the difference. All three are measures of variability. The
SD or SDC is a measure of the variation or variability
among the C’s. Thus the C1’s in Table 15 vary from 5 to
12, 1.e., there is a range of 7. This 7 could be taken as a
measure of variation; but the reader will easily understand
that a change in the C1 for one pupil might markedly affect
such a measure of variability. The SD is better because
its size is dependent not upon just two pupils but upon
the records for all pupils. Furthermore, the SD is de-
manded by the formula for SDM. The SD increases in size
with an increase in the variability of the C’s, and it de-
creases as the variation of the C’s decrease. In sum, it is
152 How to Experiment in Education
an exceedingly sensitive and stable measure of the vari-
ability among the C’s. The SD of 2.0 in Table 14 means
approximately that 68 per cent of all the C1’s fall between
Mi — 2.0 and M1 + 2.0 or between 8.8 — 2.0 and 8.8 +
2.0, or between 6.8 and 10.8. The per cent between
M — SD and M + SD is exactly 68 when the C’s make an
exactly normal frequency distribution, i.e., when a graph
of the frequency distribution is approximately bell-shaped.
The SDM is also a measure of variability. It is a meas-
ure of the variability among the M’s just as SD is a measure
of variability among the C’s. Assume the nine pupils used
in Table 15 to be a random sampling from the 10,000 ten-
year-old pupils in a certain school system. Imagine this
experiment repeated upon another random sampling of nine
pupils from the total 10,000, and then upon another
sampling, and then upon another sampling, and so on until
a great many samplings have been taken and a great many
Mz1’s have been computed. In making these samplings
certain pupils might be chosen more than once and certain
ones might never be chosen at all. Not all the Mr1’s so
computed would be identical. In fact, no two M1’s might
be identical. Certainly there would be variation among
them. The SD of all these Mr’s could be computed just as
the SD of the C1’s was computed. When so computed, the
result would be SDMrz, and, in theory at least, would be
the same as SDM1 computed by the formula illustrated in
Table 15, 1.e., 0.7. Since it is more probable that all these
Mr’s will center at the obtained Mr of 8.8 than at any
other point, the SDMz of 0.7 tells us that most probably
68 per cent of these M1’s would be between 8.8 — 0.7 and
8.8 + 0.7, 1e., between 8.1 and 9.5. In sum, SDMr1 isa
measure of variability just as SD is a measure of varia-
ability. The difference is that SD is computed from actually '
obtained C’s whereas SDMr is always computed by for-
mula. The Mz1’s whose variability it measures could actually
be determined as suggested above but in practice their
existence is only imagined.
Computations for the One-group Experimental Method 153
SDD is also a measure of the variability among many
differences determined from many repetitions of the experi-
ment upon different random samplings. As with SDMz1,
SDD is computed always by formula. The SDD of o.7 in
Table 15 tells us that most probably 68 per cent of all the
differences determined from such repetitions of this experi-
ment would fall between obtained difference 8.8 —o0.7 and
8.8 +.0.7, 1e., between 8.1 and 9.5. Mz and SDMr will
not always coincide with D and SDD as they do in this
experiment.
Measures of Reliability and Randomness of Sam-
pling.—SDMz and SDD are measures of reliability as well
as of variability. They measure the reliability, respectively,
of Mi and D. The true Mr for the 10,000 pupils in ques-
tion can be determined only by securing the Cr for all
10,000 pupils. The Mz for any number of pupils less than
10,000 will not be the true mean exactly except by chance.
The Mr for the nine pupils in Table 15 may happen to
be the true Mz. On the other hand the Mz from any
other random sampling of nine pupils has as much chance
of being the true M1. Any measure which will show the
amount of variation among all the M1’s from the various
possible random samplings of nine pupils each will be an
index of how much a particular obtained Mr may be in
error. The SDMz, as has been pointed out already, is just
such a measure of variation. Consequently it tells us how
probable it is that the obtained Mx diverges from the true
Mz by a given amount. When the various possible M1’s
vary little among themselves, there is little chance for any
one of them to diverge largely from the true Mr. In such
a situation the SDMr1 will be small in amount. When
the SDMrz is large in amount, it means that there is a large
variation in size among the possible M1’s, which, in turn,
means that the obtained Mz is not particularly reliable.
In like manner it can be shown that SDD, because it meas-
ures the variation among the possible differences, is an index
of the reliability of the obtained D, and shows the probabil-
154 How to Experiment in Education
ity that it diverges from the true D for all 10,000 by a
given amount.
SDM1 and SDD, as computed by formula, will coincide
with SDMz1 and SDD as computed from a great many ran-
domly determined Mz1’s and D’s only when an assumption
underlying these formule perfectly obtains. That is,
SDMx1 and SDD, as computed by formula, are valid only
to the extent that the nine-pupils used are a genuine random
sampling of all the 10,000 pupils, or that the obtained C’s
are a genuine random sampling of all the C’s that would be
obtained if all 10,000 pupils were experimented upon. That
is, both reliability formule assume randomness of sampling.
In actual practice no one would hope to secure a genuine
random sampling from 10,000 pupils by selecting only nine
pupils. Since this book, however, is concerned with meth-
odology rather than results, a ludicrously small amount of
data is used in most tables. The purpose of this is econ-
omy of space and clearness of presentation rather than to
set an example for the reader. |
Close attention to the nature of the sampling is neces-
sary, not only in order to discover the validity of the re-
liability measures computed but also to determine the
limitations of the conclusion drawn from the experiment.
Thus if the pupils used in the experiment are a random
sampling from the ten-year-olds in a particular elementary
school, the conclusion should be distinctly limited to the
ten-year-olds in this particular school. The experimenter
cannot be sure that the results of his experiment apply to
all ten-year-olds in the United States, or to all eleven-year-
olds in this same school.
Experimental Coefficient and Chances.—The “EC” or
experimental coefficient in Table 14 and Table 15 remains
to be explained. The formula for its computation is given
in the former table and illustrated in the latter. The experi-
mental coefficient has been devised to interpret SDD. The
formula for its computation is so constructed that an experi-
mental coefficient of 1.0 means that we can be practically
Computations for the One-group Experimental Method 1 is
certain that the true D is somewhere above zero. An EC
of 0.5 means that we can be only half certain that the true
D is above zero. An EC of 2.0 means we can be doubly
certain that the true D is above zero, and similarly for
other sizes of EC. Since the EC in Table I5 iS 4.6 we can
say that there is 4.6 times practical certainty that the true
D is above zero.
Since some statisticians wish to state probability in terms
of chances that the true D is above or below zero or above
or below any defined point, Table 19 permits the con-
version of experimental coefficients into statements of
chance. This table says, for example, that when the experi-
mental coefficient is 0.3 the chances are 3.9 to 1 that the
true D is above zero if the obtained D is above zero, Or
below zero if the obtained D is negative.
TABLE 19
SHOWING HOW TO CONVERT AN EXPERIMENTAL COEFFICIENT INTO A
STATEMENT OF CHANCES
EE
Experimental Coefficient Approximate Chances
ot 1.6 to r
‘2 2.5 tO
3 3.9 to 1
4 6.5 to I
5 Tia etOeT
6 20m cOuT
o7 38 tor
8 75 Eto. ft
9 160. 6to Tr
I.0 200m tO7T
TT O30 VECOnT
Toa 2350 tor
i 6700 tor
1.4 20000 tor
is 65000 tor
Se a een terres ee A LSE: OA AUN APA ik
The formula for EC is constructed to a D of zero as a
reference, because the experimenter’s primary concern is to
know whether the obtained superiority of one EF over
another, or the obtained D in favor of one EF, is sufficiently
reliable to justify him in concluding that the true 1) Saf
156 How to Experiment in Education
known, would continue to favor that same EF. If the
obtained D is, say, 2.0 in favor of EF1, the experimenter
wonders whether the true D may not be zero or even, say,
—1.0. For the true D to be zero, would be to make the
two EF’s of equal effectiveness. For it to become — 1.0,
would be to reverse the conclusion indicated by the obtained
D. So whenever the EC is less than 1.0, the experimenter
should state that one of his EF’s is probably more effective
than the other. The less the EC becomes, the more wary
the experimenter should be. This does not mean that the
experimenter is justified in advising practical action on the
basis of his experiment only when the EC is 1.0 or above.
So long as the EC is above zero, the true D more probably
lies in the direction of the obtained D than in the opposite
direction. Life’s most important considerations, such as
marriage, investments, and hope of Heaven, rest upon an
EC of less than 1.0!
Though the EC formula is built to a D of zero, it may
be used to measure the probability that an obtained D will
be above a defined point, or will be below a given point.
Thus if we wish to know the probability that the true D in
Table 15 will be above, say, 7.8 we should compute thus:
1.0
8.8 — 7.8==1.0. nC eeropronenes echt We can be
only half certain that the true D is above 7.8, whereas we
can be 4.6 times practical certainty that it is above zero.
Since there is just as much probability that the true D is
above as below 8.8, we may wish to determine the proba-
bility that the true D is below, say, 10.8. Compute thus:
10.8 — 8.8 = 2.0. | chy sy le
DON On
practically certain that the true D is below 10.8.- If desired
these EC’s may be expressed in terms of chances by the use
of Table 109.
Though to do so would serve no especially useful purpose
in connection with experimental computations, the EC
formula may be used to help interpret the reliability of an
1.0. We can be
Computations for the One-group Experimental Method 157
M. In this case, the SDD in the denominator of the for-
mula should give place to SDM. Thus if we desired to
_ know the probability that the true Mz in Table 1 5
would be above, say, 5.8, we could proceed as follows:
3.0
ee 5 13 10, 1) GC SECT Uae T. 1.6. The probabil-
ity then is 1.6 times practical certainty that the true Mr is
above 5.8. It happens that in Table 15 the SDM1 is the
Same as the SDD, ie., 0.7. In similar manner we could
determine the probability that the true Mr is below a de-
fined amount.
How to Increase the Experimental Coefficient.—If
the EC is not as large as desired, how can it be increased?
An inspection of the EC formula reveals the answer. The
EC can be increased by increasing the numerator of the
formula, i.e., by increasing D. But D is not subject to con-
trol by the experimenter. It is, in fact, illegitimate for him
to try consciously to increase D. Then the denominator
must be reduced. The 2.78 in the denominator is constant
So it cannot be reduced. The reduction must be in the
SDD. To see how it can be reduced we need to inspect the
formula for computing SDD. This formula shows that the
only way to reduce the SDD is to reduce one or both the
SDM’s upon which the size of the SDD depends. To find
out how, say, SDMzr can be reduced it is necessary to in-
spect the formula for computing SDMr. This reveals that
the SDMr can be reduced by reducing the SD in the
numerator or by increasing the N in the denominator.
Since errors of measurement tend to increase the variability
among the C1’s, a refinement of the testing instruments
would make a slight but almost negligible reduction in SD.
For practical purposes the SD cannot be materially re-
duced. Then the N must be increased. The N is subject
to the control of the experimenter. Therefore our search
has led us to the conclusion that the only practicable plan
for increasing the size of the EC is to increase N.
The experimenter can compute in advance about how
158 How to Experiment in Education
many pupils he must experiment upon to secure a desired
EC. The EC of 4.6 in Table 15 is high enough, but suppose
that an EC of 6.0 were desired. The size of the SDD
required to yield an EC of 6.0 may be determined by solv-
ing the following EC formula for SDD, because, presuma-
bly, the D of 8.8 would be altered little or not at all by
increases in N.
8.8
2.78 X SDD
pol DD Memeeb(e
6.0
Now the size of the SDMz1 required to yield an SDD of
o.5 may be determined by solving the following SDD for-
mula for SDM1. The SDMz2 cannot be reduced so it is
disregarded. When it is reducible, it may be asked to share
its proportionate part in reducing the SDD.
/(SDM1)? + (0)? =0.5
SDM1 = 0.5
Since the SD in the SDMr formula changes little or not at
all with changes in N, the N required to yield the needed
SDMz1 of 0.5 may be determined by the solving of the fol-
lowing SDMz1 formula for N.
20.
/N
N = 16
The answer to our query is, then, that 16 pupils must be
used if a desired EC of 6.0 is to be secured. If the neces-
sary reduction in SDD is distributed between the two
SDM’s, N must be determined for both SDMz1 and SDM2.
Another Illustration of Computation Model I.—Table
20 illustrates the application of computation model I to
sample data where EF2 is not the mere absence of EF1.
Imagine the data to have been collected in an experiment
to determine whether the pulse rate increased more from
reading a familiar favorite thrilling short story (EF1) or
$0 = = zWwas Foe = IWdS
v v
= Nar ae oe os 7O= 9 ‘T_ Seas, =< oo — 9
Vv
Seo Sy eeee Cope 66 hay To =66 66 Dp
I I Zz 66 L6 I I Zz 66 L6 2
I I z vor zor I I fe) ZOI ZOI q
.e) ° £ Cor OOI Vv z ¢ Lor Oor e
& x 2@) ILA ILI 2X p¢ 1) ILA ILI d
217 — Vp gnorsy Iq — Vp gnosy
ae ana a ee ee eee
edkT, S99, UIQ —S.aq OM], — dnoiy sup
Se See a ea ee es
Idd JO HONASAV TUAW AHL LON SI tHF NAHM I TACO NOLLVIOAWOO BSO OL MOH ONILVaLSOTIO
Oz aIavy,
Computations for the One-group Experimental Method 159
|
aS
wm
|
=
2
|
ae
op)
fe)
!
a
160 How to Experiment in Education
from hearing the story told orally by the teacher (EF2).
The story used must be an extremely familiar one, other-
wise the repetition would differ markedly in interest from
the first presentation, thereby invalidating the experiment
unless the equivalent-groups method were used.
The reader’s attention is directed to the following special
features of Table 20. The C1 of — 1.0 deviates from the
AM of 1.0 by 2 points. The AM is the same as M1,
thereby making c of zero size. As shown by the computa-
tion of SD, when the M and AM are identical no correc-
tion for the SD is necessary. The M2 is less than the AM,
but this in no way alters the usual subsequent procedure.
The D is — 1.8 because in this experiment EF2 proved to
be more effective than EF1. The EC is only o.7 which
means that we can be only o.7 practically certain that the
true D, if known, is below zero, 1.e., favors EF2.
There are several possible one-group computation models.
We could have one computation model for two EF’s and
two test types. Substitute Group A for “Group B” in com-
putation model IV, Table 24, and the reader will have such
a model. Again, we could have a computation model for
three EF’s and one test type. Substitute Group A for
“Group B” and also for “Group C” in computation model
III, Table 23, and the reader will have such a model.
Again, we could have a computation model for three EF’s
and three test types. Substitute Group A for “Group B”
and also for ‘““Group C” in computation model V, Table 25,
and the reader will have such a model. In sum, every com-
putation model listed in the next chapter could have been
listed as one-group computation models. Economy of space
is the only reason for not doing so. Imagine Group A to
run through all these models instead of different groups and
they will all be converted automatically into one-group
computation models. In like manner the detailed discus-
sion and illustration of computation model I in this chapter
is applicable to all the computation models in the next
chapter.
CHAPTER VII
COMPUTATIONS FOR THE EQUIVALENT-
GROUPS EXPERIMENTAL METHOD
Computation Model II.—Computation model II given
in Table 21 shows the necessary computations for an ex-
periment with two equivalent groups, two EF’s and one type
of test. Note that “P” appears twice because EF2 is not
applied to the same pupils who experience EFr. Note also
that the detailed formule for SD and SDM are omitted,
since the reader is already familiar with them.
TABLE 21
COMPUTATION MODEL II
Fe ns ns NET a ON VN Oat
Two Equivalent Groups — Two EF’S — One Test Type
Group A—EFr Group B— EF2
Deets OLY ACL ix Xen vir wher Woy lis x?
N M1 Sx NN M2 Sx?
AM SD AM SD
c | SDM1 Cc SDM2
ee eR EO EE lA etn ne
SUMMARY
EFr1 EF2 D SDD EG
ANS ga D
Test 1...) Mz M2 M1—Mz2 | 4/(SDMr)?+ (SDMz)? 278 SDD
erate ene cere ee Oem NORE A lA), (Foul SER VAL [RNAs Wo
Illustration of Computation Model II.—In order to
illustrate computation model II with sample experimental
data assume this problem: Which is better for the quality
of the penmanship, a penmanship period preceding the
gymnasium period (EFr1), or following the gymnasium
161
162 How to Experiment in Education
(EF2)? This problem may be solved either by the one-
group or equivalent-groups method. The equivalent-groups
method is used.
The IT for both groups should be made at the same
identical period of the day, and at a period different from
either of the experimental periods, though several other ways
of working out this experiment would be as feasible and as
satisfactory. Assume that the IT has been made on both
TABLE 22
SHOWING HOW TO USE COMPUTATION MODEL It
Two Equivalent Groups — Two EF’s— One Test Type
Group A—EFr Group B— EF2
P |ITx FTr C1 pee iN Beal SA aR IO UE Rap OG) Oa By C2 Dene ey
a 7 8 I OT Outed 7 8 I rey:
Dae nT. 6 —I DENA Mey PS 4 oe Ono
c 8 10 2 Tae k 9 7 —2 re hes:
d 8 9 I GuniD Lro 9 —I red ie!
€ 9 9 Oo i I —— ew Soothe
f Our 3 Cara TL A M2 = —08 Sx*=5
g 10 aT I OFnLO AM = —1.0 SD = 1.1
shen be f=) 12 2 Twi c = 0.2} SDM2—0.6
8 M1 =—1.1 Sx? ==11
AM =~ 1.0 SD = 1.2
c=o0.1} SDM1—0.4
SUMMARY
EF1 EF2 D SDD EC
LeSCuda vie ocean Tat —o.8 1.9 0.8 0.9
groups just before dismissal at the end of the day. The FT
for Group A should be made, then, just preceding the
gymnasium period, and the FT for Group B should be made
just after the gymnasium period. The necessary computa-
tions are made in Table 22.
In Table 22 the pupils are arranged in order of the size of
their [Tx scores in order that the reader will easily perceive
that Group A as a whole is really equivalent in initial ability
Computations for the Equivalent-groups 163
in handwriting with Group B as a whole. Table 22 also
shows that the number of pupils in one group need not
be identical with the number in the other group. Since
Mz and AM are negative, we have here an illustration
of the computation of x’s from a negative AM. This also
affords an opportunity to show how to compute D when one
of the M’s is a negative quantity. Had both M’s been
negative quantities, ie., had Mz, say, been — ite toeD)
would have been — 0.3 in favor of EF2. Both EF1r and
EF2 would have produced a loss of handwriting quality, but
EFr would have effected a larger loss. The minus is
prefixed to 0.3 to indicate that EF2 is the favored one. As
the experiment stands, however, the conclusion is that EF1
is better than EF2 for the quality of handwriting of pupils
by 1.9 points on the handwriting scale used. We can be 0.9
practically certain that this conclusion is true for the whole
group from which the experimental pupils are a random
sampling.
Practical Certainty and Pre-requisites of Reliability.
—Several times thus far the term practical certainty has
been used. This needs a fuller explanation. When 100
pupils are selected at random from rooo pupils, we can be
entirely certain that the experimental results secured for the
Ioo are true for those 100. But no matter how large the
D, we can never be absolutely certain that results secured
from any sampling less than the entire rooo are true for the
1000. Since absolute certainty is never obtainable, except
for the particular group used, statisticians have coined the
term practical certainty to designate a degree of certainty
which is generally acceptable. Practical certainty is defined
as plus and minus three times the SD of the measure in
question. Thus we can be practically certain that the
true Mz lies between obtained Mz minus 3 SDMr and ob-
_ tained Mz plus 3 SDMz. If M1 is 1.1 and SDM is 0.4, we
can be practically certain that the true Mr lies between 1.1
minus 3(0.4) and 1.1 plus 3(0.4), i.e., between —o.1 and
2.3. Similarly, we can be practically certain that the true
164 How to Experiment in Education
D lies between obtained D minus 3 SDD and obtained D
plus 3 SDD, or using the data of Table 22, we can be
practically certain that the true D is somewhere between 1.9
minus 3(0.8) and 1.9 plus 3(0.8), i.e., between — 0.5 and
4.3. Had such definition of limits been more significant than
the definition of a point above which the true D lies, i.e.,
zero, the denominator in the EC formula would have been
3 SDD instead of 2.783 SDD. The 3.0 is reduced to 2.78
because any chance or probability that the true D is above
D plus 3 SDD (when D is positive) or below D minus 3
SDD (when D is negative) merely strengthens the conclu-
sion yielded by the experiment. The difference between 3.0
and 2.78 exactly accounts for this probability.
The one-group method is a more convenient method than
the equivalent-groups method of solving the experimental
problem whose sample data appears in Table 22. But even
though the equivalent-groups method be employed, there is
a more convenient method of determining D than that shown
in Table 22. Both experimental groups could have had
their IT1 at one of the EF periods, at, let us say, the period
preceding the gymnasium period (EF1). Then the FTr for
Group A could be assumed to be identical with the ITr.
This would have made each of C1, M1, SD and SDMz zero.
This would have saved labor and would, in theory, have
yielded the identical D obtained by giving the IT1z in a
period other than one of the EF periods.
But even though the IT1 be made in a non-EF period as
shown in Table 22, the same D could have been secured by
a single computation, namely, by computing the M of Group
A’s FT1, and the M of Group B’s FT1 and by subtracting
one M from the other. Experimenters frequently resort to
this plan to avoid the necessity of making an IT1. Such an
avoidance is not commendable because the experimenter has
no right to assume that his two groups are equivalent. He
needs the IT1 to prove their equivalence. If he avoids this
criticism by using one group only, where he has a right to
assume equivalence, or if he proves the equivalence of his
Computations for the Equivalent-groups 165
two groups by means of an IT1, but then proceeds to ignore
it and work with FTr only instead of C, he is subject to
another criticism. His computations will yield the correct
D, but will not permit him to determine the EC or reliability
of the D. It will not suffice for him to compute the M, SD,
and SDM of the FT1 for each group, and to use these two
SDM’s to compute SDD just as the SDM’s of the C’s are
used to compute SDD. The SDM of the FT1’s tends as a
rule, though not always, to be unduly large and thus tends
to make the D appear less reliable than it really is. Some
distortion will always occur unless the IT1’s are all zero or
all identical in size. It is not legitimate to avoid this final
criticism by simply omitting altogether the computation of
the reliability of the D, for each experimenter is obligated to
report the reliability of his conclusion. In sum, C is required
to determine the correct reliability of D, and the obtaining
of C presupposes both an ITx and FTr.
There is a way whereby the correct SDD may be secured
without the use of C. The steps in this process follow. (1)
Compute M of initial scores. (2) Compute M of final
scores. (3) Subtract intial M from final M to get Mr.
(4) Compute SD and SDM of initial scores. (5) Compute
SD and SDM of final scores. (6) Compute SDM1 by
means of the following formula.
SDMi—
(Initial SDM)? + (Final SDM)? — (2 r initial with final) (SD
initial) (SD final)
Thus the SDMz1, computed in this way, is equal to the
square root of the following: the square of the SDM of the
IT scores, plus the square of the SDM of the FT scores,
minus twice the coefficient of correlation between the IT
scores and FT scores times the SD of the IT scores times
the SD of the FT scores. The procedure is similar for the
computation of M2 and SDM2.
The use of this thoroughly exact but substitute procedure
for determining Mr and SDMz is seldom advisable. Some
time may be saved by its use provided the IT and FT scores
166 How to Experiment in Education
have been tabulated previously into two frequency distribu-
tions, respectively. If the experimental data are available
only in such form, it is impossible to compute C’s. Gen-
erally, however, the computation of C not only facilitates the
computation of Mi and SDMr or M2 and SDMz2, but it
also makes possible a fuller utilization of experimental re-
sults in that it shows what sub-group made the larger C’s.
TABLE 23
COMPUTATION MODEL IIT
Three Equivalent Groups— Three EF’s— One Test Type
Group A—EFr1 Group B— EF2 Group C — EF3
PaLDrorn yy ses ix KELL Le Le x7n (PIT eo secrete x?
N Mr Sx? | N Mz Sx? | N M3 Sx?
AM SD AM SD AM SD
c SDMr1 c SDM2 c SDM3
SUMMARY
EF1 EFz2 EF3 D SDD EC
dele aks eh ee D
Test 1...) Mx Ma Mr — M2 |v’ (SDMr)? + (SDM2)?| >3-spp
Big SECS ile hs on de D
Test) Tce Mt M3) M1—Ms3 |v (SDMr1)? + (SDM3)? 2.78 SDD
pe UT Ege ie MAA et D
Tastire. M2 M3 M2z2—Ms3 /V/ (SDMz2)?+ (SDM3)? 2.78 SDD
Recently my attention was attracted to an experiment
where some of the pupils had one IT and one FT; whereas
others had two or more IT’s and two or more FT’s (as
though pupils a, d, and f say in Table 22, had three IT and
three FT records each). These records were‘recorded and
treated as though they belonged to different individuals.
The effect of this is to distort the SD, SDM, and SDD.
When more than one record exists for a pupil they should
be averaged so that each pupil will have just one IT and
one FT for each test.
Computation Model III.—Computation model III in
Table 23 shows the experimental computations necessary
when there are three equivalent groups, three EF’s and one
Computations for the Equivalent-groups 167
type of test. If the purpose of the experiment is to deter-
mine the relative effectiveness of three EF’s, EF1, EF2, and
EF3 will be distinctly different EF’s. If the purpose of the
experiment is to determine the absolute effectiveness of EF1,
and EF2, then, EF3 will be a control EF. It should be
understood that in all preceding and succeeding computation
models, one of the EF’s must be a control EF whenever
knowledge of the absolute effectiveness of one or more of
the EF’s is sought.
Table 23 is practically self-explanatory. The two
M1’s under EF1 in the Summary are the same Mz, and
similarly for the two M2’s under EF2 and the M3’s under
EF3. The first D and SDD under EC are M1 — M2 and
V (SDM1)? + (SDMz)? respectively, and similarly for the
second and third formule under EC. The first D, namely
M1 — M2, shows whether EF1 or EF2 is more effective and
the first EC shows its reliability. The second D, namely
M1 — M3, shows whether EF1 or EF3 is more effective
and the second EC shows its reliability, and similarly for
the third D and third EC.
By extending computation model III in Table 23 farther
to the right, to provide for a Group D — EF4 and a Group
E,— EF5 and a Group F — EF6 and so on, the experi-
menter will have a computation model for any number of
groups and EF’s when one test type is used. An extension
of the Summary according to the plan exemplified in Table
23 will take care of any number of EF’s.
Computation Model IV.—The computation models so
far given show how to take care of any number of EF’s
when one test type is used. Computation model IV in Table
24 shows how to handle two EF’s and two test types.
Table 24 shows that additional test types can be provided
for by expanding the original computation model downward,
just as additional EF’s were provided for by expanding the
original computation model to the right. Note that the
second test type is indicated by the numeral 2, and that
the two new M’s are labeled M3 and M4. The D of
168 How to Experiment in Education
M1 — M2 shows whether according to Test 1, EF1 or EF2
is the more effective. The D of M3 — Mg shows whether,
according to Test 2, EF1 or EF2 is the more effective. The
two EC’s show the reliability of these two D’s.
Equating of Differences.—Table 24 exemplifies a new
feature in connection with EC. This new feature requires
explanation. Test 1 may favor EF1 by a D of a certain
TABLE 24
COMPUTATION MODEL IV
Two Equivalent Groups — Two EF’s — Two Test Types
Group A — EFr Group B — EF2
1 IT1 FT1 Crialax x? IY ITr 1D Py C27 rx x?
N M1 Sx? | N M2 Sx?
AM SD AM SD
c SDM1 c SDM2
P IT2 FT2 Ca aiix x? Ie IT2 Ft2 Ca tits x?
N M3 Sx? N M4 Sx?
AM SD AM SD
c SDM3 c SDM4
SUMMARY
EF1 EF2 D SDD EC x 7) 4 EDe tee
__walV (SDM1)? + (SDMa2)3|___D Dee
Test1]Mr1 M2 M1r—Maz2 278SDD Mi or Ma
fas 2 f | pee Se et Se , isucuicesiphisiseaaeaiselietantaieaeetenaan
Test 2|M3 M4 M3—Malv(SDM3)? + (SDM4)-— 55 por Ma
MEC Sx?]) MED Sx?
AM SD} AM SD
ec) SDMEC c SDMED
ECMEC ECMED
amount, whereas Test 2 may favor EF2 by a D of a certain
amount, or perhaps both tests may favor EF1, or again,
both tests may favor EF2. At any rate, there is needed
some way whereby the two D’s may be combined into a
single number which will show whether, both tests consid-
ered, EF1 or EF2 is more effective and how much more
effective.
But the two D’s cannot be averaged just as they stand.
To do so might give far more weight to one test than to the
other. To make this clear, assume the following situation:
Computations for the Equivalent-groups 169
EF1 EF2 D
Test 1 105 100 5
Test 2 10 5 5
Now, in all probability, these two D’s are far from equal,
even though they are numerically the same. The first 5 is,
in all probability, a much smaller D than is the second s.
Before they can be combined they need to be equated. The
two EC’s are not only indices of the reliability of the two
D’s, but they are also at the same time excellent equaters of
the two D’s. The EC’s may be averaged. This has been
done and “MEC” or mean EC is the result. Before this
averaging is done, the sign of each D should be prefixed
to its EC.
The MEC is really a mean difference. The reliability of
each of the two D’s is known. The next need is for some
way to determine the reliability of the MEC. Such a way
is shown in Table 24. SD of the two EC’s and SDMEC or
SD of the MEC may be computed just as SDC and SDMr
are computed.
In this situation where there are two EC’s the formulae
become:
Seay ee > SDMEC= —2
aa A eae meine nha
The SDMEC is an index of the reliability or trustworthiness
of MEC as a true MEC for all the tests from which Test 1
and Test 2 are a random sampling, and, to make the state-
ment complete, for all the pupils from which the experi-
mental pupils are a random sampling.
Just as SDD needed EC for its interpretation, so SDMEC
needs an ECMEC for its interpretation. Since, as was
pointed out above, MEC is really a D still, and since
SDMEC is really an SDD still, the regular EC formula with
its customary interpretation may be used. In this situation
the formula becomes
170 How to Experiment in Education
MEC
EMEC — 378 SDMEG
The only difficulty with the use of EC and MEC as a
method of equating and combining D’s, is the impossibility
of making any clear, simple statement as to what an MEC
of a given amount means. Therefore the “ED” or equated
difference, has been devised to provide a more easily inter-
pretable method of equating and combining D’s from two
or more test types. While preferable to the MEC from a
popular standpoint it is probably less preferable from a
technical statistical point of view.
The ED for the first D is M1 — Ma divided by Mz if it
is smaller than M2 or by M2 if it is smaller than Mr. The
ED for the second D is M3 — Mg divided by M3 if it is
smaller than M4 or by Mg if it is smaller than M3. When
so computed, the ED tells the per cent of the time the
experiment has run that it would take the backward group to
catch up with the favored group if the favored group were
to stop growing until the other catches up. The ED’s for
each of the two D’s of 5, previously given, become, according
to the above process, .o5 and 1.0 respectively. These ED’s
interpreted mean respectively that the EF2 group would
catch the EF1 group in Test 1 in .o5 of the time the ex-
periment has run, and that the EF2 group would catch the
EF 1 group in Test 2 in a time exactly equal to the time the
experiment has run.
After explaining the computation of MEC and ECMEC,
it will not be necessary to rehearse the process for computing
MED and ECMED. In computing MED, the sign of the
D should be prefixed to its ED. One other caution is needed.
It sometimes happens that the smaller of the two M’s is so
close to zero that, when it is divided into the D, the resulting
ED becomes an exaggerated and unnatural amount. Thus,
if the smaller of the two M’s were exactly zero and if the
D were not also zero, the ED would become infinity! The
reader does not need to be told what this will do to the MED.
Computations for the Equivalent-groups rt
If this, or anything approaching it, were to happen, the
MED could not be used. The use of MEC would be com-
pulsory. Because of this tendency on the part of ED, the
experimenter is advised always to prefer the midscore of
the ED’s to the MED, wherever it is possible to compute
the midscore, i.e., wherever more than two test types have
been used. The midscore of the ED’s may be treated as
though it were the MED.
The computation of the midscore is exceedingly simple.
First arrange the ED’s in order of their size, paying
due regard to signs. That ED which is middlemost in:
size is the midscore. If there is an even number of ED’s
and, as a consequence, no middle ED, the mean of the
two middlemost ED’s may be taken for the midscore and
MED. |
There is no obligation upon the experimenter to give equal
weight to each test always. Because of a given test’s greater
reliability, because it is more symptomatic of the entire
objects of instruction, or for some other reason, the ex-
perimenter may desire to weight it more heavily than any
other test used. Once the D’s have been equated, weighting
becomes a simple matter of multiplying the EC or ED by
the weight desired, before averaging. ‘Thus, if there are
three tests to be averaged, and if it is desired to weight the
tests, in order, 3, 1, and 2, the experimenter should multiply
the first EC or ED by 3, the second by 1, and the third by 2.
Then he should add the products and divide by 3 plus 1
plus 2, 1.e., 6.
Illustration of Computation Model IV.—The fore-
going discussion of computation model IV will be clarified
by the use of sample data. Such data appear in Table 25,
where we shall assume the experimental problem to be this:
Which is more effective in developing reading (Test 1) and
the fundamentals of arithmetic (Test 2), three class periods
per week of fifty minutes each (EFr) or five class periods
per week of thirty minutes each (EF2). Here we have a
problem with two EF’s and two test types, requiring the
How to Experiment in Education
172
eT
oF = CANO oo = 9 6:0 = DAWOA
ro = CHWdS 9°0 = DUNS vo! = 9
co —oS rT AV: 3:0 = (1S SI —_=— NV
SOs == xc De (1 A i 5x5 CI1—=—J)AN
to z‘O er — So L‘o ae Ai “I S‘or — Sgr “ee ASOL,
vo z‘O 6°0 — 9°0 g'0 £°0 — 6°0 gi — gt SOG EER p
=~ x qa 3X “~ ei dads da eda
AUVWWAS
Valen IN CLS SiS or =f€was 0°O == 9
Lom —= (TS ol4Ii=WV Om = CS og = NV
ge =- WV das WY
XS OW N exS sw N XS at 5 . N
zx x 69 LA ELI d 2X x 89 eLa ELI 2X 3 Li Ani d
9 SWas ) VINGS >
as WV as WV as ,nV
XS oW N | ex$ SW N 2XS TN Sees : N
2x x OT aks Oe | d | x x AR = tA AI 2x a) La ai
£ 9 zWds 9 1WdS 9
as” WV as WV as WV
XS cW N sXS 7W N 2X$ TW N
x x COM tA Se aa d | x x ZO RE br =e NA SI — ?W dds s47~+qd _3GWdS) + Z@Wds) A *W—?-W *W Cer a ela oe
qa 04a ads da 2 AC | 14a
{DUT 04 a4DIpamsazquT
qanod JUWOA
qaw Jd
o1W Jo 4W + O1W — 4W ads 347+d e(OINdS) + <(4WdS) A o1Ww — 4W o1rW AIS | ie ae ee,
WA JOIN + *N— IW dds 34z7~=+q z(VINCGS) +-Gwds) A FIN — IW PIN HS od per dade Rae a Be
da Ou qds ad 2s AC f a AC |
ayDipaMmsazuy OF [DImMUT
SL LF LES ea man erga rn sa a re oro ems rele to pe a eo EE AA EL PLAIN EEA! DALLA OLY LAELIA SALIDA LDA ei
AYVWWAS
ZIWdS IIWdS oIWNdsS 6—Was 8ANdS 4Wds
cIW 1IW oI N ON 8IN LW N
ce 56) Ir) ory) 7LA ZLNI ZLI d 69 8D £9 ZLA ZINI ZLI d
9NdGS SWas PIGS fwas “Was IWas
oW STA VIN N cI Z7W IW N
99 $9 Le) ILA ILNI ILI d £9 ‘4@) I) ILA ILNI ILI d
See a cs al i ea i i ae catia |e rear
?4A — gq qnosy Iq — Pp ¢nosy
a er EE ee ee
SOP eeIpeuliojuy 3uQ — sadA], jsof, OMT, —S,Jq OMY —sdnoiy jualeamby omy,
aac ar a Serer re Stns Spee ge ee ee
IIA TadOW NOILVLAdWOD
62 miavy,
180 How to Experiment in Education
treated together, the M of all the C3’s and C7’s, the M of
all C2’s and C6’s, and the M of all the C4’s and C8’s. This
will entail for each M so computed an appropriate series of
x’s, x?’s, Sx*’s, SD’s, and SDM’s and a “Grade III and
Grade IV” section in the “Summary.”
A good illustration of the value of being alert for the sub-
groups is afforded by an experiment conducted by Eliza F.
Ogglesby of Detroit upon 350 experimental and 350 con-
trol first-grade pupils. The purpose of the experiment was
to discover whether a new reading book she had prepared
especially for slow pupils was superior to one previously in
use, and, if so, whether it was better for dull pupils than for
normal pupils or bright pupils. Miss Ogglesby has furnished
the author with the summary of her experiment. ‘This is
shown in Table 28. There were 100, 150, and 100 pupils
in each of the bright, normal, and dull groups, respectively.
EF 1 is the new book, EF2 is the usual book. The data show
that the new book is superior to the old by 0.65 points for
the bright group, o.g1 points for the normal group, and 2.44
points for the dull group. This suggests that it is an advan-
tage to make books adapted to these different levels of
capacity.
Computation Model VII.—Another common form of
experimentation is one where there is for each group an
initial test, one or more intermediate tests, and a final test.
In an experiment extending over a school year it is fre-
quently desirable to give an intermediate test at the end of
the first semester. This tends to strengthen the experiment
and fortify the conclusions.
Computation model VII in Table 29 shows how to treat
an experiment of two equivalent groups, two EF’s, two test
types, and an intermediate test for each test type. By a
horizontal and vertical extension of this table provision could
be made, respectively, for more EF’s or intermediate tests,
and more test types.
In Table 29, the usual form has been somewhat abbre-
viated to save space. C1 is the change from IT1 to INT1.
181
Computations for the Equivalent-groups
IZWGS °SWAaS §4Wwas 84was 44Wds 94Was S EN “Lied
9fWas SfWads ’EWasS ffwas zfWds 1f£wdas ofWas °Z7WdSs 8zWds
Of W SEAL rey N [fw Ze IfW N jot] Oz 871 N
9f9 $€D vey TLA- LIN 111) d-\ft9 ze) rf) ILI ILNI ILI} d jot 6zD Bee) Gott Deel, L Neel LE bead
Eq —spdng upqganqns 24 — sirdngq unqanqns IA — spdnq upganqns
4ZWdS 97WAS S2Wds bzWas £2WdsS 2z7Wds IZWAS °7WAdS SIWds
L7W 97W Sze N |’ew EzW Zz7W N |1zjW Oo7W 61" N
“zo 97D $zD CLI ©LNI &Lild \'z9 fz ezyQ ELA fLNI fLI] d (129 oz) 619 {€La fLNI &LIl d
SINGS 4IWdS 9INAdS SIWdS *’IWdS £IWdSs ZIWdS I11WdS °IWdS
SIN LIW 9IW N {St VIW fIW N |21W 1IW OlIW N
BID £19 91D (2LAI 2LNI 2LI}] d -|§19 vig f1y) (@LA ZLNI ZL] d 219 IIg ord) (2@LI Z@LNI ZLIl d
owas 8Wds 4wads 9WaS ‘Was ?’was fWwas zWdaS IWwds
OW 8N LI N j9W SW rw N |£W ZW IW N
69 89 £9 ILI 'LNI ML} d |90 $9 Lge) ILI YINI ML} d |€O ag) 1D) LTA INT FLT od
E47 — stidng pany 244 — spidnd 1v4ny Iq — sjidnd 104ny
wsOT WIpsWstoayu. suQ—sodh], jsa], saIG IL —s,Jq e014 [.—sdnoi3-qng v14, yp sdnois-jualeainby 2014],
IIIA TAGOW NOILVLNAHOD
of a1avy
182 How to Experiment in Education
C2 is the change from INT1 to FT1. C3 is the change from
IT1 to FT1, and similarly throughout the table. The AM,
Cc, x, x2, Sx2, and SD involved in the computation of SDMz1,
are omitted. The same omission occurs in the case of
SDM2, SDM4, SDMs3, and so on.
Computation Model VIII.—Computation model VIII,
shown in Table 30, is a sort of composite computation model
or a sort of summary of all the models which have preceded.
It illustrates an experiment where there are three EF’s, three
sub-groups, three test types, and one intermediate test. This
computation model embraces practically all the difficulties
in computation ever presented by a regular equivalent-groups
experiment. How to handle certain rare forms of the
equivalent-groups experiment is considered at the end of
the next chapter.
TABLE 30
SUMMARY
Rural Pupils — Initial Test to Intermediate Test
EFr EF2 D SDD EC ED
Test 1 M1 M4 Mr —My4 SDD EC ED
Test 2 Mio M13 Mro—Mrz3 SDD EC ED
Lest, 3 Mig M22 Mig—M22 SDD EC ED
MEC MED
ECMEC ; ECMED
EFi EF3 D SDD EC ED
Test 1 M1 M7 Mr —My7 SDD EC ED
Test 2 Mio M16 Mio—Mr16 SDD EC ED
Test 3 Mig M25 Mig—M25 SDD EC ED
MEC MED
ECMEC | ECMED
EF2 EF3 D SDD EC ED
Test 1 M4 M7 Ma —M7 SDD EC ED
Test 2 M13 M16 M13—M16 SDD EC ED
Test 3 M22 M25 M22—Mz25 SDD EC ED
MEC MED
Computations for the Equivalent-groups 183
Rural Pupils — Intermediate Test to Final Test
EFr EF2 D SDD EC ED
esteL east WL 2 M5 M2 —Ms5 SDD EC ED
Test 2.... Mir M14 Mir—Mr14 SDD EC ED
Test 3.... M2zo M23 M20—M23 SDD EC ED
MEC MED
ECMEC | ECMED
EFr EF3 D SDD EC ED
pLesteahseul) (v2 M8 M2 —Ms8 SDD EC ED
Test 2.... Mir Mrz Mir—M17 SDD EG ED
Test 3.... M20 M26 M20—M26 SDD EC ED
MEC MED
ECMEC | ECMED
ADT POL D SDD EC ED
Test 1 Ms5 M8 Ms —M8 SDD EC ED
Test 2 M14 M17 M14—Mr17 SDD EC ED
Pesta 3 M23 M26 M23—M26 SDD EC ED
MEC MED
ECMEC | ECMED
Rural Pupils — Initial Test to Final Test
EFr EF2 D SDD EC ED
Leste Sisal) 1.13 M6 M3 — M6 SDD EC ED
Test 2... Miz Mrs M12—Mr1s5 SDD EC ED
Test 3... M2x M24 M2r1—M24 SDD EC ED
MEC MED
ECMEC | ECMED
EFr EF3 D SDD EC ED
Test 1 M3 Mo M3 —Mo SDD EC ED
Test 2 Miz M18 M12—M18 SDD EC ED
Test 3 M21 M27 M2z1—M27 SDD EC ED
MEC MED
ECMEC | ECMED
EF2 EF3 D SDD EC ED
Test 1 M6 Mo M6o —Mo SDD EC ED
Test 2 Mis M18 M15—Mr18 SDD EC ED
Leste s M24 M27 M24—M27 SDD EC ED
MEC MED
184 How to Experiment in Education
Suburban Pupils — Initial Test to Intermediate Test
EFr EF2 D SDD EC ED
Test x1... M28 M31 M28—M3r1 SDD EC ED
Test 2... M37 M4o M37—Mg4o SDD EC ED
Test 3... M46 M49 M46— M49 SDD EC ED
MEC MED
ECMEC ECMED
EFr EF3 D SDD EC ED
Tester M28 M34 M28—M34 SDD EC ED
Test 2 M37. M43 M37—M43 SDD EC ED
Test 3 M46 Ms2 M46—M52 SDD EC ED
MEC MED
ECMEC ECMED
EF2 EF3 D SDD EC ED
Test 1.... M31 M34 M31—M34 SDD EC ED
Test 2... M40 M43 M4o0— M43 SDD EC ED
Test 3.... M49 M52 M4go—Ms52 SDD EC ED
MEC MED
ECMEC ECMED
Suburban Pupils — Intermediate Test to Final Test
EFr EF2 Dwi esp EC ED
Test 1... M2g M32 M2zg— M32 SDD EC ED
Test 2.... M38 Mgr M38— Mar SDD EC ED
Test 3.... M47 Mso M47—Mp50 SDD EC ED
MEC MED
ECMEC ECMED
EFr1 EF3 D SDD EC ED
Test 1.... M29 M35 M29—M35 SDD EC ED
Test 2.... M38 M44 M38—Ma44 SDD EC ED
Test 3.... M47 Ms3 M47—Ms53 SDD EC ED
MEC MED
ECMEC ECMED
EF2 EF3 D SDD EC ED
Test 1 M32 M35 M32—M35 SDD EC ED
Test 2 Mar M44 M4r—M44 SDD EC ED
Test 3 Mso Ms3 Mso—M53 SDD EC ED
MEC MED
Computations for the Equivalent-groups 185
Suburban Pupils — Initial Test to Final Test
EFr EF2 D SDD EC ED
Test 1....1 M30 M33 M30—M33 SDD EC ED
Test 2... M39 M42 M39—Mgqz2 SDD EC ED
Test 3... M48 Msr Ma4a8—Ms5r1 SDD EC ED
MEC MED
ECMEC ECMED
EFr EF3 D SDD EC ED
Test 1.... M30 M36 M30—M36 SDD EC ED
Test 2... M39 M45 M39—Ma4s5 SDD EC ED
Test 3... M48 M54 M48—M54 SDD EC ED
MEC MED
ECMEC ECMED
EF2 EF3 D SDD EC ED
Test r.... M33 M36 M33—M36 SDD EG ED
Test 2...) M42 Mas M42—Ma45 SDD EC ED
Test 3.... M51 M54 Ms1—M54 SDD BC ED
MEC MED
ECMEC | ECMED
Urban Pupils — Initial Test to Intermediate Test
EFr EF2 D SDD EC ED
Test z.... M55 Ms8 Ms5—Ms58 SDD EC ED
Test 2... M64 M67 M64—M67 SDD EC ED
Test 3... M73 M76 M73—M76 SDD EC ED
MEC MED
ECMEC ECMED
EFr EF3 D SDD EC ED
Test I Mss Mor Ms5—Mé6r1 SDD EC ED
Test 2 M64 Myo M64—My7o SDD EC ED
Test 3 M73 M79 M73—My7o9 SDD EC ED
MEC MED
ECMEC ECMED
EF2 EF3 D SDD EC ED
Test 1.... M58 M6r Ms8—M6r SDD EC ED
Test 2.... M67 Myo M67—My7o SDD EC ED
Test 3... M76 M7q M76—My79 SDD EC ED
MEC MED
186 How to Experiment in Education
Urban Pupils — Intermediate Test to Final Test
EFri- EF2 D SDD EC ED
Test 1.... Ms6 Msg Ms6—Ms50 SDD EC ED
Test 2... M65 M68 M65—M68 SDD EC ED
Test 3... M74 M77 M74—M77 SDD EC ED
MEC MED
ECMEC ECMED
EE. BES D SDD EC ED
Test 1 Ms6 M62 Ms6— M62 SDD EC ED
Test 2 M65 Myr M65—Myz71 SDD EC ED
Test 3 M74 M8 M74—M8o SDD EC ED
MEC MED
ECMEC ECMED
EF2 EF3 D SDD EC ED
Test 1.... M59 M62 Msq9—Mé6z2 SDD EC ED
Test 2.... M68 M71 M68—Mz71 SDD EC ED
Test 3.... M77, M80 M77—M8o SDD EC ED
MEC MED
ECMEC ECMED
Urban Pupils — Initial Test to Final Test
EFr EF2 D SDD EC ED
Test 1...4 M57 M60 Ms7— M60 SDD EC ED
Test 2... M66 M69 M66—Mé69 SDD EC ED
Test 3... M75 M78 M75—M78 SDD EC ED
MEC MED
ECMEC ECMED
EFr EF3 D SDD EC ED
Test I Ms7 M63 Ms7—M63 SDD EC ED
Test 2 M66 M72 M66—M72 SDD EC ED
Test 3 M75 M8r M75—M81 SDD EC ED
MEC MED
ECMEC ECMED
EF2 EF3 D SDD EC ED
Test 1 Mto M63 Mb6o— M63 SDD EC ED
Test 2 M69 M72 Mb69—M72 SDD EC ED
Test 3 M78 M&8&r M7&8—M81 SDD EC ED
MEC MED
ECMEC | ECMED
CHAPTER VIII
COMPUTATIONS FOR THE ROTATION
EXPERIMENTAL METHOD
Computation Model IX.—The nature and functions of
the rotation experimental method were discussed in Chapter
II. It remains to illustrate the statistical computations nec-
essary to yleld the conclusion from a rotation experiment,
together with the reliability of the conclusion.
Computation model IX is for the simplest type of rota-
tion experiment, namely, two groups which may or may not
be equivalent, two EF’s, and one type of test.
TABLE 31
COMPUTATION MODEL IX — ROTATION METHOD
Two Groups— Two EF’s— One Test Type
Group A—EFr1 Group B—EF2
P ITr FT1 Cr Pp IT1 FT1 C2
N M1 N M2
SDM1 SDM2
Group A — EF2 Group B— EFr
P ITr FTr C3 P ITi FT1 C4
N M3 N M4
SDM3 SDM4
SUMMARY
EF1 SDS1 EF2 SDS2
Test 1|Mzr-+ Mg 4/(SDMr1)?+ (SDM4)?|M2-+ M3 4/(SDM2)?-++ (SDM3)?
D SDD EC
(Mr + M4) — (M2-+ M3) | 4/(SDSr)?+ (SDS2)? | D-—+2.78 SDD
188 How to Experiment in Education
The first point to note in computation model IX, in Table
31, is that Group A has EF1 applied to it first and EF2
applied second, whereas the EF’s are applied to Group B
in the reverse order. Since both EF1 and EF2 appear first
and second any advantage of order is rotated out.
According to the computation model, Group A experiences
in order IT1, EF1, FT1, IT1 again, EF2, and FT1 again.
This does not mean that the second IT1 and FT1 will yield
identical scores with those yielded by the first [Tz and FT1,
respectively. It does not even mean that the identical test-
ing instrument must be employed. It means merely that the
same general mental function is usually tested in both in-
stances. In rare cases, however, the similarity between the
mental functions tested is slight or non-existent.
Sample problems will make clear the various possible de-
grees of similarity between the first and second pair of tests.
Assume EF 1 to be a high per cent of re-circulated air for a
classroom, and EF2 to be a continuous supply of wholly
fresh air. Assume that each EF operates one semester. The
first IIx for Group A might be a test of general reading
ability. The first FTr1 could be the identical testing instru-
ment, a duplicate test of reading ability, or some other test
of general reading ability. It must measure the same trait
as the ITxr. The second IT1 for Group A could be the same
test as that already used, or a duplicate test, or another test
of general reading ability, or a test of a similar mental func-
tion, say a vocabulary test, or a totally different sort of
test, say, a test of fundamentals of arithmetic. The second
FT1 must test the same trait as its IT1. Furthermore, the
same tests used for Group A with EFi and EF2 must be
used for Group B with EF2 and EF1, respectively. This
will prevent penalizing either EF since each EF will have
both varieties of tests.
Consider another sample problem. Assume EFr1 to be
motion-picture presentation of a lesson, and EF2 to be
teacher presentation. The subject of the motion picture
might be the geography of Alaska. This would require the
Computations for the Rotation Experimental Method 189
first ITr and FT1 to be constructed of Alaskan content.
But the teacher could not well use the identical topic and
identical tests a second time. The carry-over would be alto-
gether too large. She could choose, instead, say, the geog-
raphy of Hawaii. This topic would require that the second
IT1 and FT1 have a Hawaiian content. In group B the
order of topics would have to be reversed so that EF2 would
secure any advantages or disadvantages of the Alaskan topic
and tests, and EF1 any advantages or disadvantages of the
Hawaiian topic and tests.
Both the first and second IT’s for both Group A and
Group B are often not applied in rotation experiments. In
case Alaska and Hawaii are known to be new to the pupils,
and if, in addition, the test questions are so highly specific
that they could not be answered from general information
about the geography of places other than Alaska and Hawaii,
the experimenter frequently assumes that the pupils’ knowl-
edge is zero and so records it without testing. Even when
such an assumption introduces a slight error, it is sometimes
an advantage to accept the error and omit applying the IT’s.
Sometimes it is an advantage to keep pupils ignorant of that
upon which they are to be tested until the EF1 has been
applied. The ITz prevents such concealment unless a dupli-
cate test is available.
There is a special situation where the second IT1’s for
both Group A and Group B are not applied. If EF2 for
Group A follows EF1 immediately, and if EF1 for Group B
follows EF2 immediately, and if, in addition, the identical
or equivalent test used for the first FT1 is to be used for
the second IT1, then the scores made on the first FT1 may
be assumed to be identical with those which would result
from giving the test again as ITr.
As shown by the Summary, the total C produced by EF1
is Mit + M4. The C produced in Group A by EF1 is Mr.
That produced in Group B by EF1 is M4. The sum of
these gives the C produced in both groups by EFr1. In like
manner, the total C produced by EF2 in both groups is
190 How to Experiment in Education
M2 + M3. The D between EFi and EF2 becomes, then,
(Mr + M4) — (M2 + M3).
To compute the SDD of this last quantity requires us to
know the reliability of its two components M1 + M4 and
M2-+ M3. From a knowledge of the reliability of M1 and
M4 it is possible to compute the reliability of their sum, Le.,
it is possible to compute SD of the sum, or SDS or SDSzr.
As shown in the table, the formula for computing the re-
liability of the sum of the two M’s is just like the formula
for computing the reliability of the difference between two
M’s. All preceding computation models have made this
latter formula familiar to the reader. Once the SDS1 and
SDS2 have been computed SDD and EC are readily deter-
mined, as shown. The more detailed formula for EC may
be written thus:
EC =[(Mz + M4) — (M2 + M3)] + 2.78 (4/(SDS1)? + (SDSz)?)
- Reliability Computations in Special Situations.—It
was stated in the preceding paragraph that the formula for
the reliability of a sum is identical with the formula for the
reliability of a difference. In the short form in which these
formule are usually used and commonly published, they are
alike. ‘The complete, long formule, as given below, are
not identical.
SDD = (SDMr1)? + (SDM2)? — arr2 (SD1)(SD2)
SDS = V(SDM1)? + (SDM2)? + 2rr2 (SD1r)(SD2)
When the sum of three numbers is involved the formula be-
comes:
SOS 4/ (SDM1)* + (SDM2)?+ (SDM3)?+ 2 r12(SDr) (SD2) +
2 r13(SD1) (SD3) + 2 r23(SDz2) (SD3)
In the preceding chapter, the reader was shown how M1
could be computed by getting the difference between the M
of the IT and the M of the FT, and how the SDMz1 could
be computed by a formula which utilized the SDM of the
Computations for the Rotation Experimental Method 191
IT, SDM of the FT, the coefficient of correlation between
IT and FT, SD of IT, and SD of FT. The Mz, so com-
puted, is really a D, and the SDMxz is really an SDD. Con-
sequently the above formula for SDD is identical in form
with the SDMz formula just referred to. Just as it is pos-
sible to determine Mr by subtracting M of the IT from M
of FT, so it is possible to compute MS by adding M of IT
and M of FT. If this were needed for some purpose and
actually done, the SDMS formula would be identical with
the SDS formula given above.
In the SDS1 formula given in Table 31 it is permissible
to omit the rr2(SD1)(SDz) portion of the formula be-
cause the coefficient of correlation between the C1’s and
C4’s may be assumed to be zero, since the pairing of each
Cr with some C4 would be by chance, and similarly for the
SDS2 formula. But in computing the SDM1 or SDMS men-
tioned above, an assumption of zero correlation between IT
and FT is not permissible. It is far more probable that
some correlation will exist. To ignore the last portion of
the formula might lead to a grossly exaggerated SDMr1 or
SDMS. How this exaggeration may occur is shown by the
following data. Obviously the Mz and SDMz computed
through Cr are 5 and zero, respectively. Computed through
M of IT and M of FT, the Mz likewise comes out 5. Com-
puted through M of IT and M of FT, SDMzr comes out
zero, provided rr2(SD1)(SDz2) are utilized in its com-
putation.
Pupil IT1 FT1 Cr
a IO 15 5
b 12 17 5
Cc 14 19 5
d 16 21 5
13 18° Mr 5
SDMi =o
‘ In computing any SDD or SDS, then, the short form of
the reliability formula may be employed provided the ele-
192 How to Experiment in Education
ments that enter into the formula are uncorrelated, or are
relatively uncorrelated. The SDD in Table 31 may be com-
puted by means of the short formula because the C1’s and
C2’s come from different groups and hence their correlation
may be assumed to be zero. The SDD in the one-group
experiment shown in Table 20 has been computed with the
short formula, because the C1’s and C2’s do not appear to
be at all closely correlated. Usually, however, such correla-
tion is more in evidence, due to the fact that the brighter
pupils tend to have larger C’s under all EF’s. The one-
group method is peculiarly liable to manifest such correla-
tion, and hence with it the SDD should usually be computed
by the long formula.
The formula for the computation of SDM as illustrated
in all the computation models is appropriate only when N
exceeds 30. When N is less than 10 compute SDM thus:
SVE a,
7 VN—2
When N is between to and 20, compute SDM thus:
Rh eee ae
7 VN=2
When N is between 20 and 30, compute SDM thus:
asf yd a tulle
Mavi
When N is above 30, compute SDM thus:
The last formula is used in all computation models and
illustrations of such models, irrespective of the number of
pupils, because most actual experiments will employ 30 or
more cases and because the sample data given merely typify
a much larger amount of data.
Lo Ql L's. C’r £°z o'r o'9 eeesceeveceenees I SOL
)
2 eke ads a 2Sdas a ™Sds 14H
S AMVAWAS
S
b v i
S eS =“ =+was [oo no =i ewads o— 5
™
4 +
= S*r = ,(S'0) =a = 5 oz= WV 6° =,(£°0) Ars els OT =a=KNV
Olga xs 8:2 v Sieg XS Coie .e v
in) aa —— — -_ —_ —
= fe) o z 6 LY q 14 z € 9S £s Pp
‘ I I I 6? gv 3 I I z zs os 3)
= o ro) z ob ge } 6 € z— oF zy q
q, 6 £ S ov Se ° I I z gf ve eB
aS x x leet ee i. x x oF) eee eT d
S Iq — g dnosy 210 — WY dnosy
3 | b
v
2 rr== —ewas oo= 39 go==/ — pas fo= 3
v 14
x ze = (0) — at = ds or1r= NV i'r = ,(5°0) as = ds of =WV
© ee ARE SS: pane Pe
ae File f= o1r1=7zW 14 Fees c= 1W 4
is c= € z— LY 6 q o 0 € €$ os P
j=) v z € gv cv 3 vy z S os SY 9
Bi. I I o gt gt j I I Zz zy oF q
6 ¥ z £ S¢ ze 2 I I ¥ ve of e
Ss eX x Zz) ILA ILI d sx x I) ILA ILI d
fm
~~
3 2qq — q ¢nosy Idq — Pp ¢n0sy
Ss adAy 29 UWO—S.qAqA OME —sdnoiy omy
Ss
1S) GOHLAW NOILVLOA—XI TAGOW NOILVLNAWOD ONILVALSATII
ze atlav..
194 How to Experiment in Education
Illustration of Computation Model IX.—Since compu-
tation model IX is the basic rotation-experiment model out
of which all other rotation models will be constructed, it had
better be illustrated with sample data. Assume the problem
to be the relative mental effectiveness of recirculated air
(EF1) vs. fresh air (EF2). Assume the test used to deter-
mine this relative effectiveness to be a reading test. The
necessary computations are shown in Table 32.
Only the Summary in Table 32 needs explanation. The
EFr is 3.5 plus 2.5, 1.¢., 6.0. SDSr is the V (0.6)? -- (0.837,
1.60) / 1.0.0) B2)18)1,.0) plus) 1.2) 1e., 12.2.) 20 eee
Vi(r.1)7 (1-0) 2 eser.5.1) Dis Glo minus) 2231 eran
SDD is the V (1.0)? + (1.5)?, ie., 1.8. EC is 3.7 divided
by 2.78 times 1.8, i.e., 0.7. The conclusion from this experi-
ment is shown by D, which tells us that recirculated air is
better than fresh air by 3.7 points for the reading develop-
ment of pupils used in this experiment and for all those from
whom these pupils are a random sampling. But we can be
only 0.7 practically certain that this conclusion is true for
the larger group. ;
The data of Table 32 are artificial and inadequate. This
experiment was actually conducted by Thorndike and Mc-
Call under the auspices of the Ventilation Commission of
New York. The EF’s, as here, were washed recirculated
air and fresh air. All other conditions of temperature,
humidity, and the like were kept constant. Group A was a
group of 44 typical sixth-grade public-school pupils. Group
B was another similar group of 44 pupils. The two teachers
divided the work and both taught both groups. At the mid-
dle of the year the EF’s were rotated, as shown in Table 32.
A large number of mental and educational tests were used,
as were the teachers’ marks. The conclusion from the actual
experiment also favored the recirculated air. The experi-
ment was repeated a year later by Thorndike and Ruger.
The second experiment verified the first. These experiments
are described in School and Society for May 6 and August
12, 1916.
Computations for the Rotation Experimental Method 195
oa | aas | ¢wt+swt+tw)—(w+*wt+enm) | sas | 4wtswt+tw | isas | owteowtew [1 3504
Ou ads a sds faa I¢ds ZAa
OF ads (4W + SW + €W) — (8W + 9W + IW) €Sdas AW + SW + °W ISds SW +9OW +IW [TT 3S9L
0 ads a sas caa ISqs ITA
OF aadas (OW + VW + ZW) — (SW + OW + IW) zSas 6W+ 7W + 7W ISds SN +9N +IN [°F 4S9°L
OF ads rai zSds ZA 1sds Ia
AYVWWNAS
6was sds 4IWaS
OW N SIN N LIN N
69 ILA ILI d 89 ILA ILI d £9 ILA ELI d
244 —]2 Gnosy IW — gq qnosy ; &qq— Pp qnouy
9was swas vas
oW N ST N VIN N
99 ILA ILI d s9 Ila ILI d vD Ie ALE d
dq —D enosiy Eq4q —q qoiy 2d — V nosy
eWwas ZWads | Iwas
fW N ZW N IW N
£9 ILA ILI d zo Ila Fit | d 4 ILA. ILI d
F417 —D dnosy fda — q doin IA —P gnosy
adh] S9L MUO — S.Aa 2G, —sdnoin so14]
GOHLaW NOILVLOY— X TACON NOILVINANOD
ff alavy
196 How to Experiment in Education
Computation Model X.—The purpose of presenting
computation model X, shown in Table 33, is to indicate the
computations needed with the rotation method when there
are three EF’s, and, consequently, three groups, and one type
of test. By an appropriate extension to the right and down-
ward, computation model X may be adapted for any num-
ber of EF’s.
The computation of the SDS’s in Table 33 requires ex-
planation. The formula for the computation of SDS1 is as
follows:
SDS1 = V(SDM1)? + (SDM6)? + (SDM8)?
SDS2 and SDS3 were computed in similar manner.
In Chapter II, it was stated that the object of the rota-
tion experimental method may be to determine the relative
effectiveness of two or more EF’s. If this is the object of
the experiment, the three EF’s will be distinctly different
EF’s. If, however, the object is to determine the absolute
effectiveness of EF1 and EF2 as well as their relative effec-
tiveness, EF3 must be the mere absence of EF1 and EF2,
thereby showing the normal change produced during the
experiment by general conditions other than EF1 or EF2.
In this case, the first D in Table 33 shows the relative effec-
tiveness of EF1 and EF2. The second D shows the absolute
change produced by EF1. The third D shows the absolute
change produced by EF2.
In none of the computation models has provision been
made for delayed tests as was done, say, for intermediate
tests. It frequently happens that an experimenter wishes
to determine whether the effect of some favorable EF will
persist. It is conceivable that EF1 may be superior to EF2
immediately after they have been applied, but that the
superiority will disappear, or actually turn into an inferiority
after a month, say, has elapsed. Repetition of the tests a
month after the FT’s were made will show what effect time
has had. No special computation model needs to be pro-
vided. The regular IT’s will serve as the IT’s for the de-
Computations for the Rotation Experimental Method 197
qawoa OAWIA
qaw oan
aa oa ads (4W + +) — (8W + EW) esas) 4W + PW isqsS s8sWM+E&W {°° 9b
daa od dds (SW + 7) — COW + 1) esqS _ §W + 7 iSqS OW + IN I s9L
daa ele ads d esas cAa ™SAS 14a
swas 4Wwas
sv N LWW N
89 ZL ZLI d £9 zLa ZzLI d
owas swas
oW N SW N
99 Ld ILI d 3 1Ld ILI d
IW — gq nosy eq — WV qos
vWdS fWwas
vw N cI N
%) LA ZLI d 70) eLA ZLI d
ZWdS . GS c
z7W
z=) Ld ILI d 8) ILA ILI d
SS SS Se
eH — gq FnosH IqI— Vp enosH
sadhy, SOL OME —S,qq OMT —sdnoin omy
GOHLaN NOILVLOU — IX TIGOW NOILVINdHOD
ve AlAvy,
198 How to Experiment in Education
layed test, and the delayed test becomes the FT. From this
point the computations reproduce the process for the regular
IT and FT. The final D shows the difference between two
EF’s plus a defined interval.
Computation Model XI.—Computation model XI shows
how the computations may be made when two test types are
used. By extending this model downward, provision can
be made for any number of test types.
Computation models IX, X, and XI make it clear that
computations for rotation experiments are similar funda-
mentally to computations for one-group and equivalent-
groups methods. With this knowledge, the reader who has
mastered the eleven computation models presented will have
little difficulty in evolving for himself rotation computation
models for any number of EF’s, groups, sub-groups, test
types, and intermediate tests.
Scaling Experimental Tests.—A few pages back it was
pointed out that the first IT1’s are not always the same tests
as or similar tests to the second IT1’s. Yet all this some-
what incomparable data can be combined, and this combina-
tion can be combined, in turn, with an equal mixture of
rather incomparable data from the IT2’s, provided each test
is scaled in comparable units. It is impossible to construct
a geography test, say, on Alaska which will be just as diffi-
cult as one with a Hawaiian content. Furthermore, it is sel-
dom feasible to scale all the tests to be used in advance of
and independently of the experiment itself, so as to have
comparability of measuring units throughout.
While conducting some rotation experiments to determine
the relative effectiveness of some visual aids, Weber met just
this situation, and overcame it economically by using his own
experimental data as a basis for scaling the experimental
tests. Tests so scaled, while not absolutely required, do add
a substantial refinement to experimental computations.
The following gives the general plan! of one of Weber’s
experiments.
Weber, J. J., Comparative Effectiveness of Some Visual Aids in Elementary
Education (to be published soon),
Computations for the Rotation Experimental Method 199
Unit I India
Lecture 25 minutes
L—R Review quiz 12 minutes Group A
Film 12 minutes
F—L Lecture 25 minutes Group B
Lecture 25 minutes
L—F Film 12 minutes Group C
Unit II China
Lecture 25 minutes
L—R Review quiz 12 minutes Group C
Film I2 minutes
F—L Lecture 25 minutes Group A
Lecture 25 minutes
L—F Film 12 minutes Group B
Unit III Japan
Lecture 22 minutes
L—R Review quiz: IO minutes Group B
Film Io minutes f
F—L Lecture 22 minutes Group C
Lecture 22 minutes
L—F Film IO minutes Group A
Note that the content of the first experimental unit has to
do with India, the second with China, and the third with
Japan. Note, further, that EF1 is a lecture followed by a
review quiz (L-R), EF2 is a film followed by a lecture on
the subject matter of the motion picture, and EF3 is a lec-
ture on the material of the motion picture followed by the
motion picture. The subject matter of EF1 was drawn from
this same motion picture on India. Note, further, that
groups A, B, and C, which are approximately equivalent
seventh-grade classes are rotated in such a way that each
group experiences every EF. Note, finally, that the short-
ness of the film on Japan required that time allotments be
reduced for this unit.
Since Weber gave no IT’s, the reader should think of his
FT’s as identical with C. Since seventh-grade pupils started
this experiment with some knowledge of these lessons on
India, China, and Japan, as Weber himself proved later,
he was scarcely justified in treating his FT’s as equivalent
How to Experiment in Education
200
L II 8 gs ¢ 9 4 6s re) I £ $9
OI gI 14 1S 9 8 9 LS z 9 14 £9
S II 9 6v II 9 4 Ss z 9 9 09
zI II z Lv 9 L 14 ¢s 9 ZL 9 gs
6 I 6 Sv Ne 1 II 1S 9 II S ss
L v 6 a4 g oI 9 6 OI OI II £s
14 ¢ v wv 9 II S Lv I £1 8 1S
z Zz g Iv S S OI Sv LI 9 L 60
v I ¢ ov § 9 6 tv 6 g v LY
z I Y 6£ i v 9 Iv g c oI Sv
z fe) Ss Ly v t I ov 9 Ss OI £v
I Zz s of S fe) 6 gt L e Ss Iv
I ° v c¢ v 4 9 o£ L I v ob
fe) fe) v ee I I ¢ ve 9 I L gt
I I I ze I L ze Ps ¢ ¢ of
I Zz 1¢ I of ¢ I v £¢
¢ 6z I Qz " ¢ I 1¢
z Sz I Sz Zz fe) ct 6z
I 61 I 61 ct z
Thorold biped Comat bat Kaw d § 9409 We Fe | a eh ee a40I9$ Bee 1 a ee oe a409¢
V ‘a 3 L qd V 0] L <) qd V LD
uDgD fe Duty) BIpUy
(aaaaM WOU Gildvav) ATaAAILOadSaU ‘Nvdv{[ aNv ‘VYNIHO ‘VIGNI NO SNOSS@T
GaMOTIO“d HOIHM SISAL NOILSANO-09 TAUHL AHL AO HOVA NI STidNd aadvad-vi GaloaTas Oof Ad ACVW SaXOOS 40 NOILNANI1s1a
S¢€ alavy
Computations for the Rotation Experimental Method 201
ce
L3°z fel: 93°S obs’ g6°SP come wan AS 96v ¥g°1S
Og'I hoy ae o06'¢ ors: g6°SP gcs° gg OV oT aes
L6° Sz: 96'I ee aoe gzS° 83 OF 96 vgs
IT dds d SdS = %.alaaYy-a4NjI90T Sds Ub tJ -9d4N4I9'T Sds 9AN4I9'T-U RY
SN 10 NVIW — AYVINWAS
eer ener
Lez 00z'Z oS Lr 619°I SO:LET eo eee eoccce 6gh'1 1S‘SS1
9g°I S97 z 6g°11 619°! SO°LE1 SgS'r v9 6v1 Sees perdi
L6° CLrz Lgs ee ece eereee CgS'1 9° 6v1 6gh'1 rS'SSr
OT dads ad SdS agaay-a4nqI0T Sds Ut -AdN IIT Sds 94njI9T-WR
$JU gO WAS — AYVAWOAS
ee ee
173° eVvl: ZIO'I Was | ozo'r ogZ: 616° Was | £6¢° 6z0'X co was
1Z7'8 trl rp exept as Oz OI og’ 61'°6 as £6°g 6z'o1 89°83 as
zvos zg1s St'vP W vgIS 6S°1$ grsy W go ly ores ze°gv W
I gL
I I eZ I 6L
e rf I ol I vL
re) z I Lo z tL I Ig
z L ce) v9 9 4 89 I ° LL
g Ss z 19 v V $9 4 ¢ 14
ol 6 ¢ gS L g 4 £9 I 4 I 69
6 1I 8 ss ° 4 € 19 ° v ° L9
202 How to Experiment in Education
to C. The effect of doing so is probably to make the SD
and SDM too large. The error is not serious, and is cer-
tainly less serious than notifying pupils what to expect in
the lectures and films by giving tests to the pupils before
they had had the EF’s applied. After each group had had
an EF applied, the pupils were given a 60-question test on
the content of the lesson presented. ‘The scores made by
each group as a result of each EF are given in Table 35.
Heretofore, each pupil’s score has been tabulated sepa-
rately. Such tabulations become unwieldy when many pupils
are used. The conventional economical substitute for indi-
vidual tabulation is the frequency distribution, samples of
which appear in Table 35. Such frequency distributions,
though not absolutely necessary, do permit the employment
of various statistical short-cuts. An illustrative reading of
Table 35 will make clear the meaning of the frequency dis-
tributions. Table 35 is read thus. After a lesson on India,
presented by means of a lecture followed by a review quiz,
i.e., L-R, a test on India was given to Group A. One pupil
made a score of 29, one pupil made a score of 31, four pupils
made a score of 33 and so on. After the same lesson on
India, presented by means of F-L, the same test on India
was given to Group B. Two pupils made a score of 24, three
pupils made a score of 31, and so on. In like manner, all
six frequency distributions, shown in Table 35, may be read.
If he so desires, the experimenter can make a frequency
distribution of the C1’s, and of the C2’s, etc., in each of the
computation models, and can use this as a basis for com-
puting M, SD, and SDM by short-cut statistical processes.
But there is one thing the experimenter cannot do. He can-
not make a frequency distribution of IT’s, and another fre-
quency distribution of FT’s, and hope from these to obtain
directly a frequency distribution of C’s or even to obtain C’s
at all. C’s can be obtained only from individual tabulations.
After individual C’s have been so obtained a frequency dis-
tribution of them can be made.
The Summary for Table 35 is given in two forms. The
Computations for the Rotation Experimental Method 203
first part is in terms of the sum of the three M’s for each EF.
It is the form with which the reader is already familiar. The
second part is in terms of the mean of the three M’s for
each EF, i.e., the sum of the three M’s divided by three.
The mean of the M’s has the advantage over the sum of
the M’s in that the mean of the M’s is comparable with any
of the original M’s from which it comes, and with any
original M for any EF. But if the sum of the three M’s
is divided by three, the experimenter must be careful to
divide each SDS by three also. If this is not done the final
EC will be just one-third the size to which it is entitled.
As Table 35 shows, the second part of the Summary is one-
third the first part except for the EC which is the same.
And this is as it should be, for the D from the sum of M’s
is neither more nor less reliable than the D from the mean
of the M’s.
But the unique feature of Weber’s experimental computa-
tions is not so much his use of frequency distributions, or
his use of means instead of sums. The unique feature is
his use of T scores or scale scores intead of the original
number of questions correct. His use of T scores makes all
three tests and the scores from them comparable. To begin
with, the test on India may have been the most difficult,
and the one on Japan of medium difficulty. After the process
of scaling has been completed, these differences in difficulty
have been ironed out so that every score, irrespective of
the test, is comparable with every other score and every M
is comparable with every other M. This makes it profitable
to use the mean of the M’s instead of the sum of the M’s
in the Summary. Finally, the T scores make the D’s and
the EC’s more exact.
The procedure by which each test was scaled is shown in
Table 36, which is identical with the India portion of Table
35 except that 499 pupils instead of 300 pupils are used,
that the T scores are shown in the last column instead of
the first, and that three additional columns essential to the
computation of T scores are added. The first column is the
204 How to Experiment in Education
number of questions, out of 60 questions on India, answered
correctly by the indicated number of pupils in each of Group
A, Group B and Group C. The fifth column is the total
number of pupils in all three groups answering the number
TABLE 36
DISTRIBUTION OF SCORES MADE BY 499 7A-GRADE PUPILS IN A 60-QUESTION TEST
WHICH FOLLOWED A LESSON ON INDIA. ORIGINAL STEPS CONVERTED
INTO T-SCALE UNITS (AFTER WEBER)
Per Cent Ex-
Group A B CG . ceeding Plus
Score | L—R | FL |) tr | 2%% | raterhose |e
Reaching
— oO 2 2 I 5 99.50 24
I— 2 I fe) I 2 98.80 27
3— 4 I a 2 4 98.20 29
5— 6 iz 4 I 6 97.19 31
iio 4 6 5 15 95.09 33
g—10 3 5 4 ne 92.38 36
II —12 8 2 II 21 89.08 38
13 — 14 5 3 9 17 85.27 40
15 —16 7 9 10 26 80.96 41
AV diypract 4: Lb 8 12 34 74.95 43
IQ — 20 17 9 13 39 67.64 45
21 — 22 5 II I4 30 60.72 47
23 — 24 13 9 20 42 53-51 49
25—26 TT 19 6 36 45.69 SI
27 25 17 13 13 43 37.78 53
29 — 30 8 I4 14 36 29.86 55
31 — 32 16 I5 10 41 22.14 58
33-734 12 8 7 27 15.33 60
S5e—-30 9 9 5 23 10.32 63
Bye a0 4 I 3 8 at 65
39— 40 2 8 2 12 5.21 67
4I — 42 2 4 2 8 nox 69
43 — 44 T 4 2 7 1.70 71
45 — 46 I I 2 80 74
OY eee I I 2 .40 77
49 — 50 I I 10 81
Total 163 167 169 499
of questions shown in the first column. The numbers of
questions shown in this first column are grouped two
together instead of each question separately as is usuallv
done when scaling. This grouping is not necessary. It
is, in fact, of doubtful desirability. Its virtue is that it
Computations for the Rotation Experimental Method 205
saves labor. The sixth column gives the per cent exceeding
plus half those reaching each number of questions correct.
This per cent is based on the fifth column. How to com-
pute these per cents and transmute them into T scores,
shown in the last column, is described in Chapter V. Once
these T scores are known, the first, fifth, and sixth columns
may be eliminated as no longer useful, and the T scores may
be moved to the extreme left, thus making a table similar
to the India portion of Table 35. In like manner, the orig-
inal number of questions correct on the test on China, and
then the number of questions correct on the test on Japan,
can be transmuted into T scores. Since all the pupils in
all three groups are used in each of these three test scalings,
all scale values, i.e., T scores, are thus made comparable.
The possibility of scaling experimental tests on the basis
of the performance of experimental pupils is not limited to
rotation experiments employing three groups and FT’s only.
It is possible for any rotation experiment with any number
of groups and with or without IT’s. It is equally possible
for any one-group or equivalent-groups experiment. In all
these cases the scaling may be based upon IT, FT, or C
records. The C records are best to use, the FT records are
next best. When C records are used the experimenter can
be absolutely certain of getting a T score for every need.
If IT’s are used, there is a possibility that no pupil at the
beginning of the experiment will make as high a record as
will be made by some pupil on the FT. This means that
extremely high scores on the FT may have to go unscaled.
If the scaling is based upon FT scores, there is a possibility
that extremely low scores on the IT cannot be scaled. No
difficulty need be anticipated if C records are scaled. Chap-
ter V shows how both IT and FT may be used to widen the
range of the scale so as to include the highest and lowest
Scores.
But no matter which of the three records is scaled, it is
highly important that the scores of every experimental group
taking the test be utilized in scaling that test. This does
206 How to Experiment in Education
not mean that every pupil involved in the experiment has
to be used. It is required only that those utilized in experi-
mental computations be included. Weber scaled his tests
on 499 pupils. In his experimental computations he used
only 300 of these 499 pupils. It would have been just as
satisfactory to have scaled his tests on the 300 finally
selected as the basis for his experimental computations. It
would not have been quite so satisfactory if, say, Group C
were omitted in the scaling.
Under certain conditions it is permissible to compute
51.84 in the Summary of Table 35, by a less laborious pro-
cedure. The data which yields the three M’s from which
51.84 is derived, may be lumped together so that only one
M and one SDM is computed for all of it. In this case, the
final M for each of the other two EF’s should be computed
in the same way. The conditions required to make the
above modification permissible are (a) an equal number of
pupils in each group, (b) a uniform test for each group, or
else the tests to be scaled upon the experimental groups so
as to eliminate inequalities in difficulty and consequent
unduly-increased variability and unreliability, and (c) ap-
proximate equivalence of ability for the groups so com-
bined.
Special Computation Difficulties.—Since the rotation
method is a combination of several one-group methods or
several equivalent-groups methods, it is appropriate that this
chapter should close with a consideration of special types
of statistical computations required for special situations.
These special difficulties are caused not so much by pecu-
liar variations in experimental method as in variation in
methods of measuring changes. There are, for example,
the following common ways of measuring changes produced
in pupils by an EF:
1. Total points change on test made by each pupil.
2. Per cent of total possible gain on each test made by
each pupil.
Computations for the Rotation Experimental Method 207
3. Time required for each pupil to attain a defined score
on a test.
4. Per cent of pupils in each group attaining a perfect
score or any defined score on a test.
5. Per cent of pupils in each group making any gain on
test.
6. Per cent of pupils in one group whose change exceeds
the mean change of the other group.
Measuring-method 1 is the most commonly used and
should be. Except in very special instances, measuring-
methods 2, 3, 4, 5, and 6 should be used merely as supple-
mentary to the first method; they yield certain additional
information which, on occasion, is valuable. For example,
it may be useful to know whether the superiority of a par-
ticular EF is due to the large gains of a relatively few pupils
only, or whether every pupil has contributed to the superior-
ity. Measuring-method 4 tells whether the gains are well-
distributed. All the computation models assume measuring-
method 1. The experimenter is advised to avoid subsequent
statistical difficulty by planning for this method.
Measuring-methods 1, 2, and 3 yield a score and C for
each pupil, thereby permitting the computation of an M and
a SDM and ultimately a D, SDD and EC. Measuring-
methods 4, 5 and 6 yield a score for the group only, thereby
making it difficult, if not impossible, to compute measures
of reliability. Since each experimenter is obligated to report
the reliability of his conclusions, he should make sure that
the measuring-method which he plans to employ will yield
a measure of reliability at the end.
CHAPTER IX
CAUSAL INVESTIGATIONS
Methodology of Causal Investigations——When Dar-
win visited South America, he was surprised to discover an
outbreak of yellow fever high up in the Andes Mountains.
Since he was a born scientist, he began immediately to specu-
late and observe to see if he could discover the cause for
such an unusual phenomenon. Doubtless he asked himself
these two questions: In what respect is this situation dif-
ferent from places which are immune from yellow fever? In
what respect is this situation like places which are subject
to yellow fever? Darwin showed his genius by almost dis-
covering the cause of yellow fever. He observed something
about the place which was very unusual for high altitudes
where yellow fever is unusual, and very much like lowlands
where yellow fever is more common,—pools of stagnant
water. He therefore suggested the hypothesis that this stag-
nant water was responsible for the yellow fever. He was
right so far as he went. It was not until long afterward that
this investigation was pushed far enough to make it appear
highly probable that stagnant water produced the mosquito,
which, in turn, caused yellow fever to spread.
-Metchnikoff observed that the Bulgarians were an
unusually long-lived people. Metchnikoff wished to know
why. Doubtless he, too, asked himself these questions: In
what respect are the Bulgarians like other peoples who live
long? In what respect are they different from other peoples,
1.e., what force operates upon the Bulgarians which does not
operate upon other races? Like Darwin, he proceeded to
observe for differences. He concluded that the most striking
difference was the extent to which the Bulgarian people drink
208
Causal Investigations 209
buttermilk. He therefore concluded that the drinking of
buttermilk was responsible for the long life of the Bul-
garian, and that a similar practice on the part of other races
would lead to an equally long life. He went beyond Darwin
and buttressed his hypothesis by showing that certain organ-
isms present in buttermilk are specially beneficial to the
action of the alimentary canal.
Reavis’s recent work! is an admirable illustration of a
causal investigation in the field of education. He set out to
locate the causes for attendance and non-attendance in’
school. From incidental observation and logical deduction,
he had arrived at not one but a number of hypotheses as
to what factors influenced attendance. He proceeded to
collect a large amount of data with a view to testing the
truth of his various hypotheses.
These illustrations of causal investigations, together with
many others which will occur to the reader, indicate some
interesting inferences. One inference is that different causa!
investigations differ in their starting point and ending point.
Darwin’s causal investigation began with a problem and
ended with the formulation of a crude hypothesis. The pre-
eminent function of causal investigations is to yield sugges-
tive hypotheses to be tested by further logical deduction,
observations or experimentation. Because of the great value
of fruitful hypotheses, causal investigation has constituted
the fundamental method of discovery from the beginning of
time. Metchnikoff’s causal investigation began with a prob-
lem which not only led to the formulation of a hypothesis,
but also to the collection of certain subsidiary evidence to
show that the hypothesis was not an unreasonable one. But
Metchnikoff went no further. Reavis did not conduct an
investigation to secure useful hypotheses. Probable causes
were more evident. He started his causal investigation well
supplied with fruitful hypotheses. But what is more impor-
tant, he carried the investigation very much further than
1 Reavis, George H., Factors Controlling Attendance in Rural Schools, Teachers
College, Columbia University, 1922.
210 How to Experiment in Education
was done in the other instances. He carried it far enough
practically to prove or disprove his various hypotheses.
A second inference from these samples is that the con-
clusions yielded by causal investigations are usually less
convincing than those yielded by experimentation. Conclu-
sions from causal investigations are seldom more than strong
hypotheses, which await confirmation by experimentation.
This need for confirmation varies with the nature of the
investigation and the adequacy of the data which is assem-
bled or it is possible to assemble. Experimentation carries
greater weight than causal investigations, because an experi-
menter can control conditions much better than the investi-
gator. The investigator is compelled to accept conditions
as they are presented, complicated, as they usually are, by
all sorts of irrelevant factors, and providing, as they fre-
quently do, insufficient data upon which to base conclusions.
Darwin’s conclusion concerning the cause of yellow fever
was only a good guess, at best. It was a very slender hypo-
thesis. He could have greatly strengthened his hypothesis
by making a systematic series of observations or collection
of data. He could have strengthened it still more by evolv-
ing a hypothesis as to the exact mechanism whereby stag-
nant water causes yellow fever, and then by conducting an
equivalent-groups experiment to test this hypothesis. All
are familiar with the famous equivalent-groups experiment,
finally conducted, in which a group of healthy men offered
their lives to prove conclusively that yellow fever is trans-
mitted by a certain variety of mosquito which thrives only
where stagnant water is found.
Metchnikoff’s conclusion as to the efficacy of buttermilk
was and remains a hypothesis only, and will continue to re-
main so until it is tested experimentally. It is doubtful if it
can be tested conclusively by means of a causal investigation
because nature apparently does not present the proper con-
ditions.
The nature of Reavis’s research makes it more feasible
as a Causal investigation. By the selection of a relatively
Causal Investigations 211
narrow problem, by the collection of many data readily
available, by the utilization of recently-developed statistical
techniques, and by the exercise of no little ingenuity, he was
able to isolate fairly well the factors whose influence he
desired to study.
A third inference is that the methodology of causal investi-
gations is the methodology of equivalent-groups experimen-
tation. A causal investigation is merely an equivalent-groups
experiment conducted backward. The criteria for a valid
equivalent-groups experiment are the criteria for a valid
causal investigation. To the extent that a causal investiga-
tion would be invalid if reversed and conducted forward as
an equivalent-groups experiment, just to that extent it is
invalid as a causal investigation. A perspective of a correct
plan for a causal investigation, viewed from its starting
point, is identical with a perspective of an equivalent-groups
experimental plan, for the solution of the same problem,
viewed. from the ending point. If these perspectives are not
identical, there is a crudity in one of the plans, and the
crudity will usually be found in the plan for the causal
investigation. An important corollary of the foregoing is
that he who has mastered the technique of experimentation
is already equipped for causal investigation. Only a few
additional techniques need be described.
In illustration of the foregoing statement that the same
criteria hold for both causal investigations and equivalent-
groups experimentation,. it will suffice to show how these
criteria apply to Metchnikoff’s causal investigation. To
satisfy these criteria, Metchnikoff would have to show that,
except for much buttermilk drinking and its reputed good
effects, Bulgarians are by nature and environment equiva-
lent to other races. This he has not shown. Consequently,
critics of his hypothesis have some justification in attributing
the long life of the Bulgarians to certain other factors in
which the Bulgarians possibly differ from other races. The
true cause may be due, for example, to the operation of a
more rigorous environment than has been operating upon
212 How to Experiment in Education
other races. The effect of such selective agency would be
to make the present Bulgarian people a very hardy stock.
Combine this possible fact with the assumption that there
has been a rapid amelioration of environmental conditions
during the last few hundred years, and we have an explana-
tion for Bulgarian longevity totally unconnected with but-
termilk. Or, again, it may be that the original ancestors
of the Bulgarians possessed and transmitted through hered-
ity a tendency toward longevity, just as they doubtless
possessed and transmitted the physical traits which dis-
tinguish them from other races today. Or, finally, their
greater longevity may be due to the cooperative contribution
of several of these factors rather than to any one of them.
All this shows why causal investigations which fail to satisfy
perfectly the equivalent-groups experimental criteria yield
conclusions which are suggestive hypotheses only. Their
validity is no greater and no less than that of the conclusions
yielded by an equivalent-groups experiment which fails to
satisfy its own criteria to an equal extent.
Essential Procedure of Simple Causal Investigations.
—Causal investigations may be prosecuted in either of two
ways. Perhaps the most common and certainly the most
simple and elementary way, is the all-or-none procedure. In
an all-or-none investigation, the effect, whose cause is sought,
is either totally present or totally absent, or else the investi-
gator arbitrarily ignores any gradations in between, or else
he defines a certain minimum amount of the effect, any
amounts in excess of which will be considered to constitute
its presence, and any amounts less than which will be con-
sidered to constitute its absence.
The preceding discussion of this chapter has made it clear
that for this variety of causal investigations the essential
steps are as follows:
1. The investigator searches until he finds objects, indi-
viduals, communities or situations which are alike in that
they all show a particular effect whose cause is sought.
2. He inspects these situations to see whether they have
Causal Investigations 213
anything else in common which might possibly be the cause
of the observed effect. If he finds such a common cause,
he formulates the hypothesis that this is the probable cause
of the effect.
3. He continues his collection of cases to discover
whether the hypothetical cause is always and without excep-
tion present when the effect is present.
4. He collects cases which are alike except for the pres-
ence of the effect in some of the cases and its absence in
others.
5. He observes to see whether the hypothetical cause is
present in those cases which show the effect, and absent in
those cases which do not show it.
6. He continues the collection of such instances to dis-
cover whether inexplicable exceptions occur.
7. If in either half of the foregoing process inexplicable
exceptions occur, the investigator attempts to find a new
and more promising hypothesis as to the cause of the effect.
If he is successful in this he starts through the above process
again. If he is not successful the causal investigation ends
unsuccessfully.
Essential Procedure of a Complex Causal Investiga-
tion. a. Formulation of Hypotheses.—Causal investiga-
tions of a complex variety do not treat the effect merely as
present or absent, but recognize and take account of grada-
tions of effect and gradations of cause. Here the investi-
gator determines not only whether the presence of the effect
is accompanied by the presence of the hypothetical cause,
but also whether increase in the amount of the cause is
accompanied by a corresponding increase in the amount of
the effect. Furthermore, the investigator may attempt to
discover whether the effect is produced by one or more
causes, and if produced by several causes he may attempt
to determine just how much of the effect each cause con-
tributes.
Reavis’s investigation is an illustration of one which took
account of gradations in cause and effect, which found that
214 How to Experiment in Education
the effect was produced by several codperating causes, and
which determined the exact amount of independent contribu-
tion of each cause to the effect. A summary of his pro-
cedure is given below. The reader is referred to his disserta-
tion for details.
From incidental observation and logical deduction, he
formulated numerous hypotheses as to the more probable
causes or factors influencing the attendance of rural-school
elementary pupils. Some of these factors related to the
pupil, some to the school and teacher, and some to the com-
munity. Sample questions relating to the pupil were: Does
age, sex, distance from school, quality of roads from home
to school, distance transported, age-grade position, or quality
of school influence a pupil’s attendance record? Sample
questions relating to teacher and school were: Does the
teacher’s salary, or amount of training, or the school’s mod-
ernness of equipment, playground space, or the like influence
a pupil’s attendance? Sample questions relating to the com-
munity were: Does the community’s wealth, intellectual
level, or interest in education influence a pupil’s school
attendance?
b. Collection of Data.—The collection of data is a prob-
lem in measurement. The general principles to guide such
measurements were given in Chapter V. These principles
hold whether the investigator personally makes his own
measurements, or secures them from others by means of a
questionnaire. The principles apply whether the measure-
ments made be tests of mental traits, tests of school build-
ings, collection of school records, or the introspections or
judgments of judges.
The following questions ! will guide the investigator in the
evaluation and preparation of a questionnaire. Are the
questions as factual as possible? Do they involve a mini-
mum of judgment and memory? Are the questions as spe-
cific as possible? Will the data secured lend themselves to
1See Rugg, Harold O., Application of urbe Methods to Education, pp. 39-55;
Houghton Mifflin Company, New York,
Causal Investigations 215
tabulation and statistical treatment? Are the questions
unambiguous? Will all terms used have the same meaning
to all reporters? Will the questions evoke replies which
will be unambiguous to the investigator? Is the informa-
tion called for difficult to obtain? Can the data called for
be obtained more accurately otherwise? Do the questions
cover all the data needed for subsequent computations?
Can the questions be answered by a check, number, Yes,
No, or brief phrase? Are the questions arranged so that
none will be overlooked? Is the space sufficient for each
answer? Are the questions worded and arranged to facili-
tate tabulation and fit the tabulation form to be used? Will
the data called for by the questions, answer the specific and
previously worded objects of the investigation? Are the
questions formulated in the light of a bibliographical survey?
Is the amount of time required to answer questions so
excessive as to induce careless responses, omission of items,
or few replies? Are the questions worded in the light of
one or more preliminary trials with representative samplings
of the individuals for whom questions are designed? Are
the nature and number of questions such as to secure replies
from representative individuals and from a sufficient num-
ber to satisfy the statistical criteria of reliability?
A common form of questionnaire is one which aims to
measure the degree of preference for this or that. Thus
Lowe sent a questionnaire which gave a comprehensive list
of the activities of clergymen. He desired to know how
each clergyman evaluated each activity. Several methods
have been proposed for meeting just such a situation, Le.,
for measuring opinions.
One method, the rank method, is to ask that the activity
which is deemed most important be ranked 1, the one deemed
next most important be ranked 2, and so on for the number
of activities listed. This method is fairly satisfactory in
most cases. It is very time-consuming if the number of
items is large. It yields relative evaluations only; it does
not show what activities are deemed of no value whatever.
216 How to Experiment in Education
It does not show which activities are judged to be of equal
value, but forces the reporter to make a choice. This forc-
ing does no harm so far as group results go, but it may do
violence to one individual’s opinion. Finally, the rank
method forces the reporter to make the same difference be-
tween all adjoining activities, namely, a difference of one.
A second method is the distribution method. Here the
reporter is asked to distribute, say, 100 points among the
listed activities, thus showing the importance of each activity
by the number of points assigned to it. This method per-
mits the reporter to indicate just what activities are of no
merit, but does not allow him to indicate negative values.
It permits the reporter to attach the same value to more
than one activity, and to indicate varying differences be-
tween activities. It is more time-consuming, however, than
the rank method, unless the activities are grouped into head-
ings and sub-headings. If they can be so grouped, the re-
porter can be asked to distribute his 100 points among the
main headings, and, after this is done, to distribute the total
points assigned to each heading among its sub-items. Some-
times, however, activities do not fall into convenient group-
ings which are mutually exclusive as to items and sub-items
or where the sub-items completely exhaust their heading.
Theoretically, the distribution method requires both such
exclusiveness and exhaustion. Finally, the distribution
method tends to make the number of points assigned to each
activity incomparable from one reporter to another. One
clergyman may hold half the activities listed to be of no
value; nevertheless he must use up his 100 points. Another
clergyman who assigns some points to every activity will be
compelled to assign fewer points to an activity which he may
evaluate just the same as the previously mentioned indi-
vidual.
A third method is the relative-to-the-items scale method.
Here the reporter is asked to rate the activity considered
least important as 1, the activity considered most important
aS 20, or 10, or 5, and to assign a value anywhere from 1 to
Causal Investigations 217
20 inclusive to the other activities, assigning the same value
more than once if desired. This method has all the virtues
previously mentioned as desirable, except that of permitting
a report as to just what activities are judged of no worth
or negative worth or whether any activities are of greater
worth. |
A fourth method is the absolute-worth-occupational scale.
Here the clergyman is asked to rate any activity equal in
value to the most desirable activity in which a clergyman
can engage as worth, say, 19 points; to rate any activity
zero, which is of just no professional significance; to rate
any activity minus 19 which is equal in professional destruc-
tiveness to the worst occupational activity in which a clergy-
man can engage; and to rate all other activities according
to this absolute occupational scale. Thus, mending shoes
is above zero in social value, but is probably below zero on
a clergyman’s occupational scale. The chief objection to
this scale is the great likelihood that the reporter will be
unable to avoid confusing this fourth scale with the fifth to
be described.
The fifth method is the absolute-worth-social scale. Here
the reporter is asked to construct or think a scale ranging
from minus 19 through o to plus 19, where minus 19 means
the worst imaginable human act such as an able-bodied man
murdering his defenseless, gifted child to avoid working for
its support, where plus 19 means the best conceivable human
act, and then to rate the listed activities according to this
scale. This scale yields the fullest information of any of
the five methods described. Whether it is more or less
reliable than the others is not surely known.
Reavis employed the questionnaire procedure for collect-
ing the data used in his investigation. Fortunately, he was
in a position of authority where he could secure unusually
accurate and adequate returns. He eliminated from con-
sideration all transient pupils whose attendance could not
possibly be perfect due to the fact that they were not in
one district throughout the school year. Then he secured a
218 How to Experiment in Education
measure of the amount of attendance of each of 5314 pupils
in 200 country schools in five counties in Maryland. At the
same time he determined the amount of presence of each of
a large number of hypothetical factors, such as the pupil’s
distance from school, the quality of his work at school, the
sort of teacher who taught him, the character of the school
building and equipment which surrounded him, and the
character of the community.in which he lived.
Much ingenuity was shown in making these determina-
tions, and in securing a comparable quantitative expression
for the amount of presence of each factor. To illustrate
with only one of the difficulties encountered—consider his
method for securing comparable measures of the distance a
pupil lives from the school. A pupil who lives a mile from
the school and in order to reach it must walk all the way
along an unimproved clay dirt road, really lives farther
away than another pupil a mile from the school who walks
half the way on an unimproved clay dirt road and half the
way on a macadam state road.
To equate these two conditions, Reavis reduced the dis-
tance for pupils travelling over state roads so as to make
State-road distances equal unimproved-road distances. He
made various guesses as to the proper subtraction and
checked up each guess by computing the coefficient of corre-
lation between attendance of all pupils and the distance
score for each pupil corrected by his guess. With each
improvement in his guess, the coefficient of correlation
should go up, due to the fact that errors in measurement
reduce the coefficient of correlation toward zero. The corre-
lation between uncorrected distances and attendance was
.38. A perfect correlation would be 1.0, and no correlation
would be zero. Calling each mile of state road equivalent
to one-half mile of unimproved road and correcting accord-
ingly yielded a coefficient of correlation between corrected
distance and attendance of .43. Counting each mile of state
road as equal to three-fourths of a mile of unimproved road
and correcting accordingly raised the correlation to .54.
Causal Investigations 219
A guess on either side of the last weighting yielded correla-
tion of .48 and .51, showing that the best basis for correction
was to call one mile of state road equal to three-fourths of
a mile of unimproved road.
But even the correction for the quality of the road does
not eliminate all the error in the distance measurements.
Some of the pupils were transported all or a part of the way.
By employing the same correlation device to check up vari-
ous guesses as to the proper weighting, Reavis found the
optimum correction for distance transported per number of
days transported and per cent of days attended. The rea-
son for taking the amount of attendance into consideration
will readily occur to the reader.
c. Determination of Significance of Causes——The next
step was to divide the 5314 pupils into two groups of equal
numbers. One group was composed of that half of the
pupils having the better attendance record. The half with
the poorer attendance record composed the other group.
Three or more groups representing as many attendance
gradations could have been used. From the better-attend-
ance groups a smaller group was so selected as to be equiva-
lent in every respect, except for the difference in attendance
and the factor of distance, to a smaller group selected from
the poorer-attendance group. That is, in equating these
two groups, the factor of distance was ignored but all other
factors were regarded. The technique for equating groups
on several bases was discussed in Chapter III. Next, the
mean distance from school of each equated group was com-
puted. If, when this was done, the mean distance was
less for the better-attendance group, the investigator was
justified in concluding that a difference in distance was asso-
ciated or correlated with a difference in attendance.
The next step was to equate two groups in every respect
except, say, the quality of school work of the pupils and
attendance. The difference between the mean quality of
school work for the two groups showed the extent to which
quality of school work was associated with attendance,
220 How to Experiment in Education
whether positively correlated, negatively correlated, or
whether neutral. In similar fashion, the investigator deter-
mined whether any other factor relating to the pupil, teacher,
school, or community was associated, and to what degree,
with the attendance of the pupils.
If the mean distance for one attendance group was identi-
cal with the mean for the other attendance group, a con-
clusion that distance affects attendance would be totally
unreliable. Since the D between the two M’s would be
zero, the EC would be zero. If there were some difference
between the two M’s, the significance of this D, or rather
how much we could trust its significance, would depend upon
the reliability or EC of this D. This reliability could be
determined in the usual way. The series of distance scores
from which Mi came would permit the computation of SD
and SDMr. Similarly the series of distance scores which
yielded M2 would yield SD and SDM2. Mr and M2 would
yield D. SDMz1 and SDM2 would yield SDD. Dand SDD
would yield EC.
When two groups equivalent in all respects, except for
attendance and the difference in the factor being studied,
show the same mean amount of the factor, we can certainly
say that the factor under consideration has no influence
upon attendance, is not a cause or contributing cause of
attendance. When the above procedure is used, and when
variations in attendance are accompanied by variations in
the factor being studied, we are justified in saying that
variations in the factor are associated or are correlated with
variations in attendance. But additional considerations are
necessary before we are justified in concluding that varia-
tions in a factor zmfluence or are a cause of variations in
attendance. It may be that attendance is, instead, a cause
of the factor. Or it may be that each is partly effect and
partly cause. Or it may be that no direct, definite causal
relation exists.
Judging by Reavis’s findings, distance is associated with
attendance. Now since it is easily conceivable that distance
Causal Investigations 22
influences attendance, and since it is highly improbable that
attendance in a particular year has influenced the distance
a pupil lives from school during that year, we are justified
in concluding that distance is not only associated with but
actually influences attendance. Also the results of Reavis’s
study showed that quality of school work was associated
or correlated with attendance, but we cannot be quite certain
here, whether the quality of school work influenced attend-
ance or attendance influenced quality of school work or both.
Probably the last is nearest the truth. Poor attendance
leads to low quality of work, which leads to loss of interest,
which leads to poorer attendance still. In sum, if the investi-
gator will follow the procedure outlined above he can con-
clude that a correlation exists between factor and attendance,
and that sometimes a causal relation exists; but which is
cause and which effect rests upon additional logical con-
siderations.
When the cases are as numerous as they were in the study
made by Reavis, causal investigators often save themselves
trouble by using all the cases in the study of each factor,
trusting to luck and to numbers to make the groups equiva-
lent in all other factors. Thus, in the sample illustration,
they would divide the 5314 pupils into, say, two groups equal
in number, those living nearer and those living farther
from the school. The investigator would assume, in this
case, that since the pupils were divided with an eye to
one factor only, that the two groups would by chance be
approximately equivalent with respect to the amount of
presence of any other factor.
If the various factors are independent of each other, i.e.,
if they are uncorrelated with each other, the foregoing pro-
cedure would be fairly satisfactory. But in any complex
investigation, the investigator can be practically certain that
various factors are correlated and cross correlated in all
sorts of bewildering ways. If all pupils are divided regard-
less of everything except quality of school work, we can
be practically sure that chance would not equal the two
222 How to Experiment in Education
groups with respect to, say distance. Long distance from
school, through its reduction of attendance, affects quality
of school work. That is, distance and quality of school
work are not independent factors. ‘They are negatively
correlated. As a result, any division on the basis of quality
of school work alone, unavoidably becomes, in part at least,
a division on the basis of distance. In like manner, it
will become, in part at least, a division on the basis of
every other factor which is correlated either positively or
negatively with quality of school work. So long as this is
the case, the investigator is unable, to tell just how much of
any difference in attendance is attributable to quality of
school work, and how much to each of the various factors
correlated with quality of school work. All he can conclude
is that this total complex is correlated with the attendance
record, and may be a cause or an effect of the attendance
record. The only safe procedure is to satisfy as completely
as possible the equivalent-groups experimental criteria by
attempting consciously to equate the groups in every known
factor. Even so there will be enough error due to unknown
significant factors.
d. Preliminary Exploration of Significance of Causes.—
Now as a matter of fact, Reavis did not employ the former
or more exact method of evaluating the factors. He used
instead a modified and rather drastic form of the latter
more crude method. But he used this method not for the
purpose of evaluating exactly the influence of each factor
upon attendance, but rather for the purpose of preliminary
exploration to discover which factors appeared promising
enough to justify an additional very refined procedure—a
procedure more feasible than the exact one already de-
scribed.
His preliminary explorative procedure was to place in one
group, not the half of his pupils who had the best attend-
ance records, but the topmost 12% in attendance. The
other group was composed of the lowest 12% in attendance.
Since any factor that varies with attendance should be
Causal Investigations 423
found in different amounts in these two groups, he computed
the mean distance from school for each group, and then
the mean quality of work in school for each group, the
per cent of each group found under the better teachers, vs.
the per cent found under the poorer teachers, and so on
for the large variety of factors whose influence upon attend-
ance was under consideration. When there was a pro-
nounced difference between the two means or the two per
cents for a factor, Reavis considered that factor to be
worthy of further study by a more exact procedure. When
no pronounced difference appeared he considered that factor
to have little or no influence upon attendance and eliminated
it from further consideration. While this method is so crude
that it will not show the independent contribution of each
factor, it is sufficiently exact to show what factors are
promising ones for further study and which ones are un-
promising.
In this preliminary investigation Reavis determined
roughly the significance for attendance of the following
factors relating to the child: sex, chronological age, grade
in which enrolled, quality of work, and promotion. He
studied the following factors relating to the school: training
of teacher, salary of teacher, experience of teacher, num-
ber of recitations, completeness of teacher’s report, neat-
ness of teacher’s report, handwriting of the teacher, teacher’s
intention to continue, schools changing teachers, rating of
teacher, size of library, kind of blackboard, rating of equip-
ment, age of desks, number and kind of pictures on the
walls, school enrollment, size of schoolroom, lighting of
schoolroom, system of heating and ventilation, rating of
school building, suitability of school grounds, play and
games, value of school property, cost of running school and
distance from children’s homes. He investigated the fol-
lowing factors relating to the community; money raised,
number of community meetings, and rating of the com-
munity.
Many of the above factors proved to have little or no
224 How to Experiment in Education
connection with attendance. Many other factors showed a
significantly promising relationship. In order to reduce the
number of factors for detailed examination, various signifi-
cant factors were combined where possible. ‘Thus a score
for distance was determined by combining uncorrected dis-
tance, quality of roads, and transportation. A score for
the teacher was secured by combining the factors relating
to her which proved significant, namely, her rating by the
superintendent, her salary, and her training. A score for
the school plant was secured by combining the rating on
the building, rating on the equipment, and rating on the
grounds. In describing the correction of distance, a device
was given for determining weights to be assigned to the
elements that entered into these various combinations. A
like method was employed for computing these composites
for teacher, and for school. Three other factors, namely,
a pupil’s progress through the grades or age-grade relation-
ship, a pupil’s quality of school work, and the quality of
the community, were found worthy of additional considera-
tion. This means that six factors were selected for detailed
examination by the process to be described.
A seventh factor, namely, chronological age, was found
to be significant, but the effect of this factor was taken care
of by studying the relationship between attendance and the
six selected factors separately for each of three age groups,
namely, 5 to 8, 8 to 12, 12 and above.
e. Correlation and Inter-correlation Between Causes
and Effect—The next step was to compute the coefficient
of correlation between attendance and each of the six
selected factors, and to do this separately for each of the
three age sub-groups.
The coefficient of correlation is a statistical expression
for the degree of proportionality or correspondence between
two series of measures, and is indicated by the symbol r.
When r is t.0 the correspondence or correlation between the
two series of measures, say, scores for distance and attend-
ance is perfect and positive. When r is — 1.0 the correla-
Causal Investigations 220
tion is perfect but it is inverse or negative. When r is zero
the correlation is mi. An r may be anywhere from — 1.0
through zero to + 1.0. We should expect the r between
attendance and quality of school work to be positive, because
we should expect those pupils who have a good attendance
record to tend to show high quality of school work, and
vice versa we should expect those pupils who have a poor
attendance record to tend to show a low quality of work.
On the other hand we should expect the r between attend-
ance and distance to be negative, because we should expect
that those pupils who have a high distance score to tend to
have a low attendance record, and vice versa.
There are several formule for the computation of r.
The standard formula when the relationship is approximately
rectilinear (see Diagram 1) is Pearson’s product-moment
formula, which may be written thus when the exact mean
is used:
T= V/Sx 4/Sy?
or thus, when the assumed mean is used:
Most educational relationships are rectilinear or are suffi-
ciently so to make it permissible to employ the product-
moment formula. But it is well to construct and inspect
a scatter diagram (see Diagram 1) to determine whether
the general drift of the diagram is rectilinear or curvilinear
(see Diagram 1). If it is pronouncedly curvilinear the in-
vestigator is referred to Rugg’s book ! on statistical methods
for the appropriate formula. |
* Rugg, Harold O., Application of Statistical Methods to Education; Houghton
Mifflin Company, New York, 1917.
226 How to Experiment in Education
PER CENT OF ATTENDANCE
DIAGRAM I
THE CIRCLES SHOW AN APPROXIMATELY RECTILINEAR RELATIONSHIP. THE
CROSSES SHOW A CURVILINEAR RELATIONSHIP
a ee SS eS SS
—— | SS | I ee | ee en ee Se ee I SS SS I eS
| | J J J fF I | J ff — fF — | | | J | | | | |
in miles
&
NS
>4
iw
°
me | SS | i | S| SS eS | eS
a | | | | | —— | S| SS I
Distance
— | Se SS I eh ee Se | SS | SS I SS CO I
oO
a
°
ON
ra
oO
b
ba
fo)
N
|
|
|
|
|
|
Ea
4
ese
Bes
Ll
es
|
|
|
|
ie)
°
°
O 5 10152025 30 35 40 45 5055 60 65 70 75 80 85 90 95 100
Diagram 1 shows in one diagram two sample scatter dia-
grams for two groups of twenty-five children. The circles
show the relationship between attendance and distance.
Causal Investigations
cA Sz
0°) — =
zoo = e ( ) mae sz
Io 0- 6g eee Sz
) veel —
Sorvz — ,xS
botz
OgI
O7gI
6g01
6g01
I WVUOVIG NI (SHIONIO) VIVd AHL UOA I ALNAWOD OL MOH ONIMOHS—
SS —_———————
oe a | SS
veel — — Axg
VS gI—
aLiI— ro—
g's — ZI—
voz — go —
a ov — vi—
g 7S — gi—
9s — ZO —
bos — giI—
gv ZO
ofr — OL
gv 9'0
0'O o'O
vz — oT
gO vo—
Olt QI
cag 9.0 —
SO to
ve Pei
aa? Oe
ozs — Our
9°99 — QI
ott — go
zsol— gl
gso— v1
o' vor — O'7
Ax A
L¢ alavy,
aIUD ISI
aaa
70 = X92
o7S — WV
77S = W
gouppuaz1 P
SAHVD OH Wwe Me GEGSCATHNHY SSE K >
dnd
228 How to Experiment in Education
Each circle indicates one child’s attendance record and
distance from school. The general drift of the relationship
is a straight-line or rectilinear drift. The crosses show the
relationship between attendance and distance for twenty-five
other pupils. Remember that the diagram is merely for
illustrative purposes. It is extremely improbable that one
group of pupils (circles) would show a decided negative
correlation and another group (crosses) a decided positive
correlation. But the important point to note about the
diagram is that the circles show a rectilinear drift whereas
the crosses show a curvilinear drift.
The procedure for computing r is given in Table 37. Note
that the x column shows deviations from the AM for attend-
ance, and that the y column shows deviations from the AM
for distance. Everything else is self-explanatory.
When N is large, say 50 or above, it is more economical
to tabulate data into a contingency table, such as Table 38.
Such a contingency table may be used not only as a starting
point for a short-cut method of computing a product-moment
coefficient of correlation, but it also makes unnecessary the
construction of a scatter diagram, such as Diagram 1. In-
spection of the contingency table will show whether the rela-
tionship is sufficiently rectilinear to make the product-
moment method applicable.
Table 38 is read thus: There were 3 pupils who lived
between 3.4 and 4.0 (inclusive) miles distance from school
whose per cent of attendance was between o and 1o inclu-
sive, and similarly for the remainder of the contingency
table.
There is no particular virtue in grouping the per cents in
step-intervals of 15, or the miles in step-intervals of 0.8.
The per cents could be grouped in step-intervals of 5, 10, 15
or any amount that is convenient. Likewise, the miles could
be grouped in step-intervals of 0.2, 0.4, 0.6, 0.8 or any
amount that is convenient. The size of the step-intervals
chosen for Table 38 gives 7 steps for attendance, and 5
steps for distance. As a rule it is better to have a step-
229
Causal Investigations
; az N N
62.59 Geet ¢ (49) Tish _(X2) xc
————— errno
eSlLae— Sz ; N
p als Se, tee —— ee
(v0) (zx") — (49) (x9) —
£8 — = l§$—o= Axg 60 = AS "€or = XS
[opel rec NN ne foo ee Nees
I— Aj 2 x}
for Lz vz £ ° S g of x}
ERS Oy eee fl ee ek tena Pee + ened fag ey eer —S— | pe | er— XJ
pe ee ea ea | Sees oe ae Se Cee | Oe Se ee ee x
LS ° 6+ I— Sz mee 9 2 z ¢ z v }
ch sie Oz oI — 6s S z e g'0 0} Z'0
zI— zI —
S S S— Paste 5 es ea I I I 9°10} O'1
¢— z— I— 20 I
° Oo ° O 9 z tea € V2 0} QI
Oo fe) °
v v V I 14 I I I I 7 £ 03 9°2
I fo) ~— ¢ —
vz Oz OI z sc I I ¢ or oie
~— y— gI—
_—1 ae y, ‘ k OOI Sg ol gs ov Sz oI
2 oe ae cA} J j 06 SZ 09 SY of St fe) sony
ul 9IUDISIG
sounpuaiip {0 JuUaD sag
(Z1LaIa “I ‘H YaLdv)
MIGVL AONFZONILNOO V NI GaLVINGVL Naad SvH Lf AIdvL JO VLVd NAHM NOILVITUNOO AO LNALOLZLAIOO V TLNdWOO OL MOH SMOHS
gt alavy,
230 How to Experiment in Education
interval of such size as to produce not less than 10 nor more
than 20 steps in each of the two items. The steps are made
fewer in Table 38 so as to simplify the presentation of the
correlation procedure.
The steps in the process of computing a coefficient of
correlation from a contingency table follow. (1) Construct
contingency table. (2) The total frequencies in the first
column are 4. The total frequencies in the second column
are 2, and so on for the other columns. The grand total
of frequencies is 25. (3) The total frequencies for the first
row are 5, for the second row, 4, and so on. The grand
total of frequencies is 25, thus checking the preceding de-
termination. (4) The AM for attendance is 50, as shown
by the vertical double ruling. The AM for distance is 2.1,
as shown by the horizontal double ruling. Other AM’s
might have been taken, though AM’s near the center of each
frequency distribution are more convenient. (5) The step-
deviations from the AM for attendance are shown in the x
row. The step-deviations from the AM for distance appear
in the y column. (6) The product of each x multiplied by
its corresponding f appears in the fx row. The algebraic
total of the fx’s is shown at the end of the fx row. Sfx = 3.
(7) The product of each y multiplied by its corresponding f
appears in the fy column. The algebraic sum of the fy’s is
shown at the bottom of the fy column. Sfy=—1. (8)
The product of each x? multiplied by its corresponding f
appears in the fx? column. Sfx? = 103. (9) The product
of each y” multiplied by its corresponding f appears in the
fy? column. Sfy? = 49. (10) The f in the first square in the
first column and first row is 3. The x at the bottom of this
column is — 3. The y at the end of this row is 2. The
product of (3) X (—3) X (2) is — 18, which is written in
the upper right corner of this first square. The f in the
second square of the first column is 1. The x at the bottom
of this column is — 3, and y at the end of this row is 1. The
product of (1) X (—3) X (1) is — 3, which is written in
the upper right corner of the square in question. The f in
Causal Investigations 231
the third square of the third column is 3. The x is —1, and
the y iso. The product of (3) X (—1) X (0) is written
in the upper right corner. The f in the last square of the
last row is 2. The x is 3 and the y is — 2. The product of
(2) X (3) X (—2) is written in the upper right corner of
this square. The other f’s times the xy products are com-
puted similarly. (11) The sum of the xy products in the
first row, ie., the sum of — 18, — 4, and —2 is — 24.
This sum is written in the xy column in the minus sub-
column. Were this sum positive instead of negative, it
would be written in the positive sub-column. In like man-
ner, the sum of the xy products for each row is computed
and written in the last column. Positive Sxy—o. Nega-
Hives Ve 57 eto eb ne:cxts, COMpUted +. CX = O:0 21 ( 13)
The cy is computed; cy = —o0.04. These c’s are not multi-
plied by the size of the step-interval as is done in Table 17,
because Sxy, Sx”, and Sy? used in the correlation formula
are kept in terms of step-intervals also. (14) Sx? == 103.
Sy?= 49. Sxy =o—57 =—57. (15) The values pre-
viously computed are substituted in the correlation formula
shown at the bottom of the table. This formula is identical
with that used in Table 37, except that all values are in
terms of step-intervals. By solving the formula, r is found
to be — .80-+. ‘The r, when computed by the procedure
illustrated in Table 37, is —.81. This is a remarkably
close agreement, when we consider the drastic condensation
of the data produced by the large step-intervals used in the
contingency table.
By substituting age-grade scores for distance scores in
Table 37 or Table 38, and by recomputing, the r for at-
tendance with age-grade relation can be determined. In
similar manner, the r between attendance and each of the
six selected factors, or between any factor and any other
factor, can be computed. The first row of Table 39 shows
the coefficients of correlation between attendance and each
of the six factors as computed by Reavis for the age group
8 to 12 and all five counties combined. Reavis’s original
232 How to Experiment in Education
table presents the coefficients for the three separate groups
and the five separate counties. Additional rows show the
correlation between each factor and every other factor.
For our present purpose the first row of Table 39 is the
most significant. It tells us that those whose attendance
records are excellent tend to live near the school to the
extent of .45, tend to progress rapidly through the grades
to the extent of .50, tend.to make high marks in school to
TABLE 39
SHOWING THE COEFFICIENTS OF CORRELATION BETWEEN ATTENDANCE AND EACH
OF SIX HYPOTHETICAL CAUSES OF ATTENDANCE, TOGETHER WITH THE
CORRELATION BETWEEN EACH CAUSE AND EVERY OTHER CAUSE (ADAPTED
FROM. REAVIS)
2 3 4 5 6 7
CMS Distance] Grade lof Work] Te°| “Plone | munity
1. Attendance ........ — .45 50 i332 16 07 30
2 RLVISTANCE Roe asia —.20 | —.13 | —.10 | —.06 02
2, wAven Grade. .eiae We 24 OI 08 .08
4. Quality of Work... 00 08 03
SL CACHED nhs crete nea 25 35
6 ochool Plant it woe 17
the extent of .33, tend to have good teachers to the extent
of .16, tend to have an excellent school plant to the extent
of .o7, and tend to live in a highly-rated community to the
extent of .30. So far as these coefficients go, attendance
appears to be most closely associated with age-grade rela-
tionship and distance.
Among the inter-correlations of the various factors, the
most surprising coefficient is the zero relation between qual-
ity of work and the teacher. One would expect better
teachers to secure a higher quality of work on the part of
the pupils. Had quality of work been measured by stand-
ard tests, a positive coefficient would almost certainly have
Causal Investigations 233
been found. But the scores for quality of work were the
teacher’s marks. These marks are strictly relative, which
fact effectively covers up any difference in the efficiency
of different teachers.
If the size of any coefficient of correlation in Table 39
is so small as to cast a doubt upon its significance, there is a
formula which permits the computation of the reliability
ofanr. Itis
I—r?
SDt= (TN
where r is the coefficient of correlation whose reliability is
sought, and N is the number of pupils used in computing r.
The SDr is interpreted like SDM or SDD. If it is desired
to know the probability that the true r is not zero or below,
the EC may be computed by means of the following formula:
r
a 2.78SDr
Also this EC formula can be used to determine the prob-
ability that the true r does not lie below a defined r, or that
it does not lie above a defined r. How to use the EC
formula for either of these two special purposes has been
discussed in connection with its similar use for M or D.
f. Final Evaluation of Causes by Partial Correlation.—
The crude correlation coefficients in the first row of Table 39
may not tell the independent influence of each factor upon
attendance or vice versa. We could be certain that they
show such independent contribution only in case the inter-
correlation coefficients between the various factors were all
zero. Were they all zero we should know beyond doubt
that the correlation between a particular factor and attend-
ance has not been enhanced or diminished, as a result of its
correlation with some other of the factors listed. Addi-
tional evaluation has shown, for example, that the school
234 How to Experiment in Education
plant has no intrinsic connection with attendance. It has
a slight positive correlation of .o7 as shown in Table 39
largely because it is correlated with the teacher who does
have some genuine connection with attendance. ‘That is, all
the correlation between school plant and attendance is a
borrowed correlation. It is possible for a factor to borrow
in this way from all the other factors. The problem of
determining the independent correlation of each factor
with attendance becomes a problem of stripping from
each the correlation it has borrowed from all the other
factors. If the borrowing has been small, little will be
subtracted from the coefficients shown in the first row of
Table 30.
The crude correlation of a factor with attendance is com-
parable to the crude process previously described of dividing
all the pupils into a better-attendance and a poorer-attend-
ance group, and then averaging the distance each group
lives from school without making any attempt to equate
groups. We have seen how such a procedure tends to lump
the various factors together, depending upon the degree of
correlation between them. We have seen, further, that the
only way to avoid this confusion of different factors and to
determine the independent contribution of each to attend-
ance is to equate the two groups with respect to all the
factors except the one under investigation.
Due to the fact that it is difficult to select two groups
from the better-attendance and poorer-attendance groups
which are exactly equivalent in five different factors, Reavis
elected to employ an alternative process which yields com-
parable results. He used the method of correlation supple-
mented by partial correlation. The effect of partial cor-
relation coefficients is to show what the correlation would
be between, say, attendance and distance if all pupils were
of the same age in the same grade, were doing the same
quality of work, were under like teachers, were housed in
like school plants, and lived in like communities. The crude
coefficients in rows 2, 3, 4, 5, and 6 in Table 39 were com-
Causal Investigations 235
puted in order to make possible the computation of just such
partial correlation coefficients.
The operation of the partial correlation formula has for
its goal the following independent, isolated, or partial cor-
relation coefficients:
YI2.34567
T13.24567
TI14.23567
r15.23467
r16.23457
¥17.23456
The figures 1, 2, 3, 4, 5, 6, and 7 refer respectively to attend-
ance, distance, age grade, quality of work, teacher, school
plant, and community, as shown in Table 39. The partial
correlation coefficient of r12.34567 means the correlation
between attendance (1) and distance (2) when freed (.)
from the influence of age grade (3), quality of work (4),
teacher (5), school plant (6), and community (7). The
coefficient, r13.24567, means the correlation between attend-
ance and age grade when freed from the influence of the
five other factors.
The computation of r12.34567 requires the investigator to
operate the partial correlation formula over and over again.
Each operation takes out the influence of just one factor.
The total process is shown below, in exactly the reverse
order in which computations are actually made. Reversing
the order makes the principle of the process easier to grasp.
The first series of formule from the bottom removes the
MMuenCe Ole wiLOMaLigweric ITA, Wis) TiOires 24 to.
r26, r34, r35, r36, r45, r46, and r56. The next series of
formulz removes, in addition, the influence of 6 from r12,
Gio Tidy ttyetee eredite saad t35 andi tA seem Lie. Next
series removes, in addition, the influence of 5 from r1i2, r13,
rI4, 23, r24, and r34. The next series removes the in-
fluence of 4 fromr12,r13, andr23. The next series removes
the influence of 3 from riz. This leaves r12 purified from
the influence of 3, 4, 5, 6, and 7. |
236 How to Experiment in Education
r12.4567 — (113.4567) (123.4567)
r12.34569 S=
345034 1 — (r13.4567)? VO oe (123.4567)?
where
ROE Lop ae east ea LD) EL SSN)
Vit — (114.567)? */1 — (124.567)?
eco 13.567 — (114.567) (134.567)
V1 — (414.567)? Wt — (134.567)?
Bt aa 123.567 — (124.567) (134.567)
; Vt — (124.567)? 1 — (134.567)?
where
Ni pt r12.67 — (r15.67) (125.67)
anita V1 — (115.67)? V1 — (125.67)?
aan 114.67 — (115.67) (145.67)
/t — (115.67)? V1 — (145.67)?
Ce aseetee 124.67 — (125.67) (145.67) __
V1 — (125.67)? 1 — (145.67)?
a ps 113.67 — (r15.67) (135.67)
V 1 (015.67)* VV 1 (135.67) 7
eta 34.67 — (135.67) (145.67)
V1 — (135.67)? V1 — (145.67)?
Fe Gye iat Tora AAES 8 A
Vt — (125.67)? 1 — (135.67)?
where
pate Lie /ara) (Et0;7) ede, eee
"At — (£16.7)? V1 — (126.7)?
one r15.7 — (r16.7) (156.7)
V1 — (116.7)? 1 — (156.7)?
Awa isons wach aee WYARIAUUHG)
VT (ray) Ay tee (Ope
relorees TEA AMEL 7) AcAOeg a
V1 — (116.7)? 1 — (146.7)?
Causal Investigations
Oe a ee Ne
ad V1 — (146.7)? V1 — (156.7)?
67 = ERAT (726.7) (46.7)
nay / 1 — (126.7)? V1 —\(r46.7)?
r13.7 — (r16.7) (136.7)
Ngee ed BAG seer SEO“ IRL OFT te
at MCE Or ye Te (136.7)?
135.7 — (136.7) (156.7)
ena maith as Mie) AUR fy
en V1 — (36.7)? Vr — (156.7)?
134.7 — (136.7) (146.7)
(frp A ars RI gs
a V1 — (136.7)? 1 — (146.7)?
123.7 — (126.7) (136.7)
(AT ET NSN
waa Vt — (126.7)? V1 — (136.7)?
where
pli eee )\ra7)
tS Vata va Ga
r16 — (17) (167)
6. — sO
r16.7 Wet (nt envi (TOA) 2
26 — (127) (167)
6. SEES ees RE ET SE EE aN
126.7 Vay vou (107)
aaah Tato 7)
OTS Va (a7)? VE — (57)?
r0.7— Wines Oram (2574107) oie
/1— (157)? “1 — (167)?
125.7 == — 2S (027) (857)
V1 — (127)? Vv 1 — (157)?
— __14— (117) (147)
SS Ar pan a ne OPE
r46 — (r47) (167)
Fate eet aca LE a 2S, tel ee dak a AC ae
A eas (147)? “1 — (167)?
TAs. 76 Ne Ses (147) (557) Nea
Vt — (t47)? Vt — (57)?
237
238 How to Experiment in Education
T24 = ALATA LAG
47 = 7 (a7)? Vi— Gan)?
i Uoiorersgye ee oar
peinwicn ear
N84 = ee
idk ulus ice ad eran 2 i
V/ 1 — (637)? V1 — (147)?
MEE reat 27) AEST)
£23-7 = A/t — (127)? V1 — (137)?
Beginning at the bottom of the foregoing series of for-
mule, the coefficients of correlation from Table 39 should
be substituted in the first computation series of formule.
As soon as these first partials have been computed, data
will be available for substitution in the second computation
series. The computation climb may thus be continued until
r12.34567 has been determined.
Once the process has been completed and the size of
r12.34567 has been determined, the investigator will have
to construct a similar series of formule and compute
T13.24567. Since the principle for the construction of each
of the six needed series is identical with that for the first
series, the other five series need not be given here. Fur-
thermore, an investigator who is concerned with a larger
or smaller number of factors than six should have no diffi-
culty in extending this series to provide for a larger number
of factors, or of omitting the upper superfluous portion of
this series in case of a smaller number of factors.
By operating these formule in six such series, Reavis
isolated each of the six factors and determined its inde-
pendent contribution to attendance. That is, he determined
the significance of the distance pupils live from school,
Causal Investigations 230
regardless of the grades they are in, the quality of the work
they do, the kind of teachers they have, the character of
the school plants, or the type of community in which they
live. Similarly, he determined the independent correlation
of each factor regardless, not of all conceivable factors, nor
even of all factors studied, but of the six other factors
which appeared to be most significant and hence most need-
ful to be partialled out.
The final partial coefficients, as computed by Reavis, are
given in Table 40. For purposes of comparison the partials
TABLE 40
ORIGINAL AND PARTIAL COEFFICIENTS OF CORRELATION BETWEEN ATTENDANCE
AND SIX HYPOTHETICAL CAUSES (ADAPTED FROM REAVIS)
Age Quality School Com-
Causes Distance Grows of Work Teacher
Attendance
Original ..| —.4s .50 -33 16 07 30
Partial «>; ,| —— 43 44 45 .08 — .O1 28
are preceded by the original crude coefficients. Distance
and community suffered the least reduction. The teacher
appears to have little to do with attendance, and the school
plant has nothing to do with it. The outstanding deter-
miners of attendance are distance and age-grade relation.
The quality of school work and type of community come
next and are about equal in their influence. But the
reader should remember that the purpose of this chapter is
to describe a process rather than to present results. Final
conclusion as to the significance of these factors should take
into consideration Reavis’s results for the two other age sub-
groups. To do so would alter somewhat the conclusions
just stated.
As has been stated already, correlation does not imply
causation. But partial correlation does imply causation in
so far as all significant factors are partialled out. But par-
tial correlation does not show which is cause and which
240 How to Experiment in Education
effect. This must be decided from non-statistical consid-
erations. Such considerations lead to the conclusions that
distance, age-grade relation, teacher, and community are
clearly causes rather than effects of attendance. Each of
these factors was determined at the beginning of the year
in which the attendance records were secured. On the
other hand it seems much more probable that quality of
work partly influences attendance and is partly influenced by
attendance, i.e., it is both cause and effect.
g. Regression Equation.—No further step is required to
satisfy the purpose of a causal investigation. But the com-
putation of partial correlation coefficients makes possible an
additional step, familiarity with which is important not only
for the causal investigator but also for those who construct
tests. This next step is the derivation of a regression equa-
tion or prophecy equation.
The simplest form of prophecy is where a pupil’s score
in one trait is prophesied from a knowledge of his score
in one other trait. Since this sort of situation demands
only ordinary correlation and the simplest form of regres-
sion equation, it makes a good starting point for the explana-
tion of a situation which demands partial correlation and a
complicated regression equation.
Suppose that the problem is to secure the best prophecy
as to a pupil’s attendance based on knowledge of his dis-
tance from school. Assume the correlation between attend-
ance and distance to be as shown in Table 37. The regres-
sion equation for this purpose is:
ee pee
SDy y
As shown at the bottom of Table 37, r—=—.81,
pare Sx? Bes 24105 aes ye
Sx AT? ves. bacon \/ (0:2) ee Pacers
Da EN nails i) BE Has.
SL irene N CCV) ase ie (O02
Causal Investigations 241
Assume that the pupil’s distance score is known to be tr. ct
Then y is the difference between 1.5 and the M of 2.0;
y—-— 0.5. This pupil’s most probable position in attend-
ance may be found by substituting the preceding values in
the above formula, thus:
Since M for attendance is 52.2, the pupil’s most probable
Score in attendance is then 52.2 + 10.8, Lew Osu lnaike
manner any y can be transmuted into a most probable x.
In case x is known and the problem is to prophesy y, the
regression equation becomes:
By means of the first of these two regression equations, it
is possible for an experimenter to build up a table for trans-
muting x values into y values, so that subsequent workers
will need to determine only the value of x for each pupil.
By using the second equation, he can construct a table for
transmuting y values into x values. At this point, it should
be pointed out, that one table will not suffice for trans-
muting x values into y values, and y values into x values.
Two tables are required.
When the problem is to prophesy a pupil’s position in x,
say, attendance, from knowledge of his scores in Vea C
etc., say, distance, age-grade relation, quality of work, etc.,
partial correlation is required. The regression equation
combines the pupil’s scores on the various factors, weight-
242 How to Experiment in Education
ing each score according to the partial correlation of that
factor with the criterion, namely, attendance. If the prob-
lem is to prophesy a pupil’s intelligence from several tests
of this trait, the regression equation combines a pupil’s
scores on the several tests, weighting each test according
to its partial correlation with some criterion of intelligence,
whether the criterion be some standard intelligence test, or
teacher’s judgment, or age-grade relation, or something else,
or a combination of these to constitute a criterion. Thus,
the regression equation will combine any number of ele-
ments and weight them so as to yield composite scores
which will correspond as closely as possible, considering the
elements used, with some criterion.
All that is needed to make such an equation possible is
the partial correlation of each element with the criterion
and certain measures of variability, as shown in the follow-
ing formula. This formula is the regression equation for
attendance, i.e., it combines and weights the scores on the
various factors so as to yield the most accurate possible score
in attendance from a combination of these six factors,
Shes SD 1.23456
Kip (+12.s4567g 5024527 ) X2 + (+13.24567g 254501 ) ue
SD1.234567 D1.234567
+ (114.2356 Tenieenen: x4 - r15.23467ep ood ) X5
5.123467
SD1.234567 govt: 5D1.234567
+ (116. CRE Acre rae pe x6 + [ 117.23456 SD7ita34e60) a
Where xt is the deviation of the pupil’s score from the mean
of the attendance records, and is determined by the solution
of the formula,
x2 1s the deviation of the pupil’s score from the mean of the
scores in distance,
x3 is the deviation of the pupil’s score from the mean of
the age-grade relation, and so on for x4, x5, x6, and x7,
where x2, X3, X4, x5, x6, and x7 are known, and where
Causal Investigations 243
SD1.234567 = SDr V 1 — (r12)?Wr — (113.2)*V 1 — (414.23)?
V I= (115.234)'V a — (116.2345)? V a — (917.23456)?
SD2.134567 = SD2 Vir — (112)? Wr — (423.1) ?W 1 — (424.13)?
V 1 — (r25.134)'V 1 — (426.1345)? V 1 — (227.43456)"
SD3.124567 = SD3 V1 — (113)? V 1 — (123.1) °V 1 — (134.2)?
Vir — (135.124)* Vt — (136.1245)? Va — (137.12456)*
SD4.123567 =SD4 Vx — (114)? Wa — (424.1)? V 1 — (134.42)*
V1 — (r45.423)*V x — (46.1235)? V x — (147.12356)"
SD5.123467 = SDs Via (115)? Wi — (425.1)? 1 — (135.12)*
V1 — (r45.123)°V x — (156.1234)°V 1 — (r57.12340)?
SD6.123457 = SD6 V 1 — (116)? V1 — (126.1) *V 1 — (136.12)*
V x — (146.123)*V x — (456.1234)? V 1 — (07.12348)"
SD7.123456 = SD7 Vr— (117)? Vt — (727.1) ?V 1 — (37.12)*
V 1 — (147.123)°V x — (157.1234) Vt — (167.12345)"
To illustrate the evolution and use of a regression equa-
tion in a simple situation, assume that the problem is to
prophesy a pupil’s position in 1 from a knowledge of his
position in 2 and 3. Stated in another way, assume that
the problem is to combine the scores on 2 and 3 so that the
resulting score will be the best possible in 1 which 2 and 3
can yield. Assume that
1 = Intelligence as measured by the Stanford or Herring
Revision of the Binet-Simon Intelligence Scale,
2 == Comprehension score on the Thorndike-McCall Read-
ing Scale, and
3 == Minutes spent on the Thorndike-McCall Reading
Scale divided by the comprehension score.
Assume further that
(eT Are: SD1 = 4.42 M1 = 120
rIl3 = —.40 U2 TG Li Deecera’ i te)
23 = — .56 Sa OLS & Mee LS
244 How to Experiment in Education
Then the regression equation is
SD1.2 SD1.2
Xie Gas x2 + (mano
Utilizing the assumed data to compute the required values
in the regression equation, we have
Sw r12 — (r13) (r23) wal 80 — (— .40) (—.56) Lee
cnuet Vere 3 yor oe (829 a toa eee z
yy r13 — (r12) (r23) Ly — 40 — (.80) (—.56) a
$ Vr (TiO Ai er 3 WW iB) A ee
r23 — (ri2) (rI3) — .56 — (.80) (— .40)
SD1.23 = $D1 V1 — (r12)?V te (ri303)2 =
AAgV Tie 80)2V De t0)7 ="9163
$D2.13 =SD2vi a (r12)?V c— (r23.1)7 peed
or Vai (i80)4V Tis (Gaga == 59
SD3.12 = SD3 Vi — (113)? V1 — (123.1)? =
‘Be Vip ea ( ho | Ten (ee .70
Substituting the computed values in the regression equa-
tion, we have
a= (76-23 x2 + (1023) fa) 2p. Gen X2 + .38x
oe Sra 1) By = 3.39 38X3
Now if a pupil’s score in 2 is 53, x2 53 —50 =3,
since M2 is 50. If his score in 3 is 14, x3 = I4—15 =
—1, since M3 is15. Substituting x2 and x3 in the preced-
ing equation
XI = 3.39(3) + .38(—1) =9.79
The 9.79 shows that the pupil’s deviation from Mr is a
_plus 9.79. Since Mrz is 120, the pupil’s score in 1 becomes
120 +. 9.79, 1.€., 129.79.
CHAPTER X
ANALYSES OF EXPERIMENTAL AND CAUSAL
INVESTIGATIONS
The principles and procedures formulated in the preced-
ing chapters had to be confined necessarily to the more
common types of experiments and investigations. Further-
more, the progress of discussion permitted only a limited
use of concrete illustrations. The purpose of this closing
chapter is twofold, (a) to show the applicability of these
principles and procedures to many specific experimental
problems and problems for causal investigation, and (b) to
suggest a method of attack upon relatively uncommon
varieties of problems. The problems used are taken more
or less at random from a large number submitted from time
to time by graduate students.
No special effort has been made to make these analyses
complete. Space would not permit, nor has an effort been
made to make them model analyses. This would require
not only a long period of concentrated thinking about each
problem but also an actual trial of each experiment to check
the thinking done. All that is attempted is to draw up for
each problem a rough plan for its solution, in order to point
out to the reader the general line of attack.
PROBLEM 1. Do Rural Children Learn More Rapidly in
Consolidated Schools or in One-room Schools?
EF1 is a consolidated school. EF2 is a one-room school.
S is a group or groups of rural pupils.
This problem may be solved as an equivalent-groups ex-
periment very simply but with some delay, or it may be
solved without delay by an equivalent-groups causal inves-
245
246 How to Experiment in Education
tigation. Since an experiment always gives the experimenter
more complete control of the situation than does a causal
investigation, let us assume that this advantage outweighs
the disadvantage of a year’s delay, and that the problem
is to be solved by an equivalent-groups experiment.
The chief problem is to secure genuine equivalence of
groups. Pupils should be paired on two bases, at least,
namely, mental age and chronological age.
Having selected two equivalent-groups, or else having
delayed selection until the conclusion of the experiment, a
series of IT’s or standard tests of school abilities should be
applied. At the close of the year these tests or duplicates
of them should be applied as FT’s.
The data from these tests can be fitted into one of the
computation molds provided in a preceding chapter. For
purposes of computation, all the pupils can be treated to-
gether as two equivalent groups or else the two main groups
may be broken up into age sub-groups or grade sub-groups,
or they may be treated both ways.
PROBLEM 2. Effect of Exemption from Class Drill in
Penmanship when Pupils Attain Quality 12 on the Thorn-
dike Handwriting Scale Compared with the Effect of Con-
tinuance in Class Drill.
EFI is exemption from class drill in penmanship of those
pupils who attain quality 12 on the Thorndike Handwriting
Scale. EF2 is the continuance in class drill, or the absence
of such exemption.
The experimental group (S) is not indicated, though the
effectiveness of EF1 is likely to vary with the distance the
ability of S is from quality 12. The implication of the
student’s formulation is that S has an ability below quality
12. The conclusion from the experiment should be stated
in terms of whatever S is employed.
Since the purpose of this experiment is merely to deter-
mine the amount of superiority of one EF over the other
no control EF is required and only the less stringent criteria
Analyses of Experimental and Causal Investigations 247
for selecting the experimental method need be considered.
The one-group method is not entirely satisfactory, because:
(a) Even apart from any difference in the effectiveness of
EF’s, the amount of change under one EF will not be iden-
tical with the amount of change under the other EF. Even
under identical conditions the rate of progress in penman-
ship as measured by available tests usually shows a slowing
up as progress proceeds. ‘To date, no progress scales have
been constructed which demonstrably discount this retarda-
tion. (b) There is some danger that there will be a signifi-
cant carry-over from one EF to the other, particularly if
the exemption-from-drill EF precedes the continuance-in-
drill EF. (c) The one-group method is more than unsatis-
factory; it is completely impossible if the change in S is
determined by measuring the amount of time required to
attain quality 12. Just as soon as one EF had brought 5 to
quality 12 there would be no opportunity to determine the
effect of the other EF because S would already be at quality
12. All this means the equivalent-groups method is the best
one for this problem.
The change (C) produced by each EF can be measured
by the per cent of pupils in each group who attain quality
12, as measured by the Thorndike Handwriting Scale, dur-
ing the period of the experiment. The experiment can be
stopped when, say, 50% or 85% of the leading group has
attained quality 12. This per cent can be compared with
the per cent of the other group who have attained quality 12.
This method of measurement is objectionable because it
does not yield a score for each pupil. It yields a score for
the group as a whole. This does not permit the computa-
tion of SD, SDM, and SDD, and hence does not permit any
statement of the reliability of the conclusion.
The C can be measured by the total number of points
of growth on the scale during the period of the experiment.
There is a fatal objection to this plan. The EFr pupils
are excused from handwriting instruction when they attain
quality 12, and are thereby and thereafter encouraged to
248 How to Experiment in Education
spend the handwriting drill period in more congenial ways.
But no EF2 pupil who attains quality 12 is so excused.
Measuring C by points of growth definitely discriminates
against EFtr.
The C can be measured by the length of time required
by each pupil to attain quality 12. A serious objection to
this plan is that it requires the experiment to continue until
every pupil of both groups, even the slowest, has attained
quality 12. Certain pupils in the group may never attain
this level. Except for this practical objection the method
is quite satisfactory. If all pupils are within an easy dis-
tance of ability 12, this objection disappears.
Again, the C can be measured by determining the amount
of growth per unit of time. Suppose the first EF1 pupil
to attain quality 12 does so in one month from the begin-
ning of the experiment. To avoid disappointing pupils the
experiment will have to continue, but for purposes of com-
putation the experiment can stop at that point. The points
of growth made by each and all pupils in each group in one
month shows the relative effectiveness of each EF. The
IT1z here may be assumed to be approximately zero for each
pupil. The FT1 is the points growth in a month. The C
is then identical with FT1. Further computations follow
the computation models already given.
It is advisable for the experimenter to check the measur-
ing method just recommended by a related method. He
can permit the experiment to continue until most or perhaps
all of the EF1 pupils have reached quality 12. The instant
that an EF1 pupil reaches quality 12, the experimenter
should determine and record the attainment of the EF2
pupil who is paired with the EF1 pupil. By dividing the
points of growth from the initial starting point up to 12
by the number of days required to attain 12, the growth
per day can be determined for each EF1 pupil who attains
quality 12 during the period of the experiment. By divid-
ing the points of growth of each EF2 pupil, up to the time
his EF1 pair reached quality 12, by the number of days
Analyses of Experimental and Causal Investigations 240
required by his EFr pair to attain quality 12, measures
comparable with the foregoing EFi measures can be secured
for the EF2 pupils who pair with EF1 pupils attaining
quality 12. Quite satisfactory and comparable measures
can be secured for each EF1 pupil who fails to attain quality
12 and for his EF2 pair by dividing the points of growth
made by each during the whole time of the experiment by
the number of days in the experiment.
This method of measuring C is suggested as a check upon
the preceding one, because there is some possibility that as
EF? pupils approach their goal they are stimulated to added
zeal. To stop the experiment as soon as the first EF1 pupil
attains the goal means that only a few pupils have come
within the sway of this possible facilitating effect. This
last method gives all the pupils a chance to feel its effect,
in case such an effect exists. And in order to make results
entirely comparable an EF2 pupil, for purposes of com-
putation, is stopped, for computation purposes at least, at
the same instant that his EF1 pair stops. For purposes of
fitting these data in the computation model, assume IT1 to
be zero, and FT1 to be the above scores.
The careful experimenter will not be satisfied to measure
quality of handwriting only. As a minimum he will deter-
mine, in similar manner, the effect of each EF upon speed
of handwriting.
PROBLEM 3. What Is the Effect of the Spirit of a Class
on Its Achievement?
EFT is a spirit of enjoyment, hopefulness, codperation
and the like in a class. EF2 is the opposite sort of spirit.
There could be other EF’s representing varying degrees or
varieties of spirit.
The one-group or rotation method may be employed pro-
vided the period for each EF does not last more than a few
days. A longer pericd might fix certain attitudes which
will transfer to the succeeding EF. Even when the period
is brief some transfer is doubtless unavoidable. If the
250 How to Experiment in Education
teacher or other agent generates a pleasant spirit, this will
tend to aid the succeeding EF. If the unpleasant spirit
precedes, it will tend to subtract from the succeeding EF.
Probably the best method of all is the equivalent-groups
method, where Sx and S2 are two equivalent classes. This
method does not require a brief application of each EF.
Both IT’s and FT’s for both groups are needed. These
achievement tests will need to cover the abilities being
developed while the EF’s are operating. The differences
between the M’s of the two C’s in each achievement test
give the conclusions from the experiment.
ProBLEM 4. Are Nature and Object Drawing and Paint-
ing Fundamental to Improve Taste in Selection of Environ-
ment, or Are the Principles of Design and Color the Basts
for This Response?
EF1 is nature and object drawing and painting. EF2 is
principles of design and color.
The one-group and rotation methods are inappropriate be-
cause of probable carry-over, so the equivalent-groups
method must be employed.
The S is a group of pupils improvable in their taste in
selection of environment, and not yet trained in either EF 1
or EF2. |
Both Sr and S2 should be given an IT to determine
initial taste in selection of environment. S1 should have
EF1 applied. Sz2 should have EF2 applied. Both should
then be given an FT. The difference between the M’s of
the two C’s will show which EF contributes more toward a
development of taste in selection of the environment.
PROBLEM 5. Which Is Better for Pupil Growth, a Tem-
perature of 68 degrees and a Humidity of 50 per cent, or a
Temperature of 86 degrees and a Humidity of 80 per cent?
EFr is a temperature of 68 degrees and a humidity of
50 per cent. EF2 is a temperature of 86 degrees and a
humidity of 80 per cent.
Either the rotation or equivalent-groups method may be
Analyses of Experimental and Causal Investigations 251
employed, though the rotation method is preferable perhaps.
Sit can be subjected to EFz and then to EF2. S2 can be
subjected to EF2 first, and then to EF1. The length of
time each EF is applied should be the same for all four
periods, and will depend upon the nature of the tests used.
If the tests are of traits growth in which is very rapid,
each EF may be applied for a brief time.
Several test types covering the work of the pupils will be
needed. Both IT and FT should be given. These may be
tests of general reading ability, arithmetical ability, spelling
ability, and the like. In this case, the experiment will need
to continue for a considerable period. Or the tests may
be based upon the specific lessons being taught. In this
case, growth will be rapid, and the experiment, if desired,
may be brief.
The computation will follow the regular rotation’ com-
putation model for two EF’s and several test types.
PROBLEM 6. To Determine the Effect on the Mastery of
English of Teaching Technical Grammar from the Fourth
to the Eighth Grade.
EF is the teaching of technical grammar from the fourth
to the eighth grade. EF2 is the absence of such technical
grammar and presumably the presence of other forms of
ordinary English instruction instead.
The equivalent-groups method is required. The formula-
tion of the problem does not make it clear whether there
are to be five sub-groups—fourth, fifth, sixth, seventh, and
eighth grades—with equivalent sub-groups, or whether there
are to be two equivalent fourth grades each of which is to
have its EF applied for five years in succession.
In either case IT’s and FT’s of English ability are re-
quired. A computation model has been provided for either
form of experiment.
PROBLEM 7. To Determine the Relation of Physical Effi-
ciency to School Progress.
EF1 is physical efficiency of a defined amount. EF2 is
252 How to Experiment in Education
physical inefficiency of a defined amount. A variety of
EF’s representing different degrees of physical efficiency
might be employed.
The equivalent-groups method is appropriate to this prob-
lem. Both groups may start below par physically, or at any
stage short of a physical condition which is at the limit of
possible improvement. Sz will have its physical efficiency
improved by careful attention to diet, etc. S2 will continue
on the same physical level.
Both IT’s and FT’s are needed, covering abilities growth
in which constitutes school progress. The difference be-
tween the M’s of C1 and C2 shows the effect of improved
physical efficiency.
This problem may be interpreted to mean: Does physical
efficiency facilitate school progress? Of it may be inter-
preted to mean: Are physical efficiency and school progress
associated or correlated? If the latter is the problem, the
one-group method is the only satisfactory experimental plan.
EF1 is the physical efficiency of the pupil in the best physi-
cal condition, EF2, EF3, EF4, etc., are the physical condi-
tions of the pupils who are second, third, fourth, and so on,
respectively, in physical condition. Each pupil should be
measured in both physical efficiency and past school prog-
ress. The correlation between these two series of measures
is the answer to the problem, for this correlation shows the
relationship between various physical conditions and corre-
sponding amounts of school progress. Interpretation is
facilitated if only those pupils are used whose present physi-
cal condition has been about the same throughout the school
career of the pupils.
One difficulty with the foregoing is that positive correla-
tion may not indicate a genuine relationship between physi-
cal efficiency and school progress. It may be that those
selected as more fit are also more intelligent, and that it is
intelligence rather than physical fitness which is responsible
for the correlation. This possibility may be investigated
by equating the fit and the unfit with respect to intelligence,
Analyses of Experimental and Causal Investigations 253
by using only those pupils of like intelligence, or by partial
correlation.
ProsLeM 8. What Effect Has Previous Training in Type-
writing upon Speed and Accuracy in Learning to Use a
Comptometer?
The EF1 is learning to compute with a comptometer plus
previous training in typewriting. The EF2 is learning to
compute with a comptometer when there has been no pre-
vious training in typewriting.
The one-group method cannot be used because, if for
no other reason, there will be a carry-over from one EF to
the other. For this same reason the rotation method can-
not be employed. The equivalent-groups method is appro-
priate.
Sx should have previous training in typewriting. S2
should lack such previous training but should be equivalent
in all other respects. No additional control S is required.
A unique feature of this experiment is that one group is both
an S2 and a control S at the same time, for Cr minus C2
shows the exact effect of previous training in typewriting
upon learning to use a comptometer. Sz and S2 are not
defined by the problem. ‘The inference is that they are two
groups of clerical students.
IT1, FT1, IT2, and FT2 are required both for speed
and accuracy in computing with the comptometer. In case
both S’s have had no experience at all with the comptometer
both IT1 and IT2 may be assumed to be zero.
This problem may be solved by either an experiment, or
a causal investigation, or half investigation and half ex-
periment. An experimenter finds two appropriate and
equivalent groups. To one he gives training in typewriting
and follows it with training on a comptometer. To the
other he gives no training in typewriting, but begins train-
ing them on the comptometer, after a period has elapsed
equivalent to that used in giving his typewriting training to
the EF1 group.
254 How to Experiment in Education
The causal investigator proceeds backward rather than
forward. He locates two groups, both of whom are learning
or have learned to operate a comptometer, who are equiva-
lent, except that one has learned typewriting while the other
has not. He then investigates their respective records in
learning to operate a comptometer. Any differences dis-
covered he attributes to typewriting.
The half-investigator, .half-experimenter, locates two
groups equivalent in every respect except for typewriting.
To these two groups he applies uniform training on the
comptometer and measures the progress of each group.
PRoBLEM 9g. Given Equivalent Groups of Sales Clerks
and Clerical Workers, Is There Any Difference Between
Them in Type of Memory?
This is a causal investigation. The investigator finds the
EF’s applied before he assumes control of the situation.
The only thing left for him to do is to apply the FT’s and
formulate conclusions.
EF1 is sales clerk, or the inherited or environmental
conditions which set sales clerks apart as an occupational
group. EF2 is clerical workers or the conditions which
selected and differentiated clerical workers as an occupa-
tional group.
Si is a group of sales clerks, who, except for occupational
differentiation and its concomitants and consequences, are
equivalent to Sz. Unless the two groups are allowed to
differ in the possible immediate and direct concomitants and
consequences of occupational differentiation the whole in-
vestigation loses its point, for its very object is to determine
whether such concomitants or consequent differences occur.
This means that when the two groups are being equated the
probable concomitants and consequences should not be
among the bases employed for equating.
No IT’s can be given since the EF’s have been applied
before the investigator takes control of the situation. Even
if possible, none would be given, because the psychological
Analyses of Experimental and Causal Investigations 255
factors influential in determining ultimate occupational
choice may have been present from birth. Hence ail that
can be done is to apply FT’s to determine whether the type
of memory possessed by Sz2 differs from that possessed
by Sr.
In an investigation of this sort the investigator should
be wary about concluding from any difference in memory
revealed that this difference has been produced by the occu-
pation of a sales clerk as distinguished from the occupation
of clerical work. The truth may be instead that the differ-
ence discovered merely accompanies the occupation, i.e., is
caused directly by a fundamental something which is the
cause of occupational differentiation. It may be that the
difference revealed is itself the cause of the occupational
differentiation. In sum, whenever the investigator is pre-
sented with a completed experiment he has no assurance
as to whether the EF’s or the difference in FT’s came first
and hence is the cause or whether something more funda-
mental may not be the cause of both. All the investigator
can say is that occupational differentiation is or is not asso-
ciated with memory differentiation.
The FT’s should be tests for various types of memory.
No IT’s can be given, but in fitting data into the computation
models all IT scores may be assumed to be zero.
ProBLeM 10. Is Complete Understanding Necessary to
the Enjoyment of a Piece of Literature?
EF 1 is incomplete understanding of a piece of literature.
EF2 is presumably complete understanding. Since under-
standing may vary from complete understanding to com-
plete misunderstanding it will be necessary for the experi-
menter to define the completeness of EF1 and EF2. He
may find it necessary to employ several EF’s of varying de-
grees of completeness of understanding.
Any one of the several experimental plans promises rea-
sonably satisfactory results. One plan is to employ the
one-group method, to expose Sr to an incompletely under-
256 How to Experiment in Education
stood piece of literature and measure the resulting enjoy-
ment, and then to expose Si to the same piece of literature
after an understanding of it is taught or while an under-
standing of it is being given and measure the resulting enjoy-
ment. The difference between these two FT’s gives the
desired answer. If it is suspected that the conclusion holds
only for the particular type and difficulty of the piece of
literature employed, the experiment may be repeated with a
variety of pieces of literature.
Another plan is to employ the one-group method, to select
two pieces of literature which are known to be or may be
assumed to be equal in their appeal when both are incom-
pletely understood or completely understood and equally so
in both cases. To S1, however, one of these equated pieces
of literature is incompletely understood while the other is
completely understood. ‘The difference in amount of enjoy-
ment evoked from Sr when these two pieces are presented
gives the desired answer. As before, various pairs of speci-
mens may be presented.
Still another plan is to employ equivalent groups. S1
may be exposed to a piece of literature which is incompletely
understood and the resulting enjoyment measured. S2 may
be exposed to the identical piece of literature after under-
standing of it has been given or while understanding is
being given, and the resulting enjoyment may be measured.
As before, various pieces of literature may be used or vari-
ous degrees of understanding may be imparted.
The rotation method is inappropriate. Incomplete under-
standing may precede completer understanding without seri-
ous carry-over, but to reverse this order of sequence, as
required by the rotation method, is impossible.
No IT’s need be given, for the degree of enjoyment of a
piece of literature before the S has been exposed to it may
be assumed to be zero.
No little ingenuity will be required to devise a satisfactory
test of enjoyment. Any one of many methods may be em-
ployed. Subtle physiological indices of enjoyment may be
Analyses of Experimental and Causal Investigations 257
recorded, or the pupils may be asked to choose between a
second exposure to the piece of literature in question and
other alternatives of reasonably constant and equal appeal,
or the pupils may rate the piece of literature in comparison
with the enjoyment derived from other common experiences
of varying satisfyingness, or a secret record may be kept
of the amount of subsequent use made of the piece of
literature when it is in the class library, and so on.
ProBLtEM 11. What Is the Effect upon Teaching Effi-
ciency and Length of Service in Teaching of a Sabbatical
Year for Public School Teachers?
EF1 is a Sabbatical year. EF2 is no Sabbatical year.
The one-group method is not appropriate, because the
problem assumes that the EF is to be applied throughout the
teaching life of the teacher. Also one of the measurements
stipulated, namely, length of service, assumes the entire
teaching life. The equivalent-groups method is applicable,
and it is the only method which is applicable.
Si is a group of public school teachers to whom EFr is
applied and who are otherwise equal to and under conditions
comparable with Sa.
Initial, intermediate, and final tests of teaching efficiency
are desirable for both S’s. Only FT’s of length of service
for both S’s are necessary or possible. The various periodic
intermediate tests will reveal whether Sabbatical years have
a cumulative effect or a decreasing effect, and whether
there comes a time where they no longer contribute to teach-
ing efficiency.
Since few experimenters have the patience or confidence
in their own longevity to wait a lifetime for the completion
of such an experiment, the investigational rather than the
experimental method is likely to be employed.
PRoBLEM 12. How Do Individual Scores Obtained on
National Intelligence Scale A Compare with Those on Scale
B for the Same Pupils?
258 How to Experiment in Education
EF1 is application of National Intelligence Test, Scale A.
EF2 is application of Scale B of the same test.
The one-group method is required. There is some trans-
fer from EF1 to EF2 such as practice effect, but this can-
not be avoided. It can be largely eliminated by statistical
methods.
This experiment is unique in that the EF’s and FT’s are
identical. No IT’s are required.
The difference between FT1 and FT2 may be determined
by computing the coefficient of correlation between the Scale
A and Scale B scores, or by computing the net difference
(unreliability) between the two series of scores as was done
in Table 13.
Thus this experiment is unique in three ways. The EF’s
and FT’s are identical. Transfer from one EF to a succeed-
ing EF is eliminated statistically. Novel methods are sug-
gested for computing the difference between C1 and Ca.
PrRoBLEM 13. What Effect in Securing Order Will a Beau-
tiful Picture Placed in the Front of a Room Have Upon an
Unruly Boy Who Loves Art?
EF1 is no picture in front of room. EF2 is a beautiful
picture in front of room.
The one-group method or rotation method is the most
feasible, owing to the difficulty of equating unruly boys
who love art.
Assuming the one-group method, S is an unruly boy who
loves art. S has applied to him, in order, IT1 of unruliness,
EF1, FT1, of unruliness, EF2, FT2, of unruliness. FTr1
may be used as the IT2. This experimental unit may and
should be repeated many times to make certain that any
differences observed in the C’s are not accidental.
The foregoing experiment is a particularly difficult one
to carry through successfully. The influence of the picture,
though real, is likely to be so subtle as to have its effects
masked by one of a hundred other influences playing upon
Analyses of Experimental and Causal Investigations 259
the pupil. When S is only one pupil the probability of
large changes due to irrelevant influences is especially great.
PROBLEM 14. To Determine the Relation Between Pla-
teaus on the Learning Curve and Recall.
In its present form the problem is so vaguely stated that
an analysis of it is impossible. What is really wanted is to
know whether pupils who have plateaus in their learning
curves are better able to recall or reproduce what is learned
at some later date.
EFT is plateau or plateaus in learning curve. EF2 is a
learning curve without plateaus.
This experiment is peculiar in that the experimenter can-
not control the application of the EF’s. His only recourse
is to have a large group of pupils learn something, to plot
their learning curves, to single out those who show a plateau
or plateaus in their learning curve, to match them with a
group of pupils who show no plateaus in their learning
curves but who are otherwise equivalent as shown by tests
given prior to the beginning of the experiment, and finally
to measure the difference in the ability of these two groups
to recall what has been learned.
No IT’s need be given though it is important to know
that the two groups are equivalent in general ability to recall
what has been learned. If this is not known, it cannot be
said that plateaus have caused the difference in ability to
recall. They may be the effect or may merely be asso-
ciated with a certain recall ability.
Since the purpose of the experiment is to learn whether
learning curves plus plateaus cause or are correlated with .
recall which is superior to that caused by or associated
with learning curves minus plateaus, no control EF and S
are required. For purposes of discussion, however, let us
suppose that the problem calls for a knowledge of the exact
contribution to recall of learning curves plus plateaus, i.e.,
of learning plus a period or periods of little or no progress.
Still no control EF would be required because the contribu-
260 How to Experiment in Education
tion of irrelevant factors to recall will be substantially zero.
If the experiment continues over a long period mere matur-
ing might contribute some power of recall. In this case a
control EF and S could be used to advantage.
If, however, the purpose of the experiment is to deter-
mine the amount of contribution of plateaus rather than
learning curves plus plateaus, a control EF, that is, an EF
of learning curves with plateaus absent, is required. EF2,
above, is just such a control EF. But here is a difficulty.
Is EF 2 identical with EFx1 except for the plateau feature of
EF1? Isa plateau merely an addition to a learning curve
with a plateau lacking, or is a plateau an integral portion of
its curve? If we affirm the latter, then it becomes impos-
sible to isolate and measure the effect of plateaus; we must
always measure the effect of plateaus-imbedded-in-learning-
curves.
PROBLEM 15. Which Will Give Better Results in Baking,
to Put an Angel-food Cake Into a Gas Oven Just Lighted
or Into One of Medium Temperature?
EF 1 is a just lighted gas oven. EF2 is a gas oven which
has reached a medium temperature.
The one-group method or rotation method will not do.
Since the S is a set of angel-food cake-dough it could
not very well be baked twice. The carry-over will be
enormous, to say the least. The equivalent-groups method
is required, 1e., two sets of angel-food cake-dough made
according to identical recipes, or taken from the same
mixture.
The IT’s can be assumed to be zero. The FT’s should be
various tests of the appearance, deliciousness, and digesti-
bility of the cake baked according to each of the EF’s.
The only difficulty in this experiment is to identify the S
and the EF. It is the cake dough whose change by the two
varieties of temperature is of primary concern. The cake
dough is to these EF’s just as pupils are to the customary
EF’s.
Analyses of Experimental and Causal Investigations 261
PRoBLEM 16. Are Girls More Interested in Learning
Manipulative Processes in Junior High School Than in
Senior High School?
EF1 is the junior high school age for girls. EF2 is the
senior high school age for girls.
Either the one-group or equivalent-groups method may
be employed. If the one-group method is employed, a group
of junior high school girls should be tested, in some way,
as to the strength of their interest in learning manipulative
processes. When these same girls have reached the senior
high school age they can, then, be tested again to see whether
their interest in learning manipulative processes has in-
creased.
If the equivalent-groups method is employed, the experi-
ment becomes essentially an investigation. A group of
senior high school girls and another group of junior high
school girls should be selected so as to be equivalent, in all
respects, except for the senior and junior high school diiffer-
entiation with all of its concomitant differentiation. Stated
more simply, a group of junior high school girls should be
so selected that they will be equivalent when they become
senior high school girls, to a previously selected group of
present senior high school girls.
Each group can be tested for its interest in learning
manipulative processes. The C for each group may be
assumed to be the same as the FT. The difference between
the M’s of the two series of C’s shows the difference between
the EF’s.
ProspLEM 17. Does Observation of Skilled Teaching Aid
Normal School Students to Grasp Facts and Principles of
Teaching and to Apply Them?
EF1 is observation of skilled teaching. EF2 is the
absence of such observation.
Since the one-group and rotation methods cannot be used
because of carry-over, the equivalent-groups method is re-
quired. One group of normal school students will observe
262 How to Experiment in Education
skilled teaching while an equivalent group will forego such
observation.
Both IT’s and FT’s covering all or a random sampling of
the facts and principles of teaching will need to be con-
structed and applied to both groups.
All the foregoing is simple enough. The real difficulty is
in devising some way to measure each group’s ability to
apply facts and principles learned. ‘The only satisfactory
way to make the test is to organize an experiment within an
experiment, so as to discover just how well the normal school
students can actually teach pupils. In sum, the best way
for these students to manifest superior changes in them-
selves is to show that they can make superior measurable
changes in pupils.
Two groups of equivalent pupils can be selected. The
EF1 normal school students can be assigned to teach, in
rotation, say, one group of pupils, and the EF2 students can
be assigned to teach the other group of pupils. If the pupils
are sufficiently numerous each normal school student may
be assigned to her own group of pupils exclusively. The
specific lessons to be taught may be assigned by the experi-
menter and tests for the pupils may be constructed to meas-
ure the effect of these lessons. Or the experiment may be
permitted to run for a considerable period and general tests
may be given. Initial and final tests upon the pupils will
show which normal school group has been most successful
in applying facts and principles learned to the real task of
making desirable changes in pupils. Thus the best way to
measure the normal school student is to measure her pupils.
ProsLeM 18. Is the Per Cent of Failures Higher Among
Pupils Who Enter the Sentor High School Direct from the
Eighth Grade or From the Junior High School?
EF1 is entrance to senior high school from eighth grade.
EF2 is entrance from junior high school.
This is not so much an experiment as a causal investiga-
tion, and must of necessity be an equivalent-groups investi-
Analyses of Experimental and Causal Investigations 263
gation. A group of students entering from the junior high
school must be found who are equivalent, except for con-
comitant differentiations, to a group entering from the regu-
lar eighth grade.
The FT is the record of failures for each of these groups
during the high school period. In computation, the C may
be considered identical with FT.
ProBLEM 19. At How Much Greater Saving of Time and
Effort Can a Group of Normal Seven-year-old Children
Learn to Read Than a Group of Normal Six-year-old Chil-
dren?
EF1 is normal seven-year-olds. EF2 is normal six-year-
olds.
The one-group and rotation methods are inappropriate.
If the six-year-olds and seven-year-olds are truly normal,
the six-year-olds will in one year be equivalent to the pres-
ent condition of the seven-year-olds. In sum, the conditions
of the experiment require equivalent groups except for the
EF difference and its concomitants. It also requires both
groups to be equally unable to read at present, though not
necessarily of equal capacity to learn to read.
One or more IT’s and FT’s of reading ability, with the
intervening teaching of reading by the same or equated
teachers to both groups, will show which group can learn
more rapidly. The computation will follow the regular
computation model.
All the foregoing appears quite simple. But there is a
hidden difficulty so great as to be well nigh insurmountable.
The foregoing plan shows which group learns to read more
quickly. Even though the experiment favors the seven-year-
olds, it does not show that, in the long run, it is more eco-
nomical to delay learning to read until seven years of age.
If the six-year-olds learn to read, they can spend the read-
ing period during their seventh year learning something
else. If the six-year-olds learn to read, even though at some
labor, they have an extra year of access to printed material.
264 How to Experiment in Education
If the six-year-olds do not spend their time learning to read,
they may spend their time learning something else which
may be proportionately difficult and valuable. There are
few abilities which a ten-year-old cannot learn more easily
than a six-year-old, but this does not mean that everything
should be postponed until pupils are ten years old. Decision
as to what to postpone involves a consideration of capacity,
interest, need, injury, and the total work of the school. The
practical problem cannot be solved by the simple experi-
mental plan outlined above.
PROBLEM 20. What Specific Abilities Are Required for
Success as a Telegrapher?
The EF’s are unknown specific abilities. The problem
here is not to determine whether a given specific ability con-
tributes or will contribute to success as a telegrapher. The
problem is to discover promising specific abilities with which
to experiment. In sum, the problem is to discover some
hypothesis to be a basis for experimentation. This is always
the first step in research.
One plan of procedure is to study the work of a tele-
grapher and logically infer what specific abilities are needed.
Another plan is to select two groups, one of which is com-
posed of successful telegraphers and the other of which is
composed of unsuccessful telegraphers, but where both other-
wise appear much alike. Observation of the work of the two
groups and tests of them may bring to light suggestive
differences.
Another plan is to chose strikingly successful and strik-
ingly unsuccessful telegraphers, and to contrast these oppo-
sites in close proximity. This is the most drastic possible
method of shaking out into the field of consciousness those
differences which spell success or failure as a telegrapher.
Once specific abilities have been hit upon in such ways,
their contribution to success as a telegrapher may be deter-
mined experimentally, or by an equivalent-groups causal in-
vestigation, or by a partial correlation investigation.
7
Analyses of Experimental and Causal Investigations 265
PROBLEM 21. In a Recitation, Can a Class of Girls Bluff
a Teacher More Easily Than a Class of Boys?
EF is aclass of girls. EF2 is an equivalent class of boys.
S is the teacher, or, better, several teachers of both sexes,
since an experiment of this sort needs repetition on both
men and women teachers.
The rotation method is most appropriate because it per-
mits the experimenter to rotate out differences in nature of
lesson, teacher’s experience in teaching it, and the like. Thus
the experimenter can request a teacher to teach a specific
lesson to a class of girls, and then to teach this same lesson
to a class of generally equivalent boys. Next he can ask
the teacher to teach another lesson to both boys and girls,
only, in this case, the boys should be taught first and the
girls second.
While each lesson is being taught or afterward, the ex-
perimenter must measure the amount of bluffing which oc-
curs. The C may be treated as identical with this FT, so
that a regular rotation computation model will apply.
PROBLEM 22. To What Extent Are Children in the Upper
Grades of the Elementary School Capable of Selecting on
Their Own Initiative Statements of Most Worth in Their
History Reading?
EF is attainment of upper grade status. EF2 is, if any-
thing, the mere absence of such attainment. S is upper
grade pupils.
Of necessity the one-group method must be employed.
The whole experiment, if such it may be called, is very sim-
ple. It merely consists in locating upper grade pupils and
in testing the extent to which they can select on their own
initiative statements of most worth in their histories.
IT may be assumed to be zero, so that FT becomes Cr.
Similarly all the C2’s may be considered zero. Thus the
effect of upper-gradeness is shown by a straight measure-
ment of the present status of upper-grade children in the
trait in question.
266 How to Experiment in Education
PROBLEM 23. What Is the Best Order to Teach Geog-
raphy to Fourth-grade Pupils, the Concrete and Then the
Abstract, or the Abstract Followed by the Concrete?
EF tr is concrete followed by abstract. EF2 is abstract
followed by concrete. S is fourth-grade pupils.
Owing to the possibility of carry-over, the equivalent-
groups method is preferable. One fourth-grade group can
be taught according to EF 1 and an equivalent fourth grade
according to EF2.
IT and FT tests, testing the degree of mastery of geog-
raphy lessons at the beginning and end of the experiment,
should be applied to both groups. |
The general plan for this experiment is quite simple. The
actual carrying out of the experiment would involve much
careful labor. It is unique in that the two EF’s appear to
be rotated when they really are not. The purpose of the
experiment is not to evaluate abstract vs. concrete but
abstract after concrete vs. concrete after abstract. A simi-
larly deceptive problem is this: Which method brings the
best results in beginning reading—to teach the printed forms
of the words first and follow with the script forms, or the
reverse order? Another like deceptive problem is this:
What is the best possible order of subjects during the school
day? Here the various EF’s are all possible combinations
of order of school subjects. As many equivalent groups will
be required as there are EF’s. There may be a carry-over
from the first subject taught to the second subject, or from
the second subject to the third subject, and so on. But
carry-over from one part of an EF to another part of an
EF is not an irrelevant factor. Carry-over is an irrelevant
factor only where there is carry-over from one total EF to
another total EF.
PROBLEM 24. Can Anything Done Well By One Indi-
vidual Be So Analyzed That the Ability May Be Imparted
to Others?
For purposes of experimentation, the above problem will
Gd a
Analyses of Experimental and Causal Investigations 267
be clearer if phrased thus: Will a particular person’s analy-
sis of what some individual does remarkably well confer that
remarkable ability upon another?
Here the EFr is some particular person’s analysis of the
process by which some gifted person achieves certain ends.
EF2 is the absence of EF1. S is some individual to whom
EF or the analysis is to be taught in hopes of endowing
him with this rare ability.
The one-group method is required, for EF1 must be ap-
plied to a particular individual.
An IT or IT’s showing S’s initial status in the ability in
question needs to be followed, after EF1 has been applied,
by an FT or FT’s. These FT’s permit the computation of
C or C’s and show whether a particular individual can
analyze and impart the ability in high degree to another
particular individual. To make the experiment conclusive,
many individuals will have to attempt to analyze the process
and impart the ability to many S’s.
PROBLEM 25. To See What Projects Second-grade Pupils
Will Initiate.
EFi is the school environment and internal nature of
second-grade pupils. EF2 is the mere absence of EFr. S
is a group of second-grade pupils.
The problem calls for the one-group method in its most
elementary form, for the experiment consists solely in plung-
ing pupils with certain natures into a certain medium, and
then watching to see what happens. This elementary sort
of research is quite fundamental, and, when operated by a
keen observer, frequently leads to very valuable conclusions.
PROBLEM 26. Do Commas After Dependent Clauses Help
the Reader in Speed or Accuracy of Reading?
EF r is commas after dependent clauses. EF2 is the mere
absence of EF1, which is to say it is the absence of commas
at such places. S is not defined and hence may be any group
that can read.
268 How to Experiment in Education
The equivalent-groups method can be employed but it is
not the best method. The one-group method cannot be used,
for there will be a carry-over of acquaintance with material,
if certain material containing commas is followed by that
same material without the commas, and vice versa. This
is one of those rare situations where the one-group method
is inappropriate, but where the rotation experiment may be
used to advantage by alternating the content of the material.
The following shows a possible plan:
Period I Period II
Group A Material 1—Commas Material 2—No commas
Group B Material r1—No commas Material 2—Commas
The speed and accuracy made by Group A on “Material
1—Commas” can be combined with the speed and accuracy
scores, respectively, made by Group B on ‘Material 2—
Commas.” This can be compared with the combined speed
scores and accuracy scores for “Material 1—No commas”
and ‘‘Material 2—No commas.”
PROBLEM 27. Does Brightness Facilitate Progress Through
School?
EF1 is brightness. EF2 is absence of EF1. The subjects
are school pupils.
The one-group experimental method cannot be employed
because it is impossible for pupils to be dull for a period and
then become bright or be bright and then become dull. For
the same reason, the rotation method cannot be used. The
equivalent groups method is the correct one for this problem.
Sr is a group of pupils who are known or are shown to
be of a defined brightness. Sz2 is another group who are
known to be of a defined dullness. Except for these intelli-
gence differences and their concomitants the two groups
should be equivalent. They should be equivalent in chrono-
logical age, grade position in school, i.e., beginning first
grade or kindergarten children, etc.
Analyses of Experimental and Causal Investigations 269
Since the measure of C is the rate of progress through
school no initial tests, except of brightness, are required.
The answer to the problem will be shown by the FT, 1.e., the
number of years required on the average for each group
to complete a defined number of school grades.
PROBLEM 28. Does Genius Beget Genius?
EF is genius on the part of parents. EF2 is the absence
of such genius, or a smaller quantity of it.
The one-group and rotation experimental methods are
inappropriate owing to the fact that parents cannot be
geniuses for a time and then become non-geniuses or vice
versa. Hence the equivalent-groups method must be used.
Sir is the product of the union of the sperm and ovum of
genius parents. Sz is the product of the union of these ele-
ments from non-genius parents. :
No IT’s are required except to yield a measure of the
amount of each EF. The IT for the subjects may be as-
sumed to be zero. As soon as the offspring of each group
have sufficiently matured to make measurement practicable
an FT of intelligence may be applied. Cx and C2 will be
identical with the two FT’s. Mz minus M2 will reveal the
effect upon the intelligence of offspring of genius in the
parents.
To make it possible to separate the influence of germ
plasm and environmental influence, all children of both
groups should be placed under equally favorable environ-
mental influences immediately after conception or after birth,
at the latest. The equality of environment should be main-
tained until the FT’s are made.
AHH AMA M OA Kita ata Ry Sigh
7 Ly i‘
iy bind ' 5 ei }
’ 4! \ yi > A ' as. ‘i
a j
4
4
F i : }
a ee =» 1)
JB ai f al ee | *
-~ i
, ?
‘ mt wed Wh ‘
La ]
by ss A e
é J
i wee Sti?
4)
i : *'
i -
°
1 it h
i
{
all
. i
ae |
-
b 4 ‘
‘
+9
y i
i
‘
'
\ ;
‘
~~
vt] \
:
:
- ‘
i]
i ‘
‘ ja
’
,
‘
é
;
; i
/
*
i ‘
j +’ j
: ’
:
i ‘
ae je
f
‘ e UJ
\ : i]
4 ‘ '
{
P -
~1
| ] :
:
s '
1 é y
A
i! f
SELECTED REFERENCES FOR FURTHER READING
I. Onet-Group EXPERIMENT
Aral, TsurA.—Mental Fatigue; Teachers College, Columbia Uni-
versity, New York.
BALDwin, Birp T.—Physical Growth of School Children; Uni-
versity of Iowa, Iowa City, 1919.
Brooks, F. D.—Changes in Mental Traits With Age; Teachers
College, Columbia University, New York City.
Coy, GENEviEvE L.—IJnterests, Abilities, and Achievements of a
Special Class for Gifted Children; Teachers College, Colum-
bia University, New York, 1922.
FREEMAN, FRANK N.—Experimental Education; Houghton
Mifflin Company, New York, 1916.
Jupp, Cuarites H., anpD OTHERS.—Reading: Its Nature and
Development; University of Chicago, Chicago, 1918.
Rusk, RoBert R.—Experimental Education; Longmans, Green
and Company, London, 1919.
WHIPPLE, G. M.—Classes for Gifted Children; Public School Pub-
lishing Company, Bloomington, Illinois, 1919.
II. EQUIVALENT-GROUP EXPERIMENT
Courtis, S. A——Measuring the Effects of Supervision, in Geog-
raphy; School and Society, July 19, 19109.
Cummins, R. A.—Improvement and the Distribution of Practice;
Teachers College, Columbia University, New York.
Frost, NorMAN.—A Comparative Study of Achievement in Coun-
try and Town Schools; Teachers College, Columbia Uni-
versity, New York.
Kirsy, T. J—Practice in the Case of School Children; Teachers
College, Columbia University, New York.
PittMAN, M. S.—The Value of School Supervision; Warwick and
York, Baltimore, 1921.
271
oe How to Exteriment in Education
IiI. Rotation EXPERIMENT
Heck, W. H.—A Study of Mental Fatigue; J. P. Bell Company,
Lynchburg, Virginia, 1913.
THORNDIKE, E. L.; McCatt, WM. A., AND CHapman, J. C.—
Ventilation in Relation to Mental Work; Teachers College,
Columbia University, New York.
WEBER, J. J—The Relative Effectiveness of Some Visual Aids in
Elementary Education (to be published soon).
IV. CausAL INVESTIGATION
DENBURG, J. K. V.—Causes of the Elimination of Students in
Public Secondary Schools of New York City; Teachers Col-
lege, Columbia University, New York.
HoLLINGWworTH, L. S., AND WinForpD, C. A.—The Psychology of
Special Disability in Spelling; Teachers College, Columbia
University, New York, 1918.
O’BrRIEN, F. P—A Study of School Records of Pupils Failing in
Academic or Commercial High School Subjects; Teachers
College, Columbia University, New York.
REAvis, GEORGE H.—Factors Controlling Attendance in Rural
Schools; Teachers College, Columbia University, New York,
1920.
V. DESCRIPTIVE INVESTIGATION
BUCKNER, CHESTER A.—Baltimore School Survey Series; Board
of School Commissioners, Baltimore, 1922. Educational
Diagnosis of Individual Pupils; Teachers College, Columbia
University, New York, 1919.
Cleveland School Survey Series; Russell Sage Foundation, New
York, 1916.
Gary School Survey Series; General Education Board, New
York, 1919.
Ketty, F. J.—Teachers’ Marks; Their Variability and Standard-
ization; Teachers College, Columbia University, New York.
Kentucky State Educational Survey Series; General Education
Board, New York, 1922.
KrusE, Paut.—The Overlapping of Attainments in Certain
Grades; Teachers College, Columbia University, New York,
1918.
References for Further Reading 273
McCatL, WM. A.—How to Measure in Education; The Mac-
millen Company, New York, 1922.
MeEap, C. D.—The Relations of General Intelligence to Certain
M ental and Physical Traits; Teachers College, Columbia
University, New York.
Morrison, J. C.—Legal Status of City School Superintendents ;
Warwick and York, Baltimore, 1921.
SIMPSON, B. R. — Correlations of M ental Abilities; Teachers Col-
lege, Columbia University, New York.
Virginia State School Survey Series; World Book Company,
Yonkers, New York, 10920.
VI. EXPERIMENTAL MEASUREMENTS
Burcrss, May Ayres.—Measurement of Silent Reading; Russell
Sage Foundation, New York, 1920.
Burt, Cyrit.—MW contd and Scholastic Tests; P.S. King and Sons,
2 and 4 Great Smith St., Victoria, Westminster, Sa We _ Eng-
land.
CHAPMAN, J. Crospy.—Trade Tests; Henry Holt and Company,
New York, 1921.
DEWEY, EVELYN, CHILD, Emity, aNnD RuML, BEARDSLEY.—
Methods and Results of Testing School Children; E. P. Dut-
ton and Company, New York, 1920.
Hitrecas, Mito B.—Scale for the Measurement of Quality in
English Composition by Young People; Teachers College,
Columbia University, New York, 1912.
KUHLMANN, FReD.—Handbook of Mental Tests; A Further Re-
vision and Extension of the Binet-Simon Scale; Warwick and
York, Baltimore, 1922.
McCatt, Wm. A.—How to Measure in Education; The Mac-
millan Company, New York, 1922.
MoNnRoE, WALTER S.—Measuring the Results of Teaching;
Houghton Mifflin Company, New York, 1018.
Monroe, WALTER S.; DE Voss, J. C., AND Ketty, F. J.—Educa-
tional Tests and Measurements; Houghton Mifflin Company,
New York, 1913.
PINTNER, RUDOLF, AND PATERSON, Donatp.—A Scale of Per-
formance Tests; Warwick and York, Baltimore, 1917.
TERMAN, Lewis M.—The Measurement of Intelligence; Hough-
ton Mifflin Company, New York, 1916.
274 How to Experiment in Education
Toors, H. A.—Trade Tests in Education; Teachers College,
Columbia University, New York.
VAN WAGENEN, M. J.—Historical Information and Judgment of
Elementary School Pupils; Teachers College, Columbia Uni-
versity, New York, 1919.
VOELKER, Paut F.—Function of Ideals and Attitudes in Social
Education; Teachers College, Columbia University, New
York.
WHIPPLE, G. M.—Manual of Mental and Physical Tests, Vols.
I and II; Warwick and York, Baltimore, rgro.
Witson, G. M., AND Hoke, K. J—How To Measure; The Mac-
millan Company, New York, 1921.
Woopy, Ciirrorp.—Measurements of Some Achievements in
Arithmetic; Teachers College, Columbia University, New
York, 1916.
YERKES, R. M., Bripces, J. W., AND HARDWICK, RosE S.—A
Point Scale for Measuring Mental Ability; Warwick and
York, Baltimore, 1915.
YOAKUM, CLARENCE S., AND YERKES, R. M.—Army Mental
Tests; Henry Holt and Company, New York, 1920.
VII. STATISTICAL AND GRAPHIC METHODS
ALEXANDER, CARTER.—School Statistics and Publicity; Silver
Burdett and Company, New York, 1919.
BRINTON, WILLARD C.—Graphic Methods for Presenting Facts;
The Engineering Magazine Company, New York, 1917.
BROWN, WILLIAM, AND THompson, G. H.—Essentials of Mental
Measurement ; The Macmillan Company, New York, 1921.
KetLey, T. L—Educational Guidance; An Experimental Study
in the Analysis and Prediction of Ability of High School
Pupils; Teachers College, Columbia University, New York,
IQI4.
McCatiL, Wm. A.—How to Measure in Education; The Mac-
millan Company, New York, 1922.
Rucc, Harotp O.—A pplication of Statistical Methods to Educa-
tion; Houghton Mifflin Company, New York, 1917.
THORNDIKE, Epwarp L.—Introduction to the Theory of Mental
and Social Measurements; Teachers College, Columbia Uni-
versity, New York, 1913.
—
References for Further Reading 275
Yue, G. Upny.—An Introduction to the Theory of Statistics ;
C. Griffin and Company, London, 1912.
VIII. Arps IN STATISTICAL COMPUTATIONS
BARLOW, PETER.—T ables of Squares, Cubes, Square-Roots, Cube-
Roots, and Reciprocals of all Integers, Numbers up to
10,000; E. Spon, New York.
CRELLE, A. L.—Rechentafeln; G. Reimer, Berlin, Germany, 1907.
PEARSON, Karu.—Tabdles for Statisticians and Biometricians;
Cambridge University Press, Cambridge, England, 1914.
PETERS, J—Neue Rechentafeln fur Multiplikation und Division;
G. Reimer, Berlin, Germany.
IX. GENERAL
DEWEY, JOHN, AND DEwEy, EvELYN.—Bibliography of Tests for
Use in Schools; World Book Company, Yonkers, New York,
1921. Schools of Tomorrow; E. P. Dutton Company, New
York, 1915.
Hotmes, Henry W., AND OTHERS.—A Descriptive Bibliography
of Measurement in Elementary Subjects; Harvard Univer-
sity Press, Cambridge, Massachusetts, 1917.
Journal of Educational Psychology; Warwick and York, Balti-
more.
Journal of Educational Research; Public School Publishing Com-
pany, Bloomington, Illinois.
NATIONAL SOCIETY FOR THE STUDY oF EpucatTion.—Year Books;
Public School Publishing Company, Bloomington, Illinois.
PEARSON, Karit.—The Grammar of Science; Adam and Charles
Black, London, 1900.
Rucer, GerorcirE, J.—Bibliography on Psychological Tests;
Bureau of Educational Experiments, New York, 10918.
Teachers College Contribution to Education Series ; Teachers
College, Columbia University, New York.
THORNDIKE, Epwarp L.—Educational Psychology, Vols. I, II and
III ; Teachers College, Columbia University, New York, 1914.
Warp, Gitpert O.—The Practical Use of Books and Libraries;
The Boston Book Company, Boston, 1911.
SUMMARY OF SYMBOLS AND FORMULAE
A.Q. = accomplishment quotient = — = i=
Ar.A. = arithmetic age
: ; Ar.A.
Ar.A.Q. = arithmetic accomplishment quotient = TAriAg
‘ : Pe ls eel
Ar.Q. = arithmetic quotient = CA
A.M. = assumed mean
B = brightness = T + B correction
Ba, Be, Bi, Br = brightness in arithmetic, education, intelligence
and reading, respectively
C = (1) change produced by an experimental factor
(2) pupil classification = G+ C correction
CC = change produced by a control experimental factor
CEF = control experimental factor
C.A. = chronological age
C= correction
D = difference
EC = experimental coefficient
ah D
(1) for difference = 2.78 SDD
On SS
(2) for coefficient of correlation 798 SDt
ECMEC = experimental coefficient of the mean experimental
fieienies MEC
Rabie Me oS DILL G
ECMED = experimental coefficient of the mean equated dif-
f i MED
Tie iwnte 78 SUMED
ED = equated difference
EF = experimental factor
E.A.
CAG
F = effort or efficiency = Te — Ti
Fa = effort in arithmetic = Ta— Ti
Fr = effort in reading = Tr — Ti
f = frequency
E.Q. = educational quotient =
276
Summary of Symbols and Formule 277
fx = deviation X number of frequencies
FT = final test
G = grade status
INT = intermediate test
I.Q. = intelligence quotient =
IT = initial test
M = arithmetic mean
M.A. = mental age
MEC = mean experimental coefficient
MED = mean equated difference
N = total number
N.= = ae ~ = =Spearman self-correlation coefficient
where N is the number of tests required to yield
a defined correlation
P= pupil
PE = probable error
PED = probable error of the difference
PEM = probable error of the mean
ies
2
Shs
pl
Q = quartile deviation =
Q: = 25 percentile
Q: = 75 percentile
R.A. = reading age
Rese
R.A.Q. = reading accomplishment quotient = TA
SIGART OAS
R.Q. = reading quotient = TORE
r = product moment coefficient of correlation =
Sxy
ei Re 0
V Sx* V/Sy?*
ao) — cxcy
senor eaten where assumed
mean is used
— cx” Sg (/ eee
a : :
= —__—_—__—-= correlation coefficient resulting
I+ (n—1I1)nr
when N forms of tests are used
S = experimental subject, thing, OF group or BELORD
x, size of
SD or S.D. = standard deviation = CD eee.
278 Summary of Symbols and Formule
SDC = standard deviation of the changes
SDD = standard deviation of the difference
= (SDM:)* + (SDM2)*— 2 re (SD:) (SDz2)
D
SDM = standard deviation of the mean = ae
SDMEC = standard deviation of the mean experimental co-
efficient
SDMED = standard deviation of the mean equated differ-
ence
Mela Sea REO.
ANE VALENS
SDr = standard deviation of the coefficient of correlation
I—r
SD median =~
SDS = standard deviation of the sum
= 4/(SDM:)? + (SDM2)? + 2 rx (SDs) (SD2)
Sfx or Sx = sum of the deviations
T =.1 standard deviation of unselected 12 year old
children
Ta, Te, Ti, etc.= T score in arithmetic, education, intelligence, etc.
x = deviation
y = deviation
INDEX
Absolute-worth scales, in question-
naires, 215, 216.
Accomplishment
103.
Age scale, evaluation of, 95-98.
Army Beta non-verbal intelligence
test, use of, 85.
Assumed mean, 143.
Attendance, Reavis’s investigation
of, 209, 210, 213, 238, 239.
Quotient, 58-61,
B scale, construction of, 102-109.
Barton, and Dransfield, on teaching
of reading, 4.
Battery of tests, use in Liu’s study,
85; construction of, 138, 139.
Bennett, on equating of groups, 50,
51, 73.
Bibliography, making of survey of,
11-13; of equivalent groups meth-
od, 271; of one-group method,
271; of causal investigations, 272;
of rotation method, 272; of ex-
perimental measurements, 273,
274; general, 275.
Binet-Simon, 60, 130.
Brian, and Harter, 88.
Brightness in arithmetic, computa-
tion of pupil, 124; of class, 126.
Buckingham, 130.
C scale, construction of, 109, IIo.
Cattell, 130.
Causal investigations, methodology
of, 207-212; Reavis’s investiga-
tion, 209, 210, 213, 238, 239; pro-
cedure of, 212-244; analysis of
problems, 245-269; bibliography,
2472.
Chal Garo:
Chang, C. Y., 130.
Chang, Y. C., 130.
Chinese fundamentals of arithmetic
scale, 121-130.
Classification in arithmetic, compu-
tation of pupil, 125, 126; of class,
126.
Computation, special difficulties in,
200.) 207;
Correction, 143.
Correlation, and test reliability, 111;
in causal investigations, 224-244.
Courtis, and Thorndike, on cor-
rection formule, 116, 130.
Coy, 37.
Criteria, see Experimental measure-
ments.
Darwin, 208.
Dearborn non-verbal
test, use of, 85.
Descriptive investigations, biblicg-
raphy, 272, 273.
Difference, computation of, 150.
Difficulty test, construction of, 131-
E355
Distribution method, in question-
naires, 215, 210.
Dransfield, and Barton, on teaching
of reading, 4.
intelligence
Equivalent groups method, descrip-
tion of, 18, 19, 40, 44; formule
for, 18, 19, 59; criteria for se-
lecting, 29-31, 35; computations
for, 161-186; bibliography, 271.
Errors, see Experimental errors.
Experimental coefficient, 154-158,
168, 174.
Experimental errors, avoidance of,
63-80.
Experimental factors, amount of,
81; changes produced by, 82. See
also Irrelevant factors.
Experimental investigations, analyses
of problems for, 245-269.
Experimental measurements, func-
tions of, 81; criteria, fundamental,
82, 83; for evaluation and con-
struction of, 83-93; bibliography,
273, 274.
Experimental methods, see One-
group, Equivalent groups and Ro-
tation method.
279
280
Experimental subjects, appropriate-
ness of, 37-38, 40-44; selection of,
38-40.
Experimentation, in education, prev-
alence of, 1, 2; value of, 3-5;
selection of problem, 6-9; formu-
lation of problem, 9-11.
Experiments, see Weber’s rotation,
Lacy’s rotation, Thorndike and
McCall’s rotation.
Franzen, 130.
Frequency distribution,
tion of, 145-148.
Fullerton, 130.
construc-
Gates, 138.
Grade scale, evaluation of, 94.
Graphic methods, see Statistical and
graphic methods.
Gray, 38; on equating two groups,
<8,
Groups, equating of, 41-61.
Hanson, 37.
Harter, and Brian, 88.
Herring Revision of Binet-Simon
Scale, 60.
Hillegas, 130.
Hollingworth, H. L. and L. S., on
equating groups, 55.
Intelligence Quotient, 56, 59.
Intelligence tests, classified, 43, 44;
battery of, 85.
Irrelevant factors, constant vs. va-
riable, 63, 64; bias of experi-
menters, 64, 65; bias of assistants,
65-75; transfer, 75, 76; bias of
tests, 77, 78; other factors, 78, 79;
change produced by, 82.
Lacy, rotation experiment, 34, 35,
73-
Lew,/L. 1.0830.
Liu, H. C., on construction and use
of intelligence criterion, 84-87.
McCall, and Thorndike, reading
scale, 59-62; rotation experiment,
194.
Mean, computation of, 143; use of,
148.
Measurement, of changes, 206, 207.
Median, computation of, 148, 140.
Index
Mental age, computation of, 50,
60.
Metchnikoff, 208.
Monroe, diagnostic tests in arith-
metic, use, 88; measurement of
achievement, 130.
Myers, non-verbal intelligence test,
use, 85.
Norms, 60, 83, 117.
Ogglesby, 37, 180.
One-group method, description of,
14-17; formula for, 173; cri-
teria for selecting, 21-29, 35;
computations for, 140-160; bibli-
ography, 271.
Otis, on unreliability, 116.
Pairing pupils, technique of, 45-49,
57-
Percentile scale, evaluation of, 95-
98; points, computation of, 149-
150.
Pintner, non-verbal intelligence test,
use of, 85, 130.
Pittman, on equating of groups, 40-
SI.
Practical certainty, 156, 163.
Pressey, non-verbal intelligence test,
use of, 85.
Probable error, 151.
Product-moment formula, 225.
Product tests, construction of, 135-
138.
QI, 50.
Os\)nso aie |
Quartile deviation, computation of,
150.
Questionnaires, methods in causal
investigations, 215-217.
Rank method, in questionnaires, 215,
2106.
Rate test, construction of, 135.
Reavis, attendance investigation,
000, 210, 313.238, 3G,
Regression equation, in causal in-
vestigations, 240-244.
Relative-to-the-items scale method,
in questionnaires, 216.
Reliability, of tests, 83; formula
for, 111; net-difference method,
112-114; practical certainty, 156,
Index
163; computations in special situ-
ations, 190.
Rotation method, description of, 109,
20; formula for, I9, 20, 32; cri-
teria for selecting, 31-36; Steven-
son’s experiment, 28; Weber’s
experiment, formula, 32, descrip-
tion of, 198-207; Lacy’s experi-
ment, 34, 35; computations for,
187-207; Thorndike and McCall,
ventilation experiment, 194; bib-
liography, 272.
Rugg, H. O., 5.
Scales, adequacy of, 88; evalua-
tion of methods, 94-98; for ex-
perimental tests, 198. See also
Age scale, B scale, C scale, Chi-
nese fundamentals of arithmetic
scale, Percentile, T scale.
Scores, point, sample of, 44; men-
tal age, sample of, 44.
Scoring, of Chinese fundamentals
of arithmetic test, 122, 123, 129.
Self-correlation, see Correlation.
Sherritt 21s., 1130.
Sigma, see Standard deviation.
Spearman, self-correlation formula,
III, 112; product-moment for-
mula 225.
Standard deviation, computation of,
144; of difference, I51.
281
Stanford Revision of Binet-Simon
scale, 60.
Starch spelling scale, use of, 88.
Statistical and graphic methods,
bibliography, 274, 275.
Stevenson, rotation experiment, 26,
28.
T scale, 27; evaluation of, 95-98;
construction of, 98-102.
T scores, Weber’s use of, 203.
PaO, WWW uke so:
Terman, on mental age, 59, 130.
Tests, intelligence, classified, 43, 44;
battery of in Liu’s study, 85;
summary of steps in constructing,
scaling and standardizing, 130-139,
experimental, scaling of, 1098.
Thorndike, 5, and McCall, reading
scale, 59-62, 130; rotation experi-
ment, 194.
Total ability in arithmetic, com-
putation of pupil, 123, 124; of
class, 126.
Unreliability, see Reliability.
Variability, measures of, 151.
Weber, rotation experiment, 32, 73,
198-207.
Woody, arithmetic scales, use, 88.
ey ba
rly * ¢, Ms a | ie Re . - rt P, "> © . — >
CSD Sas Ree aia) Tee ROE ae AR enlace
. avy n ae a. > F. it NY
. / o arn Nig eye he Bak _
= i ‘ : as .
: i ts ite P
‘
Vola a " :
a
ari h i 4 {
yeh eae ; ¥ EAE RACY they abrak OA
' } F
: \ ‘ ( i
,
" ; A ei. it 1
f A,
vf v a
4 * G 1h
‘ Me )
’ ’ ’
R ‘ , ; j
‘ ' f
! '
} '
- ‘ a! J j ‘
’
‘
; “4 = : 7
‘*
i
aA ‘
.
‘
. (
i
y
,'
- \ '
)
- : !
a
i
. F
* i
j 7 |
! j
' fi 4
H
a”
q '
f
‘\
+
a
- ’ ’
4 ' s , 4
4
i
P \ vit ‘ { Mi
'
? vv M
j i ‘et haw
4 f Pi
0 ‘
} ' , * UF 7) ‘
‘ | 4 " “ae i “
' \ *
t
:
]
' sale ;
7]
) i 4 :
' ; ; ¥ LS
: 4
¥e t hy :
oe
Pee hehe
hah east anny
IL
|
y—Speer Librar
I
|
CO
NWN
(ep)
ae
a "a
=— ©
—} O
——— ee Se
i
N
=
©
_
a
Princeton Theological Seminar
How to experiment in education
LB1026.
aati