II E> HAHY 
 
 OF THE 
 UNIVERSITY 
 Of ILLINOIS 
 
 5107 
 Ue64ct 
 
 wolf-15 
 
 ZW&flft\t3H 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/studyofsamplesiz11frin 
 
'0.7 
 
 /iUwii /iv, 
 
 
 UNIVEItSITY OF ILLINOIS 
 Urbana, Illinois 
 
 A Study oi Sample Size in Making Docisions 
 About Instructional Materials 
 
 Gerald Frinckc and Lawrence EI, Stolurow 
 
 COMPARATIVE STUDIES OF PRINCIPLES 
 
 FOR PROGRAMING MATHEMATICS 
 
 IN AUTOMATED INSTRUCTION 
 
 Technical Report No. 11 
 February, 1965 
 
 Co-Investijators : 
 
 Lawrence M. Stolurow 
 
 Professor, Department of Psychology 
 Training Research Laboratory 
 
 Max Beberman 
 
 Professor, College of Education 
 University of Illinois Committee 
 on School Mathematics (UICSM) 
 
 Project Sponsor: 
 
 Educational Media Branch 
 U^S. Office of Education 
 Title VII 
 
 Project No. 711151.01 
 
A Study of Sample Size in Making Decisions 
 About Instructional Materials 
 
 Gerald Frincke and Lawrence M. Stolurow 
 
 Problem 
 
 The present study was conducted to explicate the sampling problems 
 involved in obtaining data for decisions about the acceptability of frames in 
 self-instructional programs. In this study, data were examined to determine 
 the relative efficiency of various sample sizes in making decisions about 
 retaining or revising the frames of a self-instructional program. Individual 
 frames are not assumed to be statistically independent of one another. The 
 errors made on one frame are likely to be correlated with those made on another 
 and the magnitude of the correlations is generally both unknown and variable 
 throughout a program. This state of affairs indicates the need for an 
 empirical study of sampling problems in decision making. Two indices were 
 used to determine the implications of selection criteria and sampling procedures 
 in deciding about the acceptability of frames. One index was the percent of 
 undesirable frames correctly identified. The other was the percent of 
 rejections made erroneously (type I errors) . The study also was conducted to 
 suggest guidelines for determining the sample size to use in developing self- 
 instructional programs in which it is expected that the overall error rates 
 will be low and the distribution of error rates observed for frames within 
 the program will be skewed. 
 
 Currently Associate Professor of Psychology at Sacramento State College, 
 California. 
 

Sh 
 
 Docision Making and Types of Error 
 
 In the course of evaluating and revising programed instructional 
 materials, a programer may decide to reject as undesirable those frames in 
 the program which he suspects will lea'' to a student error rate above a given 
 criterion value. In doing this, the progrr.mer is in a position similar to 
 that of a statistician f-'.ced with a large number of hypotheses to test. Each 
 frame in the program must be accepted or rejected. In effect, for each frame, 
 the programer must test the null hypothesis that the frame will have an 
 error rate below that value he has chosen as his minimum criterion for 
 "undesirableness" when the program is put into general use. The number of 
 observations on v/hich each test of this hypothesis is based will be equal to 
 the number of students with whom the program is pretested. Usually (though 
 not necessarily) the programer rejects the null hypothesis whenever a frame 
 is found to have an error rate greater than the one chosen as the criterion 
 value . 
 
 In making each decision concerning the acceptability of a program frame, 
 the programer, as statistician, may make one of two types of errors. He 
 will make a type I error if the null hypothesis is rejected when it is true; 
 he will mislabel an acceptable frame as "unacceptable." He will make a type 
 II error if the null hypothesis is accepted when it actually is false; he will 
 mislabel an unacceptable frame as "acceptable ." The programer definitely 
 wants to avoid type II errors. Ke wants what the statistician calls a 
 "powerful" test of the null hypothesis, one that has enough "power" to reject 
 
' ■ ■ '■'■-■■ 
 
 .. ■■■• : ■■:■■ 
 
 *l 
 
 
 
 • 
 
each truly unacceptable frame. At the same time, however, the programer does 
 not want to make too many type I errors. Otherwise, he will be needlessly 
 rewriting a large number of the acceptable frames in the program. Extensive 
 rewriting would then require that the revised program be pretested, and thus 
 delay and increase the cost of the finished program. 
 
 Power and Sample Size 
 
 Given a fixed number of observations (N), the only way to change the 
 probability of a type I error is to shift the rejection criterion. Unfortunately, 
 in this situation a shift of the rejection criterion which reduces chances 
 of a type I error simultaneously reduces the power of the test. Conversely, a 
 shift of the criterion which increases the power of the test results in the 
 increased probability of a type I error. Only by increasing N can the power 
 of the test be increased without a simultaneous increase in the probability of 
 a type I error. Similarly, only by increasing N can the probability of a type 
 I error be reduced without diminishing the power of the test. 
 
 Since each test is based on the error data of students utilized in pre- 
 testing the program, the power of each test will be inversely related to the 
 performance level or overall error rate of the N students comprising the 
 sample. If the N students perform quite well, observed error rates will be 
 depressed for all items. Consequently, fewer unacceptable items will be 
 rejected. Conversely, if the N students perform quite poorly, observed error 
 rates will be higher and more frames (both acceptable and not) will be 
 rejected. This suggests the desirability of obtaining measures characterizing 
 the sample of students used in terms of relevant abilities, and of obtaining 
 measures of their representativeness in terms of relevant academic achievement. 
 

 ■•:•■■ 
 
True Error Rate 
 
 The power of each test made by the programer also depends upon the 
 degree to which the actual state of affairs approaches that stated in the null 
 hypothesis. We can speak of the true error rate of a frame as that error 
 rate which would be found for it if it were given to all of the intended 
 population of students as a part of the finished program. If the "unacceptable' 
 frame has a very high "true" error rate, the test based on pretesting data 
 will be more likely to reject it than an unacceptable frame with a "true" 
 error rate which approaches acceptability. "True" error rates are generally 
 unknown for program frames. This lack of knowledge is what leads the programer 
 to construct a test for selecting frames in the first place. All the programer 
 can do, therefore, is accept the fact that his test will be more effective in 
 eliminating the extremely unacceptable frames if it is at the cost of 
 retaining some "borderline" unacceptable frames. However, the percentage of 
 truly unacceptable frames rejected by tests based on data from a given 
 pretesting sample of size N will not be expected to be very high in the case 
 of programs which have already gone through successful revision. 
 
 Method 
 
 Materials 
 
 The program . UICSM's programed instructional series, Part 110, was 
 selected for study since it teaches relatively difficult concepts and thus 
 some variability in error rates for frames could be expected. Worksheet error 
 
' 
 
 ' 
 
2 
 data fox- 178 of the students who completed Part 110 (pure mode) was recorded 
 
 3 4 
 
 on SCRIBE sheets and used with the SCRIBE system to produce IBM cards 
 
 containing individual error data for each student. Error data from the completed 
 
 worksheets of the 52 additional pure-mode subjects was punched directly into 
 
 IBM cards in the SCRIBE output format. Thus, data from 230 students was 
 
 available for sampling. 
 
 Sampling procedure . All sampling of data from the basic pool of 230 
 
 students was done without replacement. A random sample of 100 students from 
 
 all seven classes in four participating schools was selected to establish 
 
 criterion error rate measures on the 308 program frames in UICSM PIP Part 110. 
 
 This sample of 100 was stratified with regard to classes and schools. A 
 
 similar stratified random sampling procedure was then followed as closely as 
 
 possible in selecting several other samples of various sizes. In cases where 
 
 a stratified procedure was clearly impossible, such as in a sample of size 1 
 
 (N=l), ordinary random sampling was utilized. Three samples each of the 
 
 following sizes were selected; N=l, N=2, N=3, N=4, N=5, and N=15 . These 
 
 samples were also merged to form a "summation sample" with N=120. 
 
 2 
 See Beberman, N. and Stolurow, L. M. Comparative studies of ^linciples 
 
 for programing mathematics for programed instruction. Semi-annual report for 
 
 description of the modes, schools and classes involved in the 1962-63 tryout 
 
 for UICSM programs. Urbana, 111.: Univer. of 111., 1963. 
 
 3 
 SCRIBE is a system developed and used by ETS to score multiple- choice 
 
 answer sheets and to automatically transcribe the data on the answer sheets 
 
 to IBM cards. 
 
 4 
 The SCRIBE technique of recording and processing worksheet data is 
 
 described in Frincke, G. L. and Stolurow, L. M. Three methods of recording 
 worksheet performance. Urbana, 111.: Univer. of 111., 1964. USOE Title 
 VII, Technical Report No. 7. This work was done in cooperation with Educa- 
 tional Testing Service and arrangements for it were made by Dr. Paul Jacobs. 
 The ETS contribution was made possible by a grant from the Carnegie 
 Corporation of New York. 
 
I ■!-.&} 'rlAi 
 
Item analyses. Worksheet error data cards prepared from the worksheets 
 of the subjects who constituted the criterion sample, the summation sample, 
 and the 21 smaller samples, were used as the basis for 23 separate item- 
 analyses of the 308 items which comprise the UICSM Programed Instruction Part 
 110. 
 
 Three hundred and eight summary IBM cards were then prepared. Each of 
 these cards contained the results of all 23 analyses with regard to one of 
 the 308 program items. The summary cards were used to determine correlations 
 between the results of item analyses based on the various samples. These 
 cards were also used to determine which frames would be rejected and which 
 accepted by tests based on the data of the various samples in cases where the 
 criterion for rejection would be the observed error rates equal to or greater 
 than 10%, 15%, or 20%. 
 
 Results 
 
 Figure 1 is a frequency distribution of the different error rates 
 observed in Part 110 for the criterion sample (N=100) . Table 1 shows the 
 distribution of error rates observed in the 21 smaller samples. All of the 
 distributions are quite skewed. Most items are well within the limits of 
 acceptability and the number of extremely unacceptable items in Part 110 is 
 actually quite low. This is an important factor in interpreting the findings 
 of this study. 
 
 5 
 
 These analyses were carried out with the aid of an IBM 1620 computer 
 
 and a program written by Mr. Scott Krueger, University of Illinois, Training 
 Research Laboratory. 
 
-<■ ,■■>>. 
 
\ 
 
 g 
 
 (saraiMj jo aaqumu) Xouanbajj 
 
a 
 
 
 G 
 
 
 a 
 
 
 Ul 
 
 
 
 CO 
 
 
 u 
 
 H 
 
 
 
 S 
 
 J-. 
 
 rt 
 
 h 
 
 TO 
 
 <D 
 
 rH 
 
 Q) 
 
 M 
 
 TJ 
 
 
 8 
 
 <H 
 
 a 
 
 O 
 
 
 
 o 
 
 J3 
 
 JS 
 
 O 
 
 e 
 
 cC 
 
 
 H 
 
 0) 
 
 
 +> 
 
 c 
 
 C 
 
 
 • 
 
 
 ■o 
 
 w 
 
 3 
 
 h 
 
 •P 
 
 o 
 
 0) 
 
 u 
 
 
 u 
 
 «H 
 
 H 
 
 
 
 c 
 
 h 
 
 o 
 
 •H 
 
 2 
 
 3 
 
 i 
 
 X5 
 
 55 
 
 
 00 
 
 
 
 
 
 
 
 Q 
 
 
 
 
 
 
 CO 
 
 1 
 
 CO 
 
 qo oo 
 
 oo oo oo 
 
 00 00 00 
 
 00 00 00 
 
 ■M 
 
 I 
 
 o 
 
 o o 
 
 o o o 
 
 o o o 
 
 o o o 
 
 o 
 
 h 
 
 en 
 
 co co 
 
 CO CO CO 
 
 CO CO CO 
 
 CO CO CO 
 
 H 
 
 <M 
 
 
 
 
 
 
 00 00 00 00 00 00 
 
 o o o o o o 
 
 CO CO CO CO CO CO 
 
 00 00 00 
 
 o o o 
 
 CO CO CO 
 
 O O O iH r-l .-I 
 
 OOO «IOO HNN 
 
 W O O HO'? 
 
 iHOO OCMtH CTl^rH cocom 
 
 iH 05 W H00O 
 
 N H H COiniH 115 H M OJ H m 0* CO CO 
 
 mooci who t* n o ocno >/)« h idoo 
 oi o h ^ <g< ej enenw h t* oo en to to w o t- 
 
 CO O CO r-( CO t>- l»Hh 00 <D 00 t- O rH incjo 
 
 00OC5 <D CO 00 to t» W OO HH t» IN sT NlflO) 
 
 N CO 05 CNJ OJ OJ CSI C<J CM N N N r-l OJ OJ N H H 
 
Table 2 presents correlations between the error rates oi' the 308 items 
 as determined by each of the samples and error rates based on the criterion 
 sample. These correlations are generally quite low as might be expected due 
 to the small range of error rates that obtained and the extreme skewness of 
 the error rate distributions. 
 
 Estimates of the overall program error rate based on the 21 independent 
 samples also are presented in Table 1. Considerable variability among the 
 estimates based on the smaller samples can be seen. For example, those for 
 samples of size 1 range from 2.5% to 7 .1%, those of size 5 range from 4.8% 
 to 11.7% overall error rate. Figure 2 shows the mean overall error rate 
 estimates for each of the sample sizes in relation to the overall error rate 
 observed in the criterion sample. All of these overall error rates appear to 
 fluctuate randomly about the criterion value with the exception of that based 
 on the samples of N=5 . Here a very excessive and presumably spurious error 
 rate was observed. 
 
 Figure 3 depicts the efficiency of the seven sample sizes by showing the 
 relationship between the mean percentage of unacceptable frames that were 
 rejected and sample size. It shows that all sample sizes greater than N=4, 
 with the exception of the sample size N=10 at the 15% criterion level, rejected 
 50% or more of the unacceptable items. In all cases, a rapid rise in the mean 
 percentage of the unacceptable items correctly identified for rejection can 
 be seen in relation to the sample size as it increases from 1 to 5. Each line 
 represents a different criterion. In the case of a 107o error rate criterion, 
 the efficiency of the sample size increases up to size 10. With a 15% or 20% 
 error rate, the efficiency of the sample in terms of the rejection of unacceptable 
 
10 
 
 Tabic 2 
 
 Correlations of Sample Item analyses V/ith Criterion 
 Item Analyses as a function of Sample ^ize 
 
 Sample 
 
 No. 
 
 Size 
 
 Estimate of 
 
 program 
 
 error rt rate 
 in 7o 
 
 Correlation 
 
 with 
 
 criterion 
 
 0" 
 
 roc 
 
 1 
 
 i 
 
 2 
 
 i 
 
 3 
 
 i 
 
 '1 
 
 ^ 
 
 S 
 
 2 
 
 6 
 
 2 
 
 7 
 
 3 
 
 8 
 
 3 
 
 9 
 
 3 
 
 10 
 
 '1 
 
 11 
 
 4 
 
 12 
 
 4 
 
 13 
 
 5 
 
 14 
 
 5 
 
 15 
 
 5 
 
 16 
 
 10 
 
 17 
 
 10 
 
 18 
 
 10 
 
 19 
 
 15 
 
 20 
 
 15 
 
 22 b 
 
 15 
 
 120 
 
 5.6 
 7.1 
 2.5 
 3.6 
 0.0 
 6.9 
 3.6 
 5.0 
 '1.5 
 5.6 
 2^0 
 8.7 
 G.4 
 11.7 
 7.2 
 4 .3 
 3.0 
 8.0 
 5.5 
 3.7 
 4.5 
 5.1 
 5.5 
 
 1.000 
 .329 
 .280 
 .101 
 .356 
 .297 
 .318 
 .272 
 .352 
 .302 
 .455 
 .481 
 .462 
 .453 
 .470 
 .535 
 .567 
 .514 
 .552 
 .600 
 .711 
 
 .6:1 
 
 .872 
 
 Criterion sample, 
 
 Summation sample 
 
 Analysis is based upon 308 frames of UICST.l PIP Part 110. 
 All correlations except .101 are significant beyond 
 the .01 level. 
 
aSB^uaojad ux q\vj. jojj9 hbj8ao ubsw 
 
\2 
 
 < 
 
 O «H lO -H O -H 
 
 C* h H ^ H Ih 
 
 O O O 
 
 —I — 
 
 sina^T eiqeadaooBun jo uoTaoofaa aSt^uaoaad UBajfl 
 
13 
 
 items is seen to first increase to sample size 5 even more rapidly than the 
 curve for the 10% criterion. However, these curves decrease as sample size is 
 increased beyond N=5 and then increase as sample size goes from 15 to 120. 
 Table 3 presents both the percentage and number of unacceptable items correctly 
 identified by each of the samples of subjects. It should be noted that for 
 a 10% criterion, for example, all percentages are above 50 only for samples 
 of N=5 or more. However, with a 15% criterion not all the samples of size 15 
 resulted in 507o or greater correct identifications of unacceptable frames. 
 
 Table 4 shows the number and percentage of the acceptable items that 
 would be rejected by each of the samples using the various criteria. This 
 data has been combined with that of Table 3 to produce Figure 4. Figure 4 
 depicts erroneous rejections in terms of their percentage of all rejections 
 made and thus serves to indicate the "cost" of each correct rejection in 
 terms of incorrectly rejected items. 
 
 Discussion 
 
 Skewness Effects 
 
 The extreme positive skewness of the error rate distribution appears to 
 affect the efficiency of a given sample size in leading to the rejection of 
 unacceptable frames. It seems to have lowered the efficiency of N when the 
 rejection criterion was reduced from a 20% error rate to a 10% error rate. 
 A shift in the rejection criterion of this sort moves the point of rejection 
 to a place in the frequency distribution where considerably more frames are 
 accumulated, since the frames, in general, tend to produce small numbers of 
 
l'i 
 
 Table 3 
 
 The Percentage and Number of Unacceptable Frames that Would be Rejected 
 as a Function of the Size of Sample and the 
 Criterion Error Rate Used for Rejection 
 
 Sample 
 
 Criterion error rate used for rejection 
 
 Size 
 
 No. 
 
 20% 
 
 15% 
 
 10% 
 
 No. 
 
 No. 
 
 No. 
 
 100 
 
 100 
 
 11 
 
 100 
 
 24 v 
 
 100 
 
 63 
 
 1 
 
 1 
 
 45 
 
 5 
 
 29 
 
 7 
 
 19 
 
 12 
 
 1 
 
 2 
 
 18 
 
 2 
 
 17 
 
 4 
 
 10 
 
 6 
 
 1 
 
 3 
 
 18 
 
 2 
 
 8 
 
 2 
 
 5 
 
 3 
 
 2 
 
 4 
 
 45 
 
 5 
 
 46 
 
 11 
 
 38 
 
 24 
 
 2 
 
 5 
 
 55 
 
 6 
 
 38 
 
 9 
 
 30 
 
 19 
 
 2 
 
 6 
 
 36 
 
 4 
 
 21 
 
 5 
 
 22 
 
 14 
 
 3 
 
 7 
 
 36 
 
 4 
 
 29 
 
 7 
 
 29 
 
 18 
 
 3 
 
 8 
 
 64 
 
 7 
 
 42 
 
 10 
 
 32 
 
 20 
 
 3 
 
 9 
 
 36 
 
 4 
 
 54 
 
 13 
 
 35 
 
 22 
 
 4 
 
 10 
 
 45 
 
 5 
 
 50 
 
 12 
 
 25 
 
 16 
 
 4 
 
 11 
 
 64 
 
 7 
 
 79 
 
 19 
 
 62 
 
 39 
 
 4 
 
 12 
 
 73 
 
 8 
 
 67 
 
 16 
 
 63 
 
 40 
 
 5 
 
 13 
 
 100 
 
 11 
 
 92 
 
 22 
 
 68 
 
 43 
 
 5 
 
 14 
 
 91 
 
 10 
 
 58 
 
 14 
 
 57 
 
 36 
 
 5 
 
 15 
 
 100 
 
 11 
 
 88 
 
 21 
 
 52 
 
 33 
 
 10 
 
 16 
 
 64 
 
 7 
 
 42 
 
 10 
 
 56 
 
 35 
 
 10 
 
 17 
 
 45 
 
 5 
 
 38 
 
 9 
 
 78 
 
 49 
 
 10 
 
 18 
 
 64 
 
 7 
 
 62 
 
 15 
 
 70 
 
 44 
 
 15 
 
 19 
 
 45 
 
 5 
 
 29 
 
 7 
 
 51 
 
 32 
 
 15 
 
 20 
 
 82 
 
 9 
 
 58 
 
 14 
 
 51 
 
 32 
 
 15 
 
 21 
 
 45 
 
 5 
 
 92 
 
 22 
 
 52 
 
 33 
 
 120 
 
 22 
 
 73 
 
 8 
 
 62 
 
 15 
 
 68 
 
 43 
 
 (Summat 
 
 ion Sample) 
 
 
 
 
 
 
 
 Criterion sample 
 
 11 frames were unacceptable at this criterion level, 
 '24 frames were unacceptable at this criterion level 
 63 frames were unacceptable at this criterion level, 
 
00 
 
15 
 
 Table 4 
 
 The Percentage and Number of Acceptr.ble Frames that Would be Erroneously 
 Rejected as a Function of the Size of Sample and the 
 Criterion Error Rate Used for Rejection 
 
 Sample 
 
 Criterion error rate used for rejection 
 
 Size 
 
 No. 
 
 20% 
 
 15% 
 
 10% 
 
 No. 
 
 No. 
 
 No 
 
 100 
 
 1 
 
 1 
 
 5.7 
 
 17 
 
 5.3 
 
 15 
 
 4.1 
 
 10 
 
 1 
 
 2 
 
 2.0 
 
 6 
 
 1.4 
 
 4 
 
 0.8 
 
 2 
 
 1 
 
 3 
 
 3.4 
 
 10 
 
 3.5 
 
 10 
 
 3.7 
 
 9 
 
 2 
 
 4 
 
 14.1 
 
 42 
 
 12.7 
 
 36 
 
 9.4 
 
 23 
 
 2 
 
 5 
 
 12.1 
 
 36 
 
 11.6 
 
 33 
 
 9.4 
 
 23 
 
 2 
 
 6 
 
 5.7 
 
 17 
 
 5.3 
 
 15 
 
 2.9 
 
 7 
 
 3 
 
 7 
 
 12.5 
 
 37 
 
 12.0 
 
 34 
 
 9.4 
 
 23 
 
 3 
 
 S 
 
 10.1 
 
 30 
 
 9.5 
 
 27 
 
 6.9 
 
 17 
 
 3 
 
 9 
 
 15.8 
 
 47 
 
 13.4 
 
 38 
 
 11.8 
 
 29 
 
 4 
 
 10 
 
 5.1 
 
 15 
 
 2.8 
 
 8 
 
 1.6 
 
 4 
 
 4 
 
 11 
 
 28.3 
 
 84 
 
 25.7 
 
 73 
 
 21.6 
 
 53 
 
 4 
 
 12 
 
 27.6 
 
 82 
 
 26.1 
 
 74 
 
 20.4 
 
 50 
 
 5 
 
 13 
 
 40.4 
 
 120 
 
 38.4 
 
 109 
 
 35.9 
 
 88 
 
 5 
 
 14 
 
 26.3 
 
 78 
 
 26.1 
 
 74 
 
 21.2 
 
 52 
 
 5 
 
 15 
 
 18.9 
 
 56 
 
 16.2 
 
 46 
 
 13.9 
 
 34 
 
 10 
 
 16 
 
 6.7 
 
 20 
 
 6.0 
 
 17 
 
 19.6 
 
 48 
 
 10 
 
 17 
 
 17.2 
 
 51 
 
 16.5 
 
 47 
 
 43.7 
 
 107 
 
 10 
 
 18 
 
 10.8 
 
 32 
 
 8.5 
 
 24 
 
 30.2 
 
 74 
 
 15 
 
 19 
 
 2.4 
 
 7 
 
 1.8 
 
 5 
 
 2.4 
 
 6 
 
 15 
 
 20 
 
 3.7 
 
 11 
 
 2.1 
 
 6 
 
 6.1 
 
 15 
 
 15 
 
 21 
 
 2.7 
 
 8 
 
 1.8 
 
 5 
 
 11.0 
 
 27 
 
 120 
 
 22 L 
 
 (Sunnation Sample)* 
 
 3.2 
 
 4.1 
 
 10 
 
 Criterion sample. 
 
 297 frames were acceptable at this criterion level. 
 '284 frames were acceptable at this criterion Level. 
 245 frames were acceptable at this criterion level. 
 
If, 
 
 Xtsnoouoaas apBin suox^oafojr jo ^usoaad u^ai" 
 
17 
 
 Since the power of the test is lowest for frames which have true error rates 
 that are almost acceptable, the overall efficiency of the tests of frame 
 acceptability is lowered in this situation. One factor, however, works to 
 counteract this reduction in efficiency to some extent. It is clear from the 
 formula for the standard error of a proportion ((T" = U P(l-P) that as the 
 proportion (p) is shifted away from 15, the standard error is reduced. Thus, 
 the standard error of the error rates (proportion of errors) which are observed 
 for borderline unacceptable items will be reduced when the point of rejection 
 is shifted further away from the 50% error rate value. As can be seen from 
 the formula, this effect rapidly becomes less important as N is increased. 
 
 The mean overall error rates observed for the various sample sizes is 
 depicted in Figure 2. These must be considered when interpreting the efficiency 
 curves in Figure 3. Of major concern here is the fact that the mean overall 
 error rate of the samples with N=5 was considerably above that of the criterion. 
 Thus, the efficiency curves in Figure 3 are higher for N=5 than would normally 
 be expected. The same curves are somewhat depressed at the point where N=15 
 due to the fact that the mean overall error rate happened to be lowest for 
 samples of this size. 
 
 Two major relationships are illustrated in Figure 3. The fir£t is that 
 in detecting unacceptable frames, samples with N=5 or N=10 can approach quite 
 closely the efficiency of much larger samples. Inspection of Table 3, however, 
 shows that while the mean efficiency of the smaller samples is high, 
 variability in efficiency is also quite high. Thus, the programer may be 
 able to approach the efficiency of a very large sample with only a small one, 
 
La 
 
 but he runs a definite risk of drawing a very poor sample. It should be 
 noted that even the summation sample leaves a great deal to be desired in the 
 identification of faulty frames. 
 
 While it is possible to identify three-fourths or more of the unacceptable 
 frames with samples as small as 10 when a 10% criterion is used, sampling 
 fluctuations are great enough that, in this study, no sample of size 15 
 was this efficient. The data indicate that when a higher criterion is used, 
 smaller samples may identify three-fourths, or more, of the unacceptable 
 frames, and that this could occur with samples as small as 4. While it may 
 not occur with even larger samples, the chances are greater that it will. 
 
 Sample Size and Rejection Criteria 
 
 A second relationship, shown in Figure 3, involves sample size and the 
 rejection criterion. The curve for the 10% criterion reaches its first 
 maximum when N=10, while the curve for the 20% criterion reaches its first 
 maximum at N=5. This is due to the fact that with these Ns and these 
 criteria, N is as large as possible at the points where one subject missing 
 a frame will cause it to be rejected. In other words, the lowest possible 
 error rate other than 0% which can be observed in these samples is equal to 
 the rejection criterion at these points. This is not true for samples with 
 slightly smaller or slightly larger Ns. For example, when the criterion 
 for rejection was a 20% error rate and N=3, the only possible observed error 
 rates were 0%, 33%, 67% and 100%. Thus, if a frame was to be rejected, 33% 
 
19 
 
 or more of the students had to miss it. This means, in effect, that actions 
 taken in making a decision to accept or reject a frame were the same as if the 
 rejection criterion was shifted to an observed error rate of 337o, while the 
 null hypothesis being tested continued to be that the frame has an error rate 
 less than 20%. The effect of this de facto rejection criterion shift is a 
 reduction in the power of the test of this hypothesis (the probability of a 
 type I error was reduced, however). When N=5 in this case, possible observed 
 error rates were 0%, 20%, 40%, 60%, 80% and 100%. Thus, no de facto shift 
 in the rejection criterion occurred, since an observed error rate of 20% 
 led to rejection of the item. Increasing N from N=5 to N=6 would, according 
 to the same principle, reduce the power of the test. With N=6, possible 
 observed error rates are 0%, 17%, 33%, 50%, 67%, 83% and 100%. Again, a 
 de facto shift of the rejection criterion to 33% would occur and the power of 
 the test would be reduced. 
 
 It would be expected that if more sample sizes had been included in the 
 present study, then each of the curves in Figure 3 would regularly rise and 
 fall as N increased from to 120. Each successive maximum would be a little 
 higher than the last due to increased power brought about by increasing N. 
 Each successive minimum would also be higher due to closer approximation of 
 the de facto rejection criterion to the desired rejection criterion. The 
 maxima would always be at points where the product of the criterion percent 
 and N is a whole number. For some criteria the maxima will occur with N 
 equal to an integer. For others, N will not always be an integer but will 
 sometimes be an integer plus a fraction. In each of these cases, the observed 
 
• 
 
 • • 
 
20 
 
 maximum would be the integer with the fraction dropped. This would be the 
 case for a criterion of 157o. The first maximum for this criterion would 
 theoretically be with N=6.7. The observable maximum would be at N=6; however, 
 since a sample with N=6.7 cannot be obtained. 
 
 Efficiency in Terms of Type I Errors 
 
 Thus far, the discussion has concerned the efficiency of various 
 sample sizes in terms of the percentage of unacceptable frames rejected. 
 Efficiency should also be evaluated in terms of the extent of type I errors. 
 Examination of Table 4 reveals that large numbers of acceptable frames were 
 rejected by tests based on the smaller samples. When these numbers are presented 
 as percents of all rejections made, as in Figure 4, the lower efficiency of 
 smaller sample sizes becomes quite clear. For samples with N less than 15, 
 the best samples led to the rejection of at least one acceptable frame 
 along with every unacceptable frame rejected m The worst rejected as many as 
 nine acceptable frames for every unacceptable frame rejected. The curves in 
 Figure 4 as in Figure 3 have probably been influenced by the mean overall 
 error rates observed with the various sample sizes. Thus, they are somewhat 
 higher where N=5 and somewhat lower at N=15 than would normally be expected. 
 Figure 4 definitely shows, however, that one definite advantage of obtaining 
 a large pretesting sample is that erroneous rejections are considerably 
 reduced . 
 
 It is apparent that the percentage of the rejections made erroneously is 
 also a function of the rejection criterion. For smaller samples, at least, 
 
.'■ 
 
 
 S3.fl 
 
 J.;.:' ' 
 
 
21 
 
 the percentage of rejections made erroneously is lowered when the rejection 
 criterion is reduced from a 20% to a 10% error rate. Such a shift in the 
 rejection criterion results in changing at least three things which would 
 affect the percentage of rejections made erroneously. First, by lowering the 
 rejection criterion, the proportion of acceptable frames in the program is 
 reduced and the proportion of unacceptable frames increased. This reduces 
 the probability that an error made by a student will be made on an acceptable 
 frame. Thus, fewer acceptable frames are rejected. Second, the lowering of 
 the criterion places more frames in the "almost unacceptable category" due to 
 the skewness of the error rate distribution. This would tend to increase 
 erroneous rejections and at the same time reduce the power of the test in 
 making proper rejections, since more "almost unacceptable" frames would be 
 close to the criterion point. The effect of these changes would be to 
 increase the percent of erroneous rejections. A third result of a downward 
 criterion shift would be to reduce the mean standard error of the acceptable 
 frames for reasons already mentioned. Such a reduction in variability lowers 
 the percentage of erroneous rejections. With a large N, however, this effect 
 is considerably smaller. The net effect of all these changes depends 
 considerably on the distribtuion of error rates in the program. 
 
 Hazards of Small Samples 
 
 The wide variation in efficiency among samples of a given size, in 
 terms of rejecting unacceptable frames and failing to reject acceptable ones, 
 also obtains where the prediction of overall program error rate is concerned. 
 
■fi'S 
 
 £>.'•'■■' :,\ .: '<- iJjyiixJ 
 
 ■ 
 
 
 ■JOfcn ■• -j* 
 
 : 
 
 
 ■■■' ■ ■ .;•■ - tci its*! ►» ii.U.'- : . ., , ,. i&; , , . j; , ft ^ 
 
 ' ^C4B .■ ;■ ;; 
 
 5,'ri b 
 
 <lcs 
 
 w 
 
 i 
 
 - • 
 
 >. : ~- ■ : ' : ' 
 
 
2.2 
 
 This is .end ly seen when inspecting the overall error rate estimates 
 presented in Table 2. 
 
 The failure to obtain consistent results with smaller samples in the 
 present study Joints up a major objection lo cue ..o of small pretesting 
 samples. One cannot be confident that a small sample of students will produce 
 individual and overall error rates consistent with those which would obtain in 
 the population for which the program was intended. This objection, along 
 with the fact that erroneous rejections are quite frequent when small samples 
 are employed, must be seriously considered by the planner of a pretesting 
 program. The cost of failing to reject unacceptable items, of rejecting 
 acceptable items, and of inaccurately estimating the overall error rate for a 
 program must be balanced against the cost of pretesting the program. When 
 these th.ings are considered the N of the pretesting sample should be set as 
 large as is practical. It should be chosen so that the product of the 
 desired rejection criterion and N is an integer. This will maximize the power 
 of the test. 
 
 Summary 
 / 
 In spite of the common practice, in developing programed leading 
 
 materials, of using small samples of students from the target population to 
 
 accept or reject frames, there has been no examination of the implications 
 
 of this practice. This study relates the problem to the problem of the 
 
 statistician who is testing a large number of hypotheses. The concepts of 
 
 rejection level, type I and II errors, and the statistical concept of the power 
 

 ■ 
 
 K ■'- I- 
 
 . ■' I ! 
 
 
23 
 
 of a test are applied. The empirical nature of the study is important since 
 it is characteristic of the errors made to be intercorrelated and to form a 
 skewed distribution with a mean that departs substantially from .5. Twenty- 
 one independent samples of seven different sizes and three per size were drawn 
 from student worksheets used in learning from an algebra program based upon 
 the UICSM curriculum. The hazards of small samples (up to N=15) with rejection 
 criterion levels of 10%, 15% and 20% were examined. Wide variations in 
 efficiency among samples of a given size were observed both in terms of (a) 
 rejection of acceptable frames, and (b) failing to reject unacceptable ones. 
 Coupled with the inconsistency of small pretesting samples is the high 
 frequency of erroneous rejections. It was recommended that pretest samples 
 be both as large as practical and chosen, so that the product of the desired 
 rejection criterion and N are integers, so as to maximize the power of the 
 test. 
 
I ' s 
 
 - 
 
 1