II E> HAHY OF THE UNIVERSITY Of ILLINOIS 5107 Ue64ct wolf-15 ZW&flft\t3H Digitized by the Internet Archive in 2013 http://archive.org/details/studyofsamplesiz11frin '0.7 /iUwii /iv, UNIVEItSITY OF ILLINOIS Urbana, Illinois A Study oi Sample Size in Making Docisions About Instructional Materials Gerald Frinckc and Lawrence EI, Stolurow COMPARATIVE STUDIES OF PRINCIPLES FOR PROGRAMING MATHEMATICS IN AUTOMATED INSTRUCTION Technical Report No. 11 February, 1965 Co-Investijators : Lawrence M. Stolurow Professor, Department of Psychology Training Research Laboratory Max Beberman Professor, College of Education University of Illinois Committee on School Mathematics (UICSM) Project Sponsor: Educational Media Branch U^S. Office of Education Title VII Project No. 711151.01 A Study of Sample Size in Making Decisions About Instructional Materials Gerald Frincke and Lawrence M. Stolurow Problem The present study was conducted to explicate the sampling problems involved in obtaining data for decisions about the acceptability of frames in self-instructional programs. In this study, data were examined to determine the relative efficiency of various sample sizes in making decisions about retaining or revising the frames of a self-instructional program. Individual frames are not assumed to be statistically independent of one another. The errors made on one frame are likely to be correlated with those made on another and the magnitude of the correlations is generally both unknown and variable throughout a program. This state of affairs indicates the need for an empirical study of sampling problems in decision making. Two indices were used to determine the implications of selection criteria and sampling procedures in deciding about the acceptability of frames. One index was the percent of undesirable frames correctly identified. The other was the percent of rejections made erroneously (type I errors) . The study also was conducted to suggest guidelines for determining the sample size to use in developing self- instructional programs in which it is expected that the overall error rates will be low and the distribution of error rates observed for frames within the program will be skewed. Currently Associate Professor of Psychology at Sacramento State College, California. Sh Docision Making and Types of Error In the course of evaluating and revising programed instructional materials, a programer may decide to reject as undesirable those frames in the program which he suspects will lea'' to a student error rate above a given criterion value. In doing this, the progrr.mer is in a position similar to that of a statistician f-'.ced with a large number of hypotheses to test. Each frame in the program must be accepted or rejected. In effect, for each frame, the programer must test the null hypothesis that the frame will have an error rate below that value he has chosen as his minimum criterion for "undesirableness" when the program is put into general use. The number of observations on v/hich each test of this hypothesis is based will be equal to the number of students with whom the program is pretested. Usually (though not necessarily) the programer rejects the null hypothesis whenever a frame is found to have an error rate greater than the one chosen as the criterion value . In making each decision concerning the acceptability of a program frame, the programer, as statistician, may make one of two types of errors. He will make a type I error if the null hypothesis is rejected when it is true; he will mislabel an acceptable frame as "unacceptable." He will make a type II error if the null hypothesis is accepted when it actually is false; he will mislabel an unacceptable frame as "acceptable ." The programer definitely wants to avoid type II errors. Ke wants what the statistician calls a "powerful" test of the null hypothesis, one that has enough "power" to reject ' ■ ■ '■'■-■■ .. ■■■• : ■■:■■ *l • each truly unacceptable frame. At the same time, however, the programer does not want to make too many type I errors. Otherwise, he will be needlessly rewriting a large number of the acceptable frames in the program. Extensive rewriting would then require that the revised program be pretested, and thus delay and increase the cost of the finished program. Power and Sample Size Given a fixed number of observations (N), the only way to change the probability of a type I error is to shift the rejection criterion. Unfortunately, in this situation a shift of the rejection criterion which reduces chances of a type I error simultaneously reduces the power of the test. Conversely, a shift of the criterion which increases the power of the test results in the increased probability of a type I error. Only by increasing N can the power of the test be increased without a simultaneous increase in the probability of a type I error. Similarly, only by increasing N can the probability of a type I error be reduced without diminishing the power of the test. Since each test is based on the error data of students utilized in pre- testing the program, the power of each test will be inversely related to the performance level or overall error rate of the N students comprising the sample. If the N students perform quite well, observed error rates will be depressed for all items. Consequently, fewer unacceptable items will be rejected. Conversely, if the N students perform quite poorly, observed error rates will be higher and more frames (both acceptable and not) will be rejected. This suggests the desirability of obtaining measures characterizing the sample of students used in terms of relevant abilities, and of obtaining measures of their representativeness in terms of relevant academic achievement. ■•:•■■ True Error Rate The power of each test made by the programer also depends upon the degree to which the actual state of affairs approaches that stated in the null hypothesis. We can speak of the true error rate of a frame as that error rate which would be found for it if it were given to all of the intended population of students as a part of the finished program. If the "unacceptable' frame has a very high "true" error rate, the test based on pretesting data will be more likely to reject it than an unacceptable frame with a "true" error rate which approaches acceptability. "True" error rates are generally unknown for program frames. This lack of knowledge is what leads the programer to construct a test for selecting frames in the first place. All the programer can do, therefore, is accept the fact that his test will be more effective in eliminating the extremely unacceptable frames if it is at the cost of retaining some "borderline" unacceptable frames. However, the percentage of truly unacceptable frames rejected by tests based on data from a given pretesting sample of size N will not be expected to be very high in the case of programs which have already gone through successful revision. Method Materials The program . UICSM's programed instructional series, Part 110, was selected for study since it teaches relatively difficult concepts and thus some variability in error rates for frames could be expected. Worksheet error ' ' 2 data fox- 178 of the students who completed Part 110 (pure mode) was recorded 3 4 on SCRIBE sheets and used with the SCRIBE system to produce IBM cards containing individual error data for each student. Error data from the completed worksheets of the 52 additional pure-mode subjects was punched directly into IBM cards in the SCRIBE output format. Thus, data from 230 students was available for sampling. Sampling procedure . All sampling of data from the basic pool of 230 students was done without replacement. A random sample of 100 students from all seven classes in four participating schools was selected to establish criterion error rate measures on the 308 program frames in UICSM PIP Part 110. This sample of 100 was stratified with regard to classes and schools. A similar stratified random sampling procedure was then followed as closely as possible in selecting several other samples of various sizes. In cases where a stratified procedure was clearly impossible, such as in a sample of size 1 (N=l), ordinary random sampling was utilized. Three samples each of the following sizes were selected; N=l, N=2, N=3, N=4, N=5, and N=15 . These samples were also merged to form a "summation sample" with N=120. 2 See Beberman, N. and Stolurow, L. M. Comparative studies of ^linciples for programing mathematics for programed instruction. Semi-annual report for description of the modes, schools and classes involved in the 1962-63 tryout for UICSM programs. Urbana, 111.: Univer. of 111., 1963. 3 SCRIBE is a system developed and used by ETS to score multiple- choice answer sheets and to automatically transcribe the data on the answer sheets to IBM cards. 4 The SCRIBE technique of recording and processing worksheet data is described in Frincke, G. L. and Stolurow, L. M. Three methods of recording worksheet performance. Urbana, 111.: Univer. of 111., 1964. USOE Title VII, Technical Report No. 7. This work was done in cooperation with Educa- tional Testing Service and arrangements for it were made by Dr. Paul Jacobs. The ETS contribution was made possible by a grant from the Carnegie Corporation of New York. I ■!-.&} 'rlAi Item analyses. Worksheet error data cards prepared from the worksheets of the subjects who constituted the criterion sample, the summation sample, and the 21 smaller samples, were used as the basis for 23 separate item- analyses of the 308 items which comprise the UICSM Programed Instruction Part 110. Three hundred and eight summary IBM cards were then prepared. Each of these cards contained the results of all 23 analyses with regard to one of the 308 program items. The summary cards were used to determine correlations between the results of item analyses based on the various samples. These cards were also used to determine which frames would be rejected and which accepted by tests based on the data of the various samples in cases where the criterion for rejection would be the observed error rates equal to or greater than 10%, 15%, or 20%. Results Figure 1 is a frequency distribution of the different error rates observed in Part 110 for the criterion sample (N=100) . Table 1 shows the distribution of error rates observed in the 21 smaller samples. All of the distributions are quite skewed. Most items are well within the limits of acceptability and the number of extremely unacceptable items in Part 110 is actually quite low. This is an important factor in interpreting the findings of this study. 5 These analyses were carried out with the aid of an IBM 1620 computer and a program written by Mr. Scott Krueger, University of Illinois, Training Research Laboratory. -<■ ,■■>>. \ g (saraiMj jo aaqumu) Xouanbajj a G a Ul CO u H S J-. rt h TO c C • ■o w 3 h •P o 0) u u «H H c h o •H 2 3 i X5 55 00 Q CO 1 CO qo oo oo oo oo 00 00 00 00 00 00 ■M I o o o o o o o o o o o o o h en co co CO CO CO CO CO CO CO CO CO H /)« h idoo oi o h ^ - l»Hh 00 .'•'■■' :,\ .: '<- iJjyiixJ ■ ■JOfcn ■• -j* : ■■■' ■ ■ .;•■ - tci its*! ►» ii.U.'- : . ., , ,. i&; , , . j; , ft ^ ' ^C4B .■ ;■ ;; 5,'ri b . : ~- ■ : ' : ' 2.2 This is .end ly seen when inspecting the overall error rate estimates presented in Table 2. The failure to obtain consistent results with smaller samples in the present study Joints up a major objection lo cue ..o of small pretesting samples. One cannot be confident that a small sample of students will produce individual and overall error rates consistent with those which would obtain in the population for which the program was intended. This objection, along with the fact that erroneous rejections are quite frequent when small samples are employed, must be seriously considered by the planner of a pretesting program. The cost of failing to reject unacceptable items, of rejecting acceptable items, and of inaccurately estimating the overall error rate for a program must be balanced against the cost of pretesting the program. When these th.ings are considered the N of the pretesting sample should be set as large as is practical. It should be chosen so that the product of the desired rejection criterion and N is an integer. This will maximize the power of the test. Summary / In spite of the common practice, in developing programed leading materials, of using small samples of students from the target population to accept or reject frames, there has been no examination of the implications of this practice. This study relates the problem to the problem of the statistician who is testing a large number of hypotheses. The concepts of rejection level, type I and II errors, and the statistical concept of the power ■ K ■'- I- . ■' I ! 23 of a test are applied. The empirical nature of the study is important since it is characteristic of the errors made to be intercorrelated and to form a skewed distribution with a mean that departs substantially from .5. Twenty- one independent samples of seven different sizes and three per size were drawn from student worksheets used in learning from an algebra program based upon the UICSM curriculum. The hazards of small samples (up to N=15) with rejection criterion levels of 10%, 15% and 20% were examined. Wide variations in efficiency among samples of a given size were observed both in terms of (a) rejection of acceptable frames, and (b) failing to reject unacceptable ones. Coupled with the inconsistency of small pretesting samples is the high frequency of erroneous rejections. It was recommended that pretest samples be both as large as practical and chosen, so that the product of the desired rejection criterion and N are integers, so as to maximize the power of the test. I ' s - 1