HHH 
 
 Hm I W Hbe« 
 
Faculty Working Papers 
 
 
 INTERTECHNIQUE CROSS-VALIDATION 
 IN CLUSTER ANALYSIS 
 
 A. Marvin Roscoe, Jagdish N. Sheth, 
 and Welling Howell 
 
 \ 
 
 #175 
 
 College of Commerce and Business Administration 
 
 University of Illinois at Urbana-Champaign 
 
FACULTY WORKING PAPERS 
 College of Commerce and Business Administration 
 University of Illinois at Urbana-Champaign 
 April 4, 1974 
 
 INTERTECHNIQUE CROSS-VALIDATION 
 IN CLUSTER ANALYSIS 
 
 A. Marvin Roscoe, Jagdish N. Sheth, 
 and Welling Howell 
 
 #175 
 
'•: 
 
 . 
 
Intertechnique Cross-Validation in Cluster Analysis 
 
 A. MARVIN ROSCOE, 
 JAGDISH N. SHETH, 
 
 and 
 WELLING HOWELL * 
 
 * A. Marvin Roscoe and Welling Howell are Marketing Supervisors in the 
 Market Research Section of the Marketing Department of the AT&T 
 Company. Jagdish N. Sheth is I.B.A. Distinguished Professor and 
 Research Professor at the University of Illinois, Urbana - Champaign, 
 
Digitized by the Internet Archive 
 
 in 2011 with funding from 
 
 University of Illinois Urbana-Champaign 
 
 http://www.archive.org/details/intertechniquecr175rosc 
 
Intertechnique Cross -Va lie 1 it ion in Cluster Analysis 
 
 In view of the fact that in practical marketing research clustering methods 
 are utilized to define homogeneous market segments by empirical research, 
 it is critical to ensure that the derived clusters are, in fact, the true 
 clusters. One procedure of ensuring cluster invariance is replication, 
 but this is not always practical. A second procedure common in 
 psychometrics is one of cross-validating the results by external validation. 
 
 The objective of this paper is to describe a cross-validation procedure 
 which utilized intertechnique comparisons of the clustering results. The 
 procedure is applied to two hierarchical procedures in which the problem 
 involves the determination of geographical heterogeneity of markets for 
 the telephone industry. 
 
Intertechnique Cross-Validation in Cluster Analysis 
 
 Cross-validation among techniques seems essential in cluster analysis 
 because most clustering methods tend to be heuristic algorithms instead of 
 analytically optimal solutions. (See Joyce and Channon [k] and Frank and 
 Green' [2] for a review of the numerous clustering methods available today). 
 As heuristic algorithms, they have no sampling theory for statistical in- 
 ferences about the size and the number of clusters. Also, there are no ex- 
 ternal' validation procedures to ensure that the clusters derived from a 
 specific cluster analysis are in reality the true invariant clusters. The 
 potential statistical problem of obtaining artifacts as clusters is further 
 compounded in some procedures which require a priori assumptions about the 
 size and the number of clusters. Although a number of clustering methods 
 perform statistical tests such as the F ratio or Wilks' Lambda based on 
 analysis of variance principles to guard against obtaining random solutions, 
 no procedure exists which will increase the assurance that a nonrandom cluster 
 
 T i i 
 
 solution is in fact the true cl uster solution. 
 
 Because clustering methods are used in marketing research to identify 
 homogeneous market segments for selective marketing efforts, it is critical 
 that the clusters derived from a heuristic algorithm are the true clusters. 
 One procedure to ensure cluster invariance is replication which, however, is 
 not always practical. Another procedure is the common practice in psychometric^ 
 of cross-validating the results by external validation. 
 
- 2 - 
 
 The objective of this paper is to describe a cross-validation procedure 
 which utilizes intertechnique comparisions of the clustering results. Although 
 the actual study entailed applications of five different clustering techniques, 
 our discussion is limited to two techniques in this paper due to space 
 limitations. A brief description of the large scale research project is provided 
 in which the clustering results were essential to formulating an experimental 
 design for a field experiment. 
 
 DESCRIPTION OF THE STUDY 
 The major research study consisted of a three factorial -6k cell experi- 
 mentation on survey research methods. The three factors were: first, two 
 different lengths of the questionnaire; second, four different follow-up 
 procedures; and, third, the market heterogeneity of geographical areas of the 
 United States with respect to consumer telephone behavior and socioeconomic - 
 demographic characteristics (see [6]). The levels of the first two factors 
 were predetermined based on theory, prior research and practical implications 
 for the ongoing research en a longitudinal national panel of telephone 
 customers. For the third- factor, it was necessary to determine the heterogeneity 
 of the markets by empirical research which utilized clustering methods. 
 
 To define the market heterogeneity, profile data on 20,000 residential 
 telephone customers were used for clustering. These customers are part of a 
 longitudinal consumer panel called the Marketing Research Information System 
 which is maintained for the Bell System by AT&T. The panel members are selected 
 based on a mult i staged stratified sample in which the first stage of the 
 sampling procedure consists of 100 Revenue Accounting Offices (RAOs) representing 
 the entire Bell System. The profile consists of essentially three types of 
 information about each panel member: 
 
- 3 - 
 
 (a) his socioeconomic - demographic status and housing characteristics determi- 
 ned by a survey conducted in early 1970 and matched with the 1970 Census, (b) 
 his monthly- telephone behavior broken down into several categories as determined 
 by the industry practice, and (c) an inventory of his telephone equipment in- 
 cluding number and types of telephones, and additional services. 
 
 Since it was required to empirically investigate the geographical hetero- 
 geneity of the markets, an average profile of the residential telephone 
 customers was determined for each of the 86 RAOs for which detailed and complete 
 information was available. 
 
 A total of 65 customer descriptors were used to represent the total profile 
 of customers. A list of the variables is shown in Table 1. A factor analysis 
 (principal components) solution with orthogonal Varimax rotation was performed 
 on the data for the following reasons: (a) to reduce the multicollinearity 
 among variables so that the profile consisted of orthogonal factor scores which 
 are geometrically essential to calculate Euclidian distances, (b) to equalize 
 the relative weights of each of the underlying dimensions which could otherwise 
 be easily changed by arbitrary dropping or adding of profile variables, and 
 (c) to standardize the diverse scales of measurement common across the socio- 
 economic, demographic and telephone information [5]. Ten significant factors 
 were extracted from the analysis which summarized 92 percent of the total 
 variance. A brief description of the factors is provided in Table 2. 
 
 The number of significant factors was determined using several criteria, 
 both statistical and judgmental, following the recommendations of Rummel [7]. In 
 addition, the stability of the factor structure was also determined by comparing 
 the results with other data analyses to ensure the invariance of the 
 fundamental dimensionality and structure of the profile data. 
 
The standarized rotated factor scores for each RAO were then utilized to 
 compute Euclidian distances "between all combinations of RAOs. The resultant 
 
 86 X 86 distance matrix became the input to the clustering procedures. 
 
 * 
 
 Due to the following distinct advantages, Johnson's Hierarchical Clustering 
 method [3] was chosen as the primary clustering technique for determining the 
 market heterogeneity. First, it is strictly empirical; second, no prior 
 assumptions are required on the part of the researcher; and third, a hierarchical 
 display is provided of the clusters being formed based on a function minimizing 
 the pairwise distances among entities. While the size of the distance matrix 
 is a limitation of the technique, it was not a problem in our case because of 
 the relatively small number of RAOs to be clustered. Due to the structure of the 
 distance matrix and the presumption of the "ultrametric inequality", 
 [3 S p. 2^8-93 the diameter method was chosen instead of the connectedness method 
 in the BE-HICLUST solutions. The results are diagramed in Figure 1. 
 
 While the hierarchical clusters from HICLUST were meaningful, and had strong 
 face validity, it was necessary to cross-validate the results by at least one 
 other technique which was essentially similar ia its input requirements, 
 analytic strategies and the output- format. For this we chose the cluster 
 analysis program developed as part of the BMDP Series which is also a 
 hierarchical clustering routine based on sum of squares distances and the 
 amalgamation principle [l]. In short, BMDP2M amalgamates entities based on the 
 criterion of the smallest distance. Once a cluster is formed, consisting of 
 at least two entities, it calculates the average profile of the cluster and 
 treats it as if it were a new entity which is then clustered with other entities 
 or clusters based on the principle of smallest distances. The process continues 
 until all entities and clusters are hierarchically linked at different levels 
 of distances. The results of the BMDP2M analysis are diagramed in Figure 2. 
 
- 5 - 
 
 As can be seen, the two hierarchical clusters are similar in their 
 structure and hierarchy suggesting that there is a good cross-validation 
 between the two analyses . In order to quantitatively assess the degree of 
 congruence between the two hierarchical clusters, two distinct statistical 
 procedures were utilized. The first procedure consisted of calculating the 
 correlation coefficient for the two distributions of distances at which 
 linkages were made between entities or clusters in each hierarchical analysis. 
 Since the number of linkages is not likely to be identical, we have selected 
 the maximum number of links of one technique and the corresponding number of 
 the other technique. The correlation coefficient between the sequential 
 linkage distances is 0.99^. which is highly positive indicating extreme closeness 
 of the hierarchical structure of the two cluster analyses. 
 
 Another procedure for cross-validation consisted of examining the cluster: 
 developed at some specific levels of distances. Based on the plotting of 
 distances at which linkages were made, for the BE-HICLUST results a distance 01 
 5.00 was indicated as a cutoff point due to the natural break in the curve 
 suggesting a clear truncation. 
 
 The linkage for the BMDP2M results were also plotted and the natural break 
 in the linkages occurred at 3.1. This was at the point where all the clusters 
 had been formed. After this point the BMDP2M analysis indicated 15 unique entitle 
 that were not identified with any of the defined clusters. In order to produce 
 comparable results, the cutoff point for the BE-HICLUST diagram was moved to 
 3.5 for the cross-validation. The clusters could be identified by their 
 geographical orientation and have been labeled Eastern, Southern, Central and 
 Western. Metropolitan has been used for large urban areas not specifically 
 associated with regional areas . The clusters derived from the two techniques 
 are marked in Figures 1 and 2 and are cross-tabulated in Table 3. 
 
- 6 - 
 
 A total of IT clusters are displayed in Table 3, consisting of 13 regional 
 clusters (Eastern, Southern, Central and Western), three metropolitan cities 
 clusters and the last one representing all the unique RAOs which could not be 
 clustered due to their extreme distances from other RAOs, The cross-tabulation 
 
 between HICLU3T and BMDP2M clustering results indicates that 62 out of 86 RAOs 
 fell on the diagonal of the crosstab matrix which represents a hit of 72 
 percent correct classifications in terms of intertechnique results. Futher- 
 more, most of the off-diagonal elements generally fall across clusters within 
 the same geographical region- In Table 4, a cross- tabulation at the regional 
 level is provided which shows that 75 out of 36 RAOs could be correctly 
 classified on an intertechnique basis. This represents a hit of 72 percent. 
 
 While the two results are quite comparable, there are differences in the 
 example worth noting. The BE-HICLUST algorithm appears to provide a more 
 logical structure to the clusters which are grouped by region as indicated in 
 Figure 2. In addition, the BE-HICLUST method seems to work better where large 
 distances are involved, associating 8 of the Ik unique entities with meaningful 
 clusters. Such differences reinforce the need to use several techniques and to 
 understand the advantages of each especially where the researcher's judgement 
 plays such an important role. 
 
- T - 
 
 SIJMMARY AND CONCLUSIONS 
 
 We have pointed out the need for intertechnique cross-validation in 
 cluster analysis due to the heuristic nature of most clustering procedures 
 and the judgemental decisions required to interpret the results. In this 
 paper, we have also presented a concrete application of two statistical 
 procedures which enable the researcher to quantitatively measure the con- 
 gruence of structure and content of clusters across techniques. The first 
 consists of a correlation coefficient index calculated on the distributions 
 of distances at which sequential linkages are made among entities or clusters 
 or both. The second consists of a cross-tabulation of specific clusters 
 derived across two different solutions. In this paper the intertechnique 
 cross-validation procedures have been applied with respect to two 
 hierarchical clustering procedures in which the problem was the determination 
 of geographical heterogeneity of markets for the telephone industry. 
 
Table 1 
 LIST OF VARIABLES 
 
 Housing 
 
 1. 
 
 Own-rent he 
 
 2 . 
 
 Type of residence 
 
 3. 
 
 Number of roc. 
 
 
 Mobility 
 
 4. 
 
 Length oi res 
 
 
 td of Household 
 
 5. 
 
 Sex 
 
 6. 
 
 Age 
 
 ?.; 
 
 Education 
 
 8/ 
 
 Occupation 
 
 
 Family 
 
 9. 
 
 Income 
 
 -i-U . 
 
 Number in family 
 
 xl . 
 
 Average Age 
 
 12. 
 
 Life cycle 
 
 13. 
 
 SES status 
 
 Telephone Service and Equipmant 
 14. Class of service 
 If? , Grade of service 
 16* Number of telephones 
 17, Number of vertical services 
 
 18-29 
 
 30-41 
 4.2 ~ 
 
 Billing Items 12 months 
 Local service 
 Local message 
 Intrastate long distance 
 Interstate long distance 
 
Table 2 
 FACTOR. DIMENSION LABELS 
 
 1. Local service bill : 
 
 2. Local message billing 
 
 3. Intrastate long nee 
 
 4. Family - housii 
 
 i>. Intrastate long distance 
 
 6. Life cycle 
 
 . Service and equipment 
 
 8. Interstate Long Distance 1 * 
 
 9. Interstate Long Distance 2 * 
 10. Socioeconomic Characteristics 
 
 * The two factors for interstate long distance represent different 
 seasonal patterns of calling across geographical areas. 
 

 > 
 
 E-t 
 CO 
 
 S3 
 
 o 
 
 « 
 
 w 
 
 
 o 
 
 ! 
 
 a 
 is 
 
 6 
 
 1 
 
 < 
 o 
 
 M 
 
 O 
 
 n 
 
 A 
 
 £1 
 
 £1 
 
 h 
 
 n 
 
 h 
 
 1 1 1 1 1 n 
 
 h 
 
 i 
 
 n. 
 
 ,1 
 
 j 
 
 JLl 
 
 XI 
 
 J. 
 
 h 
 
 o 
 
 "so 
 
 
 o 
 
 •6 
 
 5 Jr^ 
 
 o s- y 
 
 t> Oft 
 
 < cj p. ;-. '.: i , ,-, 
 
 I jhi • ux^ui 
 
 y v v 
 
 >-5 >> to •-! 
 El to <■><>• <-*■< 
 I-. -i i.i a. <fl fj t j S 
 c-. -> o B E Sti 
 
 MM M 
 M >-1 •< - 
 Op0.bt 
 
 aSLraB 
 
 .<n q 
 
 tip; 
 
 N t \ / \J 
 
 J«9_ 
 gig 
 
 in 
 
 X 
 CIIC 
 
 7-. (-. 
 

 
 
 
 
o 
 
 © 
 
 o 
 
 
 hh -5 H '> «4-< 
 
 \ r 
 
 Sti 
 
 '-■ In I < I - .< 
 
 <•■ 'J iq t.-j • . 
 tt CT £ i fc-. 
 
 K it 
 
 Co u 
 
 HI .-I IN 
 
 Bel 
 
 .'. 'i a * « 
 
 «; i - w rr; H 
 
 " : < hi «< :-. o 
 
 gg H,w s & 
 
 oi Lc v. :-. jti tn —• 
 
 . <'.-. 3 ■ ■■- 
 
 o :. t-. m ,.. tj 
 
 t-l i: SB 9 X >< '--. •< 
 I -. '.- > ii»l. - 
 v, b.^ic ,(>(• w 
 
3 
 3 
 
 o 
 o 
 
 r-t 
 
 O 
 
 S 
 
 I 
 
 o 
 
 ft 
 
 
 • r 
 
 
 
 
 
 1 
 
 or el 
 
 k a; 
 
 - 
 
 
 
 
 H 
 
 3 
 
 H 
 
 M 
 
 £ CM 
 
 
 
 
 • 
 
 CM 
 
 
 
 
 
 
 
 CM 
 
 
 c\ 
 
 
 
 
 r-l ^ 
 
 
 
 i 
 
 
 
 
 
 
 
 H 
 CO CM 
 
 cm h 
 
 
 
 CM 
 
 H 
 
 
 H 
 
 
 
 
 sO f-l 
 
 
 
 >J- 
 
 
 
 c\ c\ 
 
 
 
 
 j c*\ 
 
 
 
 \r\ 
 
 
 
 
 H 
 
 O 
 
 
 
 CM >J- 
 
 
 
 
 rH 
 
 H 
 
 
 c% 
 
 
 
 
 C\ 
 
 
 CM 
 
 H 
 
 
 
 
 
 
 
 
 
 
 
 fc? N 
 
 8 
 
 
 H 
 
 
 
 
 
 H 
 
 c*\ 
 
 ^ 
 
 
 
 
 
 ■ ■ ■ 
 
 C*N 
 
 fV <m 
 
 
 
 
 
 
 g 
 
 
 
 
 
 
 
 
 r> 
 
 
 
 
 
 
 .H 
 
 v > . -. 
 
 
 r^ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 : " 
 
 H M <"v 
 
 
 
 
 H(^C\ 
 
 H <V i^ 
 
 rl(Vl*\-* 
 
 H t\J CI 
 
 
 
 
 3 
 
 1 
 
 1 a 
 
 1 
 
 w 
 
 C_> 
 
 i 
 
 H 
 ■J 
 
 8 
 
 a 
 
 H 
 
 CO 
 

 
 
 
C) 
 
 
 
 1 
 
 
 t 
 W 
 
 03 
 
 t> CO 
 
 
 — — ■ — , 
 
 - ■ ■■■ 
 
 
 
 
 
 cy o 
 
 
 
 
 
 
 
 H 
 
 .->* 
 
 H < 
 
 
 
 
 
 
 
 
 iH 
 
 
 
 
 
 
 . 
 
 
 
 
 (•'-> 
 
 « 
 
 (, „ -,... 
 
 
 
 
 
 
 CM 
 
 
 
 
 _ 
 
 
 
 
 
 
 a 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 
 
 
 
 
 hJ 
 
 
 
 
 
 
 
 
 
 s w 
 
 
 
 
 
 
 CN 
 
 
 
 O 
 
 
 
 
 
 
 
 
 
 « 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 «H 
 
 
 
 
 
 CM 
 
 
 
 
 
 
 
 ■"""" *"■"< 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 § 
 
 
 
 
 
 
 
 
 
 w 
 
 
 
 
 
 
 
 
 
 w 
 
 
 
 
 
 
 
 
 
 En 
 
 r*\ 
 
 
 
 -4 
 
 <H 
 
 
 
 
 CO 
 
 
 
 
 r-f 
 
 
 
 
 
 B 
 
 
 
 
 
 
 
 
 
 ^ 
 
 .J 
 
 { 
 
 
 
 _ .,,„,, 
 
 
 
 
 J--} 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 i ; 
 
 H 
 
 
 o 
 
 CM 
 
 
 
 
 
 
 M ! 
 
 
 
 
 
 
 
 
 
 O 
 
 .„„„„ „, kw . I1< ( 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 !?• 
 
 
 
 
 
 
 
 
 
 fxj 
 
 
 
 
 
 
 
 
 
 i 
 .1-1 
 
 
 
 
 
 
 
 
 
 e 
 
 " 
 
 O 
 
 H 
 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 
 
 w 
 
 
 
 
 
 
 
 
 
 52 
 
 
 
 
 
 1 " ' 
 
 
 
 
 
 
 
 
 
 
 
 
 
 H 
 
 H 
 
 
 H 
 
 
 
 
 
 
 Ti 
 
 H 
 
 
 
 
 
 
 
 
 & 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ' 
 
 
 
 
 
 
 
 
 
 
 
 
 ' 
 
 
 r~* 
 
 CM 
 
 C> 
 
 
 
 
 
 
 
 
 < 
 
 5= 
 
 00 
 
 o 
 
 
 1 
 
 
 
 
 M 
 
 M 
 
 M 
 
 •^ 
 
 
 
 b'-I 
 
 
 
 ►J 
 
 f-3 
 
 b9 
 
 p3 
 
 
 55 
 
 prt 
 
 h3 
 
 53 
 
 c2 
 
 o 
 
 § 
 
 o 
 
 
 
 ce; i 
 
 
 <3 
 
 p., 
 
 P< 
 
 •— > 
 
 
 W ;' 
 
 • i-i 
 
 
 w 
 
 o 
 
 o 
 
 o 
 
 
 H 
 
 | H 
 
 fc-* ;-f 
 
 pel 
 
 K 
 
 trj 
 
 C? 
 
 
 GO 
 
 
 25 to 
 
 i 
 
 JH 
 
 H 
 
 M 
 
 
 W t 
 
 o 
 to 
 
 
 
 1 . 
 
 | 
 
 a 
 
 
REFERENCES 
 
 1. Dixon, W.J. "BMD P -Series Documentation," Health Sciences Computing 
 Facility* Los Angeles: University of California. 
 
 2. Frank, Ronald B. and Paul E. Gi ^en« "Numerical Taxonomy in Marketing 
 Analysis: A Review Article," Journal of Mark eting Research , 5 
 (February 196*8), 83-98. 
 
 3. Johnson, Stephen C. "Hierarchical Clustering Schemes," Psychometrika , 
 32 (September 196?), 2Ul-5*+. 
 
 k. Joyce, Timothy and C. Channon. "Classifying Market Segment Respondents," 
 Applied Stati stics , 15 (Nobember i960), 101-215- 
 
 5. Morrison, Donald G. "Measurement Problems in Cluster Analysis," Management 
 Science, 13 (August 196?), B775-80 
 
 6. Roscoe, A. Marvin, Dorothy Lang and Jagdish W. Sheth. "Experimental 
 Effects of Follow-up Methods, Questionnaire Length, and Market Heterogeniety 
 in Mail Surveys", Manuscript submitted for publication. 
 
 7. Rummel, R.J. Applied Factor Analysis. Evanston: Northwestern University 
 Press, 1970 5 Chapter 15. 
 
UNIVERSfTY OF ILLINOIS-URBANA 
 
 3 0112 060296784 
 
 hAbH 
 
 HH BOB 
 
 Wamm 
 
 mm H 
 
 ff£31§8 HHBkB 
 
 ■uaBHwe HI 
 
 H bWI nnsbbi 
 
 EZS1 KmMM 
 
 BHHbBI 
 ■BBKVHBBp 
 
 SB2HR2 SVcmmB