.Data Mining In Excel: Lecture Notes and Cases Preliminary Draft 2/04 Nitin R. Patel Peter C. Bruce (c) Quantlink Corp. 2004 Distributed by: Resampling Stats, Inc. 612 N. Jackson St. Arlington, VA 22201 USA
[email protected] www.xlminer.com Contents 1 Introduction 1.1 Who is This Book For? . . . . . . . . . . 1.2 What is Data Mining? . . . . . . . . . . . 1.3 Where is Data Mining Used . . . . . . . . 1.4 The Origins of Data Mining . . . . . . . . 1.5 Terminology and Notation . . . . . . . . . 1.6 Organization of Data Sets . . . . . . . . . 1.7 Factors Responsible for the Rapid Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of Data Mining 2 Overview of the Data Mining Process 2.1 Core Ideas in Data Mining . . . . . . . . . . . . . . . . . 2.1.1 Classification . . . . . . . . . . . . . . . . . . . . 2.1.2 Prediction . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Affinity Analysis . . . . . . . . . . . . . . . . . . 2.1.4 Data Reduction . . . . . . . . . . . . . . . . . . . 2.1.5 Data Exploration . . . . . . . . . . . . . . . . . . 2.1.6 Data Visualization . . . . . . . . . . . . . . . . . 2.2 Supervised and Unsupervised Learning . . . . . . . . . . 2.3 The Steps In Data Mining . . . . . . . . . . . . . . . . . 2.4 SEMMA . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . 2.5.1 Sampling from a Database . . . . . . . . . . . . . 2.5.2 Pre-processing and Cleaning the Data . . . . . . 2.5.3 Partitioning the Data . . . . . . . . . . . . . . . 2.6 Building a Model - An Example with Linear Regression 2.6.1 Can Excel Handle the Jobupervised Learning - Classification & Prediction 29 3.1 Judging Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 A Two-class Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.2 Bayes’ Rule for Minimum Error . . . . . . . . . . . . . . . . . . . . . . . . . 30 i ii CONTENTS 3.1.3 Practical Assessment of a Classifier Using Misclassification Error as the Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Asymmetric Misclassification Costs and Bayes’ Risk . . . . . . . . . . . . . 3.1.5 Stratified Sampling and Asymmetric Costs . . . . . . . . . . . . . . . . . . 3.1.6 Generalization to More than Two Classes . . . . . . . . . . . . . . . . . . . 3.1.7 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.8 Example: Boston Housing (Two classes) . . . . . . . . . . . . . . . . . . . . 3.1.9 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.10 Classification using a Triage strategy . . . . . . . . . . . . . . . . . . . . . . 4 Multiple Linear Regression 4.1 A Review of Multiple Linear Regression . . . . . . . . . . . . . 4.1.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Illustration of the Regression Process . . . . . . . . . . . . . . 4.3 Subset Selection in Linear Regression . . . . . . . . . . . . . . . 4.4 Dropping Irrelevant Variables . . . . . . . . . . . . . . . . . . . 4.5 Dropping Independent Variables With Small Coefficient Values 4.6 Algorithms for Subset Selection . . . . . . . . . . . . . . . . . . 4.6.1 Forward Selection . . . . . . . . . . . . . . . . . . . . . 4.6.2 Backward Elimination . . . . . . . . . . . . . . . . . . . 4.6.3 Step-wise Regression (Efroymson’s method) . . . . . . . 4.6.4 All Subsets Regression . . . . . . . . . . . . . . . . . . . 4.7 Identifying Subsets of Variables to Improve Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Logistic Regression 5.1 Example 1: Estimating the Probability of Adopting a New Phone Service . . . . . 5.2 Multiple Linear Regression is Inappropriate . . . . . . . . . . . . . . . . . . . . . . 5.3 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Odd Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Example 2: Financial Conditions of Banks . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 A Model with Just One Independent Variable . . . . . . . . . . . . . . . . . 5.6.2 Multiplicative Model of Odds Ratios . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Computation of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Appendix A - Computing Maximum Likelihood Estimates and Confidence Intervals for Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Loglikelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 34 34 35 35 36 40 40 . . . . . . . . . . . . . . 43 43 43 43 44 45 47 48 49 50 50 50 51 51 51 . . . . . . . . . 55 55 56 56 57 58 59 60 61 63 . . . . 63 63 64 64 . . . . . . . . . . . . . 9. . . . 7. . . 65 Appendix B . . . . . . . . . . . . . . . 6. . . . . . . . . . .1. . . . .2 Multilayer Neural Networks . . . . .Classification . . . . . . . . . . . . . . . . .4 Algorithm . . . . . . . . . . . .1 The K-NN Procedure .2 Recursive Partitioning . . . .1. .4. . . . . . . . . . . .2 Fisher’s Linear Classification Functions 8. .3 Measuring Distance . . . . . . . . . . . . . . . . . . . . . . . .6 Multiple Local Optima and Epochs . . . . . . . . . . . . . .3 Example 1: Fisher’s Iris data . . . . . . . . . . . . . 67 67 69 69 70 71 73 74 74 75 75 75 76 76 7 Classification and Regression Trees 7.Classification of Flowers . . . . . . 77 77 77 78 84 89 89 91 91 . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . .7. . . . . . . . . . . . . . .1. . . .4 The Backward Propagation Algorithm . .4 Classification Error . . . . . . 8. . . . . . . . . . . .2 Backward Pass: Propagation of Error and Adjustment of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . .6 Best Pruned Tree . . . . . .The Newton-Raphson Method . . . . . . . . . . .2 The Neuron (a mathematical model . . . . . . . .9 Successful Applications . . . . .CONTENTS 5. . . . . . . . . . . . . . . . .1 Classification Trees . . 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . .1 Forward Pass . . . . . . . . . . . . . . .8 Adaptive Selection of Architecture . . .4. . . . . . . . .8 Regression Trees . . . . . . . . . . . . . . . . 7. . . 7.5 Minimum Error Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6 Neural Nets 6. .4 Shortcomings of k-NN algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 K-Nearest neighbor . . . . . . . . . . . . 8. . . . . . . . . . . . . . . . . . . . . .Mahalanobis Distance . . . .1 Example 1 . . . . .2 Example 1 . . . . . . . . . . . . . . . 8.6 Appendix . . . . . . . 7. . . . . . . . . . . . . 109 . . .3 K-Nearest Neighbor Prediction . . . . . . . . . . .Riding Mowers . . . . 93 93 95 98 99 99 103 . . . . . . 6. . 6. . . . . .7 Overfitting and the choice of training epochs . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . .7 Classification Rules from Trees .1 The Neuron (a Mathematical Model . 6. . . . . . .1 Single Layer Networks . . . . . . . . . . 106 . . . . . . . . . . . . . . . . . . . . 7. . . 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Example 1 . . . . . . 6. . . . . . . . . . . . . . .Riding Mowers . . . . 106 . . . . . . . . . .4 Pruning . . . .5 Adjustment for Prediction .5 Example 2 . . . . .Riding Mowers . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . . . 8 Discriminant Analysis 8. . . . . . . .2. . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 . . . . . . . . .8 iii 5. . . . . . . 9 Other Supervised Learning Techniques 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8. . . . . . . . . . . . .Computation of Outputs of all the Neurons in the Network. . 9. . . . . . . . . . . . . . . . . . . . . . . . . . . 134 . . . . . 121 . . . . . . . . . . . . . . . . . . . . . . . . of Bath Soap . . . . . . . . . 126 . . . . 110 110 111 111 112 . 137 . . . . . . . . . . . . . . . . . . . . . . . . . 123 . . . . . .1 9. . . 115 . .iv CONTENTS 9.3. . 12 Cluster Analysis 12. . . . . . . . 129 . . . . . . . .1 Discovering Association Rules in Transaction Databases 10. . . . . . . . . . .3. . . 11. . 131 . . . . . . . . . . . . . . . . . 128 . . . . . .4 Bayes . . . .6 Principal Components and Orthogonal Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 9. . . . . . . 11. . . . . . . . . 11. . . .1 Nearest neighbor (Single linkage) . . . . . . . . . . 135 . . . .6 Other distance measures . . . 124 . . . . . . . . . . . . . . . . . . .Public Utilities Data . . . . . . .2. . .1 Dimensionality Reduction . . . . . . . . . . . . . . . .2 German Credit . . . . . . . . . 118 . . . . . . . . . . .4 Example 2 . . . . . . . . . . . . . .5 Example 2 . . . . . . . . .4 Tayko Software Cataloger . . . . . . . . 161 . . . . . . . . . . . . . 115 . . . . . . . .Randomly-generated Data . . . . . . The Problem with Bayes Theorem Simplify . . . . .2 Example 1 . . . . 143 . . . . . . .6 Shortcomings . 12. . . . . . . . . . . . . . 12. . . . . . . . . .3 9.2 Support and Confidence . 12. 13 Cases 13. .3 The Principal Components . . . . . . 10.5 IMRB : Segmenting Consumers . . 123 . . . . 134 .Principal Components Analysis 11. Bayes Theorem .assume independence .Association Rules 10. . 141 . . . . . . . 12. .2. . . . . . . . . . . . . .Saris . . . . . . .3 Group average (Average linkage) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Example 1 . . . . . . . . 143 . . .2 Farthest neighbor (Complete linkage) 12. . . . . . . . . . . . . . . . . . . . . . . . .4 Optimization and the k-means algorithm . . . . . . 140 . . . . . . .3 Textile Cooperatives . . 11. . . . . . 13. . . . . .5 Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . .2 Naive 9. . . . . . . . . . . . . . . . . 12. . . . . . . . . . . . . . . . . . . . . . . . 131 .Electronics Sales . 123 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 What is Cluster Analysis? . . . . .Head Measurements of First Adult Sons . Example 1 . . 117 . . . . . . . . . . . 131 . . . . 158 . . . . . . . . . . . . . .3 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . 12. . . . .5 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . 13. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Affinity Analysis . . . . . . . . . . . . . . . . . 13. . . . . . . . . . . . . . . . . 10. . . .4 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . 135 . . . . . . . . . . . . . . . 12. . . . 152 . . . . 115 . . . . . . . . . 10.Characteristics of Wine . . . 11 Data Reduction and Exploration 11. . . . .3 Hierarchical Methods . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . 13. . . . . . .2. 167 . . . . . . . . . . . . . 10. . . . . . . . . . . . . 116 .2. . . . . . . . . . . . . . . . . . . . . . . .1 Charles Book Club . . . . . . . . . . All required data mining algorithms (plus illustrative data sets) are provided in an Excel add-in.1 Who is This Book For? This book arose out of a data mining course at MIT’s Sloan School of Management. XLMiner. The presentation of the cases is structured so that the reader can follow along and implement the algorithms on his or her own with a very low learning curve. an environment familiar to business analysts. To provide a business decision-making context for these methods. Preparation for the course revealed that there are a number of excellent books on the business context of data mining. there are also a number of more technical books about data mining algorithms. and its goal is threefold: 1. On the other hand.Chapter 1 Introduction 1. While the genesis for this book lay in the need for a case-oriented guide to teaching data-mining. Using real business cases. 1 . An important feature of this book is the use of Excel. practical guide. to illustrate the application and interpretation of these methods. but their coverage of the statistical and machine-learning algorithms that underlie data mining is not sufficiently detailed to provide a practical guide if the instructor’s goal is to equip students with the skills and tools to implement those algorithms. To provide both a theoretical and practical understanding of the key methods of classification. and do not provide the case-oriented business focus that is successful in teaching business students. but these are aimed at the statistical researcher. reduction and exploration that are at the heart of data mining. 2. Hence. analysts and consultants who are considering applying data mining techniques in contexts where they are not currently in use will also find this a useful. this book is intended for the business student (and practitioner) of data mining techniques. or more advanced graduate student. prediction. 3. 2 1. Introduction 1.2 What is Data Mining? The field of data mining is still relatively new, and in a state of evolution. The first International Conference on Knowledge Discovery and Data Mining (”KDD”) was held in 1995, and there are a variety of definitions of data mining. A concise definition that captures the essence of data mining is: “Extracting useful information from large data sets” (Hand, et al: 2001). A slightly longer version is: “Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” (Berry and Linoff: 1997 and 2000) Berry and Linoff later had cause to regret the 1997 reference to “automatic and semi-automatic means,” feeling it shortchanged the role of data exploration and analysis. Another definition comes from the Gartner Group, the information technology research firm (from their web site, Jan. 2004): “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” A summary of the variety of methods encompassed in the term “data mining” follows below (“Core Ideas”). 1.3 Where is Data Mining Used Data mining is used in a variety of fields and applications. The military might use data mining to learn what roles various factors play in the accuracy of bombs. Intelligence agencies might use it to determine which of a huge quantity of intercepted communications are of interest. Security specialists might use these methods to determine whether a packet of network data constitutes a threat. Medical researchers might use them to predict the likelihood of a cancer relapse. Although data mining methods and tools have general applicability, in this book most examples are chosen from the business world. Some common business questions one might address through data mining methods include: 1. From a large list of prospective customers, which are most likely to respond? We could use classification techniques (logistic regression, classification trees or other methods) to identify those individuals whose demographic and other data most closely matches that of our best existing customers. Similarly, we can use prediction techniques to forecast how much individual prospects will spend. 2. Which customers are most likely to commit fraud (or might already have committed it)? We can use classification methods to identify (say) medical reimbursement applications that have 1.4 The Origins of Data Mining 3 a higher probability of involving fraud, and give them greater attention. 3. Which loan applicants are likely to default? We might use classification techniques to identify them (or logistic regression to assign a “probability of default” value). 4. Which customers are more likely to abandon a subscription service (telephone, magazine, etc.)? Again, we might use classification techniques to identify them (or logistic regression to assign a “probability of leaving” value). In this way, discounts or other enticements might be proffered selectively where they are most needed. 1.4 The Origins of Data Mining Data mining stands at the confluence of the fields of statistics and machine learning (also known as artificial intelligence). A variety of techniques for exploring data and building models have been around for a long time in the world of statistics - linear regression, logistic regression, discriminant analysis and principal components analysis, for example. But the core tenets of classical statistics - computing is difficult and data are scarce - do not apply in data mining applications where both data and computing power are plentiful. This gives rise to Daryl Pregibon’s description of data mining as “statistics at scale and speed.” A useful extension of this is “statistics at scale, speed, and simplicity.” Simplicity in this case refers not to simplicity of algorithms, but rather to simplicity in the logic of inference. Due to the scarcity of data in the classical statistical setting, the same sample is used to make an estimate, and also to determine how reliable that estimate might be. As a result, the logic of the confidence intervals and hypothesis tests used for inference is elusive for many, and their limitations are not well appreciated. By contrast, the data mining paradigm of fitting a model with one sample and assessing its performance with another sample is easily understood. Computer science has brought us “machine learning” techniques, such as trees and neural networks, that rely on computational intensity and are less structured than classical statistical models. In addition, the growing field of database management is also part of the picture. The emphasis that classical statistics places on inference (determining whether a pattern or interesting result might have happened by chance) is missing in data mining. In comparison to statistics, data mining deals with large data sets in open-ended fashion, making it impossible to put the strict limits around the question being addressed that inference would require. As a result, the general approach to data mining is vulnerable to the danger of “overfitting,” where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data, but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal. 4 1.5 1. Introduction Terminology and Notation Because of the hybrid parentry of data mining, its practitioners often use multiple terms to refer to the same thing. For example, in the machine learning (artificial intelligence) field, the variable being predicted is the output variable or the target variable. To a statistician, it is the dependent variable. Here is a summary of terms used: “Algorithm” refers to a specific procedure used to implement a particular data mining technique - classification tree, discriminant analysis, etc. “Attribute” is also called a “feature,” “variable,” or, from a database perspective, a “field.” “Case” is a set of measurements for one entity - e.g. the height, weight, age, etc. of one person; also called “record,” “pattern” or “row” (each row typically represents a record, each column a variable) “Confidence” has a specific meaning in association rules of the type “If A and B are purchased, C is also purchased.” Confidence is the conditional probability that C will be purchased, IF A and B are purchased. “Confidence” also has a broader meaning in statistics (“confidence interval”), concerning the degree of error in an estimate that results from selecting one sample as opposed to another. “Dependent variable” is the variable being predicted in supervised learning; also called “output variable,” “target variable” or “outcome variable.” “Estimation” means the prediction of the value of a continuous output variable; also called “prediction.” “Feature” is also called an “attribute,” “variable,” or, from a database perspective, a “field.” “Input variable” is a variable doing the predicting in supervised learning; also called “independent variable,” “predictor.” “Model” refers to an algorithm as applied to a data set, complete with its settings (many of the algorithms have parameters which the user can adjust). “Outcome variable” is the variable being predicted in supervised learning; also called “dependent variable,” “target variable” or “output variable.” “Output variable” is the variable being predicted in supervised learning; also called “dependent variable,” “target variable” or “outcome variable.” “P (A|B)” is read as “the probability that A will occur, given that B has occurred.” “Pattern” is a set of measurements for one entity - e.g. the height, weight, age, etc. of one person; also called “record,” “case” or “row” (each row typically represents a record, each column a variable) “Prediction” means the prediction of the value of a continuous output variable; also called “estimation.” “Record” is a set of measurements for one entity - e.g. the height, weight, age, etc. of one person; also called “case,” “pattern” or “row” (each row typically represents a record, each column a variable) “Score” refers to a predicted value or class. “Scoring new data” means to use a model developed with training data to predict output values in new data. Each row represents a census tract . the largest companies had only enough data to occupy. Scannable bar codes. “Test data” refers to that portion of the data used only at the end of the model building and selection process to assess how well the final model might perform on additional data. in electronic form.7 Factors Responsible for the Rapid Growth of Data Mining Perhaps the most important factor propelling the growth of data mining is the growth of data. exploring a library or catalog shopping have close analogs on the internet. but more information per event is captured. and global positioning satellite (GPS) data are examples.02729. and to select the best model from among those that have been tried. etc. a “field. “Variable” is also called a “feature.” 1. The growth of data themselves is driven not simply by an expanding economy and knowledge base. The growth of the internet has created a vast new arena for information generation. etc. In supervised learning situations. at the end). had 0 of its residential lots zoned for over 25.” or.the first tract had a per capital crime rate (CRIM) of 0. In 1950. . one of these variables will be the outcome variable. regression tree.000. for example). Not only are more events being recorded. “Unsupervised Learning” refers to analysis in which one attempts to learn something about the data other than predicting an output value of interest (whether it falls into clusters. and records are in rows. point of sale (POS) devices. but by the decreasing cost and increasing availability of automatic data capture mechanisms.6 Organization of Data Sets 5 “Supervised Learning” refers to the process of providing an algorithm (logistic regression. “Training data” refers to that portion of data used to fit a model. The mass retailer Walmart in 2003 captured 20 million transactions per day in a 10-terabyte database. Many of the same actions that people undertake in retail shopping. 1. the values of 14 variables are recorded for a number of census tracts. In the example below (the Boston Housing data).1. “Validation data” refers to that portion of the data used to assess how well the model fits. and all can now be measured in the most minute detail. typically listed at the end or the beginning (in this case it is median value.6 Organization of Data Sets Data sets are nearly always constructed and displayed so that variables are in columns. MEDV.000 megabytes). several dozen megabytes (a terabyte is 1. mouse click trails.” “attribute.) with records in which an output variable of interest is known and the algorithm “learns” how to predict this value with new records where the output is unknown.000 square feet (ZN). to adjust some models. from a database perspective. transformed and exported to a data warehouse . They may include data from external sources (e. credit rating data). the rapid and continuing improvement in computing capacity is an essential enabler of the growth of data mining. . Many of the exploratory and analytical techniques used in data mining would not be possible without today’s computational power. Overview of the Data Mining Process In marketing. Smaller data marts devoted to a single subject may also be part of the system. but are not adequate for more complex and aggregate analysis.6 2. In short. The operational databases used to record individual transactions in support of routine business activity can handle simple queries.g. The constantly declining cost of data storage and retrieval has made it possible to build the facilities required to store and make available vast amounts of data. Data from these operational databases are therefore extracted.a large integrated data storage facility that ties together the decision support systems of an enterprise. a shift in focus from products and services to a focus on the customer and his or her needs has created a demand for detailed data on customers. the term “estimation” is used to refer to the prediction of the value of a continuous variable. rather than a class (e. For example. purchaser or nonpurchaser).Chapter 2 Overview of the Data Mining Process 2. Of course. A bus in a fleet might be available for service or unavailable.1.1. but the term “prediction” in this book refers to the prediction of the value of a continuous variable. Similar data where the classification is known are used to develop rules.g. A credit card transaction might be normal or fraudulent. The recipient of an offer might respond or not respond.” ”Association rules” can then be used in a variety of ways.) 2. 7 . amount of purchase). A common task in data mining is to examine data where the classification is unknown or will occur in the future. and “prediction” may be used for both continuous and categorical data. except we are trying to predict the value of a variable (e. or deceased. grocery stores might use such information after a customer’s purchases have all been scanned to print discount coupons. where the items being discounted are determined by mapping the customers purchases onto the association rules. 2.3 Affinity Analysis Large databases of customer transactions lend themselves naturally to the analysis of associations among items purchased.2 Prediction Prediction is similar to classification. with the goal of predicting what that classification is or will be. in classification we are trying to predict a class. still ill. or “what goes with what. A packet of data traveling on a network might be benign or threatening. An applicant for a loan might repay on time. repay late or declare bankruptcy.1.g. The victim of an illness might be recovered.1 Core Ideas in Data Mining Classification Classification is perhaps the most basic form of data analysis. (Sometimes in the data mining literature. which are then applied to the data with the unknown classification.1 2. 8 2.1.4 2. Overview of the Data Mining Process Data Reduction Sensible data analysis often requires distillation of complex data into simpler data. Rather than dealing with thousands of product types, an analyst might wish to group them into a smaller number of groups. This process of consolidating a large number of variables (or cases) into a smaller set is termed data reduction. 2.1.5 Data Exploration Unless our data project is very narrowly focused on answering a specific question determined in advance (in which case it has drifted more into the realm of statistical analysis than of data mining), an essential part of the job is to review and examine the data to see what messages it holds, much as a detective might survey a crime scene. Here, full understanding of the data may require a reduction in its scale or dimension to let us to see the forest without getting lost in the trees. Similar variables (i.e. variables that supply similar information) might be aggregated into a single variable incorporating all the similar variables. Analogously, cluster analysis might be used to aggregate records together into groups of similar records. 2.1.6 Data Visualization Another technique for exploring data to see what information they hold is graphical analysis. For example, combining all possible scatter plots of one variable against another on a single page allows us to quickly visualize relationships among variables. The Boston Housing data is used to illustrate this. In this data set, each row is a city neighborhood (census tract, actually) and each column is a variable (crime rate, pupil/teacher ratio, etc.). The outcome variable of interest is the median value of a housing unit in the neighborhood. Figure 2.1 takes four variables from this data set and plots them against each other in a series of two-way scatterplots. In the lower left, for example, the crime rate (CRIM) is plotted on the x-axis and the median value (MEDV) on the y-axis. In the upper right, the same two variables are plotted on opposite axes. From the plots in the lower right quadrant, we see that, unsurprisingly, the more lower economic status residents a neighborhood has, the lower the median house value. From the upper right and lower left corners we see (again, unsurprisingly) that higher crime rates are associated with lower median values. An interesting result can be seen in the upper left quadrant. All the very high crime rates seem to be associated with a specific, mid-range value of INDUS (proportion of non-retain businesses per neighborhood). That a specific, middling level of INDUS is really associated with high crime rates seems dubious. A closer examination of the data reveals that each specific value of INDUS is shared be a number of neighborhoods, indicating that INDUS is measured for a broader area than that of the census tract neighborhood. The high crime rate associated so markedly with a specific value of INDUS indicates that the few neighborhoods with extremely high crime rates fall mainly within one such broader area. 2.2 Supervised and Unsupervised Learning 9 Figure 2.1 Matrix scatterplot for four variables from the Boston Housing data. 2.2 Supervised and Unsupervised Learning A fundamental distinction among data mining techniques is between supervised methods and unsupervised methods. “Supervised learning” algorithms are those used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g. purchase or no purchase) is known. These ”training data” are the data from which the classification or prediction algorithm “learns,” or is “trained,” about the relationship between predictor variables and the outcome variable. Once the algorithm has learned from the training data, it is then applied to another sample of data (the ”validation data”) where the outcome is known, to see how well it does, in comparison to other models. If many different models are being tried out, it is prudent to save a third sample of known outcomes (the ”test data”) to use with the final, selected model to predict how well it will do. The model can then be used to classify or predict the outcome variable of interest in new cases where 10 2. Overview of the Data Mining Process the outcome is unknown. Simple linear regression analysis is an example of supervised learning (though rarely called that in the introductory statistics course where you likely first encountered it). The Y variable is the (known) outcome variable. A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. The regression line can now be used to predict Y values for new values of X for which we do not know the Y value. Unsupervised learning algorithms are those used where there is no outcome variable to predict or classify. Hence, there is no “learning” from cases where such an outcome variable is known. Affinity analysis, data reduction methods and clustering techniques are all unsupervised learning methods. 2.3 The Steps In Data Mining This book focuses on understanding and using data mining algorithms (steps 4-7 below). However, some of the most serious errors in data analysis result from a poor understanding of the problem - an understanding that must be developed well before we get into the details of algorithms to be used. Here is a list of the steps to be taken in a typical data mining effort: 1. Develop an understanding of the purpose of the data mining project (if it is a one-shot effort to answer a question or questions) or application (if it is an ongoing procedure). 2. Obtain the data set to be used in the analysis. This often involves random sampling from a large database to capture records to be used in an analysis. It may also involve pulling together data from different databases. The databases could be internal (e.g. past purchases made by customers) or external (credit ratings). While data mining deals with very large databases, usually the analysis to be done requires only thousands or tens of thousands of records. 3. Explore, clean, and preprocess the data. This involves verifying that the data are in reasonable condition. How should missing data be handled? Are the values in a reasonable range, given what you would expect for each variable? Are there obvious “outliers?” The data are reviewed graphically - for example, a matrix of scatterplots showing the relationship of each variable with each other variable. We also need to ensure consistency in the definitions of fields, units of measurement, time periods, etc. 4. Reduce the data, if necessary, and (where supervised training is involved) separate it into training, validation and test data sets. This can involve operations such as eliminating unneeded variables, transforming variables (for example, turning “money spent” into “spent > $100” vs. “spent <= $100”), and creating new variables (for example, a variable that records whether at least one of several products was purchased). Make sure you know what each variable means, and whether it is sensible to include it in the model. ). This involves making a choice as to the best algorithm to deploy. 8.” We concentrate in this book on steps 3-8.2. the model might be applied to a purchased list of possible customers. Where appropriate. Determine the data mining task (classification. and. regression. Choose the data mining techniques to be used (regression. Deploy the model. we will want to do our data mining analysis on less than the total number of records that are available. Data mining algorithms will have varying limitations on what they can handle in terms of the numbers of records and variables.5 2. collaborative filtering Assess: compare models using validation data set SPSS-Clementine also has a similar methodology.) 9. impute missing values Model: fit predictive models. tree. testing our final choice on the test data to get an idea how well it will perform. etc. prediction. Interpret the results of the algorithms. This is typically an iterative process . (Recall that each algorithm may also be tested on the validation data for tuning purposes. This involves integrating the model into operational systems and running it on real records to produce decisions or actions.4 SEMMA 11 5. etc.5.trying multiple variants. . clustering. and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm).1 Preliminary Steps Sampling from a Database Quite often. 2. validation and test data sets Explore data set statistically and graphically Modify: transform variables. many algorithms will execute faster with smaller data sets. For example. feedback from the algorithm’s performance on validation data is used to refine the settings. limitations that may be specific to computing power and capacity as well as software limitations. e. termed CRISP-DM (CRoss-Industry Standard Process for Data Mining). 6. where possible. Even within those limits.4 SEMMA The above steps encompass the steps in SEMMA.). a methodology developed by SAS: Sample from data sets. 7. neural nets. 2. This involves translating the general question or problem of step 1 into a more specific statistical question.g. Ward’s method of hierarchical clustering. Use algorithms to perform the task. and the action might be ”include in the mailing if the predicted amount of purchase is > $10. partition into training. in this way the validation data becomes a part of the fitting process and is likely to underestimate the error in the deployment of the model that is finally chosen. 15 records may suffice to give us a rough idea of the relationship between Y and a single dependent variable X. 2.5.5. We would end up with lots of data on non-purchasers.5.g. low value.12 2. payments not current. Other things being equal. In such cases.5. unweighted sampling would be expected to yield only 10 purchasers. For example. parsimony. the greater the risk of overfitting the data. If the event we are interested in is rare. however (e. making the estimate very unreliable).3 Overfitting For another thing. Categorical variables can also be unordered (North America. the more variables we include. For one thing. 3) or text (payments current. accurate models can be built with as few as several hundred records (see below). and sales in a subsequent time period: . if the purchase rate were 1% and we were going to be working with a sample of 1000 records.g. or compactness is a desirable feature in a model.2 Variable Selection More is not necessarily better when it comes to selecting variables for a model. the more variables we include. bankrupt). then the proportions selected for the sample will be more roughly equal. Europe. integer (assuming only integer values). Categorical variables can be either numeric (1.1 Types of Variables There are several ways of classifying variables. They can be continuous (able to assume any real numeric value. nil value). Asia) or ordered (high value. usually in a given range). 2. but little on which to base a model that distinguishes purchasers from non-purchasers.2 Pre-processing and Cleaning the Data 2. often we will want to sample a subset of records for model building. Hence. Variables can be numeric or text (character).2. fifteen variables will not be enough (each estimated relationship would have an average of only one record’s worth of information. a purchaser has a probability of being selected that is 99 times the probability of selecting a non-purchaser. If. the greater the number of records we will need to assess relationships among the variables. 2. purchases) that we have little information on them. Overview of the Data Mining From a statistical perspective. If we now want information about the relationship between Y and fifteen dependent variables X1 · · · X15 . What is overfitting? Consider the following hypothetical data about advertising expenditures in one time period. we would want our sampling procedure to over-weight the purchasers relative to the non-purchasers so that our sample would end up with a healthy complement of purchasers. or categorical (assuming one of a limited number of values). sampling a subset of records may yield so few events (e. on the other hand.2.2. customers purchasing a product in response to a mailing). 2. 2. .5 Preliminary Steps 13 Advertising 239 364 602 644 770 789 911 Sales 514 789 550 1386 1394 1440 1354 Figure 2. one that explains all these data points perfectly and leaves no error (residuals).2 : X-Y Scatterplot for advertising and Sales Data We could connect up these lines with a smooth and very complex function. 14 2. Overview of the Data Mining X-Y scatterplot, smoothed However, we can see that such a curve is unlikely to be that accurate, or even useful, in predicting future sales on the basis of advertising expenditures. A basic purpose of building a model is to describe relationships among variables in such a way that this description will do a good job of predicting future outcome (dependent) values on the basis of future predictor (independent) values. Of course, we want the model to do a good job of describing the data we have, but we are more interested in its performance with data to come. In the above example, a simple straight line might do a better job of predicting future sales on the basis of advertising than the complex function does. In this example, we devised a complex function that fit the data perfectly, and in doing so over-reached. We certainly ended up “explaining” some variation in the data that was nothing more than chance variation. We have mislabeled the noise in the data as if it were a signal. Similarly, we can add predictors to a model to sharpen its performance with the data at hand. Consider a database of 100 individuals, half of whom have contributed to a charitable cause. Information about income, family size, and zip code might do a fair job of predicting whether or not someone is a contributor. If we keep adding additional predictors, we can improve the performance of the model with the data at hand and reduce the misclassification error to a negligible level. However, this low error rate is misleading, because it likely includes spurious “explanations.” For example, one of the variables might be height. We have no basis in theory to suppose that tall people might contribute more or less to charity, but if there are several tall people in our sample and they just happened to contribute heavily to charity, our model might include a term for height - the taller you are, the more you will contribute. Of course, when the model is applied 2.5 Preliminary Steps 15 to additional data, it is likely that this will not turn out to be a good predictor. If the data set is not much larger than the number of predictor variables, then it is very likely that a spurious relationship like this will creep into the model. Continuing with our charity example, with a small sample just a few of whom are tall, whatever the contribution level of tall people may be, the computer is tempted to attribute it to their being tall. If the data set is very large relative to the number of predictors, this is less likely. In such a case, each predictor must help predict the outcome for a large number of cases, so the job it does is much less dependent on just a few cases, which might be flukes. Overfitting can also result from the application of many different models, from which the best performing is selected (more about this below). 2.5.2.4 How Many Variables and How Much Data? Statisticians could give us procedures to learn with some precision how many records we would need to achieve a given degree of reliability with a given data set and a given model. Data miners’ needs are usually not so precise, so we can often get by with rough rules of thumb. A good rule of thumb is to have ten records for every predictor variable. Another, used by Delmater and Hancock for classification procedures (2001, p. 68) is to have at least 6*M*N records, where M = number of outcome classes, and N = number of variables Even when we have an ample supply of data, there are good reasons to pay close attention to the variables that are included in a model. Someone with domain knowledge (i.e. knowledge of the business process and the data) should be consulted - knowledge of what the variables represent can often help build a good model and avoid errors. For example, “shipping paid” might be an excellent predictor of “amount spent,” but it is not a helpful one. It will not give us much information about what distinguishes high-paying from low-paying customers that can be put to use with future prospects. In general, compactness or parsimony is a desirable feature in a model. A matrix of X-Y plots can be useful in variable selection. In such a matrix, we can see at a glance x-y plots for all variable combinations. A straight line would be an indication that one variable is exactly correlated with another. Typically, we would want to include only one of them in our model. The idea is to weed out irrelevant and redundant variables from our model. 2.5.2.5 Outliers The more data we are dealing with, the greater the chance of encountering erroneous values resulting from measurement error, data entry error, or the like. If the erroneous value is in the same range as the rest of the data, it may be harmless. If it is well outside the range of the rest of the data (a misplaced decimal, for example), it may have substantial effect on some of the data mining procedures we plan to use. Values that lie far away from the bulk of the data are called outliers/indexoutliers. The term “far away” is deliberately left vague because what is or is not called an outlier is basically an 16 2. Overview of the Data Mining arbitrary decision. Analysts use rules of thumb like “anything over 3 standard deviations away from the mean is an outlier,” but no statistical rule can tell us whether such an outlier is the result of an error. In this statistical sense, an outlier is not necessarily an invalid data point, it is just a distant data point. The purpose of identifying outliers is usually to call attention to data that needs further review. We might come up with an explanation looking at the data - in the case of a misplaced decimal, this is likely. We might have no explanation, but know that the value is wrong - a temperature of 178 degrees F for a sick person. Or, we might conclude that the value is within the realm of possibility and leave it alone. All these are judgments best made by someone with “domain” knowledge. (Domain knowledge is knowledge of the particular application being considered – direct mail, mortgage finance, etc., as opposed to technical knowledge of statistical or data mining procedures.) Statistical procedures can do little beyond identifying the record as something that needs review. If manual review is feasible, some outliers may be identified and corrected. In any case, if the number of records with outliers is very small, they might be treated as missing data. How do we inspect for outliers? One technique in Excel is to sort the records by the first column, then review the data for very large or very small values in that column. Then repeat for each successive column. For a more automated approach that considers each record as a unit, clustering techniques could be used to identify clusters of one or a few records that are distant from others. Those records could then be examined. 2.5.2.6 Missing Values Typically, some records will contain missing values. If the number of records with missing values is small, those records might be omitted. However, if we have a large number of variables, even a small proportion of missing values can affect a lot of records. Even with only 30 variables, if only 5% of the values are missing (spread randomly and independently among cases and variables), then almost 80% of the records would have to be omitted from the analysis. (The chance that a given record would escape having a missing value is 0.9530 = 0.215.) An alternative to omitting records with missing values is to replace the missing value with an imputed value, based on the other values for that variable across all records. For example, if, among 30 variables, household income is missing for a particular record, we might substitute instead the mean household income across all records. Doing so does not, of course, add any information about how household income affects the outcome variable. It merely allows us to proceed with the analysis and not lose the information contained in this record for the other 29 variables. Note that using such a technique will understate the variability in a data set. However, since we can assess variability, and indeed the performance of our data mining technique, using the validation data, this need not present a major problem. etc. when we use the same data to develop the model then assess its performance. .” To consider why this might be necessary. we try it out on another partition and see how it does..7 Normalizing (Standardizing) the Data Some algorithms require that the data be normalized before the algorithm can be effectively implemented. Data mining software. we might think it best to choose the model that did the best job of classifying or predicting the outcome variable of interest with the data at hand. we are expressing each value as “number of standard deviations away from the mean.2. At first glance. and thus end up overfitting it. With multiple variables. we can measure the residuals (errors) between the predicted values and the actual values. It is an option. changing units from (say) days to hours or months could completely alter the outcome. we simply divide (partition) our data and develop our model using only one of the partitions. However. we introduce bias. different units will be used . typically has an option that normalizes the data in those algorithms where it may be required. The latter is a particularly serious problem with techniques (such as trees and neural nets) that do not impose linear or other structure on the data. we can count the proportion of held-back records that were misclassified. including XLMiner. the dollar variable will come to dominate the distance measure.days.2.3 Partitioning the Data In supervised learning. After we have a model. counts. Clustering typically involves calculating a distance measure that reflects how far each record is from a cluster center. We can measure how it does in several ways. we subtract the mean from each value. If the dollars are in the thousands and everything else is in the 10’s. rather than an automatic feature of such algorithms. because there are situations where we want the different variables to contribute to the distance measure in proportion to their scale. To address this problem. this model’s superior performance comes from two sources: • A superior model • Chance aspects of the data that happen to match the chosen model better than other models. This is because when we pick the model that does best with the data. To normalize/indexstandardizing data the data. Moreover. 2. In effect.5 Preliminary Steps 17 2.5. so we can choose the one we think will do the best when it is actually implemented. In a classification model. a key question presents itself: How well will our prediction or classification model perform when we apply it to new data? We are particularly interested in comparing the performance among various models. or from other records.5. We will typically deal with two or three partitions. In a prediction model. and divide by the standard deviation of the resulting deviations from the mean. dollars. consider the case of clustering. each record in the validation set is compared to all the records in the training set to locate its nearest neighbor(s). 2. So the use of two partitions is an essential part of the classification or prediction process. we might use only training and validation partitions. It is possible (though cumbersome) to divide the data into more than 3 partitions by successive partitioning . In a sense. Why have both a validation and a test partition? When we use the validation data to assess multiple models and then pick the model that does best with the validation data. we again encounter another (lesser) facet of the overfitting problem – chance aspects of the validation data that happen to match the chosen model better than other models. Overview of the Data Mining 2. The more models we test. Note that with nearest neighbor algorithms for supervised learning.3. we can still interpret the error in the validation data in the same way we would interpret error from any other model. validation and test sets either randomly according to user-set proportions.3. Nonetheless.3.3 Test Partition This partition (sometimes called the “holdout” or “evaluation” partition) is used if we need to assess the performance of the chosen model with new data.2 Validation Partition This partition (sometimes called the “test” partition) is used to assess the performance of each model. the training partition itself is the model . not merely a way to improve or assess it.18 2.5. we may have overestimated the accuracy of our model. In XLMiner. so that you can compare models and pick the best one. will provide an unbiased estimate of how well it will do with new data. The random features of the validation data that enhance the apparent performance of the chosen model will not likely be present in new data to which the model is applied. the user can supply a variable (column) with a value “t” (training). In some algorithms (e. 2. Applying the model to the test data. then take one of those partitions and partition it further. when we are concerned mainly with finding the best model and less with exactly how well it will do).5. the user can ask XLMiner to do the partitioning randomly. The partitioning should be done randomly to avoid getting a biased partition. . “v” (validation) and “s” (test) assigned to each case (row).g.e. these are the data used to build the various models we are examining.any application of the model to new data requires the use of the training data.g. The same training partition is generally used to develop multiple models. Sometimes (for example. which it has not seen before. the validation partition may be used in automated fashion to tune and improve the model. divide the initial data into 3 partitions.1 Training Partition Typically the largest partition. Therefore. or on the basis of a variable that denotes which partition a record is to belong to. XLMiner has a utility that can divide the data up into training. the more likely it is that one of them will be particularly effective in explaining the noise in the validation data. Alternatively.5. classification and regression trees). This will help us understand the overall process before we begin tackling new algorithms. clean.0. number of rooms per dwelling. Often variable names are cryptic and their descriptions may be unclear or missing. This data set has 14 variables and a description of each variable is given in the table below.multiple linear regression. CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV Per capita crime rate by town Proportion of residential land zoned for lots over 25.2. We will use the Boston Housing data. These descriptions are available on the “description” tab on the worksheet. The data set in question is small enough that we do not need to sample from it . 0 otherwise) Nitric oxides concentration (parts per 10 million) Average number of rooms per dwelling Proportion of owner-occupied units built prior to 1940 Weighted distances to five Boston employment centers Index of accessibility to radial highways Full-value property-tax rate per $10. as is a web source for the data set. Let’s assume that the purpose of our data mining project is to predict the median house value in small Boston area neighborhoods.we can use it in its entirety. 2. They all seem fairly straightforward..6 Building a Model . and preprocess the data. but this is not always the case. 2. Let’s look first at the description of the variables (crime rate. etc.63)2 where Bk is the proportion of blacks by town % Lower status of the population Median value of owner-occupied homes in $1000’s . Explore. using a familiar procedure .ft. Proportion of non-retail business acres per town Charles River dummy variable (= 1 if tract bounds river.) to be sure we understand them all. Purpose. We will illustrate the Excel procedure using XLMiner.000 Pupil-teacher ratio by town 1000(Bk .000 sq. 3. 1.. Obtain the data.6 19 Building a Model .An Example with Linear Regression Let’s go through the steps typical to many data mining tasks. we consider that tax on a home is usually a function of its assessed value. it is possible that at some stage we might want to apply a model to individual homes and. but would it be useful if we wanted to apply our model to homes whose assessed value might not be known? Reflect. Overview of the Data Mining Process The data themselves look like this: It is useful to pause and think about what the variables mean. Consider the variable TAX. We are left with 13 independent (predictor) variables.000 were recorded as $50. TAX might be a very good predictor of home value in a numerical sense. If MEDV >=$30. While the purpose of our inquiry has not been spelled out.000. CATV = 0. If MEDV <=$30. In addition to these variables. CATMEDV. the data set also contains an additional variable. So. For one thing. that the TAX variable. At first glance. not to individual homes. For example. The variable CATMEDV is actually a categorical variable created from MEDV. we do not need CAT MEDV so we will leave it out of the analysis. the top value. which has been created by categorizing median value (MEDV) into two categories – high and low. so there is some circularity in the model . it is quite low.000. If we were trying to categorize the cases into high and low median values. There are a couple of aspects of MEDV − the median house value − that bear noting. pertains to the average in a neighborhood. though.we want to predict a home’s value using TAX as a predictor. and whether they should be included in the model.000. there are a lot of 50’s. we would use CAT MEDV instead of MEDV. in such a case. As it is. suppose the RM (# of rooms) column looked like this. yet TAX itself is determined by a home’s value. since it dates from the 1970’s. we will keep TAX in the analysis for now. which can all be used. For another. It could be that median values above $50.20 2. like all the variables. CATV = 1. after sorting the data in descending order based on rooms: . the neighborhood TAX value would be a useful predictor. It is also useful to check for outliers that might be errors. and we want to use that data in developing a model that can then be applied to other data where that value is unknown. These are problems in which we know the class or value of the outcome variable for some data.. We can tell right away that the 79. Probably.6 Building a Model . Reduce the data and partition it into training. This technique is part of the “supervised learning” process in classification and prediction problems..929. All other values are between 3 and 9. and a validation set to see how well the model does. at this stage we might want to apply a variable reduction technique such as Principal Components Analysis to consolidate multiple similar variables into a smaller number of variables.29 is in error . We will partition the data into a training set to build the model. validation and test partitions. select XLMiner → Partition and the following dialog box appears: . In Excel. the decimal was misplaced and the value should be 7. If we had many more variables. Our task is to predict the median house value.no neighborhood is going to have houses that have an average of 79 rooms. (This hypothetical error is not present in the data set supplied with XLMiner. Our data set has only 13 variables. 21 . so data reduction is not required.2.) 4. and then assess how well that prediction does. If the partitioning is done randomly. should we need to). we can get an unbiased idea of how it might perform on more such data.g. In this case. we will divide the data into two partitions . a “test” partition might also be used.training and validation. The partitioning can be handled in one of two ways: a) The data set can have a partition variable that governs the division into training and validation partitions (e. the validation partition is used to see how well the model does when applied to new data. Typically. a data mining endeavor involves testing multiple models. 1 = training. we have the option of specifying a seed for randomization (which has the advantage of letting us duplicate the same random partition later. 2 = validation).22 2. We need to specify the percent of the data used in each partition. and which variables are to be included in the partitioned data set. Here we specify which data range is to be partitioned. . Note: Although we are not using it here. or b) The partitioning can be done randomly. Overview of the Data Mining Process . perhaps with multiple settings on each model. The training partition is used to build the model. When we train just one model and try it out on the validation data. it is multiple linear regression.. when validation data are used in the model itself. By playing a role in picking the best model. for example) explicitly factor validation data into the model building algorithm itself (in pruning trees. Thus. In this case. the specific task is to predict the value of MEDV using the 13 predictor variables. will be overly optimistic. the results achieved with the validation data.6 Building a Model . The test data. several algorithms (classification and regression trees. 3. as noted. 1. In this case. then pick the best performing model. when we train lots of models and use the validation data to see how each one does. just as with the training data. the validation data no longer provide an unbiased estimate of how the model might do with more data.2. Hence. Determine the data mining task. we can use XLMiner to build a multiple linear regression model with the training data . 23 However. 2. .we want to predict median house price on the basis of all the other values. Choose the technique. Models will almost always perform better with the data they were trained on than fresh data. once we have selected a final model. can give a better estimate of how well the chosen model will do with fresh data. for example). the validation data have become part of the model itself. In fact. or when they are used to select the best model.. Having divided the data into training and validation partitions. which should not be used either in the model building or model selection process. we will apply it to the test data to get an estimate of how well it will actually perform. . In XLMiner. We will ask XLMiner to show us the fitted values on the training data. Overview of the Data Mining Process 4.MEDV is left unused. we select Prediction → Multiple Linear Regression: The variable MEDV is selected as the output (dependent) variable.24 2. and the remaining variables are all selected as input (independent or predictor) variables. Use the algorithm to perform the task. as well as the predicted values (scores) on the validation data. the variable CAT. since they are for the records that the model was fit to. as well as the more advanced options displayed above.2. we will review the predictions themselves. along with the actual values and the residual (prediction error). . Note that these predicted values would often be called the fitted values.6 Building a Model . for more information. XLMiner produces standard regression output. 25 . but we will defer that for now. Here are the predicted values for the first few records in the training data.. Rather.. See the chapter on multiple linear regression. or the user documentation for XLMiner. On the right is the “average error” . The “residual sum of squares” on the left adds up the squared errors.our predictions are “unbiased. . However. this sum does not yield information about the size of the typical error. predictions average about right . The “RMS error” or root mean squared error is perhaps the most useful term of all. In both cases. this simply means that the positive errors and negative errors balance each other out. indicating that. Overview of the Data Mining Process And here are the results for the validation data: Let’s compare the prediction error for the training and validation data: Prediction error can be measured several ways.” Of course. so whether an error is positive or negative it contributes just the same. it is quite small. It tells us nothing about how large those positive and negative errors are. Three measures produced by XLMiner are shown above. It takes the square root of the average squared error. on balance.simply the average of the residuals (errors).26 2. so gives an idea of the typical error (whether positive or negative) in the same scale as the original data. for example) and see how they do. of course. 27 As we might expect. Interpret the results. Deploy the model. is larger than for the training data ($4.2000 voters. 6. we need to get those records into Excel.6.000 is likely to yield as accurate an answer as using the whole data set. 5. Therefore. so the standard version of XLMiner provides an interface for random sampling of records from an external database. error-wise. . we could use the ”best subsets” option in multiple linear regression to chose a reduced set of variables that might perform better with the validation data). we then use that model to predict the output variable in fresh data. These steps will be covered in more detail in the analysis of cases. After choosing the best model (typically. we would typically try other prediction algorithms (regression trees. which the model is seeing for the first time in making these predictions. of course. Likewise.518). if sampled judiciously. the RMS error for the validation data ($5.. was the overall purpose. it is then applied to new data to predict MEDV for records where this value is unknown. can give an estimate of the entire population’s opinion within one or two percentage points.2. After the best model is chosen. which were used in training the model. we need to apply the results of our analysis to a large database. XLMiner would write an additional column (variable) to the database consisting of the predicted purchase amount for each record. in most cases.1 Can Excel Handle the Job? An important aspect of this process to note is that the heavy duty analysis does not necessarily require huge numbers of records. At this stage. but in doing multiple linear regression or applying a classification tree the use of a sample of (say) 20.337). the model with the lowest error while also recognizing that ”simpler is better”). The data set to be analyzed may have millions of records. We might also try different ”settings” on the various models (for example. validation and test) can be accommodated within the rows allowed by Excel. The principle involved is the same as the principal behind polling .6 Building a Model . 2. Of course. For example. the number of records required in each partition (training. so the standard version of XLMiner has a facility for scoring the output of the model to an external database. This.. Supervised Learning .Classification & Prediction .28 3. Before we study these various algorithms in detail and face decisions on how to set these options. Is there a minimum probability of misclassification we should require of a classifier? At a minimum. We will extend our analysis to more than two classes later. we know what the probability is that it belongs to one class or the other. and how many hidden layer neurons to use in a neural net. A classifier that makes no errors would be perfect but we do not expect to be able to construct such classifiers in the real world due to “noise” and to not having all the information needed to precisely classify cases. we worked through a simple example. 3.” Imagine that. we are interested in predicting the class (classification) or continuous value (prediction) of an outcome variable. Let p(C0 ) 29 . we need to know how we will measure success. A natural criterion for judging the performance of a classifier is the probability that it makes a misclassification error.Classification & Prediction In supervised learning. 3. which subsets of predictors to use in a logistic regression model.Chapter 3 Supervised Learning . Suppose that the two classes are denoted by C0 and C1 . for each case. the minimum number of cases we should require in a leaf node in a tree classifier. we hope to do better than the crude rule “classify everything as belonging to the most prevalent class.1 A Two-class Classifier Let us first look at a single classifier for two classes.1 Judging Classification Performance Not only do we have a wide choice of different types of classifiers to choose from but within each type of classifier we have many options such as how many nearest neighbors to use in a k-nearest neighbors classifier.1. Let’s now examine the question of how to judge the usefulness of a classifier or predictor and how to compare different ones. The two-class situation is certainly the most common and occurs very frequently in practice. In the previous chapter. since its value depends on the individual case we sample from the population consisting of all possible cases of the class to which the case belongs. The probability of making a misclassification error would be the minimum of p(C0 ) and p(C1 ). Supervised Learning .f. would we then be able to build a classifier that makes no errors? The answer is no. Bayes’ formula uses the distributions of the decision variables in the two classes to give us a classifier that will have the minimum error amongst all classifiers that use the same predictor variables.s are denoted f0 (x) and f1 (x) for classes C0 and C1 in Fig. Now X is a random variable.s accurately.2 Bayes’ Rule for Minimum Error Let us take a simple situation where we have just one continuous predictor variable. say X. 1 below. We can use the well-known Bayes’ formula from probability theory to derive the best performance we can expect from a classifier for a given set of predictor variables if we had a very large amount of training data. The apriori probability is the probability that a case belongs to a class without any more knowledge about it than that it belongs to a population where the proportion of C0 ’s is p(C0 ) and the proportion of C1 ’s is p(C1 ).f. Then the relative frequency histogram of the variable X in each class would be almost identical to the probability density function (p.1. These p.d.d. Let us assume that we have a huge amount of training data and so we know the p. What is the best performance we can expect from a classifier? Clearly the more training data available to a classifier the more accurate it will be.f. In this situation we will minimize the chance of a misclassification error by assigning class C1 to the case if p(C1 ) > p(C0 ) and to C0 otherwise. Figure 1 . If we are using misclassification rate as our criterion any classifier that uses predictor variables must have an error rate better than this. Suppose that we have a very large training data set.) of X for that class.d. 3. The accuracy of a classifier depends critically on how separated the classes are with respect to the predictor variables that it the classifier uses. to use in predicting our two-class outcome variable.Classification & Prediction and p(C1 ) be the apriori probabilities that a case belongs to C0 and C1 respectively. This classifier uses the Minimum Error Bayes Rule. Suppose we had a huge amount of training data.30 3. is given by: p(X = x0 |C1 )p(C1 ) p(C1 |X = x0 ) = p(X = x0 |C0 )p(C0 ) + p(X = x0 |C1 )p(C1 ) Writing this in terms of the density functions.5. If x0 is exactly equal to a we have a 50% chance of making an error for either classification. we get p(C1 |X = x0 ) = f1 (x0 )p(C1 ) f0 (x0 )p(C0 ) + f1 (x0 )p(C1 ) Notice that to calculate p(C1 |X = x0 ) we need to know the apriori probabilities p(C0 ) and p(C1 ). the formula shows that p(C1 |X = x0 ) > p(C0 |X = x0 ) if f1 (x0 ) > f0 (x0 ). the probability. the probability of the object belonging to C1 after knowing that its X value is x0 . Applying Bayes’ formula. Similarly if x0 is less than a.3. This means that if x0 is greater than a (Figure 1). When p(C1 ) = p(C0 ) = 0. Bayes’ formula enables us to update this apriori probability to the aposteriori probability. Since there are only two possible classes. if we know p(C1 ) we can always compute p(C0 ) because p(C0 ) = 1 − p(C1 ). Let us use Bayes’ formula to predict the probability that the object belongs to class 1 conditional on the fact that it has an X value of x0 . and we classify the object as belonging to C0 we will make a smaller misclassification error than if we were to classify it as belonging to C1 . Figure 2 What if the prior class probabilities were not the same (Figure 2)? Suppose C0 is twice as likely apriori as C1 . . Then the formula says that p(C1 |X = x0 ) > p(C0 |X = x0 ) if f1 (x0 ) > 2 × f0 (x0 ).1 Judging Classification Performance 31 Now suppose we wish to classify an object for which the value of X is x0 . and we classify the object as belonging to C1 we will make a smaller misclassification error than if we were to classify it as belonging to C0 . denoted by p(C1 |X = x0 ). The apriori probability p(C1 ) is the probability that an object belongs to C1 without any knowledge of the value of X associated with it. To obtain an honest estimate of classification error. However. This is intuitively what we would expect. In the remainder of this note we shall assume that X is a vector. This rule holds even when X is a vector consisting of several components. as a by-product of classifying a case. in most practical business settings we will not know f1 (x) and f0 (x) . Let us assume that we have constructed a classifier using the training data. we can compute the conditional probability that the case belongs to each class. let us suppose that we have partitioned a data set into training and validation data sets by random selection of cases. This complicates the task because of the curse of dimensionality . we can use this probability as a “score” for each case that we are classifying. it enables us to compute the expected profit or loss for a given case. This capability is important in developing a lift curve (explained later) that is important for many practical data mining applications.the difficulty and complexity of the calculations increases exponentially. the resulting confusion table is not useful for getting an honest estimate of the misclassification rate due to the 1 There are classifiers that focus on simply finding the boundary between the regions to predict each class without being concerned with estimating the density of cases within each region. as the number of variables increases. Sometimes. For example.32 3. These classifications can be displayed in what is known as a confusion table. b for classification will be to the right of a as shown in Fig. and to C0 otherwise. Support Vector Machine Classifiers have this characteristic . we may be able to use public data such as census data to estimate these proportions. If a class is more likely we would expect the cut-off to move in a direction that would increase the range over which it is preferred.2.1. First. This gives us a better decision criterion than misclassification error when the loss due to error is different for the two classes. An important advantage of Bayes’ Rule is that. Second. with rows and columns corresponding to the true and predicted classes respectively. but if we have a large enough data set and neither class is very rare our estimates will be reliable. we will classify each case into C0 or C1 .Classification & Prediction The new boundary value. each of which is a random variable. In general we will minimize the misclassification error rate if we classify a case as belonging to C1 if p(C1 ) × f1 (x0 ) > p(C0 ) × f0 (x0 ). not linearly. these are estimates and they can be incorrect. Of course. The score enables us to rank cases that we have predicted as belonging to a class in order of confidence that we have made a correct classification. In practice X will almost always be a vector. (Although we can summarize our results in a confusion table for training data as well. Many classification methods can be interpreted as being methods for estimating such density functions1 .3 Practical Assessment of a Classifier Using Misclassification Error as the Criterion In practice. If we want to apply Bayes’ Rule we will need to estimate these density functions in some way. This has two advantages. we can estimate p(C1 ) and p(C0 ) from the data we are using to build the classifier by simply computing the proportion of cases that belong to each class. 3. When we apply it to the validation data. Supervised Learning . we record the cost of failing to classify him as a buyer.152 cases.617 42.741 0.025 ± 0.354 8. It amounts to the same thing and our goal becomes the minimization of costs. At first glance.005 0. if we think that the true misclassification rate is likely to be around 0.469 0.548 15. After all. so it greatly simplifies matter if we can capture all cost/benefit information in the misclassification cells. instead of recording the benefit of correctly classifying a buyer. For example. scoring our classification algorithm to fresh data to implement our decisions). The column headings are values of the misclassification rate and the rows give the desired accuracy in estimating the misclassification rate as measured by the half-width of the confidence interval at the 99% confidence level. So. our estimate of the misclassification rate is probably reasonably accurate.40 2.01 of the true misclassification rate. however.703 0.699 10. The table below gives an idea of how the accuracy of the estimate varies with Nval .358 . And.926 63.972 23.842 0. If Nval is reasonably large.g.589 66.461 33. the benefit (negative cost) of correctly classifying a buyer as a buyer would seem substantial.608 0. in other circumstances (e.654 16.628 0. the estimated misclassification rate Err = (N01 + N10 )/Nval where Nval = (N00 + N01 + N10 + N11 ). or the total number of cases in the validation data set. we need to have a validation data set with 3.230 13.15 1. Note that we are assuming that the cost (or benefit) of making correct classifications is zero.3. We can compute a confidence interval using the standard formula for estimating a population proportion from a random sample.20 1.01 250 657 2.889 0.30 2. it will be appropriate to consider the actual net dollar impact of each possible classification (or misclassification).) Confusion Table (Validation Cases) True Class C0 C1 Predicted Class C0 True Negatives (Number of correctly classified cases that belong to C0 ) False Negatives (Number of cases incorrectly classified as C0 that belong to C1 ) C1 False Positives (Number of cases incorrectly classified as C1 that belong to C0 ) True Positives (Number of correctly classified cases that belong to C1 ) If we denote the number in the cell at row i and column j by Nij . whether the costs are actual costs or foregone benefits (opportunity costs).50 2. ± 0.05 504 3. we are attempting to assess the value of a classifier in terms of classification error.1 Judging Classification Performance 33 danger of overfitting. this may seem incomplete.935 55. Here.05 and we want to be 99% confident that Err is within ±0.010 ± 0.152 12.10 956 5. Notice also that this rule reduces to the Minimum Error Bayes Rule when C(0|1) = C(1|0). If a class occurs only rarely in the training set. In such a scenario using the misclassification rate as a criterion can be misleading.34 3.4 3.Classification & Prediction Asymmetric Misclassification Costs and Bayes’ Risk Up to this point we have been using the misclassification rate as the criterion for judging the efficacy of a classifier. the classifier will have little information to use in learning what distinguishes it from the other classes. stratified sampling is often used to oversample the cases from the more rare class and improve the performance of classifiers. etc. and to C0 otherwise. It is often the case that the more rare events are the more interesting or important ones responders to a mailing. In the former case. if we have estimates of the cost of both types of misclassification.and hence the more . Supervised Learning . This enables us to compare different classifiers using overall expected costs as the criterion. In the latter.1. . This classifier is known as the Bayes’ Risk Classifier and the corresponding minimum expected cost of misclassification is known as the Bayes’ Risk. 3. A classifier that misclassifies 30% of buying households as non-buyers and 2% of the non-buyers as buyers would have a higher error rate but would be better if the profit from a sale is substantially higher than the cost of sending out an offer. defaulters on debt. If a classifier simply classifies every household as a non-responder it will have an error rate of only 1% but will be useless in practice. as we rarely know f1 (x0 ) and f0 (x0 ). it provides us with an ideal that the various classifiers we construct for minimizing expected opportunity cost attempt to emulate. we can use the confusion table to compute the expected cost of misclassification for each case in the validation data. In fact. misclassifying a household as unlikely to respond to a sales offer when it belongs to the class that would respond incurs a greater opportunity cost than the converse error.5 Stratified Sampling and Asymmetric Costs When classes are not present in roughly equal proportions. For example. you are missing out on a sale worth perhaps tens or hundreds of dollars. it does not improve the actual classifications themselves.1. However. Here C(0|1) is the cost of misclassifying a C1 case as belonging to C0 and C(1|0) is the cost of misclassifying a C0 case as belonging to C1 . there are circumstances when this measure is not appropriate. A better method is to change the classification rules (and hence the misclassification rates) to reflect the asymmetric costs. those who commit fraud. Nonetheless. Again. there is a Bayes’ classifier for this situation which gives rules that are optimal for minimizing the overall expected loss from misclassification (including both actual and opportunity costs). you are incurring the costs of mailing a letter to someone who will not purchase. we cannot construct this classifier in practice. Consider the situation where the sales offer is accepted by 1% of the households on a list. The most commonly used weighted sampling scheme is to sample an equal number of cases from each class. However. The Bayes’ Risk Classifier employs the following classification rule: Classify a case as belonging to C1 if p(C1 ) × f( x0 ) × C(0|1) > p(C0 ) × f0 (x0 ) × C(1|0). In these situations. Sometimes the error of misclassifying a case belonging to one class is more serious than for the other class. Note that the opportunity costs of correct classification for either class is zero. the estimated probability that it will belong to a given class. 3. of course. In such cases. The lift curve helps us determine how effectively we can “skim the cream” by selecting a relatively small number of cases and getting a relatively large portion of the responders.6 Generalization to More than Two Classes All the comments made above about two-class classifiers extend readily to classification into more than two classes. The confusion table has k rows and k columns. The input required to construct a lift curve is a validation data set that has been “scored” by appending to each case. The lift curve is a popular technique in direct marketing and one useful way to think of a lift curve is to consider a data mining model that attempts to identify the likely responders to a mailing by assigning each case a “probability of responding” score.1. we can use a very useful device known as the lift curve.k−1 p(Cj ) × fi (x0 ). The misclassification cost associated with the diagonal cells is. Hence.g. If the costs are asymmetric the Bayes Risk Classifier follows the rule: Classify a case as belonging to C1 if p(Cj ) × fj (x0 ) × C(∼ j|j) ≥ max p(Ci ) × fi (x0 ) × C(∼ i|i) i=j where C(∼ j|j) is the cost of misclassifying a case that belongs to Cj to any other class Ci .1. C1 . .3. always zero. its predicted outcomes need to be divided by 2) • Translating the results (in terms of numbers of responses) into expected gains or losses in a way that accounts for asymmetric costs. p(Cj |X = x0 ) = k−1 fi (x0 )p(C1 ) i=1 The Bayes Rule for Minimum Error is to classify a case as belonging to Cj if p(Cj ) × fj (x0 ) ≥ max i=0. 3. i = j.7 Lift Charts Often in practice. · · · Ck−1 . misclassification costs are not known accurately and decision makers would like to examine a range of possible costs. when the classifier gives a probability of belonging to each class and not just a binary classification to C1 or C0 . if a class was over-represented in the training sample by a factor of 2.···. after oversampling and training a model on a biased sample. also called a gains curve or gains chart.1. two adjustments are required: • Adjusting the responses for the biased sampling (e. Then Bayes formula gives us: fj (x0 )p(Cj ) .1 Judging Classification Performance 35 costly to misclassify. Let us suppose we have k classes C0 . C2 . 0000 0. 0 = median value <= $30.2910 4.4472 3.1157 -4.0750 -10.0000 0.MEDV (1 = median value >= $30.5993 -6.2916 -37.6084 8.5364 -21.2156 0.0023 0.0000 0.1040 4. for now think of it as like linear regression.2859 -14.9999 0.0001 0.6002 0. The model coefficients are applied to the validation data (the remaining 202 cases in the data set).4562 -13.6854 -19.8489 Actual Value of HICLASS 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 .2468 0.6119 -1.6806 -0.9016 0.000) as the dependent variable.0641 0. of Success 0.0402 -10.0130 0.9893 0.36 3.0000 0.0874 -6. The first three columns of XLMiner output for the first 30 cases in the validation data are shown below.5073 0.6509 -13.1281 0.0000 0.0000 0.9715 0.0000 0.8654 -13.Classification & Prediction Example: Boston Housing (Two classes) Let us fit a logistic regression model to the Boston Housing data.9183 -13.4900 0.0000 0.0590 -1. except the outcome variable being predicted is binary.0000 0.9340 1.9884 0.7257 Predicted Prob.) We fit a logistic regression model to the training data (304 randomly selected cases) with all the 13 variables available in the data set as predictor variables and with the binary variable CAT.6381 -2. (We will cover logistic regression in detail later. Supervised Learning .9734 0.3290 -24.4061 -14.0000 0.000.1.8 3.2349 -9.5294 3. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Predicted Log-odds of Success 3.0000 0.0015 0.5218 0.0000 0.9744 0.5273 -1. 6084 -19.0000 0.6509 -10.1 Judging Classification Performance 37 The same 30 cases are shown below sorted in descending order of the predicted probability of being a HICLASS=1 case.2156 0.9884 0.3290 -6.4900 0.0000 0.9893 0.5993 3.5073 -9.0590 -6.0000 0.0023 0.9734 0.4061 0.4472 3.0000 0.1157 -1.6119 Predicted Prob.0000 0.2910 -14.1281 0.6806 -4.0000 Actual Value of HICLASS 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .3.5273 4.5294 1.9744 0.0750 -10.5218 0.6381 3.0130 0.0000 0.6002 0.0000 0.9016 4.9999 0.2916 -1.0001 0. of Success 0.0402 -1.0641 0.0000 0. 22 5 14 16 1 15 30 3 23 18 8 6 25 17 9 24 2 27 19 20 13 26 28 29 4 21 12 11 10 7 Predicted Log-odds of Success 8.7257 0.9340 -14.0000 0.9715 0.2468 0.2349 -13.4562 -13.2859 -13.0000 0.9183 -2.8654 -21.1040 -13.0000 0.5364 -37.8489 0.0015 0.6854 -24.0874 -0. 8489 0.1281 0.9734 0.0641 0. Instead of looking at a large number of confusion tables. we can calculate the appropriate confusion table. we will predict 10 positives (7 true positives and 3 false positives).” and below which we will consider a case to be a negative or “0.0000 0. we can use the sorted table to compute a confusion table for a given cutoff probability.400.Classification & Prediction First. Probability Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Predicted Prob. For each cutoff level.0130 0.0000 0.9999 0.0000 0.5218 0. it is much more convenient to look at the cumulative lift curve (sometimes called a gains chart) which summarizes all the information in these multiple confusion tables into a graph.2156 0.0001 0.4900 0. above which we will consider a case to be a positive or “1.0000 0.9893 0.0023 0. of Success 0.” For any given cutoff level. Supervised Learning .0000 0.2468 0. if we use a cutoff probability level of 0.0000 0. we will also predict 20 negatives (18 true negatives and 2 false negatives).9744 0.0000 0.6002 0.0000 0.0000 0.0000 0.38 3.9715 0.0015 0. we need to set a cutoff probability value. For example.0000 0.0000 Actual Value of HICLASS 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Cumulative Actual Value 1 2 3 4 5 6 7 7 7 7 7 7 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 .9884 0. The graph is constructed with the cumulative number of cases (in descending order of probability) on the x axis and the cumulative number of true positives on the y axis as shown below. ”. It is worth mentioning that a curve that captures the same information as the lift curve in a slightly different manner is also popular in data mining applications. . continue with a slope of 1 until it reached 9 successes (all the successes).9) is a reference line. then continue horizontally to the right. The lift will vary with the number of cases we choose to act on. It provides a benchmark against which we can see performance of the model.would overlap the existing curve at the start. The line joining the points (0.a classifier that makes no errors . A good classifier will give us a high lift when we act on only a few cases (i. XLMiner automatically creates lift charts from probabilities predicted by logistic regression for both training and validation data.0) to (30. The charts created for the full Boston Housing data are shown below. the lift curve tells us that we would be right about 7 of them. It uses the same variable on the y axis as the lift curve (but expressed as a percentage of the maximum) and on the x axis it shows the true negatives (also expressed as a percentage of the maximum) for differing cut-off levels. If we simply select 10 cases at random we expect to be right for 10 × 9/30 = 3 cases.1 Judging Classification Performance 39 The cumulative lift chart is shown below. The lift curve for the best possible classifier .e. The model gives us a “lift” in predicting HICLASS of 7/3 = 2. It represents the expected number of positives we would predict if we did not have a model but simply selected cases at random. As we include more cases the lift will decrease.3.33. This is the ROC (short for Receiver Operating Characteristic) curve. If we had to choose 10 neighborhoods as HICLASS=1 neighborhoods and used our model to pick the ones most likely to be “1’s. use the prediction for the ones at the top). 40 3. Cases that the classifier cannot classify are subjected to closer scrutiny either by using expert judgment or by enriching the set of predictor variables by gathering additional information that is perhaps more difficult or expensive to obtain. In a two-class situation this means that for a case we can make one of three predictions. those who are too ill to retreat even if medically treated under the prevailing conditions.1.10 Classification using a Triage strategy In some cases it is useful to have a “can’t say” option for the classifier. This is analogous to the strategy of triage that is often employed during retreat in battle. or we cannot make a prediction because there is not enough information to confidently pick C0 or C1 . Since the vast majority of transactions are legitimate. An example is in processing credit card transactions where a classifier may be used to identify clearly legitimate cases and the obviously fraudulent ones while referring the remaining cases to a human decision-maker who may look up a database to form a judgment. Clearly the grey area of greatest doubt in classification is the area around a. such a classifier would substantially reduce the burden on human experts. or the case belongs to C1 . The wounded are classified into those who are well enough to retreat. To gain some insight into forming such a strategy let us revisit the simple two-class single predictor variable classifier that we examined at the beginning of this chapter.Classification & Prediction ROC Curve The ROC curve for our 30 cases example above is shown below. At a the ratio of the conditional probabilities of belonging to the classes is one. and those who are likely to become well enough to retreat if given medical attention. A sensible rule way to define the .1. 3.9 3. The case belongs to C0 . Supervised Learning . 05 or 1.2.41 grey area is the set of x values such that: t> p(C1 ) × f1 (x0 ) > 1/t p(C0 ) × f0 (x0 ) where t is a threshold for the ratio. . A typical value of t may in the range 1. Multiple Linear Regression .42 4. Identifying subsets of the independent variables to improve predictions. and the model is: Y = β0 + β1 x1 + β2 x2 + · · · + βp xp + ε (1) where ε. predicting expenditures on vacation travel based on historical frequent flier data. 2. x2 . 4. We also do not know the values of the 43 . predicting the time to failure of equipment based on utilization and environment conditions. There are two important conceptual ideas that we will develop: 1. regressors or covariates) are known quantities for purposes of prediction.1 A Review of Multiple Linear Regression 4. Multiple linear regression is applicable to numerous data mining situations. · · · . Y . predicting staffing requirements at help desks based on historical data and product and sales information. Examples are: predicting customer activity on credit cards from demographics and historical activity patterns. There is a continuous random variable called the dependent variable. Our purpose is to predict the value of the dependent variable (also referred to as the outcome or response variable) using a linear function of the independent variables. Relaxing the assumption that errors follow a Normal distribution. predicting sales from cross selling of products from historical information and predicting the impact of discounts on sales in retail outlets.2 Independence Relaxing the Normal distribution assumption Let us review the typical multiple regression model. is a Normally-distributed random variable with mean = 0 and standard deviation = σ whose value we do not know. xp .1. the “noise” variable. input variables. and a number of independent variables.1. The values of the independent variables (also referred to as predictor variables. x1 .Chapter 4 Multiple Linear Regression 4.1 Linearity Perhaps the most popular mathematical model for making predictions is the multiple linear regression model encountered in most introductory statistics classes. · · · . βˆ1 . βˆ2 . βˆ2 . An important and interesting fact for our purposes is that even if we drop the last assumption and allow the noise variables to follow arbitrary distributions. These are our estimates for the unknown values and are called OLS (ordinary least squares) estimates.distributed. · · · . βˆ1 . The data consist of n cases (rows of observations) which give us values yi . · · · . 2. Predictions based on this equation are the best predictions possible in the sense that they will be unbiased (equal to the true values on the average) and will have the smallest expected squared error compared to any unbiased estimates if we make the following assumptions: 1. Normality. β2 . σ. βˆ2 . βˆp we can calculate an unbiased estimate σ 2 for σ using the formula: σ ˆ2 = = 4. Multiple Linear Regression coefficients β0 . We estimate all these (p + 2) unknown values from the available data. n. i = 1. for i = 1. β1 . · · · . · · · . n. n. xi2 . xi1 .1. · · · . More specifically. The expected value of the dependent variable is a linear function of the independent variables. · · · . βˆp . The “noise” random variables εi are independent between all the cases. these estimates are very good for . is computed from the equation Yˆ = βˆ0 + βˆ1 x1 + βˆ2 x2 + · · · + βˆp xp . Yˆ . xp ) = β0 + β1 x1 + β2 x2 + · · · + βp xp . ˆ2 Once we have computed the estimates βˆ0 . 5. E(εi ) = 0 for i = 1. · · · . 4.44 4. The sum of squared differences is given by n (yi − β0 − β1 xi1 − β2 xi2 · · · − βp xip )2 i=1 Let us denote the values of the coefficients that minimize this expression by βˆ0 . 2. 2. x2 . x2 . εi . xip . are Normally. βˆ1 . The estimates for the β coefficients are computed so as to minimize the sum of squares of differences between the fitted (predicted) values and the observed Y values in the data. βˆp in the linear regression model (1) to predict the value of the dependent value from known values of the independent values. Homoskedasticity. observations-coefficients Unbiasedness We plug in the values of βˆ0 . E(Y |x1 .3 n 1 (yi − βˆ0 − β1 xi1 − β2 xi2 · · · − βp xip )2 n − p − 1 i=1 Sum of squares of residuals . βp . The predicted value. Here εi is the ”noise” random variable in observation i for i = 1 · · · n 3. 2. x1 . · · · . · · · . The “noise” random variables. xp . The standard deviation of εi equals the same (unknown) value. the model using the least squares estimates. These ratings are answers to survey questions given to a sample of 25 clerks in each of 30 departments. βˆ2 . These coefficient estimates are used to make predictions for each case in the validation data. βˆ1 . will give the smallest value of squared error on the average. In data mining applications we have two distinct sets of data: the training data set and the validation data set that are both representative of the relationship between the dependent and independent variables. βˆp . The independent (predictor) variables are clerical employees’ ratings of these same supervisors on more specific attributes of performance. as defined by equation (1) above. The Normal distribution assumption is required in the classical implementation of multiple linear regression to derive confidence intervals for predictions. In this classical world. We can show that predictions based on these estimates are the best linear predictions in that they minimize the expected squared error. The purpose of the analysis was to explore the feasibility of using a questionnaire for predicting effectiveness of supervisors. All ratings are on a scale of 1 to 5 by 25 clerks reporting to the supervisor.4.2 Illustration of the Regression Process Example 1: Supervisor Performance Data (adapted from Chaterjee. data are scarce and the same data are used to fit the regression model and to assess its reliability (with confidence limits). thus saving the considerable effort required to directly measure effectiveness. βˆ2 . · · · . amongst all linear models. Hadi and Price) The data shown in Table 2. The average of the square of this error enables us to compare different models and to assess the accuracy of the model in making predictions. The validation data set constitutes a ”hold-out” sample and is not used in computing the coefficient estimates. The dependent (outcome) variable is an overall measure of supervisor effectiveness. βˆ0 . · · · .1 are from a large financial organization. The training data is used to estimate the regression coefficients βˆ0 . βˆ1 . βˆp . In other words. This enables us to estimate the error in our predictions without having to assume that the noise variables follow the Normal distribution. The prediction for each case is then compared to the value of the dependent variable that was actually observed in the validation data. The variables are answers to questions on the survey and are described below. We use the training data to fit the model and to estimate the coefficients. Y Measure of effectiveness of supervisor X1 Handles employee complaints X2 Does not allow special privileges X3 Opportunity to learn new things X4 Raises based on performance X5 Too critical of poor performance X6 Rate of advancing to better jobs . 4.2 Illustration of the Regression Process 45 prediction. 46 4.263 1.501 -0.057 0. Estimate Constant X1 X2 X3 X4 X5 X6 Coefficient 13.197 0.570 -0.167 0.219 0.513 -0.044 0.317 0.329X3 − 0. Multiple Linear Regression Table 1: Training Data (20 departments) Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Y 43 63 71 61 81 43 58 71 72 67 64 67 69 68 77 81 74 65 65 50 X1 51 64 70 63 78 55 67 75 82 61 53 60 62 83 77 90 85 60 70 58 X2 30 51 68 45 56 49 42 50 72 45 53 47 57 83 54 50 64 65 46 68 X3 39 54 69 47 66 44 56 55 67 47 58 39 42 45 72 72 69 75 57 54 X4 61 63 76 54 71 54 66 70 71 62 58 59 55 59 79 60 79 55 75 64 X5 92 73 86 84 83 49 68 66 83 80 67 74 63 77 77 54 79 80 85 78 X6 45 47 48 35 47 34 35 41 31 41 34 41 25 35 46 36 63 60 46 52 The multiple linear regression estimates (as computer by XLMiner) are reported below. Applying this equation to the validation data gives the predictions and errors shown in Table 2.026 0.539 StdError 16. .182 + 0.057X4 + 0.900 7.583 -0.196 0.578 0.182 0.583X1 − 0.112X5 − 0.439 The equation to predict performance is Y = 13.787 2.329 -0.798 p-value 0.445 0.656 738. Multiple R-squared Residual SS Std.797 0.860 0.157 0.197X6 .247 t-statistic 0.180 0.746 0. Dev.112 -0.232 0.044X2 + 0. 87 56.25 -0.52) and so the predictions are unbiased.34 (two standard deviations) of the true value.02 10. • Parsimony is an important property of good models. Given the high speed of modern algorithms for multiple linear regression calculations. This more conservative formula tells us that the chances of our prediction being within ±3 × 7.75 65.54 -0. (If we want to be very conservative we can use a result known as Tchebychev’s inequality which says that the probability that any random variable is more than k standard deviations away from its mean is at most 1/k2 . it is tempting in such a situation to take a kitchen-sink approach: why bother to select a subset. Further the errors are roughly Normal so that this model gives prediction errors that are approximately 95% of the time within ±14.46 63.90 -0.51 of the true value are at least 8/9 ≈ 90%). (Multicollinearity is the presence of two or more predictor variables .17 We note that the average error in the predictions is small (−0. We obtain more insight into the influence of regressors in models with a few parameters.17 = ±21. 4. just use all the variables in the model.23 58.05 76.91 5.22 73.10 62.98 63. • We may be able to measure fewer variables more accurately (for example in surveys).52 7.3 Subset Selection in Linear Regression 47 Table 2: Predictions for validation cases Case 21 22 23 24 25 26 27 28 29 30 Averages: Std Devs: Y 50 64 53 40 63 66 78 48 85 82 X1 40 61 66 37 54 77 75 57 85 82 X2 33 52 52 42 42 66 58 44 71 39 X3 34 62 50 58 48 63 74 45 71 59 X4 43 66 63 50 66 88 80 51 77 64 X5 64 80 80 57 75 76 78 83 74 78 X6 33 41 37 49 33 72 49 38 55 39 Prediction 44.4.78 -4.95 -5.38 11.19 76.77 10.3 Subset Selection in Linear Regression A frequent problem in data mining is that of using a regression equation to predict the value of a dependent variable when we have many variables available to choose as independent variables in our model. • It may be expensive (or not feasible) to collect the full complement of variables for future predictions.91 45. • Estimates of regression coefficients are likely to be unstable due to multicollinearity in models with many variables. There are several reasons why this could be undesirable.87 -6. • We may need to delete fewer observations in data sets with missing values of observations.19 -8.30 Error=(Pred-Y) -5. Even if X2 happens to be 2 = 0 and the variance of β ˆ1 is the same in both models. Although our analysis has been based on one useful independent variable and one irrelevant independent variable. • It can be shown that dropping independent variables that have small (non-zero) coefficients can reduce the average error of predictions. It is always better to make predictions with models that do not include irrelevant variables. is: Y = β1 X1 + ε (2) and suppose that we estimate Y (using an additional variable X2 that is actually irrelevant) with the equation: (3) Y = β1 X1 + β2 X2 + ε.) Regression coefficients are more stable for parsimonious models. One rough thumbrule (where n = # of cases and k = # of variables): n ≥ 5(k + 2). Let us illustrate the last two points using the simple case of two independent variables. However the variance of βˆ1 is larger than it would have been if we had used equation (2). We can show that in this situation the least squares estimates βˆ1 and βˆ2 will have the following expected values and variances: E(βˆ1 ) = β1 . This equation is true with β2 = 0. V ar(βˆ1 ) = σ2 2 ) x2 (1 − R12 i1 σ2 2 ) x2 (1 − R12 i2 where R12 is the correlation coefficient between X1 and X2 . 4.) • It can be shown that using independent variables that are uncorrelated with the dependent variable will increase the variance of predictions. We notice that βˆ1 is an unbiased estimator of β1 and βˆ2 is an unbiased estimator of β2 since it has an expected value of zero. V ar(βˆ2 ) = σ2 E(βˆ1 ) = β1 . . So we are worse off using the irrelevant estimator in making predictions. V ar(βˆ1 ) = 2 .48 4. Multiple Linear Regression sharing the same linear relationship with the outcome variable.4 Dropping Irrelevant Variables Suppose that the true equation for Y . The reasoning remains valid in the general situation of more than two independent variables. the result holds true in general. x1 The variance is the expected value of the squared error for an unbiased estimator. we can uncorrelated with X1 so that R12 show that the variance of a prediction based on (3) will be greater than that of a prediction based on (2) due to the added variability introduced by estimation of β2 . In that case E(βˆ2 ) = 0. the dependent variable. If we use equation (3). X2 and Y so that their variances are equal to 1.5. namely that equation (3) is the correct equation but we use equation (2) for our estimates and predictions ignoring variable X2 in our model. V ar(βˆ1 ) = σ 2 . M SE3(Yˆ ) = E[(Yˆ − Y )2 ] = E[(u1 βˆ1 + u2 βˆ2 − u1 β1 − u2 β2 − ε)2 ] = V ar(u1 βˆ1 + u2 βˆ2 ) + σ 2 because now Yˆ is unbiased = u21 V ar(βˆ1 ) + u22 V ar(βˆ2 ) + 2u1 u2 Covar(βˆ1 . (1 − R12 E(βˆ1 ) = β1 V ar(βˆ1 ) = E(βˆ2 ) = β2 Now let us compare the Mean Square Errors for predicting Y at X1 = u1 . For equation (3).5 49 Dropping Independent Variables With Small Coefficient Values Suppose that the situation is the reverse of what we have discussed above. DROPPING INDEPENDENT VARIABLES WITH SMALL COEFFICIENT VALUES 4. the least squares estimates β˜1 and β˜2 have the following expected values and variances. βˆ2 ) (u21 + u22 − 2u1 u2 R12 ) 2 σ + σ2 . = 2 ) (1 − R12 . Notice that βˆ1 is a biased estimator of β1 with bias equal to R12 β2 and its Mean Square Error is given by: M SE(βˆ1 ) = E[(βˆ1 − β1 )2 ] = E[{βˆ1 − E(βˆ1 ) + E(βˆ1 ) − β1 }2 ] = [Bias(βˆ1 )]2 + V ar(βˆ1 ) = (R12 β2 )2 + σ 2 . In this case the least squares estimate βˆ1 has the following expected value and variance. E(βˆ1 ) = β1 + R12 β2 . To keep our results simple let us suppose that we have scaled the values of X1 . M SE2(Yˆ ) = E[(Yˆ − Y )2 ] = E[(u1 βˆ1 − u1 β1 − ε)2 ] = u21 M SE2(βˆ1 ) + σ 2 = u21 (R12 β2 )2 + u21 σ 2 + σ 2 .4. σ2 2 ) (1 − R12 σ2 V ar(βˆ2 ) = 2 ). X2 = u2 . For equation (2). The steps are as follows: 1.6. and (β2 /σ)2 . 4. |β2 | σ2 <√ 1 2 .9.1 Forward Selection Here we keep adding variables one at a time to construct what we hope is a reasonably good subset. . In general.4]) 4. Dropping such variables will improve the predictions as it will reduce the MSE.50 4. Start with all variables in R. M SE2(Yˆ ) < M SE3(Yˆ ) when (R12 β2 )2 + σ 2 < (1−R 2 ) or when σ 12 |β2 | σ 2 . (Typical values for Fin are in the range [2. 4.2 Backward Elimination 1. Start with constant term only in subset (S) 2. There are three common procedures: forward selection. u2 . The heuristics most often used and available in statistics software are step-wise procedures. Compute the reduction in the sum of squares of the residuals (SSR) obtained by including each variable that is not presently in R. This bias-variance trade-off generalizes to models with several independent variables and is particularly important for large values of p since in that case it is very likely that there are variables in the model that have small coefficients relative to the standard deviation of the noise term and also exhibit at least moderate correlation with other variables. The most common procedure for p greater than about 20 is to use heuristics to select ”good” subsets rather than to look for the best subset for a given criterion. for < 2. For the variable. For example. u2 = 0. Repeat 2 until no variables can be added. This type of bias-variance trade-off is a basic aspect of most data mining procedures for prediction and classification.6 Algorithms for Subset Selection Selecting subsets to improve MSE is a difficult computational problem for large p. 3. backward elimination and step-wise regression. accepting some bias can reduce MSE. Multiple Linear Regression Equation (2) can lead to lower mean squared error for many combinations of values for u1 . i. that gives the largest reduction in SSR compute SSR(S) − SSR(S ∪ {i}) Fi = M axi∈S / σ ˆ 2 (S ∪ {i}) If Fi > Fin add i to S. R12 .6. if u1 = 1. R12 2 R12 |β| σ 1−R12 If < 1 this will be true for all values of also if > . say. i. As stated above these methods pick one best subset.) We compute a criterion 2 . Repeat 2 until no variable can be dropped. say.7 Identifying Subsets of Variables to Improve Predictions The All Subsets regression (as well as modifications of the heuristic algorithms) will produce a number of subsets. Efficient implementations use branch and bound algorithms (of the type used for integer programming) to avoid explicitly enumerating all subsets.4 All Subsets Regression The idea here is to evaluate all subsets. None of the above methods guarantee that they yield the best subset for any criterion such as adjusted R2 (defined later). 3. Fi = M ini∈S / σ ˆ 2 (S) If Fi < Fout then drop i from S. that gives the smallest increase in SSR compute SSR(S − {i}) − SSR(S) . There are straightforward variations of the methods that do identify several close to best choices for different sizes of independent variable subsets. The disadvantage is that the full model with all variables is required at the start and this can be time-consuming and numerically unstable. For the variable.6. for each subset then choose the best one.3 Step-wise Regression (Efroymson’s method) This procedure is like Forward Selection except that at each step we consider dropping variables as in Backward Elimination.4. Backward Elimination has the advantage that all variables are included in S at some stage. the adjusted R2 . (Typical values for Fout are in the range [2.4]). This gets around a problem of forward selection that will never select a variable that is better than a previously selected variable that is strongly correlated with it. An intuitive metric to . (In fact the subset selection problem can be set up as a quadratic integer program. (This is only feasible if p such as Radj is less than about 20). 4.7 Identifying Subsets of Variables to Improve Predictions 51 2. Compute the increase in the sum of squares of the residuals (SSR) obtained by excluding each variable that is presently in R.6. They are reasonable methods for situations with large numbers of independent variables but for moderate numbers of independent variables the method discussed next is preferable. Convergence is guaranteed if Fout < Fin (but it is possible for a variable to enter S and then leave S at a subsequent step and even rejoin S at a yet later step). 4. 4. Since the number of subsets for even moderate values of p is very large we need some way to examine the most promising subsets and to select from them. However since R2 = 1 − SSR SST where SST .p + 1. k = 2.) We then examine the increase in R2 as a function of k amongst these subsets and choose a subset such that subsets that are larger in size give only insignificant increases in R2 . The formula for Radj 2 =1− Radj n−1 (1 − R2 ). the Total Sum of Squares. approach is to choose the subset that maximizes. . 2 . One approach is therefore to select the subset with the largest R2 for each possible size k.52 4.. if we use it as a criterion we will always pick the full model with all p variables. is the Sum of Squared Residuals for the model with just the constant term. more automatic. Multiple Linear Regression compare subsets is R2 . (Note that it is possible. a modification Another.) minimizes σ ˆ . n−k 2 to choose a subset is equivalent to picking the subset that It can be shown that using Radj 2 2 to be negative. . (The size is the number of coefficients in the model and is therefore one more than the number of variables in the subset to account for the constant term. Radj 2 is of R2 that makes an adjustment to account for size. though rare.. for Radj Table 3 gives the results of the subset selection procedures applied to the training data in Example 1. 094 738. Using this model on the validation data gives a slightly higher standard deviation of error (7.900 0.656 RSq (adj) 0.900 0.161 1.742 5.000 X1 X1 X1 X1 X1 X1 5 6 7 Notice that the step-wise heuristic fails to find the best subset for sizes of 4.467 786. X3 } maximizes 2 for all the algorithms.840 Fout= 2.562 3.635 0.7 Identifying Subsets of Variables to Improve Predictions 53 Table 3: Subset Selection for Example 1 SST= 2149. The Forward and Backward heuristics do find the best subsets of all sizes and so give identical results as the All Subsets algorithm. A criterion that is often used for subset selection is known as Mallow’s Cp . if dropped. would improve the M SE.570 0. the size of the subset.617 0.634 0. 5.637 7.615 Constant 3 786.580 1. This criterion assumes that the full model is unbiased although it may have variables that.710 Forward. E(Cp ) equals the number of parameters k + 1.591 0.639 0.511 0.656 0.532 5. backward. This suggests that we may be better off in terms of MSE of predictions Radj if we use this subset rather than the full model of size 7 with all six variables in the model.615 -0.3) than the full model (7. Small data sets make our estimates of R2 unreliable.591 -0.089 775. Cp is also an estimate of the sum of MSE (standardized by dividing by σ 2 ) for predictions (the fitted values) .497 Cp -0. and all subsets selections Models Size SSR RSq RSq Cp 1 (adj) 2 874.000 Constant 2 3 4 5 6 7 X1 X1 X1 X1 X1 X1 X3 X3 X3 X2 X2 X6 X5 X6 X3 X5 X6 X3 X4 X5 X6 Models 1 2 3 4 Constant Constant Constant Constant Constant Constant X3 X2 X2 X2 X2 X3 X3 X4 X3 X4 X5 X3 X4 X5 X6 Stepwise Selection Size SSR RSq 2 3 4 5 6 7 874.593 0.161 Constant 4 759.000 Fin= 3.654 0.361 Constant 5 743.467 0. With this assumption we can show that if a subset model is unbiased.540 0.413 0.655 0.567 0.601 0.637 0.083 Constant 6 740.970 781.032 Constant 7 738.497 7. This example also underscores the fact that we are basing our analysis on small (tiny by data mining standards!) training and validation data sets.647 0.793 3.593 0.4. So a reasonable approach to identifying subset models with small bias is to examine those with values of Cp that are near k + 1.570 -0. and 6 variables.601 783. The best subset of size 3 consisting of {X1 .1) but this may be a small price to pay if the cost of the survey can be reduced substantially by having 2 questions instead of 6.746 0.634 0. In fact there is no difference between them in the order of merit they ascribe to subsets of a fixed size. This is a consequence of having too few observations to estimate σ 2 accurately in the full model. It is important to remember that the usefulness of this approach depends heavily on the reliability of the estimate of σ 2 for the full model. This requires that the training set contain a large number of observations relative to the number of variables. are of small size). Cp is computed from the formula: + 2(k + 1) − n. a useful point to note is that for a fixed size of subset. 2 and Cp all select the Finally.54 5. Radj same subset. Thus good models are those that have values of Cp near k + 1 and that have small k (i.e. R2 . Logistic Regression at the x-values observed in the training set. . We note that for our example only the subsets of size 6 and 7 seem to be unbiased as for the other models Cp differs substantially from k. where σ ˆF2 ull is the estimated value of σ 2 in the full model that includes all Cp = σˆSSR 2 F ull the variables. 214 147/1363 = 0. as a function of education. xk may be categorical or continuous variables or a mixture of these two types. We can then use this probability to classify each case as a φ or as a 1.254 382/1415 = 0.071 226/1137 = 0. the numerator is the number of adopters and the denominator is the number surveyed in that category).1 Example 1: Estimating the Probability of Adopting a New Phone Service The data in Table 1 were obtained in a survey conducted by AT & T in the US from a national sample of cooperating households. · · · . Table 1: Adoption of New Telephone Service Low Income High Income High School or below No Change in Change in Residence during Residence during Last five years Last five years 153/2160 = 0. y.155. residen55 . We are interested in the adoption rate for a new telecommunications service.108 287/1925 = 0. As with multiple linear regression the independent variables x1 .149 139/ 547 = 0.199 Some College or above No change in Change in Residence during Residence during Last five years Last five years 61/886 = 0. residential stability and income. is binary (for convenience we often code these values as 0 and 1).069 233/1091 = 0. the adoption probability varies depending on the categorical independent variables education. While in multiple linear regression we end up with an estimate of the value of the continuous dependent variable. in logistic regression we end up with an estimate of the probability that the dependent variable is a 1 (as opposed to a φ). Note that the overall probability of adoption in the sample is 1628/10524 = 0.270 (For fractions in cells above. x2 . However.Chapter 5 Logistic Regression Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable. Let us take some examples to illustrate: 5. 3 The Logistic Regression Model The logistic regression model was developed to account for all these difficulties. If we let y = 1 represent choosing an option versus y = 0 for not choosing it.g. n. The variance equals n(p(1 − p)). In fact a binomial model would be more appropriate. 3. The model’s predicted probabilities could fall outside the range 0 to 1. 2 · · · .56 5.income no-residence-change households with some college education while the highest is 0.2 Multiple Linear Regression is Inappropriate The standard multiple linear regression model is inappropriate to model this data for the following reasons: 1. 2. The random utility model considers the utility of a choice to incorporate a random element. It is used in a variety of fields – whenever a structured model is needed to explain or predict binary outcomes. which is useful in the context of the above example. It will also increase with the total number of households.5 than where it is near 0 or 1. When we model the random element as coming from a ”reasonable” distribution. Think of the response of the households in a cell being determined by independent flips of a coin with.270 for high-income residence changers with some college education. 5. In essence the consumer theory states that when faced with a set of choices a consumer makes a choice which has the highest utility (a numeric measure of worth with arbitrary zero and scale). 1. One such application is in describing choice behavior in econometrics. 5. heads representing adoption with the probability of heads varying between cells. It assumes that the consumer has a preference order on the list of choices that satisfies reasonable criteria such as transitivity. falling in the cell. we can logically derive the logistic model for predicting choice behavior. is near 0. socioeconomic characteristics as in the Example 1 above) as well as attributes of the choice. p. We cannot use the expedient of considering the normal distribution as an approximation for the binomial model because the variance of the dependent variable is not constant across all cells: it will be higher for cells where the probability of adoption. if a cell total is 11 then this variable can take on only 12 distinct values 0. The lowest value is 0. The preference order can depend on the individual (e. The dependent variable (adoption) is not normally-distributed. say. Logistic Regression tial stability and income. 11. For example. In the context of choice behavior the logistic model can be shown to follow from the random utility theory developed by Manski as an extension of the standard economic theory of consumer behavior. the logistic regression model stipulates: .069 for low. The independent variables for our model would be: x1 ≡ ( Education: high school or below = 0. 5.149 . the probability that Y = 1 is estimated by the expression above.5. β1 . high = 1 The data in Table 1 are shown below in another summary format. x3 ) = exp(β0 + β1 ∗ xl + β2 ∗ x2 + β3 ∗ x3 ) . 1 + exp(β0 + β1 ∗ xl + β2 ∗ x2 + β3 ∗ x3 ) We obtain a useful interpretation for the coefficients β0 . for given value of X1 .254 . In other words. change over past five years = 1 x3 ≡ Income: low = 0. .199 . x2 · · · xk ) = exp(β0 + β1 ∗ x1 + .270 1.000 Typical rows in the actual data file might look like this: Adopt 0 1 0 X1 0 1 0 X2 1 0 0 X3 0 1 0 etc. βk ∗ xk ) 1 + exp(β0 + β1 ∗ x1 + · · · βk ∗ xk ) where β0 .214 . βk are unknown constants analogous to the multiple linear regression model.108 . b2 and β3 by noting that: exp(β0 ) = P rob(Y = 1|x1 = x2 = x3 = 0) P rob(Y = 0|x1 = x2 = x3 = 0) . X2 and X3 . some college or above = 1 x2 ≡ (Residential stability: no change over past five years = 0.4 Odd Ratios The logistic model for this example is: P rob(Y = 1|x1 . .4 Odd Ratios 57 Probability (Y = 1|x1 . x1 0 0 0 0 1 1 1 1 x2 0 0 1 1 0 1 0 1 x3 0 1 0 1 0 0 1 1 # in sample 2160 1363 1137 547 886 1091 1925 1415 10524 #adopters 153 147 226 139 61 233 287 382 1628 # Non-adopters 2007 1216 911 408 825 858 1638 1033 8896 Fraction adopters .071 . β1 . β2 . x2 .069 . . . 560 95% Conf.048 1.500 + 0. x2 and x3 for the independent variables.006 0.082 1.161 0.992 ∗ x2 + 0. The output of a typical program is shown below: Variable Constant x1 x2 x3 5. If βi < 0. for odds Lower Limit Upper Limit 0.444 Std. Intvl.013 1.500 + 0. x2 . multiplied by the above probability.071 0.500 0. .175 2.000 Odds 0. Similarly the multiplicative factors for x2 and x3 do not vary with the levels of the remaining factors.444 ∗ x3 ) . then summed over all observed combinations of value for X1 .316 2.161 ∗ x1 + 0. Error 0.698 1.058 0. x1 = x2 = 0 Odds of adopting in the base case The logistic model is multiplicative in odds in the following sense: Odds of adopting for a given x1 . Logistic Regression = exp(β1 ) = exp(β2 ) = exp(β3 ) = Odds of adopting in the base case (x1 = x2 = x3 = 0) Odds of adopting when x1 = 1.5 Coeff. presence of the factor increases the probability of adoption.746 Probabilities From the estimated values of the coefficients. The computations required to produce estimates of the beta coefficients require iterations using a computer program.056 0. The factor for a variable gives us the impact of the presence of that factor on the odds of adopting.000 0. x1 = x3 = 0 Odds of adopting in the base case Odds of adopting when x3 = 1.444 ∗ x3 ) The estimated number of adopters from this model will be the total number of households with a given set of values x1 .992 ∗ x2 + 0.393 1.161 ∗ x1 + 0.058 0. If βi = 0. -2. x2 .” regardless of the level of x2 and x3 . x2 and x3 for the independent variables is: P rob(Y = 1|x1 . and X3 . whereas if βi > 0.058 p-Value 0. x2 = x3 = 0 Odds of adopting in the base case Odds of adopting when x2 = 1.58 5. we see that the estimated probability of adoption for a household with values x1 . x3 = exp(β0 ) ∗ exp(β1 x1 ) ∗ exp(β2 x2 ) ∗ exp(β3 x3 ) Odds F actor F actor F actor ∗ ∗ ∗ f or due due due = basecase to x to x to x 1 2 3 If x1 = 1 the odds of adoption get multiplied by the same “Factor due to X1 .095 1.416 3. 1 + exp(−2. X2 .000 0. presence of the factor reduces the odds (and the probability) of adoption. x3 ) = exp(−2.992 0. the presence of the corresponding factor has no effect (multiplication by one). 131 0.289 In data mining applications we will have validation data that is a hold-out sample not used in fitting the model.697968993 9.199 0. we can build more complex models that reflect interactions between independent variables by including factors that are calculated from the interacting factors.199508829 2.257 0.149 0.14502996 16. x1 x2 x3 0 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 # in sample 2160 1363 1137 547 886 1091 1925 1415 # adopters 153 147 226 139 61 233 287 382 Estimated (# adopters) 164 155 206 140 78 225 252 408 Fraction Adopters 0.5. 5.373982568 -0.076 0.33789657 24.52828063 116.798161717 The total error is -2.2018306 Error (Estimate -Actual) -0.390029332 -4.798/119 = -2.214 0. Let us suppose we have the following validation data consisting of 598 households: x1 x2 x3 0 0 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 Totals # in validation sample 29 23 112 143 27 54 125 85 598 # adopters in validation sample 3 7 25 27 2 12 13 30 119 Estimated (# adopters) 2. The second column records the judgment of an expert on the financial condition of each bank.800490828 -4.071 0. We can now apply the model to these validation data.206 0.181 0.088 0.609970361 20.373982216 11.254 0.854969242 3. x3 ) 0.6 Example 2: Financial Conditions of Banks 59 The table below shows the estimated number of adopters for the various combinations of the independent variables.270 Estimated P rob(Y = 1|x1 .471717729 -2.337898368 -5.705133471 0. The last two columns give the values of two ratios used in . For example if we felt that there is an interactive effect between x1 and x2 we would add an interaction term x4 = x1 × x2 .6 Example 2: Financial Conditions of Banks Table 2 gives data on a sample of banks.7051326 2.30202944 36.108 0.113 0.798 adopters or a percentage error in estimating adopters of -2.4%. As with multiple linear regression. x2 .069 0. 09 0.66 0.08 0. This is analogous to the simple linear regression model in which we fit a straight line to relate the dependent variable.6.16 0.65 0.80 0.08 0.55 0. Independent (or explanatory) variable: x1 = Total Loans & Leases / Total Assets Ratio The equation relating the dependent variable to the explanatory variable is: P rob(Y = 1|x1 ) = exp(β0 + β1 ∗ xl ) 1 + exp(β0 + β1 ∗ xl ) .09 0.10 0.56 0. y.43 0. if financially distressed. to a single independent variable.08 0.46 0.72 0.51 0.09 0.10 0. Logistic Regression financial analysis of banks.64 1.14 0.1 A Model with Just One Independent Variable Consider first a simple logistic regression model with just one independent variable.52 0.12 0.79 Total Expenses / Total Assets (x2 ) 0.60 5.12 0. otherwise.67 0. Y = 0.04 0.54 0. Table 2: Financial Conditions of Banks Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Financial Condition∗ (y) 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 Total Loans & Leases/ Total Assets (x1 ) 0.13 0.63 0.12 0.07 0. Let us construct a simple logistic regression model for classification of banks using the Total Loans & Leases to Total Assets ratio as the independent variable in our model. = 0 for financially strong banks. 5.30 0.07 0.69 0.13 ∗ Financial Condition = 1 for financially weak banks.75 0. x.11 0.11 0. This model would have the following variables: Dependent variable: Y = 1.08 0.74 0. βˆ1 = 10.989 So that the fitted model is: P rob(Y = 1|x1 ) = exp(−6.6. Odds (Y = 1 versus Y = 0) = (β0 + β1 ∗ xl ).926. The odds that a bank with a Loan & Leases/Assets Ratio that is zero will be in financial distress = exp(−6.989 ∗ x1 ) . The odds of distress for a bank with a ratio .5. 5.989 ∗ x1 ) Figure 1 displays the data points and the fitted logistic regression model.926 + 10. equivalently.6 Example 2: Financial Conditions of Banks 61 or. These are the base case odds.2 Multiplicative Model of Odds Ratios We can think of the model as a multiplicative model of odds ratios as we did for Example 1.926) = 0.001.926 + 10. (1 + exp(−6. The maximum likelihood estimates (more on this below) of the coefficients for the model are: ˆ β0 = −6. its magnitude gives the amount by which the odds of Y = 1 against Y = 0 are changed for a unit change in xi .587. Odds (Y = 1 versus Y = 0) = (β0 + β2 ∗ x2 ).345 Figure 2 displays the data points and the fitted logistic regression model. if financially distressed. β2 = 94. If we construct a simple logistic regression model for classification of banks using the Total Expenses/Total Assets ratio as the independent variable we would have the following variables: Dependent variable: Y = 1. so the odds that such a bank will be in financial distress = 0. Independent (or explanatory) variable: x2 = Total Expenses/Total Assets Ratio The equation relating the dependent variable with the explanatory variable is: P rob(Y = l|x1 ) = exp(β0 + β2 ∗ x2 ) 1 + exp(β0 + β2 ∗ x2 ) or. otherwise.6 will increase by a multiplicative factor of exp(10. .62 5. Notice that there is a small difference in interpretation of the multiplicative factors for this example compared to Example 1. The maximum likelihood estimates of the coefficients for the model are: β0 = −9.6) = 730 over the base case. Logistic Regression of 0. equivalently.730.989 ∗ 0. While the interpretation of the sign of βi remains as before. Y = 0. Under very general conditions maximum likelihood estimators are: • Consistent : the probability of the estimator differing from the true value approaches zero with increasing sample size.. no more than 10%).7 Appendix A . Algorithms to compute the coefficient estimates and confidence intervals are iterative and less robust than algorithms for linear regression. · · · .. and when the number of coefficients in the logistic regression model is small relative to the sample size (say.Computing Maximum Likelihood Estimates . .. provided the sample size is ’large’. and the corresponding values of the independent variable i by xij for i = 1 · · · p. xpj . · · · . 5. j = 1. x2j . • Asymptotically Normally-Distributed: This allows us to compute confidence intervals and perform statistical tests in a manner analogous to the analysis of linear multiple regression models.. As with linear regression. j = 1 · · · n.6.Computing Maximum Likelihood Estimates and Confidence Intervals for Regression Coefficients We denote the coefficients by the p × 1 column vector β with the row element i equal to βi .7. collinearity (strong correlation amongst the independent variables) can lead to computational difficulties.7 Appendix A . n. 5. 2. Computed estimates are generally reliable for wellbehaved datasets where the number of observations with dependent variable values of both 0 and 1 are ‘large’. • Asymptotically Efficient : the variance is the smallest possible among consistent estimators. The n observed values of the dependent variable will be denoted by the n × 1 column vector y with the row element j equal to yj .3 63 Computation of Estimates As illustrated in Examples 1 and 2.. estimation of coefficients is usually carried out based on the principle of maximum likelihood which ensures good asymptotic (large sample) properties for the estimates. 5. Computationally intensive algorithms have been developed recently that circumvent some of these difficulties. x1j .5. their ratio is ‘not too close’ to either zero or one.1 Data yj . L. βˆi . · · · .. .+βi xpj ) n eΣi yj βi xij 1 + eΣi βi xij j=1 = eΣi (Σj yj xij )βi n [1 + eΣi βi xij ] j=1 = eΣi βi ti n [1 + eΣi βi xij ] j=1 where ti = Σj yj xij These are the sufficient statistics for a logistic regression model analogous to yˆ and S in linear regression. 2. i = 1. . 5. p: Σj xij E(Yj ) = Σj xij yj .64 5. we often work with the loglikelihood because it is generally less cumbersome to use for mathematical operations such as differentiation. of βi by maximizing the loglikelihood function for the observed values of yj and xij in our data. . p = ti − Σj xij π = ti − Σ j ˆ j = ti or Σi xij π where π ˆj = eΣi βi xij [1+eΣi βi xij ] = E(Yj ).3 Loglikelihood Function This is the logarithm of the likelihood function. Since maximizing the log of a function is equivalent to maximizing the function. An intuitive way to understand these equations is to note that for i = 1.7. Logistic Regression 5.7. is the probability of the observed data viewed as a function of the parameters (β2i in a logistic regression). n eyi (β0 +β1 x1j +β2 x2j . l = Σi βi ti − Σj log[1 + eΣi βi xij ]. we will find the global maximum of the function (if it exists) by equating the partial derivatives of the loglikelihood to zero and solving the resulting nonlinear equations for βˆi .2 Likelihood Function The likelihood function... We find the maximum likelihood estimates.. ∂l ∂βi xij eΣi βi xij [1 + eΣi βi xij ] ˆj = 0. .+βp xpj ) j=1 = 1 + eβ0 +β1 x1j +β2 x2j . Since the likelihood function can be shown to be concave. 2. β. • Confidence intervals and hypothesis tests are based on asymptotic Normal distribution of βˆi . i. Note : If the model includes the constant term xij = 1 for all j then Σj E(Yj ) = Σj yj .8 Appendix B . In this situation the predicted probabilities for observations with yj = 0 can be made arbitrarily close to 0 and those for yj = 1 can be made arbitrarily close to 1 by choosing suitable very large absolute values of some βi . β 0 . In that case the likelihood function can be made arbitrarily close to one and the first term of the loglikelihood function given above approaches infinity. the maximum likelihood estimates are such that the expected value of the sufficient statistics are equal to their observed values. asymptotically efficient and follow a multivariate Normal distribution (subject to mild regularity conditions). It converges rapidly if the starting value. If the function is not concave.The Newton-Raphson Method 65 In words. reasonably close to the maximizing value. The βˆi ’s are consistent.8 Appendix B . The method uses successive quadratic approximations to g based on Taylor series. 5. the likelihood function β t+1 = β t + [I(β t )]−1 ∇I(β t ) where Iij = ∂2l ∂βi ∂j βj • On convergence.The Newton-Raphson Method This method finds the values of βi that maximize a twice differentiable concave function.e. 5. it finds a local maximum. . the diagonal elements of I(β t )−1 give squared standard errors (approximate variance) for βˆi . The Newton-Raphson method involves computing the following successive approximations to find βˆi . is ˆ of β. The loglikelihood function is always negative and does not have a maximum when it can be made arbitrary close to zero.7.4 Algorithm A popular algorithm for computing βˆi uses the Newton-Raphson method for maximizing twice differentiable functions of several variables (see Appendix B). the expected number of successes (responses of one) using MLE estimates of βi equals the observed number of successes. This is the situation when we have a perfect model (at least in terms of the training data set)! This phenomenon is more likely to occur when the number of parameters is a large fraction (say > 20%) of the number of observations. g(β).5. the maximum of this approximation occurs when its derivative is zero. To use this equation H should be non-singular. as defined below. . ..66 5. .. . . ∂g ∂βi ∇g(β t ) = . are used to update an estimate β t to β t+1 . .. Near the maximum the rate of convergence is quadratic as it can be shown that |βit+1 − βˆi | ≤ c|βit − βˆi |2 for some c ≥ 0 when βit is near βˆi for all i. Logistic Regression The gradient vector ∇ and the Hessian matrix. This gives us a way to compute β t+1 . . the next value in our iterations. βt The Taylor series expansion around β t gives us: g(β) g(β t ) + ∇g(β t )(β − β t ) + 1/2(β − β t ) H(β t )(β − β t ) Provided H(β t ) is positive definite. . ∇g(β t ) − H(β t )(β − β t ) = 0 or β = β t − [H(β t )]−1 ∇g(β t ). H. β t+1 = β t − [H(β t ]−1 ∇g(β t ).. . This is generally not a problem although sometimes numerical difficulties can arise due to collinearity. . H(β t ) = .. .. ∂2g ∂βi ∂βk βt . so that v= m wj xj . where w0 is called the bias (not to be confused with statistical bias in j=1 prediction or estimation) and is a numerical value associated with the neuron. An activation function g (also called a squashing function) that maps v to g(v) the output value of the neuron. 67 . Each neuron receives signals through synapses that control the effects of the signal on the neuron. The synapses or connecting links that provide weights. It is convenient to think of the bias as the weight for an input x0 whose value is always equal to one. The fundamental building block in an artificial neural network is the mathematical model of a neuron as shown in Figure 1.Chapter 6 Neural Nets 6. j=0 3. Artificial neural networks were inspired by biological findings relating to the behavior of the brain as a network of units called neurons.m.. artificial neural networks have emerged as a major paradigm for data mining applications.1 The Neuron (a Mathematical Model After going through major development periods in the early 60’s and mid 80’s.000 other neurons. The three basic components of the (artificial) neuron are: 1. xj for j = 1. These synaptic connections are believed to play a key role in the behavior of the brain. This function is a monotone function.. The human brain is estimated to have around 10 billion neurons each connected on average to 10. They were a key development in the field of machine learning. An adder that sums the weighted input values to compute the input to the activation function v = w0 + m wj xj . to the input values. 2. . wj . Neural Nets Figure 1 . These are networks in which there is an input layer consisting of nodes that simply accept the input values and successive layers of nodes that are neurons like the one depicted in Figure 1.A Neuron While there are numerous different (artificial) neural network architectures that have been studied by researchers. The outputs of neurons in a layer are inputs to neurons in the next layer.68 6. . Figure 2 is a diagram for this architecture. The last layer is called the output layer. the most successful applications in data mining of neural networks have been multilayer feedforward networks. Layers between the input and output layers are known as hidden layers. When the network is used for classification. a linear function of the input vector x with components xj . In the special case of two classes it is common to have just one node in the output layer. we can interpret the neural network as a structure that predicts a value y for a given input vector x with the weights being the coefficients. the output layer typically has as many nodes as the number of classes and the output layer node with the largest output value gives the network’s estimate of the class for a given input.2 The Multilayer Neural Networks 6. In this case notice that the output of the network is m wj xj . Does this seem familiar? j=0 It looks similar to multiple linear regression.6. If we are modeling the dependent variable y using multiple linear regression. If we choose these weights to . the classification between the two classes being made by applying a cut-off to the output value at the node.2. no hidden layers).2 69 The Neuron (a mathematical model Figure 2 : Multilayer Feed-forward Neural Network In a supervised setting where a neural net is used to predict a numerical quantity there is one neuron in the output layer and its output is the prediction. The simplest network consists of just one neuron with the function g chosen to be the identity function.1 Single Layer Networks Let us begin by examining neural networks with just one layer of neurons (output layer only. g(v) = v for all v. 6. and the weights (before this case is presented to the net) by the vector w(i). While it is possible to consider many activation functions. The network is presented with cases from the training data one at a time and the weights are revised after each case in an attempt to minimize the mean square error. but is almost linear in the range where g(v) is between 0.2 Multilayer Neural Networks Multilayer neural networks are undoubtedly the most popular networks used in applications. It can be shown that if the network is trained in this manner by repeatedly presenting test data observations one-at-a-time then for suitably small (absolute) values of η the network will learn (converge to) the optimal values of w. however. In fact the revival of interest in neural nets was sparked by successes in training neural networks using this function in place of the historically (biologically inspired) step function (the ”perceptron”}. There are.9.1 and 0. function g(v) = 1+e v 6. rather than calculated in one step. If we consider using the single layer neural net for classification into c classes. This process of incremental adjustment of weights is based on the error made on training cases and is known as ’training’ the neural net. The almost universally used dynamic updating algorithm for the neural net version of linear regression is known as the Widrow-Hoff rule or the least-meansquare (LMS) algorithm. Neural Nets minimize the mean square error using observations in a training set. The weights in neural nets are also often designed to minimize mean square error in a training data set. Maximum likelihood coefficients for logistic regression can also be considered as weights in a neural network to a function of the residuals called the deviance. a number of situations where three and sometimes four and five layers have been more effective. The practical value of the logistic function arises from the fact that it has a squashing effect on very small or very large values of v. the coefficients in Fisher’s classification functions give us weights for the network that are optimal if the input vectors come from multivariate Normal distributions with a common covariance matrix. Notice that using a linear function does not achieve anything in multilayer networks that is beyond what can be done with single layer networks with linear activation functions. The updating rule is w(i+1) = w(i)+η(y(i)− y(i))x(i) with w(0) = 0. In theory it is sufficient to consider networks with two layers of neurons–one hidden and one output layer–and this is certainly the case for most applications. Note that the training data may have to be presented several times for w(i) to be close to the optimal w. these weights would simply be the least squares estimates of the coefficients. For prediction . In this case the logistic minimize ev is the activation function for the output node.70 6. Let x(i) denote the input vector x for the ith case used to train the network.2. It is simply stated. however. in practice it has been found that the logistic ev (also called the sigmoid) function g(v) = 1+ev (or minor variants such as the tanh function) works best as the activation function to map the sum of the weighted inputs to the neuron’s output. we would use c nodes in the output layer. The advantage of dynamic updating is that the network tracks moderate time trends in the underlying linear model quite effectively. If we think of classical discriminant analysis in neural network terms. There is. a different orientation in the case of neural nets: the weights are ”learned” over time. .4 1.3 2.1 2.1 4.2 0.4 1.5 1.6 PETW 0. Recall that the data consisted of four measurements on three types of iris flowers.1 to 0..2 0..9 3. The common practice is to use trial and error.5 ..2 3. An alternative is to scale the output to the linear part (0.5 1..4 1.5 1.2 3.2 0. It is no exaggeration to say that the speed of the backprop algorithm made neural nets a practical tool in the manner that the simplex method made linear optimization a practical tool..4 4.3 Example 1: Fisher’s Iris data 71 the output node is often given a linear activation function to provide forecasts that are not limited to the zero to one range..6 5 5. although there are schemes for combining optimization methods such as genetic algorithms with network training for these parameters.5 SEPW 3.6 5 4.9 4. 6.2 0.2 3.4 0.9) of the logistic function.1 .8 PETLEN 1.3 Example 1: Fisher’s Iris data Let us look at the Iris data that Fisher analyzed using discriminant analysis. The revival of strong interest in neural nets in the mid 80s was in large measure due to the efficiency of the backprop algorithm.. 7 6.3 0.2 0. Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor CLASSCODE 1 1 1 1 1 1 1 1 1 1 .1 3.4 1.7 4.4 2.. Since trial and error is a necessary part of neural net applications it is important to have an understanding of the standard method used to train a multilayered network: backpropagation.5 .9 3. 2 2 2 2 2 SEPLEN 5.9 4 4. Unfortunately there is no clear theory to guide us on choosing the number of nodes in each hidden layer or indeed the number of layers. There are 50 observations for each class of iris. 4.9 5.4 1.5 4.2 0.1 .5 3 3.3 1.4 3. 3.4 1.6 3.7 4.7 1..2 0.4 4.5 6. A part of the data is reproduced below.3 1... 51 52 53 54 55 SPECIES Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa . 1..6.4 6.5 1. OBS# 1 2 3 4 5 6 7 8 9 10 .9 .. This makes a total of 4 x 25 = 100 connections between the input layer and the hidden layer.8 6..5 1.. This error is used to adjust the weights of all the connections in the network using the backward propagation (“backprop”) to complete the iteration.1 1.9 2.2 SEPW 2.3 4.1 1... The output values of neurons in the output layer are used to compute the error.7 1.7 .3 6...6 3.1 6. 3. Another way of stating this is to say the network was trained for 400 epochs where an epoch consists of one sweep through the entire training data..4 . Neural Nets OBS# 56 57 58 59 60 . This makes a total of 25 x 3 = 75 connections between the hidden layer and the output layer.5 4..5 If we use a neural net architecture for this classification problem we will need 4 nodes (not counting the bias node) in the input layer.3 1.1 PETW 1.7 6..2 .8 3.6 PETLEN 4.5 6.1 5. Let us have one hidden layer with 25 neurons.8 7.9 2. 6 5. In addition there will be a total of 3 connections from each node in the hidden layer to each node in the output layer. the network was trained with a run consisting of 60.6 4..6 1 1. Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica CLASSCODE 2 2 2 2 2 .000 iterations. The results following the last epoch of training the neural net on this data are shown below: . 101 102 103 104 105 106 107 108 109 110 SPECIES Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor .8 6.3 1.3 2. Each iteration consists of presentation to the input layer of the independent variables in a case.8 2. 2. and 3 neurons (one for each class) in the output layer. Using the standard sigmoid (logistic) activation functions.8 1..9 3 3 2.2 2.8 2.9 2.3 5.3 6.4 2.6 5.72 6.7 3 2. Since the training data has 150 cases.9 5.6 5.6 4. followed by successive computations of the outputs of the neurons of the hidden layer and the output layer using the appropriate weights..9 7.3 5.3 4.9 6.3 2..7 7. each case was presented to the network 400 times. 3 3 3 3 3 3 3 3 3 3 SEPLEN 5. one for each of the 4 independent variables.7 3..5 3. 6.9 .5 2. Notice that there will be a total of 25 connections from each node in the input layer to each node in the hidden layer.5 7. Classification Confusion Matrix Desired Class 1 2 3 Total Computed Class 1 2 3 10 7 2 13 1 6 12 5 4 35 13 12 Total 19 20 21 60 The classification error rate of 1.3% was obtained by careful choice of key control parameters for the training run by trial and error.Classification The backprop algorithm cycles through two distinct passes. which was 2% (see section on discriminant analysis).3 Std Dev.Classification 73 Figure 3 : XL Miner output for neural network for Iris data Classification Confusion Matrix Desired class 1 2 3 Total Computed 1 2 50 49 1 50 50 Class 3 1 49 50 Total 50 50 50 150 Error Report Class 1 2 3 Overall Patterns 50 50 50 150 # Errors 0 1 1 2 % Errors 0.00 2.98) (0.92) The classification error of 1.3% is comparable to the error using discriminant analysis.98) (1. Notice that had we stopped after only one pass of the data (150 iterations) the error would have been much worse as shown below: Figure 4 : XL Miner output for neural network for Iris data.00 1. To understand the parameters involved we need to understand how the backward propagation algorithm works. (0. 6.6. The algorithm alternates between these passes several times as it scans the training data.4 The Backward Propagation Algorithm . a forward pass followed by a backward pass though the layers of the network.4 The Backward Propagation (backprop) Algorithm .00) (1. Typically the training data has to be scanned several times before the networks ”learns” to make good classifications.00 2. after only one epoch. . If we set the control parameters to poor values we can have terrible results. These terms are used to adjust the weights of the connections between the last-but-one layer of the network and the output layer.1 6.05. This phase begins with the computation of error at each neuron in the output layer.(In practice it has been found better to use values of 0. Here η is an important tuning parameter that is chosen by trial and error by wjk j k jk repeated runs on the training data. The process is repeated for the connections between nodes in the last hidden layer and the new = last-but-one hidden layer. (If c = 2. 6. The weight for the connection between nodes i and j is given by: wij old wij + ηoi δj where δj = oj (1 − oj ) wjk δk . for each node j in the last hidden layer. The neuron outputs are computed for all neurons in the first hidden layer by performing the relevant sum and activation function evaluations.9. These weights are adjusted to new values in the backward pass as described below. The algorithm starts with the first hidden layer using as input values the independent variables of a case (often called an exemplar) from the training data set. k The backward propagation of weight adjustments along these lines continues until we reach the input layer. the activation function yield c neuron outputs for the c output nodes. .4.2 Backward Pass: Propagation of Error and Adjustment of Weights . high values give erratic learning and may lead to an unstable network. The target value is just 1 for the output node corresponding to the class of the exemplar and zero for other output nodes.4. Again the relevant sum and activation function calculations are performed to compute the outputs of second layer neurons.1 respectively.74 6.9 and 0.00 ± 0. At this time we have a new set of weights on which we can make a new forward pass when presented with a training data observation. These outputs are the inputs for neurons in the second hidden layer.Computation of Outputs of all the Neurons in the Network. These output values constitute the neural net’s guess at the value of the dependent (output) variable. Neural Nets Forward Pass . The values of wij are initialized to small (generally random) numbers in the range 0. A popular error function is the squared difference between ok the output of node k and yk the target value for that node. Typical values for η are in the range 0. This continues layer by layer until we reach the output layer and compute the outputs for this layer. we can use just one output node with a cut-off value to map a numerical output value to one of the two classes).1 to 0. If we are using the neural net for classification and we have c classes. The adjustment is similar to the simple Widrow-Huff rule that we saw earlier. The output node with the largest value determines the net’s classification. Where do the initial weights come from and how are they adjusted? Let us denote by wij the weight of the connection from node i to node j. The new value of the weight wjk of the connection from node j to node k is given by: new = wold + ηo δ . Low values give slow but steady learning.) For each output layer node compute an adjustment term δk = ok (1 − ok )(yk − ok ). ” there is no assurance that the backprop algorithm (or indeed any practical algorithm) will find the optimum weights that minimize error.6. at least the extreme case of using the entire training data set on each update has been found to get stuck frequently at poor local minima. It has been found useful to randomize the order of presentation of the cases in a training set between different scans. most importantly. Another idea is to vary the adjustment parameter δ so that it decreases as the number of epochs increases. The point of minimum validation error is a good indicator of the best number of epochs for training and the weights at that stage are likely to provide the best error rate in new data. This is done by adding a term to the expression for weight adjustment for a connection that is a fraction of the previous weight adjustment for that connection. Most applications of feedforward networks and backprop require several epochs before errors are reasonably small.6 Multiple Local Optima and Epochs Due to the complexity of the function and the large numbers of weights that are being “trained” as the network “learns. This fraction is called the momentum control parameter. 6. It is therefore important to limit the number of training epochs and not to overtrain the data. One system that some algorithms use to set the number of training epochs is to use the validation data set periodically to compute the error rate for it while the network is being trained.5 Adjustment for Prediction 6. The procedure can get stuck at a local minimum. Intuitively this is useful because it avoids overfitting that is more likely to occur at later epochs than earlier ones. 6. One commonly employed idea is to incorporate a momentum term that injects some inertia in the weight adjustment on the backward pass. In that situation we change the activation function for output layer neurons to the identity function that has output value=input value. A number of modifications have been proposed to reduce the number of epochs needed to train a neural net. The validation error decreases in the early epochs of backprop but after a while it begins to increase. It is possible to speed up the algorithm by batching. rather than after each case.7 Overfitting and the choice of training epochs A weakness of the neural network is that it can be easily overfitted. A single scan of all cases in the training data is called an epoch. (An alternative is to rescale and recenter the logistic function to permit the outputs to be approximately linear in the range of dependent variable values). High values of the momentum parameter will force successive weight adjustments to be in similar directions. causing the error rate on validation data (and.5 75 Adjustment for Prediction There is a minor adjustment for prediction problems where we are trying to predict a continuous numerical value. However. that is updating the weights for several cases in a pass. new data)to be too large. . Research continues on such methods. The neural net uses a 30x32 grid of pixel intensities from a fixed camera on the vehicle as input. as of now there seems to be no automatic method that is clearly superior to the trial and error approach. Algorithms exist that grow the number of nodes selectively during training or trim them in a manner analogous to what we have seen with CART. 1996) such as bankruptcy predictions. A number of successful applications have been reported in financial applications (see Trippi and Turban. the output is the direction of steering. “straight ahead”. The usual procedure is to make intelligent guesses using past experience and to do several trial and error runs on different architectures.76 6. One of the well known ones is ALVINN that is an autonomous vehicle driving application for normal speeds on highways.9 Successful Applications There have been a number of very successful applications of neural nets in engineering applications. However. Classification and Regression Trees Adaptive Selection of Architecture One of the time consuming and complex aspects of using backprop is that we need to decide on an architecture before we can use backprop. The backprop algorithm is used to train ALVINN. . and “bear right”. 6. It has 960 input units and a single layer of 4 hidden neurons. Credit card and CRM (customer relationship management) applications have also been reported.8 7. picking stocks and commodity trading. currency market trading. It uses 30 output units representing classes such as “sharp left”. then in later sections we will show how the procedure can be extended to prediction of a continuous dependent variable. The first is the idea of recursive partitioning of the space of the independent variables. We will discuss this classification procedure first. al. Then one of these two parts is divided in a similar manner by choosing a variable again (it could be xi or another variable) and a split value for the variable. a strong contender would be the tree methodology developed by Brieman.e. Recursive partitioning divides up the p dimensional space of the x variables into non-overlapping multidimensional rectangles. one of the variables is selected. binary or ordinal. 7.1 Classification Trees There are two key ideas underlying classification trees. x2 .Chapter 7 Classification and Regression Trees If one had to choose a classification technique that performs well across a wide range of situations without requiring much effort from the analyst while being readily understandable by the consumer of the analysis. created to implement these procedures was called CART for Classification And Regression Trees. say si . operating on the results of prior divisions). In the next few sections we describe recursive partitioning. · · · . Olshen and Stone (1984). In classification. sequentially. The idea is to divide the entire x-space up into rectangles such that each rectangle is as homogenous or ’pure’ as possible. This results in three (multi-dimensional) rectangular regions. subsequent sections explain the pruning methodology. Friedman. This division is accomplished recursively (i. The X variables here are considered to be continuous. By ’pure’ we mean containing 77 . xp . x3 . is chosen to split the p dimensional space into two parts: one part that contains all the points with xi ≤ si and the other with all the points with xi > ci . the outcome variable will be a categorical variable. The program that Brieman et. The second is of pruning using validation data.2 Recursive Partitioning Let us denote the dependent (outcome) variable by y and the independent (predictor) variables by x1 . 7. First. This process is continued so that we get smaller and smaller rectangular regions. and a value of xi . say xi . as there may be points that belong to different classes but have exactly the same values for every one of the independent variables.0 51.8 69.8 Owners=1.0 110.5 64.0 51.6 22.8 64.0 93.0 18.8 61.6 20.2 59. Non-owners=2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 .0 85. Classification and Regression Trees points that belong to just one class.4 66.4 20.8 43. The categorical y variable has two classes: owners and non-owners. ft.0 20.6 17.0 47.8 14.4 16.8 23. 7.0 81.6 16.0 63.78 7.0 19.2 20.) 18.8 21.0 52.0 49. The data are shown in Table I and plotted in Figure 1 below. The independent variables here are Income (x1 ) and Lot Size (x2 ).8 17.4 33.8 22. Table 1 Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Income ($ 000’s) 60.0 75.0 14.4 17.3 Example 1 .6 19.4 18.5 87.2 17.Riding Mowers A riding-mower manufacturer would like to find a way of classifying families in a city into those likely to purchase a riding mower and those not likely to buy one.0 82.) Let us illustrate recursive partitioning with an example.0 20.1 108. this is not always possible. (Of course. A pilot random sample of 12 owners and 12 non-owners in the city is undertaken.2 84.4 16.6 20.0 Lot Size (000’s sq. The upper rectangle contains points that are . See Figure 2. x2 ) space is now divided into two rectangles.Riding Mowers 79 Figure 1 If we apply the classification tree procedure to this data it will choose x2 for the first split with a splitting value of 19. Figure 2 Notice how the split into two rectangles has created two rectangles each of which is much more homogenous than the rectangle before the split.3 Example 1 . one with the Lot Size variable. The (x1 .7. x2 ≤ 19 and the other with x2 > 19. 8). Its maximum value is (C − 1)/C.75. If we denote the classes by k. XLMiner uses the delta splitting rule which is the modified version of twoing splitting rule. Classification and Regression Trees mostly owners (9 owners and 3 non-owners) while the lower rectangle contains mostly non-owners (9 non-owners and 3 owners). The twoing splitting rule coincides with Gini splitting rule when number of classes = 2. · · · . 11.75 and x2 ≤ 19. · · · . 45. The reduction in impurity is defined as overall impurity before the split minus the sum of the impurities for the two rectangles that result from a split. has all points that are non-owners (with one exception).1. One popular measure of impurity is the Gini index. I(A) = 0 if all the observations belong to a single class. 50. How was this particular split selected? It examined each variable and all possible split values for each variable to find the best split. k = 1. There are a number of ways we could measure impurity. Figure 3 . 2. 15.1. consists exclusively of owners. The next split is on the Income variable.4. The possible split points for x1 are {38. · · · .80 7. The left lower rectangle.4. I(A) is maximized when all classes appear in equal proportions in rectangle A. These split points are ranked according to how much they reduce impurity (heterogeneity of composition). see Classification and Regression Trees by Leo Breiman (1984.3. the Gini impurity index for a rectangle A is defined by I(A) = 1 − c p2k where pk is the fraction of observations in rectangle A that belong k−1 to class k. where C is the total number of classes for the y variable.2. while the right lower rectangle. 23}. 16. which contains data points with x1 ≤ 84. C. x1 at the value 84.75 and x2 ≤ 19. What are the possible split values for a variable? They are simply the mid-points between pairs of consecutive values for the variable. 109. which contains data points with x1 > 84.5} and those for x2 are {14. Figure 3 shows that once again the tree procedure has astutely chosen to split a rectangle to increase the purity of the resulting rectangles. Sec. Figure 5 Notice that now each rectangle is pure .Riding Mowers 81 The next split is shown below: Figure 4 We can see how the recursive partitioning is refining the set of constituent rectangles to become purer as the algorithm proceeds.3 Example 1 . .7.it contains data points from just one of the two classes. The final stage of the recursive partitioning is shown in Figure 5. 82 7. The numbers inside the circle are the splitting values and the name of the variable chosen . Classification and Regression Trees The reason the method is called a classification tree algorithm is that each split can be depicted as a split of a node into two successor nodes. We have represented the nodes that have successors by circles. Figure 6 The tree representing the first three splits is shown in Figure 7 below. The first split is shown as a branching of the root node of a tree in Figure 6. Figure 7 The full tree is shown in Figure 8 below. and corresponds to one of the final rectangles into which the x-space is partitioned. The number below the leaf node is the class with the most votes in the rectangle. When the observation has dropped down all the way to a leaf we can predict a class for it by simply taking a ’vote’ of all the training data that belonged to the leaf when the tree was grown. It is useful to note that the type of trees grown by CART (called binary trees) have the property that the number of leaf nodes is exactly one more than the number of decision nodes. Such terminal nodes are called the leaves of the tree. The % value in a leaf node shows the percentage of the total number of training observations that belonged to that node. Each leaf node is depicted with a rectangle. rather than a circle. The numbers on the left fork at a decision node shows the number of points in the decision node that had values less than or equal to the splitting value while the number on the right fork shows the number that had a greater value.4 Pruning 83 for splitting at that node is shown below the node. These are called decision nodes because if we were to use a tree to classify a new observation for which we knew only the values of the independent variables we would ’drop’ the observation down the tree in such a way that at each decision node the appropriate branch is taken until we get to a node that has no successors. Figure 8 .7. The class with the highest vote is the class that we would predict for the new observation. was the real innovation. When we increase α to a very large value the penalty cost component swamps the misclassification error component of the cost complexity criterion function and the best tree is simply the tree with the fewest leaves. We now repeat the logic that . Classification and Regression Trees Pruning The second key idea in the classification and regression tree procedure. If the test does not show a significant improvement the split is not carried out. Pruning consists of successively selecting a decision node and re-designating it as a leaf node (lopping off the branches extending beyond that decision node (its “subtree”) and thereby reducing the size of the tree). CART and CART like procedures use validation data to prune back the tree that has been deliberately overgrown using the training data. We can see intuitively that these last splits are likely to be simply capturing noise in the training set rather than reflecting patterns that would occur in future data such as the validation data. The cost complexity criterion for a tree is thus Err(T ) + α|L(T )| where Err(T ) is the fraction of training data observations that are misclassified by tree T. For example. the last few splits resulted in rectangles with very few points (indeed four rectangles in the full tree have just one point). When α = 0 there is no penalty for having too many nodes in a tree and the best tree using the cost complexity criterion is the full-grown unpruned tree. We then pick as our best tree the one tree in the sequence that gives the smallest misclassification error in the validation data. By contrast. It uses a criterion called the “cost complexity” of a tree to generate a sequence of trees that are successively smaller to the point of having a tree with just the root node. Previously. CHAID (Chi-Squared Automatic Interaction Detection) is a recursive partitioning method that predates classification and regression tree (CART) procedures by several years and is widely used in database marketing applications to this day. The penalty factor is based on a parameter. Let’s call this tree T1 . that of using the validation data to prune back the tree that is grown from training data. As we increase the value of α from zero at some value we will first encounter a situation where for some tree T1 formed by cutting off the subtree at a decision node we just balance the extra cost of increased misclassification error (due to fewer leaves) against the penalty cost saved from having fewer leaves. let us call it α . L(T ) is the number of leaves in tree T and α is the per node penalty cost: a number that we will vary upwards from zero.84 7. The idea behind pruning is to recognize that a very large tree is likely to be overfitting the training data. The pruning process trades off misclassification error in the validation data set against the number of decision nodes in the pruned tree to arrive at a tree that captures the patterns but not the noise in the training data.4 7. It uses a well-known statistical test (the chi-square test for independence) to assess whether splitting a node improves the purity by a statistically significant amount. The cost complexity criterion that classification and regression procedures use is simply the misclassification error of a tree (based on the training data) plus a penalty factor for the size of the tree. namely the tree with simply one node. methods had been developed that were based on the idea of recursive partitioning but they had used rules to prevent the tree from growing excessively and overfitting the training data. In our example. that is the per node penalty. (What is the classification rule for a tree with just one node?). We prune the full tree at this decision node by cutting off its subtree and redesignating this decision node as a leaf node. 25 4 2. Continuing in this manner we generate a succession of trees with diminishing number of nodes all the way to the trivial tree consisting of just one node.94 5 1.2 9 0.49 12 0.63 10 0.64 2 5.000/-. We call this the Minimum Error Tree.35 14 0.34 15 0.59 11 0. The 3-class has median house value splits at $15.and $30.16 1 15. Let us use the Boston Housing data to illustrate.42 13 0.4 Pruning 85 we had applied previously to the full tree. The 2-class has a split only at $ 30.000/.32 .26 8 1. (Note : There are both 2-class and 3-class versions of this problem.86 6 1. Shown below is the output that XLMiner generates when it is using the training data in the tree-growing phase of the algorithm : Training Log Growing the Tree # Nodes Error 0 38.000/-).75 3 3. with the new tree T1 by further increasing the value of α.42 7 1. From this sequence of trees it seems natural to pick the one that gave the minimum misclassification error on the validation data set.7. 4 0 % Error 0.05 25 0. Thereafter the improvement is slower as we increase the size of the tree.22 18 0. Finally we .08 24 0. We see that the error steadily decreases as the number of decision nodes increases from zero (where the tree consists of just the root node) to thirty.00 0. going from 36% to 3% with just an increase of decision nodes from 0 to 3.00 0.02 28 0.03 26 0.4 0 51 0 30.21 19 0. Classification and Regression Trees Growing the Tree # Nodes Error 16 0.25 17 0.01 29 0 30 0 Training Misclassification Summary Classification Confusion Matrix Predicted Class Actual Class 1 2 3 1 59 0 0 2 0 194 0 3 0 0 51 Class 1 2 3 Overall Error Report # Cases # Errors 59 0 19.00 0.03 27 0.86 7.09 21 0.00 (These are cases in the training data) The top table logs the tree-growing phase by showing in each row the number of decision nodes in the tree at each stage and the corresponding (percentage) misclassification error for the training data applying the voting rule at the leaves. The error drops steeply in the beginning.09 22 0.09 23 0.15 20 0. 84% 15.49% 0.84% 16.84% 15.03% 0.01% 0.03% 0. # Decision Nodes 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Training Error 0. as is also shown in the confusion table and the error report by class.09% 0.84% 15.15% 0.28% 1.35% 14.84% 15.09% 0.83% 15.84% 21.84% 15.84% 15.75% 15.32% 0.21% 0.84% 15. 31 leaves) with no error in the training data.22% 0.05% 0.84% 15.83% 16.7.|0.00% 0.02% 0.09% 0.29% 5.08% 0.20% 33.84% 15.25% 0.35% 0.84% 15.64% Validation Error 15.EIT.84% 15. The output generated by XLMiner during the pruning phase is shown below.00% 0.4 Pruning 87 stop at a full tree of 30 decision nodes (equivalently.88% 2.84% 15.85% 15.59% 0.78% 21.42% 0.66% ← Minimum Error Prune| Std.84% 15.42% 1.84% 16.84% 15.94% 3.34% 0.35% 14.85% 15.85% 14.63% 1.02501957 ← Best Prune Validation Misclassification Summary Classification Confusion Matrix Predicted Class Actual Class 1 2 3 1 25 10 0 2 5 120 9 3 0 8 25 .84% 15.78% 30.20% 1.84% 15. Thereafter.45 24. the error increases. Classification and Regression Trees Class 1 2 3 Overall Error Report # Cases # Errors 35 10 134 14 33 8 202 32 % Error 28.84 Notice now that as the number of decision nodes decreases.24 15. This is more readily visible from the graph below. The Minimum Error Tree is selected to be the one with 10 decision nodes (why not the one with 13 decision nodes?). going up sharply when the tree is quite small.85% error rate for the tree with 10 nodes. the error in the validation data has a slow decreasing trend (with some fluctuation) up to a 14.57 10.88 7. . The reason this tree is important is that it is the smallest tree in the pruning sequence that has an error that is within one standard error of the Minimum Error Tree. If we’d had another set of validation data.7.1485 and Nval = 202.5 89 Minimum Error Tree This Minimum Error Tree is shown in Figure 9. the tree with 5 decision nodes. This is the Best Pruned Tree. the minimum error would have been different.5 Minimum Error Tree 7. For our example Emin = 0. Figure 9 7. . The Best Pruned Tree is shown in Figure 10. The minimum error rate we have computed can be viewed as an observed value of a random variable with standard error (estimated standard deviation) equal to where [Emin (1 − Emin )/Nval ] is the error rate (as a fraction) for the minimum error tree and Nval is the number of observations in the validation data set.025. The estimate of error that we get from the validation data is just an estimate.6 Best Pruned Tree You will notice that the XLMiner output from the pruning phase highlights another tree besides the Minimum Error Tree. so that the standard error is 0. . Classification and Regression Trees Figure 10 We show the confusion table and summary of classification errors for the Best Pruned Tree below.90 7. 2.4} can be split in 7 ways into two subsets: {1} and {2. We said at the beginning of this chapter that classification trees require relatively little effort from developers. squared residuals). RM and CRIM) out of the set thirteen variables available.many splits are attempted and. say {1.3}. Let us give our reasons for this statement. When the number of categories is large the number of splits becomes very large. The output variable is a continuous variable in this case. trees handle missing data without having to impute values or delete observations with missing values. such rules are easily explained to managers and operating staff.8 Regression Trees Regression trees for prediction operate in much the same fashion as classification trees. In principle there is no difficulty.4}. The tree method is a good off-the-shelf classifier and predictor. If you have a categorical independent variable that takes more than two values. Notes: 1. Variable subset selection is automatic since it is part of the split selection. Trees are also intrinsically robust to outliers.4}.4}.g. {1.3. but both the principle and the procedure are the same . gives us the rule: IF(LSTAT ≤ 15.7.4}. The tree procedure then selects the split that minimizes the sum of such measures.7 91 Classification Rules from Trees One of the reasons tree classifiers are very popular is that they provide easily understandable classification rules (at least if the trees are not too large). {1. Finally. For example.7 Classification Rules from Trees 7.3. There is no need for transformation of variables (any monotone transformation of the variables will give the same trees).5545) THEN CLASS = 2. you will need to replace the . for each.4} and {2.3} and {2. {3} and {1. the upper left leaf in the Best Pruned Tree. The method can also be extended to incorporate an importance ranking for the variables in terms of their impact on quality of the classification. We have not described how categorical independent variables are handled in CART. The split choices for a categorical variable are all ways in which the set of categorical values can be divided into two subsets.3}. {1.2} and {3. XLMiner supports only binary categorical variables (coded as numbers). Compared to the output of other classifiers such as discriminant functions. above.145) AND (ROOM ≤ 6. {2} and {1. {4} and {1. Their logic is certainly far more transparent than that of weights in neural networks! 7.2. Trees need no tuning parameters. For example a categorical variable with 4 categories.2. since the choice of a split depends on the ordering of observation values and not on the absolute magnitudes of these values. in our example notice that the Best Pruned Tree has automatically selected just three variables (LSTAT.4}.3. Each leaf is equivalent to a classification rule. we measure “impurity” in each branch of the resulting tree (e. 92 8. and is popular with developers of classifiers who come from a background in machine learning. 2. Discriminant Analysis variable with several dummy variables each of which is binary in a manner that is identical to the use of dummy variables in regression. Besides CHAID. another popular tree classification method is ID3 (and its successor C4. a leading researcher in machine learning.5). This method was developed by Quinlan. . classifying bonds into bond rating categories.Chapter 8 Discriminant Analysis Introduction Discriminant analysis uses continuous variable measurements on different groups of items to highlight aspects that distinguish the groups and to use these measurements to classify new items. decision on college admission. early majority. late majority and laggards. credit cards and insurance into low risk and high risk categories. as well as in research studies involving disputed authorship. and methods to identify human fingerprints. A pilot random sample of 12 owners and 12 non-owners in the city is undertaken. classifying skulls of human fossils. medical studies involving alcoholics and non-alcoholics. Common uses of the method have been in classifying organisms into species and sub-species.1 Example 1 . The data are shown in Table I and plotted in Figure 1 below: 93 . It will be easier to understand the discriminant analysis method if we first consider an example. classifying applications for loans. classifying customers of new products into early adopters.Riding Mowers A riding-mower manufacturer would like to find a way of classifying families in a city into those likely to purchase a riding mower and those not likely to buy one. 8. ) 18.2 20.5 64. Discriminant Analysis Table 1 Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Income ($ 000’s) 60 85.6 16 18.5 87 110.2 59.4 16.6 20.8 21.4 17.6 17.1 108 82.8 69 93 51 81 75 52.8 61.8 43.94 8.8 64.8 22 20 19.2 84 49.8 17.8 14 14.8 23.4 33 51 63 Lot Size (000’s sq.8 Owners=1.4 16.6 22. ft.6 20.6 19.4 18.4 20 20.4 66 47.2 17. Non-owners=2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 . 2 Fisher’s Linear Classification Functions Linear classification functions that were suggested by the noted statistician R. . Fisher can help us do better.2 Fisher’s Linear Classification Functions 95 Figure 1: Riding mower owners and non-owners. A. First. A good classification rule would separate out the data so that the fewest points are misclassified: the line shown in Figure 1 seems to do a good job in discriminating between the two groups as it makes 4 misclassifications out of 24 points. We’ll learn more about these functions below.8. x2 region into two parts where most of the owners are in one half-plane and most of the non-owners are in the complementary half-plane. Figure 2 shows the results of invoking the discriminant routine where we select the option for displaying the classification functions. let’s review the output of XLMiner’s Discriminant Analysis routine. separated by a line placed by hand We can think of a linear classification rule as a line that separates the x1 . Can we do better? 8. Figure 2: Discriminant Analysis output We note that it is possible to have a misclassification rate (3 in 24) that is lower than that our initial lines (4 in 24) by using the classification functions specified in the output. Here’s how these classification functions work: A family is classified into Class 1 of owners if Function 1 is higher than Function 2. Discriminant Analysis . The values given for the functions are simply the weights to be associated with each variable in the linear function in a manner analogous to multiple linear regression. These functions are specified in a way that can be easily generalized to more than two classes. and into Class 2 if the reverse is the case. Let us compute these functions for the observations in our data set.96 8. The results are shown in Figure 3. . 2 Fisher’s Linear Classification Functions 97 Figure 3 XLMiner output. 13 and 17 are misclassified (summarized in the confusion matrix and error report in Figure 1).8. Figure 4 depicts the logic. discriminant analysis classification for Mower data Notice that observations 1. Let us describe the reasoning behind Fisher’s linear classification rules. . 98 8. the distance would depend on the units we choose to measure the variables. P1 and P2 in direction D1) are separated by the maximum possible distance.4 Lot Size 20. In this case often there will be variables which. are practically redundant as they capture the same effects as the other variables. especially when we are using many variables to separate groups. by themselves. We could simply use Euclidean distance. X2 space. we would not be taking any account of the correlation structure. consider various directions such as directions D1 and D2 shown in Figure 4. Discriminant Analysis Figure 4 In the X1 . . One way to identify a good linear discriminant function is to choose amongst all possible directions the one that has the property that when we project (drop a perpendicular line from) the means of the two groups (owners and non-owners) onto a line in the chosen direction the projections of the group means (feet of the perpendiculars. are useful discriminators between groups but.3 Income 79. in the presence of other variables. square yards instead of thousands of square feet.6 Measuring Distance We still need to decide how to measure the distance. We will get different answers if we decided to measure lot size in say. Second. e. The means of the two groups are: Mean1 (owners) Mean2 (non-owners) 8. This is often a very important consideration. This has two drawbacks.3 17.g.5 57. First. the method as presented to its point also assumes that the chances of encountering an item from either group requiring classification is the same. We would use the training part to estimate the classification functions and hold out the validation part to get a more reliable. If the cost of mistakenly classifying a group 1 item as group 2 is very different from the cost of classifying a group 2 item as a group 1 item we may want to minimize the expected cost of misclassification rather than the simple error rate (which does not take cognizance of unequal misclassification costs. 8. In data mining applications we would randomly partition our data into training and validation subsets. Common sense tells us that if we used these same classification functions with a fresh data set. Fifty .4 Classification Error 99 Fisher’s method gets over these objections by using a measure of distance that is a generalization of Euclidean distance known as Mahalanobis distance (see appendix to this chapter for details). A. The above analysis for two classes is readily extended to more than two classes. XL-Miner has choices in its discriminant analysis dialog box to specify the first of these ratios. we could use the classification functions to identify the sublist of families that are classified as group 1: predicted purchasers of the product.5% in our example.5 Example 2 . However. The data consist of four length measurements on different varieties of iris flowers. 8. Fisher to illustrate his method for computing classification functions. unbiased estimate of classification error. If the probability of encountering an item for classification in the future is not equal for both groups we should modify our functions to reduce our expected (long run average) error rate. Example 2 illustrates this setting. this is a biased estimate . we would get a higher error rate.) It is simple to incorporate these situations into our framework.8. So far we have assumed that our objective is to minimize the classification error. All we need to provide are estimates of the ratio of the chances of encountering an item in class 1 as compared to class 2 in future classification and the ratio of the costs of making the two kinds of classification error. we may not want to minimize the misclassification rate in certain situations.it is overly optimistic. Also.4 Classification Error What is the accuracy we should expect from our classification functions? We have an error rate of 12. This is because we have used the same data for fitting the classification parameters and for estimating the error. If we had a list of prospective customers with data on income and lot size.Classification of Flowers This is a classic example used by R. These ratios will alter the constant terms in the linear classification functions to minimize the expected cost of misclassification. 9 7.3 1.9 − 6 5.7 4.3 1.7 1.7 7.5 1.6 5.5 4.2 3.5 1.4 1.9 2.7 6.4 2.7 − 3.3 2.8 2.9 4.5 1.4 6.2 0.2 0.9 3.4 4.2 0.6 4.1 2.3 4.1 1.5 The results from applying the discriminant analysis procedure of XLMiner are shown in Figure 5.4 1.5 4.2 3.9 5.6 1 1.4 1.5 1.2 0.4 − 2.6 4.2 2.1 − 3.6 1.6 3.6 4.5 6.3 4.8 7.4 1.7 3.4 1.2 − 6.1 5. .6 5 5.1 3.3 5.9 3 3 2.5 2.1 1. Again.8 3.4 3. A sample of the data are given in Table 3 below: Table 3 OBS # 1 2 3 4 5 6 7 8 9 10 ··· 51 52 53 54 55 56 57 58 59 60 ··· 101 102 103 104 105 106 107 108 109 110 SPECIES Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa − Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor Iris-versicolor − Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica CLASS CODE 1 1 1 1 1 1 1 1 1 1 − 2 2 2 2 2 2 2 2 2 2 − 3 3 3 3 3 3 3 3 3 3 SEPLEN SEPW PETLEN PETW 5.5 3.6 5.3 1.8 2.7 4. XLMiner refers by default to the training data when no partitions have been specified.5 3 3.5 6.3 2.1 0.2 0.9 4 4.1 6.4 1.8 6.7 1.3 0.4 2.4 4.3 6.8 6.9 2.4 0.3 6.1 − 1.7 3 2.6 5 4.100 8.2 3.8 2.6 3.3 2.9 5.5 7.5 − 4.5 1.9 − 7 6.9 6.5 5.3 1.xls.1 4.2 3.5 1. The full data set is available as the XLMiner data set Iris.2 0.8 1.2 0.9 3.3 5. Discriminant Analysis different flowers were measured for each species of iris.9 2. This is an example of the importance of regularization in practice.5 Example 2 .Classification of Flowers 101 Figure 5 : XL Miner output for discriminant analysis for Iris data For illustration the computations of the classification function values for observations 40 to 55 and 125 to 135 are shown in Table 4. therefore observation # 40 is classified as class 1 (setosa).8. the optimal classification function is quadratic in the classification variables. The reason is that the quadratic model requires many more parameters that are all subject to error to be estimated. the classification function for class 1 had a value of 85. However. the number of parameters to be estimated for the different variance matrices is cp(p+1)/2.85. The functions for classes 2 and 3 were 40. The maximum is 85. If there are c classes and p variables. .44.85. in practice this has not been found to be useful except when the difference in the variance matrices is large and the number of observations available for training and testing is large.15 and 5. the linear classification rule is no longer optimal. In that case. For observation # 40. When the classification variables follow a multivariate Normal distribution with variance matrices that differ substantially among different groups. 65 71.10 2.44 -6.53 108.5 1.1 5.95 77.2 6.2 0.37 37.32 104.50 82.8 3 3.40 Fn 2 40.74 105.91 75.9 5.7 7.31 83.1 6.14 Max 85.3 6.85 87.9 5.8 2.3 0.50 49.2 3.3 0.04 116.3 1.1 2.02 97.3 0.43 81.2 0. Class 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 2 3 .18 -17.74 -2.97 101.03 82.03 85.6 1.2 3.4 7.03 82.2 0.6 4.2 6.3 3.5 1.95 77.43 82.99 77.8 2.5 2.19 69.5 1.43 3.71 42.65 82.7 4.6 5.1 1.5 6.5 3.6 0.1 5 4.85 87.3 3.8 6.3 1.48 58.4 5.2 2.8 3.1 4.3 1.78 -13.20 93.01 131.4 6.8 1.4 1.59 80.91 -8.2 3.11 108.8 2.1 1.1 6.6 1.6 1.27 18.53 98.02 97.7 3.8 3.8 2.79 20.6 5.65 103.99 50.85 2.65 71.4 5 5.8 5.92 82.53 30.25 82.7 6 4.07 111.35 93.3 2.9 4 4.28 67.21 92.65 103.92 82.5 4.3 1.25 82.29 47.02 101.74 28.4 0.65 -1.1 90.97 30. Discriminant Analysis Table 4 OBS # 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 125 126 127 128 129 130 131 132 133 134 135 IRIS Species setosa setosa setosa setosa setosa setosa etosa setos setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica Class Code 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 SEP LEN 5.61 82.06 83.07 111.6 PET W 0.50 82.102 8.5 2.32 104.36 5.8 1.21 15.5 6.3 3.2 0.2 1.06 83.6 5.97 101.2 0.4 7.8 3.49 8.4 1.8 3 2.15 38.76 52.5 1.28 67.03 85.4 4.8 3 2.2 3.24 39.65 26.4 6.04 116.28 -9.30 32.80 -17. SEP W 3.4 Fn 1 85.38 45.76 81.14 Pred.76 43.14 -0.88 107.2 7.96 86.2 1.11 124.9 2 2.32 46.4 3.6 PET LEN 1.60 42.8 4.61 82.76 93.3 5 7 6.51 Fn 3 -5.13 90.9 6.5 1.9 1.46 30.21 92.20 93.5 4.1 4.19 69.01 131.90 18.48 58.66 15.96 1.4 1.29 47.03 79. Large training samples are required for S. If X1 and X2 are the n1 × p and n2 × p matrices of observations for groups 1 and 2.8. and the respective sample variance matrices are S1 and S2 .Mahalanobis Distance Mahalanobis distance is defined with respect to a positive definite matrix Σ. the pooled matrix S is equal to {(n1 − 1)S1 + (n2 − 1)S2 }/(n1 + n2 − 2).6 Appendix . 1∗ This is true asymptotically. If we have large samples approximate normality is generally adequate for this procedure to be close to optimal.e. for large training samples. The matrix S defines the optimum direction (actually the eigenvector associated with its largest eigenvalue) that we referred to when we discussed the logic behind Figure 4. This choice Mahalanobis distance can also be shown to be optimal∗1 in the sense of minimizing the expected misclassification error when the variable values of the populations in the two groups (from which we have drawn our samples) follow a multivariate normal distribution with a common covariance matrix. In linear discriminant analysis we use the pooled sample variance matrix of the different groups. i.Mahalanobis Distance 8. Notice that if Σ is the identity matrix the Mahalanobis distance is the same as Euclidean distance. The squared Mahalanobis distance between two p−dimensional (column) vectors y1 and y2 is (y1 − y2 ) Σ−1 (y1 − y2 ) where Σ is a symmetric positive definite square matrix with dimension p. to be a good approximation for the population variance matrix.6 103 Appendix . the pooled sample variance matrix. . Other Supervised Learning Techniques .104 9. we must find a distance or dissimilarity measure that we can compute between observations based on the independent variables. y = f (x1 . u2 . The idea in k-Nearest Neighbor methods is to dynamically identify k observations in the training data set that are similar to a new observation.Chapter 9 Other Supervised Learning Techniques 9. (We will examine other ways to define distance between points in the space of predictor variables when we discuss clustering methods). Xp ) and (u1 . We have training data in which each observation has a y value which is just the class to which the observation belongs. u2 . First. we look for observations in our training data that are similar or “near” to the observation to be classified. X2 . This is a non-parametric method because it does not involve estimation of parameters in an assumed function form such as the linear form that we encountered in linear regression. Specifically. · · · . The Euclidean distance between the points (X1 . based on the values of the independent variables. · · · . Then. say (u1 . For the moment we will continue ourselves to the most popular measure of distance Euclidean distance.1 K-Nearest neighbor The idea behind the k-Nearest Neighbor algorithm is to build a classification (and prediction) method using no assumptions about the form of the function. y. xp . xp ) relating the dependent variable. · · · . if we have two classes. It is possible to prove that the misclassification error of the 1-NN scheme has a misclassification probability that is no worse than twice that of the situation where we know exactly the probability density functions for 105 . to the independent variables x1 . up ) is (x1 − u1 )2 ) + (x2 − u2 )2 + · · · + (xp − up )2 . we assign a class to the observation we want to classify. · · · . Then. we need a role to assign a class to the observation to be classified. x2 . It is a remarkable fact that this simple. intuitive idea of using a single nearest neighbor to classify observations can be very powerful when we have a large number of observations in our training set. based on the classes of those proximate observations. · · · . up ) that we wish to classify. y is a binary variable. based on the classes of the neighbors. x2 . For example. The simplest case is k = 1 where we find the observation that is closest (the nearest neighbor) and set v = y where y is the class of this single nearest neighbor. We then use similar (neighboring) observations to classify the observation into a class. u2 . we would be able to reduce the misclassification error at best to half that of the simple 1-NN rule. ft.1. irrespective of the values of (u1 .2 17.2 Example 1 . Non-owners=2 1 1 1 1 1 1 1 1 1 1 . we will simply assign all observations to the same class as the class that has the majority in the training data. In other words if we have a large amount of data and used an arbitrarily sophisticated classification rule. 9.5 87.8 Owners=1. Notice that if k = n. up ). In typical applications k is in units or tens rather than in hundreds or thousands.Riding Mowers A riding-mower manufacturer would like to find a way of classifying families into those likely to purchase a riding mower and those not likely to buy one. the number of observations in the training data set.6 20.1 The K-NN Procedure For K-NN we extend the idea of 1-NN as follows.4 16.8 21.) 18. Other Supervised Learning Techniques each class.8 61. The advantage is that higher values of k provide smoothing that reduces the risk of overfitting due to noise in the training data. A pilot random sample of 12 owners and 12 non-owners is undertaken.8 23.6 22.4 20.1.0 110.0 20. Find the nearest k neighbors and then use a majority decision rule to classify a new observation.0 82.0 Lot Size (000’s sq.5 64.1 108.8 69. 9. · · · .0 93.0 85. The data are shown in Table I and Figure 1 below: Table 1 Observation 1 2 3 4 5 6 7 8 9 10 Income ($ 000’s) 60.106 9.6 19. This is clearly a case of oversmoothing unless there is no information at all in the independent variables about the dependent variable. 9.1 K-Nearest Neighbors 107 Observation 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Income ($ 000’s) 51.0 81.0 75.0 52.8 64.8 43.2 84.0 49.2 59.4 66.0 47.4 33.0 51.0 63.0 Lot Size (000’s sq. ft.) 22.0 20.0 19.6 20.8 17.2 20.4 17.6 17.6 16.0 18.4 16.4 18.8 14.0 14.8 Owners=1, Non-owners=2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 How do we choose k? In data mining we use the training data to classify the cases in the validation data, then compute error rates for various choices of k. For our example we have randomly divided the data into a training set with 18 cases and a validation set of 6 cases. Of course, in a real data mining situation we would have sets of much larger sizes. The validation set consists of observations 6, 7, 12, 14, 19, 20 of Table 1. The remaining 18 observations constitute the training data. Figure 1 displays the observations in both training and validation data sets. Notice that if we choose k = 1 we will classify in a way that is very sensitive to the local characteristics of our data. On the other hand if we choose a large value of k we average over a large number of data points and average out the variability due to the noise associated with individual data points. If we choose k = 18 we would simply predict the most frequent class in the data set in all cases. This is a very stable prediction but it completely ignores the information in the independent variables. 108 9. Other Supervised Learning Techniques Figure 1 3 2 × Owners (Training) Non-owners (Training) Owners (Validation) Non-owners (Validation) Table 2 shows the misclassification error rate for observations in the validation data for different choices of k. Table 2: Classification Error in the Riding Mower Data Misclassification Error k % 1 33 3 33 5 33 7 33 9 33 11 17 13 17 15 17 18 50 We would choose k = 11 (or possibly 13) in this case. This choice optimally trades off the variability associated with a low value of k against the oversmoothing associated with a high value of k. It is worth remarking that a useful way to think of k is through the concept of “effective number of parameters”. The effective number of parameters corresponding to k is N/k where n is the number of observations in the training data set. Thus a choice of k = 11 has an effective number of parameters of about 2 and is roughly similar in the extent of smoothing to a linear regression fit with two coefficients. 9.1.3 K-Nearest Neighbor Prediction The idea of K-NN can be readily extended to predicting a continuous value (as is our aim with multiple linear regression models). Instead of taking a majority vote of the neighbors to determine class, we use as our predicted value, the average value of the dependent variable for the k nearest neighbors. Often this average is a weighted average with the weight decreasing with increasing distance from the point at which the prediction is required. 9.1 K-Nearest Neighbors 9.1.4 109 Shortcomings of k-NN algorithms There are two difficulties with the practical exploitation of the power of the k-NN approach. First, while there is no time required to estimate parameters from the training data (as would be the case for parametric models such as regression) the time to find the nearest neighbors in a large training set can be prohibitive. A number of ideas have been implemented to overcome this difficulty. The main ideas are: (1) Reduce the time taken to compute distances by working in a reduced dimension using dimension reduction techniques such as principal components; (2) Use sophisticated data structures such as search trees to speed up identification of the nearest neighbor. This approach often settles for an “almost nearest” neighbor to improve speed. (3) Edit the training data to remove redundant or “almost redundant” points in the training set to speed up the search for the nearest neighbor. An example is to remove observations in the training data set that have no effect on the classification because they are surrounded by observations that all belong to the same class. Second, the number of observations required in the training data set to qualify as large increases exponentially with the number of dimensions p. This is because the expected distance to the nearest neighbor goes up dramatically with p unless the size of the training data set increases exponentially with p: An illustration of this phenomenon, known as “the curse of dimensionality”, is the fact that if the independent variables in the training data are distributed uniformly in a hypercube of dimension p, the probability that a point is within a distance of 0.5 units from the center is π p/2 . 2p−1 pΓ(p/2) The table below is designed to show how rapidly this drops to near zero for different combinations of p and n, the size of the training data set. It shows the expected number of points within 0.5 units of the center of the hypercube. 1 Here is a brief review of Bayes Theorem: Consider. and 99% HIV negative (class 0 or C0 )..1 Naive Bayes Bayes Theorem In probability theory. the false positives swamp the true positives. singular value decomposition and factor analysis. the probability that an individual is HIV positive P(C1 ) = 0.P (Cm ).. X2 . the probability that a subject is HIV-positive given that he or she tested positive on a screening test for HIV.98”).000. This is why we often seek to reduce the dimensionality of the space of predictor variables through methods such as selecting subsets of the predictor variables for our model or by combining them using methods such as principal components. In notation: P (C1 |Tpos ) = P (Tpos |C1 ) × P (C1 ) 0. and that an HIV negative person has a 5% chance of triggering a false positive on the test: P (Tpos |C0 ) = 0.05. yielding this surprisingly low probability. given the object’s attributes. Bayes Theorem provides the probability of a prior event. that 1% of the population is HIV positive (class 1 or C1 )..Cm and we know that the proportion of objects in these classes are P (C1 ).HIV negatives (C0 ) triggering false positives.98 (“the probability of Tpos given C1 is 0.. Bayes theorem provides a formula for updating the probability that a given object belongs to a class. .2.5 units of hypercube center.110 9.98 ∗ 0.98 ∗ 0. if they test positive on the test. That is.000.2 9. 9.0002 0.Xn ...01.05 ∗ 0.000 100. X2 .99 In this hypothetical example.01 + 0.0025 0. Other Supervised Learning Techniques Table 3 : Number of points within 0. Bayes theorem gives us the following formula to compute the probability that the object belongs to class Ci . What is the probability that a person is HIV positive. C1 . We do not know the class of the object and would like to classify it on the basis of these attribute values. In the artificial intelligence literature dimension reduction is often referred to as factor selection or feature extraction.000 10. The proportion of HIV positives amongst the positive test results is the probability that a person is HIV positive.0246 0.. P n 10. For example. and HIV positives (C1 ) with true positives. given that a certain subsequent event has occurred. Consider also that an HIV positive person has a 98% chance of testing positive (Tpos ) on a screening test: P (Tpos |C1 ) = 0. We have an object O with n attributes with values X1 .1 In the context of classification. hypothetically. Suppose that we have m classes. in the absence of any other information.2461 30 2×10−10 2 × 10−9 2 × 10−8 2 × 10−7 40 3 × 10−17 3 × 10−16 3 × 10−15 3 × 10−14 The curse of dimensionality is a fundamental issue pertinent to all classification. C2 . P (C2 ). if they test positive on the test? There are two sources of positive test results .000 1.. If we know the probability of occurrence of the attribute values X1 .Xn for each class.01 = = 0. prediction and clustering techniques.000 2 7854 78540 785398 7853982 3 5236 52360 523600 5236000 4 3084 30843 308425 3084251 5 1645 16449 164493 1644934 10 25 249 2490 24904 20 0.165 P (Tpos |C1 ) × P (C1 ) + P (Tpos |C0 ) × P (C0 ) 0. If this is not true for a particular attribute value for a class. m is 2. This is a very simplistic assumption since the attributes are very likely to be correlated. Xn ) = 111 P (X1 . Xn |Ci ). · · · . we would need a large data set with several million observations to get reasonable estimates for P (X1 . and the number of classes. X2 . 20. . did not vote in the prior election. In any case the observations required will be far fewer than in the formula without making the independence assumption. X2 . Xn ). n. Notice that since the denominator is the same for all classes we do not need to calculate it for purposes of classification. · · · . X2 .2 Naive Bayes P (Ci |X1 . is even modestly large. etc. · · · . · · · . Often this is reasonable so we can relax our requirement of having every possible value for every attribute being present in the training data. in predicting voting. Xn |C1 )P (C1 ) + · · · + P (X1 . Surprisingly this “Naive Bayes” approach. We would like to have each possible value for each attribute to be available in the training data. · · · Xn |Ci )P (Ci ) P (X1 . we can considerably simplify the expression and make it useful in practice. Independence of the attributes within each class gives us the following simplification which follows from the product rule for probabilities of independent events (the probability of multiple events occurring is the product of their individual probabilities): P (X1 . 9.9. X2 . it may even be missing in our entire data set! For example. Xm |Ci ) = P (X1 |Ci )P (X2 |Ci )P (X3 |Ci ) · · · P (Xm |Ci ) The terms on the right can be estimated simply from frequency counts with the estimate of P (Xj |Ci ) being equal to the number of occurrences of the value Xj in the training data in class Ci divided by the total number of observations in that class. even a sizeable data set may not contain many individuals who are male hispanics with high income from the midwest who voted in the last election. X2 . In fact the vector may not be present in our training set for all classes as required by the formula. as it is called.2 The Problem with Bayes Theorem The difficulty with using this formula is that if the number of variables. even if all variables are binary. X2 . the probability of observing an object with the attribute vector (X1 .3 Simplify . say. have 4 children. Xn |Cm )P (Cm ) This is known as the posterior probability to distinguish it from P (Ci ) the probability of an object belonging to class Ci in the absence of any information about its attributes. X2 . are diverced. 9. · · · .2. does work well in practice where there are many variables and they are binary or categorical with a few discrete levels. · · · . For purposes of classification we only need to know which class Ci has the highest probability. the estimated probability will be zero for the class for objects with that attribute value.2.assume independence If it is reasonable to assume that the attributes are all mutually independent within each class. An excerpt from the scores on the validation set is shown below: . of which 242 (60.112 9. on the basis of these predictor variables.5%) are sales. the primary product of the hand craft textile industry in India. type of fabric.or two-sided. Other Supervised Learning Techniques Example 1 . the size of the border. each with 200 cases.2. important ones include the shade.Saris Saris. and whether the sari is one. border and palav (the most prominent and decorative portion of the sari). color and design of the sari body. are colorful flowing garments worn by women and made with a single piece of cloth six yards long and one yard wide.sale =1 or no sale =0. XLMiner’s Naive Bayes classifier was trained on the training set and scored to the validation data. The object is to predict the classification of a sari . An excerpt from the data follows: These data were partitioned into training and validation sets. The sample data set has 400 cases.4 9. A sari has many characteristics. Predicted class is the class predicted for this row by XLMiner.” based on the application of Bayes Rule. for 1” is the calculated probability that this case will be a “1. . As we discussed.113 . ZARIWTCat = 3. what is the most likely class?” “Actual class” is the actual class for that row. “Given that SILKWTCat = 4. etc. BODYCOL = 17. “Row ID” is the row identifier from the main data set. applying Bayes classifier to the predictor variable values for that row. The rest of the columns are the values for the first few variables for these cases.. it answers the question. and “Prob. Affinity Analysis .114 10.Association Rules . These methods are also called “market basket analysis.2 Support and Confidence In addition to the antecedent (the “if” part) and the consequent (the “then” part) an association rule has two numbers that express the degree of uncertainty about the rule. for catalog design and to identify customer segments based on buying patterns. Managers would be interested to know if certain groups of items are consistently purchased together. association rules are probabilistic in nature. Each record lists all items bought by a customer on a single purchase transaction. affinity analysis is the study of “what goes with what. An example is data collected using bar-code scanners in supermarkets.) 115 . for promotions. (The support is sometimes expressed as a percentage of the total number of records in the database. they could use such information for cross-selling. The first number is called the support for the rule. In association analysis the antecedent and consequent are sets of items (called item sets) that are disjoint (do not have any items in common).” For example. Such ’market basket’ databases consist of a large number of transaction records. They could use this data for store layouts to place items optimally with respect to each other. unlike the if-then rules of logic.Association Rules Put simply. Association rules provide information of this type in the form of “if-then” statements.1 Discovering Association Rules in Transaction Databases The availability of detailed information on customer transactions has led to the development of techniques that automatically look for associations between items that are stored in the database. 10. These rules are computed from the data and. a medical researcher might be interested in learning what symptoms go with what confirmed diagnoses. 10.” since they originated with the study of customer transactions databases in order to determine correlations between purchases. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule.Chapter 10 Affinity Analysis . Affinity Analysis .000 include both items A and B and 800 of these include item C. {3} with support count of 6. {2} with support count of 7. and 5. 2. Suppose that we want association rules between items for this database that have a support count of at least 2 (equivalent to a percentage support of 2/9=22%).Association Rules The other number is known as the confidence of the rule. 10. the support) to the number of transactions that include all items in the antecedent. whereas the confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent.Electronics Sales The manager of the All Electronics retail store would like to know what items sell together. . Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely. {1.8% = 800/100. {4} with support count of 2. the association rule “If A and B are purchased then C is purchased on the same trip” has a support of 800 transactions (alternatively 0. {5} with support count of 2. out of which 2.000 point-of-sale transactions. One way to think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent.000) and a confidence of 40% (= 800/2. etc. Note : This concept is different from and unrelated to the ideas of confidence intervals and confidence levels used in statistical inference.3 Example 1 .116 10. Transaction 1 is a point-of-sale purchase of items 1. 2} with support count of 4.000). He has a database of transactions as shown below: Transaction ID 1 2 3 4 5 6 7 8 9 Item Codes 1 2 5 2 4 2 3 1 2 4 1 3 2 3 1 3 1 2 3 5 1 2 3 There are 9 transactions. By enumeration we can see that only the following item sets have a count of at least 2: {1} with support count of 6. Transaction 2 is a joint purchase of items 2 and 4. For example if a supermarket database has 100. Each transaction is a record of the items bought together in that transaction. 2. 5} with support count of 2. Since any subset of a set must occur at least as frequently as the set. {1. 5} with confidence = support count of {1.4 The Apriori Algorithm Although several algorithms have been proposed for generating association rules. Without loss of generality we will denote items by unique. {2. 5} divided by support count of {1} = 2/6 = 33%. 4} with support count of 2. 5} with support count of 2. 5} divided by support count of {1. then frequent 3-item sets and so on until we have generated frequent item sets of all sizes. each subset will also be in the list. 2} ⇒ {5} with confidence = support count of {1. 10. 2. 5} ⇒ {2} with confidence = support count of {1. we would report only the second. 3} with support count of 4. 2} = 2/4 = 50%. {5} ⇒ {1. 5} = 2/2 = 100%. If the desired confidence cut-off was 70%. For example. {1} ⇒ {2. and then we generate. 5} divided by support count of {2} = 2/7 = 29%.10. consecutive (positive) integers and that the items in each item set are in increasing order of this item number. 2. The key idea of the algorithm is to begin by generating frequent item sets with just one item (1-item sets) and to recursively generate frequent item sets with 2 items. we can deduce the rules that meet the desired confidence ratio by examining all subsets of each item set in the list. 3} with support count of 4. from the item set {1. the computational challenge is the first stage. First we find all item sets with the requisite support (these are called frequent or ’large’ item sets). We retain the corresponding association rule only if it exceeds the desired cut-off value for confidence. 5} divided by support count of {1. from each item set so identified. {1. {2. Notice that once we have created a list of all item sets that have the required support. We can see from the above that the problem of generating all association rules that meet stipulated support and confidence requirements can be decomposed into two stages. {2} ⇒ {1. . It is then straightforward to compute the confidence as the ratio of the support for the item set to the support for each subset of the item set. 5} divided by support count of {2. 3} with support count of 2. 2. 2. 2. 2. {2. 5} divided by support count of {5} = 2/2 = 100%. association rules that meet the confidence requirement. and last rules. third. 2} with confidence = support count of {1. 5} with confidence = support count of {1. 2. 5} = 2/2 = 100%. {1. For most association analysis data.2. the classic algorithm is the Apriori algorithm of Agrawal and Srikant (1993).4 The Apriori Algorithm 117 {1. 5} with support count of 2. 5} ⇒ {1} with confidence = support count of {1. {2.5} we get the following association rules: {1. {1. It is easy to generate frequent 1-item sets. A critical aspect for efficiency in this algorithm is the data structure of the candidate and frequent item set lists. If any one of these subsets of size (k − 1) is not present in the frequent (k − 1) item set list. Affinity Analysis . A pair is combined only if the first (k − 2) items are the same in both item sets. We stop only when the candidate list is empty. some k-item sets in the candidate list may not be frequent k-item sets. one from each member of the pair.) If this condition is met the join of pair is a k-item set that contains the common first (k − 2) items and the two items that are not in common. for each item. how many transactions in the database include the item.5 Example 2 . is as follows. · · ·. These transaction counts are the supports for the 1-item sets. All frequent k-item sets must be in this candidate list since every subset of size (k − 1) of a frequent k-item set must be a frequent (k − 1) item set. When we refer to an item in a computation we actually mean this item number.118 10. Create a candidate list of k-item sets by performing a join operation on pairs of (k − 1)-item sets in the list. Hash trees were used in the original version but there have been several proposals to improve on this structure. 3. this simply means that all possible pairs are to be combined. However. we can ask “Are we finding associations that are really just chance occurrences?” 10. The general procedure to obtain k-item sets from (k − 1)-item sets for k = 2. We repeat the procedure recursively by incrementing k. We need to delete these to create the list of frequent k-item sets. Notice that we need examine only (k − 1)-item sets that contain the last two items of the candidate k-item set (Why?). Proceeding in this manner with every item set in the candidate list we are assured that at the end of our scan the k-item set candidate list will have been pruned to become the list of frequent k-item sets.Association Rules The example above illustrates this notation. What about “confidence” in the non-technical sense? How sure can we be that the rules we develop are meaningful? Considering the matter from a statistical perspective. To identify the k-item sets that are not frequent we examine all subsets of size (k − 1) of each candidate k-item set. We drop 1-item sets that have support below the desired cut-off value to create a list of the frequent 1-item sets. we know that the candidate k-item set cannot be a frequent item set. All we need to do is to count. (When k = 2. There are also other algorithms that can be faster than the Apriori algorithm in practice. . We delete such k-item sets from the candidate list.Randomly-generated Data Let us examine the output from an application of this algorithm to a small randomly generated database of 50 records shown in Example 2. 5 Example 2 .10.Randomly-generated Data Tr# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 119 Items 8 3 8 3 9 1 6 3 8 8 1 1 5 6 3 1 6 8 8 9 2 4 4 8 6 1 5 4 9 8 1 3 7 7 3 1 4 8 9 8 9 5 7 7 4 7 7 7 4 7 9 5 9 8 9 9 8 5 6 9 9 8 6 8 8 6 9 5 6 9 8 4 4 8 9 9 8 8 8 9 9 6 8 8 9 . 5 = 4% ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ Consequent (c) 9 9 8 8 9 4 8 9 9 4 8 4 2 2 1 5 1 8 2 4 9 9 6 Items 7 8 9 5 7 9 8 9 5 9 2 7 9 8 7 8 7 6 9 9 7 8 Support (a) 5 3 3 2 2 2 2 2 2 Support (c) 27 27 29 29 27 11 29 27 27 Support (a ∪ c) 4 3 3 2 2 2 2 2 2 Confidence If pr(c—a) = pr(c) % 54 54 58 58 54 22 58 547 54 Lift Ratio (conf/prev.) A better measure to judge the strength of an association rule is to compare the confidence of the rule with the benchmark value where we assume that the occurrence of the consequent item set in a transaction is independent of the occurance of the antecedent for each rule.5 1. A lift ratio greater than 1. we can have a high value for confidence even when they are independent! (If nearly all customers buy beer and nearly all customers buy ice cream.120 10.5 1.Association Rules Tr# 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Association Rules Output Input Data: $A$5:$E$54 Min. Conf. Support: 2 Min. Affinity Analysis .7 1.00 mean? Can it be useful to know such rules?) In our example the lift ratios highlight Rule 6 as most interesting in that it suggests purchase of item 4 is almost 5 times as likely when items 3 and 8 are purchased than if item 4 was not .9 4. We can compute this benchmark from the frequency counts of the frequent item sets. The benchmark confidence value for a rule is the support for the consequent divided by the number of transactions in the database.7 1.7 1. 7 4.0 suggests that there is some usefulness to the rule. The lift ratio is the confidence of the rule divided by the confidence assuming independence of consequent from antecedent. 7 6. 5 2.col. 4 3. 7 3. The larger the lift ratio. (What does a ratio less than 1.9 1. the greater is the strength of the association. 8 3. this can be deceptive because if the antecedent and/or the consequent have a high support.9 A high value of confidence suggests a strong association rule. This enables us to compute the lift ratio of a rule.9 1. the confidence level will be high regardless of whether there is an association between the items. 7 1. ) 1. However. %: 70 Rule # 1 2 3 4 5 6 7 8 9 Confidence % 80 100 100 100 100 100 100 100 100 Antecedent (a) 2 5. 10. Another is that often most of them are obvious.10.6 Shortcomings Association rules have not been as useful in practice as one would have hoped. Insights such as the celebrated “on Friday evenings diapers and beers are bought together” story are not as common as might be expected.6 Shortcomings 121 associated with the item set {3. There is need for skill in association analysis and it seems likely. . that a more rigorous statistical discipline to cope with rule proliferation would be beneficial. as some researchers have argued.8}. One major shortcoming is that the support confidence framework often generates too many rules. Data Reduction and Exploration .122 11. One of the key steps in data mining. in a classification or prediction model can lead to overfitting. The “dimensionality” of a model is the number of independent or input variables used by the model.7 and 151. and accuracy and reliability can suffer.36 123 .Chapter 11 Data Reduction and Exploration 11. It is especially valuable when we have subsets of measurements that are measured on the same scale and are highly correlated. is finding ways to reduce dimensionality without sacrificing accuracy.29 52. Large numbers of variables also pose computational problems for some models (aside from questions of questions of correlation.2 Example 1 . A useful procedure for this purpose is to analyze the “principal components” (illustrated below) of the input variables. S = 52.1 Dimensionality Reduction .7 and the covariance 95.87 54.Head Measurements of First Adult Sons The data below give 25 pairs of head measurements for first adult sons in a sample (Elston and Grizzle. 11. For this data the meansof the variables x1 and x2 are 185.87 . In that case it provides a few (often as few as three) variables that are weighted combinations of the original variables that retain the explanatory power of the full original set. therefore. or variables that are unrelated to the outcome of interest. Including highly correlated variables. matrix. superfluous variables can increase costs due to collection and processing of these variables. 1962).Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely that subsets of variables are highly correlated with each other.) In model deployment. The directions of the axes is given by the eigenvectors of S. (Show why this follows from Pythagoras’ theorem.565) and .0.124 11. For our example the eigenvalues are 131.825. It is also the line that minimizes the sum of squared perpendicular distances from the line. How is this line different from the regression line of x2 on x1 ?) The z2 axis is perpendicular to the z1 axis. The eigenvector corresponding to the larger eigenvalue is (0. Amongst all possible lines. The principal component directions are shown by the axes z1 and z2 that are centered at the means of x1 and x2 .3 The Principal Components Figure 1 below shows the scatter plot of points (x1 . x2 ). The line z1 is the direction of the first principal component of the data. It is the line that captures the most variation in the data if we decide to reduce the dimensionality of the data from two to one. if we project the points in the data set orthogonally to get a set of 25 (one dimensional) values using the z1 co-ordinate.5 and 18. Data Reduction and Exploration First Adult Son Head Length (x1 ) Head Breadth (x2 ) 191 155 195 149 181 148 183 153 176 144 208 157 189 150 197 159 188 152 192 150 179 158 183 147 174 150 190 159 188 151 163 137 195 155 186 153 181 145 175 140 192 154 174 143 176 139 197 167 190 163 11. it is the line for which. the variance of the z1 values will be maximum.14. the √ √ length of the major axis is 131. Figure 1 The values of z1 and z2 for the observations are known as the principal component scores and are shown below.47 and the lenght of the minor is 18..26. This corresponds to the rule that about 65% of the data in a univariate Normal distribution lie within one standard deviation of the mean. . The lengths of the major and minor axes of the ellipse that would enclose about 65% of the points (provided the points had a bivariate Normal distribution) are the square roots of the eigenvalues.565.825) and this is the direction of the z2 axis. doubling the axes lengths of the ellipse will enclose 95% of the points and tripling them would enclose 99% of the points. The eigenvector corresponding to the smaller eigenvalue is (0.5 = 11. In Figure 1 the inner ellipse has these axes lengths while the outer ellipse has axes with twice these lengths.11.14 = 4. Similarly.3 The Principal Components 125 gives us the direction of the z1 axis. The scores are computed as the inner products of the data points and the first and second eigenvectors (in order of decreasing eigenvalue). In our example. 985 4. to represent the two variables in the original data.381 20 −15.457 −6.381 The means of z1 and z2 are zero.813 −1.281 6.474 11 −1.174 22 −14. The first principal component.14 respectively.216 2 6.301 5. .869 −4.073 −2.703 −7.246 7.081 15 1.074 23 −14.388 16 −26.701 14 7.724 25 10.657 0.194 17 9.378 −0.4 Example 2 .808 −1. z1 .094 4 −1.547 −4.Characteristics of Wine The data in Table 2 gives measurements on 13 characteristics of 60 different wines from a region. accounts for 88% of the total variance. Let us see how principal component analysis would enable us to reduce the number of dimensions in the data.393 19 −7. 11. z2 ) coordinate system to be the means of x1 and x2 .848 −2.258 −0.126 11.114 21 6.353 −2. it seems reasonable to use one variable. z1 z2 1 6.563 10 4.549 0. This follows from our choice of the origin for the (z1 .125 9 2.379 6 21.5 and 18. the first principal score. Data Reduction and Exploration Principal Component Scores Observation No.861 13 −10.994 3 −5.778 8 13. Since it captures most of the variability in the data.655 9.573 −41.129 −3.743 7 2.043 −0.088 5 −12.045 18 1. The variances are more interesting.504 24 18. The variances of z1 and z2 are 131.474 12 −4.294 1.181 3.759 0.724 1. 93 3.64 0.98 2.94 1.62 2.58 0.62 2.96 1. .81 1.6 0.33 12.79 1.76 0.22 0.41 1.89 0.06 0.14 1.55 1.31 0.17 0.7 12.95 3.43 0.93 1.3 4.5 0.45 7.6 2.18 1.95 4.72 2.25 1.17 12.8 0.36 0.9 2.96 1.83 0.85 5 5.15 1.16 1.4 25 24 30 21 16 16 18 14.2 18.42 9.81 12.32 1.68 3.6 16 16.31 2.6 0.34 0.7 2.36 13.5 127 100 101 113 118 97 98 105 95 89 91 102 112 120 115 108 88 101 100 94 87 104 98 78 78 110 151 103 86 87 139 101 97 86 112 136 122 104 98 106 85 94 89 96 88 101 96 89 97 92 112 102 80 86 92 113 2.6 4.98 1.88 2.17 0.28 2.5 7.75 14.14 0.73 1.8 18 19 19 18.95 3.17 1.12 13.92 3.27 1.96 2.45 2.64 2.84 1.22 1.34 Inten thocya Color Hue sity 0D315 Hue 2.24 3.6 16.35 2.02 1.27 0.5 1.34 12.29 0.25 0.38 1.98 3.97 1.82 0.37 13.3 1.57 OD280/ Proline 3.5 3.98 2.46 2.98 1.65 3.2 7.37 0.91 0.7 5 5.42 2.43 2.67 2.4 7.37 12.17 4.53 1.02 1.84 2.21 2.81 2.2 2.6 2.15 2.3 0.5 18.36 1.49 12.72 1.88 3.49 12.19 0.21 0.58 1.77 0.35 2.94 1.74 0.5 1.37 12.99 1.67 2.87 13.94 0.4 0.3 2.52 2.48 1.35 2.92 1.5 21 20 21.72 2.82 1.54 0.63 0.8 2.82 1.7 1.36 2.79 3.59 2.16 2.64 4.61 2.03 11.8 4.29 1.78 2.27 2.93 13.49 2.94 0.17 2.46 0.75 0.5 10.24 2.37 13.32 2.83 0.63 0.19 12.8 20.48 1.8 1.9 2.14 2.48 2.1 5.06 2.55 2 1.54 2.1 3.9 2.87 2.1 3.58 0.6 0.2 20 20 16.25 2.4 Example 2 .28 0.Characteristics of Wine 127 Table 2 Ash Alc Ash Magnsium alinity Total Alcohol Malic Acid Flava Phenols 14.9 7.95 2.81 0.76 1.1 1.81 2.6 0.59 2.61 1.4 12 17.32 0.19 1.85 2.8 2.34 0.65 0.82 3.03 1.21 0.14 2.4 4.1 0.82 1.88 0.93 0.57 1.09 1.6 5.47 0.38 13.16 2.29 0.86 0.57 5.62 1.85 2.25 13.47 2.05 2.3 1.03 1.5 21 25 19.75 5 5.98 1.75 2.87 1.19 1.98 2.56 1.14 3.76 0.38 2.95 2.25 0.18 3.3 3.21 3.66 13.8 2.51 1.1 1.35 3.68 1.3 13.14 5.12 1.04 4.81 2.53 0.3 0.29 0.04 1.84 12.16 13.05 3.23 1.24 0.75 0.29 13.86 1.96 1.6 11.5 21.32 0. For example.92 1.4 8.11.13 1.7 2.17 2.26 0.07 1.47 0.19 0.02 1.23 13.09 1.25 1.24 2.03 0.38 1.64 2.5 2.72 5.4 3.76 3.28 1.7 1.33 0.91 1.53 13.47 0.01 1.45 0.42 1.92 4.32 13.5 24 21 20 23.88 12.45 1.3 2.51 12.23 1.15 1.53 2.6 8.25 1.84 12.15 3.59 0.31 3.05 2 1.3 0.56 3.86 1.24 0.62 1.39 Non Proan noid nins 0.81 0.01 1.76 3.8 21 14 16 18 16.3 3.22 1.71 1.58 0.28 0.8 4.02 2.92 2.32 2.78 3.53 0.29 1.11 12.41 2.08 2.95 2.55 1.88 0.86 1.46 4.55 0.65 1.5 17.4 2.5 20 18.88 3.17 3.57 2.46 1.82 2.72 1.4 3.4 0.27 5.6 12.87 2.66 0.25 12.68 0.23 2.21 12.46 1.4 1.03 2.52 0.57 1.65 2.06 2.17 2.73 3 2.08 1.15 2.29 0.8 18 20 24 21.87 1.05 1.9 1.2 0.99 1.38 2.69 2.33 1.35 2.5 0.85 1.5 0.2 6.02 2.32 2.7 1.41 0.07 1.32 1.28 1.7 0.43 0.91 3.57 2.13 2.6 3.85 2.24 14.37 0.71 2.5 0.1 2.75 3.27 0.89 0.04 1.18 2 1.2 2.4 2.86 1.23 1.73 1.95 1.36 2.89 2.35 4.24 3.6 5.14 1.98 1.21 0.85 1.53 0.43 2.36 1.87 1.3 6.93 2.36 1.88 12.25 1.66 1. The rows of Figure 2 are in the same order as the columns of Table 2.54 1.61 1.69 3.4 5.69 1.36 15.21 1.6 17 16.3 1. row 1 for each principal component gives the weight for alcohol and row 13 gives the weight for proline.65 0.65 3.1 15 19.83 14.82 2.3 1.16 14.62 12.75 0.32 5.25 1.85 3.75 2.89 2.48 2.75 14.26 0.12 1.55 3.42 2.74 3.1 14.57 1.51 3.68 7.59 1.48 1.45 1.86 14.45 2.38 3.19 2.04 0.5 21.73 0.8 4.35 0.39 0.95 1.86 12.53 0.22 0.89 0.61 3.39 2.38 1.33 1065 1050 1185 1480 735 1045 1045 1510 1280 1320 1150 1547 1310 1280 1130 1680 520 680 450 630 420 355 678 502 510 750 718 870 410 472 985 886 428 392 500 750 630 530 560 600 650 695 720 515 580 590 600 780 520 550 855 830 415 625 650 550 The output from running a principal components analysis on these data is shown in Figure 2 below.52 13.81 2.63 0.8 16 11.95 2.67 1.11 2.43 0.67 1.42 0.37 13.23 2.78 0.83 13.64 1.24 0.7 2 1.78 0.8 2.99 11.08 1.45 0.65 8.33 12.65 2.62 0.26 0.41 flava noids Phenols 3.15 1.63 14.51 1.09 1.37 0.4 1.22 5.13 1.96 11.8 3.86 13.73 1.31 1.23 1.7 4.2 1.13 0.48 0.05 1.38 1.92 2.51 1.67 12.38 5.55 0.29 1.2 13.66 0.64 13.28 1.6 5.1 3.28 2.28 0.85 3.95 3.2 2.21 4 4.99 2.36 1.27 0.55 0. This is easily accomplished by dividing each variable by its standard deviation.184 -0.775% 87. or if.091 0. In the rare situations where we can give relative weights to variables we would multiply the normalized variables by these weights before doing the principal components analysis.427 3.022 0.056 0.279 0.647 0.785% 7 0.200 0.250 0.085 -0.361% 100.043 0.002 -0. A further advantage of the principal components compared to the original data is that they are uncorrelated (correlation coefficient = 0).034 0.037 -0.111 5.620 0.363 0.776% 70.086 0.072% 8 -0.370 41.939% 11 -0.124 0. and when their scale reflects their importance (sales of jet fuel.111 0.064 -0.181 0.g.100 0.639% 13 -0.014 0.264 0. .364 -0. If the variables are measured in quite differing units so that it is unclear how to compare the variability of different variables (e.135 -0.128 11.327 1.045 0.304 -0.045 0.048 -0.191 0.226 1.102 0.631 0.338 0.156 -0.438 0.477% 77.900% 10.068 -0.588 0.394 0.042 0.242 0.118 0.128 0.288 -0.218% 84.454 0.249 0. that changes in units of measurement do not change the principal component weights.196% 10 0.044 0.g.166 -0.241 0.159 0.669 0.417 -0.198 -0. scale does not reflect importance (earnings per share.053 0.182 -0.147 0.011 -0.002 0.335 0.209 0.247 0.491 3.051 -0.177 0. for variables measured in the same units. dollars).5 Normalizing the Data The principal components shown in Figure 2 were computed after replacing each original variable by a standardized version of the variable that has unit variance.076 0.249 1.125 0.185 1.493 -0.402 0.374 0.182 0.107 -0.267 0. When should we normalize the data like this? This depends on the nature of the data.077 0.155 0.529 0.917% 95.281 -0.624 -0.040 0.793% 5 0.065 0. it is probably best not to normalize (rescale the data so that it has unit variance).222 0.) The effect of this normalization (standardization) is to give all variables equal importance in terms of the variability.178 -0.349 0.638 0.045 0.217 -0.213 0. When the units of measurement are common for the variables (e.376 0. gross revenues) it is generally advisable to normalize.122 -0.127 0.009 0.255 0.168 -0.477 -0.364 -0.061 -0.197 -0.207% 93.003 -0.275 0.153 0.040 -0.142 0.660 0.400 -0. In this way.189 0.221 0.037 -0.156 -0.343 -0.133 -0.122 -0. parts per million for others).102 0.104 -0.245 -0.540% 41. If we construct regression models using these principal components as independent variables we will not encounter problems of multicollinearity.409 0.087 -0.166 1.358% 12 -0.025 -0.250 0.023 0.117 0. (Sometimes the mean is subtracted as well.824 0.011% 6 -0.157 0.444 2.207 0.294 -0.436 0.287 2.000% Notice that the first five components account for more than 80% of the total variation associated with all 13 of the original variables.172 0.167 -0.598 0. This suggests that we can capture most of the variability in the data with less than half the number of original dimensions in the data.047 0.034 0.392 -0.279% 9 0.420% 98. dollars for some.263 0.144 -0.243 0. 11.287% 91. Data Reduction and Exploration Figure 2: Principal components for the Wine data Principal Components 1 2 3 0.064 -0.876% 17.110 0.041 0.415 0.301 0.037 -0.052 0.085 -0.053 0.259 0.402 -0.225 -0.295 -0.270 -0.152 -0.044 0.745 -0.090 0.306 0.044 0.972 7.154 0.099 0. sales of heating oil).020 0.187 0.122 -0.281% 99.291 0.742% 96.138 0.147 0.070 -0.112 0.485 0.440 0.522 0.876% 59.106 0.808 6.106 0.098 0.323 0.316% 4 0.488 0. 244 0.009 0.001 1.002% 99.021 0.007 -0.060 0. This is because Proline’s standard deviation is 351 compared to the next largest standard deviation of 15 for the variable Magnesium.004 0.002 -0.000 0.006 0.345 11. 11.402 0.830% 99. The standard deviations of all the other variables are about 1% (or less) that of Proline.220 0. The first four components are the four variables with the largest variances in the data and account for almost 100% of the total variance in the data.8 1.7 351.082 0.014 0.6 Principal Components and Orthogonal Least Squares The weights computed by principal components analysis have an interesting alternate interpretation.064 0.014 -0.7 1.424 2.999% 0.002 -0.009 -0. The weights of the first principal component would define the best .6 0.976 0. Suppose we want to fit a linear surface (a straight line for 2-dimensions and a plane for 3-dimensions) to the data points where the objective is to minimize the sum of squared errors measured by the squared orthogonal distances (squared lengths of perpendiculars) from the points to the fitted linear surface.987% 99.025 -0.11.3 3.545 -0.002 0.040 0.014 0.001 -0.002 0.998% Std.001 0.013 0.000 123594.004 -0. The first five principal components computed on the raw (non-normalized) data are shown in Table 3.004 1.427 -0.022 0.022 0.261 -0. Dev.006 0.316 0.7 1.009% 0.5 The principal components analysis without normalization is trivial for this data set.031 0.040 -0.001 0.157% 0.214 0.096 -0.167 -0.2 0.804 -0.002 -0.004 0.001% 99. Notice that the first principal component is comprised mainly of the variable Proline.1 0.129 -0. 5 0.453 99.536 0.049 0.001 0.000 0.002 0.1 0.6 Principal Components and Orthogonal Least Squares 129 Example 2 (continued) Normalizing variables in the wine data is important due to the heterogenous nature of the variables.002 0.388 0.015 0.000 -0.030 0.391 0.001 194.2 0.176 -0.054 -0. The second principal component is Magnesium.998 -0. and this component explains almost all the variance in the data.097 -0.031 0.164 0.000 -0.7 0.830% Principal Components 2 3 4 0. Table 3: Principal Components of non-normalized Wine data Alcohol MalicAcid Ash Ash− Alcalinity Magnesium Total Phenols Flavanoids Nonflavanoid− Phenols Proanthocyanins Color Intensity Hue OD280/OD315 Proline Variance % Variance Cumulative % 1 0.6 14.996% 99.045 0. 130 12. would be the portion of the variability explained by the fit in a manner analogous to R2 in multiple linear regression. expressed as a percentage of the total variation in the data. This property can be exploited to find nonlinear structure in high dimensional data by considering perpendicular projections on non-linear surfaces (Hastie and Stuetzle. . The variance of the first principal component. Cluster Analysis linear surface that minimizes this sum. 1989). psychology. clustering of neighborhoods using US postal zip codes has been used successfully to group neighborhoods by lifestyles. A spectacular success of the clustering idea in chemistry was Mendelev’s periodic table of the elements.1 What is Cluster Analysis? Cluster analysis is concerned with forming groups of similar objects based on several measurements of different kinds made on the objects. In marketing and political forecasting. This idea has been applied in many areas. We cannot aspire to be comprehensive as there are literally hundreds of methods (there is even a journal dedicated to clustering ideas: “The Journal of Classification”!). including astronomy. Typically. biologists have made extensive use of classes and sub-classes to organize species. medicine.Chapter 12 Cluster Analysis 12.” “Furs and Station Wagons” and “Money and Brains. We are interested in forming groups of similar utilities. linguistics and sociology. Claritas. Our goal is to form groups of cases so that similar cases are in the same group. The objective of this chapter is to help you to understand the key ideas underlying the most commonly used techniques for cluster analysis and to appreciate their strengths and weaknesses. 12. such as “Bohemian Mix.Public Utilities Data Table 1. education. The key idea is classify these clusters in ways that would be useful for the aims of the analysis.2 Example 1 . The number of groups may be specified or determined from the data. chemistry. For example. Knowledge of lifestyles can be used to estimate the potential demand for products such as sports utility vehicles and services such as pleasure cruises. Examining the clusters enabled Claritas to come up with evocative names. archaeology. grouped neighborhoods into 40 clusters using various measures of consumer expenditure and demographics.” for the groups that captured the dominant lifestyles in the neighborhoods. a company that pioneered this approach.1 below gives corporate data on 22 US public utilities. The objects to be clustered are the 131 . the basic data used to form clusters is a table of measurements on several variables where each column represents a variable and a row represents an object (case). · · · . xi2 . Cluster Analysis utilities. The objects to be clustered are the utilities and there are 8 measurements on each utility. If we felt that some variables should be given more importance than others we would modify the squared difference terms by multiplying them by weights (positive numbers adding up to one) and use larger weights for the important variables.2. The weighted Euclidean distance measure is given by: dij = w1 (xi1 − xj1 )2 + w2 (xi2 − xj2 )2 + · · · + wp (xip − xjp )2 where w1 . All our variables are continuous in this example. p i=1 wi = 1. xj2 . 2. wp are the weights for variables 1. i and j with normalized variable values (xi1 . xip ) and (xj1 . so we compute distances using this metric.132 12. The Euclidean distance dij between two cases. An example where clustering would be useful is a study to predict the cost impact of deregulation. · · · . It would save a considerable amount of time and effort if we could cluster similar types of utilities and build detailed cost models for just one ”typical” utility in each cluster and then scale up from these models to estimate results for all utilities. · · · .2 below. w2 . xjp ) is defined by: dij = (xi1 − xj1 )2 + (xi2 − xj2 )2 + · · · + (xip − xjp )2 . The result of the calculations is given in Table 1. . Before we can use any technique for clustering we need to define a measure for distances between utilities so that similar utilities are a short distance apart and dissimilar ones are far from each other. p so that wi ≥ 0. To do the requisite analysis economists would need to build a detailed cost model of the various utilities. There are 8 measurements on each utility described in Table 1. · · · . A popular distance measure based on variables that take on continuous values is to normalize the values by dividing by the standard deviation and subtracting the mean (sometimes other measures such as range are used) and then computing the distance between objects using the Euclidean metric. 34 1. Puget Sound Power & Light Co.92 111 175 245 168 197 173 178 199 96 164 252 136 150 104 148 204 1784 X4 54.3 1 -2. Northern States Power Co. Virginia Electric & Power Co.02 1.2 8.623 0.15 1.6 9.3 0 34.7 6.309 0.2 3. (NY) Florida Power and Light Hawaiian Electric Co.2 Example 1 .7 6.2 56 61.92 1.6 2.7 11.3 X3 151 202 113 168 1.06 0.09 0. Kentucky Utilities Co.07 X2 9.5 0 0 0 39.4 53 51.3 X5 1.768 1.7 2. Madison Gas & Electric Co.2 0 0.6 57 60.9 6. Consolidated Edison Co.652 0.2 3.12 0.2 13 12.5 3.4 7.9 0 8.16 0.588 1.2 60 67.6 11.8 13.12.5 3.3 15.555 1.22 1.76 1.6 9.2 10.2 2.3 7. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Company Arizona Public Service Boston Edison Company Central Louisiana Electric Co.5 10.16 1.89 1.8 62.9 12. Pacific Gas & Electric Co.5 12.1 9.1 3.32 1.044 1.4 57.2 0 0 50.9 56.43 1.Public Utilities Data 133 Table 1 : Public Utilities Data No.108 0.628 1.13 1.9 53 56 51.2 2.702 2.897 0.2 9 2.9 X6 9077 5088 9212 6423 3300 11127 7642 13082 8406 6455 17441 6154 7179 9673 6468 15991 5714 10140 13507 7297 6650 10093 Table 2 : Explanation of Variables X1: X2: X3: X4: X5: X6: X7: X8: Fixed-charge covering ratio (income/debt) Rate of return on capital Cost per KW capacity in place Annual Load Factor Peak KWH demand growth from 1974 to 1975 Sales (KWH use per year) Percent Nuclear Total fuel costs (cents per KWH) X7 0 25.96 1.7 -2. The Southern Co.62 1. Texas Utilities Co.9 61 54.1 0 26.527 0.5 5.7 49. Oklahoma Gas and Electric Co.6 X8 0.05 1.49 1.4 1. New England Electric Co.1 1.4 0.116 1.7 12 7.8 8.4 12. San Diego Gas & Electric Co.04 1.241 1. Idaho Power Co.2 1.2 9. United Illuminating Co.4 -0. Nevada Power Co.75 1.4 0.6 22.7 54 59.058 0.636 0.3 0 0 41. X1 1.4 11.862 0.3 15.5 62 53. Wisconsin Electric Power Co. Commonwealth Edison Co.306 . 5 6.8 3.9 2.6 0.3 4.0 3.7 3.1 0.4 3. B2 .2 4.0 4.5 3.5 5.7 4.0 1.5 3.1 2.0 5.8 4.2 6.0 5.7 5.2 4.6 4.8 4.0 8 2.4 4.9 3.1 2.5 3.9 2.9 3.8 2.6 3.6 6.1 2.2 2.7 5.5 3. The most important types are hierarchical techniques.2 3.1 2.1 4.3 6.6 4.5 17 4.6 0.1 3.6 3.0 4.0 3.2 5.0 3.5 4.8 4.5 4.1 4.3 2.0 4.1 3.6 3.7 1.0 3.3 4.8 5.0 3.7 3.4 4.5 2.0 4.2 5.7 4.8 4.9 0.2 4.0 2.7 4.8 3.5 2.0 2.7 2.4 3.4 5.9 2.0 2.9 5.0 4.2 4.3 5.7 3.8 3.4 3.9 4.2 4. Such clusters have elongated sausage-like shapes when visualized as objects in space.1 5.9 2. 2 · · · m.6 3.5 3.7 3.7 4.6 6.1 2.0 4.0 2.5 2.2 3.8 5.2 3.9 2.0 .5 2 3.6 3.7 3.1 4.8 2.9 5.8 3.2 4.4 3.7 4.0 4.6 3.1 2.1 4.8 3.0 3.6 3.1 Nearest neighbor (Single linkage) Here the distance between two clusters is defined as the distance between the nearest pair of objects in the two clusters (one object in each cluster).4 5.3 3.0 4.5 4.4 2.2 2.0 3.1 0.2 3.8 5. optimization techniques and mixture models.0 4. The most popular agglomerative techniques are: 12.0 3.9 2.3 5.3 4.6 3.1 4 2.9 2.5 5.9 4.0 4.1 4.1 3.2 3.0 5.9 0.2 5.2 4.7 2.2 3.9 2.2 5.8 5.1 2.6 4.9 3.4 4.2 10 3.0 4.0 2.4 4.4 3.9 4.6 4.5 4.4 1. This measure of distance between objects is mapped into a metric for the distance between clusters (sets of objects) metrics for the distance between two clusters.4 3.4 4.8 4.4 4.4 4.0 2.4 0.1 4. 2 · · · n).0 4.0 4.9 3.7 3.6 5.4 4.1 3.5 2. the single linkage distance between A and B is M in(distance(Ai .2 5.1 4.6 4.6 18 1.9 4.7 1.0 3.2 5.2 5.5 5.1 4.6 2.6 4.3.3 4.4 2.2 3.6 11 3.6 0.4 2.9 1.8 6 3. Agglomerative hierarchical techniques are the more commonly used.0 4.6 4.1 2.1 2.5 4.5 4.0 3.7 5.0 4.4 12 3.1 3.6 5.0 0.6 4.0 4.7 2.1 4.5 3.5 2.4 3.1 4.5 5.0 4.3 1.9 3.9 3.4 0.5 1.2 5.1 2.1 0.2 2.3 2. This method has a tendency to cluster together at an early stage objects that are distant from each other because of a chain of intermediate objects in the same cluster.9 5.1 4. · · · Bn .7 3.9 4.5 3.0 4.6 3.9 2.0 3.6 4.0 3.8 3.9 4.3 4.4 0.6 2.7 3.8 3.2 4.8 4.3 3. The idea behind this set of techniques is to start with each cluster comprising of exactly one object and then progressively agglomerate (combine) the two nearest clusters until there is just one cluster left consisting of all the objects.4 2. 22 2.5 4.9 4.5 15 2.5 4.6 3.9 2.3 5.0 4.1 3.0 4.9 2.8 4.0 3.3 3.1 4.4 3.6 5 4. Bj )|i = 1.9 2.7 3. j = 1.9 6.8 6.4 5.3 4.2 3.9 4.8 4.8 5.8 2.5 2.0 5.0 3.1 4.6 21 3.0 3.4 2.8 4.5 5.0 2.9 4.5 3.9 2.3 4.6 4.9 5.4 1.2 4.0 2.0 Clustering Algorithms A large number of techniques have been proposed for forming clusters from distance matrices.5 4.9 5.0 4.1 0.7 3.4 4.9 2.7 3.5 3.6 3.5 6.0 5. If cluster A is the set of objects A1 .1 2.7 3.4 3.1 5.6 2.5 4.9 4.9 1.1 4.2 0.2 3.9 3.2 3. A2 .4 6.5 3. · · · Am and cluster B is B1 .3 4.6 3.0 4.1 4.9 3.8 4.8 4.4 3.7 3.5 4.0 3.3 2.7 4.5 19 2.9 3.7 1.1 3.3 4.0 3.8 4.7 3.2 5.2 5.8 3.6 3.3 Hierarchical Methods There are two major types of hierarchical techniques: divisive and agglomerative.5 3.6 4.1 4. Cluster Analysis Table 3 : Distances based on standardized variable values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 0.1 3.5 4.7 5.5 4.9 5.9 3.9 4.6 5.8 5.9 3.2 0.2 9 3.7 3. The only difference between the various agglomerative techniques is the way in which this inter-cluster distance metric is defined.0 1.9 3.1 2. We discuss the first two types here.1 3.4 4.134 12.4 2.5 4.8 6. How do we measure distance between clusters? All agglomerative methods require as input a distance measure between all the objects that are to be clustered.5 2.6 0.6 3.4 3.4 4.7 4.0 4.4 5.3 2.5 5.4 4.5 4.8 2.4 4.2 4.9 1.4 0.5 2.9 5.3 4.9 4.7 2.7 3.6 4.4 5.1 4.9 4.2 2.8 4.9 3.4 5.3 3.2 4. 12.7 3.1 3.6 2.1 4.6 3.1 3.9 2.0 3.4 4.0 2.0 6.6 2.2 2.7 14 2.1 4.0 20 3.5 0.6 4.6 4.0 3.1 3.2 3.3 2.1 3. Nearness of clusters is based on a measure of distance between clusters.4 3 3.0 6.2 4.6 4.0 5.6 5.2 0.0 7 3.6 2.0 2.0 4.1 0.0 13 4.4 3.2 3.7 4.3 5.5 0.6 2.3 5.2 3.6 3.5 4.4 4.8 1.0 4.6 5.4 3.6 3.0 5.5 3.7 4.5 4.4 2.4 16 4.9 2.4 3.9 3.8 3. 3 Group average (Average linkage) Here the distance between two clusters is defined as the average distance between all possible pairs of objects with one object in each pair belonging to a distinct cluster. 19. they would be: {1. If cluster A is the set of objects A1 . B2 . 6. {6}. A2 . If we visualize them as objects in space. 2 · · · m and j = 1. if we wanted to form 6 clusters we would find that the clusters are: {1. A2 . Bj )|i = 1. B2 . Bj ) the sum being taken over i = 1. The average linkage dendogram is shown in Figure 2.3. · · · Am and cluster B is B1 .3 Hierarchical Methods 12. 6. · · · Am and cluster B is B1 . 4. 19. {17}. and {5}. 13}. 21. 9}. This method tends to produce clusters at the early stages that have objects that are within a narrow range of distances from each other. 15}. . the complete linkage distance between A and B is M ax(distance(Ai . 14. 9}. 2 · · · n). {17}. 7.12. The clusters tend to group geographically − for example there is a southern group {1. 16. the objects in such clusters would have roughly spherical shapes. 15}. 3. 2 . . {7. · · · Bn . 22. 14. such as taxonomy of varieties of living organisms. 18. In general all hierarchical methods have clusters that are nested within each other as we decrease the number of clusters we desire. This is a valuable property for interpreting clusters and is essential in certain applications. If cluster A is the set of objects A1 . 12. {5}. 12. 10. 3. . {8. 20. 21. 4. 16}. 21. 20. 18. 2 · · · m. The nearest neighbor clusters for the utilities are displayed in Figure 1 below in a useful graphic format called a dendogram. j = 1. 19. n. 15. · · · Bn . a east/west seaboard group: {7. the average linkage distance between A and B is (1/mn)Σ distance(Ai . 2. 9. 12.3. 3}. 11} Notice that both methods identify {5} and {17} as singleton clusters. 18.2 135 Farthest neighbor (Complete linkage) Here the distance between two clusters is defined as the distance between the farthest pair of objects with one object in the pair belonging to a distinct cluster. 12. 13. {2. If we want six clusters using average linkage. {11}. Note that the results of the single linkage and the complete linkage methods depend only on the order of the inter-object distances and so are invariant to monotonic transformations of the inter-object distances. 14. For example. For any given number of clusters we can determine the cases in the clusters by sliding a horizontal line up and down until the number of vertical intersections of the horizontal line equals the desired number of clusters. {8. 10. Notice that if we wanted 5 clusters they would be the same as for six with the exception that the first two clusters above would be merged into one cluster. Cluster Analysis Figure1: Dendogram .Single Linkage .136 12. 4 Optimization and the k-means algorithm 137 Figure2: Dendogram .Average Linkage between groups 12. A very common measure is the sum of distances or sum of squared Euclidean distances from the mean of each cluster. Subsequent . heuristic method that generally produces good (but not necessarily optimal) solutions.4 Optimization and the k-means algorithm A non-hierarchical approach to forming good clusters is to specify a desired number of clusters. k. The k-means algorithm starts with an initial partition of the cases into k clusters. The k-means algorithm is one such method.12. and to assign each case to one of k clusters so as to minimize a measure of dispersion within the clusters. The problem can be set up as an integer programming problem (see Appendix) but because solving integer programs with a large number of variables is time consuming. clusters are often computed using a fast. say. 000 0. Output of k-means clustering Initial Cluster Centers X1 X2 X3 X4 X5 X6 X7 X8 1 0.23 -1.56 0.74 0.39 0.49 -0. The algorithm can be rerun with different randomly generated starting partitions to reduce the chances of the heuristic producing a poor solution.211 1.469 Iteration History Iteration 1 2 3 1 1.946 0. In this situation one of the partitions (generally the one with the largest sum of distances from the mean) is divided into two or more parts to reach the required number of k partitions. Generally the number of clusters in the data is not known so it is a good idea to run the algorithm with different values for k that are near the number of clusters one expects from the data to see how the sum of distances reduces with increasing values of k.350 1.27 -0.000 0.03 -0. The clusters developed using different values of k will not be nested (unlike those developed by hierarchical methods).92 -1.72 -1.63 0.87 2 -1.04 -0.85 -0.93 -1.21 -0.30 -0.22 1.676 0.58 0.71 -0.47 Cluster 3 4 -0.730 0.68 -1.58 -1.25 Minimum distance is between initial centers 3 and 6 = 3.01 0.000 .93 -0.533 0.99 -0.000 0. The means of the new clusters are computed and the improvement step is repeated until the improvement is very small.950 0.22 1.73 -0. The method is very fast.88 -1.62 0.000 0.59 1. There is a possibility that the improvement step leads to fewer than k partitions.000 0.37 2.78 1.000 Change in Cluster Centers 2 3 4 5 2.90 -0.61 -0. This leads to a new partition for which the sum of distances is strictly smaller than before.19 0.69 6 1.21 1.71 2. The modification consists of allocating each case to the nearest of the k means of the previous partition.75 0.000 6 1.000 0.13 0.138 12.86 0.000 0.25 -0. Cluster Analysis steps modify the partition to reduce the sum of the distances for each case from the mean of the cluster to which the case belongs.91 1.75 -1.12 1. The results of running the k-means algorithm for Example 1 with k = 6 are shown below.04 5 2.10 1.000 0. 989 4 4 1.96 -0.211 15 2 1.25 -0.60 -0.86 -0.000 6 6 2.241 4 4 1.640 6 9 1.989 3 3 2.08 0.59 0.109 4 13 1.72 -1.494 4 22 1.173 6 7 1.71 1.440 13 4 1.32 -0.05 0.494 21 2 1.109 11 1 2.28 -0.06 1.536 17 2 2.589 2 17 2.000 6 6 2.893 Final Cluster Centers X1 X2 X3 X4 X5 X6 X7 X8 1 -0.99 1.12.756 (sorted by Cluster) Cluster Case Distance 1 8 1.536 2 12 1.589 16 1 1.666 3 3 2.074 22 4 1.640 8 1 1.27 .730 4 20 1.86 0.560 1 11 2.10 0.41 1.893 10 4 1.04 -0.173 7 6 1.34 -0.83 1.211 3 18 1.36 5 2.71 -0.64 0.27 0.08 0.241 3 14 1.24 -1.195 5 5 0.730 14 3 1.58 -1.177 1 16 1.04 -0.555 20 4 1.76 -0.074 3 1 1.75 -0.177 12 2 1.666 2 4 1.58 0.44 -0.440 2 15 1.4 Optimization and the k-means algorithm 139 Cluster Membership (sorted by Case) Case Cluster Distance 1 3 1.31 Cluster 3 4 0.40 -0.54 -0.756 5 5 0.560 9 6 1.350 18 3 1.30 -0.195 4 10 1.21 0.69 6 0.555 4 2 1.97 2 -0.58 1.334 3 19 1.27 0.24 -0.48 0.334 19 3 1.97 0.77 -1.21 1.26 -0.51 -0.350 2 21 1. 752 0.557 4.287 3. a distance measure dij = 1 − rij In the case of binary values of x it is more intuitively appealing to use similarity measures.519 2.519 4 3.140 12. defined by correlation coefficient.087 0.557 4.295 3.287 2.000 Number of Cases in each Cluster Cluster 1 2 3 4 5 6 3 4 5 6 1 3 22 0 Valid Missing The ratio of the sum of squared distances for a given k to the sum of squared distances to the mean of all the cases (k = 1) is a useful measure for the usefulness of the clustering. Cluster Analysis Distances between Final Cluster Centers 1 2 3 4 5 6 1 0.000 3.861 2. A popular similarity measure is the square of the 2 . Suppose we have binary values for all the xij ’s and for individuals i and j we have the following 2 × 2 table: . If the ratio is near 1.861 0.286 4.286 2.927 0.000 3.072 2.720 5.000 2.072 3 3.706 3.024 3.141 0.927 4.024 4.5 Similarity Measures Sometimes it is more natural or convenient to work with a similarity measure between cases rather than distance which measures dissimilarity.087 3.0 the clustering has not been very effective.924 5 5.141 6 4. rij p 2 ≡ rij m=1 p m=1 (xim − xm )(xjm − xm ) (xim − xm )2 p m=1 (xjm − xm )2 Such measures can always be converted to distance measures. 12. In the above example we could define 2.000 4.752 3.295 2 4.706 3.000 4.924 4. if it is small we have well separated groups.720 3. 12. (a + d)/p • Jaquard’s coefficient. d/(b + c + d). This is desirable when we do not want to consider two individuals to be similar simply because they both do not have a large number of characteristics. and S is the covariance matrix for these vectors. This coefficient ignores zero matches. .6 Other distance measures Two useful measures of dissimilarity other than the Euclidean distance that satisfy the triangular inequality and so qualify as distance metrics are: • Mahalanobis distance defined by dij = (xi − xj ) S −1 (xi − xj ) where xi and xj are p-dimensional vectors of the variable values for i and j respectively. This measure takes into account the correlation between variables. It is defined as p wijm sijm m=1 sij = p wijm m=1 with wijm = 1 subject to the following rules: • wijm = 0 when the value of the variable is not known for one of the pair of individuals or to binary variables to remove zero matches. • For non-binary categorical variables sijm = 0 unless the individuals are in the same category in which case sijm = 1 • For continuous variables sijm = 1− | xim − xjm | /(xm : max −xm : min) 12. With this measure variables that are highly correlated with other variables do not contribute as much as variables that are uncorrelated or mildly correlated.6 Other distance measures 141 Individual i 0 1 Individual j 0 1 a b c d a+c b+d a+b c+d p The most useful similarity measures in this situation are: • The matching coefficient. When the variables are mixed a similarity coefficient suggested by Gower [2] is very useful. p | xim − xjm | .2.142 12. Cluster Analysis • Manhattan distance defined by dij = p | xim − xjm | m=1 • Maximum co-ordinate distance defined by dij = max m=1.···. scientific and professional books Book clubs and other mail-order books Mass-market paperbound books All other books Book retailing in the US in the 1970’s was characterized by the growth of bookstore chains located in shopping malls. this industry may be segmented as follows: 16% 16% 21% 10% 17% 20% Textbooks Trade books sold in bookstores Technical. Vinni Bhandari from The Bookbinders Club. Under a continuity program. 2002. a reader would sign up by accepting an offer of several books for 1 This case was derived. Historically. book clubs sought out alternative business models that were more responsive to their customers’ individual preferences.000 new titles. giving rise to a $25 billion industry in 20011 In terms of percentage of sales. are published each year in the US. the superstore concept of book retailing gained acceptance and contributed to double-digit growth of the book industry.000 to 80.) In response to these pressures. with the assistance of Ms. prepared by Nissan Levin and Jacob Zahavi. Tel Aviv University. By the 1990’s.1 Charles Book Club THE BOOK INDUSTRY Approximately 50. Superstores applied intense competitive pressure on book clubs and mail-order firms as well as traditional book retailers.Chapter 13 Cases 13. Industry Statistics. book clubs offered their readers different types of membership programs.000 titles. 143 . including new editions. a Case Study in Database Marketing. Two common membership programs are “continuity” and “negative option” programs that were extended contractual relationships between the club and its members. (Association of American Publishers. Conveniently situated near large shopping centers. The 1980’s saw increased purchases in bookstores stimulated through the widespread practice of discounting. and employ well-informed sales personnel. superstores maintain large inventories of 30. Permission pending. including media advertising (TV. DATABASE MARKETING AT CHARLES The club The Charles Book Club (“CBC”) was established in December of 1986. but only to specific segments of their customer base that are likely to be receptive to specific offers. CBC focused on selling specialty books by direct marketing through a variety of channels. In a negative option program. Rather than expanding the volume and coverage of mailings. CBC built and maintained a detailed database about its club members. In an attempt to combat these trends.000 readers. readers get to select which and how many additional books they would like to receive. some book clubs have begun to offer books on a “positive option” basis. CBC acquired most of these customers through advertising in specialty magazines.144 13. Negative option programs sometimes result in customer dissatisfaction and always give rise to significant mailing and processing costs. CBC is strictly a distributor and does not publish any of the books that it sells. CBC has created an active database of 500. the club’s selection of the month will be automatically delivered to them unless they specifically mark “no’ by a deadline date on their order form. magazines. Cases just a few dollars (plus shipping and handling) and an agreement to receive a shipment of one or two books each month thereafter at more standard pricing. and much of the club’s prestige depends on the quality of its selections. In line with its commitment to understanding its customer base. The continuity program was most common in the children’s books market. some book clubs are beginning to use database-marketing techniques to more accurately target customers. where parents are willing to delegate the rights to the book club to make a selection. Information contained in their databases is used to identify who is most likely to be interested in a specific offer. newspapers) and mailing. This information enables clubs to carefully design special programs tailored to meet their customer segments’ varying needs. on the premise that a book club could differentiate itself through a deep understanding of its customer base and by delivering uniquely tailored offerings. . However. Through this process. readers were required to fill out an insert and mail it to CBC. Upon enrollment. On the surface. The analysis would create and calibrate response models for the current book offering. • Every new book would be offered to the club members before general advertising. CBC looked like they were very successful. 2. The decreasing profits led CBC to revisit their original plan of using database marketing to improve its mailing yields and to stay profitable. Customer acquisition: • New members would be acquired by advertising in specialty magazines. . CBC’s management decided to focus its efforts on the most profitable customers and prospects. • Any information not being collected that is critical would be requested from the customer. The two processes they had in place were: 1. their customer database was increasing. newspapers and TV. To derive intelligence from these processes they decided to use a two-step approach for each new title: (a) Conduct a market test.1 Charles Book Club 145 The problem CBC sent mailings to its club members each month containing its latest offering. compute a score for each customer in the database. • Direct mailing and telemarketing would contact existing club members. (b) Based on the response models. their bottom line profits were falling. involving a random sample of 10.13. Data collection: • All customer responses would be recorded and maintained in the database. A possible solution They embraced the idea that deriving intelligence from their data would allow them to know their customer better and enable multiple targeted campaigns where each target audience would receive appropriate mailings. and to design targeted marketing strategies to best reach them. mailing volume was increasing. Use this score and a cut-off value to extract a target customer list for direct mail promotion. book selection was diversifying and growing.000 customers from the database to enable analysis of customer responses. however. 146 13. and Test Data (800 Customers): data only to be used after a final model has been selected to estimate the likely accuracy of the model when it is deployed. inactivity. The customer responses have been collated with past purchase data. below. Cases Targeting promotions was considered to be of prime importance. other opportunities to create successful marketing campaigns based on customer behavior data such as returns. Each row (or case) in the spreadsheet (other than the header) corresponds to one market test customer. is ready for release. CBC has sent a test mailing to a random sample of 4. Art History of Florence A new title.000 customers from its customer base. Each column is a variable with the header row giving the name of the variable. “The Art History of Florence”. complaints. Validation Data (1400 customers): hold-out data used to compare the performance of different response models. There were. The variable names and descriptions are given in Table 1. The data has been randomly partitioned into 3 parts: Training Data (1800 customers): initial data to be used to fit response models. . CBC planned to address these opportunities at a subsequent stage. and compliments. in addition. Total number of purchases Months since first purchase Number of purchases from the category:Child books Number of purchases from the category:Youth books Number of purchases from the category:Cookbooks Number of purchases from the categoryDo It Yourself books I Number of purchases from the category:Reference books (Atlases.Dictionaries) Number of purchases from the categorArt books Number of purchases from the category:Geography books Number of purchases of book title: “Secrets of Italian Cooking” Number of purchases of book title: “Historical Atlas of Italy” Number of purchases of book title: “Italian Art” =1 “The Art History of Florence” was bought. Encyclopedias.1 Charles Book Club 147 Table 1: List of Variables in Charles Book Club Data Set Variable Name Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks DoItYBks RefBks ArtBks GeoBks ItalCook ItalAtlas ItalArt Florence Related purchase Description Sequence number in the partition Identification number in the full (unpartitioned) market test data set O=Male 1=Female Monetary.Months since last purchase Frequency . No one technique is universally better than another. = 0 if not Number of related books purchased DATA MINING TECHNIQUES There are various data mining techniques that can be used to mine the data collected from the market test. For this assignment.Total money spent on books Recency. The particular context and the particular characteristics of the data are the major factors in determining which techniques perform better in an application. . we will focus on two fundamental techniques: • K-Nearest Neighbor • Logistic regression We will compare them with each other as well as with a standard industry practice known as RFM segmentation.13. $100 (Mcode=3) $101 . The assumption is that the more recent the last purchase. But since we cannot measure this attribute. The buyers are summarized in the first five tables and the non-buyers in the next five tables. cross tabulated by these categories. These tables are available for Excel computations in the RFM spreadsheet in the data file.time since last purchase F .the amount of money spent on the company’s products over a period. In the direct marketing business the most commonly used variables are the ‘RFM variables’: R . we use variables that are plausible indicators of this propensity. Cases RFM Segmentation The segmentation process in database marketing aims to partition customers in a list of prospects into homogenous groups (segments) that are similar with respect to buying behavior.$200 (Mcode=4) $201 and up (Mcode=5) The tables below display the 1800 customers in the training data.148 13.Recency .Monetary . The homogeneity criterion we need for segmentation is propensity to purchase the offering.Frequency .the number of previous purchases from the company over a period M . the more likely is the customer to purchase the product offered. and the more money spent in the past buying the company’s products. the more products bought from the company in the past. .$50 (Mcode=2) $51 . Frequency and Monetary categories as follows: Recency: 0-2 months (Rcode=1) 3-6 months (Rcode=2) 7-12 months (Rcode=3) 13 months and up (Rcode=4) Frequency: 1 book (Fcode=l) 2 books (Fcode=2) 3 books and up (Fcode=3) Monetary: $0 .$25 (Mcode=1) $26 . The 1800 observations in the training data and the 1400 observations in the validation data have been divided into Recency. 13.1 Charles Book Club 149 Buyers Sum of Florence Fcode 1 2 3 Grand Total Mcode 1 2 Rcode Sum of Florence Fcode 1 2 3 Grand Total Rcode Sum of Florence Fcode 1 2 3 Grand Total Rcode Sum of Florence Fcode 1 2 3 Grand Total Rcode Sum of Florence Fcode 1 2 3 Grand Total 1 Mcode 1 0 2 0 2 Mcode 1 1 1 3 Mcode 1 1 1 4 Mcode 1 0 0 2 2 3 1 6 3 10 5 1 16 4 7 9 15 31 5 17 17 62 96 Grand Total 38 34 79 151 151 2 0 1 1 2 3 0 0 0 0 4 2 0 0 2 5 1 1 5 7 Grand Total 3 2 6 11 2 0 0 0 3 1 3 0 4 4 1 5 4 10 5 5 5 10 20 Grand Total 8 13 14 35 2 0 1 0 1 3 1 1 0 2 4 2 2 4 8 5 5 4 31 40 Grand Total 9 8 35 52 2 2 1 3 8 1 1 10 4 2 2 7 11 5 6 7 16 29 Grand Total 18 11 24 53 3 . 150 13. Cases All customers (buyers and non-buyers) Count of Florence Fcode 1 2 3 Grand Total Mcode 1 20 Rcode Count of Florence Fcode 1 2 3 Grand Total Rcode Count of Florence Fcode 1 2 3 Grand Total Rcode Count of Florence Fcode 1 2 3 Grand Total Rcode Count of Florence Fcode 1 2 3 Grand Total 1 Mcode 1 2 20 2 2 Mcode 1 3 3 3 Mcode 1 7 7 4 Mcode 1 8 8 2 40 32 2 74 3 93 91 33 217 4 166 180 179 525 5 219 247 498 964 Grand Total 538 550 712 1800 1800 2 2 3 1 6 3 6 4 2 12 4 10 12 11 33 5 15 16 45 76 Grand Total 35 35 59 129 2 5 2 7 3 17 17 3 37 4 28 30 34 92 5 26 31 66 123 Grand Total 79 80 103 262 2 15 12 1 28 3 24 29 17 70 4 51 55 53 159 5 86 85 165 336 Grand Total 183 181 236 600 2 18 15 3 46 41 11 98 4 77 83 81 241 5 92 115 222 429 Grand Total 241 254 314 809 33 (a) What is the response rate for the training data customers taken as a whole? What is the response rate for each of the 4 × 5 × 3 = 60 combinations of RFM categories? Which combinations have response rates in the training data that are above the overall response in the training data? . M: Monetary .Months since last purchase 3. (c) Rework parts a. Draw the cumulative lift curve (consisting of three points for these three segments) showing the number of customers in the validation data set on the x axis and cumulative number of buyers in the validation data set on the y axis. a possible segmentation by product proximity could be created using the following variables: 1.Total number of past purchases 4. For “The Art History of Florence”. Compute the response rate in the validation data using these combinations. with three segments: segment 1 consisting of RFM combinations that have response rates that exceed twice the overall response rate. k = 3 and k = 11. and “Italian Art”. FirstPurch : Months since first purchase 5.Nearest Neighbor The k-Nearest Neighbor technique can be used to create segments based on product proximity of the offered products to similar products as well as propensity to purchase (as measured by the RFM variables). Use normalized data (note the checkbox ‘normalize input data’ in the dialog box) and all five variables.e. R: Recency . . k . segment 2 consists of RFM combinations that exceed the overall response rate but do not exceed twice that rate and segment 3 consisting of the remaining RFM combinations.13. sum of purchases from Art and Geography categories and of titles “Secrets of Italian Cooking”. Use normalized data (note the checkbox ‘normalize input data’ in the dialog box) and all five variables. i. which is a weighted sum of the values of the Florence variable for the k nearest neighbors with weights that are inversely proportional to distance. “Historical Atlas of Italy”.1 Charles Book Club 151 (b) Suppose that we decide to send promotional mail only to the RFM combinations identified in part a. (e) Use the k-Nearest Neighbor option under the Prediction menu choice in XLMiner to compute a cumulative gains curve for the validation data for k = 1.Total money ($) spent on books 2. The k-NN prediction algorithm gives a numerical value. k = 3 and k = 11. RelatedPurch: Total number of past purchases of related books. (d) Use the k-Nearest Neighbor option under the Classify menu choice in XLMiner to classify cases with k = 1. F : Frequency . and b. 13.xls) . (h) If the cutoff criterion for a campaign is a 30% likelihood of a purchase. based on the predictor variables.edu/pub/machine-learning-databases/statlog/) has 30 variables and 1000 records. All the variables are explained in Table 1. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases).ics. F.uci. (f) Score the customers in the validation sample and arrange them in descending order of purchase probabilities. under the assumption that the error term in the customer’s utility function follows a type I extreme value distribution. each record being a prior applicant for credit. find the customers in the validation data that would be targeted and count the number of buyers in this set. The data has been organized in the spreadsheet German Credit. New applicants for credit can also be evaluated on these 30 “predictor” variables and classified as a good credit risk or a bad credit risk.2 German Credit The German Credit data set (available at ftp.) Use the training set data of 1800 observations to construct three logistic regression models with: • The full set of 15 predictors in the data set as dependent variables and “Florence” as the independent variable. Several ordered categorical variables have been left as is. Cases Logistic Regression The Logistic Regression model offers a powerful method for modeling response because it yields well-defined purchase probabilities. • only the R.1. some of which have been transformed into a series of binary variables so that they can be appropriately handled by XLMiner.152 13. (The model is especially attractive in consumer choice settings because it can be derived from the random utility theory of consumer behavior. (Note : The original data set had a number of categorical variables. • a subset that you judge as the best. (g) Create a cumulative gains chart summarizing the results from the three logistic regression models created above along with the expected cumulative gains for a random selection of an equal number customers from the validation data set. to be treated by XLMiner as numerical. and M variables. 1: Yes radio/television 0: No. 2. # 1. 1: Yes Binary Binary furniture/equipment 0: No. RETRAINING 11. AMOUNT SAV− ACCT Duration of credit in months Credit history Numerical Purpose of credit Purpose of credit Purpose of credit Purpose of credit Purpose of credit Purpose of credit Credit amount Average balance in savings account Binary 0: no credits taken 1: all credits at this bank paid back duly 2: existing credits paid back duly till now 3: delay in paying off in the past 4: critical account car (new) 0: No. DURATION 4.2 German Credit 153 Table 1. 12. Var. 1: Yes Binary retraining 0: No. RADIO/TV 9. HISTORY 5. Checking account status Variable Type Categorical Categorical Code Description Sequence Number in data set 0 :< 0DM 1: 0 ⇐ · · · < 200 2 :⇒ 200 DM 3: no checking account DM 3.1 Variables for the German Credit data.13. 1: Yes Binary car (used) 0: No. FURNITURE 8. 1: Yes Categorical Binary Numerical Categorical 0 :< 100 DM 1 : 100 <= · · · < 500 DM 2 : 500 <= · · · < 1000 DM 3 :⇒ 1000 DM 4 : unknown/ no savings account . USED− CAR 7. EDUCATION 10. 1: Yes education 0: No. NEW− CAR 6. Variable Name OBS# CHK− ACCT Description Observation No. 1:Yes . MALE− DIV 16. CO-APPLICANT 19. 1:Yes Binary 0: No. 1:Yes 0: No. REAL− ESTATE 22. 1:Yes Binary 0: No. EMPLOYMENT Present employment since Categorical 14. 1:Yes Binary 0: No. Cases 13. 24. PRESENT− RESIDENT Installment rate as % of disposable income Applicant is male and divorced Applicant is male and single Applicant is male and married or a widower Application has a co-applicant Applicant has a guarantor Present resident since . RENT OWN− RES Applicant owns real estate Applicant owns no property (or unknown) Age in years Applicant has other installment plan credit Applicant rents Applicant owns residence 0 : unemployed 1: < 1 year 2 : 1 <= · · · < 4 years 3 : 4 <= · · · < 7 years 4 : >= 7 years Binary 0: No. 1:Yes Categorical Binary 0 :<= 1 year 1 < · · · <= 2 years 2 < · · · <= 3 years 3 :> 4 years 0: No. 1:Yes Numerical Binary 0: No. GUARANTOR 20. AGE OTHER− INSTALL 25. PROP− UNKN− NONE 23. MALE− MAR− WID 18. 1:Yes Binary 0: No.154 13.years 21. 1:Yes Binary Binary 0: No. 1:Yes Binary 0: No. 26. INSTALL− RATE Numerical 15. MALE− SINGLE 17. 13.non-resident 1 : unskilled resident 2 : skilled employee / official 3 : management/ self-employed/ highly qualified employee/ officer Numerical Binary 0: No. 1:Yes 0: No.2 German Credit 27. 32 FOREIGN RESPONSE 155 Number of existing credits at this bank Nature of job Number of people for whom liable to provide maintenance Applicant has phone in his or her name Foreign worker Credit rating is good Numerical Categorical 0 : unemployed/ unskilled . 1:Yes . NUM− DEPENDENTS 30. NUM− CREDITS 28. TELEPHONE 31. 1:Yes Binary Binary 0: No. JOB 29. Table 1.3 Opportunity Cost Table (in deutch Marks) Actual Good Bad Predicted (Decision) Good (Accept) Bad (Reject) 0 100 DM 500 DM 0 . Table 1.2 The data (first several rows) The consequences of misclassification have been assessed as follows: the costs of a false positive (incorrectly saying an applicant is a good credit risk) outweigh the benefits of a true positive (correctly saying an applicant is a good credit risk) by a factor of five.2. Cases Table 1. below. This can be summarized in the following table. shows the values of these variables for the first several records in the case.156 13. Review the predictor variables and guess at what their role might be in a credit decision. calculate the net profit of extending credit.500 DM 0 Let us use this table in assessing the performance of the various models because it is simpler to explain to decision-makers who are used to thinking of their decision in terms of net profits. For each case. Add another column for cumulative net profit. How far into the validation data do you go to get maximum net profit? (Often this is specified as a percentile or rounded to deciles. and develop classification models using the following data mining techniques in XLMiner: • Logistic regression • Classification trees • Neural networks. Which technique has the most net profit? 4.) e. If this logistic regression model is scored to future applicants. Divide the data into training and validation partitions.13. what “probability of success” cutoff should be used in extending credit? . Let’s see if we can improve our performance. Are there any surprises in the data? 2. followed by poorer risk applicants. Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. Sort the validation on “predicted probability of success. 3. a.4 Average Net Profit Actual Good Bad Predicted (Decision) Good (Accept) Bad (Reject) 100 DM 0 . Rather than accepting XLMiner’s initial classification of everyone’s credit status.” b. c.2 German Credit 157 The opportunity cost table was derived from the average net profit per loan as shown below: Table 1. Assignment 1. d. let’s use the “predicted probability of success” in logistic regression (where “success” means “1”) as a basis for selecting the best credit risks first. broad. medium or narrow • Palav shade.cotton or polyester (or many combinations of natural and synthetic fiber) Tens of thousands of combinations are possible. colors. important ones include: • Body (made up of warp and weft) shade . These products have very individual characteristics. Its products. The average price is about 1500 rupees. non-industrial character. Most state governments in India have also set up state-run corporations. and also award subsidies and developmental grants.158 13.3 Textile Cooperatives Background The traditional handloom segment of the textile industry in India employs 10 million people and accounts for 30% of total textile production. most of which have been engaged in the production of these textiles for generations. are colorful flowing garments worn by women and made with a single piece of cloth six yards long and one yard wide. medium and shining. light. and weaving saris or fabric for other uses.one-sided or double-sided • Fabric . craft product (in contrast to the more uniform garments mass produced by textile mills). which undertake training and marketing activities to support the handloom industry. and typically is the most decorated) • Sari side . dying. making natural colors (now not much in practice). and design • Border shade. color and design (the palav is the part of the sari that is most prominent. Each firm in the industry is a cooperative of families. A sari has many characteristics. The price of a sari varies from 500 rupees to 10. It has a distinct. The handloom sector is protected by state governments. and the decentralized nature of the industry means that there is tremendous variety in what is produced. color and design • Border size . chief among them: (a) Greater cost of production due to the use of labor intensive technology and a highly decentralized production process . are created by artisans who have learned the art of making textiles from their forefathers. nearly done 13. This family-centered cooperative industry competes with large mills that produce a more uniform product. which grant exemptions from various types of taxes. Cases. the primary product of the handloom sector. showing at the top and front when worn.000 rupees depending upon the fabric and intricacies of design. Saris. Different families within the cooperatives undertake different production tasks such as spinning fiber.dark. particularly saris. Hand-made saris have a competitive advantage stemming from the very fact that they are a hand-made. Cooperatives selling hand-made saris also face competitive disadvantages. 3 Textile Cooperatives 159 (b) Difficulty in coordinating demand with supply. The combined effect of all these factors is a mismatch between what is produced by the weavers and what is preferred by the consumers. Further decentralization makes the production system unaware of consumer preferences. A simple review of sales records will not reveal much information that is useful for adjusting production. and . and distribution of raw materials to reduce cost and improve the quality of inputs. design and fabric combinations . Different categories of saris . the weavers are tradition bound and lack the desire to either adopt new designs and products or to change to better techniques of production. religions. stocking. leading to higher costs of production. design and fabric combinations) that influence sales. color. Matching Supply and Demand India is a large country with a great diversity of peoples. India to find ways of improving performance of this sector. In spite of these disadvantages. Identifying important sari characteristics (shade. has contributed to the sense of complacency in the handlooms sector. Ahmedabad. the problem of adjusting supply to demand would be relatively simple . This results in stockpiling of finished goods at the sales outlets. The government. and offers a great diversity of demand for saris. an optimal product mix for the production centers (e) Analyze the process of procurement. and those that do not sell well would be reduced or discontinued. languages and festivals (holidays).13. too. Per capita throughput in handlooms is lower than at mills. If there were just a few types of saris produced. In addition. the potential of the handlooms sector is substantial because of its intrinsic ability to produce a highly diversified product range. this is not possible.the varieties that sell well would be produced in greater quantity. With tens of thousands of varieties being sold.are in demand in different parts of the country at different times of year coinciding with the local holiday festivals. A study was initiated at the Indian Institute of Management. color. The following areas were identified for the study: (a) Develop suitable models for sales forecasting and monitoring based on consumer preferences (b) Design methods to improve the flow of information between marketing and production subsystems (c) Analyze the production sub-system to reduce its response time and achieve better quality of output (d) Develop an optimal product mix for an organization and.shade. if possible. however. The decentralization process leads to high costs as a result of supervisory and procurement costs. Non-moving stocks are disposed of at government-backed rebates of 20 per cent and so there is no pressure to change the status quo. Data mining might help in two ways: 1. Cases. 4 = bright border shade 1 = pale. 1 = body is that color border color series of binary variables. 1 = border is that design pallav design series of binary variables. 1 = border is that design border size pallav size 1 = sale. 4 = bright 1 or 2 sided sari 1 = 1-sided. 1 = body is that design border design series of binary variables. Table 1: Code List ID SILKWT ZARIWT SILKWT− Cat ZARIWT− Cat BODYCOL BRDCOL BODYSHD BRDSHD SARSIDE BODYDES BRDDES PALDES BRDSZ PALSZ SALE case number silk weight zari weight categorical version of SILKWT categorical version of ZARIWT body color series of binary variables.160 13. nearly done 2. For forecasting purposes. Inaccurate forecasting results in three types of costs: • Inventory costs of storing saris for six months to a year until the next festival time • Transport and handling costs of moving saris to a different location • Losses from price discounting to clear inventories. . 2 = 2-sided body design series of binary variables. 1 = border is that color body shade 1 = pale. 0 = no sale Note: The colors and designs selected for the binary variables were those that were most common. These costs average about 12% of the cost of production (compared to the normal profit margin of 20%). a set of sari characteristics was identified and a coding system was developed to represent these characteristics as shown in Table 1. Providing a ”black box” that can classify proposed designs into likely sellers and less likely sellers. 46 per catalog mailed. Therefore. using several data mining techniques? 13. we must adjust the purchase rate back down by multiplying each cases “probability of purchase” by 0. and the gain to the stores. which it is preparing to roll out in a mailing. Identify the characteristics of saris that are: a.4 Tayko Software Cataloger Tayco is a software catalog firm that sells games and educational software. so it conducts a test .107. the data set for this case includes just 1000 purchasers and 1000 non-purchasers. It has recently put together a revised collection of items in a new catalog. and added third party titles to its offerings. Tayko has supplied its customer list of 200. Data were collected for saris present in the store at the beginning of a 15 day festival season sale period indicating whether the item was sold or unsold during the sale period (see file Textiles. The consortium affords members the opportunity to mail catalogs to names drawn from a pooled list of customers. Slow moving Establish and state your own criteria for the above definitions. Members supply their own customer lists to the pool. It started out as a software manufacturer.000 names to the pool. Most popular b.053/0.053.a response rate of 0. it was decided to work with a stratified sample that contained equal numbers of purchasers and non-purchasers. In addition to its own software titles. which totals over 5. Average interest c. and can “withdraw” an equivalent number of names each quarter.xls). 2.it draws 20.5 or 0.000 names from the pool and does a test mailing of the new catalog to them. or $5. after using the data set to predict who will be a purchaser.13. it has recently joined a consortium of catalog firms that specialize in computer and software products. so it is now entitled to draw 200. Members are allowed to do predictive modeling on the records in the pool so they can do a better job of selecting names from the pool. Tayko’s customer list is a key asset. Tayko would like to select the names that have the best chance of performing well. To optimize the performance of the data mining techniques. Average spending was $103 for each of the purchasers.000 names.4 Tayko Software Cataloger 161 Using Data Mining Techniques for Sales Forecasting An experiment was conducted to see how data mining techniques can be used to develop sales forecast for different categories of Saris.000.000 names for a mailing. . This mailing yielded 1065 purchasers .5. Predict the sale of saris. an apparent response rate of 0. A random sample of 3695 items in a store at a market (a town) was selected for study. For ease of presentation. In an attempt to grow its customer base. Assignment 1. A partition variable is used because we will be developing two different models in this case and want to preserve the same partition structure for assessing each model.162 13. “Purchase” indicates whether or not a prospect responded to the test mailing and purchased something. Codelist Var. US Is it a US address? binary 2 . Cases. record How many days ago was 1st update to cust. The overall procedure in this case will be to develop two models. 1st− update− days− ago 20. nearly done There are two response variables in this case. Freq. last− update− days− ago 19. The following table provides a description of the variables available in this case.” and will predict the amount they will spend.” The second will be used for those cases that are classified as “purchase. One will be used to classify records as “purchase” or “no purchase.16 Source− * binary 17. 18. RFM% Source catalog for the record (15 possible sources) Number of transactions in last year at source catalog How many days ago was last update to cust. as reported by source catalog (see CBC case) numeric numeric numeric numeric Code Description 1: yes 0: no 1: yes 0: no . record Recency-frequency -monetary percentile. “Spending” indicates. # Variable Name Description Variable Type 1. how much they spent. for those who made a purchase. Partition Address is a residence Person made purchase in test mailing Amount spent by customer in ($) test mailing Variable indicating which partition the record will be assigned to 1: yes 0: no Code Description 1: yes 0: no 1: yes 0: no binary numeric alpha t: training v: validation s: test The following figures show the first few rows of data (the first figure shows the sequence number plus the first 14 variables. Address− is− res binary 24. and the second figure shows the remaining 11 variables for the same rows): .4 Tayko Software Cataloger Codelist 21.13. # Variable Name Description Variable Type 23. Spending 26. Purchase 25. Web− order 22. 163 binary 1: yes 0: no Gender=mal Customer placed at least 1 order via web Customer is male binary Var. nearly done . . Cases.164 13. postage and mailing costs). validation data and test data. respectively) randomly assigned to cases.4 Tayko Software Cataloger 165 . (b) Using the “best subset” option in logistic regression. select the best subset of variables. Estimate the gross profit that the firm could expect from its remaining 180. which has 800 “t’s. implement the full logistic regression model.13. (2) Develop a model for classification a customer as a purchaser or non-purchaser (a) Partition the data into training on the basis of the partition variable. then implement a regression model with just those variables to classify the data into purchasers and non-purchasers.000 names if it randomly selected them from the pool. Assignment (1) Each catalog costs approximately $2 to mail (including printing.” 700 “v’s” and 500 “s’s” (standing for training data. (Logistic . (a) Copy the “predicted probability of success” (success = purchase) column from the classification of test data to this sheet. was chosen. and remove the records where Purchase = “0” (the resulting spreadsheet will contain only purchasers).) (3) Develop a model for predicting spending among the purchasers (a) Make a copy of the data sheet (call it data2). MLR. (c) Arrange the following columns so they are adjacent: (i) Predicted probability of purchase (success) (ii) Actual spending $ (iii) Predicted spending $ (d) Add a column for “adjusted prob. we have not used this partition in any of our analysis to this point thus it will give an unbiased estimate of the performance of our models. since we will be adding analysis to it. estimate the gross profit that would result from mailing to the 180.107]) (5) Using this cumulative lift curve.107. (b) Score the chosen prediction model to this data sheet. (c) Develop models for predicting spending using (i) Multiple linear regression (use best subset selection) (ii) Regression trees (d) Choose one model on the basis of its performance with the validation data. (b) Partition this data set into training and validation partitions on the basis of the partition variable. of purchase” by multiplying “predicted prob. nearly done regression is used because it yields an estimated “probability of purchase. with a lift of about 2. of purchase” by 0. This copy is called Score Analysis.166 13.000 on the basis of your data mining models.” which is required later in the analysis. sort by the “Purchase” variable.7 in the first decile. It is best to make a copy of the test data portion of this sheet to work with. This is to adjust for over-sampling the purchasers (see above). although it contains the scoring of the chosen classification model. Cases. of purchase * predicted spending] (f) Sort all records on the “expected spending” column (g) Calculate cumulative lift (= cumulative “actual spending” divided by the average spending that would result from random selection [each adjusted by the . . Note also that. (4) Return to the original test data partition. (e) Add a column for expected spending [adjusted prob. Note that this test data partition includes both purchasers and non-purchasers. Key Problems IMRB has traditionally segmented markets on the basis of purchaser demographics.70 brands. there are 25. Basis of purchase (price. within each category. . • Possession of durable goods (car. etc. and. They would like now to segment the market based on two key sets of variables more directly related to the purchase process and to brand loyalty: 1. susceptibility to discounts. covering about 80% of the Indian urban market. The households are carefully selected using stratified sampling.000 sample households selected in rural areas. updated annually).g. an “affluence index” is computed from this information. we are working with only urban market data).000 household panels in 105 cities and towns in India. They obtain updated data every month and use it to advise their clients on advertising and promotion strategies. for the household data. 13. • Purchase data of product categories and brands (updated monthly). maintains the following information: • Demographics of the households (updated annually). detergents. frequency. IMRB has both transaction data (each row is a transaction) and household data (each row is a household). IMRB constituted about 50. and brand loyalty). and have been modified slightly for illustrative purposes. The strata are defined on the basis of socio-economic status.13. however. and 2.5 IMRB : Segmenting Consumers of Bath Soap 167 Note : Tayko is a hypothetical company name. the concept is based upon the Abacus Catalog Alliance.5 IMRB : Segmenting Consumers of Bath Soap Business Situation The Indian Market Research Bureau (IMRB) is a leading market research agency that specializes in tracking consumer purchase behavior in consumer goods (both durable and non-durable). but the data in this case were supplied by a real company that sells software through direct sales. While this firm did not participate in a catalog consortium. To track purchase behavior.) and. Purchase behavior (volume. Details can be found at http://www. and the market (a collection of cities). (2) Consumer goods manufacturers who monitor their market share using the IMRB database. selling proposition) Doing so would allow IMRB to gain information about what demographic attributes are associated with different purchase behaviors and degrees of brand loyalty. and more effectively deploy promotion budgets.doubleclick. IMRB has two categories of clients: (1) Advertising agencies who subscribe to the database services..com/us/solutions/marketers/database/catalog/. etc. (In addition to this. about 60 . washing machine. IMRB tracks about 30 product categories (e. Thus. 1: Available 2: Not Available Affluence Index Education of homemaker (1=minimum.3 categories MT SEX Food eating habits (1=vegetarian.4 categories Presence of children in the household CS 1-2 Television available. 0=not specified) Native language (see table in worksheet) 1: male. Member Identification Demographics Member id SEC 1 . This would result in a more cost-effective allocation of the promotion budget to different market-segments. 5=low) Sex of homemaker Age of homemaker EDU 1 .168 13. each targeted at different market segments at different times of a year. 2: Female AGE Demographics Unique identifier for each household Socio Economic Class (1=high.. 3=non veg. multiple promotions could be launched. 2=veg.5 categories FEH 1 .each row contains the data for one household.9 categories HS 1-9 CHILD 1. 9 = maximum) Number of members in the household Weighted value of durables possessed . nearly done The better and more effective market segmentation would enable IMRB’s clients to design more cost-effective promotions targeted at appropriate segments. Cases. It would also IMRB to design more effective customer reward systems and thereby increase brand loyalty. Data The data in this sheet profile each household . but eat eggs. volume per transaction Avg. transactions per brand run Vol/Tran Avg.5 IMRB : Segmenting Consumers of Bath Soap 169 Summarized Purchase Data Purchase summary of the house hold over the period Purchase within Promotion No.13. price of purchase Pur Vol No Promo . Multiple brands purchased in a month are counted as separate transactions Value Sum of value Trans / Brand Runs Avg. Price Avg. of Trans Number of purchase transactions.% Percent of volume purchased under no-promotion Pur Vol Promo 6 % Percent of volume purchased under Promotion Code 6 Pur Vol Other Promo % Percent of volume purchased under other promotions . of Brands Number of brands purchased Brand Runs Number of instances of consecutive purchase of brands Total Volume Sum of volume No. Assignments 1. 55.) . The number of different brands purchased by the customer is one measure. 481. However. 144). a consumer who purchases one or two brands in quick succession then settles on a third for a long streak is different from a consumer who constantly switches back and forth among three brands. how often customers switch from one brand to another is another measure of loyalty. 286. nearly done Brand wise purchase Price category wise purchase Br. 24. Cd. c.170 13. Cases. 272. It is likely that the marketing efforts would support 2-5 different promotional approaches. brand loyalty and basis-for-purchase) of these clusters. Select what you think is the best segmentation and comment on the characteristics (demographic. All three of these components can be measured with the data in the purchase summary worksheet.a consumer who spends 90% of his or her purchase money on one brand is more loyal than a consumer who spends more equally among several brands. Use k-means clustering to identify clusters of households based on a. 5 and 999 (others) Price Cat 1 to 4 Selling proposition wise purchase Proposition Cat 5 to 15 Percent of volume purchased of the brand Per cent of volume purchased under the price category Percent of volume purchased under the product proposition category Measuring Brand Loyalty Several variables in this case deal measure aspects of brand loyalty. 352. (57. 2. (This information would be used to guide the development of advertising and promotional campaigns. b. The variables that describe purchase behavior (including brand loyalty). The variables that describe basis-for-purchase. The variables that describe both purchase behavior and basis of purchase. Note 1: How should k be chosen? Think about how the clusters would be used. Note 2: How should the percentages of total purchases comprised by various brands be treated? Isn’t a customer who buys all brand A just as loyal as a customer who buys all brand B? What will be the effect on any distance measure of using the brand share variables as is? Consider using a single derived variable. Yet a third perspective on the same issue is the proportion of purchases that go to different brands . So. Develop a model that classifies the data into these segments. Multiple rows in this data set corresponding to a single household were consolidated into a single household row in IMRD− Summary− Data. a “0” indicates it is not possessed.5 IMRB : Segmenting Consumers of Bath Soap 171 3. For example. Each row is a household. The Durables sheet in IMRB− Summary−Data contains information used to calculate the affluence index. .13. where each row is a transaction. This value is multiplied by the weight assigned to the durable item. IMRB− Purchase− Data is a transaction database. and each column represents a durable consumer good. two additional data sets are provided that were used in the derivation of the summary data. Since this information would most likely be used in targeting direct mail promotions. APPENDIX Although they are not used in the assignment. The sum of all the weighted values of the durables possessed equals the Affluence Index. a “5” indicates the weighted value of possessing the durable. it would be useful to select a market segment that would be defined as a ”success” in the classification model. A “1” in the column indicates that the durable is possessed by the household. 27 Bayes’ risk. 106 dimensionality. 133 k-Nearest Neighbor algorithm. 127 complete linkage. 4 k-means clustering. 106 disjoint. 59 data marts. 4 feature extraction. 68 estimation. 111 attribute. 47 CART. 8 data warehouse. 43. 106 Bayes’ formula. 4 antecedent. 101 Euclidian distance. 47 factor selection. 7. 4 association rules. 25 cluster analysis. 40 input variable. 111 Apriori algorithm. 79 dendrogram. 106 field. 63 affinity analysis. 106 farthest neighbor clustering. 12 CHAID.Index 2 in regression. 4 dimension reduction. 80 classification. 69 Bayes Theorem. 107 bias. 4 average linkage. 114 artificial intelligence. 131 confidence. 131 dependent variable. 63 bias-variance trade-off. 131 effective number of parameters. 4 Forward selection. 104 epoch. 73 case. 111 algorithm. 7. 128 backward elimination. 131 holdout data. 16 homoskedasticity. 48 backward propagation. 111 continuous variables. 12 group average clustering. 79 likelihood function. 119 dimensionality curse. 7 Classification Trees. 6 172 . 4 categorical variables. 101 leaf node. 131 feature. 73 classifier. 4 Euclidean distance. 50 Radj data reduction. 6 decision node. 30 Bayesian classifier. 4 consequent. 111 distance measures. 128 activation function. 119 probability. 7 Principal Components Analysis. 39 Regression Trees. 130 neural nets. 5 terminal node. 9 outcome variable. 51 machine learning. 10. 73 row. 10 stratified sampling. 15 model. 4 score. 3. 12. 15. 50 market basket analysis. 5 weighted Euclidian distance. 124 numeric variables. 4 sample. 10 test partition. 79 test data. 80 random sampling. 8 terabyte. 3. 19 output variable. 16 text variables. 43 supervised learning. 101 nearest neighbor clustering.INDEX logistic regression. 27 pruning. 36 unsupervised learning. 4. 5. 28 missing data. 8 training partition. 102 Mallow’s CP. 16 oversampling. 71 Naive Bayes. 73 regression. 4 overfitting. 61 normalizing data. 63 standardizing data. 10 record. 111 maximum likelihood. 66 similarity measures. 4 prediction. 30 subset selection in linear regression. 4. 59 Minimum Error. 26 misclassification. 48 steps in data mining. 63 Newton Raphson method. 16 Triage strategy. 11 partitions. 12 training data. 4 validation partition. 136 single linkage. 16 pattern. 4 majority decision rule. 12 173 sigmoid function. 4 outliers. 11–13 momentum parameter. 4. 14 recursive partitioning. 16 variable. 128 . 108 nearest neighbor. 124 step-wise regression. 130 squashing function. 5. Documents Similar To Data Mining NotesSkip carouselcarousel previouscarousel nextDatawarehousing and MiningIntroduction to Predictive Analycs VirtualSummerSchool_DM_IntroFao - Tilmicosin DraftIeee Projects 2011 for cse @ SBGC ( chennai, trichy, madurai, dindigul )lesson plan field Cos TeaDemystifying Data MiningGE MINITAB primerBook Exercises NayelliAnswersIBM Business Analytics Case StudiesFourth International Conference on Database and Data Mining (DBDM 2016)The Effect of Distances between Soakaway and Borehole on Groundwater Quality in Calabar, South-South, NigeriaData Mining:Go Pi Abs 1Data Mining TechniquesBriefman ResearchTs RollingA51 3 Kraljevic10 Modeling Data Mining Applications for Prediction of Prepaid Churn in Telecommunication ServicesUnit-Iegsta43Factors Behind Internal Migration and Migrant’s Livelihood Aspects in Dhaka, Bangladesh0_IntroA Replication and Extension of Shin and Stulz (1998)University of KarachiHoegl Wagner 2005f 04713641Data Mining and BioinformaticsAima Datamining Third With Multi DimensionalMATH 2303-01 Syllabus Chen S 2016stats syllabus 13-14More From Sangeetha Sangu BCSkip carouselcarousel previouscarousel nextShared6comp.pdfJava SyllabusWord Pro - SymptomFix.pdfSecurityDocumentAccess Ctrl15.pdf071003_lectureLession Plan Oops Java Iind IisemGraphs SpanningImplementationIntroductionIT ComputerHardwareLession Plan Java1NamingSwitchingRouter 1ipp2pcdnRoutingpeer2peerMultimediaRouter 2ReplicationrpcIntroTcpqosBest Books About StatisticsPractical Data Analysisby Hector CuestaR For Dummiesby Andrie de Vries and Joris MeysFundamentals of Statisticsby H. Mulholland and C. R. JonesElementary Statistics: A Workbookby K. HopeStatisticsby H. T. HayslettPractical Statistical Process Controlby Colin HardwickBest Books About Regression AnalysisData Analysis with Stataby Prasad KothariMastering 'Metrics: The Path from Cause to Effectby Joshua D. Angrist and Jörn-Steffen PischkeManaging Data Using Excelby Mark GardenerSimulationby Sheldon M. RossMastering Python for Data Scienceby Samir MadhavanMastering Machine Learning with scikit-learnby Gavin HackelingFooter MenuBack To TopAboutAbout ScribdPressOur blogJoin our team!Contact UsJoin todayInvite FriendsGiftsLegalTermsPrivacyCopyrightSupportHelp / FAQAccessibilityPurchase helpAdChoicesPublishersSocial MediaCopyright © 2018 Scribd Inc. .Browse Books.Site Directory.Site Language: English中文EspañolالعربيةPortuguês日本語DeutschFrançaisTurkceРусский языкTiếng việtJęzyk polskiBahasa indonesiaSign up to vote on this titleUsefulNot usefulMaster your semester with Scribd & The New York TimesSpecial offer for students: Only $4.99/month.Master your semester with Scribd & The New York TimesRead Free for 30 DaysCancel anytime.Read Free for 30 DaysYou're Reading a Free PreviewDownloadClose DialogAre you sure?This action might not be possible to undo. Are you sure you want to continue?CANCELOK