Discovering Knowledge in Data
An Introduction to Data Mining
By Daniel T. Larose, Ph.D.
Director, Data Mining @CCSU
What is data mining?
Data mining is predicted to be “one of the most revolutionary developments of the next decade”, according to the online technology magazine ZDNET News (February 8, 2001). In fact, the MIT Technology Review chose data mining as one of ten emerging technologies that will change the world.
What is data mining? According to the Gartner Group,
“Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”
Because data mining represents such an important field, Wiley Interscience and Dr. Daniel T. Larose have teamed up to publish a new series on data mining, initially consisting of three volumes. The first volume in this series, Discovering Knowledge in Data: An Introduction to Data Mining introduces the reader to this rapidly growing field of data mining.
Why is this book needed?
Humans are inundated with data in most fields. Unfortunately, this valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of this data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.
Discovering Knowledge in Data: An Introduction to Data Mining provides readers with:
Data mining is becoming more widespread every day, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect megabytes and terabytes of data, but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies which do not apply these techniques are in danger of falling behind, and losing market share, because their competitors are applying data mining, and thereby gaining the competitive edge.
In Discovering Knowledge in Data, the step-by-step hands-on solutions of real-world business problems using widely available data mining techniques applied to real-world data sets, will appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast of the latest methods for enhancing ROI.
Danger! Data mining is easy to do badly.
The plethora of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these GUI-based applications can manipulate data, combined with the power of the formidable data mining algorithms embedded in the black box software currently available, makes their misuse proportionally more hazardous.
Just as with any new information technology, data mining is easy to do badly. A little knowledge is especially dangerous when it comes to applying powerful models based on large data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or inappropriate analysis may be applied to data sets that call for a completely different approach, or models may be derived that are built upon wholly specious assumptions. These errors in analysis can lead to very expensive failures, if deployed.
“White Box” Approach:
Understanding the Underlying Algorithmic and Model Structures.
The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software.
Discovering Knowledge in Data applies this white-box approach by:
· Walking the reader through the various algorithms,
· Providing examples of the operation of the algorithm on actual large data sets,
· Testing the reader’s level of understanding of the concepts and algorithms, and
· Providing an opportunity for the reader to do some real data mining on large data sets.
Discovering Knowledge in Data walks the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm. For example, in Chapter 8 Hierarchical and K-Means Clustering, we see the updated cluster centers being updated, moving toward the center of their respective clusters. Also, in Chapter 9, Kohonen Networks, we see just which kind of network weights will result in a particular network node “winning” a particular record.
Applications of the algorithms to large data sets.
Discovering Knowledge in Data provides examples of the application of the various algorithms on actual large data sets. For example, in Chapter 7, Neural Networks, a classification problem is attacked using a neural network model on a real-world data set. The resulting neural network topology is examined, along with the network connection weights, as reported by the software. These data sets are included on the data disk, so that the reader may follow the analytical steps on their own, using data mining software of their choice.
Chapter exercises: Checking to make sure you understand it.
Discovering Knowledge in Data includes over 90 chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data. These include conceptual exercises, which help to clarify some of the more challenging concepts in data mining, and “Tiny data set” exercises, which challenge the reader to apply the particular data mining algorithm to a small data set, and, step-by-step, to arrive at a computationally sound solution. For example, in Chapter 6, Decision Trees, readers are provided with a small data set and asked to construct – by hand, using the methods shown in the chapter – a C4.5 decision tree model, as well as a classification and regression tree model, and to compare the benefits and drawbacks of each.
Hands-On Analysis: Learn data mining by doing data mining.
Chapters 2 – 4 and 6 – 11 provide the reader with hands-on analysis problems, representing an opportunity for the reader to apply his or her newly-acquired data mining expertise to solving real problems using large data sets. Many people learn by doing. DKD provides a framework where the reader can learn data mining by doing data mining.
The intention is to mirror the real-world data mining scenario. In the real world, dirty data sets need cleaning; raw data needs to be normalized; outliers need to be checked. So it is here with Discovering Knowledge in Data, where over 70 hands-on analysis problems are provided. In this way, the reader can “ramp up” quickly, and be “up and running” his or her own data mining analyses relatively shortly.
For example, in Chapter 10, Association Rules, readers are challenged to uncover high-confidence, high-support rules for predicting which customer will be leaving the company’s service. In Chapter 11, Model Evaluation Techniques, readers are asked to produce lift charts and gains charts for a set of classification models using a large data set, so that the best model may be identified.
Data mining as a process.
One of the fallacies associated with data mining implementation is that data mining somehow represents an isolated set of tools, to be applied by some aloof analysis department, and only inconsequentially related to the mainstream business or research endeavor. Organizations which attempt to implement data mining in this way will see their chances of success much reduced. This is because data mining should be view as a process.
Discovering Knowledge in Data presents data mining as a well-structured standard process, intimately connected with managers, decision makers, and those involved in deploying the results. Thus, this book is not only for analysts, but for managers as well, who will need to be able to communicate in the language of data mining.
The particular standard process used is the CRISP-DM framework: the Cross-Industry Standard Process for Data Mining. CRISP-DM demands that data mining be seen as an entire process, from communication of the business problem, through data collection and management, data preprocessing, model building, model evaluation, and, finally, model deployment. Therefore, this book is not only for analysts and managers, but also for data management professionals, database analysts, and decision makers.
Graphical approach, emphasizing exploratory data analysis.
Discovering Knowledge in Data, emphasizes a graphical approach to data analysis. There are more than 80 screen shots of actual computer output throughout the text, and over 30 other figures. Exploratory data analysis (EDA) represents an interesting and fun way to “feel your way” through large data sets. Using graphical and numerical summaries, the analyst gradually sheds light on the complex relationships hidden within the data. Discovering Knowledge in Data emphasizes an EDA approach to data mining, which goes hand-in-hand with the overall graphical approach.
How the book is structured.
Discovering Knowledge in Data: An Introduction to Data Mining provides a comprehensive introduction to the field. Case studies are provided, showing how data mining has been applied successfully (and not so successfully). Common myths about data mining are debunked, and common pitfalls are flagged, so that new data miners do not have to learn these lessons themselves.
The first three chapters introduce and follow the CRISP-DM standard process, especially the data preparation phase and data understanding phase. The next seven chapters represent the heart of the book, and are associated with the CRISP-DM modeling phase. Each chapter presents data mining methods and techniques for a specific data mining task.
· Chapters 5, 6, and 7 relate to the classification task, examining k-nearest neighbor (Chapter 5), decision trees (Chapter 6), and neural network (Chapter 7) algorithms.
· Chapters 8 and 9 investigate the clustering task, with hierarchical and k-means clustering (Chapter 8) and Kohonen networks (Chapter 9) algorithms.
· Chapter 10 itself handles the association task, examining association rules through the a priori and GRI algorithms.
· Finally, Chapter 11 considers model evaluation techniques, which belong to the CRISP-DM evaluation phase.
Discovering Knowledge in Data as a textbook.
Discovering Knowledge in Data: An Introduction to Data Mining naturally fits the role of textbook for an introductory course in data mining. Instructors may appreciate:
· The presentation of data mining as a process,
· The “White box” approach, emphasizing an understanding of the underlying algorithmic structures,
o Algorithm walk-throughs,
o Application of the algorithms to large data sets,
o Chapter exercises, and
o Hands-on analysis,
· The graphical approach, emphasizing exploratory data analysis, and
· The logical presentation, flowing naturally from the CRISP-DM standard process and the set of data mining tasks.
Discovering Knowledge in Data is appropriate for advanced undergraduate or graduate-level courses. Except for one section in the neural networks chapter, no calculus is required. An introductory statistics course would be nice, but is not required. No computer programming or database expertise is required.
Discovering Knowledge in Data would have remained unwritten without the assistance of Val Moliere, editor, and Kirsten Rohsted, editorial program coordinator, at Wiley Interscience. Thank you for your guidance and perserverance.
I wish to thank Dr. Chun Jin and Dr. Daniel S. Miller, my colleagues in the Master of Science in Data Mining program at Central Connecticut State University, Dr. Timothy Craine, the chair of the Department of Mathematical Sciences, Dr. Dipak K. Dey, Chair of the Department of Statistics at the University of Connecticut, and Dr. John Judge, Chair of the Department of Mathematics at Westfield State College. Your support was (and is) invaluable.
Thanks to my children Chantal, Tristan, and Ravel for sharing the computer with me. Finally, I would like to thank my wonderful wife, Debra J. Larose, for her patience, understanding, and proofreading skills. But words cannot express …
Daniel T. Larose, Ph.D.
Director, Data Mining @CCSU