Discovering Knowledge
in Data
An
Introduction to Data Mining
By Daniel T. Larose, Ph.D.
Director, Data Mining @CCSU
Preface
What is data mining?
Data mining is predicted to be “one of the most revolutionary developments
of the next decade”, according to the online technology magazine ZDNET News (February
8, 2001). In fact, the MIT Technology
Review chose data mining as one of ten emerging technologies that will
change the world.
What is data mining?
According to the Gartner Group,
“Data
mining is the process of discovering meaningful new correlations, patterns and
trends by sifting through large amounts of data stored in repositories, using
pattern recognition technologies as well as statistical and mathematical
techniques.”
Because data mining represents such an important field, Wiley Interscience
and Dr. Daniel T. Larose have teamed up to publish a new series on data mining,
initially consisting of three volumes.
The first volume in this series, Discovering
Knowledge in Data: An Introduction to Data Mining introduces
the reader to this rapidly growing field of data mining.
Why is this book
needed?
Humans are inundated with data in most fields. Unfortunately, this valuable data, which cost
firms millions to collect and collate, are languishing in warehouses and
repositories. The problem is that there are not enough trained human analysts
available who are skilled at translating all of this data into knowledge,
and thence up the taxonomy tree into wisdom.
This is why this book is needed.
Discovering Knowledge
in Data: An Introduction to Data Mining provides
readers with:
Data mining is becoming more widespread every day, because
it empowers companies to uncover profitable patterns and trends from their
existing databases. Companies and
institutions have spent millions of dollars to collect megabytes and terabytes
of data, but are not taking advantage of the valuable and actionable
information hidden deep within their data repositories. However, as the practice of data mining
becomes more widespread, companies which do not apply these techniques are in
danger of falling behind, and losing market share, because their competitors
are applying data mining, and thereby gaining the competitive edge.
In Discovering
Knowledge in Data, the step-by-step hands-on solutions of real-world business
problems using widely available data mining techniques applied to real-world
data sets, will appeal to managers, CIOs, CEOs, CFOs,
and others who need to keep abreast of the latest methods for enhancing ROI.
Danger! Data mining is easy to do badly.
The plethora of new off-the-shelf software platforms for
performing data mining has kindled a new kind of danger. The ease with which these GUI-based
applications can manipulate data, combined with the power of the formidable
data mining algorithms embedded in the black box software currently available,
makes their misuse proportionally more hazardous.
Just as with any new information technology, data mining is easy to do badly. A little knowledge is especially dangerous
when it comes to applying powerful models based on large data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or
inappropriate analysis may be applied to data sets that call for a completely
different approach, or models may be derived that are built upon wholly
specious assumptions. These errors in
analysis can lead to very expensive failures, if deployed.
“White Box” Approach:
Understanding the Underlying Algorithmic and Model Structures.
The best way to avoid these costly errors, which stem from a
blind black-box approach to data mining, is to instead apply a “white-box”
methodology, which emphasizes an understanding of the algorithmic and
statistical model structures underlying the software.
Discovering Knowledge in Data applies this white-box approach by:
·
Walking the reader through the various
algorithms,
·
Providing examples of the operation of the
algorithm on actual large data sets,
·
Testing the reader’s level of understanding of
the concepts and algorithms, and
·
Providing an opportunity for the reader to do
some real data mining on large data sets.
Algorithm Walk-Throughs
Discovering Knowledge
in Data walks the reader through the operations and nuances of the various
algorithms, using small sample data sets, so that the reader gets a true
appreciation of what is really going on inside the algorithm. For example, in Chapter 8 Hierarchical and K-Means Clustering, we
see the updated cluster centers being updated, moving toward the center of
their respective clusters. Also, in
Chapter 9, Kohonen Networks, we see just which kind of
network weights will result in a particular network node “winning” a particular
record.
Applications
of the algorithms to large data sets.
Discovering Knowledge
in Data provides examples of the application of the various algorithms on
actual large data sets. For example, in
Chapter 7, Neural Networks, a
classification problem is attacked using a neural network model on a real-world
data set. The resulting neural network
topology is examined, along with the network connection weights, as reported by
the software. These data sets are
included on the data disk, so that the reader may follow the analytical steps
on their own, using data mining software of their choice.
Chapter exercises:
Checking to make sure you understand it.
Discovering Knowledge
in Data includes over 90 chapter exercises, which allow readers to assess
their depth of understanding of the material, as well as have a little fun
playing with numbers and data. These
include conceptual exercises, which help to clarify some of the more
challenging concepts in data mining, and “Tiny data set” exercises, which
challenge the reader to apply the particular data mining algorithm to a small
data set, and, step-by-step, to arrive at a computationally sound
solution. For example, in Chapter 6, Decision Trees, readers are provided with a small data set and asked to construct
– by hand, using the methods shown in the chapter – a C4.5
decision tree model, as well as a classification
and regression tree model, and to compare the benefits and drawbacks of
each.
Hands-On Analysis: Learn data mining by doing data mining.
Chapters 2 – 4 and 6 – 11 provide the reader with hands-on analysis problems, representing an opportunity for the reader to apply his
or her newly-acquired data mining expertise to solving real problems using
large data sets. Many people learn by
doing. DKD provides a framework where the reader can learn data mining by
doing data mining.
The intention is to mirror the real-world data mining
scenario. In the real world, dirty data
sets need cleaning; raw data needs to be normalized; outliers need to be
checked. So it is here with Discovering Knowledge in Data, where
over 70 hands-on analysis problems are provided. In this way, the reader can “ramp up”
quickly, and be “up and running” his or her own data mining analyses relatively
shortly.
For example, in Chapter 10, Association Rules, readers are challenged to uncover
high-confidence, high-support rules for predicting which customer will be
leaving the company’s service. In
Chapter 11, Model Evaluation Techniques,
readers are asked to produce lift charts and gains charts for a set of
classification models using a large data set, so that the best model may be
identified.
Data
mining as a process.
One of the fallacies associated with data mining
implementation is that data mining somehow represents an isolated set of tools,
to be applied by some aloof analysis department, and only inconsequentially
related to the mainstream business or research endeavor. Organizations which attempt to implement data
mining in this way will see their chances of success much reduced. This is because data mining should be view as
a process.
Discovering Knowledge
in Data presents data mining as a well-structured standard process, intimately connected with managers, decision
makers, and those involved in deploying the results. Thus, this book is not only for analysts, but
for managers as well, who will need to be able to communicate in the language
of data mining.
The particular standard process used is the CRISP-DM framework: the Cross-Industry Standard Process for Data
Mining. CRISP-DM demands that data mining be seen as an entire process,
from communication of the business problem, through data collection and
management, data preprocessing, model building, model evaluation, and, finally,
model deployment. Therefore, this book
is not only for analysts and managers, but also for data management
professionals, database analysts, and decision makers.
Graphical
approach, emphasizing exploratory data analysis.
Discovering Knowledge
in Data,
emphasizes a graphical approach to data analysis. There are more than 80 screen shots of actual computer
output throughout the text, and over 30 other figures. Exploratory data analysis (EDA) represents an
interesting and fun way to “feel your way” through large data sets. Using graphical and numerical summaries, the
analyst gradually sheds light on the complex relationships hidden within the data. Discovering
Knowledge in Data emphasizes an EDA approach to data mining, which goes
hand-in-hand with the overall graphical approach.
How the book is
structured.
Discovering Knowledge
in Data: An Introduction to Data Mining provides a
comprehensive introduction to the field.
Case studies are provided, showing how data mining has been applied
successfully (and not so successfully).
Common myths about data mining are debunked, and common pitfalls are
flagged, so that new data miners do not have to learn these lessons themselves.
The first three chapters introduce and follow the CRISP-DM standard process, especially
the data preparation phase and data understanding phase. The next seven chapters represent the heart
of the book, and are associated with the CRISP-DM
modeling phase. Each chapter presents
data mining methods and techniques for a specific data mining task.
·
Chapters 5, 6, and 7 relate to the classification task, examining k-nearest neighbor (Chapter 5), decision
trees (Chapter 6), and neural network (Chapter 7) algorithms.
·
Chapters 8 and 9 investigate the clustering task, with hierarchical and k-means clustering (Chapter 8) and Kohonen networks (Chapter 9) algorithms.
·
Chapter 10 itself
handles the association task,
examining association rules through the a
priori and GRI algorithms.
·
Finally, Chapter 11 considers model evaluation
techniques, which belong to the CRISP-DM
evaluation phase.
Discovering Knowledge in Data as a textbook.
Discovering Knowledge
in Data: An Introduction to Data Mining naturally
fits the role of textbook for an introductory course in data mining. Instructors may appreciate:
·
The presentation of data mining as a process,
·
The “White box” approach, emphasizing an
understanding of the underlying algorithmic structures,
o Algorithm
walk-throughs,
o Application
of the algorithms to large data sets,
o Chapter
exercises, and
o Hands-on
analysis,
·
The graphical approach, emphasizing exploratory
data analysis, and
·
The logical presentation, flowing naturally from
the CRISP-DM standard process and the
set of data mining tasks.
Discovering Knowledge in Data is appropriate for advanced undergraduate or graduate-level courses. Except for one section in the neural networks chapter, no calculus is required. An introductory statistics course would be nice, but is not required. No computer programming or database expertise is required.
Acknowledgements
Discovering Knowledge in Data would have remained unwritten without the assistance of Val Moliere, editor, and Kirsten Rohsted, editorial program coordinator, at Wiley Interscience. Thank you for your guidance and perserverance.
I wish to thank Dr. Chun Jin and Dr. Daniel S. Miller, my colleagues in the Master of Science in Data Mining program at Central Connecticut State University, Dr. Timothy Craine, the chair of the Department of Mathematical Sciences, Dr. Dipak K. Dey, Chair of the Department of Statistics at the University of Connecticut, and Dr. John Judge, Chair of the Department of Mathematics at Westfield State College. Your support was (and is) invaluable.
Thanks to my children Chantal, Tristan, and Ravel for sharing the computer with me. Finally, I would like to thank my wonderful wife, Debra J. Larose, for her patience, understanding, and proofreading skills. But words cannot express …
Daniel T. Larose,
Ph.D.
Director, Data Mining
@CCSU