IBM SPSS Modeler Cookbook
上QQ阅读APP看书,第一时间看更新

Preface

IBM SPSS Modeler is the most comprehensive workbench-style data mining software package. Many of its individual modeling algorithms are available elsewhere, but Modeler has features that are helpful throughout all the phases of the independent, influential Cross Industry Standard Practice for Data Mining (CRISP-DM). Considered the de facto standard, it provides a skeleton structure for the IBM SPSS Modeler Cookbook and the recipes in this book will help you maximize your use of Modeler's tools for ETL, data preparation, modeling, and deployment.

In this book, we will emphasize the CRISP-DM phases that you are likely to address working with Modeler. Other phases, while mentioned, will not be the focus. For instance, the critical business understanding phase is primarily not a software phase. A rich discussion of this phase is included in the Appendix, Business Understanding. Also, the deployment and monitoring phases get a fraction of the attention that data preparation and modeling get because the former are phases whereas Modeler is the critical component.

These recipes will address:

  • Nonobvious applications of the basics
  • Tricky operations, work-arounds, and nondocumented shortcuts
  • Best practices for key operations as done by power users
  • Operations that are not available through standard approaches, using scripting, in a chapter dedicated to Modeler scripting recipes

While it assumes it will provide you with the level of knowledge one would gain from an introductory course or by working with user's guides, it will take you well beyond that. It will be valuable from the first time you are the lead on a Modeler project but will offer much wisdom even if you are a veteran user. Each of the authors has a decade (or two, or more) of experience; collectively they cover the gamut of data mining practice in general, and specifically knowledge of Modeler.

What is CRISP-DM?

CRISP-DM is a tool that is a neutral and industry-nonspecific process model for navigating a data mining project life cycle. It consists of six phases, and within those phases, a total of 24 generic tasks. In the given table, one can see the phases as column headings, and the generic tasks in bold. It is the most widely used process model of its kind. This is especially true of users of Modeler since the software has historically made explicit references to CRISP-DM in the default structure of the project files, but the polls have shown that its popularity extends to many data miners. It was written in the 90s by a consortium of data miners from numerous companies. Its lead authors were from NCR, Daimler Chrysler, and ISL (later bought by SPSS).

This book uses this process model to structure the book but does not address the CRISP-DM content directly. Since the CRISP-DM consortium is nonprofit, the original documents are widely available on the Web, and it would be helpful to read it entirely as part of one's data mining professional development. Naturally, as a cookbook written for users of Modeler, our focus will be on hands-on tasks.

Business understanding, while critical, is not conducive to a recipe-based format. It is such an important topic, which is why it is covered in Appendix, Business Understanding, in prose. Data preparation receives the most of our attention with four chapters. Modeling is covered, in depth, in its own chapter. Since evaluation and deployment often use Modeler in combination with other tools, we have included them in somewhat fewer recipes, but that does not diminish its importance. The final chapter, Modeler Scripting, is not named after a CRISP-DM phase or a task but is included at the end because it has the most advanced recipes.

Data mining is a business process

Data mining by discovery and interpretation of patterns in data is:

  • The use of business knowledge
  • To create new knowledge
  • In natural or artificial form

The most important thing for you to know about data mining is that it is a way of using business knowledge.

The process of data mining uses business knowledge to create new knowledge, and this new knowledge may be in one of the two forms. The first form of new knowledge that data mining can create is "natural knowledge", that is, knowledge sometimes referred to as insight. The second form of new knowledge that data mining can create is "artificial knowledge", that is, knowledge in the form of a computer program, sometimes called a predictive model. It is widely recognized that data mining produces two kinds of results: insight and predictive models.

Both forms of new knowledge are created through a process of discovering and interpreting patterns in data. The most well-known type of data mining technology is called a data mining algorithm. This is a computer program that finds patterns in data and creates a generalized form of those patterns called a "predictive model". What makes these algorithms (and the models they create) useful is their interpretation in the light of business knowledge. The patterns that have been discovered may lead to new human knowledge, or insight, or they may be used to generate new information by using them as computer programs to make predictions. The new knowledge only makes sense in the context of business knowledge, and the predictions are only of value if they can be used (through business knowledge) to improve a business process.

Data mining is a business process, not a technical one. All data mining solutions start from business goals, find relevant data, and then proceed to find patterns in the data that can help to achieve the business goals. The data mining process is described well by the aforementioned CRISP-DM industry standard data mining methodology, but its character as a business process has been shaped by the data mining tools available. Specifically, the existence of data mining workbenches that can be used by business analysts means that data mining can be performed by someone with a great deal of business knowledge, rather than someone whose knowledge is mainly technical. This in turn means that the data mining process can take place within the context of ongoing business processes and need not be regarded as a separate technical development. This leads to a high degree of availability of business knowledge within the data mining process and magnifies the likely benefits to the business.

The IBM SPSS Modeler workbench

This book is about the data mining workbench variously known as Clementine, IBM SPSS Modeler. This and the other workbench-style data mining tools have played a crucial role in making data mining what it now is, that is, a business process (rather than a technical one). The importance of the workbench is twofold.

Firstly, the workbench plays down the technical side of data mining. It simplifies the use of technology through a user interface that allows the user almost always to ignore the deep technical details, whether this means the method of data access, the design of a graph, or the mechanism and tuning of data mining algorithms. Technical details are simplified, and where possible, universal default settings are used so that the users often need not see any options that reveal the underlying technology, let alone understand what they mean.

This is important because it allows business analysts to perform data mining—a business analyst is someone with expert business knowledge and general-purpose analytical knowledge. A business analyst need not have deep knowledge of data mining algorithms or mathematics, and it can even be a disadvantage to have this knowledge because technical details can distract from focusing on the business problem.

Secondly, the workbench records and highlights the way in which business knowledge has been used to analyze the data. This is why most data mining workbenches use a "visual workflow" approach; the workflow constitutes a record of the route from raw data to analysis, and it also makes it extremely easy to change this processing and re-use it in part or in full. Data mining is an interactive process of applying business and analytical knowledge to data, and the data mining workbench is designed to make this easy.

A brief history of the Clementine workbench

During the 1980s, the School of Cognitive and Computing Studies at the University of Sussex developed an Artificial Intelligence programming environment called Poplog. Used for teaching and research, Poplog was characterized by containing several different AI programming languages and many other AI-related packages, including machine-learning modules. From 1983, Poplog was marketed commercially by Systems Designers Limited (later SD-Scicon), and in 1989, a management buyout created a spin-off company called Integral Solutions Ltd (ISL) to market Poplog and related products. A stream of businesses developed within ISL, applying the machine-learning packages in Poplog to organizations' data, in order to understand and predict customer behavior.

In 1993, Colin Shearer (the then Development and Research Director at ISL) invented the Clementine data mining workbench, basing his designs around the data mining projects recently executed by the company and creating the first workbench modules using Poplog. ISL created a data mining division, led by Colin Shearer, to develop, productize, and market Clementine and its associated services; the initial members were Colin Shearer, Tom Khabaza, and David Watkins. This team used Poplog to develop the first version of Clementine, which was launched in June 1994.

Clementine Version 1 would be considered limited by today's standards; the only algorithms provided were decision trees and neural networks, and it had very limited access to databases. However, the fundamental design features of low technical burden on the user and a flexible visual record of the analysis were as much as they are today, and Clementine immediately attracted substantial commercial interest. New versions followed, approximately one major version per year, as shown in the table below. ISL was acquired by SPSS Inc. in December 1998, and SPSS Inc. was acquired by IBM in 2009.

Version 13 was renamed as PASW Modeler, and Version 14 as IBM SPSS Modeler. The selection of major new features described earlier is very subjective; every new version of Clementine included a large number of enhancements and new features. In particular, data manipulation, data access and export, visualization, and the user interface received a great deal of attention throughout. Perhaps the most significant new release was Version 7, where the Clementine client was completely rewritten in Java; this was designed by Sheri Gilley and Julian Clinton, and contained a large number of new features while retaining the essential character of the software. Another very important feature of Clementine from Version 6 onwards was database pushback, the ability to translate Clementine operations into SQL so that they could be executed directly by a database engine without extracting the data first; this was primarily the work of Niall McCarroll and Rob Duncan, and it gave Clementine an unusual degree of scalability compared to other data mining software.

In 1996, ISL collaborated with Daimler-Benz, NCR Teradata, and OHRA to form the "CRISP-DM" consortium, partly funded by a European Union R&D grant in order to create a new data mining methodology, CRISP-DM. The consortium consulted many organizations through its Special Interest Group and released CRISP-DM Version 1.0 in 1999. CRISP-DM has been integrated into the workbench since that time and has been very widely used, sufficiently to justify calling it the industry standard.

The core Clementine analytics are designed to handle structured data—numeric, coded, and string data of the sort typically found in relational databases. However, in Clementine Version 4, a prototype text mining module was produced in collaboration with Brighton University, although not released as a commercial product. In 2002, SPSS acquired LexiQuest, a text mining company, and integrated the LexiQuest text mining technology into a product called Text Mining for Clementine, an add-on module for Version 7. Text mining is accomplished in the workbench by extracting structured data from unstructured (free text) data, and then using the standard features of the workbench to analyze this.

Historical introduction to scripting

By the time Clementine Version 4 was released in 1997, the workbench had gained substantial market traction. Its revolutionary visual programming interface had enabled a more business-focused approach to analytics than ever before—all the major families of algorithms were represented in an easy-to-use form, ODBC had enabled integration with a comprehensive range of data, and commercial partners were busy rebadging Clementine to reach a wider audience through new market channels.

The workbench lacked one major kind of functionality, that of automation, to enable the embedding of data mining within other applications. It was therefore decided that automation would form the centre piece of Version 5, and it would be provided by two major features: batch mode and scripting. Batch mode enabled running the workbench without the user interface so that streams could be run in the background, could be scheduled to run at a given time or at regular intervals, and could be run as part of a larger application. Scripting enabled the user to gain automated control of stream execution, even without the user being present; this was also a prerequisite for any complex operation executed in batch mode.

The motivation behind scripting was to provide a number of capabilities:

  • Gain control of the order of stream execution where this matters, that is, when using the Set Globals node
  • Automate repetitive processes, for example, cross-validation or the exploration of many different sets of fields or options
  • Remove the need for user intervention so that streams could run in the background
  • Manipulate complex streams, for example, if the need arose to create 1000 different Derive nodes

These motives led to an underlying philosophy of scripting, that is, scripts replace the user, not the stream. This means that the operations of scripting should be at the same level as the actions of the user, that is, they would create nodes and link them, control their settings, execute streams, and save streams and models. Scripts would not be used to implement data manipulation or algorithms directly; these would remain in the domain of the stream itself. This reflects a fundamental fact about technologies—they are defined by what they cannot do as by what they can. These principles are not inflexible, for example, cross-validation might be considered as part of an algorithm but was one of the first scripts to be written; however, they guided the design of the scripting language. A consequence of this philosophy was that there could be no interaction between script and data; the restriction was lifted only later with the introduction of access to output objects.

A number of factors influenced the design of the scripting language in addition to the above philosophy:

  • In line with the orientation towards nontechnical users, the language should be simple
  • The timescale for implementation was short, so the language should be easy to implement
  • The language should be familiar, and so should use existing programming concepts and constructs, and not attempt to introduce new ones

These philosophical and practical constraints led to a programming language influenced by BASIC, with structured features taken from POP-11 and an object-oriented approach to nodes taken from Smalltalk and its descendants.

What this book covers

Chapter 1, Data Understanding, provides recipes related to the second phase of CRISP-DM with a focus on exploring the data and data quality. These are recipes that you can apply to data as soon as you acquire the data. Naturally, some of these recipes are also among the more basic, but as always, we seek out the nonobvious tips and tricks that will make this initial assessment of your data efficient.

Chapter 2, Data Preparation – Select, covers just the first task of the data preparation phase. Data preparation is notoriously time-consuming and is incredibly rich in its potential for time-saving recipes. The cookbook will have a total of four chapters on data preparation. The selection of which data rows and which data columns to analyze can be tricky, but it sets the stage for everything that follows.

Chapter 3, Data Preparation – Clean, covers the challenges the data miners face and is dedicated to just the second generic task of the data preparation phase. Sometimes new data miners assume that if a data warehouse is being used, data cleaning has been largely done up front. Veteran data miners know that there is usually a great deal left to do since data has to be prepared for a particular use to answer a specific business question. A couple of the recipes will be basic, but the rest will be quite complex, designed to tackle some of the data miners' more difficult cleaning challenges.

Chapter 4, Data Preparation – Construct, covers the third generic task of the data preparation phase. Many data miners find that there are many more constructed variables in the final model than variables that were used in their original form, as found in the original data source. Common methods can be as straightforward as ratios of part to whole, or deltas of last month from average month, and so on. However, the chapter won't stop there. It will provide examples performing larger scale variable construction.

Chapter 5, Data Preparation – Integrate and Format, covers the fourth and fifth generic tasks of the data preparation phase. Integrating includes actions in Modeler, which further include the Merge, Append, and Aggregate nodes. Formatting is often simply defined as reconfiguring data to meet software needs, in this instance, Modeler.

Chapter 6 , Selecting and Building a Model, explains what many novice data miners see as their greatest challenge, that is, mastering data mining algorithms. Data mining, however, is neither really all about that, nor is this chapter. A discussion of algorithms can easily fill a book, and a quick search will reveal that it has done so many times. Here we'll address nonobvious tricks to make your modeling time more effective and efficient.

Chapter 7, Modeling – Assessment, Evaluation, Deployment, and Monitoring, covers the terribly important topics, especially deployment, because they don't get as much attention as they deserve. Here too, deployment deserves more attention, but this cookbook's attention is clearly and fully focused on IBM SPSS Modeler and not on its sibling products such as IBM Decision Management or IBM Collaboration and Deployment Services. Their proper use, or some alternative, is part of the complete narrative but beyond the scope of this book. So, ultimately two CRISP-DM phases and a portion of a third phase are addressed in one chapter, albeit with a large number of powerful recipes.

Chapter 8, CLEM Scripting, departs from the CRISP-DM format and focuses instead on a particular aspect of the interface, scripting. This chapter is the final chapter with advanced concepts, but it is still written with the intermediate user in mind.

Appendix, Business Understanding, covers a special section and is an essay-format discussion of the first phase and arguably the most critical phrase of CRISP-DM. Tom Khabaza, Meta Brown, Dean Abbott, and Keith McCormick each contribute an essay, collectively discussing all four subtasks.

Who this book is for

This book envisions that you are a regular user of IBM SPSS Modeler, albeit perhaps on your first serious project. It assumes that you have taken an introductory course or have equivalent preparation. IBM's Modeler certification would be some indication of this, but the certification focuses on software operations alone and does not address the general data mining theory. Some familiarity with that would be of considerable assistance in putting these recipes into context. All the readers would benefit from a careful review of the CRISP-DM document, which is readily available on the Internet.

This book also assumes that you are using IBM SPSS Modeler for data mining and are interested in all of the software-related phases of CRISP-DM. This premise might seem strange, but since Modeler combines powerful ETL capability with advanced modeling algorithms, it is true that some Modeler uses the software primarily for ETL capabilities alone. This book roughly spends equal time on both. One of the advantages of the cookbook format, however, is that the reader is invited to skip around, reading out of order, reading some chapters and not others, reading only some of the recipes within chapters, gleaning only what is needed at the moment.

It does not assume that the reader possesses knowledge of SQL. Such knowledge will not be emphasized as Modeler considerably reduces the need for knowing SQL, although many data miners have this skill. This book does not assume knowledge of statistical theory. Such knowledge is always useful to the data miner, but the recipes in this book neither require this knowledge nor does the book assume prior knowledge of data mining algorithms. The recipes simply do not dive deep enough into this aspect of the topic to require it.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " This recipe uses the cup98lrn reduced vars2.txt data set."

A block of code is set as follows:

if length(s) < 3 then '0'
elseif member(s(3),[B P F V]) and c2 /= '1' then '1'
elseif member(s(3),[C S K G J Q X Z]) and c2 /= '2' then '2'
elseif member(s(3),[D T]) and c2 /= '3' then '3'
elseif s(3) = 'L' and c2 /= '4' then '4'
elseif member(s(3),[M N]) and c2 /= '5' then '5'
elseif s(3) = 'R' and c2 /= '6' then '6'
else '' endif

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "clicking the Next button moves you to the next screen".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.