“Above all else, show the data.” (Lehr, 2019). Better yet, show the process too! Have you heard these common office references before? I sure have. Such comments are often made during the iterative planning phases while supporting major projects. Whether leaders generally just need more data, better quality data, or all-around data transparency to monitor project performance or predict future outcomes, it’s critical to start with a standard framework for reference. Far too often, businesses seek to perform analysis on their data but don’t know where to start. For that reason, a two-part approach is presented below as a resource for your next endeavor.
Part (1) – Major Steps in the Data Mining Process
If we look at data mining as a process of knowledge discovery, different people would probably have different approaches to how they would perform data mining techniques. In this case, let’s take a look at ways that scholars outline the seven steps of data mining as a process of knowledge discovery (Han, J., Pei, J., and Kamber, M., 2011).
Seven steps of data mining (Han, J., Pei, J., and Kamber, M., 2011):
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are retrieved from the database)
Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations)
Data mining (an essential process where intelligent methods are applied to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures)
Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users)
Source: Han, J., Pei, J., and Kamber, M. (2011) | Image: Seven Steps of Data Mining
As outlined in my previous discussion about the knowledge discovery process, steps one through four represent the preprocessing phase, where data mining methods are not usually applied until step five where knowledge discovery is highly probable. Although data mining primarily focuses on performing advanced techniques on data such as classification, clustering, regression, prediction, association rules, and sequential patterns (Han, J., Pei, J., and Kamber, M., 2011), it’s important not to forget about all the other activities associated with the overall process of data mining. Preprocessing is equally important (step one through four) as actually mining the data for knowledge and reporting or using it for decision making. Without clean and reliable data, it’s useless to even analyze it for accurate results.
Cross Industry Standard Process for Data Mining (CRISP-DM)
Continuing our knowledge journey, let’s look at the CRISP-DM process. Data scholars designed the CRISP-DM model to aid data explorers in their journey to better understand data. CRISP is an acronym that stands for Cross-Industry Standard Process for Data Mining (CRISP-DM). According to Hunter (2009), CRISP is a process model that provides a framework for carrying out data mining projects, which is independent of both the industry and technology used.
Source: (Vorhies, 2016) | Image: CRISP-DM Process
As outlined in the CRISP model above, the six phases of CRISP-DM include:
Business Understanding – In this phase, hypothesizes are established, the goals of the project are defined, project plans are created, left & right project boundaries are outlined, timelines and expectations are set (Hunter, 2009). Without this step, we lack direction.
Data Understanding – During this phase, the data is collected from its sources, its relationships are assessed, and often requires domain subject matter expertise (Hunter, 2009). Important to note, Hunter states that while exploring the data and its relationships, analysts often discover new insights and further develop their business understanding (2009).
Data Preparation – This phase involves selecting the appropriate data and cleaning it (Hunter, 2009). Data Cleaning, Data Integration, Data Reduction, and Data Transformation are all major tasks associated with data preprocessing, which aid in data preparation (Han, J., Pei, J., and Kamber, M., 2011).
Modeling – During this phase, a variety of data modeling techniques can be used to generate models to assess whether a hypothesis is true or false. This phase also uses advanced algorithms to assess models. According to Hunter (2009), additional data preparation may be necessary to properly use particular algorithms for testing.
Evaluation – In this phase, we determine how to use the model(s). Models created in the previous phase are assessed and a select few are chosen based on their ability to achieve the desired outcomes initially outlined during the business understanding phase (Hunter, 2009).
Deployment – In this phase, the selected models are deployed, monitored, and the results are reported for iterative productions/management efforts. During this phase, the best models are identified that meet business objectives. According to Hunter (2009), this is not the end of the project. Instead, it’s when new baseline data is discovered and integrated back into the iterative process for further knowledge discovery.
Why do you think the early phases (understanding of the business and understanding of the data) take the longest in data mining projects?
The early phases of data mining projects often take the longest amount of time because data can be quite complex to deal with. For instance, if the quality of the data is not good, then the results of our data mining efforts won’t be beneficial for our business needs. Case-in-point, according to Gualtieri (2013), many studies have shown that roughly 70-80% of a Data Scientists’ time is spent on assembling and cleaning data with only 20-30% spent on discovering new meaning or use cases with the data using algorithms. Although it’s important to start every project with the best quality of data available, how a business intends to use the data will ultimately determine the quality of the data being mined. Regardless of the project, if the business objectives and the data are not completely understood, it will be difficult if not impossible to successfully conduct data mining techniques in order to test a single or multiple hypothesizes.
What are the main data preprocessing steps?
Source: Han, J., Pei, J., and Kamber, M. (2011) | Image: Data Preprocessing Steps
According to Han, J., Pei, J., and Kamber, M. (2011), the main data preprocessing steps include:
Data Cleaning – This initial step consists of routine techniques that include but are not limited to filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies;
Data Integration – During this step, data sources are identified, and connections are made with all data sources. According to Han, J., Pei, J., and Kamber, M., analysts often find redundant data and inconsistencies across the databases, which tends to slow the knowledge discovery process down causing more time to be spent in the data cleaning phase before further data analysis techniques can be applied.
Data Reduction – During this step, a reduced representation of the dataset is presented that is much smaller in volume yet produces the same analytical results as a larger set (Han, J., Pei, J., and Kamber, M., 2011). Some of these techniques include Dimensionality Reduction, and Numerosity Reduction.
Data Transformation – In this step, data is transformed using a variety of methods such as normalization, data discretization, attribute construction, smoothing, aggregation, and concept hierarchy generation (Han, J., Pei, J., and Kamber, M., 2011).
How does CRISP-DM differ from SEMMA?
According to Software Testing Help (n.d.), SEMMA (i.e. sample, explore, modify, model, assess) is another data mining method that can be used similar to the CRISP-DM model.
Source: Software Testing Help (n.d.) | Image: SEMMA Data Mining Model
As outlined in the above SEMMA Data Mining Model, a phased approach helps data explorers thru the process. The steps in the SEMMA process include:
“Sample:In this step, a large dataset is extracted and a sample that represents the full data is taken out. Sampling will reduce the computational costs and processing time;
Explore:The data is explored for any outlier and anomalies for a better understanding of the data. The data is visually checked to find out the trends and groupings;
Modify:In this step, manipulation of data such as grouping, and subgrouping is done by keeping in focus the model to be built;
Model:Based on the explorations and modifications, the models that explain the patterns in data are constructed; and
Assess:The usefulness and reliability of the constructed model are assessed in this step. Testing of the model against real data is done here” (Software Testing Help, n.d.).
By comparison, although both the CRISP-DM and SEMMA Models can be used as a framework for data mining projects, they do differ in their approaches. CRISP-DM focuses more on front-loading routines for data cleaning while SEMMA focuses more on assembling the data for analysis via samples during the initial steps then exploring the data using algorithm-applied modeling. Contrary, the also have similarities such as using data samples/reduction, creating models based on available datasets, and exploration of the data.
Part (2) – Identify at least three of the main data mining methods.
Three main data mining methods include but are not limited to classification, regression, and cluster analysis. To start, let’s check out classification. According to Han, J., Pei, J., and Kamber, M. (2011), classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts by which predicted categorical labels are presented as results. As an example, let’s look at Han, Pei, and Kimber’s (2011) classification model where in the first step, they build a classification model based on previous data. In the second step, they determine if the model’s accuracy is acceptable, and if so, they use the model to classify new data.
“The data classification process: (a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan_decision, and the learned model or classifier is represented in the form of classification rules. (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples”. (Han, J., Pei, J., and Kamber, M., 2011)
The second data mining method is regression. According to Han, J., Pei, J., and Kamber, M. (2011), regression analysis is often used for numeric prediction and if designed properly, will predict a continuous function (or ordered value). Regression goes hand-in-hand when using classification techniques in the knowledge discovery process as classification helps predict class labels, regression helps predict numeric values associated with such class labeled dataset(s) analyzing relationships between variables. Han, J., Pei, J., and Kamber, M. (2011) also highlight that linear regression involves finding the ‘best’ line to fit two attributes (or variables) so that one attribute can be used to predict the other. For instance, is there a relationship between a home’s size and the price it sells for? If so, how is it related? How strongly? With these questions in mind, we can use a linear regression approach to assess the data. In the example highlighted in the images below from Simplilearn (2017), regression can be applied to the sample dataset.
Source: Simplilearn (2017) | Image: Regression Analysis Example Dataset
According to Simplilearn (2017), the formula for the simple linear regression model follows this blueprint:
Source: Simplilearn (2017) | Image: Simple Regression Model Formula
In addition (as outlined in the image below), Simplilearn also shows the different types of regression paths where either simple regression or multiple regression techniques are executed (2017). These techniques are determined based on one variable versus more than one variable scenarios.
Source: Simplilearn (2017) | Image: Types of Regression Analysis
The third data mining method is clustering. According to Han, J., Pei, J., and Kamber, M. (2011), cluster analysis can be used to generate class labels for a group of data and can be viewed as a class of objects, from which rules can be derived. Basically, this will help organize data observations into a hierarchy of classes that group similar events together. Clustering is an analytics technique that relies on visual approaches to understanding data. Clustering is often presented using graphics to show where the distribution of data is in relation to different types of metrics.
Source: Advani (2020) | Image: Clustering Algorithms in Machine Learning
Two Classification Techniques
During the data mining process, classification techniques can performed in many ways. The two we’ll focus on here will be the decision tree technique and the naïve bayes technique.
According to AIM (2020), a decision tree produces a sequence of rules that can be used to classify the data. Some advantages include the simplicity to understand and visualize the decision tree. Also, it requires little data preparation and can handle both numerical and categorical data.
Disadvantages include instances when small variations in the data might result in a completely different tree being generated. This can be time consuming and often creates generalizations. Below is an example of what the syntax would look like for conducting this technique using a data mining tool.
Source: AIM (2020) | Image: Decision Tree Algorithm
According to AIM (2020), a naïve bayes algorithm is based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering. Some of the advantages include the fact they require a small amount of training data to estimate the necessary parameters, and they are fast compared to other methods (AIM, 2020). The disadvantage of using naive bayes is that it’s known to be a bad estimator (AIM, 2020).