Technology innovations are rapidly making improvements across nearly every aspect of our lives in this so called, age-of-data. A central driving force behind this innovation is the emergence of new and complex streams of data, often referred to as big data. The size (i.e. volume) and the speed (i.e. velocity) of this data renders traditional methods of data storage, analysis, and reporting virtually useless. However, with the emergence of data science, companies of all kinds are developing new techniques to extract meaningful insight and actionable information from these large datasets.
Often unnoticed by the human eye, are the endless streams of data that come from a wide variety of sources. Can we ever expect to make since of it all? Data mining, or the process of using algorithmic and analysis techniques to extract meaningful insights from big data, offers some promising solutions. According to Han, J., Pei, J., and Kamber, M. (2011), data mining is, “knowledge mining from data.” Through data mining techniques, critical patterns and actionable information can often emerge. These methods are improving the decision-making processes for a wide range of industries.
Big data has the ability to improve nearly every aspect of how the global economy operates. However, to provide value, it’s critical that both consumers and business-owners alike identify opportunities to use big data to support their needs. Without leveraging this valuable resource, businesses are likely to lose out on market share due to their inability to respond to changing customer demands. So, is it another hype? I think, not! Here are a few examples in which organizations are currently employing data mining and analytical techniques to benefit their markets:
Defense: Big data analysis is helping the defense industry improve acquisition processes, technology development initiatives, logistics, and human resource management efforts.
Business: With data mining and effective analysis, businesses are outperforming their competitors through rapid decision-making and effective target-marketing strategies.
Healthcare: Advanced analytics of large patient datasets yield life-saving diagnosis and help suggest optimal treatment options.
Science, Technology, Engineering, Mathematics (STEM): Almost every discipline of STEM is uncovering new and exciting discoveries via the correlations emerging from big data analytics. These patterns and the insight they provide are enabling the growth of the scientific research and development fields.
Evolution of Data Mining
The evolution of data mining is quite interesting if you look at it from the perspective of the book, Data Mining: Concepts and Techniques (Han, J., Pei, J., Kamber, M. 2011). According to authors Han, J., Pei, J., and Kamber, M. (2011), “data mining can be viewed as a result of the natural evolution of information technology.” With more data, comes the increased need for analysis and interpretation of that data. As a result, it naturally generated a global demand for data mining.
Here’s a timeline of the evolution, as outlined in the book, Data Mining: Concepts and Techniques (Han, J., Pei, J., Kamber, M., 2011).
As you may notice, it’s clear that once people discovered how to access data back in the early 1900’s that they would eventually want to take a closer look at it or perhaps a look from afar in order to gain insights. As depicted on the timeline, primitive file processing started with the Data Collection and Database Creation movement back in the early 1900’s up until the 1960’s (Han, J., Pei, J., Kamber, M., 2011). After this phase, Database Management Systems showed up between 1970’s and the early 1980’s. In the mid-1980’s, Advanced Database Systems showed up with services that could process complex data like spatial, temporal, multimedia, and sequence. Cloud computing also became a known capability during this time (Han, J., Pei, J., Kamber, M., 2011). In the late-1990’s, Advanced Data Analysis started to make itself known with data warehouse solutions, data mining and even knowledge discovery (Han, J., Pei, J., Kamber, M., 2011). Looking ahead, the future of this data-rich environment relies on those willing to figure out the best ethical ways to use it to improve quality of life. This future outlook is considered the Next Generation of Information Systems.
The seven steps of data mining as a process of knowledge discovery
If we look at data mining as a process of knowledge discovery, individual people would probably have their own individual way they approach each situation based on their knowledge and access to resources. In this case, let’s take a look at ways that scholars outline the seven steps of data mining as a process of knowledge discovery (Han, J., Pei, J., and Kamber, M., 2011).
Seven steps of data mining (Han, J., Pei, J., and Kamber, M., 2011):
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are retrieved from the database)
Data transformation (where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations)
Data mining (an essential process where intelligent methods are applied to extract data patterns)
Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures)
Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users)
Source: Han, J., Pei, J., and Kamber, M. (2011) | Image: Seven Steps of Data Mining
Steps one through four represent the preprocessing phase, where data mining methods are not usually applied until step five where knowledge discovery is highly probable. Although data mining primarily focuses on performing advanced techniques on data such as classification, clustering, regression, prediction, association rules, and sequential patterns (Han, J., Pei, J., and Kamber, M., 2011), it’s important not to forget about all the other activities associated with the overall process of data mining. Preprocessing is equally important (step one through four) as actually mining the data for knowledge and reporting or using it for decision making. Without clean and reliable data, it’s useless to even analyze it for accurate results.
Database vs Data Warehouse?
A database is structured storage repository for data. Some databases are considered flat files (i.e. excel spreadsheets) while others are more robust and allow relational links to be established in order to create customized reporting, improved query processing speeds, and visualizations (i.e. Access, etc.). According to Han, J., Pei, J., and Kamber, M. (2011), a relational database is a collection of tables, each of which is assigned a unique name.
A data warehouse is a system that aggregates and stores information from a variety of different sources within an organization or network of organizations (Tobin, 2019). Data warehouses are often used specifically for business case scenarios. They are usually designed to aid decision-makers by offering users the ability to consolidate and analyze data from a variety of sources between the strategic-and-tactical levels. Data warehouses are useful when a certain query requires data beyond what’s stored in an individual database. Queries are often entered into common user interfaces that serve as the face and search functions of a data warehouse. With on-premise databases that are normally isolated from other business systems, users can easily perform a query that searches a single database, but with data warehouses, users are capable of tapping into data from a variety of different databases with proper application programming interface (API) connections to outside data sources.
Data warehouses are best suited for larger questions about an organization’s past, present, and future that require a higher level of analysis. For instance, mining information from multiple databases to uncover hidden insights and/or correlations. Databases are typically better suited for individual teams to store and manage shared data that’s usually constrained by storage capacity.
On the other hand, data warehouses are traditionally designed for the sole purpose of reporting and analysis. In this case, trained users can retrieve information from both current and historical data, enabling a wider range of insights (Tobin, 2019).