Data Mining and Computing
The emergence of data mining is closely connected to developments in computer technology, particularly the evolution and organization of databases, which have recently made great leaps forward. I am now going to clarify a few terms.
Query and reporting tools are simple and very quick to use; they help us explore business data at various levels. Query tools retrieve the information and reporting tools present it clearly. They allow the results of analyses to be transmitted across a client-server network, intranet or even on the internet. The networks allow sharing, so that the data can be analyzed by the most suitable platform.
This makes it possible to exploit the analytical potential of remote servers and receive an analysis report on local PCs. A client-server network must be flexible enough to satisfy all types of remote requests, from a simple reordering of data to ad hoc queries using Structured Query Language (SQL) for extracting and summarizing data in the database.
Data retrieval, like data mining, extracts interesting data and information from archives and databases. The difference is that, unlike data mining, the criteria for extracting information are decided beforehand so they are exogenous from the extraction itself. A classic example is a request from the marketing department of a company to retrieve all the personal details of clients who have bought product A and product B at least once in that order. This request may be based on the idea that there is some connection between having bought A and B together at least once but without any empirical evidence. The names obtained from this exploration could then be the targets of the next publicity campaign. In this way the success percentage (i.e. the customers who will actually buy the products advertised compared to the total customers contacted) will definitely be much higher than otherwise. Once again, without a preliminary statistical analysis of the data, it is difficult to predict the success percentage and it is impossible to establish whether having better information about the customers’ characteristics would give improved results with a smaller campaign effort.
OLAP is an important tool for business intelligence. The query and reporting tools describe what a database contains (in the widest sense this includes the data warehouse), but OLAP is used to explain why certain relations exist. The user makes his own hypotheses about the possible relations between the variables and he looks for confirmation of his opinion by observing the data. Suppose he wants to find out why some debts are not paid back; first he might suppose that people with a low income and lots of debts are high-risk categories. So he can check his hypothesis, OLAP gives him a graphical representation (called a multidimensional hypercube) of the empirical relation between the income, debt and insolvency variables. An analysis of the graph can confirm his hypothesis. Therefore OLAP also allows the user to extract information that is useful for business databases. Unlike data mining, the research hypotheses are suggested by the user and are not uncovered from the data. Furthermore, the extrapolation is a purely computerized procedure; no use is made of modeling tools or summaries provided by the statistical methodology. OLAP can provide useful information for databases with a small number of variables, but problems arise when there are tens or hundreds of variables. Then it becomes increasingly difficult and time consuming to find a good hypothesis and analyze the database with OLAP tools to confirm or deny it.
OLAP is not a substitute for data mining; the two techniques are complementary and used together they can create useful synergies. OLAP can be used in the preprocessing stages of data mining. This makes understanding the data easier, because it becomes possible to focus on the most important data, identifying special cases or looking for principal interrelations. The final data mining results, expressed using specific summary variables, can be easily represented in an OLAP hypercube.
We can summaries what we have said so far in a simple sequence that shows the evolution of business intelligence tools used to extrapolate knowledge from a database:
QUERY AND REPORTING → DATA RETRIEVAL → OLAP→ DATA MINING
Query and reporting has the lowest information capacity and data mining has the highest information capacity. Query and reporting is easiest to implement and data mining is hardest to implement. This suggests a trade-off between information capacity and ease of implementation. The choice of tool must also consider the specific needs of the business and the characteristics of the company’s information system. Lack of information is one of the greatest obstacles to achieving efficient data mining. Very often a database is created for reasons that have nothing to do with data mining, so the important information may be missing. Incorrect data is another problem. The creation of a data warehouse can eliminate many of these problems. Efficient organization of the data in a data warehouse coupled with efficient and scalable data mining allows the data to be used correctly and efficiently to support company decisions.
Data Mining and Statistics
Statistics has always been about creating methods to analyze data. The main difference between statistical methods and machine learning methods is that statistical methods are usually developed in relation to the data being analyzed but also according to a conceptual reference paradigm. Although this has made the statistical methods coherent and rigorous, it has also limited their ability to adapt quickly to the new methodologies arising from new information technology and new machine learning applications. Statisticians have recently shown an interest in data mining and this could help its development. For a long time statisticians saw data mining as a synonymous with ‘data fishing’, ‘data dredging’ or ‘data snooping’. In all these cases data mining had negative connotations. This idea came about because of two main criticisms.
First, there is not just one theoretical reference model but several models in competition with each other; these models are chosen depending on the data being examined. The criticism of this procedure is that it is always possible to find a model, however complex, which will adapt well to the data. Second, the great amount of data available may lead to non-existent relations being found among the data. Although these criticisms are worth considering, we shall see that the modern methods of data mining pay great attention to the possibility of generalizing results. This means that when choosing a model, the predictive performance is considered and the more complex models are penalized. It is difficult to ignore the fact that many important findings are not known beforehand and cannot be used in developing a research hypothesis. This happens in particular when there are large databases. This last aspect is one of the characteristics that distinguish data mining from statistical analysis. Whereas statistical analysis traditionally concerns itself with analyzing primary data that has been collected to check specific research hypotheses, data mining can also concern itself with secondary data collected for other reasons. This is the norm, for example, when analyzing company data that comes from a data warehouse. Furthermore, statistical data can be experimental data (perhaps the result of an experiment which randomly allocates all the statistical units to different kinds of treatment), but in data mining the data is typically observational data. Berry and Linoff (1997) distinguish two analytical approaches to data mining.
They differentiate top-down analysis (confirmative) and bottom-up analysis (explorative). Top-down analysis aims to confirm or reject hypotheses and tries to widen our knowledge of a partially understood phenomenon; it achieves this principally by using the traditional statistical methods. Bottom-up analysis is where the user looks for useful information previously unnoticed, searching through the data and looking for ways of connecting it to create hypotheses. The bottom-up approach is typical of data mining. In reality the two approaches are complementary.
In fact, the information obtained from a bottom-up analysis, which identifies important relations and tendencies, cannot explain why these discoveries are useful and to what extent they are valid. The confirmative tools of top-down analysis can be used to confirm the discoveries and evaluate the quality of decisions based on those discoveries. There are at least three other aspects that distinguish statistical data analysis from data mining. First, data mining analyses great masses of data. This implies new considerations for statistical analysis. For many applications it is impossible to analyze or even access the whole database for reasons of computer efficiency. Therefore it becomes necessary to have a sample of the data from the database being examined. This sampling must take account of the data mining aims, so it cannot be performed using traditional statistical theory. Second many databases do not lead to the classic forms of statistical data organization, for example, data that comes from the internet. This creates a need for appropriate analytical methods from outside the field of statistics. Third, data mining results must be of some consequence. This means that constant attention must be given to business results achieved with the data analysis models.
In conclusion there are reasons for believing that data mining is nothing new from a statistical viewpoint. But there are also reasons to support the idea that, because of their nature, statistical methods should be able to study and formalize the methods used in data mining. This means that on one hand we need to look at the problems posed by data mining from a viewpoint of statistics and utility, while on the other hand we need to develop a conceptual paradigm that allows the statisticians to lead the data mining methods back to a scheme of general and coherent analysis.
—————

