Mining the data-lode
We may be on the slip road to the information superhighway, but unfortunately it is rush hour all day every day. It does not help that the road is clogged with HGVs with "Bulk Data" stencilled on their sides.
Data builds up at an alarming rate; according to one estimate, the amount of information in the world doubles every 20 months. The accumulators of this data range from the exotic to the mundane: satellites scan the earth, and beam back pictures; every cheque clearance, every cash withdrawal, every credit card transaction is logged. In the supermarket, point-of-sale barcode readers record every can of beans purchased.
Every organisation is accumulating mountains of data on operations and performance. Research- ers are creating numerous databases for future workers. We are becoming a "data rich" society, but we remain "information poor". Important information is often buried in the obscuring mass of data. Computers are getting bigger and faster. Brute force alone, though, won't find the meaning in the data; dynamiting a mountain is a hard way to look for diamonds.
"Data mining" describes the identification and extraction of high-value "nuggets" of information from high-volume data.
One technology area with particular promise for data mining is "machine learning" -- computer systems which, shown previous cases with known outcomes, can "learn" to replicate these and generalise from them to make judgments on future cases. Some machine learning techniques are used commercially. These include the evocatively-named "neural networks" -- simple simulations of fragments of the nervous system which are "trained" on previous cases -- and "rule induction", in which decision trees discriminate between different outcomes.
These techniques compensate for each other's weaknesses. Although neural networks can be highly accurate, their operation is opaque; trained neural networks are just sets of numbers, and it is not generally possible to work out how they arrive at decisions in any meaningful way.
By contrast, rule induction builds an explicit model of the decision-making process; this can be read, understood and validated. Induction, however, is weaker in areas with numeric outcomes or where data is "noisy" -- where the data contains spurious errors or contradictory information.
Obviously, the sensible thing is to combine the two techniques. Pioneering work in this area has been carried out by Integral Solutions, ISL, a United Kingdom software company.
Since 1989, ISL has carried out data mining projects using sophisticated neural network and rule induction technology. These have predicted television audience sizes for the BBC based on previous viewing figures; forecast the turnover of retail outlets; identified faults in manufacturing equipment based on historical error logs; built systems to target direct marketing effort by "learning" a profile of existing customers.
The ISL-led Project Clementine was an attempt to create software which would give non-technologist end-users -- the data "owners" -- access to the benefits of machine learning technology.
The proposal won SMART awards in 1992 and 1993 from the Department for Trade and Industry in a competition for new technological developments, judged on innovation and market potential.
Clementine connects users to a wide range of sources, providing interactive graphics and support for hypothesis testing and generally experimenting with ideas of how the data might behave.
The main challenge of the project was to make it user-friendly.
With input from university researchers in the UK, Hungary and Singapore, ISL built an "expert system" to manage the machine learning modules. Clementine looks at the data and chooses a suitable configuration for the network or rule induction; the user is insulated from technology considerations.
As part of an earlier collaboration with Reading University and GEC Plessey Semiconductors, ISL developed a "visual programming" interface to make on-line plant data accessible to non-IT managers. Similar technology was used in Clementine's user interface; users drive the system simply by selecting icons representing data sources and operations, connecting them to specify the flow of data and editing their attributes to fine-tune their behaviour.
Clementine was launched in June 1994, and is used in areas as diverse as dentistry and foreign exchange trading. In a Clementine session, a user will start by browsing the data, using visualisation and statistics to find promising relationships. Say the user is trying to determine which factors make a patient respond to particular medication; a scatter plot with drug response overlayed may indicate a link between the ratio of two blood components and response to one of the drugs. The user can derive this ratio as a new variable and test the connection statistically.
Next, he might train a neural network and induce a rule to model drug response, using the original data plus his derived variable. The induced rule will describe what determines response to each drug, for instance "People with a blood sodium to potassium ratio greater than 10 and high blood pressure respond to Drug C".
The rules and nets can be tested on more data, and Clementine will help analyse the performance of the rule and the network. Quantitatively, it may report that the rule is 94 per cent accurate, the network is 91 per cent accurate, and in cases where both agree the accuracy rises to 97 per cent. Qualitatively, it might reveal that people with high blood pressure and normal cholesterol levels, aged over 50, are consistently misclassified as being likely to respond to Drug X when they should be prescribed Drug Y.
Research in most academic disciplines has scope for data mining. In chemistry and related sciences properties of new compounds can be inferred from those already known. Descriptions derived from bodies of text can be used to "learn" to recognise the work of particular authors. New economic models can be derived from historical data. And of course, the techniques can be applied to data on the operation of institutions themselves -- helping, for example, to pinpoint why some students are successful while others are not.
As we accumulate more and more data, extracting meaning from it will become critical.
Colin Shearer is director of ISL's data mining division. email firstname.lastname@example.org. Tel 01256 882028.