I: Data Mining and Analysis

Emping_0.6 User Guide

Interactive Test of a Case

Three Examples of Data Mining with Emping

Emping is a tool for discovery and analysis of predictive relations in tables of nominal data. It is based on a new algorithm and it is implemented in the functional programming language Haskell. These pages present possible analyses of, and results from, two data sets from the UCI Machine Learning Repository. The Mushroom and Audiology spreadsheet readable tables were also imported into SQLite databases, so that results could be expressed in SQL and tested easily. The first example discusses factors which indicate a class of mushroom, edible or poisonous. The second example is about determining the habitat of these mushrooms. We also discovered that those mushrooms which only have a grasses habitat are all edible. The third example is an analysis of relations between patients, symptoms and diagnosis in a medical data set.

Development of Emping started in 2006 with the discovery of a new algorithm for the mining of predictive rules in tables of nominal data. This algorithm was implemented in Haskell, which is a functional programming language. Subsequent versions of Emping 0.1 to the current version 0.6 are all available on Hackage, the open source repository of Haskell libraries and applications.

The major difference between the Emping algorithm and other data mining technologies is that Emping is not statistical, and that it searches from the general to the specific instead of the other way around. Because the model is logical and not statistical all rules and predicates have the same weight, regardless of their frequencies. For example, the rule that ravens are black would be refuted by the existence of just one white raven. This could be considered a disadvantage for data sets where minorities should be ignored, but it is an advantage for cases where they should not be. The logic model also allows for some interesting rule discoveries and facilitates verification using SQL, the relational database (system) query language.

The algorithm itself is based on a few very simple principles.
First of all, if a conjunction (AND) of predicates implies something, say p, and there is no rule which contradicts a single predicate in that conjunction, the disjunction of predicates (OR) is valid. This is the most general solution, the hypothesis. Through matching this hypothesis with all rules that imply something else (not p) we get a possible solution by contradiction. This step is called falsification, and the result is a conjunction of disjunctions. This must be transformed to a disjunction of conjunctions to get the possible rules, and this step is the most costly part of the calculation. Thirdly, each possible reduction must be a subset of an actual rule. This verification is needed to weed out those rules which could have p as well as not p as a consequent.

The end result is a list of all shortest possible rules that imply p. Any additional predicate in the rule is either superfluous or non-existent. Omitting any predicate from a rule results in a contradiction. In Emping this process of deriving all shortest rules is called reduction.

But now, since each reduced rule stands for several original rules, we can find a partial order for reduced rules and express it in a graph. This process is called abduction. We can also find all minimal elements in the poset, those which are not implied by any other, and the maximal elements, those which do not imply any other rules. This set of maximal elements is the set of rules which imply the predicate p best, with the caveat that one or more could still be left out, if the others cover all originals. The most general solution is not a partition of the data set. However, there is no apparent preference of one rule over the other.

Usage of Emping is explained in a separate user guide , but it is almost self explanatory. Here is a screen shot at start up.

screen shot emping at start

Emping has been applied to the Mushroom and the Audiology data sets from the UCI Machine Learning Repository from Asuncion, A. & Newman, D.J. (2007). CA: University of California, School of Information and Computer Science.

For the Class attribute of the first data set, 3635 heuristic rules were found, 1655 for edible and 1980 for poisonous mushrooms. Each of these can now be interactively tested against the source. Links A, C and D describe some tests and two more case studies.

  1. Singleton predictors for edible/poisonous mushrooms
  2. Interactive Test of all 3635 rules
  3. Determining the habitat of mushrooms from a table
  4. Analysing patients, symptoms and diagnoses in medical data

data analytic boolean algebra

Download the example .csv results and data files in a .zip file

Download tar.gz with compiled Emping_0.6 binary for Linux md5sum (digital signature) for the executable: 6ba08169c7f4833a4f69ac7ad14caa27
Download tar.gz with Emping_0.6 source code

WHITE PAPERS


II: IT Authoring and Journalism

All publications are listed on the following pages:

Articles which appeared in Computable are archived there.


III: Haskell Tutorials

An adaptation of the tutorial: Developing Gnome Apps with Glade, by Eddy Ahmed,
for the functional programming language Haskell and the Gtk2Hs graphics toolkit:
Getting Started with Glade and Gtk2Hs . This is now outdated.

The new tutorial for Glade 3 by Alex Tarkovsky is the official
Glade and Gtk2Hs documentation.

A Spanish translation: Español

An extensive Gtk2Hs programming guide, based on the GTK+2.0 tutorial by Tony Gale and Ian Main. The
Gtk2Hs Tutorial consists of 22 chapters, divided into 7 sections, plus an appendix on starting drawing with Cairo.
Note: Some examples might not work in newer Gtk2hs versions (> 0.9.12). Event handling, in particular, appears to have changed!

A Spanish translation: Español

A quite extensive tutorial on Haskell monads in a variation on the for dummies style: The Greenhorn's Guide to becoming a Monad Cowboy



Valid XHTML 1.0 Strict

Valid CSS!

Site Last Modified April 23, 2014