Emping User Guide, Version 0.5.1

Author: Hans van Thiel, April 2008
email: hthiel.char@zonnet.nl

New in Version 0.5

1. Overview

1.1. What

Emping is a utility that derives heuristic rules from nominal data. Nominal data are qualitative and unordered, as in:

Class is actually an ordinal attribute, but when the order is disregarded, it is nominal.

Heuristic rules consist of attribute values (predicates) that together imply another attribute value. For example:

Color: green and Proposition 1: True is Class: B

Heuristic rules are purely empirical, with no foundation in a theory or model. The input of Emping is just a table of nominal facts. The user has to select which attribute is to be the consequent. Then Emping derives all shortest combinations of factors which, in the table, imply the values of the selected consequent.

Each reduced rule is a generalization of one or more original rules, and therefore reduced rules may imply other reduced rules or be equivalant to others. So the reductions are partially ordered, and this partial order can be shown in a directed graph.

1.2. How

Emping reads a file in a comma seperated format (.csv) as produced by the Open Office Calc spreadsheet, and returns the results as .csv files that can be read by OO Calc. The graphs are in Graphviz format, and can be displayed by a Graphviz reader like dotty or ZRG Viewer.

Emping 0.5 has a GUI and you can start it like any other application.

Emping

The general idea is straightforward; see the screenshots for details. First, you mark or unmark in the options menu whether you want to check the data for duplicates. These have no effect on the result, but will slow the program down, and they are automatically removed if the option is selected (but not from the source file itself). If there are duplicates you can see which, with their frequencies, by saving them.

Next you select the consequent attribute from a popup menu. If you have the relevant options menu item checked, emping will look for ambiguous rules for the selected consequent.

Ambiguous rules are rules which are exactly the same, except for the consequent value. For example, an animal name may not be uniquely determined by a given description, only a group of animals. So the result is indecisive for all possible subsets of this description, as well as the longer original. The solution is to merge these rules with a new value, which denotes the union of the ambiguous rules, the group.

However, it is possible to leave ambiguities in the data, as they will just cancel each other out. But if all rules for some consequent value are ambiguous, there will be no rules at all for this value, and this will result in a program error which blocks Emping. (See known bugs and issues.)

Therefore it is recommended to run Emping only with no duplicates and unambiguous rules, though you can omit the checks, if you want, at your own risk.

See the white paper Deriving Heuristic Rules from Facts for more.

Pressing the Reduce button will start the actual reduction.

Reduced rules may themselves imply, or be equivalent to, other reduced rules with the same consequent value. The Top button will produce only the top level of these rules (saved in a .csv file).

The reduced normal form file, the top level rule file and the others (if present), can now be loaded into OO Calc.

To see the partial order in Graphviz graphs you must first get the graph legend. This file, in .csv format, shows the nodes in the first column and the equivalence groups it denotes in the following rows. The legend is useful in itself because, unlike the top level only file, it shows all reduced rules grouped as equals.

Now you can construct the graph for all attributes and also separate graphs for each value of the attribute. The reverse option lets you choose between the most general at the top(default), or the most specific at the top (reversed).

1.3. More

More about the principles on which emping is based can be found in the mentioned white paper Deriving Heuristic Rules from Facts , which is also included in the package mentioned below. The general idea is also illustrated by: Fishing

2. Known bugs and issues

Some Example Screenshots

An Open Office Calc data file:

Mushrooms Data in OO Calc

The reduction of the Mushrooms data set with Class as the consequent.

Mushrooms reduced rules in OO Calc

It turns out there are 21 single attribute values that determine whether a mushroom is poisonous, and 23 whether it is edible. But Emping also finds all combinations of two, of three, etc. It is easy to verify these rules in the original data by using the Data Filter tool in OO Calc.

The most general rules only (Tops) for a data file of zoo animals with Type as the consequent.

most general rules for zoo Type

It turns out the 824 reduced rules all imply 9 equivalence classes. Type 1 is completely determined by the Milk : yes value only. The dependency graph for this value, however, shows some very complicated entailments (not shown here).

The graph legend for a medical file of 200 audiology patients, 71 classes of symptoms and 24 types of diagnosis. Each patient is anonymously identified by a letter and a number.

Audiology Class graph Legend

Each of the 1200 nodes identifies clusters of symptoms that predict a specific diagnosis. The attribute graph is very large and complex, but there are distinct differences for each diagnosis.

This is the value graph for conductive discontinuity, viewed with the dotty viewer and editor which comes with Graphviz.

Audiology value graph conductive discontinuity

The first number is the node, the second the number of equivalences in that node. All graphs (for the same attribute) use the same legend.

The situation is very different, however, for cochlear age and noise .

Audiology Cochlear age and noise ZRG Viewer

This image has been produced by ZRG Viewer, a viewer dat works with Graphviz and produces scaleable vector graphic (.svg) images. This is a zoom in on some nodes in the same graph. Using dotty or ZRG Viewer you can trace paths and analize the relations in detail.

Audiology Cochlear age and noise zoom inZRG Viewer

As before, the first number refers to the node in the legend, the second to the number of equivalent classes. To recall, each member of an equivalence class is a different generalization of the same original rules (rows) from the data table.

Finally, here is the graph for bellspalsy ,this time in dotty again.

Bellspalsy

It turns out there is only one patient with this diagnosis in the data file. This is p77, uniquely identified by the symptom viith_nerve_signs : yes . However, there appear to be 561 more clusters of symptoms that define this patient and diagnosis.

Data Files

The examples shown are from the data files Zoo, Mushrooms and Audiology, all available at the UCI Machine Learning Repository . Thanks to: Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.

Notes

The Emping utility is written in Haskell, and has been developed and tested on the Fedora Core Linux platform, using the Haskell tools which are available as FC packages. Version 0.5 has been developed on FC8 with GHC 6.8.2 and Gtk2Hs 0.9.12. Earlier versions of GHC will probably not compile, and Gtk2Hs versions earlier than 0.9.12 don't support the implemented menu fields, and will not work.

But GHC and Gtk2Hs are implemented on many platforms, including Windows and Linux versions, so Emping-0.5 should work on those platforms too. Open Office Calc, Graphviz and ZRGViewer (which depends on Graphviz) are open source and widely available.

The time needed to perform a reduction or to produce a graph is hard to predict, since they depend on the complexity as well as the size of the data. In the examples, the shortest time was for the zoo animals (a few seconds) and the longest for the mushrooms attribute graph (45 minutes).

The reduction should be, and appears to be, linearly proportional to the number of rules, but the graph building process tests each reduction against each original rule. The complexity of that is quadratic. The reduction time is, at least, a multiple of the number of values in the consequent class.

The source code of Emping is available in a documented package under GPL license. download Haskell Cabal package

A binary version for Linux is also available. The md5sum (digital signature) for the emping file is: 49ee03da4742369ec16769a9bd5be5f2 . To be absolutely sure the file is OK send me a request for the md5sum checkfile.

Any comments, bug reports, feature requests or remarks will be most welcome.

Emping stands for empirical reasoning or the Indonesian snack with that name.