Documentation

1 Installation and check

Installation of the tools requires Python to be present on the host system. Versions from the 2.5 and 2.6 branch are known to work. Others may or may not work. Python 3.0 is not supported. Development took place under Debian/Linux and no testing for Windows compatibility was done. All components are supposed to be OS-independent, though. Other requirements include (see README file in the tarball as well):

To run and test Rudify (on Linux boxen):

If everything went fine, you will see some lines of debug messages on the screen. The last line should say something in the line of:

ptag/DEBUG Using: taggers/brown-3gram-tagger.pickled.

The programme now reads user input from the keyboard. The expected format is a colon sepa returned and a summaries of results. Press CTRL-D again for end of input to leave the programme.

2 Overview of the tools

The general workflow of the toolchain is as follows:

The tools and necessary steps are described in detail below. Every programme under ./bin/ has a set of command line options. See e.g. ./bin/rudify.py --help for an overview. Every programme under ./bin/ also makes use of the module configuration file ./lib/Rudify/config.py. All relevant information/documentation is given in that file.

2.1 conceptlist.py

This is used to automatically construct lists of lexical representations for concepts either from WordNet-3.0 or from an OWL-Ontology. For ontologies using OntoWordNet labels these are matched against WordNet. Only the first lemma of a synset is used as a lexical representation. This is based on the assumption that the most common expressions are mentioned first in the synset. Typical calls to conceptlist.py are:

[user@host]$ ./bin/conceptlist.py wordnet -ddebug > concept_list-WordNet.txt

The resulting file concept_list-WordNet.txt contains lines like:

living thing::00004258-n:
organism::00004475-n:
benthos::00005787-n:
dwarf::00005930-n:

where each lexical representation is associated with the synset offset and part-of-speech.

2.2 rudify.py

Use rudify.py to automatically query Google for lexical patterns. This programme is essentially a filter that reads in concept list and writes an ARFF file containing the absolute and relative frequencies of the lexical representations in those patterns. The two most important command line switches are:

[user@host]$ ./bin/rudify.py -tr -u [eng|esp|ita|nld] < concept_list > results.arff 2> logile

where -t restricts the queries to patterns representing certain metaproperties (r=rigidity, u=unity, d=dependence, i=identity, these can be freely combined) and -u is a boolean switch that activates the embedding of hints into the query. However, if no results are obtained when using hinted queries rudify.py falls back to unhinted queries for that particular concept/lexical representation.

2.3 Supporting tools

2.3.1 mktagger.py

This tool is used to create custom part-of-speech taggers that can be used by rudify.py. As training a tagger is quite time consuming, we use serialised representations that can be esaily stored and reused. Several sample taggers are included under ./taggers. See the logfiles of the serialized taggers for further information.

2.3.2 aggregator.py

The aggregator is a quick (but hopefully not so dirty) hack to join attributes and associated values of an ARFF file given a custom function. This allows for generalisations over formally different patterns (e.g. "is no longer X", "is no longer a X", "is no longer an X"), creating a new attribute (e.g. "is no longer (|a|an) X") that replaces the old ones.

Example call: ./bin/aggregator.py -d 0,1:keep 2-4:max < input.arff > output.arff

Attribute ranges need not be consecutive. The can be lists of column indices or ranges thereof. Counting starts by 1 (in accordance with Weka).

Custom functions so far implemented:

Attributes not mentioned in the aggregate description are discarded. "Unknown" ("?") values are excluded from calculations but are preserved by keep.

3 Training and classification

For both training a suitable model as well as for the actual classification task (class prediction) Weka is used. In principle, every machine learning software capable of dealing with ARFF files should be usable as a drop-in replacement for Weka. As of the time of writing (June 2009), Weka-3.6.1 is the current stable release and known to work. For the workflow described here, only the software archive (weka.jar) is needed. It comprises the implemented filters and classifiers as well as an easy to use graphical interface. Weka is started by issuing the following command:

[user@host]$ java -jar weka.jar

The following steps are performed in Weka's explorer module.

3.1 Training a model

Training a model from a given training set is relatively straight forward. The minimum steps required are:

You can (and should) re-iterate training using different classifiers and parameters to see how the classification changes until the results are satisfying.

3.2 Classification