Documentation
1 Installation and check
Installation of the tools requires Python to be present on the host system. Versions from the 2.5 and 2.6 branch are known to work. Others may or may not work. Python 3.0 is not supported. Development took place under Debian/Linux and no testing for Windows compatibility was done. All components are supposed to be OS-independent, though. Other requirements include (see README file in the tarball as well):
- lxml for XML parsing
- NLTK for accessing WordNet and linguistic processing — this might require additional modules as well. Important note: In order to match against older versions of WordNet please put the appropriate WordNet database files into ${NLTK_DATA}/corpora (use e.g. ${NLTK_DATA}/corpora/wordnet-1.6 for WordNet 1.6 files)! Older versions of WordNet are not include in the NLTK corpus collection.
- simplejson for JSON parsing — this is not required when using Python 2.6 or later
- Weka for classification and training — this was written in Java and is known to run on Sun Java 1.6. I had no luck gettig it to work with =gij= (GNU libgcj) version 4.3.2. Use a Weka version from the 3.6.* branch.
To run and test Rudify (on Linux boxen):
- Unpack the archive: [user@host]$ tar xvzf rudify-0.1.6.tar.gz
- Change into the newly created directory: [user@host]$ cd rudify-0.1.6
- Setup your PYTHONPATH environment variable (this is used by Python to locate its libraries) to contain the rudify-0.1.6/lib directory: [user@host]$ export PYTHONPATH=${PYTHONPATH}:./lib (You can also move the rudify-0.1.6/lib/Rudify directory into a system wide location like /usr/lib/python*/site-packages/ but I don't recommend that as this should be done by automatic installation routines only.)
- Check if everything works: [user@host]$ ./bin/rudify.py -tr -ddebug eng >/dev/null
If everything went fine, you will see some lines of debug messages on the screen. The last line should say something in the line of:
ptag/DEBUG Using: taggers/brown-3gram-tagger.pickled.
The programme now reads user input from the keyboard. The expected format is a colon sepa returned and a summaries of results. Press CTRL-D again for end of input to leave the programme.
2 Overview of the tools
The general workflow of the toolchain is as follows:
- conceptlist.py —
generate a list of concepts that are to be tagged with ontological meta properties. The format is a colon separated text file with the following fields:
- lexical representation of target concept
- (optional) hint (e.g.hypernym, base concept)
- WordNet ID (unique)
- (optional) Concept label/ID
- rudify.py — on the basis of a concept list generates an ARFF file containing numerical feature vectors for each concept's lexical representation
- Training — train a model for the classification of the numerical feature vectors or use a supplied model. You can use Weka directly for training as well. This is the preferred procedure as of rudify-0.1.4.
- Classification — use Weka for classification of feature vectors. This results in an ARFF file with an additional class attribute.
- mktagger.py — create an n-gram tagger based on a given training and evaluation corpus
- arfmangler.py — ARFF file transformer and merger
The tools and necessary steps are described in detail below. Every programme under ./bin/ has a set of command line options. See e.g. ./bin/rudify.py --help for an overview. Every programme under ./bin/ also makes use of the module configuration file ./lib/Rudify/config.py. All relevant information/documentation is given in that file.
2.1 conceptlist.py
This is used to automatically construct lists of lexical representations for concepts either from WordNet-3.0 or from an OWL-Ontology. For ontologies using OntoWordNet labels these are matched against WordNet. Only the first lemma of a synset is used as a lexical representation. This is based on the assumption that the most common expressions are mentioned first in the synset. Typical calls to conceptlist.py are:
[user@host]$ ./bin/conceptlist.py wordnet -ddebug > concept_list-WordNet.txt
The resulting file concept_list-WordNet.txt contains lines like:
living thing::00004258-n: organism::00004475-n: benthos::00005787-n: dwarf::00005930-n:
where each lexical representation is associated with the synset offset and part-of-speech.
2.2 rudify.py
Use rudify.py to automatically query Google for lexical patterns. This programme is essentially a filter that reads in concept list and writes an ARFF file containing the absolute and relative frequencies of the lexical representations in those patterns. The two most important command line switches are:
[user@host]$ ./bin/rudify.py -tr -u [eng|esp|ita|nld] < concept_list > results.arff 2> logile
where -t restricts the queries to patterns representing certain metaproperties (r=rigidity, u=unity, d=dependence, i=identity, these can be freely combined) and -u is a boolean switch that activates the embedding of hints into the query. However, if no results are obtained when using hinted queries rudify.py falls back to unhinted queries for that particular concept/lexical representation.
2.3 Supporting tools
2.3.1 mktagger.py
This tool is used to create custom part-of-speech taggers that can be used by rudify.py. As training a tagger is quite time consuming, we use serialised representations that can be esaily stored and reused. Several sample taggers are included under ./taggers. See the logfiles of the serialized taggers for further information.
2.3.2 aggregator.py
The aggregator is a quick (but hopefully not so dirty) hack to join attributes and associated values of an ARFF file given a custom function. This allows for generalisations over formally different patterns (e.g. "is no longer X", "is no longer a X", "is no longer an X"), creating a new attribute (e.g. "is no longer (|a|an) X") that replaces the old ones.
Example call: ./bin/aggregator.py -d 0,1:keep 2-4:max < input.arff > output.arff
Attribute ranges need not be consecutive. The can be lists of column indices or ranges thereof. Counting starts by 1 (in accordance with Weka).
Custom functions so far implemented:
- keep, do nothing but keep the attributes
- min, minimum of attributes' values
- max, maximum of attributes' values
- sum, sum of attributes' values
Attributes not mentioned in the aggregate description are discarded. "Unknown" ("?") values are excluded from calculations but are preserved by keep.
3 Training and classification
For both training a suitable model as well as for the actual classification task (class prediction) Weka is used. In principle, every machine learning software capable of dealing with ARFF files should be usable as a drop-in replacement for Weka. As of the time of writing (June 2009), Weka-3.6.1 is the current stable release and known to work. For the workflow described here, only the software archive (weka.jar) is needed. It comprises the implemented filters and classifiers as well as an easy to use graphical interface. Weka is started by issuing the following command:
[user@host]$ java -jar weka.jar
The following steps are performed in Weka's explorer module.
3.1 Training a model
Training a model from a given training set is relatively straight forward. The minimum steps required are:
- In the preprocess panel open the ARFF file containing the training set.
- In the classify section choose a classifier. Many classifiers cannot deal with string attributes, so often weka.classifiers.meta.FilteredClassifier needs to be used. Configure the FilteredClassifier to filter out the offending string attribute for classification (weka.filters.unsupervised.attribute.Remove -R 1) and to call the actual classifier.
- Choose the right class attribute. The default assumption is the last attribute defined which is not the case for Rudify output.
- Start the training run.
- Right click the newly created entry in the results list and save the model to a local file.
You can (and should) re-iterate training using different classifiers and parameters to see how the classification changes until the results are satisfying.
3.2 Classification
- In the preprocess panel open the ARFF file containing the data set you want to tag.
- Choose the filter weka.filters.supervised.attribute.AddClassification.
- Click on the filter's name in the field next to the choose button and set up the properties: Set remove old class to true as well as output classification and specify your serialized classifier file (the locally saved model).
- Select the old class attribute in the attribute list below the filter setup.
- Click apply to run your classification setup on the dataset.
- Have a quick glance at the classification results: In the lower right corner of the window select the newly created classification attribute for visualization.
- Save your tagged data set to a file.