MEKA
The MEKA project provides an open source implementation of methods for multi-label classification. It is based on the WEKA Machine Learning Toolkit from the University of Waikato. It grew out of work presented in these publications. Several benchmark methods are also included, as well as the pruned sets and classifier chains methods, and a wrapper to the MULAN framework.
MEKA does not yet intergrate with the WEKA GUI interface and is mainly intended to provide implementations of published algorithms. The command line interface, however, is closely based on and integrated with that of WEKA, and easy to use.
MEKA was created by Jesse Read; current webpage: http://www.tsc.uc3m.es/~jesse/ (contact information can be found there).
*NEW* 4 May, 2012 Version update (meka 1.0). A tutorial is available here.
Download
Download MEKA here. Other versions can be found at the sourceforge download page.
Documentation
A MEKA tutorial is now available; with instructions on getting started, and numerous examples on how to run and extend MEKA.
Quick Instructions: download MEKA; add meka.jar (and weka.jar) in Java's classpath. See the tutorial for examples on how to run.
See the Java docs here.
Publications which introduce MEKA's algorithms, including a PhD thesis can be found at my webpage at the Universidad Carlos III, Madrid, however much of the work on MEKA was carried out while at the University of Waikato, New Zealand.
Datasets
The following datasets have been compiled into WEKA's ARFF format for experiments in publications involving the pruned sets and classifier chains methods.
These are all text datasets, parsed into binary-attribute format using WEKA's StringToWordVector filter.
| Dataset | L | N | LC | PU | Description and Original Source(s) |
| Enron | 53 | 1702 | 3.39 | 0.442 | A subset of the Enron Email Dataset, as labelled by the UC Berkeley Enron Email Analysis Project |
| Slashdot | 22 | 3782 | 1.18 | 0.041 | Article titles and partial blurbs mined from Slashdot.org |
| Language Log | 75 | 1460 | 1.18 | 0.208 | Articles posted on the Language Log |
| IMDB | 28 | 95424 | 1.92 | 0.036 | Movie plot text summaries labelled with genres sourced from the Internet Movie Database interface, labeled with genres. |
| IMDB Updated | 28 | 120919 | 2.00 | 0.037 | An updated version of the above. |
N = The number of examples (training+testing) in the datasets
L = The number of predefined labels relevant to this dataset
LC = Label Cardinality. Average number of labels assigned per document
PU = Percentage of documents with Unique label combinations
Usage notes: Attributes 1-L of these datasets represent the label space, and other attributes represent the attribute space
Other notes: A greater selection of multi-label datasets can be found at the MULAN Website.
The Medical and Ohsumed datasets can be found here.
Links
- WEKA Machine Learning Toolkit
- MOA environment for data streams
- MULAN Framework for Multi-label Classification
- Mulan Multi-label Group from the Machine Learning and Knowledge Discovery Group in the Aristotle University of Thessaloniki.
- My Webpage at the University Carlos III of Madrid.