MEKA

The MEKA project provides an open source implementation of methods for multi-label classification and evaluation. It is based on the WEKA Machine Learning Toolkit. Several benchmark methods are also included, as well as the pruned sets and classifier chains methods, other methods from the scientific literature, and a wrapper to the MULAN framework.

Main developers:

Meka screenshot Meka screenshot

NEW Sep 25, 2014 Meka 1.7.3 is now has been released. Main changes include

  • Evaluation code is now more efficient at working with large ARFF files
  • Several methods (e.g., PS, RandomSubspaceML) and tools (e.g., PSUtils) rewritten to be more scalable for datasets having a large labelset
  • Classifier Chains (CC) based methods (CC, PCC, BCC, MCC) consolidated to share common code
  • Classifiers added (RAkEL, RAkELd)
Code and classifiers from this release were used to help get 1st place in the LSHTC4 2014 challenge and 2nd place in the 2014 WISE challenge (in combination with Antti Puurula's SGM toolkit).

May 27, 2014 Meka 1.6.2 is now on Maven Central. UPDATE Dec 15, 2014 Now Meka 1.7.3. To include it in your projects,

<dependency>
<groupId>net.sf.meka</groupId>
<artifactId>meka</artifactId>
<version>1.6.2</version>
</dependency>

Download

Download MEKA here.

Or checkout the code with subversion:
svn checkout svn://svn.code.sf.net/p/meka/code/trunk meka-code

Or get a nightly snapshot.

Documentation

Getting Started: download MEKA and run bash run.sh (run.bat on Windows) to launch the GUI.

The MEKA tutorial (pdf) has numerous examples on how to run and extend MEKA.

A List of Methods available in MEKA, and examples on how to use them.

For developers: The API reference.

MEKA originated from implementations of work from several publications including a PhD thesis, they can can be found here.

Have a specific problem or query? Post to MEKA's Mailing List (please avoid contacting developers directly for MEKA-related help).

Datasets

The following datasets have been created / compiled into WEKA's ARFF; They are all text datasets, parsed into binary-attribute format using WEKA's StringToWordVector filter. Also available are train/test splits and the original raw prefiltered text.

Dataset L N LC PU Description and Original Source(s)
Enron 53 1702 3.39 0.442 A subset of the Enron Email Dataset, as labelled by the UC Berkeley Enron Email Analysis Project
Slashdot 22 3782 1.18 0.041 Article titles and partial blurbs mined from Slashdot.org
Language Log 75 1460 1.18 0.208 Articles posted on the Language Log
IMDB Updated 28 120919 2.00 0.037 Movie plot text summaries labelled with genres sourced from the Internet Movie Database interface, labeled with genres.

N = The number of examples (training+testing) in the datasets

L = The number of predefined labels relevant to this dataset

LC = Label Cardinality. Average number of labels assigned per document

PU = Percentage of documents with Unique label combinations

Usage notes: Attributes 1-L of these datasets represent the label space, and other attributes represent the attribute space

Other notes: A greater selection of multi-label datasets can be found at the MULAN Website.
The Medical and Ohsumed datasets can be found here.

Links