Java Data Mining Package

A Library for Machine Learning and Big Data Analytics

About

The Java Data Mining Package (JDMP) is an open source Java library for data analysis and machine learning. It facilitates the access to data sources and machine learning algorithms (e.g. clustering, regression, classification, graphical models, optimization) and provides visualization modules. JDMP provides a number of algorithms and tools, but also interfaces to other machine learning and data mining packages (Weka, LibLinear, Elasticsearch, LibSVM, Mallet, Lucene, Octave).

JDMP DataSet

Quick Links

Documentation Download Professional Support

Screenshot

This screenshot shows the visualization of the Iris flower data set in JDMP. JDMP's Naive Bayes classifier has been trained on the data was used afterwards to predict the target class of the samples. Since the data set is not linearly separable, there still remain some errors and the accuracy is less than 100%. The confusion matrix shows how some Samples have been classified incorrectly.

JDMP DataSet

Description

The main focus of JDMP lies on a consistent data representation. Maybe you’ve heard that, for Linux everything is a file. For JDMP, everything is a Matrix! Well, not everything, but many objects do have a matrix representation. For example: you can combine several matrices to form a Variable, e.g. for a time series. You can access these matrices one by one, or as a single big matrix, whatever is more suitable for your task. Several Variables are combined into a Sample, like the samples with input and target values in a classification task. Many Samples form a DataSet, which may be sorted or split for a cross validation test. The DataSet can be accesses either sample by sample or as a big matrix for the input features and one for the target values.

Algorithms can manipulate Variables, Samples or DataSets, e.g. to perform pre-processing or a classification task. It has to be emphasized that, in JDMP, data processing methods are separated from data sources, so that algorithms and data may reside on different computers and parallel processing becomes possible. However, distributed computing is not yet fully implemented and exists in a “proof of concept” version only.

While some parts are pretty stable by now, a lot of development is still going on in other parts, which is why JDMP has to be considered as experimental and not yet ready for production use.

Universal Java Matrix Package

JDMP uses the Universal Java Matrix Package (UJMP) as a mathematical back-end for matrix calculations. UJMP provides most of JDMP’s import and export filters and is used for visualization. Since most of JDMP’s objects can be converted into a “matrix view”, UJMP is a very important building block in JDMP and helps to keep the code nice and simple with the ability to handle very large matrices even when they do not fit into memory. Import and export interfaces are provided for JDBC data bases, TXT, CSV, Excel, Matlab, Latex, MTX, HTML, WAV, BMP and other file formats.

This screenshot shows the visualization of a matrix in UJMP, which was created from calculating the cosine similarity between weakly correlated data samples.

UJMP Matrix

Tutorial

1. Download

First, you must download the jar files you need from our SourceForge download page. You should download jdmp-complete.jar, which contains all modules including visualization and interfaces to other machine learning packages such as Weka or LibSVM. If you want to use those, you will also need other jar files from these projects. Be careful about the licensing terms of these third party packages. Take a look at the package overview for more details.

2. Installation

You need a new Java Development Kit (JDK) to use JDMP. We recommend JDK 6 or higher, because JDMP will definitely not work with Java 5 or lower.

Make sure to include the necessary jar files from above (and their dependencies) in your Java classpath. If you are unsure how to install Java or add the libraries to your classpath, please take a look at the Java tutorials out there. Also consider using an integrated development environment (IDE) such as Eclipse for your own Java applications.

3. A Quick GUI Tour

Ok, so now you want to see JDMP in action? Well then, let’s get started! The best way to get acquainted with JDMP and its object types is through the graphical user interface. Try to execute the main method in org.jdmp.gui.JDMP. This should bring up the JDMP GUI, which is similar to a workspace in Matlab, Octave or FreeMat. If you cannot see the window, make sure that your are using the correct Java version. JDMP requires Java 6 and will not work with lower versions.

JDMP Module

The workspace in JDMP is called Module and you can use a syntax similar to Matlab to execute basic calculations, e.g.:

a=[1;2;3]
b=[1,2,3]
c=a'+b
c=a*b

Modules can store Variables, Samples, DataSets, Algorithms and other Modules, as you can see on the left side. You have just created three Variables, a, b and c. If you double-click on the c variable, it will bring up the visualization of this object.

Remember, we first calculated c=a'+b and after that c=a*b. As you can see, a Variable can remember previous matrices (top right), which is useful when you want to monitor how values are changing over time. When a Variable is visualized, all matrices in it are concatenated together to form a view on the whole data. There are different views:

  • an editor (bottom right) to change single values,
  • an image view, which gives an overview of the whole matrix and scales automatically when it is very large (bottom left),
  • and a 2d plot which displays the values of each column as a line in different colors (top left, the brown lines are the running averages).

If you double-click on a single matrix, you can inspect the values of it:

TODO: Visualization of a Matrix

If you have Octave, R, Matlab or GnuPlot installed, additional tabs are available which allow plotting a matrix in those software packages. The following screenshot shows a histogram plot in Octave:

TODO: Histogram in Octave from JDMP

If the installation is not found automatically on your computer, you must specify the directory using a system property, e.g. Octave="/usr/bin/octave". You can use System.setProperty() to set it in your code. Notice, that this functionality may have problems on some computers as is used input and output streams to communicate with the programs. Therefore, Matlab is not supported under Windows. Other operating systems like Mac OS are not tested at all, so feedback is welcome!

Now click on Examples in the module window and chose Iris DataSet. This creates the famous iris flower data set, which is often used to test classification algorithms. It appears in the DataSet list, and you can visualize it with a double-click.

TODO: Visualization of the Iris Flower DataSet

A DataSet in JDMP contains a list of Samples which are the instances used for training and prediction. In addition to that, there are Variables to measure classification accuracy or store other global properties. The Variable input is a view on the data of all Samples. Now check out, what a Sample looks like:

TODO: Visualization of a Sample

It is a collection of Variables, which store input features, target value and label. If you run a classifier on the DataSet, each sample will also contain Variables with the prediction and the root mean squared error compared to the desired result.

Well, so much for the GUI tour, you can click around a little more to get familiar with JDMP’s objects.

Finished? OK, now you should have an understanding over the basic object types of JDMP, in short:

  • Variables contain Matrices
  • Samples contain Variables
  • DataSets contain Samples (and Variables)
  • Modules are logical “workspaces” to hold a set of Variables, Samples, DataSets, Algorithms and other Modules
  • Algorithms can manipulate DataSets, Variables or Samples

But wait, I did not describe the Algorithms. Don’t they have visualization? Yes, they do, but if you want to use Algorithms, you must write Java code. JDMP is primary a machine learning library which you can use in your applications. The GUI console with the scripting language is still in a very early stage of development and serves as a demonstration tool until more functionality will be integrated in the future.

But don’t be disappointed, you can still visualize all object in your Java code by calling myVariable.showGUI() or myDataset.showGUI().

4. Main Concepts

Still interested? Ok, then it’s now time to introduce you to the main concepts of JDMP:

4.1. JDMP Philosophy #1: Everything is a Matrix!

Maybe you’ve heard that, for Linux everything is a file. For many people, it might sound strange at first, to think of a printer or a screen as a file. But it facilitates programming in Linux, because the programmer has just to know how to read from files and how to write to files.

In JDMP, the central statement is very similar:

Any type of data is a Matrix (or can be converted into a Matrix).

So, if you want to use JDMP, you will have to know how to read data from a Matrix and how to write data to it (which is really simple), and how to convert your data into a Matrix at the beginning (which should also be very easy in most cases). This matrix-centralized view might sound more intuitive than the file-centralized view from Linux, because many data sources really come in matrix format:

  • CSV files are arranged in matrix format (lines x columns)
  • Excel sheets are arranged in matrix format (rows x columns)
  • Tables in a database have matrix format (rows x columns)
  • After decoding, pictures have matrix format on the screen (height x width)
  • Some image formats like BMP or TIFF are also stored in matrix format on disk (height x width)

So far so good, but now comes the harder part:

  • Lists have matrix format (n entries x one column)
  • Mappings (java.util.Map interface) have matrix format (n entries x two columns for key and value)
  • Graphs can be represented in matrix format as an adjacency matrix (n nodes x n nodes)
  • Graphs can be represented in matrix format as an incidence matrix (n nodes x m edges)
  • Trees are a special form of graphs and are represented in the same way e.g. as an adjacency matrix (n objects x n objects)
  • Text sentences can be converted into matrix format, e.g. as a bag of words (n sentences x m words in dictionary)
  • File listings in one directory are represented in matrix format (n files x columns for file name, size, permissions, modification date)
  • Audio files are represented in matrix format (n samples x 2 columns for left and right channel)

As you can see, sometimes there is more than one suitable matrix representation for an object, like for graphs. Another point to note is that all of the examples so far are 2-dimensional, but, of course, it is also possible to have higher dimensional data:

  • Computer tomography scans can be stored in a 3D-matrix (height x width x depth)
  • Videos can be represented as a sequence of images in a 3D-matrix (n frames x height x width)

Coming to more than three dimensions, it gets harder to think of simple examples, but be sure that JDMP can also handle high-dimensional data. Needless to say, also one-dimensional data like feature vectors can be stored in a matrix, namely one, that has only one column or only one row.

The benefit of this everything-is-a-matrix view is, that the classes for storage and processing have to be implemented only once. A JDMP class, that can store an n-dimensional Matrix in a cache file on disk can also store an image, an object list, a graph, a WAV-file or a CT-scan. The same applies to a JDMP class that can connect to a JDBC database. You could even make a backup of a remote database to a local Excel file, just by executing:

MatrixFactory.linkToJDBC(uriToYourDatabase).exportToFile(Format.XLS, backupExcelFile);

To summarize the workflow in JDMP:

  • 1. Import your data into a Matrix, e.g. from CSV, Excel, JDBC (raw data -> matrix)
  • 2. Analyze your data, e.g. classify, cluster, make predictions and get the results also as matrices (matrix -> matrix)
  • 3. Export results to the format you need, e.g. JPG, Excel, Latex, GnuPlot (matrix -> desired format)

For the second step, JDMP provides wrapper classes to access the matrix data in a (for humans) more convenient format. For example, in a classification task, you would prefer dealing with a ClassificationDataSet, ClassificationSamples, and a Classifier algorithm instead of number of data and parameter matrices, right?

Here is an example:

// remember to provide more memory to your JVM.
// the default will probably not be enough.
// use the switch -Xmx512M or something similar
 
// load example dataset into a matrix
URL url = new URL("http://www.jdmp.org/example.csv");
String delimiter = ",";
Matrix data = MatrixFactory.importFromURL(FileFormat.CSV, url, delimiter);
 
// add meta data to the matrix
data.setLabel("Raw Data");
data.setColumnLabel(0, "Feature 1");
data.setColumnLabel(1, "Feature 2");
data.setColumnLabel(2, "Feature 3");
data.setColumnLabel(3, "Feature 4");
data.setColumnLabel(4, "Class Label");
 
// visualize
data.showGUI();
 
// the first 4 columns contain the features
// copy those features into a separate matrix
Matrix input = data.selectColumns(Ret.NEW, 0, 1, 2, 3);
input.setLabel("Input");
 
// normalization across the samples cannot hurt
// i.e. zero mean and unit variance for each feature
input = input.standardize(Ret.NEW, Matrix.ROW);
input.setLabel("Standardized Input");
 
// alternative: scale values between 0 and 1:
// input = input.normalize(Ret.NEW, Matrix.ROW);
 
// let's see what we've got
input.showGUI();
 
// maybe we also want to discretize the data
int numberOfBins = 5;
int dim = Matrix.ROW;
DiscretizationMethod method = DiscretizationMethod.STANDARDBINNING;
input = input.discretize(Ret.NEW, dim, method, numberOfBins);
input = input.toDoubleMatrix();
input.setLabel("Discretized Input");
 
// looks different?
input.showGUI();
 
// great, now let's extract the class labels
Matrix labels = data.selectColumns(Ret.NEW, 4);
labels.setLabel("Labels");
labels.showGUI();
 
// we cannot work with strings, create columns for each value
Matrix target = labels.discretizeToColumns(0);
target.setLabel("Target");
target.showGUI();
 
// target is now a matrix with three columns.
// if there is a "1" in the first column, it means "class 1"
// if there is one in the second it's "class 2" and so on
 
// enough working with matrices in UJMP
// let's copy input and target into a DataSet in JDMP
ClassificationDataSet ds = DataSetFactory.importFromMatrix(input, target, labels);
ds.setLabel("JDMP Demo DataSet");
 
// let's take a look at this dataset
ds.showGUI();
 
// splitting into training and test set is a good practice
boolean shuffle = true;
double percentInTheFirstDataSet = 0.7;
List split = ds.splitByPercent(shuffle, percentInTheFirstDataSet);
ClassificationDataSet trainingSet = (ClassificationDataSet) split.get(0);
ClassificationDataSet testSet = (ClassificationDataSet) split.get(1);
trainingSet.setLabel("Training Set");
testSet.setLabel("Test Set");
 
// the training set contains 70% of the samples
// the test set contains 30% of the samples
 
// train a classifier using the training set
Classifier simple = new LinearRegression();
simple.train(trainingSet);
 
// try to predict the data in the test set
simple.predict(testSet);
 
// notice the additional columns: Predicted, Difference and RMSE
testSet.showGUI();
 
// accuracy smaller than 85%
System.out.println("Accuracy LinRegression: " + testSet.getAccuracy());
 
// what if we were just guessing?
Classifier guessing = new RandomClassifier();
 
// we still have to train it
// it has to know the number of classes
guessing.train(trainingSet);
guessing.predict(testSet);
 
// accuracy around 50%
System.out.println("Accuracy Random: " + testSet.getAccuracy());
 
// Let's use a classifier from Weka
boolean dataIsDiscrete = false;
Classifier weka = new WekaClassifier(WekaClassifierType.AdaBoostM1, dataIsDiscrete);
weka.train(trainingSet);
weka.predict(testSet);
System.out.println("Accuracy Weka: " + testSet.getAccuracy());
 
// LIBSVM works also well
Classifier svm = new LibSVMClassifier();
svm.train(trainingSet);
svm.predict(testSet);
System.out.println("Accuracy SVM: " + testSet.getAccuracy());
 
// we can also do a ten-times-ten-fold crossvalidation
// on the whole dataset
CrossValidation.run(weka, ds);

As you can see, it is easier to deal with the data, when we use informative class names to access it and, even better, when have additional information like labels, names etc. available. This brings us to our second philosophy:

4.2. JDMP Philosophy #2: Every Matrix can have annotations

Now what exactly does that mean? It’s easiest to explain using an example. Consider the following 2-dimensional data matrix:

1 Holger Arndt arndt (at) jdmp (dot) org 2004
2 Andreas Naegele naegele (at) jdmp (dot) org 2007
3 Markus Bundschus bundschus (at) jdmp (dot) org 2008

This matrix (without annotations) contains just the actual data (the Matrix entries), what might be values, Strings or any other type of objects. Looks like a list of people, but to really understand the meaning of the data, we need more information, the Matrix annotations:

ID First Name Last Name Email Developer Since
1 Holger Arndt arndt (at) jdmp (dot) org 2004
2 Andreas Naegele naegele (at) jdmp (dot) org 2007
3 Markus Bundschus bundschus (at) jdmp (dot) org 2008

In this example, we have introduced column labels to clarify the meaning of the data. In JDMP however, we can also provide row labels, axis labels or matrix annotations, which is illustrated in the next table:

MatrixAnnotation AxisAnnotation(1): Column Label for the whole Axis
AxisAnnotation(1,0):
Column Label (0)
AxisAnnotation(1,1):
Column Label (1)
AxisAnnotation(1,2):
Column Label (2)
AxisAnnotation(1,3):
Column Label (3)
AxisAnnotation(1,4):
Column Label (4)
AxisAnnotation(0):
Row Label for Axis
AxisAnnotation(0,0):
Row Label (0)
Matrix Entry (0,0) Matrix Entry (0,1) Matrix Entry (0,2) Matrix Entry (0,3) Matrix Entry (0,4)
AxisAnnotation(0,1):
Row Label (1)
Matrix Entry (1,0) Matrix Entry (1,1) Matrix Entry (1,2) Matrix Entry (1,3) Matrix Entry (1,4)
AxisAnnotation(0,2):
Row Label (2)
Matrix Entry (2,0) Matrix Entry (2,1) Matrix Entry (2,2) Matrix Entry (2,3) Matrix Entry (2,4)

These annotations are not restricted to Strings or numbers; instead, any object type (also intermixed) is allowed.

The following sections of the Documentation must still be written ;-)

4.3. JDMP Philosophy #3: Every object can be visualized

5. The Universal Java Matrix Package

5.1. Basic Matrix Operations

5.2. Importing data

5.3. Exporting data

5.4. Basic Interfaces

  • Disposable
  • HasId
  • HasDescription
  • HasLabel
  • HasToolTip

5.5. Collection Classes

  • ArrayIndexList
  • CachedMap
  • HashMapList
  • MapToListWrapper
  • SerializedObjectMap
  • SoftHashMap
  • SoftHashMapList

5.6. Writing your own Matrix Implementation

5.7. Writing your own Calculations

6. Variable

7. Sample

8. DataSet

9. Algorithm

10. Module

Workflow

  • Create a DataSet
  • Split into training and test set
  • Data normalization
  • Create a Classifier
  • Training
  • Prediction
  • Evaluation
  • Cross Validation

Support JDMP

You can use JDMP free of charge. However, if you like it, we would appreciate to receive a small donation from you. Please support open source software. Thank you very much!

Big Data AnalyticsData MiningMachine LearningArtificial IntelligenceNeural Networks