Tutorial
1. Download
First, you must download the jar files you need from our SourceForge download page. You should download jdmp-complete.jar, which contains all modules including visualization and interfaces to other machine learning packages such as Weka or LibSVM. If you want to use those, you will also need other jar files from these projects. Be careful about the licensing terms of these third party packages. Take a look at the package overview for more details.
2. Installation
You need a new Java Development Kit (JDK) to use JDMP. We recommend JDK 6 or higher, because JDMP will definitely not work with Java 5 or lower.
Make sure to include the necessary jar files from above (and their dependencies) in your Java classpath. If you are unsure how to install Java or add the libraries to your classpath, please take a look at the Java tutorials out there. Also consider using an integrated development environment (IDE) such as Eclipse for your own Java applications.
3. A Quick GUI Tour
Ok, so now you want to see JDMP in action? Well then, let’s get started! The best way to get acquainted with JDMP and its object types is through the graphical user interface. Try to execute the main method in org.jdmp.gui.JDMP. This should bring up the JDMP GUI, which is similar to a workspace in Matlab, Octave or FreeMat. If you cannot see the window, make sure that your are using the correct Java version. JDMP requires Java 6 and will not work with lower versions.
The workspace in JDMP is called Module and you can use a syntax similar to Matlab to execute basic calculations, e.g.:
a=[1;2;3] b=[1,2,3] c=a'+b c=a*b
Modules can store Variables, Samples, DataSets, Algorithms and other Modules, as you can see on the left side. You have just created three Variables, a, b and c. If you double-click on the c variable, it will bring up the visualization of this object:
Remember, we first calculated c=a'+b and after that c=a*b. As you can see, a Variable can remember previous matrices (top right), which is useful when you want to monitor how values are changing over time. When a Variable is visualized, all matrices in it are concatenated together to form a view on the whole data. There are different views:
- an editor (bottom right) to change single values,
- an image view, which gives an overview of the whole matrix and scales automatically when it is very large (bottom left),
- and a 2d plot which displays the values of each column as a line in different colors (top left, the brown lines are the running averages).
If you double-click on a single matrix, you can inspect the values of it:
If you have Octave, R, Matlab or GnuPlot installed, additional tabs are available which allow plotting a matrix in those software packages. The following screenshot shows a histogram plot in Octave:
If the installation is not found automatically on your computer, you must specify the directory using a system property, e.g. Octave="/usr/bin/octave". You can use System.setProperty() to set it in your code. Notice, that this functionality may have problems on some computers as is used input and output streams to communicate with the programs. Therefore, Matlab is not supported under Windows. Other operating systems like Mac OS are not tested at all, so feedback is welcome!
Now click on Examples in the module window and chose Iris DataSet. This creates the famous iris flower data set, which is often used to test classification algorithms. It appears in the DataSet list, and you can visualize it with a double-click.
A DataSet in JDMP contains a list of Samples which are the instances used for training and prediction. In addition to that, there are Variables to measure classification accuracy or store other global properties. The Variable input is a view on the data of all Samples. Now check out, what a Sample looks like:
It is a collection of Variables, which store input features, target value and label. If you run a classifier on the DataSet, each sample will also contain Variables with the prediction and the root mean squared error compared to the desired result.
Well, so much for the GUI tour, you can click around a little more to get familiar with JDMP’s objects.
Finished? OK, now you should have an understanding over the basic object types of JDMP, in short:
- Variables contain Matrices
- Samples contain Variables
- DataSets contain Samples (and Variables)
- Modules are logical “workspaces” to hold a set of Variables, Samples, DataSets, Algorithms and other Modules
- Algorithms can manipulate DataSets, Variables or Samples
But wait, I did not describe the Algorithms. Don’t they have visualization? Yes, they do, but if you want to use Algorithms, you must write Java code. JDMP is primary a machine learning library which you can use in your applications. The GUI console with the scripting language is still in a very early stage of development and serves as a demonstration tool until more functionality will be integrated in the future.
But don’t be disappointed, you can still visualize all object in your Java code by calling myVariable.showGUI() or myDataset.showGUI().
4. Main Concepts
Still interested? Ok, then it’s now time to introduce you to the main concepts of JDMP:
4.1. JDMP Philosophy #1: Everything is a Matrix!
Maybe you’ve heard that, for Linux everything is a file. For many people, it might sound strange at first, to think of a printer or a screen as a file. But it facilitates programming in Linux, because the programmer has just to know how to read from files and how to write to files.
In JDMP, the central statement is very similar:
Any type of data is a Matrix (or can be converted into a Matrix).
So, if you want to use JDMP, you will have to know how to read data from a Matrix and how to write data to it (which is really simple), and how to convert your data into a Matrix at the beginning (which should also be very easy in most cases). This matrix-centralized view might sound more intuitive than the file-centralized view from Linux, because many data sources really come in matrix format:
- CSV files are arranged in matrix format (lines x columns)
- Excel sheets are arranged in matrix format (rows x columns)
- Tables in a database have matrix format (rows x columns)
- After decoding, pictures have matrix format on the screen (height x width)
- Some image formats like BMP or TIFF are also stored in matrix format on disk (height x width)
So far so good, but now comes the harder part:
- Lists have matrix format (n entries x one column)
- Mappings (java.util.Map interface) have matrix format (n entries x two columns for key and value)
- Graphs can be represented in matrix format as an adjacency matrix (n nodes x n nodes)
- Graphs can be represented in matrix format as an incidence matrix (n nodes x m edges)
- Trees are a special form of graphs and are represented in the same way e.g. as an adjacency matrix (n objects x n objects)
- Text sentences can be converted into matrix format, e.g. as a bag of words (n sentences x m words in dictionary)
- File listings in one directory are represented in matrix format (n files x columns for file name, size, permissions, modification date)
- Audio files are represented in matrix format (n samples x 2 columns for left and right channel)
As you can see, sometimes there is more than one suitable matrix representation for an object, like for graphs. Another point to note is that all of the examples so far are 2-dimensional, but, of course, it is also possible to have higher dimensional data:
- Computer tomography scans can be stored in a 3D-matrix (height x width x depth)
- Videos can be represented as a sequence of images in a 3D-matrix (n frames x height x width)
Coming to more than three dimensions, it gets harder to think of simple examples, but be sure that JDMP can also handle high-dimensional data. Needless to say, also one-dimensional data like feature vectors can be stored in a matrix, namely one, that has only one column or only one row.
The benefit of this everything-is-a-matrix view is, that the classes for storage and processing have to be implemented only once. A JDMP class, that can store an n-dimensional Matrix in a cache file on disk can also store an image, an object list, a graph, a WAV-file or a CT-scan. The same applies to a JDMP class that can connect to a JDBC database. You could even make a backup of a remote database to a local Excel file, just by executing:
MatrixFactory.linkToJDBC(uriToYourDatabase).exportToFile(Format.XLS, backupExcelFile);
Or, as a more practical example: Once you have transformed the log file of your web-server into a matrix, you are able to analyze it with a variety of text mining algorithms!
To summarize the workflow in JDMP:
- 1. Import your data into a Matrix, e.g. from CSV, Excel, JDBC (raw data -> matrix)
- 2. Analyze your data, e.g. classify, cluster, make predictions and get the results also as matices (matrix -> matrix)
- 3. Export results to the format you need, e.g. JPG, Excel, Latex, GnuPlot (matrix -> desired format)
For the second step, JDMP provides wrapper classes to access the matrix data in a (for humans) more convenient format. For example, in a classification task, you would prefer dealing with a ClassificationDataSet, ClassificationSamples, and a Classifier algorithm instead of number of data and parameter matrices, right?
As you can see, it is easier to deal with the data, when we use informative class names to access it and, even better, when have additional information like labels, names etc. available. This brings us to our second philosohy:
4.2. JDMP Philosophy #2: Every Matrix can have annotations
Now what exactly does that mean? It’s easiest to explain using an example. Consider the following 2-dimensional data matrix:
| 1 | Holger | Arndt | arndt (at) jdmp (dot) org | 2004 |
| 2 | Andreas | Naegele | naegele (at) jdmp (dot) org | 2007 |
| 3 | Markus | Bundschus | bundschus (at) jdmp (dot) org | 2008 |
This matrix (without annotations) contains just the actual data (the Matrix entries), what might be values, Strings or any other type of objects. Looks like a list of people, but to really understand the meaning of the data, we need more information, the Matrix annotations:
| ID | First Name | Last Name | Developer Since | |
|---|---|---|---|---|
| 1 | Holger | Arndt | arndt (at) jdmp (dot) org | 2004 |
| 2 | Andreas | Naegele | naegele (at) jdmp (dot) org | 2007 |
| 3 | Markus | Bundschus | bundschus (at) jdmp (dot) org | 2008 |
In this example, we have introduced column labels to clarify the meaning of the data. In JDMP however, we can also provide row labels, axis labels or matrix annotations, which is illustrated in the next table:
| MatrixAnnotation | AxisAnnotation(1): Column Label for the whole Axis | |||||
|---|---|---|---|---|---|---|
| AxisAnnotation(1,0): Column Label (0) |
AxisAnnotation(1,1): Column Label (1) |
AxisAnnotation(1,2): Column Label (2) |
AxisAnnotation(1,3): Column Label (3) |
AxisAnnotation(1,4): Column Label (4) |
||
| AxisAnnotation(0): Row Label for Axis |
AxisAnnotation(0,0): Row Label (0) |
Matrix Entry (0,0) | Matrix Entry (0,1) | Matrix Entry (0,2) | Matrix Entry (0,3) | Matrix Entry (0,4) |
| AxisAnnotation(0,1): Row Label (1) |
Matrix Entry (1,0) | Matrix Entry (1,1) | Matrix Entry (1,2) | Matrix Entry (1,3) | Matrix Entry (1,4) | |
| AxisAnnotation(0,2): Row Label (2) |
Matrix Entry (2,0) | Matrix Entry (2,1) | Matrix Entry (2,2) | Matrix Entry (2,3) | Matrix Entry (2,4) | |
These annotations are not restricted to Strings or numbers; instead, any object type (also intermixed) is allowed.
The following sections of the Documentation must still be written
4.3. JDMP Philosophy #3: Every object can be visualized
5. The Universal Java Matrix Package
5.1. Basic Matrix Operations
5.2. Importing data
5.3. Exporting data
5.4. Basic Interfaces
Disposable HasId HasDescription HasLabel HasToolTip Wrapper
5.5. Collection Classes
ArrayIndexList CachedMap HashMapList MapToListWrapper SerializedObjectMap SoftHashMap SoftHashMapList
5.6. Writing your own Matrix Implementation
5.7. Writing your own Calculations
6. Variable
7. Sample
8. DataSet
9. Algorithm
10. Module
Workflow
Create a DataSet.
Split into training and test set.
Data normalization.
Create a Classifier.
Training.
Prediction.
Evaluation.
Cross Validation.
Please take a look at the JDMP Forum in the meantime