Tuesday, April 5, 2011

Weka on Java through Eclipse - Getting Started

I wanted to use weka in my java code. Weka is a useful toolkit for machine learning. It has all basic classifiers, clustering techniques and many more that can be easily trained and used. I use Eclipse to develop java code. Hence, I wanted to learn how to integrate weka into my code in Eclipse. I had to learn the following:

  • How to make functionalities of weka available in my java code? - I had to find jar file for weka and include it in my project. I found Weka with Java - Getting Started very apt to accomplish this job.

  • How to load a dataset (stored as csv or arff file), how to filter the dataset by providing options to it, how to train a classifier with certain specified options etc? These were very nicely explained in Weka with Java - Essentials with examples.

  • How to save a classifier/cluster/associator model and how to load it for predicting? The model file is usually saved as <name>.model (Eg: J48.model, Naive.model etc) Somehow "-d <model file name>" for saving model or "-l <model file name>" for loading the model does not seem to work. However, the following works (Source):
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.ObjectOutputStream;
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.ObjectInputStream;

// Lines of code corresponding to the building of the model

// save the model file
OutputStream os = new FileOutputStream(modelFileName);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os);
objectOutputStream.writeObject(naiveBayesClassifier);

//read the model file
InputStream is = new FileInputStream(modelFileName);
ObjectInputStream objectInputStream = new ObjectInputStream(is);
naiveBayesClassifier = (NaiveBayes) objectInputStream.readObject();
objectInputStream.close();


Tuesday, January 4, 2011

Twinkle tool

Now that we understand RDF and SPARQL, we can try to build our own database in RDF format and also, query them using SPARQL. To use the database, which might be offline, and query on them, we can use Twinkle SPARQL query tool available at http://www.ldodds.com/projects/twinkle/. I have downloaded Twinkle 2.0 version (binary) and tried. Below mentioned examples are run with this version.

STARTING UP (Taken from the Twinkle's page)

You'll need to have Java 1.5 or higher installed to use Twinkle. Download the distribution and unzip it into a new directory. Open a command-prompt and change the current directory to this newly created directory i.e.

cd <path>
 
and execute the following:
 
java -jar twinkle.jar

Below is the screenshot of Twinkle tool obtained on the execution of the above command. On Twinkle, result can be viewed in two formats - text or table. The format can be selected by clicking on relevant tab in the bottom region of the window.
















EXAMPLE 1

Let us use john.xml or john.n3 as the data source. Save either file at a location on your computer. In Twinkle's window, against 'Data URL', click on 'File' button to browse your computer and specify the file (john.xml or john.n3).

Once the database is fixed, we can write a SPARQL query to access specific information in it.

Query 1
If we want to get John's mother's name, the query looks as below:

SPARQL query:
Explanation:
Result:

Query 2
Similarly, if we want to get John's father's name, the query is as below:

SPARQL query:

Explanation:
Result:


EXAMPLE 2

Lets try writing SPARQL queries on a slightly bigger dataset. For this, lets consider the foaf.rdf file generated by David Beckett - http://www.dajobe.org/foaf.rdf. Save this file on your computer and supply this as the 'Data URL' for Twinkle.

Query 1 (Taken from http://librdf.org/query)

SPARQL query:


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?name
WHERE {
  ?x rdf:type foaf:Person .
  ?x foaf:name ?name
}
ORDER BY ?name


Explanation:

This query uses namespaces of rdf and foaf. The query is to get distinct names (duplicate entries not listed)    of persons listed in the database sorted in alphabetical order.

Result: (text format)

------------------------
| name                 |
========================
| "Bijan Parsia"       |
| "Damian Steer"       |
| "Dan Brickley"       |
| "Dan Connolly"       |
| "Dave Beckett"       |
| "Edd Dumbill"        |
| "Eric Miller"        |
| "Jan Grant"          |
| "Jim Hendler"        |
| "Jo Walsh"           |
| "Libby Miller"       |
| "Matt Biddulph"      |
| "Morten Frederiksen" |
| "Norm Walsh"         |
| "Phil McCarthy"      |
| "Sean B. Palmer"     |
| "Tim Berners-Lee"    |
------------------------

Prev:   RDF & SPARQL

RDF & SPARQL

RDF

RDF is used to describe metadata (data about data). In RDF, data is represented as {subject, predicate, object} triples. Like in English, subject is the entity about which the data is about, predicate is the property of the subject that is being described and object is the value of the property of the subject. Eg: In the sentence, "Agra is in India", "Agra" is the subject, "is in" can be considered as predicate and "India" is the object. The equivalent RDF graph can be written as follows:

Data can be stored in RDF using N3 format or XML format.

Example: (Taken from Quick Intro to RDF - http://www.rdfabout.com/quickintro.xpd)

The data graph representing relations of John can be written in N3 format as follows:
@prefix ns: <http://www.example.org/> .
ns:john    a             ns:Person .
ns:john    ns:hasMother  ns:susan .
ns:john    ns:hasFather  ns:richard .
ns:richard ns:hasBrother ns:luke .
 File Name: john.n3

In the above example, @prefix helps to specify the namespace used in the RDF. Here, the namespace  http://www.example.org/ is abbreviated as 'ns' and used in the rest of the data description. A namespace in an RDF helps to describe the data resource as intended. The namespace helps us to define the attributes and properties of different entities of the data. In the example, john, richard, susan, luke etc are attributes and hasMother, hasFather, hasBrother are properties defined in the namespace 'ns' allowing us to define relations between them.

Further, it is also required to specify how the entities of the namespace are related to each other. This is done using RDF Schema. Ontology creation language OIL helps us to extend RDF Schema to define more specific relations and properties. I have not gone much in detail to understand RDF Schema or OIL. Please verify if what I have mentioned is correct regarding them. The same data graph can be written in XML format as follows:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns:ns="http://www.example.org/#">
  <ns:Person rdf:about="http://www.example.org/#john">
    <ns:hasMother rdf:resource="http://www.example.org/#susan" />
    <ns:hasFather>
      <rdf:Description rdf:about="http://www.example.org/#richard">
        <ns:hasBrother rdf:resource="http://www.example.org/#luke" />
      </rdf:Description>
    </ns:hasFather>
  </ns:Person>
</rdf:RDF>

File Name: john.xml

In the xml version, <rdf:RDF> defines the XML document to be an RDF document. It also contains a reference to the RDF namespace. The <rdf:Description> contains elements that describe the resource. For more details, refer to w3schools material on RDF (http://www.w3schools.com/rdf/rdf_main.asp).


SPARQL

As SPARQL is used to query RDF database, SPARQL also refers data in {subject, predicate,object} triples. The general format of the SPARQL query is as given below (Taken from a presentation available online - Sorry, forgot the specific details):

# prefix  declarations for namespaces
PREFIX foo: <http://example.com/resources/>
......
# dataset definitions for specifying the database to use
FROM <...>

# result clause
SELECT ...

# query pattern
WHERE {
.....
}

# query modifiers
ORDER BY ...

PREFIX is same as @prefix of RDF in functionality.

For complete details on SPARQL query language wrto RDF, refer to SPARQL Query Language for RDF. Below are my comments on this document.
  1. This document is easy to read with the knowledge gained from the prev post: RDF, SPARQL and DBPedia Basics -  My Journey
  2. One can directly start from Section 1.2 (Document Conventions) of the document. 
  3. For the first read, I just stayed on this page and did not navigate to any other links given on the page. Still, it was followable. 
  4. The examples given in Section 2 and Section 3 can be run and tested using Twinkle Tool - The data, given for examples, is in N3 format. Hence, store it as a N3 file and provide this file to the tool for querying. But, I am somehow getting the result entries in reverse order on Twinkle compared to what is shown as expected result  in the document. I dont know how to fix it. I will come back and look at it later. Also, I need to figure out how to get the result on Twinkle in N3 format when CONSTRUCT query form is used. By default, Twinkle shows it in RDF/XML format.
  5. Section 4.1.1 seemed to go overhead for me as I am not very clear about URIs and URLs. So, just read through it without understanding much. I hope it doesnt affect my learning on SPARQL much. I will come back to this part later. However, I could understand the Relative IRIs part better through example given in Section 4.2.
  6. Section 4.1.2 - Grammar Rules just gives the definitions of different literals like integer, decimal, string etc as regular expressions. We generally know what we mean by an integer or a decimal or a string. So, I think one can skip it if its a bit hard to follow as regular expressions. If you are coming across these terms for the first time, search for their textual definitions in Google and go througth examples and starting text. Thats sufficient. Same is true of Section 4.1.3 and where ever Grammar Rules are given.
  7. Section 8 takes about RDF Dataset. I feel it is a must read section to be able to tap in information from the whole dataset. 
Some useful links:
  1. Difference between RDF and OWL
    Prev:   RDF, SPARQL and DBPedia Basics -  My Journey
    Next:   Twinkle Tool