My Share Point: 2015

Motivation

I found a couple of tutorials on Machine Learning techniques I was interested in, but were implemented in Python, syntax of which was new to me. Also, Scikit and Spark use python and have become popular tools now-a-days for Machine Learning. So, I thought why not learn a bit of Python...Thus, my exploration began.

Setting up Python

First thing I noticed about python is that scripts coded with newer version do not run with older python versions though you do not use any new feature in the script. So, you need to get the right version. I will be using Python 2.7.10 that comes as part of Anaconda for all my explorations. Anaconda comes with many useful python packages along with Python distribution and is very easy to install.

Download the .sh file
Run bash <.sh file downloaded>
Accept the Terms and Conditions, set path for installation and you are done.

Once you have it installed, you can enter the python prompt by running <Path where anaconda is installed ending in /anaconda>/bin/python2.7

You can exit from the python prompt using quit() or Ctrl-D

Unique to Python

In python,

Declaration of variables is simpler and one need not mention the type of the variable
: along with Indentation (space or tab) is used for scoping (In other languages like Java or C, curly braces is used for scoping and indentation is used only to enhance the readability of the code)
Lines need not end in ; unlike in other languages
# is used for single line comments and """ (Three ") at the start and end for multiple line comments
One can create two types of objects: Mutable and Immutable. Value in Immutable object cannot be changed once created.

Let us compare declaration of an array and printing its values using for loop to understand the difference between Java and Python way of coding. In Java (Indentation presence or absence doesn't matter):

int[] a = new int[5]{1,2,3,4,5};

System.out.println("Start");

for(int i=0;i<5;i++){

System.out.println("Value of a["+i+"] is "+a[i]);

}

System.out.println("End");

Coding same task in Python (Doesn't require variables to be declared. But, require typecast when printing. Indentation decides the scope of for loop):

a = [1,2,3,4,5]

print("Start")

for i in range(0, 4):

print("Value of a["+str(i)+"] is "+str(a[i]))

print("End")

Another way of printing in Python:

for i in range(0, 4):

print "Value of a[%d] is %d" % (i,a[i])

When first line of for loop is typed and enter key is clicked, one sees ... To enter the loop, one needs to introduce indentation via space or tab and then, type the line that is part of the loop. To exit the loop and execute it, one needs to click enter twice.

Handling Arrays (Lists) and Tuples

Since arrays are generally used, let us look at how to create an array and access elements in the array. In python, arrays are referred as lists and the indices start from 0.

Defining at once: a = [1,2,3,4,5]

Starting with an empty array and filling it up:

a = []

for i in range(0, 4):

a.append(i+1)

Another way of creating array with for loop: a = [i+1 for i in range(5)] # The lower limit of range is 0 by default

Creating an array with same value, say 10, in each slot: a = [10]*5

Size of an array: len(a)

Accessing element at a particular index i: a[i]

Accessing all elements from index 2 till the end:

for e in a[2:]:

print e

Accessing all elements, except last 3 elements:

for e in a[:-3]:

print e

Concatenate two arrays a and b: a + b

While list or array is a mutable object, tuple is an immutable object and use parenthesis instead of square brackets. Tuples can only be defined at once. However, accessing elements in tuple is similar to arrays.

Accessing corresponding elements in two arrays (It gets corresponding elements as long as the shortest of the two arrays is exhausted):

for e,f in zip(a,b):

print e,f #Prints e and f values with space in between

Handling 2D Matrices (List of Lists)

Defining at once: m = [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19]]

Starting with an empty matrix and filling it up:

m = []

for row in range(5):

a = []

for col in range(4):

a.append(col+row*4)

m.append(a)

Creating a matrix of 5X4 with values from 0 to 19: m = [[col+row*4 for col in range(4)] for row in range(5)]

Creating a matrix of 5X4 with all elements having 0: m = 5*[4*[0]]

Number of rows: len(m)

Number of rows: len(m[0])

Accessing i-th row and j-th column of a matrix: m[i][j]

Accessing i-th row of a matrix: m[i]

Accessing j-th column of a matrix: zip(*m)[j] #Need to transpose the matrix to access the column as an array

Concatenate or stack two matrices m and n (Need not have same number of columns): m + n #5X4 matrix concatenated with 3X2 matrix yields matrix with 5+3=8 rows with first 5 rows having 4 entries and next 3 rows having 2 entries

Accessing corresponding elements in two matrices (It gets corresponding elements as long as the smaller of the two matrices is exhausted):

for e,f in zip(m,n):

print e,f #Prints e and f values with space in between

Handling Dictionaries

Defining at once: d = {'name':'XYZ','age':25} #Stored with key sorted alphabetically

Starting with an empty dictionary and filling it up:

d = {}

keys = ['name','age']

values = ['XYZ',25]

for i in range(2):

d[keys[i]] = values[i]

Using two parallel lists, keys and values, to get dictionary at once: d = dict(zip(keys,values))

Accessing value corresponding to a key in the dictionary: d[key]

Accessing value corresponding to a key in the dictionary safely (without producing error if key is not present in the dictionary): d.get(key)

Concatenate two dictionaries d and e: dict(d.items() + e.items())

Clear dictionary: d.clear()

Delete an entry in the dictionary with specified key: del d[key]

Delete entire dictionary: del d

More possible operations on dictionaries can be found in this link.

Writing another post after a long time. Recently, I had to try out Supervised Latent Dirichlet Allocation package by Chong Wang, whose Adviser was David M. Blei. Thought of sharing how to set it up and use it. Here we go.

Using slda

It is for multi-class classification (Binary classification is subset of it). It is not for Multi-label classification, where a data point can belong to more than one class.
When there are C classes, the label should be one value among {0,1,...,C-1}. Even if one has non-numeric labels, they need to be converted to numeric value. However, slda does not assume order among the labels. So, this conversion does not change the nature of the problem and hence, one need not worry.
Before creating train and test files, the words need to be indexed and train and test files need to contain only the indices and the word counts.

Setting up slda

These steps were tried on Ubuntu machine.

slda depends on another package called gsl (GNU Scientific Library). So, we need to install that first (Thanks to this link): sudo apt-get install gsl-bin libgsl0-dev libgsl0ldbl
Now, it is time to setup slda. Download the package i.e. slda.tgz. Unzip it using tar -zxvf slda.tgz.
Go inside slda folder created. Try running "make". If it gives error like it did in my case, add "#include <cstddef>" line before any other #includes in corpus.h (Thanks to this link). Then, run "make" command. If you don't get any error and find a runnable slda file created, you are ready to use it.

Example Run

Download the example dataset i.e. data.tgz from slda site from where you downloaded the package. Unzip it using tar -zxvf data.tgz.

One can find details on the command run by looking at readme.txt

Assuming the unzipped data is in same folder as slda executable, you can train a model using following command (settings.txt file comes with slda.tgz):

./slda est train-data.dat train-label.dat settings.txt 0.01 50 random ./

Top few lines of the command run:

reading data from train-data.dat
number of docs : 800
number of terms : 158
number of total words : 1920800
...
The run takes sometime (>30mins).

After the training is done, you can infer label for test data using following command:

./slda inf test-data.dat test-label.dat settings.txt final.model ./

Last few lines of the command run:

results will be saved in ./
document 0
document 100
document 200
document 300
document 400
document 500
document 600
document 700
average accuracy: 0.738

So, the slda model gave an accuracy of 73.8% on the test data. The inferred labels are in inf-labels.dat file.

Example: Data Preparation to Output Examination

Data

Consider 5 documents as below:

car vehicle wheels bus break car wheels
bus driver car road crossing
forest deer lion fruits prey

grass deer summer spring
tour bus forest lion

Create dictionary

Construct a mapping from word to index like below:

0 <- car
1 <- vehicle
2 <- wheels
3 <- bus
4 <- break
5 <- driver
6 <- road
7 <- crossing
8 <- forest
9 <- deer
10 <- lion
11 <- fruits
12 <- prey
13 <- grass
14 <- summer
15 <- spring
16 <- tour

Create slda understandable train data and label files

Refer to this link for details. Training files need to have as many lines as there are documents: One document per line.

sample.data

5 0:2 1:1 2:2 3:1 4:1
5 3:1 5:1 0:1 6:1 7:1
5 8:1 9:1 10:1 11:1 12:1
4 13:1 9:1 14:1 15:1
4 16:1 3:1 8:1 10:1

First column: Number of unique words in the document (M)

Rest of the columns: [word index]:[number of occurrences of that word in the document].

So, the data format is

[M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]

sample.label

0
0
1
1
0

Each row is the label of the corresponding row in sample.data file. The label starts with 0.

Learn an slda model

./slda est sample.data sample.label settings.txt 0.01 2 random ./

The learn model is stored as final.model. word-assignments.dat (given below) provides an understanding of topic assignment to words in the document. The format is as follows: Each row corresponds to the document.

First column: Number of unique words in the document (M)

Rest of the columns: [word index]:[index of the topic the word belongs to with the index starting from 0].

005 0000:00 0001:00 0002:00 0003:00 0004:00
005 0003:00 0005:00 0000:00 0006:00 0007:00
005 0008:01 0009:01 0010:01 0011:01 0012:01
004 0013:01 0009:01 0014:01 0015:01
004 0016:00 0003:00 0008:00 0010:00

From this, we can infer that Topic 0 is related to vehicles and Topic 1 is related to nature.

Infer the label for test data

Since it is only example run, will use train data itself to infer as well.

./slda inf sample.data sample.label settings.txt final.model ./

The command above gives accuracy of inference as well. If the word assignment was as above, accuracy will be 100% i.e. 1.000 The inferred labels are stored in inf-labels.dat.

My Share Point

Pages

Saturday, August 29, 2015

Python Tutorial

Motivation

Setting up Python

Unique to Python

Handling Arrays (Lists) and Tuples

Handling 2D Matrices (List of Lists)

Handling Dictionaries

Tuesday, July 7, 2015

slda by Blei et.al.: Setup and Example Run

Using slda

Setting up slda

Example Run

Example: Data Preparation to Output Examination

Data

Create dictionary

Create slda understandable train data and label files

sample.data

sample.label

Learn an slda model

Infer the label for test data