Writing another post after a long time. Recently, I had to try out Supervised Latent Dirichlet Allocation package by Chong Wang, whose Adviser was David M. Blei. Thought of sharing how to set it up and use it. Here we go.
Download the example dataset i.e. data.tgz from slda site from where you downloaded the package. Unzip it using tar -zxvf data.tgz.
One can find details on the command run by looking at readme.txt
Assuming the unzipped data is in same folder as slda executable, you can train a model using following command (settings.txt file comes with slda.tgz):
./slda est train-data.dat train-label.dat settings.txt 0.01 50 random ./
Top few lines of the command run:
reading data from train-data.dat
number of docs : 800
number of terms : 158
number of total words : 1920800
...
The run takes sometime (>30mins).
After the training is done, you can infer label for test data using following command:
./slda inf test-data.dat test-label.dat settings.txt final.model ./
Last few lines of the command run:
results will be saved in ./
document 0
document 100
document 200
document 300
document 400
document 500
document 600
document 700
average accuracy: 0.738
So, the slda model gave an accuracy of 73.8% on the test data. The inferred labels are in inf-labels.dat file.
Consider 5 documents as below:
Construct a mapping from word to index like below:
0 <- car
1 <- vehicle
2 <- wheels
3 <- bus
4 <- break
5 <- driver
6 <- road
7 <- crossing
8 <- forest
9 <- deer
10 <- lion
11 <- fruits
12 <- prey
13 <- grass
14 <- summer
15 <- spring
16 <- tour
Refer to this link for details. Training files need to have as many lines as there are documents: One document per line.
Each row is the label of the corresponding row in sample.data file. The label starts with 0.
./slda est sample.data sample.label settings.txt 0.01 2 random ./
Since it is only example run, will use train data itself to infer as well.
Using slda
- It is for multi-class classification (Binary classification is subset of it). It is not for Multi-label classification, where a data point can belong to more than one class.
- When there are C classes, the label should be one value among {0,1,...,C-1}. Even if one has non-numeric labels, they need to be converted to numeric value. However, slda does not assume order among the labels. So, this conversion does not change the nature of the problem and hence, one need not worry.
- Before creating train and test files, the words need to be indexed and train and test files need to contain only the indices and the word counts.
Setting up slda
These steps were tried on Ubuntu machine.- slda depends on another package called gsl (GNU Scientific Library). So, we need to install that first (Thanks to this link):
sudo apt-get install gsl-bin libgsl0-dev
libgsl0ldbl
- Now, it is time to setup slda. Download the package i.e. slda.tgz. Unzip it using tar -zxvf slda.tgz.
- Go inside slda folder created. Try running "make". If it gives error like it did in my case, add "#include <cstddef>" line before any other #includes in corpus.h (Thanks to this link). Then, run "make" command. If you don't get any error and find a runnable slda file created, you are ready to use it.
Example Run
Download the example dataset i.e. data.tgz from slda site from where you downloaded the package. Unzip it using tar -zxvf data.tgz.
One can find details on the command run by looking at readme.txt
Assuming the unzipped data is in same folder as slda executable, you can train a model using following command (settings.txt file comes with slda.tgz):
./slda est train-data.dat train-label.dat settings.txt 0.01 50 random ./
Top few lines of the command run:
reading data from train-data.dat
number of docs : 800
number of terms : 158
number of total words : 1920800
...
The run takes sometime (>30mins).
After the training is done, you can infer label for test data using following command:
./slda inf test-data.dat test-label.dat settings.txt final.model ./
Last few lines of the command run:
results will be saved in ./
document 0
document 100
document 200
document 300
document 400
document 500
document 600
document 700
average accuracy: 0.738
So, the slda model gave an accuracy of 73.8% on the test data. The inferred labels are in inf-labels.dat file.
Example: Data Preparation to Output Examination
Data
Consider 5 documents as below:
car vehicle wheels bus break car wheels
bus driver car road crossing
forest deer lion fruits prey
bus driver car road crossing
forest deer lion fruits prey
grass deer summer spring
tour bus forest lion
tour bus forest lion
Create dictionary
Construct a mapping from word to index like below:
0 <- car
1 <- vehicle
2 <- wheels
3 <- bus
4 <- break
5 <- driver
6 <- road
7 <- crossing
8 <- forest
9 <- deer
10 <- lion
11 <- fruits
12 <- prey
13 <- grass
14 <- summer
15 <- spring
16 <- tour
Create slda understandable train data and label files
Refer to this link for details. Training files need to have as many lines as there are documents: One document per line.
sample.data
5 0:2 1:1 2:2 3:1 4:1
5 3:1 5:1 0:1 6:1 7:1
5 8:1 9:1 10:1 11:1 12:1
4 13:1 9:1 14:1 15:1
4 16:1 3:1 8:1 10:1
5 3:1 5:1 0:1 6:1 7:1
5 8:1 9:1 10:1 11:1 12:1
4 13:1 9:1 14:1 15:1
4 16:1 3:1 8:1 10:1
First column: Number of unique words in the document (M)
Rest of the columns: [word index]:[number of occurrences of that word in the document].
So, the data format is
[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]
sample.label
0
0
1
1
0
0
1
1
0
Each row is the label of the corresponding row in sample.data file. The label starts with 0.
Learn an slda model
./slda est sample.data sample.label settings.txt 0.01 2 random ./
The learn model is stored as final.model. word-assignments.dat (given below) provides an understanding of topic assignment to words in the document. The format is as follows: Each row corresponds to the document.
First column: Number of unique words in the document (M)
Rest of the columns: [word index]:[index of the topic the word belongs to with the index starting from 0].
005 0000:00 0001:00 0002:00 0003:00 0004:00
005 0003:00 0005:00 0000:00 0006:00 0007:00
005 0008:01 0009:01 0010:01 0011:01 0012:01
004 0013:01 0009:01 0014:01 0015:01
004 0016:00 0003:00 0008:00 0010:00
005 0003:00 0005:00 0000:00 0006:00 0007:00
005 0008:01 0009:01 0010:01 0011:01 0012:01
004 0013:01 0009:01 0014:01 0015:01
004 0016:00 0003:00 0008:00 0010:00
From this, we can infer that Topic 0 is related to vehicles and Topic 1 is related to nature.
Infer the label for test data
Since it is only example run, will use train data itself to infer as well.
./slda inf sample.data sample.label settings.txt final.model ./
The command above gives accuracy of inference as well. If the word assignment was as above, accuracy will be 100% i.e. 1.000 The inferred labels are stored in inf-labels.dat.
I just see the post i am so happy to the communication science post of information's.So I have really enjoyed and reading your blogs for these posts.Any way I’ll be replay for your great thinks and I hope you post again soon...
ReplyDeleteSoftware Testing Training in Chennai