Posts filed under 'machine learning'
sax vs. dom performance
I just switched my “python naive bayes code”:http://www.khakipants.org/archives/2002/12/libsvm_python.html from using xml.dom.minidom to xml.sax and have seen an impressive performance boost.
Parsing 1000 small XML documents now takes only 8.5 seconds (it used to take 32). As a result of some other improvements (mentioned below) parsing was accounting for 70 percent of the processing time. This will come in handy since I am trying to process 810,000 documents.
I have also realized that some of my concerns about performance were a bit confounded by some naive assumptions about the nature of my categories (flat vs. hierarchichal). Anyhow, I am now in position to run tests across the entire document collection (hopefully it will complete in under five days)…
My small object-oriented python classifier library now supports the Naive Bayes (pure python) and “Support Vector Machine”:http://www.support-vector.net/ (courtesy of “lib_svm”:http://www.csie.ntu.edu.tw/~cjlin/libsvm/) algorithms. Once I polish it a bit more I will post the code.
3 comments February 16, 2003
Latent Semantic Indexing patent
This page contains some more information about LSI including a note about a patent:
Users should also be aware of the Telecordia Technologies (Bellcore) Patent : Computer information retrieval using latent semantic structure (U. S. Patent No. 4,839,853, June 13, 1989) before initiating any commerical product development based on LSI.
1 comment January 16, 2003
Latent-Semantic Indexing tutorial
The guy who has been looking for exported MT content co-authored .
a good overview of Latent Semantic Indexing
Add comment January 15, 2003
libsvm python
I just compiled the libsvm python interface library on debian/python2.2… Here are the things I had to do:
- compile libsvm – this required that I change the Makefile to point to g++-3.0
- download and install swig and python2.2-dev (apt-get instsall swig python2.2-dev)
- edit ./python/Makefile to remove a couple of ?=’s and correct the incorrect python includedir path to /usr/include/python2.2
- change compiler for python interface code to be g++-3.0 ( I was getting a linker ImportError: ./svmc.so: undefined symbol: _Znaj when I tried to run a test…)
I am currently integrating this SVM code into my personal little classification hack on some new data from Reuters.
Mostly I am trying to get a good feel for performance characteristics since trying to run on the full data set using Naive Bayes and a bad implementation took more than 5 days (eek).
Add comment December 16, 2002
blog categorization
Add comment December 12, 2002
sources of weblog data
- Weblogs.Com: Recently Changed Weblogs
- organica
- blogtree
- blogstreet
- daypop
- blogdex
- syndic8
- myelin: blogging ecosystem
I am going to spend the next few days looking over this and trying to figure out what they all mean…
Note: some of these links were grabbed from the organica’s links section.
Add comment December 6, 2002
an interesting source of data
o r g a n i c a maintains in- and out-links for lots of weblogs… I need to think about what this means about god and religion and stuff.
Add comment December 3, 2002
opennlp project
The opennlp project has some stuff that may help with my PhD.
Add comment September 9, 2002
Research Links… categorization
Add comment September 9, 2002




