Posts filed under 'machine learning'

sax vs. dom performance

I just switched my “python naive bayes code”:http://www.khakipants.org/archives/2002/12/libsvm_python.html from using xml.dom.minidom to xml.sax and have seen an impressive performance boost.

Parsing 1000 small XML documents now takes only 8.5 seconds (it used to take 32). As a result of some other improvements (mentioned below) parsing was accounting for 70 percent of the processing time. This will come in handy since I am trying to process 810,000 documents.

I have also realized that some of my concerns about performance were a bit confounded by some naive assumptions about the nature of my categories (flat vs. hierarchichal). Anyhow, I am now in position to run tests across the entire document collection (hopefully it will complete in under five days)…

My small object-oriented python classifier library now supports the Naive Bayes (pure python) and “Support Vector Machine”:http://www.support-vector.net/ (courtesy of “lib_svm”:http://www.csie.ntu.edu.tw/~cjlin/libsvm/) algorithms. Once I polish it a bit more I will post the code.

3 comments February 16, 2003

Latent Semantic Indexing patent

This page contains some more information about LSI including a note about a patent:

Users should also be aware of the Telecordia Technologies (Bellcore) Patent : Computer information retrieval using latent semantic structure (U. S. Patent No. 4,839,853, June 13, 1989) before initiating any commerical product development based on LSI.

1 comment January 16, 2003

Latent-Semantic Indexing tutorial

Add comment January 15, 2003

libsvm python

I just compiled the libsvm python interface library on debian/python2.2… Here are the things I had to do:

  1. compile libsvm – this required that I change the Makefile to point to g++-3.0
  2. download and install swig and python2.2-dev (apt-get instsall swig python2.2-dev)
  3. edit ./python/Makefile to remove a couple of ?=’s and correct the incorrect python includedir path to /usr/include/python2.2
  4. change compiler for python interface code to be g++-3.0 ( I was getting a linker ImportError: ./svmc.so: undefined symbol: _Znaj when I tried to run a test…)

I am currently integrating this SVM code into my personal little classification hack on some new data from Reuters.

Mostly I am trying to get a good feel for performance characteristics since trying to run on the full data set using Naive Bayes and a bad implementation took more than 5 days (eek).

Add comment December 16, 2002

blog categorization

Add comment December 12, 2002

sources of weblog data

I am going to spend the next few days looking over this and trying to figure out what they all mean…

Note: some of these links were grabbed from the organica’s links section.

Add comment December 6, 2002

an interesting source of data

o r g a n i c a maintains in- and out-links for lots of weblogs… I need to think about what this means about god and religion and stuff.

Add comment December 3, 2002

opennlp project

The opennlp project has some stuff that may help with my PhD.

Add comment September 9, 2002

Research Links… categorization

Add comment September 9, 2002


About me

Hello, I'm Nathan Jacobs and you are looking at my blog. I am a doctoral candidate in Computer Science at Washington University in St. Louis focusing on Computer Vision. My research is in algorithms to improve the ability of computer to reason about the natural world. I also really like to make attractive and informative visualizations of complex data.

I currently update my flickr site much more frequently than this blog.

RSS twitter

Category Cloud

computer vision friend information retrieval information technology knowledge management landscaping machine learning math mozilla noise personal programming research site admin social software teaching travel ultimate frisbee usability web standards

friend

links

papers

Top Posts

Flickr Photos

Smooth

View from Mill Club 405

View from Mill Club 405

Rainbow Sunrise

Why can't I turn left?

More Photos