Posts filed under 'information retrieval'
.NET Information Retrieval
A quick search for .NET related information retrieval and text classification sites resulted in the following list:
- nlucene : a seemingly dead (maybe never alive) lucene rewrite
- lucene.net
- visual basic data mining
It looks like there is not much out there… can anyone point me to a few more sites?
update: added lucene.net
5 comments October 25, 2003
content management systems etc.
For the last week I have been evaluating bids for a content management systems that will be deployed across campus. The plan is for the central IT shop to act as a content management ASP (similar to atomz).
My first impression is that most of these systems are designed for deployment to a single location. Or possibly that is just the most common deployment scenario.
Gosh that was really boring… in other news… I twisted my ankle this weekend at Summer Soaker 3. It was my first ultimate frisbee tournament in about two years and I played well and had fun. Because of the sprain I have been mooching rides off of everyone since I can’t ride my bike. Oooohhh and guess what else… last night I played in the sandbox with my son. I buried his toes in slightly wet sand and then he made them peek out. We also made smoothies (yum) and ate popcorn.
I am also reading Human Behavior and the Principle of Least Effort (the origin of zipf’s law) by G. Zipf. I have seen it cited in many places and figured it was time to read this classic. I really love when the fields of sociology, economics, biology and computer science come together (sorry requires IE). If you want to read it for yourself I suggest your local library because $250 is the best price I found on the web.
1 comment July 17, 2003
even more blogdex co-citation
As I have mentioned before, I have been playing around with blogdex data and co-citation.
I have now begun to crawl the top 200 stories of the day (instead of just the top 50) based on Alex’s suggestion to see if it improves results (the daily updates should begin to reflect this starting tomorrow). The first instance of the resulting graph does not look to promising so I might tweak the layout generation script.
Also: I have solved a problem with neato not terminating by placing an upper bound on the number of iterations for the layout algorithm (by using -Gmaxiter=10000 on the command line).
update: I have tweaked the graph format so that the nodes are smaller. This improves structure because the layout algorithm does not have to deal with node overlap.
2 comments June 3, 2003
nutch overpowers google
“Nutch”:http://www.nutch.org, “as I speculated”:http://www.khakipants.org/archives/2003/05/nutch_an_open_source_web_search_engine.html, is slowly climbing up the charts (it has surpassed “google”:http://www.google.com and is now the 7th most frequent UserAgent accessing my site):
| Rank | Hits | UserAgent |
| 7 | 127 | NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html; |
| 8 | 115 | Googlebot/2.1 (+http://www.googlebot.com/bot.html) |
Amount of traffic coming from nutch = 0%
Amount of traffic coming from google = 8%
Maybe it is time to “block nutch too”:http://diveintomark.org/archives/2003/02/26/how_to_block_spambots_ban_spybots_and_tell_unwanted_robots_to_go_to_hell.html.
Add comment May 9, 2003
more blogdex co-citation
As I mentioned a few days ago I am “playing with blogdex data”:http://www.khakipants.org/archives/2003/05/blogdex_cocitation.html… now I have “automated the process”:http://www.khakipants.org/log/projects/pyblogdex/.
Add comment May 7, 2003
nutch – an open source web search engine
“nutch”:http://www.nutch.org/docs/index.html (an effort to create an open-source Internet search engine) is a must read for lovers of information retrieval.
You should checkout the:
* “nice javadoc”:http://www.nutch.org/docs/api/index.html and
* the “cvs access”:http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/.
The project objectives seem to indicate that they “fear google”:http://www.khakipants.org/archives/2003/02/google_may_be_evil.html just like me:
bq. Nutch is a nascent effort to implement an open-source web search engine. Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing. Today’s oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain. That would not be good for the users of internet. Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine.
I plan on keeping tabs on this project… and I doubt I will be able to forget about it because it is going to begin to dominate my access logs.
Anyone aware of a realistic Internet simulator that could be used for such an effort (similar to “honeyd”:http://niels.xtdnet.nl/honeyd/challenge.html)?
Add comment May 7, 2003
blogdex co-citation
Over the last few days I have decided to play around with co-citation analysis… using data from blogdex.
Using a 104 lines of python code, a Makefile and GraphViz I have been able to generate the the following cool picture.
I have not looked around to see if any one else has done this yet… don’t reinvent the wheel unless you want to learn about the wheel… in this case I want to learn about the wheel.
update: I just realized that the picture needs more explanation.
- the circles represent stories
- each story is labeled with the title that blogdex provides
- each story is selected by blogdex by virtue of its hippness (not sure the actual formula – although you can get a good idea by looking at the blogdex news).
- the stories are positioned (by GraphVis) based on how frequently they are cited together.
- the lines represent stories that are strongly correlated (I have hidden the lines of weakly correlated stories).
- if you look closely you will notice that stories about apple iTunes are grouped on the right, random humorous stories are on the bottom and war/news items are on the top.
Add comment May 1, 2003
google may be evil
Lately I have been trying to make up for past sins.
Currently I am trying to read of fiction. During high school and college I was able to slip by without reading much (if any) fiction (unless you count watching the scarlet letter on VHS).
As part of this effort, last night I finished reading George Orwell’s _1984_…
The folks at google watch are trying to warn us that “our”:http://www.ttabor.com/ramblings/archives/000011.html “beloved”:http://googlefan.com/ “google”:http://www.google.com may be “evil”:http://www.google-watch.org/bigbro.html.
I think this is a little silly considering the ease of using “the alternatives”:http://www.teoma.com but I did “delete my google cookie”:http://www.mozilla.org/projects/security/pki/psm/help_21/using_priv_help.html and prevented it from being set again (for the near future anyhow). Thanks “mozilla”:http://www.mozilla.org.
Add comment February 14, 2003
Latent Semantic Indexing patent
This page contains some more information about LSI including a note about a patent:
Users should also be aware of the Telecordia Technologies (Bellcore) Patent : Computer information retrieval using latent semantic structure (U. S. Patent No. 4,839,853, June 13, 1989) before initiating any commerical product development based on LSI.
1 comment January 16, 2003
zeal directory
Lately I have become interested in how to maintain a large scale web directory… here are the user guidelines for zeal.
It looks like they are using a karma-like mechansim similar to slashdot.
update: I think they might be evil.
Add comment January 15, 2003




