Posts filed under 'information retrieval'

.NET Information Retrieval

A quick search for .NET related information retrieval and text classification sites resulted in the following list:

It looks like there is not much out there… can anyone point me to a few more sites?

update: added lucene.net

5 comments October 25, 2003

content management systems etc.

For the last week I have been evaluating bids for a content management systems that will be deployed across campus. The plan is for the central IT shop to act as a content management ASP (similar to atomz).

My first impression is that most of these systems are designed for deployment to a single location. Or possibly that is just the most common deployment scenario.

Gosh that was really boring… in other news… I twisted my ankle this weekend at Summer Soaker 3. It was my first ultimate frisbee tournament in about two years and I played well and had fun. Because of the sprain I have been mooching rides off of everyone since I can’t ride my bike. Oooohhh and guess what else… last night I played in the sandbox with my son. I buried his toes in slightly wet sand and then he made them peek out. We also made smoothies (yum) and ate popcorn.

I am also reading Human Behavior and the Principle of Least Effort (the origin of zipf’s law) by G. Zipf. I have seen it cited in many places and figured it was time to read this classic. I really love when the fields of sociology, economics, biology and computer science come together (sorry requires IE). If you want to read it for yourself I suggest your local library because $250 is the best price I found on the web.

1 comment July 17, 2003

even more blogdex co-citation

As I have mentioned before, I have been playing around with blogdex data and co-citation.

I have now begun to crawl the top 200 stories of the day (instead of just the top 50) based on Alex’s suggestion to see if it improves results (the daily updates should begin to reflect this starting tomorrow). The first instance of the resulting graph does not look to promising so I might tweak the layout generation script.

Also: I have solved a problem with neato not terminating by placing an upper bound on the number of iterations for the layout algorithm (by using -Gmaxiter=10000 on the command line).

update: I have tweaked the graph format so that the nodes are smaller. This improves structure because the layout algorithm does not have to deal with node overlap.

2 comments June 3, 2003

nutch overpowers google

“Nutch”:http://www.nutch.org, “as I speculated”:http://www.khakipants.org/archives/2003/05/nutch_an_open_source_web_search_engine.html, is slowly climbing up the charts (it has surpassed “google”:http://www.google.com and is now the 7th most frequent UserAgent accessing my site):

Rank Hits UserAgent
7 127 NutchOrg/0.03-dev (Nutch; http://www.nutch.org/docs/bot.html;
8 115 Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Amount of traffic coming from nutch = 0%
Amount of traffic coming from google = 8%

Maybe it is time to “block nutch too”:http://diveintomark.org/archives/2003/02/26/how_to_block_spambots_ban_spybots_and_tell_unwanted_robots_to_go_to_hell.html.

Add comment May 9, 2003

more blogdex co-citation

As I mentioned a few days ago I am “playing with blogdex data”:http://www.khakipants.org/archives/2003/05/blogdex_cocitation.html… now I have “automated the process”:http://www.khakipants.org/log/projects/pyblogdex/.

Add comment May 7, 2003

nutch – an open source web search engine

“nutch”:http://www.nutch.org/docs/index.html (an effort to create an open-source Internet search engine) is a must read for lovers of information retrieval.

You should checkout the:

* “nice javadoc”:http://www.nutch.org/docs/api/index.html and
* the “cvs access”:http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/.

The project objectives seem to indicate that they “fear google”:http://www.khakipants.org/archives/2003/02/google_may_be_evil.html just like me:

bq. Nutch is a nascent effort to implement an open-source web search engine. Web search is a basic requirement for internet navigation, yet the number of web search engines is decreasing. Today’s oligopoly could soon be a monopoly, with a single company controlling nearly all web search for its commercial gain. That would not be good for the users of internet. Nutch aims to enable anyone to easily and cost-effectively deploy a world-class web search engine.

I plan on keeping tabs on this project… and I doubt I will be able to forget about it because it is going to begin to dominate my access logs.

Anyone aware of a realistic Internet simulator that could be used for such an effort (similar to “honeyd”:http://niels.xtdnet.nl/honeyd/challenge.html)?

Add comment May 7, 2003

blogdex co-citation

Over the last few days I have decided to play around with co-citation analysis… using data from blogdex.

Using a 104 lines of python code, a Makefile and GraphViz I have been able to generate the the following cool picture.

I have not looked around to see if any one else has done this yet… don’t reinvent the wheel unless you want to learn about the wheel… in this case I want to learn about the wheel.

update: I just realized that the picture needs more explanation.

  • the circles represent stories
  • each story is labeled with the title that blogdex provides
  • each story is selected by blogdex by virtue of its hippness (not sure the actual formula – although you can get a good idea by looking at the blogdex news).
  • the stories are positioned (by GraphVis) based on how frequently they are cited together.
  • the lines represent stories that are strongly correlated (I have hidden the lines of weakly correlated stories).
  • if you look closely you will notice that stories about apple iTunes are grouped on the right, random humorous stories are on the bottom and war/news items are on the top.

Add comment May 1, 2003

google may be evil

Lately I have been trying to make up for past sins.

Currently I am trying to read of fiction. During high school and college I was able to slip by without reading much (if any) fiction (unless you count watching the scarlet letter on VHS).

As part of this effort, last night I finished reading George Orwell’s _1984_…

The folks at google watch are trying to warn us that “our”:http://www.ttabor.com/ramblings/archives/000011.html “beloved”:http://googlefan.com/ “google”:http://www.google.com may be “evil”:http://www.google-watch.org/bigbro.html.

I think this is a little silly considering the ease of using “the alternatives”:http://www.teoma.com but I did “delete my google cookie”:http://www.mozilla.org/projects/security/pki/psm/help_21/using_priv_help.html and prevented it from being set again (for the near future anyhow). Thanks “mozilla”:http://www.mozilla.org.

Add comment February 14, 2003

Latent Semantic Indexing patent

This page contains some more information about LSI including a note about a patent:

Users should also be aware of the Telecordia Technologies (Bellcore) Patent : Computer information retrieval using latent semantic structure (U. S. Patent No. 4,839,853, June 13, 1989) before initiating any commerical product development based on LSI.

1 comment January 16, 2003

zeal directory

Lately I have become interested in how to maintain a large scale web directory… here are the user guidelines for zeal.

It looks like they are using a karma-like mechansim similar to slashdot.

update: I think they might be evil.

Add comment January 15, 2003

Previous Posts


About me

Hello, I'm Nathan Jacobs and you are looking at my blog. I am a doctoral candidate in Computer Science at Washington University in St. Louis focusing on Computer Vision. My research is in algorithms to improve the ability of computer to reason about the natural world. I also really like to make attractive and informative visualizations of complex data.

I currently update my flickr site much more frequently than this blog.

RSS twitter

Category Cloud

computer vision friend information retrieval information technology knowledge management landscaping machine learning math mozilla noise personal programming research site admin social software teaching travel ultimate frisbee usability web standards

friend

links

papers

Top Posts

Flickr Photos

Smooth

View from Mill Club 405

View from Mill Club 405

Rainbow Sunrise

Why can't I turn left?

More Photos