even more blogdex co-citation

June 3, 2003

As I have mentioned before, I have been playing around with blogdex data and co-citation.

I have now begun to crawl the top 200 stories of the day (instead of just the top 50) based on Alex’s suggestion to see if it improves results (the daily updates should begin to reflect this starting tomorrow). The first instance of the resulting graph does not look to promising so I might tweak the layout generation script.

Also: I have solved a problem with neato not terminating by placing an upper bound on the number of iterations for the layout algorithm (by using -Gmaxiter=10000 on the command line).

update: I have tweaked the graph format so that the nodes are smaller. This improves structure because the layout algorithm does not have to deal with node overlap.

Entry Filed under: information retrieval. .

2 Comments Add your own

  • 1. Henry  |  September 11, 2003 at 11:11 am

    Can you change the permissions on the pyblogdex directory? I’m very curious to see what you did.

    Reply
  • 2. Nathan Jacobs  |  September 11, 2003 at 11:52 am

    done.

    The HTML files had been generating a lot of traffic from google (before the access control) so I decided to encode them with gzip to try to reduce the load.

    I stopped the automated updates of this several months ago. But I have made some changes to the code (I basically implemented SimRank:http://citeseer.nj.nec.com/539641.html). An older copy of the source code (along with a few other projects) is available at http://www.khakipants.org/log/projects/builds/.

    Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe to the comments via RSS Feed


About me

Hello, I'm Nathan Jacobs and you are looking at my blog. I am a doctoral candidate in Computer Science at Washington University in St. Louis focusing on Computer Vision. My research is in algorithms to improve the ability of computer to reason about the natural world. I also really like to make attractive and informative visualizations of complex data.

I currently update my flickr site much more frequently than this blog.

RSS twitter

Category Cloud

computer vision friend information retrieval information technology knowledge management landscaping machine learning math mozilla noise personal programming research site admin social software teaching travel ultimate frisbee usability web standards

friend

links

papers

Top Posts

Flickr Photos

Smooth

View from Mill Club 405

View from Mill Club 405

Rainbow Sunrise

Why can't I turn left?

More Photos