Feeds:
Posts
Comments

Archive for the ‘IR’ Category

Nuances of Live Blog Search

Ever wondered if the “Published Last Hour” link of the Google Blog Search works? Let me tell you.. It works and it doesnt take an hour for your updated post to come up in results. After my last post, i did some ego-searching on Google Blog Search. The query “pramodp” brought up this blog as expected but the unexpected part is – a search for the blogs with “pramodp” published in the last hour brought up my post on “Snapshot of the web” which I had published less than 10 minutes ago!!

To really appreciate the wow of this , we need to consider the enormity of the task. Blogs are the most active part of the web (apart from news sources). You may call it the “everchanging surface” of the web. Here freshness is of huge importance. According to Technorati – which is the authority on all things “live web” – there are over 175,000 new blogs everyday! The number of new posts every day is 1.6 Million !! That makes it 18 posts per second. How do you accomodate this torrent of content? How do you return a post that was published 5 minutes ago for a search query?

Here is how I would do it. I might be wrong. There might be more efficient methods to do it. But here it is. First task, How would Google know that you have written a blog entry? Simple, Just like your Google Reader knows when your favourite blogs are updated, the Google spider knows through feeds when a blog is updated. Clearly the most difficult task would be to rebuild the index for every post update (or every few minutes). Even if you do a very efficient incremental indexing, how would you rebuild the index containing Millions of blogs every minute?! The answer is “you dont”. Maintain a separate index for the blogs published in the last hour. There is an Indexer that takes the content from a store server populated by the crawlers. It parses the searchable terms in the document and puts the documents in the reverse index which maintains the ranked list of documents for every term (Example: pramodp :: http://students.iiit.ac.in/~pramodp -> pramodp.wordpress.com -> …) . What is missing is the rank of the post which we are trying to add. Note that this recent post that we are trying to accomodate in the index is not competing with all posts ever written for a rank. Its competition space is limited to the blogs written in the last one hour only. Normally ranking of a page would require the information about the distribution of the terms in the whole search subspace of the web (in this case the blogs published in the last hour) along with the distribution of the terms in the page to be indexed. We might not afford to calculate this for every blog update. So we might add the post to the reverse index with the page rank of the blog initially. In other words, take the post on the face value of the importance of the blog and worry about the actual ranking later. You might setup a process to rebuild the Last One Hour Index once every 10 minutes in which time there will be 10K posts (according to our estimate of 18 posts per sec) to rank the posts based on the statistical distribution of terms. A cleaning process retires posts older than an hour from the index periodically.

It might turn out that Google does have the raw computing power to actually calculate the PageRank of the posts on the fly. We know they have the most fine tuned hardware there is. What is for certain is that the indexing process would be more frequent than the regular web indexer and there would be greater stress on the freshness factor of the content. Let me publish this post and see if it comes out in the “Published Anytime” Blog Search.

Update: Damnit! Google upset my little theory and showed up this post in the Search alltime blogs category within 2 minutes of writing this blog! Now is the time when you put to use the “learn, unlearn and relearn” lesson you heard so often.

Read Full Post »

Snapshot of the web!

Yesterday I badly needed a pdf file which shows in the Google Search results but is a broken link. The text only version was not good enough (the research paper was figure-heavy). Googling for alternate locations of the pdf did not help. Finally I figured out what i wanted. If there were sites which took snapshots of the web at regular intervals, I could acces a previous edition of the pdf. Internet Archive’s WayBack Machine is exactly that, a time machine for the web. The WayBack Machine lets you “Browse through 85 billion web pages archived from 1996 to a few months ago”.

Apart from the need to retrieve a broken link, you can use the wayback machine to do some funstuff. How about checking how the Google homepage looked at 1998? Surprisingly the first archived version of the www.google.com is very simplistic (too simplistic for one’s liking because it just has two links). Another funny page is this one , a collection of the initial Google stickers .

3 posts in 5 days! Not bad! Not bad at all!!

Read Full Post »