Welcome to Professional ASP.NET - Chris Love's Official Blog Sign in | Join | Help

Chris Love's Official ASP.NET Blog

Chris Love's Helpful tips, tricks and pragmatic development knowledge for the ASP.NET world.
Add to Technorati Favorites


ASP Insider
Colorado Women sues Archive.org - Hopefully technology will prevail

I thought about adding this one to my links of the week, but this is a little more important than just a breif mention. If you are not familiar with Archive.org, it is a web site that retains and makes public snapshots of every website at various points in time. Go ahead check out the original version of Extreme Web Works (please do not laugh though). It is great to see how far I have come, but I do not know if I really want to remember this time in my career.

Sites like Archive.org, Google.com and Live.com all utilize spiders to run through sites and retrieve content that users can then spider. For traditional search engines it provides the loss leader that attracts you to their site to view and hopefully click the contextual ads they display. In return, those lucky enough to be on the first page of results get extra traffic. For Archive.org it is the loss leader to get you to the site, for what type of revenue I am not very sure. These sites actually store the content to be indexed, in Archive.org's case they keep a lasting record of the content and republish it. I believe traditional search engines would purge the data after a while because they always want the freshest data.

Recently a Colorado woman decided to sue Archive.org for spidering her site. Archive.org gat all but one of her charges for breach of contract. Evidently the woman posted on her site that no one could spider her content and the judge has granted a hearing on the charge.

Archive.org should have no problem winning the case, at least you would hope not. Judges do have a serious history of not understanding technology and ruling against logic, protocols and standards. In this case it should be pretty cut and dry. All Internet spiders, at least for good guys sites like Archive.org, obey a protocol for the robots.txt file. This protocol has been around as long as I can remember and basically allows a site owner to tell the spiders what they can and cannot read. If this woman did not want her site spidered she could have simply defined her robots.txt file to say so and all the spiders would have left her alone. She did not and still has not and if you read her site you will see she obviously does not get it. The protocol even allows for you to define rules by spider.

The reality is that most site owners do not know or care about this public standard and generally omit this little file. It is not the only standard rule of the web that your typical site owner omits, there are literally 1000's of little things every site owner needs to be aware of to operate successfully on the web. I generally do not create robots.txt files for my sites, unless I need to block something from a spider, mostly because I want all the content of all my sites spidered.

If the judge is worth their seat they will rule against this woman and let the Internet continue to thrive.

Posted: Sunday, March 18, 2007 8:50 AM

by Chris Love
Filed under:

Comments

No Comments

Leave a Comment

(required) 

(required) 

(optional)

(required) 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS