robots.txt Adventure
Introduction.txt
Last October I got bored and set my spider loose on the robots.txt files of the world. Having had a good deal of positive feedback on my HTTP Headers survey, I had decided to poke around in robots.txt files and see what sorts of interesting things I could find.
Since then, I’ve taken 6 weeks of vacation and gotten to be very busy at work, so I’m just now getting around to analyzing all the data I gathered. These are some of the results of that analysis.
Robots?
To those of you completely unaware of what this post is about, here’s a brief primer. Google is a search engine. You probably use it. If not, odds are you use one of MSN Search (now called “Live Search”), Ask Jeeves (now Ask.com), or Yahoo! Search. How do those search engines grab web pages to search? Well, they use robots, also called spiders. Now, these aren’t the giant metal machines you see chasing tweaked out English factory workers through the streets of London, nor are they the giant eight-legged creatures you find lurking behind clocks. Rather, they’re pieces of software that surf around the web grabbing web pages. Since they’re software, they can surf the web much faster than humans, as well as find things most humans might overlook. As such, there arose a need for a standard for advising robots on what they should and shouldn’t look at.
The Un-Standard
The Robots Exclusion Protocol arose in June 1994 by consensus among a number of web spider developers. The original protocol description from 1994 describes the basic syntax of a robots.txt file to be placed at the root of a web site. So, for example, Google would place their robots.txt file at:
http://www.google.com/robots.txt
The basic format goes something like this. First, the file specifies a User-agent (the name of the robot) that is to follow the subsequent rules (until the next User-agent line):
User-agent: SuperHappyRobot
This line tells “SuperHappyRobot” that it needs to pay attention to the next few lines. Any other robot will ignore these rules. The next line might look something like:
Disallow: /tmp/
Which would mean SuperHappyRobot shouldn’t download any pages that start with the path “/tmp/” from this server. Variations on these lines are that * will match any robot name (in other words, “User-agent: *” should tell all the robots to pay attention), and blank Disallow statements mean anything goes. So, Apple’s robots.txt file of:
# robots.txt for http://www.apple.com/ User-agent: * Disallow:
means, essentially, that any robot is free to grab any page it can get its hands on, at least for the “www.apple.com” website.
So, that was all well and good, but around 1996 there was a push to try to get robots.txt standardized, and an IETF draft (http://www.robotstxt.org/wc/norobots-rfc.html) was produced that clarified and added to the robots.txt syntax. The primary addition was a new “Allow” rule, which allowed a little more fine-grained control over which pages could be retrieved. For example, with the following set of rules:
User-agent: * Disallow: /apache/ Allow: /apache/02/03/11/2228242.shtml
All documents except “/apache/02/03/11/2228242.shtml” in the “/apache/” path would be excluded from spidering. There was also a provision for “extensions” to the protocol, such that a rule line like “Crawl-delay: 10” could be added. Spiders that didn’t support that extension would ignore it, while spiders that did might delay 10 seconds between page fetches.
Around the same time the IETF draft was being discussed, Sean “Captain Napalm” Conner proposed his own extension to the Robots Exclusion Protocol, which included Allow rules as well as regular expression syntax for rules, and new Robot-version, Visit-time, Request-rate, and Comment rules. Less than 100 of the sites I visited use rules unique to this spec.
Since none of these three documents have ever been ratified or adopted by a standards body, there has been a bit of persistent confusion over what constitutes a valid robots.txt document. The most definitive document is certainly the original 1994 document. Most commercial robots today, however, attempt to conform to the IETF draft document. And, given the large number of Allow rules around, it would be remiss of a robot not to try.
A Touch of Controversy
This de-facto standard has had its share of controversy over the years. Many webmasters object to having to opt-out of spiders crawling their site. Given that I found 47,738 sites that disallow spidering the root of their site with the wildcard (*) user-agent match, it appears that that viewpoint still has many adherents, and many just want to be left alone by the bulk of spiders. See the comments in this thread for some examples of this opinion from some relatively tech-savvy webmasters. Among them is the well-known IncrediBILL:
Lack of a robots.txt file should mean just that, they don’t know about robots so robots should STAY THE HELL OUT!
I’ll come back to this later.
Others have objected to the idea of putting up a roadmap to secret pages on their sites. Bertrand Meyer, the designer of Eiffel (the programming language, not the Tower) and a Very Smart Person even holds this viewpoint. To quote:
If you are just a bit absent-minded, isn’t it natural
to use this mechanism to exclude stuff from being indexed and hence believe
no one will find it? “Stupid”, maybe — but not unlikely.
Indeed, scanning through the robots.txt files I pulled down, I find disallow rules for 3,000+ “phpMyAdmin” paths, 40,000+ “stats” paths, 31,000+ “log” paths, 400+ “secret” paths, 100,000+ “admin” paths, and a host of other interesting looking entries. Even if the vast majority of these are properly secured with authentication, the chances of a few people being absent-minded, as Bertrand might say, are pretty good.
On the flip side of these opinions, there are those who have always viewed, and want to continue to view, robots.txt as a merely advisory standard. As courts and legislative bodies have begun to apply the force of law to this loose consensus protocol, some have spoken out in favor of information transparency and the essential openness of the Internet, including Marijn Koster, the creator of the protocol:
“I don’t think that’s in the spirit of free information exchange,” Koster says. Some robots may have legitimate reasons to ignore robot exclusion directives. For example, he says, a company might use robots to hunt for copyright infringing content.
Methodology
Having written a spider for my HTTP headers survey and run it against all of the domains in the Open Directory, I already had a large collection of web sites, and a decent spider. I further added to my list of domains by extracting links from the pages I’d downloaded for that project. Then, I ran my spider (written in Python, using PycURL) against this expanded list of domains, attempting to retrieve the robots.txt file at each. The HTTP headers and full body of the response were stored in a MySQL database. This database was then dumped via a custom “Big File” implementation, which amounted to a bit more than 12GB on disk. Then, I wrote an analyzer which could run through this logical file, processing the records, recording interesting statistics about the entries and reporting the results. This analyzer takes about half an hour to run on the dataset. In total, I received responses from about 4.6 million unique domains.
Status Codes
HTTP status codes (aka response codes) tell web browsers and robots both what kind of response they’re getting when they download a page. For example, “200” means everything is okay and “404” means the web server couldn’t find the file the browser requested. The IETF robots.txt spec says that a 404 response for robots.txt means the site is unrestricted for robots, and a 2XX response means the robot must respect the returned robots.txt content. Other status codes have recommended behaviors, but they’re not required.
Status codes are interesting primarily because they give a quick count of how many sites have a robots.txt file. I got responses from 4.6 million sites, so by tallying the response codes of different types, I can tell who has a robots.txt file and who doesn’t:
Status Code | Count |
---|---|
404 | 3,008,767 |
200 | 1,217,303 |
302 | 276,106 |
301 | 72,674 |
403 | 15,675 |
400 | 5,570 |
401 | 3,856 |
500 | 2,841 |
410 | 1,450 |
303 | 1,319 |
503 | 890 |
304 | 529 |
501 | 280 |
502 | 227 |
307 | 218 |
204 | 215 |
300 | 100 |
504 | 60 |
406 | 58 |
419 | 45 |
550 | 36 |
202 | 34 |
999 | 17 |
100 | 12 |
418 | 10 |
201 | 7 |
405 | 6 |
423 | 6 |
666 | 3 |
402 | 3 |
415 | 3 |
407 | 2 |
510 | 2 |
490 | 1 |
505 | 1 |
509 | 1 |
900 | 1 |
409 | 1 |
408 | 1 |
Total: | 4,608,330 |
Broken down by class, we get:
Class | Count | % of Total |
---|---|---|
5xx | 4,338 | 0.09 |
4xx | 3,035,454 | 65.86 |
3xx | 350,946 | 7.61 |
2xx | 1,217,559 | 26.42 |
1xx | 12 | 0.00 |
invalid | 21 | 0.00 |
As we can see above, around 65% of sites return a 4XX status code, indicating they don’t have a robots.txt file. Another 7.6% redirect to a different URL, usually either the home page or an error page. This means, essentially, that about 26% of sites are attempting to serve up a valid robots.txt file. Of course, some sites may improperly return an error page with a 2xx status code, so this is only useful as a quick estimate.
MIME Types
MIME types (aka content types) are returned in the headers of HTTP responses by web servers to tell clients what the document’s type is. They consist of a type (text, image, etc), a subtype (like html or jpeg) and some other optional parameters (like the character encoding). So, for example, an HTML file usually has a MIME type like “text/html” and a text file a type like “text/plain”. An image file might have a MIME type like “image/gif” or “image/jpeg”. The IANA keeps an official list of registered MIME types at http://www.iana.org/assignments/media-types/.
The only MIME type that should be returned for a valid robots.txt file is text. True, the specs don’t specifically mention MIME types, but sites like Google follow the general HTTP rule of “if it’s not text/*, it’s not really plain text”. Of the robots.txt files I got back, 109,780 of them had MIME types other than text/plain. So, it should be no surprise that the big 3 search engines (Yahoo!, Google, and MSN) all will attempt to parse any text robots.txt file they get back from the server. For example, Digg.com serves up their robots.txt file as “text/html; charset=UTF-8”. Google, MSN, and Yahoo! all obey the rules in the file.
Besides for text/html and text/plain, some of the more common MIME types I got back were application/octet-stream, application/x-httpd-php, text/x-perl (mostly error pages), video/x-ms-asf, application/x-httpd-cgi, image/gif, and image/jpeg.
Even among files ostensibly marked as text, there were a wide variety of questionable MIME types:
Count | Content Type |
---|---|
2 | application/txt |
5 | application/x-txt |
2 | file/txt |
1 | internal-gopher-text |
30 | plain/text |
12 | text |
13 | text/R*ch |
2 | text/aleph_save |
2 | text/ascii |
6 | text/asp |
36 | text/css |
2 | text/dhtml |
73 | text/enriched |
1 | text/htm |
1 | text/illegal |
1 | text/javascript |
2 | text/octet-stream |
1 | text/plane |
4 | text/rtf |
1 | text/ssi html |
3 | text/svg |
3 | text/text |
9 | text/txt |
20 | text/vnd.wap.wml |
5 | text/x-component |
87 | text/x-invalid |
1 | text/x-log |
386 | text/x-perl |
2 | text/x-python |
40 | text/x-server-parsed-html |
23 | text/xml |
11 | txt |
No, Really, Robots Dot TEXT
An error similar to using the wrong content type is uploading a robots.txt file in a format other than plain text. Popular mistakes here include Word documents (examples: 1, 2, 3), RTF documents (examples: 1, 2, 3), and HTML. I even found LaTeX and KOffice documents.
One piece of server software (called Cougar, which looks, as near as I can tell, to be either Microsoft Small Business Server or IIS), even spits out ASF streaming video files when asked for a robots.txt file (examples: 1, 2). Fun.
Invalid Encodings
Character encodings specify what letters and other characters correspond to which specific bits. Sites specify what character set a response is in within the Content-type header. Some sites serve up robots.txt files in little-used encodings, such as UTF-16. UTF-16 is tricky for a number of reasons, not the least of which are the different endian encodings. Of the 463 UTF-16 files I found, approximately 10% were not valid UTF-16, even though they included a UTF16 BOM.
Otherwise, I saw close to 300 unique character sets claimed by servers, even discarding obviously incorrect ones and making them all lower case. These included some ones I hadn’t seen before, like “nf_z_62-010”, “ibm-939”, and “fi_fi.iso-8859-15@euro”.
Comments
robots.txt have one and only one proper way to comment, which is to put comments after a hash mark (#). However, I found HTML comments (), C++ style comments (//), and a variety of others, including simple in line comments.
Totally Confused
Some people seem rather befuddled as to what constitutes a robots.txt file. For example, the most common confusion I’ve found is people using the raw text dump of the Web Robots Database as their robots.txt file. I’m not just talking about a couple of sites, either. Approximately 1 in every 1000 websites I looked at do this. It’s really quite bizarre. This seems to be part of a more general mistake wherein people copy instructions on how to set up a robots.txt file into the contents of robots.txt files. For example, here are a few: www.cooljobscanada.com, www.numis.co.uk, www.volubilis2000.com, www.kickapoo-orchard.com, www.aplussupply.com.
Then there are just the random things you find. Religious texts and descriptions of churches. A catalog for MIDI tracks.
ASCII art, both pornography and otherwise.
A list of videogames. Several .htaccess files. Access logs. Lists of keywords and website descriptions, including an actual keyword stuffing example. Bash scripts, PHP pages, and everything in between.
I even found image files being served for robots.txt. Not to mention e-mail messages and newsgroup postings.
There’s even a description of a swimming pool. In German.
And, of course, plenty of human-readable instructions to robots which can’t read them: http://www.corsicamania.com/robots.txt.
info.txt
Apparently there’s another protocol, similar to robots.txt, for advertising the contact information for a site. A file called info.txt is supposed to be placed in the root of the site, which sites like Alexa will look for when trying to find out who owns the domain. I found a lot of these records in the robots.txt files.
Someday I’ll have to see how many of these there are in the wild.
Wildcards
There are no wildcards (also known as pattern matching) in the official robots.txt specs, but various search engines have added extensions to support this.
For example, Google, MSN Search, and Yahoo! allow an asterisk (*) to match any sequence of characters, and a dollar sign ($) to match the end of the URL. So, to block spiders from downloading any JPEG image files, one might use:
User-agent: * Disallow: /*.jpg$
Indeed, blocking spidering of certain file types is the most popular use for wildcards. Most people who are using wildcards for anything else are doing so entirely unnecessarily. For example, a lot of sites have the following rule:
Disallow: /RealEstateTips/*
The use of the non-standard wildcard above is useless, as this rule is equivalent to:
Disallow: /RealEstateTips/
This is because rules are by default partial paths, and will match any path beginning with that string. It’s also worth noting that of all the sites which have the above rule with the wildcard, none of them have the rule without the wildcard. So, a spider which didn’t support pattern matching would be free to download urls that start with “/RealEstateTips/”, so long as they didn’t have an asterisk after the second slash.
Common Syntax Errors
So, besides for the above, what are some of the common errors? The spec says that records are separated by blank lines, and the most common errors center around that. First most is putting a blank line between a User-agent line and the rules that should apply to it, with 74,043 files doing this. Next up is the placement of a Disallow or Allow rule with no User-agent or Disallow/Allow rule immediately before it, with 64,921 files making this mistake. The next is placing a User-agent line immediately after a Disallow/Allow line, with no space in between. 32,656 files did this. Finally, lines which were neither comments, nor blank, nor rules showed up in 22,269 files.
Crawl-delay
The IETF robots.txt draft spec includes a provision for extensions to the robots.txt format. Basically, along with “Allow” and “Disallow” lines, spiders can optionally support extensions for enhanced control over the robot’s behavior. The most widely-deployed of these is the Crawl-delay extension.
MSN Search, Yahoo!, and Ask all support Crawl-delay, which is used to insert a delay between successive accesses of a web server. A typical Crawl-delay might look something like this:
User-agent: * Crawl-delay: 5
Which spiders that support Crawl-delay would interpret as meaning they should wait 5 seconds between requests to the site. I found tens of thousands of these entries.
Typos!
I found a LOT of typos in these files. You wouldn’t think it would be very hard to spell the limited vocabulary of “User-agent” and “Disallow” correctly, but you’d be wrong. For example, I found 69 typos of Disallow. 69! That’s not even counting the ones I found with weird characters in the middle of the word.
Fingerprinting Using robots.txt
Sometimes, we can use robots.txt file contents for fingerprinting the sites that serve them up. For example, we can fingerprint the sites designed by Moriah.com by looking for robots.txt files with the contents:
this file placed here so you don't fill up my error log looking for it :-)
Similarly, we can find the more than 7,000 real estate sites designed by Advanced Access by looking for the rule:
Disallow: /RealEstateTips/*
More usefully, we can identify one Korean domain squatter by looking for robots.txt files that contain only a meta tag like:
meta http-equiv=refresh content='0;url=http://www.hiplayer.com'
(brackets excluded because of a bug in WordPress).
At the time I spidered, we could identify another domain squatter by looking for a robots.txt file like:
User-agent: * Disallow: /pixel/ Disallow: /library/ Disallow: /results_monitor.asp
They’ve since switched to a more generic, but still easily-identifiable robots.txt file.
Using similar methods, it’s easy to find a lot more domain squatters, mass-hosted websites, etc. A search engine could potentially maintain a list of such signatures and, based solely on the robots.txt file, not bother indexing the page. Or, more generally, it could increase or decrease the relevance and ranking of the site in its search results.
Conclusions
Okay, so what conclusions can we draw from this mess of data? The primary conclusion, I think, is that the Robots Exclusion Protocol is more complicated than it actually seems. As a spider, in order to properly parse the variety of robots.txt files you’ll find in the wild you’ll need to write an extremely lenient parser (following the Robustness Principle), mostly ignore content types, handle a variety of character encodings (and in many cases ignore those returned by the server), detect HTML and other content returned in the guise of robots.txt files, and potentially implement multiple extensions to the accepted standard.
How about the position, discussed above, that spiders shouldn’t spider or download content without the explicit permission of the webmaster? Belgium has certainly come down on the side of requiring explicit permission. However, the evidence shows that Google is in the right on this one:
“Given the vast size of the Internet, it is impossible for a search engine to contact personally each owner of a web page to determine whether the owner desires its web page to be searched, indexed or cached… If such advanced permission was required, the internet would promplty grind to a halt,” Google’s senior counsel and head of public policy Andrew McLaughlin told the Senate Legal and Constitutional Affairs Committee.
As seen in the status codes section, if this were to happen, nearly three quarters of domains on the web would go “dark” for search engines. If these sites went dark for search engines, they would essentially be offline for the majority of web users. Such an action would be in nobody’s best interest; not the site owner’s and certainly not in those of the web-using public at large.
On a less serious note, it’s always interesting to see just how vast the Internet really is. Few things drive that home for me as much as seeing how varied the content people generate on the web can be.
So, until next time, I leave you with a quote from one of the robots.txt files I came across:
are you searching something??? 🙂
Yes. Yes I am. And so far, every time I look, I find it.
More resources:
- Alexa robots.txt Search example
- Google Webmaster Tools – Includes a robots.txt validator.
- Sun’s amusing robots.txt file
- Google Blog on the Robots Exclusion Protocol: First Post, Second.
March 13th, 2007 at 12:37 AM
Haha, the things you find on the internets. I was particularly amused by the religious texts, the ASCII art, and the various misspellings of the word “disallow”.
Keep these entries coming; they’re really fun to read. 🙂 Hmm, let’s see… what can I change on my website that you’ll catch on your next internet survey…
March 19th, 2007 at 11:02 AM
This is the eight stop in YesBut’s tour of blog land. I arrived here by entering the key words “quick estimate†in Google Blog search.
I think most bloggers love to have spiders crawling all over their blog. But now I must move on using the keyword chosen at random from your blog “English factoryâ€. If you want to know where I have come from and where the key word takes me check my blog
http://grumpyandfarting.blogspot.com on Tuesday 20th March
August 2nd, 2007 at 1:16 PM
Cougar is Windows Media Server.
August 14th, 2007 at 1:35 AM
Awesome. Just found out your HTTP headers survey and this article, and I completely love them. I completely dig this kind of rather useless and fun statistics. More or these !
Unsollicited suggestions for future surveys ? HTML meta tags, favicon.ico?, unprotected .htaccess ?
September 21st, 2007 at 7:00 PM
Yay, original research.
I’m sure many of us did at least one of these things wrong, like typing “disallwo” [sic] and not checking the spelling.
September 21st, 2007 at 9:28 PM
How many sites got everything – the response code, the mime-type, the syntax – “right”?
September 22nd, 2007 at 12:09 AM
Interesting post Andrew, I learned something new today. Coincidentally, I looked at the robots file on your site, it reads “# Nothing to see here. :-)”.
September 22nd, 2007 at 5:46 AM
I can’t believe people think sites shouldn’t be spidered unless they asked for it.
Did it ever occur to them that if they want to keep something private, maybe they shouldn’t publish it on a world-wide, public computer network?
It’s the internet. You have no privacy. Get over it.
September 22nd, 2007 at 6:36 AM
You missed the blog in http://www.webmasterworld.com/robots.txt
September 22nd, 2007 at 5:30 PM
[…] nextthing.org » robots.txt Adventure (tags: web robots.txt http search spider robots standards internet google genius analysis **) […]
September 23rd, 2007 at 3:45 AM
Hey, I’ve translated this article into Russian (of course you’ve got some more links :)).
This is great. I’m surprised how many sites from Dmoz have such stupid errors. They are likely to be good sites, aren’t they? It’s so hard to get into dmoz now…
Good job.
September 23rd, 2007 at 11:26 AM
I think you’re missing some important rules for robots to follow:
# A robot may not injure a human being or through inaction allow a human being to come to harm.
# A robot must obey the orders given it by human beings, except where such orders would conflict with the First Law
# A robot must protect its own existence, as long as such protection does not conflict with the First or Second Laws.
September 24th, 2007 at 1:54 AM
Just one small quibble: it’s “c-o-n-n-E-r”.
September 24th, 2007 at 9:51 AM
Sorry about that Sean. I’ve fixed it in the article.
September 24th, 2007 at 11:56 AM
[…] Interesting web surveys: robots.txt and http headers (via Simon Willison). […]
September 24th, 2007 at 2:51 PM
[…] robots.txt Adventure […]
September 24th, 2007 at 6:37 PM
Oh that’s amusing, interesting and useful too.
I frequently do health checks on other peoples’ websites but it hadn’t occured to me to check they’ve written their robots file correctly. I’ll add that to my list of checks.
September 26th, 2007 at 1:08 AM
[…] Eine interessante Untersuchung von Andrew Wooster. Der ließ 4,6 Millionen Domains von einem selbst gebastelten Spider ansteuern, um jeweils die Datei robots.txt einzusammeln und zu analysieren. Dabei kamen nicht nur statistische Daten über Statuscodes und Mime-Typen zu Tage, auch allerlei Merkwürdigkeiten wurden ausgemacht, die auf ein sonderbares Verständnis der Datei schließen lassen. So finden sich Texte aller Art, Keywords, Logs, Listen und sogar ASCII-Kunst in einer Datei, die sich ausschließlich an Bots und Spider richtet. […]
February 24th, 2009 at 5:39 PM
Can I use a wildcard in a filename? e.g.
http://www.domain.com/admin*.php
http://www.domain.com/*.txt