Tangled Webs

At Home with the Robots

On a typical website, up to half of all visitors aren’t human at all. They’re robots of all kinds: the good, the bad, the ugly. Or, if you prefer, There’s three ways that robots can go: that’s good, bad or mediocre . . . .

This group of pages is based on a long post I’ve put together for the Webmaster World forums every now and then since 2012. Robots come, robots go, so I’ll continue updating it now and then. Watch for notations like “at time of writing” to see how current any given description is. Some robots will also have a “Last Seen” date.

Good, Bad or Mediocre?

There are various ways of classifying robots. But for me the Great Divide is robots.txt. Does the robot first check that it has permis­sion to enter, learn which parts of the site are off limits, and restrict its visits to authorized areas? If it doesn’t, it had better have a darn good excuse for being here.

What counts as “darn good” is purely an individual decision. A robot barging in uninvited from social media might lead to human visitors who would otherwise not have known what a great site you’ve got. If a robot is connected with something you consider a worthy cause, such as checking up on academic plagiarism, you might decide to turn a blind eye even though there’s no direct benefit to you. I, personally, consider archiving (as at the Wayback Machine) a great thing; there’s material I would never have found other­wise. But some people loathe it.

Within the “good” group, there are further subdivisions: search engine spiders; robots that follow RSS feeds; social media; and the ever-popular “I dunno, but they don’t seem to be causing any trouble”.

robots.txt

Before anything else: Make sure everyone can see the file. I’ve found that some robots get confused if robots.txt requests are redirected, whether to HTTPS or to the canonical form of the hostname (with or without “www”, depending on site). So on my sites, robots.txt requests are exempt from all redirection and all access-control rules. That leaves them with no excuse to say But I tried to read it, honest I did.

Once that’s taken care of: The next thing you really can’t do is penalize robots for not following every last syllable of your long, convoluted robots.txt file. To be considered “compliant”, a robot only has to under­stand this:

User-Agent: *
Disallow: /private
Disallow: /personal

meaning “no matter who you are, keep out of these areas”, and conversely this:

User-Agent: your-name-here
Disallow: /
 
User-Agent: and-yours-too
Disallow: /

meaning “keep out, subject closed, not negotiable, I just don’t like your face”. Note the blank line between the two name-and-disallow pairs. This is not just to make it more readable; in robots.txt a blank line has syntactic meaning.

If you don’t want any robots, ever, that gives you the minimalist two-line robots.txt file:

User-Agent: *
Disallow: /

meaning “everyone keep out of everywhere all the time”.

A combined listing is a great time-saver:

User-Agent: this-means-you
User-Agent: and-you
User-Agent: and-also-you
Disallow: /

It isn’t obligatory, though. Some of the robots on these pages moved from the “bad” to the “good” side after I experimented and found that they only understand the single-name format.

And, come to that, what does “your name” mean? It obviously doesn’t mean the robot’s full name:

User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Any right-thinking robot ought to know what you mean when you say simply

User-Agent: Googlebot
User-Agent: bingbot

homing in on the part that distinguishes one from another. If a robot can’t do this, it risks being kicked into the “noncompliant” bin: “Oh, duh, I had no idea you meant me when you said ‘Knowledge’, I’ve only been taught to recognize my full name, ‘The Knowledge AI’.” Lesson: As we all know, never attribute to malice that which can be adequately explained by stupidity.

Or take the timer, which can accompany your “Disallow:” lines:

Crawl-Delay: 3

meaning “Please space your requests at least three seconds apart.” Simple, straight­forward, reasonable. The Googlebot knowingly ignores it—they’d prefer that you make your preferences known in Search Console—so you can hardly hold it against other robots if they disregard it as well.

There’s one more common-sense rule that robots have to follow if they expect to be allowed in: have a name. There’s not much point to asking for robots.txt if you’re going around calling yourself Firefox/10 or Chrome/41.

A final note: Often people talk about “blocking” some robot in robots.txt. This is wrong. As the .txt extension should tell you, robots.txt is purely an infor­mational text file. It has no enforcement power; it’s just like pinning a sign to a door saying “No Admit­tance” or “Employees Only”. If you really need to prevent someone from entering, you need to install a deadbolt. Details will depend on your server type and—as always—on your individual needs and preferences.

Vocabulary

If you are just joining us . . . these are the terms I assume you know:

IP
The “address” of the visitor: a set of numbers that might represent a human ISP, or a wireless phone provider . . . or a search engine in Mountain View, or a server in Ukraine. Unlike all other pieces of information the visitor sends, you can normally assume the IP address is real.
IP addresses can be falsified—but it’s not like forging the sender’s address on email (or paper mail), or putting a fake number into Caller ID. Faking the IP address on an Internet request is the equivalent of ordering something by mail and giving a false shipping address. You’d only do it if you want someone to receive an embar­rassing package, or if you’re trying to make the vendor go broke by sending out things the recipient didn’t order. If you want to receive the package, you have to give your real address.
A distributed robot doesn’t always come from the same address; it shares space on a lot of different servers, and might crawl from any of them. This can make it hard to tell if the robot is who it says it is, but there are usually other identifiers. Major operators like Bing and Google have known addresses that they use consistently. Smaller robots—and even some surprisingly big ones—are often distributed. Or then again, in the case of malign robots, perhaps they didn’t pay their bills and have to find a new host every time.
UA
Short for “User Agent”. With human visitors, that means their browser—the exact version number, along with the operating system, or the exact model of their smartphone.
The UA can easily be faked; even ordinary browsers often have a User-Agent Switching option. So you will get a robot pretending to be the latest Chrome or Firefox, or calling itself Googlebot, because it thinks it will get better treatment that way. (Spoiler: It won’t.)
Referer
Don’t look at me; I didn’t codify the spelling. Loosely, the “Referer” is who sent you. If the visitor is human, that normally means the link they clicked to get to a page, whether in some other page or in a search engine. If they typed your address straight into the browser’s address bar, or they have your page bookmarked, there’s no referer.
Most robots just read pages, like “panda.html”. Humans also need supporting files: images, stylesheets, scripts. But you don’t have to ask for all these things indivi­dually. The browser does it for you; that’s why it’s called a “User Agent”. In general, a browser will give the name of the original page as referer for all supporting files.
Humans can choose not to send a referer, though this may cause trouble on some sites. You can also send a fake referer. When a robot does this, inserting the name of some site they’ve been paid to advertise, it’s called “Referer Spam”.
An “auto-referer” is my own term for giving the requested page itself as referer. The idea is probably to avoid suspicion by making it look as if you’re already on the site. Instead, of course, it only makes it more obvious that you’re a robot.
logs
Server access logs, available to all website administrators except those on the cheapest of shared-hosting plans. (If that’s you, stop reading this page and go look for a new host.) Logs show all requests sent in to the server, along with the response. If you, as a human, are reading this page, my access logs might say something like:

11.22.33.44 - - [01/Mar/2020:12:23:45 -0700] "GET /fun/robots/ HTTP/1.1" 200 11555 "https://www.duckduckgo.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2020:12:23:46 -0700] "GET /sharedstyles.css HTTP/1.1" 200 2067 "https://fiftywordsforsnow.com/fun/robots/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2020:12:23:46 -0700] "GET /fun/miststyles.css HTTP/1.1" 200 1567 "https://fiftywordsforsnow.com/fun/robots/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2020:12:23:46 -0700] "GET /fun/robots/robotstyles.css HTTP/1.1" 200 1567 "https://fiftywordsforsnow.com/fun/robots/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2020:12:23:47 -0700] "GET /favicon.ico HTTP/1.1" 200 661 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

and so on.
If, on the other hand, you are a malign robot and I don’t want you to see the page, logs might say curtly:

11.22.33.44 - - [01/Feb/2020:12:23:46 -0700] "GET /fun/robots/ HTTP/1.1" 403 3443 "http://disgusting-spammy-site.ru/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"

robots.txt
A text file tucked away in most websites, including this one. (Don’t get too excited though; what you see is not necessarily what the average robot will see.) It has to be reachable by the exact name robots.txt so visitors know what to ask for. It tells robots which areas of the site are off limits, and may set special rules for some robots by name. Good, law-abiding robots will consult robots.txt before asking for anything else, and will respect its rules.