MiSTings and More

At Home with the Robots

On a typical website, up to half of all visitors aren’t human at all. They’re robots of all kinds: the good, the bad, the ugly. Or, if you prefer, There’s three ways that robots can go: that’s good, bad and mediocre . . . .

This group of pages is based on a long post I’ve put together for the Webmaster World forums every year or two since 2012, most recently in March 2019. Robots come, robots go, so I’ll continue updating it now and then.

There are various ways of classifying robots. But for me the Great Divide is robots.txt. Does the robot first check that it has permis­sion to enter, learn which parts of the site are off limits, and restrict its visits to authorized areas? If it doesn’t, it had better have a darn good excuse for being here.

What counts as “darn good” is purely an individual decision. A robot barging in uninvited from social media might lead to human visitors who would otherwise not have known what a great site you’ve got. If a robot is connected with something you consider a worthy cause, such as checking up on academic plagiarism, you might decide to turn a blind eye even though there’s no direct benefit to you. I, personally, consider archiving (as at the Wayback Machine) a great thing; there’s material I would never have found otherwise. But some people loathe it.

Within the “good” group, there are further subdivisions: search engine spiders; robots that follow RSS feeds; social media; and the ever-popular “I dunno, but they don’t seem to be causing any trouble”.

Search Engines

Targeted Robots

Miscellaneous Robots

Malign Robots

robots.txt

Suppose you’ve decided that your visitors have to heed robots.txt. If so, the one thing you really can’t do is penalize them for not following every last syllable of your long, convoluted robots.txt file. To be considered “compliant”, a robot only has to under­stand this:

User-Agent: *
Disallow: /private
Disallow: /personal

meaning “no matter who you are, keep out of these areas”, and conversely this:

User-Agent: your-name-here
Disallow: /

meaning “keep out, subject closed, not negotiable, I just don’t like your face”.

If you don’t want any robots, ever, that gives you the minimalist two-line robots.txt file:

User-Agent: *
Disallow: /

meaning “everyone keep out of everywhere all the time”.

A combined listing is a great time-saver:

User-Agent: this-means-you
User-Agent: and-you
User-Agent: and-also-you
Disallow: /

It isn’t obligatory, though. Some of the robots on these pages moved from the “bad” to the “good” side after I experimented and found that they only understand the single-name format.

And, come to that, what does “your name” mean? It obviously doesn’t mean the robot’s full name:

User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Any right-thinking robot ought to know what you mean when you say simply

User-Agent: Googlebot
User-Agent: bingbot

homing in on the part that distinguishes one from another. If a robot can’t do this, it risks being kicked into the “noncompliant” bin: “Oh, duh, I had no idea you meant me when you said ‘Knowledge’, I’ve only been taught to recognize my full name, ‘The Knowledge AI’.” Lesson: As we all know, never attribute to malice that which can be adequately explained by stupidity.

Or take the timer, which can accompany your “Disallow:” lines:

Crawl-Delay: 3

meaning “Please space your requests at least three seconds apart.” Simple, straight­forward, reasonable. The Googlebot knowingly ignores it—they’d prefer that you make your preferences known in Search Console—so you can hardly hold it against other robots if they disregard it as well.

There’s one more common-sense rule that robots have to follow if they expect to be allowed in: have a name. There’s not much point to asking for robots.txt if you’re going around calling yourself Firefox/10 or Chrome/41.

A final note: Often people talk about “blocking” some robot in robots.txt. This is wrong. The robots.txt file is, as the .txt extension should tell you, purely an informational text file. It has no enforcement power; it’s just like pinning a sign to a door saying “No Admit­tance” or “Employees Only”. If you really need to prevent someone from entering, you need to install a deadbolt. Details will depend on your server type and—as always—on your individual needs and preferences.

Vocabulary

If you are just joining us . . . these are the terms I assume you know:

IP
The “address” of the visitor: a set of numbers that might represent a human ISP, or a wireless phone provider . . . or a search engine in Mountain View, or a server in Ukraine. Unlike all other pieces of information the visitor sends, you can normally assume the IP address is real.
IP addresses can be falsified—but it’s not like forging the sender’s address on email (or paper mail), or putting a fake number into Caller ID. Faking the IP address on an Internet request is the equivalent of ordering something by mail and giving a fake shipping address. You’d only do it if you want someone to receive an embar­rassing package, or if you’re trying to make the vendor go broke by sending out things the recipient didn’t order. If you want to receive the package, you have to give your real address.
A distributed robot doesn’t always come from the same address; it shares space on a lot of different servers, and might crawl from any of them. This can make it hard to tell if the robot is who it says it is, but there are usually other identifiers. Major operators like Bing and Google have known addresses that they use consistently. Smaller robots—and even some surprisingly big ones—are often distributed. Or then again, in the case of malign robots, perhaps they didn’t pay their bills and have to find a new host every time.
UA
Short for “User Agent”. With human visitors, that means their browser—the exact version number, along with the operating system, or the exact model of their smartphone.
The UA can easily be faked; even ordinary browsers often have a User-Agent Switching option. So you will get a robot pretending to be the latest Chrome or Firefox, or calling itself Googlebot, because it thinks it will get better treatment that way. (Spoiler: It won’t.)
Referer
Don’t look at me; I didn’t codify the spelling. Loosely, the “Referer” is who sent you. If the visitor is human, that normally means the link they clicked to get to a page, whether in some other page or in a search engine. If they typed your address straight into the browser’s address bar, or they have your page bookmarked, there’s no referer.
Most robots just read pages, like “panda.html”. Humans also need supporting files: images, stylesheets, scripts. But you don’t have to ask for all these things individually. The browser does it for you; that’s why it’s called a “User Agent”. In general, a browser will give the name of the original page as referer for all supporting files.
Humans can choose not to send a referer, though this may cause trouble on some sites. You can also send a fake referer. When a robot does this, inserting the name of some site they’ve been paid to advertise, it’s called “Referer Spam”.
An “auto-referer” is my own term for giving the requested page itself as referer. The idea is probably to avoid suspicion by making it look as if you’re already on the site. Instead, of course, it only makes it more obvious that you’re a robot.
logs
Server access logs, available to all website administrators except those on the cheapest of shared-hosting plans. (If that’s you, stop reading this page and go look for a new host.) Logs show all requests sent in to the server, along with the response. If you, as a human, are reading this page, my access logs might say something like:

11.22.33.44 - - [01/Mar/2019:12:23:45 -0700] "GET /fun/robots/ HTTP/1.1" 200 11555 "https://www.duckduckgo.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2019:12:23:46 -0700] "GET /sharedstyles.css HTTP/1.1" 200 2067 "http://fiftywordsforsnow.com/fun/robots/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2019:12:23:46 -0700] "GET /fun/miststyles.css HTTP/1.1" 200 1567 "http://fiftywordsforsnow.com/fun/robots/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2019:12:23:46 -0700] "GET /fun/robots/robotstyles.css HTTP/1.1" 200 1567 "http://fiftywordsforsnow.com/fun/robots/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

11.22.33.44 - - [01/Mar/2019:12:23:47 -0700] "GET /favicon.ico HTTP/1.1" 200 661 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

and so on.
If, on the other hand, you are a malign robot and I don’t want you to see the page, logs might say curtly:

11.22.33.44 - - [01/May/2017:12:23:46 -0700] "GET /fun/robots/ HTTP/1.1" 403 3443 "http://disgusting-spammy-site.ru/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"

robots.txt
A text file tucked away in most websites, including this one. (Don’t get too excited though; what you see is not what the average robot will see.) It has to be reachable by the exact name robots.txt so visitors know what to ask for. It tells robots which areas of the site are off limits, and may set special rules for some robots by name. Good, law-abiding robots will consult robots.txt before asking for anything else, and will respect its rules.