At Home with the Robots

Unwelcome Robots

And then there are the robots that are not welcome. Will never be welcome. Could never be welcome.

As noted elsewhere, my standards are not exacting: ask for robots.txt, find your name, stay away if disallowed. If you don’t see your name, at least stay out of the disal­lowed directories. But for some robots, even this is just too difficult and complicated.

In order to crash the gates successfully, it isn’t enough to have a thoroughly human User-Agent; they also have to send plausible humanoid headers. Thankfully, most robots are either too stupid or too lazy. As of early 2020, about a tenth of them—down from a fifth just one or two years ago—send no User-Agent at all: instant lockout. About one-quarter—including a number of major search engines—don’t send an Accept header: instant lockout unless white­listed. Other header anomalies are, of course, for me to know and you to find out. Or not.

It’s not as bad as it could be, though. At rough count, less than 2% of unwanted robots are getting in—and most of those just pick up the front page and go on their way. (Not long ago, while looking up something else, I was stunned to discover that fully two-thirds of root / requests are blocked. As you may have noticed, this is not a front-driven site.) It’s been several years since I’ve seen a really devas­tating crawl coming out of nowhere.

Fun fact: Malign robots must really like reading about themselves. On this site, more than 10% of all blocked requests are for this very directory, expecially its front page.

Robotic User-Agents

Alongside those that can’t be bothered to send a User-Agent at all, there are always a few brand-new robots who go around for months with the equivalent of Insert Name Here. And then there are the computer-science class assignments who never do figure out what to call themselves. (“Am I ‘test1’? No, wait, I think I was ‘TestOne’.”) On the other hand, the fake Googlebot seems to have all but disappeared in recent years. Maybe the people who write robot scripts have figured out that

Googlebot UA + non-Google IP = automatic lockout

so it ends up being worse than useless.

By now, at least 85% of all robots are smart enough to start their names with “Mozilla” followed by some more-or-less-plausible humanoid user-agent. In the early part of 2019, around 70% of blocked robots claimed to be human. Firefox/40.1 seems to be in fashion just now; at least it’s a little more believable than the assorted one-digit Firefoxes. And a visitor of any species, human or robot, calling itself MSIE 6 . . . can only inspire pity.

Some of them, though, aren’t even trying:

null
Go-http-client/1.1
Dorado WAP-Browser/1.0.0

Sometimes humans wear the “Dorado” face too, but more often it’s a robot.

Thanks but No Thanks

It isn’t enough to ask for robots.txt. You also have to do what it says. That means you:

DomainStatsBot

IP: 136.243.17.142, ..59.237, ..222.140; 148.251.121.91

These four IPs, exact down to the last digit, have been used interchangeably since 2019.

DomainStatsBot/1.0 (http://domainstats.io/our-bot)

DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)

The first User-Agent showed up very briefly in 2016; when next seen, in 2019, it had settled on the second form.

Thanks to some inattentiveness on my part, this robot was authorized for quite a while before I noticed that it had started requesting pages in one roboted-out directory. Hmph. Once could be a glitch; recurring over more than a year is misbehavior.

LightspeedSystemsCrawler

IP: 207.200.8.180; 52.36.125.174, 52.36.251.200

LightspeedSystemsCrawler Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)

It is possible this is really two robots. But I don’t like either of them, so it makes little difference. From 207.200 it requests the root followed by all pages linked therefrom. (Or, if you want to be hairsplittingly accurate: all pages linked from the 403 page. They just happen to be the same list.) From 52.36 it requests robots.txt—where it has been Disallowed for years—generally followed by a few requests for the root.

linkdex

Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)

In the past it has been absent for long stretches. Currently it shows its face every month or so.

ltx71

IP: 52.2.10.94, 104.154.58.95

ltx71 - (http://ltx71.com/)

I first became aware of this robot when it showed up requesting new ebooks. I later discovered that it also likes the page linked from my profile at Webmaster World; I just never noticed because it was always blocked for one reason or another. Consi­dering how often this robot asks for robots.txt—on some visits, it’s all it requests—it is suprising that it has not yet got around to reading the thing.

MegaIndex

IP: 176.9.50.244, 176.9.28.137

Earlier they lived at 88.198.48.46. After taking the year 2016 off, they returned with two new addresses in random alternation.

Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)

Last seen: May 2020

Some robots request robots.txt only after getting the front page; this is one of them. (“Oh! I just assumed I was allowed to crawl everywhere. Gee, I guess I was wrong.”) But, since they proceeded to ask for the entire contents of a roboted-out directory, it becomes pretty academic.

Qwantify

The peril of putting a robot in the Ignore bin is that you don’t notice when it starts misbehaving—in this case by repeatedly crawling directories that are clearly and unambiguously disallowed in robots.txt. In years past, you would have found this robot in the Miscellaneous list. No more.

IP: 91.242.162, 194.187.169-171, 162.19.101

Very rarely I have seen it from other IPs, but they are so rare I can’t be sure they are not fakers. The 162.19 range is a newer one, first seen in mid-2023. You might expect the full 194.187.168-171 (in CIDR terms, 194.187.168.0/22) but I have never seen it from 168.

Mozilla/5.0 (compatible; Qwantify/1.0; +https://www.qwant.com/)

Mozilla/5.0 (compatible; Qwantify/2.3w; +https://www.qwant.com/)/2.3w

Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)/2.4w

Qwantify/1.0

The minimalist “Qwantify/1.0” and-that’s-all appeared sporadically until March 2021. For the first few years it requested only the favicon. It then took a prolonged vacation from mid-2018 to December 2019. When it returned, it changed to root / followed by favicon. Around April 2021 it was replaced by the full 1.0 version, with the same root-plus-favicon request pattern.

You might reasonably think that the Qwantify/2 versions came after Qwantify/1.0. In fact they are older, going back to 2015. The change from 2.3w to 2.4w happened in the latter part of April 2017, with no overlap. Qwantify/2.4 crawled from 91.242, requesting only robots.txt. Can’t think why it bothered, since it persistently ignored the Disallow it found there. With an isolated exception, I last saw it in 2022.

Mozilla/5.0 (compatible; Qwantify-dev/1.0; +https://help.qwant.com/bot/)

Mozilla/5.0 (compatible; Qwantify-prod/1.0; +https://help.qwant.com/bot/)

Mozilla/5.0 (compatible; Qwantify-prod524/1.0; +https://help.qwant.com/bot/)

Do not ask why Qwantify suddenly ventured into (presumably) “development” and “production” many years into its existence. “Qwantify-dev” first showed up in November 2022, “Qwantify-prod” in July 2023. The third UA is just an example. Beginning around December 2023 “dev” came interspersed with “dev188”; from around November 2023, “prod” sometimes appended a number of its own. To date (January 2024) I’ve seen 112, 147, and 524.

Incidentally, the “help.qwant.com” URL leads to an information page listing half a dozen different User-Agents—four of which I don’t think I have ever set eyes on. The page also says that the robot “respects the robots rules standard”. This is, to use an arcane technical term, a lie.

Mozilla/5.0 (compatible; Qwantify/Bleriot/1.1; +https://help.qwant.com/bot)

Mozilla/5.0 (compatible; Qwantify/Bleriot/1.2.1; +https://help.qwant.com/bot)

Mozilla/5.0 (compatible; Qwantify/Mermoz/0.1; +https://www.qwant.com/; +https://www.github.com/QwantResearch/mermoz)

These versions seem to have been short-lived experiments. While they existed, they were used in random alternation. With one exception, I never saw Mermoz before 2018, and haven’t seen it at all since March 2019. It was replaced by Bleriot/1.1—always from the 91.242 IP—which was last seen in March 2020. Bleriot/1.2.1 made a brief, overlapping appearance near the end of 2019.

SemanticScholarBot

IP: 54.70.40.11

Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)

Initially I thought this was a compliant robot, because it stayed away when it found its name in robots.txt. But after I authorized it, it headed straight for a roboted-out directory. Nice try, but no cigar.

vuhuvBot

IP: 185.93.54.51

Mozilla/5.0 (compatible; vuhuvBot/1.0; +http://vuhuv.com/bot.html)

Last seen: June 2020

This is a fairly recent arrival to the unwanted-robots collection. It requests robots.txt only after its (blocked) page request—always a sure-fire way to continue being blocked.

We don’t want your kind around here

If they never even ask for robots.txt, how will they know what it says?

Blackboard Safeassign

34.231.5.82, 34.202.93.213

Blackboard Safeassign

For several years I gave this robot a free pass because it’s doing more-or-less-legitimate work, checking for plagiarism. On any one visit it might pick up anywhere from one to a dozen pages, most often in the /ebooks/ directory. Requests always came in pairs: first HEAD, then GET.

When I moved to HTTPS, Blackboard Safeassign racked up a disproportionate number of redirects because it came in one day and asked for the same group of pages over, and over, and over again. Did they think the text was going to change every three seconds? When it reached the point of up to two hundred requests for the same page within a couple of minutes, I threw in the towel. So far they haven’t seemed to get the message; blocked robots seldom do.

For editorial comments on plagiarism-checkers in general, see TurnitinBot under “Former Robots”.

PocketParser

PocketParser/2.0 (+https://getpocket.com/pocketparser_ua)

I originally put this with the Targeted Robots, but its requests don’t always fit the pattern. Happily, it has never been frequent; once it was absent for a whole year. And then it returned, darn it.

Seekport Crawler

IP: distributed

Each individual visit—typically six requests—used a single IP, but they varied all over the map.

Mozilla/5.0 (compatible; Seekport Crawler; http://seekport.com/)

Last seen: June 2022

First seen in August 2019, this robot was mercifully rare until 2020. And then, in June 2022, it must have come under new management, to be replaced with the compliant SeekportBot.

Chrome/50 with referer

IP: assorted OVH ranges, especially 94.23, 158.69.248, 188.165

Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36

Referer: assorted long complicated strings from assorted sites

This pattern may finally be on its way out; I’ve only seen it once since October 2019. Fortunately it is easy to block, so I can ignore it. The referer is always different—not referer spam but various superficially plausible searches, often from reputable sites. In former years, a robot with the same UA but no referer was common.

Others

OrgProbe/0.9.4 (+http://www.blocked.org.uk)

Xenu Link Sleuth/1.3.8

I don’t know and don’t especially care whether this is the actual Xenu; all I know is, I didn’t order it. (I don’t know about Xenu’s ordinary behavior. The w3 link checker requests robots.txt on each site that it visits, and goes away weeping if it doesn’t find authorization.)

Request Patterns

Many malign robots can be identified by their behavior rather than by name or address. For starters, there is the tried-and-true

“Your root page sent me”

Referer: fiftywordsforsnow.com (with or without final /)

Claiming that you got to some deep interior page directly from the root is a good way to get yourself blocked—especially if you also got the www and/or the https wrong. Look, robots, I made this site. I know which pages are linked from where. Sending an auto-referer for the root itself is similarly effective. Happily, most of them don’t even get as far as the RewriteRules—because, uhm, reasons.

A huge proportion of this site’s blocked requests are for /webs/robots/. And the overwhelming majority of those requests give the front page as referer—which would have been enough to get them blocked, regardless of other factors. That’s why these pages aren’t linked directly from the front page; the server has no way to tell that you’re lying.

That leads us to . . .

Contact Six

This pattern showed up towards the end of 2022, and continues to be seen a few times a month. Each visit consists of—on this site—exactly six requests from a single IP and UA:

/boilerplate/contact.html without referer

immediately followed by five requests, all with the abovenamed /contact.html as referer:

/ (root)
/boilerplate/search.html
/boilerplate/contact.html
/boilerplate/legal.html
/boilerplate/about.html

Now and then I see the same pattern on my personal site. Since that site isn’t big enough for a search page, “Six” becomes “Five”.

This is but one manifestation of what I know collective as the—

“Contact” Botnets

At any given time, there will be two or three variations of what I’ve labeled the “contact” botnet. The IP varies; the UA varies; the headers vary. What they all have in common is that the request will include /boilerplate/contact.html at some point. (Clever robots! They have been taught the word “contact”, and know to look for it, no matter where on the site the contact form may happen to be located.) The exact pattern tends to run for a few years before they get bored and change the script.

Four Requests

GET /some-inner-page

(always a different one) giving root as referer, as above

GET / (root) with auto-referer
GET /boilerplate/contact.html giving root as referer
GET / (root) with auto-referer

Like the “one plus three”, this exact pattern has been around for years. Currently they are all blocked—but only thanks to headers. IP alone won’t do it; the “Contact botnet” plagued me for years. A detail of this pattern is that the first referer has a final / slash, but the other three don’t. And, of course, they must have friends inside, because the first request is for some page that exists but is not visible from the front page.

Contact At Home

The original version of the “At Home with the Robots” group of pages was a single page called athome.html. It did not take robots long to find it—“Look! Someone’s talking about me!”—leading to a fixed pair of requests:

GET /fun/athome.html
GET /boilerplate/contact.html giving /fun/athome.html as referer

Before long, they figured out that they didn’t need to make the first request, but could proceed directly to the contact page, claiming that athome.html had sent them.

It has also not taken them long to learn that the old page has a new URL. Within days I was finding the request pattern:

GET /boilerplate/contact.html giving /fun/robots/ as referer

Contact POST

Once you’re on the contact page, the obvious next step is to, well, make contact. Or, better yet, use the page’s built-in POST mechanism to attempt to do something evil. By mid-2018 this led to the pattern:

GET some-random-interior-page

GET /boilerplate/contact.html giving the previously requested page as referer

POST /boilerplate/contact.html

This variation first showed up in May 2018, though it didn’t really become common until September. Fortunately, the POST request was always blocked because reasons. (I can’t give away everything, after all.) Still more fortunately, the script very quickly settled on a single user-agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36

This is distressingly up-to-date—it is still in occasional use by humans—but where there’s a will there’s a way.

Other patterns include

Two Plus One

This pattern showed up near the end of 2019 and shows no sign of going away, though it has been less frequent in recent months. Each visit consists of exactly three requests from a single IP:

/fun/robots/ with / (root) referer
/ with root referer without the final / (slash)

and then, anywhere from a few minutes to—rarely—over an later,

/ without referer, using a different UA

Sometimes the first request is for some individual page in the /robots/ directory—the directory you’re reading right now—but the front page is most common.

For example:

aa.bb.cc.dd - - [11/Feb/2020:08:21:31 -0800] "GET /fun/robots/ HTTP/1.1" 403 6691 "https://fiftywordsforsnow.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.69 Freeu/61.0.3163.69 MRCHROME SOC Safari/537.36"

aa.bb.cc.dd - - [11/Feb/2020:08:21:32 -0800] "GET / HTTP/1.1" 403 6691 "https://fiftywordsforsnow.com" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.69 Freeu/61.0.3163.69 MRCHROME SOC Safari/537.36"

aa.bb.cc.dd - - [11/Feb/2020:08:26:46 -0800] "GET / HTTP/1.1" 200 10415 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.6) Gecko/20070817 IceWeasel/2.0.0.6-g3"

And repeat, anywhere from one to seven times in the course of a day. So far the record is eleven sets—but even one is one too many.

The first two requests are blocked because of the obviously bogus referer; only the third request gets through. So far I haven’t found a unifying feature that would allow me to block them all, though many do come from bad neighborhoods. Others come from seemingly human ISPs, especially in Asia, Africa or Latin America—parts of the world where, for a variety of reasons, people’s personal computers are especially vulnerable.

The UA in that third, not-yet-blocked request has a final quirk. It’s always different from the opening pair, and always plausibly humanoid. But it will then be reused on later visits, up to half a dozen times before yielding to a new one. This points to craftiness on the botrunner’s part: it’s no use blocking a specific UA, because tomorrow it will be replaced by something else.

But that’s not all. Other perennial robotic favorites include:

Eight Plus One

This pattern is very much like “Two Plus One” . . . except that here the two first requests come in sets of four, for a total of eight blocked requests to one non-blocked. And, just like “Two Plus One”, the ninth and last request deploys a different UA. Happily, this configu­ration is less common than the shorter version.

One Plus Three

GET /some-inner-page

(always a different one) giving the root as referer, although the requested page is not actually linked from the root, and then

GET / (root), giving the previously failed request as referer
GET / (root), with auto-referer
GET / (root), with auto-referer

This pattern was active for several years. Can’t think why they bothered, since everything was blocked.

Multiples:

Only one page is requested—but they ask for it repeatedly, anywhere from 3 to 7 times, most often 4. Sometimes they throw in some referer spam. Though almost all of them are blocked, they keep coming, day in and day out, many many times a day.

The Worst of the Worst

When you walk in and demand to see /wp-login.php, it’s all over. There is zero chance that you have any legitimate, law-abiding purpose. (Well, all right. I suppose if you actually have a WordPress site, you might invite security-testing robots to look around and confirm that nobody can get in where they don’t belong. That is not what my visitors are up to.) The last few times I’ve counted, some 10% of a random month’s robotic visits—25% of blocked visits—ask for wp-login, wp-admin, Fckeditor, xmlrpc.php and similar. What they receive, instead, is the 403 page—or, at worst, a “no such file” 404.

A surprising number ask for robots.txt, or the meaningless “/blog/robots.txt”. I’m not clear about the purpose of this request. They do not rush over and ask for any and all roboted-out directories, while they do go ahead and ask for the standard WP list. Are they hoping to find something like “Disallow: /wp-admin876” that isn’t worth asking for unless you already know it exists?

The main feature of malign robots is miscellane­ousness. Some IP shows up, requests two or six or a dozen files from a standard list, and never shows its face again. Repeat visits are a feature of named robots from known addresses.

For several years, the overwhelming majority of requests for wp-login.php have come in with the exact UA

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1

suggesting that they all started with the same script. Happily, this UA is no longer used by humans, so they can now be blocked at the gate. Unhappily, the UA has lately been joined by the slightly more plausible

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0

The Current Faker

Calling yourself Googlebot seems to be going out of fashion. The only one I’ve seen lately didn’t even use the full UA string; it thought it could get away with the magic word alone:

Googlebot (gocrawl v0.4)

Even then, it hedged its bets; the “gocrawl” version alternated with

Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2

which counts as “could be better, could be worse” among humanoid UAs. All blocked, so no skin off my nose in any case.

And the Winner Is . . .

In the “Extra Stupid” category:

User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31

And your point is . . .?

Look again. The element “User-Agent:” is part of the UA string. This exact UA first showed up in April 2017, and could be seen sporadically until September 2019. Others with the leading “User-Agent:” have rattled the doorknob as recently as October 2021.

And similarly:

Mozilla/5.0 (compatible; Please Name Your robot; +http://192.168.1.33:23481/yioop/bot.php)

our winner for June 2017. (Disclaimer: It’s no relation to the real Yioopbot, currently on hiatus; they just lifted its code.) On top of every­thing else, the 192.168 IP is in a Private Use Area, meaning that you can’t get there from here.

Another entertaining one, first seen in December 2020, is:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/[WEBKIT_VERSION] (KHTML, like Gecko, Mediapartners-Google) Chrome/[CHROME_VERSION] Safari/[WEBKIT_VERSION]

and similarly

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x

A current favorite, active since May 2019, is:

Mozlila/5.0 (Linux; Android 7.0; SM-G892A Bulid/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/60.0.3112.107 Moblie Safari/537.36

In case the three separate typos aren’t enough to ensure that the door will be resoundingly slammed in their face, this one tends to come with the referer

www.google.com

and-that’s-all.

All these must have failed to read the Your New Robot instructions carefully enough. But the most impressive has to be:

/?Connection=keep-alive&User-Agent=Mozilla%2F5.0+%28Linux%3B+U%3B+Android+4.0.3%3B+zh-cn%3B+M032+Build%2FIML74K%29+AppleWebKit%2F534.30+%28KHTML%2C+like+Gecko%29+Version%2F4.0+Mobile+Safari%2F534.30

This technically doesn’t qualify . . . because that wasn’t the User-Agent, it was the request. Clearly someone hit the wrong button. The stated User-Agent, in case anyone wondered, was the tried-and-true python-requests/2.14.2. The percent signs represent various punctu­ation—slash, semicolon, parentheses—while the plusses would be spaces: Mozilla/5.0 (Linux; U; Android and so on.