At Home with the Robots

Targeted Robots

Back in 2015 I started submitting to the surpassingly useful Online Books Page. This is a curated directory (meaning that everything on it is reviewed by a human) which includes both an RSS feed and a “New Listings” page. Ebook-seeking robots tend to be truthful about where they learned about the page they’re requesting, even if they didn’t literally click on a link to get there, the way a human would.

These robots’ normal pattern is to request robots.txt (the compliant ones, that is) and then a single copy of the latest page. Rarely they will then engage in other behavior, but the directory listing is always the trigger.

IP: Where I don’t give an address, it means the robot is spread across The Usual Suspects: 3, 18, 34, 35, 52, 54, et cetera, et cetera, you know the drill. AWS, Google Cloud, assorted other big server ranges. The robots in question may be distributed, or they may simply move around a lot.

Referer: Throughout this page, “RSS” means the Online Books Page’s RSS feed. There is also “new.html”, but this is more often used by humans.

Last seen: When I give a “last seen” date, it means that the robot hasn’t shown itself lately, but I’ve not yet transferred it to the Past Robots page.

Okay by Me

Admittedly my standards are not exacting. Ask for robots.txtbefore you meet your first 403, ahem, not several seconds after—and give some indication that you mean to follow it.

Magpie Crawler

IP: 185.25.32, 185.25.35

magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net)

robots

magpie-crawler/1.1 (robots-txt-checker; +http://www.brandwatch.net)

Referer: RSS

The “robots” and-that’s-all user-agent was used for robots.txt requests until February 2021, at which time it was replaced by the more informative version. The IP changes periodically.

MBCrawler

IP: 174.129.1.66; 54.83.5.6

MBCrawler/1.0 (https://monitorbacklinks.com)

Last seen: September 2022

Any given visit will use either one IP or the other, at random. Behavioral quirk: Every request, including robots.txt, is preceded by a HEAD request for the same file. (This strikes me as more trouble than it’s worth.)

omgili

IP: 62.90.131.202, 82.166.195.66

The 62.90 IP dates only from December 2018. Before that, it was always 82.166.195.65 (not 66).

omgili/0.5 +http://omgili.com

Sad but true: The name is short for “oh my god I love it”. And there’s not a thing we can do about it. Tends to make multiple pairs of requests over several days.

trendiction

IP: 144.76

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.0; trendictionbot0.5.0; trendiction search; http://www.trendiction.de/bot; please let us know of any problems; web at trendiction.com) Gecko/20071127 Firefox/50.0

Referer: as if human

Until June 2018, the UA ended in Firefox/3.0.0.11 (really) but wiser counsels must have prevailed.

Distinctive behavior: After getting robots.txt, new page, and the first 11 (eleven) images belonging to that page, it then gets the site’s front page (https://fiftywordsforsnow.com/) plus all authorized pages linked from it, except the page that started its visit. The initial page request has a truthful referer; image requests name the relevant page as referer; the subsequent spider-type requests have no referer.

Moving to HTTPS revealed another behavioral quirk. Its initial new-page request is always HTTPS, because it’s following a link. But the follow-up requests, beginning with the front page, start as HTTP. Worse, it gives the originally requested page as referer for the correct form—a behavior that flatly contradicts rules about how a referer is supposed to work.

umBot

IP: 94.130.67.180, 138.201.248.12

These IPs have been used interchangeably since early 2020. There have been others in the past.

Mozilla/5.0 (compatible; um-FC/1.0; mailto: techinfo@ubermetrics-technologies.com)

This first UA was used consistently through 2018 with a succession of IPs. Requests followed a steady pattern of robots.txt (where it was disallowed) plus one page (blocked).

With the beginning of 2019 came an abrupt change in both UA and behavior. The first of the two paired UAs—“um-FC” with Firefox 40.1—was used for GET requests; the second—“um-LN” but otherwise identical—for HEAD. Most requests were for robots.txt, except a few HEAD requests for pages.

Mozilla/5.0 (compatible; um-FC/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1

Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com; Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1

Then, early in 2020, the robot seems to have come under new management, abandoning “um-LN” and no longer requesting anything but robots.txt. So I removed the Disallow, poked some holes . . . and sat back and watched as it continued to request nothing but robots.txt, as if it had entirely forgotten what it originally came for.

Little did I suspect that, beginning in late 2021—more than a year after I authorized it—“um-FC” had been joined by

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36

Note the complete absence of anything like “um” or “uber­metrics”, which would have triggered the hole-poking that lets selected robots get past access controls. Since this new User-Agent is used only for page requests, while robots.txt remains the province of “um-FC”, it took me until summer 2022 before I realized that the robot had been asking for pages—one per visit, as in years past—each duly preceded by robots.txt.

I see no reason why I should modify my access-control rules for a robot that goes around under a false name. Instead I fired off a snippy email to the contact address.

There is a lesson in all this, but I’m not certain what the lesson is, or who is to learn it. Watch This Space.

yacybot

The IP is almost never the same from one visit to the next.

yacybot (/global; amd64 {variable-part-here}) http://yacy.net/bot.html

What I’ve given as {variable-part-here} can be absolutely anything. The most recent, for example, was

yacybot (/global; amd64 Linux 4.4.74-minimal; java 1.8.0; Europe/en) http://yacy.net/bot.html

Referer: often but not always “new”; there are many others

This robot comes and goes. At one point it was gone for over a year, before returning in June 2019.

No Thanks

And then there are the robots that don’t meet my admittedly lax criteria, most of the time because They Didn’t Even Ask. The astute reader will notice that this list is longer than the preceding one.

In some cases I had to consult logs to see what certain robots have been up to, because after a time I get tired of tracking blocked requests and just ignore them. Like the man said, unwanted robots ye shall always have with you. But sometimes they do get bored and go away.

AppEngine

IP: 107.178.194-195

AppEngine-Google; (+http://code.google.com/ appengine; appid: s~feedly-social)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-social)

Feedly/1.0 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)

Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)

There are probably more User-Agents than this; since it is consistently blocked, I stopped paying attention back around 2017. It’s still active, though. Requests each new page 10-12 times and then goes away.

Feedspot

IP: 54.186.248.49

Sometimes it uses other IPs in the Usual Suspects range, but this is by far the most common.

Mozilla/5.0 (compatible; Feedspot/1.0 (+https://www.feedspot.com/fs/fetcher; like FeedFetcher-Google)

Since the beginning of 2020 its requests have come in pairs: HEAD immediately followed by GET for the same page. The current UA goes back to late 2018; before that it was

Mozilla/5.0 (compatible; Feedspotbot/1.0; +http://www.feedspot.com/fs/bot)

Could be worse. When it first showed up, in mid-2015, it claimed to be Firefox/2 (two), putting its own name in the Referer slot:

Referer: Feedspotbot: http://www.feedspot.com

Googlebot-Compatible

IP: various, but mostly 54; 23.21.191.239

Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Googlebot-Compatible Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8

Last seen: May 2021

Most easily recognized by its pattern: a single HEAD request from 23.21 with the “lucid” UA, followed by two GET requests: one from some other IP with the same UA, and then one from that second IP with some vaguely humanoid UA, down to and including Firefox 4.

Grammarly

Grammarly/1.0 (http://www.grammarly.com)

Requests new pages up to half a dozen times, alternating with older pages it has met in the past. Whether new or old, it always requests the same page twice, a second or so apart. In spite of years of having the door slammed in its face, it remains extremely active, not just with recent ebooks but all over the site.

heritrix

IP: 104.various

Mozilla/5.0 (compatible; heritrix/3.3.0-SNAPSHOT-20150302-2206 +http://127.0.0.1)

It makes a show of good behavior by asking for robots.txt . . . which it proceeds to ignore.

Metadataparser

IP: mostly 52 and 54

metadataparser/1.1.0 (https://github.com/bloglovin/metadataparser)

Last seen: August 2021

Well, at least it only requests each page once.

Miniflux

IP: 116.203.93.224

Mozilla/5.0 (compatible; Miniflux/5f487e8; +https://miniflux.app)

A fairly recent arrival, first seen in July 2019.

Moreover

IP: 70.39.246.37

Mozilla/5.0 Moreover/5.1 (+http://www.moreover.com)

Denied in robots.txt, but rarely asks. Curiously, its robots.txt requests often come from some entirely unrelated IP—and many seconds after the (blocked) page request, which kinda defeats the purpose.

PaperLiBot

IP: 37.59.19; 37.187.162-167

Mozilla/5.0 (compatible; PaperLiBot/2.1; https://support.paper.li/entries/20023257-what-is-paper-li)

Mozilla/5.0 (compatible; PaperLiBot/2.1; https://support.paper.li/hc/en-us/articles/360006695637-PaperLiBot)

Like so many of us, it changed from http to https in its UA string a few years back. In April 2021 there was a further change, to the form involving /articles/. Before then, the web page said:

“Paper.li is a content curation service that let's you turn socially shared content into beautiful online newspapers and newsletters.”

It would, however, be a lie to say that I block them purely because of the grocer’s apo’strophe. The new version of the web page says, among other things,

PaperLiBot is the generic name of Paper.li's web crawler.
Paper.li is a content curation service which lets you turn socially shared content into beautiful online newspapers and newsletters.

(do not ask what they mean by “generic name”) and, much further down the page,

If you want to prevent PaperLiBot from crawling content on your site please get in touch with us

Or they could, y‘know, teach their robot to read robots.txt. Honestly, botrunners, it isn’t that difficult.

Slackbot

Slackbot-LinkExpanding 1.0 (+https://api.slack.com/robots)

Last seen: December 2021

Every now and then it takes a few months’ vacation, but it always comes back. After one of those vacations, from December 2018 to March 2019, it changed its pattern. Instead of trying for current ebooks, it instead goes for assorted other pages on the site—including, appropriately enough, the “Malign Robots” page.

One recent visit tells me it has been trading notes with some other robot. Although it has never seen a redirect, it asked for the new URL /webs/robots/ rather than the former /fun/robots/.

Winds

Winds: Open Source RSS & Podcast app: https://getstream.io/winds/

Last seen: October 2021

In the relatively short time it has been around, it has used so many different IPs, I stopped keeping track. By now it has probably hit all the Usual Suspects. I would love to think the frequent changes are because it didn’t pay its bills.

YaK/linkfluence

IP: 54.39 (mostly)

Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; bot@linkfluence.com)

A fairly recent arrival, first seen in November 2019. It asks for robots.txt periodically, but has not yet got around to reading it. Although I list it among Targeted Robots, and most of its requests are for the most recent ebook, it has also been known to request the sitemap.

Humanoids

Once upon a time, webmasters could check for a leading “Mozilla” in the UA string, and be confident that the visitor was human. A decade later, the Upgrade-Insecure-Requests header was a useful diagnostic. That was then. This is now.

Humanoid robots like to latch on to some particular browser-and-OS combo as their preferred disguise. After a while, when humans have moved on to a much later version, the UA can be safely flagged as robotic (my personal label is botnet_agent. And some of them fall into the “just for shits and giggles” category: MSIE 6? Really?

Recent fakers include:

Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19

Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36

By the time you read this, there may be a whole new group.