At Home with the Robots

Miscellaneous Robots

When deciding whether to let a robot come in, some sites use a criterion of “Will this robot benefit me?” But I think this is short-sighted. Maybe it won’t benefit you (or your site) personally—but maybe it will benefit someone else engaged in a legitimate activity. And then in turn I might derive benefit from robots that visit other people’s sites. Scratching someone else’s back doesn’t directly and immedi­ately help you; it’s the exchange that’s to everyone’s advantage.

A handful of robots on this list—notably TIA’s “special_archiver” and SafeDNSBot—still use the HTTP/1.0 protocol exclusively. What will they do when the Internet moves up to HTTP/2?

Social Media

Facebook

IPv4: 31.13.64-127 (Ireland), 66.220.144-159, 69.63.128-191, 69.171.224-255, 173.252.64-127
IPv6: 2a03:2880::/29

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

facebookexternalhit/1.1

cortex/1.0

adreview/1.0

 

The blank space at the bottom of the User Agent list is not a mistake. For a couple of years, from June 2017 to May 2019, Facebook occasionally came in with no User-Agent at all, recognizable only by its IP address. This required some hole-poking, so I hope they have abandoned the idea for good.

The shorter “facebookexternalhit” version made its first appearance in the latter part of 2015. It often picked up isolated pages without images. I don’t know what this translates to in the FB user experience. Whatever it was, it now seems to be retired; I last saw it in October 2018.

The minimalist cortex/1.0 is a comparatively recent arrival, first seen in February 2019. The equally concise adreview/1.0 is still newer; I first saw it in June 2019. The formerly common visionutils hasn’t been around since April 2016.

Protocol: mixed

When my server became HTTP/2-compatible, Facebook immediately started making HTTP/2.0 requests—but only some of the time. Currently (summer 2021) it shows a 5:1 preference for HTTP/1.1.

Facebook’s behavior changes every couple of years. Currently new links start with a page request, followed by all image files belonging to the page, giving the page as referer. The human user then presumably selects one image, which will get re-requested every time your original visitor’s friends view the page that linked to you. These image-only requests come in with no referer. (Pro tip: Do your pages include a non-visible image file, such as piwik’s noscript dot? I rewrite this for facebook to a little banner giving the site name. So if a page happens not to have any good pictures—some ebooks don’t—there’s still something for the human user to select.)

Facebook, incidentally, seems to be responsible for those /ebooks/kleinschmidt without-final-slash requests discussed under the Bingbot Evil Twin. All it takes is one person to “like” a page and misspell its URL.

Twitter

IP: 199.16.156-159, 199.59.148-151

Twitterbot/1.0

Uncharacteristically for a social-media-based robot, the Twitterbot asks for and obeys robots.txt. And, like the Googlebot, it never forgets. One recent month’s requests included an URL that I remember seeing on my old site’s Redirect lists, meaning that they first learned about it no later than December 2013.

Pinterest

IP: 54.236.1

Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)

Like Twitter, the Pinterestbot honors robots.txt. Generally it requests pages and the occasional icon. What it does with them is anyone’s guess; it doesn’t seem to have anything to do with the quasi-hotlinking that used to garner so much dislike.

. . . and a Faker

In the category of Most Unlikely User-Agent:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0

This robot is supposed to be associated with the “iOS Messages” app. Since I don’t personally use this, or know anyone who does, I haven’t been able to investigate further.

Requests one page plus apple-touch-icon-precomposed.png (a file that happens to exist, so it need not plow through the whole list of potential apple-touch-icons). The IP varies, but it’s always the identical UA string from beginning to end. I’ve been seeing it sporadically for years.

Further Afield

Other law-abiding robots, in alphabetical order. Many of them don’t have a home IP; instead they are distributed, generally among the usual suspects of 3, 18, 34, 35, 52, 54 and so on.

Adsbot

IP: 173.231.60.198

Mozilla/5.0 (compatible; Adsbot/3.1; +https://seostar.co/robot/)

Here’s something you don’t see every day: an authorized robot that is, never­theless, blocked three times out of four. Why? Because it consistently lies about its referer, claiming to have been sent by the root when requesting a deep-interior page. Sorry, Adsbot, but I’m not going to poke a hole just because you don’t understand the difference between the Referer: and Origin: headers. Not that the latter would be any more truthful in the case of a robot.

AhrefsBot

IP: distributed, but especially 54.36.148-150

Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/)

. . .

Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)

Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)

Protocol: HTTP/2.0

Worth noting: When my server became HTTP/2-compatible, Ahrefs was the first ordinary crawler to make HTTP/2 requests. (Facebook also uses HTTP/2, but I don’t consider it a crawler.) Just like the bingbot, the AhrefsBot continues to use HTTP/1.1—but only for robots.txt.

Although the robot overall seems to be distributed, almost all robots.txt requests come from just two IPs, 54.36.150.49 and 151.80.39.207. This may explain why their website says that robots.txt changes can take up to a week to be recognized. In spite of this, they’ve never asked for anything in a roboted-out directory. Requests are mostly pages, with a few seemingly random images mixed in.

All User-Agent strings are identical except the version number, which changes periodically. The earliest version I find is 2.0, in use through early 2012. (There was probably a 1.x, but this predates my active logging.) 3.1 took over in the latter part of 2012, but didn’t last long:

There doesn’t seem to have been a 6.0; I don’t know what the story is. Unlike the earlier version changes, 6.1 and 7.0 overlapped for about half a year.

Amazonbot

IP: 3.224.220.101, 52.70.240.171

These are not particularly savory neighborhoods, what with the ongoing AWS association, but where else would Amazon crawl from?

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)

This robot first showed itself in April 2021. A while back I tested for robots.txt compliance—and then promptly forgot all about it. Watch for an update, as I’ve yet to see how it behaves when authorized.

Applebot

IP: 17.58.101.57, 17.58.96-103

In the past they have come from other addresses in 17—the whole thing still belongs to Apple—but 17.58.101.57 is by far their current favorite. Addresses outside of 17.58.96.0/21 are extremely rare.

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)

The iPhone version was always rare, and must now been retired. I haven’t set eyes on it since mid-2017.

For a spell in the middle of 2019 the Applebot had a real problem with filepaths. Starting abruptly in March, there was a flurry of requests for /ebooks/pagename.html where the correct URL is /ebooks/title/pagename.html. One of those titles was Chapman’s Iliad, which I know they have no trouble locating, because they requested one of its pages several thousand times—really—during a few months in mid-2018. Fortunately, they seem to have got a grip on the filepath issue around September-October of 2019.

That leaves directory slashes. As it happens, I have a lot of URLs in the form /ebooks/title/ and-that’s-all. If it’s a long book, the directory will contain further files; most of them don’t. (In case anyone wondered: It is consistently /ebooks/title/ rather than /ebooks/title.html because each directory also contains that book’s illustrations.)

Somewhere along the line, the Applebot decided these real, physical directories are extensionless URLs in disguise, and will persistently request /ebooks/title without that final slash. The server redirects them to the right place, but it’s still a superfluous request to deal with. At one time, up to half of all initial requests—meaning a third of all requests when you include the redirected correct form—left off that final / slash. Looking from the other side: fully three-quarters of all slashless requests come from the Applebot. This strikes me as excessive.

If anyone has ever figured out what the Applebot does, they haven’t shared the knowledge with me. Some webmasters may remember

“If robots instructions don’t mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.”

The Applebot is not the only robot to adhere to this quaint misappre­hension.

archive.org_bot

Some webmasters loathe archivers; I approve of them. I don’t have much choice, since there are references on this very site that survive only because of archived copies.

IP: 207.241.229-233, 37.187.150
IPv6: 2607:f298:5:105b:

It’s really 224-239 (the full /20), but these are the only addresses I’ve seen in the last few years. It may be random, or they may use the rest of the range for non-crawling purposes. The 37.187 range hasn’t been around since February 2019.

Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)

Mozilla/5.0 (compatible; special_archiver/3.1.1 +http://www.archive.org/details/archive.org_bot)

The special_archiver is comparatively rare; it only gets supporting files. Both user-agents are rare in continuing to use HTTP/1.0; the first HTTP/1.1 requests I find are from as recently as July 2021.

There was formerly a Wayback Machine user-agent:

Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot)

but it seems to have folded up its tents and gone home at the end of 2018.

AwarioBot

IP: 94.130.237, especially 94.130.237.100;
116.202.246; 136.243.70

The robot’s web page says it doesn’t crawl from a fixed IP, but since late January 2023 I have never seen it from anything but 94.130.237.100.

Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html)

AwarioSmartBot/1.0 (+https://awario.com/bots.html; bots@awario.com)

Although the two robots are sent by the same entity, their behavior is different. AwarioSmartBot first showed up in late 2019, and only on this site. On any given visit it asked for either robots.txt or pages, never both; as a result it spent several years in the “out of sight, out of mind” category. This UA last appeared in April 2023. If it had been the only Awario robot I ever saw, you would have found it on the “Unwanted Robots” page.

AwarioBot, by that name, first appeared in January 2023, overlapping SmartBot. Unlike the SmartBot, each individual visit is introduced with a robots.txt request. This eventually got it authorized after it had demonstrated compliance.

There is also an AwarioRssBot, but I’ve never met it.

Barkrowler

IP: 62.210

Barkrowler/0.7 (+http://www.exensa.com/crawl)

Barkrowler/0.9 (+http://www.exensa.com/crawl)

Phase One (through mid-2019): Barkrowler graduated from 0.7 to 0.9—apparently skipping 0.8—in mid-January 2019. When it visits, it is a long visit, picking up everything it can lawfully get. Any one visit sticks with the same IP from beginning to end, though it may come back from a different IP later the same day. This version disappeared in July 2019.

Barkrowler/0.9 (+https://babbar.tech/crawler)

Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)

Phase Two (from 2020): After a long absence, Barkrowler reappeared in January 2020, now with “babbar.com” instead of “exensa.com” in its UA. The final form, beginning in Mozilla, took over in June 2020. Its 2020 reappearance also marks a slight change in behavior: requsts can now be anywhere from a single page to several dozen, always beginning with robots.txt.

At the moment, I am not absolutely certain these are the same robot. Thanks to that six-month absence, I can’t even compare headers.

BLEXBot

IP: distributed, especially 46.4 and 94.130

In years past, it consistently crawled from 148.251.244.204. Currently it ranges all over the map.

Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)

“BLEXBot assists internet marketers to get information on the link structure”

Crawling behavior is unpredictable, anywhere from 3 pages to over a hundred. It is possible the different IPs are actually different robots with different behaviors, but I’ve never looked that closely.

In any case, I am not absolutely certain it is wise to put the word “BLEXBot” and “link structure” into the same sentence. At one point, about one-sixth of their requests were 404s caused by appending other people’s URLs to my paths—a perva­sive problem that others have noticed too. (The “one-sixth” figure surprised me. I would have guessed closer to 95%.) It looks as if the problem arose in December 2016, got progressively worse—and then, in early May 2017, it stopped as suddenly as it had started. Whew.

At the moment, they’re busily updating their database: crawling old HTTP URLs, getting redirected to HTTPS, and re-crawling. It will not take them long.

CCBot

CCBot/2.0 (http://commoncrawl.org/faq/)

May be following outside links, though they did once look at the sitemap. Their web page says they’re Nutch-based, which may explain their compliance with robots.txt—not something you see every day from their favorite hangout, the 54 neighborhood. They were slower than some to adopt HTTP/1.1, but finally changed over in mid-2018.

DotBot

IP: 216.244.66.243 and .248

This is one of those robots that likes to pick an IP and stick with it. In years past it was 216.244.66.229; currently it toggles between .243 and .248.

Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com)

Psst! DotBot! It’s tidier when the URL in your UA string doesn’t redirect—especially to an entirely different domain. (Admittedly, they are doing better than the Googlebot, which redirects to a 404.)

The DotBot has a number of distinguishing features:

First, its extra­ordinary appetite for robots.txt. In the last year or two it has even surpassed the former record-holder, bing—by a factor of three or more.

Second, it spends an inordinate amount of its crawl budget asking for directories without final / (see notes on the Applebot, above). It’s as if it is so accustomed to extensionless URLs, it simply assumes that is the correct form and the / slash sneaked in by mistake. Set aside the Applebot, and two-thirds of all slashless requests are from the DotBot.

Third, it refuses to default to HTTPS. It is almost the only robot that will request pages at HTTP even if the page was only created after this site moved to HTTPS.

Protocol: HTTP/2.0

In spite of its resistance to HTTPS, DotBot is one of the very few crawlers to use HTTP/2.0.

fluid

IP: 194.93.0.40

Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html)

The wording of the info page strongly suggests that the robot comes from a country whose language that does not use definite articles. Say, for example, Russian; information is scanty, but the server lives in Moscow.

Protocol: HTTP/1.0

This robot has been coming around since July 2021, and has never requested anything but robots.txt. Given the HTTP/1.0 protocol, it is possible it wouldn’t be able to receive pages—but since it has never made a redirected request, how would it know?

GrapeshotCrawler

IP: 148.64.56

Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)

Generally picks up some random ebook. It’s following links from somewhere, but not any of the RSS feeds or “new” links that would put it in the Targeted Robots class. A further behavioral quirk is that as recently as August 2023 it always starts its requests at HTTP, whether or not the page ever existed in that form.

This robot was absent for so long—from May 2019—that I moved it to the “past robots” page. And then in July-August 2020 it rematerialized, only to disappear again until April 2022. Make up your mind, willya?

ia_archiver

IP: various

Over the years, this robot has used assorted IPs from assorted Usual Suspects ranges. This makes it well-nigh impossible to tell if it is the real thing, which is admissible, or an impostor, which isn’t.

ia_archiver

Last seen: January 2022

Like the archive.org_bot, further up the page, this clearly goes in the YMMV category.

This robot was slow to move up to HTTP/1.1; it finally started in July 2020, with sporadic 1.0 requests through September.

INETDEX-BOT

IP: 168.119.91.226 (at first); 51.75.241-245 (later)

INETDEX-BOT/1.5 (Mozilla/5.0; https://inetdex.com/; info at inetdex dot com)

This is a fairly recent arrival, first seen in February 2022. It has humanoid headers, and appears to be fully robots.txt compliant, so I’ve never bothered to look into it more deeply.

But wait, there’s more. After a few months they must have decided to let the interns run wild. In addition to the original User-Agent, which is still the most common, I have met:

INETDEX-BOT/1.5 (Mozilla/5.0 (Crawling for WebSearchEngine.org); https://inetdex.com/; info at inetdex dot com)

INETDEX-BOT/1.5 (Mozilla/5.0 (compatible; indexing for WebSearchEngine.org); https://inetdex.com/; info at inetdex dot com)

INETDEX-BOT/1.5 (Mozilla/5.0; Search Engine Bot; https://inetdex.com/; info at inetdex dot com)

INETDEX-BOT/1.5 (Mozilla/5.0 (Search Engine Bot); https://inetdex.com/; info at inetdex dot com)

INETDEX-BOT/1.5 (Mozilla/5.0; https://inetdex.com/bot.html; info at inetdex dot com)

INETDEX-BOT/1.5 (Mozilla/5.0; https://inetdex.com/bot.html)

The first two showed up one day in May 2022, never to be seen again. The third came near the end of May. The fourth—note the added parentheses‐popped up in early June. The fifth—identical to the original UA except for the added /bot.html—appeared in October, to be replaced a few days later by the sixth. All in 2022. We’ll see if they eventually settle on one name.

InternetArchiveBot

IP: 185.15.56.22

IABot/2.0 (+https://meta.wikimedia.org/wiki/InternetArchiveBot/FAQ_for_sysadmins) (Checking if link from Wikipedia is broken and needs removal)

Protocol: HTTP/2

Method: HEAD

In spite of its name, this robot appears to work for Wikipedia. I don’t see it very often, but it does give a solid hint about which pages on this site are linked from Wikipedia articles—including which ones were linked before the site went HTTPS, as it can take a while before things get updated. Since late 2020, it has been using HTTP/2.0.

The Knowledge AI

IP: 64.62.252, 66.160.140

The Knowledge AI

As far as I can make out, this robot simply doesn’t know how to get to HTTPS pages. On my personal site—which was secured years ago because that’s where my analytics live, and browsers tend to yap if you try to log in to a non-secure site—it racked up a steady stream of redirects on the HTTP side. Same on the present site when it went HTTPS a few years later. To date, The Knowledge AI has never made an HTTPS request . . . which means it has effectively barred itself from an increasing part of the internet.

MJ12bot

Is it a search engine? Is it something else? You decide. Either way, MJ12 is one of the best-known distributed crawlers. In addition to a wide array of IPv4 addresses, since the second half of 2019 I have also seen them from a variety of IPv6, representing at least three different ranges.

Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)

v1.4.8 was introduced in late February 2017. This robot has an unusually wide overlap in version numbers. Others seen on this site:

Although it has been using primarily HTTP/1.1 for years, there were sporadic HTTP/1.0 requests as recently as mid-2019.

MojeekBot

IP: 5.102.173.71

Mozilla/5.0 (compatible; MojeekBot/0.9; +https://www.mojeek.com/bot.html)

I’m told Mojeek is a search engine, but I’m darned if I can remember ever in my life seeing anyone they sent. The current version number was introduced around July 2020; before that it was 0.7, skipping 0.8. Earlier still, it was 0.6 for quite a while.

Neevabot

IP: 54.161.41.102, 100.26.127.17

Mozilla/5.0 (compatible; Neevabot/1.0; +https://neeva.com/neevabot)

Protocol: HTTP/2

This is a relative newcomer, first showing up in mid-2021. It claims to have plans for a search engine, so its description may eventually move over to that page. One sign of its youth and vigor is that all its requests, without exception, have come in as HTTPS using the HTTP/2.0 protocol.

On a typical visit, it only picks up three or four pages. If it receives a redirect, expect it to wait a day or so before asking for the updated URL.

SeekportBot

IP: various

Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com)

This close relative of the Seekport Crawler took over in June 2022. Thanks to the similarity in name and ownership, it took a while to realize that this one was going to be robots.txt compliant. The information page opens cheerfully with

Bot Type: Good (always identifies itself)
Obeys robots.txt: Yes

—which indeed appears to be true—and goes on to tell us it is associated with SISTRIX, a once-familiar name.

SemrushBot

IP: 46.229.168; 85.208.96; 213.174.146, .147., 152; 192.243.53

Mozilla/5.0 (compatible; SemrushBot/1.0~bm; +http://www.semrush.com/bot.html)

Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)

Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)

Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)

Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)

Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html)

Mozilla/5.0 (compatible; SemrushBot-SI/0.97; +http://www.semrush.com/bot.html)

The IP addresses I’ve listed are the ones I have personally seen. The overwhelming majority of requests have come from 46.229.168, with 213.174.46 a distant second, and the others still rarer. On paper, Semrush ranges are considerably wider: 46.229.160-175, 192.243.48-63, 213.174.128-159.

There are many, many variations of this robot; just to make things difficult, each one sends different headers, requiring extensive hole-poking. They have a number of other versions I’ve never set eyes on, including “SemrushBot-SA” and “ContentAnalyzerBot/1.0”.

Version 2~bl replaced the older 1.2~bl in June 2018, only to be replaced in its turn by 3~bl in mid-December of the same year, and then by 6~bl in June 2019. (They seem to have bypassed 4 and 5 entirely.) Version 1.0~bm is a parallel form, over­lapping all the “bl”s; I’ve seen it as recently as November 2019.

This is an extremely active robot, visiting almost as many times as the bingbot, and more than the Googlebot. It would be very interesting to know what they’re looking for, since several requests were for obscure interior non-English-language pages that, to the best of my knowledge, are not linked from anywhere in the known universe.

SEOkicks

IP: 95.216.96.170, 138.201.30.66, 136.243.89.157

Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)

I don’t think this robot has ever requested anything but robots.txt, where it is not disallowed. Maybe it is simply checking whether the site is accessible, and has intelligently decided that a successful robots.txt fetch will convey this information just as handily as any other file.

special_archiver

Described under archive.org_bot, above.

Where Are they Now?

In years past, some authorized robots were frequent enough to make their way onto my Ignore list. In some cases, it turns out there hasn’t been anything to ignore in ages. If I haven’t set eyes on them in many months, they’re listed here. If it’s been over a year, they will be moved to the Former Robots page.