When deciding whether to let a robot come in, some sites use a criterion of “Will this robot benefit me?” But I think this is short-sighted. Maybe it won’t benefit you (or your site) personally—but maybe it will benefit someone else engaged in a legitimate activity. And then in turn I might derive benefit from robots that visit other people’s sites. Scratching someone else’s back doesn’t directly and immediately help you; it’s the exchange that’s to everyone’s advantage.
A handful of robots on this list—notably TIA’s “special_archiver” and SafeDNSBot—still use the HTTP/1.0 protocol exclusively. What will they do when the Internet moves up to HTTP/2?
Social Media
IPv4: 31.13.64-127 (Ireland), 66.220.144-159, 69.63.128-191, 69.171.224-255, 173.252.64-127
IPv6: 2a03:2880::/29
facebookexternalhit/1.1 (+http://www.facebook.com/
facebookexternalhit/1.1
cortex/1.0
adreview/1.0
The blank space at the bottom of the User Agent list is not a mistake. For a couple of years, from June 2017 to May 2019, Facebook occasionally came in with no User-Agent at all, recognizable only by its IP address. This required some hole-poking, so I hope they have abandoned the idea for good.
The shorter “facebookexternalhit” version made its first appearance in the latter part of 2015. It often picked up isolated pages without images. I don’t know what this translates to in the FB user experience. Whatever it was, it now seems to be retired; I last saw it in October 2018.
The minimalist cortex/1.0 is a comparatively recent arrival, first seen in February 2019. The equally concise adreview/1.0 is still newer; I first saw it in June 2019. The formerly common visionutils hasn’t been around since April 2016.
Protocol: mixed
When my server became HTTP/2-compatible, Facebook immediately started making HTTP/2.0 requests—but only some of the time. Gradually it shifted; currently (spring 2025) it shows a ten-to-one preference for HTTP/2.0.
Facebook’s behavior changes every couple of years. For quite a while, new links started with a page request, followed by all image files belonging to the page, giving the page as referer. The human user then presumably selected one image, which would get re-requested every time your original visitor’s friends viewed the page that linked to you. These image-only requests came in with no referer. (Pro tip: Do your pages include a non-visible image file, such as piwik’s noscript dot? I rewrite this for facebook to a little banner giving the site name. So if a page happens not to have any good pictures—some ebooks don’t—there’s still something for the human user to select.)
Facebook, incidentally, seems to be responsible for those /ebooks/kleinschmidt without-final-slash requests discussed under the Bingbot Evil Twin. All it takes is one person to “like” a page and misspell its URL.
Update: Somewhere along the line, facebook started absolutely flooding my site with requests for everything under the sun, ignoring robots.txt if it even requested the file in the first place. I’ve been blocking it since, I think, 2023—and will continue to do so until it remembers what robots.txt is for.
IP: 199.16.156-159, 199.59.148-151
Twitterbot/1.0
Uncharacteristically for a social-media-based robot, the Twitterbot asks for and obeys robots.txt. And, like the Googlebot, it never forgets. One recent month’s requests included an URL that I remember seeing on my old site’s Redirect lists, meaning that they first learned about it no later than December 2013.
Perhaps unsurprisingly, I’ve seen fewer Twitter-related visits since they came under new ownership in 2022.
IP: 54.236.1
Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)
Like Twitter, the Pinterestbot honors robots.txt. Generally it requests pages and the occasional icon. What it does with them is anyone’s guess; it doesn’t seem to have anything to do with the quasi-hotlinking that used to garner so much dislike.
. . . and a Faker
In the category of Most Unlikely User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0
This robot is supposed to be associated with the “iOS Messages” app. Since I don’t personally use this, or know anyone who does, I haven’t been able to investigate further.
Requests one page plus apple-touch-icon-precomposed.png (a file that happens to exist, so it need not plow through the whole list of potential apple-touch-icons). The IP varies, but it’s always the identical UA string from beginning to end. I’ve been seeing it sporadically for years.
Further Afield
Other law-abiding robots, in alphabetical order. Many of them don’t have a home IP; instead they are distributed, generally among the usual suspects of 3, 18, 34, 35, 52, 54 and so on.
AhrefsBot
IP: distributed, but especially 54.36.148-150
Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/)
. . .
Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)
Protocol: HTTP/2.0
Worth noting: When my server became HTTP/2-compatible, Ahrefs was the first ordinary crawler to make HTTP/2 requests. Just like the bingbot, the AhrefsBot continues to use HTTP/1.1 for robots.txt, rarely for other files.
Although the robot overall seems to be distributed, almost all robots.txt requests come from just two IPs, 54.36.150.49 and 151.80.39.207. This may explain why their website says that robots.txt changes can take up to a week to be recognized. In spite of this, they’ve never asked for anything in a roboted-out directory. Requests are mostly pages, with a few seemingly random images mixed in.
All User-Agent strings are identical except the version number, which changes periodically. The earliest version I find is 2.0, in use through early 2012. (There was probably a 1.x, but this predates my active logging.) 3.1 took over in the latter part of 2012, but didn’t last long:
- 4.0: October 2012 - July 2013
- 5.0: July 2013 - March 2016
- 5.1: April 2016 - January 2017
- 5.2: December 2016 - December 2018, briefly returning May 2019
- 6.1: December 2018 - March 2021
- 7.0: August 2020 to present
There doesn’t seem to have been a 6.0; I don’t know what the story is. Unlike the earlier version changes, 6.1 and 7.0 overlapped for about half a year.
Amazonbot
IP: 3.224.220.101, 52.70.240.171
These are not particularly savory neighborhoods, what with the ongoing AWS association, but where else would Amazon crawl from?
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
This robot first showed itself in April 2021. A while back I tested for robots.txt compliance—and then promptly forgot all about it. Watch for an update, as I’ve yet to see how it behaves when authorized.
Applebot
IP: 17.58.101.57, 17.58.96-103
In the past they have come from other addresses in 17—the whole thing still belongs to Apple—but 17.58.101.57 is by far their current favorite. Addresses outside of 17.58.96.0/21 are extremely rare.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)
The iPhone version was always rare, and must now been retired. I haven’t set eyes on it since mid-2017.
For a spell in the middle of 2019 the Applebot had a real problem with filepaths. Starting abruptly in March, there was a flurry of requests for /ebooks/
That leaves directory slashes. As it happens, I have a lot of URLs in the form /ebooks/title/ and-that’s-all. If it’s a long book, the directory will contain further files; most of them don’t. (In case anyone wondered: It is consistently /ebooks/
Somewhere along the line, the Applebot decided these real, physical directories are extensionless URLs in disguise, and will persistently request /ebooks/title without that final slash. The server redirects them to the right place, but it’s still a superfluous request to deal with. At one time, up to half of all initial requests—meaning a third of all requests when you include the redirected correct form—left off that final / slash. Looking from the other side: fully three-quarters of all slashless requests come from the Applebot. This strikes me as excessive.
If anyone has ever figured out what the Applebot does, they haven’t shared the knowledge with me. Some webmasters may remember
“If robots instructions don’t mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.”
The Applebot is not the only robot to adhere to this quaint misapprehension.
archive.org_bot
Some webmasters loathe archivers; I approve of them. I don’t have much choice, since there are references on this very site that survive only because of archived copies.
IP: 207.241.229-233, 37.187.150
IPv6: 2607:f298:5:105b:
It’s really 224-239 (the full /20), but these are the only addresses I’ve seen in the last few years. It may be random, or they may use the rest of the range for non-crawling purposes. The 37.187 range hasn’t been around since February 2019.
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot)
Mozilla/5.0 (compatible; special_archiver/3.1.1 +http://www.archive.org/details/archive.org_bot)
Protocol: HTTP/1.0
The special_archiver is comparatively rare; it only gets supporting files. Both user-agents are anomalous in continuing to use HTTP/1.0; the first HTTP/1.1 requests I find are from as recently as July 2021.
At some time when I wasn’t paying attention, the familiar archive.org_bot started alternating with longer versions:
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/e4420d1 warc/v0.8.28
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/78c9471 warc/v0.8.33
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/8818597 warc/v0.8.43
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/76f39f7 warc/v0.8.53
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/7597a01 warc/v0.8.54
The first of these, v0.8.28, showed up in May 2023; the others were scattered through 2024: March-April; August; November; and December, respectively.
The original, shorter UA finally packed up and went home near the end of 2024. Since then, longer versions have continued (January, February and March 2025):
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/d83a4f6 warc/v0.8.64
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/08548af warc/v0.8.70
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/2865beb warc/v0.8.73
. . . and so on.
There was formerly a Wayback Machine user-agent:
Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; http://archive.org/details/archive.org_bot)
but it seems to have folded up its tents and gone home at the end of 2018.
AwarioBot
IP: 94.130.237, especially 94.130.237.100;
116.202.246;
136.243.70
The robot’s web page says it doesn’t crawl from a fixed IP, but since late January 2023 I have never seen it from anything but 94.130.237.100.
Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html)
AwarioSmartBot/1.0 (+https://awario.com/bots.html; bots@awario.com)
Although the two robots are sent by the same entity, their behavior is different. AwarioSmartBot first showed up in late 2019, and only on this site. On any given visit it asked for either robots.txt or pages, never both; as a result it spent several years in the “out of sight, out of mind” category. This UA last appeared in April 2023. If it had been the only Awario robot I ever saw, you would have found it on the “Unwanted Robots” page.
AwarioBot, by that name, first appeared in January 2023, overlapping SmartBot. Unlike the SmartBot, each individual visit is introduced with a robots.txt request. This eventually got it authorized after it had demonstrated compliance.
There is also an AwarioRssBot, but I’ve never met it.
Barkrowler
IP: 62.210
Barkrowler/0.7 (+http://www.exensa.com/crawl)
Barkrowler/0.9 (+http://www.exensa.com/crawl)
Phase One (through mid-2019): Barkrowler graduated from 0.7 to 0.9—apparently skipping 0.8—in mid-January 2019. When it visits, it is a long visit, picking up everything it can lawfully get. Any one visit sticks with the same IP from beginning to end, though it may come back from a different IP later the same day. This version disappeared in July 2019.
IP: 62.210, 154.54.249, 217.113.194
Barkrowler/0.9 (+https://babbar.tech/crawler)
Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)
Phase Two (from 2020): After a longish absence, Barkrowler reappeared in January 2020, now with “babbar.tech” instead of “exensa.com” in its UA. The final form, beginning in Mozilla, took over in June 2020, making a clean break from the shorter version. Its 2020 reappearance also marks a slight change in behavior: requsts can now be anywhere from a single page to several dozen, always beginning with robots.txt.
A little later, there was a change in IP. The original 62.210 was last used in December 2021, overlapping with 154.54.249 which first showed up in September 2020. Since September 2022 this has alternated with 217.113.194; unlike the earlier IPs, this one belongs to Babbar by name.
BLEXBot
IP: distributed, especially 46.4 and 94.130
In years past, it consistently crawled from 148.251.244.204. Currently it ranges all over the map.
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
“BLEXBot assists internet marketers to get information on the link structure”
Crawling behavior is unpredictable, anywhere from 3 pages to over a hundred. It is possible the different IPs are actually different robots with different behaviors, but I’ve never looked that closely.
In any case, I am not absolutely certain it is wise to put the word “BLEXBot” and “link structure” into the same sentence. At one point, about one-sixth of their requests were 404s caused by appending other people’s URLs to my paths—
At the moment, they’re busily updating their database: crawling old HTTP URLs, getting redirected to HTTPS, and re-crawling. It will not take them long.
CCBot
CCBot/2.0 (http://commoncrawl.org/faq/)
May be following outside links, though they did once look at the sitemap. Their web page says they’re Nutch-based, which may explain their compliance with robots.txt—not something you see every day from their favorite hangout, the 54 neighborhood. They were slower than some to adopt HTTP/1.1, but finally changed over in mid-2018.
ChatGPT
IP: various, including The Usual Suspects
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
Protocol: HTTP/2
Do not ask what this robot—first seen in December 2023—is all about. In particular, do not ask about its relationship to OAI-SearchBot, below. I’m sure there is a dissertation on it somewhere.
DotBot
IP: 216.244.66.243 and .248
This is one of those robots that likes to pick an IP and stick with it. In years past it was 216.244.66.229; currently it toggles between .243 and .248.
Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensite
Psst! DotBot! It’s tidier when the URL in your UA string doesn’t redirect—especially to an entirely different domain. (Admittedly, they are doing better than the Googlebot, which redirects to a 404.)
The DotBot has a number of distinguishing features:
First, its extraordinary appetite for robots.txt. In the last year or two it has even surpassed the former record-holder, bing—by a factor of three or more.
Second, it spends an inordinate amount of its crawl budget asking for directories without final / (see notes on the Applebot, above). It’s as if it is so accustomed to extensionless URLs, it simply assumes that is the correct form and the / slash sneaked in by mistake. Set aside the Applebot, and two-thirds of all slashless requests are from the DotBot.
Third, it refuses to default to HTTPS. It is almost the only robot that will request pages at HTTP even if the page was only created after this site moved to HTTPS. In fact, this eventually got so annoying, I’ve now blocked DotBot requests for everything but the root.
Protocol: HTTP/2.0
In spite of its resistance to HTTPS, DotBot was one of the first crawlers to use HTTP/2.0.
fluid
IP: 194.93.0.40
Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html)
The wording of the info page strongly suggests that the robot comes from a country whose language does not use definite articles. Say, for example, Russian; information is scanty, but the server lives in Moscow.
Protocol: HTTP/1.0
This robot has been coming around since July 2021, and has never requested anything but robots.txt. Given the HTTP/1.0 protocol, it is possible it wouldn’t be able to receive pages—but since it has never made a redirected request, how would it know?
GrapeshotCrawler
IP: 148.64.56
Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php)
Last seen: August 2023
Generally picks up some random ebook. It’s following links from somewhere, but not any of the RSS feeds or “new” links that would put it in the Targeted Robots class. A further behavioral quirk is that as recently as August 2023 it always starts its requests at HTTP, whether or not the page ever existed in that form.
This robot was absent for so long—from May 2019 to July 2020—that I moved it to the “past robots” page. And then it rematerialized, only to disappear again a month later. There have been further visits in April 2022, October 2022 and, most recently, August 2023. This time I’m giving it a full two years before considering it truly gone.
ia_archiver
IP: various
Over the years, this robot has used assorted IPs from assorted Usual Suspects ranges. This makes it well-nigh impossible to tell if it is the real thing, which is admissible, or an impostor, which isn’t.
ia_archiver
Last seen: January 2022
Like the archive.org_bot, further up the page, this clearly goes in the YMMV category.
This robot was slow to move up to HTTP/1.1; it finally started in July 2020, with sporadic 1.0 requests through September.
InternetArchiveBot
IP: 185.15.56.22
IABot/2.0 (+https://meta.wikimedia.org/wiki/
Protocol: HTTP/2
Method: HEAD
In spite of its name, this robot appears to work for Wikipedia. I don’t see it very often, but it does give a solid hint about which pages on this site are linked from Wikipedia articles—including which ones were linked before the site went HTTPS, as it can take a while before things get updated. Since late 2020, it has been using HTTP/2.0.
MJ12bot
Is it a search engine? Is it something else? You decide. Either way, MJ12 is one of the best-known distributed crawlers. In addition to a wide array of IPv4 addresses, since the second half of 2019 I have also seen them from a variety of IPv6, representing at least three different ranges.
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)
v1.4.8 was introduced in late February 2017. This robot has an unusually wide overlap in version numbers. Others seen on this site:
- 1.4.4: through February 2015
- 1.4.5: April 2014 - January 2017
- 1.4.6: October 2016 (only)
- 1.4.7: Movember 2016 - August 2019
Although it has been using primarily HTTP/1.1 for years, there were sporadic HTTP/1.0 requests as recently as mid-2019.
MojeekBot
IP: 5.102.173.71
Mozilla/5.0 (compatible; MojeekBot/0.9; +https://www.mojeek.com/bot.html)
I’m told Mojeek is a search engine, but I’m darned if I can remember ever in my life seeing anyone they sent. The current version number was introduced around July 2020; before that it was 0.7, skipping 0.8. Earlier still, it was 0.6 for quite a while.
OAI-SearchBot
IP: 20.42.10, 51.8.102, 173.203.190, 51.8.102, 173.203.190
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
Protocol: HTTP/2
The first UA showed up around August 2024, and was last seen towards the end of January 2025. The second, with Chrome, made a few brief appearances in December 2024, only to be replaced by the third version, with added Macintosh-and-Safari element.
As with ChatGPT-User, above, do not ask what this robot does, or what the relationship is between the two.
SafeDNSBot
IP: various
With rare exceptions, no two visits come from the same IP.
SafeDNSBot (https://www.safedns.com/searchbot)
Protocol: HTTP/1.0
Last seen: June 2024
Rarely requests more than one page on a visit—but when it does, they tend to be deep interior pages, implying that it is getting a shopping list from somewhere else.
This robot likes to take long vacations; to date it has been moved to Former Robots twice, only to be moved back again. Its first absence lasted from September 2019 to January 2021 before making an unexpected reappearance. After that, it remained absent until January 2024, after which it stopped by periodically during the first half of the year.
On its most recent visits, it continued to use HTTP/1.0. Coincidentally or otherwise, its (current) most recent visit happened to come from a blocked range. Did it go off to sulk?
SeekportBot
IP: various
Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com)
This close relative of the Seekport Crawler took over in June 2022. Thanks to the similarity in name and ownership, it took a while to realize that this one was going to be robots.txt compliant. The information page opens cheerfully with
Bot Type: Good (always identifies itself)
Obeys robots.txt: Yes
—which indeed appears to be true—and goes on to tell us it is associated with SISTRIX, a once-familiar name.
SemrushBot
IP: 46.229.168; 85.208.96; 213.174.146, .147., 152; 192.243.53
Mozilla/5.0 (compatible; SemrushBot/1.0~bm; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/2~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html)
Mozilla/5.0 (compatible; SemrushBot-SI/0.97; +http://www.semrush.com/bot.html)
The IP addresses I’ve listed are the ones I have personally seen. The overwhelming majority of requests have come from 46.229.168, with 213.174.46 a distant second, and the others still rarer. On paper, Semrush ranges are considerably wider: 46.229.160-175, 192.243.48-63, 213.174.128-159.
There are many, many variations of this robot; just to make things difficult, each one sends different headers, requiring extensive hole-poking. They have a number of other versions I’ve never set eyes on, including “SemrushBot-SA” and “Content
Version 2~bl replaced the older 1.2~bl in June 2018, only to be replaced in its turn by 3~bl in mid-December of the same year, and then by 6~bl in June 2019 and 7~bl in late November 2020. (They seem to have bypassed 4 and 5 entirely.) Version 1.0~bm was a parallel form, overlapping all the “bl”s; I saw it as recently as mid-2020.
This is an extremely active robot, visiting almost as many times as the bingbot, and more than the Googlebot. It would be very interesting to know what they’re looking for, since several requests were for obscure interior non-English-language pages that, to the best of my knowledge, are not linked from anywhere in the known universe.
SEOkicks
IP: 95.216.96.170, 138.201.30.66, 136.243.89.157
Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)
I don’t think this robot has ever requested anything but robots.txt, where it is not disallowed. Maybe it is simply checking whether the site is accessible, and has intelligently decided that a successful robots.txt fetch will convey this information just as handily as any other file.
special_archiver
Described under archive.org_bot, above.
Where Are they Now?
In years past, some authorized robots were frequent enough to make their way onto my Ignore list. In some cases, it turns out there hasn’t been anything to ignore in ages. If I haven’t set eyes on them in several months, they’re listed here. If it’s been over a year, they will be moved to the Former Robots page.