On a typical website, up to half of all visitors aren’t human at all. They’re robots of all kinds: the good, the bad, the ugly. Or, if you prefer, There’s three ways that robots can go: that’s good, bad and mediocre . . . .
This article is based on a long post I’ve put together for the Webmaster World forums every year or two since 2012, most recently in May 2017. Robots come, robots go, so I’ll continue updating it now and then.
There are various ways of classifying robots. But the Great Divide is robots.txt. Does the robot first check that it has permission to enter, learn which parts of the site are off limits, and restrict its visits to authorized areas? If it doesn’t, it had better have a darn good excuse for being there.
Sad but true: On the Internet, everything is assumed to be American unless they explicitly state otherwise. Hence .gov = the US government; .mil = the US military; .edu = institutions of higher learning in . . . well, North America at least. Almost everywhere you go, the most popular search engines are based in the US.
US Search Engines
Still Number One
On most sites in most countries, you can expect the single largest robotic visitor to be:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
As the name indicates, “Googlebot-Image” is strictly for image files. In years past, the ordinary Googlebot has also been known to request images—usually with a referer, like a human—but it seems to have given up the habit.
It took until April 2016 for Google to realize that they had a perfectly good mobile OS of their own, and didn’t need to go calling themselves “iPhone”. For several years before that, probably starting in late 2011, the mobile Googlebot went through a succession of iPhone names:
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Can you spot the difference between the second (October 2013) and third (February 2014) versions? By 2014, smartphones had become mainstream; there was no longer a need to label yourself “Mobile”. The name “Googlebot-Mobile” did hold on a while longer with two older UAs:
DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/188.8.131.52.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
DoCoMo has been around since at least June 2011; SAMSUNG showed up a little later, in March 2012. Both disappeared around the end of October 2016.
As I write this, all Googlebot requests for supporting files (.css, .js) include a referer—the page the file “belongs” to. They have always done this sporadically; it looks as if it became standard practice in mid-March 2017. On the other hand, they no longer find it useful to send a referer with image requests; the last case I can find is November 2015.
To Say Nothing Of . . .
But wait, there’s more. Alongside the true Googlebot, there’s an ever-expanding list of other Googloid functions—including some I’ve never personally set eyes on, like the assorted AdSense-related crawlers.
IP: 66.102.6-7 and 66.249.80-95
I don’t know what they do with the rest of 66.102.0-63. I have only once—ever—seen them outside 6-9, and rarely outside 6-7.
In alphabetical order:
Mozilla/5.0 (compatible; GoogleDocs; apps-presentations; +http://docs.google.com)
Confession: I have no idea what this does. It only fetches images, and it’s very rare. Their web page leaves me none the wiser.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon
This UA has certainly matured over the years. Originally they sent no UA at all; later they called themselves Firefox 6, and since March of 2016 they’ve gone by Chrome/49. Unlike some search engines, Google doesn’t display a favicon next to each result. The favicon does show up whenever you list your sites in a Google property such as Google Search Console (the former Webmaster Tools) or Profile, and quite possibly others that I don’t know about.
Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)
Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:184.108.40.206; Google-SearchByImage) Gecko/2009021910 Firefox/3.0.7
Confession: I never knew this UA existed. Thanks to that Firefox/3, they have never seen anything but a 403. The UA, complete with “de” (barring a few ebooks, I have no German-language content), has existed since at least 2015. If they hadn’t come from a Google IP, I’d have assumed they were just another unwanted robot.
This doesn’t have a UA of its own; it just appends “,gzip(gfe)” (with comma, without space) to the human visitor’s UA string. The referer will be something involving “translate.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36
Once again, color me puzzled. I remember a few years ago the Google Search results always had an option for Preview, but I haven’t seen it in yoincks, so I have no idea what this UA currently does.
Formerly Known as Webmaster Tools
Several of the bigger search engines have Webmaster Tools so you can learn a little more about what the search engine is up to, and maybe even have some say in its behavior. Just to be different, Google renamed its version Google Search Console.
Mozilla/5.0 (compatible; Google-Site-Verification/1.0)
Shows up periodically on any site that has a GSC account.
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/27.0.1453 Safari/537.36
I first saw this UA in May 2016. I don’t know exactly how old it is, because it comes only in response to a specific action on your part: “Fetch and Render” in the Fetch As Googlebot section of GSC. Like most googloid functions it is not subject to robots.txt; casual experimentation shows that if you request a page in a roboted-out directory, it will do the fetch with this UA, but won’t show the “What a Human Sees” render. Probably they don’t want to rub your nose in the fact that they’ve just fetched something robots aren’t supposed to see.
If that UA seems familiar, it’s because Preview (above) is identical; only the names have been changed. They must have a sentimental fondness for Chrome 27.
We Try Harder
IP: 40.77.167, 157.55.39, 207.46.13
Obviously these are not Bing’s full ranges; in fact about half the Internet—at least on the IPv4 side—seems to be registered to Microsoft. But in the most recent month I looked at, requests were evenly divided between these three /24 sectors.
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Yup, you read that right: Bing is using an iPhone UA. Ha. Ha. Ha.
Unlike Google, Bing uses the same robot for both pages and images. No request had a referer. The mobile bingbot similarly makes requests of all kinds—except that it never asks for robots.txt in its own name. That job is left to the ordinary bingbot, which ought to be quite good at it; for many years, Bing was the Abou ben Adhem of robots.txt requests. Recently they seem to have cut back on their appetite, although they still request robots.txt far often than the Googlebot.
About 10% of bing requests came from The Robot That Will Never Die:
In spite of the “media” in the name, recent requests have been exclusively for pages.
But Also . . .
And then there’s Bing Preview. In addition to the three bing/msn crawl ranges, it also shows up from
IP: 65.55.210, 131.253.25-27, 199.30.24-25
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b
Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b
I’m not clear what this UA actually does. I don’t believe it is a true preview; the requests don’t come in packages (page, supporting files, images) like a human. It may be Bing’s version of a Mobile-Friendliness tester.
Unlike Google’s Site Verification, Bing’s wears plain clothes:
IP: 40.77, 131.253
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246
It never requests anything but /BingSiteAuth.xml.
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
In March 2016, Yahoo! Slurp suddenly started requesting stylesheets, always with the appropriate page as referer (like the Googlebot). On the other hand, as of December 2016 they seem to have entirely stopped asking for images. Either they’ve got a very long visual memory or they’ve been sneaking in under an alias, because Yahoo! image search is still sending humans.
Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)
DuckDuckGo does not crawl websites; it uses other robots’ crawl data and applies their own algorithm. If a page comes up in a DuckDuckGo search, the Favicons-Bot comes by to pick up the site’s favicon in order to display it next to the search result. In spite of the name, they request the root (the site’s front page) first; if the page is blocked they won’t request the favicon. If your access-control rules involve looking at the referer, this may entail some hole-poking.
National Search Engines
Most parts of the world are content to google and be done with it. But some countries have their own search engines that get most of the business. I’ve listed them in order of frequency at this site, which may or may not have anything to do with the search engine’s overall size and distribution.
Czech Republic: Seznam
I read recently that the country has officially changed its name to Czechia—but everyone who lives there hates it, which would seem to be a drawback. Seznam’s website says Czech Republic, so we’ll stick with that.
Google Translate, incidentally, says that “seznam” means “list”. They have always been fond of this site; not sure why, since human visitors sent by Seznam can be counted on your fingers.
As you might expect from a robot living in RIPE territory—where IPv4 addresses are now being rationed out in /22 segments—Seznam was quick to grab an IPv6 range when it became available. To date I’ve only met them from 2a02:598:2 and 2a02:598:a, in equal amounts, but they own the whole /32 because why wouldn’t they.
Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)
This UA was rolled out in May 2016; for many years before that, it was
followed (from February 2014) by
Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)
I can only conjecture that, like Apache, Seznam doesn’t care for odd numbers. In any case the new version must be considered an improvement, since they now link to a page in English.
Like Bing and Google, there’s a preview:
Mozilla/5.0 PhantomJS (compatible; Seznam screenshot-generator 2.1; +http://fulltext.sblog.cz/screenshot/)
This version has been around since mid-2015; before that, it was
Mozilla/5.0 (compatible; Seznam screenshot-generator 2.1; +http://fulltext.sblog.cz/screenshot/)
Maybe one of these years they will notice that the preview’s User-Agent still references the old “fulltext.sblog” page, even while the regular robot has updated to “seznambot-intro”.
Yandex is a Russian company, but they’re also big in Turkey. Sometimes they’ll come in expressing a preference for material in Turkish, instead of the more common Russian. As we speak, Yandex’s distinguishing trait is their sheer range of IPs. They’re not distributed; they just own a whole lot of small, widely separated ranges. Aside from their hands-down favorite, 220.127.116.11, within the last few years they have used:
130.193.32-71 (i.e. 32-63 and 64-71)
199.21 is Yandex’s ARIN range (North America). On my sites they use it randomly alongside their various European addresses; it may behave differently on sites that have region-based access controls.
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
The imagebot was busy in April 2017, accounting for about 2/3 of all requests.
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/
The mobile UA was rare. It asks for pages and supporting files (css, js) but no images.
Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)
Linking to a webpage in your UA string is generally considered A Good Thing—but, er, it only works if people can read Russian. Faute de mieux, I’ve always assumed they are a search engine. Rather a low-budget one: they do a biggish crawl every few months, at which point they show up on my Redirects lists requesting old pages that everyone else has already got sorted to their currect URL. Requests are almost exclusively pages. (Exceptions are interesting, but only if you know the site.)
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)
That, at least, is their all-ASCII Internet name; it’s really Cốc Cốc with plentiful diacritics. Although they call themselves a search engine; they’ve never done a top-to-bottom spidering. Instead, when they learn of the existence of some particular page, they come in and ask for it, along with all its associated images.
Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)
Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1
Daumoa first showed up in response to an RSS feed; since then they have sometimes wandered further. They’ve got a few other user-agents, but the “faqID” one is their current favorite. Among other things it handles all robots.txt requests for the whole family.
Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
The search engine is called, inexplicably, Exalead. Since this site has no French content, they don’t come around much. Although they have periodically looked at the xml sitemap, I don’t think they’ve ever done a full spidering; they come in and ask for specific pages.
From the same address comes Exalead’s version of a Preview:
Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Exabot-Thumbnails)
This form has been around, unchanged, since at least 2011.
Not necessarily search engines, but still legitimate in my book:
Mozilla/5.0 (compatible; Qwantify/2.3w; +https://www.qwant.
Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.
The third, minimalist UA is only for the favicon. The change from 2.3 to 2.4 happened in the latter part of April 2017, with no overlap.
Is it a search engine? Is it something else? You decide. Either way, MJ12 is one of the best-known distributed crawlers, so I won’t bother to list its IPs.
Mozilla/5.0 (compatible; MJ12bot/v1.4.7; http://mj12bot.com/)
Qwantify and MJ12 are among the few law-abiding robots that still use HTTP/1.0. What will they do when the Internet moves up to HTTP/2?
Mozilla/5.0 (compatible; Cliqzbot/1.0; +http://cliqz.com/company/cliqzbot)
“Was genau ist Cliqzbot?” Another of those targeted searches, I think. They are either distributed, or they sprawl so widely across 52 that there’s no telling where they really live.
Mozilla/5.0 (compatible; DeuSu/5.0.2; +https://deusu.de/robot.html)
DeuSu only understands Disallow if they’re given a section to themselves in robots.txt.
Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/bot)
April 2017 marks the first Yeti sighting since . . . drumroll . . . July of 2014. That was over at my old site, Lucy’s Worlds, which they used to visit all the time; they’ve never before set foot on the present site. They used to change IP every year or so, but this one’s been the same since mid-2013.
After that passing visit in April—robots.txt and one page—they didn’t show up again until September 2017. That time they picked up two individual pages with all supporting files. What they plan to do with them remains a mystery.
Bing and Yandex don’t have any connection that I know of—but they’re both associated with one near-identical behavior:
Visitor comes in with generally humanoid headers and requests a page, sometimes giving the appropriate search engine as referer. They request scripts, stylesheets, fonts—in short, all supporting files except images and favicon. All three entities pick up piwik.js (the script that tells them what my analytics is looking for), though only one of them acts on it by requesting piwik.php (the actual analytics). It may or may not be relevant that my piwik installation lives on a different site. If I kept my fonts on a separate site—or used third-party fonts such as google’s—would they still be requested?
The range belongs to an outfit called Drake Holdings, hence my identifier for the robot.
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)
IP: 65.55 and 131.253
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; Trident/5.0)
(i.e. identical to the two Drake Holdings forms)
Both of these have been getting 302 redirected to a custom page that has served different purposes over the years; one of its functions is to intercept humans who accidentally behave like robots.
IP and UA: as expected for humans in Russia
. . . et cetera, et cetera, with an enormously long string of garbage that appears to be identical to a genuine Yandex referer. Each request is followed or preceded within 24 hours by an apparently human visit (with images and piwik, without favicon) to the same page. Different IP and UA, but Yandex referers to this site are infrequent enough that I can easily pick them out.
Bingbot’s Evil Twin
Alongside the ordinary plainclothes Bingbot, there’s this further agent that I never noticed until fine-tooth-combing logs:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
I had no idea this UA existed; they got an automatic 403 on the grounds of “non-bingbot from Bing/MSN range”. They made the same set of three requests every few days: first /ebooks/bourquin/ and then, several hours later, /ebooks/kleinschmidt (without slash) immediately redirected to /ebooks/kleinschmidt/ with slash. (Paradoxically, the malformed URL is what prevented it from being blocked at the outset; it’s a very tightly constrained RewriteRule.) Turns out this has been going on—always the same set of three—since mid-September 2016, continuing through the beginning of May 2017.
And then on 2 May there was a final lone request for /ebooks/bourquin/ . . . and no more, as if someone had abruptly pulled a plug.
Other Law-Abiding Robots
In alphabetical order:
Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)
Probably distributed; I counted eight different ranges in one month. But all robots.txt requests come from just two IPs, 18.104.22.168 and 22.214.171.124, which may explain why their website says that robots.txt changes can take up to a week to be recognized. In spite of this, they’ve never asked for anything in a roboted-out directory. Requests are mostly pages, with a few seemingly random images mixed in.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Applebot/0.1; +http://www.apple.com/go/applebot)
Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 (Applebot/0.1; +http://www.apple.com/go/applebot)
though the iPhone version is rare.
If anyone has ever figured out what this robot does, they haven’t shared the knowledge with me. Some webmasters may remember
“If robots instructions don’t mention Applebot but do mention Googlebot, the Apple robot will follow Googlebot instructions.”
The Applebot is not the only robot to adhere to this quaint misapprehension.
Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
“BLEXBot assists internet marketers to get information on the link structure”
I am not absolutely certain it is wise to put the word “BLEXBot” and “link structure” into the same sentence. At one point, about one-sixth of their requests were 404s caused by appending other people’s URLs to my paths—a pervasive problem that others have noticed too. (The “one-sixth” figure surprised me. I would have guessed closer to 95%.) It looks as if the problem arose in December 2016, got progressively worse—and then, in early May 2017, it stopped as suddenly as it had started. Whew.
IP: 64.62.252, 126.96.36.199
BUbiNG is a scalable, fully distributed crawler, currently under development and that supersedes UbiCrawler.
(UbiCrawler must have been before my time; I find no record of it.) Although the two IP ranges belong to different hosts, there are no major differences in their behavior. I put them in the “No skin off my nose” category.
IP: 188.8.131.52, 184.108.40.206 (they pick a fresh IP for each crawl, and don’t come around very often)
May be following outside links, though they did once look at the sitemap. Their web page says they’re Nutch-based, which may explain their compliance with robots.txt—not something you see every day from the 54 neighborhood.
Mozilla/5.0 (compatible; DotBot/1.1; http://www.
Psst! DotBot! It’s tidier when the URL in your UA string doesn’t redirect—especially not to an entirely different domain. (They are not the only ones.)
IP: mostly 220.127.116.11
The exceptions may be fakers, but why would you pretend to be the SemrushBot?
Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.
It would be very interesting to know what they’re looking for, since several requests were for obscure interior non-English-language pages that, to the best of my knowledge, are not linked from anywhere in the known universe.
Mozilla/5.0 (compatible; SEOkicks-Robot; +http://www.seokicks.de/robot.html)
Mozilla/5.0 (compatible; SiteExplorer/1.1b; +http://siteexplorer.info/Backlink-Checker-Spider/)
SiteExplorer is another of the rare robots that only understand Disallow if they’re given a sector to themselves in robots.txt, so I initially thought they were non-compliant. Later evidence suggests that they’re just very, very slow on the uptake: in the course of one month they picked up robots.txt eleven times, but never screwed up the courage to ask for a page until almost the end of the month.
IP: 18.104.22.168, 22.214.171.124 (two full crawls)
Mozilla/5.0 (compatible; spbot/5.0.3; +http://
I think their crawling happens on the fly: robots.txt, two forms of root—one of which gets a 301—and then all other pages, from top to bottom, with the same referer a human would send. In the rare case that a page is linked from widely separated directories on the same site, the referer is whichever one the robot saw first. Since they don’t come in with a shopping list, there are never any 301s or 410s. This makes it useful for record-keeping purposes: Count the number of spbot requests, subtract two, and that’s how many visible URLs you’ve got.
IP: 126.96.36.199 (always)
I did say I didn’t have very high standards when it comes to authorizing robots. They were very active in the latter months of 2016; in April they only showed up once. They’re only interested in the /ebooks/ directory: mostly pages, but the occasional stylesheet, and sometimes the first image on a page—regardless of whether it’s a full-color frontispiece or a little icon from the navigation banner.
Mozilla/5.0 (compatible; Uptimebot/1.0; +http://www.
Like SiteExplorer, UptimeBot is exceedingly slow when it comes to robots.txt. But if you keep denying them for a month or two they will eventually get the message and stop requesting pages. At that point you may choose to whitelist them. Most of the time they just do a HEAD / (“Does the root page exist? OK, we’re good”) which is hardly a server-intensive request.
Most of these first showed themselves in the latter part of 2015 when I started submitting to the surpassingly useful Online Books Page. This is a curated directory (meaning that everything on it is reviewed by a human) which includes both an RSS feed and a “New Listings” page. Ebook-seeking robots tend to be truthful about where they learned about the page they’re requesting, even if they didn’t literally click on a link to get there, the way a human would.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0 (FlipboardProxy/1.2; +http://flipboard.com/browserproxy)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0 (FlipboardProxy/1.6; +http://flipboard.com/browserproxy)
Flipboard 1.2 is for robots.txt and pages, 1.6 is for images. For each new file, they request the HTML a few times, and the associated images just once. They don’t seem to be interested in stylesheets.
The name is short for OMG I Love It. Unfortunately, I am not making this up. Check for yourself. This is a recent arrival; I only started seeing it early in 2017. It picks up a page when it first learns about it, and then comes back every week or so for the same page.
Less frequent visitors include:
rogerbot (Nutch-based robot from Mozilla)
IPv4: 31.13.64-127 (Ireland), 66.220.144-159, 69.63.128-191, 69.171.224-255, 173.252.64-127
The shorter UA made its first appearance in the latter part of 2015. It often picks up isolated pages without images; I don’t know what this translates to in the FB user experience. The formerly common visionutils hasn’t been around since April 2016.
Facebook’s behavior changes every couple of years. Currently new links start with a page request, followed by all image files belonging to the page, giving the page as referer. The human user then presumably selects one image, which will get re-requested every time your original visitor’s friends view the page that linked to you. These image-only requests come in with no referer. (Pro tip: Do your pages include a non-visible image file, such as piwik’s noscript dot? I rewrite this for facebook to a little banner giving the site name. So if a page happens not to have any good pictures—some books don’t—there’s still something for the human user to select.)
Facebook, incidentally, seems to be responsible for those /ebooks/kleinschmidt without-final-slash requests, now limited to the Bingbot Evil Twin. All it takes is one person to “like” a page and misspell its URL . . .
IP: 199.16.156-159, 199.59.148-151
Uncharacteristically for a social-media-based robot, the Twitterbot asks for and obeys robots.txt. And, like the Googlebot, it never forgets. This month’s requests included an URL that I remember seeing on my old site’s Redirect lists, meaning that they first learned about it no later than December 2013.
. . . and a Faker
In the category of Most Unlikely User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0
The IP varies, but it’s always the identical UA string from beginning to end, so it’s not just an extra bit tacked on to the end of a human browser.
Before we bid farewell to the Probably-Good side, here’s a nod to those who show pretensions to robotitude by asking for—and honoring—robots.txt, and by having an identifiable name:
The name is crucial; there’s not much point to asking for robots.txt if you’re going around calling yourself Firefox 10, or even Chrome 54.
Well, not that bad. Currently about 3% of unwanted robots get in—and most of those just pick up the front page and go on their way. It’s been a few years since I’ve seen a really devastating crawl coming out of nowhere. In order to crash the gates successfully, it isn’t enough to have a thoroughly human UA; they also have to send plausible humanoid headers. Thankfully, most robots are either too stupid or too lazy. About 3% of them sent no User-Agent at all: instant lockout. About one-quarter—including a number of major search engines—did not send an Accept header: instant lockout unless whitelisted. Other header anomalies are, of course, for me to know and you to find out. Or not.
Along with the small group that can’t be bothered to send a UA header at all, there are always a few brand-new crawlers who go around for months with the equivalent of Insert Name Here. And then there are the computer-science class assignments who never do figure out what to call themselves. (“Am I ‘test1’? No, wait, I think I was ‘TestOne’.”) On the other hand, the fake Googlebot seems to have all but disappeared in recent months. Maybe the people who write robot scripts have figured out that Googlebot UA + non-Google IP = automatic lockout, so it ends up being worse than useless.
As of April 2017, about 85% of all robots were smart enough to start their names with “Mozilla” followed by some more-or-less-plausible humanoid user-agent. Firefox/40.1 seems to be in fashion just now; at least it’s a little more believable than the assorted one-digit Firefoxes. And a visitor of any species, human or robot, calling itself MSIE 6 . . . can only inspire pity.
Some of them, though, aren’t even trying:
Sometimes humans wear the “Dorado” face too, but more often it’s a robot.
Not Welcome Here
On some sites, Chinese search engines might be perfectly legitimate and even desirable. Me, I want no part of ’em—but try getting this message through.
Both Baidu and Sogou request robots.txt on a regular basis—only to ignore it. This remains true even after I gave each one a Disallow block to itself, on the off chance that they were simply too primitive to understand a continuous listing.
IP: 123.125.71, 180.76
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.
Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
In April 2017, the Baiduspider by that name never showed its face at all. (It isn’t gone; I’ve seen it later.) It visited once or twice a day from 180.76, but only to request robots.txt under the Firefox 6 alias. There were also a slew of blocked root requests throughout the month from 123.125.67—an IP that Baidu has used in the past—claiming to be Chrome/45.
What I did see was a robot from an ARIN range professing to be
[sic] What’s better than faking your UA? Claiming to be something that would be banned in its own right.
IP: 36.110, 106.38, 106.120, 123.126, 220.181.124
Sogou web spider/4.0
Currently it doesn’t seem to be interested in much but /ebooks/, which strongly suggests it is picking up links from some outside source.
The Worst of the Worst
When you walk in and demand to see /wp-login.php, it’s all over. There is zero chance that you have any legitimate, law-abiding purpose. (I suppose if you actually have a WordPress site, you might invite security-testing robots to look around and confirm that nobody can get in where they don’t belong. That is not what my visitors are up to.) Close to 10% of the month’s robotic visits asked for wp-login, wp-admin, Fckeditor, xmlrpc.php and similar. What they received, instead, was the 403 page—or, at worst, a “no such file” 404.
A surprising number asked for robots.txt, or the meaningless “/blog/robots.txt”. I’m not clear about the purpose of this request. They did not rush over and ask for any and all roboted-out directories, while they did go ahead and ask for the standard WP list. Are they hoping to find something like “Disallow: /wp-admin876” that isn’t worth asking for unless you already know it exists?
The main feature of malign robots is miscellaneousness. Some IP shows up, requests two or six or a dozen files from a standard list, and never shows its face again. Repeat visits are a feature of named robots from known addresses.
Interestingly, about about half of the group—from every possible IP—came in with the exact UA
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
suggesting that they all started with the same script. Unfortunately, this identical UA is still in use by humans. But many others had no UA at all, making for a convenient Shoot To Kill twofer.
Most malign robots can be identified by their behavior rather than by name and address. Current patterns:
One Plus Three
(always a different one) giving the root as referer, although the requested page is not actually linked from the root, and then
GET / (root), giving the previously failed request as referer
GET / (root), with auto-referer
GET / (root), with auto-referer
This month—Hurrah!—every last one of this pattern was blocked.
(always a different one) giving root as referer, as above
GET / (root) with auto-referer
GET /boilerplate/contact.html giving root as referer
GET / (root) with auto-referer
These requests are always blocked—but only thanks to headers. IP alone won’t do it; the “Contact botnet” plagued me for years. A further quirk of this pattern is that the first referer has a final / slash, but the other three don’t. And, of course, they must have friends inside, because the first request is for some page that exists but is not visible from the front page.
Only one page is requested—but they ask for it repeatedly, anywhere from 3 to 7 times, most often 4. Sometimes they throw in some referer spam. Almost all of them are blocked.
One is enough:
Request: one page from the /ebooks/paston/ directory . . . which happens to consist of unusually large HTML files. I would prefer not to send out 300K if I don’t have to, so now they get 302 redirected to a detour page at about 1/200 the weight. (The page also catches the rare human by mistake, but they’re given enough information to proceed normally.) To make these go away I’ve been reduced to blocking a handful of IP ranges—which is exactly what I’d hoped to get away from when I changed to header-based access controls. Sometimes there’s just no other way, darn it.
“Your root page sent me”
Referer: fiftywordsforsnow.com (with or without final /)
Claiming that you got to some deep inner page directly from the root is a good way to get yourself blocked—especially if also you get the www wrong. Sending an auto-referer for the root itself is similarly effective. Most of them don’t even get as far as the RewriteRules, though.
The Current Faker
Calling yourself Googlebot seems to be going out of fashion. The one I saw most often didn’t even use the full UA string; it thought it could get away with the magic word alone:
Googlebot (gocrawl v0.4)
Even then, it hedged its bets; the “gocrawl” version alternated with
Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2
which counts as “could be better, could be worse” among humanoid UAs. All blocked, so no skin off my nose in any case.
Targeted Robots: Unauthorized
I talked earlier about robots that read an RSS feed, or the New Releases department of the Online Books Page, and come in asking for a file from that list. The Great Divide is whether they first ask for robots.txt. These don’t:
Mozilla/5.0 (compatible; PaperLiBot/2.1; http://support.
NewsBlur Content Fetcher - 61 subscribers - http://www.
Yes, it really says “61 subscribers” right there in the UA string. At least this month.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 AppEngine-Google; (+http://code.google.com/appengine; appid: s~feedly-nikon3)
In the past they have had other, similar UAs, but for now they seem to have settled on this form. Each time they are presented with a new title, they request it over and over for a week or so and then lose interest.
Thanks but No Thanks
On second thought, it isn’t enough to ask for robots.txt. You also have to do what it says. That means:
IP: 188.8.131.52 (formerly at 184.108.40.206; they seem to have taken 2016 off, and returned at a new address)
Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://
Some robots request robots.txt only after getting the front page; this is one of them. But, since they proceeded to ask for the entire contents of a roboted-out directory, it becomes pretty academic.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:220.127.116.11) Gecko/20070725 Firefox/18.104.22.168 - James BOT - WebCrawler http://cognitiveseo.com/bot.html
Mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.
IP: 22.214.171.124 and 126.96.36.199 (exactly)
ltx71 - (http://ltx71.com/)
I first became aware of this robot when it showed up requesting new ebooks. I later discovered that it also likes the page linked from my profile at Webmaster World; I just never noticed because it was always blocked for one reason or another.
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0 (NetShelter ContentScan, contact email@example.com for information)
Mozilla/5.0 (compatible; WBSearchBot/1.1; +http://www.
We don’t want your kind around here
If they never even ask for robots.txt, how will they know what it says?
Gluten Free Crawler
Mozilla/5.0 (compatible; Gluten Free Crawler/1.0; +http://glutenfreepleasure.com/)
Crawls URLs it finds listed on other sites—including but not limited to the one given in the same profile ltx71 keeps following. As far as I can tell, the name is meant as a joke, not as referer spam.
Ancient history: Someone from this exact IP, though using a different name, asked for robots.txt on 24 November 2014. Apparently they’re still assimilating its contents; they’ve never asked for a fresh copy.
IP: 188.8.131.52 and 184.108.40.206 in alternation
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Request: HEAD /
(always with incorrect www) They’re persistent, I’ll give them that; every few days they’re rattling the doorknob again.
Xenu Link Sleuth/1.3.8
I don’t know and don’t especially care whether this is the actual Xenu; all I know is, I didn’t order it. (I don’t know about Xenu’s ordinary behavior. The w3 link checker requests robots.txt on each site that it visits, and goes away weeping if it doesn’t find authorization.)
And the Winner Is . . .
In the “Extra Stupid” category for April 2017:
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/
And your point is . . .?
Look again. The element “User-Agent:” is part of the UA string. The runner-up for the month is
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x
which similarly failed to read the Your New Robot instructions carefully enough.
On a similar note, I’ve recently met
Mozilla/5.0 (compatible; Please Name Your robot; +http://192.168.1.33:23481/yioop/bot.php)
Barring a last-minute upset, that looks like the winner for June 2017. (Disclaimer: It’s no relation to the real Yioopbot, currently on hiatus; they just lifted its code.) On top of everything else, the 192.168 IP is in a Private Use Area, meaning that you can’t get there from here.
I said at the beginning that if a robot chooses to disregard robots.txt, it had better have a darn good excuse. What counts as a “darn good excuse” is purely an individual decision. A robot barging in uninvited from social media might lead to human visitors who would otherwise not have known what a great site you’ve got. If a robot is connected with something you consider a worthy cause, such as checking up on academic plagiarism, you might decide to turn a blind eye even though there’s no direct benefit to you. I, personally, consider archiving (as at the Wayback Machine) a great thing; there’s material I would never have found otherwise. Some people loathe it.
The one thing you really can’t do is penalize robots for not following every last syllable of your long, convoluted robots.txt file. To be considered “compliant”, it only has to understand this:
Everything else is gravy. A combined listing is a great time-saver:
Some of the robots on this page moved from the “bad” to the “good” side after I experimented and found that they only understand the single-name format.
Or take the timer, which can accompany your “Disallow:” lines:
meaning “Please space your requests at least three seconds apart.” Simple, straightforward, reasonable. The Googlebot knowingly ignores it—they’d prefer that you make your preferences known in Search Console—so you can hardly hold it against other robots if they disregard it as well.
If you are just joining us . . . these are the words I assume you know:
- The “address” of the visitor: a set of numbers that might represent a human ISP, or a wireless phone provider . . . or a search engine in Mountain View, or a server in Ukraine. Unlike all other pieces of information the visitor sends, you can normally assume the IP address is real.
- IP addresses can be falsified—but it’s not like forging the sender’s address on email (or paper mail), or putting a fake number into Caller ID. Faking the IP address on an Internet request is equivalent to ordering something by mail and giving a made-up shipping address. You’d only do it if you want someone to receive an embarrassing package, or if you’re trying to make the company go broke by sending out things the recipient didn’t order. (On the Internet, there are no gifts.) In order to receive the package, you have to give your real address.
- Short for “User Agent”. With human visitors, that means their browser—the exact version number, along with the operating system, or the exact model of their smartphone.
- The UA can easily be faked; even ordinary browsers often have a User-Agent Switching option. So you will get a robot pretending to be the latest Chrome or Firefox, or calling itself Googlebot, because it thinks it will get better treatment that way.
- Don’t look at me; I didn’t codify the spelling. Loosely, the “Referer” is who sent you. If the visitor is human, that normally means the link they clicked to get to a page, whether in some other page or in a search engine. If they typed your address straight into the browser’s address bar, or they have your page bookmarked, there’s no referer.
- Most robots just read pages, like “athome.html”. Humans also need supporting files: images, stylesheets, scripts. But you don’t have to ask for all these things individually. The browser does it for you; that’s why it’s called a “User Agent”. In general, a browser will give the name of the original page as referer for all supporting files.
- Humans can choose not to send a referer, though this may cause trouble on some sites. You can also send a fake referer. When a robot does this, inserting the name of some site they’ve been paid to advertise, it’s called “Referer Spam”.
- An “auto-referer” is my own term for giving the requested page itself as referer. The idea is probably to avoid suspicion by making it look as if you’re already on the site. Instead, of course, it only makes it more obvious that you’re a robot.
- Server access logs, available to all website administrators except those on the cheapest of shared-hosting plans. (If that’s you, stop reading this page and go look for a new host.) Logs show all requests sent in to the server, along with the response. If you, as a human, are reading this page, my access logs might say something like:
220.127.116.11 - - [01/May/2017:12:23:45 -0700] "GET /fun/and so on.
athome.html HTTP/1.1" 200 11555 "https://www.duckduckgo.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
18.104.22.168 - - [01/May/2017:12:23:46 -0700] "GET /shared
styles.css HTTP/1.1" 200 2067 "http://fiftywordsforsnow.com/ fun/athome.html" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
22.214.171.124 - - [01/May/2017:12:23:46 -0700] "GET /fun/
miststyles.css HTTP/1.1" 200 1567 "http://fiftywordsforsnow. com/fun/athome.html" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
126.96.36.199 - - [01/May/2017:12:23:47 -0700] "GET /favicon.
ico HTTP/1.1" 200 661 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
- If, on the other hand, you are a malign robot and I don’t want you to see the page, logs might say curtly:
188.8.131.52 - - [01/May/2017:12:23:46 -0700] "GET /fun/
athome.html HTTP/1.1" 403 3557 "http://disgusting-spammy-site.ru/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"
- A “distributed” robot doesn’t always come from the same address; it shares space on a lot of different servers, and might crawl from any of them. This can make it hard to tell if the robot is who it says it is, but there are usually other identifiers. Major operators like Bing and Google have known addresses that they use consistently.
- A text file tucked away in most websites, including this one. It has to be reachable by the exact name robots.txt so visitors know what to ask for. It tells robots which areas of the site are off limits, and may set special rules for some robots by name. Good, law-abiding robots will consult robots.txt before asking for anything else, and will respect its rules.
- A robots.txt file has no physical force. It’s the same as putting up a sign that says “Employees Only” or “No Admittance”. For people who can’t or won’t follow instructions, there are locks and barricades.