At Home with the Robots

Search Engines

Sad but true: On the Internet, everything is assumed to be American unless they explicitly state otherwise. Hence .gov = the US government; .mil = the US military; .edu = institutions of higher learning in . . . well, North America at least. Almost every­where you go, the most popular search engines are based in the US.

As you eyeball the various User-Agents, pause for a moment to look at the link contained in the UA string. This information is generally a sign of an honorable robot; it lets webmasters get more information about the would-be visitor. It is far less useful when the said link redirects to an entirely different page—especially when this in turn is a 404 Not Found.

US Search Engines

Still Number One: Google

On most sites in most countries, you can expect the single largest robotic visitor to be:

IP: 66.249.64-79

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot-Image/1.0

Googlebot/2.1 (+http://www.google.com/bot.html)

As the name indicates, “Googlebot-Image” is strictly for image files. Now and then, the ordinary Googlebot will also request images—always with a referer, like a human.

The shorter Googlebot version may have been a mistake, or may be a failed experiment. Aside from one isolated visit in March 2019, I first saw it in November 2019. After that it visited sporadically for about a year, making its last appearance in December 2020. During that time, its sole job seems to have been requesting PDFs.

Protocol: HTTP/2 coming soon

In my youth there was a saying: Brazil is the country of the future . . . and always will be. As far as the Googlebot is concerned, HTTP/2.0 will always be the protocol of the future. To test the waters, a number of requests in late 2020, and a handful more in spring 2021, were made with 2.0. Since then they seem to have back-burnered the idea; I saw none at all in 2022.

Mobile Googlebots

It took until April 2016 for Google to realize that they had a perfectly good mobile OS of their own, and didn’t need to go calling themselves “iPhone”. This led to:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

. . .

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

For several years, the mobile Googlebot clung firmly to that first Chrome version. Then, starting in early 2020, it jumped up to the current Chrome or something like it, starting with Chrome/78.0.3904.74. Since then it has moved forward periodically; at time of writing (April 2022) it is up to Chrome/99.

Before the Android User-Agent came along, the mobile Googlebot went through a succession of iPhone names, probably starting in late 2011:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Can you spot the difference between the second (October 2013) and third (February 2014) versions? By 2014, smartphones had become mainstream; there was no longer a need to label yourself “Mobile”. The name “Googlebot-Mobile” did hold on a while longer with two older UAs:

DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)

DoCoMo has been around since at least June 2011; SAMSUNG showed up a little later, in March 2012. Both disappeared around the end of October 2016.

Googlebots with Style

Around May 2018, Google quietly introduced a new robot, whose sole function seems to be requesting scripts and stylesheets, a job it now shares with the traditional Googlebot. The original UA was an unnumbered Safari, in use through January 2020.

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36

(If you meet any of this group of user-agents requesting a page file, or showing its face later than the beginning of 2020, it’s a faker, readily identified by its IP.)

But before too long, the scripts-and-stylesheets function shifted over to Chrome. It seems to follow the current Chrome release, starting with Chrome/78 in January 2020. As I write this (April 2022), it is up to Chrome/99:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/78.0.3904.74 Safari/537.36

. . .

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/99.0.4844.84 Safari/537.36

Also like the ordinary Googlebot, the stylebot eventually went mobile. In fact, exactly like the ordinary Googlebot; unlike the non-mobile version, there is no separate User-Agent for the stylebot.

Although stylesheets and scripts are by far the most common request, I’ve seen it picking up other kinds of non-page files—images, fonts, sounds, you name it—always giving the appropriate page as referer.

Behavior

For the last few years, all Googlebot requests for supporting files (.css, .js) have included a referer—the page the file “belongs” to. They have always done this spora­dically; it looks as if it became standard practice in mid-March 2017.

Sending a referer with image requests—always using one of the non-image Googlebot variants—is less common. For a while this behavior almost vanished, dropping down to less than one in 20 requests. But 2021 saw a resurgence, with up to one in five image requests giving a referer.

When a site moves to HTTPS, Google will do a full, top-to-bottom spidering, exactly as if it were a brand-new site. This is in addition to, not a substitute for, any and all individual page requests that were redirected from their earlier HTTP shopping list.

To Say Nothing Of . . .

But wait, there’s more. Alongside the true Googlebot, there’s an ever-expanding list of other Googloid functions—including some I’ve never personally set eyes on, like the assorted AdSense-related crawlers. As I write, some of them include a link to a page with the complete list. Here I’ll only talk about the ones I have personally met.

IP: 66.102.6-7 and 66.249.80-95

I don’t know what they do with the rest of 66.102.0-63. Barring a handful of visits from 66.102.16, I have never seen them outside 66.102.6-9; until recently I rarely even saw them outside 66.102.6-7.

In alphabetical order:

Docs:

Mozilla/5.0 (compatible; GoogleDocs; apps-presentations; +http://docs.google.com)

Confession: I have no idea what this does. It only fetches images, and it’s very rare. Their web page leaves me none the wiser.

Favicon:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36 Google Favicon

This UA has certainly matured over the years. Originally they sent no User-Agent at all; later they called themselves Firefox 6, and since March of 2016 they’ve gone by Chrome/49. Unlike some search engines, Google doesn’t display a favicon next to each result. The favicon does show up whenever you list your sites in a Google property such as Google Search Console (the former Webmaster Tools), and quite possibly others that I don’t know about.

Image Proxy:

Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko Firefox/11.0 (via ggpht.com GoogleImageProxy)

Read-Aloud:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36 (compatible; Google-Read-Aloud; +https://support.google.com/webmasters/answer/1061943)

At some time in the distant past, this robot called itself “google-speakr” and-that’s-all. By any name, I never set eyes on it before late 2019 and have no idea how long it’s been around. Like other User-Agents containing the “Chrome/41” element, it is not subject to robots.txt rules, because it visits only in response to human activity.

SearchByImage:

Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.7; Google-SearchByImage) Gecko/2009021910 Firefox/3.0.7

Confession: I never knew this UA existed. Thanks to that Firefox/3, they have never seen anything but a 403. The UA, complete with “de” (barring a few ebooks, I have no German-language content), has existed since at least 2015. If they hadn’t come from a Google IP, I’d have assumed they were just another unwanted robot. The good news is that is extremely rare: I count just ten visits in the past year.

Snippet:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google (+https://developers.google.com/+/web/snippet/)

This minor head-scratcher first showed up in June of 2018, generally requesting the same obscure ebook page, and always getting the door slammed in its face. Unlike everything else listed here, it may also show up from the 66.249.80-95 area, a non-crawl Google range.

Translate:

This doesn’t have a UA of its own; it just appends “,gzip(gfe)” (with comma, without space) to the human visitor’s UA string. The referer will be something involving “translate.googleusercontent.com”.

Web Preview:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/27.0.1453 Safari/537.36

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36

Once again, color me puzzled. I remember a few years ago the Google Search results always had an option for Preview, but I haven’t seen it in yoincks, so I have no idea what this UA currently does. Around May of 2017, Google belatedly realized that “Chrome/27” was getting to be pretty implausible, and upgraded to Chrome/41—a version number that was abandoned by most humans in 2016.

The “Chrome/41” form remained in use until around May 2019, at which point it disappeared without a trace . . . until August 2020, when it showed up in brand-new togs:

Mozilla/5.0 (X11; Linux x86_64)  AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview)  Chrome/84.0.4147.108 Safari/537.36

Note the double spaces. This new version has been wildly active—thousands of visits as against previous years’ scant dozens—but it is no longer static. Much like the new script-and-stylesheet robot, it updates the Chrome version regularly. As I write this (spring 2022) it is up to

Mozilla/5.0 (X11; Linux x86_64)  AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview)  Chrome/99.0.4844.74 Safari/537.36

still with those distinctive double spaces.

Formerly Known as Webmaster Tools

Several of the bigger search engines have Webmaster Tools so you can learn a little more about what the search engine is up to, and maybe even have some say in its behavior. Just to be different, Google renamed its version Google Search Console. Anything involving a voluntary action on your part is also listed here.

Page Speed Insights:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36

Note that Chrome version; you’ll see it again.

Site Verification:

Mozilla/5.0 (compatible; Google-Site-Verification/1.0)

Shows up periodically on any site that has a GSC account.

Search Console:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Google Search Console)

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/41.0.2272.118 Safari/537.36

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/27.0.1453 Safari/537.36

I first saw this UA in May 2016. I don’t know exactly how old it is, because it comes only in response to a specific action on your part: “Fetch and Render” in the Fetch As Googlebot section of GSC. Like most googloid functions it is not subject to robots.txt; casual experi­mentation shows that if you request a page in a roboted-out directory, it will do the fetch with this UA, but won’t show the “What a Human Sees” render. Probably they don’t want to rub your nose in the fact that they’ve just fetched something robots aren’t supposed to see.

If that UA seems familiar, it’s because Preview and Page Speed Insights (above) are identical; only the names have been changed. They must have a sentimental fondness for elderly Chrome builds; as I write this, human Chrome is about to roll out 81. Like Preview, the Search Console moved up from 27 to 41 around the beginning of 2017.

The Android version is even newer. I first saw it in January 2019, and don’t yet know if it is meant as a replacement or only an alternative.

And Finally...

For a brief time in April-May of 2019, someone hit the wrong button, resulting in scattered visits from:

Mozilla/5.0 (Linux; Android 9.0.0; en-us; Pixel 3 XL Build/PD1A.180621.003) AppleWebKit/[WEBKIT_VERSION] (KHTML, like Gecko) Chrome/[CHROME_VERSION] Mobile Safari/[WEBKIT_VERSION] (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Really.

We Try Harder: Bing

IP: 40.77.167, 157.55.39, 207.46.13; 13.66.139; 23.103

These are not Bing’s full ranges; in fact about half the Internet—at least on the IPv4 side—seems to be registered to Microsoft. But the first three are their hands-down favorites; the rest are far less common. Still rarer are:

IP: 52.162.161, 52.240; 65.55.210; 131.253; 191.232.136; 199.30.16-31

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Yup, you read that right: Bing is using an iPhone UA. Ha. Ha. Ha.

In mid-2022 Bing decided to follow Google’s example and add a “Chrome” element, complete with version number, to their UA string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/100.0.4896.127 Safari/537.36

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36

The first was seen sporadically in May-July 2022, the second from August 2022 on. The change is described in an April 2022 blog entry which says, among other things,

By regularly updating our web page rendering engine to the most recent stable version of Microsoft Edge we will be making the above user agent strings evergreen. . . . We will stop using our historical user-agent by Fall 2022.

Do not ask what Chrome has to do with Edge, but note the utter absence of the element “Edg”. In any case the final sentence is, er, factually incorrect, since the “historical user-agent” remains the most commonly seen, as of February 2023. To date, I have only seen the two exact version numbers shown above. And I have yet to see—ever—the touted Android version, given in the blog entry as:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

One final snicker before we leave the user-agent: note the URL given in all versions of the UA string. Not only does it redirect to https, it redirects to an entirely different URL, currently https://www.bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0.

Behavior

Unlike Google, Bing uses the same robot for both pages and images. I have never seen a request with referer; apparent exceptions turn out on closer inspection to be fakers. The mobile bingbot similarly makes requests of all kinds—except that it never asks for robots.txt in its own name. That job is left to the ordinary bingbot, which ought to be quite good at it; for many years, Bing was the Abou ben Adhem of robots.txt requests. Recently they seem to have cut back on their appetite, although they still request robots.txt far more than the Googlebot.

Protocol: HTTP/2

While Google continues to talk about moving to HTTP/2 at some time in the unspecified future, Bing quietly went out and did it. After a trial run near the end of 2020, the bingbot started making regular HTTP/2.0 requests in March 2021. By the end of May they had entirely changed over, reserving HTTP/1.1 for robots.txt alone.

An ongoing Bing quirk is to request URLs in all-lower-case form: The page is PageName.html but they ask for pagename.html. They’ve been doing it for years—I checked back to 2016 before getting bored with the exercise—so I don’t suppose they will stop any time soon. Now and then they give up for a few months, but they always come back.

As noted above, the link in the bingbot’s user-agent string redirects. From a past version of this page, I learned that there exists—or once existed—a second mobile bingbot:

Mozilla/5.0 (Windows Phone 8.1; ARM; Trident/7.0; Touch; rv:11.0; IEMobile/11.0; NOKIA; Lumia 530) like Gecko (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

I’ll have to take their word for it, since I have never personally met this version. But the page does seem to have been updated within the present geological era, since the most recent version no longer lists . . .

Requiescat in Pace

As recently as a few years ago, The Robot That Will Never Die was responsible for around 10% of bingbot requests:

msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

In spite of the “media” in the name, requests in its final years were exclusively for pages. But it has gone to its well-earned rest; I haven’t set eyes on it since May 2017.

Back from the Dead

. . . and then, in October 2019, an even older Bing entity rematerialized:

msnbot/2.0b (+http://search.msn.com/msnbot.htm)

Requests covered all possible filetypes, including pages. It was surprisingly active for about half a year, before folding up its tents and disappearing again at the end of April 2020.

But Also . . .

And then there’s Bing Preview. In addition to the various bing/msn crawl ranges, it also shows up from

IP: 65.55.210, 131.253.25-27, 199.30.24-25

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534+ (KHTML, like Gecko) BingPreview/1.0b

Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 BingPreview/1.0b

I’m not clear what this UA actually does. I don’t believe it is a true preview; the requests don’t come in packages (page, supporting files, images) like a human. It may be Bing’s version of a Mobile-Friendliness tester.

BingSiteAuth

Unlike Google’s Site Verification, Bing’s wears plain clothes:

IP: 40.77, 131.253

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

It never requests anything but /BingSiteAuth.xml. Its visits may be linked to interactions with Bing Webmaster Tools, which explains why I haven’t set eyes on it since early 2019. On that final visit, it set aside the plain clothes and instead wore the Bing Preview user-agent.

Yahoo! Slurp

IP: 68.180.228-230; 72.30; 74.6

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

In a pattern that should by now be familiar, that link in the user-agent string redirects before arriving at “Why is Slurp crawling my page?”

In March 2016, Yahoo! Slurp suddenly started requesting stylesheets, always with the appropriate page as referer (like the Googlebot); in May of 2017 they stopped as suddenly as they had started. Meanwhile, by the end of 2016 they had entirely stopped asking for images. Either they’ve got a very long visual memory or they’ve been sneaking in under an alias, because to this day—I’m typing this in December 2019—Yahoo image search is still sending humans.

Currently the robot seems to be on vacation. I haven’t seen Slurp since October of 2018, barring a few brief visits in August 2019. Odder still, their only requests those times were for images. They must have changed their headers, because all requests were roundly blocked. And still the image search referers keep coming.

DuckDuckGo

IP: 54.208.102.37, 107.21.1.8

Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)

For many years, they used only the 107.21 address. Now that they’re becoming more popular, they’ve had to expand—always to one of the two, down to the last digit.

DuckDuckGo does not crawl websites; it uses other robots’ crawl data and applies their own algorithm. If a page comes up in a DuckDuckGo search, the Favicons-Bot comes by to pick up the site’s favicon in order to display it next to the search result. In spite of the name, they request the root (the site’s front page) first; if the page is blocked they won’t request the favicon. If your access-control rules involve looking at the referer, this may entail some hole-poking.

DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

I can’t absolutely swear that that’s what the real DuckDuckBot looks like. For some reason, a group of referer-spam fakers have decided to use this UA string; as far as I know, I’ve never met the real thing.

National Search Engines

Most parts of the world are content to google and be done with it. But some countries have their own search engines that get most of the business. I’ve listed them in order of frequency at this site, which may or may not have anything to do with the search engine’s overall size and distribution.

This section is alphabetized by country.

Czech Republic: Seznam

I read recently that the country has officially changed its name to Czechia—but everyone who lives there hates it, which would seem to be a drawback. Seznam’s website says Czech Republic, so we’ll stick with that.

Google Translate, incidentally, says that the Czech word seznam means “list”. They have always been fond of this site; not sure why, since I don’t have a single word of Czech-language content, and human visitors sent by Seznam can be counted on your fingers.

IPv4: 77.75.76-79
IPv6: 2a02:598

As you might expect from a robot living in RIPE territory—where IPv4 addresses were rationed out in /22 segments and are now entirely used up—Seznam was quick to grab an IPv6 range when it became available. To date I’ve only met them from 2a02:598:2 and 2a02:598:a, in equal amounts, but they own the whole /32 because why wouldn’t they. A behavioral quirk is that they always use the IPv4 address to request robots.txt, even if the rest of the request will come from IPv6.

Mozilla/5.0 (compatible; SeznamBot/3.2; +http://napoveda.seznam.cz/en/seznambot-intro/)

This UA was rolled out in May 2016; for many years before that, it was

SeznamBot/3.0 (+http://fulltext.sblog.cz/)

followed (from February 2014) by

Mozilla/5.0 (compatible; SeznamBot/3.2; +http://fulltext.sblog.cz/)

I can only conjecture that, like Apache, Seznam doesn’t care for odd numbers. In any case the new version must be considered an improvement, since they now link to a page in English.

Rarely, there are experimental variants:

Mozilla/5.0 (compatible; SeznamBot/3.2-test1; +http://napoveda.seznam.cz/en/seznambot-intro/)

Mozilla/5.0 (compatible; SeznamBot/3.2-test1-1; +http://napoveda.seznam.cz/en/seznambot-intro/)

Like Bing and Google, there’s a preview:

Mozilla/5.0 PhantomJS (compatible; Seznam screenshot-generator 2.1; +http://fulltext.sblog.cz/screenshot/)

This version has been around since mid-2015; before that, it was

Mozilla/5.0 (compatible; Seznam screenshot-generator 2.1; +http://fulltext.sblog.cz/screenshot/)

Maybe one of these years they will notice that the preview’s User-Agent still refer­ences the old “fulltext.sblog” page, even while the regular robot has updated to “seznambot-intro”.

Japan: Yahoo

IP: 182.22.30

This is a brand-new robot, first seen in late July 2023. In fact I had to pore over archived logs to confirm that I’d never seen it before, because “Yahoo Japan” sounds so familiar.

Requests follow a distinctive pattern: each page is followed by no more than one script or stylesheet. If the first stylesheet associated with a given page is one the robot has already seen, it continues down the list to see if anything is unfamiliar. As usual with search-engine crawlers in recent years, any requests for a supporting file give the original page as referer.

Mozilla/5.0 (compatible; Y!J-WSC/1.0; +https://yahoo.jp/3BSZgF)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Y!J-WSC/1.0; +https://yahoo.jp/3BSZgF) Chrome/113.0.0.0 Safari/537.36

The shorter UA is for pages and robots.txt; the longer one is for scripts and stylesheets. So far I haven’t seen it requesting images.

Korea: Daumoa

IP: 203.133.168-171, rarely 203.133.174

They may occasionally crawl from other addresses; the whole 203.133.160-191 sector belongs to Daum.

Mozilla/5.0 (compatible; Daum/4.1; +http://cs.daum.net/faq/15/4118.html?faqId=28966)

Mozilla/5.0 (Unknown; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) Safari/538.1 Daum/4.1

Daumoa first showed up in response to an RSS feed; since then they have some­times wandered further. They’ve got a few other user-agents, but the “faqID” one is their current favorite. Among other things it handles all robots.txt requests for the whole family.

Daum’s particular quirk—every robot has one—is a reluctance to accept redirects. I will often find up to half a dozen HTTP requests for the same page on a single day, all duly redirected . . . but when I counter-check the HTTPS logs, there will just be one request, and sometimes none at all.

Korea: Yeti and Linespider

IP: 125.209.235

Mozilla/5.0 (compatible; Yeti/1.1; +http://naver.me/bot)

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.0 Safari/537.36 (compatible; Yeti/1.1; +http://naver.me/spd)

April 2017 marked the first Yeti sighting since . . . drumroll . . . July of 2014. That was over at my old site, Lucy’s Worlds, which they used to visit all the time; they’d never before set foot on the present site. They used to change IP every year or so, but this one’s been the same since mid-2013.

The shorter UA string last visited in October 2019 . . . but watch as the plot thickens.

Introducing Linespider

Near the end of October 2019, a brand-new and seemingly unrelated entity started showing up:

IP: 203.104.154

Mozilla/5.0 (compatible;Linespider/1.1;+https://lin.ee/4dwXkTH)

(Yup, no space after either of the semicolons. In some quarters, that’s enough to get yourself blocked.) They requested robots.txt and assorted pages, and made no trouble.

And then, shortly afterward, from the same IP on the same visits, came the new, expanded Yeti. These two crawlers work in tandem. Linespider gets pages; Yeti—the long version—gets stylesheets belonging to the pages.

Russia: Yandex

Yandex is a Russian company, but they’re also big in Turkey. Sometimes they’ll come in expressing a preference for material in Turkish, instead of the more common Russian.

As we speak, Yandex’s distinguishing trait is their sheer range of IPs. They’re not distributed; they just own a whole lot of small, widely separated ranges. For the last few years their hands-down favorite—accounting for at least two-thirds of all requests—has been exactly

IP: 141.8.144.18

But they have also used:

5.45.192-255
5.255.192-255
37.140.128-191
77.88.0-63
84.201.128-191
87.250.224-255
93.158.128-191
95.108.128-255
100.43.64-95
130.193.32-71 (i.e. 32-63 and 64-71)
141.8.128-191
178.154.128-255
199.21.96-99
199.36.240-243

as well as

IPv6: 2a02:6b8:b000::/52

199.21 is Yandex’s ARIN range (North America). On my sites they use it randomly alongside their various European addresses; it may behave differently on sites that have region-based access controls. The IPv6 address is extremely rare; I’ve only seen it requesting a handful of images, most recently in June 2019.

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B411 Safari/600.1.4 (compatible; YandexMobileBot/3.0; +http://yandex.com/bots)

Like many search engines, it consistently sends a referer when asking for suppor­ting files such as scripts and stylesheets.

The mobile bot is comparatively rare. It asks for pages and supporting files (css, js, fonts) but no images. Notably, it persists in asking for piwik.js even though this file lives in a roboted-out directory. Recently it has become rarer still; in the whole year 2023 I only saw it three times.

Another Yandex User-Agent showed up in March 2019. (It may be older; this is when I first saw it.)

Mozilla/5.0 (compatible; YandexAccessibilityBot/3.0; +http://yandex.com/bots)

At some time when I wasn’t paying attention—probably in late 2019—Yandex introduced an alternative user-agent involving Chrome. Unlike Google, the version number isn’t continuously updated. So far there have been three:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.268

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0

The first, with Chrome/51, was in use until around October 2020; Chrome/81 then took over until March 2023, when it was replaced by the current Chrome/108. Confession: I only noticed it at the end of 2023, because “Chrome/108.0.0.0” is unfortunately a favorite among certain humanoid robots.

Yandex seems to like HTTPS. Once they see that a given site is available securely, they tend to change their entire shopping list. Even pages that were redirected years ago, pages that never existed as HTTPS, will generally be requested as HTTPS. With rare exceptions, HTTP requests are limited to the front page, probably just to confirm that the redirect is in place.

Make that: HTTP page requests. Aside from sitemap.xml, I also see the occasional HTTP image request; these will be preceded by the YandexBot requesting robots.txt on the HTTP side.

And yes, the link in their UA string redirects to HTTPS. Ahem.

Russia: Mail.RU

IP: 95.163.248-255, 217.69.133

In years past, 217.69 was the only IP they used; more recently, 95.163 has been their favorite. At time of writing (summer 2022) they seem to have narrowed in on 95.63.255.

Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots)

Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/Robots/2.0; +https://help.mail.ru/webmaster/indexing/robots)

Mozilla/5.0 (compatible; Linux x86_64; Mail.RU_Bot/2.0; +https://help.mail.ru/webmaster/indexing/robots)

Linking to a web page in your UA string is generally considered A Good Thing—but, er, it only works if people can read it. Faute de mieux, I’ve always assumed they are a search engine. (The element /indexing/ in the newer UAs does reinforce the hypothesis.) Rather a low-budget one: they do a biggish crawl every few months, at which point they show up on my Redirects lists requesting ancient pages that everyone else has long since got sorted to their currect URLs. Requests are almost exclusively pages. Exceptions are interesting, but only if you know the site.

The first listed UA was in regular use through March 2022, with sporadic reappear­ances into May. The second one seems to have been a goof on their part—note the extra /Robots/ in the middle—showing up occasionally from April to June 2022. Since the beginning of May 2022, the third form has been the clear preference.

The URL in the old UA string redirects to
https://help.mail.ru/webmaster/indexing/robots/types_robots
which, as the directory structure suggests, is linked from the page given in the new UA string. Not that it makes an awfully big difference when the content is all in Russian and all I can do is spell out роботы (“r-o-b-o-t plus some vowel or other”).

A quirk of protocol: Starting in early 2015 and continuing until the present, Mail.RU has used HTTP/1.0 for robots.txt requests, 1.1 for everything else. Like Yandex, they otherwise use HTTPS for everything, including long-gone pages that never existed in that form. Well, if I were a Russian robot, I’d probably try to keep my contacts secure too.

Vietnam: Coccoc

IP: 103.131.71

Their former IP, 123.30.175, seems to have been discontinued.

Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine)

Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine)

That, at least, is their all-ASCII Internet name; it’s really Cốc Cốc with plentiful diacritics. Although they call themselves a search engine; they’ve never done a top-to-bottom spidering. Instead, when they learn of the existence of some particular page, they come in and ask for it, along with all its associated images.

The last time I did a systematic count, fully 65% (two-thirds) of Coccoc’s requests were for robots.txt, implying that sometimes they never got around to asking for a page at all. On closer inspection, the number dropped below 50% on this site—typical for a small robot that requests just one or two files per visit, always accompanied by robots.txt. The statistical weirdness is due to my personal site, where they will typically make from four to six robots.txt requests on the HTTP side, and never make it over to HTTPS at all. Only time will tell if they start doing the same thing on this site, now that it too has gone HTTPS.

Not Welcome Here

On some sites, Chinese search engines might be perfectly legitimate and even desirable. Me, I want no part of ’em.

Both Baidu and Sogou request robots.txt on a regular basis—but until quite recently they ignored what it said. This remained true even after I gave each one a Disallow block to itself, on the off chance that they were simply too primitive to understand a continuous listing.

Baidu

IP: 180.76.15; 123.125.71, 220.181.108

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2

Even when requesting robots.txt alone, the Firefox 6 alias is more common than the Baiduspider name. This does the robot no favors, since it means it will be served an alternative robots.txt that simply disallows everyone.

At some point in 2017, Baidu finally saw the error of its ways and largely stopped asking for pages. Whew. The only thing better than a blocked request is a request that is not made in the first place. Less work for the server; less aggravation for the webmaster. Currently they forget themselves every third visit or so, and request pages even after finding themselves comprehensively denied.

Like the Googlebot, Baiduspider is popular with spoofers—for all the good it does them. Once I even saw a robot from an ARIN (North America) range professing to be

compatible;Baiduspider/2.0; +http://www.baidu.com/search/spider.html

[sic] What’s better than faking your UA? Claiming to be something that would be banned in its own right.

Aspiegelbot and PetalBot

IP: 114.119.128-167

Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; AspiegelBot)

Mozilla/5.0 (compatible;AspiegelBot)

Mozilla/5.0 (compatible;PetalBot;+https://aspiegel.com/petalbot)

Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)

The listed IP is part of a wider range, 114.119.128-191, belonging to Huawei Singapore. At first they used mainly 160-167; lately they seem to have moved down the block, extending all the way down to 128, but still not going above 167.

The second UA was rare; both AspiegelBot versions disappeared entirely in May 2020. But it was only a name change, to be replaced by PetalBot from the same IP. (Note the fondness for missing spaces after a semicolon.) The last version, dropping the “aspiegel” element entirely, showed up in April 2021.

An unexpected feature of this robot is that it appears to be fully robots.txt compliant. That doesn’t mean it likes what it sees; I’ve seen up to 40 robots.txt requests in a single day, perhaps hoping if they keep asking they will get a different response.

Sogou

IP: 36.110, 106.38.241, 106.120.173, 123.126.113, 220.181.124

Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)

Currently Sogou doesn’t seem to be interested in much but the /ebooks/ directory, which strongly suggests it is picking up links from some outside source. Unlike Baidu, it continues not to understand what “Disallow” means. Or rather, it understands but pretends not to, since its page requests always involve some humanoid UA such as an Android.

Yisou

IP: 42.120.160-161, 42.156.136-139, 106.11.152-159

All these ranges seem to belong to Alibaba.

YisouSpider

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36

The short version is used only for robots.txt requests, where it finds itself disal­lowed. Not that this affects its future behavior. In addition to pages—where it is comprehensively blocked—it always requests the stylesheet belonging to the site’s 403 page.

Three Headscratchers

Bing and Yandex don’t have any connection that I know of—but they’re both associated with one near-identical behavior:

Visitor comes in with generally humanoid headers and requests a page, sometimes giving the appropriate search engine as referer. They request scripts, stylesheets, fonts—in short, all supporting files except images and favicon. All three entities pick up piwik.js (the script that tells them what my analytics is looking for), though only one of them acts on it by requesting piwik.php (the actual analytics). It may or may not be relevant that my piwik installation lives on a different site. If I kept my fonts on a separate site—or used third-party fonts such as google’s—would they still be requested?

Although I haven’t set eyes on any of this trio in over a year, I’m keeping them on this page pending further developments.

Drake Holdings

IPv4: 204.79.180
IPv6: 2607:f298:5:105b::99a:f41e

The range belongs to an outfit called Drake Holdings which has some nebulous connection to Microsoft, hence my identifier for the robot.

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0;  Trident/5.0)

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;  Trident/5.0)

Last seen: June 2018

That’s two  spaces before the second “Trident”. Any given visit uses one UA or the other, but apart from that they’re random. Some HTML requests have no referer; some have Bing search, with a plausible search string for the requested page. Not always right, but always understandable. Requests for supporting files have human-type referers. Unlike the other two members of the head-scratcher trio, it requests piwik.php, meaning that it acts on javascript. In fact, its IPv6 address was only used for piwik requests.

Visitors from this IP were active for about two and a half years, beginning in November 2015 and ending abruptly in mid-June 2018. But read on.

Plainclothes Bingbot

I gave it this name because it comes from assorted known Bing/MSN ranges, but always with a humanoid user-agent and headers.

IP: 65.55.211-218
131.253.24-36
23.101.169.3, 23.100.232.233
52.162.211.179, 52.162.213.79; 52.162.161.148

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;  Trident/5.0)

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0;  Trident/5.0)

Last seen: October 2019

(i.e. identical to the two Drake Holdings user-agents, right down to the double space, its distinguishing feature).

For many years the plainclothes bingbot used only the first two listed IP ranges, especially 131.253.25. In June 2018 they added the 23.101 address—always the same, down to the last digit—apparently replacing the former Drake addresses. In September 2018 came the 52.162 group. Most recently, in March 2019, came the 23.100 address—again, down to the last digit—apparently replacing the former 23.101.

All of these have been getting 302 redirected to a custom page that has served different purposes over the years; one of its functions is to intercept humans who accidentally behave like robots. Or, as the case may be, robots who intentionally behave like humans.

Final oddity: Some requests come in with a referer, suggesting that they’re checking up on Bing search—providing me with ongoing fodder for the Search sections of The Panda Page. With-referer searches always come from the same IP: initially 23.101.169.3, later 23.100.232.233. (The reverse is not true. These same IPs also make referer-less requests, often on the same visit.)

At one time, I could expect to see the plainclothes bingbot almost daily, often several times a day. Then, near the end of October 2019, they stopped as suddenly as they had started. Coincidentally or otherwise, they disappeared just days before the site changed to HTTPS.

Yandex Referers

IP and UA: various, as expected for humans in Russia

Referer: http://yandex.ru/clck/jsredir?from=yandex.ru%3Bsearch%3Bweb%3B%3B&text=&etext=

Last seen: February 2020

. . . et cetera, et cetera, with an enormously long string of garbage that appears to be identical to a genuine Yandex referer. Each request is followed or preceded within 24 hours by an apparently human visit (with images and piwik, without favicon) to the same page. Different IP and UA, but Yandex referers are infrequent enough that I can easily pick them out.

This entity became much less common towards the end of 2019, and now seems to have disappeared entirely.