Crawling

Why the crawling?

The search engines needs to know all relevant information on a website in order to understand its level of importance so that it can range it accordingly.

In order to get the information from the website, the search engine uses a technique called crawling.

The crawler is basically a piece of software that at its best pretends it is a human browsing the site, but in contrast to a real human which on average only visits 5-10 pages before exiting from the site, tries to visit every single page of the website.

All this information is then sent back to the search engine and its search index.

Put short, the crawler is simply the transport guy, being clearly instructed by the search engine to visit a specific website, and bring back all the relevant information it finds, following the links on the website.

The crawler identifying itself to the website

When someone - or something - visits a website, it is required to «introduce itself» to the website using what is called User-Agent.

When you as a real person visit a particular website, you always do this through a web browser like Chrome, Firefox or Safari. The User-Agent reported to the website when using Safari on and iPad would be:

Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405

This information tells the web server, and further the website, which browser (among other details) is being used.

When the crawler visits a website, it is similarly required to identify itself.

By convention the end-word «bot» is being used to name the search engine crawlers.

Google’s general crawler is called Googlebot, Microsoft Bing’s crawler is called bingbot while he privacy-focused search engine DuckDuckGo is identifying itself as «DuckDuckBot». Other search engines have a similar naming convention of their crawlers.

The User-Agent name of Googlebot is:

Googlebot/2.1 (+http://www.googlebot.com/bot.html)

As seen, the User-Agent name used by Google doesn’t have all the additional details as the User-Agent for the Safari browser had. This is for the most because the search engine crawler doesn’t explicitly use the same type of specifications (operating system, device type, etc) every time it visits / crawls the website, and further because it is much more cleaner and also easier to identify the crawler when using a simple User-Agent name.

Some search engines, like Bing and Yahoo, adds a prefix - «Mozilla/5.0», to its User-Agent string, becoming like this:

Mozilla/5.0 (compatible; bingbot/2.0 +http://www.bing.com/bingbot.htm)

Fun-fact: The reason for the added «Mozilla/5.0» is that during the first browser war during the late 1990s, many web servers were configured to only return web pages that required then advanced features, like frames, to clients that were identified as some version of Mozilla (the browser engine the then most popular browser Netscape Navigator was built upon), which supported these features. Older browsers, like Mosaic and Cello, which didn’t support these features, would be given a more basic version of the website. The search engine crawlers, which indeed supported these features, were therefore required to clearly tell the web servers they were as good as the Mozilla-based browsers, in order to get the full details of the websites they visited.

User-Agent names of all the most important search engines is listed in the reference of this book.

Most websites today have many different crawlers visiting its website every single day. Most of these crawlers are search engines visiting the site.

Although many call the crawlers «bots», our definition throughout this book will be crawlers, as they are more specific towards search engines, while the term «bot» is being used for so much more these days.

First time visiting a new website

The crawler, or spider as some call it, crawl a new website either when someone has first initiated a direct search towards that domain, if the website has sent a sitemap to its search engine tool, like Google Search Console, or if the website owner has registered the website with the Google Search Console tool.

During the first crawl, the search engine tries to determine how the website shall be handled, most importantly what kind of website it is and how often it should be re-crawled.

Simulating a real human

In order for the crawler to be as similar to a real person and a real browser, it tries to simulate as many factors of a real user behavior as possible. For example, it uses the same bandwidth as is currently the most used.

In 2019 this is still 3G on a global basis. It is therefore assumed that crawlers like Googlebot visits your website using a 3G connection, although many countries and regions have a much higher connection usage for 4G and broadband.

It is therefore very important to look at the size of all pages on your website, and to reduce the size of all objects possible, which you will learn everything about in chapter 3 and chapter 4.

Controlling the crawler

The crawler

When the search engine crawls the website it fetches every single word it finds on the website and saves it in its search index. A website with a total of 100 words would thus have 100 records (one per word) in the search index, each referencing to the related website.

PreviousIntroduction NextIndexing and Selection

Last updated 5 years ago