How Google crawls a web page

How Google crawls a web page

Every time we talk of digital marketing, one of the key pieces is the web positioning. To make our page appear in the top search results on Google, the first step is to find our site, it crawled and indexed. The culmination of an effective online strategy is the first visit is that of a web crawler, and to do so thoroughly and without any hindrance. It is the starting point in our journey within the SEO, to ensure that Internet users see us when using the search engine; when we do not provide this first contact, our whole strategy will be affected and our online visibility could prove to be incomplete or erroneous. To avoid wasting further time and resources, it is necessary to know the mechanisms that come into play when Google crawls our website to simplify everything and help us out and benefit from the process.

Tracking Basics

A crawler (tracker) is a small software that travels through the network of networks using the links of web on the Internet, and is jumping from one to another while collecting information page to upload to Google server; his name in this case is Googlebot. They are also known as bots, robots or spiders, and is the most common method used by search engines to collect data of new or newly updated content and create an index of everything on your server: it is what is known as indexing. Once is indexed, the same server is responsible for organizing the content for relevance to multiple searches that can make a user from the browser; It is what we call familiarly positioning.

Googlebot only crawls pages on your site if you have permission to do so, so it is a double-edged sword: allows you to select the content you want to appear in search results, but if you indicate you incorrectly even disappear from their listings . The cornerstone in the universe of online marketing is to get your website to be traced and positively indexed by Google, as it is the most used search engine in our country.

Given that the crawlers are programs, what they do is scan the source code of the web page so this has to be recognizable (no obsolete technology like Flash) and properly structured (if a human can navigate from start to finish without problems, so will Googlebot). If the bot can not visit the site properly, you can not track it, much less index it.

How Google crawls a website

To understand the process and to take action to ensure that your website ends indexed, you must first take a look at how Google has crawled the pages:

- Be very clear that if there are no impediments, Googlebot is continually accessing your site. The Google tracking rate refers to the speed of Googlebot requests, which can become once every few seconds; this does not mean that all your web is being fully tracked at all times. Requests are used to check for new content or newly updated that have to be tracked, so you do not need to track every page of the website all the time; as you can see, it is a very accurate system that provides accurate information to Google when there is "fresh" content on a website.

- First, Googlebot read the written information in the robots.txt file, as this is where we indicate the contents of our web that can be tracked and indexed by crawlers; if a page or directory is denied access by exclusion protocol it will not be indexed. This file is in the root directory of the web page is a simple text with two commands: User-agent for the bots, and Disallow for content.

With robots.txt you can block known spiders of search engines do not want to index your content (for whatever reason that). For example:

User-agent: Googlebot Disallow: /

This would block the crawler from Google, but it is not what we want (since first would not be reading this post). That to the "legitimate" for fraudulent bots (which are dedicated to finding security vulnerabilities or spammer in search and capture emails) there would be no barrier since not even stop to read the file. And as is public and accessible, anyone can view contents, so do not use the file to hide information from your server that can be used against you. A robots.txt that allows all bots entry would be:

User-agent: * Disallow:

  Also leaving it blank, or even not upload any files.

For the theme of positioning, you have to have your robots.txt always controlled; if you make changes to the web and do not upgrade, you can find the surprise of a content not indexed (or obsolete surprise that appears again in search results). A technical analysis of your website and / or server hosting you can edit the code simple robots.txt and solve any unforeseen for spiders know exactly what to do during their (frequent) visits.

- After knowing the permissions you have and how far you can go, Google then proceeds to review the sitemap.xml file; you should know that normally search engines do not always require the information in this file to reveal the skeleton of a web, increasingly there are more effective. But still very important, especially for the vast diversity of sites with different structures and configurations, which can sometimes hinder reach each part or portion that need to be indexed. A complete and well built sitemap can support when indexing dynamic content such as images, videos, PDF files, etc.

In addition, each sitemap reveals important information as metadata: date of last update of the web, the assiduity with which the content of the pages listed or relationships between different internal links that are created with the passage of amending weather. Again, it is possible that the crawlers can detect almost everything, even if the site is too large there is a possibility to overlook any segment especially if the pages do not refer naturally; it is necessary to list them in the sitemap to be more visible.

If the web is newly created and still does not receive any external link, do not leave anything to chance and see that Google follow the trail left by your own website and forms the sitemap.xml. At first it is better to be aware and keep it updated, Google is more "lazy" with newcomers as assigned to your bot a while, the "crawl budget" to track each website. Depending on the speed, authority, accessibility and quality of the site, Google will change the time for their spiders to crawl each of its pages. Improving these characteristics up content frequently receive more visits from crawlers, which will come up in the positioning and increase traffic on your website.

Steps to facilitate tracing


Well, we want Google to find our site, indexed and rank in search results. We have already seen the files that should make it easier for Googlebot, and with them we must be clear that a well-structured web and with a minimum of quality, will not have any fault to appear quickly in search results.

If the site is new and fresh from the oven, from the same Google Search Console we can provide the URL or even send the sitemap directly. Of course, it must be already accessible online, and although it seems obvious, it must be operational 24 hours if you can only intermittently access or spend long periods inoperative, do not expect Google you note to appear in your browser. See creating a clear conceptual hierarchy of pages, with a home and the rest of the outstanding principal, you can reach any of them by following links, and above all, no dead ends.

Think how you would try to find someone and what words would introduce to locate pages; all those words must be in some way on your website, develop content so you have to use them naturally (always describing the content clearly and accurately). Each of the and alt attributes for images have to specify and clarify what you are reading / watching: helps both users reach the web like Google to understand the content. Do not stop crawling robots with session IDs, as it can generate an incomplete indexing of the site.

It must be aware at all times of the loading speed of the web; quick access will improve the user experience and helps Google positively assess your site and increase the reputation for quality that will be reflected in the positioning.

In short, if the content of your website is well optimized, your pages will be crawled and indexed without having to allocate any resource in this regard. And as you go expanding the content of the site, all form part of the gear spontaneously and basic rules will remain intact. Optimization for perfect tracking by Google requires special attention at the beginning of our SEO strategy; if we fail in this first step, all further efforts and resources may fall on deaf ears. To achieve indexed and appear in search results, we must ensure that the web is easy to track by crawlers visit us.