How to Tame a Search Bot: A Guide to Web Indexing?

Sonu Singh December 14, 2021

0 1,628 7 minutes read

Astounding-Facts-to-Improve-Site-Crawling

If the Internet is a giant library, the search engines are its superfast employees who can quickly guide the reader (Internet user) in the boundless ocean of information. It is a systematic card index, their own database, that helps them do this.

When a user enters a keyword phrase, the search engine shows results from that database. That is, search engines store copies of documents on their servers and access them when a user submits a request. So that you can display a particular page, it must first be added to the database (index). That is why newly created websites that search engines do not know about, are not included in the output.

The search engine sends its robot (aka spider, aka crawler) to search for new pages that appear on the web every second. The spider pack collects data by moving links from one page to another and passes it into the database. The information is then processed by other mechanisms.

How to check web indexing?

You can check the indexing in three basic ways:

Make a query in a search engine using special operators.
Use webmaster tools (Google Search Console).
Use the Linkbox service.

Search operators

You can quickly and easily determine the approximate number of indexed pages by using the site: operator.

Index dofollow checker service

Index check-in Linkbox service is performed in two simple steps: You need to import links into the campaign and apply the indexing check action to them.

Webmaster panel

Google Search Console provides detailed information about indexing. From its primary source, so to speak.

How to control indexing?

Search engines view sites differently than we do. Unlike ordinary users, the search engine robot sees the underbelly of the site. If you do not stop it in time, it will scan all pages without fail, including those that should not be exhibited.

However, you should keep in mind that the robot’s resources are limited: there is a certain quota – the number of pages that the crawler can pass in a certain amount of time. If your site has a huge number of pages, chances are that the robot will spend most of its resources on junk pages and leave the important ones for the future.

This is why indexing can and should be managed. To do this, you can use certain tools, which we’ll look at below.

Robots.txt

Robots.txt is a simple text file (as you might have guessed from the extension) where you can use special words and symbols to write the rules that search engines understand.

Directives used in robots.txt:

User-agent (Addressing the robot.)

Allow (Allow indexing)

Disallow (Disallow indexing)

Sitemap (Sitemap address)

Crawl-delay (The time delay between website pages downloading)

User-agent shows the search engine to which the following rules apply. If the recipient is any search engine, type an asterisk:

User-agent: GoogleBot

User-agent: Bingbot

User-agent: Slurp (Yahoo! search robot)

User-agent: *

The most commonly used directive is disallowed. It is used to disallow the indexing of pages, files, or directories.

Among the pages that should be banned are:

System files and folders. Admin panel, CMS files, personal account, recycle bin, etc.
Low-informative auxiliary pages that don’t need to be promoted. Like blog authors’ biographies, for example.
Various duplicates of the main pages.

Let’s look at them in more detail. Imagine you have a blog page with an article. You have published that article on another website by adding a UTM tag to the existing URL to track conversions. Although its address has changed slightly, it still leads to the same page with the same content. This is a duplicate site that must be excluded from indexing.

Statistics systems are not the only ones to blame for duplicate pages. Duplicates can appear when you search for products or sort them, or because the same product is in several categories, etc. Even the website engines often create a lot of different duplicates (especially WordPress and Joomla).

Apart from full duplicates, there are also partial ones. The best example is the blog’s homepage with post announcements. Typically, these announcements are taken from articles, so such pages do not contain unique content. At this point, you can make the announcements unique or remove them entirely.

Pages of this kind (lists of articles, product catalogs, etc.) also have page navigation (pagination), which divides the list into several pages. Google has detailed what to do with such pages in its Help section.

Duplicates can do a lot of harm to rankings. If the number of pages is too high, the search engine may not display certain pages you planned to promote and which have been highlighted in terms of optimization (for example, you chose a link-enhanced product page, but the search engine shows a completely different page). Thus, to avoid this, we should properly set up the website indexing. One way to deal with duplicates is through a robots.txt file.

When compiling robots.txt, you can be guided by other websites. To do this, simply add robots.txt after the slash at the end of the home page address of the website of your choice. Remember, however, that websites have dissimilar functionality, so you can’t completely copy the directives of top competitors and enjoy your life in peace. If you decide to download a pre-made robots.txt for your CMS, you will still have to modify it to suit your needs.

Let’s look into the symbols used in making the rules.

We enter the path to a certain file or folder via a slash (/). If you set a folder (e.g. /wp-admin/), all files from this folder will be excluded from indexing. If you want to select a specific file, you must enter its name and extension in full (along with the directory).

If you need to disallow indexing of certain types of files or a page containing a parameter, you can use asterisks (*):

Disallow: /*/openstat=

Disallow: /*/?utm_source=

Disallow: /*/price=

Disallow: /*/gclid=*

The allow directive allows the indexing of individual directories, pages, or files. You need, for example, to exclude all the contents of the uploads folder from the search engines except for one pdf-file. Here’s how it can be done:

Disallow: /wp-content/uploads/

Allow: /wp-content/uploads/book.pdf

Another important directive is sitemap. This is where you specify the address (if any) where you can find your sitemap. On how to create it and what it is used for, we’ll talk later.

We should mention that search engines treat robots.txt in different ways. Google treats it more as a recommendation and may ignore some directives.

After the file is created, it must be placed in the root directory of the site, i.e: site.com/robots.txt.

How to disallow website indexing?

But if you need to remove the website from all search engines for some reason, here’s how you do it:

User-agent: *

Disallow: /

It is very advisable to do this while the website is under development. To allow the website indexing, you only need to remove the slash (but don’t forget to do it when you start the website).

Nofollow and noindex

Use special Html-tags to customize indexing.

Let’s take a closer look at the robots mega-tag. Like the robots.txt file, it allows you to control indexing, but in a more flexible way. To understand how it works, let’s look at the instructions options:

index content and links

not to index content and links

not to index the content, but follow the links

meta name=”robots” content=”index,nofollow” />

index content, but not to follow links

These are not all examples of this meta tag use because there are other directives in addition to nofollow and noindex. For example, there’s no image index, which disallows scanning images on the page. You can learn more about this meta tag and its use with Google’s help.

Rel=” canonical”

Another way to combat duplicates is to use the rel=”canonical attribute. You can set the canonical (preferred) address for each page, which will be displayed in search results. By entering an attribute in the duplicate’s code, you attach it to the homepage, so there is no confusion about the page’s version. If the duplicate has link weight, it will be transferred to the homepage.

The tag is implemented as follows

Sitemap (site map)

And if the robots.txt file informs the robot about the pages that should not be indexed, the sitemap, on the contrary, contains all the links that need to be indexed.

A major advantage of the sitemap is that it contains not only the list of pages but also necessary data for the robot – the date and frequency of updates to each page and its priority for crawling.

The sitemap.xml file can be generated automatically by specialized online services. One of the best and simplest plugins for WordPress is Google XML Sitemaps. It has many different settings but they are not difficult to understand.

In the end, you get a simple and convenient sitemap in the form of a table. And it becomes accessible as soon as the plugin is activated.

The sitemap is extremely useful for indexing because robots often pay much attention to old pages and ignore new ones. When there is a sitemap, the robot sees the pages that have changed and visits them first when accessing the website.

If you developed a sitemap using third-party services, you should upload the finished file, along with the robots.txt file, into the folder on your hosting server where the site is located. Again, in the root folder: site.com/sitemap.xml.

It is advisable to upload the resulting file to a special section in the Google Search Console for convenience.

Conclusion

Web indexing is a complex process that can not always be handled by search engines alone. Since indexing has a direct impact on the website positioning in search results, it makes sense to take control and facilitate the work of search robots. Yes, you might need a lot of effort and time but even such an unpredictable thing as a search bot can still be controlled by man.