By Gino Cosme on 2005/05/20
I often meet people who don't understand how search engines gather their information. They know what they are and understand the importance of being indexed and listed on them - well some do - but the minute you start talking about spiders and the like, they freeze up.
Freeze no more. This article aims to shed some uncertainty you may have about search engines. After all, if you want to benefit from being listed on search engines, you'd better know how they work.
Three That Are One
Crawler-based search engines are made up of three major elements: the spider, the index, and the software. Each has its own function and together they produce what we have come to trust (or distrust) on the SERPs (Search Engine Results Pages).
The Hungry Spider
Also known as a web crawler or robot, a search engine spider is an automated program that reads web pages and follows any links to other pages within the site. This is often referred to as a site being "spidered" or "crawled". There are three very hungry and active spiders on the Net. Their names are Googlebot (Google), Slurp (Yahoo!) and MSNBot (MSN Search).
Spiders start their journeys with a list of page URLs that have previously been added to their index (database). As it visits these pages, crawling the code and copy, it adds new pages (links) that it finds on the page to its index. As such, one could refer to a spider as feeding an evolving index, which is discussed below.
The spider returns to the sites in its index on a regular basis, scanning for any changes. How often the spider returns is up to the search engines to decide. Website owners do have some control in how often a spider visits their site by making use of a robot.txt file. Search engines first look for this file before crawling a page further.
The Growing Index
An index is like a giant catalogue or inventory of websites containing a copy of every web page and file that the spider finds. If a web page changes, this catalogue is updated with the new information. To give you an idea of the size of these indexes, the latest figure released by Google is 8 billion pages.
It sometimes takes a while for new pages or changes that the spider finds to be added to its index. Thus, a web page may have been "spidered" but not yet "indexed." Until a page is indexed - added to the index - spidered pages will not be available to those searching with the search engine.
The Performing Search Engine
At the end of the day a search engine is a software program designed to sift through billions of pages recorded in its index to find matches to a search query and rank them in an order that it believes is most relevant. Quite a mouthful.
How do search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? Each search engine has developed a set of rules and mathematical equations, known as an algorithm, which it uses to set the order of its rankings.
Exactly how a particular search engine's algorithm works is a closely-kept secret, but some general rules are clear that are often used to increase a website's ranking performance. This is referred to as search engine optimisation.
In a nutshell, search engines use on and off page copy to group related pages into vertical themes. If we take a page relating to the film industry, these themes or groups could be entertainment, movie entertainment, movie star entertainment, etc. Each theme has common words and phrases that best describe the pages the group contains. Some pages may belong to more than one group. For instance, a page relating to movie profits could belong to both financial and entertainment groups.
The SERP (or Search Engine Results Page)
After applying this algorithm to their index of sites, a search engine comes up with a list of the most relevant results according to the search conducted.
To simplify an otherwise complex process, when a user enters a search query, the search engine analyses and searches its index for the pages it considers relevant to the query. Once it has a shortlist of the relevant pages, it further calculates what order they are presented to the user in based on further algorithmic factors. These could be a user's location and possibly even their search history.
This algorithm differs between engines, which is why different search engines may produce different results for the same query. Each search engine has its niche. It is however not uncommon for a user to use more than one search engine at a time. This further demonstrates the importance for website owners to be indexed and ranked well on all search engines.
The aim of a search engine is to put itself in its user's shoes. They therefore want to deliver appropriate, relevant, information-rich sites that will satisfy users, first time round.
An impossible task? I like to think of it as a very exciting challenge.