Wednesday, May 18, 2011

Google Crawling Technology Overview

Before you even enter your query in the search box, Google is continuously traversing the web in real time with software programs called crawlers, or “Googlebots”. A crawler visits a page, copies the content and follows the links from that page to the pages linked to it, repeating this process over and over until it has crawled billions of pages on the web.

Next Google processes these pages and creates an index, much like the index in the back of a book. If you think of the web as a massive book, then Google‘s index is a list of all the words on those pages and where they‘re located, as well as information about the links from those pages, and so on. The index is parceled into manageable sections and stored across a large network of computers around the world.

When you type a query into the Google search box, your query is sent to Google machines and compared with all the documents stored in our index to identify the most relevant matches. In a split second, our system prepares a list of the most relevant pages and also determines the relevant sections and bits of text, images, videos and more. What you get is a list of search results with relevant information excerpted in “snippets” (short text summary) beneath each result.
As Larry said long ago, we want to give you back “exactly what you want.”
Describing the basic crawling, indexing and serving processes of a search engine is just part of the story. The other key ingredients of Google search are: 

Relevance. As Larry said long ago, we want to give you back “exactly what you want.” When Google was founded, one key innovation was PageRank, a technology that determined the “importance” of a webpage by looking at what other pages link to it, as well as other data. Today we use more than 200 signals, including PageRank, to order websites, and we update these algorithms on a weekly basis. For example, we offer personalized search results based on your web history and location.  

Comprehensiveness. Google launched in 1998 with just 25 million pages, which even then was a small fraction of the web. Today we index billions and billions of webpages, and our index is roughly 100 million gigabytes. We continue investing to expand the comprehensiveness of our services. In 2007 we introduced Universal Search, which made search more comprehensive by integrating images, videos, news, books and more into our main search results.   

Freshness. In the early days, Googlebots crawled the web every three or four months, which meant that the information you found on Google typically was out of date. Today we’re continually crawling the web ensuring that you can find the latest news, blogs and status updates minutes or even seconds after they’re posted. With Realtime Search, we’re able to serve up breaking topics from a comprehensive set of sources just moments after events occur. 

Speed. Our average query response time is roughly one-fourth of a second. In comparison, the average blink of an eye is one-tenth of a second. Speed is a major search priority, which is why in general we don’t turn on new features if they will slow our services down. Instead, search engineers are always working not just on new features, but ways to make search even faster. In addition to smart coding, on the back end we’ve developed distributed computing systems around that globe that ensure you get fast response times. With technologies like autocomplete and Google Instant, we help you find the search terms and results you’re looking for before you’re even finished typing.

    1 comment:

    1. wow,, GraeT post i read,, Love It..

      visit please,, http://el-janhreview.blogspot.com

      ReplyDelete