Anda di halaman 1dari 40

SEARCH ENGINE 1

A research project on SEARCH ENGINE

SUBMITTED BY

SATHISH KOTHA 108-00-0746

University Of Northern Virginia

CSCI 587 SEC 1220, SPECIAL TOPICS IN INFORMATION TECHNOLOGY-1

6/20/2010
SEARCH ENGINE 2

Abstract of the Project

A web search engine is designed to search for information on the World Wide Web. The
search results are usually presented in a list of results and are commonly called hits. The
information may consist of web pages, images, information and other types of files. Some search
engines also mine data available in databases or open directories. Unlike Web directories, which
are maintained by human editors, search engines operate algorithmically or are a mixture of
algorithmic and human input.

Here in this project, we are discussing about types of search engines. How a search
engine works or finds the information for the user. What are the processes going behind the
screen? Today how many search engines are existing to provide information and facts for the
computer users, History of search engine. What are the different stages of search engines in
searching information? What are the features of Web searching? And also different topics like
Advanced Research Projects Agency Network. And what a BOT really mean?

Types of search queries we use in seeking information using search engine. Web
Directories? And very famous search engine like google, yahoo and their role of processing as
search engine. Challenges in language processing. Characteristics of search engines.

I used this topic for my project is reason that I find interest in working of search engine. And I
want every one to come across this topic and learn. As many of them uses, but they don’t know
what’s the real fact going behind the screen in a search engine.

At the end of the project I also gave some references from which I have selected for topic
discussion of project. I hope you like this and accept this project for my topic in this course.
SEARCH ENGINE 3

ACKNOWLEDGEMENT

The project entitled “SEARCH ENGINE” is of total effort of me. It is my duty to bring
forward each and every one who is either directly or indirectly in relation with our project and
without whom it would not have gained a structure.

Accordingly sincere thanks to PROF. SOUROSHI , For his support and for her valuable
suggestions and timely advice without them, the project would not be completed in time.
We also thank many others who helped us through out the project and made our project
successful.

PROJECT ASSOCIATES
SEARCH ENGINE 4

CONTENTS

PRELIMINARIES

Acknowledgement

1. History of search engine page 1 - 15

Types of search queries

World Wide Web wanderer

Alliweb

Primitive web search

2. Working of a search engine page 15 – 32

Web crawling

Indexing

Searching

3. New features for web searching page 33 – 35

4. Conclusion page 36

5. References page 37
SEARCH ENGINE 5

1.Early Technology

2.Directories

3. Vertical Search

4. Search Engine Marketing

History of Search Engines: From 1945 to Google 2007

As We May Think (1945):

The concept of hypertext and a memory extension really came to life in July of 1945,
when after enjoying the scientific camaraderie that was a side effect of WWII, Vannaver Bush's
As We May Think was published in The Atlantic Monthly.

He urged scientists to work together to help build a body of knowledge for all mankind.
Here are a few selected sentences and paragraphs that drive his point home.

Specialization becomes increasingly necessary for progress, and the effort to bridge
between disciplines is correspondingly superficial.

A record, if it is to be useful to science, must be continuously extended, it must be stored,


and above all it must be consulted.

He not only was a firm believer in storing data, but he also believed that if the data source was to
be useful to the human mind we should have it represent how the mind works to the best of our
abilities.

Our ineptitude in getting at the record is largely caused by the artificiality of the systems of
indexing. ... Having found one item, moreover, one has to emerge from the system and re-enter
on a new path.

The human mind does not work this way. It operates by association. ... Man cannot hope
fully to duplicate this mental process artificially, but he certainly ought to be able to learn from
it. In minor ways he may even improve, for his records have relative permanency.

He then proposed the idea of a virtually limitless, fast, reliable, extensible, associative memory
storage and retrieval system. He named this device a memex.
SEARCH ENGINE 6

Gerard Salton (1960s - 1990s):

Gerard Salton, who died on August 28th of 1995, was the father of modern search technology.
His teams at Harvard and Cornell developed the SMART informational retrieval system.
Salton’s Magic Automatic Retriever of Text included important concepts like the vector space
model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values,
and relevancy feedback mechanisms.

Ted Nelson:

Ted Nelson created Project Xanadu in 1960 and coined the term hypertext in 1963. His goal with
Project Xanadu was to create a computer network with a simple user interface that solved many
social problems like attribution.

While Ted was against complex markup code, broken links, and many other problems associated
with traditional HTML on the WWW, much of the inspiration to create the WWW was drawn
from Ted's work.

There is still conflict surrounding the exact reasons why Project Xanadu failed to take off.

Advanced Research Projects Agency Network:

ARPANet is the network which eventually led to the internet. The Wikipedia has a great
background article on ARPANet and Google Video has a free interesting video about ARPANet
from 1972.

Archie (1990):

The first few hundred web sites began in 1993 and most of them were at colleges, but long
before most of them existed came Archie. The first search engine created was Archie, created in
1990 by Alan Emtage, a student at McGill University in Montreal. The original intent of the
name was "archives," but it was shortened to Archie.
SEARCH ENGINE 7

Archie helped solve this data scatter problem by combining a script-based data gatherer with a
regular expression matcher for retrieving file names matching a user query. Essentially Archie
became a database of web filenames which it would match with the users queries.

Bill Slawski has more background on Archie here.

Veronica & Jughead:

As word of mouth about Archie spread, it started to become word of computer and Archie had
such popularity that the University of Nevada System Computing Services group developed
Veronica. Veronica served the same purpose as Archie, but it worked on plain text files. Soon
another user interface name Jughead appeared with the same purpose as Veronica, both of these
were used for files sent via Gopher, which was created as an Archie alternative by Mark
McCahill at the University of Minnesota in 1991.

File Transfer Protocol:

Tim Burners-Lee existed at this point, however there was no World Wide Web. The main way
people shared data back then was via File Transfer Protocol (FTP).

If you had a file you wanted to share you would set up an FTP server. If someone was interested
in retrieving the data they could using an FTP client. This process worked effectively in small
groups, but the data became as much fragmented as it was collected.

Tim Berners-Lee & the WWW (1991):

From the Wikipedia:

While an independent contractor at CERN from June to December 1980, Berners-Lee proposed a
project based on the concept of hypertext, to facilitate sharing and updating information among
researchers. With help from Robert Cailliau he built a prototype system named Enquire.

After leaving CERN in 1980 to work at John Poole's Image Computer Systems Ltd., he returned
in 1984 as a fellow. In 1989, CERN was the largest Internet node in Europe, and Berners-Lee
saw an opportunity to join hypertext with the Internet. In his words, "I just had to take the
hypertext idea and connect it to the TCP and DNS ideas and — ta-da! — the World Wide Web".
SEARCH ENGINE 8

He used similar ideas to those underlying the Enquire system to create the World Wide Web, for
which he designed and built the first web browser and editor (called WorldWideWeb and
developed on NeXTSTEP) and the first Web server called httpd (short for HyperText Transfer
Protocol daemon).

The first Web site built was at http://info.cern.ch/ and was first put online on August 6, 1991. It
provided an explanation about what the World Wide Web was, how one could own a browser
and how to set up a Web server. It was also the world's first Web directory, since Berners-Lee
maintained a list of other Web sites apart from his own.

In 1994, Berners-Lee founded the World Wide Web Consortium (W3C) at the Massachusetts
Institute of Technology.

Tim also created the Virtual Library, which is the oldest catalogue of the web. Tim also wrote a
book about creating the web, titled Weaving the Web.

What is a Bot?

Computer robots are simply programs that automate repetitive tasks at speeds impossible for
humans to reproduce. The term bot on the internet is usually used to describe anything that
interfaces with the user or that collects data.

Another bot example could be Chatterbots, which are resource heavy on a specific topic. These
bots attempt to act like a human and communicate with humans on said topic.

Types of Search Queries:

Andrei Broder authored A Taxonomy of Web Search [PDF], which notes that most searches fall
into the following 3 categories:

• Informational - seeking static information about a topic


• Transactional - shopping at, downloading from, or otherwise interacting with the result
• Navigational - send me to a specific URL
SEARCH ENGINE 9

Nancy Blachman's Google Guide offers searchers free Google search tips, and Greg R.Notess's
Search Engine Showdown offers a search engine features chart.

There are also many popular smaller vertical search services. For example, Del.icio.us allows
you to search URLs that users have bookmarked, and Technorati allows you to search blogs.

World Wide Web Wanderer:

Soon the web's first robot came. In June 1993 Matthew Gray introduced the World Wide Web
Wanderer. He initially wanted to measure the growth of the web and created this bot to count
active web servers. He soon upgraded the bot to capture actual URL's. His database became
knows as the Wandex.

The Wanderer was as much of a problem as it was a solution because it caused system lag by
accessing the same page hundreds of times a day. It did not take long for him to fix this software,
but people started to question the value of bots.

ALIWEB:

In October of 1993 Martijn Koster created Archie-Like Indexing of the Web, or ALIWEB in
response to the Wanderer. ALIWEB crawled meta information and allowed users to submit their
pages they wanted indexed with their own page description. This meant it needed no bot to
collect data and was not using excessive bandwidth. The downside of ALIWEB is that many
people did not know how to submit their site.

Robots Exclusion Standard:

Martijn Kojer also hosts the web robots page, which created standards for how search engines
should index or not index content. This allows webmasters to block bots from their site on a
whole site level or page by page basis.

By default, if information is on a public web server, and people link to it search engines
generally will index it.

In 2005 Google led a crusade against blog comment spam, creating a nofollow attribute that can
be applied at the individual link level. After this was pushed through Google quickly changed the
scope of the purpose of the link nofollow to claim it was for any link that was sold or not under
editorial control.

Primitive Web Search:

By December of 1993, three full fledged bot fed search engines had surfaced on the web:
JumpStation, the World Wide Web Worm, and the Repository-Based Software Engineering
(RBSE) spider. JumpStation gathered info about the title and header from Web pages and
retrieved these using a simple linear search. As the web grew, JumpStation slowed to a stop. The
WWW Worm indexed titles and URL's. The problem with JumpStation and the World Wide
SEARCH ENGINE 10

Web Worm is that they listed results in the order that they found them, and provided no
discrimination. The RSBE spider did implement a ranking system.

Since early search algorithms did not do adequate link analysis or cache full page content if you
did not know the exact name of what you were looking for it was extremely hard to find it.

Excite:

Excite came from the project Architext, which was started by in February, 1993 by six Stanford
undergrad students. They had the idea of using statistical analysis of word relationships to make
searching more efficient. They were soon funded, and in mid 1993 they released copies of their
search software for use on web sites.

Excite was bought by a broadband provider named @Home in January, 1999 for $6.5 billion,
and was named Excite@Home. In October, 2001 Excite@Home filed for bankruptcy. InfoSpace
bought Excite from bankruptcy court for $10 million.

Web Directories:

VLib:

When Tim Berners-Lee set up the web he created the Virtual Library, which became a
loose confederation of topical experts maintaining relevant topical link lists.

EINet Galaxy

The EINet Galaxy web directory was born in January of 1994. It was organized similar to how
web directories are today. The biggest reason the EINet Galaxy became a success was that it also
contained Gopher and Telnet search features in addition to its web search feature. The web size
in early 1994 did not really require a web directory; however, other directories soon did follow.

Yahoo! Directory
SEARCH ENGINE 11

In April 1994 David Filo and Jerry Yang created the Yahoo! Directory as a collection of their
favorite web pages. As their number of links grew they had to reorganize and become a
searchable directory. What set the directories above The Wanderer is that they provided a human
compiled description with each URL. As time passed and the Yahoo! Directory grew Yahoo!
began charging commercial sites for inclusion. As time passed the inclusion rates for listing a
commercial site increased. The current cost is $299 per year. Many informational sites are still
added to the Yahoo! Directory for free.

Open Directory Project

In 1998 Rich Skrenta and a small group of friends created the Open Directory Project, which is a
directory which anybody can download and use in whole or part. The ODP (also known as
DMOZ) is the largest internet directory, almost entirely ran by a group of volunteer editors. The
Open Directory Project was grown out of frustration webmasters faced waiting to be included in
the Yahoo! Directory. Netscape bought the Open Directory Project in November, 1998. Later
that same month AOL announced the intention of buying Netscape in a $4.5 billion all stock
deal.

LII

Google offers a librarian newsletter to help librarians and other web editors help make
information more accessible and categorize the web. The second Google librarian newsletter
came from Karen G. Schneider, who is the director of Librarians' Internet Index. LII is a high
quality directory aimed at librarians. Her article explains what she and her staff look for when
looking for quality credible resources to add to the LII. Most other directories, especially those
which have a paid inclusion option, hold lower standards than selected limited catalogs created
by librarians.

The Internet Public Library is another well kept directory of websites.

Business.com

Due to the time intensive nature of running a directory, and the general lack of scalability of a
business model the quality and size of directories sharply drops off after you get past the first
half dozen or so general directories. There are also numerous smaller industry, vertically, or
locally oriented directories. Business.com, for example, is a directory of business websites.
SEARCH ENGINE 12

Looksmart

Looksmart was founded in 1995. They competed with the Yahoo! Directory by frequently
increasing their inclusion rates back and forth. In 2002 Looksmart transitioned into a pay per
click provider, which charged listed sites a flat fee per click. That caused the demise of any good
faith or loyalty they had built up, although it allowed them to profit by syndicating those paid
listings to some major portals like MSN. The problem was that Looksmart became too dependant
on MSN, and in 2003, when Microsoft announced they were dumping Looksmart that basically
killed their business model.

In March of 2002, Looksmart bought a search engine by the name of WiseNut, but it never
gained traction. Looksmart also owns a catalog of content articles organized in vertical sites, but
due to limited relevancy Looksmart has lost most (if not all) of their momentum. In 1998
Looksmart tried to expand their directory by buying the non commercial Zeal directory for $20
million, but on March 28, 2006 Looksmart shut down the Zeal directory, and hope to drive
traffic using Furl, a social bookmarking program.

WebCrawler:

Brian Pinkerton of the University of Washington released WebCrawler on April 20, 1994. It was
the first crawler which indexed entire pages. Soon it became so popular that during daytime
hours it could not be used. AOL eventually purchased WebCrawler and ran it on their network.
Then in 1997, Excite bought out WebCrawler, and AOL began using Excite to power its
NetFind. WebCrawler opened the door for many other services to follow suit. Within 1 year of
its debuted came Lycos, Infoseek, and OpenText.

Lycos:

Lycos was the next major search development, having been design at Carnegie Mellon
University around July of 1994. Michale Mauldin was responsible for this search engine and
remains to be the chief scientist at Lycos Inc.

On July 20, 1994, Lycos went public with a catalog of 54,000 documents. In addition to
providing ranked relevance retrieval, Lycos provided prefix matching and word proximity
bonuses. But Lycos' main difference was the sheer size of its catalog: by August 1994, Lycos
SEARCH ENGINE 13

had identified 394,000 documents; by January 1995, the catalog had reached 1.5 million
documents; and by November 1996, Lycos had indexed over 60 million documents -- more than
any other Web search engine. In October 1994, Lycos ranked first on Netscape's list of search
engines by finding the most hits on the word ‘surf.’.

Infoseek:

Infoseek also started out in 1994, claiming to have been founded in January. They really did not
bring a whole lot of innovation to the table, but they offered a few add on's, and in December
1995 they convinced Netscape to use them as their default search, which gave them major
exposure. One popular feature of Infoseek was allowing webmasters to submit a page to the
search index in real time, which was a search spammer's paradise.

AltaVista:

AltaVista debut online came during this same month. AltaVista


brought many important features to the web scene. They had
nearly unlimited bandwidth (for that time), they were the first to
allow natural language queries, advanced searching techniques and they allowed users to add or
delete their own URL within 24 hours. They even allowed inbound link checking. AltaVista also
provided numerous search tips and advanced search features.

Due to poor mismanagement, a fear of result manipulation, and portal related clutter AltaVista
was largely driven into irrelevancy around the time Inktomi and Google started becoming
popular. On February 18, 2003, Overture signed a letter of intent to buy AltaVista for $80
million in stock and $60 million cash. After Yahoo! bought out Overture they rolled some of the
AltaVista technology into Yahoo! Search, and occasionally use AltaVista as a testing platform.

Inktomi:

The Inktomi Corporation came about on May 20, 1996 with its search engine
Hotbot. Two Cal Berkeley cohorts created Inktomi from the improved
technology gained from their research. Hotwire listed this site and it became
hugely popular quickly.

In October of 2001 Danny Sullivan wrote an article titled Inktomi Spam Database Left Open To
Public, which highlights how Inktomi accidentally allowed the public to access their database of
spam sites, which listed over 1 million URLs at that time.

Although Inktomi pioneered the paid inclusion model it was nowhere near as efficient as the pay
per click auction model developed by Overture. Licensing their search results also was not
profitable enough to pay for their scaling costs. They failed to develop a profitable business
SEARCH ENGINE 14

model, and sold out to Yahoo! for approximately approximately $235 million, or $1.65 a share,
in December of 2003.

Ask.com (Formerly Ask Jeeves):

In April of 1997 Ask Jeeves was launched as a natural


language search engine. Ask Jeeves used human editors to
try to match search queries. Ask was powered by DirectHit
for a while, which aimed to rank results based on their
popularity, but that technology proved to easy to spam as
the core algorithm component. In 2000 the Teoma search
engine was released, which uses clustering to organize sites
by Subject Specific Popularity, which is another way of
saying they tried to find local web communities. In 2001
Ask Jeeves bought Teoma to replace the DirectHit search
technology.

Jon Kleinberg's Authoritative sources in a hyperlinked


environment [PDF] was a source of inspiration what lead to the eventual creation of Teoma.
Mike Grehan's Topic Distillation [PDF] also explains how subject specific popularity works.

AllTheWeb

AllTheWeb was a search technology platform launched in May of 1999 to showcase Fast's
search technologies. They had a sleek user interface with rich advanced search features, but on
February 23, 2003, AllTheWeb was bought by Overture for $70 million. After Yahoo! bought
out Overture they rolled some of the AllTheWeb technology into Yahoo! Search, and
occasionally use AllTheWeb as a testing platform.

Google also has a Scholar search program which aims to make scholarly research easier to do.

On November 15, 2005 Google launched a product called Google Base, which is a database of
just about anything imaginable. Users can upload items and title, describe, and tag them as they
see fit. Based on usage statistics this tool can help Google understand which vertical search
products they should create or place more emphasis on. They believe that owning other verticals
will allow them to drive more traffic back to their core search service. They also believe that
targeted measured advertising associated with search can be carried over to other mediums. For
example, Google bought dMarc, a radio ad placement firm. Yahoo! has also tried to extend their
SEARCH ENGINE 15

reach by buying other high traffic properties, like the photo sharing site Flickr, and the social
bookmarking site del.icio.us.

Google AdSense

On March 4, 2003 Google announced their content targeted ad network. In April 2003, Google
bought Applied Semantics, which had CIRCA technology that allowed them to drastically
improve the targeting of those ads. Google adopted the name AdSense for the new ad program.

AdSense allows web publishers large and small to automate the placement of relevant ads on
their content. Google initially started off by allowing textual ads in numerous formats, but
eventually added image ads and video ads. Advertisers could chose which keywords they wanted
to target and which ad formats they wanted to market.

To help grow the network and make the market more efficient Google added a link which allows
advertisers to sign up for AdWords account from content websites, and Google allowed
advertisers to buy ads targeted to specific websites, pages, or demographic categories. Ads
targeted on websites are sold on a cost per thousand impression (CPM) basis in an ad auction
against other keyword targeted and site targeted ads.

Google also allows some publishers to place AdSense ads in their feeds, and some select
publishers can place ads in emails.

To prevent the erosion of value of search ads Google allows advertisers to opt out of placing
their ads on content sites, and Google also introduced what they called smart pricing. Smart
pricing automatically adjusts the click cost of an ad based on what Google perceives a click from
that page to be worth. An ad on a digital camera review page would typically be worth more than
a click from a page with pictures on it.

Yahoo! Search Marketing

Yahoo! Search Marketing is the rebranded name for Overture after Yahoo! bought them out. As
of September 2006 their platform is generally the exact same as the old Overture platform, with
the same flaws - ad CTR not factored into click cost, it's hard to run local ads, and it is just
generally clunky.

Microsoft AdCenter

Microsoft AdCenter was launched on May 3. 2006. While Microsoft has limited marketshare,
they intend to increase their marketshare by baking search into Internet Explorer 7. On the
features front, Microsoft added demographic targeting and dayparting features to the pay per
click mix. Microsoft's ad algorithm includes both cost per click and ad clickthrough rate.
SEARCH ENGINE 16

Microsoft also created the XBox game console, and on May 4, 2006 announced they bought a
video game ad targeting firm named Massive Inc. Eventually video game ads will be sold from
within Microsoft AdCenter.

Early Years

Google's corporate history page has a pretty strong background on Google, starting from when
Larry met Sergey at Stanford right up to present day. In 1995 Larry Page met Sergey Brin at
Stanford.

By January of 1996, Larry and Sergey had begun collaboration on a search engine called
BackRub, named for its unique ability to analyze the "back links" pointing to a given website.
Larry, who had always enjoyed tinkering with machinery and had gained some notoriety for
building a working printer out of Lego™ bricks, took on the task of creating a new kind of server
environment that used low-end PCs instead of big expensive machines. Afflicted by the
perennial shortage of cash common to graduate students everywhere, the pair took to haunting
the department's loading docks in hopes of tracking down newly arrived computers that they
could borrow for their network.

A year later, their unique approach to link analysis was earning BackRub a growing reputation
among those who had seen it. Buzz about the new search technology began to build as word
spread around campus.

BackRub ranked pages using citation notation, a concept which is popular in academic circles. If
someone cites a source they usually think it is important. On the web, links act as citations. In the
PageRank algorithm links count as votes, but some votes count more than others. Your ability to
rank and the strength of your ability to vote for others depends upon your authority: how many
people link to you and how trustworthy those links are.

In 1998, Google was launched. Sergey tried to shop their PageRank technology, but nobody was
interested in buying or licensing their search technology at that time.

Winning the Search War


SEARCH ENGINE 17

Later that year Andy Bechtolsheim gave them $100,000 seed funding, and Google received $25
million Sequoia Capital and Kleiner Perkins Caufield & Byers the following year. In 1999 AOL
selected Google as a search partner, and Yahoo! followed suit a year later. In 2000 Google also
launched their popular Google Toolbar. Google gained search market share year over year ever
since.

In 2000 Google relaunched their AdWords program to sell ads on a CPM basis. In 2002 they
retooled the service, selling ads in an auction which would factor in bid price and ad
clickthrough rate. On May 1, 2002, AOL announced they would use Google to deliver their
search related ads, which was a strong turning point in Google's battle against Overture.

In 2003 Google also launched their AdSense program, which allowed them to expand their ad
network by selling targeted ads on other websites.

Going Public

Google used a two class stock structure, decided not to give earnings guidance, and offered
shares of their stock in a Dutch auction. They received virtually limitless negative press for the
perceived hubris they expressed in their "AN OWNER'S MANUAL" FOR GOOGLE'S
SHAREHOLDERS. After some controversy surrounding an interview in Playboy, Google
dropped their IPO offer range from $85 to $95 per share from $108 to $135. Google went public
at $85 a share on August 19, 2004 and its first trade was at 11:56 am ET at $100.01.

Verticals Galore!

In addition to running the world's most popular search service, Google also runs a large number
of vertical search services, including:

• Google News: Google News launched in beta in September 2002. On September 6, 2006,
Google announced an expanded Google News Archive Search that goes back over 200
years.
• Google Book Search: On October 6, 2004, Google launchedGoogle Book Search.
• Google Scholar: On November 18, 2004, Google launched Google Scholar, an academic
search program.
• Google Blog Search: On September 14, 2005, Google announced Google Blog Search.
• Google Base: On November 15, 2005, Google announced the launch of Google Base, a
database of uploaded information describing online or offline content, products, or
services.
• Google Video: On January 6, 2006, Google announced Google Video.
• Google Universal Search: On May 16, 2007 Google began mixing many of their vertical
results into their organic search results.

Just Search, We Promise!

product.
SEARCH ENGINE 18

Microsoft

In 1998 MSN Search was


launched, but Microsoft did not get serious about search until after Google proved the business
model. Until Microsoft saw the light they primarily relied on partners like Overture, Looksmart,
and Inktomi to power their search service.

They launched their technology preview of their search engine around July 1st of 2004. They
formally switched from Yahoo! organic search results to their own in house technology on
January 31st, 2005. MSN announced they dumped Yahoo!'s search ad program on May 4th,
2006.

On September 11, 2006, Microsoft announced they were launching their Live Search product.
SEARCH ENGINE 19

2. Working of a search engine

A search engine operates, in the following order

1. Web crawling
2. Indexing
3. Searching

Web search engines work by storing information about many web pages, which they retrieve
from the WWW itself. These pages are retrieved by a Web crawler (sometimes also known as a
spider) — an automated Web browser which follows every link it sees. Exclusions can be made by
the use of robots.txt. The contents of each page are then analyzed to determine how it should be
indexed (for example, words are extracted from the titles, headings, or special fields called meta
tags). Data about web pages are stored in an index database for use in later queries. Some search
engines, such as Google, store all or part of the source page (referred to as a cache) as well as
information about the web pages, whereas others, such as AltaVista, store every word of every
page they find. This cached page always holds the actual search text since it is the one that was
actually indexed, so it can be very useful when the content of the current page has been updated
and the search terms are no longer in it.

When a user enters a query into a search engine (typically by using key words), the engine
examines its index and provides a listing of best-matching web pages according to its criteria,
usually with a short summary containing the document's title and sometimes parts of the text. Most
search engines support the use of the Boolean operators AND, OR and NOT to further specify the
search query. Some search engines provide an advanced feature called proximity search which
allows users to define the distance between keywords.

A web crawler (also known as a web spider, web robot, or—especially in the FOAF
community—web scutter) is a program or automated script which browses the World
Wide Web in a methodical, automated manner. Other less frequently used names for web
crawlers are ants, automatic indexers, bots, and worms.

This process is called web crawling or spidering. Many sites, in particular search engines, use
spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy
of all the visited pages for later processing by a search engine that will index the downloaded
pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a
website, such as checking links or validating HTML code. Also, crawlers can be used to gather
specific types of information from Web pages, such as harvesting e-mail addresses (usually for
spam).

A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to
visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the
page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier
are recursively visited according to a set of policies.
SEARCH ENGINE 20

Crawling policies

, There are three important characteristics of the Web that make crawling it very difficult:

• its large volume,


• its fast rate of change,
• dynamic page generation

which combine to produce a wide variety of possible crawlable URLs.

The large volume implies that the crawler can only download a fraction of the web pages within
a given time, so it needs to prioritize its downloads. The high rate of change implies that by the
time the crawler is downloading the last pages from a site, it is very likely that new pages have
been added to the site, or that pages have already been updated or even deleted.

The recent increase in the number of pages being generated by server-side scripting languages
has also created difficulty in that endless combination of HTTP GET parameters exist, only a
small selection of which will actually return unique content. For example, a simple online photo
gallery may offer three options to users, as specified through HTTP GET parameters. If there
exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to
disable user-provided contents, then that same set of content can be accessed with forty-eight
different URLs, all of which will be present on the site. This mathematical combination creates a
problem for crawlers, as they must sort through endless combinations of relatively minor scripted
changes in order to retrieve unique content.

As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor
free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some
reasonable measure of quality or freshness is to be maintained. A crawler must carefully choose
at each step which pages to visit next.

The behavior of a web crawler is the outcome of a combination of policies:

• A selection policy that states which pages to download.


• A re-visit policy that states when to check for changes to the pages.
• A politeness policy that states how to avoid overloading websites.
• A parallelization policy that states how to coordinate distributed web crawlers.

Selection policy

Given the current size of the Web, even large search engines cover only a portion of the publicly
available interne. As a crawler always downloads just a fraction of the Web pages, it is highly
desirable that the downloaded fraction contains the most relevant pages, and not just a random
sample of the Web.

This requires a metric of importance for prioritizing Web pages. The importance of a page is a
function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the
SEARCH ENGINE 21

latter is the case of vertical search engines restricted to a single top-level domain, or search
engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty:
it must work with partial information, as the complete set of Web pages is not known during
crawling.

Abiteboul (Abiteboul et al., 2003) designed a crawling strategy based on an algorithm called
OPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of
"cash" which is distributed equally among the pages it points to.

Boldi et al. (Boldi et al., 2004) used simulation on subsets of the Web of 40 million pages from
the .it domain and 100 million pages from the WebBase crawl, testing breadth-first against
depth-first, random ordering and an omniscient strategy.

Baeza-Yates et al. used simulation on two subsets of the Web of 3 million pages from the .gr and
.cl domain, testing several crawling strategies. They showed that both the OPIC strategy and a
strategy that uses the length of the per-site queues are both better than breadth-first crawling, and
that it is also very effective to use a previous crawl, when it is available, to guide the current one.

Daneshpajouh et al. designed a community based algorithm for discovering good seeds. Their
method crawls web pages with high PageRank from different communities in less iteration in
comparison with crawl starting from random seeds. One can extract good seed from a previously
crawled web graph using this new method. Using these seeds a new crawl can be very effective.

A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to
request only HTML resources, a crawler may make an HTTP HEAD request to determine a Web
resource's MIME type before requesting the entire resource with a GET request. To avoid
making numerous HEAD requests, a crawler may alternatively examine the URL and only
request the resource if the URL ends with .html, .htm or a slash. This strategy may cause
numerous HTML Web resources to be unintentionally skipped. A similar strategy compares the
extension of the web resource to a list of known HTML-page types: .html, .htm, .asp, .aspx, .php,
and a slash.

Some crawlers intend to download as many resources as possible from a particular Web site.
Cothey (Cothey, 2004) introduced a path-ascending crawler that would ascend to every path in
each URL that it intends to crawl. For example, when given a seed URL of
http://llama.org/hamster/monkey/page.html, it will attempt to crawl /hamster/monkey/, /hamster/,
and /. Cothey found that a path-ascending crawler was very effective in finding isolated
resources, or resources for which no inbound link would have been found in regular crawling.

Many Path-ascending crawlers are also known as Harvester software, because they're used to
"harvest" or collect all the content - perhaps the collection of photos in a gallery - from a specific
page or host.

The main problem in focused crawling is that in the context of a web crawler, we would like to
be able to predict the similarity of the text of a given page to the query before actually
downloading the page. A possible predictor is the anchor text of links; this was the approach
SEARCH ENGINE 22

taken by Pinkerton in a crawler developed in the early days of the Web. Diligenti et al. propose
to use the complete content of the pages already visited to infer the similarity between the
driving query and the pages that have not been visited yet. The performance of a focused
crawling depends mostly on the richness of links in the specific topic being searched, and a
focused crawling usually relies on a general Web search engine for providing starting points.

Web 3.0 defines advanced technologies and new principles for the next generation search
technologies that is summarized in Semantic Web and Website Parse Template concepts for the
present. Web 3.0 crawling and indexing technologies will be based on human-machine clever
associations.

The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long
time, usually measured in weeks or months. By the time a web crawler has finished its crawl,
many events could have happened. These events can include creations, updates and deletions.

From the search engine's point of view, there is a cost associated with not detecting an event, and
thus having an outdated copy of a resource. The most used cost functions, introduced in (Cho
and Garcia-Molina, 2000), are freshness and age.

Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The
freshness of a page p in the repository at time t is defined as:

Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the
repository, at time t is defined as:

Image:Web Crawling Freshness Age.svg


Evolution of freshness and age in Web crawling

The objective of the crawler is to keep the average freshness of pages in its collection as high as
possible, or to keep the average age of pages as low as possible. These objectives are not
equivalent: in the first case, the crawler is just concerned with how many pages are out-dated,
while in the second case, the crawler is concerned with how old the local copies of pages are.

Two simple re-visiting policies were studied by Cho and Garcia-Molina:

Uniform policy: This involves re-visiting all pages in the collection with the same frequency,
regardless of their rates of change.

Proportional policy: This involves re-visiting more often the pages that change more frequently.
The visiting frequency is directly proportional to the (estimated) change frequency.

(In both cases, the repeated crawling order of pages can be done either at random or with a fixed
order.)

To improve freshness, we should penalize the elements that change too often (Cho and Garcia-
Molina, 2003a). The optimal re-visiting policy is neither the uniform policy nor the proportional
SEARCH ENGINE 23

policy. The optimal method for keeping average freshness high includes ignoring the pages that
change too often, and the optimal for keeping average age low is to use access frequencies that
monotonically (and sub-linearly) increase with the rate of change of each page..

Politeness policy

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can
have a crippling impact on the performance of a site. Needless to say if a single crawler is
performing multiple requests per second and/or downloading large files, a server would have a
hard time keeping up with requests from multiple crawlers.

• Network resources, as crawlers require considerable bandwidth and operate with a high
degree of parallelism during a long period of time.
• Server overload, especially if the frequency of accesses to a given server is too high.
• Poorly written crawlers, which can crash servers or routers, or which download pages
they cannot handle.
• Personal crawlers that, if deployed by too many users, can disrupt networks and Web
servers.

A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt
protocol that is a standard for administrators to indicate which parts of their Web servers should
not be accessed by crawlers. This standard does not include a suggestion for the interval of visits
to the same server, even though this interval is the most effective way of avoiding server
overload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to use
an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to
delay between requests.

The first proposal for the interval between connections was given in and was 60 seconds.
However, if pages were downloaded at this rate from a website with more than 100,000 pages
over a perfect connection with zero latency and infinite bandwidth, it would take more than 2
months to download only that entire website; also, only a fraction of the resources from that Web
server would be used. This does not seem acceptable.

For those using web crawlers for research purposes, a more detailed cost-benefit analysis is
needed and ethical considerations should be taken into account when deciding where to crawl
and how fast to crawl.

Anecdotal evidence from access logs shows that access intervals from known crawlers vary
between 20 seconds and 3–4 minutes. It is worth noticing that even when being very polite, and
taking all the safeguards to avoid overloading web servers, some complaints from Web server
administrators are received. Brin and Page note that: "... running a crawler which connects to
more than half a million servers (...) generates a fair amount of email and phone calls. Because of
the vast number of people coming on line, there are always those who do not know what a
crawler is, because this is the first one they have seen.".

Parallelization policy
SEARCH ENGINE 24

Main article: Distributed web crawling

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize
the download rate while minimizing the overhead from parallelization and to avoid repeated
downloads of the same page. To avoid downloading the same page more than once, the crawling
system requires a policy for assigning the new URLs discovered during the crawling process, as
the same URL can be found by two different crawling processes.

Web crawler architectures

High-level architecture of a standard Web crawler

A crawler must not only have a good crawling strategy, as noted in the previous sections, but it
should also have a highly optimized architecture.

Web crawlers are a central part of search engines, and details on their algorithms and architecture
are kept as business secrets. When crawler designs are published, there is often an important lack
of detail that prevents others from reproducing the work. There are also emerging concerns about
"search engine spamming", which prevent major search engines from publishing their ranking
algorithms.

URL normalization

Crawlers usually perform some type of URL normalization in order to avoid crawling the same
resource more than once. The term URL normalization, also called URL canonicalization, refers
to the process of modifying and standardizing a URL in a consistent manner. There are several
types of normalization that may be performed including conversion of URLs to lowercase,
removal of "." and ".." segments, and adding trailing slashes to the non-empty path component
(Pant et al., 2004).

Crawler identification

Web crawlers typically identify themselves to a web server by using the User-agent field of an
HTTP request. Web site administrators typically examine their web servers’ log and use the user
agent field to determine which crawlers have visited the web server and how often. The user
agent field may include a URL where the Web site administrator may find out more information
about the crawler. Spambots and other malicious Web crawlers are unlikely to place identifying
information in the user agent field, or they may mask their identity as a browser or other well-
known crawler.

It is important for web crawlers to identify themselves so Web site administrators can contact the
owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they
may be overloading a web server with requests, and the owner needs to stop the crawler.
SEARCH ENGINE 25

Identification is also useful for administrators that are interested in knowing when they may
expect their Web pages to be indexed by a particular search engine.

• RBSE was the first published web crawler. It was based on two programs: the first
program, "spider" maintains a queue in a relational database, and the second program
"mite", is a modified www ASCII browser that downloads the pages from the Web.
• WebCrawler was used to build the first publicly-available full-text index of a subset of
the Web. It was based on lib-WWW to download pages, and another program to parse
and order URLs for breadth-first exploration of the Web graph. It also included a real-
time crawler that followed links based on the similarity of the anchor text with the
provided query.

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information
retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive
psychology, mathematics, informatics, physics and computer science. An alternate name for the
process in the context of search engines designed to find web pages on the Internet is Web
indexing.

Popular engines focus on the full-text indexing of online, natural language documents. Media
types such as video and audio and graphics are also searchable.

Meta search engines reuse the indices of other services and do not store a local index, whereas
cache-based search engines permanently store the index along with the corpus. Unlike full-text
indices, partial-text services restrict the depth indexed to reduce index size. Larger services
typically perform indexing at a predetermined time interval due to the required time and
processing costs, while agent-based search engines index in real time.

Indexing

The purpose of storing an index is to optimize speed and performance in finding relevant
documents for a search query. Without an index, the search engine would scan every document
in the corpus, which would require considerable time and computing power. For example, while
an index of 10,000 documents can be queried within milliseconds, a sequential scan of every
word in 10,000 large documents could take hours. The additional computer storage required to
store the index, as well as the considerable increase in the time required for an update to take
place, are traded off for the time saved during information retrieval.

Index Design Factors

Major factors in designing a search engine's architecture include:

Merge factors
How data enters the index, or how words or subject features are added to the index during
text corpus traversal, and whether multiple indexers can work asynchronously. The
indexer must first check whether it is updating old content or adding new content.
SEARCH ENGINE 26

Traversal typically correlates to the data collection policy. Search engine index merging
is similar in concept to the SQL Merge command and other merge algorithms.
Storage techniques
How to store the index data, that is, whether information should be data compressed or
filtered.
Index size
How much computer storage is required to support the index.
Lookup speed
How quickly a word can be found in the inverted index. The speed of finding an entry in
a data structure, compared with how quickly it can be updated or removed, is a central
focus of computer science.
Maintenance
How the index is maintained over time.
Fault tolerance
How important it is for the service to be reliable. Issues include dealing with index
corruption, determining whether bad data can be treated in isolation, dealing with bad
hardware, partitioning, and schemes such as hash-based or composite partitioning, as well
as replication.

Index Data Structures

Search engine architectures vary in the way indexing is performed and in methods of index
storage to meet the various design factors. Types of indices include:

Challenges in Parallelism

A major challenge in the design of search engines is the management of parallel computing
processes. There are many opportunities for race conditions and coherent faults. For example, a
new document is added to the corpus and the index must be updated, but the index
simultaneously needs to continue responding to search queries. This is a collision between two
competing tasks. Consider that authors are producers of information, and a web crawler is the
consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward
index is the consumer of the information produced by the corpus, and the inverted index is the
consumer of information produced by the forward index. This is commonly referred to as a
producer-consumer model. The indexer is the producer of searchable information and users are
the consumers that need to search. The challenge is magnified when working with distributed
storage and distributed processing. In an effort to scale with larger amounts of indexed
information, the search engine's architecture may involve distributed computing, where the
search engine consists of several machines operating in unison. This increases the possibilities
for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel
architecture.

Inverted indices
SEARCH ENGINE 27

Many search engines incorporate an inverted index when evaluating a search query to quickly
locate documents containing the words in a query and then rank these documents by relevance.
Because the inverted index stores a list of the documents containing each word, the search engine
can use direct access to find the documents associated with each word in the query in order to
retrieve the matching documents quickly. The following is a simplified illustration of an inverted
index:

Inverted Index
Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7

This index can only determine whether a word exists within a particular document, since it stores
no information regarding the frequency and position of the word; it is therefore considered to be
a boolean index. Such an index determines which documents match a query but does not rank
matched documents. In some designs the index includes additional information such as the
frequency of each word in each document or the positions of a word in each document. Position
information enables the search algorithm to identify word proximity to support searching for
phrases; frequency can be used to help in ranking the relevance of documents to the query. Such
topics are the central research focus of information retrieval.

The inverted index is a sparse matrix, since not all words are present in each document. To
reduce computer storage memory requirements, it is stored differently from a two dimensional
array. The index is similar to the term document matrices employed by latent semantic analysis.
The inverted index can be considered a form of a hash table. In some cases the index is a form of
a binary tree, which requires additional storage but may reduce the lookup time. In larger indices
the architecture is typically a distributed hash table. Inverted indices can be programmed in
several computer programming languages.

Index Merging

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first
deletes the contents of the inverted index. The architecture may be designed to support
incremental indexing, where a merge identifies the document or documents to be added or
updated and then parses each document into words. For technical accuracy, a merge conflates
newly indexed documents, typically residing in virtual memory, with the index cache residing on
one or more computer hard drives.

After parsing, the indexer adds the referenced document to the document list for the appropriate
words. In a larger search engine, the process of finding each word in the inverted index (in order
to report that it occurred within a document) may be too time consuming, and so this process is
commonly split up into two parts, the development of a forward index and a process which sorts
SEARCH ENGINE 28

the contents of the forward index into the inverted index. The inverted index is so named because
it is an inversion of the forward index.

The Forward Index

The forward index stores a list of words for each document. The following is a simplified form
of the forward index:

Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsing, it is better to
immediately store the words per document. The delineation enables Asynchronous system
processing, which partially circumvents the inverted index update bottleneck. The forward index
is sorted to transform it to an inverted index. The forward index is essentially a list of pairs
consisting of a document and a word, collated by the document. Converting the forward index to
an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted
index is a word-sorted forward index.

Compression

Generating or maintaining a large-scale search engine index represents a significant storage and
processing challenge. Many search engines utilize a form of compression to reduce the size of
the indices on disk. Consider the following scenario for a full text, Internet search engine.

• An estimated 2,000,000,000 different web pages exist as of the year 2000Suppose there
are 250 words on each webpage (based on the assumption they are similar to the pages of
a novel.
• It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per
character
• The average number of characters in any given word on a page may be estimated at 5
(Wikipedia:Size comparisons)
• The average personal computer comes with 100 to 250 gigabytes of usable space

Given this scenario, an uncompressed index (assuming a non-conflated, simple, index) for 2
billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5
bytes per word, this would require 2500 gigabytes of storage space alone, more than the average
free disk space of 25 personal computers. This space requirement may be even larger for a fault-
tolerant distributed storage architecture. Depending on the compression technique chosen, the
SEARCH ENGINE 29

index can be reduced to a fraction of this size. The tradeoff is the time and processing power
required to perform compression and decompression.

Notably, large scale search engine designs incorporate the cost of storage as well as the costs of
electricity to power the storage. Thus compression is a measure of cost.

Document Parsing

Document parsing breaks apart the components (words) of a document or other form of media
for insertion into the forward and inverted indices. The words found are called tokens, and so, in
the context of search engine indexing and natural language processing, parsing is more
commonly referred to as tokenization. It is also sometimes called word boundary
disambiguation, tagging, text segmentation, content analysis, text analysis, text mining,
concordance generation, speech segmentation, lexing, or lexical analysis. The terms 'indexing',
'parsing', and 'tokenization' are used interchangeably in corporate slang.

Natural language processing, as of 2006, is the subject of continuous research and technological
improvement. Tokenization presents many challenges in extracting the necessary information
from documents for indexing to support quality searching. Tokenization for indexing involves
multiple technologies, the implementation of which are commonly kept as corporate secrets.

Challenges in Natural Language Processing

Word Boundary Ambiguity


Native English speakers may at first consider tokenization to be a straightforward task,
but this is not the case with designing a multilingual indexer. In digital form, the texts of
other languages such as Chinese, Japanese or Arabic represent a greater challenge, as
words are not clearly delineated by whitespace. The goal during tokenization is to
identify words for which users will search. Language-specific logic is employed to
properly identify the boundaries of words, which is often the rationale for designing a
parser for each language supported (or for groups of languages with similar boundary
markers and syntax).
Language Ambiguity
To assist with properly ranking matching documents, many search engines collect
additional information about each word, such as its language or lexical category (part of
speech). These techniques are language-dependent, as the syntax varies among
languages. Documents do not always clearly identify the language of the document or
represent it accurately. In tokenizing the document, some search engines attempt to
automatically identify the language of the document.
Diverse File Formats
In order to correctly identify which bytes of a document represent characters, the file
format must be correctly handled. Search engines which support multiple file formats
must be able to correctly open and access the document and be able to tokenize the
characters of the document.
Faulty Storage
SEARCH ENGINE 30

The quality of the natural language data may not always be perfect. An unspecified
number of documents, particular on the Internet, do not closely obey proper file protocol.
binary characters may be mistakenly encoded into various parts of a document. Without
recognition of these characters and appropriate handling, the index quality or indexer
performance could degrade.

Tokenization

Unlike literate humans, computers do not understand the structure of a natural language
document and cannot automatically recognize words and sentences. To a computer, a document
is only a sequence of bytes. Computers do not 'know' that a space character separates words in a
document. Instead, humans must program the computer to identify what constitutes an individual
or distinct word, referred to as a token. Such a program is commonly called a tokenizer or parser
or lexer. Many search engines, as well as other natural language processing software, incorporate
specialized programs for parsing, such as YACC or Lex.

During tokenization, the parser identifies sequences of characters which represent words and
other elements, such as punctuation, which are represented by numeric codes, some of which are
non-printing control characters. The parser can also identify entities such as email addresses,
phone numbers, and URLs. When identifying each token, several characteristics may be stored,
such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category
(part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and
line number.

Language Recognition

If the search engine supports multiple languages, a common initial step during tokenization is to
identify each document's language; many of the subsequent steps are language dependent (such
as stemming and part of speech tagging). Language recognition is the process by which a
computer program attempts to automatically identify, or categorize, the language of a document.
Other names for language recognition include language classification, language analysis,
language identification, and language tagging. Automated language recognition is the subject of
ongoing research in natural language processing. Finding which language the words belongs to
may involve the use of a language recognition chart.

Format Analysis

If the search engine supports multiple document formats, documents must be prepared for
tokenization. The challenge is that many document formats contain formatting information in
addition to textual content. For example, HTML documents contain HTML tags, which specify
formatting information such as new line starts, bold emphasis, and font size or style. If the
search engine were to ignore the difference between content and 'markup', extraneous
information would be included in the index, leading to poor search results. Format analysis is the
identification and handling of the formatting content embedded within documents which controls
the way the document is rendered on a computer screen or interpreted by a software program.
SEARCH ENGINE 31

Format analysis is also referred to as structure analysis, format parsing, tag stripping, format
stripping, text normalization, text cleaning, and text preparation. The challenge of format
analysis is further complicated by the intricacies of various file formats. Certain file formats are
proprietary with very little information disclosed, while others are well documented. Common,
well-documented file formats that many search engines support include:

• Microsoft Word
• Microsoft Excel
• Microsoft Powerpoint
• IBM Lotus Notes
• HTML
• ASCII text files (a text document without any formatting)
• Adobe's Portable Document Format (PDF)
• PostScript (PS)
• LaTex
• The UseNet archive (NNTP) and other deprecated bulletin board formats
• XML and derivatives like RSS
• SGML (this is more of a general protocol)
• Multimedia meta data formats like ID3

Options for dealing with various formats include using a publicly available commercial parsing
tool that is offered by the organization which developed, maintains, or owns the format, and
writing a custom parser.

Some search engines support inspection of files that are stored in a compressed or encrypted file
format. When working with a compressed format, the indexer first decompresses the document;
this step may result in one or more files, each of which must be indexed separately. Commonly
supported compressed file formats include:

• ZIP - Zip File


• RAR - Archive File
• CAB - Microsoft Windows Cabinet File
• Gzip - Gzip file
• BZIP - Bzip file
• TAR, and TAR.GZ - Unix Gzip'ped Archives

Format analysis can involve quality improvement methods to avoid including 'bad information'
in the index. Content can manipulate the formatting information to include additional content.
Examples of abusing document formatting for spamdexing:

• Including hundreds or thousands of words in a section which is hidden from view on the
computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in
HTML, which may incorporate the use of CSS or Javascript to do so).
SEARCH ENGINE 32

• Setting the foreground font color of words to the same as the background color, making
words hidden on the computer screen to a person viewing the document, but not hidden
to the indexer.

Section Recognition

Some search engines incorporate section recognition, the identification of major parts of a
document, prior to tokenization. Not all the documents in a corpus read like a well-written book,
divided into organized chapters and pages. Many documents on the web, such as newsletters and
corporate reports, contain erroneous content and side-sections which do not contain primary
material (that which the document is about). For example, this article displays a side menu with
links to other web pages. Some file formats, like HTML or PDF, allow for content to be
displayed in columns. Even though the content is displayed, or rendered, in different areas of the
view, the raw markup content may store this information sequentially. Words that appear
sequentially in the raw source content are indexed sequentially, even though these sentences and
paragraphs are rendered in different parts of the computer screen. If search engines index this
content as if it were normal content, the quality of the index and search quality may be degraded
due to the mixed content and improper word proximity. Two primary problems are noted:

• Content in different sections is treated as related in the index, when in reality it is not
• Organizational 'side bar' content is included in the index, but the side bar content does not
contribute to the meaning of the document, and the index is filled with a poor
representation of its documents.

Section analysis may require the search engine to implement the rendering logic of each
document, essentially an abstract representation of the actual document, and then index the
representation instead. For example, some content on the Internet is rendered via Javascript. If
the search engine does not render the page and evaluate the Javascript within the page, it would
not 'see' this content in the same way and would index the document incorrectly. Given that some
search engines do not bother with rendering issues, many web page designers avoid displaying
content via Javascript or use the Noscript tag to ensure that the web page is indexed properly. At
the same time, this fact can also be exploited to cause the search engine indexer to 'see' different
content than the viewer.

Meta Tag Indexing

Specific documents often contain embedded meta information such as author, keywords,
description, and language. For HTML pages, the meta tag contains keywords which are also
included in the index. Earlier Internet search engine technology would only index the keywords
in the meta tags for the forward index; the full document would not be parsed. At that time full-
text indexing was not as well established, nor was the hardware able to support such technology.
The design of the HTML markup language initially included support for meta tags for the very
purpose of being properly and easily indexed, without requiring tokenization.

As the Internet grew through the 1990s, many brick-and-mortar corporations went 'online' and
established corporate websites. The keywords used to describe webpages (many of which were
SEARCH ENGINE 33

corporate-oriented webpages similar to product brochures) changed from descriptive to


marketing-oriented keywords designed to drive sales by placing the webpage high in the search
results for specific search queries. The fact that these keywords were subjectively-specified was
leading to spamdexing, which drove many search engines to adopt full-text indexing
technologies in the 1990s. Search engine designers and companies could only place so many
'marketing keywords' into the content of a webpage before draining it of all interesting and useful
information. Given that conflict of interest with the business goal of designing user-oriented
websites which were 'sticky', the customer lifetime value equation was changed to incorporate
more useful content into the website in hopes of retaining the visitor. In this sense, full-text
indexing was more objective and increased the quality of search engine results, as it was one
more step away from subjective control of search engine result placement, which in turn
furthered research of full-text indexing technologies.

In Desktop search, many solutions incorporate meta tags to provide a way for authors to further
customize how the search engine will index content from various files that is not evident from
the file content. Desktop search is more under the control of the user, while Internet search
engines which must focus more on the full text index.

See also

A web search query is a query that a user enters into web search engine to satisfy his or her
information needs. Web search queries are distinctive in that they are unstructured and often
ambiguous; they vary greatly from standard query languages which are governed by strict syntax
rules.

Types

There are three broad categories that cover most web search queries:

• Informational queries – Queries that cover a broad topic (e.g., colorado or trucks) for
which there may be thousands of relevant results.

• Navigational queries – Queries that seek a single website or web page of a single entity
(e.g., youtube or delta airlines).

• Transactional queries – Queries that reflect the intent of the user to perform a particular
action, like purchasing a car or downloading a screen saver.

Search engines often support a fourth type of query that is used far less frequently:

• Connectivity queries – Queries that report on the connectivity of the indexed web graph
(e.g., Which links point to this URL?, and How many pages are indexed from this
domain name?).
SEARCH ENGINE 34

Characteristics

Most commercial web search engines do not disclose their search logs, so information about
what users are searching for on the Web is difficult to come by. Nevertheless, a study in 2001
analyzed the queries from the Excite search engine showed some interesting characteristics of
web search:

• The average length of a search query was 2.4 terms.


• About half of the users entered a single query while a little less than a third of users
entered three or more unique queries.
• Close to half of the users examined only the first one or two pages of results (10 results
per page).
• Less than 5% of users used advanced search features (e.g., Boolean operators like AND,
OR, and NOT).
• The top three most frequently used terms were and, of, and sex.

A study of the same Excite query logs revealed that 19% of the queries contained a geographic
term (e.g., place names, zip codes, geographic features, etc.).

A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat
queries and that 87% of the time the user would click on the same result. This suggests that many
users use repeat queries to revisit or re-find information.

In addition, much research has shown that query term frequency distributions conform to the
power law, or long tail distribution curves. That is, a small portion of the terms observed in a
large query log (e.g. > 100 million queries) are used most often, while the remaining terms are
used less often individually. This example of the Pareto principle (or 80-20 rule) allows search
engines to employ optimization techniques such as index or database partitioning, caching and
pre-fetching.
SEARCH ENGINE 35

New Features for Web Searching

The incredible development of Web resources and services has become a motivation for many
studies and for companies to invest on developing new search engines or adding new features
and abilities to their search engines. By looking at the papers published in the mentioned
conferences and other journals and seminars, we can track several specifications and shifts in the
future. Ma (2004) from Asian Microsoft Research Centre reported features of the next generation
of search engines in WISE04. Deep Web with structured information is a potential resource that
search companies are trying to capture. Meanwhile, researchers have focused on Web page
structure to increase the quality of search. Microsoft has started a big competition on Web
searching through working on Web page blocks, the Deep Web and mobile search. MSN new
ranking model will be based on object-level ranking rather than document-level.

4.1 Page Structure Analysis: the first search engines concentrated on Web page contents.
AltaVista and other old search engines were made based on indexing the content of Web pages.
They built huge centralized indices and this is still a part of every popular search engine.
However, it was clear that the contents of a Web page could not be sufficient for capturing the
huge amount of information. In 1996-1997 Google was designed based on a novel idea that the
link structure of the Web is an important resource to improve the results of search engines.
Backlinks were used based on the Hyperlink-Induced Topic Search (HITS) algorithm to crawl
billions of Web pages. Google not only used this approach to capture the biggest amount of Web
pages but also established PageRank - the ranking system that improved the search results (Brin
& Page, 1998). After content-based indexing and link analysis the new area of study is page and
layout structures. HTML and XML are important in this approach. It is thought that Web page
layout is a good resource for improving search results. For example, the value of information
presented in < heading > tags can be more than information in < paragraph > tags. We can
imagine also that a link in the middle of Web page is more important than a link in footnote.
Web Graph algorithms such as HITS might be implemented to a sub-section of Web pages to
improve search result ranking models. The automatic thesaurus construction method is a page
structure method, which extracts term relationships from the link structure of Websites. It is able
to identify new terms and reflect the latest relationship between terms as the Web evolves.
Experimental results have shown that the constructed thesaurus, when applied to query
expansion, outperforms traditional association thesaurus (Chen et al, 2003).

4.2 Deep Search: current search engines can only crawl and capture a small part of the Web,
which is called the "visible" or "indexable" Web. A huge amount of scientific and other valuable
information is behind closed doors. It is believed that the size of invisible or deep Web is several
times bigger than the size of the surface Web. Different databases, library catalogues, digital
books and journals, patents, research reports and governmental archives are examples of
resources that usually cannot be crawled and indexed by current search engines. Web content
providers are moving toward Semantic Web by applying technologies such as XML and RDF
(Resource Description Framework) in order to create more structured Web resources. New
search engines are trying to find suitable methods for penetrating the database barriers.
BrightPlanet's "differencing" algorithm is designed to transfer queries across multiple deep Web
resources at once, aggregating the results and letting users compare changes to those results over
time. Google, MSN and many other popular search engines are competing to find solution for the
SEARCH ENGINE 36

invisible Web. Recently, Yahoo has developed a paid service for searching the deep Web that is
called the Content Aggregation Program (CAP). The method is secret but the company does
acknowledge that its Content Aggregation Program will give paying customers a more direct
pipeline into its search database (Wright, 2004).

4.3 Structured Data: the World Wide Web is considered a huge collection of unstructured data
presented in billions of Web pages. As we already mentioned, the amazing size and valuable
resources of the deep Web have affected the industry of search engines and the next generation
of search engines are supposed to be able to investigate deep Web information. As a part of both
surface and deep Web, structured data resources are very important and valuable. In many cases,
data is stored in tables and separated files. The concept of structured searching is different from
the way search engines currently operate. Most of search engines just save a copy of Web pages
in their repository and then make several indexes from the content of these pages. Most
documents available on the Web are unstructured resources. So, search engines can just judge
them based on the keyword occurrence. As Rein (1997) says a search engine supporting XML-
based queries can be programmed to search structured resources. Such an engine would rank
words based on their location in a document, and their relation to each other, rather than just the
number of times they appear. Traditional information retrieval and database management
techniques have been used to extract data from different tables and resources and combine them
to respond users' queries. Current search engines cannot resolve this problem efficiently, but in
the future an intelligent search engine will be able to distinguish different structured resources
and combine their data to find a high quality response for a complicated query.

4.4 Recommending Group Ranking: while many search engines are able to crawl and index
billions of Web pages, sorting the results of each query is still an issue. Page ranking algorithms
have been utilized to present a better ranked result. The idea is simple: more relevant pages must
take a higher rank. Basic ranking algorithms are based on the occurrence rate of index terms in
each page. Simply, if the search term is mathematics then a page that has the word mathematics
20 times must be ranked before a page which has mathematics 10 times. As we already
mentioned, this alone is not a sufficient way; recently link information and page structure
information have been used to improve rank quality. These methods are automatic and are done
by machines. However, it is believed that the best judgement about the importance and quality of
Web pages is acquired when they are reviewed and recommended by human experts. Discussion
thread recommendation or peer reviews are expected to be used by search engines to improve
their results. In the future, search results will be ranked not only based on the automatic ranking
algorithms but also by using the ideas of scholars and scientific recommending groups.

4.5 Federated Search: also known as parallel search, metasearch or broadcast search, it
aggregates multiple channels of information into a single searchable point. Federated search
engines are different from metasearch engines. Metasearch engines services for users are free
while federated search engines are sold to libraries and other interested information service
providers. Federated searche mostly covers subscription based databases that are usually a part of
Invisible Web and ignored by Web-oriented metasearch engines. Usually there is no overlap
between databases covered by federated search engines. Federated searching has several
advantages for users. It reduces the time that is needed for searching several databases and also
users do not need to know how to search through different interfaces (Fryer, 2004). One of the
SEARCH ENGINE 37

important reasons of the growing interest in federated searching is the complexity of the online
materials environment such as the increasing number of electronic journals and online full-text
databases. Webster (2004) maintains that although federated searching tools offer some real
immediate advantages today, they cannot overcome the underlying problem of growing
complexity and lack of uniformity. We need an open interoperable and uniform e-content
environment to provide fully the interconnected assessable environment that librarians are
seeking from metasearching. One of the disadvantages of federated search engines is that they
cannot be used for sophisticated search commands and queries, and are limited to basic Boolean
search.

4.6 Mobile Search: the number of people who have a cell phone seems to be more than the
number of people who have a PC. Also many other mobile technologies such as GPS devices are
used widely. Search engine companies have focused on the big market of mobile phones and
wireless telecommunication devices. In the future everyone will have access to the Web
information and services through his/her wireless phone without necessarily having a computer.
Recently, Yahoo developed its mobile Web search system and mobile phone users can have
access to Yahoo Local, Image and Web search, as well as quick links to stocks, sports scores and
weather for fee. The platform also includes a modified Yahoo Instant Messaging client and
Yahoo Mobile Games (Singer, 2004).
SEARCH ENGINE 38

4. Conclusion

The World Wide Web with its short history has experienced significant changes. While the first
search engines were established based on the traditional database and information retrieval
methods, many other algorithms and methods have since been added to them to improve their
results. The gigantic size of the Web and vast variety of the users' needs and interests as well as
the big potential of the Web as a commercial market have brought about many changes and a
great demand for better search engines. In this article, we reviewed the history of Web search
tools and techniques and mentioned some big shifts in this field. Google utilized Web graph or
link structure of the Web to make one of the most comprehensive and reliable search engines.
Local services and the personalization of search tools are two major ideas that have been studied
for several years.

By looking at papers published in popular conferences on Web and information management, we


see not only a considerable increase in the quantity of Web search research papers since 2001,
but also we can see that Web search and information retrieval topics such as ranking, filtering
and query formulation are still hot topics. This reveals that search engines have many unsolved
and research-interesting areas.

We mentioned several important issues for the future of search engines. The next generations of
search tools are expected to be able to extract structured data to offer high quality responses to
users' questions. The structure of Web pages seems to be a good resource with which search
engines can improve their results. As well, there will be a shift towards providing specialised
search facilities for the scholarly part of the Web that encompasses a considerable part of the
deep Web. Having the Beta version of Google Scholar (http://scholar.google.com) released in
November 2004, other major players in search engine industry are expected to invest on rivals
for this new service. Search engines are trying to consider recommendations of special-interest
groups into their search techniques. Limitation in funds has enforced libraries and other major
information user organizations to share their online resources. Federated search is a sample of
future cooperative search and information retrieval facilities. Finally, we addressed the efforts of
search engine companies in breaking their borders through making search possible for mobile
phones and other wireless information and communication devices.

The World Wide Web will be more usable in the future. The Web's security and privacy are two
important issues for the coming years. Web search industry is opening new horizons for the
global village. Meanwhile many issues have remained unsolved or incomplete still. Information
extraction, ambiguity in addresses and names, personalization and multimedia searching among
others are major issues in the next few years.
SEARCH ENGINE 39

References

• Brin, S. & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Proceedings of the 7th International WWW Conference, Brisbane, Australia, 107-117.
• Chen, Z., Liu, S., Wenyin, L., Pu, G. & Ma, W. (2003). Building a web thesaurus from
web link ltructure. Proceedings of the 26th annual international ACM SIGIR conference
on Research and development in information retrieval. Toronto, 48– 55.
• Fryer, D. (2004). Federated search engines. Online, 28(2), 16-19.
• Gromov, G. R. (2002). History of Internet and WWW: the roads and crossroads of
Internet history. Retrieved December 5, 2004, from
http://www.netvalley.com/intvalstat.html
• Holzschlag, M. E. (2001). How specialization limited the Web. Retrieved December 4,
2004, from http://www.webtechniques.com/archives/2001/09/desi/
• Jansen, B. J., Spink, A. & Pedersen, J. (2003). An analysis of multimedia searching on
AltaVista. Proceedings of the 5th ACM SIGMM international workshop on Multimedia
information retrieval, 186-192.
• Kherfi, M. L., Ziou, D. & Bernardi, A. (2004). Image retrieval from the World Wide
Web: issues, techniques and systems. ACM Computer Surveys, 36(14), 35-67.
• Liu, F., Yu, C. & Meng, W. (2002). Personalized web search by mapping user queries to
categories. Proceedings of the eleventh international conference on Information and
knowledge management CIKM’02, McLean, Virginia, USA, 558-565.
• Ma, W., Zhang, H. and Hon, H. (2004). Towards Next Generation Web Information
Retrieval. Web Information Systems – WISE04: Proceedings of the fifth international
Conference on Web Information System Engineering , Brisbane, Australia, 17.
• Perez, C. (2004). Google offers new local search service. Retrieved December 2, 2004,
from http://www.infoworld.com/article/04/03/17/HNgooglelocal_1.html
• Poulter, A. (1997). The design of World Wide Web search engines: a critical review,
Program, 31(2), 131-145.
• Rein, L. (1997). XML Ushers in Structured Web Searches. Retrieved November 20,
2004, from http://www.wired.com/news/technology/0,1282,7751,00.html
• Schwartz, C. (1998). Web search engines. Journal of the American Society for
Information Science, 49(11), 973-982.
• Singer, M. (2004, October 27). Yahoo sends search aloft. Retrieved November 28, 2004,
from http://www.internetnews.com/bus-news/article.php/3427831
• Sullivan, D. (2000, June 2). Survey reveals search habits. The Search Engine Report.
Retrieved December 1, 2004, from http://www.searchenginewatch.com/sereport/00/06-
realnames.html
• Wall, A. (2004). History of search engines & web history. Retrieved December 3, 2004,
from http://www.search-marketing.info/search-engine-history/
• Watters, C. & Amoudi, G. (2003). GeoSearcher: location-based ranking of search engine
results. Journal of the American Society for Information Science and Technology, 54(2),
140-151.
• Webster, P. (2004). Metasearching in an academic environment. Online, 28(2), 20-23.
SEARCH ENGINE 40

Anda mungkin juga menyukai