|
Search
Engine Technology
by Xiaopeng Wang
What are search engines?
A search engine is a software-based technology used in conjunction
with the Internet to assist users in finding information stored
on the World Wide Web. Search engines are automated computer programs
designed to facilitate searching through Web pages. When Internet
users input keywords representing the types of information they
are looking for, the search engine will compile a record of Web
locations matching the keywords. [1] Then, users can
visit those locations via hyperlinks provided by the search engines
checking these Web sites for the sought-after information. Three
of the most widely used search engines are Google, Yahoo!, and
Ask Jeeves. [2]
How do search engines work?
In spite of their diverse designs and functions, most search engines
follow a similar operating process, including word matching, data
collecting, and indexing.
Word matching is the core of search engine development. [3]
This feature utilizes computers’ capacity for finding wanted
text terms immediately and accurately from thousands of Web pages,
a process that dramatically surpasses the capabilities of human
beings.
To produce word matches, search engines must have a large collection
of data or documents to look through, just as librarians need
to work in a space with many books. Given the vast information
sources connected to the Internet, search engines must make use
of automated shortcuts; that is, bots, spiders, or crawlers are
used to comb the Internet for categories of information that enable
them to build lists of relevant documents found on the Web. [4]
Instead of going out onto the Internet itself upon receiving a
request, search engines search through the more limited databases
they have compiled.
Spider programs reside on local computers and send out requests
for information to Web addresses at other computers and servers,
similar to the process one uses when visiting Web sites via one’s
own computer. The spiders usually begin with popular Web pages,
following every hyperlink of those pages to other pages. By this
means, the spiders can systematically and rapidly travel across
the widespread cyberspace. [5] Take Google as an example.
At its peak performance, Google can crawl more than 100 Web pages
and generate about 600 Kbytes of data per second using only four
spiders. [6]
During their traveling, spiders collect at minimum a Web site’s
title, URL, and a brief description. Some search engines send
their spiders to scan the entire text of documents to create full-text
indexes, including all significant words, hyperlinks and metadata.
[7] Google even stores full Web pages as “cached”
pages, through which users can retrieve those Web pages even when
they do not exist anymore.
Search engines save the information collected in an organized
way in order to efficiently use it. The process or method of organizing
the information is called indexing. Among the ways being used
for building an index, the hash table is the most effective. The
hash table applies a formula that attaches a numerical value to
each word. The value determines the ranking of the documents where
those words are found. Every search engine has a distinctive formula
for ranking words, which is one of the reasons that a search for
the same word on different search engines will generate different
lists. [8]
Generally
speaking, a good search engine will host a wealthy collection
of data as well as a reasonable and effective indexing system.
|
Figure
1. Spiders take a Web page's content, create key words and
store information into a database. Source: Howstuffworks.com. |
Search
engines’ background and proponents
In 1990, Alan Emtage, a student at McGill University in Montreal,
created the indexing search tool, called Archie for the File Transfer
Protocol (FTP) system. One year later, Mark McCahill introduced
Gopher and the University of Nevada System Computing Services
group developed Veronica search service for Gopher in 1993. In
the same year, with the advent of the first web browser Mosaic,
Matthew Gray launched the first Web spider, called World Wide
Web Wanderer, to capture Web information. [9]
In April 1994, Jerry Yang and David Filo of Stanford University
started Yahoo! to provide a better organized collection of Web
pages. Meanwhile in Washington D.C., Brian Pinkerton released
his WebCrawler project that made full-text searches possible on
the Internet. Dr. Michael Mauldin introduced Lycos, which means
“wolf spider” in Latin, with a catalog of 54,000 documents
on July 1994. [10]
More and more search engines came to the stage in the late 1990s.
For instance, Excite (1995), AltaVista (1995), Inktomi (1996),
and Ask Jeeves were all popular among Internet users until late
1998 when Lawrence Page and Sergey Brin improved indexing technology
and launched a much more successful search engine called Google.
[11]
What problems are search engines designed to solve?
The Internet is considered to be an ocean of information and search
engines are the tools for identifying quickly what is there. Rather
than a paucity of data, the critical problem of the Web is information
overload. The Web is so overwhelming that ordinary people have
difficulties finding what they really want on it. Search engines
help users to locate desired information and lead them to the
places where they can retrieve it. Originally, search engines
were designed to facilitate academic research. With the commercialization
of the Internet, search engines have come to represent the single
most important source of new Web users. [12] By 2005,
about 84 percent of adult users, 108 million Americans or so,
had used search engines to find information on the Web, and about
38 million Americans were using a search engine every day. These
users send out about 4 billion queries every month. [13]
Search engines are used for various purposes, both important and
trivial. Web users seek legal documents and medical information
using search engines, but they also look for cooking recipes and
tabloid stories. [14] Moreover, research indicates that
80 percent of online shoppers begin at the search engine for product
information while planning a purchase. [15] The Web has
become the greatest library in the world where students and researchers
use search engines to dig up any topic on earth. [16]
How search engines interconnect with other media
Devices connected to the Internet can also take advantage of search
engine services. These include Personal Digital Assistants (PDA),
3G cellular phones, and broadband satellite services. For instance,
University of Maryland developed Rover, a system that allows users
to search for wanted text, audio or video information through
their PDA handsets. [17] Google also has introduced wireless
services for cell phone and PDA users.
|
Figure
2. Google introduces wireless search engine services. Source:
Google.com |
Search engines are also used in database management, including
the FTP system, peer-to-peer networks, library categories, email
boxes and even personal computers. Actually, search engines have
become a necessity because locating and retrieving desired data
is the common problem of all database systems.
There is a tendency for Web search engines to be combined with
the desktop operating systems of PCs. In 2004, Microsoft declared
that its new generation of operating system, code-named Longhorn,
would offer Web searching functions. [18] Simultaneously,
Google released its desktop search engine, that will enable users
to specify a hard disk or the Web, or both, for a given search
through Web browsers. [19]
Constraints
No matter how sophisticated is the design of the search engine,
or how good a user is at taking advantage of search engines, there
are valuable resources on the Web that search engines cannot find.
Significant information remains unconnected to the Web, but there
are also great quantities of information that the search engines
do not automatically pick up because no one has created a Web
page to contain it. [20] Even for the existing information
on the Web, whose size is approximately 500 times the content
that search engines have found to date, a great amount of this
information remains invisible to search engines. [21]
The reasons are that sometimes search engines do not capture the
Web pages that are too deep in the site; sometimes search engines
do not include contents in non-HTML formats, such as video, audio,
or other files. In addition, some Web sites require passwords
so spiders may not be able to collect from those sites. [22]
Hence,
although the World Wide Web is the greatest library ever invented,
it still does not represent the whole of human intelligence. There
is a large amount of offline knowledge.
The quality of search results is another concern. People want
the most relevant results to be listed at the top of the produced
pages when performing a search. To achieve this goal, search engines
have to apply a reasonable indexing technique to determine the
ranking of search results. Google relies on PageRank, an algorithm
based on the assumption that important pages are the ones that
receive many references from other sites. This is why Google is
frequently perceived to be the best search engine in the world.
[23] Nevertheless, from the users’ perspective,
the greatest number of links may not represent the highest quality
information.
Economic factors have an extraordinary impact on the ranking of
search results. Search engines usually identify advertisements
by labeling them as “sponsored” in the results. However,
a few companies and some Web sites now use the tactic of editing
metadata to manipulate the search results. By embedding popular
keywords, such as “MP3” and “Britney Spears,”
into their metadata [24], these web sites have found
they can increase their chances of appearing in the final search.
A different approach is to pay a small fee to guarantee that one’s
sites will be included in the search results. [25] A
couple of well-known search engines, Yahoo! and Ask Jeeves, employ
this “pay for play” strategy. Under such a circumstance,
users cannot sometimes distinguish advertisements from real results.
Copyright and privacy are other critical issues. Search engine
providers sometimes link the search words entered by users to
specific advertisements and sponsored sites. Also, some trademark
holders have complained that searches for a specific trademarked
term, such as “Playboy,” resulted in a link to competitors.
[27] People begin to worry about whether the introduction
of desktop search engines will threaten the security of information
in their own personal computer hard drives.
Search Engine Applications
Google (www.google.com) Launched in 1998, Google has become the
favorite search engine for the majority of Web users. Its success
is based on its large database, which is about 8 billion records,
and its fast and effective ranking algorithm. Google additionally
provides searches of cached pages, newsgroups, images, PDF and
other non-HTML files. [28] Google also offers technical
support to other Web search engines, including AOL, Netscape,
and EarthLink. [30]
Ask Jeeves (www.ask.com) Ask Jeeves started in 1996. It is the
first search engine to use natural language searching, which remains
its remarkable feature. Ask Jeeves also feeds other search engines
such as About.com and Mama.com. [31]
AltaVista (www.altavista.com) AltaVista began in 1995. It was
the first in a number of categories: first full-text index, multiple
language searching capability, with multi-language translation,
with a spell checking function, and one of the first to do images
and non-text documents searching. Although it no longer feeds
any other Web sites, AltaVista is feeding results to Yahoo! Since
the company was bought from Overture.com.
Yahoo! (www.yahoo.com) Initiated in 1994, Yahoo! was the first
big search directory used to receive search results from Google
and other search feeders. After the purchase of AltaVista, Yahoo!
has been a competitor in the competition of search engine development.
[32]
Some other important search engines include AOL.com, MSN.com,
InfoSpace.com, Netscape.com, EarthLink.com, Hotbot.com and Teoma.
More personalized and specified search engines are on the horizon.
There is no doubt that the search engine is a promising business.
Microsoft’s involvement, Google’s great success in
NASDAQ, and international consolidation has been heating up the
competition.
|
Figure
3. Four major search engines: Google, Yahoo!, AltaVista,
and AskJeeve. Edited by Xiaopeng Wang |
Notes:
[1]
Peter Kent, Search engine optimization for dummies, Indianapolis
IN: Wiley, 2004, p.10.
[2]
Greg R. Notess, “Internet search engine update.” Online,
Jan/Feb. 2005, 29(1), p.12.
[3] Ned L. Fielden, & Lucy Kuntz, Search engines handbook,
Jefferson, NC: McFarland & Company, 2002, p14.
[4]
Curt Franklin, “How Internet search engines work.”
Howstullworks.com, http://computer.howstuffworks.com/search-engine.htm,
visited at February 20, 2005.
[5]
Ned L. Fielden, & Lucy Kuntz, Search engines handbook,
p16.
[6] Curt Franklin, “How Internet search engines work.”
Howstullworks.com.
[7]
Peter Kent, Search engine optimization for dummies, p.82.
[8]
Curt Franklin, “How Internet search engines work.”
Howstuffworks.com.
[9]
Wes Sonnenreich, A history of search engines.
[10]
Wes Sonnenreich, A history of search engines.
[11]
“How Google works,” Economist, September
18, 2004, Vol. 372(8393), p.32-35.
[12] Peter Kent, Search engine optimization for dummies,
p.13.
[13]
Deborah Fallows, Search engine users, Washington D.C.:
Pew Internet & American Life Project, 2005.
[14]
Deborah Fallows, Search engine users.
[15]
Peter Kent, Search engine optimization for dummies, p.14.
[16]
Angelo Fernando, “Google intelligence! Sure, search engines
deliver, but what about the off-line world?” Communication
World, January/February 2005, p.12-13.
[17] Javed Mostafa, “Seeking better Web searches,”
Scientific American, February 2005, Vol. 292(2), p.66-68.
[18]
John Markoff, “Microsoft unveils its Internet search engines,
quietly,” The New York Times, November 11, 2004.
[19]
Javed Mostafa, “Seeking better Web searches”.
[20]
Angelo Fernando, “Google intelligence!”
[21]
Javed Mostafa, “Seeking better Web searches”.
[22] Randolph Hock, The extreme searcher’s Internet
book: A guide for the serious searcher, Medford, NJ: CyberAge
Books, 2004, p.21.
[23]
“How Google works,” Economist.
[24]
Metadata: The portion of the HTML coding for a Web page that allows
the person creating the page to enter text describing the content
of the page. The content of metadata is not shown on the page
itself when the page is viewed in a browser window. But Spider
programs usually pick up metadata when scanning Web pages.
[25]
Tom Spring, “Search tangle,” News & Trends,
www.pcworld.com, August 2004.
[26]
Catherine P. Taylor, “The engine that could,” MedicaWeek,
March 15, 2004, Vol.14(1), p.18-22.
[27]
George H. Pike, “All Google, All the time” Information
Today, February 2005 p.15.
[28]
Javed Mostafa, “Seeking better Web searches”.
[29] Randolph Hock, The extreme searcher’s Internet
book, p.91.
[30]
Peter Kent, Search engine optimization for dummies, p.17.
[31]
Ned L. Fielden, & Lucy Kuntz, Search engines handbook,
p78.
[32]
Peter Kent, Search engine optimization for dummies, p.30.

|