Ohio University OU School of Telecommunications ITS Site Map Contact Us ITS Projects ITS History About ITS ITS Home Page Ohio University OU School of Telecommunications Contact Us ITS Projects ITS History About ITS Home page

 

 

Search Engines

Search Engine Technology

by Xiaopeng Wang

What are search engines?

A search engine is a software-based technology used in conjunction with the Internet to assist users in finding information stored on the World Wide Web. Search engines are automated computer programs designed to facilitate searching through Web pages. When Internet users input keywords representing the types of information they are looking for, the search engine will compile a record of Web locations matching the keywords. [1] Then, users can visit those locations via hyperlinks provided by the search engines checking these Web sites for the sought-after information. Three of the most widely used search engines are Google, Yahoo!, and Ask Jeeves. [2]

How do search engines work?

In spite of their diverse designs and functions, most search engines follow a similar operating process, including word matching, data collecting, and indexing.

Word matching is the core of search engine development. [3] This feature utilizes computers’ capacity for finding wanted text terms immediately and accurately from thousands of Web pages, a process that dramatically surpasses the capabilities of human beings.

To produce word matches, search engines must have a large collection of data or documents to look through, just as librarians need to work in a space with many books. Given the vast information sources connected to the Internet, search engines must make use of automated shortcuts; that is, bots, spiders, or crawlers are used to comb the Internet for categories of information that enable them to build lists of relevant documents found on the Web. [4] Instead of going out onto the Internet itself upon receiving a request, search engines search through the more limited databases they have compiled.

Spider programs reside on local computers and send out requests for information to Web addresses at other computers and servers, similar to the process one uses when visiting Web sites via one’s own computer. The spiders usually begin with popular Web pages, following every hyperlink of those pages to other pages. By this means, the spiders can systematically and rapidly travel across the widespread cyberspace. [5] Take Google as an example. At its peak performance, Google can crawl more than 100 Web pages and generate about 600 Kbytes of data per second using only four spiders. [6]

During their traveling, spiders collect at minimum a Web site’s title, URL, and a brief description. Some search engines send their spiders to scan the entire text of documents to create full-text indexes, including all significant words, hyperlinks and metadata. [7] Google even stores full Web pages as “cached” pages, through which users can retrieve those Web pages even when they do not exist anymore.

Search engines save the information collected in an organized way in order to efficiently use it. The process or method of organizing the information is called indexing. Among the ways being used for building an index, the hash table is the most effective. The hash table applies a formula that attaches a numerical value to each word. The value determines the ranking of the documents where those words are found. Every search engine has a distinctive formula for ranking words, which is one of the reasons that a search for the same word on different search engines will generate different lists. [8] Generally speaking, a good search engine will host a wealthy collection of data as well as a reasonable and effective indexing system.

Figure 1. Spiders take a Web page's content, create key words and store information into a database. Source: Howstuffworks.com.

Search engines’ background and proponents

In 1990, Alan Emtage, a student at McGill University in Montreal, created the indexing search tool, called Archie for the File Transfer Protocol (FTP) system. One year later, Mark McCahill introduced Gopher and the University of Nevada System Computing Services group developed Veronica search service for Gopher in 1993. In the same year, with the advent of the first web browser Mosaic, Matthew Gray launched the first Web spider, called World Wide Web Wanderer, to capture Web information. [9]

In April 1994, Jerry Yang and David Filo of Stanford University started Yahoo! to provide a better organized collection of Web pages. Meanwhile in Washington D.C., Brian Pinkerton released his WebCrawler project that made full-text searches possible on the Internet. Dr. Michael Mauldin introduced Lycos, which means “wolf spider” in Latin, with a catalog of 54,000 documents on July 1994. [10]

More and more search engines came to the stage in the late 1990s. For instance, Excite (1995), AltaVista (1995), Inktomi (1996), and Ask Jeeves were all popular among Internet users until late 1998 when Lawrence Page and Sergey Brin improved indexing technology and launched a much more successful search engine called Google. [11]

What problems are search engines designed to solve?

The Internet is considered to be an ocean of information and search engines are the tools for identifying quickly what is there. Rather than a paucity of data, the critical problem of the Web is information overload. The Web is so overwhelming that ordinary people have difficulties finding what they really want on it. Search engines help users to locate desired information and lead them to the places where they can retrieve it. Originally, search engines were designed to facilitate academic research. With the commercialization of the Internet, search engines have come to represent the single most important source of new Web users. [12] By 2005, about 84 percent of adult users, 108 million Americans or so, had used search engines to find information on the Web, and about 38 million Americans were using a search engine every day. These users send out about 4 billion queries every month. [13]

Search engines are used for various purposes, both important and trivial. Web users seek legal documents and medical information using search engines, but they also look for cooking recipes and tabloid stories. [14] Moreover, research indicates that 80 percent of online shoppers begin at the search engine for product information while planning a purchase. [15] The Web has become the greatest library in the world where students and researchers use search engines to dig up any topic on earth. [16]

How search engines interconnect with other media

Devices connected to the Internet can also take advantage of search engine services. These include Personal Digital Assistants (PDA), 3G cellular phones, and broadband satellite services. For instance, University of Maryland developed Rover, a system that allows users to search for wanted text, audio or video information through their PDA handsets. [17] Google also has introduced wireless services for cell phone and PDA users.

Figure 2. Google introduces wireless search engine services. Source: Google.com

Search engines are also used in database management, including the FTP system, peer-to-peer networks, library categories, email boxes and even personal computers. Actually, search engines have become a necessity because locating and retrieving desired data is the common problem of all database systems.

There is a tendency for Web search engines to be combined with the desktop operating systems of PCs. In 2004, Microsoft declared that its new generation of operating system, code-named Longhorn, would offer Web searching functions. [18] Simultaneously, Google released its desktop search engine, that will enable users to specify a hard disk or the Web, or both, for a given search through Web browsers. [19]

Constraints

No matter how sophisticated is the design of the search engine, or how good a user is at taking advantage of search engines, there are valuable resources on the Web that search engines cannot find. Significant information remains unconnected to the Web, but there are also great quantities of information that the search engines do not automatically pick up because no one has created a Web page to contain it. [20] Even for the existing information on the Web, whose size is approximately 500 times the content that search engines have found to date, a great amount of this information remains invisible to search engines. [21] The reasons are that sometimes search engines do not capture the Web pages that are too deep in the site; sometimes search engines do not include contents in non-HTML formats, such as video, audio, or other files. In addition, some Web sites require passwords so spiders may not be able to collect from those sites. [22]

Hence, although the World Wide Web is the greatest library ever invented, it still does not represent the whole of human intelligence. There is a large amount of offline knowledge.

The quality of search results is another concern. People want the most relevant results to be listed at the top of the produced pages when performing a search. To achieve this goal, search engines have to apply a reasonable indexing technique to determine the ranking of search results. Google relies on PageRank, an algorithm based on the assumption that important pages are the ones that receive many references from other sites. This is why Google is frequently perceived to be the best search engine in the world. [23] Nevertheless, from the users’ perspective, the greatest number of links may not represent the highest quality information.

Economic factors have an extraordinary impact on the ranking of search results. Search engines usually identify advertisements by labeling them as “sponsored” in the results. However, a few companies and some Web sites now use the tactic of editing metadata to manipulate the search results. By embedding popular keywords, such as “MP3” and “Britney Spears,” into their metadata [24], these web sites have found they can increase their chances of appearing in the final search. A different approach is to pay a small fee to guarantee that one’s sites will be included in the search results. [25] A couple of well-known search engines, Yahoo! and Ask Jeeves, employ this “pay for play” strategy. Under such a circumstance, users cannot sometimes distinguish advertisements from real results.

Copyright and privacy are other critical issues. Search engine providers sometimes link the search words entered by users to specific advertisements and sponsored sites. Also, some trademark holders have complained that searches for a specific trademarked term, such as “Playboy,” resulted in a link to competitors. [27] People begin to worry about whether the introduction of desktop search engines will threaten the security of information in their own personal computer hard drives.

Search Engine Applications

Google (www.google.com) Launched in 1998, Google has become the favorite search engine for the majority of Web users. Its success is based on its large database, which is about 8 billion records, and its fast and effective ranking algorithm. Google additionally provides searches of cached pages, newsgroups, images, PDF and other non-HTML files. [28] Google also offers technical support to other Web search engines, including AOL, Netscape, and EarthLink. [30]

Ask Jeeves (www.ask.com) Ask Jeeves started in 1996. It is the first search engine to use natural language searching, which remains its remarkable feature. Ask Jeeves also feeds other search engines such as About.com and Mama.com. [31]

AltaVista (www.altavista.com) AltaVista began in 1995. It was the first in a number of categories: first full-text index, multiple language searching capability, with multi-language translation, with a spell checking function, and one of the first to do images and non-text documents searching. Although it no longer feeds any other Web sites, AltaVista is feeding results to Yahoo! Since the company was bought from Overture.com.

Yahoo! (www.yahoo.com) Initiated in 1994, Yahoo! was the first big search directory used to receive search results from Google and other search feeders. After the purchase of AltaVista, Yahoo! has been a competitor in the competition of search engine development. [32]

Some other important search engines include AOL.com, MSN.com, InfoSpace.com, Netscape.com, EarthLink.com, Hotbot.com and Teoma. More personalized and specified search engines are on the horizon. There is no doubt that the search engine is a promising business. Microsoft’s involvement, Google’s great success in NASDAQ, and international consolidation has been heating up the competition.

Figure 3. Four major search engines: Google, Yahoo!, AltaVista, and AskJeeve. Edited by Xiaopeng Wang

Notes:

[1] Peter Kent, Search engine optimization for dummies, Indianapolis IN: Wiley, 2004, p.10.

[2] Greg R. Notess, “Internet search engine update.” Online, Jan/Feb. 2005, 29(1), p.12.

[3] Ned L. Fielden, & Lucy Kuntz, Search engines handbook, Jefferson, NC: McFarland & Company, 2002, p14.

[4] Curt Franklin, “How Internet search engines work.” Howstullworks.com, http://computer.howstuffworks.com/search-engine.htm, visited at February 20, 2005.

[5] Ned L. Fielden, & Lucy Kuntz, Search engines handbook, p16.

[6] Curt Franklin, “How Internet search engines work.” Howstullworks.com.

[7] Peter Kent, Search engine optimization for dummies, p.82.

[8] Curt Franklin, “How Internet search engines work.” Howstuffworks.com.

[9] Wes Sonnenreich, A history of search engines.

[10] Wes Sonnenreich, A history of search engines.

[11] “How Google works,” Economist, September 18, 2004, Vol. 372(8393), p.32-35.

[12] Peter Kent, Search engine optimization for dummies, p.13.

[13] Deborah Fallows, Search engine users, Washington D.C.: Pew Internet & American Life Project, 2005.

[14] Deborah Fallows, Search engine users.

[15] Peter Kent, Search engine optimization for dummies, p.14.

[16] Angelo Fernando, “Google intelligence! Sure, search engines deliver, but what about the off-line world?” Communication World, January/February 2005, p.12-13.

[17] Javed Mostafa, “Seeking better Web searches,” Scientific American, February 2005, Vol. 292(2), p.66-68.

[18] John Markoff, “Microsoft unveils its Internet search engines, quietly,” The New York Times, November 11, 2004.

[19] Javed Mostafa, “Seeking better Web searches”.

[20] Angelo Fernando, “Google intelligence!”

[21] Javed Mostafa, “Seeking better Web searches”.

[22] Randolph Hock, The extreme searcher’s Internet book: A guide for the serious searcher, Medford, NJ: CyberAge Books, 2004, p.21.

[23] “How Google works,” Economist.

[24] Metadata: The portion of the HTML coding for a Web page that allows the person creating the page to enter text describing the content of the page. The content of metadata is not shown on the page itself when the page is viewed in a browser window. But Spider programs usually pick up metadata when scanning Web pages.

[25] Tom Spring, “Search tangle,” News & Trends, www.pcworld.com, August 2004.

[26] Catherine P. Taylor, “The engine that could,” MedicaWeek, March 15, 2004, Vol.14(1), p.18-22.

[27] George H. Pike, “All Google, All the time” Information Today, February 2005 p.15.

[28] Javed Mostafa, “Seeking better Web searches”.

[29] Randolph Hock, The extreme searcher’s Internet book, p.91.

[30] Peter Kent, Search engine optimization for dummies, p.17.

[31] Ned L. Fielden, & Lucy Kuntz, Search engines handbook, p78.

[32] Peter Kent, Search engine optimization for dummies, p.30.

 
© 2004 Institute for Telecommunications Studies. All Rights Reserved.
This page was last updated on December 9, 2004