Suvarna Garge (Editor)

Search engine technology

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

A search engine is an information retrieval software program that discovers, crawls, transforms and stores information for retrieval and presentation in response to user queries.

Contents

A search engine normally consists of four components e.g. search interface, crawler(also known as a spider or bot),indexer, and database. The crawler traverses a document collection, deconstructs document text, and assigns surrogates for storage in the search engine index. Online search engines store images, link data and metadata for the document as well.

The Memex

The concept of hypertext and a memory extension originates from an article that was published in The Atlantic Monthly in July 1945 written by Vannevar Bush, titled As We May Think. Within this article Vannevar urged scientists to work together to help build a body of knowledge for all mankind. He then proposed the idea of a virtually limitless, fast, reliable, extensible, associative memory storage and retrieval system. He named this device a memex.

Bush regarded the notion of “associative indexing” as his key conceptual contribution. As he explained, this was “a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.” This “linking” (as we now say) constituted a “trail” of documents that could be named, coded, and found again. Moreover, after the original two items were coupled, “numerous items” could be “joined together to form a trail”; they could be “reviewed in turn, rapidly or slowly, by deflecting a lever like that used for turning the pages of a book. It is exactly as though the physical items had been gathered together from widely separated sources and bound together to form a new book”

All of the documents used in the memex would be in the form of microfilm copy acquired as such or, in the case of personal records, transformed to microfilm by the machine itself. Memex would also employ new retrieval techniques based on a new kind of associative indexing the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another to create personal "trails" through linked documents. The new procedures, that Bush anticipated facilitating information storage and retrieval would lead to the development of wholly new forms of encyclopedia.

The most important mechanism, conceived by Bush and considered as closed to the modern hypertext systems is the associative trail. It would be a way to create a new linear sequence of microfilm frames across any arbitrary sequence of microfilm frames by creating a chained sequence of links in the way just described, along with personal comments and side trails. The essential feature of the memex [is] the process of tying two items together… When the user is building a trail, he names it in his code book, and taps it out on his keyboard. Before him are the two items to be joined, projected onto adjacent viewing positions. At the bottom of each there are a number of blank code spaces, and a pointer is set to indicate one of these on each item. The user taps a single key, and the items are permanently joined… Thereafter, at any time, when one of these items is in view, the other can be instantly recalled merely by tapping a button below the corresponding code space.

In the article of Bush is not described any automatic search, nor any universal metadata scheme such as a standard library classification or a hypertext element set. Instead, when the user made an entry, such as a new or annotated manuscript, or image, he was expected to index and describe it in his personal code book. Later on, by consulting his code book, the user could retrace annotated and generated entries.

In 1965 Bush took part in the project INTREX of MIT, for developing technology for mechanization the processing of information for library use. In his 1967 essay titled "Memex Revisited", he pointed out that the development of the digital computer, the transistor, the video, and other similar devices had heightened the feasibility of such mechanization, but costs would delay its achievements. He was right again.

Ted Nelson, who later did pioneering work with first practical hypertext system and coined the term "hypertext" in the 1960s, credited Bush as his main influence.

SMART

Gerard Salton, who died on August 28 of 1995, was the father of modern search technology. His teams at Harvard and Cornell developed the SMART informational retrieval system. Salton’s Magic Automatic Retriever of Text included important concepts like the vector space model, Inverse Document Frequency (IDF), Term Frequency (TF), term discrimination values, and relevancy feedback mechanisms.

He authored a 56-page book called A Theory of Indexing which explained many of his tests upon which search is still largely based.

String Search Engines

In 1987 an article was published detailing the development of a character string search engine (SSE) for rapid text retrieval on a double-metal 1.6-μm n-well CMOS solid-state circuit with 217,600 transistors lain out on a 8.62x12.76-mm die area. The SSE accommodated a novel string-search architecture which combines a 512-stage finite-state automaton (FSA) logic with a content addressable memory (CAM) to achieve an approximate string comparison of 80 million strings per second. The CAM cell consisted of four conventional static RAM (SRAM) cells and a read/write circuit. Concurrent comparison of 64 stored strings with variable length was achieved in 50 ns for an input text stream of 10 million characters/s, permitting performance despite the presence of single character errors in the form of character codes. Furthermore, the chip allowed nonanchor string search and variable-length `don't care' (VLDC) string search.

Archie

The first web search engines was Archie, created in 1990 by Alan Emtage, a student at McGill University in Montreal. The author originally wanted to call the program "archives," but had to shorten it to comply with the Unix world standard of assigning programs and files short, cryptic names such as grep, cat, troff, sed, awk, perl, and so on.

The primary method of storing and retrieving files was via the File Transfer Protocol (FTP). This was (and still is) a system that specified a common way for computers to exchange files over the Internet. It works like this: Some administrator decides that he wants to make files available from his computer. He sets up a program on his computer, called an FTP server. When someone on the Internet wants to retrieve a file from this computer, he or she connects to it via another program called an FTP client. Any FTP client program can connect with any FTP server program as long as the client and server programs both fully follow the specifications set forth in the FTP protocol.

Initially, anyone who wanted to share a file had to set up an FTP server in order to make the file available to others. Later, "anonymous" FTP sites became repositories for files, allowing all users to post and retrieve them.

Even with archive sites, many important files were still scattered on small FTP servers. Unfortunately, these files could be located only by the Internet equivalent of word of mouth: Somebody would post an e-mail to a message list or a discussion forum announcing the availability of a file.

Archie changed all that. It combined a script-based data gatherer, which fetched site listings of anonymous FTP files, with a regular expression matcher for retrieving file names matching a user query. (4) In other words, Archie's gatherer scoured FTP sites across the Internet and indexed all of the files it found. Its regular expression matcher provided users with access to its database.

Veronica

In 1993, the University of Nevada System Computing Services group developed Veronica. It was created as a type of searching device similar to Archie but for Gopher files. Another Gopher search service, called Jughead, appeared a little later, probably for the sole purpose of rounding out the comic-strip triumvirate. Jughead is an acronym for Jonzy's Universal Gopher Hierarchy Excavation and Display, although, like Veronica, it is probably safe to assume that the creator backed into the acronym. Jughead's functionality was pretty much identical to Veronica's, although it appears to be a little rougher around the edges.

The Lone Wanderer

The World Wide Web Wanderer, developed by Matthew Gray in 1993 was the first robot on the Web and was designed to track the Web's growth. Initially, the Wanderer counted only Web servers, but shortly after its introduction, it started to capture URLs as it went along. The database of captured URLs became the Wandex, the first web database.

Matthew Gray's Wanderer created quite a controversy at the time, partially because early versions of the software ran rampant through the Net and caused a noticeable netwide performance degradation. This degradation occurred because the Wanderer would access the same page hundreds of time a day. The Wanderer soon amended its ways, but the controversy over whether robots were good or bad for the Internet remained.

In response to the Wanderer, Martijn Koster created Archie-Like Indexing of the Web, or ALIWEB, in October 1993. As the name implies, ALIWEB was the HTTP equivalent of Archie, and because of this, it is still unique in many ways.

ALIWEB does not have a web-searching robot. Instead, webmasters of participating sites post their own index information for each page they want listed. The advantage to this method is that users get to describe their own site, and a robot doesn't run about eating up Net bandwidth. Unfortunately, the disadvantages of ALIWEB are more of a problem today. The primary disadvantage is that a special indexing file must be submitted. Most users do not understand how to create such a file, and therefore they don't submit their pages. This leads to a relatively small database, which meant that users are less likely to search ALIWEB than one of the large bot-based sites. This Catch-22 has been somewhat offset by incorporating other databases into the ALIWEB search, but it still does not have the mass appeal of search engines such as Yahoo! or Lycos.

Excite

Excite, initially called Architext, was started by six Stanford undergraduates in February 1993. Their idea was to use statistical analysis of word relationships in order to provide more efficient searches through the large amount of information on the Internet. Their project was fully funded by mid-1993. Once funding was secured. they released a version of their search software for webmasters to use on their own web sites. At the time, the software was called Architext, but it now goes by the name of Excite for Web Servers.

Excite was the first serious commercial search engine which launched in 1995. It was developed in Stanford and was purchased for $6.5 billion by @Home. In 2001 Excite and @Home went bankrupt and InfoSpace bought Excite for $10 million.

Some of the first analysis of web searching was conducted on search logs from Excite

Yahoo!

In April 1994, two Stanford University Ph.D. candidates, David Filo and Jerry Yang, created some pages that became rather popular. They called the collection of pages Yahoo! Their official explanation for the name choice was that they considered themselves to be a pair of yahoos.

As the number of links grew and their pages began to receive thousands of hits a day, the team created ways to better organize the data. In order to aid in data retrieval, Yahoo! (www.yahoo.com) became a searchable directory. The search feature was a simple database search engine. Because Yahoo! entries were entered and categorized manually, Yahoo! was not really classified as a search engine. Instead, it was generally considered to be a searchable directory. Yahoo! has since automated some aspects of the gathering and classification process, blurring the distinction between engine and directory.

The Wanderer captured only URLs, which made it difficult to find things that weren’t explicitly described by their URL. Because URLs are rather cryptic to begin with, this didn’t help the average user. Searching Yahoo! or the Galaxy was much more effective because they contained additional descriptive information about the indexed sites.

Lycos

At Carnegie Mellon University during July 1994, Michael Mauldin, on leave from CMU,developed the Lycos search engine.

Types of Web Search Engines

Search engines on the web are sites enriched with facility to search the content stored on other sites. There is difference in the way various search engines work, but they all perform three basic tasks.

  1. Finding and selecting full or partial content based on the keywords provided.
  2. Maintaining index of the content and referencing to the location they find
  3. Allowing users to look for words or combinations of words found in that index.

The process begins when a user enters a query statement into the system through the interface provided.

There are basically three types of search engines: Those that are powered by robots (called crawlers; ants or spiders) and those that are powered by human submissions; and those that are a hybrid of the two.

Crawler-based search engines are those that use automated software agents (called crawlers) that visit a Web site, read the information on the actual site, read the site's meta tags and also follow the links that the site connects to performing indexing on all linked Web sites as well. The crawler returns all that information back to a central depository, where the data is indexed. The crawler will periodically return to the sites to check for any information that has changed. The frequency with which this happens is determined by the administrators of the search engine.

Human-powered search engines rely on humans to submit information that is subsequently indexed and catalogued. Only information that is submitted is put into the index.

In both cases, when you query a search engine to locate information, you're actually searching through the index that the search engine has created —you are not actually searching the Web. These indices are giant databases of information that is collected and stored and subsequently searched. This explains why sometimes a search on a commercial search engine, such as Yahoo! or Google, will return results that are, in fact, dead links. Since the search results are based on the index, if the index hasn't been updated since a Web page became invalid the search engine treats the page as still an active link even though it no longer is. It will remain that way until the index is updated.

So why will the same search on different search engines produce different results? Part of the answer to that question is because not all indices are going to be exactly the same. It depends on what the spiders find or what the humans submitted. But more important, not every search engine uses the same algorithm to search through the indices. The algorithm is what the search engines use to determine the relevance of the information in the index to what the user is searching for.

One of the elements that a search engine algorithm scans for is the frequency and location of keywords on a Web page. Those with higher frequency are typically considered more relevant. But search engine technology is becoming sophisticated in its attempt to discourage what is known as keyword stuffing, or spamdexing.

Another common element that algorithms analyze is the way that pages link to other pages in the Web. By analyzing how pages link to each other, an engine can both determine what a page is about (if the keywords of the linked pages are similar to the keywords on the original page) and whether that page is considered "important" and deserving of a boost in ranking. Just as the technology is becoming increasingly sophisticated to ignore keyword stuffing, it is also becoming more savvy to Web masters who build artificial links into their sites in order to build an artificial ranking.

Modern web search engines are highly intricate software systems that employ technology that has evolved over the years. There are a number of sub-categories of search engine software that are separately applicable to specific 'browsing' needs. These include web search engines (e.g. Google), database or structured data search engines (e.g. Dieselpoint), and mixed search engines or enterprise search. The more prevalent search engines, such as Google and Yahoo!, utilize hundreds of thousands computers to process trillions of web pages in order to return fairly well-aimed results. Due to this high volume of queries and text processing, the software is required to run in a highly dispersed environment with a high degree of superfluity.

Web search engines

Search engines that are expressly designed for searching web pages, documents, and images were developed to facilitate searching through a large, nebulous blob of unstructured resources. They are engineered to follow a multi-stage process: crawling the infinite stockpile of pages and documents to skim the figurative foam from their contents, indexing the foam/buzzwords in a sort of semi-structured form (database or something), and at last, resolving user entries/queries to return mostly relevant results and links to those skimmed documents or pages from the inventory.

Crawl

In the case of a wholly textual search, the first step in classifying web pages is to find an ‘index item’ that might relate expressly to the ‘search term.’ In the past, search engines began with a small list of URLs as a so-called seed list, fetched the content, and parsed the links on those pages for relevant information, which subsequently provided new links. The process was highly cyclical and continued until enough pages were found for the searcher’s use. These days, a continuous crawl method is employed as opposed to an incidental discovery based on a seed list. The crawl method is an extension of aforementioned discovery method. Except there is no seed list, because the system never stops worming.

Most search engines use sophisticated scheduling algorithms to “decide” when to revisit a particular page, to appeal to its relevance. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity, and overall quality of site. The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in.

The pages that are discovered by web crawls are often distributed and fed into another computer that creates a veritable map of resources uncovered. The bunchy clustermass looks a little like a graph, on which the different pages are represented as small nodes that are connected by links between the pages. The excess of data is stored in multiple data structures that permit quick access to said data by certain algorithms that compute the popularity score of pages on the web based on how many links point to a certain web page, which is how people can access any number of resources concerned with diagnosing psychosis. Another example would be the accessibility/rank of web pages containing information on Mohamed Morsi versus the very best attractions to visit in Cairo after simply entering ‘Egypt’ as a search term. One such algorithm, PageRank, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention because it highlights repeat mundanity of web searches courtesy of students that don’t know how to properly research subjects on Google. The idea of doing link analysis to compute a popularity rank is older than PageRank. Other variants of the same idea are currently in use – grade schoolers do the same sort of computations in picking kickball teams. But in all seriousness, these ideas can be categorized into three main categories: rank of individual pages and nature of web site content. Search engines often differentiate between internal links and external links, because web masters and mistresses are not strangers to shameless self-promotion. Link map data structures typically store the anchor text embedded in the links as well, because anchor text can often provide a “very good quality” summary of a web page’s content.

Database Search Engines

Searching for text-based content in databases presents a few special challenges from which a number of specialized search engines flourish. Databases can be slow when solving complex queries (with multiple logical or string matching arguments). Databases allow pseudo-logical queries which full-text searches do not use. There is no crawling necessary for a database since the data is already structured. However, it is often necessary to index the data in a more economized form to allow a more expeditious search.

Mixed Search Engines

Sometimes, data searched contains both database content and web pages or documents. Search engine technology has developed to respond to both sets of requirements. Most mixed search engines are large Web search engines, like Google. They search both through structured and unstructured data sources. Take for example, the word ‘ball.’ In its simplest terms, it returns more than 40 variations on Wikipedia alone. Did you mean a ball, as in the social gathering/dance? A soccer ball? The ball of the foot? Pages and documents are crawled and indexed in a separate index. Databases are indexed also from various sources. Search results are then generated for users by querying these multiple indices in parallel and compounding the results according to “rules.”

References

Search engine technology Wikipedia