Web Search Engine

Introduction

Jiawei Han , ... Jian Pei , in Data Mining (Third Edition), 2012

1.6.2 Web Search Engines

A Web search engine is a specialized computer server that searches for information on the Web. The search results of a user query are often returned as a list (sometimes called hits). The hits may consist of web pages, images, and other types of files. Some search engines also search and return data available in public databases or open directories. Search engines differ from web directories in that web directories are maintained by human editors whereas search engines operate algorithmically or by a mixture of algorithmic and human input.

Web search engines are essentially very large data mining applications. Various data mining techniques are used in all aspects of search engines, ranging from crawling 5 (e.g., deciding which pages should be crawled and the crawling frequencies), indexing (e.g., selecting pages to be indexed and deciding to which extent the index should be constructed), and searching (e.g., deciding how pages should be ranked, which advertisements should be added, and how the search results can be personalized or made "context aware").

Search engines pose grand challenges to data mining. First, they have to handle a huge and ever-growing amount of data. Typically, such data cannot be processed using one or a few machines. Instead, search engines often need to use computer clouds, which consist of thousands or even hundreds of thousands of computers that collaboratively mine the huge amount of data. Scaling up data mining methods over computer clouds and large distributed data sets is an area for further research.

Second, Web search engines often have to deal with online data. A search engine may be able to afford constructing a model offline on huge data sets. To do this, it may construct a query classifier that assigns a search query to predefined categories based on the query topic (i.e., whether the search query "apple" is meant to retrieve information about a fruit or a brand of computers). Whether a model is constructed offline, the application of the model online must be fast enough to answer user queries in real time.

Another challenge is maintaining and incrementally updating a model on fast-growing data streams. For example, a query classifier may need to be incrementally maintained continuously since new queries keep emerging and predefined categories and the data distribution may change. Most of the existing model training methods are offline and static and thus cannot be used in such a scenario.

Third, Web search engines often have to deal with queries that are asked only a very small number of times. Suppose a search engine wants to provide context-aware query recommendations. That is, when a user poses a query, the search engine tries to infer the context of the query using the user's profile and his query history in order to return more customized answers within a small fraction of a second. However, although the total number of queries asked can be huge, most of the queries may be asked only once or a few times. Such severely skewed data are challenging for many data mining and machine learning methods.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123814791000010

Foundation

Sudhanshu Chauhan , Nutan Kumar Panda , in Hacking Web Intelligence, 2015

Web search engine

A web search engine is a software application which crawls the web to index it and provides the information based on the user search query. Some search engines go beyond that and also extract information from various open databases. Usually the search engines provide real-time results based upon the backend crawling and data analysis algorithm they use. The results of a search engine are usually represented in the form of URLs with an abstract.

Apart from usual web search engines, some search engines also index data from various forums, and other closed portals (require login). Some search engines also collect search results from various different search engines and provide it in a single interface.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012801867500001X

Professional news search services

Lars VĂ¥ge , Lars Iselid , in News Search, Blogs and Feeds, 2010

Depth and scope of coverage, back archives and permanence of access

In web search engines you are normally not able to search printed versions of news articles or recent articles that were not published on the web. Using the professional services you can search both, and you will also get access to news articles that predate the web. Today several of the big newspapers, like the New York Times, The Times and Die Zeit have digitized large amounts of their older output. In these cases they typically provide search tools that are based on optical character recognition and will let you read some articles for free, while you have to pay for others. In the fee-based systems, searching is not based on OCR processes, which means that the results of searching within article texts are always very accurate.

The crawler programs of the web search engines visit an unbelievable number of web pages every day and the size of their indexes exceeds several billion web pages. While this may lead you to think that they must surely find everything available on the web, this is in fact not the case at all. A large portion of the web, some say the largest by far, cannot be captured by the crawler programs of Google and the rest. Huge amounts of information remain in what is commonly referred to as the Invisible Web and cannot be found using the search engines.

The fee-based search services, on the other hand, primarily do not index web resources; rather, they index proprietary sources, which supply the material directly to their systems. The numbers of news sources found in these databases range from a couple of hundred to several tens of thousands. This may seem very little when compared to the number of websites that the search engines index, but the fact is that the coverage of the professional online services is much more complete within their field. The reason for this is that they offer near-complete coverage of all the most important source publications, with archives in some cases stretching back several decades.

If you want to find a press release that was issued through one of the news wires about the launch of some company at the beginning of the 1990s, then you will find it here. If you want to read the first news stories commenting on the 9/11 attacks in the US, you will also find them here. This is what you pay for – the knowledge that the information is there, somewhere, in the database and that it will stay there, that you can find it and you will be able to find it again if you lose it. It won't go away just because a company goes bankrupt or is bought up by a competitor that then removes all the information on the original company from the web.

However, there are gaps in the coverage, which is why we used the phrase 'near complete coverage'. Most importantly, there are all the articles by freelance writers that were removed from these databases as a result of the case New York Times vs. Tasini before the U.S. Supreme Court in 2001. Jonathan Tasini was the President of the US National Writers Union. The court ruled that newspapers may not license electronic copies of articles written by freelance journalists to online news databases without first securing permission from the authors or offering economic compensation.

This quickly led to the removal of a great many articles from the online services and is the reason why this kind of material, unfortunately, cannot to be found on them. There may be other omissions, such as the non-inclusion of very short articles. The bottom line is that the coverage, while exceptional, is not complete. You may sometimes have to go to a library and read articles on microfilm – not an altogether unpleasant experience if you have the time, but if you don't have the time it can be a major obstacle.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781843346029500034

Data Retrieval: Search

Susan Fowler , ... FAST CONSULTING, in Web Application Design Handbook, 2004

Searches Are Shallow (but Don't Have to Be)

Most users look at no more than 10 to 20 documents (one or two pages) from the results list (Jansen and Pooch 2001; Nielsen, 2001).

Although most web search engines show 10 hits per page, a study by Michael Bernard et al. (2002, p. 5) found that their research subjects preferred 50 links per page and scanned and found information more quickly with 50 links. Also, the Flamenco system from University of California, Berkeley, retrieves hundreds of search results in a three-column matrix with seemingly no ill effects—see the Flamenco Search Interface Project at http://bailando.sims.berkeley.edu/flamenco.html for more information.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781558607521500050

Cross-Vertical Search Ranking

Bo Long , Yi Chang , in Relevance Ranking for Vertical Search Engines, 2014

Abstract

A traditional Web search engine conducts ranking mainly in a single domain, i.e., it focuses on one type of data source, and effective modeling relies on a sufficiently large number of labeled examples, which require an expensive and time-consuming labeling process. On the other side, it is very common for a vertical search engine to conduct ranking tasks in various verticals, thus presenting a more challenging ranking problem: cross-domain ranking. Although in this book our focus is on cross-vertical ranking, the proposed approaches can be applied to more general cases, such as cross-language ranking. Therefore, we use a more general term, cross-domain ranking, in this book. For cross-domain ranking, in some domains we may have a relatively large amount of training data, whereas in other domains we can only collect very little. Therefore finding ways to leverage labeled information from related heterogeneous domains to improve ranking in a target domain has become a problem of great interest. In this chapter, we propose a novel probabilistic model, the pairwise cross-domain factor (PCDF) model, to address this problem. The proposed model learns latent factors (features) for multidomain data in partially overlapped heterogeneous feature spaces. It is capable of learning homogeneous feature correlation, heterogeneous feature correlation, and pairwise preference correlation for cross-domain knowledge transfer. We also derive two PCDF variations to address two important special cases. Under the PCDF model, we derive a stochastic gradient-based algorithm, which facilitates distributed optimization and is flexible to adopt various loss functions and regularization functions to accommodate different data distributions.The extensive experiments on real-world datasets demonstrate the effectiveness of the proposed model and algorithm.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124071711000095

Evaluating news search tools

Lars VĂ¥ge , Lars Iselid , in News Search, Blogs and Feeds, 2010

Frequency of spidering

When using a web search engine, a good way to test the spidering frequency is to use the cached copy that's made available on several web search engines. (Spidering frequency refers to the frequency with which the search engine's crawler program visits websites for updates.) In Google the link to the cached copy can be found near the URL in the results lists. Click on it, and you will see a copy of the web page as it appeared the last time the spider visited it. You could say that a search engine index consists of asynchronous snapshots of the internet. The front pages of most news sources are typically updated several times every hour and of course it's difficult for the spiders to keep up with this. However, when evaluating news search engines you can't use the cached copy to determine how recently a page was captured because news search engines normally don't offer a cached copy. This means that you'll have to do your evaluation more or less manually. An interesting experiment can be done by watching a sports event on TV. Choose some really big event, like a high-profile soccer match of international significance. You don't have to watch the whole game, just the last part so that you know who won and maybe who scored the winning goal. Then, as soon as the match ends, go to your computer and start searching your favourite news search engines using keywords such as the names of the teams and of the player who scored the most crucial goal. Do the search every 15 minutes in two or three news search engines that you want to check up on. Note how many hits you get each time in each of the news search engines that are relevant to the just-finished match and cite the match score. Give it an hour, or maybe one and a half hours. The results will reflect how effective their spidering and indexing is.

Some news search engines may crawl the news sites very frequently, but perhaps not very deeply. By this we mean that while they may reindex the content of the home page and the start pages of major news sections, they don't go very deep each time they visit the websites. It's difficult to determine just how often a full reindexing of a source site is done. It could actually turn out to be very irregular. A test over a couple a days may give you a hint as to the typical spidering frequency.

It's important to check these things, so as to find out how fresh the search engine index normally is and whether this is good enough for your needs. In commercial news search databases this test won't be very helpful because they don't index web pages other than as a complement to the indexing of articles that they receive in other ways. Many but not all NMS provide you with news both from the internet and from the printed versions. There are lots of monitoring services around the world and some of them were established long before the internet became a popular publishing platform. Even here, it's important to try to check the updating frequency of both the printed news stories and the stories published on the websites that are being monitored.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781843346029500058

Formulating the Query

Tony Russell-Rose , Tyler Tate , in Designing the Search Experience, 2013

Related searches

All the major web search engines offer support for related searches. Bing, for example, shows them in a panel to the left of the main results ( Figure 5.35).

Figure 5.35. Related searches at Bing.

Google, by contrast, shows them on demand (via a link in the sidebar) as a panel above the main search results (Figure 5.36). Like the Yahoo example seen earlier, they both emphasize extensions to the query by highlighting the nonmatching elements.

Figure 5.36. Related searches at Google.

Apart from providing inspiration, related searches can be used to help clarify an ambiguous query (see Chapter 7 for the significance of this within faceted search). For example, query on Bing for "apple" returns results associated mainly with the computer manufacturer, but the related searches clearly indicate a number of other interpretations (Figure 5.37).

Figure 5.37. Query disambiguation via related searches at Bing.

Related searches can also be used to articulate associated concepts in a taxonomy. At eBay, for example, a query for "acoustic guitar" returns a number of related searches at varying levels of specificity. These include subordinate (child) concepts, such as "yamaha acoustic guitar" and "fender acoustic guitar," along with sibling concepts such as "electric guitar," and superordinate (parent) concepts such as "guitar." These taxonomic signposts offer a subtle form of guidance, helping us understand better the conceptual space in which our query belongs (Figure 5.38).

Figure 5.38. Taxonomic signposting via related searches at eBay.

Although related searches offer us a way to open our minds to new directions, they are not the only source of inspiration. Sometimes it is the results themselves that provide the stimulus. When we find a particularly good match for our information need, we try to find more of the same: a process that Peter Morville refers to as "pearl growing" (Morville, 2010). Google's image search, for example, offers us the opportunity to find images similar to a particular result (Figure 5.39).

Figure 5.39. Find similar images at Google.

For image search, the results certainly appear impressive, with a single click returning a remarkably homogenous set of results. But that feature is perhaps also its biggest shortcoming: by hiding the details of the similarity calculation, the user has no control over what is returned and cannot see why certain items are deemed similar when others are not. For this type of search, a faceted approach may be preferable, in which the user has control over exactly which dimensions are considered as part of the similarity calculation (see Chapter 7).

Google shows how we can actively seek similar results, but sometimes we may prefer to have related content presented to us. Recommender systems such as Last.fm and Netflix rely heavily on attributes, ratings, and collaborative filtering data to suggest content we're likely to enjoy. And from just a single item in our music collection, iTunes Genius can recommend many more for us to listen to as part of a playlist (Figure 5.40).

Figure 5.40. Genius playlist creates "more like this" from a single item.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123969811000057

The future of academic libraries in the digital age

Lili Li , in Trends, Discovery, and People in the Digital Age, 2013

Abstract

Impacted by the evolving web search engines and innovative information media, academic libraries are facing more unexpected competition in today's networked information society. Published by the Chronicle of Higher Education in January 2011, Brian T. Sullivan's 'Academic Library Autopsy Report, 2050' 1 triggered another round of debate about the demise of academic libraries. In this chapter, the author analyses six key factors Sullivan used to declare the death of the academic library. After examining current academic libraries from four different aspects, the author highlights six driving forces that will impact the infrastructure and operations of academic libraries in the future. Utilising the web-based library information technology architecture, he outlines a number of features that future academic libraries may have in future.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781843347231500164

Sloppy programming

Greg Little , ... Allen Cypher , in No Code Required, 2010

Introduction

When a user enters a query into a Web search engine, they do not expect it to return a syntax error. Imagine a user searching for "End-User Programing" and getting an error like: "Unexpected token 'Programing'". Not only would they not expect to see such an error, but they would expect the search engine to suggest the proper spelling of "Programing". The burden is on the search engine to make the user right.

People have come to expect this behavior from search engines, but they do not expect this behavior from program compilers or interpreters. When a novice programmer enters "print hello world" into a modern scripting language, and the computer responds with "SyntaxError: invalid syntax", the attitude of many programmers is that the novice did something wrong, rather than that the computer did something wrong. In this case, the novice may have forgotten to put quotes and parentheses around "hello world", depending on the underlying formal language. Programmers do not often think that the computer forgot to search for a syntactically correct expression that most closely resembled "print hello world".

This attitude may make sense when thinking about code that the computer will run without supervision. In these cases, it is important for the programmer to know in advance exactly how each statement will be interpreted.

However, programming is also a way for people to communicate with computers daily, in the form of scripting interfaces to applications. In these cases, commands are typically executed under heavy supervision. Hence, it is less important for the programmer to know in advance precisely how a command will be interpreted, since they will see the results immediately, and they can make corrections. Unfortunately, scripting interfaces and command prompts typically use formal languages, requiring users to cope with rigid and seemingly arbitrary syntax rules for forming expressions. One canonical example is the semicolon required at the end of each expression in C, but even modern scripting languages like Python and JavaScript have similarly inscrutable syntax requirements.

Not only do scripting interfaces use formal languages, but different applications often use different formal languages: Python, Ruby, JavaScript, sh, csh, Visual Basic, and ActionScript, to name a few. These languages are often similar – many are based on C syntax – but they are different enough to cause problems. For instance, JavaScript assumes variables are global unless explicitly declared to be local with "var", whereas Python assumes variables are local unless declared with "global".

In addition to learning the language, users must also learn the Application Programmer Interface (API) for the application they want to script. This can be challenging, since APIs are often quite large, and it can be difficult to isolate the portion of the API relevant to the current task.

We propose that instead of returning a syntax error, an interpreter should act more like a Web search engine. It should first search for a syntactically valid expression over the scripting language and API. Then it should present this result to the user for inspection, or simply execute it, if the user is "feeling lucky." We call this approach sloppy programming, a term coined by Tessa Lau at IBM.

This chapter continues with related work, followed by a deeper explanation of sloppy programming. We then present several prototype systems which use the sloppy programming paradigm, and discuss what we learned from them. Before concluding, we present a high-level description of some of the algorithms that make sloppy programming possible, along with some of their tradeoffs in different domains.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123815415000158

Data Mining Trends and Research Frontiers

Jiawei Han , ... Jian Pei , in Data Mining (Third Edition), 2012

Similarity Search and OLAP in Information Networks

Similarity search is a primitive operation in database and web search engines. A heterogeneous information network consists of multityped, interconnected objects. Examples include bibliographic networks and social media networks, where two objects are considered similar if they are linked in a similar way with multityped objects. In general, object similarity within a network can be determined based on network structures and object properties, and with similarity measures. Moreover, network clusters and hierarchical network structures help organize objects in a network and identify subcommunities, as well as facilitate similarity search. Furthermore, similarity can be defined differently per user. By considering different linkage paths, we can derive various similarity semantics in a network, which is known as path-based similarity.

By organizing networks based on the notion of similarity and clusters, we can generate multiple hierarchies within a network. Online analytical processing (OLAP) can then be performed. For example, we can drill down or dice information networks based on different levels of abstraction and different angles of views. OLAP operations may generate multiple, interrelated networks. The relationships among such networks may disclose interesting hidden semantics.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123814791000137