Full-Text and Search EnginesThis is a featured page

Back to: Traditional Bibliographic Tools

When you are searching a traditional bibliographic tool, you are not searching the full-text of books and journals; you are searching only the summaries (bibliographic records) created by librarians or other specialists. With the invention of computers, there was the possibility of putting whole books and articles into electronic form where the entire text could be searched. How is this done?

Boolean Searching
Full-Text and Search Engines - AUR Library Information WikiIt turns out that the basic method for searching and retrieving text out of electronic files was discovered in the 19th century!

George Boole discovered a type of algebra that was unimportant for the time, but turned out to solve many of the problems of computer retrieval. Boolean algebra essentially reduced all values to either true or false, and then to relate these sets of true or false values in different ways. Let's see an example:

We can search a database of text for the terms Trajan and Portraits. The majority of materials do not have either of these terms and return a False result, but some documents have one of the words, and a smaller part have both words.
Full-Text and Search Engines - AUR Library Information Wiki

From this result, the computer can relate the true sets in various ways, through the operators:
AND, OR and NOT

Trajan

AND

Portraits
Full-Text and Search Engines - AUR Library Information WikiAND
lowers the number of records only to those that contain both terms
Trajan

OR

Portraits
Full-Text and Search Engines - AUR Library Information WikiOR
increases the number of records to include those that contain either word
Trajan

NOT

Portraits
Full-Text and Search Engines - AUR Library Information WikiNOT
lowers the number of records to exclude those that have the word

In this way, an infinite number of TRUE sets can be added, e.g. trajan and portraits and rome. Since this is based on mathematics it can get rather tricky if you mix the operators together, for example, if you want to add "rome or roman" to the search:
trajan and portraits and rome or roman
will give you different results depending on how they are grouped: do you want:
trajan and portraits and (rome or roman)
which limits the previous search to all records with either of the words Rome or Roman
Full-Text and Search Engines - AUR Library Information Wiki

or
(trajan and portraits and rome) or roman
which opens the search to include anything with the word Roman in it
Full-Text and Search Engines - AUR Library Information Wiki
As we see, the results are quite different and similar to 1 + 2 x 3. There may be two answers, depending on how they are related.
(1 + 2) x 3 = 3 x 3 = 9
1 + (2 x 3) = 1 + 6 = 7
If you want to do these kinds of searches, you should do either separate searches, e.g. "trajan AND portraits AND rome" and another search "trajan AND portraits AND roman" or ask for help.

Some additions have been added to these operators such as exact phrase, e.g. "white house" does not retrieve "the house is white," and near, e.g. "dante NEAR3 comedy" would limit the search to the words dante and comedy within three words of one another.

There are various types of truncation, that is, a way of searching for multiple letters at once (for example, fascis* retrieves fascist, fascists, fascism, etc.), and fuzzy searching, i.e. inexact searches that will retrieve information that comes close to the desired search, (this allows for spelling errors).
  • Not all computer systems offer all of these searches.

Arrangement of Boolean Results
Something rather surprising happened when computer scientists began to use Boolean algebra: it turned out that search and retrieval through the Boolean operators was quite simple for a computer to do, but the results turned out to be unsatisfactory since, although a full-text search could be done very quickly, the searches themselves would routinely retrieve thousands of results and were more of a hindrance than a help to the searcher. So, it turned out that the real problem was to make the results useful, and so efforts turned to how best to arrange the search results.

There have been many attempts to do this and many failures. Lately, there have been some successes with modern search engines, such as Yahoo and Google. What are the differences between the results from a search engine and the results from traditional bibliographic tools?

If we remember from the previous section, the purpose of a traditional library catalog is to enable people to find everything in a collection in certain ways. Therefore, in a catalog, search results allow people to find materials by their authors, titles, and subjects. This method allows for concept searching by using special forms found through authority files.

The modern search engines have completely different goals from traditional library tools. One of the main differences is that there is no concept searching available: search engines can only search text. Therefore, you cannot search the concept World War I; you can only search the words that may be about World War I; so you search "World War I" "wwi," "ww1" "World War One," "First World War," "1st World War," and so on.

How does a search engine work? What is the purpose of a search engine? How is this purpose different from a traditional bibliographic tool?

To discuss this, we will focus on two of the most popular search engines: Google and Yahoo.

Full-Text and Search Engines - AUR Library Information Wiki
Google adds information to its database using web crawlers that automatically go out, scan the web, and bring back links and text for people to search. As everyone knows, searching Google (using Boolean operators) can be very fast, and the results can make the users very happy. But, it is routine that a searcher returns 100,000 or more results. Not too many people will look at all 100,000. How are these results arranged?

Google uses a method called "Relevance Ranking," which is determined by a mathematical algorithm that Google keeps secret. Why is it secret?

It turns out that for many businesses, selling their wares on the web is a matter of life and death. Many institutions also want users to find the information that they have gone to great expense to place on the web. In any case, research has shown that few people look beyond the "top 10" hits, and even fewer people go beyond the first page of results. Therefore, there is tremendous pressure to get into the top 10 results of a search.

If Google were to reveal precisely how their algorithm worked, people would be able to manipulate the results to get to be #1, regardless of how "reliable" the search result turned out to be for a user. This has happened, and we will take a look at an example later.

Google's relevance ranking is explained in the following way:
PageRank Explained PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at considerably more than the sheer volume of votes, or links a page receives; for example, it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important." Using these and other factors, Google provides its views on pages' relative importance.

In essence, Google uses the power of "citations" to help arrange the results. A page that has more links to it, i.e. is cited more often, appears higher in the results.

Full-Text and Search Engines - AUR Library Information Wiki
Yahoo also has what Google has: an area that is automatically created by web crawlers and items arranged according to their own algorithm, but Yahoo is unique in that it has a Directory that is made by human editors. People submit a web site that they would like to be included in the directory. When they do this, they can suggest categories that the Yahoo editors may change. Yahoo editors may also decide not to add a site to the Directory.

Users can search Yahoo similarly to searching Google, or they can opt to browse and search the Directory.

For more information, see the Yahoo Help pages for Suggest a site.

Money
Money is an unavoidable topic whenever someone discusses the internet. Google and Yahoo make a lot of money. How do they do that? A lot of it has to do with making it easier for users to find specific materials in their databases.

Yahoo has two programs: the first is Directory Submit, which allows people to pay a fee (currently $299 each year) to have their sites included more quickly. There is also Search Submit of various types, in which people can pay a fee and/or a "pay per click" fee.

Google has a similar program, called AdWords in which people can pay to make their sites more visible. When people search a word that is also used as an adword, additional links appear.
Full-Text and Search Engines - AUR Library Information Wiki

There is also a program called AdSense, where a person can pay to have Google advertisements appear automatically on their own websites. For an example, see ApartmentRatings.com.

Can the Search Results be Manipulated?

It should be very clear by now that many people would like to manipulate the search results. People are doing this, and this is how it is done. Since we have seen that the algorithms work by citations, it only makes sense that if someone put up enough webpages that link to a specific webpage, then the latter webpage would come up much higher in the results. This has happened many times, and can lead to some very strange results.

Political examples of this are known as "Google-bombing," and the most famous example is the result for "miserable failure." (This link searches Yahoo). The result is very strange: the #1 link goes to the official White House page of President George W. Bush. Yet, if you search the page at the White House, you will not find either word in the page. Why does this happen?

As we saw above, the Google search engine gathers millions and millions of web pages on the World Wide Web, and gives everything a "vote" for specific pages. It does this by using the text that links to a page to describe the page. If many pages use the same text to describe a single page, the search engine begins to add everything up.

The following illustration shows that the text on 5 pages equals the page that they link to. Therefore, the text "miserable failure" equals the page.
Full-Text and Search Engines - AUR Library Information Wiki
It doesn't matter what other pages say.

If we return to the miserable failure result in Yahoo and scroll down the page, we will see links to other pages as well. As of this writing, #5 is President Jimmy Carter, #10 is Michael Moore and #17 is Senator Hillary Clinton. Obviously, we are seeing a battle taking place in the search engines among backers of different political opponents to see who can be the #1 "miserable failure."
Full-Text and Search Engines - AUR Library Information Wiki

(Google's search result was recently changed. For more information, see Google bomb in Wikipedia)

This may be funny or appalling depending on your own point of view, but it should make us be extremely suspicious about the results in a search engine. If people can do this in such an obvious way, how are people and institutions manipulating other results, in more subtle ways? Many people and businesses spending a lot of money and intellectual labor on this effort. Anyway, what does all of this have to do with serious research?

Goals of Search Engines vs. Traditional Bibliographic Tools
The goal of the search engines is that of any business: to make money. There is nothing wrong with that, but it must be understood that this is different from the goals of the traditional bibliographic tools which are to be as objective as possible and that the creators of these tools must follow a code of professional ethics. Ethics cannot even apply when we consider the above scenario with search engines. (See: What is a Library?)

Here are a few other examples of the new information environment and what the internet companies will do. Again, this is not to find fault, but to demonstrate how the goals and actions of the internet companies are different from those of the traditional information professionals.

The following quote is taken from the article: China's model for a censored Internet by Kathleen McLaughlin (Christian Science Monitor, Sept. 22, 2005),
"Part of the Chinese success [i.e. of censoring the Internet] has been co-opting American tech companies with the lure of its lucrative consumer market. Microsoft blocks bloggers from posting politically sensitive words in Chinese; Google shuts down for several minutes when a user in China looks too many times for forbidden words like "Falun Gong;" and Yahoo recently admitted turning over private e-mail information that helped lead to the jailing of a Chinese journalist. [our emphasis]

"I do not like the outcome," Yahoo chief Jerry Yang said of incident. But it's a decision he said he had to make when he decided to do business here."

A report by Privacy International: A Race to the Bottom - Privacy Ranking of Internet Service Companies claims that Google is the worst search engine at protecting privacy.

Full-Text and Search Engines - AUR Library Information Wiki
Another example is "A False Wikipedia Biography" by John Siegenthaler (USA Today, Nov. 29, 2005), where the author of this article was slandered by an unknown person on Wikipedia. In this article, the author said that there was no way to find the person who wrote it.

But, just a few days later, the person was discovered. See: A Little Sleuthing Unmasks Writer of Wikipedia Prank by Katherine Seelye (New York Times, Dec. 11, 2005).

In any case, different people and organizations change Wikipedia entries to suit their own needs. There is an interesting Wired article on this, and among other information, they provide an example of an e-voting machine vendor who deleted 15 long paragraphs detailing the problems of using electronic voting machines. There is now a tool to track these changes, called WikiScanner.

Something else happened that should make us skeptical of what we read in Wikipedia: one of its creators, Larry Sanger, said that Wikipedia is "broken beyond repair" and left Wikipedia. Here is the story from the Times Online. Finally, there is the article Wikipedia and the Meaning of Truth (Technology Review, Nov.Dec. 2008) that brings up the concept of Wikitruth, and discusses Wikipedia's core content policies of: verifiability, no original research, and neutral point of view, which has certain interesting consequences.

Censorship
Search results can also be censored: Google has allowed its searches to be censored in China. For an excellent and detailed discussion of this, see the roundtable discussion The Struggle to Control Information, and more specifically, You Can't Get There From Here, which are additional webpages for PBS Frontline's documentary The Tank Man. (watch online)

There is another term, called Googlewashing which refers to a word being "hijacked" and changed by people on the web. [If you are interested in these phenomena, for two sides of the issue, see: Anti-war slogan coined, repurposed and Googlewashed … in 42 days by Andrew Orlowski (The Register, April 3, 2003), and Stop Worrying and Learn to Love the Google-Bomb by Séamus Byrne (Fibreculture, issue 3, 2004)].

This is not to say that search engines are bad and should not be used. Use them, but use them wisely!


To summarize, remember:

INFORMATION
IS
NOT
A
VOTE!
Full-Text and Search Engines - AUR Library Information Wiki

If 20 webpages have Information A
and
1 webpage has information B,
it does not follow that Information A is correct!

The single webpage could be correct and everyone else is just copying the wrong information!

Example:
Full-Text and Search Engines - AUR Library Information Wiki


Nicolai Lenin,
the founder
of the Soviet Union

Nicolai Lenin: there are many articles, textbooks, and webpages that exist about this man.

HE DIDN'T EXIST

His name was Vladimir Ilich Lenin. He never used the name Nicolai Lenin, but the name has been around for a long time in the U.S. Everybody copied the wrong information. On the web, if you searched “Nicolai Lenin,” you may never learn this, but if you search the LC Authority File...

Full-Text and Search Engines - AUR Library Information Wiki

you discover the truth.

Should you trust a page that has
a biography of Nicolai Lenin?




No user avatar
j.weinheimer
Latest page update: made by j.weinheimer , Nov 25 2008, 8:37 AM EST (about this update About This Update j.weinheimer Edited by j.weinheimer

5 words added

view changes

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.