For the last several years,
Google has maintained the largest index of pages on the World Wide Web
(see SearchEngineWatch.com).
Today, in the later half of 2003, it is estimated that there are over 3.3
billion pages indexed by Google's database of web pages.
This document is about searching
Google. As you will soon find out, Google has its own
system of search features and devices. You must know them in order
to make effective use of Google for your Internet searching
needs.
Search Engines
vs
Web Directories
First, understand that Google
is a search engine. In other words, it uses a "spider"
to "crawl" the World Wide Web, looking for new pages and changed, updated
pages (and other documents, like Word files and pdf files) that it did not
know about previously or that have changed since the last time it found a
page at a particular URL, and adds that new data to its already huge
database that indexes the World Wide Web.
It is not, in and of itself, a
web directory, as was an earlier version of the very popular Yahoo! site when it started in the
1990's.
Web directories are datasets that are much, much smaller--maybe a few
thousand up to a couple of million web sites referred to in a very large
web directory like Open Directory
Project (see below) as opposed to over 3 billion web pages
referred to in a search engine like Google. Web directories,
however, are very, very useful finding tools because they are filled with
web sites that have been classified by human beings who are experts in
some area of information. Search engines are added to by programs
which scan the publicly-accessible World Wide Web for new or updated
pages; web directories are organized, carefully chosen lists of the better
and best web sites dealing with all possible topics.
Indeed, web directories are so
important that many search engines actually include web directory services
at their sites as well as their all-web search engines.
Google, for example, has a separate tab or button labeled
"Directory" which lays out 16 general categories under which, altogether,
over 3 million web sites are organized and pointed to. Google
didn't create this information; Google licensed the right to list
the web directory services of the Open
Directory Project (ODP):
The ODP is also known as
DMOZ, an acronym for Directory Mozilla. This name reflects its
loose association with Netscape's Mozilla project, an Open Source browser
initiative. The ODP was developed in the spirit of Open Source,
where development and maintenance are done by net-citizens, and
results are made freely available for all
net-citizens.
Basic Search
Mechanisms
The default page brought up at
Googl'e main address, http://google.com/, shown above, is the
input box to search the World Wide Web.
There are three main types of web
searches that Google facilitates:
-
searching the Web itself--the
first tab (and the default condition)
-
browsing through a Directory of
web sites made available by Open Directory Project
(ODP)
-
an advanced search
page (the Advanced Search text link)
You should also notice that
Google has five tabs above its search box, with the other four--besides the Web--being
Images, Groups, Directory, and News. One of
them, Directory (circled item #2), has already been mentioned--it is the
Google presentation of Open Directory Project. The
other tabs can be easily used as well: Images allows the user to locate
images that are used on web pages on the Internet; Groups allows the user
to search through the millions of messages left on newsgroups since the
early 80's; News allows users to get current news in a variety of areas,
and to search for past news stories.
The area to the right of the search
box contains three text links:
Advanced
Search Preferences Language
Tools
We will return to Preferences and
Language Tools later; Advanced Search will be taken up below.
Basic Indexing
Features
-
Case
Doesn't Matter. Upper case and capitalization of search
words has no impact or change in what results one gets from
Google: all words are stored in Google's index as
lower case, and all searches using any combination of upper and
lower case letters obtains the same results:
Homer
HOMER
homer
-
Stop
Words. there are a small number of very common words
that Google does not use for indexing purposes. Called
"stop words" or "delete words," these two-dozen or so words are used
frequently in our English language text, but are also, therefore, of
little retrieval value--they are common language "placeholders" that
don't allow us to discriminate effectively for search purposes between
pages that have them and web pages that don't have them. It makes
little sense to say you wish to search for only those web pages that
contain the word "in" for example.
If a glossary of all the words
appearing in all the WWW's pages were produced, most of the occurrences
would be to words like "of," "for," "by," "with," "to," etc. Since we
have evidence that these words are of almost no search significance, the
producers of search engine indexes like Google save extraordinary
amounts of database space by not indexing according to the occurrences
of these almost insignificant words. But, there are sometimes
special circumstances under which a user might wish to be able to search
according to these words, and you should know that special procedures
are made available to you to force Google to search for the
occurrence of the word, say "the," in a search result ("the Who,"
for example).
Google stop
words
a |
at |
in |
that |
when |
about |
be |
is |
the |
where |
an |
by |
it |
this |
which |
and |
for |
of |
to |
who |
are |
from |
on |
was |
will |
as |
I |
or |
what |
with |
-
To Force a Stop Word
to be Searched by Google. Although stop words are
normally not searched for by Google, they can be forced into a
Google search specification in one of two ways.
First, one
may place a plus sign (+) directly in front of the word that
Google will usually not include in a search: the plus means that
following word must be found in the search results.
already paid +for
Second, one may simply include the
usually excluded word in a bound phrase (enclosed in quotes):
"already paid
for"
Basic Search
Features
-
Default
logical AND operation. When the user puts more than one word
into the search box, the engine assumes that the user wishes for the two
(or more) words to be ANDed together. In other words, the default
logical operation used by Google in the absence of any
specification at all is the logical AND operator. Please also note
that the logical operators are always specified in capital
letters by the user.
cats dogs
. . .
will retrieve the same results as . .
.
cats AND
dogs
-
Logical OR
condition. When the user seeks to expand search results by
giving Google several different words or by specifying a list of
synonyms or near-synonyms, the user should specify the OR logical
operator between each of the words. In other words, the OR
operator is not implied, as is the default condition of the logical AND
operator. One must indicate the logical OR by placing it, in
capital letters, between two words:
cats OR
dogs
-
Logical
negation. The last logical operation--negation--is
performed in Google by placing a minus sign directly in front of
the word (no space between the word and the minus sign!). Therefore, were we searching for web pages that were
about cats, but not about dogs, we would use this formulation:
cats
-dogs
As shown above in the discussion
of stop words, this phrase-binding process can include stop words that
Google normally doesn't allow the user to specify:
"gone with the
wind"
"vitamin a"
This phrase searching technique is
particularly useful in finding odd or uncommon phraseology in
pieces of well-known text. Were you searching for a copy of
Lincoln's Gettysburg address, for example, you should go ahead and
specify . . .
"four score and seven
years ago our fathers"
Punctuation, by the way, is
ignored by Google, so either of these specifications will tally the same
result:
"four score and seven
years ago, our fathers"
Advanced Search Mechanisms
Advanced Search
Page
-
Google gives users a menu-driven
means of using some of its advanced features, although there is also a feature of
search word qualification that can be added to searching using its Basic
Search box as well. We will first deal with Google's menu page of
advanced search features.
By
clicking on the "Advanced Search" text link on the right side of the
main page's search box (Circle 3 in the illustration below)
. . .one gets this Advanced
Search page:
with all of the
words
is comparable to the AND search with the exact
phrase
is comparable to phrase searching with at least one of the
words is
comparable to the OR search without the
words
is comparable to negation of a term
-
Language
area: The row just under the Find results area contains
the language specification of the web page content. By default, it
is set to "any language." But it can be set to any one of a very
large number of specific languages (over 30).
-
File
Format area: By defaut, Google searches for your
specification coming from any number of different file formats,
including, of course, html. However, here are other file
formats that you can choose, specifically, to have Google search
for:
any format
Adobe PDF (pdt)
Adobe Postscript (ps)
Microsoft Word (doc)
Microsoft Excel (xls)
Microsoft Powerpoint (ppt)
Rich Text Format (rtf)
As will be pointed out later,
there are a number of other file formats that can be
searched on if one uses the term qualification technique--something that
will be presented to you later on in this tutorial.
anytime (default)
past 3 months
past 6 months
past year
anywhere in the page
in the title of the page
in the text of the page
in the URL of the page
in links to the page
-
Domain area: Here you are allowed to specify that your
results can only come from, or must be excluded from coming from, a
particular domain (com, edu, gov, mil, net, . . . .).
-
Safesearch area: This area tells Google to
exclude from your
search results sites
that contain pornography and explicit sexual content. Through a
Safesearch Preferences page (set Preferences off of a link to the right
of the main page's search box), you may set the strictness of
Google's exclusionary criteria to three conditions:
exclusions
turned off
moderate strictness
very strict
Term Qualification in Basic Search
The original usability
philosophy of Google was "make Google easy to search, and
give the user advanced options in a menu-driven format." So, instead of
forcing users to have to know how to use the logical operators,
Google's leadership team decided to offer a default AND logical
operator, and to allow users to combine terms in other ways through a
menu of specific alternatives (the Advanced Search menu page). In
the background, however, Google had a list of term qualification
devices that could do the same (and sometimes, more) thing. That
is what this section is about--search syntax that is not actively
promoted to its user community.
A frustrating characteristic of Google for proficient,
professional searchers is the existence of these
"undocumented" search techniques, which Google only slowly and
cautiously brings to the attention of its users for retrieval purposes.
But then again, Google's management team probably doesn't want to complicate
the simple searching done by the vast majority of its users (the "80/20
rule" applies: 80% of all Google searches are performed using only 20%
of its search features).
However, there are a number of command-line term qualifiers that work in
Google's search syntax (and are being written about by authors of
recent trade books). These qualifiers are described in this
section.
Please note that the form of
all Google qualifiers is
qualifier_word:search_words_or_phrase
as in
intitle:admissions
(For further examples of the
special, advanced term qualification features discussed below, see
Google's own explanations on its
Advanced Search
Operators page.)
-
intitle
qualifier: Use this search qualifier to specify that the search take
place in only those words that came from a web page's title field.
(Before using this qualifier, please remember that the web page's title
field contents probably do not appear on the web page itself. The
title field is part of a page's standard html coding, but it is
not necessarily also displayed in the actual contents of the page that
are viewable on the web.)
intitle:"help page"
inurl:ibm
"customer
service" site:www.ibm.com
Another example of the site search is, for example, searching a
university's domain (say, www.ou.edu) for something about the Sociology
Department:
site:www.ou.edu
"sociology department"
info:www.ibm.com
related:www.nbc.com
Indeed, the related qualifier produces the same results as clicking on
the "Similar pages" link on the last line of an entry in a Google search
response set:
What you would expect to get in a response
to this query is other "entities" (in this case, broadcast companies) in
this category. If you notice in the next to the last line of the
entry shown above, the breakdown of the category is Television >
Networks > NBC. NBC is a specific entity within the category
Networks: you would therefore expect the
related qualifier (or the "Similar pages" link at the end of
a search response entry) to return other specifics within the Networks
category.
-
link
qualifier: Interestingly, this qualifier returns a list of web
pages that link to the web site you specify. In other words, if
you would like to see who is linking to a website, use this qualifier:
link:www.ou.edu
cashe:www.cnn.com
Like "Similar pages," this information is
also available from a link on the last line of a response entry (see
illustration above)
[stopped here, Sept 24,
2003]
Google's Results
Layout
How to Interpret Your
Search Result
Other Features in the
Background
-
Phone numbers
-
Addresses
-
Street maps
-
Stock quotes
-
Translations
-
Dictionary definitions
-
Synonyms
-
Calculator
Still Farther in the
Background: the Labs
http://labs.google.com/
|