Britannica (Web), Cont. . . . Chorus « Electronic Research

Natural language searching

Searching is possible using Boolean operators and natural language. Natural language searching may bring an overwhelming set of results as it retrieves every article that includes anywhere the term or the terms entered by the user. The query death rites in Asian cultures yields 19,873 documents (Figure 11). It is a tad excessive, and of course, not even in this encyclopedia can be that many articles about any topic. The search engine picks every article and sections of article that include the words death OR rite OR rites OR Asian OR culture OR cultures. The word in is not searched because it is a stopword along with many other terms, definite and indefinite articles, pronouns and prepositions, such as where, when, who, as, it, is, was, the, on, from, etc. It automatically stems certain terms such as regular plurals (although not other suffixed versions such as Asia from Asian), that further broadens the search (Figure 12).

Figure 11: Response to a complex query using implicit OR operation yielding mostly irrelevant results Figure 12: A behind-the-scenes look: The stemming of query words

Relevance ranking

Getting so many results is like drinking from a fire hose, but the results are presented in relevance rank order, and the software does a very good job in this process. It recognizes phrasal expressions and gives them top priority, especially if they occur in the article titles or subheadings. In addition, the proximity and frequency of the query terms in the documents are also evaluated for the ranking. The relative infrequency of one or more query term(s) in the entire database also raises the relevance of the documents found to include the term. The combinations of these criteria makes it possible that the query New England lists the article with that title first even if there may be other articles that include more occurrences of the word new and England.

As shown in Figure 11 above, the software offers an option to refine the search by limiting it to articles that include ALL the terms that are present in the query (except for the stopwords). It should be displayed perhaps more prominently and with a button that calls for action. Such a restrictive action dramatically reduces the result, in our case to 13 perfect items. The software eliminates the stopwords and puts the AND Boolean operator between the query terms (Figure 13). In case of only two terms this may not limit the search enough. In case of a longer query (art, dance, music, language, culture of Tibet) that consists of several terms the swing from the automatic OR operations (yielding 23,507 items) to the automatic AND operation may limit the search to the extent that no articles or too few articles are found with all the terms (2 in our case as shown in Figure 14). That's when the advanced search options become handy.

Figure 13: Same complex query as in Figure 11 but with Boolean AND operator yields few highly relevant hits Figure 14: Too narrow query of many AND-ed search terms

Advanced search options of Britannica Online

The users do not have to accept the above two options. They may use the truncation symbol (*) to retrieve variations of a word such as Tibet* for Tibetan or Tibetic, danc* for dance, dances, dancers, dancers, dancing, etc. Even better, they may use AND and OR operators as appropriate for the query, such as Tibet* AND (art OR danc* OR music* OR language or cultur*) that increases the results to 390 items (Figure 15). The stemming of the word Tibet* in itself yields many additional records in combination with the other words. Regular plural forms are retrieved without explicit stemming, i.e. art will find arts. It may be needed to truncate art to also retrieve artist, artists, and artistic.

Such a search can be further refined by using the proximity operator ADJ instead of AND. The ADJ operator limits the search to those articles that have the term before and after this operator no more than 15 characters apart in the specified order. The proximity cannot be increased or decreased, although it would be useful. The specified order of the terms deserves attention as the wrong sequencing may yield significantly different results, omitting important articles. The query sovereignty ADJ nations would yield a puny 2 records that have the term sovereignty of nations. A much better query (nation* ADJ sovereign*) OR (sovereign* ADJ nation*) uses all the advanced features and retrieves national sovereignty, sovereign nations, sovereignty of nations. It would be useful to have a proximity operator, such as NEAR without the word order restriction. This would simplify the query formulation to nation* NEAR sovereign*. Word order specification may still be needed to distinguish police state from state police, of course.

Figure 15: Combination of Boolean AND and OR operators yields much larger result set Figure 16: Seemingly odd but highly relevant article about Ferlinghetti in response to the impeachment search

All these advance operators are not relevant for single word queries that are also much more efficient than in the print version. The query sovereignty yields 992 items. The printed Index volume provides references to only 15 articles. These are scattered in a dozen volumes. Undoubtedly, among the nearly thousand articles there are hundreds that mention sovereignty only in passing, but there are hundreds that are indeed relevant for the topic. Limiting the search to the index reduces the results drastically but still has the advantage to list articles that don't appear in the printed Index volume under sovereignty, such as constitutional sovereignty, pluralistic sovereignty, or popular sovereignty, i.e., compound terms that include the single term of the query. All of these are highly relevant articles, obviously. Such single term queries may also bring up articles that add an interesting dimension to a topical search. The term impeachment has merely six references in the printed Index. The Index search in Britannica Online also yields the same references. However, the Article search yields 120 highly relevant articles, including one about Ferlinghetti. Quite an unlikely hit but displaying the article more than justifies its retrieval (Figure 16).

Case sensitivity

Britannica Online is utterly case sensitive. This is unusual as most search programs don't make a difference between great lakes and Great Lakes. In Britannica Online the difference is enormous: 24,853 hits versus 7,759. Even with the strictest proximity operator there is a substantial difference: Great ADJ Lakes (373) and great ADJ lakes (504). In the era of electronic mail where we often use lower case even in signatures -that would make e. e. cummings very happy- this case sensitivity will baffle many users. I don't think it is bad, it is just unusual.

Stopwords, British and American variants and Spelling

Britannica Online uses a large number of stopwords, far more than the dozen or so in typical bibliographic databases. This is reasonable, and the stopwords are well chosen. The same cannot be said of the British versus American spelling variations. It is somewhat enigmatic how it works. Most British variants are automatically picked up by the software using an internal American-Anglo/Anglo-American dictionary, but it is not always so.

The query colour NOT color shows 17 records and the query color NOT colour finds 66 records. The help file warns the user to use both formats of such terms. Oddly, the compiler of the help file seems to be enamored in the topic and belabors on it even though there are no articles with the American spelling belabor, only one with belabour and there is none with either enamored or enamoured. Furthermore, such term as kiloliter or kilolitre do not exist either in American or British spelling in the encyclopedia (and to my knowledge they are not used) except in the help file (Figures 17a and 17b). Ironically, the extensive list does not include the encyclopedia/encyclopaedia pair. This really backfires when the user is searching for the term encyclopedia. The Index search retrieves 18 items (Figures 18a and 18b), including all the competitors of Britannica (Collier, Americana), but Encyclopaedia Britannica is absent from the list. This also reinforces that there is no comprehensive automatic switching between British and American spelling that would be much needed. Oddly, the first item in response to a query online ADJ encyclopedia in the Index shows encyclopedia see encyclopaedia [Cross ref]. I wonder why is there no similar cross reference for the simple query encyclopedia.

Figure 17a: British/American word pairs Figure 17b: British/American word pairs

Figure 18a: Results for search on encyclopedia does not show Encyclopaedia Britannica among the first few records ... Figure 18b: .... and not even in the rest of the records

Help with misspellings and search operators

Misspellings can be common in formulating a query, and Britannica Online offers an excellent solution for cases when no article is found with the term. It lists words that have the same or similar letter combination as the one typed, and words that sound similar, as well as the alphabetically closest terms (Figure 19). This gives the user the opportunity to pick the right term from the list and launch the search again. Unfortunately, when one word in a compound term is misspelled (as in national sovereignity) and the user picks the correct term from the list of recommended close terms, only that term alone is searched. The 1999 version may offer the better option of replacing the misspelled term by plugging it in the query cell.

Figure 19: Words that sound similar or are closest to the one entered (and most likely misspelled) Figure 20: Options for displaying records

Sorting and displaying results

Results can be sorted by relevance or by Spectrum (Propaedia) categories. This latter is an unusual but useful option to collocate articles by major disciplines (religion, art, etc.). With the removal of Propaedia in 1999 this option will not be available. The users have the choice to display results in groups of 3,10,20,40,100,200 per page. It's odd why not 5,10,50, 100 and 200 are the choices, but it is a minor issue (Figure 20).

Looking up the items referenced in the printed Index is the drudgery when using Encyclopaedia Britannica. Britannica Online brings the items to your screen instantly and lets you do the cherry picking. The search terms are highlighted, and the text is easier to read than the small type in the printed volumes. The hits can be displayed showing the first paragraph of the article, the first paragraph where the term occurs, and parts of every paragraphs where the term occurs - in a one-liner format, similar to the traditional KWIC (Keyword In Context) format (Figures 21a-c). This latter can be an efficient format for scanning the results, especially articles from the Micropaedia, but can make the page excessively long for search terms that appear with very high frequency in the long Macropaedia articles. In case of compound terms like national sovereignty it grossly dilutes the KWIC list that every sentence where either of the terms occur is listed. This could be easily corrected by restricting the generation of KWIC entries to occurrences as specified by the user, in the query, i.e. national sovereignty in case of a query like national ADJ sovereignty. The context is limited to 5-7 words around the search terms. It would be far more informative to increase the context to 12-15 words that still would fit on one line.

I would also like to see a format where only the title of the article or article section is displayed. The vertical layout of the result list could also be made tighter by smaller vertical spacing as the blue color of hotlinked article titles separate the entries very well. It can significantly reduce the number of clicks to scroll down on the page and/or to move to the next page.

Figure 21a: Efficient KWIC format for short article Figure 21b: Inefficent KWIC format for first part of long article Figure 21c: Efficent KWIC format for second part of long article

Inline images and external links

Britannica Online has over 7,000 images that appear as thumbnails already in the short result list in top-notch quality, and can be enlarged by a single click both there and from the full article. In the upcoming 1999 version the images will appear within the text of the articles, and will increase. Even so, Britannica Online offers many external links to Web sites primarily for the sake of more images. The article about Sandro Botticelli includes only two of his paintings, but there are fifty hyperlinks to large collections, directly to the painting(s) of Botticelli. (Figure 22a-c) This is exemplary solution for broadening the options of a very rich resource in a very efficient manner.

Figure 22a: Partial list of external links to Botticelli image collections Figure 22b: External link to Carol L. Gerten's Botticelli thumbnails ... Figure 22c: ... and full-screen images

Britannica Online - Big Bang for Your Buck

Britannica Online is one of the few commercial Web sites that make profit. It is definitely one that deserves to do so. According to a price change in early October, individuals can have a yearly subscription for $60 for unlimited use. Universities, schools can have the same for less than $1 per student. The actual fee varies depending on the number of students, and it is not directly proportional but stratified into ranges. The academic subscriptions extend to the faculty and staff as well, irrespective of their size. This benefits even more the private schools where the faculty/student ratio is higher--but at that price every educational institute should have a subscription to this reference gem.

The online version does not have the number of images of the printed edition but runs circles around it in terms of access. It may not have the audio and video elements and the awesome statistical data visualization program of the CD-ROM version, but it has more images and it is available from anywhere where there is an Internet connection with a decent browser. Britannica offers a one-week free trial period without asking for your credit card number first. Once you realize what you get for a few cents a day you will not hesitate to subscribe.

The once ultra conservative Britannica has turned into a technology trail blazer, and offers an awesome model for using the Web to cultivate one's mind instead of plowing through sources of dubious value. Amidst the millions of dollars wasted on questionable educational projects promising to be a panacea, this worthiest of reference and educational sources is available for pennies, now.

About the author

Péter Jacsó is an associate professor at the Department of Information and Computer Sciences of the University of Hawaii. His columns are published in Database, Information Today and Computers in Libraries. In 1998 he won the Pratt-Severn Faculty Innovation Award of the Association of Library and Information Science Education, and the Louis Shore-Oryx Press Award of the American Library Association for his discerning database reviews.
Top (1128 bytes) Chorus


Updated December 8, 1998
Copyright © 1998 Péter Jacsó