.. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN L161— O-10Q6 Digitized by the Internet Archive in 2013 http://archive.org/details/usersguidetoeure956burk Q t 9Sb Report No. UIUCDCS-R-79-956 USER'S GUIDE TO EUREKA AND EURUP by UILU-ENG 79 1707 Thomas G. Burket and Perry A. Emrath EftEUBRARYOFTHE APR 17 1979 February 1979 NSF-0CA-MCS77-27910-000039 tSEl DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Report No. UIUCDCS-R-79-956 USER'S GUIDE TO EUREKA AND EURUP by Thomas G. Burket Perry A. Emrath February 1979 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 Supported in part by the National Science Foundation under Grant No. US NSF MCS77-27910 TABLE OF CONJEES, Page INTRODUCTION PART 1 - EUREKA 1. Basic EUREKA features 2 2. EUREKA Query Language 4 2.0 Command Summary , . # u 2.1 Signing on and off EUREKA , . . . \\ \\ \\ \ \\ \ \ ]\ 2.2 Find command c 2.2.1 Find contexts - the "in" clause ,9 2.2.2 Search sets - the "from" clause. ,...*,[,[ 13 2.2.3 Naming sets using " = " ^c 2.2.4 Commenting sets 15 2.2.5 Find command summary ....... 16 2.3 Print command 17 2.3.1 Print query. -jy 2.3.2 Print context 1g 2.3.3 Browse mode ?o 2.4 Change command. 2 ? 2.5 Comment command ........ ?? 2.6 DB command ' 2 o 2.7 Delete command. .......... , 23 2 .8 Enter command , 25 2.9 Freq and term commands. ...... 25 2.10 Guide command # 26 2.11 Help command . 2 6 2.12 LP command 27 2.13 Make command 27 2.14 Message command. ......... 29 2.15 T command ' 2Q 2.16 Users command *29 2.17 Words command, ™ 2.18 Thesaurus feature. ..... ' ^ 2.18.1 Enter * * \ * * * * * \\ * * * ^ 2.18.2 Change, ,.,,., '07 2.18.3 Print * * \\\\\\\\\\\ 38 2.18.4 T command ,,,,,, ^q 2.18.5 Delete * \\\\\\\\\\\ 39 2.18.6 Sample EUREKA user session,' * \ \\ \\ \\ [40 3. Comments on EUREKA 43 3.0 General search ideas. ........... '43 3.1 Search strategy ,,,,..** uo 3.2 Suffixing * ' * \\\\\\\\\\ ' * 44 3.3 Search failures ...,.,...]* 44 4. Glossary of terms, 48 Page PART 2 - EURUP 51 1, Introduction . . ........ 51 2, Database Structures. ..,..,......., 53 2.0 Database Directory 54 2.1 Database Index. ................ 54 2.2 Text \\\\\\\ 56 2.3 Vocabularies, ,56 2.4 Context Definitions .....!!! 58 2.5 Stop and Save Lists ............... ,61 3. EURUP Inputs and Outputs ,,,,.... 63 3.0 Document Input File Format. ....... j 63 3.1 Statistics and Message Output Files ...,.,,,,. 65 3.2 Temporary Files , , , ! [ 66 4. EURUP Functions and Command Language 67 4.0 Command Mode. .,,,, # 68 4.1 Update Mode ........... \ 69 4.2 Database Creation, Extension and Deletion ....... 69 4.2.1 Create ............. ,70 4.2.2 Extend \\\\\\\l\ 4.2.3 Detach ...!!!!!! 72 4.2.4 Zero . \ \\ \\ \\ 72 4.3 Document and Vocabulary Updating. .... 73 4.3.1 Insert ......,,.,.,....,, ,74 4.3.2 Replace. ........... ..,.,., ,75 4.3.3 Delete \\\\\ 75 4.4 Extent Reorganization and Garbage Collection. . . , . \ 75 4.4.1 Dump ............. ,77 4.4.2 Load \ \ \ * * \ \ \ \ 77 4.4.3 Reorganize ....,...,,.,.,..,.,78 4.5 Miscellaneous .............,,,.,, ,79 4.5.1 Open .........,.,,,,,.,, ,79 4.5.2 Exit ' \ * \ \ \ \ 79 4.5.3 Directory. ...,...,.,.,,,.,, ,79 4.5.4 Change ............ \ 80 4.5.5 Select .............."][[[*[ 81 4.5.6 Move ....... .......,,,... # |81 4.5.7 Numbers. ................. ,82 4.5.8 Stop ........!!!!!* 83 4.5.9 Unlock \\ \\\\\ 85 5. Comments, Suggestions and Examples on the Use of EURUP ... 86 5.0 Some Database Characteristics ,86 5.1 Constructing an Average Database. ...........87 5.2 Using EURUP and EUREKA to Check Spelling. ......" 92 RELATED REFERENCES 95 INTRODUCTION EUREKA is an experimental system designed for research in the field of information retrieval. Part 1 of this report describes the current state of EUREKA and how to use it to conduct searches of a database. It is divided into four sections - basic features, query language description, general comments about EUREKA use and a glossary of terms and rules. EURUP, for EUREKA UPDATE, is the system which builds and maintains databases for use by EUREKA. Part 2 discusses the logical organization of a database, as well as how to construct or modify the component files. It has five sections - introduction, database structure, EURUP input and output formats, EURUP functions and command language, and general comments about efficient EURUP use. Both EURUP and EUREKA run on a PDP 11/40 minicomputer. PART 1 - EUREKA 1. Basic EUREKA features As a package, EUREKA is unique, although it shares many features with other existing retrieval systems. The major points of its structure are: 1) It is interactive. 2) The full text of each document is available. 3) The index contains every word that appears in the documents (instead of a controlled vocabulary). 4) A document is divided into several "contexts", such as author, title and sentence, and searches may be confined to particular contexts. 5) Results of queries are stored for later use. 6) A user-defined thesaurus may be used. 7) Several databases are potentially available. Before giving more details, let's examine a typical user's approach to EUREKA. She has a particular area to investigate, presumably the type that can be handled by reading the text of relevant documents. EUREKA is not a question-answer system where someone might ask, "How many cities in Illinois have over 100,000 people?", and receive the answer "3". Instead, the user must formulate her request as a set of terms that EUREKA can find within the text, such as "city", "population", "residents" and so on. Armed with her request, she "signs on" to the system and the correct database and enters a search command. EUREKA responds with statistics on the number of documents containing the given terms and (we hope) the desired information within them. She probably would then scan and evaluate some of the documents. After this reading, the user has three choices: 1 ) The question has been answered so exit the system or go on to another query. 2) Finding the answer appears hopeless so give up. 3) The result is not quite right but the answer appears within reach so modify the original query for a more accurate search. These steps can become very complex and require the subsidiary features of EUREKA. We will begin the details by briefly describing the system command formats and functions. The glossary at the end of the manual defines various crucial terms appearing from now on. 2. EUREKA Query Language 2.0 Command Summary bye - disconnects the user from EUREKA change - changes the name of a query set or changes a thesaurus entry comment - attaches a comment to an existing query set db - allows the user to access a different database delete - removes a query set, comments or a thesaurus entry enter - defines a thesaurus entry find - does searches of the database for terms in some context f re q - allows scanning the database index to see what terms exist and to see statistics about them guide - enters the lesson facilities help - displays the key ideas about EUREKA commands or some other area to help the user who is confused or has forgotten some details lp - causes certain output to go to a line printer instead of the terminal make - forms a new query set out of some combination of existing sets and documents message - leaves a comment or message for "The Management" print - displays document text, query set or thesaurus information t - sets flags for thesaurus use term - like freq only without statistics users - displays a list of signed-on EUREKA users words - displays keywords from a document set Detailed command descriptions follow. Keep in mind that EUREKA is not static and that the formats may change. The help files (see help command) are maintained as up to date as possible so that information about changes is available. Note on notation: when defining a command format, the notation means the user must enter something of that type in that location. The notation distinguishes command parameters from the keywords that must appear. 2.1 Signing on and off EUREKA Naturally, the first step in using EUREKA is to get connected to it and the desired database. When EUREKA is ready for someone, it types on the terminal: Please type your name: and then waits for someone to sign on. Prior to the first encounter, a person should be assigned a unique user name, possibly matching his real name. When given a user name, each person is also given a default database that he is automatically connected to at signon. Getting on then is as simple as typing that user name and hitting the return key. To override the default database, the user name may be followed by a comma and some other database name. If EUREKA responds with a line similar to "Query #1" then he is on the system. However, if a message like "Database not found . . ." appears, then either that data is not available or he may have just mistyped the database name. In that case, the user is connected to the system as a whole, but not to any data. Most commands will not work without first entering a db request (see Section 2.6) to perform the data attachment. Examples: Please type your name: torn (user "torn" is logged on Query #7 to his default database) # Please type your name: torn, west ("torn" is logged on Query #7 to the database * named "west") The simplest command of all is BYE (or BY), which disconnects the user from EUREKA. Anytime the user is shown a "#", as above following the Query #7, typing "bye" signs him off and has EUREKA ask for another person. For example, Please type your name: torn Query #1 #bye (a message appears about how long "torn" was on system) Please type your name: (ready for another user) 2.2 Find c9Hwan<3 The find command is the most important one in EUREKA, as it causes the database to be searched. It can get quite complex, so we will start with its basic form of find or f where is a combination of terms. For example, "f tax" means search the entire database for documents containing the word "tax". This has limited power, so combinations of words are permitted, as in f income or tax f income and tax f tax and income or sales f tax and (income or sales). The first one means find documents that have either "tax" or "income" or both, while the second requires that both "income" and "tax" must be in the document for it to be retrieved by EUREKA. The third example shows that these operators (AND/OR) may appear together in the same expression, but an ambiguity does arise. In EUREKA, the request would be interpreted as - find all documents that contain "tax" and "income" together, or any document having "sales" instead of - find all documents that contain "tax" and either "income" or "sales", whenever ANDs and ORs appear, the AND is always evaluated first because of its higher precedence. If the user had wanted the other meaning in the example, then the fourth case above would provide that request. Parentheses allow grouping of terms to override the natural or default usage. This is analogous to algebraic expressions where A+B»C might mean either (A+B)«C or A+(B*C). Abbreviations do exist for both operators, "•" and "&» for AND, "+» for OR so "f a and b or c» could be entered as »f a & b+c". The symbols are completely equivalent to the words, except they need not be surrounded by spaces as the words do. That is "f income and tax" could not be typed as »f incomeandtax", but "f income&tax" is acceptable. Notice that »f income and tax" will select a document that has "income" anywhere in it and "tax" also somewhere within. What if we wanted to find "income tax", where the words are next to each other? EUREKA permits the grouping of several words together into one term by surrounding them by single quotes. For example, f 'income tax' would solve our problem, and similarly, f 'income tax' or 'internal revenue service'. What if we want to find tax, taxes, taxpayer and so on? There is no need to type all the possible words. By putting a »?" at the end, as in »f tax?", all words that start with "tax" are searched for. However, if aJJ, the endings aren't desired, such as "taxicab" or "taxation", then we must use a list of the acceptable endings. For example, "f tax or taxes or taxpayer". The discussion of term and thesaurus commands gives ways to help handle this problem (Sections 2.9 and 2.18). Just as suffixing is allowed, putting the question mark as the first character asks EUREKA to find all words with the given ending. Thus, "f ?ation" could produce "taxation", "representation", "sensation", etc. EUREKA checks the entire index in this case, so prefixing is relatively slow and is not recommended. To help prevent careless use, the ending must be at least three characters. "?y" fails, but "?ogy» passes. A »?" may appear at both ends of the same word. "?ize?" may not be very useful, but it would match terms "socialize" and "computerized". Prefixing and 8 suffixing work for quoted phrases, too, as with "income tax?" looking for "income taxes" and other similar strings. However, a "?" inside a term is treated as an actual question mark. The exact rules for what constitutes a term appear in the glossary under "term". The results displayed after a search request are a compact representation of the documents retrieved. For example, this is a sample set. f mathematics 66 documents responded to index search 66 documents are in this set The 20 documents with the most occurrences are: 1( 2) 5( 3) 10( 1) 42( 1) 229( 4) 252( 1) 428( 1) 527( 1) 703( 5) 710( 4) 715( 2) 776( 2) 780( 9) 825( 2) 837( 2) 1107( 3) 1168( 4) 1412( 2) 1723( 2) 1774( 7) The search found 66 documents containing "mathematics" somewhere in the text. The table of numbers gives those documents with the most frequent use of the term. The first number in each pair is the document number and the second i: the number of occurrences for the term. Thus document 229 had "mathematics" in it four times, and 1723 had it twice. The list is sorted in document number order, except that only the top 20 matches are listed. We see that document 780 had nine occurrences, 1774 had seven and so on down to several with just one. By this ordering, we know that nc other documents could have had more than one occurrence, for otherwise it would have been in the table. When many documents have the same lowest value, the ones with the lower document numbers are given. In this example 66-15=51 documents must have had just one occurrence of "mathematics", but only those five with th« smaller document numbers appear. As another example, consider f 'computer architecture' 44 documents responded to index search Do you wish to abort this query? n Do you wish to abort this query? n 44 documents required full-text search 12 documents are in this set 4( 1) 271 ( 2) 651 ( 1) 1650( 1) 167K 1) 1708( 7) 1713( D 1722( 3) 1764( 1) 1839( 2) 1877( D 1892( 1) Several new ideas show up in this request for a phrase. EUREKA checked the index and found that 44 documents had both "computer" and "architecture" in them. However, the user asked for "computer architecture", forcing the words together and in that order. This forces EUREKA to do a "full-text search", scanning all 44 documents to see if that requirement is met. In this example, 12 of the 44 are retrieved and the other 32 do not have the exact pair "computer architecture". The inquiry about aborting the query allows the user to exit a request that nay be time-consuming. Full-text searching is relatively slow because of the considerable text processing needed, and the user may not want to wait for its completion. This is especially handy if many documents are retrieved. EUREKA prompts the user after every 16 documents, so the above queries occurred after 16 and then 32 different articles had been scanned. 2.2.1 Find contexts - the "in" clause The above discussion on the find command dealt only with its basic form. The command does have several options good for a substantial increase in power. One of these is the "in" clause, or context option, which appears as f in . The context clause tells EUREKA to confine its searching to only those specified regions of the document. A context is a predefined division of a document, such as author, title, body and references. Each document in a database has at least some subset of the contexts contained within it. Each document within the database has the same contexts defined, but those specific names depend on the database. Only sentence and paragraph are common to every 10 database. The context names may be abbreviated, the actual shortening permitted again being dependent on the database. The usual limit is one or two characters, with two required whenever some other context name begins with the same letter and has been given a higher priority. For example, "sentence" abbreviates as "s", so if a context "source" existed, its shortest form would be "so". A context list is one or more context names separated by commas. A comma implies OR, so "in author, title" means find it in either the author or the title section, or both. Expressions with "in" clauses can get arbitrarily complicated, since "in" may appear many times. We'll start with cases of only one clause. The two key ideas to remember are that the clause is actually part of the expression and that it applies to everything to the left (up to the left end of its parenthesis level). Examples are f hemingway or frost in author f (hemingway in au) or (frost in au) f hemingway in author or frost f hemingway or (frost in author) The first case asks for those documents in which either "hemingway" or "frost" is the author, ignoring ones where either name appears elsewhere in the document, such as in a paragraph discussing the virtues of Frost's poems. Of course, if a document has "hemingway" in the author section and also somewhere else, it will still be retrieved. Note that "in author" affects everything to the left. In the second case, we have exactly the same interpretation. The parentheses around the "frost" term cuts off the rightmost "in au", requiring an additional clause for "hemingway". However, the parentheses around the "hemingway" clause are redundant, since nothing is to the left of it. With "in author" moved to the left term in the third example, the request 11 means find documents with "hemingway" in the author section or "frost" anywhere, The fourth example again demonstrates context cutoff by a parenthesis. "Frost" must appear in the author region, but "hemingway" may be anywhere, so this is the opposite of number three. To search for "hemingway" in either the author or the title, try f hemingway in author, title but to concentrate only on the most famous Hemingway, use f ernes t and hemingway in au,ti. In the example f (tax or taxes) and income in sentence the "in sentence" clause applies to the entire expression. The parentheses do not affect the context, since the "in" is not inside the parentheses. It "invades" parentheses to the left. Stepping up to two clauses brings in considerable power as well as added complexity. The simplest case is two in a row, as with f hemingway or frost in author in title which is the same as "... in author, title". A more typical example and its effective interpretation is f hemingway in au or frost in title f (hemingway in au,ti) or (frost in ti). The "title" affects both its neighboring term and all else to the left, thus combining with "author". This example is also ILLEGAL, breaking a conflict rule described later. The interpretation and the illegality is probably not what the reader expected! To get independent operands, parentheses are needed. Continuing with the literary example, we might want to find all documents with "hemingway" in the title and "baker" as the author. The dual "in" clause allows this request to be handled in one try with 12 f (hemingway in title) and (baker in author). The parentheses confine the context to the desired term. More examples are f (income & tax in s) and (Illinois and state in p) f 'cook county 1 and (prison or jail) in sentence f (1976 and may in date) and (database or 'data base' in ti) f (knuth in au)+(cacm in source)4(1978 in da). An "in" may appear after every term in a search expression, although such use tends to be confusing or cause errors with misplaced parentheses. The user might think that he can sprinkle "in" clauses without restrictions on meanings. Consider f (a & b in p) and c in s. This should be interpreted as "((a&b in p,s) and (c in s))". However, "p" and "s" conflict, since sentences are subsets of paragraphs. The attempt to put a tighter constraint at an outer level is illegal. The reverse case of "(a & b in s) & c in p" is ok, and the more general paragraph restriction will apply at only the outer level, as was probably expected. That is, "a" and "b" must still be in the same sentence, plus "c" must exist somewhere in the same paragraph. The other main error comes from inconsistency in section names (author, title, date, etc. , but not s or p). A specified name must agree exactly with any outer level context lists affecting it and having a name in common. For example, these two cases fail: f a in au,ti and b and c in au f a in au and b and c in au,ti. The reason is that the value of "a" is modified by both "au" and "au,ti". Which is meant? The rule is this: if a term is modified by an "in" clause and then another "in" at an outer level, then the contexts at each level must agree exactly . The case of "hemingway in author and frost in title" fits this rule. "Hemingway" is 13 affected by an inner level (author) and an outer context (title), but since the two don't natch - error! Any differences in contexts must be confined to parenthesized terms, such as "hemingway in au or (frost in ti)", which cuts off the influence of "ti". An alternative is "hemingway in au,ti or (frost in ti)", which cuts off the "ti" again and makes the combination of "au,ti" explicit and thus legal. One note: this specific matching rule does not apply to sentence or paragraph, so the matching check is done after excluding any s or p specifiers. The most common use of contexts comes through the sentence and paragraph divisions. Many requests want a group of words in the same general area, such as only a few words apart. EUREKA does not have the capability to find two words, say, within three words of each other, but restricting a search to the same sentence or paragraph is a powerful approximation. Actually, this method is often better than a distance measurement because of large variations in English grammar and sentence structures across writing styles and document contents. 2.2.2 Search sets - the "from" clause Another useful option for the find command is the "from" clause, which specifies what set of documents EUREKA should search. When no "from" clause exists, the entire database is examined. Its most fruitful use is to narrow the results of a previous query to a smaller set of documents. If the user knows the desired answer is somewhere within a set, she may request that searching be confined only to that set, thus producing a better result in less time. Its basic form is: f from or if the "in" clause is also used, f in from specifies a combination of existing query sets and documents to use as the search set. As with , the set expression Is a series of terms separated by operators. For sets, these are 1) and, 4, • 2) or, + 3) minus, - e.g. 2 and 3. e.g. 2 + 3. e.g. 2-3. AND and OR carry their original meanings here, with the addition of MINUS. This new operator means "but not", as in "documents in query set 2 but not those which are also in set 3". Each term in the set expression must be one of 1 ) Query set number. 2) Query set name. 3) List of document numbers. All three are converted into a list of documents and subjected to some combination with the lists from other terms. Some examples are f tax from last f tax from irs f tax from 3 f tax from [1,2,3,4] f tax from irs -3 f tax from irs+3-C 10,1 1 ,22] . The first case is the most common use of "from", where the search is conducted only over those documents retrieved in the most recent query. "Last" is a special keyword known to EUREKA. The second example assumes the user had given the name "irs" to an existing set, causing its documents to be fetched for searching. Each set with a name still has a query number assigned at the time of the find command. If "irs" were the name of query set number three, the two successive examples would be equivalent. A list of numbers surrounded by square brackets is always interpreted by EUREKA as a list of specific documents. In the fourth example, searching would be confined to documents one through four in the database. 15 The fifth case would search documents from set "irs", except those also in query #3. If "irs" and three were the same, the search set would be empty and nothing would result. The more complex final example demonstrates all three types of set forms, generating a set containing documents in either "irs" or #3, but removing documents 10, 11 and 22 if they are in either of those two sets. When the "in" clause is used, the "from" must appear after the context if it is to be used. Note that "from all" is the same as no "from" at all, and that parentheses may not be used. See the make command (Section 2.13) for techniques in constructing complex set expressions. 2.2.3 Naming sets using "=" Remembering query numbers can be difficult, so EUREKA allows labels on queries. One way to attach a name is to do it when the set is formed with the find command. The form needed is f = or in full form, f in from = . The name supplied 1) Must not already exist. 2) Must start with a letter. 3) Can be up to ten letters and/or digits. 4) Cannot be a keyword, like "last". For example, Query #3 #f tax = irs searches for "tax" in the whole database and forms query set #3 out of the retrieved documents. The label "irs" is attached to the set, so »f blah from irs" would be valid as long as the set and the name existed. 16 2.2.4 Commenting sets Our final embellishment permits commenting the query set. A string of characters surrounded by double quotes (" and not '•) and appearing at the end of the command defines a comment. This note to oneself may be viewed later and could serve as a reminder of some point about the query. For example, f evasion from irs "tax evasion penalties" does a search of the "irs" set, creates a new query set and includes the quoted string as a comment. 2.2.5 Find command summary An example of a full form find command may now be given. f tax and income in sent from last = taxation "income tax laws" has all the optional clauses, and ordered properly. Anything beyond "income" is optional, but, if used, must follow the strict rules. Most of the options reappear in other commands, such as DELETE and CHANGE, where more details will be explained. Browsing the documents retrieved by FIND is the function of PRINT, which will be discussed next. Following the search, EUREKA stores the search expression and the documents retrieved. The expression saved is the effective one actually used to perform the search, although without thesaurus expansion. Thus, for the sequence f a and then f b from last the effective second request is "f b and a". "B & A" would be stored as the search expression for that query. In general, any time a "from" clause appears, the net expression is " and Expression from existing set>". Theoretically, the fully expanded thesaurus terms should be saved instead of just the user-typed terms, but the amount of storage required is prohibitive. Therefore, to get consistent results when printing from a set, the thesaurus 17 must be in the same state as when the search was done. 2.3 Print command PRINT displays document text, query sets, and thesaurus entries. See Section 2.18.3 for thesaurus information. Printing query sets is simplest, so we'll begin there. 2.3.1 Print query This command form recalls data about previous queries, the data being: 1) Query number and name, if any. 2) The search expression used for generating the query (without thesaurus expansion). 3) The number of documents in set. 4) All document-frequency pairs (not just the top 20 as with FIND). 5) Any existing comments. The command format is one of: print query print query <# or name> to <# or name> print query <# or name>,<# or name>,...,<# or name> with "p q" a valid abbreviation for "print query". The first fetches and displays the single query specified, while the second displays a series of them. For example, P q 3 would show query set #3, but p q 3 to 10 would display #3 and go on through #4, #5 and up through #10. To print a group of queries not in the same range, a list is given. Then only those specified query sets appear. Two examples are p q 1 ,4, last p q irs,6 EUREKA allows interaction with the display in many cases, depending on the amount of data being shown. Printing is stopped every 20 lines, if necessary, 18 and the user is prompted with a "*". She then may 1) Hit to continue. 2) Type "k" to kill the printing. 3) Type "h" to get help. 4) Type "d" to stop printing and delete this set. If the command includes the "to" clause or a set list, printing also pauses between queries. The same four options apply there, too. Two alternatives exist to typing "to". A dash ("-") or double dots ("..") between the range bounds is treated just as "to". More examples are p q irs . . last print q 1-50 p q 27 to 300. The final case could represent a mistyping. The user need not worry about causing too much output by having the system go all the way to 300 instead of the intended 30. She may always exit by typing "k" at any prompt between queries. 2.3.2 Print context EUREKA's other print command retrieves and displays text from documents. Its form is . print from . The contexts permitted here match those of FIND, with a list of contexts separated by commas acceptable. The set rules, however, are more restrictive by not allowing an expression. It must be a single query set number or name, or a document list in square brackets. Both the context list and the "from" clause may be left off. If there is no "from", then "from last" is assumed, while if no context is given, "document" is the assumed value. Therefore, "p", "p document", "p from last" and "p document from last" all have the same meaning. If is a query set and not just a document list, EUREKA displays the 19 contexts from each document in the set, flagging those lines that contain one of the search terms from the original query. For example, consider this sequence: Query #3 #f tax • • • Query #4 #p sent from last EUREKA would begin showing sentences from the documents retrieved by query three. The only sentences would be those having "tax" somewhere in them. The first sentences would be from the document with the most occurrences of "tax". When that document is finished, EUREKA would continue with the document containing the second most occurrences of "tax", and so on to the last document in the set. The specific rules for what portions of a document are displayed get complex, so they will only be outlined here. The most important condition is whether the print context list contains "sentence" or "paragraph" ("s" or "p"). If it does j2o£., then the specified contexts are displayed, with any line flagged if it has one of the search terms. Any context specifiers from the FIND are ignored. The FIND restricts the document selection process only and does not affect what happens when printing. For example, given f money in title p body the entire body of each document is shown, with any occurrence of "money" marked. Titles are not displayed. Assume the print context does contain "s". The discussion also applies similarly for "p". Then any additional contexts are first displayed, as above, with any search terms flagged. After that, a second pass is made over the text, looking for selected sentences to show. If no "s" appeared in the search expression, then any sentences with an occurrence of a search term are 20 displayed, even if they already appeared during the first pass. The sentences must be within the contexts of the search expression, however. If "sentence" was in the search expression, the above applies except a sentence shown on the second pass must have an instance of the entire expression, not just one of the terms. As an example, f digital and computer in body p s, title prints the titles, flagging those lines that happen to have "digital" or "computer". It then displays any sentences from the body that have either term. As an alternative, f digital and computer in body,s p title, s also prints titles, but then shows only sentences from the body with both "digital" and "computer". The number of different cases gets large, and they are not always handled in the obvious way, so it is best to experiment and keep in mind this idea: with FIND, use "in" to restrict the search to specific contexts, while use contexts with PRINT to show specific parts of the documents in a query set, regardless of the search expression. 2.3.3 Browse mode The user may break the pattern of simple text display and interact with EUREKA during the display by using "browse mode". Some of the capabilities are skipping to the next document, viewing other contexts within a document, and killing the whole display process. Browsing is implemented by remembering where the user was in the main print stream before entering browse mode. So, if the above example of printing sentences with "tax" were interrupted by browsing, exiting browse mode would 21 return EUREKA to printing more "tax" sentences. Whenever a context is completed, or after 20 lines, EUREKA prompts the user. He may continue the current state by hitting or using one of the browse commands. They are: p - print next paragraph s - print next sentence 5p - skip to the fifth paragraph from the current one. The general form is np, where n is some number. -5p means back up five paragraphs. The same format works for sentences, such as -4s and so on. If the end of the context is reached (like the end of body), EUREKA will not advance without a context request (see below), skip - go to the next document in the set kill - stop the entire printing process end - leave browse mode and continue with original printing help - get help on browsing - by typing any one of the contexts legal in the database, that part of the document is entered and display is begun. For example, typing "references" would move to the references division (if it exists). While in that section, any "p" or "s" browse request stays confined to the references. Leaving the section is possible only by entering another context or using one of "skip", "kill" or "end". The current mode (print vs. browse) is always identified at the top of the output screen. Browse mode begins whenever a context name (including p,s) is input at a prompt. Browse mode can get very confusing! We recommend extensive experimenting to 22 get the feel of its capabilities and functions. Many useful tools for browsing do not exist, such as looking for a particular term while browsing or skipping back to an earlier document. We hope to add some of these features in the future. 2.4 Change command CHANGE assigns a name to a query set or changes an existing name. The name may then appear in FIND "from" clauses, PRINT requests, and so on, in place of the query number. The command form is change to The new name must match the rules specified in the FIND description and in the glossary under query set name. Examples are change 3 to taxation change prisons to jails Assuming query #3 had no name, the first command assigns "taxation" as its name. If it did have a name, that label would be deleted. In the second case, "jails" replaces "prisons" as the set name. Change also has a version for modification of synonyms, entered as "change syn ". The thesaurus section describes this command. 2.5 Comment command The comment command attaches comments to a query set. This is similar to the "comments" clause of FIND. The form needed is comment "some string" as in comment 3 "income tax rules" com jails "joliet state prison escapes". The comments are displayed during PRINT QUERY requests. If the set already has comments, the new string is added with the existing comments preserved. 23 2.6 DB. command A user wishing to switch to another database may avoid signing off and then signing on to the other one by using the db command. This simple statement is just "db ". For example, if a user were in database "west" and wanted to use database "cr" for a while, he could follow this sequence: Query #10 #db cr Query #10 # (commands while in "cr") • • • Query #17 #db west (back to original database "west") If a bad name is given, the user is left attached to no database and must either try again with a new, accurate name or else exit EUREKA with BYE. Section 2.1 mentioned a case where DB must appear. Consider this: Please type your name: tom,xxx DATABASE UNAVAILABLE OR ERROR IN NAME Query #12 #f something You must pick a database (use DB command) Query #12 #db west In this example, no database "xxx" exists, so "torn" was left without searchable data. Giving the db command with "west" solved the problem and completed the signon process (assuming, of course, that "west" is a valid database). 2.7 Delete command DELETE has four forms: delete query delete comments delete query all except delete syn 24 The first form removes query sets from the user's collection. A query that no longer has value should be removed to keep from cluttering up the user's space. The set list in this command is a series of query numbers and/or names separated by commas, or a range specification using " to ". Any query names and comments are also purged with the set. Examples are delete query last delete q 3 del q 3, 5, 11, jails, irs del q 1,2,3,4 del q 1..4 (these 2 are the same) del q jails to irs del q jails-irs (these 2 are the same) Note the format is just as with PRINT QUERY. DELETE COMMENTS only removes the comments from the set(s) specified. All query set information survives. For example, delete comments last del com 3, 5, jails del com 10-20 One simplification for deleting everything is available - the "delete q all" and "delete query all except " phrases. The first removes all query sets, comments, and names, and returns the user to query #1. If a few queries should be preserved, then "except" is used, with either a list or range. Thus, del query all except 5, jails deletes every query except #5 and the one named "jails". del comments all also affects every query, but just clears out all the comments from them. Again, the sets survive. The keyword "query" is actually optional in all the above examples. Its use is encouraged for consistency with PRINT QUERY. DELETE SYN purges synonyms from the user's thesaurus. See Section 2.18. 25 2.8 Enter command - see thesaurus, Section 2.18 2.9 Freg and term commands These two commands are closely related. They provide access to the actual database index and show the exact words available for search terms. Each command has one parameter - a string used to compare against terms in the index. Any word that begins with the same series of letters as the string is shown. The display pauses every 20 lines, if necessary, and the user may then hit to continue or type "k" to quit. An example from a possible database is Query #6 #term tax tax taxable taxation taxed taxes taxi taxi cab taxpayer taxpayers Notice that each word starts with "tax". What if you want to match on the end of a word? EUREKA provides prefixing also, using a "?" as with search terms. For example, "?able" would list terms like "sociable" and "taxable". FREQ outputs more detailed information about the terms. For each one, the number of documents in which the word appears is given, along with its total frequency of occurrence. A third number, used by the system people, is to be ignored. The FREQ form of the above example is Query #10 #freq tax tax 33 313 017306 taxable 11 45 017453 taxation 3 3 017620 taxed 4 6 017765 taxes 19 95 020132 taxi 1 1 020277 taxicab 1 1 020444 26 taxpayer 17 166 020611 taxpayers 5 38 020756 We see that "taxpayer" occurs in 17 documents a total of 166 times. This would match the result found by a "find taxpayer" command. FREQ permits prefixing, as TERM does. 2.10 Guide, command EUREKA has a feature that provides lessons on particular subjects, such as how to use the system. The user is stepped through the subject matter and any exercises provided by the developer of the lesson. Eventually he returns to the normal EUREKA environment. When a lesson is available, typing "guide" puts the person into the lesson programs. Currently, GUIDE needs the "cai" database, so EUREKA switches the user to "cai" and back to the original database on exit from GUIDE. If CAI is not available, GUIDE cannot be entered. 2.11 Help command Sometimes a user forgets a command format or a detail about how the command works. For times like these, EUREKA offers a help feature. Whenever the user is prompted for a new query by the "#", help may be requested simply by typing "help" or "h". The system then displays some general help information and a list of topics. After scanning this list the user selects the topic of interest, types the name, and then reads what EUREKA has to say. When the display is finished, EUREKA goes back to prompting for another query. The help-getting process may be shortened by following "help" with the name of a topic. If the name is legal, then EUREKA bypasses the general step and goes right to the requested data. For example, 21 help find begins displaying the data on FIND. EUREKA often pauses at various times during the display and prompts with a •'•». The user may then hit to continue, type - k « to stop the printing, or give the name of another topic. The current one is then stopped, and EUREKA moves on to the new topic. This is handy when the information in one triggers some thought covered in another. In addition, help is available at many other times, such as when a prompt is given in browse mode. The specifics depend on the particular command. 2.12 j^P command This command sets a flag telling EUREKA to direct future output to a printer for hard copy instead of to the terminal. Using it again returns the system to normal output, making the command a "toggle switch". The command, simply "lp", affects PRINT, TERM and FREQ. LP should flpjL be used by the typical person, as it can generate lots of output. The printer may not be by the terminal, either, making the copy unavailable. 2.13 Make oqmmflnrl Query sets are usually formed by find commands. Explicit construction of new sets out of existing sets and documents is accomplished using MAKE. These new sets are given query numbers and are equivalent to ones built by FIND. The full make command syntax is make » comments » Only "make » is required, with the rest optional, as it is with FIND. The expression rules ra tch those discussed in Section 2.2.2. Simple examples are 28 Query #10 #make 3 or 4 • • • Query #11 #make 2 and 5 minus [2,10] • • • Query #12 #make last-10 = pets The first constructs query set #10, combining documents which are in eithe set #3 or #4. In the second case (notice the higher query number), only those documents which are in both query sets #2 and #5 are included. Documents 2 and 10 are not included, even if they are in both sets #2 and #5. The third makes query set #12, assigns the name "pets" and includes those documents in set "last" (that is, set #11), except for those also in #10. The net expression is the same as "(2 and 5)-[2,10]-(3 or 4)", but parentheses are invalid here. Otherwise, the operator precedence of search expressions applies, with MINUS th same level as OR. Thus A+B-C»D means ((A+B) - (C*D) ) and not (( (A+B)-C)*D) or some other combination. MAKE can build complex sets, if terms and operators are arranged properly. Sometimes splitting into two or more commands is still needed. To get the equivalent meaning of query #12, one request of make 2*5-3-4- [2,10] would do the job. Query sets formed by MAKE are not exactly the same as FIND results because MAKE produces no search terms. Therefore, a "PRINT " request using a MAKE set will not output anything since PRINT would have no terms to search for and display! MAKE, then, is useful for making sets to feed to FIND through the "from" clause. At the time of this writing, this weakness is being removed. A search expression will be constructed out of the expressions from the sets used in the make request. 29 2.14 Message command A user with a comment or question about EUREKA may express it with the message command. The EUREKA system people will read the messages and leave replies in the help file "reply" (i.e. access using "help reply"). MESSAGE has both a long and short form. The longer one is used by simply typing "message". EUREKA then gives brief instructions and prompts with "*". The user types the first line of his message and is then prompted again. He may continue to enter more lines, if desired. Hitting right after the prompt gives an empty line and causes MESSAGE to exit. The shorter form skips the instruction phase and allows the user to type his first line on the same line as "message". Examples of each type are Query #9 ^message •this is a message •with no meaning • Query #9 #message can I change my user •name to something else? • 2.15 T command - see thesaurus, Section 2.18 2.16 Users command Users is a command good for curiosity about who is on EUREKA. By typing "users", the user sees a list of all the people currently signed on to the system, along with each person's terminal number and the time of his last command. 30 2.17 Words command Evaluating a set of documents is time-consuming and difficult. The words command exists to provide aid in this area. WORDS extracts and displays keyword: from a document set. The words should carry considerable information about what the documents really discuss. By viewing these words, we hope the user can evaluate her set more easily, or at least get some feedback on possible terms t< incorporate in her next query attempt. A command looks like words words words . If no argument appears, the "last" set is assumed. EUREKA produces a list of documents from the parameter and submits the list to the word extractor. Some sample commands are words irs (look up docs in set "irs") words 4 (look up docs in query #4) words [4,8,17,33,48] (use these 5 specific docs) words [10.. 20] (the 11 docs from 10 to 20) words (same as "words last"). WORDS does much more than just display words, but let's start there. The words are divided into two classes: 1) Those with high document counts (i.e. appear in many documents within the set). 2) Those with high average frequencies. The average is the total frequency of the word within the set divided by the number of the set's documents that contain the word. Suppose WORDS must analyze some set with 10 documents and that tax appears 6,3,5,2,6 and 8 times in six of the documents and never in the other four. Its document count is six and its average frequency is (6+3+5+2+6+8 = 30/6, not 30/10) five. Studies show that the words with high document counts and/or high 31 average frequencies tend to be the content-bearing words in a set, provided that "noise" like prepositions is deleted. In fact, WORDS assumes that this modification of the input source was done previously. The user will probably notice that few words displayed ever have suffixes - most are in singular, root form. While checking for "noise", the database constructor also tries to collapse existing words to root form. The idea is that multiple forms of the word still represent the same word and that a more accurate frequency picture of the root word's use will be obtained by the combination. For example, "manage", "manages" and "managing" would be combined as three occurrences of "manage". This allows a word with several forms to add the respective frequencies together and make the keyword list when the individual forms might not have. EUREKA shows the words sorted by count, not alphabetically, and up to 25 at a time. Therefore, the 25 most common words appear, as well as the 25 words with the highest average frequencies. (A word may be in both lists.) A sample request with a query set of 32 documents is Query #10 #words 8 (query set #8) documents: 32 (set has 32 documents) * words that appeared in many documents: SYSTEM-22 HOSPITAL-17 MEDICAL-13 PATIENT- 11 AUTOMATE- 10 BASE- 10 ADMINISTRATE-9 CLINIC-9 DESIGN-9 MANAGE-8 DEVELOPMENT-7 HEALTH-7 D0CUMENT-6 PHYSICIAN-6 STYLE-6 PROGRAM- 13 LANGUAGE- 10 PROCESS-9 SONS- 7 DATA- 12 PROG RAMMING- 10 TEXT- 9 C0MP0NENT-6 TECHN0L0GY-6 APPR0XIMATE-5 words with high average frequencies: ANTIBI0TIC-8 PROGRAMMING-7 EPISODE-5 PR0GRAM-5 SEGMENT-5 STYLE-5 HOSPITAL-4 NETWORK-4 0RGANISM-4 NEIGHB0RH00D-3 PATIENT-3 PR0VERBS-3 SPECIFICATION-3 STRUNK 3 APPROXIMATE-2 AUTO-2 Choos'e one off ""* ° P ^ ^ **** ^ # (return was P ressed > = exit ( also means exit) 1 = exit and save doc list 2 = change the document list MEDICAL-5 ALGORITHM-4 SYSTEM- 4 S0RT-3 ACQUIRE-2 MYCIN-5 CLINIC-4 BRAIN-3 AMHT-2 32 3 = see more words (if possible) 4 = see original word list again 5 = get help #2 (to be continued later) This initial output shows that "system" was the most common word, appearin in 22 of the 32 documents. Note it also made the frequency list with an average of four occurrences per document for the 22 it was in. The highest average goes to "antibiotic", which could not have been in more than four documents or it would have made the other list, too. A user presented with this output might notice that the set contains two distinct subsets - medicine and computer programming. We will use this fact shortly, after explaining what the choices mean. Choice - leave WORDS and return for another query. Choice 1 - save document list. If a user entered a document list by hand < produced a narrow set using option 2, he may want to save the documents as a query set. This allows studying the set with FIND and PRINT. WORDS builds a query set equivalent to one generated by a search, only the search expression will be empty. Choice 2 - change list. This option can be extremely useful in narrowing the list closer to the target answer. By selecting a subset of the documents that contain particular words, the user removes unwanted documents. (Choice 1 then needed to save the freshly made list.) The selection functions as either AND or OR, depending on a later decision. For AND, a document survives only if it has all the words chosen, while OR means take the document if it has any of the words. An example of this option is below. Choice 3 - see more words. The original display includes only the top wor in each category. WORDS does have more than that available, so this option fetches words from farther down each list. The categories and interpretation remain unchanged. 33 Choice 4 - reset. After using option 3, a user may want to see the "best" words again without re-doing everything. This choice just backs the display up to the beginning. Choice 5 - help. This prints a brief message. To get more complete help, the user should exit WORDS and give a "help words" command. Continuing with the example, suppose the user wants to select the documents discussing hospitals or medicine, throwing out the computer articles. To get that, also suppose he decides to accept one of the 32 documents if it has "hospital" or "medicine" or "antibiotic" in it. Remember that only the given set gets involved - the rest of the database is ignored. To start, he already entered option 2 above. The following output would be generated: Now select the documents to include in the new list but first, type to "and" the words or type 1 to "or" the words or type 2 to back up to previous choices (i.e. change mind) •1 (we want to OR the three words) The words displayed above will be reshown. For each word, type y to include it, to discard it, or k to exit the section SYSTEM * ( was pressed) HOSPITAL *y MEDICAL *y PROGRAM »k (we got the 2 words from this part. K here means skip to the high average frequency section) ANTIBIOTIC «y PROGRAMMING »k (we don't need any more words from this section. K now means exit word selection and resubmit the new document list for processing. ) documents: 19 (19 of the 32 had at least one of the 3 words) words that appeared in many documents: SYSTEM- 19 HOSPITAL- 17 MEDICAL- 13 PATIENT- 11 DATA- 10 ( and so on . . . ) After the second display is finished, the user gets to choose one of the six options again. The identity of the original 32 documents is gone as far as WORDS is concerned. It only knows the 19 it received on the second pass. 34 If a query set is submitted to WORDS and a user decides to narrow the set, the actual set is not deleted. WORDS merely "forgets" what the original documents are. 2.18 Thesaurus feature Many sections preceding this one referenced the "thesaurus". This complex feature will now be discussed. The user should remember that this entire section may be ignored if desired, but the thesaurus can be extremely helpful in making more efficient use of EUREKA. We do recommend not employing the thesaurus until the user is comfortable with most of the rest of EUREKA. English is a diverse language with many words very similar in meaning to other words. Knowing these synonyms greatly increases the effectiveness of search expressions. By defining a class of synonyms, the user can request that by putting one member in a find command, the entire synonym will be fetched and each word within it will also be searched for. This optional feature is controlled completely by the user. Each one gets his own personal thesaurus to build, modify and use as he wishes. The commands available are ENTER, CHANGE, DELETE, PRINT and T. Before discussing the specifics, we will define a synonym and give examples. 1) A synonym class has up to three sections: names, terms and an expression. 2) The sections must appear in one of these combinations: - name(s) and term(s) - name(s) and expression - name(s) and term(s) and expression - term(s) - term(s) and expression 3) A term is a EUREKA search term (see glossary) which may be used 35 to reference the synonym. This is correct for names, too, but when a synonym is expanded in a search expression, terms are included and names are not. The terms are ORed together, while names serve only as labels. These terms /names may include quoted phrases, prefixing and suffixing as usual. Using these additions is recommended, except for prefixing. When looking for matches between terms and thesaurus words, quotes and universal characters (i.e. "?") are ignored. *) An expression, here is included in any synonym expansion. It does not get expanded itself, and neither do the terms mentioned above. The expression may be a full expression as in find commands, complete with "in » parts. Because of this power, and the interpretation of "in" by the system, the expression is surrounded by parentheses. The parens "cut off" the influence of the "in", keeping it confined to the thesaurus expression part. The words within the expression are not stored individually in the thesaurus, so a class that has an expression must have at least a name or term so that it can be referenced. Examples Now for some examples, assuming the user is doing FIND commands with the thesaurus turned on. Svnonvm 1 : names: x terms: y, z 36 expr: a ♦ b and c The commands "f x n , "f y" and "f z" would each result in an effective command of "f y or z or (a or b and c)". §Yn, and thus a null line, means no definition for that part is desired. For names and terms, the user enters a list of words separated by spaces and/or commas. After all the parts are typed in, the legality of the new definition is checked. Next, each of the names and terms is examined for existing use. Should at least one exist already, the user is asked whether he still wants the new definition entered. If so, or if no conflicts existed, the new class is added to the thesaurus. On a conflict the previous entries are not displayed, but the user may easily cancel the request, check them using PRINT SYN and then decide later what to do. Some examples of ENTER and the other commands will be given later. 2-18.2 Change Once a particular class has been defined and the user has evaluated it, he -ay want to modify parts of the class. The change command exists for this Purpose. Its format is: "change syn ", such as "change syn x". 38 If the name/term given exists in the thesaurus, its definition is displayed and the user is prompted for action. If the term has multiple meanings, then the various ones are displayed until the user picks the one he wants to modify. When prompted for a response following the display, the user must choose one of the subcommands listed below. If the subcommand is legal, the necessary action is performed and the modified synonym is displayed. Prompting is done again, and this cycle continues until either the user exits with a null command or commits a major error. The subcommands are: - : exit the change command - help : display information on the change command - delete : remove the given word from the class - delete expr : remove the expression from the class - add name : add another name to the class - add term : add another term to the class - add expr : add or replace the expression - change to : a substitution of one by another - change expr to : same as add expr These commands preserve the legality of the class. For example, if the synonym has one name and an expression, then deletion of either part would violate class rules. Thus the deletion would not be performed. 2.18.3 Print Many thesaurus requests display definitions of synonyms to the user. The print command allows explicit requests to see particular classes. Its format is simply "print syn ". If the name or term exists, all of its definitions are shown, with prompting after every two. This prompting makes easier viewing for those few classes that may have many definitions, and for the 39 second case of the command, "print syn all". The "all" keyword dumps the entire thesaurus on the screen, in the same format and pausing after every pair. 2.18.4 T command The t (as in thesaurus) command is only for setting states, such as turning thesaurus use on and off. "T on" and "T off" do just that, and when the thesaurus is off, search expression terms are not looked up. When the thesaurus is on, lookup and expansion is automatic and requires user action only when a term has multiple meanings. He must then choose the desired class. Even when the thesaurus is on, it can temporarily be overruled within the search expression. Suppose a user has an expression of two words, one of which he wants checked in the thesaurus, and one he doesn't. Turning it on assumes the user wants both, but by typing "§" as the first character of the term not to look up, the thesaurus will not be consulted. Thus "f tax and income", entered with the thesaurus on, would look up "tax" and not "income". Note that with the thesaurus off, the "§» is treated as a normal character. T has one other form, also related to find commands. If the user wants to see expansion of terms within search expressions, "t display on" is needed. The condition stays on until either a "t display off" command is entered or until he exits the system. When DISPLAY is not on, nothing is printed beyond a normal FIND, so the thesaurus use is transparent, unless the user has to resolve a conflict. Of course, the thesaurus must also be on for DISPLAY to work. 2.18.5 Delete Our final command is DELETE, which removes entire classes from the thesaurus. Its format is: "delete syn ", or "delete syn ". That is, deletion of several synonyns at once is possible in the rirst case. The definition of the class is displayed and the user is asked if he 40 still wants it deleted. This display helps to avoid mistakes, allows for changes of heart and resolves conflicts among those terms with many definitions. A multiple definition case may end up with no changes, all the definitions deleted or with some removed and some kept. When a class is deleted, each name /term in it is also looked up and reference to the now non-existent synonym is removed. If this class is the only one it is in, then that term is likewise purged. By using "delete syn all", the entire thesaurus is cleared - all names, terms, classes and space are freed. 2.18.6 Sample EUREKA user session System responses are underlined. Query #1 Renter (example 1) names: 5 terms; •tax taxes taxpayer? income irs 'internal revenue service 1 expr; •return and file Query £1 #print syn irs (ex. 2) term(s) : TAX TAXES TAXPAYER INCOME IRS 'INTERNAL REVENUE SERVICE' expression: (RETURN AND FILE) Query £1 jtenter (ex. 3) names: ^income terms: JJwages salary pay employ? expr: JKreturn> INCOME has a definition already 41 Should the new svnonvm be added? •y Query #1 #t on (turn thesaurus on, ex. 4) Query jM. #t display on (doing a FIND shows expansions, ex. 5) Query jM #f salary ( ex . 6) SALARY expands to: WAGES+SALARY+PAY+EMPLOY? . . . result of find command . . . Query #2 #f income (income has 2 definitions, ex. 7) INCOME has a multiple definition. Please pick one ... the definition with tax (example 1) is shown Js this the one?(v or n) INCOME expands to: TAX+TAXES+TAXPAYER7+I NC0ME+IRS+' INTERNAL REVENUE SERVICE' +( RETURN AND FILE) . . . result of find command . . . Query jQ ^change syn tax ( ex# 8) changes will be made to: ... the class definition with tax (ex. 1,2) is shown Please enter request-help available ^change expr to return and file? and fail? in paragraph synonym now: terms: IAX TAXES TAXPAYER? INCOME IRS 'INTERNAL REVENUE SERVICE' expression: jRgTURN AND FILE? AND FAIL? IN PARAGRAPH) Ple ase enter request-help available ^delete income synonym now: terms: IAX TAXES TAXPAYER? IRS 'INTERNAL REVENUE SERVICE' 42 expression: jREIUM AM I1LZ1 MP. FAIL? IN PARAGRAPH) Pl_ea.se enter request-help available .£ (exit change sequence) Query #3 _#delete syn income (ex. 9) (Income did have a multiple definition (ex. 1,3), but income was deleted from one class (ex. 8), leaving only one definition. ) name(s) : INCOME term(s) : WAGES SALARY PAY EMPLOY? Do vou really want it deleted? Query j£ £ . . . and so on for the rest of the session 43 3. Comments on EUREKA 3.0 General search ideas In broad terms, many things influence the effectiveness of EUREKA. These are 1) The user's understanding of what the system can do. 2) How well the user can define his problem to himself. 3) How well he can express the problem to the system. 4) How good the database is in the topic of interest. (thorough coverage, up-to-date, etc.) 5) How big the database is, both the number of documents and the size of the documents. The first three cover the man-machine interface, affecting input to EUREKA. The other two consider the basic capability of EUREKA to answer the question completely and efficiently. 3.1 Search strategy Because EUREKA is interactive, it gives quick response to requests. With such interaction available, the user need not try to formulate a detailed query that gives perfect results in one try. One should aim for an iterative strategy, gradually narrowing down the set of documents to get an answer. When beginning investigation into a subject, the first query should aim for high recall, which means trying to retrieve as many of the potentially relevant documents as possible. This first set should give an idea of what the user faces. If the set is large, the query should be modified for a more restricted retrieval. If the set is not too big, the user should browse through some or all of the documents and evaluate the contents. Usually the set is too broad. In this case, any evaluation (e.g. reading the text of a few of the documents) should have provided feedback about what changes in the search expression are needed. Since the target information is probably contained in that query set, the "from last" clause is recommended. By 44 using "from" in a new query, the document set should be narrowed closer to the answer. By repeating the process as often as needed, the user should eventually discover the results hoped for. Sometimes a few searches can be carried out at once. Say the user wants data about some general subject "X", but not when it also discusses "Y". MAKE is useful here. The first search series is performed and then the second, independent of the first. A make command, such as "make x-y" does the selective deletion of subject "Y" documents. Generalizing to more complex make statements provides even more power. 3.2 Suffixing Using a "?" at the end of words is usually very handy, especially in early queries where high recall is preferred. The term command can show what words will be searched for through the use of suffixing. Sometimes a few undesirable words are included, as with "taxicab" in the Section 2.9 example. If so, conduct a search for those specific "bad" terms within the final result set, creating a new set. By then using a MAKE and deleting those new documents, their influence is removed. Of course, if one of the "good" words also appear in those documents, its occurrences will be lost. Therefore, this type of strategy requires caution. 3.3 Search failures Even if the database has the needed information, that data can still elude diligent searchers. These failures come from several contributing areas: 1) Forgetting proper search terms. Since EUREKA indexes every word, the user may have to think of every word needed to cover a topic. Failure to include a major word in the search expression may remove the only chance to access relevant documents. As a trivial example, a search for articles on pigs could 45 fail if a user forgets that an article might reference "hogs" instead of specifically "pigs". A related problem is case 2. 2) No specific term available. In systems with a small controlled vocabulary, the indexer can think up a term suitable for representing the document. Continuing with the above example, suppose several hog diseases are discussed by name, without ever using a general term like "disease". In a controlled system, an index term like "hog diseases" might be used, but in EUREKA, where only exact words from the data are indexed, the user would have to think of particular diseases to list or try other words that mean disease. More likely, though, is that the document won't be retrieved. 3) Incorrect search expression. Even when the right terms are found, some searches fail because of improper logical use of AND, OR, IN, FROM and parentheses. 4) Poor analysis of the problem. This is more common when someone conducts a search for another person. Key ideas can easily be lost in the discussion. In a self -search, the user should carefully analyze exactly what he wants and define the target logical areas. For example, a search for "pollution caused by disposal of nuclear power plant wastes" would first be analyzed as having three distinct subtopics - nuclear power plants, waste disposal, and pollution. This may seem like a trivial step, but the same topic expressed as "land and water contamination by radioactive wastes" may not be split up the same way, even though the subject is effectively the same (assuming the user implied nuclear power with radioactivity). By the time the topic is translated into a search expression, an error in this conceptual phase can seriously weaken the search. 5) Recall versus precision (see glossary). A search usually involves a tradeoff between these two measures. Before formulating the search request, the user should choose a strategy for either high recall or high precision. Using 46 many general search terms with suffixing, but without restriction of contexts and ANDs tends to increase recall at the expense of precision. If a small number of very relevant documents is satisfactory, then a strategy of very specific terms, quoted phrases, contexts, and so on, may be appropriate. Potentially relevant documents will probably be lost. With the iterative searching of EUREKA, trying for a high recall first and then narrowing down to improve precision usually works best. 6) Suffixing. Failure to include suffixing hurts recall when a term appears in a document only in some suffixed form and not in the root form given. However, trying for all endings often hurts precision by dragging in improper words with the same root. 7) Homographs. This hopeless problem comes from words with the same spelling but different meanings. Fortunately, it is not very serious, but an example is "bow". In a broad database, it could refer to ships or to arrows. Including another relevant term in the expression can provide the necessary restriction, such as "bow and arrow". That phrase is unlikely in an article on boats. 8) Failure to include peripherally related words. This idea is more esoteric than others. Consider including "detroit" as a term for a search on an auto industry topic, or "mental health" for a search on "mental illness" since illness is a condition of health. 9) False coordination. In a search for "tax", any document with "tax" is retrieved, even one with a single occurrence in an incidental passage. The problem, then, is words that aren't really relevant to the topic of the article, but are found anyway. 10) Incorrect term relationships. This problem is a serious one, since the appearance of two terms in a document does not mean the occurrences are related. 47 An expression like "a and b" assumes the dual appearance implies a relationship, Suppose "tax and rate and auto?" was an expression. A document about changing import tax rates on automobiles probably would be found, as would an article covering tax money wasted on government cars with poor mileage rates. If the first case is what the user had in mind, then the second would not be relevant. This problem of terms not appearing the way expected is the major force behind the use of contexts and quoted phrases. The closer relationship implied by them removes many of these improper matches. The remainder just cannot be handled without a more sophisticated system. 48 **• Glossary of terms Context - each document is divided into several parts, such as author, title, references and so on. Such a division is called a context. All documents within a particular database have the same contexts defined, although some parts might be empty. For example, a document might have no references. The exact name of each context depends on the database. The contexts "sentence" and "paragraph" may appear arbitrarily many times within a document, but all others may appear only once. Document - a database is separated into a series of documents, each assigned a number. What constitutes a document, such as one journal article or one chapter of a book, varies between databases. Every word within a document is added to the index and can be searched for. Index - a search for a term consults the index to see if that word exists. The index contains every word that appears anywhere in the database, plus information about how often and in which documents. Precision - a measure of search effectiveness. Mathematically defined as the number of relevant documents retrieved divided by the total number retrieved. Intuitively, it is the proportion of material found that is actually useful. Query - a command to EUREKA to execute some request. Each query has a number, but only those queries that generate a set of documents save the result and change the query number. Query name - a label can be attached to the query results. It is then equivalent to the number. The name must start with a letter and may have up to nine more characters, each being a letter or digit. Query set - the list of documents retrieved by a FIND or built with MAKE. A 49 query set print includes the search expression, query name, the documents in the set and comments. Each set has a number to distinguish it from others. Recall - a measure of search effectiveness. It is defined as the number of relevant documents retrieved divided by the total number of relevant documents in the database, ignoring how much extra junk was also retrieved. It measures the proportion of the good documents actually found. Getting high recall often forces low precision. Search expression - FIND examines its search expression to determine what terms to search for and in what combination. The expression is a series of terms separated by the operators AND and OR. AND has higher priority than OR. Search set - when a search is conducted, the set of documents actually examined is called the search set. The "from" clause allows a combination of query sets and documents to be specified as the search set. This confines searching to the documents of the search set and not the entire database. Synonym class - an entry in the thesaurus, composed of terms and an expression. A class should represent a set of words similar in meaning, or can be considered shorthand for some commonly used search expression. Term - an operand in a search expression, also what EUREKA can explicitly search for. A term follows a strict set of rules: 1) a maximum of 32 characters. 2) if the term contains more than one word, i.e. has spaces within it, it should be surrounded by single quotes. To include quotes within a quoted term, two consecutive quotes are needed. For example, "Murphy's Law" would be entered as » murphy" s law 1 . 3) if the term is unquoted, it is terminated by the appearance of a blank, "+", «««, "&», ")" or th e e nd of the line. Note that single 50 quotes can be embedded in unquoted terms. The terminators like »♦" may appear inside quoted terms as part of the term. 4) a "?«» at the beginning of a term signals prefixing, and one at the end means suffixing. 5) when the thesaurus is on, a "0" at the beginning of a term means don't look up this term in the thesaurus. 6) If a term is quoted or if it contains characters that are not just letters and digits (e.g. 27,610 has a comma), EUREKA extracts the substrings containing only alphanumeric characters. A document containing all such substrings is then searched for an exact match with the complete search term. (e.g. 27,610 is indexed as 27 once and 610 once). 51 PART 2 - EUR UP 1. Introduotinp EURUP is a system for constructing, updating and maintaining databases for EUREKA. It is operational and useful, although incomplete, and still undergoing development. This part of the report documents the structure of EUREKA databases and the working portions of EURUP, as well as a few commands that are being finished or modified. It is meant to be a guide to the use of EURUP for constructing and maintaining databases. Some of the dynamic properties of databases used by information retrieval systems such as EUREKA include steady or erratic growth over an extended period of time, occasional or periodic removal of obsolete data, and correction of random errors in the data (e.g. typographical). Hence there is a need to be able to modify or extend a database. But updates cause the database to become disorganized, which results in EUREKA taking longer to access desired data because it is not in the best location. Therefore, there must also be the ability to restructure the database, perhaps employing new hardware, as the database grows or improved techniques are applied. The primary reasons for implementing EURUP are to provide easy procedures for constructing a database from raw text and to allow easy database updating. Listed below are other considerations in the design of the database structures and EURUP. D Unique, well-defined declarations and data accessing procedures will be provided. 2) The integrity of database structures will be maintained at all times. 52 3) The possibility of on-line updates should not be ruled out. This is an interesting problem and ideally, a large information retrieval system should be available 24 hours a day. Maintaining data integrity is a first step in this direction. 4) Reorganization of the various database components, the index in particular, is provided so that search times can be minimized. 5) Databases are capable of being split and stored on more than one volume (i.e. disk pack) in order to handle databases larger than a single volume. 6) Some of the data files in a database will have variable block sizes. 7) The updating system should make future data structure, format and procedure changes easier and less disruptive to the user environment. The next section details the structure of a database, discussing the various system files needed and the database components stored in these files. Some knowledge of the DOS filing system is assumed and will undoubtedly be necessary in order to use EURUP and this manual. Section 3 talks about the inputs and outputs that EURUP uses and generates. In particular, the format of text to be input to a database is completely defined. Section 4 describes the functions of EURUP and gives the syntax and operation of the command language. The final section gives advice and some examples of using EURUP to build or modify a database. 53 2. Database Structures A EUREKA database is a collection of "documents". Each document contains nothing but text and is broken up into a fixed number of sections, such as title, author, abstract, body, references, etc. As will be seen, the number of sections and their names are database parameters, selected to suit the data and/or the database manager. The documents are assigned unique numbers for addressing purposes. Associated with each document is an optional "vocabulary file" which contains a list of words and the number of times they appear in the document. The "database index" is a complete list of every word that appears in any document of the database. Each "term" in the database index has an associated "postings file" of document numbers, indicating which documents contain that term. With each document number in a postings file are the term frequency and context flags. These indicate which sections of the document the word appears in and how many times altogether. In satisfying a "find" request, EUREKA first consults the database index, to determine which documents contain the desired logical combination of search terms. In some cases, the database index will not yield a final answer to the search request, but will give a list of documents that may satisfy the request, guaranteeing that no other documents in the database do. EUREKA scans the complete text of those documents to determine which ones actually satisfy the search request. For example, to find "income tax", a document with both "income" and "tax" must be checked to see if the terms are adjacent. Databases are stored in several interrelated files kept in user code [1,5], The "database directory" contains symbolic pointers to the other "extents" of the database. Each extent is a contiguous file, and there are three types: index, text, and vocabulary. Other components of a database are stored in files 54 that have the same name as the database directory and an identifying file name extension. 2.0 Database Directory The database directory is a relatively small contiguous file which is used to locate the other extents of the database and also contains the document and vocabulary directories. The file name extension for the database directory is "DB". The database directory always takes an odd number of physical blocks (say 2n+1). The first block of the file (block 0) contains the "extent directory", which is a list of symbolic pointers to the other extents of the database. Each pointer indicates the volume (i.e. device and unit) where the extent is stored and the name of the file. These pointers are set by the create command when the database is created, and can be modified with the change command. Blocks 1 thru n contain the "document directory". The jth entry in the document directory points to the start of document j, beginning with j = 0. Similarly, blocks n+1 thru 2n contain the "vocabulary directory", and the kth entry points to the start of the vocabulary file for document k. Note that the vocabulary directory exists even if the database contains none of the optional vocabulary files. At database creation, a value for the maximum number of documents in the database is input. It determines the sizes of both the document and vocabulary directories, and hence the size of the database directory file. 2.1 Database Index Every database has exactly one database index extent, referenced through the database directory. This extent contains the "terms file" and the "postings files". The terms file is indexed using ISAM techniques for quick word searches. 55 Block of the extent is a header of control information and statistics for the extent. The extent also contains a bitmap so that space for terms and postings can be dynamically allocated. The terms in the index are restricted to being alphanumeric strings of any length less than 32 characters. Special characters are not indexed. Each term is assigned a unique 24-bit integer value in lexicographic order. Actually, the first character of a term forms the high 8 bits of the term number. The low 16 bits are selected with more or less even spacing over all the terms that begin with a particular letter or digit. This leaves gaps which usually allows new terms to be inserted without term number conflicts. The use of term numbers and some of the associated problems are discussed in subsequent sections. Normally, each term entry contains a pointer to the start of its associated postings file. Each posting entry represents a single document and holds a term frequency for how many times the term appears in that document. The value is limited to a 7-bit field, and so it is set to 127 if the term appears more than 126 times. Truncation causes inaccuracy only for very frequent terms, which typically bear no information anyway. In normal English there are always a few words that appear many times and in many documents, like "the". These words have little information content but have large postings files. If a postings file gets to be a certain size it is deleted, but the term entry remains with a special nil postings file pointer. This is taken to mean that the term appears in every document (even though it may not). The deletion threshold is currently fixed at 500 postings, but in the future should be made a parameter of the database. 56 2.2 Text The text of each document is stored in a "document file". Each document file contains a "document context index" and the text of the document. The document context index is stored in the first text block of the document and contains a pointer to the beginning of each section of the document as well as a pointer to the end of it. The text blocks are doubly linked so that EUREKA browse mode can scroll forward or backward through the text. Document files are stored in text extents, which are large contiguous files. A database can have a variable number of text extents, but must have at least one. The size of each text extent is set when it is allocated. All text extents have the same structure. The first block contains control and statistical information. A bitmap is stored at the end of the extent. This is used for dynamically allocating text blocks from the remaining space. If all the text extents fill up, either they may be enlarged and reorganized, or an additional text extent may be allocated. The extent directory (part of the database directory) points to each of the text extents in the database. The document directory (also in the database directory) is used to look up arbitrary document files. Each entry in the document directory specifies in which text extent the document file is stored, as well as the relative address of its first block of text within the extent. Note that a document is completely contained in one text extent. It cannot be split across extent boundaries, 2.3 Vocabularies Each document's vocabulary file is a sorted list of term numbers and frequency values for selected words from the document. A vocabulary file may be created during document insertion, or it may be (re)generated for an existing document at any time. 57 Vocabulary files are stored in vocabulary extents, which like the text extents are large contiguous files. A database can have a variable number of vocabulary extents, including none at all, in which case no vocabulary files may be created. The size of each vocabulary extent is set when it is allocated. They have the same structure as the text extents, but contain data blocks for the vocabulary files rather than text blocks. Vocabulary extents can be enlarged, reorganized and allocated just like the text extents. The extent directory (in the database directory) points to each vocabulary extent in the database. The vocabulary directory (also in the database directory) is used to look up arbitrary vocabulary files. Each entry in the vocabulary directory specifies which vocabulary extent the vocabulary file resides in, as well as the relative address of its first data block. Like document text, a vocabulary file cannot be split across extent boundaries. Vocabulary files contain term numbers instead of the actual term text because of the storage savings and the quick processing made possible for some EUREKA programs. After processing vocabularies, the term numbers are used to search the terms file in order to display the actual words. Originally a vocabulary file was to represent a document exactly, having all and only those words that are in the text. However, the needs of EUREKA commands dictated otherwise. During vocabulary construction, the document text now passes through a "stemmer", which converts most suffixed terms into root forms. The frequency count for the root is a sum of the counts for any suffixed forms in the document. For example, suppose a document contains "manage", "managing" and "managed". Its vocabulary file will contain only "manage", but the frequency field will include the occurrences of all three. A vocabulary entry is always a "real word", meaning a word in the database index. A root produced by the stemmer is rejected if the stem cannot be found in 58 the index. Instead, the original word form is entered in the vocabulary. One major reason for this decision is that each vocabulary entry needs a term number, but only indexed words are assigned numbers. Another reason is that EUREKA users can read vocabulary contents indirectly through the words command, so the entries should be real. Further vocabulary modification comes through use of the "stop" and "save" lists. The vocabulary builder throws out all terms under four characters and those that begin with a digit or are on the stop list. The save list prevents some words from being deleted and prevents unwise stemming in other cases. See Section 2.5 for more details. Stemming, stopping and saving can be avoided if a vocabulary in the original scheme is desired. The controls exist as program source code switches, though, making necessary a new assembly, etc. 2.4 Context Definitions The context names for a database are defined in a special file. This file has the name of the database and an extension of "CTX". It is a simple text file, created and modified with the editor. As mentioned before, a document can consist of up to 12 sections. Any subset of sections can be given a context name and any section can be included in any number of context definitions. The definitions are used by EUREKA for two purposes. One is to translate a user's context specifier into a representation for the actual sections of the documents which are to be searched or printed. For example, "cite" might be defined as sections 1,2,3 and 5. A EUREKA command of "find knuth in cite" would cause EUREKA to search for all documents that contain "knuth" in any of sections 1,2,3 or 5. Anytime EUREKA expects a context name in a command, it will only accept a name that is in the context definition file. Obviously, EUREKA keywords, such as "from" or "last", cannot be used as context names. 59 The other purpose of the context definition file is to define labels to be printed as the various sections of a document are displayed for the user. For example, if "title" is defined as section 1, then "title" will be used as a label anytime the text of section 1 is displayed. How the text of a document is broken into 12 sections will be described in section 3, but it should be noted that all of the documents in a database will use the same context definitions, so they should all have the same logical organization. If the organization of the documents in a database only requires a few sections, then any of the 12 can be used and the remainder left empty. The empty sections need never be referenced in any of the context definitions. However, care must be taken to ensure that no data is put into a document section for which there is no context definition, or else that data will not be searchable or printable. There is one context name definition per line in the CTX file. Each line contains three items separated by single spaces or commas, the format being:
The context name must be alphabetic upper case. The section flags item is an octal word of bit flags which indicate the sections represented by the context name, interpreted as follows: bit 15 bit 14 bit 13 bit 12 bit 11 bit 10 bit 9 bit 8 paragraph sentence ignored section 1 section 2 section 3 section 4 section 5 bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit section 6 section 7 section 8 section 9 section 10 section 11 section 12 ignored The paragraph flag indicates that searching (printing) is to be restricted to individual subdivisions of the sections being searched (printed). The sentence 60 flag increases restriction to subdivisions of paragraphs. For example, if the context file contains the following entries: SUBSECTION 100000 HEADN0TES 20 9 NOTES 100020 then a context specification of "in subsection in headnotes" is equivalent to "in notes", and means that subdivisions of section 9 should be searched or printed, because bit 15 (paragraph) is on and so is bit 4 (section 9). The section number item identifies that the name of this entry shall serve as a label when printing the section. Typically there is a context name for each individual section of a document, and this name is also used as the label for printing. However, this is not mandatory. It is valid to create an entry which is to be used only as a label by setting the section flags to 0. In this case, the context name will not be acceptable in a search or print request. Also, entries that include more than one section usually are not used as section labels, and so the section number of these entries is set to 0. In the above example, HEADN0TES will be printed as a label any time text from section 9 of a document is displayed. The order of the entries in the context definition file is only important for the determination of abbreviations. EUREKA does a linear scan on the entries, looking for the first one with a prefix that matches the user's context specification. For example, if the file contains: DATA 17600 DATE 1000 4 then only "d" is required to specify the DATA context, but "date" is required to indicate the DATE context, "dat" will match DATA before DATE is found, and so it will be interpreted as DATA. This example assumes that there are no entries preceding DATA which begin with D. 61 The context file is read by EUREKA when a user signs onto a database. The file need not be present, in which case EUREKA reads standard context definitions from file "STD.CTX[ 1 ,5]". The actual contents of this file are: PARAGRAPH 100000 SENTENCE 40000 DOCUMENT 17776 DATA 17600 AUTHOR 10000 1 TITLE 4000 2 SOURCE 2000 3 DATE 1000 4 PAGES 400 5 MISCELLANEOUS 200 6 INDEX 100 7 TEXT 36 WORDS 40 8 ABSTRACT 20 9 BODY 10 10 FOOTNOTES 4 11 REFERENCES 2 12 2.5 Stop and Save Lists Each database has optional "stop" and "save" files of text, with names equal to the database name and extensions of "STL" and "SVL", respectively. The two files also have encoded versions, but that discussion is deferred to the stop command in Section 4.5.8. Section 2.3 mentioned EUREKA programs that use vocabularies. They produce more meaningful results if "noise" is purged from the vocabularies, namely those high frequency words of low meaning, like prepositions, articles and common names. The stop list contains terms to delete during vocabulary construction. The list is checked after stemming, so "noise" that comes in many suffixed forms need be entered only once. The stop list has a counterpart for saving words. Terms under four characters long (after stemming) are automatically pitched, unless they are on the save list (e.g. "tax"). Without question, most short words are useless, but this allows the few useful ones to survive. Terms beginning with a digit are 62 also deleted, but the save list cannot prevent that. The stemmer always consults the save list first, so that specific suffixed terms can be kept and to prevent occasional unwise decisions on the part of the stemmer. For example, if one would really like to keep "automatic" from becoming "automate", then an entry of "automatic" would do so. Detailed discussion of actually what should be stopped or saved is beyond this manual, but the list format is not. The two files have identical formats of upper case text produced by the editor or automatically through a version of the stop command. Any number of words (starting with a letter) may be on a line, separated by blanks and/or tabs. The lists must be sorted by first letter only, although full sorting is certainly allowed. For example, all the "A"s must appear before the "B"s, but any ordering within the "A" group is acceptable. The first entry on a line must start in column 1 , and all words on a line must start with the same letter. 63 3. EURUP Inputs and Outputs EURUP is an interactive system because it reads commands from a terminal. It hardly acts like one, though, since many of the commands cause the system to execute for minutes or even hours. Commands can be entered to update a database with a batch of documents or to reorganize a large database extent, requiring considerable I/O. Besides reading commands, the terminal is used for logging error, warning and statistics messages. The EURUP command language is described in Section 4. The rest of this section discusses the various input and output data files which EURUP uses and generates. 3.0 Document Input File Format One of EURUP' s primary functions is to update a database by inserting, replacing or deleting whole documents. When inserting or replacing a document, EURUP gets the text for a document from a "document input file". This is a standard text file which contains special embedded "markup" codes. A big problem, which cannot be entirely handled by EURUP, is the generation of these document input files. Typically, text for a database comes from some external source, such as a publisher. The complete text is written in a single stream, containing markup codes used by the database supplier. These markup codes generally define various section boundaries, character code translations and/or provide formatting information for use by a typesetting system. A special formatting program must be written to break the text into individual document input files, which can then be processed by EURUP. The document file names for a given database should all start with a prefix of one or two letters, followed by a document number. They do not have to be in user code [1,5] as do the database extents. See the insert command (Section 4.3.1) for more details. This formatter must do appropriate conversions on the supplied markup codes in the original text. 64 The format of a document input file has been designed to simplify the formatting programs as much as possible. The EURUP markup language contains a few codes that delimit sections of text and some that only supply format information used when displaying the text. The text of the document is format free, that is, all format information for the display of the text is contained in the markup codes. The normal ASCII format control characters, such as carriage return and line feed, are treated as blanks. Strings of blanks are quashed into a single blank. The sections of a document need not appear in numerical order and can be interleaved. For example, a document might first contain text for section 5, followed by text for section 1 , and then more text to add to section 5. In addition, there is a special markup code that causes the document input processor to go back to the previous section. Paragraph markers can be placed to subdivide a section, and sentence markers to subdivide a paragraph. An asterisk indicates that a markup code follows. Below is a list of all the codes that are used in the EURUP markup language. For the alphabetic codes, case is ignored. Some codes accept a numeric parameter, represented by n. Document Markup Codes *Cn Specifies which section the following text is to be inserted in. n can be thru 12. If 0, the following text is discarded, otherwise it is put into section n. If n is greater than 12, the markup code is ignored and a warning message is logged. *C- Leave the current section and place the following text in the section that was previously used (i.e. being processed just before the current one). *P Paragraph boundary. Any adjacent blanks, empty sentences or empty paragraphs are discarded. *S Sentence boundary. Any adjacent blanks or empty sentences are discarded. 65 *#n Formatting command which forces n blanks in the text. Any surrounding blanks in the text are discarded, n can be which forces the surrounding text to be adjacent even if a blank appears in the text, *+ Formatting command which forces a new line. Text is normally formatted for display by filling output lines. A new line can be started at any point with this command. «* Used to force an asterisk into the document text. The document input processor starts a new document in section 0. In other words, all text up to the first *C code is discarded. Here is an example input document to clarify the use of the section codes. alpha *c7 beta*c2gamma *c- delta *p *s rho *s omega *p*c5 sigma *c5 epsilon *c- omicron *c7 zeta text location alpha discarded gamma section 2 sigma sect 5 epsilon sect 5 (continued) omicron sect 5 (continued) beta sect 7, paragraph 1 delta sect 7, para 1 (continued) rno sect 7, para 2, sentence 1 omega sect 7, para 2, sent 2 zeta sect 7, para 3 3.1 Statistics and Message Output Files Many of the EURUP functions generate statistics about their data processing phases. Such statistics are normally formatted and output to standard text files. The updating commands in EURUP generally provide for the specification of a log file which is opened for extension. While processing a command, all statistics and certain warning or other informative messages are written to the log file. If not specified, the default is to log this information at the terminal. Errors severe enough to cause abortion of the current command generate 66 messages at the terminal as well as in the log file. Other commands will be available for formatting and printing information about what is stored in the database. The actual information written by the various commands is given in Section 4 with the command descriptions. It is conventional to specify a log file of "database name. LOG", for developing a file with a fairly complete history of the updates made to a database. As with the document input files, the log files are not restricted to user code [1,5]. 3.2 Temporary Files There are a few instances in EURUP where the CPU's memory is not large enough to hold all the data being processed. When necessary, EURUP will generate appropiate temporary disk files. Usually the temporary files will be deleted when no longer needed, but the program code may sometimes be conditionalized to save them. In either case, temporary files will have an extension of "TMP" and will not be protected, so that PIP can be used to delete any garbage that may accrue. Sometimes the user may supply a name for a temporary file and keep it for future purposes. The use and format of temporary files will be explained further as they are encountered in the command descriptions of Section 4. 67 4. EURUP EHflgUaaa anl Command Language EURUP runs under the same operating system as EUREKA itself and may eventually be able to run concurrently with EUREKA. It can't do so now because there is no interlocking mechanism between EUREKA and EURUP. A database which is being used by a EUREKA user should not be updated by EURUP, and vice versa. EURUP has been designed, however, with such mechanisms in mind, and such features may be included later. The terminal is used to enter commands and to log error messages and statistics. EURUP does extensive error checking, and with knowledge of the database structures and operation of EURUP, the error messages should be self-explanatory. There are two command levels in EURUP, called command mode and update mode. EURUP starts in command mode and prompts the user with a "#". In command mode, a CSI string is read which specifies a database name, a command switch, and possibly some parameters. The commands at this level operate on database extents (i.e. system files) and usually don't require access to the whole database. One of the command mode commands causes the specified database to be "opened". Update mode is entered and the user is prompted with an "*". Opening a database involves setting up communication between EURUP and all the extents of the database. This mode is concerned with operating on a database as an entity. At this level, commands can be entered for updating or listing information about the database. The notation used in the command descriptions is fairly trivial. Commands and keywords are shown with the first few letters capitalized. The capitals indicate the minimal abbreviation accepted by EURUP. Any letters following this abbreviation are simply ignored. Lower case items are parameters supplied by the user. An item in square brackets is optional or only applies to certain commands. Wherever a space is shown separating words, any number of spaces can 68 be used, and in most instances one comma may also be inserted (in which case no spaces are necessary). All other special characters behave like keywords and must appear as shown. 4.0 Command Mode Command mode commands are expected when EURUP prompts with a "#". These commands have the general form: database name /command [ : parameters ][ , file parameters] The database name is a file specification for the database directory. Some commands prompt the user interactively to obtain additional parameters. Some generate information that is sent to a log file, or to the terminal if no file is specified. The commands available and the sections where details can be found are: 4.2.1 CREATE 4.2.2 EXTEND 4.2.3 DETACH 4.2.4 ZERO 4.5.1 OPEN 4.4.3 REORGANIZE 4.4.1 DUMP 4.4.2 LOAD 4.5.3 DIRECTORY 4.5.4 CHANGE creates a new database. attaches an additional extent to an existing database or enlarges the database directory. removes an existing extent from a database, re-initializes a database, deleting all data from it. opens a database and causes EURUP to enter update mode. reorganizes a database extent. converts an extent to a linear form by dumping it to a sequential file. reloads an extent from a previously dumped sequential file. provides a listing of information in the database directory. provides the ability to change the extent pointers in the database directory. 69 4.1 Update Mode Update mode commands are expected when EURUP prompts with a "*". All update commands operate on the currently open database. The commands available in update mode are: 4.3.1 INSERT 4.3.2 REPLACE 4.3.3 DELETE 4.5.2 EXIT 4.5.5 SELECT 4.5.6 MOVE 4.5.7 NUMBERS 4.5.8 STOP 4.5.9 UNLOCK inserts documents or vocabularies into the database. replaces documents or vocabularies in the database. deletes documents or vocabularies from the database, closes the database and returns to command mode. specifies which text (vocabulary) extent document (vocabulary) files are to be inserted in, allows the user to move document or vocabulary files from one extent to another. controls term number assignments. makes a compact, coded stop list from text input, or semi-automatically builds the text file. recovers a locked database. 4.2 Database Creation. Extension a^d. Deletion Creating a database involves allocating and initializing the various files, including the index, text and (optionally) vocabulary extents, the context definition file, and the stop and save lists. The database is initially empty. Update mode is then used to stuff documents into it one by one. In its simplest form, a database consists only of the database directory, index and text extents. If vocabularies are to be generated, then a vocabulary extent is 70 needed. Additionally, stop and save list files can be created for purposes of tailoring the vocabularies. Once created, a database is a fixed size, and hence only so much data can be inserted. However, if it becomes full, there are ways to make it larger without having to completely reconstruct it from the document input files. If the text or vocabulary extents become full, an additional extent can be allocated and attached to the database. If the index fills up, it must be reorganized and copied to a larger file. If enough documents are deleted from a database to empty a text or vocabulary extent, that extent can be detached from the database and deleted. It may also be possible to move document or vocabulary files from one extent to another in order to make it empty so that it can be deleted. It is always possible to use system utilities to simply delete all of the files that comprise a database. There is also a convenient command that deletes all the documents and vocabularies of a database, leaving it empty as if it were just created. 4.2.1 Create The create command allocates a database directory file and initializes extents for a database. The form of a create command is: database/CReate