next up previous contents
Next: Software architecture of the Up: Technical documentation Previous: Technical documentation

Using SMART

This section covers technical details about using SMART. It describes file formats, informations contained inside the files, and what commands are available to view these files.

When creating a database (indexed collection) from a collection, SMART creates a series of files describing the index and contents of the documents comprised in the collection. The main files that are generated by the indexation of a collection by SMART are the dictionary files ("dict files"), which contains the information about each token (or "concept") of the original documents, and the inverted files ("inv files"), which gives the list of document numbers when a concept number is known.

Inverted files (sometimes referred to as "inv files") are an instance of "direct relational objects". Without going into much details (someone has something to contribute to this section?), they seem to represent a serializable data structure that supports cross-referencing of informations when loaded in core. A direct relational objects always has a corresponding file on disk inside the database directory, and is always accompanied by a ".var" file. "inv" and "inv.var" would be two files representing a single direct relational object of type "inv file".

The following files are normally produced by SMART while indexing a new collection. These files are:

dict: Dictionary file. In this file, you will find the concept number assigned for each token of the collection's documents. When loaded in memory, this file is accessed by a hash table. A record contained in a dict file contains the following information:

        con info token
Where:
    - con (long) is the concept number assigned to the token
    - info (long) is a usage dependent field
    - token (string) is the actual token.
The struct definition that defines a dict entry is named DICT_ENTRY, and is contained in the file src/h/dict.h . The SMART action that outputs a dict entry is the function print.obj.dict from src/libprint/p_dict.c .

doc_loc: Document locations file. This file is actually input to SMART rather than output; it contains the list of documents that comprise the collection, one file name per line with full path. It is usually created in the database directory (under indexed_colls/$database) by the script used for indexing purpose (ex. make_cacm, make_docsmart).

inv, inv.var: Inverted word list. This file (with its associated .var file since it is a direct relational object) contains a list of concept numbers found in the collection, and reports the document id (did) in which the concepts are found. A record contained in a inv file contains the following information:

        list 0 id_num wt
Where:
    - list (long) document identifier
    - 0 (long) the constant zero
    - info (long) concept number
    - wt (float) weight associated to the concept
The struct definition that defines a inv entry is named INV, and is contained in the file src/h/inv.h . The SMART action that outputs a inv entry is the function print.obj.inv from src/libprint/p_inv.c .

textloc, textloc.var: Text location file. The textloc file contains the file location for each document in the collection. Since the documens are left in place when indexing them, SMART uses the textloc file to locate the original text of an indexed document. Since many documents can be placed in a single file of the collection (delimiting them by new-document markers), the textloc file also A record contained in a textloc file contains the following information:

        id_num title file_name begin_text end_text doc_type
Where:
    - id_num (long) document id
    - title (string) title of the document
    - filename (string) file in which the document is found
    - begin_text (long) offset in file for the beginning of document
    - end_text (long) offset in file for the end of document
    - doc_type (long) collection-dependant type of document
The struct definition that defines a textloc entry is named TEXTLOC, and is contained in the file src/h/textloc.h . The SMART action that outputs a textloc entry is the function print.obj.textloc from src/libprint/p_textloc.c .

There would be cases where the structure of the query would dictate the use of other files (for example, if you wish to distinguish the queries on different structural parts of the documents). Take a look at the docsmart collection, once you build it, to see an example of this.

Technical aspects of how to index a query, both using an interactive session and an experimental collection should be discussed in this section. Any contributor?


next up previous contents
Next: Software architecture of the Up: Technical documentation Previous: Technical documentation
Christian Meunier
1999-05-02