Tokens And Spaces as Basis of Information Storage

What Are Tokens

As friend and local poet Carrie Harper Hechtman once illuminated

the spaces between the words, the pauses, may convey as much meaning as the words themselves.

Tokens are terms, whitespace, punctuation, Arabic numerals and additionally any structurally supportive information as from simple section markings to denormalized keys and contextual information. In order to conserve information and remain unambiguous (described further here) tokens should always be considered case-sensitive.

Eureka currently does not currently support any out of stream information, but this can easily be achieved presently with magic (as in distinct and unique) prefixes and is generally necessary in one form or anther to take full advantage of the query language and statistics. Future mapping of out of stream data would be just a matter of mapping input streams to spaces. The concept of space is discussed further in choosing tokens well and normalization examples. Magic prefixes do not require additional resources with repeated incidents, tokens being stored separately from the corpus.

Spaces

The definition of the lexicon is in itself a definition of different space, and Eureka uses this distinction in order to better discover and compose information within it.

As described further in A more precise definition of tokens, and shown by way of a small Shakespeare excerpt example the corpus is partitioned by token rules into separate spaces allowing for higher compression, more clear integration within the queries and overall performance boost. Spaces can be used to distinguish or disambiguate tokens or else they can be used to distinguish sections or formats.

An example for the utilization of higher priority credit_card_number and phone_number lexical and lesser priority numeric rule would split the following record such that number:1881 would only recall the 1881 as part of the address. You could still find all instances of “1881” in any space as a sub-key search but while we have the other two rules 1881 would only count as one instance on this record.

John Doe
4012 8888 8888 1881
(650) 555 1881
1881 Doe Court, Arnold California, USA 95223

We’ll expand the above example to include a primary key for an example of sectioning by space. If you captured ^#CUSTOMER as a beginning_of_record you would now have many new views on a particular datastream. For example, the customer primary key is simply the token following each beginning_of_record mark. The number of records of this type in the corpus is the number of elements in beginning_of_record mark -1. Likewise the number of tokens in a record or the relative position of a token is just the distance or the difference respectively within the region as defined as the [previous, next beginning_of_record).

#CUSTOMER aj^%12jkhd65b93f0
John Doe
4012 8888 8888 1881
(650) 555 1881
1881 Doe Court, Arnold California, USA 95223

Mix in the standard token classes like space, punctuation, and paragraph marks and you are likely to directly and without a lot of specialization get all the natural language features for closeness and relationship between different words in a query. Likewise putting html tags in their own space would make query conditionals such as in_title, in_body, or is_emphasized easy to construct and you can just as easily avoid confusing “head” with <head> or “body” with <body> etc.