A More Precise Definition of Tokens

For the sake of the Eureka data analysis, storage and random access retrieval system tokens must be the unambiguous islands of information that exist independently. Tokens are the product of regular expression rules, an arbitrary class name and a priority. The priority is merely the attempted application of one rule followed on failure the next successive rule and prohibits any possibility of multiple regular expression rules capturing the same bytes in the original corpus by more than one token. The class name (class as in classification not structure) serves in the query to distinguish tokens as members of different spaces. The process of prioritized application of different regular expression is called lexing. In the larger field of data analytics and processing while regular expression matching is generally applied to lines of text however within Eureka it more frequently refers to a mere word, number, whitespace or some kind of structural formatting. Note also by the above definitions it is impossible for the same token to reside in different classes.

These lex specifications {regular expression, class name} and implicitly by order the priority are an integral to the resulting Eureka digest. Tokenization is the first step in transforming the input corpus by partitioning tokens of the corpus into k independent spaces, where k is the number of rules applied and the spaces are largely independent with only cross space positional joining and aggregate byte histogram statistics residing outside of the concept of space.

The emphasis on the precision and disambiguation here is deliberate. Distinguishing itself from the majority of meta data generation and applications in information retrieval all Eureka transformational processes must be datawise conservative and unambiguous for back transformation function.

Standard indexes greatly transform or normalize their tokens, often discarding stop words outright or glomming them onto the token before or after them, normalizing capitalization, stemming suffixes and transforming text. These general techniques are used to reduce the size of the key space, reduce the size dedicated to posting lists or to improve recall by coalescing multiple words together. Eureka has no such opportunity, for each of these operations are not data conserving, however the internal representation is done in such a way that full representation can be achieved, full specificity (e.g. can join on a upper cased term, a lower case or on the set of all terms that are a case insensitive match). The performance both in speed and memory is drilled down elsewhere [key space performance] however generally performance differences are imperceptible and the processing automatically degenerates to the most restrictive of query specification and presence in the corpus.