Introduction To Eureka

token data

With large to exceedingly large amounts of free form, semi-structured, structured or mixed data, Eureka renders the corpus into its token form, organizing those tokens in a recursive manner around frequencies, relational closeness and retaining the original value and ordering.  With attention to memory proximity, compression, and data conservation the resulting format is encoded to provide near direct access while offering flexibility and generality.

Considering your data as being wholly comprised of tokens affords that scope, context, and meaning may be elucidated instead of specified.  In such a data system capable of real-time and general access discovery and application of that structure may be dynamic.

Eureka does not require a schema, as if that any such structure exists it can be quickly determined and intuitively utilized.  Likewise in contrast to general practice of indexing in databases or search engines even the concept of context or type (record or field) is unnecessary.  For example, from the set of all hits for any token the context of field or record can be defined with relative positions of other tokens (e.g. whitespace delimiter indicating record or field, json key, xml marking, etc.) and joined to produce the same constraints imposed by a schema dynamically.

In the Eureka perspective, all data is equal.   Tokens are classified but otherwise non differentiated and all tokens are indexed (including sub-token indexing).  All tokens are directly relatable to all other tokens via lexical closeness, statistics, and/or position.

binary data

Tokens are of course made up of binary bytes, but what the lex rules cannot make sense of would be processed and integrated just the same along with possibly of purely coincidental matching due to lexical ambiguity.  Bookending binary data with magic tokens may provide for a practical applications for retrieval, but allow for the potential of the resulting collisions.

This introduction will proceed with a