Eureka data is broadly broken down into the following views of Token, Join, and Corpus and here I have enumerated some of the features and costs of access in the Eureka format.
- The Token View, contains all the information about all the tokens in your corpus including
- Direct access, to how many tokens do you have.
- Direct access, to how many unique tokens to you have.
- Direct access to the total count of each byte in the Corpus.
- The ability to look up a token directly.
- The ability to look up all tokens directly matching a simple regular expression or transformation. Given no variability in the expansion of the regular expressions (no {‘+’, ‘*’, ‘{i,j}’} kind of operators in the regex) then the transformation should be direct. Each letter position at each word is indexed, hence a simple transformation such as case insensitive match is just a test of ‘X’ | ‘x’ for x being any letter at every position. This could also be optimized back->front due to fixed size of the word eliminating off the bat all words that are too long or bi directional eliminating all the words that don’t belong to {‘0th, ‘1st’, ‘2nd’, … } letters. The total count of all the unique words and their respected occurrences is available at each letter hence bidirectional could be optimized quickly into forward or reverse depending on the evaluation prospect.
- The ability of matching regular expression with expansion ({‘+’, ‘*’, ‘{i,j}’} operators) is inherently a higher order problem. However even this is aided by the word table bing fully indexed and bidirectional. The concept of any as in ‘.*’ should flow directly interpreted (e.g. Count(‘.*’) is just a direct lookup of all unique) or flow through the Token view thus being direct. The concept of not should be dynamically rewritten if evaluation or estimate indicates a highly biased selection.
- For looked up token, directly how many times does it appear in the Corpus View
- For a looked up token, directly list the relative weight of each byte as it contributes to the count of all possible tokens in the Token View.
- Direct lookup lexical adjacent neighbors to that token. (These last two features are about suggestion or auto typing.)
- The Join View connects the Token View with the Corpus View, when
- Given a specific token, directly each positions in the corpus that the token can be found.
- Given a specific position from the Corpus View, directly what is the token associated
- Given a specific token frequency, directly where does that token rank in all token frequencies.
- For any given token frequency directly enumerate the unique tokens that match that frequency, or what rank that frequency with respect to all others.
- Do all that set work stuff intersection, union, exclusion. While these cannot be constant time operations they can are highly optimized in implementation (compressed near random access iterator with logk(), k >> 2 search). Also the set work has opportunity for optimization, e.g. A ∩ B, where λ(B) << λ(A), is limited to the magnitude of B.
- All the set work which can be limited to a confidence interval can be done potentially small size sampling, e.g. λ(A ∩ B), CI= 0.92 might be done in very small number of steps as λ(A) and λ(B) are directly known quantities and pseudorandom testing is enabled.
- The Corpus View is a representation of the stream of the original text itself and includes.
- Direct access to an arbitrary position in the stream.
- Direct distance between two arbitrary positions.
- Linear cost cross product between two subsets of the corpus.
- (with join to Join View) Linear cost frequency normalized cross product between two subsets