In this example we’ll focus on the first sonnet of the William Shakespeare. The six rules and the priority by order will lex the corpus into six separate Eureka spaces. Note that the catch all rule ‘.’ has lowest priority and though it would otherwise capture all remaining single characters it will remain empty.
|Priority||Regular Expression Rule||Space|
And the resulting lexing, color coded to the above would look something like:
by William Shakespeare␤
From fairest creatures we desire increase,␤
That thereby beauty‘s rose might never die,␤
But as the riper should by time decease,␤
His tender heir might bear his anamory:␤
But thou contracted to thine own bright eyes,␤
Feed‘st thy light‘s flame with self–substantial fuel,␤
Making a famine where abundance lies,␤
Thy self thy foe, to thy sweet self too cruel:␤
Thou that art now the world‘s fresh ornament,␤
And only herald to the gaudy spring,␤
Within thine own bud buriest thy content,␤
And tender churl mak‘st waste in niggarding:␤
Pity the world, or else this glutton be,␤
To eat the world‘s due, by the grave and thee.␤
Internally the corpus would be stored in the six spaces with inter-space join tables. Queries could search for consecutive tokens in one space, matching specific pattern of spaces or any.
word:token1 [⊕[ .*:.*]+ ⊕ word:tokenYi]*
word:PC(token1) [⊕[ .*:.*]+ ⊕ word:tokenYi]*
where PC would permute and include all existing capitalization of token1 in word space, or
word:token1 [⊕[ white:"\t"]+ ⊕ word:tokenYi]*
where the white space separator(s) are explicitly constrained to a single tab.
Indexing fully every letter position in every token, indexing every token (no normalization or dropping of stop-words) and storing all frequency statistics from letter frequencies to the count of unique frequencies present in each space costs only a small multiple (i.e. <3 times) of the original corpus size even in this small case. As we scale documents up to achieve statistical significance this ratio trends toward one.
Just as an emphasis on the small nature of the this corpus, and also the Eureka data approach for storing all relevant meta data directly, the word space the count of frequencies are displayed in the following table.
Note that due to the minuscule size this space is very sparse and the numeric progressions are likewise discontinuous:
|Count of words sharing λ||Frequency(λ)|
As the size of the corpus increases the relatively costly insertions of unique terms into the terms table occur and all tables benefit from inherent statistics. This has the effect of improving overall per corpus efficiency, but even in the case of our minuscule corpus we are looking reasonably good.
|Space||Eureka Internal Size||Original Size (in bytes)||Ratio of Sizes|
Eureka is designed around direct joining, hence utilizing the concept of multiple spaces result in very natural access absent performance degradation. The demarkation of genuinely different kinds of data will strongly enhance performance as each space exhibits disparate frequencies, limited tokens and natural association. The spacial definition is a natural deconvolution of the original data reducing the overall size and entropy hence increasing performance.
Though Eureka is general in that it more about discovering structure than imposing structure on a corpus it is wise for the convenience of access and performance to consider the definitions of token rules and priorities as they serve the dual purpose of defining your data and optimizing access.