In this example we’ll just focus on the following, the first sonnet of the William Shakespeare dataset. Eureka will be instructed to lex on the following rules, and in doing so split the corpus into six separate spaces, though in this case the catch all lowest priority any space will be left empty.
|Priority||Regular Expression Rule||Space|
And the resulting lexing would look something like …
and internally the document would be stored in the six spaces with inter-space join tables, such that queries could search for consecutive tokens in one space (emulating search engine behavior), or matching
word:token1 [⊕[ .*:.*]+ ⊕ word:tokenYi]*
Some variations are
word:PC(token1) [⊕[ .*:.*]+ ⊕ word:tokenYi]*
where PC would permute and include all existing capitalization of token1 in word space, or
word:token1 [⊕[ white:"\t"]+ ⊕ word:tokenYi]*
where the white space separator(s) are explicitly constrained to a single tab.
Keeping in mind that this is truly a minuscule example, I would counter a misperception that splitting a corpus up into a handful of different cross joined spaces and keeping all meta-data is inherently expensive. Indexing fully every letter position in every token, indexing every token (no normalization or dropping of stop-words) and storing all frequency statistics from letter frequencies to the count of unique frequencies present in each space costs only a small multiple (i.e. <3 times) of the original corpus size in this case. As we scale documents up [ as reported elsewhere ], achieving statistically significant numbers this ratio will go down significantly toward one.
First, just to just as an emphasis on the small nature of the this corpus, and also the Eureka data approach for storing all relevant meta data directly, the word space the count of frequencies displayed shows that this space is very sparse and discontinuous:
As the size of the corpus increases, a fewer costly insertions of unique terms are incur and statistically smooth transitions over all tables. This has the effect of reducing the overall per corpus size cost significantly. Even in the case of our minuscule corpus its still looking pretty good …
|Space||Eureka Internal Size||Original Size (in bytes)||Ratio of Sizes|