Shakespeare Example


In this example we’ll focus on the first sonnet of the William Shakespeare.  The six rules and the priority by order will lex the corpus into six separate Eureka spaces.  Note that the catch all rule ‘.’ has lowest priority and though it would otherwise capture all remaining single characters it will remain empty.


PriorityRegular Expression RuleSpace
(([ \t]+[:;/\.,!@#$%&*()_\+\-={}\^\[\]'()"(<>]*)+)
(([:;/\.,!@#$%&*()_\+\-={}\^\[\]')"\(<>]+[ \t]*)+)

And the resulting lexing, color coded to the above would look something like:

by William Shakespeare

 From fairest creatures we desire increase,
 That thereby beautys rose might never die,
 But as the riper should by time decease,
 His tender heir might bear his anamory:
 But thou contracted to thine own bright eyes,
 Feedst thy lights flame with selfsubstantial fuel,
 Making a famine where abundance lies,
 Thy self thy foe, to thy sweet self too cruel:
 Thou that art now the worlds fresh ornament,
 And only herald to the gaudy spring,
 Within thine own bud buriest thy content,
 And tender churl makst waste in niggarding:
 Pity the world, or else this glutton be,
 To eat the worlds due, by the grave and thee.

Internally the corpus would be stored in the six spaces with inter-space join tables.  Queries could search for consecutive tokens in one space, matching specific pattern of spaces or any.

word:token1 [⊕[ .*:.*]+ ⊕ word:tokenYi]*

A variation

word:PC(token1) [⊕[ .*:.*]+ ⊕ word:tokenYi]*

where PC would permute and include all existing capitalization of token1 in word space, or

word:token1 [⊕[ white:"\t"]+ ⊕ word:tokenYi]*

where the white space separator(s) are explicitly constrained to a single tab.

Indexing fully every letter position in every token, indexing every token (no normalization or dropping of stop-words) and storing all frequency statistics from letter frequencies to the count of unique frequencies present in each space costs only a small multiple (i.e. <3 times) of the original corpus size even in this small case.  As we scale documents up to achieve statistical significance this ratio trends toward one.

Just as an emphasis on the small nature of the this corpus, and also the Eureka data approach for storing all relevant meta data directly, the word space the count of frequencies are displayed in the following table.

Note that due to the minuscule size this space is very sparse and the numeric progressions are likewise discontinuous:

Count of words sharing λ Frequency(λ)
92 (total)
For words in Shakespeare's first sonnet, the distribution of frequencies shows that only one word has six occurrences, two words have four occurrences, four words have three occurrences and a whopping seventy-eight words are only present one time in this short text. Hows that for diversity!

As the size of the corpus increases the relatively costly insertions of unique terms into the terms table occur and all tables benefit from inherent statistics.  This has the effect of improving overall per corpus efficiency, but even in the case of our minuscule corpus we are looking reasonably good.

SpaceEureka Internal SizeOriginal Size (in bytes)Ratio of Sizes
total1.9 kbytes7112.6
word1.4 kbytes5102.9
num59 bytes512
white252 bytes1451.7
punct146 bytes285.4
para90 bytes243.7


Eureka is designed around direct joining, hence utilizing the concept of multiple spaces result in very natural access absent performance degradation.  The demarkation of genuinely different kinds of data will strongly enhance performance as each space exhibits disparate frequencies, limited tokens and natural association.  The spacial definition is a natural deconvolution of the original data reducing the overall size and entropy hence increasing performance.

Though Eureka is general in that it more about discovering structure than imposing structure on a corpus it is wise for the convenience of access and performance to consider the definitions of token rules and priorities as they serve the dual purpose of defining your data and optimizing access.