Shakespeare Example

 

In this example we’ll just focus on the following, the first sonnet of the William Shakespeare dataset.  Eureka will be instructed to lex on the following rules, and in doing so split the corpus into six separate spaces, though in this case the catch all lowest priority any space will be left empty.

priority_regex_name_table

PriorityRegular Expression RuleSpace
1
([A-Za-z]+[0-9]*)+
word
2
[+-~]?[0-9]+([\\.][0-9]*)*
num
3
(([ \t]+[:;/\.,!@#$%&*()_\+\-={}\^\[\]'()"(<>]*)+)
white
4
(([:;/\.,!@#$%&*()_\+\-={}\^\[\]')"\(<>]+[ \t]*)+)
punct
5
\n\n?
para
6
.
any

And the resulting lexing would look something like …

1609
THE SONNETS
by William Shakespeare

 1
 From fairest creatures we desire increase,
 That thereby beautys rose might never die,
 But as the riper should by time decease,
 His tender heir might bear his memory:
 But thou contracted to thine own bright eyes,
 Feedst thy lights flame with selfsubstantial fuel,
 Making a famine where abundance lies,
 Thy self thy foe, to thy sweet self too cruel:
 Thou that art now the worlds fresh ornament,
 And only herald to the gaudy spring,
 Within thine own bud buriest thy content,
 And tender churl makst waste in niggarding:
 Pity the world, or else this glutton be,
 To eat the worlds due, by the grave and thee.

and internally the document would be stored in the six spaces with inter-space join tables, such that queries could search for consecutive tokens in one space (emulating search engine behavior), or matching

word:token1 [⊕[ .*:.*]+ ⊕ word:tokenYi]*

Some variations are

word:PC(token1) [⊕[ .*:.*]+ ⊕ word:tokenYi]*

where PC would permute and include all existing capitalization of token1 in word space, or

word:token1 [⊕[ white:"\t"]+ ⊕ word:tokenYi]*

where the white space separator(s) are explicitly constrained to a single tab.

Performance

Keeping in mind that this is truly a minuscule example, I would counter a misperception that splitting a corpus up into a handful of different cross joined spaces and keeping all meta-data is inherently expensive.  Indexing fully every letter position in every token, indexing every token (no normalization or dropping of stop-words) and storing all frequency statistics from letter frequencies to the count of unique frequencies present in each space costs only a small multiple (i.e. <3 times) of the original corpus size in this case.  As we scale documents up [ as reported elsewhere ], achieving statistically significant numbers this ratio will go down significantly toward one.

First, just to just as an emphasis on the small nature of the this corpus, and also the Eureka data approach for storing all relevant meta data directly, the word space the count of frequencies displayed shows that this space is very sparse and discontinuous:

Countλ
16
24
43
72
781
92 (total)
For words in Shakespeare's first sonnet, the distribution of frequencies shows that only one word has six occurrences, two words have four occurrences, four words have three occurrences and a whopping seventy-eight words are only present one time in this short text. Hows that for diversity!

As the size of the corpus increases, a fewer costly insertions of unique terms are incur and statistically smooth transitions over all tables. This has the effect of reducing the overall per corpus size cost significantly.  Even in the case of our minuscule corpus its still looking pretty good …

SpaceEureka Internal SizeOriginal Size (in bytes)Ratio of Sizes
total1.9 kbytes7112.6
word1.4 kbytes5102.9
num59 bytes512
white252 bytes1451.7
punct146 bytes285.4
para90 bytes243.7
any000/0