Dual Space Basis

For each space (e.g. tokens = {words, numbers, whitespace, …} ) their is a separate basis, meaning that for word:’cat’ you have a given frequency (λ), and a set that is either in the basis of the all tokens that are words.  This means that looking up ‘cat’ produces the opportunity to see that it is at position 45, 87, .. of words encountered or that it is at position 95, 188, …  of all tokens encountered.

As expected space_type:any is also a valid, hence you can mix and match among the different spaces without specifying what the actual value is.

Eureka has this dual basis enabled for two reasons.  First, you may, as is generally the case only want to know the set of occurrences of two directly adjacent words with no concern for what the separator is, what type the separator is, or how many times that separator is applied or type changes.  Likewise given two references to words, you could directly tell how many words, how many spaces, or how many space:’\t’, or how many tokens there are between them.

The second reason for this dual basis is that while not certain, generally analysts would want to track types of tokens in the real world that have genuine meaning.  If for instance they choose space of {words, whitespace, roman numerals, and numbers} and their data falls within a particular domain (or even is naturally ordered as in eg. math, followed by history, science and literature documents), then the Eureka system would be able to pick out the distinctions patterns or shifts of patterns and frequencies present and spend less space comprehensively and completely encoding the information.