Terms and Concepts

The following are used frequently throughout this documentation.

  • Source of Truth refers to the concept of data origin, or authority for which all the derivatives, the digests and meta data are based. Though not necessarily constructed of singular system, nor necessarily immutable store, the source of truth would be distinguished from all the other collections of data as an arbiter where multiple copies of essentially the same data disagree and resource for recovery or corrections of  such discrepancies.  Though present in most big data systems Eureka would largely downplay the need for this concept of source of truth by minimizing the need for the multiple and frequently conflicting derivatives.
  • Corpus is a collection of works or text, which when ingested into a Eureka data system it is distilled down to a set of statistics that conserves all the information in that collection of text while also exposing all the relevant statistics and join tables for the quickest operations.
  • Join(ing) means the connection of different data concepts directly.  Internally it is as though you have random or direct reference to any one value, or one of a set of value, or a range within a set of elements.
  • LexingToken and Space are the process of ingesting and differentiating data into different classes or types.  As in reading byte stream of input and transforming that to tokens with annotations such as words, spaces, and delimiters etc.  The space concept is that its convenient to consider at times to constrain tokens to one class of types such as only words, or only delimiters.  As tokens are such a central concept in Eureka that a large section of this documentation is dedicated to it.
  • Direct(ly) or Random Access is the concept of that machine complexity (or order) for a lookup.  Random access is literally constant time (O(k)).  Eureka’s access is based on O(logk(n)), where k >> 2, while greater than O(k) its approximately and practically equivalent even with large n.
  • Magic as in special refers to the special, frequently unique, character sequences that are integrated into data stream that is the corpus.  Quite probably, is a temporary crutch, this mechanism permits the current design to proceed until a better solution possibly multi channel data or out of band information is provided.