Eureka Data System


Eureka is a blue sky project, a generalized random-access data store where aggregates, statistics, and relational information have been optimized for near constant time access and high join performance.  The underlying data-driven human-guided format emphasizes direct data access; lexical, positional, or frequency proximity; analytics, discovery; and efficient or statistically enabled hypothesis testing.


Information formats have the basic constraint of all natural languages: that context must be sequentially accessed.  When processing or comprehending small amounts of data, such that any one reader can hold the entirety of that information in their mind, such a constraint is negligible.  As the collection of information increases however, meta-data (or information about information) becomes critical to support context in order to efficiently comprehend, manipulate, and optimally utilize this information.

The collapse of costs, previously limiting collection, fidelity, transfer, storing, and sharing of information resulted in a rapid uptick in the quantity of information available.  This quantity of data exacerbates the inherent linguistic constraint: context as a sequential access challenge, and benefits statistics as a method to better understand and investigate.


Eureka addresses this context challenge by decomposing a corpus into its constituent parts to such a degree that it creates a complete conservative set of meta-information optimized for recall and discovery.  By guiding the user or algorithm through the landscape which is the meta-data representation of the original information insight and application may be achieved.

On the basis of statistics and ordering information alone, corpora of enormous size could be organized in a distributed system and accessible in near real-time to human or machine analysis.

The resulting system would outperform in resource consumption and in speed, conventional big data systems.  Eureka would serve as a repository, a random access datastore, a realtime query and machine analytics resource speeding training and allowing for finer and faster natural language processing and depth.