Eureka Data System


Eureka is a blue sky project, a generalized random-access data store where aggregates, statistics, and relational information have been optimized for near constant time access and high join performance.  The underlying data-driven human-guided format emphasizes direct data access; lexical, positional, or frequency proximity; analytics, discovery; and efficient or statistically enabled hypothesis testing.


Persistence format has been the basic constraint of all natural languages.  The information and context surrounding must be sequentially accessed.  When processing or comprehending small amounts of data, such that any one reader can hold the entirety in their mind, such a constraint is negligible.  As the collection of information increases however, meta-data (or information about information) becomes critical in order to efficiently comprehend, manipulate, and optimally utilize this information.

At the advent of the information age, collapse of costs previously limiting collection, fidelity, transfer, storing, and sharing of information result in a rapid increase in the quantity of information available.  Todays and the futures quantity of data exacerbates the inherent linguistic constraint: context as a sequential access challenge, and benefits a statistically based method to understand and utilize data.


Eureka addresses the context challenge by decomposing a corpus into its constituent parts to such a degree that it creates a complete conservative set of meta-information optimized for recall and discovery.  By guiding the user or algorithm through the landscape which is the meta-data representation insight and utility may be achieved.

On the basis of statistics and ordering information alone, corpora of enormous size could be organized in a distributed system and accessible in near real-time for human and algorithmic analysis.

The resulting system would outperform in resource consumption and speed conventional big data systems based upon traditional  approaches.  Eureka would excel in multiple roles as a repository, a random access datastore, a realtime query and machine analytics, a resource speeding training, allowing for finer and faster structured and unstructured data processing.