Eureka Data System

About

Eureka is a blue sky project, a generalized random-access data store where aggregates, statistics, and relational information have been optimized for near constant time access and high join performance.  The underlying data-driven human-guided format emphasizes direct data access; lexical, positional, or frequency proximity; analytics, discovery; and efficient or statistically enabled hypothesis testing.

Why

Information formats have the basic constraint of all natural languages: that context must be sequentially accessed.  When processing or comprehending small amounts of data, such that any one reader can hold the entirety of that information in their mind, such a constraint is negligible.  As the collection of information increases however, meta-data (or information about information) becomes critical to support context in order to efficiently comprehend, manipulate, and optimally utilize this information.

The collapse of costs, previously limiting collection, fidelity, transfer, storing, and sharing of information result in a rapid uptick in the quantity of information available.  Todays and the futures quantity of data exacerbates the inherent linguistic constraint: context as a sequential access challenge, and benefits a statistically based method to understand and utilize data.

How

Eureka addresses the context challenge by decomposing a corpus into its constituent parts to such a degree that it creates a complete conservative set of meta-information optimized for recall and discovery.  By guiding the user or algorithm through the landscape which is the meta-data representation of the original information insight and application may be achieved.

On the basis of statistics and ordering information alone, corpora of enormous size could be organized in a distributed system and accessible in near real-time to human and algorithmic analysis.

The resulting system would outperform in resource consumption and speed conventional big data systems.  Eureka would excel in multiple roles as a repository, a random access datastore, a realtime query and machine analytics, a resource speeding training, allowing for finer and faster structured and unstructured data processing.