The Illusion of Organization

The standard model of large data applications is about providing order or organization around discrete pieces of information.  There are many consequences to this perspective and many opportunities as the perspective is shed.

The consequences of the discrete pieces orientation is that frequently somewhere deep down in some data store resides your datum, that datum is replicated all over the place (possibly with transformation, but here again each instance in one place), and those datum are referenced through a mechanism that reflects the storage organization more than the data stored.  As evidence that this might not be the most efficient way, consider that with all natural brains contrary mechanism mechanisms prevail.  In brains the signal is distributed, and connections form or strengthen around the signals instead of signals themselves being replicated all over.

In the standard model an emphasis is what is seen and not what is important.  Its as though a comprehensibly small piece of data can be scaled up to an incomprehensibly large piece of data, and that proves a false tautology!  An incomprehensibly large data is incomprehensible at least to our native human perspective, and that should begs the question does the current model provide some kind of order?

Let us perform a brief mental experiment.  Imagine what row 81985529216486895 looks like in a 1152921504606846975 sized table of http access log.

81985529216486895:1152921504606846975 - - [xx/xxx/xxxx:xx:xx:xx -xxxx] "xxxx /xx-xxxxx/ xxxx/x.x" xxx xxx "xxxx://

It is very likely that you came up with a concrete example. You are likely to do this because as a fixed or concrete example seems in your mind a better representation, but how reasonable were you in coming up with that placeholder for actual data?  Even when guided with a good amount of expertise your chance of a match is pretty much impossible.  It is really not much better off than a Monte Carlo approach of constructing a random length string with random values.  Considered also that you have answered the wrong question.  Instead of what line 81985529216486895 looks like you’ve likely answered the much more challenging question of what line 81985529216486895 is.  The former question is really much more useful and comes up more often than the latter.

As an alternative to a fixed or concrete instatnce is coming up with statistics and clearly addresses the challenge more naturally.  For example, you could have reasonably deduced that the log line in question follows a specific distribution in byte length, or field count, or average field length.  You may deduce as well that that each kind of field has its own distribution in length.  The line is split up into first order fields by a single space and subfields are themselves split up from those first order fields by spaces or other delimiters.   All the values as they fall in the entire log, the line or any given field or sub-field in the line has particular frequencies.  Even a mild effort and minimal expertise can reveal much utility as you try this approach.

Coming up with a concrete example is likened to living in the world of small data and imposing that perspective to the large.  The alternate approach is to apply statistics to the world of large data and the combination of big and statistics  makes a lot of intuitive sense.  So just where do statistics fall in the organization and order of large data in the standard large data methods?  Statistics to the degree that they are available are appended to the outside the source of truth instead of begin viewed as an integral part of the data itself.

Where the massive size in the magnitude of data should intuitively drive us to the value of statistics, the standard methods for organizing information is more attune to the world of small data, a world where statistics cannot by virtue of small sample size offer little for order and organization.  Lets again revisit this fictitious log line 81985529216486895 from the perspective of it being recalled for an application or service.  Well first consider, that baring some monitor or specialty tool its not likely that any application would utilize the whole piece of data.  Second, there is certainly a lot of data here (the other 1152921504606846974 rows)  and this one row would very unlikely (even hundreds or thousands of events roll up together) into a results that any one row would have a direct connection with a result or output even in a service even a service that is invoked billions of times each day.  Do consider that row from a statistical sense however, and this one row could effect unfathomable number of features derived from the data set.  In a sense it is likely to come up (perhaps even multiple times) as a participant of the results of each service call.

As statistics is the natural form of the large data system then it should be very intuitive to optimize data around a statistical basis.  Support for direct access to an instance of original data is the secondary or tertiary priority of such a system.  Eureka data was designed around such a principle, consequentially it is less sequential and less beholden to the original form of the data and consequentially presents fewer of the artifacts we add with our standard repositories.

Paradoxically, as compared with other big data systems and their of challenges of linking together of internal references to get to that a copy of the pristine representation of the original data the Eureka Data system is likely to recall a single instance of data more quickly even if it means recalling that data not eventually from one pristine spot but as a result of number of statistics and relationships.