{"id":55,"date":"2017-06-29T14:06:08","date_gmt":"2017-06-29T14:06:08","guid":{"rendered":"http:\/\/eurekadata.net\/?page_id=55"},"modified":"2017-10-05T17:06:19","modified_gmt":"2017-10-05T17:06:19","slug":"performance","status":"publish","type":"page","link":"https:\/\/eurekadata.net\/index.php\/performance\/","title":{"rendered":"Performance"},"content":{"rendered":"<h3>Its Not at all About Storage Space &#8230;<\/h3>\n<p>Aside from the general re-architecture from a central repository to a plethora of general purpose meta-stores and digests on that architecture to increasingly specific met-store and digests to\u00a0<em>just one place,<\/em> there are but two themes to Eureka. \u00a0First, reduce the overall wasted space when storing and accessing data and secondly, reduce the overall access time. \u00a0This should really should read<\/p>\n<p style=\"text-align: center;\"><strong>Reduce the overall access time.<\/strong><br \/>\n<strong><em>Reduce the overall access time.<\/em><br \/>\n<strong><em>!!! Reduce the overall access time !!!<\/em><\/strong><br \/>\n<\/strong><\/p>\n<p>The functionality and statistics are a means to the end, which are enabled by and improved with larger data, and that is tolerable with quicker processes, quicker turnaround and faster <em>overall access time<\/em>. \u00a0Quicker turnaround time results in more analysis, results in better understanding of the data, results in a better applications, utilizing larger data, more insights and so on.<\/p>\n<p>Consider space, if you have one server at capacity and you want to double the space of that capacity in almost all cases you could easily partition that job on two servers now at twice the relative speed as before. \u00a0The problem is that this is only the best outcome, soon to be disturbed if for any reason the now split job in such a way that the work largely independently processed across the servers. \u00a0The upside 2x improvement, the downside 1 million x degradation as every context level of the program is broken right down to the ethernet packet if an when that server just a single meter away must synchronize.<\/p>\n<p>Buying more machines is cheap, but using them incorrectly can be very expensive. \u00a0 Compression in information retrieval, though perhaps originally about economizing the quicker more expensive memory, has been for a very long time about minimizing the relative slowness of memory movement and the locality of data. \u00a0Even though the CPU clock rate has been relatively stagnant, the transistor count marches forward diving at least two different dimensions of better opportunity, mainly the CPU cores count and the instructions per clock cycle increase. \u00a0Maintaining the highest utilization of concurrency within the CPU by reducing the starving out to cache levels fails and memory bus performance and the reduction of resource contention will be the biggest concern in a large data system now and into the short term future at least.<\/p>\n<p>Consider also the fortuitous cycle that occurs as data increases in each unit of density. \u00a0In such a case then offsets get smaller, which means that references can get smaller, then offsets get smaller again, which means references can get smaller. \u00a0Densification acts to lessen cache failures and improves data locality meaning again concurrent operations are more likely to happen.<\/p>\n<h3>About Space<\/h3>\n<p>The amount of space required to store a sampling of William Shakespeare can be summed up in a single graph. \u00a0It appears to be settling somewhat at a 75% overhead on top of the original text itself.<\/p>\n<p><strong>Note: \u00a0The following graph is out of date. \u00a0The current Eureka ratio appears to converge to 1.44.<\/strong><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-219 size-full\" src=\"http:\/\/eurekadata.net\/wp-content\/uploads\/2017\/06\/size_eureka_v_shakespeare.png\" alt=\"\" width=\"712\" height=\"450\" srcset=\"https:\/\/eurekadata.net\/wp-content\/uploads\/2017\/06\/size_eureka_v_shakespeare.png 712w, https:\/\/eurekadata.net\/wp-content\/uploads\/2017\/06\/size_eureka_v_shakespeare-300x190.png 300w\" sizes=\"(max-width: 712px) 100vw, 712px\" \/><\/p>\n<p>The horizontal axis indicates lines of Shakespeare ingested and each blue and green data point respectively indicate the size in kilobytes of the Eureka dataset and the original text respectively. \u00a0The golden colored line is the ratio of the blue line over the red line, and the flattening out as we approach 1000 lines of text that Eureka is likely to require an approximately 75% overhead on top of the size of the text though it would be interesting to re-evaluate with documents of many magnitudes of size larger and many different types of documents.<\/p>\n<p>I should note here that Eureka is limited most in ingesting information, however the paltry size of the can be explained better by a lack of resources more than anything else, there are no limiting constraints as designed aside from the relatively expensive potentially one time transformation of the original stream to an internal format.<\/p>\n<h3>Comparisons With Popular Search Indexers<\/h3>\n<p>As it was explained previously, Eureka Data stands to replace a vast amount of intermediate meta data and derivatives with a general purpose framework. \u00a0Any one of these may easily exceed the 75% overhead above and as a collection they would generally exceed this by a great multiple.<\/p>\n<p>Looking at just search engine indexing, a popular meta data service for all data and fields that are unstructured. \u00a0The popular search tools Lucene and the more modern Elasticsearch seem to have a base line index (or overhead) cost of 2-3 and 0.5-1.5 times with modest indexing. \u00a0You can see that its not even really reasonable to compare as these frameworks with Eureka as they are mainly crafted about selecting which parts of the original dataset that you would like to index, how deep you would index them and \u00a0what kinds of ancillary data would you also like to derive in order to best utilize the resulting index.<\/p>\n<p>In contrast the Eureka system is definitively complete and comprehensive. \u00a0As the\u00a0<em>source of truth<\/em><em>\u00a0<\/em>nothing can remain unindexed nor can any statistic necessary for recovery left uncollected. \u00a0If anything were either unindexed or uncollected the Eureka method would not be able to recover all or a part of the data. \u00a0Still Eureka at 75% overhead is extremely competitive and seems superior to state of the art techniques even using their own perspective.<\/p>\n<h3>A Necessary Perspective Change<\/h3>\n<p>How poorly the concepts fit together accentuates the problems of perspective that Eureka attempts to overcome. \u00a0The concept of overhead must be addressed as it is problematic. \u00a0William Shakespeare with full indexes and aggregate statistics <em><strong>is<\/strong><\/em> a lot more information than William Shakespeare without.<\/p>\n<p>The rational follows. William Shakespeare in ascii format\u00a0<em><strong>is<\/strong><\/em> much more information than the same text given a completely lossless operation of bzip (under best mode it compresses to 27% of its original size). \u00a0However with the bzip version one could not directly make little intelligence of the data, nor could one skip an arbitrary amount into the text and make any intelligence at all. \u00a0Knowing very little about the data directly without first decompression indicates that there is actually less information there.<\/p>\n<p>Likewise William Shakespeare in ascii format\u00a0<strong><em>is<\/em><\/strong><em>\u00a0<\/em>less information than an indexed version of the same text. \u00a0While you can make intelligence of some part of William Shakespeare in ascii format directly, you cannot make intelligence or derive any information about all William Shakespeare directly. \u00a0You also could directly skip some arbitrary number of bytes into the text and make intelligence of what you now read, but you cannot skip directly to a specific part intelligently without the added index to tell you where to go.<\/p>\n<p>Scale this line of reasoning up with the following thought experiment. \u00a0What if you had all the human readable information that there ever was in one place? \u00a0Clearly this would be unintelligible. \u00a0An objective analyst dropped into an arbitrary part of this enormous corpus would be many orders of magnitude more likely to be somewhere whose context they could little understand than be at say at a tweet that they published earlier in the day. \u00a0If you were facing all the information that there ever was you would more likely than would never stumble across a context that was fully understood.<\/p>\n<p>If the information were organized however you would directly find your tweet. \u00a0To compile a list of similar tweets or to even understand the context of that one 140 character \u00a0message would take considerable organization to aide the process. \u00a0A program or even a cluster of coordinated programs would be a help, but only in a limited way. \u00a0One program searching multiple orders of magnitude more quickly than the analyst, or a cluster of programs with multiple orders of magnitude faster search than the one program would both be quickly bogged down. \u00a0The key is that\u00a0the data is itself must be organized. That really is the only opportunity have an opportunity to even understand one aspect of the data.<\/p>\n<p>Organization therefor is not\u00a0<em>overhead<\/em> so much as creating data from data, or as per earlier example creating it from the great bard himself. \u00a0With this concept in mind the contrast between Lucene, Elasticsearch and Eureka increases. \u00a0Thinking from this perspective Lucene and Elasiticsearch are creating new information from their corpus, but in so far as they are incomplete, non comprehensive and less general they are bringing in much less information that Eureka hence are truly much less efficient than Eureka.<\/p>\n<h3>Efficiency<\/h3>\n<p>Luckily one should be able to emulate Eureka on Lucene or Elasitcsearch. \u00a0You need only create a complete and full and index with no denormalization and all statistics possible. \u00a0This would be an exercise in duplicating data in many formats, for example to achieve direct keyword regular expression matching some Lucene users play tricks such as index each word recursively at each letter in the word . \u00a0You would have to invent a lot of mechanisms to pour in statistics and for instance multiple mechanisms to mark different levels of context such as sentence, paragraph, section document. \u00a0The result would built up. \u00a0The result would be\u00a0<strong>huge<\/strong>!<\/p>\n<p>I have some pretty clever codecs as part of Eureka, but more significantly than those or rather building upon those the efficiency is the result of all information being complete and self referential. \u00a0Eureka couldn&#8217;t save information by not indexing something such as stop words, nor can it normalize all terms to UPPERCASE or stem terms. There is no statistic that can be discarded or even truncated in order to save space. \u00a0The user is welcome to do that with their data stream if they like and for some applications it may actually make sense and save resources to discard or normalize their data, however doing any of these things within the Eureka format itself would break\u00a0with concept of a self consistency and completeness and ultimately cost more in space than being designed to store it all.<\/p>\n<p>[ To some degree stemming, and transformation is more about increasing recall (being less specific) than operational performance. \u00a0Eureka can help here as the specificity could be guided interactively. \u00a0Further with mixed case only permutations present in the token table get added to a mixed case join and searching for such permutations cost only one operation per letter (in contrast to hash table which would be 2^ letter) and managing them during a join is done through a light weight cursor. \u00a0Unless a sizable (as in hundreds or thousands) of permutations of cursors exist perceptible performance degradation is unlikely. \u00a0]<\/p>\n<p>Overall Access<\/p>\n<p>The information in Eureka is organized for direct access, where <em>direct access<\/em> is shorthand for\u00a0<em><strong>O<\/strong><\/em>(log<sub>k<\/sub>(n)), where k &gt;&gt; 2. \u00a0This is not truly random access but approximately so. \u00a0Possibly more important than the\u00a0<em><strong>O<\/strong><\/em>(log<sub>k<\/sub>(n)), k &gt;&gt; 2 and\u00a0<strong><em>O<\/em><\/strong>(<em>k<\/em>) is that wherever possible data locality has been preserved. \u00a0For example the bits representing one word are directly adjacent to the bits representing the following word. \u00a0The uncapitalized word is logically connected to its capitalized equivalent, etc.<\/p>\n<p>The efficient packing of information within the architecture should be able to scale to quite large with near constant time performance for many of its operations.<\/p>\n<p>Direct access to both local and corpus wide statistics allows for both query dynamic optimization and optimized <a href=\"http:\/\/eurekadata.net\/index.php\/advanced-analytics\/\">hypothesis testing<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Its Not at all About Storage Space &#8230; Aside from the general re-architecture from a central repository to a plethora of general purpose meta-stores and digests on that architecture to increasingly specific met-store and digests to\u00a0just one place, there are but two themes to Eureka. \u00a0First, reduce the overall wasted space when storing and accessing &hellip; <a href=\"https:\/\/eurekadata.net\/index.php\/performance\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Performance<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/55"}],"collection":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/comments?post=55"}],"version-history":[{"count":26,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/55\/revisions"}],"predecessor-version":[{"id":403,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/55\/revisions\/403"}],"wp:attachment":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/media?parent=55"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}