{"id":109,"date":"2017-07-04T21:34:07","date_gmt":"2017-07-04T21:34:07","guid":{"rendered":"http:\/\/eurekadata.net\/?page_id=109"},"modified":"2017-10-05T16:55:27","modified_gmt":"2017-10-05T16:55:27","slug":"shakespeare-example","status":"publish","type":"page","link":"https:\/\/eurekadata.net\/index.php\/shakespeare-example\/","title":{"rendered":"Shakespeare Example"},"content":{"rendered":"<p>&nbsp;<\/p>\n<p>In this example we&#8217;ll focus on the first sonnet of the <a href=\"http:\/\/eurekadata.net\/index.php\/shakespeare-redux\/\">William Shakespeare<\/a>. \u00a0The six rules and the priority by order will lex the corpus into six separate Eureka spaces. \u00a0Note that the\u00a0<i>catch all<\/i>\u00a0rule\u00a0<i>\u2018.\u2019<\/i>\u00a0has lowest priority and though it would otherwise capture all remaining single characters it will remain\u00a0empty.<\/p>\n<h2 class=\"tablepress-table-name tablepress-table-name-id-1\">priority_regex_name_table<\/h2>\n\n<table id=\"tablepress-1\" class=\"tablepress tablepress-id-1\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Priority<\/th><th class=\"column-2\">Regular Expression Rule<\/th><th class=\"column-3\">Space<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td class=\"column-1\">1<\/td><td class=\"column-2\"><pre style=\"font-size: smaller\">([A-Za-z]+[0-9]*)+<\/pre><\/td><td class=\"column-3\"><span class=\"word\">word<\/span><\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-1\">2<\/td><td class=\"column-2\"><pre style=\"font-size: smaller\">[+-~]?[0-9]+([\\\\.][0-9]*)*<\/pre><\/td><td class=\"column-3\"><span class=\"num\">num<\/span><\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-1\">3<\/td><td class=\"column-2\"><pre style=\"font-size: smaller\">(([ \\t]+[:;\/\\.,!@#$%&amp;*()_\\+\\-={}\\^\\[\\]'()\"(<>]*)+)<\/pre><\/td><td class=\"column-3\"><span class=\"white\">white<\/span><\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td class=\"column-1\">4<\/td><td class=\"column-2\"><pre style=\"font-size: smaller\">(([:;\/\\.,!@#$%&amp;*()_\\+\\-={}\\^\\[\\]')\"\\(<>]+[ \\t]*)+)<\/pre><\/td><td class=\"column-3\"><span class=\"punct\">punct<\/span><\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-1\">5<\/td><td class=\"column-2\"><pre style=\"font-size: smaller\">\\n\\n?<\/pre><\/td><td class=\"column-3\"><span class=\"para\">para<\/span><\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-1\">6<\/td><td class=\"column-2\"><pre style=\"font-size: smaller\">.<\/pre><\/td><td class=\"column-3\">any<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-1 from cache -->\n<p>And the resulting lexing, color coded to the above would look something like:<\/p>\n<p style=\"border: 2px; border-style: solid; border-color: #101010; padding: 1em; background-color: #808080;\"><span class=\"num\">1609<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"word\">THE<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">SONNETS<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"word\">by<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">William<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Shakespeare<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"num\">1<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">From<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">fairest<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">creatures<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">we<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">desire<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">increase<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">That<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thereby<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">beauty<\/span><span class=\"punct\">&#8216;<\/span><span class=\"word\">s<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">rose<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">might<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">never<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">die<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">But<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">as<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">the<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">riper<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">should<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">by<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">time<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">decease<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">His<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">tender<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">heir<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">might<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">bear<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">his<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">anamory<\/span><span class=\"punct\">:<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">But<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thou<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">contracted<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">to<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thine<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">own<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">bright<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">eyes<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Feed<\/span><span class=\"punct\">&#8216;<\/span><span class=\"word\">st<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thy<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">light<\/span><span class=\"punct\">&#8216;<\/span><span class=\"word\">s<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">flame<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">with<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">self<\/span><span class=\"punct\">&#8211;<\/span><span class=\"word\">substantial<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">fuel<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Making<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">a<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">famine<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">where<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">abundance<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">lies<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Thy<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">self<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thy<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">foe<\/span><span class=\"punct\">, <\/span><span class=\"word\">to<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thy<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">sweet<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">self<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">too<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">cruel<\/span><span class=\"punct\">:<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Thou<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">that<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">art<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">now<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">the<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">world<\/span><span class=\"punct\">&#8216;<\/span><span class=\"word\">s<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">fresh<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">ornament<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">And<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">only<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">herald<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">to<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">the<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">gaudy<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">spring<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Within<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thine<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">own<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">bud<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">buriest<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thy<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">content<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">And<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">tender<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">churl<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">mak<\/span><span class=\"punct\">&#8216;<\/span><span class=\"word\">st<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">waste<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">in<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">niggarding<\/span><span class=\"punct\">:<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">Pity<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">the<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">world<\/span><span class=\"punct\">, <\/span><span class=\"word\">or<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">else<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">this<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">glutton<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">be<\/span><span class=\"punct\">,<\/span><span class=\"para\">\u2424<br \/>\n<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">To<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">eat<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">the<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">world<\/span><span class=\"punct\">&#8216;<\/span><span class=\"word\">s<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">due<\/span><span class=\"punct\">, <\/span><span class=\"word\">by<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">the<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">grave<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">and<\/span><span class=\"white\">\u00a0<\/span><span class=\"word\">thee<\/span><span class=\"punct\">.<\/span><span class=\"para\">\u2424<br \/>\n<\/span><\/p>\n<p>Internally the corpus would be stored in the six spaces with inter-space join tables. \u00a0Queries could search for consecutive tokens in one space, matching specific pattern of spaces or\u00a0<em>any<\/em>.<\/p>\n<pre>word:token1 [\u2295[ .*:.*]+ \u2295 word:tokenYi]*<\/pre>\n<p>A variation<\/p>\n<pre>word:<em><strong>PC<\/strong><\/em>(token1) [\u2295[ .*:.*]+ \u2295 word:tokenYi]*<\/pre>\n<p>where <em><strong>PC<\/strong><\/em> would permute and include all existing capitalization of token1 in word space, or<\/p>\n<pre>word:token1 [\u2295[ white:\"\\t\"]+ \u2295 word:tokenYi]*<\/pre>\n<p>where the white space separator(s) are explicitly constrained to a single tab.<\/p>\n<p>Indexing fully every letter position in every token, indexing every token (no normalization or dropping of stop-words) and storing all frequency statistics from letter frequencies to the count of unique frequencies present in each space costs only a small multiple (i.e. <strong>&lt;3<\/strong> times) of the original corpus size even in this small case. \u00a0As we scale documents up to achieve statistical significance this ratio trends toward one.<\/p>\n<p>Just as an emphasis on the small nature of the this corpus, and also the Eureka data approach for storing all relevant meta data directly, the <span class=\"word\">word<\/span> space the count of frequencies are displayed in the following table.<\/p>\n<p>Note that due to the minuscule size this space is very sparse and the numeric progressions are likewise discontinuous:<\/p>\n\n<table id=\"tablepress-3\" class=\"tablepress tablepress-id-3\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Count of words sharing \u03bb <\/th><th class=\"column-2\">Frequency(\u03bb)<\/th>\n<\/tr>\n<\/thead>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td class=\"column-1\">1<\/td><td class=\"column-2\">6<\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-1\">2<\/td><td class=\"column-2\">4<\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-1\">4<\/td><td class=\"column-2\">3<\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td class=\"column-1\">7<\/td><td class=\"column-2\">2<\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-1\">78<\/td><td class=\"column-2\">1<\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-1\">92 (total)<\/td><td class=\"column-2\"><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<span class=\"tablepress-table-description tablepress-table-description-id-3\">For words in Shakespeare's first sonnet, the distribution of frequencies shows that only one word has six occurrences, two words have four occurrences, four words have three occurrences and a whopping seventy-eight words are only present one time in this short text.  Hows that for diversity!<\/span>\n<!-- #tablepress-3 from cache -->\n<p>As the size of the corpus increases the relatively costly insertions of unique terms into the terms table occur and all tables benefit from inherent statistics. \u00a0This has the effect of improving overall per corpus efficiency, but even in the case of our minuscule corpus we are looking reasonably good.<\/p>\n\n<table id=\"tablepress-2\" class=\"tablepress tablepress-id-2\">\n<thead>\n<tr class=\"row-1 odd\">\n\t<th class=\"column-1\">Space<\/th><th class=\"column-2\">Eureka Internal Size<\/th><th class=\"column-3\">Original Size (in bytes)<\/th><th class=\"column-4\">Ratio of Sizes<\/th>\n<\/tr>\n<\/thead>\n<tfoot>\n<tr class=\"row-8 even\">\n\t<th class=\"column-1\">total<\/th><th class=\"column-2\">1.9 kbytes<\/th><th class=\"column-3\">711<\/th><th class=\"column-4\">2.6<\/th>\n<\/tr>\n<\/tfoot>\n<tbody class=\"row-hover\">\n<tr class=\"row-2 even\">\n\t<td class=\"column-1\"><span class=\"word\">word<\/span><\/td><td class=\"column-2\">1.4 kbytes<\/td><td class=\"column-3\">510<\/td><td class=\"column-4\">2.9<\/td>\n<\/tr>\n<tr class=\"row-3 odd\">\n\t<td class=\"column-1\"><span class=\"num\">num<\/span><\/td><td class=\"column-2\">59 bytes<\/td><td class=\"column-3\">5<\/td><td class=\"column-4\">12<\/td>\n<\/tr>\n<tr class=\"row-4 even\">\n\t<td class=\"column-1\"><span class=\"white\">white<\/span><\/td><td class=\"column-2\">252 bytes<\/td><td class=\"column-3\">145<\/td><td class=\"column-4\">1.7<\/td>\n<\/tr>\n<tr class=\"row-5 odd\">\n\t<td class=\"column-1\"><span class=\"punct\">punct<\/span><\/td><td class=\"column-2\">146 bytes<\/td><td class=\"column-3\">28<\/td><td class=\"column-4\">5.4<\/td>\n<\/tr>\n<tr class=\"row-6 even\">\n\t<td class=\"column-1\"><span class=\"para\">para<\/span><\/td><td class=\"column-2\">90 bytes<\/td><td class=\"column-3\">24<\/td><td class=\"column-4\">3.7<\/td>\n<\/tr>\n<tr class=\"row-7 odd\">\n\t<td class=\"column-1\"><span class=\"any\">any<\/span><\/td><td class=\"column-2\">0<\/td><td class=\"column-3\">0<\/td><td class=\"column-4\">0\/0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<!-- #tablepress-2 from cache -->\n<h3>Conclusion<\/h3>\n<p>Eureka is designed around direct joining, hence utilizing the concept of multiple spaces result in very natural access absent performance degradation. \u00a0The demarkation of <em>genuinely<\/em>\u00a0different\u00a0kinds of data will strongly enhance performance as<em>\u00a0<\/em>each space exhibits disparate frequencies, limited tokens and natural association. \u00a0The spacial definition is a natural deconvolution of the original data reducing the overall size and entropy hence increasing performance.<\/p>\n<p>Though Eureka is general in that it more about discovering structure than imposing structure on a corpus it is wise for the convenience of access and performance to consider the definitions of token rules and priorities as they serve the dual purpose of defining your data and optimizing access.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; In this example we&#8217;ll focus on the first sonnet of the William Shakespeare. \u00a0The six rules and the priority by order will lex the corpus into six separate Eureka spaces. \u00a0Note that the\u00a0catch all\u00a0rule\u00a0\u2018.\u2019\u00a0has lowest priority and though it would otherwise capture all remaining single characters it will remain\u00a0empty. And the resulting lexing, color &hellip; <a href=\"https:\/\/eurekadata.net\/index.php\/shakespeare-example\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Shakespeare Example<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/109"}],"collection":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/comments?post=109"}],"version-history":[{"count":62,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/109\/revisions"}],"predecessor-version":[{"id":401,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/109\/revisions\/401"}],"wp:attachment":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/media?parent=109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}