{"id":108,"date":"2017-07-04T21:34:07","date_gmt":"2017-07-04T21:34:07","guid":{"rendered":"http:\/\/eurekadata.net\/?page_id=108"},"modified":"2017-10-09T21:01:55","modified_gmt":"2017-10-09T21:01:55","slug":"choosing_tokens","status":"publish","type":"page","link":"https:\/\/eurekadata.net\/index.php\/choosing_tokens\/","title":{"rendered":"Choosing Tokenization Well"},"content":{"rendered":"<p>Eureka the entirely automatic, recursive process of distilling data into its component statistics can utilize but does not have any restrictions or require any kind of format specification or limitations on its input data.\u00a0 It would even be possible to inject and robustly recall binary information from plain text though that might lead to some level of confusion with coincidental tokens both inside and outside of the binary sections (see normalization for mechanisms to make this easy)[].<\/p>\n<p>Eureka does however require human interaction with respect to the construction of lex rules.\u00a0 As a component of the Eureka digest itself the lex rules define how the original data is partitioned into different statistical spaces.\u00a0 This has two practical effects, first the consistency of the resulting partition would materially effect how effective Eureka is in compressing the data, hence how quickly all operations within that space occur. Secondly, the query language expresses through the different classes of tokens hence how easily it is achieve operational efficiency.\u00a0 Recall queries are an interactive process[\u2026] hence poor choices or mismatching regular expression rules would result in feedback that might be misleading or confusing.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Eureka the entirely automatic, recursive process of distilling data into its component statistics can utilize but does not have any restrictions or require any kind of format specification or limitations on its input data.\u00a0 It would even be possible to inject and robustly recall binary information from plain text though that might lead to some &hellip; <a href=\"https:\/\/eurekadata.net\/index.php\/choosing_tokens\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Choosing Tokenization Well<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/108"}],"collection":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/comments?post=108"}],"version-history":[{"count":5,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/108\/revisions"}],"predecessor-version":[{"id":189,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/108\/revisions\/189"}],"wp:attachment":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/media?parent=108"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}