{"id":106,"date":"2017-07-04T21:34:07","date_gmt":"2017-07-04T21:34:07","guid":{"rendered":"http:\/\/eurekadata.net\/?page_id=106"},"modified":"2017-07-04T21:46:29","modified_gmt":"2017-07-04T21:46:29","slug":"a-more-precise-definition-of-tokens","status":"publish","type":"page","link":"https:\/\/eurekadata.net\/index.php\/a-more-precise-definition-of-tokens\/","title":{"rendered":"A More Precise Definition of Tokens"},"content":{"rendered":"<p>For the sake of the Eureka data analysis, storage and random access retrieval system tokens must be the unambiguous islands of information that exist independently.\u00a0 Tokens are the product of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Regular_expression\">regular expression rules<\/a>, an arbitrary class name and a priority.\u00a0 The priority is merely the attempted application of one rule followed on failure the next successive rule and prohibits any possibility of multiple regular expression rules capturing the same bytes in the original corpus by more than one token.\u00a0 The class name (class as in classification not structure) serves in the query to distinguish tokens as members of different spaces. \u00a0 The process of prioritized application of different regular expression is called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Lexical_analysis\">lexing<\/a>.\u00a0 In the larger field of data analytics and processing while regular expression matching is generally applied to lines of text however within Eureka it more frequently refers to a mere word, number, whitespace or some kind of structural formatting.\u00a0 Note also by the above definitions it is impossible for the same token to reside in different classes.<\/p>\n<p>These lex specifications {regular expression, class name} and implicitly by order the priority are an integral to the resulting Eureka digest.\u00a0 Tokenization is the first step in transforming the input corpus by partitioning tokens of the corpus into <b><i>k<\/i><\/b> independent spaces, where <b><i>k<\/i><\/b> is the number of rules applied and the spaces are largely independent with only cross space positional joining and aggregate byte histogram statistics residing outside of the concept of space.<\/p>\n<p>The emphasis on the precision and disambiguation here is deliberate.\u00a0 Distinguishing itself from the majority of meta data generation and applications in information retrieval all Eureka transformational processes must be datawise conservative and\u00a0 unambiguous for back transformation function.<\/p>\n<p>Standard indexes greatly transform or normalize their tokens, often discarding stop words outright or glomming them onto the token before or after them, normalizing capitalization, stemming suffixes and transforming text.\u00a0 These general techniques are used to reduce the size of the key space, reduce the size dedicated to posting lists or to improve recall by coalescing multiple words together.\u00a0 Eureka has no such opportunity, for each of these operations are not data conserving, however the internal representation is done in such a way that full representation can be achieved, full specificity (e.g. can join on a upper cased term, a lower case or on the set of all terms that are a case insensitive match).\u00a0 The performance both in speed and memory is drilled down elsewhere [key space performance] however generally performance differences are imperceptible and the processing automatically degenerates to the most restrictive of query specification and presence in the corpus.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For the sake of the Eureka data analysis, storage and random access retrieval system tokens must be the unambiguous islands of information that exist independently.\u00a0 Tokens are the product of regular expression rules, an arbitrary class name and a priority.\u00a0 The priority is merely the attempted application of one rule followed on failure the next &hellip; <a href=\"https:\/\/eurekadata.net\/index.php\/a-more-precise-definition-of-tokens\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">A More Precise Definition of Tokens<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/106"}],"collection":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/comments?post=106"}],"version-history":[{"count":3,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/106\/revisions"}],"predecessor-version":[{"id":118,"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/pages\/106\/revisions\/118"}],"wp:attachment":[{"href":"https:\/\/eurekadata.net\/index.php\/wp-json\/wp\/v2\/media?parent=106"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}