Token Normaization

Token normalization is an example of a mechanism to

  • lexically imbue order on numerics or other data types
  • introduce many orders of periodicity, as an example the date as unix seconds if normalized or padded to 10 base 10 digits which will produce many less features than a date string with day of week, day of month, day or year, week of month, week of year, month of year, etc., and
  • denormalize structured data such that compound keys can represent hierarchical key definitions and state.

Such denormalization may not expand the size of the data as one might think.  For instance, keys of arbitrarily large size do not add any additional space to the store the second time they are encountered.  Also, if something such as dates follow a reasonably predictable pattern then Eureka may utilize that pattern to more efficiently handle the additional data load.  If  1/7th of the days of week are Mondays and consequentially if this this token is now appearing in approximately the same spot of 1/7th the records, then adding this token may only require a few extra bits of data per record.