Designs for efficient low power cardinality and similarity sketches by Two-Step Hashing (TSH) Articles uri icon

authors

  • Li, Jie
  • REVIRIEGO VASALLO, PEDRO
  • SHANSHAN, LIU
  • Xiao, Liyi
  • Lombardi, Fabrizio

publication date

  • November 2021

start page

  • 246

end page

  • 253

volume

  • 81

abstract

  • In many computing applications, there is a need to estimate different features of large data sets, such as the number of distinct elements or the similarity with other data sets. This can be efficiently implemented using probabilistic data structures known as sketches; for example, Hyperloglog is widely used for cardinality estimate and Minhash for similarity estimate. Many of these sketches rely on computing hash functions on the elements and counting the number of consecutive leading zeros (or ones) to identify the minimum (or maximum) hash value for data updating. This functionality usually accounts for the most demanding part of the sketch implementation; an efficient and low power hash computation and leading zero counting are therefore required for high performance. A novel but important observation is exploited in this paper; namely the trailing hash bits are rarely used in these sketches (e.g., the leading zeros counting stops once the first one is found, which usually occurs after a few bits). Based on this feature, a so-called Two-Step Hashing (TSH) scheme is proposed in this paper to significantly reduce the power consumption. Therefore rather than computing all hash bits (as in the traditional scheme), TSH computes a few leading hash bits in a first step and the trailing hash bits in the second step, that is only required when the leading bits are all zeros (or ones). Hence, in most cases only a fraction of the hash bits is computed, so significantly reducing the power dissipation of hash computations when they are used to check the number of consecutive leading zeros, or the minimum value. The use of TSH for sketches with both single and multiple pipeline implementations are considered; evaluation results based on HyperLogLog (HLL) as a sketch with a Cyclic Redundancy Check hash function show that TSH saves up to 75.9 % (78.7 %) power dissipation in a single (multiple) pipeline implementation.

subjects

  • Industrial Engineering

keywords

  • cardinality; cyclic redundancy check (crc); hyperloglog; minhash; similarity