linux-kernel - Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	1 May 2016 05:43:58 -0400
From:	"George Spelvin" <linux@...izon.com>
To:	linux@...izon.com, tglx@...utronix.de
Cc:	eric.dumazet@...il.com, linux-kernel@...r.kernel.org,
	riel@...hat.com, torvalds@...ux-foundation.org
Subject: Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

> Sorry I did not express myself clear enough.

> hash64 (the single multiply with the adjusted golden ratio) is
> slightly faster than the modulo one which has two mutiplications.

Yes, I figured out that was probably what you were talking about,
and benchmarked it similarly.

But I noticed a much greater difference.

 Wang      * PHI    % 4093   Wang      * PHI    % 4093
 13599486  3494816  5238442  13584166  3460266  5239463
 13589552  3466764  5237327  13582381  3422304  5276253
 13569409  3407317  5236239  13566738  3393215  5267909
 13575257  3413736  5237708  13596428  3379811  5280118
 13583607  3540416  5325609  13650964  3380301  5265210

At 3.7 GHz, that's 

* PHI:     1059 M ops/second
* Modulo:   706 M ops/second
* Wang:     271 M ops/second

Of course, that's a tight loop hashing; I presume your test case
has more overhead.

Anyway, my recommendation (I'll write the patches if you like) is:

* Modify the multiplicative constants to be
  #define COLDEN_RATIO_32 0x61C88647
  #define COLDEN_RATIO_64 0x61C8864680B583EB

* Change the prototype of hash_64 to return u32.

* Create a separate 32-bit implementation of hash_64() for the
  BITS_PER_LONG < 64 case.  This should not be Wang's or anything
  similar because 64-bit shifts are slow and awkward on 32-bit
  machines.
  One option is something like __jhash_final(), but I think
  it will suffice to do:

  static __always_inline u32 hash_64(u64 val, unsigned int bits)
  {
	u32 hash = (u32)(val >> 32) * GOLDEN_RATIO_32 + (u32)val;
	hash *= GOLDEN_RATIO_32;
        return hash >> (32 - bits);
  }
  (S-o-b on that if you want it, of course.)

  (If you want it out of line, make a helper function that
  computes the 32-bit hash and have an inline wrapper that
  does the shifting.)

* Eliminate the !CONFIG_ARCH_HAS_FAST_MULTIPLIER code path.  Once we've
  got rid of the 32-bit processors, we're left with 64-bit processors,
  and they *all* have hardware multipliers.  Even the MIPS R4000 (first
  64-bit processor ever) and Alpha 21064 had one.

  (Admittedly, the R4000 was 20 cycles, which is slower than Wang's hash,
  but 18 of them could overlap other instructions.)

  The worst multiply support is SPARCv9, which has 64x64-bit
  multiply with a 64-bit result, but no widening multiply.

* If you feel ambitious, add a 32-bit CONFIG_ARCH_HAS_SLOW_MULTIPLIER
  exception path.