[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5p5sypt3y643rr7kp66lhmgksgtuvdgijrryh53mqiiqkrgyty@d4zcnya22owg>
Date: Sun, 25 Feb 2024 00:51:06 -0500
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Matthew Wilcox <willy@...radead.org>
Cc: David Laight <David.Laight@...lab.com>,
'Herbert Xu' <herbert@...dor.apana.org.au>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Thomas Graf <tgraf@...g.ch>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"maple-tree@...ts.infradead.org" <maple-tree@...ts.infradead.org>, "rcu@...r.kernel.org" <rcu@...r.kernel.org>
Subject: Re: [PATCH 0/1] Rosebush, a new hash table
On Sun, Feb 25, 2024 at 05:01:19AM +0000, Matthew Wilcox wrote:
> On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote:
> > On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> > > I remember playing around with the elf symbol table for a browser
> > > and all its shared libraries.
> > > While the hash function is pretty trivial, it really didn't matter
> > > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > > chains were always long.
> >
> > that's a pretty bad hash, even golden ratio hash would be better, but
> > still bad; you really should be using at least jhash.
>
> There's a "fun" effect; essentially the "biased observer" effect which
> leads students to erroneously conclude that the majority of classes are
> oversubscribed. As somebody observed in this thread, for some usecases
> you only look up hashes which actually exist.
>
> Task a trivial example where you have four entries unevenly distributed
> between two buckets, three in one bucket and one in the other. Now 3/4
> of your lookups hit in one bucket and 1/4 in the other bucket.
> Obviously it's not as pronounced if you have 1000 buckets with 1000
> entries randomly distributed between the buckets. But that distribution
> is not nearly as even as you might expect:
>
> $ ./distrib
> 0: 362
> 1: 371
> 2: 193
> 3: 57
> 4: 13
> 5: 4
>
> That's using lrand48() to decide which bucket to use, so not even a
> "quality of hash" problem, just a "your mathematical intuition may not
> be right here".
well, golden ratio hash - hash_32(i, 32)
0: 368
1: 264
2: 368
3: 0
but your distribution actually is accurate in general, golden ratio hash
is relly nice for sequential integers. the actual problem with your test
is that you're testing 100% occupancy - no one does that.
75% occupancy, siphash:
0: 933
1: 60
2: 6
3: 1
4: 0
that looks about right to me.
Powered by blists - more mailing lists