[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZdrJn0lkFeYGuYIC@casper.infradead.org>
Date: Sun, 25 Feb 2024 05:01:19 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: David Laight <David.Laight@...lab.com>,
'Herbert Xu' <herbert@...dor.apana.org.au>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Thomas Graf <tgraf@...g.ch>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"maple-tree@...ts.infradead.org" <maple-tree@...ts.infradead.org>,
"rcu@...r.kernel.org" <rcu@...r.kernel.org>
Subject: Re: [PATCH 0/1] Rosebush, a new hash table
On Sat, Feb 24, 2024 at 10:18:31PM -0500, Kent Overstreet wrote:
> On Sat, Feb 24, 2024 at 10:10:27PM +0000, David Laight wrote:
> > I remember playing around with the elf symbol table for a browser
> > and all its shared libraries.
> > While the hash function is pretty trivial, it really didn't matter
> > whether you divided 2^n, 2^n-1 or 'the prime below 2^n' some hash
> > chains were always long.
>
> that's a pretty bad hash, even golden ratio hash would be better, but
> still bad; you really should be using at least jhash.
There's a "fun" effect; essentially the "biased observer" effect which
leads students to erroneously conclude that the majority of classes are
oversubscribed. As somebody observed in this thread, for some usecases
you only look up hashes which actually exist.
Task a trivial example where you have four entries unevenly distributed
between two buckets, three in one bucket and one in the other. Now 3/4
of your lookups hit in one bucket and 1/4 in the other bucket.
Obviously it's not as pronounced if you have 1000 buckets with 1000
entries randomly distributed between the buckets. But that distribution
is not nearly as even as you might expect:
$ ./distrib
0: 362
1: 371
2: 193
3: 57
4: 13
5: 4
That's using lrand48() to decide which bucket to use, so not even a
"quality of hash" problem, just a "your mathematical intuition may not
be right here".
To put this data another way, 371 entries are in a bucket with a single
entry, 384 are in a bucket with two entries, 171 are in a 3-entry
bucket, 52 are in a 4-entry bucket and 20 are in a 5-entry bucket.
$ cat distrib.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
int bucket[1000];
int freq[10];
int main(int argc, char **argv)
{
int i;
for (i = 0; i < 1000; i++)
bucket[lrand48() % 1000]++;
for (i = 0; i < 1000; i++)
freq[bucket[i]]++;
for (i = 0; i < 10; i++)
printf("%d: %d\n", i, freq[i]);
return 0;
}
(ok, quibble about "well, 1000 doesn't divide INT_MAX evenly so your
random number generation is biased", but i maintain that will not
materially affect these results due to it affecting only 0.00003% of
numbers generated)
Powered by blists - more mailing lists