[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141115191219.GA19060@thin>
Date: Sat, 15 Nov 2014 11:12:20 -0800
From: Josh Triplett <josh@...htriplett.org>
To: Thomas Graf <tgraf@...g.ch>
Cc: Herbert Xu <herbert@...dor.apana.org.au>, netdev@...r.kernel.org,
eric.dumazet@...il.com, paulmck@...ux.vnet.ibm.com
Subject: Re: [PATCH 4/4] rhashtable: Add parent argument to mutex_is_held
On Sat, Nov 15, 2014 at 11:16:26AM +0000, Thomas Graf wrote:
> On 11/15/14 at 11:25am, Herbert Xu wrote:
> > On Thu, Nov 13, 2014 at 06:43:43PM +0800, Herbert Xu wrote:
> > > On Thu, Nov 13, 2014 at 10:41:24AM +0000, Thomas Graf wrote:
> > > >
> > > > Never mind. You did fix it. I looked at the wrong patch.
> > >
> > > OK. Now comes the fun part of shoehorning the xfrm_policy bydst
> > > hash into rhashtable :)
> >
> > Thomas, it appears that rhashtable as it stands cannot handle
> > changing the random seed because of the way it constructs the
> > new hash table without degrading into a linked list. Is there
> > something I'm missing?
> >
> > FWIW my hashtable in net/bridge/br_multicast.c handles rehashing
> > correctly. Any objections to me converting rhashtable to use my
> > scheme instead?
>
> Can you elaborate a bit?
>
> The point of rhashtable is to not require two sets of linked list
> pointers as done by MDB or OVS flow tables to work around the
> increased cache footprint of that approach. The difference of the
> two algos is dicussed in this paper [0].
>
> The disadvantage of rhashtable is that, AFAIK, the hash function
> cannot change while resizing as it would break the mutual linked
> lists.
You can handle hash function changes with rhashtable without needing a
second set of linked-list pointers, actually. You'd have to add ~1
unlikely() conditional to the reader common case, and you'd make readers
*during* a hash algorithm change (which I'd hope happens as rarely as a
resize) somewhat less efficient, but you wouldn't use any more memory
than a resize currently does, and you wouldn't use the memory of extra
linked-list pointers in the common case.
Rather than the current approach of switching out the hash table pointer
and having readers only search the new table (which will have
valid-but-crosslinked buckets during a resize), instead keep two hash
table pointers (each with their own hash parameters) and a single
toggleable old/new indicator. Readers check both hash table pointers,
and if valid, check both tables, first old then new (order important to
make sure they don't miss a node). Then, the rehashing algorithm can
incrementally move nodes from the old buckets to the new buckets,
*without* disrupting concurrent readers.
Rough rehashing algorithm sketch:
- Set up the new empty table with the new set of hash parameters.
- synchronize_rcu(). Readers will now search both old and new tables.
- Peel nodes off the ends of the old hash table and add them to the new
hash table (similar to the unzip step in the resize algorithm).
Because readers search old-then-new, and writers make each node appear
in the new table before making it disappear from the old, you don't
need a synchronize_rcu() here, just an smp_wmb() after linking the
node into the new table and before unlinking the node from the old
table. Also, because you remove nodes from the *ends* of old-table
buckets, you don't have to worry about a reader's linked-list walk
getting dragged over to the new table and missing nodes from the old
one.
- Once all nodes have moved to the new table, mark the pointer to the
old table as NULL, synchronize_rcu(), and free the old table.
- Toggle the old/new flag.
Ordering principles in this algorithm:
- You want readers to see the changes made by the rehasher in order.
- If the reader reads location A then B, and the rehasher writes
location B then A, the rehasher just needs an smp_wmb() in between.
- If the reader reads location A then B, and the rehasher writes
location A then B, the rehasher needs a synchronize_rcu() in between.
- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists