lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZcKmP3zVdBwJVxGd@dread.disaster.area>
Date: Wed, 7 Feb 2024 08:35:59 +1100
From: Dave Chinner <david@...morbit.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: JonasZhou-oc <JonasZhou-oc@...oxin.com>, viro@...iv.linux.org.uk,
	brauner@...nel.org, jack@...e.cz, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, CobeChen@...oxin.com,
	LouisQi@...oxin.com, JonasZhou@...oxin.com
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false
 sharing with i_mmap.

On Mon, Feb 05, 2024 at 11:28:29PM +0000, Matthew Wilcox wrote:
> On Mon, Feb 05, 2024 at 02:22:18PM +1100, Dave Chinner wrote:
> > Intuition tells me that what the OP is seeing is the opposite case
> > to above: there is significant contention on the lock. In that case,
> > optimal "contention performance" comes from separating the lock and
> > the objects it protects into different cachelines.
> > 
> > The reason for this is that if the lock and objects it protects are
> > on the same cacheline, lock contention affects both the lock and the
> > objects being manipulated inside the critical section. i.e. attempts
> > to grab the lock pull the cacheline away from the CPU that holds the
> > lock, and then accesses to the object that are protected by the lock
> > then have to pull the cacheline back.
> > 
> > i.e. the cost of the extra memory fetch from an uncontended
> > cacheline is less than the cost of having to repeatedly fetch the
> > memory inside a critical section on a contended cacheline.
> > 
> > I consider optimisation attempts like this the canary in the mine:
> > it won't be long before these or similar workloads report
> > catastrophic lock contention on the lock in question.  Moving items
> > in the structure is equivalent to re-arranging the deck chairs
> > whilst the ship sinks - we might keep our heads above water a
> > little longer, but the ship is still sinking and we're still going
> > to have to fix the leak sooner rather than later...
> 
> So the fundamental problem is our data structure.  It's actually two
> problems wrapped up in one bad idea.
> 
> i_mmap is a struct rb_root_cached:
> 
> struct rb_root_cached {
>         struct rb_root rb_root;
>         struct rb_node *rb_leftmost;
> };
> 
> struct rb_root {
>         struct rb_node *rb_node;
> };
> 
> so it's two pointers into the tree; one to the leftmost node, one
> to the topmost node.  That means we're modifying one or both of these
> pointers frequently.  I imagine it's the rb_root, which is being modified
> frequently because we're always ... appending to the right?  And so
> we're rotating frequently.  Worst case would be if we're appending to
> the left and modifying both pointers.
> 
> There are things we could do to address this by making rotations less
> frequent for the common case, which I suspect is mapping the entire file.
> And perhaps we should do these things as a stopgap.
> 
> The larger problem is that rbtrees suck.  They suck the same way that
> linked lists suck; the CPU can't prefetch very far ahead (or in this
> case down).  It can do a little more work in that it can prefetch both
> the left & right pointers, but it can't fetch the grandchildren until the
> children fetches have finished, so the dependent reads have us stumped.

Yes, pretty much all balanced binary search trees have this problem.
I pointed out this problem with rbtrees many years ago when someone
tried to implement range locks with an rbtree to track locked
ranges. On modern CPUs, trees with array based nodes (btrees, the
linux radix tree, etc) are far more efficient. But....

> The solution to this problem is to change the interval tree data structure
> from an Red-Black tree to a B-tree, or something similar where we use
> an array of pointers instead of a single pointer.

... B-trees are not immune to pointer chasing problems, either.
Most use binary searches within the nodes, and so we have the
problem of unpredictable cacheline misses within the nodes as well
as being unable to do depth based prefetch similar to rbtrees.

Perhaps we should be looking at something like this:

https://lore.kernel.org/linux-fsdevel/20240126220655.395093-2-kent.overstreet@linux.dev/

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ