linux-kernel - Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZcFvHfYGH7EwGBRK@casper.infradead.org>
Date: Mon, 5 Feb 2024 23:28:29 +0000
From: Matthew Wilcox <willy@...radead.org>
To: Dave Chinner <david@...morbit.com>
Cc: JonasZhou-oc <JonasZhou-oc@...oxin.com>, viro@...iv.linux.org.uk,
	brauner@...nel.org, jack@...e.cz, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, CobeChen@...oxin.com,
	LouisQi@...oxin.com, JonasZhou@...oxin.com
Subject: Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false
 sharing with i_mmap.

On Mon, Feb 05, 2024 at 02:22:18PM +1100, Dave Chinner wrote:
> Intuition tells me that what the OP is seeing is the opposite case
> to above: there is significant contention on the lock. In that case,
> optimal "contention performance" comes from separating the lock and
> the objects it protects into different cachelines.
> 
> The reason for this is that if the lock and objects it protects are
> on the same cacheline, lock contention affects both the lock and the
> objects being manipulated inside the critical section. i.e. attempts
> to grab the lock pull the cacheline away from the CPU that holds the
> lock, and then accesses to the object that are protected by the lock
> then have to pull the cacheline back.
> 
> i.e. the cost of the extra memory fetch from an uncontended
> cacheline is less than the cost of having to repeatedly fetch the
> memory inside a critical section on a contended cacheline.
> 
> I consider optimisation attempts like this the canary in the mine:
> it won't be long before these or similar workloads report
> catastrophic lock contention on the lock in question.  Moving items
> in the structure is equivalent to re-arranging the deck chairs
> whilst the ship sinks - we might keep our heads above water a
> little longer, but the ship is still sinking and we're still going
> to have to fix the leak sooner rather than later...

So the fundamental problem is our data structure.  It's actually two
problems wrapped up in one bad idea.

i_mmap is a struct rb_root_cached:

struct rb_root_cached {
        struct rb_root rb_root;
        struct rb_node *rb_leftmost;
};

struct rb_root {
        struct rb_node *rb_node;
};

so it's two pointers into the tree; one to the leftmost node, one
to the topmost node.  That means we're modifying one or both of these
pointers frequently.  I imagine it's the rb_root, which is being modified
frequently because we're always ... appending to the right?  And so
we're rotating frequently.  Worst case would be if we're appending to
the left and modifying both pointers.

There are things we could do to address this by making rotations less
frequent for the common case, which I suspect is mapping the entire file.
And perhaps we should do these things as a stopgap.

The larger problem is that rbtrees suck.  They suck the same way that
linked lists suck; the CPU can't prefetch very far ahead (or in this
case down).  It can do a little more work in that it can prefetch both
the left & right pointers, but it can't fetch the grandchildren until the
children fetches have finished, so the dependent reads have us stumped.

The solution to this problem is to change the interval tree data structure
from an Red-Black tree to a B-tree, or something similar where we use
an array of pointers instead of a single pointer.  Not the Maple Tree;
that is designed for non-overlapping ranges.  One could take inspiration
from the Maple Tree and design a very similar data structure that can
store and iterate over overlapping ranges.  I can't understand the macros
this late at night, so I don't fully understand what the interval tree
is doing, but I can probably make a more fully informed observation by
the end of the week.