[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200914033131.GK6583@casper.infradead.org>
Date: Mon, 14 Sep 2020 04:31:31 +0100
From: Matthew Wilcox <willy@...radead.org>
To: Dave Chinner <david@...morbit.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Amir Goldstein <amir73il@...il.com>,
Hugh Dickins <hughd@...gle.com>,
Michael Larabel <Michael@...haellarabel.com>,
Ted Ts'o <tytso@...gle.com>,
Andreas Dilger <adilger.kernel@...ger.ca>,
Ext4 Developers List <linux-ext4@...r.kernel.org>,
Jan Kara <jack@...e.cz>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: Kernel Benchmarking
On Mon, Sep 14, 2020 at 09:45:03AM +1000, Dave Chinner wrote:
> I have my doubts that complex page cache manipulation operations
> like ->migrate_page that rely exclusively on page and internal mm
> serialisation are really safe against ->fallocate based invalidation
> races. I think they probably also need to be wrapped in the
> MMAPLOCK, but I don't understand all the locking and constraints
> that ->migrate_page has and there's been no evidence yet that it's a
> problem so I've kinda left that alone. I suspect that "no evidence"
> thing comes from "filesystem people are largely unable to induce
> page migrations in regression testing" so it has pretty much zero
> test coverage....
Maybe we can get someone who knows the page migration code to give
us a hack to induce pretty much constant migration?
> Stuff like THP splitting hasn't been an issue for us because the
> file-backed page cache does not support THP (yet!). That's
> something I'll be looking closely at in Willy's upcoming patchset.
One of the things I did was fail every tenth I/O to a THP. That causes
us to split the THP when we come to try to make use of it. Far more
effective than using dm-flakey because I know that failing a readahead
I/O should not cause any test to fail, so any newly-failing test is
caused by the THP code.
I've probably spent more time looking at the page splitting and
truncate/hole-punch/invalidate/invalidate2 paths than anything else.
It's definitely an area where more eyes are welcome, and just having
more people understand it would be good. split_huge_page_to_list and
its various helper functions are about 400 lines of code and, IMO,
a little too complex.
> The other issue here is that serialisation via individual cache
> object locking just doesn't scale in any way to the sizes of
> operations that fallocate() can run. fallocate() has 64 bit
> operands, so a user could ask us to lock down a full 8EB range of
> file. Locking that page by page, even using 1GB huge page Xarray
> slot entries, is just not practical... :/
FWIW, there's not currently a "lock down this range" mechanism in
the page cache. If there were, it wouldn't be restricted to 4k/2M/1G
sizes -- with the XArray today, it's fairly straightforward to
lock ranges which are m * 64^n entries in size (for 1 <= m <= 63, n >=0).
In the next year or two, I hope to be able to offer a "lock arbitrary
page range" feature which is as cheap to lock 8EiB as it is 128KiB.
It would still be page-ranges, not byte-ranges, so I don't know how well
that fits your needs. It doesn't solve the DIO vs page cache problems
at all, since we want DIO to ranges which happen to be within the same
pages as each other to not conflict.
Powered by blists - more mailing lists