linux-kernel - Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zbcv8tKISFYRaMye@dread.disaster.area>
Date: Mon, 29 Jan 2024 15:56:18 +1100
From: Dave Chinner <david@...morbit.com>
To: Mike Snitzer <snitzer@...hat.com>
Cc: Mike Snitzer <snitzer@...nel.org>, Matthew Wilcox <willy@...radead.org>,
	Ming Lei <ming.lei@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Don Dutile <ddutile@...hat.com>,
	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops
 in willneed range

On Sun, Jan 28, 2024 at 09:12:12PM -0500, Mike Snitzer wrote:
> On Sun, Jan 28, 2024 at 8:48 PM Dave Chinner <david@...morbit.com> wrote:
> >
> > On Sun, Jan 28, 2024 at 07:39:49PM -0500, Mike Snitzer wrote:
> > > On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@...radead.org> wrote:
> > > >
> > > > On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > > > > On Sun, Jan 28 2024 at  5:02P -0500,
> > > > > Matthew Wilcox <willy@...radead.org> wrote:
> > > > Understood.  But ... the application is asking for as much readahead as
> > > > possible, and the sysadmin has said "Don't readahead more than 64kB at
> > > > a time".  So why will we not get a bug report in 1-15 years time saying
> > > > "I put a limit on readahead and the kernel is ignoring it"?  I think
> > > > typically we allow the sysadmin to override application requests,
> > > > don't we?
> > >
> > > The application isn't knowingly asking for readahead.  It is asking to
> > > mmap the file (and reporter wants it done as quickly as possible..
> > > like occurred before).
> >
> > .. which we do within the constraints of the given configuration.
> >
> > > This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
> > > request size based on read-ahead setting") -- same logic, just applied
> > > to callchain that ends up using madvise(MADV_WILLNEED).
> >
> > Not really. There is a difference between performing a synchronous
> > read IO here that we must complete, compared to optimistic
> > asynchronous read-ahead which we can fail or toss away without the
> > user ever seeing the data the IO returned.
> >
> > We want required IO to be done in as few, larger IOs as possible,
> > and not be limited by constraints placed on background optimistic
> > IOs.
> >
> > madvise(WILLNEED) is optimistic IO - there is no requirement that it
> > complete the data reads successfully. If the data is actually
> > required, we'll guarantee completion when the user accesses it, not
> > when madvise() is called.  IOWs, madvise is async readahead, and so
> > really should be constrained by readahead bounds and not user IO
> > bounds.
> >
> > We could change this behaviour for madvise of large ranges that we
> > force into the page cache by ignoring device readahead bounds, but
> > I'm not sure we want to do this in general.
> >
> > Perhaps fadvise/madvise(willneed) can fiddle the file f_ra.ra_pages
> > value in this situation to override the device limit for large
> > ranges (for some definition of large - say 10x bdi->ra_pages) and
> > restore it once the readahead operation is done. This would make it
> > behave less like readahead and more like a user read from an IO
> > perspective...
> 
> I'm not going to pretend like I'm an expert in this code or all the
> distinctions that are in play.  BUT, if you look at the high-level
> java reproducer: it is requesting mmap of a finite size, starting from
> the beginning of the file:
> FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
> MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

Mapping an entire file does not mean "we are going to access the
entire file". Lots of code will do this, especially those that do
random accesses within the file.

> Yet you're talking about the application like it is stabbingly
> triggering unbounded async reads that can get dropped, etc, etc.  I

I don't know what the application actually does. All I see is a
microbenchmark that mmaps() a file and walks it sequentially. On a
system where readahead has been tuned to de-prioritise sequential IO
performance.

> just want to make sure the subtlety of (ab)using madvise(WILLNEED)
> like this app does isn't incorrectly attributed to something it isn't.
> The app really is effectively requesting a user read of a particular
> extent in terms of mmap, right?

madvise() is an -advisory- interface that does not guarantee any
specific behaviour. the man page says:

MADV_WILLNEED
       Expect access in the near future.  (Hence, it might be a good
       idea to read some pages ahead.)

It says nothing about guaranteeing that all the data is brought into
memory, or that if it does get brought into memory that it will
remain there until the application accesses it. It doesn't even
imply that IO will even be done immediately. Any application
relying on madvise() to fully populate the page cache range before
returning is expecting behaviour that is not documented nor
guaranteed.

Similarly, the fadvise64() man page does not say that WILLNEED will
bring the entire file into cache:

POSIX_FADV_WILLNEED
        The specified data will be accessed in the near future.

	POSIX_FADV_WILLNEED  initiates a nonblocking read of the
	specified region into the page cache.  The amount of data
	read may be de‐ creased by the kernel depending on virtual
	memory load.  (A few megabytes will usually be fully
	satisfied, and more is rarely use‐ ful.)

> BTW, your suggestion to have this app fiddle with ra_pages and then

No, I did not suggest that the app fiddle with anything. I was
talking about the in-kernel FADV_WILLNEED implementation changing
file->f_ra.ra_pages similar to how FADV_RANDOM and FADV_SEQUENTIAL
do to change readahead IO behaviour. That then allows subsequent
readahead on that vma->file to use a larger value than the default
value pulled in off the device.

Largely, I think the problem is that the application has set a
readahead limit too low for optimal sequential performance.
Complaining that readahead is slow when it has been explicitly tuned
to be slow doesn't really seem like a problem we can fix with code.

-Dave.
-- 
Dave Chinner
david@...morbit.com