linux-kernel - Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH6w=aw_46Ker0w8HmSA41vUUDKGDGC3gxBFWAhd326+kEtrNg@mail.gmail.com>
Date: Sun, 28 Jan 2024 19:39:49 -0500
From: Mike Snitzer <snitzer@...nel.org>
To: Matthew Wilcox <willy@...radead.org>
Cc: Mike Snitzer <snitzer@...nel.org>, Ming Lei <ming.lei@...hat.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, linux-fsdevel@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Don Dutile <ddutile@...hat.com>, Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in
 willneed range

On Sun, Jan 28, 2024 at 7:22 PM Matthew Wilcox <willy@...radead.org> wrote:
>
> On Sun, Jan 28, 2024 at 06:12:29PM -0500, Mike Snitzer wrote:
> > On Sun, Jan 28 2024 at  5:02P -0500,
> > Matthew Wilcox <willy@...radead.org> wrote:
> >
> > > On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > > > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > > > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > > > only tries to readahead 512 pages, and the remained part in the advised
> > > > range fallback on normal readahead.
> > >
> > > Does the MAINTAINERS file mean nothing any more?
> >
> > "Ming, please use scripts/get_maintainer.pl when submitting patches."
>
> That's an appropriate response to a new contributor, sure.  Ming has
> been submitting patches since, what, 2008?  Surely they know how to
> submit patches by now.
>
> > I agree this patch's header could've worked harder to establish the
> > problem that it fixes.  But I'll now take a crack at backfilling the
> > regression report that motivated this patch be developed:
>
> Thank you.
>
> > Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED)
> > allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb.
> >
> > Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that
> > mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024
> > and running this reproducer against a 2G file will take ~5x longer
> > (depending on the system's capabilities), mmap_load_test.java follows:
> >
> > import java.nio.ByteBuffer;
> > import java.nio.ByteOrder;
> > import java.io.RandomAccessFile;
> > import java.nio.MappedByteBuffer;
> > import java.nio.channels.FileChannel;
> > import java.io.File;
> > import java.io.FileNotFoundException;
> > import java.io.IOException;
> >
> > public class mmap_load_test {
> >
> >         public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
> >               if (args.length == 0) {
> >                       System.out.println("Please provide a file");
> >                       System.exit(0);
> >               }
> >               FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
> >               MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
> >
> >               System.out.println("Loading the file");
> >
> >               long startTime = System.currentTimeMillis();
> >               mem.load();
> >               long endTime = System.currentTimeMillis();
> >               System.out.println("Done! Loading took " + (endTime-startTime) + " ms");
> >
> >       }
> > }
>
> It's good to have the original reproducer.  The unfortunate part is
> that being at such a high level, it doesn't really show what syscalls
> the library makes on behalf of the application.  I'll take your word
> for it that it calls madvise(MADV_WILLNEED).  An strace might not go
> amiss.
>
> > reproduce with:
> >
> > javac mmap_load_test.java
> > echo 64 > /sys/block/sda/queue/read_ahead_kb
> > echo 1024 > /sys/block/sda/queue/max_sectors_kb
> > mkfs.xfs /dev/sda
> > mount /dev/sda /mnt/test
> > dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000
> >
> > echo 3 > /proc/sys/vm/drop_caches
>
> (I prefer to unmount/mount /mnt/test; it drops the cache for
> /mnt/test/2G_file without affecting the rest of the system)
>
> > java mmap_load_test /mnt/test/2G_file
> >
> > Without a fix, like the patch Ming provided, iostat will show rareq-sz
> > is 64 rather than ~1024.
>
> Understood.  But ... the application is asking for as much readahead as
> possible, and the sysadmin has said "Don't readahead more than 64kB at
> a time".  So why will we not get a bug report in 1-15 years time saying
> "I put a limit on readahead and the kernel is ignoring it"?  I think
> typically we allow the sysadmin to override application requests,
> don't we?

The application isn't knowingly asking for readahead.  It is asking to
mmap the file (and reporter wants it done as quickly as possible..
like occurred before).

This fix is comparable to Jens' commit 9491ae4aade6 ("mm: don't cap
request size based on read-ahead setting") -- same logic, just applied
to callchain that ends up using madvise(MADV_WILLNEED).

> > > > @@ -972,6 +974,7 @@ struct file_ra_state {
> > > >   unsigned int ra_pages;
> > > >   unsigned int mmap_miss;
> > > >   loff_t prev_pos;
> > > > + struct maple_tree *need_mt;
> > >
> > > No.  Embed the struct maple tree.  Don't allocate it.
> >
> > Constructive feedback, thanks.
> >
> > > What made you think this was the right approach?
> >
> > But then you closed with an attack, rather than inform Ming and/or
> > others why you feel so strongly, e.g.: Best to keep memory used for
> > file_ra_state contiguous.
>
> That's not an attack, it's a genuine question.  Is there somewhere else
> doing it wrong that Ming copied from?  Does the documentation need to
> be clearer?  I can't fix what I don't know.

OK