linux-kernel - Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZbbfXVg9FpWRUVDn@redhat.com>
Date: Sun, 28 Jan 2024 18:12:29 -0500
From: Mike Snitzer <snitzer@...nel.org>
To: Matthew Wilcox <willy@...radead.org>
Cc: Ming Lei <ming.lei@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Don Dutile <ddutile@...hat.com>,
	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops
 in willneed range

On Sun, Jan 28 2024 at  5:02P -0500,
Matthew Wilcox <willy@...radead.org> wrote:

> On Sun, Jan 28, 2024 at 10:25:22PM +0800, Ming Lei wrote:
> > Since commit 6d2be915e589 ("mm/readahead.c: fix readahead failure for
> > memoryless NUMA nodes and limit readahead max_pages"), ADV_WILLNEED
> > only tries to readahead 512 pages, and the remained part in the advised
> > range fallback on normal readahead.
> 
> Does the MAINTAINERS file mean nothing any more?

"Ming, please use scripts/get_maintainer.pl when submitting patches."

(I've cc'd accordingly with this email).

> > If bdi->ra_pages is set as small, readahead will perform not efficient
> > enough. Increasing read ahead may not be an option since workload may
> > have mixed random and sequential I/O.
> 
> I thik there needs to be a lot more explanation than this about what's
> going on before we jump to "And therefore this patch is the right
> answer".

The patch is "RFC". Ming didn't declare his RFC is "the right answer".
All ideas for how best to fix this issue are welcome.

I agree this patch's header could've worked harder to establish the
problem that it fixes.  But I'll now take a crack at backfilling the
regression report that motivated this patch be developed:

Linux 3.14 was the last kernel to allow madvise (MADV_WILLNEED)
allowed mmap'ing a file more optimally if read_ahead_kb < max_sectors_kb.

Ths regressed with commit 6d2be915e589 (so Linux 3.15) such that
mounting XFS on a device with read_ahead_kb=64 and max_sectors_kb=1024
and running this reproducer against a 2G file will take ~5x longer
(depending on the system's capabilities), mmap_load_test.java follows:

import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;

public class mmap_load_test {

        public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
		if (args.length == 0) {
			System.out.println("Please provide a file");
			System.exit(0);
		}
		FileChannel fc = new RandomAccessFile(new File(args[0]), "rw").getChannel();
		MappedByteBuffer mem = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());

		System.out.println("Loading the file");

		long startTime = System.currentTimeMillis();
		mem.load();
		long endTime = System.currentTimeMillis();
		System.out.println("Done! Loading took " + (endTime-startTime) + " ms");
		
	}
}

reproduce with:

javac mmap_load_test.java
echo 64 > /sys/block/sda/queue/read_ahead_kb
echo 1024 > /sys/block/sda/queue/max_sectors_kb
mkfs.xfs /dev/sda
mount /dev/sda /mnt/test
dd if=/dev/zero of=/mnt/test/2G_file bs=1024k count=2000

echo 3 > /proc/sys/vm/drop_caches
java mmap_load_test /mnt/test/2G_file

Without a fix, like the patch Ming provided, iostat will show rareq-sz
is 64 rather than ~1024.

> > @@ -972,6 +974,7 @@ struct file_ra_state {
> >  	unsigned int ra_pages;
> >  	unsigned int mmap_miss;
> >  	loff_t prev_pos;
> > +	struct maple_tree *need_mt;
> 
> No.  Embed the struct maple tree.  Don't allocate it.

Constructive feedback, thanks.

> What made you think this was the right approach?

But then you closed with an attack, rather than inform Ming and/or
others why you feel so strongly, e.g.: Best to keep memory used for
file_ra_state contiguous.