linux-kernel - Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in willneed range

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a754add2-de29-4c91-b4f4-cbd7eb888cb6@redhat.com>
Date: Tue, 30 Jan 2024 11:43:21 +0100
From: David Hildenbrand <david@...hat.com>
To: Mike Snitzer <snitzer@...nel.org>, Dave Chinner <david@...morbit.com>
Cc: Ming Lei <ming.lei@...hat.com>, Matthew Wilcox <willy@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-fsdevel@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 Don Dutile <ddutile@...hat.com>,
 Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
 Alexander Viro <viro@...iv.linux.org.uk>,
 Christian Brauner <brauner@...nel.org>, linux-block@...r.kernel.org
Subject: Re: [RFC PATCH] mm/readahead: readahead aggressively if read drops in
 willneed range

On 29.01.24 23:46, Mike Snitzer wrote:
> On Mon, Jan 29 2024 at  5:12P -0500,
> Dave Chinner <david@...morbit.com> wrote:
> 
>> On Mon, Jan 29, 2024 at 12:19:02PM -0500, Mike Snitzer wrote:
>>> While I'm sure this legacy application would love to not have to
>>> change its code at all, I think we can all agree that we need to just
>>> focus on how best to advise applications that have mixed workloads
>>> accomplish efficient mmap+read of both sequential and random.
>>>
>>> To that end, I heard Dave clearly suggest 2 things:
>>>
>>> 1) update MADV/FADV_SEQUENTIAL to set file->f_ra.ra_pages to
>>>     bdi->io_pages, not bdi->ra_pages * 2
>>>
>>> 2) Have the application first issue MADV_SEQUENTIAL to convey that for
>>>     the following MADV_WILLNEED is for sequential file load (so it is
>>>     desirable to use larger ra_pages)
>>>
>>> This overrides the default of bdi->ra_pages and _should_ provide the
>>> required per-file duality of control for readahead, correct?
>>
>> I just discovered MADV_POPULATE_READ - see my reply to Ming
>> up-thread about that. The applicaiton should use that instead of
>> MADV_WILLNEED because it gives cache population guarantees that
>> WILLNEED doesn't. Then we can look at optimising the performance of
>> MADV_POPULATE_READ (if needed) as there is constrained scope we can
>> optimise within in ways that we cannot do with WILLNEED.
> 
> Nice find! Given commit 4ca9b3859dac ("mm/madvise: introduce
> MADV_POPULATE_(READ|WRITE) to prefault page tables"), I've cc'd David
> Hildenbrand just so he's in the loop.

Thanks for CCing me.

MADV_POPULATE_READ is indeed different; it doesn't give hints (not 
"might be a good idea to read some pages" like MADV_WILLNEED documents), 
it forces swapin/read/.../.

In a sense, MADV_POPULATE_READ is similar to simply reading one byte 
from each PTE, triggering page faults. However, without actually reading 
from the target pages.

MADV_POPULATE_READ has a conceptual benefit: we know exactly how much 
memory user space wants to have populated (which range). In contrast, 
page faults contain no such hints and we have to guess based on 
historical behavior. One could use that range information to *not* do 
any faultaround/readahead when we come via MADV_POPULATE_READ, and 
really only popoulate the range of interest.

Further, one can use that range information to allocate larger folios, 
without having to guess where placement of a large folio is reasonable, 
and which size we should use.

> 
> FYI, I proactively raised feedback and questions to the reporter of
> this issue:
>   
> CONTEXT: madvise(WILLNEED) doesn't convey the nature of the access,
> sequential vs random, just the range that may be accessed.

Indeed. The "problem" with MADV_SEQUENTIAL/MADV_RANDOM is that it will 
fragment/split VMAs. So applying it to smaller chunks (like one would do 
with MADV_WILLNEED) is likely not a good option.

-- 
Cheers,

David / dhildenb