linux-kernel - Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a9e0a83-59a8-46ac-ae7a-8f2a65b48e1e@arm.com>
Date: Tue, 2 Jul 2024 14:26:53 +0100
From: Ryan Roberts <ryan.roberts@....com>
To: David Hildenbrand <david@...hat.com>,
 Catalin Marinas <catalin.marinas@....com>,
 Yang Shi <yang@...amperecomputing.com>
Cc: "Christoph Lameter (Ampere)" <cl@...two.org>, will@...nel.org,
 anshuman.khandual@....com, scott@...amperecomputing.com,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 Jinjiang Tu <tujinjiang@...wei.com>
Subject: Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW
 instructions

On 02/07/2024 13:58, David Hildenbrand wrote:
> On 02.07.24 14:36, Ryan Roberts wrote:
>> On 02/07/2024 12:22, David Hildenbrand wrote:
>>> On 02.07.24 12:26, Ryan Roberts wrote:
>>>> On 01/07/2024 20:43, Catalin Marinas wrote:
>>>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>>>>> This patch does feel a bit like working around a non-optimal user choice
>>>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>>>>> single call for a larger VA vs touching each page.
>>>>>>
>>>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>>>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>>>>> definitely going to write the memory very soon, right?
>>>>>
>>>>> I agree the Arm architecture behaviour is not ideal here and any
>>>>> timelines for fixing it in hardware, if they do happen, are far into the
>>>>> future. Purely from a kernel perspective, what I want though is make
>>>>> sure that longer term (a) we don't create additional maintenance burden
>>>>> and (b) we don't keep dead code around.
>>>>>
>>>>> Point (a) could be mitigated if the architecture is changed so that any
>>>>> new atomic instructions added to this range would also come with
>>>>> additional syndrome information so that we don't have to update the
>>>>> decoding patterns.
>>>>>
>>>>> Point (b), however, depends on the OpenJDK and the kernel versions in
>>>>> distros. Nick Gasson kindly provided some information on the OpenJDK
>>>>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>>>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>>>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>>>>> OpenJDK a few months ago:
>>>>>
>>>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>>>>
>>>>> The OpenJDK commit lacks explanation but what I gathered from the diff
>>>>> is that this option is the preferred one in the presence of THP (which
>>>>> most/all distros enable by default). If we merge your proposed kernel
>>>>> patch, it will take time before it makes its way into distros. I'm
>>>>> hoping that by that time, distros would have picked a new OpenJDK
>>>>> version already that doesn't need the atomic_add(0) pattern. If that's
>>>>> the case, we end up with some dead code in the kernel that's almost
>>>>> never exercised.
>>>>>
>>>>> I don't follow OpenJDK development but I heard that updates are dragging
>>>>> quite a lot. I can't tell whether people have picked up the
>>>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>>>>
>>>>> There's a point (c) as well on the overhead of reading the faulting
>>>>> instruction. I hope that's negligible but I haven't measured it.
>>>>>
>>>>
>>>> Just to add to this, I note the existing kernel behaviour is that if a write
>>>> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
>>>> then the PMD is shattered, the PTE of the fault address is populated with a
>>>> writable page and the remaining PTEs are populated with order-0 zero pages
>>>> (read-only).
>>>
>>> That also recently popped up in [1]. CCing Jinjiang. Ever since I
>>> replied there, I also thought some more about that handling in regard to the
>>> huge zeropage.
>>>
>>>>
>>>> This seems like odd behaviour to me. Surely it would be less effort and more
>>>> aligned with the app's expectations to notice the huge zero page in the PMD,
>>>> remove it, and install a THP, as would have been done if pmd_none() was true? I
>>>> don't think there is a memory bloat argument here because, IIUC, with the
>>>> current behaviour, khugepaged would eventually upgrade it to a THP anyway?
>>>
>>> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
>>> are treated like pte_none. But in the common case, that setting is left alone.
>>
>> Ahh, got it. So in the common case, khugepaged won't actually collapse
>> unless/until a bunch more write faults occur in the 2M region, and in that case
>> there is a risk that changing this behaviour could lead to a memory bloat
>> regression.
>>
>>>
>>>>
>>>> Changing to this new behaviour would only be a partial solution for your use
>>>> case, since you would still have 2 faults. But it would remove the cost of the
>>>> shattering and ensure you have a THP immediately after the write fault. But I
>>>> can't think of a reason why this wouldn't be a generally useful change
>>>> regardless? Any thoughts?
>>>
>>> The "let's read before we write" as used by QEMU migration code is the desire
>>> to not waste memory by populating the zeropages. Deferring consuming memory
>>> until really required.
>>>
>>>      /*
>>>       * We read one byte of each page; this will preallocate page tables if
>>>       * required and populate the shared zeropage on MAP_PRIVATE anonymous
>>> memory
>>>       * where no page was populated yet. This might require adaption when
>>>       * supporting other mappings, like shmem.
>>>       */
>>
>> So QEMU is concerned with preallocatiing page tables? I would have thought you
>> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
>> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
>> through the allocator, fewer pud/pmd lock/unlocks, etc).
> 
> I think we are only concerned about the "shared zeropage" part. Everything else
> is just unnecessary detail that adds confusion here :) One requires the other.

Sorry I don't quite follow your comment. As I understand it, the zeropage
concept is intended as a memory-saving mechanism for applications that read
memory but never write it. I don't think that really applies in your migration
case, because you are definitely going to write all the memory eventually, I
think? So I guess you are not interested in the "memory-saving" property, but in
the side-effect, which is the pre-allocation of pagetables? (if you just wanted
the memory-saving property, why not just skip the "read one byte of each page"
op? It's not important though, so let's not go down the rabbit hole.

> 
> Note that this is from migration code where we're supposed to write a single
> page we received from the migration source right now (not more). And we want to
> avoid allcoating memory if it can be avoided (usually for overcommit).
> 
> 
> 
>>
>> TBH I always assumed in the past the that huge zero page is only useful because
>> its a placeholder for a real THP that would be populated on write. But that's
>> obviously not the case at the moment. So other than a hack to preallocate the
>> pgtables with only 1 fault per 2M, what other benefits does it have?
> 
> I don't quite udnerstand that question. [2] has some details why the huge
> zeropage was added -- because we would have never otherwise received huge
> zeropages with THP enabled but always anon THP directly on read.
> 
>>
>>>
>>>
>>> Without THP this works as expected. With THP this currently also works as
>>> expected, but of course with the price [1] of not getting anon THP
>>> immediately, which usually we don't care about. As you note, khugepaged might
>>> fix this up later.
>>>
>>> If we disable the huge zeropage, we would get anon THPs when reading instead of
>>> small zeropages.
>>
>> I wasn't aware of that behaviour either. Although that sounds like another
>> reason why allocating a THP over the huge zero page on write fault should be the
>> "more consistent" behaviour.
> 
> Reading [2] I think the huge zeropage was added to avoid the allocation of THP
> on read. Maybe for really only large readable regions, not sure why exactly.

I might raise this on the THP call tomorrow, if Kyril joins and get his view.

> 
>>
>>>
>>> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
>>> when we really care about that performance difference, which would also
>>> avoid the huge zeropage completely, but it's also not quite optimal in some
>>> cases.
>>
>> I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT,
>> which would just mark the VMA so that any read fault is upgraded to write.
>>
>>>
>>>
>>> I don't really know what to do here: changing the handling for the huge zeropage
>>> only unconditionally does not sound too wrong, but the change in behavior
>>> might (or might not) be desired for some use cases.
>>>
>>> Reading from unpopulated memory can be a clear sign that really the shared
>>> zeropage
>>> is desired (as for QEMU), and concurrent memory preallcoation/population should
>>> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2]
>>> regarding
>>> the common use cases for the huge zeropage back than.
>>
>> The current huge zero page behavior on write fault sounds wonky to me. But I
>> agree there are better and more complete solutions to the identified use cases.
>> So unless something pops up where the change is a clear benefit, I guess better
>> to be safe and leave as is.
> 
> We've had that behavior for a quite a while ... so it's rather surprising to see
> multiple people reporting this right now.
> 
> I guess most use cases don't read from uninitialized memory barely write to it
> and care about getting THPs immediately.
> 
> For preallocation, MADVISE_POPULATE_WRITE is better. For QEMU migration? not
> sure what's really better. Maybe replacing the huge zeropage by a THP would be
> faster in some cases, but result in more memory consumption (and more page
> zeroing?) during migration in other cases.
>