[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <35f70ba6-5305-4268-b7ba-81545cacd83f@os.amperecomputing.com>
Date: Tue, 9 Jul 2024 10:56:55 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: "Christoph Lameter (Ampere)" <cl@...two.org>, will@...nel.org,
anshuman.khandual@....com, david@...hat.com, scott@...amperecomputing.com,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW
instructions
On 7/4/24 3:03 AM, Catalin Marinas wrote:
> On Tue, Jul 02, 2024 at 03:21:41PM -0700, Yang Shi wrote:
>> On 7/1/24 12:43 PM, Catalin Marinas wrote:
>>> I don't follow OpenJDK development but I heard that updates are dragging
>>> quite a lot. I can't tell whether people have picked up the
>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>> As Christopher said there may be similar use of atomic in other
>> applications, so I don't worry too much about dead code problem IMHO.
>> OpenJDK is just the usecase that we know. There may be unknown unknowns. And
>> the distros typically backport patches from mainline kernel to their kernel
>> so there should be combos like old kernel + backported patch + old OpenJDK.
> That's a somewhat valid argument I heard internally as well. People tend
> to change or patch kernel versions more often than OpenJDK versions
> because of the risk of breaking their Java stack. But, arguably, one can
> backport the madvise() OpenJDK change since it seems to have other
> benefits on x86 as well.
>
>> AFAICT, the users do expect similar behavior as x86 (one fault instead of
>> two faults). Actually we noticed this problem due to a customer report.
> It's not a correctness problem, only a performance one. Big part of that
> could be mitigated by some adjustment to how THP pages are allocated on
> a write fault (though we'd still have an unnecessary read fault and some
> TLBI). See Ryan's sub-thread.
Yes, the TLBI is still sub-optimal and it is quite noticeable.
>
>>> There's a point (c) as well on the overhead of reading the faulting
>>> instruction. I hope that's negligible but I haven't measured it.
>> I think I showed benchmark data requested by Anshuman in the earlier email
>> discussion.
> Do you mean this:
>
> https://lore.kernel.org/r/328c4c86-96c8-4896-8b6d-94f2facdac9a@os.amperecomputing.com
>
> I haven't figured out what the +24% case is in there, it seems pretty
> large.
I think I ran the test much more iterations and I didn't see such
outlier anymore.
>> To rule out the worst case, I also ran the test 100 iterations with
160 threads then compared the worst case:
>>
>> N Min Max Median Avg Stddev
>> 100 34770 84979 65536 63537.7 10358.873
>> 100 38077 87652 65536 63119.02 8792.7399
>
> What you haven't benchmarked (I think) is the case where the instruction
> is in an exec-only mapping. The subsequent instruction read will fault
> and it adds to the overhead. Currently exec-only mappings are not
> widespread but I heard some people planning to move in this direction as
> a default build configuration.
I tested exec-only on QEMU tcg, but I don't have a hardware supported
EPAN. I don't think performance benchmark on QEMU tcg makes sense since
it is quite slow, such small overhead is unlikely measurable on it.
>
> It could be worked around with a new flavour of get_user() that uses the
> non-T LDR instruction and the user mapping is readable by the kernel
> (that's the case with EPAN, prior to PIE and I think we can change this
> for PIE configurations as well). But it adds to the complexity of this
> patch when the kernel already offers a MADV_POPULATE_WRITE solution.
>
Powered by blists - more mailing lists