[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e8497b6b-c85a-4b11-9c0f-01528c82cb92@os.amperecomputing.com>
Date: Tue, 13 Aug 2024 10:09:04 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Catalin Marinas <catalin.marinas@....com>
Cc: "Christoph Lameter (Ampere)" <cl@...two.org>, will@...nel.org,
anshuman.khandual@....com, david@...hat.com, scott@...amperecomputing.com,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v5 PATCH] arm64: mm: force write fault for atomic RMW
instructions
On 7/11/24 11:17 AM, Yang Shi wrote:
>
>
> On 7/11/24 10:43 AM, Catalin Marinas wrote:
>> On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
>>> On 7/10/24 2:22 AM, Catalin Marinas wrote:
>>>>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>>>>> index 642bdf908b22..d30265d424e4 100644
>>>>> --- a/arch/arm64/mm/mmap.c
>>>>> +++ b/arch/arm64/mm/mmap.c
>>>>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16]
>>>>> __ro_after_init = {
>>>>> [VM_WRITE] = PAGE_READONLY,
>>>>> [VM_WRITE | VM_READ] = PAGE_READONLY,
>>>>> /* PAGE_EXECONLY if Enhanced PAN */
>>>>> - [VM_EXEC] =
>>>>> PAGE_READONLY_EXEC,
>>>>> + [VM_EXEC] =
>>>>> PAGE_EXECONLY,
>>>>> [VM_EXEC | VM_READ] = PAGE_READONLY_EXEC,
>>>>> [VM_EXEC | VM_WRITE] = PAGE_READONLY_EXEC,
>>>>> [VM_EXEC | VM_WRITE | VM_READ] = PAGE_READONLY_EXEC,
>>>> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
>>>> Otherwise it looks fine.
>>> Thanks. I just ran the same benchmark. Ran the modified
>>> page_fault1_thread
>>> (trigger read fault) in 100 iterations with 160 threads on 160
>>> cores. This
>>> should be the worst contention case and collected the max data (worst
>>> latency). It shows the patch may incur ~30% overhead for exec-only
>>> case. The
>>> overhead should just come from the permission fault.
>>>
>>> N Min Max Median Avg Stddev
>>> x 100 163840 219083 184471 183262 12593.229
>>> + 100 211198 285947 233608 238819.98 15253.967
>>> Difference at 95.0% confidence
>>> 55558 +/- 3877
>>> 30.3161% +/- 2.11555%
>>>
>>> This is a very extreme benchmark, I don't think any real life
>>> workload will
>>> spend that much time (sys vs user) in page fault, particularly read
>>> fault.
>>>
>>> With my atomic fault benchmark (populate 1G memory with atomic
>>> instruction
>>> then manipulate the value stored in the memory in 100 iterations so
>>> the user
>>> time is much longer than sys time), I saw around 13% overhead on sys
>>> time
>>> due to the permission fault, but no noticeable change for user and real
>>> time.
>> Thanks for running these tests.
>>
>>> So the permission fault does incur noticeable overhead for read
>>> fault on
>>> exec-only, but it may be not that bad for real life workloads.
>> So you are saying 3 faults is not that bad for real life workloads but
>> the 2 faults behaviour we have currently is problematic for OpenJDK. For
>> the OpenJDK case, I don't think the faulting is the main issue affecting
>> run-time performance but, over a longer run (or with more iterations in
>> this benchmark after the faulting in), you'd notice the breaking up of
>> the initial THP pages.
Gently ping...
>
> I meant the extra permission fault for exec-only should be ok since
> the current implementation can't force write fault for exec-only
> anyway. It does incur noticeable overhead for read fault, but I'm not
> aware of any real life workloads are sensitive to read fault. The
> benchmark is for a very extreme worst case.
>
>>
>> Do you have any OpenJDK benchmarks that show the problem? It might be
>> worth running them with this patch:
>>
>> https://lore.kernel.org/r/rjudrmg7nkkwfgviqeqluuww6wu6fdrgdsfimtmpjee7lkz2ej@iosd2f6pk4f7
>>
>>
>> Or, if not, do you see any difference in the user time in your benchmark
>> with the above mm patch? In subsequent iterations, linear accesses are
>> not ideal for testing. Better to have some random accesses this 1GB of
>> memory (after the initial touching). That would mimic heap accesses a
>> bit better.
>
> I didn't try that patch. I think we discussed this before. This patch
> can remove the THP shattering, we should be able to see some
> improvement, but TBLI is still needed and its cost should be still
> noticeable because the write protection fault still happens.
>
>>
>> Anyway, as it stands, I don't think we should merge this patch since it
>> does incur an additional penalty with exec-only mappings and it would
>> make things even worse for OpenJDK if distros change their default
>> toolchain flags at some point to generate exec-only ELF text sections.
>> While we could work around this by allowing the kernel to read the
>> exec-only user mapping with a new flavour of get_user() (toggling PAN as
>> well), I don't think it's worth it. Especially with the above mm change,
>> I think the benefits of this patch aren't justified. Longer term, I hope
>> that customers upgrade to OpenJDK v22 or, for proprietary tools, they
>> either follow the madvise() approach or wait for the Arm architects to
>> change the hardware behaviour.
>
> If the overhead for exec-only is a concern, I think we can just skip
> exec-only segment for now, right? The exec-only is not that popular
> yet. And if the users prefer security, typically they may be not that
> sensitive to performance.
>
>>
>
Powered by blists - more mailing lists