[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4794885d-2e17-4bd8-bdf3-8ac37047e8ee@os.amperecomputing.com>
Date: Thu, 10 Apr 2025 15:00:22 -0700
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, will@...nel.org,
catalin.marinas@....com, Miko.Lenczewski@....com,
scott@...amperecomputing.com, cl@...two.org
Cc: linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block
mapping when rodata=full
Hi Ryan,
I know you may have a lot of things to follow up after LSF/MM. Just
gently ping, hopefully we can resume the review soon.
Thanks,
Yang
On 3/13/25 10:40 AM, Yang Shi wrote:
>
>
> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>> On 13/03/2025 17:28, Yang Shi wrote:
>>> Hi Ryan,
>>>
>>> I saw Miko posted a new spin of his patches. There are some slight
>>> changes that
>>> have impact to my patches (basically check the new boot parameter).
>>> Do you
>>> prefer I rebase my patches on top of his new spin right now then
>>> restart review
>>> from the new spin or review the current patches then solve the new
>>> review
>>> comments and rebase to Miko's new spin together?
>> Hi Yang,
>>
>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>
>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>> series and am
>> not too bothered about the integration with that; I think it's pretty
>> straight
>> forward. I'm more interested in how you are handling the splitting,
>> which I
>> think is the bulk of the effort.
>
> Yeah, sure, thank you.
>
>>
>> I'm hoping to get to this next week before heading out to LSF/MM the
>> following
>> week (might I see you there?)
>
> Unfortunately I can't make it this year. Have a fun!
>
> Thanks,
> Yang
>
>>
>> Thanks,
>> Ryan
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>
>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>> Changelog
>>>> =========
>>>> v3:
>>>> * Rebased to v6.14-rc4.
>>>> * Based on Miko's BBML2 cpufeature patch
>>>> (https://lore.kernel.org/linux-
>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@....com/).
>>>> Also included in this series in order to have the complete
>>>> patchset.
>>>> * Enhanced __create_pgd_mapping() to handle split as well per
>>>> Ryan.
>>>> * Supported CONT mappings per Ryan.
>>>> * Supported asymmetric system by splitting kernel linear
>>>> mapping if such
>>>> system is detected per Ryan. I don't have such system to
>>>> test, so the
>>>> testing is done by hacking kernel to call linear mapping
>>>> repainting
>>>> unconditionally. The linear mapping doesn't have any block
>>>> and cont
>>>> mappings after booting.
>>>>
>>>> RFC v2:
>>>> * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>> handle TLB
>>>> conflict gracefully per Will Deacon
>>>> * Rebased onto v6.13-rc5
>>>> *
>>>> https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>> yang@...amperecomputing.com/
>>>>
>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>> yang@...amperecomputing.com/
>>>>
>>>> Description
>>>> ===========
>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>> break-before-make rule.
>>>>
>>>> A number of performance issues arise when the kernel linear map is
>>>> using
>>>> PTE entries due to arm's break-before-make rule:
>>>> - performance degradation
>>>> - more TLB pressure
>>>> - memory waste for kernel page table
>>>>
>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>> line but this disables the alias checks on page table permissions and
>>>> therefore compromises security somewhat.
>>>>
>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>> invalidate the
>>>> page table entry when changing page sizes. This allows the kernel to
>>>> split large mappings after boot is complete.
>>>>
>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>> level 2
>>>> is available and rodata=full is used. This functionality will be used
>>>> when modifying page permissions for individual page frames.
>>>>
>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>> only.
>>>>
>>>> If the system is asymmetric, the kernel linear mapping may be
>>>> repainted once
>>>> the BBML2 capability is finalized on all CPUs. See patch #6 for
>>>> more details.
>>>>
>>>> We saw significant performance increases in some benchmarks with
>>>> rodata=full without compromising the security features of the kernel.
>>>>
>>>> Testing
>>>> =======
>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>> memory and
>>>> 4K page size + 48 bit VA.
>>>>
>>>> Function test (4K/16K/64K page size)
>>>> - Kernel boot. Kernel needs change kernel linear mapping
>>>> permission at
>>>> boot stage, if the patch didn't work, kernel typically didn't
>>>> boot.
>>>> - Module stress from stress-ng. Kernel module load change
>>>> permission for
>>>> linear mapping.
>>>> - A test kernel module which allocates 80% of total memory via
>>>> vmalloc(),
>>>> then change the vmalloc area permission to RO, this also
>>>> change linear
>>>> mapping permission to RO, then change it back before vfree().
>>>> Then launch
>>>> a VM which consumes almost all physical memory.
>>>> - VM with the patchset applied in guest kernel too.
>>>> - Kernel build in VM with guest kernel which has this series
>>>> applied.
>>>> - rodata=on. Make sure other rodata mode is not broken.
>>>> - Boot on the machine which doesn't support BBML2.
>>>>
>>>> Performance
>>>> ===========
>>>> Memory consumption
>>>> Before:
>>>> MemTotal: 258988984 kB
>>>> MemFree: 254821700 kB
>>>>
>>>> After:
>>>> MemTotal: 259505132 kB
>>>> MemFree: 255410264 kB
>>>>
>>>> Around 500MB more memory are free to use. The larger the machine, the
>>>> more memory saved.
>>>>
>>>> Performance benchmarking
>>>> * Memcached
>>>> We saw performance degradation when running Memcached benchmark with
>>>> rodata=full vs rodata=on. Our profiling pointed to kernel TLB
>>>> pressure.
>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>> latency is reduced by around 9.6%.
>>>> The gain mainly came from reduced kernel TLB misses. The kernel TLB
>>>> MPKI is reduced by 28.5%.
>>>>
>>>> The benchmark data is now on par with rodata=on too.
>>>>
>>>> * Disk encryption (dm-crypt) benchmark
>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>> with disk
>>>> encryption (by dm-crypt).
>>>> fio --directory=/data --random_generator=lfsr --norandommap
>>>> --randrepeat 1 \
>>>> --status-interval=999 --rw=write --bs=4k --loops=1
>>>> --ioengine=sync \
>>>> --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting
>>>> --thread \
>>>> --name=iops-test-job --eta-newline=1 --size 100G
>>>>
>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>> worst
>>>> number of good case is around 90% more than the best number of bad
>>>> case).
>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>
>>>> * Sequential file read
>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>> populated).
>>>> The bandwidth is increased by 150%.
>>>>
>>>>
>>>> Mikołaj Lenczewski (1):
>>>> arm64: Add BBM Level 2 cpu feature
>>>>
>>>> Yang Shi (5):
>>>> arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>> arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>> arm64: mm: support large block mapping when rodata=full
>>>> arm64: mm: support split CONT mappings
>>>> arm64: mm: split linear mapping if BBML2 is not supported
>>>> on secondary
>>>> CPUs
>>>>
>>>> arch/arm64/Kconfig | 11 +++++
>>>> arch/arm64/include/asm/cpucaps.h | 2 +
>>>> arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>> arch/arm64/include/asm/mmu.h | 4 ++
>>>> arch/arm64/include/asm/pgtable.h | 12 ++++-
>>>> arch/arm64/kernel/cpufeature.c | 95
>>>> +++++++++++++++++++++++++++++++++++++
>>>> arch/arm64/mm/mmu.c | 397
>>>> ++++++++++++++++++++++++++++++++++
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>> ++++++++++++++++++++++-------------------
>>>> arch/arm64/mm/pageattr.c | 37 ++++++++++++---
>>>> arch/arm64/tools/cpucaps | 1 +
>>>> 9 files changed, 518 insertions(+), 56 deletions(-)
>>>>
>>>>
>
Powered by blists - more mailing lists