lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8e3ad39e-de59-48bf-b776-b27dc784a8ef@os.amperecomputing.com>
Date: Thu, 13 Feb 2025 13:27:37 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, catalin.marinas@....com,
 will@...nel.org
Cc: cl@...two.org, scott@...amperecomputing.com,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC v2 PATCH 0/2] arm64: support FEAT_BBM level 2 and large
 block mapping when rodata=full




On 2/11/25 3:36 AM, Ryan Roberts wrote:
> Sorry managed to send this to the list only. Resending with original recipients
> added back in...
>
>
> On 11/02/2025 11:34, Ryan Roberts wrote:
>> Hi Yang,
>>
>> Thanks for putting this together; I'm hoping to piggyback on this and use BBML2
>> to reduce the cost of contpte_convert().

Thanks for sharing another usecase.

>>
>> review incoming...
>>
>>
>> On 03/01/2025 01:17, Yang Shi wrote:
>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>> break-before-make rule.
>>>
>>> This resulted in a couple of problems:
>>>    - performance degradation
>>>    - more TLB pressure
>>>    - memory waste for kernel page table
>>>
>>> There are some workarounds to mitigate the problems, for example, using
>>> rodata=on, but this compromises the security measurement.
>>>
>>> With FEAT_BBM level 2 support, splitting large block page table to
>>> smaller ones doesn't need to make the page table entry invalid anymore.
>>> This allows kernel split large block mapping on the fly.
>>>
>>> Add kernel page table split support and use large block mapping by
>>> default when FEAT_BBM level 2 is supported for rodata=full.  When
>>> changing permissions for kernel linear mapping, the page table will be
>>> split to PTE level.
>>>
>>> The machine without FEAT_BBM level 2 will fallback to have kernel linear
>>> mapping PTE-mapped when rodata=full.
>>>
>>> With this we saw significant performance boost with some benchmarks with
>>> keeping rodata=full security protection in the mean time.
>>>
>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>>> 4K page size + 48 bit VA.
>>>
>>> Function test (4K/16K/64K page size)
>>>    - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>>      boot stage, if the patch didn't work, kernel typically didn't boot.
>>>    - Module stress from stress-ng. Kernel module load change permission for
>>>      module sections.
>>>    - A test kernel module which allocates 80% of total memory via vmalloc(),
>>>      then change the vmalloc area permission to RO, then change it back
>>>      before vfree(). Then launch a VM which consumes almost all physical
>>>      memory.
>> I don't really understand how vmalloc is relevant here? vmalloc can already map
>> huge pages you use vmalloc_huge() and changing the permissions of vmalloc
>> mapping will only affect the ptes pertaining to that mapping; I don't see why
>> that would cause permissions to be changed on the linear map or for huge pages
>> in the linear map to be split?

I just uses vmalloc() API to emulate what modules loading does. Allocate 
memory via vmalloc() then change permission to, for example, read-only, 
by calling set_memory_ro(). So I can stress the page split by doing it 
on the most of memory, for example, 80% of memory. It is more efficient 
than loading real modules.

It is implemented by a patch against test_vmalloc. I don't include the 
patch in this series, if you think it is useful, I can include it in v3 
anyway.

>>
>>>    - VM with the patchset applied in guest kernel too.
>>>    - Kernel build in VM with patched guest kernel.
>>>
>>> Memory consumption
>>> Before:
>>> MemTotal:       258988984 kB
>>> MemFree:        254821700 kB
>>>
>>> After:
>>> MemTotal:       259505132 kB
>>> MemFree:        255410264 kB
>>>
>>> Around 500MB more memory are free to use.  The larger the machine, the
>>> more memory saved.
>>>
>>> Performance benchmarking
>>> * Memcached
>>> We saw performance degradation when running Memcached benchmark with
>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>> latency is reduced by around 9.6%.
>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>> MPKI is reduced by 28.5%.
>>>
>>> The benchmark data is now on par with rodata=on too.
>>>
>>> * Disk encryption (dm-crypt) benchmark
>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>>> encryption (by dm-crypt).
>>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>>      --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>>      --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>>      --name=iops-test-job --eta-newline=1 --size 100G
>>>
>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>> number of good case is around 90% more than the best number of bad case).
>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>
>>> * Sequential file read
>>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>>> The bandwidth is increased by 150%.
>> The performance gains definitely look worthwhile!

Yeah, thanks for taking your time review the patches. I think the 
feedback is positive enough so far to get rid off the "RFC" tag.

Yang


>>
>> Thanks,
>> Ryan
>>
>>> RFC v2:
>>>    * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>>      conflict gracefully per Will Deacon
>>>    * Rebased onto v6.13-rc5
>>>
>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-yang@os.amperecomputing.com/
>>>
>>> Yang Shi (2):
>>>        arm64: cpufeature: detect FEAT_BBM level 2
>>>        arm64: mm: support large block mapping when rodata=full
>>>
>>>   arch/arm64/include/asm/cpufeature.h |  19 ++++++++++++
>>>   arch/arm64/include/asm/pgtable.h    |   7 ++++-
>>>   arch/arm64/kernel/cpufeature.c      |  11 +++++++
>>>   arch/arm64/mm/mmu.c                 |  32 ++++++++++++++++++--
>>>   arch/arm64/mm/pageattr.c            | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>>   arch/arm64/tools/cpucaps            |   1 +
>>>   6 files changed, 234 insertions(+), 9 deletions(-)
>>>
>>>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ