linux-kernel - Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d2f2d48e-bfa5-4fd8-a6fe-dc75b89ffe9e@os.amperecomputing.com>
Date: Mon, 2 Dec 2024 15:39:57 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: catalin.marinas@....com, will@...nel.org
Cc: cl@...two.org, scott@...amperecomputing.com,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block
 mapping when rodata=full

Gently ping...


Any comments on this RFC? Look forward to discussing them.

Thanks,
Yang


On 11/18/24 10:16 AM, Yang Shi wrote:
> When rodata=full kernel linear mapping is mapped by PTE due to arm's
> break-before-make rule.
>
> This resulted in a couple of problems:
>    - performance degradation
>    - more TLB pressure
>    - memory waste for kernel page table
>
> There are some workarounds to mitigate the problems, for example, using
> rodata=on, but this compromises the security measurement.
>
> With FEAT_BBM level 2 support, splitting large block page table to
> smaller ones doesn't need to make the page table entry invalid anymore.
> This allows kernel split large block mapping on the fly.
>
> Add kernel page table split support and use large block mapping by
> default when FEAT_BBM level 2 is supported for rodata=full.  When
> changing permissions for kernel linear mapping, the page table will be
> split to PTE level.
>
> The machine without FEAT_BBM level 2 will fallback to have kernel linear
> mapping PTE-mapped when rodata=full.
>
> With this we saw significant performance boost with some benchmarks with
> keeping rodata=full security protection in the mean time.
>
> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
> 4K page size + 48 bit VA.
>
> Function test (4K/16K/64K page size)
>    - Kernel boot.  Kernel needs change kernel linear mapping permission at
>      boot stage, if the patch didn't work, kernel typically didn't boot.
>    - Module stress from stress-ng. Kernel module load change permission for
>      module sections.
>    - A test kernel module which allocates 80% of total memory via vmalloc(),
>      then change the vmalloc area permission to RO, then change it back
>      before vfree(). Then launch a VM which consumes almost all physical
>      memory.
>    - VM with the patchset applied in guest kernel too.
>    - Kernel build in VM with patched guest kernel.
>
> Memory consumption
> Before:
> MemTotal:       258988984 kB
> MemFree:        254821700 kB
>
> After:
> MemTotal:       259505132 kB
> MemFree:        255410264 kB
>
> Around 500MB more memory are free to use.  The larger the machine, the
> more memory saved.
>
> Performance benchmarking
> * Memcached
> We saw performance degradation when running Memcached benchmark with
> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
> With this patchset we saw ops/sec is increased by around 3.5%, P99
> latency is reduced by around 9.6%.
> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
> MPKI is reduced by 28.5%.
>
> The benchmark data is now on par with rodata=on too.
>
> * Disk encryption (dm-crypt) benchmark
> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
> encryption (by dm-crypt).
> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>      --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>      --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>      --name=iops-test-job --eta-newline=1 --size 100G
>
> The IOPS is increased by 90% - 150% (the variance is high, but the worst
> number of good case is around 90% more than the best number of bad case).
> The bandwidth is increased and the avg clat is reduced proportionally.
>
> * Sequential file read
> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
> The bandwidth is increased by 150%.
>
>
> Yang Shi (3):
>        arm64: cpufeature: detect FEAT_BBM level 2
>        arm64: mm: support large block mapping when rodata=full
>        arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2
>
>   arch/arm64/include/asm/cpufeature.h |  24 ++++++++++++++++++
>   arch/arm64/include/asm/pgtable.h    |   7 +++++-
>   arch/arm64/kernel/cpufeature.c      |  11 ++++++++
>   arch/arm64/mm/mmu.c                 |  31 +++++++++++++++++++++--
>   arch/arm64/mm/pageattr.c            | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>   arch/arm64/tools/cpucaps            |   1 +
>   6 files changed, 238 insertions(+), 9 deletions(-)
>
>