lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <85e50475-7d2c-49df-924e-90e0b915a4d3@os.amperecomputing.com>
Date: Sun, 2 Nov 2025 16:47:52 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Ryan Roberts <ryan.roberts@....com>, Guenter Roeck <linux@...ck-us.net>
Cc: catalin.marinas@....com, will@...nel.org, akpm@...ux-foundation.org,
 david@...hat.com, lorenzo.stoakes@...cle.com, ardb@...nel.org,
 dev.jain@....com, scott@...amperecomputing.com, cl@...two.org,
 linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, nd@....com
Subject: Re: [PATCH v8 3/5] arm64: mm: support large block mapping when
 rodata=full



On 11/2/25 4:11 AM, Ryan Roberts wrote:
> On 02/11/2025 10:31, Ryan Roberts wrote:
>> On 01/11/2025 16:14, Guenter Roeck wrote:
>>> Hi,
>>>
>>> On Wed, Sep 17, 2025 at 12:02:09PM -0700, Yang Shi wrote:
>>>> When rodata=full is specified, kernel linear mapping has to be mapped at
>>>> PTE level since large page table can't be split due to break-before-make
>>>> rule on ARM64.
>>>>
>>>> This resulted in a couple of problems:
>>>>    - performance degradation
>>>>    - more TLB pressure
>>>>    - memory waste for kernel page table
>>>>
>>>> With FEAT_BBM level 2 support, splitting large block page table to
>>>> smaller ones doesn't need to make the page table entry invalid anymore.
>>>> This allows kernel split large block mapping on the fly.
>>>>
>>>> Add kernel page table split support and use large block mapping by
>>>> default when FEAT_BBM level 2 is supported for rodata=full.  When
>>>> changing permissions for kernel linear mapping, the page table will be
>>>> split to smaller size.
>>>>
>>>> The machine without FEAT_BBM level 2 will fallback to have kernel linear
>>>> mapping PTE-mapped when rodata=full.
>>>>
>>>> With this we saw significant performance boost with some benchmarks and
>>>> much less memory consumption on my AmpereOne machine (192 cores, 1P)
>>>> with 256GB memory.
>>>>
>>>> * Memory use after boot
>>>> Before:
>>>> MemTotal:       258988984 kB
>>>> MemFree:        254821700 kB
>>>>
>>>> After:
>>>> MemTotal:       259505132 kB
>>>> MemFree:        255410264 kB
>>>>
>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>> more memory saved.
>>>>
>>>> * Memcached
>>>> We saw performance degradation when running Memcached benchmark with
>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>> latency is reduced by around 9.6%.
>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>> MPKI is reduced by 28.5%.
>>>>
>>>> The benchmark data is now on par with rodata=on too.
>>>>
>>>> * Disk encryption (dm-crypt) benchmark
>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
>>>> disk encryption (by dm-crypt).
>>>> fio --directory=/data --random_generator=lfsr --norandommap            \
>>>>      --randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1  \
>>>>      --ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1         \
>>>>      --group_reporting --thread --name=iops-test-job --eta-newline=1    \
>>>>      --size 100G
>>>>
>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>> number of good case is around 90% more than the best number of bad
>>>> case). The bandwidth is increased and the avg clat is reduced
>>>> proportionally.
>>>>
>>>> * Sequential file read
>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>> populated). The bandwidth is increased by 150%.
>>>>
>>> With lock debugging enabled, we see a large number of "BUG: sleeping
>>> function called from invalid context at kernel/locking/mutex.c:580"
>>> and "BUG: Invalid wait context:" backtraces when running v6.18-rc3.
>>> Please see example below.
>>>
>>> Bisect points to this patch.
>>>
>>> Please let me know if there is anything I can do to help tracking
>>> down the problem.
>> Thanks for the report - ouch!
>>
>> I expect you're running on a system that supports BBML2_NOABORT, based on the
>> stack trace, I expect you have CONFIG_DEBUG_PAGEALLOC enabled? That will cause
>> permission tricks to be played on the linear map at page allocation and free
>> time, which can happen in non-sleepable contexts. And with this patch we are
>> taking pgtable_split_lock (a mutex) in split_kernel_leaf_mapping(), which is
>> called as a result of the permission change request.
>>
>> However, when CONFIG_DEBUG_PAGEALLOC enabled we always force-map the linear map
>> by PTE so split_kernel_leaf_mapping() is actually unneccessary and will return
>> without actually having to split anything. So we could add an early "if
>> (force_pte_mapping()) return 0;" to bypass the function entirely in this case,
>> and I *think* that should solve it.
>>
>> But I'm also concerned about KFENCE. I can't remember it's exact semantics off
>> the top of my head, so I'm concerned we could see similar problems there (where
>> we only force pte mapping for the KFENCE pool).
>>
>> I'll investigate fully tomorrow and hopefully provide a fix.

Hi Ryan,

Thanks a lot for the quick fix. I have some comments about kfence below.

> Here's a proposed fix, although I can't get access to a system with BBML2 until
> tomorrow at the earliest. Guenter, I wonder if you could check that this
> resolves your issue?
>
> ---8<---
> commit 602ec2db74e5abfb058bd03934475ead8558eb72
> Author: Ryan Roberts <ryan.roberts@....com>
> Date:   Sun Nov 2 11:45:18 2025 +0000
>
>      arm64: mm: Don't attempt to split known pte-mapped regions
>      
>      It has been reported that split_kernel_leaf_mapping() is trying to sleep
>      in non-sleepable context. It does this when acquiring the
>      pgtable_split_lock mutex, when either CONFIG_DEBUG_ALLOC or
>      CONFIG_KFENCE are enabled, which change linear map permissions within
>      softirq context during memory allocation and/or freeing.
>      
>      But it turns out that the memory for which these features may attempt to
>      modify the permissions is always mapped by pte, so there is no need to
>      attempt to split the mapping. So let's exit early in these cases and
>      avoid attempting to take the mutex.
>      
>      Closes: https://lore.kernel.org/all/f24b9032-0ec9-47b1-8b95-c0eeac7a31c5@roeck-us.net/
>      Fixes: a166563e7ec3 ("arm64: mm: support large block mapping when rodata=full")
>      Signed-off-by: Ryan Roberts <ryan.roberts@....com>
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index b8d37eb037fc..6e26f070bb49 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -708,6 +708,16 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>   	return ret;
>   }
>   
> +static inline bool force_pte_mapping(void)
> +{
> +	bool bbml2 = system_capabilities_finalized() ?
> +		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
> +
> +	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> +			   is_realm_world())) ||
> +		debug_pagealloc_enabled();
> +}
> +
>   static DEFINE_MUTEX(pgtable_split_lock);
>   
>   int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> @@ -723,6 +733,16 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>   	if (!system_supports_bbml2_noabort())
>   		return 0;
>   
> +	/*
> +	 * If the region is within a pte-mapped area, there is no need to try to
> +	 * split. Additionally, CONFIG_DEBUG_ALLOC and CONFIG_KFENCE may change
> +	 * permissions from softirq context so for those cases (which are always
> +	 * pte-mapped), we must not go any further because taking the mutex
> +	 * below may sleep.
> +	 */
> +	if (force_pte_mapping() || is_kfence_address((void *)start))

IIUC this may break kfence late init? The kfence_late_init() allocates 
pages from buddy allocator, then protects them (setting them to 
invalid). But the protection requires split page table, this check will 
prevent kernel from splitting page table because __kfence_pool is 
initialized before doing protection. So there is kind of circular 
dependency.

The below fix may work?

if (force_pte_mapping() || (READ_ONCE(kfence_enabled) && 
is_kfence_address((void *)start)))

The kfence_enabled won't be set until protection is done. So if it is 
set, we know kfence address must be mapped by PTE.

Thanks,
Yang





> +		return 0;
> +
>   	/*
>   	 * Ensure start and end are at least page-aligned since this is the
>   	 * finest granularity we can split to.
> @@ -1009,16 +1029,6 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>   
>   #endif /* CONFIG_KFENCE */
>   
> -static inline bool force_pte_mapping(void)
> -{
> -	bool bbml2 = system_capabilities_finalized() ?
> -		system_supports_bbml2_noabort() : cpu_supports_bbml2_noabort();
> -
> -	return (!bbml2 && (rodata_full || arm64_kfence_can_set_direct_map() ||
> -			   is_realm_world())) ||
> -		debug_pagealloc_enabled();
> -}
> -
>   static void __init map_mem(pgd_t *pgdp)
>   {
>   	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
> ---8<---
>
> Thanks,
> Ryan
>
>> Yang Shi, Do you have any additional thoughts?
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Guenter
>>>
>>> ---
>>> Example log:
>>>
>>> [    0.537499] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
>>> [    0.537501] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
>>> [    0.537502] preempt_count: 1, expected: 0
>>> [    0.537504] 2 locks held by swapper/0/1:
>>> [    0.537505]  #0: ffffb60b01211960 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x24/0x38
>>> [    0.537510]  #1: ffffb60b01595838 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x0/0x40
>>> [    0.537516] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-dbg-DEV #1 NONE
>>> [    0.537517] Call trace:
>>> [    0.537518]  show_stack+0x20/0x38 (C)
>>> [    0.537520]  __dump_stack+0x28/0x38
>>> [    0.537522]  dump_stack_lvl+0xac/0xf0
>>> [    0.537525]  dump_stack+0x18/0x3c
>>> [    0.537527]  __might_resched+0x248/0x2a0
>>> [    0.537529]  __might_sleep+0x40/0x90
>>> [    0.537531]  __mutex_lock_common+0x70/0x1818
>>> [    0.537533]  mutex_lock_nested+0x34/0x48
>>> [    0.537534]  split_kernel_leaf_mapping+0x74/0x1a0
>>> [    0.537536]  update_range_prot+0x40/0x150
>>> [    0.537537]  __change_memory_common+0x30/0x148
>>> [    0.537538]  __kernel_map_pages+0x70/0x88
>>> [    0.537540]  __free_frozen_pages+0x6e4/0x7b8
>>> [    0.537542]  free_frozen_pages+0x1c/0x30
>>> [    0.537544]  __free_slab+0xf0/0x168
>>> [    0.537547]  free_slab+0x2c/0xf8
>>> [    0.537549]  free_to_partial_list+0x4e0/0x620
>>> [    0.537551]  __slab_free+0x228/0x250
>>> [    0.537553]  kfree+0x3c4/0x4c0
>>> [    0.537555]  destroy_sched_domain+0xf8/0x140
>>> [    0.537557]  cpu_attach_domain+0x17c/0x610
>>> [    0.537558]  build_sched_domains+0x15a4/0x1718
>>> [    0.537560]  sched_init_domains+0xbc/0xf8
>>> [    0.537561]  sched_init_smp+0x30/0x98
>>> [    0.537562]  kernel_init_freeable+0x148/0x230
>>> [    0.537564]  kernel_init+0x28/0x148
>>> [    0.537566]  ret_from_fork+0x10/0x20
>>> [    0.537569] =============================
>>> [    0.537569] [ BUG: Invalid wait context ]
>>> [    0.537571] 6.18.0-dbg-DEV #1 Tainted: G        W
>>> [    0.537572] -----------------------------
>>> [    0.537572] swapper/0/1 is trying to lock:
>>> [    0.537573] ffffb60b011f3830 (pgtable_split_lock){+.+.}-{4:4}, at: split_kernel_leaf_mapping+0x74/0x1a0
>>> [    0.537576] other info that might help us debug this:
>>> [    0.537577] context-{5:5}
>>> [    0.537578] 2 locks held by swapper/0/1:
>>> [    0.537579]  #0: ffffb60b01211960 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x24/0x38
>>> [    0.537582]  #1: ffffb60b01595838 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x0/0x40
>>> [    0.537585] stack backtrace:
>>> [    0.537585] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.18.0-dbg-DEV #1 NONE
>>> [    0.537587] Tainted: [W]=WARN
>>> [    0.537588] Call trace:
>>> [    0.537589]  show_stack+0x20/0x38 (C)
>>> [    0.537591]  __dump_stack+0x28/0x38
>>> [    0.537593]  dump_stack_lvl+0xac/0xf0
>>> [    0.537596]  dump_stack+0x18/0x3c
>>> [    0.537598]  __lock_acquire+0x980/0x2a20
>>> [    0.537600]  lock_acquire+0x124/0x2b8
>>> [    0.537602]  __mutex_lock_common+0xd8/0x1818
>>> [    0.537604]  mutex_lock_nested+0x34/0x48
>>> [    0.537605]  split_kernel_leaf_mapping+0x74/0x1a0
>>> [    0.537607]  update_range_prot+0x40/0x150
>>> [    0.537608]  __change_memory_common+0x30/0x148
>>> [    0.537609]  __kernel_map_pages+0x70/0x88
>>> [    0.537610]  __free_frozen_pages+0x6e4/0x7b8
>>> [    0.537613]  free_frozen_pages+0x1c/0x30
>>> [    0.537615]  __free_slab+0xf0/0x168
>>> [    0.537617]  free_slab+0x2c/0xf8
>>> [    0.537619]  free_to_partial_list+0x4e0/0x620
>>> [    0.537621]  __slab_free+0x228/0x250
>>> [    0.537623]  kfree+0x3c4/0x4c0
>>> [    0.537625]  destroy_sched_domain+0xf8/0x140
>>> [    0.537627]  cpu_attach_domain+0x17c/0x610
>>> [    0.537628]  build_sched_domains+0x15a4/0x1718
>>> [    0.537630]  sched_init_domains+0xbc/0xf8
>>> [    0.537631]  sched_init_smp+0x30/0x98
>>> [    0.537632]  kernel_init_freeable+0x148/0x230
>>> [    0.537633]  kernel_init+0x28/0x148
>>> [    0.537635]  ret_from_fork+0x10/0x20
>>>
>>> ---
>>> bisect:
>>>
>>> # bad: [3a8660878839faadb4f1a6dd72c3179c1df56787] Linux 6.18-rc1
>>> # good: [e5f0a698b34ed76002dc5cff3804a61c80233a7a] Linux 6.17
>>> git bisect start 'v6.18-rc1' 'v6.17'
>>> # bad: [58809f614e0e3f4e12b489bddf680bfeb31c0a20] Merge tag 'drm-next-2025-10-01' of https://gitlab.freedesktop.org/drm/kernel
>>> git bisect bad 58809f614e0e3f4e12b489bddf680bfeb31c0a20
>>> # bad: [a8253f807760e9c80eada9e5354e1240ccf325f9] Merge tag 'soc-newsoc-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
>>> git bisect bad a8253f807760e9c80eada9e5354e1240ccf325f9
>>> # bad: [4b81e2eb9e4db8f6094c077d0c8b27c264901c1b] Merge tag 'timers-vdso-2025-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
>>> git bisect bad 4b81e2eb9e4db8f6094c077d0c8b27c264901c1b
>>> # bad: [f1004b2f19d7e9add9d707f64d9fcbc50f67921b] Merge tag 'm68k-for-v6.18-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
>>> git bisect bad f1004b2f19d7e9add9d707f64d9fcbc50f67921b
>>> # good: [a9401710a5f5681abd2a6f21f9e76bc9f2e81891] Merge tag 'v6.18-rc-part1-smb3-common' of git://git.samba.org/ksmbd
>>> git bisect good a9401710a5f5681abd2a6f21f9e76bc9f2e81891
>>> # good: [fe68bb2861808ed5c48d399bd7e670ab76829d55] Merge tag 'microblaze-v6.18' of git://git.monstr.eu/linux-2.6-microblaze
>>> git bisect good fe68bb2861808ed5c48d399bd7e670ab76829d55
>>> # bad: [f2d64a22faeeecff385b4c91fab5fe036ab00162] Merge branch 'for-next/perf' into for-next/core
>>> git bisect bad f2d64a22faeeecff385b4c91fab5fe036ab00162
>>> # good: [30f9386820cddbba59b48ae0670c3a1646dd440e] Merge branch 'for-next/misc' into for-next/core
>>> git bisect good 30f9386820cddbba59b48ae0670c3a1646dd440e
>>> # good: [43de0ac332b815cf56dbdce63687de9acfd35d49] drivers/perf: hisi: Relax the event ID check in the framework
>>> git bisect good 43de0ac332b815cf56dbdce63687de9acfd35d49
>>> # good: [5973a62efa34c80c9a4e5eac1fca6f6209b902af] arm64: map [_text, _stext) virtual address range non-executable+read-only
>>> git bisect good 5973a62efa34c80c9a4e5eac1fca6f6209b902af
>>> # good: [b3abb08d6f628a76c36bf7da9508e1a67bf186a0] drivers/perf: hisi: Refactor the event configuration of L3C PMU
>>> git bisect good b3abb08d6f628a76c36bf7da9508e1a67bf186a0
>>> # good: [6d2f913fda5683fbd4c3580262e10386c1263dfb] Documentation: hisi-pmu: Add introduction to HiSilicon V3 PMU
>>> git bisect good 6d2f913fda5683fbd4c3580262e10386c1263dfb
>>> # good: [2084660ad288c998b6f0c885e266deb364f65fba] perf/dwc_pcie: Fix use of uninitialized variable
>>> git bisect good 2084660ad288c998b6f0c885e266deb364f65fba
>>> # bad: [77dfca70baefcb988318a72fe69eb99f6dabbbb1] Merge branch 'for-next/mm' into for-next/core
>>> git bisect bad 77dfca70baefcb988318a72fe69eb99f6dabbbb1
>>> # first bad commit: [77dfca70baefcb988318a72fe69eb99f6dabbbb1] Merge branch 'for-next/mm' into for-next/core
>>>
>>> ---
>>> bisect into branch:
>>>
>>> - git checkout -b testing 77dfca70baefcb988318a72fe69eb99f6dabbbb1
>>> - git rebase 77dfca70baefcb988318a72fe69eb99f6dabbbb1~1
>>>    [ fix minor conflict similar to the conflict resolution in 77dfca70baefc]
>>> - git diff 77dfca70baefcb988318a72fe69eb99f6dabbbb1
>>>    [ confirmed that there are no differences ]
>>> - confirm that the problem is still seen at the tip of the rebase
>>> - git bisect start HEAD 77dfca70baefcb988318a72fe69eb99f6dabbbb1~1
>>> - run bisect
>>>
>>> Results:
>>>
>>> # bad: [47fc25df1ae3ae8412f1b812fb586c714d04a5e6] arm64: map [_text, _stext) virtual address range non-executable+read-only
>>> # good: [30f9386820cddbba59b48ae0670c3a1646dd440e] Merge branch 'for-next/misc' into for-next/core
>>> git bisect start 'HEAD' '77dfca70baefcb988318a72fe69eb99f6dabbbb1~1'
>>> # good: [805491d19fc21271b5c27f4602f8f66b625c110f] arm64/Kconfig: Remove CONFIG_RODATA_FULL_DEFAULT_ENABLED
>>> git bisect good 805491d19fc21271b5c27f4602f8f66b625c110f
>>> # bad: [13c7d7426232cc4489df7cd2e1f646a22d3f6172] arm64: mm: support large block mapping when rodata=full
>>> git bisect bad 13c7d7426232cc4489df7cd2e1f646a22d3f6172
>>> # good: [a4d9c67e503f2b73c2d89d8e8209dfd241bdc8d8] arm64: Enable permission change on arm64 kernel block mappings
>>> git bisect good a4d9c67e503f2b73c2d89d8e8209dfd241bdc8d8
>>> # first bad commit: [13c7d7426232cc4489df7cd2e1f646a22d3f6172] arm64: mm: support large block mapping when rodata=full


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ