linux-kernel - Re: [PATCH] arm64: clear_user: align __arch_clear

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAL2CeBzYkr-i_WTF_pF9QrqO4K8BE+Z7D8tUY_M-HJGwZeSUUw@mail.gmail.com>
Date: Mon, 24 Nov 2025 22:45:01 -0500
From: Luke Yang <luyang@...hat.com>
To: Will Deacon <will@...nel.org>
Cc: Catalin Marinas <catalin.marinas@....com>, Jirka Hladky <jhladky@...hat.com>, 
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org, 
	Joe Mario <jmario@...hat.com>
Subject: Re: [PATCH] arm64: clear_user: align __arch_clear_user() to 128B for
 I-cache efficiency

On Mon, 24 Nov 2025 13:38:25 +0000, Will Deacon wrote:
> Hmm, but what's special about __arch_clear_user()? If we make this
> change, anybody could surely make similar arguments for other
> functions on their hot paths?

Hi Will,

Thanks for the feedback.

I agree that the precedent question matters. In this case, though, the
irqbypass change introduced roughly a 30% regression in /dev/zero read
throughput. That is a fundamental primitive that many workloads rely on,
and the regression stems from an unintended shift of __arch_clear_user()
so that its tight zeroing loop now crosses an I-cache boundary. This has
also persisted as a deterministic change in performance in all
subsequent kernel builds we have tested since its appearance in 6.17.

The proposed ".p2align 6" is not adding a new micro-optimisation. It
restores the previous instruction-cache locality that the function had
before the irqbypass reshuffle. The cost is very small (up to 64 bytes
of padding in one place), and the bar for applying this kind of fix is
correspondingly high: did an unrelated change cause a significant
performance regression in a widely used core primitive?

I am open to any solution that reliably restores the lost IPC, so please
let me know if you have something else in mind.

Thanks,
Luke