[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241128162618.GA3653@willie-the-truck>
Date: Thu, 28 Nov 2024 16:26:19 +0000
From: Will Deacon <will@...nel.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
Catalin Marinas <catalin.marinas@....com>,
Marc Zyngier <maz@...nel.org>,
Oliver Upton <oliver.upton@...ux.dev>,
Ard Biesheuvel <ardb@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH 11/21] arm64: Keep first mismatched 32bits el0 capable
CPU online through its callbacks
Hi Frederic,
On Tue, Nov 12, 2024 at 03:22:35PM +0100, Frederic Weisbecker wrote:
> The first mismatched 32bits el0 capable CPU is designated as the last
> resort CPU for compat 32 bits tasks. As such this CPU is forbidden to
> go offline.
>
> However this restriction is applied to the device object of the CPU,
> which is not easy to revert later if needed because other components may
> have forbidden the target to be offline and they are not tracked.
>
> But the task cpu possible mask is going to be made aware of housekeeping
> CPUs. In that context, a better 32 bits el0 last resort CPU may be found
> later on boot. When that happens, the old fallback can be made
> offlineable again.
>
> To make this possible and more flexible, drive the offlineable decision
> from the cpuhotplug callbacks themselves.
Sadly, I don't think this can work.
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 718728a85430..53ee8ce38d5b 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -3591,15 +3591,15 @@ void __init setup_user_features(void)
> minsigstksz_setup();
> }
>
> -static int enable_mismatched_32bit_el0(unsigned int cpu)
> -{
> - /*
> - * The first 32-bit-capable CPU we detected and so can no longer
> - * be offlined by userspace. -1 indicates we haven't yet onlined
> - * a 32-bit-capable CPU.
> - */
> - static int lucky_winner = -1;
> +/*
> + * The first 32-bit-capable CPU we detected and so can no longer
> + * be offlined by userspace. -1 indicates we haven't yet onlined
> + * a 32-bit-capable CPU.
> + */
> +static int cpu_32bit_unofflineable = -1;
>
> +static int mismatched_32bit_el0_online(unsigned int cpu)
> +{
> struct cpuinfo_arm64 *info = &per_cpu(cpu_data, cpu);
> bool cpu_32bit = id_aa64pfr0_32bit_el0(info->reg_id_aa64pfr0);
>
> @@ -3611,7 +3611,7 @@ static int enable_mismatched_32bit_el0(unsigned int cpu)
> if (cpumask_test_cpu(0, cpu_32bit_el0_mask) == cpu_32bit)
> return 0;
>
> - if (lucky_winner >= 0)
> + if (cpu_32bit_unofflineable < 0)
nit: I think you meant '>=' here, but it doesn't matter. See below..
> +static int mismatched_32bit_el0_offline(unsigned int cpu)
> +{
> + return cpu == cpu_32bit_unofflineable ? -EBUSY : 0;
> +}
I think this is far too late. The reason we prevent hot-unplug of the
last 32-bit CPU is because 32-bit tasks need somewhere to run. By the
time our offline notifier runs, those tasks have already been migrated.
On my setup, this explodes because that migration fails (as expected):
[ 125.954586] ------------[ cut here ]------------
[ 125.955661] kernel BUG at kernel/sched/core.c:3501!
[ 125.957585] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[ 125.959371] Modules linked in:
[ 125.961850] CPU: 2 UID: 0 PID: 27 Comm: migration/2 Not tainted 6.12.0-00001-ge7689036c862-dirty #10
[ 125.963711] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
[ 125.964859] Stopper: __balance_push_cpu_stop+0x0/0x134 <- balance_push+0x118/0x1ac
[ 125.968507] pstate: 614000c9 (nZCv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 125.969648] pc : select_fallback_rq+0x2f4/0x2f8
[ 125.970477] lr : select_fallback_rq+0x198/0x2f8
[ 125.971273] sp : ffff800080153d20
[ 125.971897] x29: ffff800080153d30 x28: 0000000000000002 x27: ffffddbd31d79d30
[ 125.973416] x26: 0000000000000001 x25: ffffddbd31d79c70 x24: 0000000000000008
[ 125.974436] x23: ffffddbd31d79000 x22: ffffddbd31a77190 x21: 0000000000000004
[ 125.975452] x20: ffff0000c506a280 x19: 0000000000000000 x18: ffffddbd30fb1df0
[ 125.976467] x17: 0000000000000000 x16: 0000000000000001 x15: 000000000018132d
[ 125.977490] x14: 00000000ffffffe0 x13: 0000040000000000 x12: 00000c0000000000
[ 125.978499] x11: 000000002ae00002 x10: 000000000000000c x9 : 0000000000000040
[ 125.979507] x8 : 0000000000000000 x7 : 000000000000000c x6 : 000000000000000c
[ 125.980501] x5 : ffff0000c506a578 x4 : ffffddbd2fe53eb0 x3 : 0000000000000010
[ 125.981508] x2 : 0000000000000004 x1 : 0000000000000004 x0 : 0000000000000004
[ 125.982671] Call trace:
[ 125.983065] select_fallback_rq+0x2f4/0x2f8
[ 125.983550] __balance_push_cpu_stop+0x94/0x134
[ 125.983983] cpu_stopper_thread+0xbc/0x174
[ 125.984352] smpboot_thread_fn+0x1e4/0x24c
[ 125.984732] kthread+0xfc/0x184
[ 125.985065] ret_from_fork+0x10/0x20
[ 125.985741] Code: 9000d9c0 91306000 9441667a 17ffffef (d4210000)
[ 125.986445] ---[ end trace 0000000000000000 ]---
As I mentioned before, I think we need to turn this the other way around
so that the non-unpluggable 32-bit core is forced to be a housekeeping]
CPU. You said it was hard to revert CPUs from being treated as nohz_full,
but I'm wondering whether we can prevent it from being added in the first
place. The arch code has fingers in all the early boot pies.
Yet another option (which I'm not hugely fond of, but may be ok) would
be to treat 32-bit-capable nohz_full CPUs as being 64-bit-only when
'allow_mismatched_32bit_el0' is enabled (and documenting this as a
limitation).
Will
Powered by blists - more mailing lists