linux-kernel - Re: [PATCH] powerpc/smp: poll cpu_callin_map more aggressively in __cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <87letfmk8e.fsf@linux.ibm.com>
Date:   Wed, 29 Jun 2022 12:51:13 -0500
From:   Nathan Lynch <nathanl@...ux.ibm.com>
To:     Michael Ellerman <mpe@...erman.id.au>
Cc:     linux-kernel@...r.kernel.org, npiggin@...il.com,
        brking@...ux.ibm.com, srikar@...ux.vnet.ibm.com,
        linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH] powerpc/smp: poll cpu_callin_map more aggressively in
 __cpu_up()

Michael Ellerman <mpe@...erman.id.au> writes:

> Nathan Lynch <nathanl@...ux.ibm.com> writes:
>> Replace the outdated iteration and timeout calculations here with
>> indefinite spin_until_cond()-wrapped poll of cpu_callin_map. __cpu_up()
>> already does this when waiting for the cpu to set its online bit before
>> returning, so this change is not really making the function more brittle.
>
> I'm not sure I agree that this doesn't make the code more brittle.
>
> The existing indefinite wait you mention is later in the function, and
> happens after the CPU has successfully come into the kernel.
>
> I think it's more common that a stuck/borked CPU doesn't come into the
> kernel at all, rather than comes in and then fails to online.
>
> So I think the bail out when the CPU fails to call in is useful, I would
> guess I see that "Processor x is stuck" message multiple times a year
> while debugging various things.

Yeah I can see how my claim is too strong here.

>> Removing the msleep(1) in the hotplug path here reduces the time it takes
>> to online a CPU on a P9 PowerVM LPAR from about 30ms to 1ms when exercised
>> via thaw_secondary_cpus().
>
> That is a nice improvement.
>
> Can we do something that returns quickly in the happy case and still has
> a timeout when things go wrong? Seems like a busy loop with a
> time_after() check would do the trick.

Yes, I'll rework it like that. Thanks.