linux-kernel - Re: [PATCH] use x86 cpu park to speedup smp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877do65og8.fsf@nanos.tec.linutronix.de>
Date:   Thu, 21 Jan 2021 15:55:51 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     David Woodhouse <dwmw2@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        "shenkai \(D\)" <shenkai8@...wei.com>,
        "Schander\, Johanna 'Mimoja' Amelie" <mimoja@...zon.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        X86 ML <x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>,
        hewenliang4@...wei.com, hushiyuan@...wei.com,
        luolongjun@...wei.com, hejingxian@...wei.com
Subject: Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

David,

On Tue, Jan 19 2021 at 12:12, David Woodhouse wrote:
> On Tue, 2020-12-15 at 22:20 +0100, Thomas Gleixner wrote:
> We've been playing with this a little. There's a proof-of-concept hack
> below; don't look too hard because it's only really for figuring out
> the timing etc.
>
> Basically we ripped out the 'wait' parts of the x86 do_boot_cpu() into
> a separate function do_wait_cpu(). There are four phases to the wait.
>
>  • Wait for the AP to turn up in cpu_initialized_mask, set its bit in
>    cpu_callout_mask to allow it to run the AP thread.
>  • Wait for it to finish init and show up in cpu_callin_mask.
>  • check_tsc_sync_source()
>  • Wait for cpu_online(cpu)
>
> There's an EARLY_INIT macro which controls whether the early bringup
> call actually *does* anything, or whether it's left until bringup_cpu()
> as the current code does. It allows a simple comparison of the two.
>
> First we tested under qemu (on a Skylake EC2 c5.metal instance). The
> do_boot_cpu() actually sending the IPIs took ~300k cycles itself.
> Without EARLY_INIT we see timing for the four wait phases along the
> lines of:
>
> [    0.285312] CPU#10 up in     192950,    952898,  60014786,        28 (  61160662)
> [    0.288311] CPU#11 up in     181092,    962704,  60010432,        30 (  61154258)
> [    0.291312] CPU#12 up in     386080,    970416,  60013956,        28 (  61370480)
> [    0.294311] CPU#13 up in     372782,    964506,  60010564,        28 (  61347880)
> [    0.297312] CPU#14 up in     389602,    976280,  60013046,        28 (  61378956)
> [    0.300312] CPU#15 up in     213132,    968148,  60012138,        28 (  61193446)
>
> If we define EARLY_INIT then that first phase of waiting for the CPU
> add itself is fairly much instantaneous, which is precisely what we
> were hoping for. We also seem to save about 300k cycles on the AP
> bringup too. It's just that it *all* pales into insignificance with
> whatever it's doing to synchronise the TSC for 60M cycles.

Yes, that's annoying, but it can be avoided. The host could tell the
guest that the TSC is perfectly synced.

> [    0.338829] CPU#10 up in        600,    689054,  60025522,        28 (  60715204)
> [    0.341829] CPU#11 up in        610,    635346,  60019390,        28 (  60655374)
> [    0.343829] CPU#12 up in        632,    619352,  60020728,        28 (  60640740)
> [    0.346829] CPU#13 up in        602,    514234,  60025402,        26 (  60540264)
> [    0.348830] CPU#14 up in        608,    621058,  60025952,        26 (  60647644)
> [    0.351829] CPU#15 up in        600,    624690,  60021526,       410 (  60647226)
>
> Testing on real hardware has been more interesting and less useful so
> far. We started with the CPUHP_BRINGUP_KICK_CPU state being
> *immediately* before CPUHP_BRINGUP_CPU. On my 28-thread Haswell box,
> that didn't come up at all even without actually *doing* anything in
> the pre-bringup phase. Merely bringing all the AP threads up through
> the various CPUHP_PREPARE_foo stages before actually bringing them
> online, was enough to break it. I have no serial port on this box so we
> haven't get worked out why; I've resorted to putting the
> CPUHP_BRINGUP_KICK_CPU state before CPUHP_WORKQUEUE_PREP instead.

Hrm.

> That lets it boot without the EARLY_INIT at least (so it's basically a
> no-op), and I get these timings. Looks like there's 3-4M cycles to be
> had by the parallel SIPI/INIT, but the *first* thread of each core is
> also taking another 8M cycles and it might be worth doing *those* in
> parallel too. And Thomas I think that waiting for the AP to bring
> itself up is the part you meant was pointlessly differently
> reimplemented across architectures? So the way forward there might be
> to offer a generic CPUHP state for that, for architectures to plug into
> and ditch their own tracking.

Yes. The whole wait for alive and callin and online can be generic.

> When I enabled EARLY_INIT it didn't boot; I need to hook up some box
> with a serial port to make more meaningful progress there, but figured
> it was worth sharing the findings so far.
>
> Here's the hack we're testing with, for reference. It's kind of ugly
> but you can see where it's going. Note that the CMOS mangling for the
> warm reset vector is going to need to be lifted out of the per-cpu
> loop, and done *once* at startup and torn down once in smp_cpus_done.
> Except that it also needs to be done before/after a hotplug cpu up;
> we'll have to come back to that but we've just shifted it to
> native_smp_cpus_done() for testing for now.

Right. It's at least a start.

Thanks,

        tglx