[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <97e1f2aa-d823-1aea-a41f-8024ba5075aa@bytedance.com>
Date: Thu, 23 Feb 2023 19:24:26 +0000
From: Usama Arif <usama.arif@...edance.com>
To: David Woodhouse <dwmw2@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>, kim.phillips@....com
Cc: arjan@...ux.intel.com, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, x86@...nel.org,
pbonzini@...hat.com, paulmck@...nel.org,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
rcu@...r.kernel.org, mimoja@...oja.de, hewenliang4@...wei.com,
thomas.lendacky@....com, seanjc@...gle.com, pmenzel@...gen.mpg.de,
fam.zheng@...edance.com, punit.agrawal@...edance.com,
simon.evans@...edance.com, liangma@...ngbit.com
Subject: Re: [External] Re: [PATCH v9 0/8] Parallel CPU bringup for x86_64
On 23/02/2023 11:07, David Woodhouse wrote:
> On Wed, 2023-02-22 at 17:42 +0100, Thomas Gleixner wrote:
>> David!
>>
>> On Wed, Feb 22 2023 at 10:11, David Woodhouse wrote:
>>> On Wed, 2023-02-15 at 14:54 +0000, Usama Arif wrote:
>>> So the next thing that might be worth looking at is allowing the APs
>>> all to be running their hotplug thread simultaneously, bringing
>>> themselves from CPUHP_BRINGUP_CPU to CPUHP_AP_ONLINE. This series eats
>>> the initial INIT/SIPI/SIPI latency, but if there's any significant time
>>> in the AP hotplug thread, that could be worth parallelising.
>>
>> On a 112 CPU machine (64 cores, HT enabled) the bringup takes
>>
>> Setup and SIPIs sent: 49 ms
>> Bringup each CPU: 516 ms
>>
>> That's about 500 ms faster than a non-parallel bringup!
>>
>> Now looking at the 516 ms, which is ~4.7 ms/CPU. The vast majority of the
>> time is spent on the APs in
>>
>> cpu_init() -> ucode_cpu_init()
>>
>> for the primary threads of each core. The secondary threads are quickly
>> (1us) out of ucode_cpu_init() because the primary thread already loaded
>> it.
>>
>> A microcode load on that machine takes ~7.5 ms per primary thread on
>> average which sums up to 7.5 * 55 = 412.5 ms
>>
>> The threaded bringup after CPU_AP_ONLINE takes about 100us per CPU.
>
> Nice analysis; thanks!
>
>> identify_secondary_cpu() is one of the longer functions which takes
>> ~125us / CPU summing up to 13ms
>
> Hm, shouldn't that one already be parallelised by my 'part 2' patch?
>
> It's called from smp_store_cpu_info(), from smp_callin(), which is
> called from somewhere in the middle of start_secondary().
>
> And if the comments I helpfully added to that function for the benefit
> of our future selves are telling the truth, the AP is free to get that
> far once the BSP has set its bit in cpu_callout_mask, which happens in
> do_wait_cpu_initialized().
>
> So
> https://git.infradead.org/users/dwmw2/linux.git/commitdiff/4b5731e05b0#patch3
> ought to parallelise that. But Usama emirically reported that 'part 2'
> didn't add any noticeable benefit, not even those 13ms? On a *larger*
> machine.
>
So I am using a similar machine to Thomas 128 CPU machine (64 cores, HT
enabled). I have microcode config disabled, so I guess I get similar
numbers to Thomas, i.e. 100ms (516 - 412) ms. I do see a difference of
~3ms with part2 which I thought is maybe within the margin of error for
measuring, but I guess it isn't. After seeing the ~70ms that is cut with
reusing timer calibration, I didnt really then focus much on part 2
then. I guess that ~70ms is the "rest" from Thomas' table below?
Thanks,
Usama
>
>> The TSC sync check for the first CPU on the second socket consumes
>> 20ms. That's only once per socket, intra socket is using MSR_TSC_ADJUST,
>> which is more or less free.
>>
>> So the 516 ms are wasted here:
>>
>> total 516 ms
>> ucode_cpu_init() 412 ms
>> identify_secondary_cpu() 13 ms
>> 2ndsocket_tsc_sync 20 ms
>> threaded bringup 12 ms
>> rest 59 ms
>>
>> So the rest is about 530us per CPU, which is just the sum of many small
>> functions, lock contentions...
>>
>> Getting rid of the micro code overhead is possible. There is no reason
>> to serialize that between the cores. But it needs serialization vs. HT
>> siblings, which requires to move identify_secondary_cpu() and its caller
>> smp_store_cpu_info() ahead of the synchronization point and then have
>> serialization between the siblings. That's going to be a major surgery
>> and inspection effort to ensure that there are no hidden assumptions
>> about global hotplug serialization.
>>
>> So that would cut the total cost down to ~100ms plus the
>> preparatory/SIPI stage of 60ms which sums up to about 160ms and about
>> 1.5ms per CPU total.
>>
>> Further optimization starts to be questionable IMO. It's surely possible
>> somehow, but then you really have to go and inspect each and every
>> function in those code pathes, add local locking, etc. Not to talk about
>> the required mess in the core code to support that.
>>
>> The low hanging fruit which brings most is the identification/topology
>> muck and the microcode loading. That needs to be addressed first anyway.
>
> Agreed, thanks.
>
Powered by blists - more mailing lists