linux-kernel - Re: [patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cda9bf38-0c40-4658-65aa-fbca1b3577e8@suse.com>
Date:   Mon, 17 Apr 2023 10:35:41 +0200
From:   Juergen Gross <jgross@...e.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>
Cc:     x86@...nel.org, David Woodhouse <dwmw@...radead.org>,
        Andrew Cooper <andrew.cooper3@...rix.com>,
        Brian Gerst <brgerst@...il.com>,
        Arjan van de Veen <arjan@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Paul McKenney <paulmck@...nel.org>,
        Tom Lendacky <thomas.lendacky@....com>,
        Sean Christopherson <seanjc@...gle.com>,
        Oleksandr Natalenko <oleksandr@...alenko.name>,
        Paul Menzel <pmenzel@...gen.mpg.de>,
        "Guilherme G. Piccoli" <gpiccoli@...lia.com>,
        Piotr Gorski <lucjan.lucjanov@...il.com>,
        David Woodhouse <dwmw@...zon.co.uk>,
        Usama Arif <usama.arif@...edance.com>,
        Boris Ostrovsky <boris.ostrovsky@...cle.com>,
        xen-devel@...ts.xenproject.org,
        Russell King <linux@...linux.org.uk>,
        Arnd Bergmann <arnd@...db.de>,
        linux-arm-kernel@...ts.infradead.org,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>, Guo Ren <guoren@...nel.org>,
        linux-csky@...r.kernel.org,
        Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
        linux-mips@...r.kernel.org,
        "James E.J. Bottomley" <James.Bottomley@...senPartnership.com>,
        Helge Deller <deller@....de>, linux-parisc@...r.kernel.org,
        Paul Walmsley <paul.walmsley@...ive.com>,
        Palmer Dabbelt <palmer@...belt.com>,
        linux-riscv@...ts.infradead.org,
        Mark Rutland <mark.rutland@....com>,
        Sabin Rapan <sabrapan@...zon.com>
Subject: Re: [patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup

On 15.04.23 01:44, Thomas Gleixner wrote:
> Hi!
> 
> This is a complete rework of the parallel bringup patch series (V17)
> 
>      https://lore.kernel.org/lkml/20230328195758.1049469-1-usama.arif@bytedance.com
> 
> to address the issues which were discovered in review:
> 
>   1) The X86 microcode loader serialization requirement
> 
>      https://lore.kernel.org/lkml/87v8iirxun.ffs@tglx
> 
>      Microcode loading on HT enabled X86 CPUs requires that the microcode is
>      loaded on the primary thread. The sibling thread(s) must be in
>      quiescent state; either looping in a place which is aware of potential
>      changes by the microcode update (see late loading) or in fully quiescent
>      state, i.e. waiting for INIT/SIPI.
> 
>      This is required by hardware/firmware on Intel. Aside of that it's a
>      vendor independent software correctness issue. Assume the following
>      sequence:
> 
>      CPU1.0	  	      CPU1.1
>      			      CPUID($A)
>      Load microcode.
>      Changes CPUID($A, $B)
>      			      CPUID($B)
> 
>      CPU1.1 makes a decision on $A and $B which might be inconsistent due
>      to the microcode update.
> 
>      The solution for this is to bringup the primary threads first and after
>      that the siblings. Loading microcode on the siblings is a NOOP on Intel
>      and on AMD it is guaranteed to only modify thread local state.
> 
>      This ensures that the APs can load microcode before reaching the alive
>      synchronization point w/o doing any further x86 specific
>      synchronization between the core siblings.
> 
>   2) The general design issues discussed in V16
> 
>      https://lore.kernel.org/lkml/87pm8y6yme.ffs@tglx
> 
>      The previous parallel bringup patches just glued this mechanism into
>      the existing code without a deeper analysis of the synchronization
>      mechanisms and without generalizing it so that the control logic is
>      mostly in the core code and not made an architecture specific tinker
>      space.
> 
>      Much of that had been pointed out 2 years ago in the discussions about
>      the early versions of parallel bringup already.
> 
> 
> The series is based on:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip x86/apic
> 
> and also available from git:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git hotplug
> 
> 
> Background
> ----------
> 
> The reason why people are interested in parallel bringup is to shorten
> the (kexec) reboot time of cloud servers to reduce the downtime of the
> VM tenants. There are obviously other interesting use cases for this
> like VM startup time, embedded devices...
> 
> The current fully serialized bringup does the following per AP:
> 
>      1) Prepare callbacks (allocate, intialize, create threads)
>      2) Kick the AP alive (e.g. INIT/SIPI on x86)
>      3) Wait for the AP to report alive state
>      4) Let the AP continue through the atomic bringup
>      5) Let the AP run the threaded bringup to full online state
> 
> There are two significant delays:
> 
>      #3 The time for an AP to report alive state in start_secondary() on x86
>         has been measured in the range between 350us and 3.5ms depending on
>         vendor and CPU type, BIOS microcode size etc.
> 
>      #4 The atomic bringup does the microcode update. This has been measured
>         to take up to ~8ms on the primary threads depending on the microcode
>         patch size to apply.
> 
> On a two socket SKL server with 56 cores (112 threads) the boot CPU spends
> on current mainline about 800ms busy waiting for the APs to come up and
> apply microcode. That's more than 80% of the actual onlining procedure.
> 
> By splitting the actual bringup mechanism into two parts this can be
> reduced to waiting for the first AP to report alive or if the system is
> large enough the first AP is already waiting when the boot CPU finished the
> wake-up of the last AP.
> 
> 
> The actual solution comes in several parts
> ------------------------------------------
> 
>   1) [P 1-2] General cleanups (init annotations, kernel doc...)
> 
>   2) [P 3] The obvious
> 
>      Avoid pointless delay calibration when TSC is synchronized across
>      sockets. That removes a whopping 100ms delay for the first CPU of a
>      socket. This is an improvement independent of parallel bringup and had
>      been discussed two years ago already.
> 
>   2) [P 3-6] Removal of the CPU0 hotplug hack.
> 
>      This was added 11 years ago with the promise to make this a real
>      hardware mechanism, but that never materialized. As physical CPU
>      hotplug is not really supported and the physical unplugging of CPU0
>      never materialized there is no reason to keep this cruft around. It's
>      just maintenance ballast for no value and the removal makes
>      implementing the parallel bringup feature way simpler.
> 
>   3) [P 7-16] Cleanup of the existing bringup mechanism:
> 
>       a) Code reorganisation so that the general hotplug specific code is
>          in smpboot.c and not sprinkled all over the place
> 
>       b) Decouple MTRR/PAT initialization from smp_callout_mask to prepare
>          for replacing that mask with a hotplug core code synchronization
>          mechanism.
> 
>       c) Make TSC synchronization function call based so that the control CPU
>          does not have to busy wait for nothing if synchronization is not
>          required.
> 
>       d) Remove the smp_callin_mask synchronization point as its not longer
>          required due to #3c.
> 
>       e) Rework the sparse_irq_lock held region in the core code so that the
>          next polling synchronization point in the x86 code can be removed to.
> 
>       f) Due to #3e it's not longer required to spin wait for the AP to set
>          it's online bit.  Remove wait_cpu_online() and the XENPV
>          counterpart. So the control CPU can directly wait for the online
>          idle completion by the AP and free the control CPU up for other
>          work.
> 
>       This reduces the synchronization points in the x86 code to one, which
>       is the AP alive one. This synchronization will be moved to core
>       infrastructure in the next section.
> 
>   4) [P 17-27] Replace the disconnected CPU state tracking
> 
>      The extra CPU state tracking which is used by a few architectures is
>      completely separate from the CPU hotplug core code.
> 
>      Replacing it by a variant integrated in the core hotplug machinery
>      allows to reduce architecture specific code and provides a generic
>      synchronization mechanism for (parallel) CPU bringup/teardown.
> 
>      - Convert x86 over and replace the AP alive synchronization on x86 with
>        the core variant which removes the remaining x86 hotplug
>        synchronization masks.
> 
>      - Convert the other architectures usage and remove the old interface
>        and code.
> 
>   5) [P 28-30] Split the bringup into two steps
> 
>      First step invokes the wakeup function on the BP, e.g. SIPI/STARTUP on
>      x86. The second one waits on the BP for the AP to report alive and
>      releases it for the complete onlining.
> 
>      As the hotplug state machine allows partial bringup this allows later
>      to kick all APs alive in a first iteration and then bring them up
>      completely one by one afterwards.
> 
>   6) [P 31] Switch the primary thread detection to a cpumask
> 
>      This makes the parallel bringup a simple cpumask based mechanism
>      without tons of conditionals and checks for primary threads.
> 
>   7) [P 32] Implement the parallel bringup core code
> 
>      The parallel bringup looks like this:
>      
>        1) Bring up the primary SMT threads to the CPUHP_KICK_AP_ALIVE step
>        	 one by one
> 
>        2) Bring up the primary SMT threads to the CPUHP_ONLINE step one by
>        	 one
> 
>        3) Bring up the secondary SMT threads to the CPUHP_KICK_AP_ALIVE
>        	 step one by one
> 
>        4) Bring up the secondary SMT threads to the CPUHP_ONLINE
>        	 step one by one
> 
>      In case that SMT is not supported this is obviously reduced to step #1
>      and #2.
> 
>   8) [P 33-37] Prepare X86 for parallel bringup and enable it
> 
> 
> Caveats
> -------
> 
> The non X86 changes have been all compile tested. Boot and runtime
> testing has only be done on a few real hardware platforms and qemu as
> available. That definitely needs some help from the people who have
> these systems at their fingertips.
> 
> 
> Results and analysis
> --------------------
> 
> Here are numbers for a dual socket SKL 56 cores/ 112 threads machine.  All
> numbers in milliseconds. The time measured is the time which the cpu_up()
> call takes for each CPU and phase. It's not exact as the system is already
> scheduling, handling interrupts and soft interrupts, which is obviously
> skewing the picture slightly.
> 
> Baseline tip tree x86/apic branch.
> 
> 		total      avg/CPU          min          max
> total  :      912.081        8.217        3.720      113.271
> 
> The max of 100ms is due to the silly delay calibration for the second
> socket which takes 100ms and was eliminated first. Also the other initial
> cleanups and improvements take some time away.
> 
> So the real baseline becomes:
> 
> 		total      avg/CPU          min          max
> total  :      785.960        7.081        3.752       36.098
> 
> The max here is on the first CPU of the second socket. 20ms of that is due
> to TSC synchronization and an extra 2ms to react on the SIPI.
> 
> With parallel bootup enabled this becomes:
> 
> 		total      avg/CPU          min          max
> prepare:       39.108        0.352        0.238        0.883
> online :       45.166        0.407        0.170       20.357
> total  :       84.274        0.759        0.408       21.240
> 
> That's a factor ~9.3 reduction on average.
> 
> Looking at the 27 primary threads of socket 0 then this becomes even more
> interesting:
> 
> 		total      avg/CPU          min          max
> total  :      325.764       12.065       11.981       14.125
> 
> versus:
> 		total      avg/CPU          min          max
> prepare:        8.945        0.331        0.238        0.834
> online :        4.830        0.179        0.170        0.212
> total  :       13.775        0.510        0.408        1.046
> 
> So the reduction factor is ~23.5 here. That's mostly because the 20ms TSC
> sync is not skewing the picture.
> 
> For all 55 primaries, i.e with the 20ms TSC sync extra for socket 1 this
> becomes:
> 
>                  total      avg/CPU          min          max
> total  :      685.489       12.463       11.975       36.098
> 
> versus:
> 
>                  total      avg/CPU          min          max
> prepare:       19.080        0.353        0.238        0.883
> online :       30.283        0.561        0.170       20.357
> total  :       49.363        0.914        0.408       21.240
> 
> The TSC sync reduces the win to a factor of ~13.8
> 
> With 'tsc=reliable' on the command line the socket sync is disabled which
> brings it back to the socket 0 numbers:
> 
>                  total      avg/CPU          min          max
> prepare:       18.970        0.351        0.231        0.874
> online :       10.328        0.191        0.169        0.358
> total  :       29.298        0.543        0.400        1.232
> 
> Now looking at the secondary threads only:
> 
>                  total      avg/CPU          min          max
> total  :      100.471        1.794        0.375        4.745
> 
> versus:
>                  total      avg/CPU          min          max
> prepare:       19.753        0.353        0.257        0.512
> online :       14.671        0.262        0.179        3.461
> total  :       34.424        0.615        0.436        3.973
> 
> Still a factor of ~3.
> 
> The average on the secondaries for the serialized bringup is significantly
> lower than for the primaries because the SIPI response time is shorter and
> the microcode update takes no time.
> 
> This varies wildly with the system, whether microcode in BIOS is already up
> to date, how big the microcode patch is and how long the INIT/SIPI response
> time is. On an AMD Zen3 machine INIT/SIPI response time is amazingly fast
> (350us), but then it lacks TSC_ADJUST and does a two millisecond TSC sync
> test for _every_ AP. All of this sucks...
> 
> 
> Possible further enhancements
> -----------------------------
> 
> It's definitely worthwhile to look into reducing the cross socket TSC sync
> test time. It's probably safe enough to use 5ms or even 2ms instead of 20ms
> on systems with TSC_ADJUST and a few other 'TSC is sane' indicators. Moving
> it out of the hotplug path is eventually possible, but that needs some deep
> thoughts.
> 
> Let's take the TSC sync out of the picture by adding 'tsc=reliable" to the
> kernel command line. So the bringup of 111 APs takes:
> 
>                  total      avg/CPU          min          max
> prepare:       38.936        0.351        0.231        0.874
> online :       25.231        0.227        0.169        3.465
> total  :       64.167        0.578        0.400        4.339
> 
> Some of the outliers are not necessarily in the state callbacks as the
> system is already scheduling and handles interrupts and soft
> interrupts. Haven't analyzed that yet in detail.
> 
> In the prepare stage which runs on the control CPU the larger steps are:
> 
>    smpcfd:prepare           16us  avg/CPU
>    threads:prepare          98us  avg/CPU
>    workqueue:prepare        43us  avg/CPU
>    trace/RB:prepare	  135us	 avg/CPU
> 
> The trace ringbuffer initialization allocates 354 pages and 354 control
> structures one by one. That probably should allocate a large page and an
> array of control structures and work from there. I'm sure that would reduce
> this significantly. Steven?
> 
> smpcfd does just a percpu allocation. No idea why that takes that long.
> 
> Vs. threads and workqueues. David thought about spreading out the
> preparation work and do it really in parallel. That's a nice idea, but the
> threads and workqueue prepare steps are self serializing. The workqueue one
> has a global mutex and aside of that both steps create kernel threads which
> implicitely serialize on kthreadd. alloc_percpu(), which is used by
> smpcfd:prepare is also globally serialized.
> 
> The rest of the prepare steps is pretty much in the single digit
> microseconds range.
> 
> On the AP side it should be possible to move some of the initialization
> steps before the alive synchronization point, but that really needs a lot
> of analysis whether the functions are safe to invoke that early and outside
> of the cpu_hotplug_lock held region for the case of two stage parallel
> bringup; see below.
> 
> The largest part is:
> 
>      identify_secondary_cpu()	99us avg/CPU
>     
>      Inside of identify_secondary_cpu() the largest offender:
> 
>        mcheck_init()		73us avg/CPU
> 
>      This part is definitly worth to be looked at whether it can be at least
>      partially moved to the early startup code before the alive
>      synchronization point. There's a lot of deep analysis required and
>      ideally we just rewrite the whole CPUID evaluation trainwreck
>      completely.
> 
> The rest of the AP side is low single digit microseconds except of:
> 
>      perf/x86:starting		14us avg/CPU
> 
>      smpboot/threads:online	13us avg/CPU
>      workqueue:online		17us avg/CPU
>      mm/vmstat:online		17us avg/CPU
>      sched:active		30us avg/CPU
> 
> sched:active is special. Onlining the first secondary HT thread on the
> second socket creates a 3.2ms outlier which skews the whole picture. That's
> caused by enabling the static key sched_smt_present which patches the world
> and some more. For all other APs this is really in the 1us range. This
> definitely could be postponed during bootup like the scheduler domain
> rebuild is done after the bringup. But that's still fully serialized and
> single threaded and obviously could be done later in the context of async
> parallel init. It's unclear why this is different with the fully serialized
> bringup where it takes significantly less time, but that's something which
> needs to be investigated.
> 
> 
> Is truly parallel bringup feasible?
> -----------------------------------
> 
> In theory yes, realistically no. Why?
> 
>     1) The preparation phase
> 
>        Allocating memory, creating threads for the to be brought up CPU must
>        obviously happen on an already online CPU.
> 
>        While it would be possible to bring up a subset of CPUs first and let
>        them do the preparation steps for groups of still offline CPUs
>        concurrently, the actual benefit of doing so is dubious.
> 
>        The prime example is kernel thread creation, which is implicitely
>        serialized on kthreadd.
> 
>        A simple experiment shows that 4 concurrent workers on 4 different
>        CPUs where each is creating 14 * 5 = 70 kernel threads are 5% slower
>        than a single worker creating 4 * 14 * 5 = 280 threads.
> 
>        So we'd need to have multiple kthreadd instances to handle that,
>        which would then serialize on tasklist lock and other things.
> 
>        That aside the preparation phase is also affected by the problem
>        below.
> 
>     2) Assumptions about hotplug serialization
> 
>        a) There are quite some assumptions about CPU bringup being fully
>           serialized across state transitions.  A lot of state callbacks rely
>           on that and would require local locking.
> 
> 	 Adding that local locking is surely possible, but that has several
> 	 downsides:
> 
>            - It adds complexity and makes it harder for developers to get
> 	    this correct. The subtle bugs resulting out of that are going
> 	    to be interesting
> 
>            - Fine grained locking has a charm, but only if the time spent
> 	    for the actual work is larger than the time required for
> 	    serialization and synchronization.
> 
> 	    Serializing a callback which takes less than a microsecond and
> 	    then having a large number of CPUs contending on the lock will
> 	    not make it any faster at all. That's a well known issue of
> 	    parallelizing and neither made up nor kernel specific.
> 
>        b) Some operations definitely require to be protected by the
>           cpu_hotplug_lock, especially those which affect cpumasks as the
>           masks are guaranteed to be stable in a cpus_read_lock()'ed region.
> 
>         	 As this lock cannot be taken in atomic contexts, it's required
>         	 that the control CPU holds the lock write locked across these
>         	 state transitions. And no, we are not making this a spinlock just
>         	 for that and we even can't.
> 
>         	 Just slapping a lock into the x86 specific part of the cpumask
>         	 update function does not solve anything. The relevant patch in V17
>         	 is completely useless as it only serializes the actual cpumask/map
>         	 modifications, but all read side users are hosed if the update
>         	 would be moved before the alive synchronization point, i.e. into a
>         	 non hotplug lock protected region.
> 
>         	 Even if the hotplug lock would be held accross the whole parallel
>         	 bringup operation then this would still expose all usage of these
>         	 masks and maps in the actual hotplug state callbacks to concurrent
>         	 modifications.
> 
>         	 And no, we are not going to expose an architecture specific raw
>         	 spinlock to the hotplug state callbacks, especially not to those
>         	 in generic code.
> 
>        c) Some cpu_read_lock()'ed regions also expect that there is no CPU
>        	 state transition happening which would modify their local
>        	 state. This would again require local serialization.
> 
>      3) The amount of work and churn:
> 
>         - Analyze the per architecture low level startup functions plus their
>           descendant functions and make them ready for concurrency if
>         	 necessary.
> 
>         - Analyze ~300 hotplug state callbacks and their descendant functions
>           and make them ready for concurrency if necessary.
> 
>         - Analyze all cpus_read_lock()'ed regions and address their
>           requirements.
>        
>         - Rewrite the core code to handle the cpu_hotplug_lock requirements
>           only in distinct phases of the state machine.
> 
>         - Rewrite the core code to handle state callback failure and the
>           related rollback in the context of the new rules.
> 
>        - ...
> 
>     Even if some people are dedicated enough to do that, it's very
>     questionable whether the resulting complexity is justified.
> 
>     We've spent a serious amount of time to sanitize hotplug and bring it
>     into a state where it is correct. This also made it reasonably simple
>     for developers to implement hotplug state callbacks without having to
>     become hotplug experts.
> 
>     Breaking this completely up will result in a flood of hard to diagnose
>     subtle issues for sure. Who is going to deal with them?
> 
>     The experience with this series so far does not make me comfortable
>     about that thought in any way.
> 
> 
> Summary
> -------
> 
> The obvious and low hanging fruits have to be solved first:
> 
>    - The CPUID evaluation and related setup mechanisms
> 
>    - The trace/ringbuffer oddity
> 
>    - The sched:active oddity for the first sibling on the second socket
>    
>    - Some other expensive things which I'm not seeing in my test setup due
>      to lack of hardware or configuration.
> 
> Anything else is pretty much wishful thinking in my opinion.
> 
>    To be clear. I'm not standing in the way if there is a proper solution,
>    but that requires to respect the basic engineering rules:
> 
>      1) Correctness first
>      2) Keep it maintainable
>      3) Keep it simple
> 
>    So far this stuff failed already at #1.
> 
> I completely understand why this is important for cloud people, but
> the real question to ask here is what are the actual requirements.
> 
>    As far as I understand the main goal is to make a (kexec) reboot
>    almost invisible to VM tenants.
> 
>    Now lets look at how this works:
> 
>       A) Freeze VMs and persist state
>       B) kexec into the new kernel
>       C) Restore VMs from persistant memory
>       D) Thaw VMs
> 
>    So the key problem is how long it takes to get from #B to #C and finally
>    to #D.
> 
>    As far as I understand #C takes a serious amount of time and cannot be
>    parallelized for whatever reasons.
> 
>    At the same time the number of online CPUs required to restore the VMs
>    state is less than the number of online CPUs required to actually
>    operate them in #D.
> 
>    That means it would be good enough to return to userspace with a
>    limited number of online CPUs as fast as possible. A certain amount of
>    CPUs are going to be busy with restoring the VMs state, i.e. one CPU
>    per VM. Some remaining non-busy CPU can bringup the rest of the system
>    and the APs in order to be functional for #D, i.e the restore of VM
>    operation.
> 
>    Trying to optimize this purely in kernel space by adding complexity of
>    dubious value is simply bogus in my opinion.
> 
>    It's already possible today to limit the number of CPUs which are
>    initially onlined and online the rest later from user space.
> 
>    There are two issue there:
> 
>      a) The death by MCE broadcast problem
> 
>         Quite some (contemporary) x86 CPU generations are affected by
>         this:
> 
>           - MCE can be broadcasted to all CPUs and not only issued locally
>             to the CPU which triggered it.
> 
>           - Any CPU which has CR4.MCE == 0, even if it sits in a wait
>             for INIT/SIPI state, will cause an immediate shutdown of the
>             machine if a broadcasted MCE is delivered.
> 
>      b) Do the parallel bringup via sysfs control knob
> 
>         The per CPU target state interface allows to do that today one
>         by one, but it's akward and has quite some overhead.
> 
>         A knob to online the rest of the not yet onlined present CPUs
>         with the benefit of the parallel bringup mechanism is
>         missing.
> 
>      #a) That's a risk to take by the operator.
> 
>          Even the regular serialized bringup does not protect against this
>       	issue up to the point where all present CPUs have at least
>       	initialized CR4.
> 
> 	Limiting the number of APs to online early via the kernel command
> 	line widens that window and increases the risk further by
> 	executing user space before all APs have CR4 initialized.
> 
> 	But the same applies to a deferred online mechanism implemented in
> 	the kernel where some worker brings up the not yet online APs while
> 	the early online CPUs are already executing user space code.
> 
>      #b) Is a no brainer to implement on top of this.
> 
> 
> Conclusion
> ----------
> 
> Adding the basic parallel bringup mechanism as provided by this series
> makes a lot of sense. Improving particular issues as pointed out in the
> analysis makes sense too.
> 
> But trying to solve an application specific problem fully in the kernel
> with tons of complexity, without exploring straight forward and simple
> approaches first, does not make any sense at all.
> 
> Thanks,
> 
> 	tglx
> 
> ---
>   Documentation/admin-guide/kernel-parameters.txt |   20
>   Documentation/core-api/cpu_hotplug.rst          |   13
>   arch/Kconfig                                    |   23 +
>   arch/arm/Kconfig                                |    1
>   arch/arm/include/asm/smp.h                      |    2
>   arch/arm/kernel/smp.c                           |   18
>   arch/arm64/Kconfig                              |    1
>   arch/arm64/include/asm/smp.h                    |    2
>   arch/arm64/kernel/smp.c                         |   14
>   arch/csky/Kconfig                               |    1
>   arch/csky/include/asm/smp.h                     |    2
>   arch/csky/kernel/smp.c                          |    8
>   arch/mips/Kconfig                               |    1
>   arch/mips/cavium-octeon/smp.c                   |    1
>   arch/mips/include/asm/smp-ops.h                 |    1
>   arch/mips/kernel/smp-bmips.c                    |    1
>   arch/mips/kernel/smp-cps.c                      |   14
>   arch/mips/kernel/smp.c                          |    8
>   arch/mips/loongson64/smp.c                      |    1
>   arch/parisc/Kconfig                             |    1
>   arch/parisc/kernel/process.c                    |    4
>   arch/parisc/kernel/smp.c                        |    7
>   arch/riscv/Kconfig                              |    1
>   arch/riscv/include/asm/smp.h                    |    2
>   arch/riscv/kernel/cpu-hotplug.c                 |   14
>   arch/x86/Kconfig                                |   45 --
>   arch/x86/include/asm/apic.h                     |    5
>   arch/x86/include/asm/cpu.h                      |    5
>   arch/x86/include/asm/cpumask.h                  |    5
>   arch/x86/include/asm/processor.h                |    1
>   arch/x86/include/asm/realmode.h                 |    3
>   arch/x86/include/asm/sev-common.h               |    3
>   arch/x86/include/asm/smp.h                      |   26 -
>   arch/x86/include/asm/topology.h                 |   23 -
>   arch/x86/include/asm/tsc.h                      |    2
>   arch/x86/kernel/acpi/sleep.c                    |    9
>   arch/x86/kernel/apic/apic.c                     |   22 -
>   arch/x86/kernel/callthunks.c                    |    4
>   arch/x86/kernel/cpu/amd.c                       |    2
>   arch/x86/kernel/cpu/cacheinfo.c                 |   21
>   arch/x86/kernel/cpu/common.c                    |   50 --
>   arch/x86/kernel/cpu/topology.c                  |    3
>   arch/x86/kernel/head_32.S                       |   14
>   arch/x86/kernel/head_64.S                       |  121 +++++
>   arch/x86/kernel/sev.c                           |    2
>   arch/x86/kernel/smp.c                           |    3
>   arch/x86/kernel/smpboot.c                       |  508 ++++++++----------------
>   arch/x86/kernel/topology.c                      |   98 ----
>   arch/x86/kernel/tsc.c                           |   20
>   arch/x86/kernel/tsc_sync.c                      |   36 -
>   arch/x86/power/cpu.c                            |   37 -
>   arch/x86/realmode/init.c                        |    3
>   arch/x86/realmode/rm/trampoline_64.S            |   27 +
>   arch/x86/xen/enlighten_hvm.c                    |   11
>   arch/x86/xen/smp_hvm.c                          |   16
>   arch/x86/xen/smp_pv.c                           |   56 +-
>   drivers/acpi/processor_idle.c                   |    4
>   include/linux/cpu.h                             |    4
>   include/linux/cpuhotplug.h                      |   17
>   kernel/cpu.c                                    |  397 +++++++++++++++++-
>   kernel/smp.c                                    |    2
>   kernel/smpboot.c                                |  163 -------
>   62 files changed, 953 insertions(+), 976 deletions(-)
> 
> 

Tested with a Xen PV dom0 on an 8 cpu system, no issues found.

Tested-by: Juergen Gross <jgross@...e.com>


Juergen

Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3099 bytes)

Download attachment "OpenPGP_signature" of type "application/pgp-signature" (496 bytes)