[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z3l_2CiDgmDmAktE@csail.mit.edu>
Date: Sun, 5 Jan 2025 00:07:12 +0530
From: "Srivatsa S. Bhat" <srivatsa@...il.mit.edu>
To: Costa Shulyupin <costa.shul@...hat.com>
Cc: Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Yury Norov <yury.norov@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Valentin Schneider <vschneid@...hat.com>,
Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <neeraj.upadhyay@...nel.org>,
linux-kernel@...r.kernel.org, Waiman Long <longman@...hat.com>,
x86@...nel.org, paulmck@...nel.org
Subject: Re: [RFC PATCH v1] stop_machine: Add stop_housekeeping_cpuslocked()
Hi Costa,
On Wed, Dec 18, 2024 at 07:15:31PM +0200, Costa Shulyupin wrote:
> CPU hotplug interferes with CPU isolation and introduces latency to
> real-time tasks.
>
> The test:
>
> rtla timerlat hist -c 1 -a 500 &
> echo 0 > /sys/devices/system/cpu/cpu2/online
>
> The RTLA tool reveals the following blocking thread stack trace:
>
> -> multi_cpu_stop
> -> cpu_stopper_thread
> -> smpboot_thread_fn
>
> This happens because multi_cpu_stop() disables interrupts for EACH online
> CPU since takedown_cpu() indirectly invokes take_cpu_down() through
> stop_machine_cpuslocked(). I'm omitting the detailed description of the
> call chain.
>
I had explored removing stop-machine from the CPU hotplug offline path
a very long time ago:
https://lore.kernel.org/all/20130218123714.26245.61816.stgit@srivatsabhat.in.ibm.com/
Towards the tail end of that patchset is the actual change that
replaces the call to __stop_machine() with stop_one_cpus():
https://lore.kernel.org/all/20130218124431.26245.10956.stgit@srivatsabhat.in.ibm.com/
But before that, there were ~45 odd patches in the series to make sure
that all the existing CPU hotplug callbacks (at the time, in that
kernel version) relying on any implicit assumptions related to the
guarantees provided by stop_machine() were adequately addressed with
an alternative scheme before switching over to stop_one_cpu() for CPU
offlining.
> Proposal: Limit the stop operation to housekeeping CPUs.
>
> take_cpu_down() invokes with cpuhp_invoke_callback_range_nofail:
> - tick_cpu_dying()
> - hrtimers_cpu_dying()
> - smpcfd_dying_cpu()
> - x86_pmu_dying_cpu()
> - rcutree_dying_cpu()
> - sched_cpu_dying()
> - cache_ap_offline()
>
> Which synchronizations do these functions require instead of stop_machine?
>
I'd recommend taking a look at one such prior attempt to remove
stop_machine from CPU hotplug (shared above) for reference, as you
begin your analysis for the current kernel.
Regards,
Srivatsa
Microsoft Linux Systems Group
Powered by blists - more mailing lists