lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADDUTFzK0FNS_mJ=S2_FH2vS2c5a+gW_qsjf3Hb9k=zzjB4JmA@mail.gmail.com>
Date: Mon, 9 Dec 2024 09:10:35 +0200
From: Costa Shulyupin <costa.shul@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>, Waiman Long <longman@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Valentin Schneider <vschneid@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>
Cc: open list <linux-kernel@...r.kernel.org>
Subject: Interference of CPU hotplug on CPU isolation and Real-Time tasks

Hello

Simplified test:
rtla timerlat hist -c 1 -a 500 &
echo 0 >  /sys/devices/system/cpu/cpu11/online

RTLA reveals blocking thread stack trace:
...
               -> multi_cpu_stop
               -> cpu_stopper_thread
               -> smpboot_thread_fn
...

I've found that multi_cpu_stop() disables interrupts for EACH online
CPU because takedown_cpu() indirectly invokes take_cpu_down() through
stop_machine_cpuslocked(). I'm omitting the detailed description of
the call chain.

Potentially using stop_one_cpu() instead of stop_machine_cpuslocked()
could solve the problem:

@@ -1335,7 +1339,7 @@ static int takedown_cpu(unsigned int cpu)
       /*
        * So now all preempt/rcu users must observe !cpu_active().
        */
-       err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
+       err = stop_one_cpu(cpu, take_cpu_down, NULL);

Original stop_machine code was introduced 20 years ago:
Author: rusty <rusty>
Date:   Fri Mar 19 16:02:28 2004 +0000

   [PATCH] Hotplug CPUs: cpu_down()

   Implement cpu_down(): uses stop_machine to freeze the machine, then
   uses (arch-specific) __cpu_disable() and migrate_all_tasks().

   Whole thing under CONFIG_HOTPLUG_CPU, so doesn't break archs which
   don't define that.

https://github.com/jeffmahoney/linux-pre-git/commit/864a81b15223552102124656a012ac6de6947499#diff-52e4b09f63a029f319f95a60ddc0a09c31de0e172f8a2802ce39294569e60587R122

Additionally, take_cpu_down() relies on local_irq_save() and
hard_irq_disable(). However, I am omitting this patch to concentrate
solely on stop_one_cpu().

Questions:
1. Why stop_machine() is used during the CPU hotplug?
2. Is it worth testing using stop_one_cpu(), or would that be the
wrong approach?
3. Do you have any additional recommendations?

Thanks
Costa


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ