linux-kernel - Re: [RFC PATCH] kernel/cpu: restart cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <877d7itszm.ffs@tglx>
Date:   Thu, 21 Apr 2022 16:23:57 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     David Hildenbrand <david@...hat.com>,
        Joel Savitz <jsavitz@...hat.com>, linux-kernel@...r.kernel.org
Cc:     Valentin Schneider <valentin.schneider@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Yuan ZhaoXiong <yuanzhaoxiong@...du.com>,
        Baokun Li <libaokun1@...wei.com>,
        "Jason A. Donenfeld" <Jason@...c4.com>,
        YueHaibing <yuehaibing@...wei.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        David Hildenbrand <dhildenb@...hat.com>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "Rafael J. Wysocki" <rafael@...nel.org>
Subject: Re: [RFC PATCH] kernel/cpu: restart cpu_up when hotplug is disabled

On Tue, Apr 19 2022 at 14:34, David Hildenbrand wrote:
> On 18.04.22 21:54, Joel Savitz wrote:
>> The cpu hotplug path may be utilized while hotplug is disabled for a
>> brief moment leading to failures. As an example, attempts to perform
>> cpu hotplug by userspace soon after boot may race with pci_device_probe
>> leading to inconsistent results.
>
> You might want to extend a bit in which situation we observed that issue
> fairly reliably.
>
> When restricting the number of boot cpus on the kernel cmdline, e.g.,
> via "maxcpus=2", udev will find the offline cpus when enumerating all
> cpus and try onlining them. Due to the race, onlining of some cpus fails
> e.g., when racing with pci_device_probe().

maxcpus is a horrible hack and broken vs. MCE broadcasting on x86.

> While teaching udev to not online coldplugged CPUs when "maxcpus" was
> specified ("policy"), it revealed the underlying issue that onlining a
> CPU can fail with -EBUSY in corner cases when cpu hotplug is temporarily
> disabled.

Right. It can fail with -EBUSY and because userspace fails to handle it
gracefully we need to hack around it?

>> Proposed idea:
>> Call restart_syscall instead of returning -EBUSY since
>> cpu_hotplug_disabled seems to only have a positive value
>> for short, temporary amounts of time.
>> 
>> Does anyone see any serious problems with this?

Yes. It's a horrible hack and wrong...
 
>>  	if (cpu_hotplug_disabled) {
>> -		err = -EBUSY;
>> +		/* avoid busy looping (5ms of sleep should be enough) */
>> +		msleep(5);
>> +		err = restart_syscall();

... as it sleeps with cpu_add_remove_lock held, which protects
cpu_hotplug_disabled. IOW, cpu_hotplug_enable() is blocked until
msleep() returns.

> It's worth noting that we use the same approach in
> lock_device_hotplug_sysfs().

That does not make it any better, really. 

> It's far from perfect I would say, but we really wanted to avoid
> letting user space having to deal with retry logic.

What's so hard with retry logic in user space? 

If you can come up with a reasonable argument why user space cannot be
fixed, then there is certainly a better solution than slapping a
msleep(5) at some random place into the code.

Thanks,

        tglx