linux-kernel - Re: [patch 4/6] Fix SMP poweroff hangs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <47063E93.9030105@rtr.ca>
Date:	Fri, 05 Oct 2007 09:39:31 -0400
From:	Mark Lord <lkml@....ca>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>
Cc:	"Rafael J. Wysocki" <rjw@...k.pl>,
	"Brown, Len" <len.brown@...el.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>, simon.derr@...l.net,
	Alexey Starikovskiy <astarikovskiy@...e.de>
Subject: Re: [patch 4/6] Fix SMP poweroff hangs

Reposting this.. somehow left LKML off the TO/CC list.. (thanks Andrew!)

Eric W. Biederman wrote:
> Mark Lord <lkml@....ca> writes:
>
>> Eric W. Biederman wrote:
>>> Mark Lord <lkml@....ca> writes:
>>>>> - It could be that we changed the initialization order.
>>>>> - It could be that we changed the set of sysdev devices.
>>>>> - It could be we that cpu_down is doing something extra that we are not
>>>>>   doing.
>>>>>
>>>>> My guess is that Mark Lord, and Thomas Gleixner are sharp enough to
>>>>> have checked their command line parameters before submitting a kernel
>>>>> bugfix patch.  Although it would not hurt to have confirmation of
>>>>> that.
>>>> Right.  No fancy command line overrides here.
>>> Ok.
>>>
>>> Simple questions.
>>> - 32bit or 64bit x86?
>>> - Is this new failure mode or a regression?
>>> - By hang on power off I assume you mean that the power does
>>>   not go off?
>>> - Is there anything else you can tell me about this failure mode?
>> You could search kernel archives from last week,
>> when lots more info was posted in the original trouble reports.
>
> Sure.  I will. I was starting off a bit lazy.
>
>> My system is Core2Duo DualCore 1.86GHz, pure 32bit kernel/user.
>> System prints "Power Down." message, but power stays on.
>> Broken in 2.6.17, and in 2.6.23-rc*.  No other versions attempted.
>
> Ok.  So apparently not a regression.
>
>>> Mark, Thomas given the difficulty in reproducing the failure if
>>> someone has time I would love to see what happens if for testing:
>>> set_cpus_allowed(current, cpumask_of_cpu(1)); (instead of
>> disable_nonboot_cpus)
>>
>> System behaved fine on six consecutive poweroffs with that.
>> This could be due to subtle timing changes induced by the
>> context switches this produces, though.  I also tried it
>> as the very first line in the same function, and saw no change.
>
> Thanks.  This confirms something.  This is not an issue of which
> cpu you are running on (or else by forcing you onto the second
> core it would always fail).   At most it may be something
> happening on one cpu and the getting switched to another cpu.
>
> So this looks like something extra that cpu_down is doing.
>
>> I then removed that one line (and left out the disable_nonboot_cpus as well),
>> and it failed to power off on the very next attempt (got lucky).
>
> Ok.
>
>> Changed it to set_cpus_allowed(current, cpumask_of_cpu(0)),
>> and it survived six consecutive poweroffs.
>>
>> Changed it back to disable_nonboot_cpus(..),
>> and it also survived another four consecutive poweroffs.
>>
>> I also verified that both CPUs were enabled on entry to machine_poweroff().
>
> That both CPUS are enabled on entry to machine_poweroff is expected,
> and normal.
>
> The code path on i386 should be:
> machine_power_off
>   native_machine_power_off
>      machine_shutdown(); (which disables the other cpus)
>     smp_call_function
>        stop_this_cpu (on each cpu to be stopped.
>      pm_power_off(); (which turns off the power)
..
> This does sound like a race of some sort.

Mmm... thanks for the tour.

The cpu hotplug code appears to take great precautions against internal races
(dunno if it succeeds or not, though), but the correspond code in native_smp_send_stop()
looks a bit iffy by comparison.  I wonder if that's where it gets stuck?

static void native_smp_send_stop(void)
{
       /* Don't deadlock on the call lock in panic */
       int nolock = !spin_trylock(&call_lock);
       unsigned long flags;

       local_irq_save(flags);
       __smp_call_function(stop_this_cpu, NULL, 0, 0);
       if (!nolock)
               spin_unlock(&call_lock);
       disable_local_APIC();
       local_irq_restore(flags);
}

So basically, it tries to avoid races by grabbing the call_lock,
but then ignores that lock and proceeds anyway.  Recipe for a race?

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/