linux-kernel - RE: for_each_cpu() is buggy for UP kernel?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <SG2P15301MB0015F23FF0E44BE991E8C38EBF930@SG2P15301MB0015.APCP153.PROD.OUTLOOK.COM>
Date:   Tue, 15 May 2018 03:02:27 +0000
From:   Dexuan Cui <decui@...rosoft.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
CC:     Ingo Molnar <mingo@...nel.org>,
        Alexey Dobriyan <adobriyan@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Rakib Mullick <rakib.mullick@...il.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: RE: for_each_cpu() is buggy for UP kernel?

> From: Linus Torvalds <torvalds@...ux-foundation.org>
> Sent: Sunday, May 13, 2018 11:22
> On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <decui@...rosoft.com> wrote:
> 
> > Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?
> 
> As Thomas points out, this has come up before.
> 
> One of the issues is historical - we tried very hard to make the SMP code
> not cause code generation problems for UP, and part of that was just that
> all these loops were literally designed to entirely go away under UP. It
> still *looks* syntactically like a loop, but an optimizing compiler will
> see that there's nothing there, and "for_each_cpu(...) x" essentially just
> turns into "x" on UP.  An empty mask simply generally doesn't make sense,
> since opn UP you also don't have any masking of CPU ops, so the mask is
> ignored, and that helps the code generation immensely.
> 
> If you have to load and test the mask, you immediately lose out badly in
> code generation.
Thank you all for the insights and the detailed background introduction!
 
> So honestly, I'd really prefer to keep our current behavior. Perhaps with a
> debug option that actually tests (on SMP - because that's what every
> developer is actually _using_ these days) that the mask isn't empty. But
> I'm not sure that would find this case, since presumably on SMP it might
> never be empty.
I agree.

> Now, there is likely a fairly good argument that UP is getting _so_
> uninteresting that we shouldn't even worry about code generation. But the
> counter-argument to that is that if people are using UP in this day and
> age, they probably are using some really crappy hardware that needs all the
> help it can get.
FWIW, I happened to find this issue in a SMP virtual machine, but the kernel
from a customer was built with CONFIG_SMP disabled. After spending 1 day
debugging the strange boot-up delay, which was caused by the unexpected
PIT interrupt storm, I finally tracked it down to the UP version of for_each_cpu().

The function exposing the issue is kernel/time/tick-broadcast.c:
tick_handle_oneshot_broadcast().

If you're OK with the below fix (not tested yet), I'll submit a patch for it:

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -616,6 +616,10 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
        now = ktime_get();
        /* Find all expired events */
        for_each_cpu(cpu, tick_broadcast_oneshot_mask) {
+#ifndef CONFIG_SMP
+               if (cpumask_empty(tick_broadcast_oneshot_mask))
+                       break;
+#endif
                td = &per_cpu(tick_cpu_device, cpu);
                if (td->evtdev->next_event <= now) {
                        cpumask_set_cpu(cpu, tmpmask); 

> At least for now, I'd rather have this inconsistency, because it really
> makes a surprisingly *big* difference in code generation.  From the little
> test I just did, adding that mask testing to a *single* case of
> for_each_cpu() added 20 instructions.  I didn't look at exactly why that
> happened (because the code generation was so radically different), but it
> was very noticeable. I used your macro replacement in kernel/taskstats.c in
> case you want to try to dig into what happened, but I'm not surprised. It
> really turns an unconditional trivial loop into a much more complex thing
> that needs to look at and test a value that we didn't care about before.
I agree.

 
> Maybe we should introduce a "for_each_cpu_maybe_empty()" helper for
> cases  like this?
>                     Linus
Sounds like a good idea.

Thanks,
-- Dexuan