[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFx5_qcUHBQ50RfOnH+pS=jqA_Wg2YYCcm=fTXRFJjs3rg@mail.gmail.com>
Date: Sun, 13 May 2018 11:21:31 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Dexuan Cui <decui@...rosoft.com>
Cc: Ingo Molnar <mingo@...nel.org>,
Alexey Dobriyan <adobriyan@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Rakib Mullick <rakib.mullick@...il.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: for_each_cpu() is buggy for UP kernel?
On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <decui@...rosoft.com> wrote:
> Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?
As Thomas points out, this has come up before.
One of the issues is historical - we tried very hard to make the SMP code
not cause code generation problems for UP, and part of that was just that
all these loops were literally designed to entirely go away under UP. It
still *looks* syntactically like a loop, but an optimizing compiler will
see that there's nothing there, and "for_each_cpu(...) x" essentially just
turns into "x" on UP. An empty mask simply generally doesn't make sense,
since opn UP you also don't have any masking of CPU ops, so the mask is
ignored, and that helps the code generation immensely.
If you have to load and test the mask, you immediately lose out badly in
code generation.
So honestly, I'd really prefer to keep our current behavior. Perhaps with a
debug option that actually tests (on SMP - because that's what every
developer is actually _using_ these days) that the mask isn't empty. But
I'm not sure that would find this case, since presumably on SMP it might
never be empty.
Now, there is likely a fairly good argument that UP is getting _so_
uninteresting that we shouldn't even worry about code generation. But the
counter-argument to that is that if people are using UP in this day and
age, they probably are using some really crappy hardware that needs all the
help it can get.
At least for now, I'd rather have this inconsistency, because it really
makes a surprisingly *big* difference in code generation. From the little
test I just did, adding that mask testing to a *single* case of
for_each_cpu() added 20 instructions. I didn't look at exactly why that
happened (because the code generation was so radically different), but it
was very noticeable. I used your macro replacement in kernel/taskstats.c in
case you want to try to dig into what happened, but I'm not surprised. It
really turns an unconditional trivial loop into a much more complex thing
that needs to look at and test a value that we didn't care about before.
Maybe we should introduce a "for_each_cpu_maybe_empty()" helper for cases
like this?
Linus
Powered by blists - more mailing lists