[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFy0RnpxjK7e3ScHpvwesyuXz8AWaPaTy=4QbcPV7dRKqw@mail.gmail.com>
Date:	Mon, 29 Feb 2016 09:57:13 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Michael Matz <matz@...e.de>
Cc:	Markus Trippelsdorf <markus@...ppelsdorf.de>,
	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
	"gcc@....gnu.org" <gcc@....gnu.org>, parallel@...ts.isocpp.org,
	llvm-dev@...ts.llvm.org, Will Deacon <will.deacon@....com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	David Howells <dhowells@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ramana Radhakrishnan <Ramana.Radhakrishnan@....com>,
	Luc Maranget <luc.maranget@...ia.fr>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jade Alglave <j.alglave@....ac.uk>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [isocpp-parallel] Proposal for new memory_order_consume definition
On Mon, Feb 29, 2016 at 9:37 AM, Michael Matz <matz@...e.de> wrote:
>
>The important part is with induction variables controlling
> loops:
>
>   short i;  for (i = start; i < end; i++)
> vs.
>   unsigned short u; for (u = start; u < end; u++)
>
> For the former you're allowed to assume that the loop will terminate, and
> that its iteration count is easily computable.  For the latter you get
> modulo arithmetic and (if start/end are of larger type than u, say 'int')
> it might not even terminate at all.  That has direct consequences of
> vectorizability of such loops (or profitability of such transformation)
> and hence quite important performance implications in practice.
Stop bullshitting me.
It would generally force the compiler to add a few extra checks when
you do vectorize (or, more generally, do any kind of loop unrolling),
and yes, it would make things slightly more painful. You might, for
example, need to add code to handle the wraparound and have a more
complex non-unrolled head/tail version for that case.
In theory you could do a whole "restart the unrolled loop around the
index wraparound" if you actually cared about the performance of such
a case - but since nobody would ever care about that, it's more likely
that you'd just do it with a non-unrolled fallback (which would likely
be identical to the tail fixup).
It would be painful, yes.
But it wouldn't be fundamentally hard, or hurt actual performance fundamentally.
It would be _inconvenient_ for compiler writers, and the bad ones
would argue vehemently against it.
.. and it's how a "go fast" mode would be implemented by a compiler
writer initially as a compiler option, for those HPC people. Then you
have a use case and implementation example, and can go to the
standards body and say "look, we have people who use this already, it
breaks almost no code, and it makes our compiler able to generate much
faster code".
Which is why the standard was written to be good for compiler writers,
not actual users.
Of course, in real life HPC performance is often more about doing the
cache blocking etc, and I've seen people move to more parameterized
languages rather than C to get best performance. Generate the code
from a much higher-level description, and be able to do a much better
job, and leave C to do the low-level job, and let people do the
important part.
But no. Instead the C compiler people still argue for bad features
that were a misdesign and a wart on the language.
At the very least it should have been left as a "go unsafe, go fast"
option, and standardize *that*, instead of screwing everybody else
over.
The HPC people end up often using those anyway, because it turns out
that they'll happily get rid of proper rounding etc if it buys them a
couple of percent on their workload.  Things like "I really want you
to generate multiply-accumulate instructions because I don't mind
having intermediates with higher precision" etc.
             Linus
Powered by blists - more mailing lists
 
