[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250222221248.772b4bf6@pumpkin>
Date: Sat, 22 Feb 2025 22:12:48 +0000
From: David Laight <david.laight.linux@...il.com>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: "H. Peter Anvin" <hpa@...or.com>, Linus Torvalds
<torvalds@...ux-foundation.org>, Ventura Jack <venturajack85@...il.com>,
Gary Guo <gary@...yguo.net>, airlied@...il.com, boqun.feng@...il.com,
ej@...i.de, gregkh@...uxfoundation.org, hch@...radead.org,
ksummit@...ts.linux.dev, linux-kernel@...r.kernel.org,
miguel.ojeda.sandonis@...il.com, rust-for-linux@...r.kernel.org
Subject: Re: C aggregate passing (Rust kernel policy)
On Sat, 22 Feb 2025 16:22:08 -0500
Kent Overstreet <kent.overstreet@...ux.dev> wrote:
> On Sat, Feb 22, 2025 at 12:54:31PM -0800, H. Peter Anvin wrote:
> > VLIW and OoO might seem orthogonal, but they aren't – because they are
> > trying to solve the same problem, combining them either means the OoO
> > engine can't do a very good job because of false dependencies (if you
> > are scheduling molecules) or you have to break them instructions down
> > into atoms, at which point it is just a (often quite inefficient) RISC
> > encoding. In short, VLIW *might* make sense when you are statically
> > scheduling a known pipeline, but it is basically a dead end for
> > evolution – so unless you can JIT your code for each new chip
> > generation...
>
> JITing for each chip generation would be a part of any serious new VLIW
> effort. It's plenty doable in the open source world and the gains are
> too big to ignore.
Doesn't most code get 'dumbed down' to whatever 'normal' ABI compilers
can easily handle.
A few hot loops might get optimised, but most code won't be.
Of course AI/GPU code is going to spend a lot of time in some tight loops.
But no one is going to go through the TCP stack and optimise the source
so that a compiler can make a better job of it for 'this years' cpu.
For various reasons ended up writing a simple 32bit cpu last year (in VHDL for an fgpa).
The ALU is easy - just a big MUX.
The difficulty is feeding the result of one instruction into the next.
Normal code needs to do that all the time, you can't afford a stall
(never mind the 3 clocks writing to/from the register 'memory' would take).
In fact the ALU dependencies [1] ended up being slower than the instruction fetch
code, so I managed to take predicted and unconditional branches without a stall.
So no point having the 'branch delay slot' of sparc32.
[1] multiply was the issue, even with a pipeline stall if the result has needed.
In any case it only had to run at 62.5MHz (related to the PCIe speed).
Was definitely an interesting exercise.
David
Powered by blists - more mailing lists