linux-kernel - Re: C aggregate passing (Rust kernel policy)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250222221248.772b4bf6@pumpkin>
Date: Sat, 22 Feb 2025 22:12:48 +0000
From: David Laight <david.laight.linux@...il.com>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: "H. Peter Anvin" <hpa@...or.com>, Linus Torvalds
 <torvalds@...ux-foundation.org>, Ventura Jack <venturajack85@...il.com>,
 Gary Guo <gary@...yguo.net>, airlied@...il.com, boqun.feng@...il.com,
 ej@...i.de, gregkh@...uxfoundation.org, hch@...radead.org,
 ksummit@...ts.linux.dev, linux-kernel@...r.kernel.org,
 miguel.ojeda.sandonis@...il.com, rust-for-linux@...r.kernel.org
Subject: Re: C aggregate passing (Rust kernel policy)

On Sat, 22 Feb 2025 16:22:08 -0500
Kent Overstreet <kent.overstreet@...ux.dev> wrote:

> On Sat, Feb 22, 2025 at 12:54:31PM -0800, H. Peter Anvin wrote:
> > VLIW and OoO might seem orthogonal, but they aren't – because they are
> > trying to solve the same problem, combining them either means the OoO
> > engine can't do a very good job because of false dependencies (if you
> > are scheduling molecules) or you have to break them instructions down
> > into atoms, at which point it is just a (often quite inefficient) RISC
> > encoding. In short, VLIW *might* make sense when you are statically
> > scheduling a known pipeline, but it is basically a dead end for
> > evolution – so unless you can JIT your code for each new chip
> > generation...  
> 
> JITing for each chip generation would be a part of any serious new VLIW
> effort. It's plenty doable in the open source world and the gains are
> too big to ignore.

Doesn't most code get 'dumbed down' to whatever 'normal' ABI compilers
can easily handle.
A few hot loops might get optimised, but most code won't be.
Of course AI/GPU code is going to spend a lot of time in some tight loops.
But no one is going to go through the TCP stack and optimise the source
so that a compiler can make a better job of it for 'this years' cpu.

For various reasons ended up writing a simple 32bit cpu last year (in VHDL for an fgpa).
The ALU is easy - just a big MUX.
The difficulty is feeding the result of one instruction into the next.
Normal code needs to do that all the time, you can't afford a stall
(never mind the 3 clocks writing to/from the register 'memory' would take).
In fact the ALU dependencies [1] ended up being slower than the instruction fetch
code, so I managed to take predicted and unconditional branches without a stall.
So no point having the 'branch delay slot' of sparc32.
[1] multiply was the issue, even with a pipeline stall if the result has needed.
In any case it only had to run at 62.5MHz (related to the PCIe speed).

Was definitely an interesting exercise.

	David