linux-kernel - Re: [PATCH 0/9 RFC] Make wake_up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wgLL5sA_GAEbFn+zELAxz9Jjf0umpORV4N1sQtRhDeZ2Q@mail.gmail.com>
Date: Mon, 19 Aug 2024 10:45:57 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Christian Brauner <brauner@...nel.org>
Cc: NeilBrown <neilb@...e.de>, Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 0/9 RFC] Make wake_up_{bit,var} less fragile

On Mon, 19 Aug 2024 at 01:16, Christian Brauner <brauner@...nel.org> wrote:
>
> The problem is that we currently use wait_on_bit() and wake_up_bit() in
> various places on i_state and all of these functions require an unsigned
> long (probably because some architectures only handle atomic ops on
> unsigned long).

It's actually mostly because of endianness, not atomicity.

The whole "flags in one single unsigned long" is a very traditional
Linux pattern. Even originally, when "unsigned", "unsigned int", and
"unsigned long" were all the same 32-bit thing, the pattern for the
kernel was "unsigned long" for the native architecture accesses.

It may not be a *great* pattern, and arguably it should always have
been "flags in a single unsigned int" (which was the same thing back
in the days). But hey, hindsight is 20:20.

[ And I say "arguably", not "obviously".

  The kernel basically takes the approach that "unsigned long" is the
size of GP registers, and so "array of unsigned long" is in some
respect fundamentally more efficient than "array of unsigned int",
because you can very naturally do operations in bigger chunks.

  So bitops working on some more architecture-neutral size like just
bytes or "u32" or "unsigned int" would have advantages, but "unsigned
long" in many ways is also a real advantage, and can have serious
alignment advantages for the simple cases ]

And it turns out byte order matters. Because you absolutely want to be
able to mix things like simple initializers, ie

     unsigned long flags = 1ul << BITPOS;
     ...
     clear_bit(BITPOS, &flags);

then the clear_bit() really _fundamentally_ works only on arrays of
"unsigned long", because on big-endian machines the bit position
really depends on the chunk size.

And yes, big-endianness is a disease (and yes, bit ordering is
literally one of the reasons), and it's happily mostly gone, but sadly
"mostly" is not "entirely".

So we are more or less stuck with "bit operations fundamentally work
on arrays of unsigned long", and no, you *cannot* use them on smaller
types because of the horror that is big-endianness.

Could we do "u32 bitops"? Yeah, but because of all the endianness
problems, it really would end up having to be a whole new set of
interfaces.

We do, btw, have a special case: we support the notion of
"little-endian bitmaps".  When you actually have data structures that
are binary objects across architectures, you need to have a sane
*portable* bitmap representation, and the only sane model is the
little-endian one that basically is the same across any type size (ie
"bit 0" is the same physical bit in a byte array as it is in a word
array and in a unsigned long array).

So you have things like filesystems use this model with test_bit_le()
and friends. But note that that does add extra overhead on BE
machines. And while we probably *should* have just said that the
normal bitops are always little-endian, we didn't, because by the time
BE machines were an issue, we already had tons of those simple
initializers (that would now need to use some helper macro to do the
bit swizzling on big-endian HW).

So I suspect we're kind of stuck with it. If you use the bit
operations - including the wait/wake_bit stuff - you *have* to use
"unsigned long".

Note that the "var_wait" versions don't actually care about the size
of the variable. They never look at the value in memory, so they
basically just treat the address of the variable as a cookie for
waiting. So you can use the "var" versions with absolutely anything.

[ Side note: the wake_up_bit() interface is broken garbage. It uses
"void *word" for the word. That's very dangerous because it allows
type mis-use without warnings. I didn't notice until I checked.
Thankfully wait_on_bit() itself has the right signature, so hopefully
nobody ever does that ]

            Linus