linux-kernel - Re: [GIT PULL] bcachefs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wie4U8hwRN+nYRwV4G51qXPJKr0DpjbxO1XSMZnPA_LTw@mail.gmail.com>
Date:   Thu, 10 Aug 2023 09:40:08 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Kent Overstreet <kent.overstreet@...ux.dev>
Cc:     linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-bcachefs@...r.kernel.org, djwong@...nel.org,
        dchinner@...hat.com, sandeen@...hat.com, willy@...radead.org,
        josef@...icpanda.com, tytso@....edu, bfoster@...hat.com,
        jack@...e.cz, andreas.gruenbacher@...il.com, brauner@...nel.org,
        peterz@...radead.org, akpm@...ux-foundation.org,
        dhowells@...hat.com, snitzer@...nel.org, axboe@...nel.dk
Subject: Re: [GIT PULL] bcachefs

On Thu, 10 Aug 2023 at 08:55, Kent Overstreet <kent.overstreet@...ux.dev> wrote:
>
> Heh, I liked the bitfields - I prefer that to open coding structs, which
> is a major pet peeve of mine. But the text size went down a lot a lot
> without them (would like to know why the compiler couldn't constant fold
> all that stuff out, but... not enough to bother).

Bitfields are actually absolutely *horrioble* for many many reasons.
The bit ordering being undefined is just one of them.

Yes, they are convenient syntax, but the downsides of them means that
you should basically only use them for code that has absolutely zero
subtle issues.

Avoid them like the plague with any kind of "data transfer issues", so
in the kernel avoid using them for user interfaces unless you are
willing to deal with the duplication and horror of
__LITTLE_ENDIAN_BITFIELD etc.

Also avoid them if there is any chance of "raw format" issues: either
saving binary data formats, or - as in your original code - using
unions.

As I pointed out your code was actively buggy, because you thought it
was little-endian. That's not even true on little-endian machines (ie
POWERPC is big-endian in bitfields, even when being little-endian in
bytes!).

Finally, as you found out, it also easily generates horrid code. It's
just _harder_ for compilers to do the right thing, particularly when
it's not obvious that other parts of the structure may be "don't care"
because they got initialized earlier (or will be initialized later).
Together with threading requirements, compilers might do a bad job
either because of the complexity, or simply because of subtle
consistency rules.

End result: by all means use bitfields for the *simple* cases where
they are used purely for internal C code with no form of external
exposure, but be aware that even then the syntax convenience easily
comes at a cost.

> > On x86, you'd never see that as an issue, since all writes are
> > releases, so the 'barrier()' compiler ordering ends up forcing the
> > right magic.
>
> Yep, agreed.

But you should realize that on other architectures, I think that
"barrier() + plain write" is actively buggy. On x86 it's safe, but on
arm (and in fact pretty much anything but s390), the barrier() does
nothing in SMP. Yes, it constrains the compiler, but the earlier
writes to remove the entry from the list may happen *later* as far as
other CPUs are concerned.

Which can be a huge problem if the "struct six_lock_waiter" is on the
stack - which I assume it is - and the waiter is just spinning on
w->lock_acquired. The data structure may be re-used as regular stack
space by the time the list removal code happens.

Debugging things like that is a nightmare. And you'll never see it on
x86, and it doesn't look possible when looking at the code, and the
oopses on other architectures will be completely random stack
corruption some time after it got the lock.

So this is kind of why I worry about locking. It's really easy to
write code that works 99.9% of the time, but then breaks when you are
unlucky. And depending on the pattern, the "when you are unlucky" may
or may not be possible on x86. It's not like x86 has total memory
ordering either, it's just stronger than most.

> > Some of the other oddity is around the this_cpu ops, but I suspect
> > that is at least partly then because we don't have acquire/release
> > versions of the local cpu ops that the code looks like it would want.
>
> You mean using full barriers where acquire/release would be sufficient?

Yes.

That code looks like it should work, but be hugely less efficient than
it might be. "smp_mb()" tends to be expensive everywhere, even x86.

Of course, I might be missing some other cases. That percpu reader
queue worries me a bit just because it ends up generating ordering
based on two different things - the lock word _and_ the percpu word.

And I get very nervous if the final "this gets the lock" isn't some
obvious "try_cmpxchg_acquire()" or similar, just because we've
historically had *so* many very subtle bugs in just about every single
lock we've ever had.

> Matthew was planning on sending the iov_iter patch to you - right around
> now, I believe, as a bugfix, since right now
> copy_page_from_iter_atomic() silently does crazy things if you pass it a
> compound page.
>
> Block layer patches aside, are there any _others_ you really want to go
> via maintainers?

It was mainly just the iov and the block layer.

The superblock cases I really don't understand why you insist on just
being different from everybody else.

Your exclusivity arguments make no sense to me. Just open the damn
thing. No other filesystem has ever had the fundamental problems you
describe. You can do any exclusivity test you want in the
"test()/set()" functions passed to sget().

You say that it's a problem because of a "single spinlock", but it
hasn't been a problem for anybody else.

I don't understand why you are so special. The whole problem seems made-up.

> More broadly, it would make me really happy if we could get certain
> people to take a more constructive, "what do we really care about here
> and how do we move forward" attitude instead of turning every
> interaction into an opportunity to dig their heels in on process and
> throw up barriers.

Honestly, I think one huge problem here is that you've been working on
this for a long time (what - a decade by now?) and you've made all
these decisions that you explicitly wanted to be done independently
and intentionally outside the kernel.

And then you feel that "now it's ready to be included", and think that
all those decisions you made outside of the mainline kernel now *have*
to be done that way, and basically sent your first pull request as a
fait-accompli.

The six-locks showed some of that, but as long as they are
bcachefs-internal, I don't much care.

The sget() thing really just smells like "this is how I designed
things, and that's it".

                Linus