linux-kernel - Re: [GIT PULL] bcachefs fixes for 6.16-rc3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALJXSJrWjsAgN8HDUAhr5WYB97_YS57PuAhwpRctpNFU6=4AKQ@mail.gmail.com>
Date: Sat, 21 Jun 2025 17:07:51 -0400
From: Jérôme Poulin <jeromepoulin@...il.com>
To: linux-bcachefs@...r.kernel.org
Cc: Kent Overstreet <kent.overstreet@...ux.dev>, "Theodore Ts'o" <tytso@....edu>, 
	Martin Steigerwald <martin@...htvoll.de>, Jani Partanen <jiipee@...apeli.fi>, 
	Linus Torvalds <torvalds@...ux-foundation.org>, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [GIT PULL] bcachefs fixes for 6.16-rc3

As a bcachefs user who has been following this discussion, I'd like to
share my perspective on the current state of the filesystem and the
path forward.

I'm currently using this filesystem for a backup staging server so it
is easy for me to make sure data isn't getting lost and can verify
checksums at the application levels from time to time. The solution
uses snapshots extensively as well as replication, reflinks and
background compression.

I really like this filesystem for multiple reasons, it fills the gap
for missing features of traditional filesystems, it allows integrating
cache devices almost seamlessly, it allows having metadata on local
devices while still pushing to slow HDD, SMR or network devices
without having to setup something like Ceph or a stack like Btrfs,
mdadm and nbd/iSCSI.  I've seen all the features appear one by one on
Bcachefs and it is growing fast.

I migrated from Btrfs after an incident with a RAID controller losing
its cache that caused the filesystem to be unmountable and
unrepairable.  Btrfs restore was able to recover *most* of the files
on that server except a couple subvolumes which had to be recreated by
the backup system.  And again, since this is a staging area for
backups, I don't need 100% uptime or a guarantee that my files won't
be lost so I felt pretty confident in using Bcachefs to speed up
operations there.

Bcachefs was able to triple the speed of the backup system by having
metadata stored in NVMe + passively caching all writes to NVMe.  The
last part of the backup is now blazing fast since everything is in
NVMe.

At this point in time, I do believe Bcachefs has solid foundations, as
of now, the only data corruption that lost me some files were related
to a snapshot deletion bug for a feature that was not yet published to
mainline.

It hasn't been without its downsides, many times I had to take the
filesystem for offline repair and Kent was always able to figure out
the root cause of issues causing the FS not to mount read-write and
issue a patch for the FS and for fsck.  We found many weird bugs
together, ARM specific bugs, reflink causing corruption, resize not
allocating buckets, many races and lock ups, upgrade not finishing
correctly, corruption from weird interactions, data not staying cached
when there's no promote_target.  All of this was fixed without much
more damage than the last operations being lost and most were fixed
really quickly from cat'ing a couple diagnostic files, using perf or
worst case metadata image.

The filesystem is very resilient at being rebooted anywhere, anytime.
It went through many random resets during any of..  fsck repairs, fsck
rebuilding the btree from scratch, upgrades, in the middle of snapshot
operations, while replaying journal.  It just always recovers at
places I wouldn't expect to be able to hit the power switch. Worst
case, it mounted read-only and needed fsck but could always be mounted
read-only.

It also went through losing 6 devices and the write-back cache (that
defective controller, again).  Fsck could repair it with minimal loss
related to recent data. A lot of scary messages in fsck, but it
finished and I could run scrub+rereplicate to finish it off (which
fixed a couple more files).

Where things get a bit more touchy is when combining all those
features together;  operations tend to be a bit "racy" between each
other and tend to lock up when there's multiple features running/being
used in parallel.  I think this is where we get to the "move fast
break things" part of the filesystem.  The foundation is solid, read,
write, inode creations/suppression, bucket management, all basic posix
operations, checksums, scrub, device addition. Many of the
bcachefs-specific operations are stable, being able to set compression
and replication level and data target per folder is awesome stuff and
works well.

>From my experience, what is less polished are; snapshots and snapshot
operations, reflink, nocow, multiprocess heavy workloads, those seem
to be where the "experimental" part of the filesystem goes into the
spotlight.  I've been running rotating snapshots on many machines, it
works well until it doesn't and I need to reboot or fsck. Reflink
before 6.14 seemed a bit hacky and can result in errors. Nocow tends
to lock up but isn't really useful with bcachefs anyway. Maybe
casefolding which might not be fully tested yet. Those are the true
experimental features and aren't really labelled as such.

We can always say "yes, this is fixed in master, this is fixed in
6.XX-rc4" but it is still experimental and tends to be what causes the
most pain right now.  I think this needs to be communicated more
clearly. If the filesystem goes off experimental, I think a subset of
features should be gated by filesystem options to reduce the need for
big and urgent rc patches.

The problem is...  when the experimental label is removed, it needs to
be very clear that users aren't expected to be running the latest rc
and master branch.  All the features marked as stable should have
settled enough that there won't be 6 users requiring a developer to
mount their filesystem read-write or recover files from a catastrophic
race condition.

This is where communication needs to be clear, bcachefs website,
tools, options; should all clearly label features that might require
someone to ask a developer's help or to run the latest release
candidate or a debug version of the kernel.

Bcachefs has very nice unit and integration testing with ktest, but it
isn't enough to represent real-world usage yet and that's why I think
some features should still be marked just as experimental as erasure
coding.  Bcachefs filesystem where I do not use reflink, snapshot or
anything wild, only multiple devices with foreground/promote_target,
replication, compression, never experience weird issues or lockups for
many kernel versions now.  Mind you, I'm not using bcachefs on any
rootfs yet, only specific use-case and patterns that can be
documented.

I care about the future and success of bcachefs to be my go-to
filesystem for anything that requires CoW features, robust repair
tools, caching and flexible RAID-like features.  I just don't want it
to get kicked out of the kernel because of huge changesets to fix bugs
on features that shouldn't be used by someone who expects the
filesystem to behave.

It might slow down development a bit to mark some features as
experimental, but it'll remove the pressure of having to push so many
bug fixes that are critical to make sure users don't experience
critical failures or blindly try to repair their FS using fsck -y
without reporting issues. It reduces the experimental surface to a
subset of features, it also makes the user aware of what they should
do if enabled, eg.: contact dev before fsck -y, run a recent kernel at
all time, etc.

One more thing that I think is missing, many patches submitted, even
if it doesn't show up, should have a Reported-By and Tested-By tag to
help show how many people in the community are working and helping
make Bcachefs great, it would also make people on the ML aware that
patches aren't just thrown in there; it usually has been a reported
bug from a community member which had to test the resulting patch.

Anyway, that message is bigger than I expected and I hope brings some
light on how I perceive bcachefs from a user standpoint.

Have a great weekend!

On Fri, Jun 20, 2025 at 8:15 PM Kent Overstreet
<kent.overstreet@...ux.dev> wrote:
>
> On Fri, Jun 20, 2025 at 07:35:04PM -0400, Kent Overstreet wrote:
> > So it's hard to fathom what's going on here.
>
> I also need to add that this kind of drama, and these responses to pull
> requests - second guessing technical decisions, outright trash talk -
> have done an incredible amount of damage, and I think it's time to make
> you guys aware of that since it's directly relevant to the story of this
> pull request.
>
> I've put a lot of work into building a real community around bcachefs,
> because that's critical to making it the rock solid, dependable
> filesystem, for eeryone, that I intend it to be: building a community
> where people feel free to share observations, bug reports, and where
> people trust that those will be acted on responsibly.
>
> That all gets set back whenever drama like this happens. Last time, the
> casefolding bugfix pull request, ignited a whole vi. vs. emacs holy war.
> Every time this happens, the calm, thoughtful people pull back, and all
> I hear from are the angry, dramatic voices.
>
> More than that, I lost a hire because of Linus's constant,
> every-other-pull-request "I'm thinking about removing bcachefs from the
> kernel". It turns out, smart, thoughtful engineers with stable jobs
> become very hesitant about leaving those jobs when that happens, and
> that's all their co-workers are seeing.
>
> And the first thing that got cancelled/put aside because of that - work
> that was in progress, and hasn't been completed - was tooling for
> comprehensive programatic fault injection for on disk format errors.
> IOW - the tooling and test coverage that would have caught the subvolume
> deletion bug.
>
> That's a really painful loss right now.
>
> Even despite that, bcachefs development has been going incredibly
> smoothly, and it's shaping up fast. Like I mentioned before, 100+ TB
> filesystems are commonplace, users are commenting every release on how
> much smoother is getting. We are, I hope, only a year or less from being
> able to take the experimental label off, based on the decline in
> critical bug reports I'm seeing.
>
> The only area that gives me cause for concern - and it causes a _lot_ of
> concern - is upstream.
>