linux-kernel - Re: [PATCH 5/5] ceph: Fix write storm on fscrypted files

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH5Ym4j4SbR7DMOZaLWD4v5HOj6Eejv07pcyE3TsWb9R_jFLcA@mail.gmail.com>
Date: Tue, 6 Jan 2026 16:05:09 -0800
From: Sam Edwards <cfsworks@...il.com>
To: Viacheslav Dubeyko <Slava.Dubeyko@....com>
Cc: Xiubo Li <xiubli@...hat.com>, "brauner@...nel.org" <brauner@...nel.org>, 
	"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "jlayton@...nel.org" <jlayton@...nel.org>, 
	Milind Changire <mchangir@...hat.com>, "idryomov@...il.com" <idryomov@...il.com>, 
	"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: [PATCH 5/5] ceph: Fix write storm on fscrypted files

On Tue, Jan 6, 2026 at 3:11 PM Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:
>
> On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:
> > On Mon, Jan 5, 2026 at 2:34 PM Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:
> > >
> > > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > > CephFS stores file data across multiple RADOS objects. An object is the
> > > > atomic unit of storage, so the writeback code must clean only folios
> > > > that belong to the same object with each OSD request.
> > > >
> > > > CephFS also supports RAID0-style striping of file contents: if enabled,
> > > > each object stores multiple unbroken "stripe units" covering different
> > > > portions of the file; if disabled, a "stripe unit" is simply the whole
> > > > object. The stripe unit is (usually) reported as the inode's block size.
> > > >
> > > > Though the writeback logic could, in principle, lock all dirty folios
> > > > belonging to the same object, its current design is to lock only a
> > > > single stripe unit at a time. Ever since this code was first written,
> > > > it has determined this size by checking the inode's block size.
> > > > However, the relatively-new fscrypt support needed to reduce the block
> > > > size for encrypted inodes to the crypto block size (see 'fixes' commit),
> > > > which causes an unnecessarily high number of write operations (~1024x as
> > > > many, with 4MiB objects) and grossly degraded performance.
> >
> > Hi Slava,
> >
> > > Do you have any benchmarking results that prove your point?
> >
> > I haven't done any "real" benchmarking for this change. On my setup
> > (closer to a home server than a typical Ceph deployment), sequential
> > write throughput increased from ~1.7 to ~66 MB/s with this patch
> > applied. I don't consider this single datapoint representative, so
> > rather than presenting it as a general benchmark in the commit
> > message, I chose the qualitative wording "grossly degraded
> > performance." Actual impact will vary depending on workload, disk
> > type, OSD count, etc.
> >
> > Those curious about the bug's performance impact in their environment
> > can find out without enabling fscrypt, using: mount -o wsize=4096
> >
> > However, the core rationale for my claim is based on principles, not
> > on measurements: batching writes into fewer operations necessarily
> > spreads per-operation overhead across more bytes. So this change
> > removes an artificial per-op bottleneck on sequential write
> > performance. The exact impact varies, but the patch does improve
> > (fscrypt-enabled) write throughput in nearly every case.
> >
>

Hi Slava,

> If you claim in commit message that "this patch fixes performance degradation",
> then you MUST share the evidence (benchmarking results). Otherwise, you cannot
> make such statements in commit message. Yes, it could be a good fix but please
> don't make a promise of performance improvement. Because, end-users have very
> different environments and workloads. And what could work on your side may not
> work for other use-cases and environments.

I agree with the underlying concern: I do not have a representative
environment, and it would be irresponsible to promise or quantify a
specific speedup. For that reason, the commit message does not claim
any particular improvement factor.

What it does say is that this patch fixes a known performance
degradation that artificially limits the write batch size. But that
statement is about correctness relative to previous behavior, not
about guaranteeing any specific performance outcome for end users.

> Potentially, you could describe your
> environment, workload and to share your estimation/expectation of potential
> performance improvement.

I don’t think that would be useful here. As you pointed out, any such
numbers would be highly workload- and environment-specific and would
not be representative or actionable. The purpose of this patch is
simply to remove an unintentional limit, not to advertise or promise
measurable gains.

Best,
Sam

>
> Thanks,
> Slava.
>
> > Warm regards,
> > Sam
> >
> >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >
> > > > Fix this (and clarify intent) by using i_layout.stripe_unit directly in
> > > > ceph_define_write_size() so that encrypted inodes are written back with
> > > > the same number of operations as if they were unencrypted.
> > > >
> > > > Fixes: 94af0470924c ("ceph: add some fscrypt guardrails")
> > > > Cc: stable@...r.kernel.org
> > > > Signed-off-by: Sam Edwards <CFSworks@...il.com>
> > > > ---
> > > >  fs/ceph/addr.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index b3569d44d510..cb1da8e27c2b 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1000,7 +1000,8 @@ unsigned int ceph_define_write_size(struct address_space *mapping)
> > > >  {
> > > >       struct inode *inode = mapping->host;
> > > >       struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> > > > -     unsigned int wsize = i_blocksize(inode);
> > > > +     struct ceph_inode_info *ci = ceph_inode(inode);
> > > > +     unsigned int wsize = ci->i_layout.stripe_unit;
> > > >
> > > >       if (fsc->mount_options->wsize < wsize)
> > > >               wsize = fsc->mount_options->wsize;