linux-kernel - Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH5Ym4gTKZue4r8URmgo+BBLJcQ+xKzEm7_P4xo1=XEwfUuv1A@mail.gmail.com>
Date: Tue, 13 Jan 2026 17:28:22 -0800
From: Sam Edwards <cfsworks@...il.com>
To: Ilya Dryomov <idryomov@...il.com>
Cc: Xiubo Li <xiubli@...hat.com>, Jeff Layton <jlayton@...nel.org>, ceph-devel@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length

On Tue, Jan 13, 2026 at 12:15 PM Ilya Dryomov <idryomov@...il.com> wrote:
>
> On Tue, Jan 13, 2026 at 8:04 PM Sam Edwards <cfsworks@...il.com> wrote:
> >
> > On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@...il.com> wrote:
> > >
> > > On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@...il.com> wrote:
> > > >
> > > > When the OSD replies to a sparse-read request, but no extents matched
> > > > the read (because the object is empty, the read requested a region
> > > > backed by no extents, ...) it is expected to reply with two 32-bit
> > > > zeroes: one indicating that there are no extents, the other that the
> > > > total bytes read is zero.
> > > >
> > > > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> > > > is in an EC pool), the OSD sends back only one 32-bit zero. The
> > > > sparse-read state machine will end up reading something else (such as
> > > > the data CRC in the footer) and get stuck in a retry loop like:
> > > >
> > > >   libceph:  [0] got 0 extents
> > > >   libceph: data len 142248331 != extent len 0
> > > >   libceph: osd0 (1)...:6801 socket error on read
> > > >   libceph: data len 142248331 != extent len 0
> > > >   libceph: osd0 (1)...:6801 socket error on read
> > > >
> > > > This is probably a bug in the OSD, but even so, the kernel must handle
> > > > it to avoid misinterpreting replies and entering a retry loop.
> > >
> > > Hi Sam,
> > >
> >
> > Hey Ilya,
> >
> > > Yes, this is definitely a bug in the OSD (and I also see another
> > > related bug in the userspace client code above the OSD...).  The
> > > triggering condition is a sparse read beyond the end of an existing
> > > object on an EC pool.  19.2.3 isn't the problem -- main branch is
> > > affected as well.
> > >
> > > If this was one of the common paths, I'd support adding some sort of
> > > a workaround to "handle" this in the kernel client.  However, sparse
> > > reads are pretty useless on EC pools because they just get converted
> > > into regular thick reads.  Sparse reads offer potential benefits only
> > > on replicated pools, but the kernel client doesn't use them by default
> > > there either.  The sparseread mount option that is necessary for the
> > > reproducer to work isn't documented and was added purely for testing
> > > purposes.
> >
> > Note that the kernel client forces sparse reads when using fscrypt
> > (see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem
> > organically as a result. It may still make sense to apply a kernel
> > workaround.
> >
> > On the other hand, it sounds like fscrypt+EC is a niche corner case,
> > we've now established that the OSD is definitely not following the
> > protocol, and working around this client-side is more involved than
> > just fixing this in the OSD. So I think simply telling affected users
> > to update their OSDs is also a reasonable way to handle this.
>
> fscrypt and EC can't be mixed -- fscrypt+EC doesn't really work.  The
> reason sparse reads are forced for fscrypt is that the client relies on
> the sparseness metadata to be able tell if a given 4K block in the
> encrypted file is a hole (in the PUNCH_HOLE sense) or not.  If it's
> a hole, POSIX dictates that a read should return zeroes.  On an EC pool
> where sparse reads are degraded into regular thick reads by the OSD,
> a hole in the middle of an object wouldn't ever be signaled.  Instead,
> the OSD would synthesize a bunch of zeroes and pass them to the client.
> The client would then run them through the crypto engine (believing
> it's a bona fide ciphertext) and return the resulting gibberish to the
> user, thus violating POSIX and widespread assumptions about generic
> filesystem behavior.

Oof, thanks for the heads-up! Fortunately my workload tolerates
garbage in holes... with the occasional (now-explained) warning, that
is. :)

I don't see the fscrypt+EC limitation mentioned in the kernel nor Ceph
docs, so I'm guessing this is more a "known major limitation" than an
out-of-scope use case. The CephFS client already blocks PUNCH_HOLE for
encrypted inodes, but by writing into the middle of an empty object, I
was able to form a hole organically and reproduce the garbage you
describe.

EC is complex, so I wouldn't have been surprised if it simply didn't
have a way to store objects with holes at all. But I was caught off
guard to learn that the hard part of this problem is communicating the
hole to the client. My intuition was that the read path must already
be detecting "no data here" in order to synthesize filler zeroes, but
it sounds like that information doesn't survive as explicit metadata.
Clearly I have more to learn about the EC read pipeline.

Cheers,
Sam

>
> >
> > I'll defer to you.
> >
> > >
> > > >
> > > > Detect this condition when the extent count is zero by checking the
> > > > `payload_len` field of the op reply. If it is only big enough for the
> > > > extent count, conclude that the data length is omitted and skip to the
> > > > next op (which is what the state machine would have done immediately
> > > > upon reading and validating the data length, if it were present).
> > > >
> > > > ---
> > > >
> > > > Hi list,
> > > >
> > > > RFC: This patch is submitted for comment only. I've tested it for about
> > > > 2 weeks now and am satisfied that it prevents the hang, but the current
> > > > approach decodes the entire op reply body while still in the
> > > > data-gathering step, which is suboptimal; feedback on cleaner
> > > > alternatives is welcome!
> > > >
> > > > I have not searched for nor opened a report with Ceph proper; I'd like a
> > > > second pair of eyes to confirm that this is indeed an OSD bug before I
> > > > proceed with that.
> > >
> > > Let me know if you want me to file a Ceph tracker ticket on your
> > > behalf.  I have a draft patch for the bug in the OSD and would link it
> > > in the PR, crediting you as a reporter.
> >
> > Please do! I'm also interested in seeing the patch -- the OSD code is
> > pretty dense and I couldn't find the EC sparse read handler.
>
> https://github.com/ceph/ceph/pull/66912
>
> Thanks,
>
>                 Ilya