[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH5Ym4gHUG326s8XoBxVExo1ZspSd0n+x3t=+rJ8N9qgjxgGHg@mail.gmail.com>
Date: Tue, 13 Jan 2026 11:04:36 -0800
From: Sam Edwards <cfsworks@...il.com>
To: Ilya Dryomov <idryomov@...il.com>
Cc: Xiubo Li <xiubli@...hat.com>, Jeff Layton <jlayton@...nel.org>, ceph-devel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] libceph: Handle sparse-read replies lacking data length
On Tue, Jan 13, 2026 at 9:27 AM Ilya Dryomov <idryomov@...il.com> wrote:
>
> On Tue, Jan 13, 2026 at 4:31 AM Sam Edwards <cfsworks@...il.com> wrote:
> >
> > When the OSD replies to a sparse-read request, but no extents matched
> > the read (because the object is empty, the read requested a region
> > backed by no extents, ...) it is expected to reply with two 32-bit
> > zeroes: one indicating that there are no extents, the other that the
> > total bytes read is zero.
> >
> > In certain circumstances (e.g. on Ceph 19.2.3, when the requested object
> > is in an EC pool), the OSD sends back only one 32-bit zero. The
> > sparse-read state machine will end up reading something else (such as
> > the data CRC in the footer) and get stuck in a retry loop like:
> >
> > libceph: [0] got 0 extents
> > libceph: data len 142248331 != extent len 0
> > libceph: osd0 (1)...:6801 socket error on read
> > libceph: data len 142248331 != extent len 0
> > libceph: osd0 (1)...:6801 socket error on read
> >
> > This is probably a bug in the OSD, but even so, the kernel must handle
> > it to avoid misinterpreting replies and entering a retry loop.
>
> Hi Sam,
>
Hey Ilya,
> Yes, this is definitely a bug in the OSD (and I also see another
> related bug in the userspace client code above the OSD...). The
> triggering condition is a sparse read beyond the end of an existing
> object on an EC pool. 19.2.3 isn't the problem -- main branch is
> affected as well.
>
> If this was one of the common paths, I'd support adding some sort of
> a workaround to "handle" this in the kernel client. However, sparse
> reads are pretty useless on EC pools because they just get converted
> into regular thick reads. Sparse reads offer potential benefits only
> on replicated pools, but the kernel client doesn't use them by default
> there either. The sparseread mount option that is necessary for the
> reproducer to work isn't documented and was added purely for testing
> purposes.
Note that the kernel client forces sparse reads when using fscrypt
(see linux-6.18/fs/ceph/addr.c:361) and I encountered this problem
organically as a result. It may still make sense to apply a kernel
workaround.
On the other hand, it sounds like fscrypt+EC is a niche corner case,
we've now established that the OSD is definitely not following the
protocol, and working around this client-side is more involved than
just fixing this in the OSD. So I think simply telling affected users
to update their OSDs is also a reasonable way to handle this.
I'll defer to you.
>
> >
> > Detect this condition when the extent count is zero by checking the
> > `payload_len` field of the op reply. If it is only big enough for the
> > extent count, conclude that the data length is omitted and skip to the
> > next op (which is what the state machine would have done immediately
> > upon reading and validating the data length, if it were present).
> >
> > ---
> >
> > Hi list,
> >
> > RFC: This patch is submitted for comment only. I've tested it for about
> > 2 weeks now and am satisfied that it prevents the hang, but the current
> > approach decodes the entire op reply body while still in the
> > data-gathering step, which is suboptimal; feedback on cleaner
> > alternatives is welcome!
> >
> > I have not searched for nor opened a report with Ceph proper; I'd like a
> > second pair of eyes to confirm that this is indeed an OSD bug before I
> > proceed with that.
>
> Let me know if you want me to file a Ceph tracker ticket on your
> behalf. I have a draft patch for the bug in the OSD and would link it
> in the PR, crediting you as a reporter.
Please do! I'm also interested in seeing the patch -- the OSD code is
pretty dense and I couldn't find the EC sparse read handler.
>
> >
> > Reproducer (Ceph 19.2.3, CephFS with an EC pool already created):
> > mount -o sparseread ... /mnt/cephfs
> > cd /mnt/cephfs
> > mkdir ec/
> > setfattr -n ceph.dir.layout.pool -v 'cephfs-data-ecpool' ec/
> > echo 'Hello world' > ec/sparsely-packed
> > truncate -s 1048576 ec/sparsely-packed
> > # Read from a hole-backed region via sparse read
> > dd if=ec/sparsely-packed bs=16 skip=10000 count=1 iflag=direct | xxd
> > # The read hangs and triggers the retry loop described in the patch
> >
> > Hope this works,
> > Sam
> >
> > PS: I would also like to write a pair of patches to our messenger v1/v2
> > clients to check explicitly that sparse reads consume exactly the number
> > of bytes in the data section, as I see there have already been previous
> > bugs (including CVE-2023-52636) where the sparse-read machinery gets out
> > of sync with the incoming TCP stream. Has this already been proposed?
>
> Not that I'm aware of. An additional safety net would be welcome as
> long as it doesn't end up too invasive of course.
Time permitting, I'll see about fixing read_partial_message() to use
con->v1.in_base_pos consistently, use that to count data bytes
consumed in sparse reads, and fail with a more specific error_msg when
a length mismatch is detected. (I do not have a plan for messenger v2
yet.)
Regards,
Sam
>
> Thanks,
>
> Ilya
Powered by blists - more mailing lists