linux-kernel - Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANubcdVjXbKc88G6gzHAoJCwwxxHUYTzexqH+GaWAhEVrwr6Dg@mail.gmail.com>
Date: Sat, 22 Nov 2025 14:38:59 +0800
From: Stephen Zhang <starzhangzsd@...il.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: linux-kernel@...r.kernel.org, linux-block@...r.kernel.org, 
	nvdimm@...ts.linux.dev, virtualization@...ts.linux.dev, 
	linux-nvme@...ts.infradead.org, gfs2@...ts.linux.dev, ntfs3@...ts.linux.dev, 
	linux-xfs@...r.kernel.org, zhangshida@...inos.cn
Subject: Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling

Christoph Hellwig <hch@...radead.org> 于2025年11月21日周五 18:37写道：
>
> On Fri, Nov 21, 2025 at 04:17:39PM +0800, zhangshida wrote:
> > We have captured four instances of this corruption in our production
> > environment.
> > In each case, we observed a distinct pattern:
> >     The corruption starts at an offset that aligns with the beginning of
> >     an XFS extent.
> >     The corruption ends at an offset that is aligned to the system's
> >     `PAGE_SIZE` (64KB in our case).
> >
> > Corruption Instances:
> > 1.  Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
> > 2.  Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
> > 3.  Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
> > 4.  Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
>
> Do you have a somwhat isolate reproducer for this?
>
=====The background=====
Sorry, We do not have a consistent way to reproduce this issue, as the
data was collected from a production environment run by database
providers. However, the I/O model is straightforward:

t0:A -> B
About 100 minutes later...
t1:A -> C

A, B, and C are three separate physical machines.

A is the primary writer. It runs a MySQL instance where a single thread
appends data to a log file.
B is a reader. At time t0, this server concurrently reads the log file from
A to perform a CRC check for backup purposes. The CRC check confirms
that the data on both A and B is identical and correct.
C is another reader. At time t1 (about 100 minutes after the A -> B backup
completed), this server also reads the log file from A for a CRC check.
However, this check on C indicates that the data is corrupt.

The most unusual part is that upon investigation, the data on the primary
server A was also found to be corrupt at this later time, differing from the
correct state it was in at t0.

Another factor to consider is that memory reclamation is active, as the
environment uses cgroups to restrict resources, including memory.

After inspecting the corrupted data, we believe it did not originate from
any existing file. Instead, it appears to be raw data that was on the disk
before the intended I/O write was successfully completed.

This raises the question: how could a write I/O fail to make it to the disk?

A hardware problem seems unlikely, as different RAID cards and disks
are in use across the systems.
Most importantly, the corruption exhibits a distinct pattern:

It starts at an offset that aligns with the beginning of an XFS extent.
It ends at an offset that aligns with the system's PAGE_SIZE.

The starting address of an extent is an internal concept within the
filesystem, transparent to both user-space applications and lower-level
kernel modules. This makes it highly suspicious that the corruption
always begins at this specific boundary. This suggests a potential bug in
the XFS logic for handling append-writes to a file.

All we can do now is to do some desperate code analysis to see if we
can catch the bug.

======code analysis======
In kernel version 4.19, XFS handles extent I/O using the ioend structure,
which appears to represent a block of I/O to a continuous disk space.
This is broken down into multiple bio structures, as a single bio cannot
handle a very large I/O range:
| page 1| page 2 | ...| page N |
|<-------------ioend-------------->|
| bio 1      |  bio 2        | bio 3  |

To manage a large write, a chain of bio structures is used:
bio 1 -> bio 2 -> bio 3
All bios in this chain share a single end_io callback, which should only
be triggered after all I/O operations in the chain have completed.

The kernel uses the bi_remaining atomic counter on the first bio in the
chain to track completion, like:
1 -> 2 -> 2
if bio 1 completes, it will become:
1 -> 1 -> 2
if bio 2 completes:
1 -> 1 -> 1
if bio 3 completes:
1 -> 1 -> 0
So it is time to trigger the end io callback since all io is done.

But how does it handle a series of out-of-order completions?
if bio 3 completes first, it will become:
1 -> 2 -> 1
if bio 2 completes, since it seems forget to CHECK IF THE
FIRST BIO REACH 0 and go to next bio directly,
---c code----
static struct bio *__bio_chain_endio(struct bio *bio)
{
        struct bio *parent = bio->bi_private;

        if (bio->bi_status && !parent->bi_status)
                parent->bi_status = bio->bi_status;
        bio_put(bio);
        return parent;
}

static void bio_chain_endio(struct bio *bio)
{
        bio_endio(__bio_chain_endio(bio));
}
----c code----

it will become:
1 -> 2 -> 0
So it is time to trigger the end io callback since all io is done, which is
not actually the case. but bio 1 is still in an unknown state.



> > After analysis, we believe the root cause is in the handling of chained
> > bios, specifically related to out-of-order io completion.
> >
> > Consider a bio chain where `bi_remaining` is decremented as each bio in
> > the chain completes.
> > For example,
> > if a chain consists of three bios (bio1 -> bio2 -> bio3) with
> > bi_remaining count:
> > 1->2->2
> > if the bio completes in the reverse order, there will be a problem.
> > if bio 3 completes first, it will become:
> > 1->2->1
> > then bio 2 completes:
> > 1->1->0
> >
> > Because `bi_remaining` has reached zero, the final `end_io` callback
> > for the entire chain is triggered, even though not all bios in the
> > chain have actually finished processing. This premature completion can
> > lead to stale data being exposed, as seen in our case.
>
> It sounds like there is a problem because bi_remaining is only
> incremented after already submittin a bio.  Which code path do you
> see this with?  iomap doesn't chain bios, so is this the buffer cache
> or log code?  Or is there a remapping driver involved?
>

Yep. The commit below:

commit ae5535efd8c445ad6033ac0d5da0197897b148ea
Author: Christoph Hellwig <hch@....de>
Date:   Thu Dec 7 08:27:05 2023 +0100

    iomap: don't chain bios

changes the logic. Since there are still many code paths that use
bio_chain, I am including these cleanups with the fix. This provides a reason
to CC all related communities. That way, developers who are monitoring
this can help identify similar problems if someone asks for help in the future,
if that is the right analysis and fix.


Thanks,
Shida