[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251121081748.1443507-1-zhangshida@kylinos.cn>
Date: Fri, 21 Nov 2025 16:17:39 +0800
From: zhangshida <starzhangzsd@...il.com>
To: linux-kernel@...r.kernel.org
Cc: linux-block@...r.kernel.org,
nvdimm@...ts.linux.dev,
virtualization@...ts.linux.dev,
linux-nvme@...ts.infradead.org,
gfs2@...ts.linux.dev,
ntfs3@...ts.linux.dev,
linux-xfs@...r.kernel.org,
zhangshida@...inos.cn,
starzhangzsd@...il.com
Subject: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
From: Shida Zhang <zhangshida@...inos.cn>
Hello everyone,
We have recently encountered a severe data loss issue on kernel version 4.19,
and we suspect the same underlying problem may exist in the latest kernel versions.
Environment:
* **Architecture:** arm64
* **Page Size:** 64KB
* **Filesystem:** XFS with a 4KB block size
Scenario:
The issue occurs while running a MySQL instance where one thread appends data
to a log file, and a separate thread concurrently reads that file to perform
CRC checks on its contents.
Problem Description:
Occasionally, the reading thread detects data corruption. Specifically, it finds
that stale data has been exposed in the middle of the file.
We have captured four instances of this corruption in our production environment.
In each case, we observed a distinct pattern:
The corruption starts at an offset that aligns with the beginning of an XFS extent.
The corruption ends at an offset that is aligned to the system's `PAGE_SIZE` (64KB in our case).
Corruption Instances:
1. Start:`0x73be000`, **End:** `0x73c0000` (Length: 8KB)
2. Start:`0x10791a000`, **End:** `0x107920000` (Length: 24KB)
3. Start:`0x14535a000`, **End:** `0x145b70000` (Length: 8280KB)
4. Start:`0x370d000`, **End:** `0x3710000` (Length: 12KB)
After analysis, we believe the root cause is in the handling of chained bios, specifically
related to out-of-order io completion.
Consider a bio chain where `bi_remaining` is decremented as each bio in the chain completes.
For example,
if a chain consists of three bios (bio1 -> bio2 -> bio3) with
bi_remaining count:
1->2->2
if the bio completes in the reverse order, there will be a problem.
if bio 3 completes first, it will become:
1->2->1
then bio 2 completes:
1->1->0
Because `bi_remaining` has reached zero, the final `end_io` callback for the entire chain
is triggered, even though not all bios in the chain have actually finished processing.
This premature completion can lead to stale data being exposed, as seen in our case.
The core issue appears to be that `bio_chain_endio` does not check if the current bio's
`bi_remaining` count has reached zero before proceeding to the next I/O.
Proposed Fix:
Removing `__bio_chain_endio` and allowing the standard `bio_endio` to handle the completion
logic should resolve this issue, as `bio_endio` correctly manages the `bi_remaining` counter.
Shida Zhang (9):
block: fix data loss and stale date exposure problems during append
write
block: export bio_chain_and_submit
gfs2: use bio_chain_and_submit for simplification
xfs: use bio_chain_and_submit for simplification
block: use bio_chain_and_submit for simplification
fs/ntfs3: use bio_chain_and_submit for simplification
zram: use bio_chain_and_submit for simplification
nvmet: fix the potential bug and use bio_chain_and_submit for
simplification
nvdimm: use bio_chain_and_submit for simplification
block/bio.c | 3 ++-
drivers/block/zram/zram_drv.c | 3 +--
drivers/nvdimm/nd_virtio.c | 3 +--
drivers/nvme/target/io-cmd-bdev.c | 3 +--
fs/gfs2/lops.c | 3 +--
fs/ntfs3/fsntfs.c | 12 ++----------
fs/squashfs/block.c | 3 +--
fs/xfs/xfs_bio_io.c | 3 +--
fs/xfs/xfs_buf.c | 3 +--
fs/xfs/xfs_log.c | 3 +--
10 files changed, 12 insertions(+), 27 deletions(-)
--
2.34.1
Powered by blists - more mailing lists