linux-kernel - [PATCH v2 0/6] ceph: CephFS writeback correctness and performance fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260107210139.40554-1-CFSworks@gmail.com>
Date: Wed,  7 Jan 2026 13:01:33 -0800
From: Sam Edwards <cfsworks@...il.com>
To: Xiubo Li <xiubli@...hat.com>,
	Ilya Dryomov <idryomov@...il.com>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@....com>,
	Christian Brauner <brauner@...nel.org>,
	Milind Changire <mchangir@...hat.com>,
	Jeff Layton <jlayton@...nel.org>,
	ceph-devel@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Sam Edwards <CFSworks@...il.com>
Subject: [PATCH v2 0/6] ceph: CephFS writeback correctness and performance fixes

Hello list,

This is v2 of my series that addresses interrelated issues in CephFS writeback,
fixing crashes, improving robustness, and correcting performance behavior,
particularly for fscrypted files. [1]

Changes v1->v2:
- Clarify patch #1's commit message to establish that failures on the first
  folio are not possible.
- Add another patch to move the "clean up page array on abort" logic to a new
  ceph_discard_page_array() function. (Thanks Slava!)
- Change the wording "grossly degraded performance" to instead read
  "correspondingly degraded performance." This makes the causal relationship
  clearer (that write throughput is limited much more significantly by write
  op/s due to the bug) without making any claims (qualitative or otherwise)
  about significance. (Thanks Slava!)
- Reset locked_pages = 0 immediately when the page array is discarded,
  simplifying patch #5 ("ceph: Assert writeback loop invariants")
- Reword "as evidenced by the previous two patches which fix oopses" to
  "as evidenced by two recent patches which fix oopses" and refer to the
  patches by subject (being in the same series, I cannot refer to them by hash)

I received several items of feedback on v1 that I have chosen not to adopt --
mostly because I believe they run contrary to kernel norms about strong
contracts, redundancy, not masking bugs, and regressions. (It is possible that
I am mistaken on these norms, and may still include them in a v3 if someone
makes good points in favor of them or consensus overrules me.)

Feedback on v1 not adopted:
- "Patch #1, which fixes a crash in unreachable code, should be reordered after
   patch #6 (#5 in v1), which fixes a bug that makes the code unreachable,
   in order to simplify crash reproduction in review"
The order of the patchset represents the canonical commit order of the series.
Committing patch #6 before patch #1 would therefore introduce a regression, in
direct violation of longstanding kernel policy.

- "Patch #1 should not swallow errors from move_dirty_folio_in_page_array() if
   they happen on the first folio."
It is not possible for move_dirty_folio_in_page_array() to encounter an error
on the first folio, and this is not likely to change in the future. Even if
such an error were to occur, the caller already expects
ceph_process_folio_batch() to "successfully" lock zero pages from time to time.
Swallowing errors in ceph_process_folio_batch() is consistent with its design.

- "Patch #1 should include the call trace and reproduction steps in the commit
   message."
The commit message already explains the execution path to the failure, which is
what people really care about (the call trace is just a means to that end).
Inlining makes the call trace completely opaque about why the oops happens.
Reproduction steps are not particularly valuable for posterity, but a curious
party in the future can always refer back to mailing list discussions to find
them.

- "Patch #2 should not exist: it removes the return code from
   ceph_process_folio_batch(), which is essential to indicate failures to the
   caller."
The caller of ceph_process_folio_batch() only cares about success/failure as a
boolean: the exact error code is discarded. This is redundant with
ceph_wbc.locked_pages, which not only indicates success/failure, but also the
degree of success (how many pages were locked). This makes
ceph_wbc.locked_pages a more appealing single source of truth than the error
code.

Further, the error return mechanism is misleading to future programmers: it
implies that it is acceptable for ceph_process_folio_batch() to abort the
operation (after already selecting some folios for write). The caller cannot
handle this. Removing the return code altogether makes the contract explicit,
which is the central point of the patch.

- "Patch #5 (#4 in v1) should not introduce BUG_ON() in the writeback path."
The writeback path already has BUG_ON(), and this is consistent with kernel
norms: check invariants, fail fast, and don't try to tolerate ambiguity. There
are already BUG_ONs that check several invariants, so rather than introducing
new failure possibilities to the writeback loop, patch #5 actually catches
invariant errors sooner. This is to tighten up the code and prevent
regressions, not fix any particular bug.

- "Patch #6 (#5 in v1) should include benchmark results to support its claim of improved performance."
My environment is not very representative of a typical Ceph deployment, and
benchmarking is tough to get right. I am not prepared to stand behind any
particular estimated/expected speedup factor. Rather, the rationale for this
patch is a simple computer science principle: increasing the amount of useful
work done per operation reduces total overhead. I have changed the phrasing
"grossly degraded performance" to "correspondingly degraded performance" to
emphasize that the performance degradation follows from the bottleneck, without
implying that I'm making some kind of claim about magnitude.

Warm regards,
Sam

[1]: https://lore.kernel.org/ceph-devel/20251231024316.4643-1-CFSworks@gmail.com/T/

Sam Edwards (6):
  ceph: Do not propagate page array emplacement errors as batch errors
  ceph: Remove error return from ceph_process_folio_batch()
  ceph: Free page array when ceph_submit_write fails
  ceph: Split out page-array discarding to a function
  ceph: Assert writeback loop invariants
  ceph: Fix write storm on fscrypted files

 fs/ceph/addr.c | 82 +++++++++++++++++++++++++++++---------------------
 1 file changed, 48 insertions(+), 34 deletions(-)

-- 
2.51.2