linux-kernel - [PATCH 0/2] uring

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250212204546.3751645-1-csander@purestorage.com>
Date: Wed, 12 Feb 2025 13:45:44 -0700
From: Caleb Sander Mateos <csander@...estorage.com>
To: Jens Axboe <axboe@...nel.dk>,
	Pavel Begunkov <asml.silence@...il.com>
Cc: Riley Thomasson <riley@...estorage.com>,
	io-uring@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Caleb Sander Mateos <csander@...estorage.com>
Subject: [PATCH 0/2] uring_cmd SQE corruptions

In our application issuing NVMe passthru commands, we have observed
nvme_uring_cmd fields being corrupted between when userspace initializes
the io_uring SQE and when nvme_uring_cmd_io() processes it.

We hypothesized that the uring_cmd's were executing asynchronously after
the io_uring_enter() syscall returned, yet were still reading the SQE in
the userspace-mapped SQ. Since io_uring_enter() had already incremented
the SQ head index, userspace reused the SQ slot for a new SQE once the
SQ wrapped around to it.

We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
index in userspace upon return from io_uring_enter(). By overwriting the
nvme_uring_cmd nsid field with a known garbage value, we were able to
trigger the err message in nvme_validate_passthru_nsid(), which logged
the garbage nsid value.

The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
SQE copying until it's needed"). With this commit reverted, the poisoned
values in the SQEs are no longer seen by nvme_uring_cmd_io().

Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
to async_data at prep time. The commit moved this memcpy() to 2 cases
when the request goes async:
- If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
- If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue

This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
pointer is not updated to point to async_data after the memcpy(),
as it correctly is in the REQ_F_FORCE_ASYNC case.

However, uring_cmd's can be issued async in other cases not enumerated
by 5eff57fa9f3a, also leading to SQE corruption. These include requests
besides the first in a linked chain, which are only issued once prior
requests complete. Requests waiting for a drain to complete would also
be initially issued async.

While it's probably possible for io_uring_cmd_prep_setup() to check for
each of these cases and avoid deferring the SQE memcpy(), we feel it
might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
As discussed recently in regard to the ublk zero-copy patches[1], new
async paths added in the future could break these delicate assumptions.

Thoughts?

[1]: https://lore.kernel.org/io-uring/7c2c2668-4f23-41d9-9cdf-c8ddd1f13f7c@gmail.com/

Caleb Sander Mateos (2):
  io_uring/uring_cmd: don't assume io_uring_cmd_data layout
  io_uring/uring_cmd: switch sqe to async_data on EAGAIN

 io_uring/uring_cmd.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

-- 
2.45.2