[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZr2GnkOK5uKwj2Tm+1v-Hk4k0md203nyReqHmrfmJZ7Dw@mail.gmail.com>
Date: Mon, 27 Jan 2025 10:06:28 -0800
From: Caleb Sander <csander@...estorage.com>
To: Hannes Reinecke <hare@...e.de>
Cc: Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
Sagi Grimberg <sagi@...mberg.me>, Maurizio Lombardi <mlombard@...hat.com>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] nvme-tcp: fix connect failure on receiving partial
ICResp PDU
On Mon, Jan 27, 2025 at 12:09 AM Hannes Reinecke <hare@...e.de> wrote:
>
> On 1/27/25 08:37, Hannes Reinecke wrote:
> > On 1/24/25 19:43, Caleb Sander Mateos wrote:
> >> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> >> checks that the return value from recvmsg() is non-negative. If the
> >> sender closes the TCP connection or sends fewer than 128 bytes, this
> >> check will pass even though the full PDU wasn't received.
> >>
> >> Ensure the full ICResp PDU is received by checking that recvmsg()
> >> returns the expected 128 bytes.
> >>
> >> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> >> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> >> recvmsg() could return prematurely with only part of the PDU.
> >>
> >> Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> >> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> >> ---
> >> v4: keep recvmsg() error return value
> >> v3: fix return value to indicate error
> >> v2: add Fixes tag
> >>
> >> drivers/nvme/host/tcp.c | 5 ++++-
> >> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >> index e9ff6babc540..56679eb8c0d6 100644
> >> --- a/drivers/nvme/host/tcp.c
> >> +++ b/drivers/nvme/host/tcp.c
> >> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct
> >> nvme_tcp_queue *queue)
> >> iov.iov_len = sizeof(*icresp);
> >> if (nvme_tcp_queue_tls(queue)) {
> >> msg.msg_control = cbuf;
> >> msg.msg_controllen = sizeof(cbuf);
> >> }
> >> + msg.msg_flags = MSG_WAITALL;
> >> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> >> iov.iov_len, msg.msg_flags);
> >
> > But won't we have to wait for a TCP timeout now if the sender sends less
> > than 128 bytes? With this patch we always wait for 128 bytes, and
> > possibly wait for TCP timeout if not.
> > Testcase for this would be nice ...
> >
> > And I need to check if secure concatenation is affected here; with
> > secure concatenation we need to peek at the first packet to check
> > if it's an ICRESP or a TLS negotiation.
> >
> And wouldn't this patch be sufficient here?
If the controller sends the ICResp across multiple TCP packets, the
recvmsg() will only receive the first one. There are 2 different
problematic outcomes:
- The validation of the icresp buffer may fail because it hasn't been
fully initialized from the PDU
- Even if the validation passes, the remaining bytes of the PDU are
still in the socket buffer. Subsequent receive operations will return
those bytes instead of the following PDUs they meant to receive.
It is important to receive the full ICResp to avoid connection
establishment failure, and to ensure there are no remaining bytes in
the socket buffer.
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 841238f38fdd..85b1328a8757 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1451,12 +1451,13 @@ static int nvme_tcp_init_connection(struct
> nvme_tcp_queue *queue)
> }
> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> iov.iov_len, msg.msg_flags);
> + if (ret == 0)
> + ret = -ENOTCONN;
It is possible for ret to be positive but less than sizeof(ICResp).
For example, a controller is allowed to send the 8-byte Common Header
and the 120-byte ICResp PDU Specific Header in separate TCP packets.
In that case, recvmsg() would only receive the Common Header and
return 8. We need to receive the full 128 bytes of the ICResp PDU
before processing it and receiving subsequent PDUs from the socket.
> if (ret < 0) {
> pr_warn("queue %d: failed to receive icresp, error %d\n",
> nvme_tcp_queue_id(queue), ret);
> goto free_icresp;
> }
> - ret = -ENOTCONN;
> if (nvme_tcp_queue_tls(queue)) {
> ctype = tls_get_record_type(queue->sock->sk,
> (struct cmsghdr *)cbuf);
>
> icresp validity is checked later in the code, so if we haven't received
> a full icresp we _should_ fail those tests. And then we don't really
> have to check how many bytes we've received.
The ICResp is only validated up to the MAXH2CDATA field. The
validation would fail if that field is 0, so it is true that we would
catch a recvmsg() that received 12 bytes or fewer. However, the
remaining 115 bytes of the ICResp PDU are not validated, so a passing
validation does not ensure all 128 bytes were received from the
socket.
Thanks,
Caleb
Powered by blists - more mailing lists