[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZrvHmYQgqu5GiQDrAg_MOEiRigCLCmBSO6jeyd1z+Q-JQ@mail.gmail.com>
Date: Tue, 28 Jan 2025 14:02:34 -0800
From: Caleb Sander <csander@...estorage.com>
To: Hannes Reinecke <hare@...e.de>
Cc: Keith Busch <kbusch@...nel.org>, Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
Sagi Grimberg <sagi@...mberg.me>, Maurizio Lombardi <mlombard@...hat.com>, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] nvme-tcp: fix connect failure on receiving partial
ICResp PDU
On Tue, Jan 28, 2025 at 12:28 AM Hannes Reinecke <hare@...e.de> wrote:
>
> On 1/27/25 18:38, Caleb Sander wrote:
> > On Sun, Jan 26, 2025 at 11:37 PM Hannes Reinecke <hare@...e.de> wrote:
> >>
> >> On 1/24/25 19:43, Caleb Sander Mateos wrote:
> >>> nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> >>> checks that the return value from recvmsg() is non-negative. If the
> >>> sender closes the TCP connection or sends fewer than 128 bytes, this
> >>> check will pass even though the full PDU wasn't received.
> >>>
> >>> Ensure the full ICResp PDU is received by checking that recvmsg()
> >>> returns the expected 128 bytes.
> >>>
> >>> Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> >>> split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> >>> recvmsg() could return prematurely with only part of the PDU.
> >>>
> >>> Signed-off-by: Caleb Sander Mateos <csander@...estorage.com>
> >>> Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> >>> ---
> >>> v4: keep recvmsg() error return value
> >>> v3: fix return value to indicate error
> >>> v2: add Fixes tag
> >>>
> >>> drivers/nvme/host/tcp.c | 5 ++++-
> >>> 1 file changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >>> index e9ff6babc540..56679eb8c0d6 100644
> >>> --- a/drivers/nvme/host/tcp.c
> >>> +++ b/drivers/nvme/host/tcp.c
> >>> @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
> >>> iov.iov_len = sizeof(*icresp);
> >>> if (nvme_tcp_queue_tls(queue)) {
> >>> msg.msg_control = cbuf;
> >>> msg.msg_controllen = sizeof(cbuf);
> >>> }
> >>> + msg.msg_flags = MSG_WAITALL;
> >>> ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> >>> iov.iov_len, msg.msg_flags);
> >>
> >> But won't we have to wait for a TCP timeout now if the sender sends less
> >> than 128 bytes? With this patch we always wait for 128 bytes, and
> >> possibly wait for TCP timeout if not.
> >
> > Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to
> > wait for it to send the remainder of the ICResp PDU. That's just how
> > the NVMe/TCP protocol works. If we want to protect against
> > buggy/malicious controllers that don't send a full ICResp, we need a
> > timeout mechanism. That's the purpose of the existing
> > `queue->sock->sk->sk_rcvtimeo = 10 * HZ;` in nvme_tcp_alloc_queue().
> > Note that recvmsg() timing out was already possible in the original
> > code if the controller didn't send anything on the TCP connection
> > after accepting it.
> >
> Hmm. With checking the code 'rcvtimeo' is only evaluated if MSG_WAITALL
> is _not_ set. Makes me wonder why we do set it...
> But that's beside the point.
I am not seeing where sk_rcvtimeo is ignored when MSG_WAITALL is used,
can you point me to the code?
It is certainly ignored for *non-blocking* recvmsg() (MSG_DONTWAIT),
but I don't see why it would be ignored for MSG_WAITALL. man 7 socket
also suggests it applies to all blocking socket I/O.
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
{
return noblock ? 0 : sk->sk_rcvtimeo;
}
In tcp_recvmsg_locked():
timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>
> >> Testcase for this would be nice ...
> >>
> >> And I need to check if secure concatenation is affected here; with
> >> secure concatenation we need to peek at the first packet to check
> >> if it's an ICRESP or a TLS negotiation.
> >
> > Are you saying that with secure concatenation we don't know in advance
> > whether the connection is using TLS between the TCP and NVMe/TCP
> > protocol layers? Wouldn't the host already need to know that when it
> > sent its ICReq PDU?
>
> No, the host doesn't need to know. TLS is enabled by the lower
> layers.
>
> But upon further checking, I guess it'll be okay with secure
> concatenation.
>
> Nevertheless, I would vastly prefer to have a receive loop here
> instead of waiting to receive the full amount as per MSG_WAITALL.
> The entire tcp code is using nonblocking calls, so I'd rather
> keep it that way and implement a receive loop here.
The driver uses non-blocking socket I/O in the I/O path, but not the
connect path. nvme_tcp_init_connection() is already using blocking
socket I/O to send the ICReq PDU and receive the ICResp. I suppose we
could change the connect path to register poll interest and use
non-blocking operations, but that seems like a more involved code
change and orthogonal to the issue of receiving the full ICResp.
Best,
Caleb
Powered by blists - more mailing lists