[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <76689FA1-68C1-4825-AB9C-804966ABC34F@amacapital.net>
Date: Mon, 5 Feb 2024 15:22:15 -0800
From: Andy Lutomirski <luto@...capital.net>
To: Network Development <netdev@...r.kernel.org>
Cc: Linux API <linux-api@...r.kernel.org>
Subject: Re: The sk_err mechanism is infuriating in userspace
> On Feb 5, 2024, at 3:03 PM, Andy Lutomirski <luto@...capital.net> wrote:
>
> Hi all-
>
> I encounter this issue every couple of years, and it still seems to be
> an issue, and it drives me nuts every time I see it.
>
> I write software that uses unconnected datagram-style sockets. Errors
> happen for all kinds of reasons, and my software knows it. My
> software even handles the errors and moves on with its life. I use
> MSG_ERRQUEUE to understand the errors. But the kernel fights back:
>
> struct sk_buff *__skb_try_recv_datagram(struct sock *sk,
> struct sk_buff_head *queue,
> unsigned int flags, int *off, int *err,
> struct sk_buff **last)
> {
> struct sk_buff *skb;
> unsigned long cpu_flags;
> /*
> * Caller is allowed not to check sk->sk_err before skb_recv_datagram()
> */
> int error = sock_error(sk);
>
> if (error)
> goto no_packet;
> ^^^^^^^^^^ <----- EXCUSE ME?
>
> The kernel even fights back on the *send* path?!?
>
> static long sock_wait_for_wmem(struct sock *sk, long timeo)
> {
> DEFINE_WAIT(wait);
>
> sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
> for (;;) {
> if (!timeo)
> break;
> if (signal_pending(current))
> break;
> set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> ...
> if (READ_ONCE(sk->sk_err))
> break; <-- KERNEL HATES UNCONNECTED SOCKETS!
>
> This is IMO just broken. I realize it's legacy behavior, but it's
> BROKEN legacy behavior. sk_err does not (at least for an unconnected
> socket) indicate that anything is wrong with the socket. It indicates
> that something is worthy of notice, and it wants to tell me.
>
> So:
>
> 1. sock_wait_for_wmem should IMO just not do that on an unconnected
> socket. AFAICS it's simply a bug.
>
> 2. How, exactly, am I supposed to call recvmsg() and, unambiguously,
> find out whether recvmsg() actually failed? There are actual errors
> (something that indicates that the kernel malfunctioned or the socket
> is broken), errors indicating that the packet being received is busted
> (skb_copy_datagram_msg, for example), and also errors indicating that
> there's an error queued up.
>
> I would like to know that there's an error queued up. That's what
> poll and epoll are for, right? Or a hint from recvmsg() that I should
> call MSG_RECVERR too. Or it could have a mode where it returns a
> normal datagram *or* an error as appropriate. But the current state
> of affairs is just brittle and racy.
>
> Are there any reasonably implementable, non-breaking ways to improve
> the API so that programs that understand socket errors can actually
> function fully correctly without gnarly retry loops in userspace and
> silly heuristics about what errors are actually errors?
Contemplating this, recvmsg() can sent status information back via msg_flags. Maybe we could characterize a recvmsg() call as doing one of the following things:
1. Actually fails, via -EFAULT or otherwise. Userspace can get an errno but doesn’t know beyond that what actually went wrong. Should never happen in a correct program. ENOMEM is not in this category.
2. There is nothing to receive. This is -EAGAIN.
3. Received an sk_err error. This is a *success*, and it comes with an error code. Users of RECVERR can’t reliably correlate this with an ERRQUEUE message. Maybe they don’t care.
4. Received a datagram.
5. Received a queued error message a la ERRQUEUE.
6. Dequeued a datagram (or ERRQUEUE) but did *not* receive it due to a checksum error or other error. (And there should be a clear indication of whether the call succeeded but something was wrong with the message or whether the call *failed* for an unexpected reason but the offending message was nonetheless removed from the socket’s queue).
Maybe 7: Received a message (or ERRQUEUE), and the checksum was wrong, but the data is being returned anyway.
I suppose that a flag could enable this mode and then all but #1 would return a *success* code from the syscall. And msg_flags would contain an indication as to what actually happened.
Thoughts? Does io_uring affect any of this?
>
> Grumpily,
> Andy
Powered by blists - more mailing lists