netdev - Re: traceroute failure in kernel 6.1 and 6.2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iKxoJEGf+renKFXO+vajyJTKvBu1zaTiuFhKhDwcMny-g@mail.gmail.com>
Date:   Tue, 24 Jan 2023 17:14:17 +0100
From:   Eric Dumazet <edumazet@...gle.com>
To:     Pavel Begunkov <asml.silence@...il.com>
Cc:     Mantas Mikulėnas <grawity@...il.com>,
        netdev@...r.kernel.org
Subject: Re: traceroute failure in kernel 6.1 and 6.2

On Tue, Jan 24, 2023 at 4:28 PM Pavel Begunkov <asml.silence@...il.com> wrote:
>
> On 1/24/23 08:57, Eric Dumazet wrote:
> > On Tue, Jan 24, 2023 at 7:03 AM Eric Dumazet <edumazet@...gle.com> wrote:
> [...]
> >>> It doesn't solve the issue; I tried bumping it from the default of
> >>> 212992 to 4096-times-that, with exactly the same results.
> >>>
> >>> The amount of packets it's able to send is variable, For example, right
> >>> now, on my regular VM (which is smaller than the PC that yesterday's
> >>> trace was done on), the program very consistently fails on the *second*
> >>> sendto() call -- I don't think two packets is an unreasonable amount.
> >>>
> >>> The program has -q and -N options to reduce the number of simultaneous
> >>> probes, but the only effect it has is if I reduce the packets all the
> >>> way down to just one at a time.
> >>
> >> Problem is : if we revert the patch, unpriv users can trivially crash a host.
> >>
> >> Also, sent ICMP packets  look just fine to me, and the patch is
> >> changing tx path.
> >>
> >> The reported issue seems more like rx path related to me.
> >> Like IP_RECVERR being not handled correctly.
> >>
> >> I think more investigations are needed. Maybe contact Pavel Begunkov
> >> <asml.silence@...il.com>
> >> because the initial crash issue came with
> >> 47cf88993c91 ("net: unify alloclen calculation for paged requests")
> >
> > I am reasonably confident this is a bug in this traceroute binary.
> >
> > It sets
> >   setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
> >
> > So a sendto() can absolutely return the error set by last received
> > ICMP (cf ping_err()) on the socket,
> > as per RFC1122 4.1.3.3
> >
> >   4.1.3.3  ICMP Messages
> >
> >              UDP MUST pass to the application layer all ICMP error
> >              messages that it receives from the IP layer.  Conceptually
> >              at least, this may be accomplished with an upcall to the
> >              ERROR_REPORT routine (see Section 4.2.4.1).
> >
> >              DISCUSSION:
> >                   Note that ICMP error messages resulting from sending a
> >                   UDP datagram are received asynchronously.  A UDP-based
> >                   application that wants to receive ICMP error messages
> >                   is responsible for maintaining the state necessary to
> >                   demultiplex these messages when they arrive; for
> >                   example, the application may keep a pending receive
> >                   operation for this purpose.  The application is also
> >                   responsible to avoid confusion from a delayed ICMP
> >                   error message resulting from an earlier use of the same
> >
> >
> > Fix would be
> >
> > diff traceroute/traceroute.c.orig traceroute/traceroute.c
> > 1657c1657
> > <     if (errno == EMSGSIZE)
> > ---
> >>      if (errno == EMSGSIZE || errno == EHOSTUNREACH)
> >
> > or to collect a pending socket error (but that would be racy), using
> > SO_ERROR getsockopt()
>
> If it doesn't help I'll take a look, perfectly reproducible for me.
>

My fix has the following consequence:

Instead of pretending ICMP packet has no 'transhdrlen',
it now calls sock_alloc_send_skb() (instead of alloc_skb()),
and thus is able to sleep/schedule (unless application
provides MSG_DONTWAIT), and is also
sensible to a prior setting of sk->sk_err (from ping_err())

This was probably not an intended behavior of initial
ping implementation (which was ignoring sk->sk_err until a recvmsg() or poll())

__ip_append_data()

if (transhdrlen) {
    skb = sock_alloc_send_skb(sk, alloclen,
                       (flags & MSG_DONTWAIT), &err);
} else {
     skb = NULL;
     if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
                      2 * sk->sk_sndbuf)
                 skb = alloc_skb(alloclen,
                                          sk->sk_allocation);
                   if (unlikely(!skb))
                       err = -ENOBUFS;
}
if (!skb)
           goto error;

The intent for this code is to sleep/schedule only for the first skb,
and uses alloc_skb() for the following skbs, to increase chance
of not breaking the generation of the datagram in the middle.

Downside is that ICMP now reacts like UDP, thus the application
must correctly handle the sendmsg() returning the stored sk->sk_err.

Note 1: ipv6 ping implementation forces a MSG_DONTWAIT
while calling ip6_append_data(), for no apparent reason.

Note 2:
 It seems strange that both udp_err() and ping_err()
 call ip_icmp_error() when a matching socket is found (and used IP_RECVERR)
 _and_ also set sk->sk_err

 Perhaps ip_icmp_error() should return a bool and
 we should not set sk->sk_err if a proper error skb has been
 queued to sk->sk_error_queue