netdev - Re: [PATCH net v2] ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAF=yD-K07q_ygjRrsau3fPWX4==WPjEtZN1y3eZUTABYaG0vWg@mail.gmail.com>
Date: Wed, 20 Sep 2023 21:41:38 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: David Howells <dhowells@...hat.com>, netdev@...r.kernel.org
Cc: syzbot+62cbf263225ae13ff153@...kaller.appspotmail.com, 
	Eric Dumazet <edumazet@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	David Ahern <dsahern@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Jakub Kicinski <kuba@...nel.org>, 
	bpf@...r.kernel.org, syzkaller-bugs@...glegroups.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH net v2] ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data()

On Wed, Sep 20, 2023 at 9:54 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> David Howells wrote:
> > Including the transhdrlen in length is a problem when the packet is
> > partially filled (e.g. something like send(MSG_MORE) happened previously)
> > when appending to an IPv4 or IPv6 packet as we don't want to repeat the
> > transport header or account for it twice.  This can happen under some
> > circumstances, such as splicing into an L2TP socket.
> >
> > The symptom observed is a warning in __ip6_append_data():
> >
> >     WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
> >
> > that occurs when MSG_SPLICE_PAGES is used to append more data to an already
> > partially occupied skbuff.  The warning occurs when 'copy' is larger than
> > the amount of data in the message iterator.  This is because the requested
> > length includes the transport header length when it shouldn't.  This can be
> > triggered by, for example:
> >
> >         sfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
> >         bind(sfd, ...); // ::1
> >         connect(sfd, ...); // ::1 port 7
> >         send(sfd, buffer, 4100, MSG_MORE);
> >         sendfile(sfd, dfd, NULL, 1024);
> >
> > Fix this by deducting transhdrlen from length in ip{,6}_append_data() right
> > before we clear transhdrlen if there is already a packet that we're going
> > to try appending to.
> >
> > Reported-by: syzbot+62cbf263225ae13ff153@...kaller.appspotmail.com
> > Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@google.com/
> > Signed-off-by: David Howells <dhowells@...hat.com>
> > cc: Eric Dumazet <edumazet@...gle.com>
> > cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>
> > cc: "David S. Miller" <davem@...emloft.net>
> > cc: David Ahern <dsahern@...nel.org>
> > cc: Paolo Abeni <pabeni@...hat.com>
> > cc: Jakub Kicinski <kuba@...nel.org>
> > cc: netdev@...r.kernel.org
> > cc: bpf@...r.kernel.org
> > cc: syzkaller-bugs@...glegroups.com
> > Link: https://lore.kernel.org/r/75315.1695139973@warthog.procyon.org.uk/ # v1
> > ---
> >  net/ipv4/ip_output.c  |    1 +
> >  net/ipv6/ip6_output.c |    1 +
> >  2 files changed, 2 insertions(+)
> >
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 4ab877cf6d35..9646f2d9afcf 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -1354,6 +1354,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
> >               if (err)
> >                       return err;
> >       } else {
> > +             length -= transhdrlen;
> >               transhdrlen = 0;
> >       }
> >
> > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> > index 54fc4c711f2c..6a4ce7f622e9 100644
> > --- a/net/ipv6/ip6_output.c
> > +++ b/net/ipv6/ip6_output.c
> > @@ -1888,6 +1888,7 @@ int ip6_append_data(struct sock *sk,
> >               length += exthdrlen;
> >               transhdrlen += exthdrlen;
> >       } else {
> > +             length -= transhdrlen;
> >               transhdrlen = 0;
> >       }
> >
>
> Definitely a much simpler patch, thanks.
>
> So the current model is that callers with non-zero transhdrlen always
> pass to __ip_append_data payload length + transhdrlen.
>
> I do see that udp does this: ulen += sizeof(struct udphdr); This calls
> ip_make_skb if not corked, but directly ip_append_data if corked.
>
> Then __ip_append_data will use transhdrlen in its packet calculations,
> and reset that to zero after allocating the first new skb.
>
> So if corked *and* fragmentation, which would cause a new skb to be
> allocated, the next skb would incorrectly reserve udp header space,
> because the second __ip_append_data call will again pass transhdrlen.
> If so, then this patch fixes that. But that has never been reported,
> so I'm most likely misreading some part..

This works today because udp only includes transhdrlen if not corked.
In udpv6_sendmsg:

        if (up->pending) {
                       ...
                       goto do_append_data;
        }
        ulen += sizeof(struct udphdr);

So ip6_append_data is called with ulen == len once data is pending, so
subtracting transhdrlen (which is still sizeof(udphdr)) would not be
correct.

l2tp_ip6_sendmsg more or less follows udpv6_sendmsg, but it
unconditionally sets ulen = len + transhdrlen. So maybe the fix is in
L2TP:

+++ b/net/l2tp/l2tp_ip6.c
@@ -507,7 +507,6 @@ static int l2tp_ip6_sendmsg(struct sock *sk,
struct msghdr *msg, size_t len)
         */
        if (len > INT_MAX - transhdrlen)
                return -EMSGSIZE;
-       ulen = len + transhdrlen;

        /* Mirror BSD error message compatibility */
        if (msg->msg_flags & MSG_OOB)
@@ -628,6 +627,7 @@ static int l2tp_ip6_sendmsg(struct sock *sk,
struct msghdr *msg, size_t len)

 back_from_confirm:
        lock_sock(sk);
+       ulen = len + skb_queue_empty(&sk->sk_write_queue) ? transhdrlen : 0;

As said, only raw, udp and l2p can possibly pass MSG_MORE and so cause
secondary invocations of ip6_append_data for the same send. With raw
passing transhdrlen 0, and udp as discussed above, we only have to
consider l2tp.