netdev - Re: [PATCH net-next v4] net: ip: avoid OOM kills with large UDP sends over loopback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSdoP37qsk0mCvhmiVOGwXtXEktKfcR2PMSEcTQtRBrv7A@mail.gmail.com>
Date:   Thu, 24 Jun 2021 12:45:18 -0400
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        davem@...emloft.net, netdev@...r.kernel.org,
        eric.dumazet@...il.com, dsahern@...il.com, yoshfuji@...ux-ipv6.org,
        brouer@...hat.com, Dave Jones <dsj@...com>
Subject: Re: [PATCH net-next v4] net: ip: avoid OOM kills with large UDP sends
 over loopback

On Thu, Jun 24, 2021 at 12:28 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Wed, 23 Jun 2021 22:21:11 -0400 Willem de Bruijn wrote:
> > On Wed, Jun 23, 2021 at 5:44 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > >
> > > Dave observed number of machines hitting OOM on the UDP send
> > > path. The workload seems to be sending large UDP packets over
> > > loopback. Since loopback has MTU of 64k kernel will try to
> > > allocate an skb with up to 64k of head space. This has a good
> > > chance of failing under memory pressure. What's worse if
> > > the message length is <32k the allocation may trigger an
> > > OOM killer.
> > >
> > > This is entirely avoidable, we can use an skb with page frags.
> > >
> > > af_unix solves a similar problem by limiting the head
> > > length to SKB_MAX_ALLOC. This seems like a good and simple
> > > approach. It means that UDP messages > 16kB will now
> > > use fragments if underlying device supports SG, if extra
> > > allocator pressure causes regressions in real workloads
> > > we can switch to trying the large allocation first and
> > > falling back.
> > >
> > > v4: pre-calculate all the additions to alloclen so
> > >     we can be sure it won't go over order-2
> > >
> > > Reported-by: Dave Jones <dsj@...com>
> > > Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> > > ---
> > >  net/ipv4/ip_output.c  | 32 ++++++++++++++++++--------------
> > >  net/ipv6/ip6_output.c | 32 +++++++++++++++++---------------
> > >  2 files changed, 35 insertions(+), 29 deletions(-)
> > >
> > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > > index c3efc7d658f6..8d8a8da3ae7e 100644
> > > --- a/net/ipv4/ip_output.c
> > > +++ b/net/ipv4/ip_output.c
> > > @@ -1054,7 +1054,7 @@ static int __ip_append_data(struct sock *sk,
> > >                         unsigned int datalen;
> > >                         unsigned int fraglen;
> > >                         unsigned int fraggap;
> > > -                       unsigned int alloclen;
> > > +                       unsigned int alloclen, alloc_extra;
> >
> > Separate line?
>
> But why? What makes it preferable to have logically connected variables
> declared on separate lines? The function is already 300 LoC. I've been
> meaning to ask someone about this preference for a while :)

Reverse christmas tree is the norm in netdev. Pointing out for
consistency only. I have no particular opinion on the rule.

Agreed that in this function, multiple entries per line would be preferable.


> > >                         unsigned int pagedlen;
> > >                         struct sk_buff *skb_prev;
> > >  alloc_new_skb:
> > > @@ -1074,35 +1074,39 @@ static int __ip_append_data(struct sock *sk,
> > >                         fraglen = datalen + fragheaderlen;
> > >                         pagedlen = 0;
> > >
> > > +                       alloc_extra = hh_len + 15;
> > > +                       alloc_extra += exthdrlen;
> > > +
> > > +                       /* The last fragment gets additional space at tail.
> > > +                        * Note, with MSG_MORE we overallocate on fragments,
> > > +                        * because we have no idea what fragment will be
> > > +                        * the last.
> > > +                        */
> > > +                       if (datalen == length + fraggap)
> > > +                               alloc_extra += rt->dst.trailer_len;
> > > +
> > >                         if ((flags & MSG_MORE) &&
> > >                             !(rt->dst.dev->features&NETIF_F_SG))
> > >                                 alloclen = mtu;
> > > -                       else if (!paged)
> > > +                       else if (!paged &&
> > > +                                (fraglen + alloc_extra < SKB_MAX_ALLOC ||
> > > +                                 !(rt->dst.dev->features & NETIF_F_SG)))
> >
> > This perhaps deserves a comment. Something like this?
> >
> >   /* avoid order-3 allocations where possible: replace with frags if
> > allowed (sg) */
>
> Here I thought comparing skb alloc size to SKB_MAX_ALLOC is explanatory
> enough ;)

Yeah, I guess you're right. The comment only rewords *what* the code
does, so not super informative. Never mind that suggestion.

> In the middle of the test, like this, right?
>
>                  else if (!paged &&
>                           /* avoid order-3 allocations if device
>                            * can handle skb frags (sg)
>                            */
>                           (fraglen + alloc_extra < SKB_MAX_ALLOC ||
>                           !(rt->dst.dev->features & NETIF_F_SG)))
>
> I should make it less-equal while at it.
>
> > >                                 alloclen = fraglen;
> > >                         else {
> > >                                 alloclen = min_t(int, fraglen, MAX_HEADER);
> > >                                 pagedlen = fraglen - alloclen;
> > >                         }
> > >
> > > -                       alloclen += exthdrlen;
> > > -
> > > -                       /* The last fragment gets additional space at tail.
> > > -                        * Note, with MSG_MORE we overallocate on fragments,
> > > -                        * because we have no idea what fragment will be
> > > -                        * the last.
> > > -                        */
> > > -                       if (datalen == length + fraggap)
> > > -                               alloclen += rt->dst.trailer_len;
> > > +                       alloclen += alloc_extra;
> > >
> > >                         if (transhdrlen) {
> > > -                               skb = sock_alloc_send_skb(sk,
> > > -                                               alloclen + hh_len + 15,
> > > +                               skb = sock_alloc_send_skb(sk, alloclen,
> > >                                                 (flags & MSG_DONTWAIT), &err);
> > >                         } else {
> > >                                 skb = NULL;
> > >                                 if (refcount_read(&sk->sk_wmem_alloc) + wmem_alloc_delta <=
> > >                                     2 * sk->sk_sndbuf)
> > > -                                       skb = alloc_skb(alloclen + hh_len + 15,
> > > +                                       skb = alloc_skb(alloclen,
> > >                                                         sk->sk_allocation);
> > >                                 if (unlikely(!skb))
> > >                                         err = -ENOBUFS;
> >
> > Is there any risk of regressions? If so, would it be preferable to try
> > regular alloc and only on failure, just below here, do the size and SG
> > test and if permitted jump back to the last of the three alloc_len
> > options?
>
> There is, that's what I tried in v1, Eric pointed out that we can't
> modify sk->sk_allocation here because UDP fast path doesn't take the
> lock, and pointed out that UNIX code has to handle similar problem.
> So I decided to just copy what AF_UNIX does. In practical terms
> MTU > 16k is highly unlikely on physical devices (AFAIK) and with
> messages that large hopefully the trip thru the memory allocator won't
> be all that noticeable? If we were capping at one page that'd be a
> problem, but my gut feeling was that order-2 cap is unlikely to hurt.
>
> But I can go back, I'd have to refactor sock_alloc_send_pskb() to pass
> gfp_t explicitly. Probably by creating another layer of helpers
> (__sock_alloc_send_pskb()?). sock_alloc_send_pskb() already takes 6
> params so I was also thinking of converting it to ERR_PTR() return
> (instead of passing the error pointer) (6 is max for register passing).
>
> Should I go back to retry?

For __GFP_NOWARN? Sorry, I missed that.

Okay, then I understand why this approach is preferable. And LGTM. Thanks!