netdev - Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250325175408.32112-1-kuniyu@amazon.com>
Date: Tue, 25 Mar 2025 10:53:57 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <willemdebruijn.kernel@...il.com>
CC: <davem@...emloft.net>, <dsahern@...nel.org>, <edumazet@...gle.com>,
	<horms@...nel.org>, <kuba@...nel.org>, <kuni1840@...il.com>,
	<kuniyu@...zon.com>, <netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v1 net 1/3] udp: Fix multiple wraparounds of sk->sk_rmem_alloc.

From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Date: Tue, 25 Mar 2025 10:18:30 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
> > Date: Mon, 24 Mar 2025 15:44:40 -0400
> > > Kuniyuki Iwashima wrote:
> > > > From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
> > > > Date: Mon, 24 Mar 2025 10:59:49 -0400
> > > > > Kuniyuki Iwashima wrote:
> > > > > > __udp_enqueue_schedule_skb() has the following condition:
> > > > > > 
> > > > > >   if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> > > > > >           goto drop;
> > > > > > 
> > > > > > sk->sk_rcvbuf is initialised by net.core.rmem_default and later can
> > > > > > be configured by SO_RCVBUF, which is limited by net.core.rmem_max,
> > > > > > or SO_RCVBUFFORCE.
> > > > > > 
> > > > > > If we set INT_MAX to sk->sk_rcvbuf, the condition is always false
> > > > > > as sk->sk_rmem_alloc is also signed int.
> > > > > > 
> > > > > > Then, the size of the incoming skb is added to sk->sk_rmem_alloc
> > > > > > unconditionally.
> > > > > > 
> > > > > > This results in integer overflow (possibly multiple times) on
> > > > > > sk->sk_rmem_alloc and allows a single socket to have skb up to
> > > > > > net.core.udp_mem[1].
> > > > > > 
> > > > > > For example, if we set a large value to udp_mem[1] and INT_MAX to
> > > > > > sk->sk_rcvbuf and flood packets to the socket, we can see multiple
> > > > > > overflows:
> > > > > > 
> > > > > >   # cat /proc/net/sockstat | grep UDP:
> > > > > >   UDP: inuse 3 mem 7956736  <-- (7956736 << 12) bytes > INT_MAX * 15
> > > > > >                                              ^- PAGE_SHIFT
> > > > > >   # ss -uam
> > > > > >   State  Recv-Q      ...
> > > > > >   UNCONN -1757018048 ...    <-- flipping the sign repeatedly
> > > > > >          skmem:(r2537949248,rb2147483646,t0,tb212992,f1984,w0,o0,bl0,d0)
> > > > > > 
> > > > > > Previously, we had a boundary check for INT_MAX, which was removed by
> > > > > > commit 6a1f12dd85a8 ("udp: relax atomic operation on sk->sk_rmem_alloc").
> > > > > > 
> > > > > > A complete fix would be to revert it and cap the right operand by
> > > > > > INT_MAX:
> > > > > > 
> > > > > >   rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
> > > > > >   if (rmem > min(size + (unsigned int)sk->sk_rcvbuf, INT_MAX))
> > > > > >           goto uncharge_drop;
> > > > > > 
> > > > > > but we do not want to add the expensive atomic_add_return() back just
> > > > > > for the corner case.
> > > > > > 
> > > > > > So, let's perform the first check as unsigned int to detect the
> > > > > > integer overflow.
> > > > > > 
> > > > > > Note that we still allow a single wraparound, which can be observed
> > > > > > from userspace, but it's acceptable considering it's unlikely that
> > > > > > no recv() is called for a long period, and the negative value will
> > > > > > soon flip back to positive after a few recv() calls.
> > > > > 
> > > > > Can we do better than this?
> > > > 
> > > > Another approach I had in mind was to restore the original validation
> > > > under the recvq lock but without atomic ops like
> > > > 
> > > >   1. add another u32 as union of sk_rmem_alloc (only for UDP)
> > > >   2. access it with READ_ONCE() or under the recvq lock
> > > >   3. perform the validation under the lock
> > > > 
> > > > But it requires more changes around the error queue handling and
> > > > the general socket impl, so will be too invasive for net.git but
> > > > maybe worth a try for net-next ?
> > > 
> > > Definitely not net material. Adding more complexity here
> > > would also need some convincing benchmark data probably.
> > > 
> > > > 
> > > > > Is this because of the "Always allow at least one packet" below, and
> > > > > due to testing the value of the counter without skb->truesize added?
> > > > 
> > > > Yes, that's the reason although we don't receive a single >INT_MAX
> > > > packet.
> > > 
> > > I was surprised that we don't take the current skb size into
> > > account when doing this calculation.
> > > 
> > > Turns out that this code used to do that.
> > > 
> > > commit 363dc73acacb ("udp: be less conservative with sock rmem
> > > accounting") made this change:
> > > 
> > > -       if (rmem && (rmem + size > sk->sk_rcvbuf))
> > > +       if (rmem > sk->sk_rcvbuf)
> > >                 goto drop;
> > > 
> > > The special consideration to allow one packet is to avoid starvation
> > > with small rcvbuf, judging also from this review comment:
> > > 
> > > https://lore.kernel.org/netdev/1476938622.5650.111.camel@edumazet-glaptop3.roam.corp.google.com/
> > 
> > Interesting, thanks for the info !
> > 
> > Now it's allowed to exceed by the total size of the incoming skb
> > on every CPUs, and a user may notice that rmem > rcvbuf via ss,
> > but I guess it's allowed because the fast recovery is expected.
> > 
> > 
> > > 
> > > That clearly doesn't apply when rcvbuf is near INT_MAX.
> > > Can we separate the tiny budget case and hard drop including the
> > > skb->truesize for normal buffer sizes?
> > 
> > Maybe like this ?
> > 
> >         if (rcvbuf < UDP_MIN_RCVBUF) {
> >                 if (rmem > rcvbuf)
> >                         goto drop;
> >         } else {
> >                 if (rmem + size > rcvbuf)
> >                         goto drop;
> >         }
> > 
> > SOCK_MIN_RCVBUF is 2K + skb since 2013, but the regression was
> > reported after that in 2016, so UDP_MIN_RCVBUF would be more ?
> 
> Since the only issue is the overflow, could use a higher bound like
> INT_MAX >> 1.
>  
> > But I wonder if adding new branches in the fast path is worth for
> > the corner case, and that's why I chose integrating the cast into
> > the exisintg branch, allowing a small overflow, which is observable
> > only when no thread calls recv() and skbs are queued more than INT_MAX.
> 
> Okay. Though it can probably be structured that the likely path does
> not even see this?

Ah exactly, will use this and update commit message in v2.

Thanks!

> 
>     if (rmem + size > rcvbuf) {
>             if (rcvbuf > INT_MAX << 1)
>                     goto drop;
>             if (rmem > rcvbuf)
>                     goto drop;
>     }
>