netdev - Re: BUG: sk_backlog.len can overestimate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXJAmy7PTZOcwRz-mSiZJkEL4sJKWhkE8kisUZp8M=V1BBA3g@mail.gmail.com>
Date:   Tue, 1 Oct 2019 13:45:27 -0700
From:   John Ousterhout <ouster@...stanford.edu>
To:     Eric Dumazet <eric.dumazet@...il.com>
Cc:     netdev@...r.kernel.org
Subject: Re: BUG: sk_backlog.len can overestimate

On Tue, Oct 1, 2019 at 11:34 AM Eric Dumazet <eric.dumazet@...il.com> wrote:
>
> On 10/1/19 10:25 AM, John Ousterhout wrote:
> > On Tue, Oct 1, 2019 at 9:19 AM Eric Dumazet <eric.dumazet@...il.com> wrote:
> >> ...
> >> Sorry, I have no idea what is the problem you see.
> >
> > OK, let me try again from the start. Consider two values:
> > * sk->sk_backlog.len
> > * The actual number of bytes in buffers in the current backlog list
> >
> > Now consider a series of propositions:
> >
> > 1. These two are not always the same. As packets get processed by
> > calling sk_backlog_rcv, they are removed from the backlog list, so the
> > actual amount of memory consumed by the backlog list drops. However,
> > sk->sk_backlog.len doesn't change until the entire backlog is cleared,
> > at which point it is reset to zero. So, there can be periods of time
> > where sk->sk_backlog.len overstates the actual memory consumption of
> > the backlog.
>
> Yes, this is done on purpose (and documented in __release_sock()
>
> Otherwise you could have a livelock situation, with user thread being
> trapped forever in system, and never return to user land.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8eae939f1400326b06d0c9afe53d2a484a326871
>
>
> >
> > 2. The gap between sk->sk_backlog.len and actual backlog size can grow
> > quite large. This happens if new packets arrive while sk_backlog_rcv
> > is working. The socket is locked, so these new packets will be added
> > to the backlog, which will increase sk->sk_backlog_len. Under high
> > load, this could continue indefinitely: packets keep arriving, so the
> > backlog never empties, so sk->sk_backlog_len never gets reset.
> > However, packets are actually being processed from the backlog, so
> > it's possible that the actual size of the backlog isn't changing, yet
> > sk->sk_backlog.len continues to grow.
> >
> > 3. Eventually, the growth in sk->sk_backlog.len will be limited by the
> > "limit" argument to sk_add_backlog. When this happens, packets will be
> > dropped.
>
> _Exactly_ WAI
>
> >
> > 4. Now suppose I pass a value of 1000000 as the limit to
> > sk_add_backlog. It's possible that sk_add_backlog will reject my
> > request even though the backlog only contains a total of 10000 bytes.
> > The other 990000 bytes were present on the backlog at one time (though
> > not necessarily all at the same time), but they have been processed
> > and removed; __release_sock hasn't gotten around to updating
> > sk->sk_backlog.len, because it hasn't been able to completely clear
> > the backlog.
>
> WAI
>
> >
> > 5. Bottom line: under high load, a socket can be forced to drop
> > packets even though it never actually exceeded its memory budget. This
> > isn't a case of a sender trying to fool us; we fooled ourselves,
> > because of the delay in resetting sk->sk_backlog.len.
> >
> > Does this make sense?
>
> Yes, just increase your socket limits. setsockopt(...  SO_RCVBUF ...),
> and risk user threads having bigger socket syscall latencies, obviously.

OK, I think I understand where you are coming from now. However, I
have a comment and a suggestion.

Comment: the terminology here confused me. The problem is that
sk->sk_backlog.len isn't actually a "length" (e.g., # bytes currently
stored in the backlog). It's actually "the total number of bytes added
to the backlog since the last time it was completely cleared."
Unfortunately, this isn't documented. Same thing for the limit: the
goal isn't to to limit the amount of data in the backlog, it's to
limit the amount of data processed in one call to __release_sock.
Although I saw the comment in __release_sock, it didn't have quite
enough info to fully apprise me of the problem being solved.

Suggestion: although the code does solve the problem you mentioned, it
has the unpleasant side effect that it can cause packet drops. This is
unfortunate because the drops are most likely to happen in periods of
overload, which is when you'd really like not to waste work that has
already been done. Did you consider the possibility of returning from
__release_sock without processing the entire backlog? For example,
once __release_sock has processed a bunch of packets, why not take the
others and feed them back through the NAPI mechanism again, so they'll
be passed to the net_protocol.handler again?

> > By the way, I have actually observed this phenomenon in an
> > implementation of the Homa transport protocol.
> >
>
> Maybe this transport protocol should size correctly its sockets limits :)

But this isn't really about socket resource limits (though that is
conflated in the implementation); it's about limiting the time spent
in a single call to __release_sock, no?