[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSdi=tw=N4X2f+paFNM7KHqBgNkV_se-ykZ0+WoA7q0AhQ@mail.gmail.com>
Date: Tue, 3 Sep 2019 13:55:37 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: unlisted-recipients:; (no To-header on input)
Cc: Eric Dumazet <eric.dumazet@...il.com>,
Steve Zabele <zabele@...cast.net>,
Network Development <netdev@...r.kernel.org>,
shum@...ndrew.org, vladimir116@...il.com, saifi.khan@...ikr.in,
Daniel Borkmann <daniel@...earbox.net>, on2k16nm@...il.com,
Stephen Hemminger <stephen@...workplumber.org>
Subject: Re: Is bug 200755 in anyone's queue??
On Fri, Aug 30, 2019 at 4:30 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> On Fri, Aug 30, 2019 at 4:54 AM Eric Dumazet <eric.dumazet@...il.com> wrote:
> >
> >
> >
> > On 8/29/19 9:26 PM, Willem de Bruijn wrote:
> >
> > > SO_REUSEPORT was not intended to be used in this way. Opening
> > > multiple connected sockets with the same local port.
> > >
> > > But since the interface allowed connect after joining a group, and
> > > that is being used, I guess that point is moot. Still, I'm a bit
> > > surprised that it ever worked as described.
> > >
> > > Also note that the default distribution algorithm is not round robin
> > > assignment, but hash based. So multiple consecutive datagrams arriving
> > > at the same socket is not unexpected.
> > >
> > > I suspect that this quick hack might "work". It seemed to on the
> > > supplied .c file:
> > >
> > > score = compute_score(sk, net, saddr, sport,
> > > daddr, hnum, dif, sdif);
> > > if (score > badness) {
> > > - if (sk->sk_reuseport) {
> > > + if (sk->sk_reuseport && !sk->sk_state !=
> > > TCP_ESTABLISHED) {
>
> This won't work for a mix of connected and connectionless sockets, of
> course (even ignoring the typo), as it only skips reuseport on the
> connected sockets.
>
> > >
> > > But a more robust approach, that also works on existing kernels, is to
> > > swap the default distribution algorithm with a custom BPF based one (
> > > SO_ATTACH_REUSEPORT_EBPF).
> > >
> >
> > Yes, I suspect that reuseport could still be used by to load-balance incoming packets
> > targetting the same 4-tuple.
> >
> > So all sockets would have the same score, and we would select the first socket in
> > the list (if not applying reuseport hashing)
>
> Can you elaborate a bit?
>
> One option I see is to record in struct sock_reuseport if any port in
> the group is connected and, if so, don't return immediately on the
> first reuseport_select_sock hit, but continue the search for a higher
> scoring connected socket.
>
> Or do return immediately, but do this refined search in
> reuseport_select_sock itself, as it has a reference to all sockets in the
> group in sock_reuseport->socks[]. Instead of the straightforward hash.
That won't work, as reuseport_select_sock does not have access to
protocol specific data, notably inet_dport.
Unfortunately, what I've come up with so far is not concise and slows
down existing reuseport lookup in a busy port table slot. Note that it
is needed for both ipv4 and ipv6.
Do not break out of the port table slot early, but continue to search
for a higher scored match even after matching a reuseport:
"
@@ -413,28 +413,39 @@ static struct sock *udp4_lib_lookup2(struct net *net,
struct udp_hslot *hslot2,
struct sk_buff *skb)
{
+ struct sock *reuseport_result = NULL;
struct sock *sk, *result;
+ int reuseport_score = 0;
int score, badness;
u32 hash = 0;
result = NULL;
badness = 0;
udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
score = compute_score(sk, net, saddr, sport,
daddr, hnum, dif, sdif);
if (score > badness) {
- if (sk->sk_reuseport) {
+ if (sk->sk_reuseport &&
+ sk->sk_state != TCP_ESTABLISHED &&
+ !reuseport_result) {
hash = udp_ehashfn(net, daddr, hnum,
saddr, sport);
- result = reuseport_select_sock(sk, hash, skb,
+ reuseport_result =
reuseport_select_sock(sk, hash, skb,
sizeof(struct udphdr));
- if (result)
- return result;
+ if (reuseport_result)
+ reuseport_score = score;
+ continue;
}
badness = score;
result = sk;
}
}
+
+ if (badness < reuseport_score)
+ result = reuseport_result;
+
return result;
"
To break out after the first reuseport hit when it is safe, i.e., when
it holds no connected sockets, requires adding this state to struct
reuseport_sock at __ip4_datagram_connect. And modify
reuseport_select_sock to read this. At least, I have not found a more
elegant solution.
> Steve, Re: your point on a scalable QUIC server. That is an
> interesting case certainly. Opening a connected socket per flow adds
> both memory and port table pressure. I once looked into an SO_TXONLY
> udp socket option that does not hash connected sockets into the port
> table. In effect receiving on a small set of listening sockets (e.g.,
> one per cpu) and sending over separate tx-only sockets. That still
> introduces unnecessary memory allocation. OTOH it amortizes some
> operations, such as route lookup.
>
> Anyway, that does not fix the immediate issue you reported when using
> SO_REUSEPORT as described.
As for the BPF program: good point on accessing the udp port when
skb->data is already beyond the header.
Programs of type sk_filter can use bpf_skb_load_bytes(_relative).
Which I think will work, but have not tested.
As of kernel 4.19 programs of type BPF_PROG_TYPE_SK_REUSEPORT can be
attached (with CAP_SYS_ADMIN). See
tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c for an
example that parses udp headers with bpf_skb_load_bytes.
Powered by blists - more mailing lists