[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJwJo6YcPt5+9uQt4yuYS_7o+O8ubjEgOBrq9RmH+b8OpJxdGA@mail.gmail.com>
Date: Wed, 20 Nov 2024 00:19:50 +0000
From: Dmitry Safonov <0x7f454c46@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Dmitry Safonov via B4 Relay <devnull+0x7f454c46.gmail.com@...nel.org>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>, David Ahern <dsahern@...nel.org>,
Ivan Delalande <colona@...sta.com>, Matthieu Baerts <matttbe@...nel.org>,
Mat Martineau <martineau@...nel.org>, Geliang Tang <geliang@...nel.org>,
John Fastabend <john.fastabend@...il.com>, Davide Caratti <dcaratti@...hat.com>,
Kuniyuki Iwashima <kuniyu@...zon.com>, netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
mptcp@...ts.linux.dev, Johannes Berg <johannes@...solutions.net>
Subject: Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
On Tue, 19 Nov 2024 at 00:12, Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Sat, 16 Nov 2024 03:52:47 +0000 Dmitry Safonov wrote:
> > Kind of agree. But then, it seems to be quite rare. Even on a
> > purposely created selftest it fires not each time (maybe I'm not
> > skilful enough). Yet somewhat sceptical about a re-try in the kernel:
> > the need for it is caused by another thread manipulating keys, so we
> > may need another re-try after the first re-try... So, then we would
> > have to introduce a limit on retries :D
>
> Wouldn't be the first time ;)
> But I'd just retry once with a "very large" buffer.
>
> > Hmm, what do you think about a kind of middle-ground/compromise
> > solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
> > hardly ever/never happen by purposely allocating larger skb. I don't
> > want to set some value in stone as one day it might become not enough
> > for all different socket infos, but maybe just add 4kB more to the
> > initial allocation? So, for it to reproduce, another thread would have
> > to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
> > socket between this thread's skb allocation and filling of the info
> > array. I'd call it "attempting to be nice to a user, but not at their
> > busylooping expense".
>
> The size of the retry buffer should be larger than any valid size.
> We can add a warning if calculated size >= 32kB.
Currently, md5/ao keys are limited by sock_kmalloc(), which uses
optmem_max sysctl limit. The default nowadays is 128KB.
>From [1] I see that the current in-kernel (struct tcp_md5sig_key) hits
optmem_max on
# ok 38 optmem limit was hit on adding 655 key
IOW, with the default limit and sizeof(struct tcp_diag_md5sig) = 100,
the maximum skb size would be ~= 65Kb. Sounds a little too big for
kmemcache allocation.
Initially, my idea was to limit this old version of tcp-md5-diag by
U16_MAX. Now I'm thinking of adopting your idea by always allocating
32kB skb for single-message and marking it somehow, if it's not big
enough to fit all the keys on a socket (NLM_F_DUMP_INTR or any other
alternative for userspace to get a clue that the single message wasn't
enough).
Then, as I planned, teach the multi-message dump iterator to stop
between recvmsg() on N-th md5/ao key and continue the dump from that
key on the next recvmsg().
> If we support an inf number of md5 keys we need to cap it.
Yeah, unfortunately, we have some customers with 1000 peers (and
because of that we internally test BGP with even more peers).
And that's with an assumption of one key per peer, which is not
necessarily true for AO.
> Eric is back later this week, perhaps we should wait for his advice.
Sure, I will be glad to have advice from you both, thanks!
> > > Right, the table based parsing doesn't work well with multi-attr,
> > > but other table formats aren't fundamentally better. Or at least
> > > I never came up with a good way of solving this. And the multi-attr
> > > at least doesn't suffer from the u16 problem.
> >
> > Yeah, also an array of structs that makes it impossible to extend such
> > an ABI with new members.
> >
> > And with regards to u16, I was thinking of this diff for net-next, but
> > was not sure if it's worth it:
> >
> > diff --git a/lib/nlattr.c b/lib/nlattr.c
> > index be9c576b6e2d..01c5a49ffa34 100644
> > --- a/lib/nlattr.c
> > +++ b/lib/nlattr.c
> > @@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
> > int attrtype, int attrlen)
> > {
> > struct nlattr *nla;
> >
> > + DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
> > + "requested nlattr::nla_len %d >= U16_MAX", attrlen);
> > +
> > nla = skb_put(skb, nla_total_size(attrlen));
> > nla->nla_type = attrtype;
> > nla->nla_len = nla_attr_size(attrlen);
>
> I'm slightly worried that this can be triggered already from user
> space, but we can try DEBUG_NET_* and see. Here and in nla_nest_end().
Yeah, I thought that CONFIG_DEBUG_NET is not enabled on generic
distros, but the description is:
: Enable extra sanity checks in networking.
: This is mostly used by fuzzers, but is safe to select.
not sure if that guards any production users from enabling it.
But that would be interesting to see if, with those new additions,
netdev doesn't produce any warnings.
[1] https://netdev-3.bots.linux.dev/vmksft-tcp-ao/results/867500/14-setsockopt-closed-ipv4/stdout
Thanks,
Dmitry
Powered by blists - more mailing lists