[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <02b50aae-f0e9-47a4-8365-a977a85975d3@ovn.org>
Date: Thu, 4 Apr 2024 00:52:15 +0200
From: Ilya Maximets <i.maximets@....org>
To: Jakub Kicinski <kuba@...nel.org>, Eric Dumazet <edumazet@...gle.com>
Cc: Stefano Brivio <sbrivio@...hat.com>, davem@...emloft.net,
netdev@...r.kernel.org, pabeni@...hat.com, jiri@...nulli.us,
idosch@...sch.org, johannes@...solutions.net, fw@...len.de,
pablo@...filter.org, Martin Pitt <mpitt@...hat.com>,
Paul Holzinger <pholzing@...hat.com>,
David Gibson <david@...son.dropbear.id.au>, i.maximets@....org
Subject: Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same
read() as families
On 3/19/24 18:40, Jakub Kicinski wrote:
> On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
>>> Hi Stefano! I was worried this may happen :( I think we should revert
>>> offending commits, but I'd like to take it on case by case basis.
>>> I'd imagine majority of netlink is only exercised by iproute2 and
>>> libmnl-based tools. Does passt hang specifically on genetlink family
>>> dump? Your commit also mentions RTM_GETROUTE. This is not the only
>>> commit which removed DONE:
>>>
>>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges --oneline
>>>
>>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
>>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
>>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
>>> 6647b338fc5c netlink: fix netlink_diag_dump() return value
>>
>> Lets not bring back more RTNL locking please for the handlers that
>> still require it.
>
> Definitely. My git log copy/paste is pretty inaccurate, these two are
> better examples:
>
> 5d9b7cb383bb nexthop: Simplify dump error handling
> 02e24903e5a4 netlink: let core handle error cases in dump operations
>
> I was trying to point out that we merged a handful of DONE "coalescing"
> patches, and if we need to revert - let's only do that for the exact
> commands needed. The comment was raised on my genetlink patch while
> the discussion in the link points to RTM_GETROUTE.
>
>> The core can generate an NLMSG_DONE by itself, if we decide this needs
>> to be done.
>
> Exactly.
FWIW, it seems that Libreswan is suffering from the same issue on
RTM_GETROUTE dump.
On 6.9.0-rc1 I see:
/usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
--start --asynchronous tun-in-1
recvfrom(7,
[
[{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
...
[{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
recvfrom(7, <-- Stuck here forever
On 6.8.0 the output is following:
recvfrom(7,
[
[{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
...
[{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
recvfrom(7,
[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI,}, 0],
40728, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
close(7)
So, it seems like it explicitly waits for NLMSG_DONE to be in a separate
message.
I reported the issue to Libreswan:
https://github.com/libreswan/libreswan/issues/1675
but just wanted to let you know as well.
Found this since it breaks IPsec system tests in Open vSwitch.
Best regards, Ilya Maximets.
Powered by blists - more mailing lists