netdev - Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89i+i-CooK7GHKr=UYDw4Nf7EYQ5GFGB3PFZiaB7a_j3_xA@mail.gmail.com>
Date: Sun, 2 Jun 2024 12:00:09 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: David Ahern <dsahern@...nel.org>
Cc: Jakub Kicinski <kuba@...nel.org>, Stephen Hemminger <stephen@...workplumber.org>, davem@...emloft.net, 
	netdev@...r.kernel.org, pabeni@...hat.com, 
	Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>
Subject: Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()

On Sun, Jun 2, 2024 at 4:23 AM David Ahern <dsahern@...nel.org> wrote:
>
> On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> >> Sorry, I disagree.
> >>
> >> You can't just fix the problem areas. The split was an ABI change, and there could
> >> be a problem in any dump. This the ABI version of the old argument
> >>   If a tree falls in a forest and no one is around to hear it, does it make a sound?
> >>
> >> All dumps must behave the same. You are stuck with the legacy behavior.
>
> I don't agree with such a hard line stance. Mistakes made 20 years ago
> cannot hold Linux back from moving forward. We have to continue
> searching for ways to allow better or more performant behavior.
>
> >
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> >
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).
>
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.

Main motivation for me was to avoid re-grabbing RTNL again just to get NLM_DONE.

The avoidance of the two system calls was really secondary.

I think we could make a generic change in netlink_dump() to force NLM_DONE
in an empty message _and_ avoiding a useless call to the dump method, which
might still use RTNL or other contended mutex.

In a prior feedback I suggested a sysctl that Jakub disliked,
but really we do not care yet, as long as we avoid RTNL as much as we can.

Jakub, what about the following generic change, instead of ad-hoc changes ?

I tested it, I can send it with the minimal change (the alloc skb
optim will reach net-next)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index fa9c090cf629e6e92c097285b262ed90324c7656..0a58e5d13b8e68dd3fbb2b3fb362c3399fa29909
100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2289,15 +2289,20 @@ static int netlink_dump(struct sock *sk, bool
lock_taken)
         * ever provided a big enough buffer.
         */
        cb = &nlk->cb;
-       alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
-
-       max_recvmsg_len = READ_ONCE(nlk->max_recvmsg_len);
-       if (alloc_min_size < max_recvmsg_len) {
-               alloc_size = max_recvmsg_len;
-               skb = alloc_skb(alloc_size,
+       if (nlk->dump_done_errno) {
+               alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
+               max_recvmsg_len = READ_ONCE(nlk->max_recvmsg_len);
+               if (alloc_min_size < max_recvmsg_len) {
+                       alloc_size = max_recvmsg_len;
+                       skb = alloc_skb(alloc_size,
                                (GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
                                __GFP_NOWARN | __GFP_NORETRY);
+               }
+       } else {
+               /* Allocate the space needed for NLMSG_DONE alone. */
+               alloc_min_size = nlmsg_total_size(sizeof(nlk->dump_done_errno));
        }
+
        if (!skb) {
                alloc_size = alloc_min_size;
                skb = alloc_skb(alloc_size, GFP_KERNEL);
@@ -2350,8 +2355,7 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
                cb->extack = NULL;
        }

-       if (nlk->dump_done_errno > 0 ||
-           skb_tailroom(skb) <
nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
+       if (skb->len) {
                mutex_unlock(&nlk->nl_cb_mutex);

                if (sk_filter(sk, skb))