netdev - Re: [PATCH net-next v2 1/2] udp: msg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=yD-KoUX=suZaaV3xzixSM2Kds65YRxCR4HyXXkKp8=aN+XQ@mail.gmail.com>
Date:   Mon, 26 Nov 2018 14:49:07 -0500
From:   Willem de Bruijn <willemdebruijn.kernel@...il.com>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     Network Development <netdev@...r.kernel.org>,
        David Miller <davem@...emloft.net>,
        Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH net-next v2 1/2] udp: msg_zerocopy

On Mon, Nov 26, 2018 at 1:19 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> On Mon, Nov 26, 2018 at 1:04 PM Paolo Abeni <pabeni@...hat.com> wrote:
> >
> > On Mon, 2018-11-26 at 12:59 -0500, Willem de Bruijn wrote:
> > > The callers of this function do flush the queue of the other skbs on
> > > error, but only after the call to sock_zerocopy_put_abort.
> > >
> > > sock_zerocopy_put_abort depends on total rollback to revert the
> > > sk_zckey increment and suppress the completion notification (which
> > > must not happen on return with error).
> > >
> > > I don't immediately have a fix. Need to think about this some more..
> >
> > [still out of sheer ignorance] How about tacking a refcnt for the whole
> > ip_append_data() scope, like in the tcp case? that will add an atomic
> > op per loop (likely, hitting the cache) but will remove some code hunk
> > in sock_zerocopy_put_abort() and sock_zerocopy_alloc().
>
> The atomic op pair is indeed what I was trying to avoid. But I also need
> to solve the problem that the final decrement will happen from the freeing
> of the other skbs in __ip_flush_pending_frames, and will not suppress
> the notification.
>
> Freeing the entire queue inside __ip_append_data, effectively making it
> a true noop on error is one approach. But that is invasive, also to non
> zerocopy codepaths, so I would rather avoid that.
>
> Perhaps I need to handle the abort logic in udp_sendmsg directly,
> after both __ip_append_data and __ip_flush_pending_frames.

Actually,

(1) the notification suppression is handled correctly, as .._abort
decrements uarg->len. If now zero, this suppresses the notification
in sock_zerocopy_callback, regardless whether that callback is
called right away or from a later kfree_skb.

(2) if moving skb_zcopy_set below getfrag, then no kfree_skb
will be called on a zerocopy skb inside __ip_append_data. So on
abort the refcount is exactly the number of zerocopy skbs on the
queue that will call sock_zerocopy_put later. Abort then only needs
to handle special case zero, and call sock_zerocopy_put right away.

Tentative fix on top of v2 (I'll squash into v3):

---

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2179ef84bb44..4b21a58329d1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1099,12 +1099,13 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)

-               if (sk->sk_type != SOCK_STREAM && !refcount_read(&uarg->refcnt))
+               if (sk->sk_type != SOCK_STREAM &&
!refcount_read(&uarg->refcnt)) {
                        refcount_set(&uarg->refcnt, 1);
-
-               sock_zerocopy_put(uarg);
+                       sock_zerocopy_put(uarg);
+               }
        }
 }

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7504da2f33d6..a19396e21b35 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1014,13 +1014,6 @@ static int __ip_append_data(struct sock *sk,
                        skb->csum = 0;
                        skb_reserve(skb, hh_len);

-                       /* only the initial fragment is time stamped */
-                       skb_shinfo(skb)->tx_flags = cork->tx_flags;
-                       cork->tx_flags = 0;
-                       skb_shinfo(skb)->tskey = tskey;
-                       tskey = 0;
-                       skb_zcopy_set(skb, uarg);
-
                        /*
                         *      Find where to start putting bytes.
                         */
@@ -1053,6 +1046,13 @@ static int __ip_append_data(struct sock *sk,
                        exthdrlen = 0;
                        csummode = CHECKSUM_NONE;

+                       /* only the initial fragment is time stamped */
+                       skb_shinfo(skb)->tx_flags = cork->tx_flags;
+                       cork->tx_flags = 0;
+                       skb_shinfo(skb)->tskey = tskey;
+                       tskey = 0;
+                       skb_zcopy_set(skb, uarg);
+
                        if ((flags & MSG_CONFIRM) && !skb_prev)
                                skb_set_dst_pending_confirm(skb, 1);

---

This patch moves all the skb_shinfo touches operations after the copy,
to avoid touching that twice.

Instead of the refcnt trick, I could also refactor sock_zerocopy_put
and call __sock_zerocopy_put

---

-void sock_zerocopy_put(struct ubuf_info *uarg)
+static void __sock_zerocopy_put(struct ubuf_info *uarg)
 {
-       if (uarg && refcount_dec_and_test(&uarg->refcnt)) {
-               if (uarg->callback)
-                       uarg->callback(uarg, uarg->zerocopy);
-               else
-                       consume_skb(skb_from_uarg(uarg));
-       }
+       if (uarg->callback)
+               uarg->callback(uarg, uarg->zerocopy);
+       else
+               consume_skb(skb_from_uarg(uarg));
+}
+
+void sock_zerocopy_put(struct ubuf_info *uarg)
+       if (uarg && refcount_dec_and_test(&uarg->refcnt))
+               __sock_zerocopy_put(uarg);
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);

---