[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAF=yD-L4bBm_obmsW5m7aTGwFv8uC-hDA7zo9qsAUr0y32Rr0Q@mail.gmail.com>
Date: Wed, 28 Nov 2018 18:50:15 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Network Development <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>,
Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH net-next v2 1/2] udp: msg_zerocopy
On Mon, Nov 26, 2018 at 2:49 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> On Mon, Nov 26, 2018 at 1:19 PM Willem de Bruijn
> <willemdebruijn.kernel@...il.com> wrote:
> >
> > On Mon, Nov 26, 2018 at 1:04 PM Paolo Abeni <pabeni@...hat.com> wrote:
> > >
> > > On Mon, 2018-11-26 at 12:59 -0500, Willem de Bruijn wrote:
> > > > The callers of this function do flush the queue of the other skbs on
> > > > error, but only after the call to sock_zerocopy_put_abort.
> > > >
> > > > sock_zerocopy_put_abort depends on total rollback to revert the
> > > > sk_zckey increment and suppress the completion notification (which
> > > > must not happen on return with error).
> > > >
> > > > I don't immediately have a fix. Need to think about this some more..
> > >
> > > [still out of sheer ignorance] How about tacking a refcnt for the whole
> > > ip_append_data() scope, like in the tcp case? that will add an atomic
> > > op per loop (likely, hitting the cache) but will remove some code hunk
> > > in sock_zerocopy_put_abort() and sock_zerocopy_alloc().
> >
> > The atomic op pair is indeed what I was trying to avoid. But I also need
> > to solve the problem that the final decrement will happen from the freeing
> > of the other skbs in __ip_flush_pending_frames, and will not suppress
> > the notification.
> >
> > Freeing the entire queue inside __ip_append_data, effectively making it
> > a true noop on error is one approach. But that is invasive, also to non
> > zerocopy codepaths, so I would rather avoid that.
> >
> > Perhaps I need to handle the abort logic in udp_sendmsg directly,
> > after both __ip_append_data and __ip_flush_pending_frames.
>
> Actually,
>
> (1) the notification suppression is handled correctly, as .._abort
> decrements uarg->len. If now zero, this suppresses the notification
> in sock_zerocopy_callback, regardless whether that callback is
> called right away or from a later kfree_skb.
>
> (2) if moving skb_zcopy_set below getfrag, then no kfree_skb
> will be called on a zerocopy skb inside __ip_append_data. So on
> abort the refcount is exactly the number of zerocopy skbs on the
> queue that will call sock_zerocopy_put later. Abort then only needs
> to handle special case zero, and call sock_zerocopy_put right away.
An additional issue is that refcount_t cannot be initialized to
zero, then incremented for each skb, unlike the original patch
based on atomic_t.
I did revert to the basic implementation using an extra ref
for the function call, similar to TCP, as you suggested.
On top of that as a separate optimization patch I have a
variant that uses refcnt zero by replacing refcount_inc with
refcount_set(.., refcount_read(..) + 1). Not very pretty.
An alternative to elide the cost of sock_zerocopy_put in the
fast path is to add a static branch on SO_ZEROCOPY. That
would also compile it out for TCP in the common case.
Anyway, I do intend to send a v3, but it's taking a bit longer
to try to find a clean solution.
>
> Tentative fix on top of v2 (I'll squash into v3):
>
> ---
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2179ef84bb44..4b21a58329d1 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1099,12 +1099,13 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
>
> - if (sk->sk_type != SOCK_STREAM && !refcount_read(&uarg->refcnt))
> + if (sk->sk_type != SOCK_STREAM &&
> !refcount_read(&uarg->refcnt)) {
> refcount_set(&uarg->refcnt, 1);
> -
> - sock_zerocopy_put(uarg);
> + sock_zerocopy_put(uarg);
> + }
> }
> }
>
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 7504da2f33d6..a19396e21b35 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1014,13 +1014,6 @@ static int __ip_append_data(struct sock *sk,
> skb->csum = 0;
> skb_reserve(skb, hh_len);
>
> - /* only the initial fragment is time stamped */
> - skb_shinfo(skb)->tx_flags = cork->tx_flags;
> - cork->tx_flags = 0;
> - skb_shinfo(skb)->tskey = tskey;
> - tskey = 0;
> - skb_zcopy_set(skb, uarg);
> -
> /*
> * Find where to start putting bytes.
> */
> @@ -1053,6 +1046,13 @@ static int __ip_append_data(struct sock *sk,
> exthdrlen = 0;
> csummode = CHECKSUM_NONE;
>
> + /* only the initial fragment is time stamped */
> + skb_shinfo(skb)->tx_flags = cork->tx_flags;
> + cork->tx_flags = 0;
> + skb_shinfo(skb)->tskey = tskey;
> + tskey = 0;
> + skb_zcopy_set(skb, uarg);
> +
> if ((flags & MSG_CONFIRM) && !skb_prev)
> skb_set_dst_pending_confirm(skb, 1);
>
> ---
>
> This patch moves all the skb_shinfo touches operations after the copy,
> to avoid touching that twice.
>
> Instead of the refcnt trick, I could also refactor sock_zerocopy_put
> and call __sock_zerocopy_put
>
> ---
>
> -void sock_zerocopy_put(struct ubuf_info *uarg)
> +static void __sock_zerocopy_put(struct ubuf_info *uarg)
> {
> - if (uarg && refcount_dec_and_test(&uarg->refcnt)) {
> - if (uarg->callback)
> - uarg->callback(uarg, uarg->zerocopy);
> - else
> - consume_skb(skb_from_uarg(uarg));
> - }
> + if (uarg->callback)
> + uarg->callback(uarg, uarg->zerocopy);
> + else
> + consume_skb(skb_from_uarg(uarg));
> +}
> +
> +void sock_zerocopy_put(struct ubuf_info *uarg)
> + if (uarg && refcount_dec_and_test(&uarg->refcnt))
> + __sock_zerocopy_put(uarg);
> }
> EXPORT_SYMBOL_GPL(sock_zerocopy_put);
>
> ---
Powered by blists - more mailing lists