netdev - Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion notification

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180118114024.GB24920@oracle.com>
Date:   Thu, 18 Jan 2018 06:40:24 -0500
From:   Sowmini Varadhan <sowmini.varadhan@...cle.com>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     Network Development <netdev@...r.kernel.org>,
        David Miller <davem@...emloft.net>, rds-devel@....oracle.com,
        santosh.shilimkar@...cle.com, sowmini.varadhan@...cle.com
Subject: Re: [PATCH RFC net-next 5/6] rds: support for zcopy completion
 notification

On (01/17/18 19:23), Willem de Bruijn wrote:
> > +static void rds_rm_zerocopy_callback(struct rds_sock *rs)
     :
> > +       skb = alloc_skb(ncookies * sizeof(u32), GFP_ATOMIC);
> 
> TCP zerocopy avoids this issue by allocating the notification skb when the
> zerocopy packet is created, which would be rds_message_copy_from_user.

right, I could allocate the skb when we set up the zcopy data strucutres.

> This does not add an allocation, if using the same trick of stashing
> the intermediate notification object (op_mmp_znotifier) in skb->cb. Though,
> alloc_skb is probably more expensive than that kzalloc. If nothing else,
> because of more zeroing.

I would have liked to reuse skb->cb, but had to fall back to the alloc_skb
because of the attempt to fall back to flexibly numbered (and sized) cookie
notifications.

If we choose the "max number of cookies" option discussed in the previous 
thread, I think I should be able to do this with the existing 48 byte sized 
cb and pack in 8 32 bit cookies after the sock_extended_err.. maybe
that's sufficient.

> > +       serr = SKB_EXT_ERR(skb);
> > +       serr->ee.ee_errno = 0;
> > +       serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
> > +       serr->ee.ee_data = ncookies;
> 
> This changes the semantics of these fields. Please add a new SO_EE_CODE flag,
> even if the semantics can be derived from the packet family (for now).

sounds good, I'll take care of this (and other review comments)
for the next rev.

> Even better would be if we can avoid the cookies completely. I understand
> the issue with concurrent send threads racing on obtaining a zckey value. If
> the sender could learn which zckey was chosen for a call, would that suffice?

I'm not sure I understand the proposal- you want sendmsg to return
the cookie ("zckey") for you? How?

even if we mangled sendmsg() to return something other than
the existing POSIX semantics, how will the application be asynchronously
notified that send has completed (on a per-message/per-datagram) basis?

> I suspect that in even with concurrent senders, notifications arrive
> largely in
> order, in which case we could just maintain the existing semantics and even
> reuse that implementation.

not so. rds-stress [1] with -d8 -t8 quickly disproves this on my 10G ixgbe
connection. 

When you have multiple threads writing to a socket, you cannot know
what was the "order of send", unless you bottleneck all the threads
to go through a single-point-of-send.  rds-stress is an example of this
sort of usage (fairly typical in our applications)

[1] http://public-yum.oracle.com/repo/OracleLinux/OL6/ofed_UEK/x86_64//getPackageSource/rds-tools-2.0.7-1.12.el6.src.rpm

Consider 2 RDS sockets fd1 and fd2, each one sending to the
same pair of IP addresses: if fd1 gets bound to tcp sock1, fd2 to tcp sock2,
it's very possible that the send completion is not in the same order
as the order of send. The application cannot know which socket gets
bound to which TCP connection (or even whether/how-many tcp connections
are involved: mprds strives to make this transparent to the application)

Same problem exists for pure-datagram sockets like PF_PACKET, UDP etc.
application may send buf1 (datagram1) and buf2 (datagram2), and buf2
may make it to the destination before buf1.  The "notifications 
arrive largely in order" may be mostly true about a single-stream TCP
connection but not for the datagram models (or even threaded tcp apps
like iperf?)..

> > +       tail = skb_peek_tail(q);
> > +       if (!tail ||
> > +           SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY)  {
> > +               __skb_queue_tail(q, skb);
> > +               skb = NULL;
> > +       }
> 
> This drops the packet if the branch is not taken. In the TCP case
> this condition
> means that we can try to coalesce packets, but that is not the case here.

good point, I'll check into this and fix (same applies for the comments
to patches 4/6 and 6/6)

> >                 } rdma;
> >                 struct rm_data_op {
> >                         unsigned int            op_active:1;
> > -                       unsigned int            op_notify:1;
> > +                       unsigned int            op_notify:1,
> > +                                               op_zcopy:1,
                                                     :
> > +                       struct rds_znotifier    *op_mmp_znotifier;
> 
> not necessary if op_mmp_znotifier is NULL unless set in
> rds_message_copy_from_user

To make sure I dont misunderstand, you are suggesting that we dont
need op_zcopy, but can just check for the null-ness of
op_mmp_znotifier (yes, true, I agree)? or something else?

--Sowmini