netdev - [PATCH RFC net-next 0/6] rds: zerocopy support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1516147540.git.sowmini.varadhan@oracle.com>
Date:   Wed, 17 Jan 2018 04:19:58 -0800
From:   Sowmini Varadhan <sowmini.varadhan@...cle.com>
To:     netdev@...r.kernel.org, willemdebruijn.kernel@...il.com
Cc:     davem@...emloft.net, rds-devel@....oracle.com,
        sowmini.varadhan@...cle.com, santosh.shilimkar@...cle.com
Subject: [PATCH RFC net-next 0/6] rds: zerocopy support

This patch series provides support for MSG_ZERCOCOPY
on a PF_RDS socket based on the APIs and infrastructure added
by f214f915e7db ("tcp: enable MSG_ZEROCOPY")

For single threaded rds-stress testing using rds-tcp with the
ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary
results show that  there is a significant reduction in latency: about
90 usec with zerocopy, compared with 200 usec without zerocopy.

Additional testing/debugging is ongoing, but I am sharing
the current patchset to get some feedback on API design choices
especially for the send-completion notification for multi-threaded 
datagram socket applications

Brief RDS Architectural overview: PF_RDS sockets implement 
message-bounded datagram semantics over a reliable transport.
The RDS socket layer tracks message boundaries and uses
an underlying transport like TCP to segment/reassemble the
message into MTU sized frames. In addition to the reliable,
ordered delivery semantics provided by the transport, the
RDS layer also retains the datagram in its retransmit queue,
to be resent in case of transport failure/restart events.

This patchset modifies the above for zerocopy in the following manner.
- if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
- if the SO_ZEROCOPY  socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg are pinned. The pinning
uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit 
on MSG_ZEROCOPY pages")

The message is unpinned after we get back an ACK (TCP ACK, in the
case of rds-tcp) indicating that the RDS module at the receiver
has received the datagram, and it is safe for the sender to free
the message from its (RDS) retransmit queue. 

The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure thus needs to be notified
about send-completion, and that notification must uniquely
identify the message to the application so that the application
buffers may be freed/reused. 

Unique identification of the message in the completion notification
is done in the following manner:
- application passes down a 32 bit cookie as ancillary data with
  rds_sendmsg. The ancillary data in this case has cmsg_level == SOL_RDS
  and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
- upon send-completion, the rds module passes up a batch of cookies
  on the sk_error_queue associated with the PF_RDS socket. The message
  thus received will have a batch of N cookies in the data, with the
  number of cookies (N) specified in the ancillary data passed with
  recvmsg(). The current patchset sets up the ancillary data as a
  sock_extended_err with ee_origin == SO_EE_ORIGIN_ZEROCOPY, and 
  ee_data == N based on 52267790ef52 ("sock: add MSG_ZEROCOPY"), and
  alternate suggestions for designing this API are invited. The 
  important point here is that the notification would need to be able
  to contain an arbitrary number of cookies, where each cookie
  would allow the application to  uniquely identify a buffer used with
  sendmsg()

  Note that cookie-batching on send-completion notification means
  that the application may not know the buffering requirements
  a priori and the buffer sent down with recvmsg on the MSG_ERRQUEUE
  may be smaller than the required size for the notifications to be
  sent. To accomodate this case, sk_error_queue has been enhanced 
  to support MSG_PEEK semantics (so that the application
  can retry with a larger buffer)
  
Work in progress
- additional testing: when we test this with rds-stress with 8 sockets,
  and a send depth of 64 (i.e. each socket can have at most 64 outstanding
  requests) some data corruption is reported by rds-stress. Working
  on drilling down the root-cause
- optimizing the send-completion notification API: our use-cases are
  multi-threaded, and we want to be able to reuse buffers as soon
  as possible (instead of waiting for the req-resp transaction to 
  complete). Sub-optimal design of the completion notification can
  actually cause a perf deterioration (system-call overhead to
  reap notification, throughput can go down because application does
  not send "fast enough", even though latency is small), so this area
  needs to be optimized carefully
- additional test results beyond the rds-stress micro-benchmarks.


Sowmini Varadhan (6):
  sock: MSG_PEEK support for sk_error_queue
  skbuff: export mm_[un]account_pinned_pages for other modules
  rds: hold a sock ref from rds_message to the rds_sock
  sock: permit SO_ZEROCOPY on PF_RDS socket
  rds: support for zcopy completion notification
  rds: zerocopy Tx support.

 drivers/net/tun.c        |    2 +-
 include/linux/skbuff.h   |    3 +
 include/net/sock.h       |    2 +-
 include/uapi/linux/rds.h |    1 +
 net/core/skbuff.c        |    6 ++-
 net/core/sock.c          |   14 +++++-
 net/packet/af_packet.c   |    3 +-
 net/rds/af_rds.c         |    3 +
 net/rds/message.c        |  119 ++++++++++++++++++++++++++++++++++++++++++++-
 net/rds/rds.h            |   16 +++++-
 net/rds/recv.c           |    3 +
 net/rds/send.c           |   41 ++++++++++++----
 12 files changed, 192 insertions(+), 21 deletions(-)