[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1516147540.git.sowmini.varadhan@oracle.com>
Date: Wed, 17 Jan 2018 04:19:58 -0800
From: Sowmini Varadhan <sowmini.varadhan@...cle.com>
To: netdev@...r.kernel.org, willemdebruijn.kernel@...il.com
Cc: davem@...emloft.net, rds-devel@....oracle.com,
sowmini.varadhan@...cle.com, santosh.shilimkar@...cle.com
Subject: [PATCH RFC net-next 0/6] rds: zerocopy support
This patch series provides support for MSG_ZERCOCOPY
on a PF_RDS socket based on the APIs and infrastructure added
by f214f915e7db ("tcp: enable MSG_ZEROCOPY")
For single threaded rds-stress testing using rds-tcp with the
ixgbe driver using 1M message sizes (-a 1M -q 1M) preliminary
results show that there is a significant reduction in latency: about
90 usec with zerocopy, compared with 200 usec without zerocopy.
Additional testing/debugging is ongoing, but I am sharing
the current patchset to get some feedback on API design choices
especially for the send-completion notification for multi-threaded
datagram socket applications
Brief RDS Architectural overview: PF_RDS sockets implement
message-bounded datagram semantics over a reliable transport.
The RDS socket layer tracks message boundaries and uses
an underlying transport like TCP to segment/reassemble the
message into MTU sized frames. In addition to the reliable,
ordered delivery semantics provided by the transport, the
RDS layer also retains the datagram in its retransmit queue,
to be resent in case of transport failure/restart events.
This patchset modifies the above for zerocopy in the following manner.
- if the MSG_ZEROCOPY flag is specified with rds_sendmsg(), and,
- if the SO_ZEROCOPY socket option has been set on the PF_RDS socket,
application pages sent down with rds_sendmsg are pinned. The pinning
uses the accounting infrastructure added by a91dbff551a6 ("sock: ulimit
on MSG_ZEROCOPY pages")
The message is unpinned after we get back an ACK (TCP ACK, in the
case of rds-tcp) indicating that the RDS module at the receiver
has received the datagram, and it is safe for the sender to free
the message from its (RDS) retransmit queue.
The payload bytes in the message may not be modified for the
duration that the message has been pinned. A multi-threaded
application using this infrastructure thus needs to be notified
about send-completion, and that notification must uniquely
identify the message to the application so that the application
buffers may be freed/reused.
Unique identification of the message in the completion notification
is done in the following manner:
- application passes down a 32 bit cookie as ancillary data with
rds_sendmsg. The ancillary data in this case has cmsg_level == SOL_RDS
and cmsg_type == RDS_CMSG_ZCOPY_COOKIE.
- upon send-completion, the rds module passes up a batch of cookies
on the sk_error_queue associated with the PF_RDS socket. The message
thus received will have a batch of N cookies in the data, with the
number of cookies (N) specified in the ancillary data passed with
recvmsg(). The current patchset sets up the ancillary data as a
sock_extended_err with ee_origin == SO_EE_ORIGIN_ZEROCOPY, and
ee_data == N based on 52267790ef52 ("sock: add MSG_ZEROCOPY"), and
alternate suggestions for designing this API are invited. The
important point here is that the notification would need to be able
to contain an arbitrary number of cookies, where each cookie
would allow the application to uniquely identify a buffer used with
sendmsg()
Note that cookie-batching on send-completion notification means
that the application may not know the buffering requirements
a priori and the buffer sent down with recvmsg on the MSG_ERRQUEUE
may be smaller than the required size for the notifications to be
sent. To accomodate this case, sk_error_queue has been enhanced
to support MSG_PEEK semantics (so that the application
can retry with a larger buffer)
Work in progress
- additional testing: when we test this with rds-stress with 8 sockets,
and a send depth of 64 (i.e. each socket can have at most 64 outstanding
requests) some data corruption is reported by rds-stress. Working
on drilling down the root-cause
- optimizing the send-completion notification API: our use-cases are
multi-threaded, and we want to be able to reuse buffers as soon
as possible (instead of waiting for the req-resp transaction to
complete). Sub-optimal design of the completion notification can
actually cause a perf deterioration (system-call overhead to
reap notification, throughput can go down because application does
not send "fast enough", even though latency is small), so this area
needs to be optimized carefully
- additional test results beyond the rds-stress micro-benchmarks.
Sowmini Varadhan (6):
sock: MSG_PEEK support for sk_error_queue
skbuff: export mm_[un]account_pinned_pages for other modules
rds: hold a sock ref from rds_message to the rds_sock
sock: permit SO_ZEROCOPY on PF_RDS socket
rds: support for zcopy completion notification
rds: zerocopy Tx support.
drivers/net/tun.c | 2 +-
include/linux/skbuff.h | 3 +
include/net/sock.h | 2 +-
include/uapi/linux/rds.h | 1 +
net/core/skbuff.c | 6 ++-
net/core/sock.c | 14 +++++-
net/packet/af_packet.c | 3 +-
net/rds/af_rds.c | 3 +
net/rds/message.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-
net/rds/rds.h | 16 +++++-
net/rds/recv.c | 3 +
net/rds/send.c | 41 ++++++++++++----
12 files changed, 192 insertions(+), 21 deletions(-)
Powered by blists - more mailing lists