[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170222163901.90834-1-willemdebruijn.kernel@gmail.com>
Date: Wed, 22 Feb 2017 11:38:49 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: netdev@...r.kernel.org
Cc: Willem de Bruijn <willemb@...gle.com>
Subject: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
From: Willem de Bruijn <willemb@...gle.com>
RFCv2:
I have received a few requests for status and rebased code of this
feature. We have been running this code internally, discovering and
fixing various bugs. With net-next closed, now seems like a good time
to share an updated patchset with fixes. The rebase from RFCv1/v4.2
was mostly straightforward: mainly iov_iter changes. Full changelog:
RFC -> RFCv2:
- review comment: do not loop skb with zerocopy frags onto rx:
add skb_orphan_frags_rx to orphan even refcounted frags
call this in __netif_receive_skb_core, deliver_skb and tun:
the same as 1080e512d44d ("net: orphan frags on receive")
- fix: hold an explicit sk reference on each notification skb.
previously relied on the reference (or wmem) held by the
data skb that would trigger notification, but this breaks
on skb_orphan.
- fix: when aborting a send, do not inc the zerocopy counter
this caused gaps in the notification chain
- fix: in packet with SOCK_DGRAM, pull ll headers before calling
zerocopy_sg_from_iter
- fix: if sock_zerocopy_realloc does not allow coalescing,
do not fail, just allocate a new ubuf
- fix: in tcp, check return value of second allocation attempt
- chg: allocate notification skbs from optmem
to avoid affecting tcp write queue accounting (TSQ)
- chg: limit #locked pages (ulimit) per user instead of per process
- chg: grow notification ids from 16 to 32 bit
- pass range [lo, hi] through 32 bit fields ee_info and ee_data
- chg: rebased to davem-net-next on top of v4.10-rc7
- add: limit notification coalescing
sharing ubufs limits overhead, but delays notification until
the last packet is released, possibly unbounded. Add a cap.
- tests: add snd_zerocopy_lo pf_packet test
- tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
The change to allocate notification skbuffs from optmem requires
ensuring that net.core.optmem is at least a few 100KB. To
experiment, run
sysctl -w net.core.optmem_max=1048576
The snd_zerocopy_lo benchmarks reported in the individual patches were
rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
replaced with skb_orphan_frags to allow looping to local sockets. The
netperf results below are also rerun with v2.
In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.
Overview (from original RFC):
Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch
http://patchwork.ozlabs.org/patch/413184/
On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.
The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.
Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.
* Performance
The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.
NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF
--process cycles-- ----cpu cycles----
std zc % std zc %
4K 27,609 11,217 41 49,217 39,175 79
16K 21,370 3,823 18 43,540 29,213 67
64K 20,557 2,312 11 42,189 26,910 64
256K 21,110 2,134 10 43,006 27,104 63
1M 20,987 1,610 8 42,759 25,931 61
Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):
std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313
79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string
3.27% 1396 netperf [kernel.kallsyms] [k] tcp_sendmsg
1.66% 694 netperf [kernel.kallsyms] [k] get_page_from_freelist
0.79% 325 netperf [kernel.kallsyms] [k] tcp_ack
0.43% 188 netperf [kernel.kallsyms] [k] __alloc_skb
zc:
Samples: 1K of event 'cycles', Event count (approx.): 1439509124
30.36% 584 netperf.zerocop [kernel.kallsyms] [k] gup_pte_range
14.63% 284 netperf.zerocop [kernel.kallsyms] [k] __zerocopy_sg_from_iter
8.03% 159 netperf.zerocop [kernel.kallsyms] [k] skb_zerocopy_add_frags_iter
4.84% 96 netperf.zerocop [kernel.kallsyms] [k] __alloc_skb
3.10% 60 netperf.zerocop [kernel.kallsyms] [k] kmem_cache_alloc_node
* Safety
The number of pages that can be pinned on behalf of a user with
MSG_ZEROCOPY is bound by the locked memory ulimit.
While the kernel holds process memory pinned, a process cannot safely
reuse those pages for other purposes. Packets looped onto the receive
stack and queued to a socket can be held indefinitely. Avoid unbounded
notification latency by restricting user pages to egress paths only.
skb_orphan_frags_rx() will create a private copy of pages even for
refcounted packets when these are looped, as did skb_orphan_frags for
the original tun zerocopy implementation.
Pages are not remapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. TC filters that access contents may have to be
excluded by adding an skb_orphan_frags_rx.
Processes can also safely avoid OOM conditions by bounding the number
of bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map.
* Limitations / Known Issues
- PF_INET6 is not yet supported.
- TCP does not build max GSO packets, especially for
small send buffers (< 4 KB)
Willem de Bruijn (12):
sock: allocate skbs from optmem
sock: skb_copy_ubufs support for compound pages
sock: add generic socket zerocopy
sock: enable sendmsg zerocopy
sock: sendmsg zerocopy notification coalescing
sock: sendmsg zerocopy ulimit
sock: sendmsg zerocopy limit bytes per notification
tcp: enable sendmsg zerocopy
udp: enable sendmsg zerocopy
raw: enable sendmsg zerocopy with IP_HDRINCL
packet: enable sendmsg zerocopy
test: add sendmsg zerocopy tests
drivers/net/tun.c | 2 +-
drivers/vhost/net.c | 1 +
include/linux/sched.h | 2 +-
include/linux/skbuff.h | 94 +++-
include/linux/socket.h | 1 +
include/net/sock.h | 4 +
include/uapi/linux/errqueue.h | 1 +
net/core/datagram.c | 35 +-
net/core/dev.c | 4 +-
net/core/skbuff.c | 327 ++++++++++++--
net/core/sock.c | 29 ++
net/ipv4/ip_output.c | 34 +-
net/ipv4/raw.c | 27 +-
net/ipv4/tcp.c | 37 +-
net/packet/af_packet.c | 52 ++-
tools/testing/selftests/net/.gitignore | 2 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/snd_zerocopy.c | 354 +++++++++++++++
tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++
19 files changed, 1536 insertions(+), 67 deletions(-)
create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c
--
2.11.0.483.g087da7b7c-goog
Powered by blists - more mailing lists