[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1440081408-12302-1-git-send-email-willemb@google.com>
Date: Thu, 20 Aug 2015 10:36:39 -0400
From: Willem de Bruijn <willemb@...gle.com>
To: netdev@...r.kernel.org
Cc: mst@...hat.com, jasowang@...hat.com,
Willem de Bruijn <willemb@...gle.com>
Subject: [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY
From: Willem de Bruijn <willemb@...gle.com>
Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch
http://patchwork.ozlabs.org/patch/413184/
On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.
The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.
Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.
* Performance
The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of 3 runs. std
is a standard netperf, zc uses zerocopy and % is the ratio.
NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -- -m $size
perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF
--process cycles-- ----cpu cycles----
std zc % std zc %
4K 11,060 5,615 51 20,517 19,694 96
16K 8,706 2,045 23 17,913 15,549 86
64K 8,105 1,152 14 17,592 12,167 69
256K 8,087 926 11 16,953 11,279 66
1M 7,955 826 10 17,228 10,655 62
Perf record indicates the main source of these differences. Process
cycles only (perf record; perf report -n):
std:
Samples: 15K of event 'cycles', Event count (approx.): 7967793182
73.02% 11564 netperf [kernel.kallsyms] [k] copy_user_generic_string
4.73% 746 netperf [kernel.kallsyms] [k] __memset
2.73% 433 netperf [kernel.kallsyms] [k] tcp_sendmsg
2.41% 383 netperf [kernel.kallsyms] [k] get_page_from_freelist
0.90% 143 netperf [kernel.kallsyms] [k] copy_from_iter
zc:
Samples: 1K of event 'cycles', Event count (approx.): 858290585
17.11% 182 netperf.zc.aug2 [kernel.kallsyms] [k] gup_pte_range
9.31% 100 netperf.zc.aug2 [kernel.kallsyms] [k] __memset
7.79% 81 netperf.zc.aug2 [kernel.kallsyms] [k] __zerocopy_sg_from_iter
3.87% 44 netperf.zc.aug2 [kernel.kallsyms] [k] __alloc_skb
3.75% 18 netperf.zc.aug2 netperf.zc.aug2015 [.] allocate_buffer_ring
The individual patches report additional micro-benchmark results.
* Safety
The number of pages that can be pinned on behalf of a process with
MSG_ZEROCOPY is bound by the locked memory ulimit.
Pages are not mapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. Filters may have to be addressed by inserting a
preventative skb_copy_ubufs(). Device drivers can be whitelisted,
similar to scatter-gather support (NETIF_F_SG).
Conversely, while the kernel holds process memory pinned, a process
cannot safely reuse those pages for other purposes. Some protocols,
notably TCP, may hold data for an unbounded length of time. Tun and
virtio bound latency by calling skb_copy_ubuf before cloning and
before injecting packets in unbounded latency paths. This approach
is not feasible for TCP.
Processes can safely avoid OOM conditions by bounding the number of
bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map -- for instance, depending on
type of page, by calling munmap() or with madvise MADV_SOFT_OFFLINE or
MADV_DONTNEED. Long-lived kernel references are an anomaly and this
operation should be rare. The mechanism was suggested in the earlier
zerocopy packet socket patch.
* Limitations / Known Issues
- PF_INET6 and PF_UNIX are not yet supported.
- UDP/RAW/PACKET should sleep on ubuf_info alloc failure
they currently immediately return ENOBUFS
- TCP does not build max GSO packets, especially for
small send buffers (< 4 KB)
Willem de Bruijn (10):
sock: skb_copy_ubufs support for compound pages
sock: add generic socket zerocopy
sock: enable generic socket zerocopy
sock: zerocopy coalesce support
tcp: enable MSG_ZEROCOPY
udp: enable MSG_ZEROCOPY
raw: enable MSG_ZEROCOPY with hdrincl
packet: enable MSG_ZEROCOPY
sock: RLIMIT number of pinned pages with MSG_ZEROCOPY
test: add zerocopy tests
drivers/vhost/net.c | 1 +
include/linux/mm_types.h | 1 +
include/linux/skbuff.h | 72 +++-
include/linux/socket.h | 1 +
include/net/sock.h | 2 +
include/uapi/linux/errqueue.h | 1 +
net/core/datagram.c | 37 +-
net/core/skbuff.c | 297 ++++++++++++--
net/core/sock.c | 2 +
net/ipv4/ip_output.c | 30 +-
net/ipv4/raw.c | 27 +-
net/ipv4/tcp.c | 31 +-
net/packet/af_packet.c | 44 ++-
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/snd_zerocopy.c | 353 +++++++++++++++++
tools/testing/selftests/net/snd_zerocopy_lo.c | 535 ++++++++++++++++++++++++++
16 files changed, 1372 insertions(+), 64 deletions(-)
create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c
--
2.5.0.276.gf5e568e
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists