lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1653992701.git.asml.silence@gmail.com>
Date:   Tue, 28 Jun 2022 19:56:22 +0100
From:   Pavel Begunkov <asml.silence@...il.com>
To:     io-uring@...r.kernel.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Jens Axboe <axboe@...nel.dk>, kernel-team@...com,
        Pavel Begunkov <asml.silence@...il.com>
Subject: [RFC net-next v3 00/29] io_uring zerocopy send

The third iteration of patches for zerocopy io_uring sends. I fixed
all known issues since the previous version and reshuffled io_uring
patches, but the net/ code didn't change much. I think it's ready
and will send it as a non-RFC soon.

All tests below are done using io_uring with all relevant performance
options turned on. Numbers look good, send + flush per request, which
is the worst case, is on par with non-zerocopy with the payload size
lower than 600 bytes with dummy netdev and b/w 1200-1500 for NIC tests.
Without "buffer-free" notification flushing at all it's on par with NIC
at around 600 bytes.

dummy:
IO size | non-zc (tx/s) | zc (tx/s)      | zc + flush (tx/s)
8000    | 1299916       | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230       | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617       | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794       | 2381527 (+13%) | 2195295 (+4%)

NIC:
IO size | non-zc (tx/s) | zc (tx/s)      | zc + flush (tx/s)
4000    | 495134        | 606420 (+22%)  | 558971 (+12%)
1500    | 551808        | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677        | 592088 (+1.2%) | 560885 (-4%)
600     | 596292        | 598550 (+0.4%) | 555366 (-6.7%)

Apart from zerocopy, it also removes page referencing for reigstered
buffers (used in all zc tests). I'm experimenting with notificaiton
optimsation, which should improve the 3rd column, but that will go
separately from this series. I've also seen good CPU usage reduction
for TCP comparing to non-zc, but not posting numbers as had problems
saturating CPU.

Links:

  RFC v1:
  https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  liburing (copy of the benchmark + some tests):
  https://github.com/isilence/liburing/tree/zc_v3

  kernel repo:
  https://github.com/isilence/linux/tree/zc_v3

API design overview:

  First we take an internal zerocopy handler, aka struct ubuf_info, and let
  io_uring to pass it into the network layer in struct msghdr. io_uring
  stores them as wrapping into struct io_notif.

  It also has an array of so called notification slots, each keeps one and
  only one active notifier at a time, to which the userspace can bind requests
  by specifying the slot index. Then the userspace can request to flush a
  notifier, so when all buffers and requests used with this notifier
  complete/freed it'll post one CQE.

  The userspace can't bind new requests to a flushed notifier, however,
  it can use the slot as flushing automatically replaces the notifier with
  a new one.

Changelog:

  RFC v2 -> RFC v3:
    TCP support
    accounting for normal (non-registered) buffers
    allow to combine reg and normal requests within a notifier
    notification flushing via IORING_OP_RSRC_UPDATE
    overriding io_uring notification tag/user_data
    add ubuf_info submmision side reference caching/batching
    fix buffer indexing
    fix io-wq ->uring_lock locking
    fix bugs when mixing with MSG_ZEROCOPY
    fix managed refs bugs in skbuff.c
    numerous cleanups

  RFC -> RFC v2:
    remove additional overhead for non-zc from skb_release_data()
    avoid msg propagation, hide extra bits of non-zc overhead
    task_work based "buffer free" notifications
    improve io_uring's notification refcounting
    added 5/19, (no pfmemalloc tracking)
    added 8/19 and 9/19 preventing small copies with zc
    misc small changes

Pavel Begunkov (29):
  ipv4: avoid partial copy for zc
  ipv6: avoid partial copy for zc
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: carry external ubuf_info in msghdr
  net: bvec specific path in zerocopy_sg_from_iter
  net: optimise bvec-based zc page referencing
  net: don't track pfmemalloc for managed frags
  skbuff: don't mix ubuf_info of different types
  ipv4/udp: support zc with managed data
  ipv6/udp: support zc with managed data
  tcp: support zc with managed data
  tcp: kill extra io_uring's uarg refcounting
  net: let callers provide extra ubuf_info refs
  io_uring: opcode independent fixed buf import
  io_uring: add zc notification infrastructure
  io_uring: cache struct io_notif
  io_uring: complete notifiers in tw
  io_uring: add notification slot registration
  io_uring: rename IORING_OP_FILES_UPDATE
  io_uring: add zc notification flush requests
  io_uring: wire send zc request type
  io_uring: account locked pages for non-fixed zc
  io_uring: allow to pass addr into sendzc
  io_uring: add rsrc referencing for notifiers
  io_uring: sendzc with fixed buffers
  io_uring: flush notifiers after sendzc
  io_uring: allow to override zc tag on flush
  io_uring: batch submission notif referencing
  selftests/io_uring: test zerocopy send

 fs/io_uring.c                                 | 566 +++++++++++++++-
 include/linux/skbuff.h                        |  59 +-
 include/linux/socket.h                        |   8 +
 include/uapi/linux/io_uring.h                 |  43 +-
 net/compat.c                                  |   2 +
 net/core/datagram.c                           |  53 +-
 net/core/skbuff.c                             |  35 +-
 net/ipv4/ip_output.c                          |  66 +-
 net/ipv4/tcp.c                                |  56 +-
 net/ipv6/ip6_output.c                         |  65 +-
 net/socket.c                                  |   6 +
 tools/testing/selftests/net/Makefile          |   1 +
 .../selftests/net/io_uring_zerocopy_tx.c      | 605 ++++++++++++++++++
 .../selftests/net/io_uring_zerocopy_tx.sh     | 131 ++++
 14 files changed, 1613 insertions(+), 83 deletions(-)
 create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
 create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

-- 
2.36.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ