lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250806203705.2560493-1-dhowells@redhat.com>
Date: Wed,  6 Aug 2025 21:36:21 +0100
From: David Howells <dhowells@...hat.com>
To: Steve French <sfrench@...ba.org>
Cc: David Howells <dhowells@...hat.com>,
	Paulo Alcantara <pc@...guebit.org>,
	Shyam Prasad N <sprasad@...rosoft.com>,
	Tom Talpey <tom@...pey.com>,
	Wang Zhaolong <wangzhaolong@...weicloud.com>,
	Stefan Metzmacher <metze@...ba.org>,
	Mina Almasry <almasrymina@...gle.com>,
	linux-cifs@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [RFC PATCH 00/31] netfs: [WIP] Allow the use of MSG_SPLICE_PAGES and use netmem allocator

Hi Steve et al,

[!] NOTE: These patches are NOT FULLY WRITTEN, won't necessarily compile
    all the way through and haven't been fully tested and this is intended
    as a preview of what I'm working on.  Basic SMB1 and SMB2+ work as far
    as "cifs: Rewrite base TCP transmission".  Encrypted, signed and RDMA
    may work but haven't been tested; compressed is disabled.  Assume that
    anything beyond the specified point won't work.

I'm trying to work towards the network filesystems being able to use
zerocopy facilities such as MSG_SPLICE_PAGES and, in doing so, I think the
cifs driver can be simplified quite a bit.

However, making it possible to use the zerocopy facilities is probably
going to mean moving away from storing the data to be transmitted in slab
pages as slab pages don't have refcounts and their lifetime rules are
unsuitable for splicing.

So what I'd like to do is to allocate protocol bufferage from the netmem
allocator.  The netmem allocator, if I understand it correctly, manages DMA
mapping and IOMMU wrangling in bulk, and works as a bump/page fragment
allocator - which has the potential to improve performance a bit.

The aim is to build up a list of fragments for each request using a bvecq.
These form a segmented list and can be spliced together when assembling a
compound request.  The segmented list can then be passed to sendmsg() with
MSG_SPLICE_PAGES in a single call, thereby only having a single loop (in
the TCP stack) to shovel data, not loops-over-loops.  Possibly we can
dispense with corking also, provided we can tell TCP to flush the record
boundaries.  (Note that this also simplifies smbd_send() for RDMA).

To make this easier, I want to introduce a "request descriptor", which I'm
calling "struct smb_message" and allocate it at a higher level, notably the
PDU encoding routines in cifssmb.c and smb2pdu.c and then hand that down
into the transport.  It will contain the list of fragments that form the
message.

smb_message then replaces mid_q_struct, absorbing its contents.  The
transport then doesn't allocate these, but uses the ones that it is given
and the I/O thread gets to simplify its refcounting and do less of it.  The
rule is that smb_message gets an extra ref when it is enqueued and whoever
dequeues it gets this ref and either puts it or hands it on.  The PDU
encoding routines get a ref when allocating them and keep the refs until
they complete.

smb_message is then given a next pointer to allow compounds to be trivially
assembled, with the protocol wrangling being done in the transport.  This
next pointer also allows a bunch of fixed-size arrays to be got rid of
(which were imposing weird restrictions like reducing the maximum component
count of a compound if we stole a kvec[] slot for the transform header).

To this end, I make the following significant changes.  Note that some of
the changes are a way to transit to a later stage.

 (1) Make SMB1 transport use the SMB2 transport rather than having parallel
     dispatch code.

     (a) Get rid of the separation of the message buffer into two kvecs,
     	 one for the rfc1002 header and one for the rest.  This seems to be
     	 only to simplify the signing code - but at the expense of making a
     	 bunch of other places more complex.

     (b) Make SendReceive() wrap cifs_send_recv() as does SendReceive2().

     (c) Get rid of SendReceiveBlockingLock() in favour of using
     	 SendReceive() and a couple of flags to request interruptible
     	 waiting and to indicate Windows-type unlock rather than lock
     	 cancellation.

 (2) Replace mid_q_struct with smb_message and also include credits and
     smb_rqst therein.

 (3) Rewrite the base TCP transmission to be able to use MSG_SPLICE_PAGES.

     (a) Copy all the data involved in a message into a big buffer formed
     	 of a sequence of pages attached to a bvecq.

     (b) If encrypting the message just encrypt this buffer.  Converting
     	 this to a scatterlist is much simpler (and uses less memory) than
     	 encrypting from the protocol elements.

     (c) As the pages in the bvecq are just that, they have refcounts and
     	 can be passed to MSG_SPLICE_PAGES - thereby avoiding the copy in
     	 TCP.

     (d) Compression should be a matter of vmap()'ing these pages to form
     	 the source buffer, allocating a second buffer of pages to form a
     	 dest buffer, also in a bvecq, vmapping that and then doing the
     	 compression.  The first buffer can then just be replaced by the
     	 second.

     (e) __smb_send_rqst() can then do a single sendmsg() with
     	 MSG_SPLICE_PAGES() from an ITER_BVECQ-type iterator.

     (f) smbd_send() can push the same buffer to smbd_post_send_iter() from
     	 the same iterator.

 (4) Clean up mid->callback_data.  Replace it with a waitqueue in
     smb_message (for most commands) and a cifs_io_subrequest pointer (for
     read and write).  Make request completion wait on the smb_message
     waitqueue rather than on server->response_q to avoid thundering herd
     issues.

     (Also, I note that under some circumstances, cifs just wakes up the
     first thing on server->response_q without any reference to *what* it
     is waking up).

 (5) Add some more bits to smb_message to hold the buffers in a bvecq with
     the intent of killing of the smb_rqst struct.

     (a) The PDU encoders will have to work out how much memory they need
     	 for the request protocol bits in advance and tell the smb_message
     	 allocator their requirements.  This will get the requested amount
     	 from the netmem allocator, so it needs to be correctly sized.  A
     	 pointer is then set in smb->request to the buffer.

     (b) The smb_message is given a pointer (->next) to chain to another
     	 message to be compounded after it.

     (c) smb_send_recv_messages() will be used to dispatch a synchronous
     	 request.  If the head smb_message's ->next pointer is not NULL, it
     	 will set the appropriate compound chaining stuff and insert
     	 appropriate padding.  Then it will link the bvecq structs of those
     	 messages together.

 (6) Convert PDU encoders to allocate and use smb_message and pass it down.

     (a) So far, SMB2 Negotiate Protocol, Session Setup, Logoff, Tree
     	 Connect and Tree Disconnect have been done - and though they build
     	 if SMB1 and compression are disabled, they won't work yet and so
     	 haven't been tested.

     (b) SMB2 Posix Mkdir has been attempted and will compile, but is
     	 likely to need rejigging as it's a close associate of Create.

     (c) SMB2 Create/Open is partially done and won't compile.  This gets
     	 complicated because it's used in a lot of places and also gets
     	 compounded - so anything that gets compounded with it must also be
     	 converted.

The patches can be found here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=cifs-experimental

Thanks,
David

David Howells (31):
  iov_iter: Move ITER_DISCARD and ITER_XARRAY iteration out-of-line
  iov_iter: Add a segmented queue of bio_vec[]
  netfs: Provide facility to alloc buffer in a bvecq
  cifs, nls: Provide unicode size determination func
  cifs: Introduce an ALIGN8() macro
  cifs: Move the SMB1 transport code out of transport.c
  cifs: Rename mid_q_entry to smb_message
  cifs: Keep the CPU-endian command ID around
  cifs: Rename SMB2_xxxx_HE to SMB2_xxxx
  cifs: Make smb1's SendReceive() wrap cifs_send_recv()
  cifs: Fix SMB1 to not require separate kvec for the rfc1002 header
  cifs: Replace SendReceiveBlockingLock() with SendReceive() plus flags
  cifs: Institute message managing struct
  cifs: Split crypt_message() into encrypt and decrypt variants
  cifs: Use netfs_alloc/free_folioq_buffer()
  cifs: Rewrite base TCP transmission
  cifs: Rework smb2 decryption
  cifs: Pass smb_message structs down into the transport layer
  cifs: Clean up mid->callback_data and kill off mid->creator
  cifs: Don't need state locking in smb2_get_mid_entry()
  cifs: [DEBUG] smb_message refcounting
  cifs: Add netmem allocation functions
  cifs: Add more pieces to smb_message
  cifs: Convert SMB2 Negotiate Protocol request
  cifs: Convert SMB2 Session Setup request
  cifs: Convert SMB2 Logoff request
  cifs: Convert SMB2 Tree Connect request
  cifs: Convert SMB2 Tree Disconnect request
  cifs: Rearrange Create request subfuncs
  cifs: Convert SMB2 Posix Mkdir request
  cifs: Convert SMB2 Open request

 fs/netfs/Makefile             |    1 +
 fs/netfs/buffer.c             |  102 ++
 fs/nls/nls_base.c             |   33 +
 fs/smb/client/Makefile        |    2 +-
 fs/smb/client/cached_dir.c    |   41 +-
 fs/smb/client/cifs_debug.c    |   15 +-
 fs/smb/client/cifs_unicode.c  |   39 +
 fs/smb/client/cifs_unicode.h  |    2 +
 fs/smb/client/cifsencrypt.c   |   77 +-
 fs/smb/client/cifsfs.c        |   31 +-
 fs/smb/client/cifsglob.h      |  254 ++--
 fs/smb/client/cifsproto.h     |   91 +-
 fs/smb/client/cifssmb.c       |  123 +-
 fs/smb/client/cifstransport.c |  251 ++++
 fs/smb/client/compress.c      |    4 +-
 fs/smb/client/compress.h      |    6 +-
 fs/smb/client/connect.c       |  158 +-
 fs/smb/client/netmisc.c       |   15 +-
 fs/smb/client/ntlmssp.h       |   20 +-
 fs/smb/client/reparse.c       |    2 +-
 fs/smb/client/sess.c          |  271 ++--
 fs/smb/client/smb1ops.c       |  101 +-
 fs/smb/client/smb2file.c      |    2 +-
 fs/smb/client/smb2inode.c     |    6 +-
 fs/smb/client/smb2misc.c      |   80 +-
 fs/smb/client/smb2ops.c       |  570 +++----
 fs/smb/client/smb2pdu.c       | 2616 +++++++++++++++++----------------
 fs/smb/client/smb2proto.h     |   65 +-
 fs/smb/client/smb2transport.c |  229 ++-
 fs/smb/client/smbdirect.c     |   88 +-
 fs/smb/client/smbdirect.h     |    5 +-
 fs/smb/client/trace.h         |   61 +
 fs/smb/client/transport.c     | 1372 +++++++----------
 fs/smb/common/smb2pdu.h       |   92 +-
 fs/smb/server/connection.c    |    2 +-
 fs/smb/server/oplock.c        |    4 +-
 fs/smb/server/smb2misc.c      |   14 +-
 fs/smb/server/smb2ops.c       |   38 +-
 fs/smb/server/smb2pdu.c       |   66 +-
 fs/smb/server/smb_common.c    |    2 +-
 include/linux/bvec.h          |   13 +
 include/linux/iov_iter.h      |   77 +-
 include/linux/netfs.h         |    4 +
 include/linux/nls.h           |    1 +
 include/linux/uio.h           |   11 +
 lib/iov_iter.c                |  406 ++++-
 lib/tests/kunit_iov_iter.c    |  196 +++
 47 files changed, 4177 insertions(+), 3482 deletions(-)
 create mode 100644 fs/netfs/buffer.c
 create mode 100644 fs/smb/client/cifstransport.c


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ