linux-kernel - Re: How to convert I/O iterators to iterators, sglists and RDMA lists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1ISWla50g5gHax6@iweiny-desk3>
Date:   Thu, 20 Oct 2022 20:30:34 -0700
From:   Ira Weiny <ira.weiny@...el.com>
To:     David Howells <dhowells@...hat.com>
CC:     Christoph Hellwig <hch@...radead.org>,
        Al Viro <viro@...iv.linux.org.uk>, <willy@...radead.org>,
        <dchinner@...hat.com>, Steve French <smfrench@...il.com>,
        Shyam Prasad N <nspmangalore@...il.com>,
        "Rohith Surabattula" <rohiths.msft@...il.com>,
        Jeff Layton <jlayton@...nel.org>,
        <torvalds@...ux-foundation.org>, <linux-cifs@...r.kernel.org>,
        <linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: How to convert I/O iterators to iterators, sglists and RDMA lists

On Thu, Oct 20, 2022 at 03:03:56PM +0100, David Howells wrote:
> Christoph Hellwig <hch@...radead.org> wrote:
> 
> > >  (1) Async direct I/O.
> > > 
> > >      In the async case direct I/O, we cannot hold on to the iterator when we
> > >      return, even if the operation is still in progress (ie. we return
> > >      EIOCBQUEUED), as it is likely to be on the caller's stack.
> > > 
> > >      Also, simply copying the iterator isn't sufficient as virtual userspace
> > >      addresses cannot be trusted and we may have to pin the pages that
> > >      comprise the buffer.
> > 
> > This is very related to the discussion we are having related to pinning
> > for O_DIRECT with Ira and Al.
> 
> Do you have a link to that discussion?  I don't see anything obvious on
> fsdevel including Ira.

I think Christoph meant to say John Hubbard.

> 
> I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
> involved in that.

This one?

https://lore.kernel.org/all/20220831041843.973026-5-jhubbard@nvidia.com/

I've been casually reading it but not directly involved.

Ira

> 
> > What block file systems do is to take the pages from the iter and some flags
> > on what is pinned.  We can generalize this to store all extra state in a
> > flags word, or byte the bullet and allow cloning of the iter in one form or
> > another.
> 
> Yeah, I know.  A list of pages is not an ideal solution.  It can only handle
> contiguous runs of pages, possibly with a partial page at either end.  A bvec
> iterator would be of more use as it can handle a series of partial pages.
> 
> Note also that I would need to turn the pages *back* into an iterator in order
> to commune with sendmsg() in the nether reaches of some network filesystems.
> 
> > >  (2) Crypto.
> > > 
> > >      The crypto interface takes scatterlists, not iterators, so we need to
> > >      be able to convert an iterator into a scatterlist in order to do
> > >      content encryption within netfslib.  Doing this in netfslib makes it
> > >      easier to store content-encrypted files encrypted in fscache.
> > 
> > Note that the scatterlist is generally a pretty bad interface.  We've
> > been talking for a while to have an interface that takes a page array
> > as an input and return an array of { dma_addr, len } tuples.  Thinking
> > about it taking in an iter might actually be an even better idea.
> 
> It would be nice to be able to pass an iterator to the crypto layer.  I'm not
> sure what the crypto people think of that.
> 
> > >  (3) RDMA.
> > > 
> > >      To perform RDMA, a buffer list needs to be presented as a QPE array.
> > >      Currently, cifs converts the iterator it is given to lists of pages,
> > >      then each list to a scatterlist and thence to a QPE array.  I have
> > >      code to pass the iterator down to the bottom, using an intermediate
> > >      BVEC iterator instead of a page list if I can't pass down the
> > >      original directly (eg. an XARRAY iterator on the pagecache), but I
> > >      still end up converting it to a scatterlist, which is then converted
> > >      to a QPE.  I'm trying to go directly from an iterator to a QPE array,
> > >      thus avoiding the need to allocate an sglist.
> > 
> > I'm not sure what you mean with QPE.  The fundamental low-level
> > interface in RDMA is the ib_sge.
> 
> Sorry, yes. ib_sge array.  I think it appears as QPs on the wire.
> 
> > If you feed it to RDMA READ/WRITE requests the interface for that is the
> > RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a
> > scatterlist but to which all of the above remarks on DMA interface apply.
> > For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is
> > a quite horrible layering violation and should move into the driver, but
> > that is going to a massive change to the whole RDMA subsystem, so unlikely
> > to happen anytime soon.
> 
> In cifs, as it is upstream, in RDMA transmission, the iterator is converted
> into a clutch of pages in the top, which is converted back into iterators
> (smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into
> sge lists (see smbd_post_send_sgl()).
> 
> I have patches that pass an iterator (which it decants to a bvec if async) all
> the way down to the bottom layer.  Snippets are then converted to scatterlists
> and those to sge lists.  I would like to skip the scatterlist intermediate and
> convert directly to sge lists.
> 
> On the other hand, if you think the RDMA API should be taking scatterlists
> rather than sge lists, that would be fine.  Even better if I can just pass an
> iterator in directly - though neither scatterlist nor iterator has a place to
> put the RDMA local_dma_key - though I wonder if that's actually necessary for
> each sge element, or whether it could be handed through as part of the request
> as a hole.
> 
> > Neither case has anything to do with what should be in common iov_iter
> > code, all this needs to live in the RDMA subsystem as a consumer.
> 
> That's fine in principle.  However, I have some extraction code that can
> convert an iterator to another iterator, an sglist or an rdma sge list, using
> a common core of code to do all three.
> 
> I can split it up if that is preferable.
> 
> Do you have code that's ready to be used?  I can make immediate use of it.
> 
> David
>