[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y1ISWla50g5gHax6@iweiny-desk3>
Date: Thu, 20 Oct 2022 20:30:34 -0700
From: Ira Weiny <ira.weiny@...el.com>
To: David Howells <dhowells@...hat.com>
CC: Christoph Hellwig <hch@...radead.org>,
Al Viro <viro@...iv.linux.org.uk>, <willy@...radead.org>,
<dchinner@...hat.com>, Steve French <smfrench@...il.com>,
Shyam Prasad N <nspmangalore@...il.com>,
"Rohith Surabattula" <rohiths.msft@...il.com>,
Jeff Layton <jlayton@...nel.org>,
<torvalds@...ux-foundation.org>, <linux-cifs@...r.kernel.org>,
<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: How to convert I/O iterators to iterators, sglists and RDMA lists
On Thu, Oct 20, 2022 at 03:03:56PM +0100, David Howells wrote:
> Christoph Hellwig <hch@...radead.org> wrote:
>
> > > (1) Async direct I/O.
> > >
> > > In the async case direct I/O, we cannot hold on to the iterator when we
> > > return, even if the operation is still in progress (ie. we return
> > > EIOCBQUEUED), as it is likely to be on the caller's stack.
> > >
> > > Also, simply copying the iterator isn't sufficient as virtual userspace
> > > addresses cannot be trusted and we may have to pin the pages that
> > > comprise the buffer.
> >
> > This is very related to the discussion we are having related to pinning
> > for O_DIRECT with Ira and Al.
>
> Do you have a link to that discussion? I don't see anything obvious on
> fsdevel including Ira.
I think Christoph meant to say John Hubbard.
>
> I do see a discussion involving iov_iter_pin_pages, but I don't see Ira
> involved in that.
This one?
https://lore.kernel.org/all/20220831041843.973026-5-jhubbard@nvidia.com/
I've been casually reading it but not directly involved.
Ira
>
> > What block file systems do is to take the pages from the iter and some flags
> > on what is pinned. We can generalize this to store all extra state in a
> > flags word, or byte the bullet and allow cloning of the iter in one form or
> > another.
>
> Yeah, I know. A list of pages is not an ideal solution. It can only handle
> contiguous runs of pages, possibly with a partial page at either end. A bvec
> iterator would be of more use as it can handle a series of partial pages.
>
> Note also that I would need to turn the pages *back* into an iterator in order
> to commune with sendmsg() in the nether reaches of some network filesystems.
>
> > > (2) Crypto.
> > >
> > > The crypto interface takes scatterlists, not iterators, so we need to
> > > be able to convert an iterator into a scatterlist in order to do
> > > content encryption within netfslib. Doing this in netfslib makes it
> > > easier to store content-encrypted files encrypted in fscache.
> >
> > Note that the scatterlist is generally a pretty bad interface. We've
> > been talking for a while to have an interface that takes a page array
> > as an input and return an array of { dma_addr, len } tuples. Thinking
> > about it taking in an iter might actually be an even better idea.
>
> It would be nice to be able to pass an iterator to the crypto layer. I'm not
> sure what the crypto people think of that.
>
> > > (3) RDMA.
> > >
> > > To perform RDMA, a buffer list needs to be presented as a QPE array.
> > > Currently, cifs converts the iterator it is given to lists of pages,
> > > then each list to a scatterlist and thence to a QPE array. I have
> > > code to pass the iterator down to the bottom, using an intermediate
> > > BVEC iterator instead of a page list if I can't pass down the
> > > original directly (eg. an XARRAY iterator on the pagecache), but I
> > > still end up converting it to a scatterlist, which is then converted
> > > to a QPE. I'm trying to go directly from an iterator to a QPE array,
> > > thus avoiding the need to allocate an sglist.
> >
> > I'm not sure what you mean with QPE. The fundamental low-level
> > interface in RDMA is the ib_sge.
>
> Sorry, yes. ib_sge array. I think it appears as QPs on the wire.
>
> > If you feed it to RDMA READ/WRITE requests the interface for that is the
> > RDMA R/W API in drivers/infiniband/core/rw.c, which currently takes a
> > scatterlist but to which all of the above remarks on DMA interface apply.
> > For RDMA SEND that ULP has to do a dma_map_single/page to fill it, which is
> > a quite horrible layering violation and should move into the driver, but
> > that is going to a massive change to the whole RDMA subsystem, so unlikely
> > to happen anytime soon.
>
> In cifs, as it is upstream, in RDMA transmission, the iterator is converted
> into a clutch of pages in the top, which is converted back into iterators
> (smbd_send()) and those into scatterlists (smbd_post_send_data()), thence into
> sge lists (see smbd_post_send_sgl()).
>
> I have patches that pass an iterator (which it decants to a bvec if async) all
> the way down to the bottom layer. Snippets are then converted to scatterlists
> and those to sge lists. I would like to skip the scatterlist intermediate and
> convert directly to sge lists.
>
> On the other hand, if you think the RDMA API should be taking scatterlists
> rather than sge lists, that would be fine. Even better if I can just pass an
> iterator in directly - though neither scatterlist nor iterator has a place to
> put the RDMA local_dma_key - though I wonder if that's actually necessary for
> each sge element, or whether it could be handed through as part of the request
> as a hole.
>
> > Neither case has anything to do with what should be in common iov_iter
> > code, all this needs to live in the RDMA subsystem as a consumer.
>
> That's fine in principle. However, I have some extraction code that can
> convert an iterator to another iterator, an sglist or an rdma sge list, using
> a common core of code to do all three.
>
> I can split it up if that is preferable.
>
> Do you have code that's ready to be used? I can make immediate use of it.
>
> David
>
Powered by blists - more mailing lists