linux-kernel - Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aTL09kE6y9A0gLSn@lizhi-Precision-Tower-5810>
Date: Fri, 5 Dec 2025 10:06:30 -0500
From: Frank Li <Frank.li@....com>
To: Koichiro Den <den@...inux.co.jp>
Cc: ntb@...ts.linux.dev, linux-pci@...r.kernel.org,
	dmaengine@...r.kernel.org, linux-kernel@...r.kernel.org,
	mani@...nel.org, kwilczynski@...nel.org, kishon@...nel.org,
	bhelgaas@...gle.com, corbet@....net, vkoul@...nel.org,
	jdmason@...zu.us, dave.jiang@...el.com, allenbh@...il.com,
	Basavaraj.Natikar@....com, Shyam-sundar.S-k@....com,
	kurt.schwemmer@...rosemi.com, logang@...tatee.com,
	jingoohan1@...il.com, lpieralisi@...nel.org, robh@...nel.org,
	jbrunet@...libre.com, fancer.lancer@...il.com, arnd@...db.de,
	pstanner@...hat.com, elfring@...rs.sourceforge.net
Subject: Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA
 backed transport mode

On Fri, Dec 05, 2025 at 12:04:24PM +0900, Koichiro Den wrote:
> On Thu, Dec 04, 2025 at 03:16:25PM -0500, Frank Li wrote:
> > On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> > > On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > > > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > > > >
> > > > > > ...
> > > > > > > > > +#include "ntb_edma.h"
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > > > + * backend currently only supports this layout.
> > > > > > > > > + */
> > > > > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > > > > >
> > > > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > > > as dmaengine driver.
> > > > > > >
> > > > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > > > >
> > > > > > why not?
> > > > >
> > > > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > > > would be attractive for future extensibility and maintainability. However,
> > > > > there are a couple of concerns for me, some of which might be alleviated by
> > > > > your suggestion below, and some which are more generic safety concerns that
> > > > > I tried to outline in my replies to your other comments.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > > > > +
> > > > > > ...
> > > > > > > > > +
> > > > > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > > > +	of_node_put(parent);
> > > > > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > > > +{
> > > > > > > >
> > > > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > > > just register callback for dmeengine.
> > > > > > >
> > > > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > > > might clear a status bit before the other side has a chance to see it and
> > > > > > > invoke its callback. Please correct me if I'm missing something here.
> > > > > >
> > > > > > You should use difference channel?
> > > > >
> > > > > Do you mean something like this:
> > > > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > > > >   but do other remaining channels for DMA transfers.
> > > >
> > > > Yes, it may be simple overall. Of course this will waste a channel.
> > >
> > > So, on the EP side I see two possible approaches:
> > >
> > > (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
> > >     hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> > > (b) Or, teach this special-purpose policy (i.e. configuring only a single
> > >     notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
> > >     for example via Kconfig. I don't think DT is a good place to describe
> > >     such a policy.
> > >
> > > There is also another option, which do not necessarily run dw_edma_probe()
> > > by ourselves:
> > >
> > > (c) Leave the default initialization by the SoC glue as-is, and override the
> > >     per-channel role via some new dw-edma interface, with the guarantee
> > >     that all channels except the notification channel remain unused on its
> > >     side afterwards. In this model, the EP side builds the LL locations
> > >     for data transfers and the RC configures all channels, but it sets up
> > >     the notification channel in a special manner.
> > >
> > > [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
> > >
> > > >
> > > > >
> > > > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > > > the same eDMA instance, as long as the channels are carefully partitioned
> > > > > between them?
> > > >
> > > > Channel register MMIO space is sperated. Some channel register shared
> > > > into one 32bit register.
> > > >
> > > > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > > > is safe.
> > > >
> > > > Need careful handle irq enable/disable.
> > >
> > > Yeah, I agree it is unavoidable in this model.
> > >
> > > >
> > > > Or you can defer all actual DMA transfer to EP side, you can append
> > > > MSI write at last item of link to notify RC side about DMA done. (actually
> > > > RIE should do the same thing)
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > > > suppress needless edma_int as much as possible.
> > > > > > >
> > > > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > > > That would require some changes on dw-edma core.
> > > > > >
> > > > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > > > dw-edma-pcie.c, but not one actually use it recently.
> > > > > >
> > > > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > > > >
> > > > > > > >
> > > > > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > > > +	u32 i, val;
> > > > > > > > > +
> > > > > > ...
> > > > > > > > > +	ret = dw_edma_probe(chip);
> > > > > > > >
> > > > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > > > dma engine support.
> > > > > > > >
> > > > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > > > so use correct filter function, you should get dma chan.
> > > > > > >
> > > > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > > > condition I mentioned above.
> > > > > >
> > > > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > > > remote.
> > > > >
> > > > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > > > ended up with the following observations:
> > > > >
> > > > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > > > >   DMA transfer channels, the RC side never received any interrupt. The databook
> > > > >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > > > >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > > > >   RIE [...]", so I think this behaviour is expected.
> > > >
> > > > Actually, you can append MSI write at last one of DMA descriptor link. So
> > > > it will not depend on eDMA's IRQ at all.
> > >
> > > For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> > > IB iATU target did not appear to work in practice. Indeed that was the
> > > motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> > > as the eDMA read transfer DAR.
> > >
> > > But in any case, simply masking the local interrupt is sufficient here. I
> > > mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> > > implementable with this hardware. This whole LIE/RIE topic is a bit
> > > off-track, sorry for the noise.
> > >
> > > [2] For the record, RFC v2 is conceptually orthogonal and introduces a
> > >     broader concept ie. remote eDMA model, but I reused many of the
> > >     preparatory commits from v1, which is why this is RFC v2 rather than a
> > >     separate series.
> > >
> > > >
> > > > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > > > >   design, where the RC issues the DMA transfer for the notification via
> > > > >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > > > >   dw_edma_core_handle_int() for that channel, which means that internal state
> > > > >   such as dw_edma_chan.status is never managed correctly.
> > > >
> > > > If you append on MSI write at DMA link, you needn't check status register,
> > > > just check current LL pos to know which descrptor already done.
> > > >
> > > > Or you also enable LIE and disable related IRQ line(without register
> > > > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > > > RC side.
> > >
> > > What I was worried about here is that, with RIE=0 the current dw-edma
> > > handling of struct dw_edma_chan::status field (not status register) would
> > > not run for that channel, which could affect subsequent tx submissions. But
> > > your suggestion also makes sense, thank you.
> > >
> > > --8<--
> > >
> > > So anyway the key point seems that we should avoid such hard-coded register
> > > handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> > > interfaces (possibly with some extensions to the dw-edma core). From your
> > > feedback, I feel this is the essential direction.
> > >
> > > From that perspective, I'm leaning toward (b) (which I wrote above in a
> > > reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> > > if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> > > channel. In practice, a DT-based variant of (b) (for example a new property
> > > such as "dma-notification-channel = <N>;" and making
> > > dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> > > suspect putting this kind of policy into DT is not acceptable.
> > >
> > > Assuming careful handling, (c) might actually be the simplest approach. I
> > > may need to add a small hook for the notification channel in
> > > dw_edma_done_interrupt(), via a new API such as
> > > dw_edma_chan_register_notify().
> >
> > I reply everything here for overall design
> >
> > EDMA actually can access all memory at both EP and RC side regardless PCI
> > map windows. NTB defination is that only access part of both system memory,
> > so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
> > is still easiest path now. I have a draft idea without touch NTB core code
> > (most likley).
> >
> > EP side                          RC side
> >              1:  Control bar
> >              2:  Doorbell bar
> >              3:  WM1
> >
> > MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
> > queue in system memory, transfer data (RW) to this array.
> >
> > Use EDMA only one side, RC/EP. use EP as example.
> >
> > In 1 (control bar, resever memory space, which call B)
> >
> > In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
> > just implement device_prep_dma_memcpy(). That just put src\dest\size info
> > to memory space B, then push doorbell.
> >
> > in EP side's a workqueue, fetch info from B, the send to EDMA queue to
> > do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
> > 'fake' DMA memcpy driver will be triggered.
> >
> > Futher, 3 WM1 is not necessary existed at all, because both side don't
> > access it directly.
> >
> > For example:
> >
> > case RC TX, EP RX
> >
> > RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
> > memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position
> >
> >             0xRC_1000 -> 0xPCI_1000 0x1000
> >
> > EP side, there RX request ntb_async_rx_submit(),  from 0xPCI_1000 to
> > 0xEP_8000 size 0x20000.
> >
> > so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
> > mark both side done, then trigger related callback functions.
> >
> > You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
> > maker,  RC and EP system PCI space is not necesary the same as CPU space,
> > PCI controller may do address convert.
>
> Thanks for the detailed explanation.
>
> Just to clarify, regarding your comments about the number of memcpy
> operations and not using the 0xPCI_1000 window for data path, I think RFC
> v2 is already similar to what you're describing.
>
> To me it seems the key differences in your proposal are mainly two-fold:
> (1) the layering, and (2) local eDMA use rather than remote.

Not big difference between remote and local DMA. My major means just use
oneside is enough. If eDMA handle in remote, EP side need virtual memcpy
and RC side to handle actual transfer.

I use EP as example, just because some logic R/W is reverted between EP/RC.
RC's write is EP's read.

>
> For (1), instead of adding more eDMA-specific handling into ntb_transport
> layer, your approach would keep changes to ntb_transport minimal and
> encapsulate the eDMA usage inside the "fake DMA memcpy driver" as much as
> possible. In that design, would the MW1 layout change? Leaving the existing
> layout as-is would waste the space (so RFC v2 had introduced a new layout).

It is fine if NTB maintainer agree it.

>
> Also, one point I'm still unsure about is the opposite direction (ie.
> EP->RC). In that case, do you also expect the EP to trigger its local eDMA
> engine? If yes, then, similar to the RC->EP direction in RFC v2, the EP
> would need to know the RC-side receive buffer address (e.g. 0xRC_1000) in
> advance.

'fake DMA memcpy driver' already put 0xRC_1000 to one shared memory place.

>
> You also mentioned that you already have some draft. Are you planning to
> post that as a patch series? If not, I can of course try to
> implement/prototype this approach based on your suggestion.

Sorry, I have not actually worked for ntb eDMA before. My work base on RDMA
framework. Idealy, RDMA can do user-space(EP) to user space (RC) data
transfer with zero copy.

But I think NTB is also a good path since RDMA is over complexed.

Frank

>
> Please let me know if the above understanding does not match what you had
> in mind.
>
> Thank you,
> Koichiro
>
>
> >
> > Frank
> > >
> > > Thank you for your time and review,
> > > Koichiro
> > >
> > > >
> > > > Frank
> > > > >
> > > > > >
> > > > > > Frank
> > > > > > >
> > > > > > > Thanks for reviewing,
> > > > > > > Koichiro
> > > > > > >
> > > > > > > >
> > > > > > > > Frank
> > > > > > > >
> > > > > > > > > +	if (ret) {
> > > > > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > > > +		return ret;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > ...
> > > > > >
> > > > > > > > > +{
> > > > > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > > > +};
> > > > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > > > +
> > > > > > > > >  /**
> > > > > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > > > > --
> > > > > > > > > 2.48.1
> > > > > > > > >