linux-kernel - Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA backed transport mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <nbjr7ovjgvrvcr7sntrgcjyui5tukp6utd5bvqc3hsdopsl3vi@or4vjl3owblf>
Date: Fri, 5 Dec 2025 12:04:24 +0900
From: Koichiro Den <den@...inux.co.jp>
To: Frank Li <Frank.li@....com>
Cc: ntb@...ts.linux.dev, linux-pci@...r.kernel.org, 
	dmaengine@...r.kernel.org, linux-kernel@...r.kernel.org, mani@...nel.org, 
	kwilczynski@...nel.org, kishon@...nel.org, bhelgaas@...gle.com, corbet@....net, 
	vkoul@...nel.org, jdmason@...zu.us, dave.jiang@...el.com, allenbh@...il.com, 
	Basavaraj.Natikar@....com, Shyam-sundar.S-k@....com, kurt.schwemmer@...rosemi.com, 
	logang@...tatee.com, jingoohan1@...il.com, lpieralisi@...nel.org, robh@...nel.org, 
	jbrunet@...libre.com, fancer.lancer@...il.com, arnd@...db.de, pstanner@...hat.com, 
	elfring@...rs.sourceforge.net
Subject: Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA
 backed transport mode

On Thu, Dec 04, 2025 at 03:16:25PM -0500, Frank Li wrote:
> On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> > On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > > >
> > > > > ...
> > > > > > > > +#include "ntb_edma.h"
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > > + * backend currently only supports this layout.
> > > > > > > > + */
> > > > > > > > +#define DMA_WRITE_INT_STATUS_OFF   0x4c
> > > > > > > > +#define DMA_WRITE_INT_MASK_OFF     0x54
> > > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF    0x58
> > > > > > > > +#define DMA_READ_INT_STATUS_OFF    0xa0
> > > > > > > > +#define DMA_READ_INT_MASK_OFF      0xa8
> > > > > > > > +#define DMA_READ_INT_CLEAR_OFF     0xac
> > > > > > >
> > > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > > as dmaengine driver.
> > > > > >
> > > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > > >
> > > > > why not?
> > > >
> > > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > > would be attractive for future extensibility and maintainability. However,
> > > > there are a couple of concerns for me, some of which might be alleviated by
> > > > your suggestion below, and some which are more generic safety concerns that
> > > > I tried to outline in my replies to your other comments.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP		64
> > > > > > > > +
> > > > > ...
> > > > > > > > +
> > > > > > > > +	virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > > +	of_node_put(parent);
> > > > > > > > +	return (virq > 0) ? virq : -EINVAL;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > > +{
> > > > > > >
> > > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > > just register callback for dmeengine.
> > > > > >
> > > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > > might clear a status bit before the other side has a chance to see it and
> > > > > > invoke its callback. Please correct me if I'm missing something here.
> > > > >
> > > > > You should use difference channel?
> > > >
> > > > Do you mean something like this:
> > > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > > >   but do other remaining channels for DMA transfers.
> > >
> > > Yes, it may be simple overall. Of course this will waste a channel.
> >
> > So, on the EP side I see two possible approaches:
> >
> > (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
> >     hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> > (b) Or, teach this special-purpose policy (i.e. configuring only a single
> >     notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
> >     for example via Kconfig. I don't think DT is a good place to describe
> >     such a policy.
> >
> > There is also another option, which do not necessarily run dw_edma_probe()
> > by ourselves:
> >
> > (c) Leave the default initialization by the SoC glue as-is, and override the
> >     per-channel role via some new dw-edma interface, with the guarantee
> >     that all channels except the notification channel remain unused on its
> >     side afterwards. In this model, the EP side builds the LL locations
> >     for data transfers and the RC configures all channels, but it sets up
> >     the notification channel in a special manner.
> >
> > [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
> >
> > >
> > > >
> > > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > > the same eDMA instance, as long as the channels are carefully partitioned
> > > > between them?
> > >
> > > Channel register MMIO space is sperated. Some channel register shared
> > > into one 32bit register.
> > >
> > > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > > is safe.
> > >
> > > Need careful handle irq enable/disable.
> >
> > Yeah, I agree it is unavoidable in this model.
> >
> > >
> > > Or you can defer all actual DMA transfer to EP side, you can append
> > > MSI write at last item of link to notify RC side about DMA done. (actually
> > > RIE should do the same thing)
> > >
> > > >
> > > > >
> > > > > >
> > > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > > suppress needless edma_int as much as possible.
> > > > > >
> > > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > > That would require some changes on dw-edma core.
> > > > >
> > > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > > dw-edma-pcie.c, but not one actually use it recently.
> > > > >
> > > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > > >
> > > > > > >
> > > > > > > > +	struct ntb_edma_interrupt *v = data;
> > > > > > > > +	u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > > +	u32 i, val;
> > > > > > > > +
> > > > > ...
> > > > > > > > +	ret = dw_edma_probe(chip);
> > > > > > >
> > > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > > dma engine support.
> > > > > > >
> > > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > > so use correct filter function, you should get dma chan.
> > > > > >
> > > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > > condition I mentioned above.
> > > > >
> > > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > > remote.
> > > >
> > > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > > ended up with the following observations:
> > > >
> > > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > > >   DMA transfer channels, the RC side never received any interrupt. The databook
> > > >   (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > > >   "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > > >   RIE [...]", so I think this behaviour is expected.
> > >
> > > Actually, you can append MSI write at last one of DMA descriptor link. So
> > > it will not depend on eDMA's IRQ at all.
> >
> > For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> > IB iATU target did not appear to work in practice. Indeed that was the
> > motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> > as the eDMA read transfer DAR.
> >
> > But in any case, simply masking the local interrupt is sufficient here. I
> > mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> > implementable with this hardware. This whole LIE/RIE topic is a bit
> > off-track, sorry for the noise.
> >
> > [2] For the record, RFC v2 is conceptually orthogonal and introduces a
> >     broader concept ie. remote eDMA model, but I reused many of the
> >     preparatory commits from v1, which is why this is RFC v2 rather than a
> >     separate series.
> >
> > >
> > > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > > >   design, where the RC issues the DMA transfer for the notification via
> > > >   ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > > >   dw_edma_core_handle_int() for that channel, which means that internal state
> > > >   such as dw_edma_chan.status is never managed correctly.
> > >
> > > If you append on MSI write at DMA link, you needn't check status register,
> > > just check current LL pos to know which descrptor already done.
> > >
> > > Or you also enable LIE and disable related IRQ line(without register
> > > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > > RC side.
> >
> > What I was worried about here is that, with RIE=0 the current dw-edma
> > handling of struct dw_edma_chan::status field (not status register) would
> > not run for that channel, which could affect subsequent tx submissions. But
> > your suggestion also makes sense, thank you.
> >
> > --8<--
> >
> > So anyway the key point seems that we should avoid such hard-coded register
> > handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> > interfaces (possibly with some extensions to the dw-edma core). From your
> > feedback, I feel this is the essential direction.
> >
> > From that perspective, I'm leaning toward (b) (which I wrote above in a
> > reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> > if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> > channel. In practice, a DT-based variant of (b) (for example a new property
> > such as "dma-notification-channel = <N>;" and making
> > dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> > suspect putting this kind of policy into DT is not acceptable.
> >
> > Assuming careful handling, (c) might actually be the simplest approach. I
> > may need to add a small hook for the notification channel in
> > dw_edma_done_interrupt(), via a new API such as
> > dw_edma_chan_register_notify().
> 
> I reply everything here for overall design
> 
> EDMA actually can access all memory at both EP and RC side regardless PCI
> map windows. NTB defination is that only access part of both system memory,
> so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
> is still easiest path now. I have a draft idea without touch NTB core code
> (most likley).
> 
> EP side                          RC side
>              1:  Control bar
>              2:  Doorbell bar
>              3:  WM1
> 
> MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
> queue in system memory, transfer data (RW) to this array.
> 
> Use EDMA only one side, RC/EP. use EP as example.
> 
> In 1 (control bar, resever memory space, which call B)
> 
> In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
> just implement device_prep_dma_memcpy(). That just put src\dest\size info
> to memory space B, then push doorbell.
> 
> in EP side's a workqueue, fetch info from B, the send to EDMA queue to
> do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
> 'fake' DMA memcpy driver will be triggered.
> 
> Futher, 3 WM1 is not necessary existed at all, because both side don't
> access it directly.
> 
> For example:
> 
> case RC TX, EP RX
> 
> RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
> memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position
> 
>             0xRC_1000 -> 0xPCI_1000 0x1000
> 
> EP side, there RX request ntb_async_rx_submit(),  from 0xPCI_1000 to
> 0xEP_8000 size 0x20000.
> 
> so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
> mark both side done, then trigger related callback functions.
> 
> You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
> maker,  RC and EP system PCI space is not necesary the same as CPU space,
> PCI controller may do address convert.

Thanks for the detailed explanation.

Just to clarify, regarding your comments about the number of memcpy
operations and not using the 0xPCI_1000 window for data path, I think RFC
v2 is already similar to what you're describing.

To me it seems the key differences in your proposal are mainly two-fold:
(1) the layering, and (2) local eDMA use rather than remote.

For (1), instead of adding more eDMA-specific handling into ntb_transport
layer, your approach would keep changes to ntb_transport minimal and
encapsulate the eDMA usage inside the "fake DMA memcpy driver" as much as
possible. In that design, would the MW1 layout change? Leaving the existing
layout as-is would waste the space (so RFC v2 had introduced a new layout).

Also, one point I'm still unsure about is the opposite direction (ie.
EP->RC). In that case, do you also expect the EP to trigger its local eDMA
engine? If yes, then, similar to the RC->EP direction in RFC v2, the EP
would need to know the RC-side receive buffer address (e.g. 0xRC_1000) in
advance.

You also mentioned that you already have some draft. Are you planning to
post that as a patch series? If not, I can of course try to
implement/prototype this approach based on your suggestion.

Please let me know if the above understanding does not match what you had
in mind.

Thank you,
Koichiro


> 
> Frank
> >
> > Thank you for your time and review,
> > Koichiro
> >
> > >
> > > Frank
> > > >
> > > > >
> > > > > Frank
> > > > > >
> > > > > > Thanks for reviewing,
> > > > > > Koichiro
> > > > > >
> > > > > > >
> > > > > > > Frank
> > > > > > >
> > > > > > > > +	if (ret) {
> > > > > > > > +		dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > > +		return ret;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > ...
> > > > >
> > > > > > > > +{
> > > > > > > > +	spin_lock_init(&qp->ep_tx_lock);
> > > > > > > > +	spin_lock_init(&qp->ep_rx_lock);
> > > > > > > > +	spin_lock_init(&qp->rc_lock);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > > +	.setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > > +	.tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > > +	.tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > > +	.rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > > +	.rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > > +	.debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > > +};
> > > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > > +
> > > > > > > >  /**
> > > > > > > >   * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > > >   * @qp: NTB transport layer queue to be enabled
> > > > > > > > --
> > > > > > > > 2.48.1
> > > > > > > >