[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aTHsGerE5phzLrgk@lizhi-Precision-Tower-5810>
Date: Thu, 4 Dec 2025 15:16:25 -0500
From: Frank Li <Frank.li@....com>
To: Koichiro Den <den@...inux.co.jp>
Cc: ntb@...ts.linux.dev, linux-pci@...r.kernel.org,
dmaengine@...r.kernel.org, linux-kernel@...r.kernel.org,
mani@...nel.org, kwilczynski@...nel.org, kishon@...nel.org,
bhelgaas@...gle.com, corbet@....net, vkoul@...nel.org,
jdmason@...zu.us, dave.jiang@...el.com, allenbh@...il.com,
Basavaraj.Natikar@....com, Shyam-sundar.S-k@....com,
kurt.schwemmer@...rosemi.com, logang@...tatee.com,
jingoohan1@...il.com, lpieralisi@...nel.org, robh@...nel.org,
jbrunet@...libre.com, fancer.lancer@...il.com, arnd@...db.de,
pstanner@...hat.com, elfring@...rs.sourceforge.net
Subject: Re: [RFC PATCH v2 20/27] NTB: ntb_transport: Introduce remote eDMA
backed transport mode
On Fri, Dec 05, 2025 at 12:42:03AM +0900, Koichiro Den wrote:
> On Wed, Dec 03, 2025 at 11:14:43AM -0500, Frank Li wrote:
> > On Wed, Dec 03, 2025 at 05:53:03PM +0900, Koichiro Den wrote:
> > > On Tue, Dec 02, 2025 at 10:42:29AM -0500, Frank Li wrote:
> > > > On Tue, Dec 02, 2025 at 03:43:10PM +0900, Koichiro Den wrote:
> > > > > On Mon, Dec 01, 2025 at 04:41:05PM -0500, Frank Li wrote:
> > > > > > On Sun, Nov 30, 2025 at 01:03:58AM +0900, Koichiro Den wrote:
> > > > > > > Add a new transport backend that uses a remote DesignWare eDMA engine
> > > > > > > located on the NTB endpoint to move data between host and endpoint.
> > > > > > >
> > > > ...
> > > > > > > +#include "ntb_edma.h"
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * The interrupt register offsets below are taken from the DesignWare
> > > > > > > + * eDMA "unrolled" register map (EDMA_MF_EDMA_UNROLL). The remote eDMA
> > > > > > > + * backend currently only supports this layout.
> > > > > > > + */
> > > > > > > +#define DMA_WRITE_INT_STATUS_OFF 0x4c
> > > > > > > +#define DMA_WRITE_INT_MASK_OFF 0x54
> > > > > > > +#define DMA_WRITE_INT_CLEAR_OFF 0x58
> > > > > > > +#define DMA_READ_INT_STATUS_OFF 0xa0
> > > > > > > +#define DMA_READ_INT_MASK_OFF 0xa8
> > > > > > > +#define DMA_READ_INT_CLEAR_OFF 0xac
> > > > > >
> > > > > > Not sure why need access EDMA register because EMDA driver already export
> > > > > > as dmaengine driver.
> > > > >
> > > > > These are intended for EP use. In my current design I intentionally don't
> > > > > use the standard dw-edma dmaengine driver on the EP side.
> > > >
> > > > why not?
> > >
> > > Conceptually I agree that using the standard dw-edma driver on both sides
> > > would be attractive for future extensibility and maintainability. However,
> > > there are a couple of concerns for me, some of which might be alleviated by
> > > your suggestion below, and some which are more generic safety concerns that
> > > I tried to outline in my replies to your other comments.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +#define NTB_EDMA_NOTIFY_MAX_QP 64
> > > > > > > +
> > > > ...
> > > > > > > +
> > > > > > > + virq = irq_create_fwspec_mapping(&fwspec);
> > > > > > > + of_node_put(parent);
> > > > > > > + return (virq > 0) ? virq : -EINVAL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static irqreturn_t ntb_edma_isr(int irq, void *data)
> > > > > > > +{
> > > > > >
> > > > > > Not sue why dw_edma_interrupt_write/read() does work for your case. Suppose
> > > > > > just register callback for dmeengine.
> > > > >
> > > > > If we ran dw_edma_probe() on both the EP and RC sides and let the dmaengine
> > > > > callbacks handle int_status/int_clear, I think we could hit races. One side
> > > > > might clear a status bit before the other side has a chance to see it and
> > > > > invoke its callback. Please correct me if I'm missing something here.
> > > >
> > > > You should use difference channel?
> > >
> > > Do you mean something like this:
> > > - on EP side, dw_edma_probe() only set up a dedicated channel for notification,
> > > - on RC side, do not set up that particular channel via dw_edma_channel_setup(),
> > > but do other remaining channels for DMA transfers.
> >
> > Yes, it may be simple overall. Of course this will waste a channel.
>
> So, on the EP side I see two possible approaches:
>
> (a) Hide "dma" [1] as in [RFC PATCH v2 26/27] and call dw_edma_probe() with
> hand-crafted settings (chip->ll_rd_cnt = 1, chip->ll_wr_cnt = 0).
> (b) Or, teach this special-purpose policy (i.e. configuring only a single
> notification channel) to the SoC glue driver's dw_pcie_ep_init_registers(),
> for example via Kconfig. I don't think DT is a good place to describe
> such a policy.
>
> There is also another option, which do not necessarily run dw_edma_probe()
> by ourselves:
>
> (c) Leave the default initialization by the SoC glue as-is, and override the
> per-channel role via some new dw-edma interface, with the guarantee
> that all channels except the notification channel remain unused on its
> side afterwards. In this model, the EP side builds the LL locations
> for data transfers and the RC configures all channels, but it sets up
> the notification channel in a special manner.
>
> [1] https://github.com/jonmason/ntb/blob/68113d260674/Documentation/devicetree/bindings/pci/snps%2Cdw-pcie-ep.yaml#L83
>
> >
> > >
> > > Also, is it generically safe to have dw_edma_probe() executed from both ends on
> > > the same eDMA instance, as long as the channels are carefully partitioned
> > > between them?
> >
> > Channel register MMIO space is sperated. Some channel register shared
> > into one 32bit register.
> >
> > But the critical one, interrupt status is w1c. So only write BIT(channel)
> > is safe.
> >
> > Need careful handle irq enable/disable.
>
> Yeah, I agree it is unavoidable in this model.
>
> >
> > Or you can defer all actual DMA transfer to EP side, you can append
> > MSI write at last item of link to notify RC side about DMA done. (actually
> > RIE should do the same thing)
> >
> > >
> > > >
> > > > >
> > > > > To avoid that, in my current implementation, the RC side handles the
> > > > > status/int_clear registers in the usual way, and the EP side only tries to
> > > > > suppress needless edma_int as much as possible.
> > > > >
> > > > > That said, I'm now wondering if it would be better to set LIE=0/RIE=1 for
> > > > > the DMA transfer channels and LIE=1/RIE=0 for the notification channel.
> > > > > That would require some changes on dw-edma core.
> > > >
> > > > If dw-edma work as remote DMA, which should enable RIE. like
> > > > dw-edma-pcie.c, but not one actually use it recently.
> > > >
> > > > Use EDMA as doorbell should be new case and I think it is quite useful.
> > > >
> > > > > >
> > > > > > > + struct ntb_edma_interrupt *v = data;
> > > > > > > + u32 mask = BIT(EDMA_RD_CH_NUM);
> > > > > > > + u32 i, val;
> > > > > > > +
> > > > ...
> > > > > > > + ret = dw_edma_probe(chip);
> > > > > >
> > > > > > I think dw_edma_probe() should be in ntb_hw_epf.c, which provide DMA
> > > > > > dma engine support.
> > > > > >
> > > > > > EP side, suppose default dwc controller driver already setup edma engine,
> > > > > > so use correct filter function, you should get dma chan.
> > > > >
> > > > > I intentionally hid edma for EP side in .dts patch in [RFC PATCH v2 26/27]
> > > > > so that RC side only manages eDMA remotely and avoids the potential race
> > > > > condition I mentioned above.
> > > >
> > > > Improve eDMA core to suppport some dma channel work at local, some for
> > > > remote.
> > >
> > > Right, Firstly I experimented a bit more with different LIE/RIE settings and
> > > ended up with the following observations:
> > >
> > > * LIE=0/RIE=1 does not seem to work at the hardware level. When I tried this for
> > > DMA transfer channels, the RC side never received any interrupt. The databook
> > > (5.40a, 8.2.2 "Interrupts and Error Handling") has a hint that says
> > > "If you want a remote interrupt and not a local interrupt then: Set LIE and
> > > RIE [...]", so I think this behaviour is expected.
> >
> > Actually, you can append MSI write at last one of DMA descriptor link. So
> > it will not depend on eDMA's IRQ at all.
>
> For RC->EP interrupts on R-Car S4 in EP mode, using ITS_TRANSLATER as the
> IB iATU target did not appear to work in practice. Indeed that was the
> motivation for the RFC v1 series [2]. I have not tried using ITS_TRANSLATER
> as the eDMA read transfer DAR.
>
> But in any case, simply masking the local interrupt is sufficient here. I
> mainly wanted to point out that my naive idea of LIE=0/RIE=1 is not
> implementable with this hardware. This whole LIE/RIE topic is a bit
> off-track, sorry for the noise.
>
> [2] For the record, RFC v2 is conceptually orthogonal and introduces a
> broader concept ie. remote eDMA model, but I reused many of the
> preparatory commits from v1, which is why this is RFC v2 rather than a
> separate series.
>
> >
> > > * LIE=1/RIE=0 does work at the hardware level, but is problematic for my current
> > > design, where the RC issues the DMA transfer for the notification via
> > > ntb_edma_notify_peer(). With RIE=0, the RC never calls
> > > dw_edma_core_handle_int() for that channel, which means that internal state
> > > such as dw_edma_chan.status is never managed correctly.
> >
> > If you append on MSI write at DMA link, you needn't check status register,
> > just check current LL pos to know which descrptor already done.
> >
> > Or you also enable LIE and disable related IRQ line(without register
> > irq handler), so Local IRQ will be ignore by GIC, you can safe handle at
> > RC side.
>
> What I was worried about here is that, with RIE=0 the current dw-edma
> handling of struct dw_edma_chan::status field (not status register) would
> not run for that channel, which could affect subsequent tx submissions. But
> your suggestion also makes sense, thank you.
>
> --8<--
>
> So anyway the key point seems that we should avoid such hard-coded register
> handling in [RFC PATCH v2 20/27] and rely only on the standard dw-edma
> interfaces (possibly with some extensions to the dw-edma core). From your
> feedback, I feel this is the essential direction.
>
> From that perspective, I'm leaning toward (b) (which I wrote above in a
> reply comment) with a Kconfig guard, i.e. in dw_pcie_ep_init_registers(),
> if IS_ENABLED(CONFIG_DW_REMOTE_EDMA) we only configure the notification
> channel. In practice, a DT-based variant of (b) (for example a new property
> such as "dma-notification-channel = <N>;" and making
> dw_pcie_ep_init_registers() honour it) would be very handy for users, but I
> suspect putting this kind of policy into DT is not acceptable.
>
> Assuming careful handling, (c) might actually be the simplest approach. I
> may need to add a small hook for the notification channel in
> dw_edma_done_interrupt(), via a new API such as
> dw_edma_chan_register_notify().
I reply everything here for overall design
EDMA actually can access all memory at both EP and RC side regardless PCI
map windows. NTB defination is that only access part of both system memory,
so anyway need once memcpy. Although NTB can't take 100% eDMA advantage, it
is still easiest path now. I have a draft idea without touch NTB core code
(most likley).
EP side RC side
1: Control bar
2: Doorbell bar
3: WM1
MW1 is fixed sized array [ntb_payload_header + data]. Current NTB built
queue in system memory, transfer data (RW) to this array.
Use EDMA only one side, RC/EP. use EP as example.
In 1 (control bar, resever memory space, which call B)
In ntb_hw_epf.c driver, create a simple 'fake' DMA memcpy driver, which
just implement device_prep_dma_memcpy(). That just put src\dest\size info
to memory space B, then push doorbell.
in EP side's a workqueue, fetch info from B, the send to EDMA queue to
do actual transfer, after EP DMA finish, mark done at B, then raise msi irq,
'fake' DMA memcpy driver will be triggered.
Futher, 3 WM1 is not necessary existed at all, because both side don't
access it directly.
For example:
case RC TX, EP RX
RC ntb_async_tx_submit() use device_prep_dma_memcpy() copy user space
memory (0xRC_1000 to PCI_1000, size 0x1000), put into share bar0 position
0xRC_1000 -> 0xPCI_1000 0x1000
EP side, there RX request ntb_async_rx_submit(), from 0xPCI_1000 to
0xEP_8000 size 0x20000.
so setup eDMA transfer form 0xRC_1000 -> 0xEP_8000 size 1000. After complete
mark both side done, then trigger related callback functions.
You can see 0xPCI_1000 is not used at all. Actually 0xPCI_1000 is trouble
maker, RC and EP system PCI space is not necesary the same as CPU space,
PCI controller may do address convert.
Frank
>
> Thank you for your time and review,
> Koichiro
>
> >
> > Frank
> > >
> > > >
> > > > Frank
> > > > >
> > > > > Thanks for reviewing,
> > > > > Koichiro
> > > > >
> > > > > >
> > > > > > Frank
> > > > > >
> > > > > > > + if (ret) {
> > > > > > > + dev_err(&ndev->dev, "dw_edma_probe failed: %d\n", ret);
> > > > > > > + return ret;
> > > > > > > + }
> > > > > > > +
> > > > > > > + return 0;
> > > > > > > +}
> > > > > > > +
> > > > ...
> > > >
> > > > > > > +{
> > > > > > > + spin_lock_init(&qp->ep_tx_lock);
> > > > > > > + spin_lock_init(&qp->ep_rx_lock);
> > > > > > > + spin_lock_init(&qp->rc_lock);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static const struct ntb_transport_backend_ops edma_backend_ops = {
> > > > > > > + .setup_qp_mw = ntb_transport_edma_setup_qp_mw,
> > > > > > > + .tx_free_entry = ntb_transport_edma_tx_free_entry,
> > > > > > > + .tx_enqueue = ntb_transport_edma_tx_enqueue,
> > > > > > > + .rx_enqueue = ntb_transport_edma_rx_enqueue,
> > > > > > > + .rx_poll = ntb_transport_edma_rx_poll,
> > > > > > > + .debugfs_stats_show = ntb_transport_edma_debugfs_stats_show,
> > > > > > > +};
> > > > > > > +#endif /* CONFIG_NTB_TRANSPORT_EDMA */
> > > > > > > +
> > > > > > > /**
> > > > > > > * ntb_transport_link_up - Notify NTB transport of client readiness to use queue
> > > > > > > * @qp: NTB transport layer queue to be enabled
> > > > > > > --
> > > > > > > 2.48.1
> > > > > > >
Powered by blists - more mailing lists