lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251217151609.3162665-1-den@valinux.co.jp>
Date: Thu, 18 Dec 2025 00:15:34 +0900
From: Koichiro Den <den@...inux.co.jp>
To: Frank.Li@....com,
	dave.jiang@...el.com,
	ntb@...ts.linux.dev,
	linux-pci@...r.kernel.org,
	dmaengine@...r.kernel.org,
	linux-renesas-soc@...r.kernel.org,
	netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org
Cc: mani@...nel.org,
	kwilczynski@...nel.org,
	kishon@...nel.org,
	bhelgaas@...gle.com,
	corbet@....net,
	geert+renesas@...der.be,
	magnus.damm@...il.com,
	robh@...nel.org,
	krzk+dt@...nel.org,
	conor+dt@...nel.org,
	vkoul@...nel.org,
	joro@...tes.org,
	will@...nel.org,
	robin.murphy@....com,
	jdmason@...zu.us,
	allenbh@...il.com,
	andrew+netdev@...n.ch,
	davem@...emloft.net,
	edumazet@...gle.com,
	kuba@...nel.org,
	pabeni@...hat.com,
	Basavaraj.Natikar@....com,
	Shyam-sundar.S-k@....com,
	kurt.schwemmer@...rosemi.com,
	logang@...tatee.com,
	jingoohan1@...il.com,
	lpieralisi@...nel.org,
	utkarsh02t@...il.com,
	jbrunet@...libre.com,
	dlemoal@...nel.org,
	arnd@...db.de,
	elfring@...rs.sourceforge.net,
	den@...inux.co.jp
Subject: [RFC PATCH v3 00/35] NTB transport backed by endpoint DW eDMA

Hi,

This is RFC v3 of the NTB/PCI series that introduces NTB transport backed
by DesignWare PCIe integrated eDMA.

  RFC v2: https://lore.kernel.org/all/20251129160405.2568284-1-den@valinux.co.jp/
  RFC v1: https://lore.kernel.org/all/20251023071916.901355-1-den@valinux.co.jp/

The goal is to improve performance between a host and an endpoint over
ntb_transport (typically with ntb_netdev on top). On R-Car S4, preliminary
iperf3 results show 10~20x throughput improvement. Latency improvements are
also observed.

In this approach, payload is transferred by DMA directly between host and
endpoint address spaces, and the NTB Memory Window is primarily used as a
control/metadata window (and to expose the eDMA register/LL regions).
Compared to the memcpy-based transport, this avoids extra copies and
enables deeper rings and scales out to multiple queue pairs.

Compared to RFC v2, data plane works in a symmetric manner in both
directions (host-to-endpoint and endpoint-to-host). The host side drives
remote read channels for its TX transfer while the endpoint drives local
write channels.

Again, I recognize that this is quite a large series. Sorry for the volume,
but for the RFC stage I believe presenting the full picture in a single set
helps with reviewing the overall architecture (Of course detail feedback
would be appreciated as well). Once the direction is agreed, I will respin
it split by subsystem and topic.

Many thanks for all the reviews and feedback from multiple perspectives.


Data flow overview
==================

    Figure 1. RC->EP traffic via ntb_netdev+ntb_transport
                     backed by Remote eDMA

          EP                                   RC
       phys addr                            phys addr
         space                                space
          +-+                                  +-+
          | |                                  | |
          | |                ||                | |
          +-+-----.          ||                | |
 EDMA REG | |      \    [A]  ||                | |
          +-+----.  '---+-+  ||                | |
          | |     \     | |<---------[0-a]----------
          +-+-----------| |<----------[2]----------.
  EDMA LL | |           | |  ||                | | :
          | |           | |  ||                | | :
          +-+-----------+-+  ||  [B]           | | :
          | |                ||  ++            | | :
       ---------[0-b]----------->||----------------'
          | |            ++  ||  ||            | |
          | |            ||  ||  ++            | |
          | |            ||<----------[4]-----------
          | |            ++  ||                | |
          | |           [C]  ||                | |
       .--|#|<------------------------[3]------|#|<-.
       :  |#|                ||                |#|  :
      [5] | |                ||                | | [1]
       :  | |                ||                | |  :
       '->|#|                                  |#|--'
          |#|                                  |#|
          | |                                  | |


    Figure 2. EP->RC traffic via ntb_netdev+ntb_transport
                     backed by EP-Local eDMA

          EP                                   RC
       phys addr                            phys addr
         space                                space
          +-+                                  +-+
          | |                                  | |
          | |                ||                | |
          +-+                ||                | |
 EDMA REG | |                ||                | |
          +-+                ||                | |
^         | |                ||                | |
:         +-+                ||                | |
: EDMA LL | |                ||                | |
:         | |                ||                | |
:         +-+                ||  [C]           | |
:         | |                ||  ++            | |
:      -----------[4]----------->||            | |
:         | |            ++  ||  ||            | |
:         | |            ||  ||  ++            | |
'----------------[2]-----||<--------[0-b]-----------
          | |            ++  ||                | |
          | |           [B]  ||                | |
       .->|#|--------[3]---------------------->|#|--.
       :  |#|                ||                |#|  :
      [1] | |                ||                | | [5]
       :  | |                ||                | |  :
       '--|#|                                  |#|<-'
          |#|                                  |#|
          | |                                  | |


      0-a. configure Remote eDMA
      0-b. DMA-map and produce DAR
      1.   memcpy while building skb in ntb_netdev case
      2.   consume DAR, DMA-map SAR and kick DMA read transfer
      3.   DMA transfer
      4.   consume (commit)
      5.   memcpy to application side

      [A]: MemoryWindow that aggregates eDMA regs and LL.
           IB iATU translations (Address Match Mode).
      [B]: Control plane ring buffer (for "produce")
      [C]: Control plane ring buffer (for "consume")

  Note:
    - Figure 1 is unchanged from RFC v2.
    - Figure 2 differs from the one depicted in RFC v2 cover letter.


Changes since RFC v2
====================

RFCv2->RFCv3 changes:
  - Architecture
    - Have EP side use its local write channels, while leaving RC side to
      use remote read channels.
    - Abstraction/HW-specific stuff encapsulation improved.
  - Added control/config region versioning for the vNTB/EPF control region
    so that mismatched RC/EP kernels fail early instead of silently using an
    incompatible layout.
  - Reworked BAR subrange / multi-region mapping support:
    - Dropped the v2 approach that added new inbound mapping ops in the EPC
      core.
    - Introduced `struct pci_epf_bar.submap` and extended DesignWare EP to
      support BAR subrange inbound mapping via Address Match Mode IB iATU.
    - pci-epf-vntb now provides a subrange mapping hint to the EPC driver
      when offsets are used.
  - Changed .get_pci_epc() to .get_private_data()
  - Dropped two commits from RFC v2 that should be submitted separately:
    (1) ntb_transport debugfs seq_file conversion
    (2) DWC EP outbound iATU MSI mapping/cache fix (will be re-posted separately)
  - Added documentation updates.
  - Addressed assorted review nits from the RFC v2 thread (naming/structure).

RFCv1->RFCv2 changes:
  - Architecture
    - Drop the generic interrupt backend + DW eDMA test-interrupt backend
      approach and instead adopt the remote eDMA-backed ntb_transport mode
      proposed by Frank Li. The BAR-sharing / mwN_offset / inbound
      mapping (Address Match Mode) infrastructure from RFC v1 is largely
      kept, with only minor refinements and code motion where necessary
      to fit the new transport-mode design.
  - For Patch 01
    - Rework the array_index_nospec() conversion to address review
      comments on "[RFC PATCH 01/25]".

RFCv2: https://lore.kernel.org/all/20251129160405.2568284-1-den@valinux.co.jp/
RFCv1: https://lore.kernel.org/all/20251023071916.901355-1-den@valinux.co.jp/


Patch layout
============

  Patch 01-25 : preparation for Patch 26
                - 01-07: support multiple MWs in a BAR
		- 08-25: other misc preparations
  Patch 26    : main and most important patch, adds eDMA-backed transport
  Patch 27-28 : multi-queue use, thanks to the remote eDMA, performance
                scales
  Patch 29-33 : handle several SoC-specific issues so that remote eDMA
                mode ntb_transport works on R-Car S4
  Patch 34-35 : kernel doc updates


Tested on
=========

* 2x Renesas R-Car S4 Spider (RC<->EP connected with OcuLink cable)
* Kernel base: next-20251216 + [1] + [2] + [3]

  [1]: https://lore.kernel.org/all/20251210071358.2267494-2-cassel@kernel.org/
       (this is a spin-out patch from
        https://lore.kernel.org/linux-pci/20251129160405.2568284-20-den@valinux.co.jp/)
  [2]: https://lore.kernel.org/all/20251208-dma_prep_config-v1-0-53490c5e1e2a@nxp.com/
       (while it appears to still be under active discussion)
  [3]: https://lore.kernel.org/all/20251217081955.3137163-1-den@valinux.co.jp/
       (this is a spin-out patch from
        https://lore.kernel.org/all/20251129160405.2568284-14-den@valinux.co.jp/)


Performance measurement
=======================

No serious measurements yet, because:
  * For "before the change", even use_dma/use_msi does not work on the
    upstream kernel unless we apply some patches for R-Car S4. With some
    unmerged patch series I had posted earlier (but superseded by this RFC
    attempt), it was observed that we can achieve about 7 Gbps for the
    RC->EP direction. Pure upstream kernel can achieve around 500 Mbps
    though.
  * For "after the change", measurements are not mature because this
    RFC v3 patch series is not yet performance-optimized at this stage.

Here are the rough measurements showing the achievable performance on
the R-Car S4:

- Before this change:

  * ping
    64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=12.3 ms
    64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=6.58 ms
    64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.26 ms
    64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=7.43 ms
    64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.39 ms
    64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=7.38 ms
    64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=1.42 ms
    64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=7.41 ms

  * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 2`)
    [ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
    [  5]   0.00-10.01  sec   344 MBytes   288 Mbits/sec  3.483 ms  51/5555 (0.92%)  receiver
    [  6]   0.00-10.01  sec   342 MBytes   287 Mbits/sec  3.814 ms  38/5517 (0.69%)  receiver
    [SUM]   0.00-10.01  sec   686 MBytes   575 Mbits/sec  3.648 ms  89/11072 (0.8%)  receiver

  * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 2`)
    [  5]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  3.164 ms  390/5731 (6.8%)  receiver
    [  6]   0.00-10.03  sec   334 MBytes   279 Mbits/sec  2.416 ms  396/5741 (6.9%)  receiver
    [SUM]   0.00-10.03  sec   667 MBytes   558 Mbits/sec  2.790 ms  786/11472 (6.9%)  receiver

    Note: with `-P 2`, the best total bitrate (receiver side) was achieved.

- After this change (use_remote_edma=1):

  * ping
    64 bytes from 10.0.0.11: icmp_seq=1 ttl=64 time=1.42 ms
    64 bytes from 10.0.0.11: icmp_seq=2 ttl=64 time=1.38 ms
    64 bytes from 10.0.0.11: icmp_seq=3 ttl=64 time=1.21 ms
    64 bytes from 10.0.0.11: icmp_seq=4 ttl=64 time=1.02 ms
    64 bytes from 10.0.0.11: icmp_seq=5 ttl=64 time=1.06 ms
    64 bytes from 10.0.0.11: icmp_seq=6 ttl=64 time=0.995 ms
    64 bytes from 10.0.0.11: icmp_seq=7 ttl=64 time=0.964 ms
    64 bytes from 10.0.0.11: icmp_seq=8 ttl=64 time=1.49 ms

  * RC->EP (`sudo iperf3 -ub0 -l 65480 -P 4`)
    [  5]   0.00-10.02  sec  3.00 GBytes  2.58 Gbits/sec  0.437 ms  33053/82329 (40%)  receiver
    [  6]   0.00-10.02  sec  3.00 GBytes  2.58 Gbits/sec  0.174 ms  46379/95655 (48%)  receiver
    [  9]   0.00-10.02  sec  2.88 GBytes  2.47 Gbits/sec  0.106 ms  47672/94924 (50%)  receiver
    [ 11]   0.00-10.02  sec  2.87 GBytes  2.46 Gbits/sec  0.364 ms  23694/70817 (33%)  receiver
    [SUM]   0.00-10.02  sec  11.8 GBytes  10.1 Gbits/sec  0.270 ms  150798/343725 (44%)  receiver

  * EP->RC (`sudo iperf3 -ub0 -l 65480 -P 4`)
    [  5]   0.00-10.01  sec  3.28 GBytes  2.82 Gbits/sec  0.380 ms  38578/92355 (42%)  receiver
    [  6]   0.00-10.01  sec  3.24 GBytes  2.78 Gbits/sec  0.430 ms  14268/67340 (21%)  receiver
    [  9]   0.00-10.01  sec  2.92 GBytes  2.51 Gbits/sec  0.074 ms  0/47890 (0%)  receiver
    [ 11]   0.00-10.01  sec  4.76 GBytes  4.09 Gbits/sec  0.037 ms  0/78073 (0%)  receiver
    [SUM]   0.00-10.01  sec  14.2 GBytes  12.2 Gbits/sec  0.230 ms  52846/285658 (18%)  receiver

  * configfs settings:
      # modprobe pci_epf_vntb
      # cd /sys/kernel/config/pci_ep/
      # mkdir functions/pci_epf_vntb/func1
      # echo 0x1912 >   functions/pci_epf_vntb/func1/vendorid
      # echo 0x0030 >   functions/pci_epf_vntb/func1/deviceid
      # echo 32 >       functions/pci_epf_vntb/func1/msi_interrupts
      # echo 16 >       functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count
      # echo 128 >      functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count
      # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws
      # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1
      # echo 0x20000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2
      # echo 0xe0000 >  functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_offset
      # echo 0x1912 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid
      # echo 0x0030 >   functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid
      # echo 0x10 >     functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number
      # echo 0 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/ctrl_bar
      # echo 4 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_bar
      # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1_bar
      # echo 2 >        functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw2_bar
      # ln -s controllers/e65d0000.pcie-ep functions/pci_epf_vntb/func1/primary/
      # echo 1 > controllers/e65d0000.pcie-ep/start



Thank you for reviewing,


Koichiro Den (35):
  PCI: endpoint: pci-epf-vntb: Use array_index_nospec() on mws_size[]
    access
  NTB: epf: Add mwN_offset support and config region versioning
  PCI: dwc: ep: Support BAR subrange inbound mapping via address match
    iATU
  NTB: Add offset parameter to MW translation APIs
  PCI: endpoint: pci-epf-vntb: Propagate MW offset from configfs when
    present
  NTB: ntb_transport: Support partial memory windows with offsets
  PCI: endpoint: pci-epf-vntb: Hint subrange mapping preference to EPC
    driver
  NTB: core: Add .get_private_data() to ntb_dev_ops
  NTB: epf: vntb: Implement .get_private_data() callback
  dmaengine: dw-edma: Fix MSI data values for multi-vector IMWr
    interrupts
  NTB: ntb_transport: Move TX memory window setup into setup_qp_mw()
  NTB: ntb_transport: Dynamically determine qp count
  NTB: ntb_transport: Introduce get_dma_dev() helper
  NTB: epf: Reserve a subset of MSI vectors for non-NTB users
  NTB: ntb_transport: Move internal types to ntb_transport_internal.h
  NTB: ntb_transport: Introduce ntb_transport_backend_ops
  dmaengine: dw-edma: Add helper func to retrieve register base and size
  dmaengine: dw-edma: Add per-channel interrupt routing mode
  dmaengine: dw-edma: Poll completion when local IRQ handling is
    disabled
  dmaengine: dw-edma: Add notify-only channels support
  dmaengine: dw-edma: Add a helper to retrieve LL (Linked List) region
  dmaengine: dw-edma: Serialize RMW on shared interrupt registers
  NTB: ntb_transport: Split core into ntb_transport_core.c
  NTB: ntb_transport: Add additional hooks for DW eDMA backend
  NTB: hw: Introduce DesignWare eDMA helper
  NTB: ntb_transport: Introduce DW eDMA backed transport mode
  NTB: epf: Provide db_vector_count/db_vector_mask callbacks
  ntb_netdev: Multi-queue support
  NTB: epf: Add per-SoC quirk to cap MRRS for DWC eDMA (128B for R-Car)
  iommu: ipmmu-vmsa: Add PCIe ch0 to devices_allowlist
  iommu: ipmmu-vmsa: Add support for reserved regions
  arm64: dts: renesas: Add Spider RC/EP DTs for NTB with remote DW PCIe
    eDMA
  NTB: epf: Add an additional memory window (MW2) barno mapping on
    Renesas R-Car
  Documentation: PCI: endpoint: pci-epf-vntb: Update and add mwN_offset
    usage
  Documentation: driver-api: ntb: Document remote eDMA transport backend

 Documentation/PCI/endpoint/pci-vntb-howto.rst |  16 +-
 Documentation/driver-api/ntb.rst              |  58 +
 arch/arm64/boot/dts/renesas/Makefile          |   2 +
 .../boot/dts/renesas/r8a779f0-spider-ep.dts   |  37 +
 .../boot/dts/renesas/r8a779f0-spider-rc.dts   |  52 +
 drivers/dma/dw-edma/dw-edma-core.c            | 233 ++++-
 drivers/dma/dw-edma/dw-edma-core.h            |  13 +-
 drivers/dma/dw-edma/dw-edma-v0-core.c         |  39 +-
 drivers/iommu/ipmmu-vmsa.c                    |   7 +-
 drivers/net/ntb_netdev.c                      | 341 ++++--
 drivers/ntb/Kconfig                           |  12 +
 drivers/ntb/Makefile                          |   4 +
 drivers/ntb/hw/amd/ntb_hw_amd.c               |   6 +-
 drivers/ntb/hw/edma/ntb_hw_edma.c             | 754 +++++++++++++
 drivers/ntb/hw/edma/ntb_hw_edma.h             |  76 ++
 drivers/ntb/hw/epf/ntb_hw_epf.c               | 187 +++-
 drivers/ntb/hw/idt/ntb_hw_idt.c               |   3 +-
 drivers/ntb/hw/intel/ntb_hw_gen1.c            |   6 +-
 drivers/ntb/hw/intel/ntb_hw_gen1.h            |   2 +-
 drivers/ntb/hw/intel/ntb_hw_gen3.c            |   3 +-
 drivers/ntb/hw/intel/ntb_hw_gen4.c            |   6 +-
 drivers/ntb/hw/mscc/ntb_hw_switchtec.c        |   6 +-
 drivers/ntb/msi.c                             |   6 +-
 .../{ntb_transport.c => ntb_transport_core.c} | 482 ++++-----
 drivers/ntb/ntb_transport_edma.c              | 987 ++++++++++++++++++
 drivers/ntb/ntb_transport_internal.h          | 220 ++++
 drivers/ntb/test/ntb_perf.c                   |   4 +-
 drivers/ntb/test/ntb_tool.c                   |   6 +-
 .../pci/controller/dwc/pcie-designware-ep.c   | 198 +++-
 drivers/pci/controller/dwc/pcie-designware.c  |  25 +
 drivers/pci/controller/dwc/pcie-designware.h  |   2 +
 drivers/pci/endpoint/functions/pci-epf-vntb.c | 246 ++++-
 drivers/pci/endpoint/pci-epc-core.c           |   2 +-
 include/linux/dma/edma.h                      | 106 ++
 include/linux/ntb.h                           |  38 +-
 include/linux/ntb_transport.h                 |   5 +
 include/linux/pci-epf.h                       |  27 +
 37 files changed, 3716 insertions(+), 501 deletions(-)
 create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-ep.dts
 create mode 100644 arch/arm64/boot/dts/renesas/r8a779f0-spider-rc.dts
 create mode 100644 drivers/ntb/hw/edma/ntb_hw_edma.c
 create mode 100644 drivers/ntb/hw/edma/ntb_hw_edma.h
 rename drivers/ntb/{ntb_transport.c => ntb_transport_core.c} (91%)
 create mode 100644 drivers/ntb/ntb_transport_edma.c
 create mode 100644 drivers/ntb/ntb_transport_internal.h

-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ