[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250120062112.GL89233@linux.alibaba.com>
Date: Mon, 20 Jan 2025 14:21:12 +0800
From: Dust Li <dust.li@...ux.alibaba.com>
To: Andrew Lunn <andrew@...n.ch>, Niklas Schnelle <schnelle@...ux.ibm.com>
Cc: Alexandra Winter <wintera@...ux.ibm.com>,
Julian Ruess <julianr@...ux.ibm.com>,
Wenjia Zhang <wenjia@...ux.ibm.com>,
Jan Karcher <jaka@...ux.ibm.com>, Gerd Bayer <gbayer@...ux.ibm.com>,
Halil Pasic <pasic@...ux.ibm.com>,
"D. Wythe" <alibuda@...ux.alibaba.com>,
Tony Lu <tonylu@...ux.alibaba.com>,
Wen Gu <guwen@...ux.alibaba.com>,
Peter Oberparleiter <oberpar@...ux.ibm.com>,
David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Eric Dumazet <edumazet@...gle.com>,
Andrew Lunn <andrew+netdev@...n.ch>,
Thorsten Winkler <twinkler@...ux.ibm.com>, netdev@...r.kernel.org,
linux-s390@...r.kernel.org, Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Simon Horman <horms@...nel.org>
Subject: Re: [RFC net-next 0/7] Provide an ism layer
On 2025-01-17 21:29:09, Andrew Lunn wrote:
>On Fri, Jan 17, 2025 at 05:57:10PM +0100, Niklas Schnelle wrote:
>> On Fri, 2025-01-17 at 17:33 +0100, Andrew Lunn wrote:
>> > > Conceptually kind of but the existing s390 specific ISM device is a bit
>> > > special. But let me start with some background. On s390 aka Mainframes
>> > > OSs including Linux runs in so called logical partitions (LPARs) which
>> > > are machine hypervisor VMs which use partitioned non-paging memory. The
>> > > fact that memory is partitioned is important because this means LPARs
>> > > can not share physical memory by mapping it.
>> > >
>> > > Now at a high level an ISM device allows communication between two such
>> > > Linux LPARs on the same machine. The device is discovered as a PCI
>> > > device and allows Linux to take a buffer called a DMB map that in the
>> > > IOMMU and generate a token specific to another LPAR which also sees an
>> > > ISM device sharing the same virtual channel identifier (VCHID). This
>> > > token can then be transferred out of band (e.g. as part of an extended
>> > > TCP handshake in SMC-D) to that other system. With the token the other
>> > > system can use its ISM device to securely (authenticated by the token,
>> > > LPAR identity and the IOMMU mapping) write into the original systems
>> > > DMB at throughput and latency similar to doing a memcpy() via a
>> > > syscall.
>> > >
>> > > On the implementation level the ISM device is actually a piece of
>> > > firmware and the write to a remote DMB is a special case of our PCI
>> > > Store Block instruction (no real MMIO on s390, instead there are
>> > > special instructions). Sadly there are a few more quirks but in
>> > > principle you can think of it as redirecting writes to a part of the
>> > > ISM PCI devices' BAR to the DMB in the peer system if that makes sense.
>> > > There's of course also a mechanism to cause an interrupt on the
>> > > receiver as the write completes.
>> >
>> > So the s390 details are interesting, but as you say, it is
>> > special. Ideally, all the special should be hidden away inside the
>> > driver.
>>
>> Yes and it will be. There are some exceptions e.g. for vfio-pci pass-
>> through but that's not unusual and why there is already the concept of
>> vfio-pci extension module.
>>
>> >
>> > So please take a step back. What is the abstract model?
>>
>> I think my high level description may be a good start. The abstract
>> model is the ability to share a memory buffer (DMB) for writing by a
>> communication partner, authenticated by a DMB Token. Plus stuff like
>> triggering an interrupt on write or explicit trigger. Then Alibaba
>> added optional support for what they called attaching the buffer which
>> means it becomes truly shared between the peers but which IBM's ISM
>> can't support. Plus a few more optional pieces such as VLANs, PNETIDs
>> don't ask. The idea for the new layer then is to define this interface
>> with operations and documentation.
>>
>> >
>> > Can the abstract model be mapped onto CLX? Could it be used with a GPU
>> > vRAM? SoC with real shared memory between a pool of CPUs.
>> >
>> > Andrew
>>
>> I'd think that yes, one could implement such a mechanism on top of CXL
>> as well as on SoC. Or even with no special hardware between a host and
>> a DPU (e.g. via PCIe endpoint framework). Basically anything that can
>> DMA and IRQs between two OS instances.
>
>Is DMA part of the abstract model? That would suggest a true shared
>memory system is excluded, since that would not require DMA.
>
>Maybe take a look at subsystems like USB, I2C.
>
>usb_submit_urb(struct urb *urb, gfp_t mem_flags)
>
>An URB is a data structure with a block of memory associated with it,
>contains the detail to pass to the USB device.
>
>i2c_transfer(struct i2c_adapter *adap, struct i2c_msg *msgs, int num)
>
>*msgs points to num of messages which get transferred to/from the I2C
>device.
>
>Could the high level API look like this? No DMA, no IRQ, no concept of
>a somewhat shared memory. Just an API which asks for a message to be
>sent to the other end? struct urb has some USB concepts in it, struct
>i2c_msg has some I2C concepts in it. A struct ism_msg would follow the
>same pattern, but does it need to care about the DMA, the IRQ, the
>memory which is semi shared?
I don’t have a clear picture of what the API should look like yet, but I
believe it’s possible to avoid DMA and IRQ. In fact, the current data
transfer API, ops->move_data() in include/linux/ism.h, already abstracts
away the DMA and IRQ details.
One thing we cannot hide, however, is whether the operation is zero-copy
or copy. This distinction is important because we can reuse the data at
different times in copy mode and zero-copy mode.
Best regards,
Dust
Powered by blists - more mailing lists