[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMvTesCbNYHjiwaZC4EJopErZhW+vM0d87zJ54RT_AKXb-2yjw@mail.gmail.com>
Date: Wed, 3 Dec 2025 22:21:18 +0800
From: Tianyu Lan <ltykernel@...il.com>
To: Michael Kelley <mhklinux@...look.com>, Christoph Hellwig <hch@...radead.org>,
Robin Murphy <robin.murphy@....com>
Cc: "kys@...rosoft.com" <kys@...rosoft.com>, "haiyangz@...rosoft.com" <haiyangz@...rosoft.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>, "decui@...rosoft.com" <decui@...rosoft.com>,
"longli@...rosoft.com" <longli@...rosoft.com>, "vdso@...bites.dev" <vdso@...bites.dev>,
Tianyu Lan <tiala@...rosoft.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH] Drivers: hv: Confidential VMBus exernal memory support
On Sat, Nov 29, 2025 at 1:47 AM Michael Kelley <mhklinux@...look.com> wrote:
>
> From: Tianyu Lan <ltykernel@...il.com> Sent: Monday, November 24, 2025 10:29 AM
> >
> > In CVM(Confidential VM), system memory is encrypted
> > by default. Device drivers typically use the swiotlb
> > bounce buffer for DMA memory, which is decrypted
> > and shared between the guest and host. Confidential
> > Vmbus, however, supports a confidential channel
>
Hi Michael:
Thanks for your review. I spent some time to do test
and still need paravisor(HCL) team to double check.
> s/Vmbus/VMBus/ [and elsewhere in this commit msg]
>
Will update.
> > that employs encrypted memory for the Vmbus ring
> > buffer and external DMA memory. The support for
> > the confidential ring buffer has already been
> > integrated.
> >
> > In CVM, device drivers usually employ the standard
> > DMA API to map DMA memory with the bounce buffer,
> > which remains transparent to the device driver.
> > For external DMA memory support,
>
> The "external memory" terminology is not at all clear in the context
> of Linux. Presumably the terminology came from Hyper-V or the
> paravisor, but I never understood what the "external" refers to. Maybe
> it is memory that is "external" with respect to the hypervisor, and therefore
> not shared between the hypervisor and guest? But that meaning won't be
> evident to other reviewers in the Linux kernel community. In any case, some
> explanation of "external memory" is needed. And even consider changing the
> field names in the code to be something that makes more sense to Linux.
>
Agree. For VMbus devices, there are two major kinds of communication
methods with
the host.
* VMbus ring buffer.
* Dynamic DMA transition
The confidential ring buffer has been upstreamed and merged into the
Linux kernel 6.18
by Roman. Additionally, the confidential DMA transition, also known as
external memory
allows the use of private/encrypted memory as DMA memory.
> > Hyper-V specific
> > DMA operations are introduced, bypassing the bounce
> > buffer when the confidential external memory flag
> > is set.
>
> This commit message never explains why it is important to not do bounce
> buffering. There's the obvious but unstated reason of improving performance,
> but that's not the main reason. The main reason is a confidentiality leak.
> When available, Confidential VMBus would be used because it keeps the
> DMA data private (i.e., encrypted) and confidential to the guest. Bounce
> buffering copies the data to shared (i.e., decrypted) swiotlb memory, where
> it is exposed to the hypervisor. That's a confidentiality leak, and is the
> primary reason the bounce buffering must be eliminated. This requirement
> was pointed out by Robin Murphy in the discussion of Roman Kisel's
> original code to eliminate the bounce buffering.
>
For CVM on Hyper-V,, there are typically two layers: VTL0 and VTL2. VTL2
plays a similar role as L1 hypervisor of normal guest in VTL0. Communication
between these two layers can utilize private or encrypted memory directly, as
they both reside within the same CVM. This differs from the
communication between
the host and VTL0/VTL2, where shared or decrypted memory (bounce
buffer) is required.
For Confidential VMbus devices in CVM, the device module (emulated
device code) in
VTL2 and interacts with the normal guest in VTL0. As a result, the
Confidential VMbus
device's ring buffer and DMA transactions may use private or encrypted memory.
> Separately, I have major qualms about using an approach with Hyper-V specific
> DMA operations. As implemented here, these DMA operations bypass all
> the kernel DMA layer logic when the VMBus synthetic device indicates
> "external memory". In that case, the supplied physical address is just used
> as the DMA address and everything else is bypassed. While this actually works
> for VMBus synthetic devices because of their simple requirements, it also
> is implicitly making some assumptions. These assumptions are true today
> (as far as I know) but won't necessarily be true in the future. The assumptions
> include:
>
> * There's no vIOMMU in the guest VM. IOMMUs in CoCo VMs have their
> own issues to work out, but an implementation that bypasses any IOMMU
> logic in the DMA layer is very short-term thinking.
>
> * There are no PCI pass-thru devices in the guest. Since the implementation
> changes the global dma_ops, the drivers for such PCI pass-thru devices would
> use the same global dma_ops, and would fail.
For PCI pass-thru devices, it may be reused with additional changes(Co-work with
Dexuan for this support) and we may rework the current callbacks to
meet the future
requirement.
>
> The code also has some other problems that I point out further down.
>
> In any case, an approach with Hyper-V specific DMA operations seems like
> a very temporary fix (hack?) at best. Further down, I'll proposed at alternate
> approach that preserves all the existing DMA layer functionality.
>
> > These DMA operations might also be reused
> > for TDISP devices in the future, which also support
> > DMA operations with encrypted memory.
> >
> > The DMA operations used are global architecture
> > DMA operations (for details, see get_arch_dma_ops()
> > and get_dma_ops()), and there is no need to set up
> > for each device individually.
> >
> > Signed-off-by: Tianyu Lan <tiala@...rosoft.com>
> > ---
> > drivers/hv/vmbus_drv.c | 90 +++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 89 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 0dc4692b411a..ca31231b2c32 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -39,6 +39,9 @@
> > #include <clocksource/hyperv_timer.h>
> > #include <asm/mshyperv.h>
> > #include "hyperv_vmbus.h"
> > +#include "../../kernel/dma/direct.h"
> > +
> > +extern const struct dma_map_ops *dma_ops;
> >
> > struct vmbus_dynid {
> > struct list_head node;
> > @@ -1429,6 +1432,88 @@ static int vmbus_alloc_synic_and_connect(void)
> > return -ENOMEM;
> > }
> >
> > +
> > +static bool hyperv_private_memory_dma(struct device *dev)
> > +{
> > + struct hv_device *hv_dev = device_to_hv_device(dev);
>
> device_to_hv_device() works only when "dev" is for a VMBus device. As noted above,
> if a CoCo VM were ever to have a PCI pass-thru device doing DMA, "dev" would
> be some PCI device, and device_to_hv_device() would return garbage.
Yes, Nice catch. PCI device support is not included in this patch and
still on the way
and post this with ConfidentialVMbus device for review first.
>
> > +
> > + if (hv_dev && hv_dev->channel && hv_dev->channel->co_external_memory)
> > + return true;
> > + else
> > + return false;
> > +}
> > +
> > +static dma_addr_t hyperv_dma_map_page(struct device *dev, struct page *page,
> > + unsigned long offset, size_t size,
> > + enum dma_data_direction dir,
> > + unsigned long attrs)
> > +{
> > + phys_addr_t phys = page_to_phys(page) + offset;
> > +
> > + if (hyperv_private_memory_dma(dev))
> > + return __phys_to_dma(dev, phys);
> > + else
> > + return dma_direct_map_phys(dev, phys, size, dir, attrs);
>
> This code won't build when the VMBus driver is built as a module.
> dma_direct_map_phys() is in inline function that references several other
> DMA functions that aren't exported because they aren't intended to be
> used by drivers. Same problems occur with dma_direct_unmap_phys()
> and similar functions used below.
Yes, Nice catch! We may add a function to register dma ops in the built-in file
and call it in the VMbus driver.
>
> > +}
> > +
> > +static void hyperv_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
> > + size_t size, enum dma_data_direction dir, unsigned long attrs)
> > +{
> > + if (!hyperv_private_memory_dma(dev))
> > + dma_direct_unmap_phys(dev, dma_handle, size, dir, attrs);
> > +}
> > +
> > +static int hyperv_dma_map_sg(struct device *dev, struct scatterlist *sgl,
> > + int nelems, enum dma_data_direction dir,
> > + unsigned long attrs)
> > +{
> > + struct scatterlist *sg;
> > + dma_addr_t dma_addr;
> > + int i;
> > +
> > + if (hyperv_private_memory_dma(dev)) {
> > + for_each_sg(sgl, sg, nelems, i) {
> > + dma_addr = __phys_to_dma(dev, sg_phys(sg));
> > + sg_dma_address(sg) = dma_addr;
> > + sg_dma_len(sg) = sg->length;
> > + }
> > +
> > + return nelems;
> > + } else {
> > + return dma_direct_map_sg(dev, sgl, nelems, dir, attrs);
> > + }
> > +}
> > +
> > +static void hyperv_dma_unmap_sg(struct device *dev, struct scatterlist *sgl,
> > + int nelems, enum dma_data_direction dir, unsigned long attrs)
> > +{
> > + if (!hyperv_private_memory_dma(dev))
> > + dma_direct_unmap_sg(dev, sgl, nelems, dir, attrs);
> > +}
> > +
> > +static int hyperv_dma_supported(struct device *dev, u64 mask)
> > +{
> > + dev->coherent_dma_mask = mask;
> > + return 1;
> > +}
> > +
> > +static size_t hyperv_dma_max_mapping_size(struct device *dev)
> > +{
> > + if (hyperv_private_memory_dma(dev))
> > + return SIZE_MAX;
> > + else
> > + return swiotlb_max_mapping_size(dev);
> > +}
> > +
> > +const struct dma_map_ops hyperv_dma_ops = {
> > + .map_page = hyperv_dma_map_page,
> > + .unmap_page = hyperv_dma_unmap_page,
> > + .map_sg = hyperv_dma_map_sg,
> > + .unmap_sg = hyperv_dma_unmap_sg,
> > + .dma_supported = hyperv_dma_supported,
> > + .max_mapping_size = hyperv_dma_max_mapping_size,
> > +};
> > +
> > /*
> > * vmbus_bus_init -Main vmbus driver initialization routine.
> > *
> > @@ -1479,8 +1564,11 @@ static int vmbus_bus_init(void)
> > * doing that on each VP while initializing SynIC's wastes time.
> > */
> > is_confidential = ms_hyperv.confidential_vmbus_available;
> > - if (is_confidential)
> > + if (is_confidential) {
> > + dma_ops = &hyperv_dma_ops;
>
> arm64 doesn't have the global variable dma_ops, so this patch
> won't build on arm64, even if Confidential VMBus isn't yet available
> for arm64.
Yes, there should be a stub function for both ARM and x86 to implement it..
>
> > pr_info("Establishing connection to the confidential VMBus\n");
> > + }
> > +
> > hv_para_set_sint_proxy(!is_confidential);
> > ret = vmbus_alloc_synic_and_connect();
> > if (ret)
> > --
> > 2.50.1
> >
>
> Here's my idea for an alternate approach. The goal is to allow use of the
> swiotlb to be disabled on a per-device basis. A device is initialized for swiotlb
> usage by swiotlb_dev_init(), which sets dev->dma_io_tlb_mem to point to the
> default swiotlb memory. For VMBus devices, the calling sequence is
> vmbus_device_register() -> device_register() -> device_initialize() ->
> swiotlb_dev_init(). But if vmbus_device_register() could override the
> dev->dma_io_tlb_mem value and put it back to NULL, swiotlb operations
> would be disabled on the device. Furthermore, is_swiotlb_force_bounce()
> would return "false", and the normal DMA functions would not force the
> use of bounce buffers. The entire code change looks like this:
>
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2133,11 +2133,15 @@ int vmbus_device_register(struct hv_device *child_device_obj)
> child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
> dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
>
> + device_initialize(&child_device_obj->device);
> + if (child_device_obj->channel->co_external_memory)
> + child_device_obj->device.dma_io_tlb_mem = NULL;
> +
> /*
> * Register with the LDM. This will kick off the driver/device
> * binding...which will eventually call vmbus_match() and vmbus_probe()
> */
> - ret = device_register(&child_device_obj->device);
> + ret = device_add(&child_device_obj->device);
> if (ret) {
> pr_err("Unable to register child device\n");
> put_device(&child_device_obj->device);
>
> I've only compile tested the above since I don't have an environment where
> I can test Confidential VMBus. You would need to verify whether my thinking
> is correct and this produces the intended result.
Thanks Michael. I tested it and it seems to hit an issue. Will double check.with
HCL/paravisor team.
We considered such a change before. From Roman's previous patch, it seems to
need to change phys_to_dma() and force_dma_unencrypted()..
>
> Directly setting dma_io_tlb_mem to NULL isn't great. It would be better
> to add an exported function swiotlb_dev_disable() to swiotlb code that sets
> dma_io_tlb_mem to NULL, but you get the idea.
>
> Other reviewers may still see this approach as a bit of a hack, but it's a lot
> less of a hack than introducing Hyper-V specific DMA functions.
> swiotlb_dev_disable() is conceptually needed for TDISP devices, as TDISP
> devices must similarly protect confidentiality by not allowing use of the swiotlb.
> So adding swiotlb_dev_disable() is a step in the right direction, even if the
> eventual TDISP code does it slightly differently. Doing the disable on a
> per-device basis is also the right thing in the long run.
>
Agree. Include Christoph and Robin here.
Now, we have three options to disable swiotlb bounce buffer to per-device.
1) Add platform DMA ops and check platform condition(e.g, Confidential
VMbus or TDISP
device) in platform callback before calling stand DMA API which may
use a bounce buffer.
2) Directly setting dma_io_tlb_mem to NULL in the device/bus level
driver. It seems not
enough for Confidential VM which may use high bit of address to
identify encrypted
or decrypted.(Detail please see Roman's previous patch)
https://lore.kernel.org/linux-hyperv/20250409000835.285105-6-romank@linux.microsoft.com/
https://lore.kernel.org/linux-hyperv/20250409000835.285105-7-romank@linux.microsoft.com/
3) Add a new stand swiotlb function to disable the use of bounce
buffer for the device.
--
Thanks
Tianyu Lan
Powered by blists - more mailing lists