[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB4157DAE6D8CC6BA11CA87298D4DCA@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Fri, 28 Nov 2025 17:47:05 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Tianyu Lan <ltykernel@...il.com>, "kys@...rosoft.com" <kys@...rosoft.com>,
"haiyangz@...rosoft.com" <haiyangz@...rosoft.com>, "wei.liu@...nel.org"
<wei.liu@...nel.org>, "decui@...rosoft.com" <decui@...rosoft.com>,
"longli@...rosoft.com" <longli@...rosoft.com>, "vdso@...bites.dev"
<vdso@...bites.dev>
CC: Tianyu Lan <tiala@...rosoft.com>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: RE: [RFC PATCH] Drivers: hv: Confidential VMBus exernal memory
support
From: Tianyu Lan <ltykernel@...il.com> Sent: Monday, November 24, 2025 10:29 AM
>
> In CVM(Confidential VM), system memory is encrypted
> by default. Device drivers typically use the swiotlb
> bounce buffer for DMA memory, which is decrypted
> and shared between the guest and host. Confidential
> Vmbus, however, supports a confidential channel
s/Vmbus/VMBus/ [and elsewhere in this commit msg]
> that employs encrypted memory for the Vmbus ring
> buffer and external DMA memory. The support for
> the confidential ring buffer has already been
> integrated.
>
> In CVM, device drivers usually employ the standard
> DMA API to map DMA memory with the bounce buffer,
> which remains transparent to the device driver.
> For external DMA memory support,
The "external memory" terminology is not at all clear in the context
of Linux. Presumably the terminology came from Hyper-V or the
paravisor, but I never understood what the "external" refers to. Maybe
it is memory that is "external" with respect to the hypervisor, and therefore
not shared between the hypervisor and guest? But that meaning won't be
evident to other reviewers in the Linux kernel community. In any case, some
explanation of "external memory" is needed. And even consider changing the
field names in the code to be something that makes more sense to Linux.
> Hyper-V specific
> DMA operations are introduced, bypassing the bounce
> buffer when the confidential external memory flag
> is set.
This commit message never explains why it is important to not do bounce
buffering. There's the obvious but unstated reason of improving performance,
but that's not the main reason. The main reason is a confidentiality leak.
When available, Confidential VMBus would be used because it keeps the
DMA data private (i.e., encrypted) and confidential to the guest. Bounce
buffering copies the data to shared (i.e., decrypted) swiotlb memory, where
it is exposed to the hypervisor. That's a confidentiality leak, and is the
primary reason the bounce buffering must be eliminated. This requirement
was pointed out by Robin Murphy in the discussion of Roman Kisel's
original code to eliminate the bounce buffering.
Separately, I have major qualms about using an approach with Hyper-V specific
DMA operations. As implemented here, these DMA operations bypass all
the kernel DMA layer logic when the VMBus synthetic device indicates
"external memory". In that case, the supplied physical address is just used
as the DMA address and everything else is bypassed. While this actually works
for VMBus synthetic devices because of their simple requirements, it also
is implicitly making some assumptions. These assumptions are true today
(as far as I know) but won't necessarily be true in the future. The assumptions
include:
* There's no vIOMMU in the guest VM. IOMMUs in CoCo VMs have their
own issues to work out, but an implementation that bypasses any IOMMU
logic in the DMA layer is very short-term thinking.
* There are no PCI pass-thru devices in the guest. Since the implementation
changes the global dma_ops, the drivers for such PCI pass-thru devices would
use the same global dma_ops, and would fail.
The code also has some other problems that I point out further down.
In any case, an approach with Hyper-V specific DMA operations seems like
a very temporary fix (hack?) at best. Further down, I'll proposed at alternate
approach that preserves all the existing DMA layer functionality.
> These DMA operations might also be reused
> for TDISP devices in the future, which also support
> DMA operations with encrypted memory.
>
> The DMA operations used are global architecture
> DMA operations (for details, see get_arch_dma_ops()
> and get_dma_ops()), and there is no need to set up
> for each device individually.
>
> Signed-off-by: Tianyu Lan <tiala@...rosoft.com>
> ---
> drivers/hv/vmbus_drv.c | 90 +++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 89 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 0dc4692b411a..ca31231b2c32 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -39,6 +39,9 @@
> #include <clocksource/hyperv_timer.h>
> #include <asm/mshyperv.h>
> #include "hyperv_vmbus.h"
> +#include "../../kernel/dma/direct.h"
> +
> +extern const struct dma_map_ops *dma_ops;
>
> struct vmbus_dynid {
> struct list_head node;
> @@ -1429,6 +1432,88 @@ static int vmbus_alloc_synic_and_connect(void)
> return -ENOMEM;
> }
>
> +
> +static bool hyperv_private_memory_dma(struct device *dev)
> +{
> + struct hv_device *hv_dev = device_to_hv_device(dev);
device_to_hv_device() works only when "dev" is for a VMBus device. As noted above,
if a CoCo VM were ever to have a PCI pass-thru device doing DMA, "dev" would
be some PCI device, and device_to_hv_device() would return garbage.
> +
> + if (hv_dev && hv_dev->channel && hv_dev->channel->co_external_memory)
> + return true;
> + else
> + return false;
> +}
> +
> +static dma_addr_t hyperv_dma_map_page(struct device *dev, struct page *page,
> + unsigned long offset, size_t size,
> + enum dma_data_direction dir,
> + unsigned long attrs)
> +{
> + phys_addr_t phys = page_to_phys(page) + offset;
> +
> + if (hyperv_private_memory_dma(dev))
> + return __phys_to_dma(dev, phys);
> + else
> + return dma_direct_map_phys(dev, phys, size, dir, attrs);
This code won't build when the VMBus driver is built as a module.
dma_direct_map_phys() is in inline function that references several other
DMA functions that aren't exported because they aren't intended to be
used by drivers. Same problems occur with dma_direct_unmap_phys()
and similar functions used below.
> +}
> +
> +static void hyperv_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
> + size_t size, enum dma_data_direction dir, unsigned long attrs)
> +{
> + if (!hyperv_private_memory_dma(dev))
> + dma_direct_unmap_phys(dev, dma_handle, size, dir, attrs);
> +}
> +
> +static int hyperv_dma_map_sg(struct device *dev, struct scatterlist *sgl,
> + int nelems, enum dma_data_direction dir,
> + unsigned long attrs)
> +{
> + struct scatterlist *sg;
> + dma_addr_t dma_addr;
> + int i;
> +
> + if (hyperv_private_memory_dma(dev)) {
> + for_each_sg(sgl, sg, nelems, i) {
> + dma_addr = __phys_to_dma(dev, sg_phys(sg));
> + sg_dma_address(sg) = dma_addr;
> + sg_dma_len(sg) = sg->length;
> + }
> +
> + return nelems;
> + } else {
> + return dma_direct_map_sg(dev, sgl, nelems, dir, attrs);
> + }
> +}
> +
> +static void hyperv_dma_unmap_sg(struct device *dev, struct scatterlist *sgl,
> + int nelems, enum dma_data_direction dir, unsigned long attrs)
> +{
> + if (!hyperv_private_memory_dma(dev))
> + dma_direct_unmap_sg(dev, sgl, nelems, dir, attrs);
> +}
> +
> +static int hyperv_dma_supported(struct device *dev, u64 mask)
> +{
> + dev->coherent_dma_mask = mask;
> + return 1;
> +}
> +
> +static size_t hyperv_dma_max_mapping_size(struct device *dev)
> +{
> + if (hyperv_private_memory_dma(dev))
> + return SIZE_MAX;
> + else
> + return swiotlb_max_mapping_size(dev);
> +}
> +
> +const struct dma_map_ops hyperv_dma_ops = {
> + .map_page = hyperv_dma_map_page,
> + .unmap_page = hyperv_dma_unmap_page,
> + .map_sg = hyperv_dma_map_sg,
> + .unmap_sg = hyperv_dma_unmap_sg,
> + .dma_supported = hyperv_dma_supported,
> + .max_mapping_size = hyperv_dma_max_mapping_size,
> +};
> +
> /*
> * vmbus_bus_init -Main vmbus driver initialization routine.
> *
> @@ -1479,8 +1564,11 @@ static int vmbus_bus_init(void)
> * doing that on each VP while initializing SynIC's wastes time.
> */
> is_confidential = ms_hyperv.confidential_vmbus_available;
> - if (is_confidential)
> + if (is_confidential) {
> + dma_ops = &hyperv_dma_ops;
arm64 doesn't have the global variable dma_ops, so this patch
won't build on arm64, even if Confidential VMBus isn't yet available
for arm64.
> pr_info("Establishing connection to the confidential VMBus\n");
> + }
> +
> hv_para_set_sint_proxy(!is_confidential);
> ret = vmbus_alloc_synic_and_connect();
> if (ret)
> --
> 2.50.1
>
Here's my idea for an alternate approach. The goal is to allow use of the
swiotlb to be disabled on a per-device basis. A device is initialized for swiotlb
usage by swiotlb_dev_init(), which sets dev->dma_io_tlb_mem to point to the
default swiotlb memory. For VMBus devices, the calling sequence is
vmbus_device_register() -> device_register() -> device_initialize() ->
swiotlb_dev_init(). But if vmbus_device_register() could override the
dev->dma_io_tlb_mem value and put it back to NULL, swiotlb operations
would be disabled on the device. Furthermore, is_swiotlb_force_bounce()
would return "false", and the normal DMA functions would not force the
use of bounce buffers. The entire code change looks like this:
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2133,11 +2133,15 @@ int vmbus_device_register(struct hv_device *child_device_obj)
child_device_obj->device.dma_mask = &child_device_obj->dma_mask;
dma_set_mask(&child_device_obj->device, DMA_BIT_MASK(64));
+ device_initialize(&child_device_obj->device);
+ if (child_device_obj->channel->co_external_memory)
+ child_device_obj->device.dma_io_tlb_mem = NULL;
+
/*
* Register with the LDM. This will kick off the driver/device
* binding...which will eventually call vmbus_match() and vmbus_probe()
*/
- ret = device_register(&child_device_obj->device);
+ ret = device_add(&child_device_obj->device);
if (ret) {
pr_err("Unable to register child device\n");
put_device(&child_device_obj->device);
I've only compile tested the above since I don't have an environment where
I can test Confidential VMBus. You would need to verify whether my thinking
is correct and this produces the intended result.
Directly setting dma_io_tlb_mem to NULL isn't great. It would be better
to add an exported function swiotlb_dev_disable() to swiotlb code that sets
dma_io_tlb_mem to NULL, but you get the idea.
Other reviewers may still see this approach as a bit of a hack, but it's a lot
less of a hack than introducing Hyper-V specific DMA functions.
swiotlb_dev_disable() is conceptually needed for TDISP devices, as TDISP
devices must similarly protect confidentiality by not allowing use of the swiotlb.
So adding swiotlb_dev_disable() is a step in the right direction, even if the
eventual TDISP code does it slightly differently. Doing the disable on a
per-device basis is also the right thing in the long run.
Michael
Powered by blists - more mailing lists