[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <718eaa5a-a021-469d-9053-f622a53422b9@linux.microsoft.com>
Date: Thu, 8 May 2025 08:34:00 -0700
From: Roman Kisel <romank@...ux.microsoft.com>
To: Michael Kelley <mhklinux@...look.com>,
Saurabh Singh Sengar <ssengar@...rosoft.com>,
Naman Jain <namjain@...ux.microsoft.com>, KY Srinivasan <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>,
Dexuan Cui <decui@...rosoft.com>
Cc: Anirudh Rayabharam <anrayabh@...ux.microsoft.com>,
Saurabh Sengar <ssengar@...ux.microsoft.com>,
Stanislav Kinsburskii <skinsburskii@...ux.microsoft.com>,
Nuno Das Neves <nunodasneves@...ux.microsoft.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>
Subject: Re: [PATCH] Drivers: hv: Introduce mshv_vtl driver
On 5/7/2025 9:03 PM, Michael Kelley wrote:
> From: Roman Kisel <romank@...ux.microsoft.com> Sent: Wednesday, May 7, 2025 12:21 PM
>>
>> On 5/7/2025 6:02 AM, Saurabh Singh Sengar wrote:
>>>
>> [..]
>>
>>>> + }
>>>> +
>>>> + local_irq_save(flags);
>>>> + in = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> + out = *this_cpu_ptr(hyperv_pcpu_output_arg);
>>>> +
>>>> + if (copy_from_user(in, (void __user *)hvcall.input_ptr,
>>>> hvcall.input_size)) {
>>>
>>> Here is an issue related to usage of user copy functions when interrupt are disabled.
>>> It was reported by Michael K here:
>>>
>>> https://github.com/microsoft/OHCL-Linux-Kernel/issues/33
>>
>> From the practical point of view, that memory will be touched by the
>> user mode by virtue of Rust requiring initialization so the a possible
>> page fault would be resolved before the IOCTL. OpenHCL runs without swap
>> so the the memory will not be paged out to require page faults to be
>> brought in back.
>>
>> I do agree that might be turned into a footgun by the user land if
>> they malloc a page w/o prefaulting (so it's just a VA range, not backed
>> with the physical page), and then send its address straight over here
>> right after w/o writing any data to it. Perhaps likelier with the output
>> data. Anyway, yes, relying on the user land doing sane things isn't
>> the best approach to the kernel programming.
>>
>> If we're inclined to fix this, I'd encourage to take an approach that
>> works for the confidential VMs as well so we don't have to fix that
>> again when start upstreaming what we have for SNP and TDX. The
>> allocation *must* be visible to the hypervisor in the confidential
>> scenarios.
>>
>> Or, maybe we could avoid the allocations by reading the first byte
>> of the user land buffer to "pre-fault" the page outside of the
>> scope that disables interrupts. Why allocate if we can avoid that?
>> Could set up also the SMP remote calls to run this on the desired
>> CPU.
>>
>> Summarizing for the case you want to change this:
>>
>> 1. Keep interrupts disabled when reading/writing to/from the Hyper-V
>> driver allocated input and output pages.
>> 2. If you decide to allocate separate pages, make sure they are
>> visible to the hypervisor in the confidential scenarios. I know
>> we're not talking SNP and TDX here just yet but it would be
>> a waste of time imho to build something here and scrape that
>> later. The issues with allocations are:
>> a) If allocating on-demand, we might fail the hypercall
>> because of OOM. That's certainly bad as the whole VM
>> will break down.
>> b) If allocating for the whole lifetime of the VM,
>> let us remember that we avoid using hypercalls
>> due to their runtime cost. We'll be keeping around
>> 2 pages per CPU for the few times we need them.
>> 3. Consider reading a byte from the user land buffers to make the page
>> fault happen outside of disabling interrupts. There is no
>> outswap (maybe could have disabling swap in Kconfig) so the page
>> will stay in the memory.
>>
>> If you're not changing this, feel free to keep my "Reviewed-by".
>>
>
> Regardless of what might be done to prevent a page fault, I don't
> see an option to not fix this. copy_from_user() contains a call to
> might_fault(), which in turn calls might_sleep(). The intent of these
> runtime "annotations" is precisely for the kernel to check for such
> errors and complain about them. The complaining is suppressed unless
> CONFIG_DEBUG_ATOMIC_SLEEP is set, but we want to be able to
> set that option for debugging purposes and not have this code
> generating complaints.
I outlined the practical point of view above to emphasize that the
existing code works very well (millions of VMs, and there is health
monitoring in place) thanks to running as a firmware and the tooling.
The VTL2 driver is run and useful only in that environment.
I agree that this approach might raise eyebrows and, as evidenced,
entails some arguing. If you must insist, perhaps for the sake of that
code looking more conventional for the reader we could tweak this.
If that's the choice, keeping this path lean should be the goal in
my opinion.
>
> Michael
--
Thank you,
Roman
Powered by blists - more mailing lists