[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e4bdbf988addfa811e3317e6da7ef691bd46c3a.camel@intel.com>
Date: Thu, 04 Aug 2022 22:06:13 +1200
From: Kai Huang <kai.huang@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org
Cc: seanjc@...gle.com, pbonzini@...hat.com, len.brown@...el.com,
tony.luck@...el.com, rafael.j.wysocki@...el.com,
reinette.chatre@...el.com, dan.j.williams@...el.com,
peterz@...radead.org, ak@...ux.intel.com,
kirill.shutemov@...ux.intel.com,
sathyanarayanan.kuppuswamy@...ux.intel.com,
isaku.yamahata@...el.com
Subject: Re: [PATCH v5 07/22] x86/virt/tdx: Implement SEAMCALL function
On Thu, 2022-08-04 at 10:35 +1200, Kai Huang wrote:
> On Wed, 2022-08-03 at 07:20 -0700, Dave Hansen wrote:
> > On 8/2/22 19:37, Kai Huang wrote:
> > > On Thu, 2022-07-21 at 13:52 +1200, Kai Huang wrote:
> > > > Also, if I understand correctly above, your suggestion is we want to prevent any
> > > > CMR memory going offline so it won't be hot-removed (assuming we can get CMRs
> > > > during boot). This looks contradicts to the requirement of being able to allow
> > > > moving memory from core-mm to driver. When we offline the memory, we cannot
> > > > know whether the memory will be used by driver, or later hot-removed.
> > > Hi Dave,
> > >
> > > The high level flow of device hot-removal is:
> > >
> > > acpi_scan_hot_remove()
> > > -> acpi_scan_try_to_offline()
> > > -> acpi_bus_offline()
> > > -> device_offline()
> > > -> memory_subsys_offline()
> > > -> acpi_bus_trim()
> > > -> acpi_memory_device_remove()
> > >
> > >
> > > And memory_subsys_offline() can also be triggered via /sysfs:
> > >
> > > echo 0 > /sys/devices/system/memory/memory30/online
> > >
> > > After the memory block is offline, my understanding is kernel can theoretically
> > > move it to, i.e. ZONE_DEVICE via memremap_pages().
> > >
> > > As you can see memory_subsys_offline() is the entry point of memory device
> > > offline (before it the code is generic for all ACPI device), and it cannot
> > > distinguish whether the removal is from ACPI event, or from /sysfs, so it seems
> > > we are unable to refuse to offline memory in memory_subsys_offline() when it is
> > > called from ACPI event.
> > >
> > > Any comments?
> >
> > I suggest refactoring the code in a way that makes it possible to
> > distinguish the two cases.
> >
> > It's not like you have some binary kernel. You have the source code for
> > the whole thing and can propose changes *ANYWHERE* you need. Even better:
> >
> > $ grep -A2 ^ACPI\$ MAINTAINERS
> > ACPI
> > M: "Rafael J. Wysocki" <rafael@...nel.org>
> > R: Len Brown <lenb@...nel.org>
> >
> > The maintainer of ACPI works for our employer. Plus, he's a nice
> > helpful guy that you can go ask how you might refactor this or
> > approaches you might take. Have you talked to Rafael about this issue?
>
> Rafael once also suggested to set hotplug.enabled to 0 as your code shows below,
> but we just got the TDX architecture behaviour of memory hotplug clarified from
> Intel TDX guys recently.
>
> > Also, from a two-minute grepping session, I noticed this:
> >
> > > static acpi_status acpi_bus_offline(acpi_handle handle, u32 lvl, void *data,
> > > void **ret_p)
> > > {
> > ...
> > > if (device->handler && !device->handler->hotplug.enabled) {
> > > *ret_p = &device->dev;
> > > return AE_SUPPORT;
> > > }
> >
> > It looks to me like if you simply set:
> >
> > memory_device_handler->hotplug.enabled = false;
> >
> > you'll get most of the behavior you want. ACPI memory hotplug would not
> > work and the changes would be confined to the ACPI world. The
> > "lower-level" bus-based hotplug would be unaffected.
> >
> > Now, I don't know what kind of locking would be needed to muck with a
> > global structure like that. But, it's a start.
>
> This has two problems:
>
> 1) This approach cannot distinguish non-CMR memory hotplug and CMR memory
> hotplug, as it disables ACPI memory hotplug for all. But this is fine as we
> want to reject non-CMR memory hotplug anyway. We just need to explain clearly
> in changelog.
>
> 2) This won't allow the kernel to speak out "BIOS bug" when CMR memory hotplug
> actually happens. Instead, we can only print out "hotplug is disabled due to
> TDX is enabled by BIOS." when we set hotplug.enable to false.
>
> Assuming above is OK, I'll explore this option. I'll also do some research to
> see if it's still possible to speak out "BIOS bug" in this approach but it's not
> a mandatory requirement to me now.
>
> Also, if print out "BIOS bug" for CMR memory hotplug isn't mandatory, then we
> can just detect TDX during kernel boot, and disable hotplug when TDX is enabled
> by BIOS, but don't need to use "winner-take-all" approach. The former is
> clearer and easier to implement. I'll go with the former approach if I don't
> hear objection from you.
>
> And ACPI CPU hotplug can also use the same way.
>
> Please let me know any comments. Thanks!
>
One more reason why "winner-take-all" approach doesn't work:
If we allow ACPI memory hotplug to happen but choose to disable it in the
handler using "winner-take-all", then at the beginning the ACPI code will
actually create a /sysfs entry for hotplug.enabled to allow userspace to change
it:
/sys/firmware/acpi/hotplug/memory/enabled
Which means even we set hotplug.enabled to false at some point, userspace can
turn it on again. The only way is to not create this /sysfs entry at the
beginning.
With "winner-take-all" approach, I don't think we should avoid creating the
/sysfs entry. Nor we should introduce arch-specific hook to, i.e. prevent
/sysfs entry being changed by userspace.
So instead of "winner-take-all" approach, I'll introduce a new kernel command
line to allow user to choose between ACPI CPU/memory hotplug vs TDX. This
command line should not impact the "software" CPU/memory hotplug even when user
choose to use TDX. In this case, this is similar to "winner-take-all" anyway.
Powered by blists - more mailing lists