linux-kernel - RE: [PATCH v2 2/2] drivers/hv: add CPU offlining support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157906A4E40E416D61AFEEBD4182@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Tue, 14 Jan 2025 02:43:33 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Hamza Mahfooz <hamzamahfooz@...ux.microsoft.com>,
	"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>
CC: Boqun Feng <boqun.feng@...il.com>, Wei Liu <wei.liu@...nel.org>, "K. Y.
 Srinivasan" <kys@...rosoft.com>, Haiyang Zhang <haiyangz@...rosoft.com>,
	Dexuan Cui <decui@...rosoft.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 2/2] drivers/hv: add CPU offlining support

From: Hamza Mahfooz <hamzamahfooz@...ux.microsoft.com> Sent: Friday, January 10, 2025 2:00 PM
> 
> Currently, it is effectively impossible to offline CPUs. Since, most
> CPUs will have vmbus channels attached to them. So, as made mention of
> in commit d570aec0f2154 ("Drivers: hv: vmbus: Synchronize
> init_vp_index() vs. CPU hotplug"), rebind channels associated with CPUs
> that a user is trying to offline to a new "randomly" selected CPU.

Let me provide some additional context and thoughts about the new
functionality proposed in this patch set.

1. I would somewhat challenge the commit message statement that
"it is effectively impossible to offline CPUs". VMBus device interrupts
can be assigned to a different CPU via a /sys interface and the code
in target_cpu_store().  So a CPU *can* be taken offline by first reassigning
any VMBus device interrupts, and then the offlining operation will succeed.
That reassigning requires manual sysadmin actions or some scripting,
which isn't super easy or automatic, but it's not "effectively impossible".

2. As background, when a CPU goes offline, the Linux kernel already has
functionality to reassign unmanaged IRQs that are assigned to the CPU
going offline.  (Managed IRQs are just shut down.)  See fixup_irqs().
Unfortunately, VMBus device interrupts are not modelled as Linux IRQs,
so the existing mechanism is not applied to VMBus devices.

3. In light of #2 and for other reasons, I submitted a patch set in June 2024
that models VMBus device interrupts as Linux IRQs. See [1]. This patch set
got feedback from Thomas Gleixner about how to demultiplex the IRQs, but
no one from Microsoft gave feedback on the overall idea. I think it would
be worthwhile to pursue these patches, but I would like to get some
macro-level thoughts from the Microsoft folks. There are implications for
things such as irqbalance.

4. As the cover letter in my patch set notes, there's still a problem with
the automatic Linux IRQ reassignment mechanism for the new VMBus IRQs.
The cover letter doesn't give full details, but the problem is ultimately due
to needing to get an ack from Hyper-V that the change in VMBus device
interrupt assignment has been completed. I have investigated alternatives
for making it work, but they are all somewhat convoluted. Nevertheless,
if we want to move forward with the patch set, further work on these
alternatives would be warranted.

5. In May 2020, Andrea Parri worked on a patch set that does what this
patch set does -- automatically reassign VMBus device interrupts when
a CPU tries to go offline. That patch set took a broader focus on making a
smart decision about the CPU to which to assign the interrupt in several
different circumstances, one of which was offlining a CPU. It was
somewhat complex and posted as an RFC [2]. I think Andrea ended up
having to work on some other things, and the patch set was not pursued
after the initial posting. It might be worthwhile to review it for comparison
purposes, or maybe it's worth reviving.

All of that is to say that I think there are two paths forward:

A. The quicker fix is to take the approach of this patch set and continue
handling VMBus device interrupts outside of the Linux IRQ mechanism.
Do the automatic reassignment when taking a CPU offline, as coded
in this patch. Andrea Parri's old patch set might have something to add
to this approach, if just for comparison purposes.

B. Take a broader look at the problem by going back to my patch set
that models VMBus device interrupts as Linux IRQs. Work to get
the existing Linux IRQ reassignment mechanism to work for the new
VMBus IRQs. This approach will probably take longer than (A).

I lean toward (B) because it converges with standard Linux IRQs, but I
don't know what's driving doing (A). If there's need to do (A) sooner,
see my comments in the code below. I'm less inclined to add the
complexity of Andrea Parri's old patch set because I think it takes
us even further down the path of doing custom VMBus-related
work when we would do better to converge toward existing Linux
IRQ mechanisms.

[1] https://lore.kernel.org/linux-hyperv/20240604050940.859909-1-mhklinux@outlook.com/
[2] https://lore.kernel.org/linux-hyperv/20200526223218.184057-1-parri.andrea@gmail.com/

> 
> Cc: Boqun Feng <boqun.feng@...il.com>
> Cc: Wei Liu <wei.liu@...nel.org>
> Signed-off-by: Hamza Mahfooz <hamzamahfooz@...ux.microsoft.com>
> ---
> v2: remove cpus_read_{un,}lock() from hv_pick_new_cpu() and add
>     lockdep_assert_cpus_held().
> ---
>  drivers/hv/hv.c | 56 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 41 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index 36d9ba097ff5..9fef71403c86 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -433,13 +433,39 @@ static bool hv_synic_event_pending(void)
>  	return pending;
>  }
> 
> +static int hv_pick_new_cpu(struct vmbus_channel *channel,
> +			   unsigned int current_cpu)
> +{
> +	int ret = 0;
> +	int cpu;
> +
> +	lockdep_assert_cpus_held();
> +	lockdep_assert_held(&vmbus_connection.channel_mutex);
> +
> +	/*
> +	 * We can't assume that the relevant interrupts will be sent before
> +	 * the cpu is offlined on older versions of hyperv.
> +	 */
> +	if (vmbus_proto_version < VERSION_WIN10_V5_3)
> +		return -EBUSY;

I'm not sure why this test is here.  The function vmbus_set_channel_cpu()
tests the vmbus_proto_version against V4_1 and returns an appropriate
error. Do we *need* to filter against V5_3 instead of V4_1?

> +
> +	cpu = cpumask_next(get_random_u32_below(nr_cpu_ids), cpu_online_mask);
> +
> +	if (cpu >= nr_cpu_ids || cpu == current_cpu)
> +		cpu = VMBUS_CONNECT_CPU;

Picking a random CPU like this seems to have some problems:

1. The selected CPU might be an isolated CPU, in which case the
call to vmbus_channel_set_cpu() will return an error, and the
attempt to take the CPU offline will eventually fail. But if you try
again to take the CPU offline, a different random CPU may be
chosen that isn't an isolated CPU, and taking the CPU offline
will succeed. Such inconsistent behavior should be avoided.

2. I wonder if we should try to choose a CPU in the same NUMA node
as "current_cpu".  The Linux IRQ mechanism has the concept of CPU
affinity for an IRQ, which can express the NUMA affinity. The normal
Linux reassignment mechanism obeys the IRQ's affinity if possible,
and so would do the right thing for NUMA. So we need to consider
whether to do that here as well.

3. The handling of the current_cpu feels a bit hacky. There's
also no wrap-around in the mask search. Together, I think that
creates a small bias toward choosing the VMBUS_CONNECT_CPU,
which is arguably already somewhat overloaded because all the
low-speed devices use it. I haven't tried to look for alternative
approaches to suggest.

Michael

> +
> +	ret = vmbus_channel_set_cpu(channel, cpu);
> +
> +	return ret;
> +}
> +
>  /*
>   * hv_synic_cleanup - Cleanup routine for hv_synic_init().
>   */
>  int hv_synic_cleanup(unsigned int cpu)
>  {
>  	struct vmbus_channel *channel, *sc;
> -	bool channel_found = false;
> +	int ret = 0;
> 
>  	if (vmbus_connection.conn_state != CONNECTED)
>  		goto always_cleanup;
> @@ -456,31 +482,31 @@ int hv_synic_cleanup(unsigned int cpu)
> 
>  	/*
>  	 * Search for channels which are bound to the CPU we're about to
> -	 * cleanup.  In case we find one and vmbus is still connected, we
> -	 * fail; this will effectively prevent CPU offlining.
> -	 *
> -	 * TODO: Re-bind the channels to different CPUs.
> +	 * cleanup.
>  	 */
>  	mutex_lock(&vmbus_connection.channel_mutex);
>  	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
>  		if (channel->target_cpu == cpu) {
> -			channel_found = true;
> -			break;
> +			ret = hv_pick_new_cpu(channel, cpu);
> +
> +			if (ret) {
> +				mutex_unlock(&vmbus_connection.channel_mutex);
> +				return ret;
> +			}
>  		}
>  		list_for_each_entry(sc, &channel->sc_list, sc_list) {
>  			if (sc->target_cpu == cpu) {
> -				channel_found = true;
> -				break;
> +				ret = hv_pick_new_cpu(channel, cpu);
> +
> +				if (ret) {
> +					mutex_unlock(&vmbus_connection.channel_mutex);
> +					return ret;
> +				}
>  			}
>  		}
> -		if (channel_found)
> -			break;
>  	}
>  	mutex_unlock(&vmbus_connection.channel_mutex);
> 
> -	if (channel_found)
> -		return -EBUSY;
> -
>  	/*
>  	 * channel_found == false means that any channels that were previously
>  	 * assigned to the CPU have been reassigned elsewhere with a call of
> @@ -497,5 +523,5 @@ int hv_synic_cleanup(unsigned int cpu)
> 
>  	hv_synic_disable_regs(cpu);
> 
> -	return 0;
> +	return ret;
>  }
> --
> 2.47.1
>