[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB41579818278E2548450A37DAD4E1A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Wed, 8 Oct 2025 18:24:34 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Long Li <longli@...rosoft.com>, "longli@...ux.microsoft.com"
<longli@...ux.microsoft.com>, KY Srinivasan <kys@...rosoft.com>, Haiyang
Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, Dexuan Cui
<decui@...rosoft.com>, "James E.J. Bottomley"
<James.Bottomley@...senPartnership.com>, "Martin K. Petersen"
<martin.petersen@...cle.com>, James Bottomley <JBottomley@...n.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU
as on the I/O issuing CPU
From: Long Li <longli@...rosoft.com> Sent: Wednesday, October 8, 2025 10:20 AM
>
> > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> > same CPU as on the I/O issuing CPU
> >
> > From: Long Li <longli@...rosoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> > >
> > > > -----Original Message-----
> > > > From: Michael Kelley <mhklinux@...look.com>
> > > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > > To: longli@...ux.microsoft.com; KY Srinivasan <kys@...rosoft.com>;
> > > > Haiyang Zhang <haiyangz@...rosoft.com>; Wei Liu
> > > > <wei.liu@...nel.org>; Dexuan Cui <decui@...rosoft.com>; James E.J.
> > > > Bottomley <James.Bottomley@...senPartnership.com>; Martin K.
> > > > Petersen <martin.petersen@...cle.com>; James Bottomley
> > > > <JBottomley@...n.com>; linux-hyperv@...r.kernel.org;
> > > > linux-scsi@...r.kernel.org; linux- kernel@...r.kernel.org
> > > > Cc: Long Li <longli@...rosoft.com>
> > > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > > channel with the same CPU as on the I/O issuing CPU
> > > >
> > > > From: longli@...ux.microsoft.com <longli@...ux.microsoft.com> Sent:
> > > > Wednesday, October 1, 2025 10:06 PM
> > > > >
> > > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > > select a channel with a returning CPU that is not the same as
> > > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > > well when the Hyper-V exposes a large number of channels (up to
> > > > > the number of all CPUs). Use a different CPU for returning channel is not efficient on Hyper-V.
> > > > >
> > > > > Change this behavior by preferring to the channel with the same
> > > > > CPU as the current I/O issuing CPU whenever possible.
> > > > >
> > > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > > and no regression with older Hyper-V/Azure environments.
> > > > >
> > > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@...rosoft.com>
> > > > > Signed-off-by: Long Li <longli@...rosoft.com>
> > > > > ---
> > > > > drivers/scsi/storvsc_drv.c | 96
> > > > > ++++++++++++++++++--------------------
> > > > > 1 file changed, 45 insertions(+), 51 deletions(-)
> > > > >
> > > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > > 100644
> > > > > --- a/drivers/scsi/storvsc_drv.c
> > > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel *get_og_chn(struct storvsc_device *stor_device,
> > > > > }
> > > > >
> > > > > /*
> > > > > - * Our channel array is sparsley populated and we
> > > > > + * Our channel array could be sparsley populated and we
> > > > > * initiated I/O on a processor/hw-q that does not
> > > > > * currently have a designated channel. Fix this.
> > > > > * The strategy is simple:
> > > > > - * I. Ensure NUMA locality
> > > > > - * II. Distribute evenly (best effort)
> > > > > + * I. Prefer the channel associated with the current CPU
> > > > > + * II. Ensure NUMA locality
> > > > > + * III. Distribute evenly (best effort)
> > > > > */
> > > > >
> > > > > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > > + return stor_device->stor_chns[q_num];
> > > > > +
> > > >
> > > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > > >the non-
> > > > NULL case. And the checks are all done with stor_device->lock held,
> > > > so the stor_chns array can't change.
> > > > Hence the above code will return NULL, which will cause a NULL
> > > > reference when
> > > > storvsc_do_io() sends out the VMBus packet.
> > > >
> > > > My recollection is that get_og_chan() is called when there is no
> > > > channel that interrupts the current CPU (that's what it means for
> > > > stor_device-
> > > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > > >channel
> > > > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > > > Adding code to prefer the channel associated with the current CPU
> > > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > > when it is already known that there is no such channel.
> > >
> > > The initial values for stor_chns[] and alloced_cpus are set in
> > > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> > subchannels).
> >
> > OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry
> > in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU
> > bit in alloced_cpus is *not* set. All the places that manipulate these fields update
> > both so they are in sync with each other. Hence I'll agree the code you've added
> > in get_og_chn() will never return a NULL value.
> >
> > (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL,
> > the corresponding CPU bit in alloced_cpus may or may not be set.)
> >
> > >
> > > As a result, the check for cpumask_test_cpu(q_num,
> > > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > > If the check fails, the code follows the old behavior to find a channel.
> > >
> > > This check is needed because storvsc supports
> > > change_target_cpu_callback() callback via vmbus.
> >
> > But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied
> > the code here for discussion purposes. This is the only place that get_og_chn() is
> > called:
> >
> > spin_lock_irqsave(&stor_device->lock, flags);
> > outgoing_channel = stor_device->stor_chns[q_num];
> > if (outgoing_channel != NULL) {
> > spin_unlock_irqrestore(&stor_device->lock, flags);
> > goto found_channel;
> > }
> > outgoing_channel = get_og_chn(stor_device, q_num);
> > spin_unlock_irqrestore(&stor_device->lock, flags);
> >
> > The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-
> > NULL, then we've found a suitable channel and get_og_chn() is *not* called. The
> > only time get_og_chan() is called is when the stor_chn[] entry
> > *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> > So the check you've added in get_og_chn() can never be true and the check is not
> > needed. You said the check is needed because of change_target_cpu_callback(),
> > which will invoke storvsc_change_target_cpu(). But I don't see how that matters
> > given the checks done before get_og_chn() is called. The spin lock synchronizes
> > with any changes made by storvsc_change_target_cpu().
>
> I agree this check may not be necessary. Given the current patch is correct, how about I
> submit another patch to remove this check after more testing are done?
That's OK with me.
>
> >
> > Related, there are a couple of occurrences of code like this:
> >
> > for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> > q_num + 1) {
> > channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> > if (!channel)
> > continue;
> >
> > Given your point that the alloced_cpus and stor_chns[] entries are always in sync
> > with each other, the check for channel being NULL is not necessary.
>
> I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior
> on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns
> before changing alloced_cpus. It can run at the same time as this code. I think we may need
> to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic.
Yes, you are right. The spin lock isn't held in these cases, so the checks for NULL
should stay.
Michael
Powered by blists - more mailing lists