[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<DS3PR21MB573597BFA650534306FCA5CACEE1A@DS3PR21MB5735.namprd21.prod.outlook.com>
Date: Wed, 8 Oct 2025 17:20:28 +0000
From: Long Li <longli@...rosoft.com>
To: Michael Kelley <mhklinux@...look.com>, "longli@...ux.microsoft.com"
<longli@...ux.microsoft.com>, KY Srinivasan <kys@...rosoft.com>, Haiyang
Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>, Dexuan Cui
<decui@...rosoft.com>, "James E.J. Bottomley"
<James.Bottomley@...senPartnership.com>, "Martin K. Petersen"
<martin.petersen@...cle.com>, James Bottomley <JBottomley@...n.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] scsi: storvsc: Prefer returning channel with the same CPU
as on the I/O issuing CPU
> Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning channel with the
> same CPU as on the I/O issuing CPU
>
> From: Long Li <longli@...rosoft.com> Sent: Tuesday, October 7, 2025 5:56 PM
> >
> > > -----Original Message-----
> > > From: Michael Kelley <mhklinux@...look.com>
> > > Sent: Tuesday, October 7, 2025 8:42 AM
> > > To: longli@...ux.microsoft.com; KY Srinivasan <kys@...rosoft.com>;
> > > Haiyang Zhang <haiyangz@...rosoft.com>; Wei Liu
> > > <wei.liu@...nel.org>; Dexuan Cui <decui@...rosoft.com>; James E.J.
> > > Bottomley <James.Bottomley@...senPartnership.com>; Martin K.
> > > Petersen <martin.petersen@...cle.com>; James Bottomley
> > > <JBottomley@...n.com>; linux-hyperv@...r.kernel.org;
> > > linux-scsi@...r.kernel.org; linux- kernel@...r.kernel.org
> > > Cc: Long Li <longli@...rosoft.com>
> > > Subject: [EXTERNAL] RE: [PATCH] scsi: storvsc: Prefer returning
> > > channel with the same CPU as on the I/O issuing CPU
> > >
> > > From: longli@...ux.microsoft.com <longli@...ux.microsoft.com> Sent:
> > > Wednesday, October 1, 2025 10:06 PM
> > > >
> > > > When selecting an outgoing channel for I/O, storvsc tries to
> > > > select a channel with a returning CPU that is not the same as
> > > > issuing CPU. This worked well in the past, however it doesn't work
> > > > well when the Hyper-V exposes a large number of channels (up to
> > > > the number of all CPUs). Use a different CPU for returning channel is not
> efficient on Hyper-V.
> > > >
> > > > Change this behavior by preferring to the channel with the same
> > > > CPU as the current I/O issuing CPU whenever possible.
> > > >
> > > > Tests have shown improvements in newer Hyper-V/Azure environment,
> > > > and no regression with older Hyper-V/Azure environments.
> > > >
> > > > Tested-by: Raheel Abdul Faizy <rabdulfaizy@...rosoft.com>
> > > > Signed-off-by: Long Li <longli@...rosoft.com>
> > > > ---
> > > > drivers/scsi/storvsc_drv.c | 96
> > > > ++++++++++++++++++--------------------
> > > > 1 file changed, 45 insertions(+), 51 deletions(-)
> > > >
> > > > diff --git a/drivers/scsi/storvsc_drv.c
> > > > b/drivers/scsi/storvsc_drv.c index d9e59204a9c3..092939791ea0
> > > > 100644
> > > > --- a/drivers/scsi/storvsc_drv.c
> > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > @@ -1406,14 +1406,19 @@ static struct vmbus_channel
> *get_og_chn(struct storvsc_device *stor_device,
> > > > }
> > > >
> > > > /*
> > > > - * Our channel array is sparsley populated and we
> > > > + * Our channel array could be sparsley populated and we
> > > > * initiated I/O on a processor/hw-q that does not
> > > > * currently have a designated channel. Fix this.
> > > > * The strategy is simple:
> > > > - * I. Ensure NUMA locality
> > > > - * II. Distribute evenly (best effort)
> > > > + * I. Prefer the channel associated with the current CPU
> > > > + * II. Ensure NUMA locality
> > > > + * III. Distribute evenly (best effort)
> > > > */
> > > >
> > > > + /* Prefer the channel on the I/O issuing processor/hw-q */
> > > > + if (cpumask_test_cpu(q_num, &stor_device->alloced_cpus))
> > > > + return stor_device->stor_chns[q_num];
> > > > +
> > >
> > > Hmmm. When get_og_chn() is called, we know that stor_device-
> > > >stor_chns[q_num] is NULL since storvsc_do_io() has already handled
> > > >the non-
> > > NULL case. And the checks are all done with stor_device->lock held,
> > > so the stor_chns array can't change.
> > > Hence the above code will return NULL, which will cause a NULL
> > > reference when
> > > storvsc_do_io() sends out the VMBus packet.
> > >
> > > My recollection is that get_og_chan() is called when there is no
> > > channel that interrupts the current CPU (that's what it means for
> > > stor_device-
> > > >stor_chns[<current CPU>] to be NULL). So the algorithm must pick a
> > > >channel
> > > that interrupts some other CPU, preferably a CPU in the current NUMA node.
> > > Adding code to prefer the channel associated with the current CPU
> > > doesn't make sense in get_og_chn(), as get_og_chn() is only called
> > > when it is already known that there is no such channel.
> >
> > The initial values for stor_chns[] and alloced_cpus are set in
> > storvsc_channel_init() (for primary channel) and handle_sc_creation() (for
> subchannels).
>
> OK, I agree that if the CPU bit in alloced_cpus is set, then the corresponding entry
> in stor_chns[] will not be NULL. And if the entry in stor_chns[] is NULL, the CPU
> bit in alloced_cpus is *not* set. All the places that manipulate these fields update
> both so they are in sync with each other. Hence I'll agree the code you've added
> in get_og_chn() will never return a NULL value.
>
> (However, FWIW the reverse is not true: If the entry in stor_chns[] is not NULL,
> the corresponding CPU bit in alloced_cpus may or may not be set.)
>
> >
> > As a result, the check for cpumask_test_cpu(q_num,
> > &stor_device->alloced_cpus) will guarantee we are getting a channel.
> > If the check fails, the code follows the old behavior to find a channel.
> >
> > This check is needed because storvsc supports
> > change_target_cpu_callback() callback via vmbus.
>
> But look at the code in storvsc_do_io() where get_og_chn() is called. I've copied
> the code here for discussion purposes. This is the only place that get_og_chn() is
> called:
>
> spin_lock_irqsave(&stor_device->lock, flags);
> outgoing_channel = stor_device->stor_chns[q_num];
> if (outgoing_channel != NULL) {
> spin_unlock_irqrestore(&stor_device->lock, flags);
> goto found_channel;
> }
> outgoing_channel = get_og_chn(stor_device, q_num);
> spin_unlock_irqrestore(&stor_device->lock, flags);
>
> The code gets the spin lock, then reads the stor_chns[] entry. If the entry is non-
> NULL, then we've found a suitable channel and get_og_chn() is *not* called. The
> only time get_og_chan() is called is when the stor_chn[] entry
> *is* NULL, which also means that the CPU bit in alloced_cpus is *not* set.
> So the check you've added in get_og_chn() can never be true and the check is not
> needed. You said the check is needed because of change_target_cpu_callback(),
> which will invoke storvsc_change_target_cpu(). But I don't see how that matters
> given the checks done before get_og_chn() is called. The spin lock synchronizes
> with any changes made by storvsc_change_target_cpu().
I agree this check may not be necessary. Given the current patch is correct, how about I
submit another patch to remove this check after more testing are done?
>
> Related, there are a couple of occurrences of code like this:
>
> for_each_cpu_wrap(tgt_cpu, &stor_device->alloced_cpus,
> q_num + 1) {
> channel = READ_ONCE(stor_device->stor_chns[tgt_cpu]);
> if (!channel)
> continue;
>
> Given your point that the alloced_cpus and stor_chns[] entries are always in sync
> with each other, the check for channel being NULL is not necessary.
I don't' agree with removing the check for NULL. This code follows the existing storvsc behavior
on checking allocated CPUs. Looking at storvsc_change_target_cpu(), it changes stor_chns
before changing alloced_cpus. It can run at the same time as this code. I think we may need
to add some memory barriers in storvsc_change_target_cpu(), but that is for another topic.
Thanks,
Long
Powered by blists - more mailing lists