[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9529ddd0-28f7-fa74-e56f-39de84321a22@redhat.com>
Date: Mon, 26 Oct 2020 09:35:52 -0400
From: Nitesh Narayan Lal <nitesh@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>
Cc: Marcelo Tosatti <mtosatti@...hat.com>, helgaas@...nel.org,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
linux-pci@...r.kernel.org, intel-wired-lan@...ts.osuosl.org,
frederic@...nel.org, sassmann@...hat.com,
jesse.brandeburg@...el.com, lihong.yang@...el.com,
jeffrey.t.kirsher@...el.com, jacob.e.keller@...el.com,
jlelli@...hat.com, hch@...radead.org, bhelgaas@...gle.com,
mike.marciniszyn@...el.com, dennis.dalessandro@...el.com,
thomas.lendacky@....com, jiri@...dia.com, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
lgoncalv@...hat.com
Subject: Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping
CPUs
On 10/23/20 5:00 PM, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
>> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>>> So shouldn't we then fix the drivers / interface first, to get rid of
>>> this inconsistency?
>>>
>> Considering we agree that excess vector is a problem that needs to be
>> solved across all the drivers and that you are comfortable with the other
>> three patches in the set. If I may suggest the following:
>>
>> - We can pick those three patches for now, as that will atleast fix a
>> driver that is currently impacting RT workloads. Is that a fair
>> expectation?
> No. Blindly reducing the maximum vectors to the number of housekeeping
> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
> what the right number of interrupts for this situation is.
>
> Many of these drivers need more than queue interrupts, admin, error
> interrupt and some operate best with seperate RX/TX interrupts per
> queue. They all can "work" with a single PCI interrupt of course, but
> the price you pay is performance.
>
> An isolated setup, which I'm familiar with, has two housekeeping
> CPUs. So far I restricted the number of network queues with a module
> argument to two, which allocates two management interrupts for the
> device and two interrupts (RX/TX) per queue, i.e. a total of six.
Does it somehow take num_online_cpus() into consideration while deciding
the number of interrupts to create?
>
> Now I reduced the number of available interrupts to two according to
> your hack, which makes it use one queue RX/TX combined and one
> management interrupt. Guess what happens? Network performance tanks to
> the points that it breaks a carefully crafted setup.
>
> The same applies to a device which is application specific and wants one
> channel including an interrupt per isolated application core. Today I
> can isolate 8 out of 12 CPUs and let the device create 8 channels and
> set one interrupt and channel affine to each isolated CPU. With your
> hack, I get only 4 interrupts and channels. Fail!
>
> You cannot declare that all this is perfectly fine, just because it does
> not matter for your particular use case.
I agree that does sound wrong.
>
> So without information from the driver which tells what the best number
> of interrupts is with a reduced number of CPUs, this cutoff will cause
> more problems than it solves. Regressions guaranteed.
Indeed.
I think one commonality among the drivers at the moment is the usage of
num_online_cpus() to determine the vectors to create.
So, maybe instead of doing this kind of restrictions in a generic level
API, it will make more sense to do this on a per-device basis by replacing
the number of online CPUs with the housekeeping CPUs?
This is what I have done in the i40e patch.
But that still sounds hackish and will impact the performance.
>
> Managed interrupts base their interrupt allocation and spreading on
> information which is handed in by the individual driver and not on crude
> assumptions. They are not imposing restrictions on the use case.
Right, FWIU it is irq_do_set_affinity that prevents the spreading of
managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,
isn't?
>
> It's perfectly fine for isolated work to save a data set to disk after
> computation has finished and that just works with the per-cpu I/O queue
> which is otherwise completely silent. All isolated workers can do the
> same in parallel without trampling on each other toes by competing for a
> reduced number of queues which are affine to the housekeeper CPUs.
>
> Unfortunately network multi-queue is substantially different from block
> multi-queue (as I learned in this conversation), so the concept cannot
> be applied one-to-one to networking as is. But there are certainly part
> of it which can be reused.
So this is one of the areas that I don't understand well yet and will
explore now.
>
> This needs a lot more thought than just these crude hacks.
Got it.
I am indeed not in the favor of pushing a solution that has a possibility
of regressing/breaking things afterward.
>
> Especially under the aspect that there are talks about making isolation
> runtime switchable. Are you going to rmmod/insmod the i40e network
> driver to do so? That's going to work fine if you do that
> reconfiguration over network...
That's an interesting point. However, for some of the drivers which uses
something like cpu_online/possible_mask while creating its affinity mask
reloading will still associate jobs to the isolated CPUs.
--
Thanks
Nitesh
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists