linux-kernel - Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87ft6464jf.fsf@nanos.tec.linutronix.de>
Date:   Fri, 23 Oct 2020 23:00:52 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Nitesh Narayan Lal <nitesh@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Marcelo Tosatti <mtosatti@...hat.com>, helgaas@...nel.org,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        linux-pci@...r.kernel.org, intel-wired-lan@...ts.osuosl.org,
        frederic@...nel.org, sassmann@...hat.com,
        jesse.brandeburg@...el.com, lihong.yang@...el.com,
        jeffrey.t.kirsher@...el.com, jacob.e.keller@...el.com,
        jlelli@...hat.com, hch@...radead.org, bhelgaas@...gle.com,
        mike.marciniszyn@...el.com, dennis.dalessandro@...el.com,
        thomas.lendacky@....com, jiri@...dia.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        lgoncalv@...hat.com
Subject: Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>> So shouldn't we then fix the drivers / interface first, to get rid of
>> this inconsistency?
>>
> Considering we agree that excess vector is a problem that needs to be
> solved across all the drivers and that you are comfortable with the other
> three patches in the set. If I may suggest the following:
>
> - We can pick those three patches for now, as that will atleast fix a
>   driver that is currently impacting RT workloads. Is that a fair
>   expectation?

No. Blindly reducing the maximum vectors to the number of housekeeping
CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
what the right number of interrupts for this situation is.

Many of these drivers need more than queue interrupts, admin, error
interrupt and some operate best with seperate RX/TX interrupts per
queue. They all can "work" with a single PCI interrupt of course, but
the price you pay is performance.

An isolated setup, which I'm familiar with, has two housekeeping
CPUs. So far I restricted the number of network queues with a module
argument to two, which allocates two management interrupts for the
device and two interrupts (RX/TX) per queue, i.e. a total of six.

Now I reduced the number of available interrupts to two according to
your hack, which makes it use one queue RX/TX combined and one
management interrupt. Guess what happens? Network performance tanks to
the points that it breaks a carefully crafted setup.

The same applies to a device which is application specific and wants one
channel including an interrupt per isolated application core. Today I
can isolate 8 out of 12 CPUs and let the device create 8 channels and
set one interrupt and channel affine to each isolated CPU. With your
hack, I get only 4 interrupts and channels. Fail!

You cannot declare that all this is perfectly fine, just because it does
not matter for your particular use case.

So without information from the driver which tells what the best number
of interrupts is with a reduced number of CPUs, this cutoff will cause
more problems than it solves. Regressions guaranteed.

Managed interrupts base their interrupt allocation and spreading on
information which is handed in by the individual driver and not on crude
assumptions. They are not imposing restrictions on the use case.

It's perfectly fine for isolated work to save a data set to disk after
computation has finished and that just works with the per-cpu I/O queue
which is otherwise completely silent. All isolated workers can do the
same in parallel without trampling on each other toes by competing for a
reduced number of queues which are affine to the housekeeper CPUs.

Unfortunately network multi-queue is substantially different from block
multi-queue (as I learned in this conversation), so the concept cannot
be applied one-to-one to networking as is. But there are certainly part
of it which can be reused.

This needs a lot more thought than just these crude hacks.

Especially under the aspect that there are talks about making isolation
runtime switchable. Are you going to rmmod/insmod the i40e network
driver to do so? That's going to work fine if you do that
reconfiguration over network...

Thanks,

        tglx