linux-kernel - Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1709031642030.2351@nanos>
Date:   Sun, 3 Sep 2017 20:17:04 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Chen Yu <yu.c.chen@...el.com>
cc:     x86@...nel.org, Ingo Molnar <mingo@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>, Rui Zhang <rui.zhang@...el.com>,
        LKML <linux-kernel@...r.kernel.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>,
        Dan Williams <dan.j.williams@...el.com>,
        Christoph Hellwig <hch@....de>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing
 the idlest CPU

On Fri, 1 Sep 2017, Chen Yu wrote:

> This is the major logic to spread the vectors on different CPUs.
> The main idea is to choose the 'idlest' CPU which has assigned
> the least number of vectors as the candidate/hint for the vector
> allocation domain, in the hope that the vector allocation domain
> could leverage this hint to generate corresponding cpumask.
> 
> One of the requirements to do this vector spreading work comes from the
> following hibernation problem found on a 16 cores server:
> 
> CPU 31 disable failed: CPU has 62 vectors assigned and there
> are only 0 available.
> 
> 2 issues were found after investigation:
> 
> 1. The network driver has declared many vector resources via
>    pci_enable_msix_range(), say, this driver might likely want
>    to reserve 6 per logical CPU, then there would be 192 of them.
> 2. Besides, most of the vectors reserved by this driver are assigned
>    on CPU0 due to the current code strategy, so there would be
>    insufficient slots left on CPU0 to receive any migrated IRQs
>    during CPU offine.
> 
> In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle
> managed IRQs on CPU hotplug") the issue 1 has been solved with the
> managed interrupt mechanism for multi queue devices, which no longer
> migrate these interrupts. It simply shuts them down and restarts
> them when the CPU comes back online.
> 
> However, according to the implementation of this network driver,
> the commit mentioned above does not fully fix the issue 1.
> Here's the framework of the network driver:
> 
> step 1. Reserved enough irq vectors and corresponding IRQs.
> step 2. If the network is activated, invoke request_irq() to
>         register the handler.
> step 3. Invoke set_affinity() to spread the IRQs onto different
>         CPUs, thus to spread the vectors too.
>
> The problem is, if the network cable is not connected, step 2
> and 3 will not get invoked, thus the IRQ vectors will not spread
> on different CPUs and will still be allocated on CPU0. As a result
> the CPUs will still get the offline failure.

So the proper solution for this network driver is to switch to managed
interrupts instead of trying to work around it in some other place. It's
using the wrong mechanism - end of story.

Why are you insisting on implementing a band aid for this particular driver
instead of fixing the underlying problem of that driver which requires to
have 32 queues and interrupts open even if only a single CPU is online?

> Previously there were some discussion in the thread [1] about the
> vector spread, and here are the suggestion from Thomas:
> 
> Q1:
>     Rewrite that allocation code from scratch, use per cpu bitmaps,
>     so we can do fast search over masks and keep track of
>     the vectors which are used/free per cpu.
> A1:
>     per cpu bitmap was onced considered but this might not be
>     as fast as the solution proposed here. That is, if we introduce the
>     per cpu bitmap, we need to count the number of vectors assigned on
>     each CPUs by cpumask_weight() and sort them in order to get the
>     CPUs who have the least number of vectors assigned. By contrast,
>     if we introduce the bitmaps for each vector number, we can get the
>     'idle' CPU at the cost of nearly O(1).

The weight accounting with the cpumasks is an orthogonal issue to the per
cpu vector bitmaps.

Once you selected a CPU the current code still loops in circles instead of
just searching a bitmap.

And if the required affinity mask needs more than one CPU then this is
still not giving you the right answer. You assign perhaps a vector to the
least busy CPU, but the other CPUs which happen to be in the mask are going
to be randomly selected.

I'm not against making the vector allocation better, but certainly not by
adding yet more duct tape to something which is well known as one of the
dumbest search algorithms on the planet with a worst case of

       nvectors * nr_online_cpus * nr_online_cpus_in_affinity_mask

and your mechanism nests another loop of potentially NR_VECTORS into
that. Which is pointless as the actual assignable vector space is
smaller. While x86 has 256 vectors the assignable vector space is way
smaller and non continous:

Vectors   0- 31 are reserved for CPU traps and exceptions
Vectors 239-255 are reserved by the kernel for IPIs, local timer etc.
Vector	     32 can be reserved by the kernel for the reboot interrupt
Vector	     96 can be reserved by the kernel for INT80

So the actually usable array size is between 205 and 207.

Aside of that the code is incorrect vs. the percpu accounting. Remove the
limit in the update function and see what happens. While the limit is
necessary in general, it should at least warn and yell whenever the
accounting goes out of bounds. And if that happens, then the whole thing is
completely useless simply because the numbers are just wrong.

>     One scenario of this patch can be illustrated below:
> 
>     a) After initialization, vector_mask[0] is set with {CPU0...CPU31},
>        other vector_mask[i] remain zero, which means that, CPU0...CPU31
>        have zero vectors assigned.
>     b) During driver probe, CPU0 is chosen to be assigned with one vector.
>        Thus vector_mask[0] becomes {CPU1...CPU31}, and vector_mask[1] becomes
>        {CPU0}.
>     c) The next time the system tries to assign another vector, it searches from
>        vector[0], because it is non-empty, CPU1 is chosen to place
>        the vector. Thus vector_mask[0] becomes {CPU2...CPU31}, and vector_mask[1]
>        becomes {CPU0, CPU1}.
>     d) and so on, the vectors assignment will spread on different CPUs.

So this works for your particular network driver scenario, which is the
wrong example as this driver just needs to be reworked to not expose that
issue.

The reality of affinity requirements is not that simple. It's very much
device dependent and making it mandatory can have negative side effects.

> Q2:
>     "respect the allocation domains"
> A2:
>     This solution provides a CPU as the param for the vector allocation domains,

Solution? This is a crude hack which "solves" one particular problem in the
wrong way.

>     in the hope that the domains can leverage this hint to generate the cpumask

Hope is never a good enginering principle.

>     with less vectors assigned. But yes, the CPU we select upfront gets fed into
>     apic->vector_allocation_domain() might just say no because the CPU
>     does not belong to it, but anyway this should somehow spread the vectors,
>     although I can not find out a graceful solution to this.
>     BTW, the vector allocation domain is default_vector_allocation_domain()
>     in our case, thus it works.

Now try that with the logical delivery modes which allow multi CPU
assignements and the cluster mode thereof.

> Q3:
>     "set a preference for the node on which the interrupt was allocated"
> A3:
>     This can be achieved by further optimization in the following way:
>     a) Also record the number of vectors used per node.
>     b) Each time invoking __assign_irq_vector, adjust the input mask
>        by cpumask_and(mask, mask, node_mask_with_least_vectors())

I'm not looking forward to that heuristics and the next loop inside the
loops of loops.

So no, this is not going to happen. The current implementation sucks and
needs a rewrite.

I already started to look into that, but started at the low level end of
it. See the IDT cleanup series in tip:x86/apic. There is much more vooddo
in the vector management code to clean up before we can actually talk about
making the assignment algorithm smarter.

I'll send out a separate mail raising a few points to discuss before we are
actually starting to look into magic spreading functionality.

Thanks,

	tglx