linux-kernel - Re: [PATCH][RFC] x86/irq: Spread vectors on different CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20170705140215.GB10286@yu-desktop-1.sh.intel.com>
Date:   Wed, 5 Jul 2017 22:02:15 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     "Williams, Mitch A" <mitch.a.williams@...el.com>,
        "Ramamurthy, Harshitha" <harshitha.ramamurthy@...el.com>,
        Ingo Molnar <mingo@...hat.com>,
        "Van De Ven, Arjan" <arjan.van.de.ven@...el.com>,
        "Anvin, H Peter" <h.peter.anvin@...el.com>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>, Rui Zhang <rui.zhang@...el.com>,
        Mika Westerberg <mika.westerberg@...ux.intel.com>,
        linux-kernel@...r.kernel.org, x86@...nel.org
Subject: Re: [PATCH][RFC] x86/irq: Spread vectors on different CPUs

Hi Thomas,
On Wed, Jun 28, 2017 at 09:03:19PM +0200, Thomas Gleixner wrote:
> On Sat, 13 May 2017, Chen Yu wrote:
> > This is because:
> > 1. One of the drivers has declare many vector resource via
> >    pci_enable_msix_range(), say, this driver might likely want
> >    to reserve 6 per logical CPU, then there would be 192 of them.
> 
> This has been solved with the managed interrupt mechanism for multi queue
> devices, which does not longer migrate these interrupts. It simply shuts
> them down and restarts them when the cpu comes back online.
> 
I think the driver has already enabled multi queue:
ethtool -l enp187s0f0
Channel parameters for enp187s0f0:
Pre-set maximums:
RX:		0
TX:		0
Other:		1
Combined:	128
Current hardware settings:
RX:		0
TX:		0
Other:		1
Combined:	32
> > 2. Besides, most of the vectors for this driver are allocated
> >    on CPU0 due to the current code strategy, so there would be
> >    insufficient slots left on CPU0 to receive any migrated
> >    IRQs from the other CPUs during CPU offine.
> 
> That's a driver sillyness. I assume it's a multi queue device, so it should
> use the managed interrupt affinity code.
> 
Let me Cc my colleagues Mitch and Harshitha who are working on
this driver currently.
> > 3. Furthermore, many vectors this driver reserved do no have
> >    any IRQ handler attached.
> 
> That's silly. Why does a driver allocate more resources than required? Just
> because it can?
> 
The reason why the driver has reserved the resource but w/o using them
might be that(well, it is just my guess), some network cable are not
plugged in the interface at the moment, although the driver has been
loaded. I will confirm if this is the case first.
> > As a result, all vectors on CPU0 were used out and the last alive
> > CPU (31) failed to migrate its IRQs to the CPU0.
> > 
> > As we might have difficulty to reduce the number of vectors reserved
> > by that driver,
> 
> Why so? If the vectors are not used, then why are they allocated in the
> first place? Can you point at the driver source please?
> 
We are testing the net/ethernet/intel/i40e driver now. Previously we have
introduced a mechanism to release some vectors before S4 thus to work
around this issue, but as you mentioned before, if these are multi-queues,
the release will be accomplisheded automatically during cpu hotplug, I wonder
if the vectors reserved will also be released even without handler attached?
> > there could be a compromising solution that, to spread
> > the vector allocation on different CPUs rather than always choosing
> > the *first* CPU in the cpumask. In this way, there would be a balanced
> > vector distribution. Because many vectors reserved but without used(point 3
> > above) will not be counted in during CPU offline, and they are now
> > on nonboot CPUs this problem will be solved.
> 
> You are tackling the problem at the wrong end to some extent. I'm not
> against sanely spreading interrupts, but we only want to do that, if there
> is a compelling reason. A broken driver does not count.
>   
Understand.
> > +static int pick_leisure_cpu(const struct cpumask *mask)
> > +{
> > +	int cpu, vector;
> > +	int min_nr_vector = NR_VECTORS;
> > +	int target_cpu = cpumask_first_and(mask, cpu_online_mask);
> > +
> > +	for_each_cpu_and(cpu, mask, cpu_online_mask) {
> > +		int nr_vectors = 0;
> > +
> > +		for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
> > +			if (!IS_ERR_OR_NULL(per_cpu(vector_irq, cpu)[vector]))
> > +				nr_vectors++;
> > +		}
> > +		if (nr_vectors < min_nr_vector) {
> > +			min_nr_vector = nr_vectors;
> > +			target_cpu = cpu;
> > +		}
> 
> That's beyond silly. Why would we have to count the available vectors over
> and over? We can keep track of the vectors available per cpu continuously
> when we allocate and free them.
> 
Yes, that would be much more efficient:)
> > +	}
> > +	return target_cpu;
> > +}
> > +
> >  static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> >  			       const struct cpumask *mask)
> >  {
> > @@ -131,7 +152,7 @@ static int __assign_irq_vector(int irq, struct apic_chip_data *d,
> >  	/* Only try and allocate irqs on cpus that are present */
> >  	cpumask_clear(d->old_domain);
> >  	cpumask_clear(searched_cpumask);
> > -	cpu = cpumask_first_and(mask, cpu_online_mask);
> > +	cpu = pick_leisure_cpu(mask);
> >  	while (cpu < nr_cpu_ids) {
> >  		int new_cpu, offset;
> 
> I'm really not fond of this extra loop. That function is a huge nested loop
> mess already. The cpu we select upfront gets fed into
> apic->vector_allocation_domain() which might just say no because the CPU
> does not belong to it. So yes, it kinda works for your purpose, but it's
> just a bolted on 'solution'.
> 
> We really need to sit down and rewrite that allocation code from scratch,
> use per cpu bitmaps, so we can do fast search over masks and keep track of
> the vectors which are used/free per cpu.
> 
> That needs some thought because we need to respect the allocation domains,
> but there must be a saner way than looping with interrupts disabled for a
> gazillion hoops.
> 
> What the current code misses completely is to set a preference for the node
> on which the interrupt was allocated. We just blindly take something out of
> the default affinity mask which is all online cpus.
> 
> So there are a lot of shortcomings in the current implementation and we
> should be able to replace it with something better, faster and while at it
> make that spreading stuff work.
Agree. Sounds to be a good improvement, let me check the i40e driver first
and then see if this can be revised according to this design.
> 
> That does not mean, that we blindly support drivers allocating a gazillion
> vectors just for nothing.
> 
Thanks,
	Yu
> Thanks,
> 
> 	tglx
>