[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <598457c6-4bea-50f5-efe9-6a2af3405ff5@akamai.com>
Date: Mon, 19 Nov 2018 14:35:07 -0800
From: Josh Hunt <johunt@...mai.com>
To: tglx@...utronix.de
Cc: saeedm@...lanox.com, linux-kernel@...r.kernel.org,
"Ozen, Gurhan" <guozen@...mai.com>
Subject: vector space exhaustion on 4.14 LTS kernels
Hi Thomas
We have a class of machines that appear to be exhausting the vector
space on cpus 0 and 1 which causes some breakage later on when trying to
set the affinity. The boxes are running the 4.14 LTS kernel.
I instrumented 4.14 and here's what I see:
[ 28.328849] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff
onlinemask:ff,ffffffff vector:0
[ 28.329847] __assign_irq_vector: irq:512 cpu:2 vector:222 cfgvect:0
off:14 old_domain:00,00000000 domain:00,00000000
vector_search:00,00000004 update
[ 28.329847] default_cpu_mask_to_apicid: irq:512 mask:00,00000004
...
[ 31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff
onlinemask:ff,ffffffff vector:222
[ 31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff,ffffffff
vector_cpumask:00,00000001 vector:222
...
[ 31.729154] __assign_irq_vector: irq:512 cpu:2 vector:00,00000004
domain:00,00000004 success
[ 31.729154] default_cpu_mask_to_apicid: irq:512 hwirq:512
mask:00,00000004
[ 31.729154] apic_set_affinity: irq:512 mask:ff,ffffffff err:0
...
[ 32.818152] mlx5_irq_set_affinity_hint: 0: irq:512 mask:00,00000001
...
[ 39.531242] __assign_irq_vector: irq:512 cpu:0 mask:00,00000001
onlinemask:ff,ffffffff vector:222
[ 39.531244] __assign_irq_vector: irq:512 cpu:0 mask:00,00000001
vector_cpumask:00,00000001 vector:222
[ 39.531245] __assign_irq_vector: irq:512 cpu:0 vector:00,00000001
domain:00,00000004
...
[ 39.531384] __assign_irq_vector: irq:512 cpu:0 vector:37
current_vector:37 next_cpu2
[ 39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,00000001
vector:00,00000000 continue
[ 39.531386] apic_set_affinity: irq:512 mask:00,00000001 err:-28
The affinity values:
root@....25.48.208:/proc/irq/512# grep . *
affinity_hint:00,00000001
effective_affinity:00,00000004
effective_affinity_list:2
grep: mlx5_comp0@pci:0000:65:00.1: Is a directory
node:0
smp_affinity:ff,ffffffff
smp_affinity_list:0-39
spurious:count 3
spurious:unhandled 0
spurious:last_unhandled 0 ms
I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on
allocation", and this sounds like what we're hitting. Booting 4.19 does
not have this problem. I haven't booted 4.15 yet, but can do it to
confirm the above commit is what resolves this.
Since 4.14 doesn't have the matrix allocator it's not a trivial
backport. I was wondering a) if you agree with my assessment and b) if
there's any plans on resolving this on the 4.14 allocator? If not I can
attempt to backport the idea to 4.14 to spread the interrupts around on
allocation.
Thanks
Josh
Powered by blists - more mailing lists