[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b8e02cd-2f2e-4bf1-9332-6dde502c22b1@oracle.com>
Date: Mon, 13 May 2024 10:43:48 -0700
From: Dongli Zhang <dongli.zhang@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org
Cc: mingo@...hat.com, Borislavbp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, joe.jin@...cle.com, linux-kernel@...r.kernel.org,
virtualization@...ts.linux.dev
Subject: Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
Hi Thomas,
On 5/13/24 5:44 AM, Thomas Gleixner wrote:
> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
>> interrupt affinity reconfiguration via procfs. Instead, the change is
>> deferred until the next instance of the interrupt being triggered on the
>> original CPU.
>>
>> When the interrupt next triggers on the original CPU, the new affinity is
>> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
>> but if the old vector on the original CPU remains online, it is not
>> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
>> reclaiming process is delayed until the next trigger of the interrupt on
>> the new CPU.
>>
>> Upon the subsequent triggering of the interrupt on the new CPU,
>> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
>> remains online. Subsequently, the timer on the old CPU iterates over its
>> vector_cleanup list, reclaiming vectors.
>>
>> However, if the old CPU is offline before the interrupt triggers again on
>> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
>> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
>> in vector_matrix, resulting in a CPU vector leak.
>
> I doubt that.
>
> Any interrupt which is affine to an outgoing CPU is migrated and
> eventually pending moves are enforced:
>
> cpu_down()
> ...
> cpu_disable_common()
> fixup_irqs()
> irq_migrate_all_off_this_cpu()
> migrate_one_irq()
> irq_force_complete_move()
> free_moved_vector();
>
> No?
I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
because:
1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
cleanup before irq_do_set_affinity().
2. The irq_needs_fixup() may return false so that irq_force_complete_move() does
not get the chance to trigger.
3. Even irq_force_complete_move() is triggered, it exits early if
apicd->prev_vector==0.
The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
cpu_disable_common() releases the vector_lock after CPU is flagged offline.
void cpu_disable_common(void)
{
int cpu = smp_processor_id();
remove_siblinginfo(cpu);
/* It's now safe to remove this processor from the online map */
lock_vector_lock();
remove_cpu_from_maps(cpu); ---> CPU is flagged offline
unlock_vector_lock(); ---> release the vector_lock here!
fixup_irqs();
lapic_offline();
}
Therefore, the bugfix may become something like (just to demo the idea):
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 185738c72766..247a53fe9ada 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct
apic_chip_data *apicd)
cl->timer.expires = jiffies + 1;
add_timer_on(&cl->timer, cpu);
}
- } else {
- apicd->prev_vector = 0; // or print a warning
}
raw_spin_unlock(&vector_lock);
}
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..5ecd072a34fe 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
return false;
}
+ /*
+ * Complete an eventually pending irq move cleanup. If this
+ * interrupt was moved in hard irq context, then the vectors need
+ * to be cleaned up. It can't wait until this interrupt actually
+ * happens and this CPU was involved.
+ */
+ irq_force_complete_move(desc);
+
/*
* No move required, if:
* - Interrupt is per cpu
@@ -87,14 +95,6 @@ static bool migrate_one_irq(struct irq_desc *desc)
return false;
}
- /*
- * Complete an eventually pending irq move cleanup. If this
- * interrupt was moved in hard irq context, then the vectors need
- * to be cleaned up. It can't wait until this interrupt actually
- * happens and this CPU was involved.
- */
- irq_force_complete_move(desc);
-
/*
* If there is a setaffinity pending, then try to reuse the pending
* mask, so the last change of the affinity does not get lost. If
That's why I modify only the __vector_schedule_cleanup() as it looked simple.
I will fix in the CPU hotplug interrupt migration code.
>
> In fact irq_complete_move() should never see apicd->move_in_progress
> with apicd->prev_cpu pointing to an offline CPU.
I think it is possible. The fact that a CPU is offline doesn't indicate
fixup_irqs() has already been triggered. The vector_lock is released after CPU
is flagged offline.
>
> The CPU offline case in __vector_schedule_cleanup() should not even
> exist or at least just emit a warning.
>
> If you can trigger that case, then there is something fundamentally
> wrong with the CPU hotplug interrupt migration code and that needs to be
> investigated and fixed.
>
I can easily reproduce the issue.
1. Create a QEMU VM (12 vCPUs) with virtio-net, and 8 RX/TX queues (16 vectors).
Triggered active iperf3 multiqueue workload.
2. Affine all 16 vectors to CPU=11, and sleep 1 second.
3. Affine all 16 vectors to CPU=10.
4. Offline CPU=11, and sleep for 2-seconds.
5. Online CPU=11, goto step 2 in a loop.
After hours, the CPU=11 has many un-reclaimed vectors.
# cat /sys/kernel/debug/irq/domains/VECTOR
name: VECTOR
size: 0
mapped: 47
flags: 0x00000103
Online bitmaps: 12
Global available: 2302
Global reserved: 7
Total allocated: 122
System: 38: 0-19,21,50,128,236,240-242,244,246-255
| CPU | avl | man | mac | act | vectors
0 198 1 1 4 32-34,48
1 198 1 1 4 32-35
2 198 1 1 4 32-35
3 199 1 1 3 32-34
4 199 1 1 3 32-34
5 199 1 1 3 32-34
6 198 1 1 4 32-35
7 200 1 1 2 32-33
8 199 1 1 3 32,34-35
9 198 1 1 4 32-33,36-37
10 198 1 1 4 32-33,35,41
11 118 1 1 84 32-49,51-110,113-115,117,119,121
I will fix in the interrupt migration code.
Thank you very much!
Dongli Zhang
Powered by blists - more mailing lists