linux-kernel - Re: [RESEND PATCH 2/2] smp: Reduce NMI traffic from CSD waiters to CSD destination.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a29619cc-4195-75c4-e49a-0e4ab433cf53@oracle.com>
Date:   Tue, 30 May 2023 11:24:00 +1000
From:   Imran Khan <imran.f.khan@...cle.com>
To:     paulmck@...nel.org
Cc:     peterz@...radead.org, jgross@...e.com, vschneid@...hat.com,
        yury.norov@...il.com, tglx@...utronix.de,
        linux-kernel@...r.kernel.org, akpm@...ux-foundation.org
Subject: Re: [RESEND PATCH 2/2] smp: Reduce NMI traffic from CSD waiters to
 CSD destination.

Hello Paul,

On 16/5/2023 10:09 pm, Paul E. McKenney wrote:
> On Tue, May 09, 2023 at 08:31:24AM +1000, Imran Khan wrote:
>> On systems with hundreds of CPUs, if few hundred or most of the CPUs
>> detect a CSD hang, then all of these waiters endup sending an NMI to
>> destination CPU to dump its backtrace.
>> Depending on the number of such NMIs, destination CPU can spent
>> a significant amount of time handling these NMIs and thus making
>> it more difficult for this CPU to address those pending CSDs timely.
>> In worst case it can happen that by the time destination CPU is done
>> handling all of the above mentioned backtrace NMIs, csd wait time
>> may have elapsed and all of the waiters start sending backtrace NMI
>> again and this behaviour continues in loop.
>>
>> To avoid the above mentioned scenario, issue backtrace NMI only from
>> first waiter. The other waiters to same CSD destination can make use
>> of backtrace obtained via fist waiter's NMI.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@...cle.com>
> 
> Reviewed-by: Paul E. McKenney <paulmck@...nel.org>
> 

Thanks a lot for reviewing this and [1]. Could you kindly let me know
if you plan to pick these in your tree, at some point of time.

Thanks,
Imran

[1]:
https://lore.kernel.org/all/088edfa0-c1b7-407f-8b20-caf0fecfbb79@paulmck-laptop/

>> ---
>>  kernel/smp.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index b7ccba677a0a0..a1cd21ea8b308 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -43,6 +43,8 @@ static DEFINE_PER_CPU_ALIGNED(struct call_function_data, cfd_data);
>>  
>>  static DEFINE_PER_CPU_SHARED_ALIGNED(struct llist_head, call_single_queue);
>>  
>> +static DEFINE_PER_CPU(atomic_t, trigger_backtrace) = ATOMIC_INIT(1);
>> +
>>  static void __flush_smp_call_function_queue(bool warn_cpu_offline);
>>  
>>  int smpcfd_prepare_cpu(unsigned int cpu)
>> @@ -242,7 +244,8 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
>>  			 *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request");
>>  	}
>>  	if (cpu >= 0) {
>> -		dump_cpu_task(cpu);
>> +		if (atomic_cmpxchg_acquire(&per_cpu(trigger_backtrace, cpu), 1, 0))
>> +			dump_cpu_task(cpu);
>>  		if (!cpu_cur_csd) {
>>  			pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu);
>>  			arch_send_call_function_single_ipi(cpu);
>> @@ -423,9 +426,14 @@ static void __flush_smp_call_function_queue(bool warn_cpu_offline)
>>  	struct llist_node *entry, *prev;
>>  	struct llist_head *head;
>>  	static bool warned;
>> +	atomic_t *tbt;
>>  
>>  	lockdep_assert_irqs_disabled();
>>  
>> +	/* Allow waiters to send backtrace NMI from here onwards */
>> +	tbt = this_cpu_ptr(&trigger_backtrace);
>> +	atomic_set_release(tbt, 1);
>> +
>>  	head = this_cpu_ptr(&call_single_queue);
>>  	entry = llist_del_all(head);
>>  	entry = llist_reverse_order(entry);
>> -- 
>> 2.34.1
>>