linux-kernel - Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dc71306f-2693-0e02-8886-5daf96cfa11d@linux.intel.com>
Date:   Mon, 13 Jul 2020 10:23:31 +0800
From:   "Li, Aubrey" <aubrey.li@...ux.intel.com>
To:     Joel Fernandes <joel@...lfernandes.org>
Cc:     Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org,
        linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>, vineethrp@...il.com,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Tim Chen <tim.c.chen@...el.com>,
        "Paul E . McKenney" <paulmck@...nel.org>
Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of
 IRQ and softirq

On 2020/7/10 21:21, Joel Fernandes wrote:
> On Fri, Jul 10, 2020 at 08:19:24PM +0800, Li, Aubrey wrote:
>> Hi Joel/Vineeth,
>>
>>
>> The problem is gone when we reverted this patch. We are running multiple
>> uperf threads(equal to cpu number) in a cgroup with coresched enabled.
>> This is 100% reproducible on our side.
> 
> Interesting. I am guessing you are not doing any hotplug since those fixes
> were removed from v6 to expose those hotplug issues..
> 
> The last known lockups with this patch were fixed. Appreciate if you can dig
> in more and provide logs/traces. The last one I remember was:
> 
> HT1                                  HT2
>                                      irq_enter()
> 				     	- sets the core-wide flag
> <softirq running>                    
>       acquires a lock.
>   <gets irq>
>   irq_enter() - do nothing.
>   irq_exit() - busy wait on flag.
>                                      irq_exit()
> 				       <softirq running>
> 				       acquire a lock and deadlock.
> 
> The fix was to call sched_core_irq_enter() when you enter enter a softirq
> from paths other than irq_exit().
> 
> Other than this one, we have not seen lockups in heavy testing over the last
> 2 months since we redesigned this patch to enter the 'private state' on the
> outer-most core-wide sched_core_irq_enter().

When the first soft lockup panic on CPU75, it's waiting on flush tlb IPI.

[  170.641645] CPU: 75 PID: 5393 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
[  170.641651] RIP: 0010:smp_call_function_many_cond+0x2b1/0x2e0
----snip----
[  170.641660] Call Trace:
[  170.641666]  ? flush_tlb_func_common.constprop.10+0x220/0x220
[  170.641668]  ? x86_configure_nx+0x50/0x50
[  170.641669]  ? flush_tlb_func_common.constprop.10+0x220/0x220
[  170.641670]  on_each_cpu_cond_mask+0x2f/0x80
[  170.641671]  flush_tlb_mm_range+0xab/0xe0
[  170.641677]  change_protection+0x18a/0xca0
[  170.641682]  ? __switch_to_asm+0x34/0x70
[  170.641685]  change_prot_numa+0x15/0x30
[  170.641689]  task_numa_work+0x1aa/0x2c0
[  170.641694]  task_work_run+0x76/0xa0
[  170.641698]  exit_to_usermode_loop+0xeb/0xf0
[  170.641700]  do_syscall_64+0x1aa/0x1d0
[  170.641701]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

If I read the code correctly, I assume someone is pending on irq_exit() so IPI
can't return to CPU75, and I found it's CPU91

[  170.652257] CPU: 91 PID: 5401 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
[  170.652257] RIP: 0010:sched_core_irq_exit+0xcc/0x110
----snip----
[  170.652261] Call Trace:
[  170.652262]  <IRQ>
[  170.652262]  irq_exit+0x6a/0xb0
[  170.652262]  smp_apic_timer_interrupt+0x74/0x130
[  170.652262]  apic_timer_interrupt+0xf/0x20

Then I check the stack of CPU91's sibling CPU19, and found it's on a spin lock.

[  170.643678] CPU: 19 PID: 5385 Comm: uperf Kdump: loaded Not tainted 5.7.6+ #3
[  170.643679] RIP: 0010:native_queued_spin_lock_slowpath+0x137/0x1e0
[  170.643684] Call Trace:
[  170.643684]  <IRQ>
[  170.643684]  _raw_spin_lock+0x1b/0x20
[  170.643685]  tcp_delack_timer+0x2c/0xf0
[  170.643685]  ? tcp_delack_timer_handler+0x170/0x170
[  170.643685]  call_timer_fn+0x2d/0x130
[  170.643685]  run_timer_softirq+0x420/0x450
[  170.643686]  ? enqueue_hrtimer+0x39/0x90
[  170.643686]  ? __hrtimer_run_queues+0x138/0x290
[  170.643686]  __do_softirq+0xed/0x2f0
[  170.643686]  irq_exit+0xad/0xb0
[  170.643686]  smp_apic_timer_interrupt+0x74/0x130
[  170.643687]  apic_timer_interrupt+0xf/0x20
----snip----
[  170.643738]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So I guess the problem is,

CPU91					CPU19
(1)hold a bh_lock_sock(sk)
(2)<gets irq>
					(3) <gets irq>
(4) irq_exit()
    -> sched_core_irq_exit()
       - not outermost, wait()
					(5) invoke softirq
					(6) acquire bh_lock_sock() and deadlock
					(7) sched_core_irq_exit()

In case I understood anything wrong, I attached the full dmesg.

IMHO, can we let irq exit and wait before return user mode? I think we
can trust anything running in the kernel.

Thanks,
-Aubrey

View attachment "dmesg.txt" of type "text/plain" (216893 bytes)