linux-kernel - Re: sched_core_balance() releasing interrupts with pi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YjIKQBIbJR/kRR+N@linutronix.de>
Date:   Wed, 16 Mar 2022 17:03:12 +0100
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: sched_core_balance() releasing interrupts with pi_lock held

On 2022-03-15 17:46:06 [-0400], Steven Rostedt wrote:
> On Tue, 8 Mar 2022 16:14:55 -0500
> Steven Rostedt <rostedt@...dmis.org> wrote:
> 
> > Hi Peter,
> 
> Have you had time to look into this?

yes, I can confirm that it is a problem ;) So I did this:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 33ce5cd113d8..56c286aaa01f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5950,7 +5950,6 @@ static bool try_steal_cookie(int this, int that)
 	unsigned long cookie;
 	bool success = false;
 
-	local_irq_disable();
 	double_rq_lock(dst, src);
 
 	cookie = dst->core->core_cookie;
@@ -5989,7 +5988,6 @@ static bool try_steal_cookie(int this, int that)
 
 unlock:
 	double_rq_unlock(dst, src);
-	local_irq_enable();
 
 	return success;
 }
@@ -6019,7 +6017,7 @@ static void sched_core_balance(struct rq *rq)
 
 	preempt_disable();
 	rcu_read_lock();
-	raw_spin_rq_unlock_irq(rq);
+	raw_spin_rq_unlock(rq);
 	for_each_domain(cpu, sd) {
 		if (need_resched())
 			break;
@@ -6027,7 +6025,7 @@ static void sched_core_balance(struct rq *rq)
 		if (steal_cookie_task(cpu, sd))
 			break;
 	}
-	raw_spin_rq_lock_irq(rq);
+	raw_spin_rq_lock(rq);
 	rcu_read_unlock();
 	preempt_enable();
 }


which looked right but RT still fall apart:

| =====================================
| WARNING: bad unlock balance detected!
| 5.17.0-rc8-rt14+ #10 Not tainted
| -------------------------------------
| gcc/2608 is trying to release lock ((lock)) at:
| [<ffffffff8135a150>] folio_add_lru+0x60/0x90
| but there are no more locks to release!
| 
| other info that might help us debug this:
| 4 locks held by gcc/2608:
|  #0: ffff88826ea6efe0 (&sb->s_type->i_mutex_key#12){++++}-{3:3}, at: xfs_ilock+0x90/0xd0
|  #1: ffff88826ea6f1a0 (mapping.invalidate_lock#2){++++}-{3:3}, at: page_cache_ra_unbounded+0x8e/0x1f0
|  #2: ffff88852aba8d18 ((lock)#3){+.+.}-{2:2}, at: folio_add_lru+0x2a/0x90
|  #3: ffffffff829a5140 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xe0
| 
| stack backtrace:
| CPU: 18 PID: 2608 Comm: gcc Not tainted 5.17.0-rc8-rt14+ #10
| Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.03.0003.041920141333 04/19/2014
| Call Trace:
|  <TASK>
|  dump_stack_lvl+0x4a/0x62
|  lock_release.cold+0x32/0x37
|  rt_spin_unlock+0x17/0x80
|  folio_add_lru+0x60/0x90
|  filemap_add_folio+0x53/0xa0
|  page_cache_ra_unbounded+0x1c3/0x1f0
|  filemap_get_pages+0xe3/0x5b0
|  filemap_read+0xc5/0x2f0
|  xfs_file_buffered_read+0x6b/0x1a0
|  xfs_file_read_iter+0x6a/0xd0
|  new_sync_read+0x11b/0x1a0
|  vfs_read+0x134/0x1d0
|  ksys_read+0x68/0xf0
|  do_syscall_64+0x59/0x80
|  entry_SYSCALL_64_after_hwframe+0x44/0xae
| RIP: 0033:0x7f3feab7310e

It is always the local-lock that is breaks apart. Based on "locks held"
and the lock it tries to release it looks like the lock was acquired on
CPU-A and released on CPU-B.

> Thanks,
> 
> -- Steve

Sebastian