lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241113093826.667c4918@imladris.surriel.com>
Date: Wed, 13 Nov 2024 09:38:26 -0500
From: Rik van Riel <riel@...riel.com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-kernel@...r.kernel.org, dave.hansen@...ux.intel.com,
 luto@...nel.org, peterz@...radead.org, tglx@...utronix.de,
 mingo@...hat.com, x86@...nel.org, kernel-team@...a.com, hpa@...or.com,
 bigeasy@...utronix.de
Subject: Re: [PATCh 0/3] x86,tlb: context switch optimizations

On Wed, 13 Nov 2024 10:55:50 +0100
Borislav Petkov <bp@...en8.de> wrote:
On Fri, Nov 08, 2024 at 07:27:47PM -0500, Rik van Riel wrote:
> > While profiling switch_mm_irqs_off with several workloads,
> > it appears there are two hot spots that probably don't need
> > to be there.
> 
> One of those three is causing the below here, zapping them from tip.
> 

This is interesting, and unexpected.

> [    3.186469] ------------[ cut here ]------------
> [    3.186469] WARNING: CPU: 16 PID: 97 at kernel/smp.c:807
> smp_call_function_many_cond+0x188/0x720

This is the lockdep_assert_irqs_enabled() from this branch:

        if (cpu_online(this_cpu) && !oops_in_progress &&
            !early_boot_irqs_disabled)
                lockdep_assert_irqs_enabled();

> [    3.186469] Call Trace:
> [    3.186469]  <TASK>
> [    3.186469]  on_each_cpu_cond_mask+0x50/0x90
> [    3.186469]  flush_tlb_mm_range+0x1a8/0x1f0
> [    3.186469]  __text_poke+0x366/0x5d0

... and sure enough, it looks like __text_poke() calls
flush_tlb_mm_range() with IRQs disabled!

> [    3.186469]  text_poke_bp_batch+0xa1/0x3d0
> [    3.186469]  text_poke_finish+0x1b/0x30
> [    3.186469]  arch_jump_label_transform_apply+0x18/0x30
> [    3.186469]  static_key_slow_inc_cpuslocked+0x55/0xa0
...

I have no good explanation for why that lockdep_assert_irqs_enabled()
would not be firing without my patches applied.

We obviously should not be sending out any IPIs with IRQs disabled.

However, __text_poke has been sending IPIs with interrupts disabled
for 4 years now! No wonder we see deadlocks involving __text_poke
on a semi-regular basis.

Should we move the local_irq_restore() in __text_poke() up a few lines,
like in the patch below?

Alternatively, should we explicitly clear the mm_cpumask in unuse_temporary_mm,
to make sure that mm never has any bits set in mm_cpumask?

Or, since we do not flush the TLB for the poking_mm until AFTER we have switched
back to the prev mm, should we simply always switch to the poking_mm in a way
that involves flushing the TLB? That way we won't even have to flush the entry
after unuse...

What is the best approach here?
---8<---
From a2e7c517bbd2cf108fc14c51449bf8e53e314b53 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@...riel.com>
Date: Wed, 13 Nov 2024 09:19:39 -0500
Subject: [PATCH] x86,alternatives: re-enable interrupts before sending TLB  flush IPI

__text_poke() calls flush_tlb_mm_range() to flush the mapping of
the text poke address. However, it does so with interrupts disabled,
which can cause a deadlock.

We do occasionally observe deadlocks involving __text_poke(), but
not frequently enough to spend much time debugging them.

Borislav triggered this bug while testing a different patch, which
lazily clears bits from the mm_cpumask, resulting in more bits being
set when __text_poke() calls flush_tlb_mm_range(), which in turn
triggered the lockdep_assert_irqs_enabled() in smp_call_function_many_cond().

Avoid sending IPIs with IRQs disabled by re-enabling IRQs earlier.

Signed-off-by: Rik van Riel <riel@...riel.com>
Reported-by: Borislav Petkov <bp@...en8.de>
Cc: stable@...nel.org
Fixes: 7cf494270424 ("x86: expand irq-off region in text_poke()")
---
 arch/x86/kernel/alternative.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index d17518ca19b8..f71d84249f6e 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1940,6 +1940,9 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 	 */
 	unuse_temporary_mm(prev);
 
+	/* Re-enable interrupts before sending an IPI. */
+	local_irq_restore(flags);
+
 	/*
 	 * Flushing the TLB might involve IPIs, which would require enabled
 	 * IRQs, but not if the mm is not used, as it is in this point.
@@ -1956,7 +1959,6 @@ static void *__text_poke(text_poke_f func, void *addr, const void *src, size_t l
 		BUG_ON(memcmp(addr, src, len));
 	}
 
-	local_irq_restore(flags);
 	pte_unmap_unlock(ptep, ptl);
 	return addr;
 }
-- 
2.45.2




Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ