linux-kernel - Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <74ffd154-5d92-4303-9977-6ddc9accdf01@linux.ibm.com>
Date: Tue, 23 Apr 2024 20:51:58 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Ankur Arora <ankur.a.arora@...cle.com>,
        Thomas Gleixner <tglx@...utronix.de>
Cc: peterz@...radead.org, torvalds@...ux-foundation.org, paulmck@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org, willy@...radead.org,
        mgorman@...e.de, jpoimboe@...nel.org, mark.rutland@....com,
        jgross@...e.com, andrew.cooper3@...rix.com, bristot@...nel.org,
        mathieu.desnoyers@...icios.com, geert@...ux-m68k.org,
        glaubitz@...sik.fu-berlin.de, anton.ivanov@...bridgegreys.com,
        mattst88@...il.com, krypton@...ich-teichert.org, rostedt@...dmis.org,
        David.Laight@...LAB.COM, richard@....at, mjguzik@...il.com,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        LKML <linux-kernel@...r.kernel.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Nicholas Piggin <npiggin@...il.com>
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling



On 2/13/24 11:25 AM, Ankur Arora wrote:
> Hi,
> 
> This series adds a new scheduling model PREEMPT_AUTO, which like
> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> on explicit preemption points for the voluntary models.
> 
> The series is based on Thomas' original proposal which he outlined
> in [1], [2] and in his PoC [3].
> 
> An earlier RFC version is at [4].
> 

Hi Ankur/Thomas. 

Thank you for this series and previous ones. 
These are very interesting patch series and the even more interesting 
discussions. I have been trying go through to get different bits of it. 


Tried this patch on PowerPC by defining LAZY similar to x86. The change is below. 
Kept it at PREEMPT=none for PREEMPT_AUTO. 

Running into soft lockup on large systems (40Cores, SMT8) and seeing close to 100%
regression on small system ( 12 Cores, SMT8). More details are after the patch. 

Are these the only arch bits that need to be defined? am I missing something very 
basic here? will try to debug this further. Any inputs?

---
 arch/powerpc/Kconfig                   | 1 +
 arch/powerpc/include/asm/thread_info.h | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..11e7008f5dd3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -268,6 +268,7 @@ config PPC
 	select HAVE_PERF_EVENTS_NMI		if PPC64
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
+	select HAVE_PREEMPT_AUTO
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE
 	select HAVE_RSEQ
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index 15c5691dd218..c28780443b3b 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -117,11 +117,13 @@ void arch_setup_new_exec(void);
 #endif
 #define TIF_POLLING_NRFLAG	19	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT		20	/* 32 bit binary */
+#define TIF_NEED_RESCHED_LAZY	21	/* Lazy rescheduling */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
+#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
 #define _TIF_NOTIFY_SIGNAL	(1<<TIF_NOTIFY_SIGNAL)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_32BIT		(1<<TIF_32BIT)
@@ -144,7 +146,7 @@ void arch_setup_new_exec(void);
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
 				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
-				 _TIF_NOTIFY_SIGNAL)
+				 _TIF_NOTIFY_SIGNAL | _TIF_NEED_RESCHED_LAZY)
 #define _TIF_PERSYSCALL_MASK	(_TIF_RESTOREALL|_TIF_NOERROR)
 
 /* Bits in local_flags */

---------------- Smaller system ---------------------------------

NUMA:                     
  NUMA node(s):           5
  NUMA node2 CPU(s):      0-7
  NUMA node3 CPU(s):      8-31
  NUMA node5 CPU(s):      32-39
  NUMA node6 CPU(s):      40-47
  NUMA node7 CPU(s):      48-95

Hackbench 			6.9		+preempt_auto (=none)
(10 iterations, 10000 loops)

Process 10 groups          :       3.00,       3.07(  -2.33)
Process 20 groups          :       5.47,       5.81(  -6.22)
Process 30 groups          :       7.78,       8.52(  -9.51)
Process 40 groups          :      10.16,      11.28( -11.02)
Process 50 groups          :      12.37,      13.90( -12.37)
Process 60 groups          :      14.58,      16.68( -14.40)
Thread  10 groups          :       3.24,       3.28(  -1.23)
Thread  20 groups          :       5.93,       6.16(  -3.88)
Process(Pipe) 10 groups    :       1.94,       2.96( -52.58)
Process(Pipe) 20 groups    :       2.91,       5.44( -86.94)
Process(Pipe) 30 groups    :       4.23,       7.83( -85.11)
Process(Pipe) 40 groups    :       5.35,      10.61( -98.32)
Process(Pipe) 50 groups    :       6.64,      13.18( -98.49)
Process(Pipe) 60 groups    :       7.88,      16.69(-111.80)
Thread(Pipe)  10 groups    :       1.92,       3.02( -57.29)
Thread(Pipe)  20 groups    :       3.25,       5.36( -64.92)

------------------- Large systems -------------------------

NUMA:                     
  NUMA node(s):           4
  NUMA node2 CPU(s):      0-31
  NUMA node3 CPU(s):      32-127
  NUMA node6 CPU(s):      128-223
  NUMA node7 CPU(s):      224-319


watchdog: BUG: soft lockup - CPU#278 stuck for 26s! [hackbench:7137]
Modules linked in: bonding tls rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink pseries_rng vmx_crypto drm drm_panel_orientation_quirks xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 278 PID: 7137 Comm: hackbench Kdump: loaded Tainted: G             L     6.9.0-rc1+ #42
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1050.00 (NM1050_052) hv:phyp pSeries
NIP:  c000000000037fbc LR: c000000000038324 CTR: c0000000001a8548
REGS: c0000003de72fbb8 TRAP: 0900   Tainted: G             L      (6.9.0-rc1+)
MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28002222  XER: 20040000
CFAR: 0000000000000000 IRQMASK: 0
GPR00: c000000000038324 c0000003de72fb90 c000000001973e00 c0000003de72fb88
GPR04: 0000000000240080 0000000000000007 0010000000000000 c000000002220090
GPR08: 4000000000000002 0000000000000049 c0000003f1dcff00 0000000000002000
GPR12: c0000000001a8548 c000001fff72d080 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000002002000
GPR24: 0000000000000001 0000000000000000 0000000002802000 0000000000000002
GPR28: 0000000000000003 fcffffffffffffff fcffffffffffffff c0000003f1dcff00
NIP [c000000000037fbc] __replay_soft_interrupts+0x3c/0x154
LR [c000000000038324] arch_local_irq_restore.part.0+0x1cc/0x214
Call Trace:
[c0000003de72fb90] [c000000000038020] __replay_soft_interrupts+0xa0/0x154 (unreliable)
[c0000003de72fd40] [c000000000038324] arch_local_irq_restore.part.0+0x1cc/0x214
[c0000003de72fd90] [c000000000030268] interrupt_exit_user_prepare_main+0x19c/0x274
[c0000003de72fe00] [c0000000000304e0] syscall_exit_prepare+0x1a0/0x1c8
[c0000003de72fe50] [c00000000000cee8] system_call_vectored_common+0x168/0x2ec


+mpe, nick