[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEiKCqoaEWnZvlCI@swahl-home.5wahls.com>
Date: Tue, 10 Jun 2025 14:39:54 -0500
From: Steve Wahl <steve.wahl@....com>
To: Leon Romanovsky <leon@...nel.org>
Cc: K Prateek Nayak <kprateek.nayak@....com>, Steve Wahl <steve.wahl@....com>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org, Vishal Chourasia <vishalc@...ux.ibm.com>,
samir <samir@...ux.ibm.com>, Naman Jain <namjain@...ux.microsoft.com>,
Saurabh Singh Sengar <ssengar@...ux.microsoft.com>,
srivatsa@...il.mit.edu, Michael Kelley <mhklinux@...look.com>,
Russ Anderson <rja@....com>, Dimitri Sivanich <sivanich@....com>
Subject: Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
On Tue, Jun 10, 2025 at 04:09:52PM +0300, Leon Romanovsky wrote:
>
>
> On Tue, Jun 10, 2025, at 15:36, Leon Romanovsky wrote:
> > On Tue, Jun 10, 2025 at 05:03:14PM +0530, K Prateek Nayak wrote:
> >> Hello Leon,
> >>
> >> On 6/10/2025 4:37 PM, Leon Romanovsky wrote:
> >>
> >> [..snip..]
> >>
> >> > > + if (WARN_ON(!topology_span_sane(cpu_map)))
> >> > > + goto error;
> >> >
> >> > Hi,
> >> >
> >> > This WARN_ON() generate the following splat in our regression over VMs.>
> >> > [ 0.408379] ------------[ cut here ]------------
> >> > [ 0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
> >> > [ 0.410797] Modules linked in:
> >> > [ 0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
> >> > [ 0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >> > [ 0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
> >> > [ 0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
> >> > [ 0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
> >> > [ 0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
> >> > [ 0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
> >> > [ 0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
> >> > [ 0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
> >> > [ 0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
> >> > [ 0.425164] FS: 0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
> >> > [ 0.426751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > [ 0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
> >> > [ 0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > [ 0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> > [ 0.431429] Call Trace:
> >> > [ 0.431983] <TASK>
> >> > [ 0.432500] sched_init_smp+0x32/0xa0
> >> > [ 0.433069] ? stop_machine+0x2c/0x40
> >> > [ 0.433821] kernel_init_freeable+0xf5/0x260
> >> > [ 0.434682] ? rest_init+0xc0/0xc0
> >> > [ 0.435399] kernel_init+0x16/0x120
> >> > [ 0.436140] ret_from_fork+0x5e/0xd0
> >> > [ 0.436817] ? rest_init+0xc0/0xc0
> >> > [ 0.437526] ret_from_fork_asm+0x11/0x20
> >> > [ 0.438335] </TASK>
> >> > [ 0.438841] ---[ end trace 0000000000000000 ]---
> >>
> >> Would it be possible for you to boot the guest with "sched_verbose" in
> >> kernel cmdline and attach the full dmesg? Thanks in advance.
> >
> > I'll try, but can't promise due to how this kernel is been running in
> > our systems.
>
>
>
> [ 0.032233] [mem 0xc0000000-0xfed1bfff] available for PCI devices
> [ 0.032237] Booting paravirtualized kernel on KVM
> [ 0.032238] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
> [ 0.036921] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
> [ 0.038074] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
> [ 0.038108] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose
> [ 0.038222] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 selinux=0 biosdevname=0 audit=0", will be passed to user space.
> [ 0.038235] random: crng init done
> [ 0.038235] printk: log_buf_len individual max cpu contribution: 4096 bytes
> [ 0.038236] printk: log_buf_len total cpu_extra contributions: 36864 bytes
> [ 0.038237] printk: log_buf_len min size: 65536 bytes
> [ 0.038330] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
> [ 0.038331] printk: early log buf free: 56792(86%)
> [ 0.038452] software IO TLB: area num 16.
> [ 0.049552] Fallback order for Node 0: 0 4 3 2 1
> [ 0.049556] Fallback order for Node 1: 1 4 3 2 0
> [ 0.049559] Fallback order for Node 2: 2 4 3 0 1
> [ 0.049561] Fallback order for Node 3: 3 4 1 0 2
> [ 0.049563] Fallback order for Node 4: 4 0 1 2 3
> [ 0.049569] Built 5 zonelists, mobility grouping on. Total pages: 3932026
> [ 0.049570] Policy zone: Normal
> [ 0.049571] mem auto-init: stack:off, heap alloc:off, heap free:off
> [ 0.073214] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
> [ 0.082959] ftrace: allocating 46168 entries in 182 pages
> [ 0.082961] ftrace: allocated 182 pages with 5 groups
> [ 0.083102] rcu: Hierarchical RCU implementation.
> [ 0.083102] rcu: RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
> [ 0.083104] Rude variant of Tasks RCU enabled.
> [ 0.083104] Tracing variant of Tasks RCU enabled.
> [ 0.083105] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
> [ 0.083106] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
> [ 0.083115] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
> [ 0.083117] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
> [ 0.089643] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
> [ 0.089831] rcu: srcu_init: Setting srcu_struct sizes based on contention.
> [ 0.100835] Console: colour VGA+ 80x25
> [ 0.100838] printk: legacy console [tty0] enabled
> [ 0.132452] printk: legacy console [ttyS1] enabled
> [ 0.221725] ACPI: Core revision 20250404
> [ 0.222382] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
> [ 0.223635] APIC: Switch to symmetric I/O mode setup
> [ 0.224298] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
> [ 0.225262] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
> [ 0.226436] kvm-guest: setup PV IPIs
> [ 0.227740] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.228537] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2563bd843df, max_idle_ns: 440795257314 ns
> [ 0.229871] Calibrating delay loop (skipped) preset value.. 5187.80 BogoMIPS (lpj=10375616)
> [ 0.231044] x86/cpu: User Mode Instruction Prevention (UMIP) activated
> [ 0.234092] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
> [ 0.234805] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
> [ 0.235598] Speculative Store Bypass: Vulnerable
> [ 0.236229] GDS: Unknown: Dependent on hypervisor status
> [ 0.236955] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
> [ 0.237871] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [ 0.238713] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [ 0.239535] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
> [ 0.240439] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
> [ 0.241219] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
> [ 0.242085] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
> [ 0.242927] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
> [ 0.243794] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
> [ 0.244595] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
> [ 0.245401] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
> [ 0.246078] x86/fpu: xstate_offset[5]: 960, xstate_sizes[5]: 64
> [ 0.249871] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]: 512
> [ 0.250683] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
> [ 0.251500] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
> [ 0.253380] Freeing SMP alternatives memory: 48K
> [ 0.253876] pid_max: default: 32768 minimum: 301
> [ 0.254516] LSM: initializing lsm=capability
> [ 0.255115] stackdepot: allocating hash table of 1048576 entries via kvcalloc
> [ 0.262981] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
> [ 0.265481] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
> [ 0.266233] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
> [ 0.267255] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
> [ 0.268594] smpboot: CPU0: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (family: 0x6, model: 0x55, stepping: 0x7)
> [ 0.269870] Performance Events: Skylake events, full-width counters, Intel PMU driver.
> [ 0.269870] ... version: 2
> [ 0.269870] ... bit width: 48
> [ 0.269870] ... generic registers: 4
> [ 0.269873] ... value mask: 0000ffffffffffff
> [ 0.270548] ... max period: 00007fffffffffff
> [ 0.271220] ... fixed-purpose events: 3
> [ 0.271763] ... event mask: 000000070000000f
> [ 0.272574] signal: max sigframe size: 3216
> [ 0.273155] rcu: Hierarchical SRCU implementation.
> [ 0.273773] rcu: Max phase no-delay instances is 1000.
> [ 0.274097] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
> [ 0.275329] smp: Bringing up secondary CPUs ...
> [ 0.276031] smpboot: x86: Booting SMP configuration:
> [ 0.276689] .... node #0, CPUs: #1
> [ 0.277528] .... node #1, CPUs: #2 #3
> [ 0.278084] .... node #2, CPUs: #4 #5
> [ 0.279023] .... node #3, CPUs: #6 #7
> [ 0.279946] .... node #4, CPUs: #8 #9
> [ 0.313886] smp: Brought up 5 nodes, 10 CPUs
> [ 0.315058] smpboot: Total of 10 processors activated (51878.08 BogoMIPS)
> [ 0.316713] ------------[ cut here ]------------
> [ 0.316713] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
> [ 0.318187] Modules linked in:
> [ 0.318619] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
> [ 0.319928] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [ 0.321286] RIP: 0010:build_sched_domains+0xe67/0x13a0
> [ 0.321873] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
> [ 0.324099] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
> [ 0.324779] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
> [ 0.325659] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
> [ 0.326109] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
> [ 0.326989] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
> [ 0.327868] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
> [ 0.328743] FS: 0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
> [ 0.329772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.330069] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
> [ 0.330973] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 0.331858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 0.332740] Call Trace:
> [ 0.333111] <TASK>
> [ 0.333453] sched_init_smp+0x32/0xa0
> [ 0.333877] ? stop_machine+0x2c/0x40
> [ 0.334382] kernel_init_freeable+0xf5/0x260
> [ 0.334954] ? rest_init+0xc0/0xc0
> [ 0.335423] kernel_init+0x16/0x120
> [ 0.335907] ret_from_fork+0x5e/0xd0
> [ 0.336396] ? rest_init+0xc0/0xc0
> [ 0.336866] ret_from_fork_asm+0x11/0x20
> [ 0.337409] </TASK>
> [ 0.337755] ---[ end trace 0000000000000000 ]---
> [ 0.338089] Memory: 15307024K/15728104K available (14320K kernel code, 2394K rwdata, 9212K rodata, 1668K init, 1272K bss, 371220K reserved, 0K cma-reserved)
> [ 0.340215] devtmpfs: initialized
> [ 0.341149] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
> [ 0.342235] posixtimers hash table entries: 8192 (order: 5, 131072 bytes, vmalloc)
> [ 0.343256] futex hash table entries: 512 (32768 bytes on 5 NUMA nodes, total 160 KiB, linear).
> [ 0.346367] NET: Registered PF_NETLINK/PF_ROUTE protocol family
> [ 0.347279] thermal_sys: Registered thermal governor 'step_wise'
> [ 0.347288] cpuidle: using governor ladder
> [ 0.348603] cpuidle: using governor menu
> [ 0.349254] PCI: ECAM [mem 0xb0000000-0xbfffffff] (base 0xb0000000) for domain 0000 [bus 00-ff]
> [ 0.350190] PCI: ECAM [mem 0xb0000000-0xbfffffff] reserved as E820 entry
> [ 0.351025] PCI: Using configuration type 1 for base access
> [ 0.351822] kprobes: kprobe jump-optimization is enabled. All kprobes are optimized if possible.
> [ 0.381999] HugeTLB: allocation took 0ms with hugepage_allocation_threads=2
> [ 0.393902] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
> [ 0.394769] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
> [ 0.402159] ACPI: Added _OSI(Module Device)
> [ 0.402744] ACPI: Added _OSI(Processor Device)
> [ 0.403326] ACPI: Added _OSI(Processor Aggregator Device)
> [ 0.404648] ACPI: 1 ACPI AML tables successfully acquired and loaded
> [ 0.405807] ACPI: Interpreter enabled
>
> Thanks
I don't think that's the full dmesg output, maybe a console capture
with reduced levels? I'm not finding the output of sched_domain_debug() and
sched_domain_debug_one() here.
Thanks,
Steve Wahl
> >
> > Thanks
> >
> >>
> >> --
> >> Thanks and Regards,
> >> Prateek
> >>
> >> >
> >> > Thanks
> >> >
> >> > > +
> >> > > /* Build the groups for the domains */
> >> > > for_each_cpu(i, cpu_map) {
> >> > > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> >> > > --
> >> > > 2.26.2
> >> > >
> >>
--
Steve Wahl, Hewlett Packard Enterprise
Powered by blists - more mailing lists