[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <4110e533-6cab-4845-bd11-11279ebc9150@app.fastmail.com>
Date: Tue, 10 Jun 2025 16:09:52 +0300
From: "Leon Romanovsky" <leon@...nel.org>
To: "K Prateek Nayak" <kprateek.nayak@....com>
Cc: "Steve Wahl" <steve.wahl@....com>, "Ingo Molnar" <mingo@...hat.com>,
"Peter Zijlstra" <peterz@...radead.org>,
"Juri Lelli" <juri.lelli@...hat.com>,
"Vincent Guittot" <vincent.guittot@...aro.org>,
"Dietmar Eggemann" <dietmar.eggemann@....com>,
"Steven Rostedt" <rostedt@...dmis.org>, "Ben Segall" <bsegall@...gle.com>,
"Mel Gorman" <mgorman@...e.de>, "Valentin Schneider" <vschneid@...hat.com>,
linux-kernel@...r.kernel.org, "Vishal Chourasia" <vishalc@...ux.ibm.com>,
samir <samir@...ux.ibm.com>, "Naman Jain" <namjain@...ux.microsoft.com>,
"Saurabh Singh Sengar" <ssengar@...ux.microsoft.com>, srivatsa@...il.mit.edu,
"Michael Kelley" <mhklinux@...look.com>, "Russ Anderson" <rja@....com>,
"Dimitri Sivanich" <sivanich@....com>
Subject: Re: [PATCH v4 1/2] sched/topology: improve topology_span_sane speed
On Tue, Jun 10, 2025, at 15:36, Leon Romanovsky wrote:
> On Tue, Jun 10, 2025 at 05:03:14PM +0530, K Prateek Nayak wrote:
>> Hello Leon,
>>
>> On 6/10/2025 4:37 PM, Leon Romanovsky wrote:
>>
>> [..snip..]
>>
>> > > + if (WARN_ON(!topology_span_sane(cpu_map)))
>> > > + goto error;
>> >
>> > Hi,
>> >
>> > This WARN_ON() generate the following splat in our regression over VMs.>
>> > [ 0.408379] ------------[ cut here ]------------
>> > [ 0.409097] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
>> > [ 0.410797] Modules linked in:
>> > [ 0.411453] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
>> > [ 0.413353] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>> > [ 0.415440] RIP: 0010:build_sched_domains+0xe67/0x13a0
>> > [ 0.416458] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
>> > [ 0.417662] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
>> > [ 0.418686] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
>> > [ 0.419982] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
>> > [ 0.421166] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
>> > [ 0.422514] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
>> > [ 0.423820] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
>> > [ 0.425164] FS: 0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
>> > [ 0.426751] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [ 0.427832] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
>> > [ 0.428818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > [ 0.430131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> > [ 0.431429] Call Trace:
>> > [ 0.431983] <TASK>
>> > [ 0.432500] sched_init_smp+0x32/0xa0
>> > [ 0.433069] ? stop_machine+0x2c/0x40
>> > [ 0.433821] kernel_init_freeable+0xf5/0x260
>> > [ 0.434682] ? rest_init+0xc0/0xc0
>> > [ 0.435399] kernel_init+0x16/0x120
>> > [ 0.436140] ret_from_fork+0x5e/0xd0
>> > [ 0.436817] ? rest_init+0xc0/0xc0
>> > [ 0.437526] ret_from_fork_asm+0x11/0x20
>> > [ 0.438335] </TASK>
>> > [ 0.438841] ---[ end trace 0000000000000000 ]---
>>
>> Would it be possible for you to boot the guest with "sched_verbose" in
>> kernel cmdline and attach the full dmesg? Thanks in advance.
>
> I'll try, but can't promise due to how this kernel is been running in
> our systems.
[ 0.032233] [mem 0xc0000000-0xfed1bfff] available for PCI devices
[ 0.032237] Booting paravirtualized kernel on KVM
[ 0.032238] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[ 0.036921] setup_percpu: NR_CPUS:512 nr_cpumask_bits:10 nr_cpu_ids:10 nr_node_ids:5
[ 0.038074] percpu: Embedded 53 pages/cpu s177240 r8192 d31656 u1048576
[ 0.038108] Kernel command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 root=UUID=49650207-5673-41e8-9f3b-5572de97a271 ro selinux=0 kasan_multi_shot net.ifnames=0 biosdevname=0 console=tty0 console=ttyS1,115200 audit=0 systemd.unified_cgroup_hierarchy=0 sched_verbose
[ 0.038222] Unknown kernel command line parameters "kasan_multi_shot BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 selinux=0 biosdevname=0 audit=0", will be passed to user space.
[ 0.038235] random: crng init done
[ 0.038235] printk: log_buf_len individual max cpu contribution: 4096 bytes
[ 0.038236] printk: log_buf_len total cpu_extra contributions: 36864 bytes
[ 0.038237] printk: log_buf_len min size: 65536 bytes
[ 0.038330] printk: log buffer data + meta data: 131072 + 458752 = 589824 bytes
[ 0.038331] printk: early log buf free: 56792(86%)
[ 0.038452] software IO TLB: area num 16.
[ 0.049552] Fallback order for Node 0: 0 4 3 2 1
[ 0.049556] Fallback order for Node 1: 1 4 3 2 0
[ 0.049559] Fallback order for Node 2: 2 4 3 0 1
[ 0.049561] Fallback order for Node 3: 3 4 1 0 2
[ 0.049563] Fallback order for Node 4: 4 0 1 2 3
[ 0.049569] Built 5 zonelists, mobility grouping on. Total pages: 3932026
[ 0.049570] Policy zone: Normal
[ 0.049571] mem auto-init: stack:off, heap alloc:off, heap free:off
[ 0.073214] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=10, Nodes=5
[ 0.082959] ftrace: allocating 46168 entries in 182 pages
[ 0.082961] ftrace: allocated 182 pages with 5 groups
[ 0.083102] rcu: Hierarchical RCU implementation.
[ 0.083102] rcu: RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=10.
[ 0.083104] Rude variant of Tasks RCU enabled.
[ 0.083104] Tracing variant of Tasks RCU enabled.
[ 0.083105] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 0.083106] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=10
[ 0.083115] RCU Tasks Rude: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
[ 0.083117] RCU Tasks Trace: Setting shift to 4 and lim to 1 rcu_task_cb_adjust=1 rcu_task_cpu_ids=10.
[ 0.089643] NR_IRQS: 33024, nr_irqs: 504, preallocated irqs: 16
[ 0.089831] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[ 0.100835] Console: colour VGA+ 80x25
[ 0.100838] printk: legacy console [tty0] enabled
[ 0.132452] printk: legacy console [ttyS1] enabled
[ 0.221725] ACPI: Core revision 20250404
[ 0.222382] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[ 0.223635] APIC: Switch to symmetric I/O mode setup
[ 0.224298] kvm-guest: APIC: send_IPI_mask() replaced with kvm_send_ipi_mask()
[ 0.225262] kvm-guest: APIC: send_IPI_mask_allbutself() replaced with kvm_send_ipi_mask_allbutself()
[ 0.226436] kvm-guest: setup PV IPIs
[ 0.227740] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.228537] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2563bd843df, max_idle_ns: 440795257314 ns
[ 0.229871] Calibrating delay loop (skipped) preset value.. 5187.80 BogoMIPS (lpj=10375616)
[ 0.231044] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[ 0.234092] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[ 0.234805] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[ 0.235598] Speculative Store Bypass: Vulnerable
[ 0.236229] GDS: Unknown: Dependent on hypervisor status
[ 0.236955] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.237871] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.238713] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.239535] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[ 0.240439] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[ 0.241219] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[ 0.242085] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[ 0.242927] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[ 0.243794] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.244595] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
[ 0.245401] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
[ 0.246078] x86/fpu: xstate_offset[5]: 960, xstate_sizes[5]: 64
[ 0.249871] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]: 512
[ 0.250683] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
[ 0.251500] x86/fpu: Enabled xstate features 0xff, context size is 2560 bytes, using 'compacted' format.
[ 0.253380] Freeing SMP alternatives memory: 48K
[ 0.253876] pid_max: default: 32768 minimum: 301
[ 0.254516] LSM: initializing lsm=capability
[ 0.255115] stackdepot: allocating hash table of 1048576 entries via kvcalloc
[ 0.262981] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, vmalloc hugepage)
[ 0.265481] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, vmalloc hugepage)
[ 0.266233] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[ 0.267255] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, vmalloc)
[ 0.268594] smpboot: CPU0: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (family: 0x6, model: 0x55, stepping: 0x7)
[ 0.269870] Performance Events: Skylake events, full-width counters, Intel PMU driver.
[ 0.269870] ... version: 2
[ 0.269870] ... bit width: 48
[ 0.269870] ... generic registers: 4
[ 0.269873] ... value mask: 0000ffffffffffff
[ 0.270548] ... max period: 00007fffffffffff
[ 0.271220] ... fixed-purpose events: 3
[ 0.271763] ... event mask: 000000070000000f
[ 0.272574] signal: max sigframe size: 3216
[ 0.273155] rcu: Hierarchical SRCU implementation.
[ 0.273773] rcu: Max phase no-delay instances is 1000.
[ 0.274097] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level
[ 0.275329] smp: Bringing up secondary CPUs ...
[ 0.276031] smpboot: x86: Booting SMP configuration:
[ 0.276689] .... node #0, CPUs: #1
[ 0.277528] .... node #1, CPUs: #2 #3
[ 0.278084] .... node #2, CPUs: #4 #5
[ 0.279023] .... node #3, CPUs: #6 #7
[ 0.279946] .... node #4, CPUs: #8 #9
[ 0.313886] smp: Brought up 5 nodes, 10 CPUs
[ 0.315058] smpboot: Total of 10 processors activated (51878.08 BogoMIPS)
[ 0.316713] ------------[ cut here ]------------
[ 0.316713] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2486 build_sched_domains+0xe67/0x13a0
[ 0.318187] Modules linked in:
[ 0.318619] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0-rc1_for_upstream_min_debug_2025_06_09_14_44 #1 NONE
[ 0.319928] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 0.321286] RIP: 0010:build_sched_domains+0xe67/0x13a0
[ 0.321873] Code: ff ff 8b 6c 24 08 48 8b 44 24 68 65 48 2b 05 60 24 d0 01 0f 85 03 05 00 00 48 83 c4 70 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 65 fe ff ff 48 c7 c7 28 fb 08 82 4c 89 44 24 28 c6 05 e4
[ 0.324099] RSP: 0000:ffff8881002efe30 EFLAGS: 00010202
[ 0.324779] RAX: 00000000ffffff01 RBX: 0000000000000002 RCX: 00000000ffffff01
[ 0.325659] RDX: 00000000fffffff6 RSI: 0000000000000300 RDI: ffff888100047168
[ 0.326109] RBP: 0000000000000000 R08: ffff888100047168 R09: 0000000000000000
[ 0.326989] R10: ffffffff830dee80 R11: 0000000000000000 R12: ffff888100047168
[ 0.327868] R13: 0000000000000002 R14: ffff888100193480 R15: ffff888380030f40
[ 0.328743] FS: 0000000000000000(0000) GS:ffff8881b9b76000(0000) knlGS:0000000000000000
[ 0.329772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.330069] CR2: ffff88843ffff000 CR3: 000000000282c001 CR4: 0000000000370eb0
[ 0.330973] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 0.331858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 0.332740] Call Trace:
[ 0.333111] <TASK>
[ 0.333453] sched_init_smp+0x32/0xa0
[ 0.333877] ? stop_machine+0x2c/0x40
[ 0.334382] kernel_init_freeable+0xf5/0x260
[ 0.334954] ? rest_init+0xc0/0xc0
[ 0.335423] kernel_init+0x16/0x120
[ 0.335907] ret_from_fork+0x5e/0xd0
[ 0.336396] ? rest_init+0xc0/0xc0
[ 0.336866] ret_from_fork_asm+0x11/0x20
[ 0.337409] </TASK>
[ 0.337755] ---[ end trace 0000000000000000 ]---
[ 0.338089] Memory: 15307024K/15728104K available (14320K kernel code, 2394K rwdata, 9212K rodata, 1668K init, 1272K bss, 371220K reserved, 0K cma-reserved)
[ 0.340215] devtmpfs: initialized
[ 0.341149] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[ 0.342235] posixtimers hash table entries: 8192 (order: 5, 131072 bytes, vmalloc)
[ 0.343256] futex hash table entries: 512 (32768 bytes on 5 NUMA nodes, total 160 KiB, linear).
[ 0.346367] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[ 0.347279] thermal_sys: Registered thermal governor 'step_wise'
[ 0.347288] cpuidle: using governor ladder
[ 0.348603] cpuidle: using governor menu
[ 0.349254] PCI: ECAM [mem 0xb0000000-0xbfffffff] (base 0xb0000000) for domain 0000 [bus 00-ff]
[ 0.350190] PCI: ECAM [mem 0xb0000000-0xbfffffff] reserved as E820 entry
[ 0.351025] PCI: Using configuration type 1 for base access
[ 0.351822] kprobes: kprobe jump-optimization is enabled. All kprobes are optimized if possible.
[ 0.381999] HugeTLB: allocation took 0ms with hugepage_allocation_threads=2
[ 0.393902] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
[ 0.394769] HugeTLB: 28 KiB vmemmap can be freed for a 2.00 MiB page
[ 0.402159] ACPI: Added _OSI(Module Device)
[ 0.402744] ACPI: Added _OSI(Processor Device)
[ 0.403326] ACPI: Added _OSI(Processor Aggregator Device)
[ 0.404648] ACPI: 1 ACPI AML tables successfully acquired and loaded
[ 0.405807] ACPI: Interpreter enabled
Thanks
>
> Thanks
>
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
>> >
>> > Thanks
>> >
>> > > +
>> > > /* Build the groups for the domains */
>> > > for_each_cpu(i, cpu_map) {
>> > > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>> > > --
>> > > 2.26.2
>> > >
>>
Powered by blists - more mailing lists