[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d9a3eb80-abaa-4453-a1ec-cc1200c58481@alu.unizg.hr>
Date: Sat, 9 Mar 2024 17:13:43 +0100
From: Mirsad Todorovac <mirsad.todorovac@....unizg.hr>
To: Sean Anderson <sean.anderson@...o.com>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [BUG] KCSAN: data-race in tick_nohz_idle_stop_tick /
tick_nohz_next_event
Hi, Sean,
I am reducing Cc: so if you could please attach the decode_stacktrace.sh output of your build.
Thanks,
Mirsad
On 3/7/24 20:43, Sean Anderson wrote:
> On 8/18/23 10:15, Mirsad Todorovac wrote:
>>
>>
>> On 8/17/23 21:32, Mirsad Todorovac wrote:
>>> Hi,
>>>
>>> This is your friendly bug reporter. :-)
>>>
>>> The environment is the vanilla torvalds kernel on Ubuntu 22.04 LTS and a Ryzen 7950X assembled box.
>>>
>>> [ 7145.500079] ==================================================================
>>> [ 7145.500109] BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event
>>>
>>> [ 7145.500139] write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26:
>>> [ 7145.500155] tick_nohz_idle_stop_tick+0x3b1/0x4a0
>>> [ 7145.500173] do_idle+0x1e3/0x250
>>> [ 7145.500186] cpu_startup_entry+0x20/0x30
>>> [ 7145.500199] start_secondary+0x129/0x160
>>> [ 7145.500216] secondary_startup_64_no_verify+0x17e/0x18b
>>>
>>> [ 7145.500245] read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16:
>>> [ 7145.500261] tick_nohz_next_event+0xe7/0x1e0
>>> [ 7145.500277] tick_nohz_get_sleep_length+0xa7/0xe0
>>> [ 7145.500293] menu_select+0x82/0xb90
>>> [ 7145.500309] cpuidle_select+0x44/0x60
>>> [ 7145.500324] do_idle+0x1c2/0x250
>>> [ 7145.500336] cpu_startup_entry+0x20/0x30
>>> [ 7145.500350] start_secondary+0x129/0x160
>>> [ 7145.500367] secondary_startup_64_no_verify+0x17e/0x18b
>>>
>>> [ 7145.500392] value changed: 0x0000001a -> 0xffffffff
>>>
>>> [ 7145.500411] Reported by Kernel Concurrency Sanitizer on:
>>> [ 7145.500421] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
>>> [ 7145.500439] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>> [ 7145.500449] ==================================================================
>>>
>>> Please find dmesg log (what was available in the ring buffer) and the lshw output attached.
>>>
>>> NOTE: There are no proprietary drivers loaded and Canonical livepatch isn't working for this kernel,
>>> probably previous ACPI KCSAN BUGs turn this taint on.
>>>
>>> Thank you very much for your time evaluating this report.
>>
>> P.S.
>>
>> According to Mr. Heo, I will add the decoded stacktrace to this bug report, which ought to have been done
>> in the first place :-/
>>
>> [ 7145.500079] ==================================================================
>> [ 7145.500109] BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event
>>
>> [ 7145.500139] write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26:
>> [ 7145.500155] tick_nohz_idle_stop_tick (kernel/time/tick-sched.c:904 kernel/time/tick-sched.c:1130)
>> [ 7145.500173] do_idle (kernel/sched/idle.c:215 kernel/sched/idle.c:282)
>> [ 7145.500186] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
>> [ 7145.500199] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
>> [ 7145.500216] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:441)
>>
>> [ 7145.500245] read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16:
>> [ 7145.500261] tick_nohz_next_event (kernel/time/tick-sched.c:868)
>> [ 7145.500277] tick_nohz_get_sleep_length (kernel/time/tick-sched.c:1250)
>> [ 7145.500293] menu_select (drivers/cpuidle/governors/menu.c:283)
>> [ 7145.500309] cpuidle_select (drivers/cpuidle/cpuidle.c:359)
>> [ 7145.500324] do_idle (kernel/sched/idle.c:208 kernel/sched/idle.c:282)
>> [ 7145.500336] cpu_startup_entry (kernel/sched/idle.c:378 (discriminator 1))
>> [ 7145.500350] start_secondary (arch/x86/kernel/smpboot.c:210 arch/x86/kernel/smpboot.c:294)
>> [ 7145.500367] secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:441)
>>
>> [ 7145.500392] value changed: 0x0000001a -> 0xffffffff
>>
>> [ 7145.500411] Reported by Kernel Concurrency Sanitizer on:
>> [ 7145.500421] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
>> [ 7145.500439] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>> [ 7145.500449] ==================================================================
>>
>> The code seems to be:
>>
>>
>> /*
>> * If this CPU is the one which updates jiffies, then give up
>> * the assignment and let it be taken by the CPU which runs
>> * the tick timer next, which might be this CPU as well. If we
>> * don't drop this here the jiffies might be stale and
>> * do_timer() never invoked. Keep track of the fact that it
>> * was the one which had the do_timer() duty last.
>> */
>> if (cpu == tick_do_timer_cpu) {
>> tick_do_timer_cpu = TICK_DO_TIMER_NONE;
>> → ts->do_timer_last = 1;
>> } else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
>> ts->do_timer_last = 0;
>>
>> }
>>
>> but I am not sure why this happens.
>>
>> Maybe a WRITE_ONCE(ts->do_timer_last, 1) is required?
>>
>> Or is it another KCSAN false positive ...
>
> I have seen a few more KCSAN reports regarding this variable:
>
> [ 153.285199] ==================================================================
> [ 153.292529] BUG: KCSAN: data-race in tick_nohz_next_event+0x15c/0x29c
> [ 153.299118]
> [ 153.300685] race at unknown origin, with read to 0xffffffc082aa1048 of 4 bytes by task 0 on cpu 2:
> [ 153.309760] tick_nohz_next_event+0x15c/0x29c
> [ 153.314256] tick_nohz_get_sleep_length+0xd8/0x118
> [ 153.319186] menu_select+0x1e0/0xc08
> [ 153.322875] cpuidle_select+0x4c/0x60
> [ 153.326686] do_idle+0x1e0/0x2b4
> [ 153.330062] cpu_startup_entry+0x34/0x3c
> [ 153.334133] secondary_start_kernel+0x1e8/0x21c
> [ 153.338821] __secondary_switched+0xb8/0xbc
> [ 153.343143]
> [ 153.344706] no locks held by swapper/2/0.
> [ 153.348802] irq event stamp: 129586
> [ 153.352370] hardirqs last enabled at (129585): [<ffffffc081acaa2c>] el1_interrupt+0x40/0x50
> [ 153.360964] hardirqs last disabled at (129586): [<ffffffc0800d8fec>] do_idle+0xc8/0x2b4
> [ 153.369132] softirqs last enabled at (129560): [<ffffffc08001071c>] __do_softirq+0x47c/0x500
> [ 153.377786] softirqs last disabled at (129549): [<ffffffc080017134>] ____do_softirq+0x10/0x1c
> [ 153.386457]
> [ 153.388019] value changed: 0xffffffff -> 0x00000001
> [ 153.392993]
> [ 153.394555] Reported by Kernel Concurrency Sanitizer on:
> [ 153.399947] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.6.20+ #102
> [ 153.406595] Hardware name: xlnx,zynqmp (DT)
> [ 153.410866] ==================================================================
>
> and:
>
> [ 75.362055] ==================================================================
> [ 75.369461] BUG: KCSAN: data-race in tick_sched_do_timer / tick_sched_do_timer
> [ 75.376854]
> [ 75.378426] write to 0xffffffc082aa1048 of 4 bytes by interrupt on cpu 1:
> [ 75.385335] tick_sched_do_timer+0x124/0x168
> [ 75.389744] tick_sched_timer+0x48/0xc8
> [ 75.393720] __hrtimer_run_queues+0x200/0x5f0
> [ 75.398198] hrtimer_interrupt+0x1b0/0x3f4
> [ 75.402417] arch_timer_handler_phys+0x40/0x54
> [ 75.406983] handle_percpu_devid_irq+0xf8/0x20c
> [ 75.411679] generic_handle_domain_irq+0x48/0x64
> [ 75.416435] gic_handle_irq+0xc4/0x100
> [ 75.420298] call_on_irq_stack+0x24/0x4c
> [ 75.424360] do_interrupt_handler+0xc0/0xc4
> [ 75.428692] el1_interrupt+0x34/0x50
> [ 75.432415] el1h_64_irq_handler+0x18/0x24
> [ 75.436668] el1h_64_irq+0x64/0x68
> [ 75.440184] percpu_counter_add_batch+0xd8/0x1b8
> [ 75.444967] set_pte_range+0x14c/0x2c8
> [ 75.448838] filemap_map_pages+0x2dc/0xa38
> [ 75.453091] __handle_mm_fault+0x1304/0x17dc
> [ 75.457492] handle_mm_fault+0xa0/0x374
> [ 75.461450] do_page_fault+0x1f8/0x620
> [ 75.465365] do_translation_fault+0x118/0x138
> [ 75.469887] do_mem_abort+0x5c/0xe4
> [ 75.473516] el0_da+0x5c/0xac
> [ 75.476623] el0t_64_sync_handler+0xe4/0x158
> [ 75.481050] el0t_64_sync+0x190/0x194
> [ 75.484834]
> [ 75.486405] 5 locks held by (d-logind)/293:
> [ 75.490693] #0: ffffff88076753c0 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x15c/0x5c0
> [ 75.500441] #1: ffffffc082b042f0 (rcu_read_lock){....}-{1:2}, at: __handle_mm_fault+0x1260/0x17dc
> [ 75.509755] #2: ffffffc082b042f0 (rcu_read_lock){....}-{1:2}, at: filemap_map_pages+0x0/0xa38
> [ 75.518748] #3: ffffffc082b042f0 (rcu_read_lock){....}-{1:2}, at: __pte_offset_map+0x0/0x2a0
> [ 75.527662] #4: ffffff880762b018 (ptlock_ptr(ptdesc)){+.+.}-{2:2}, at: __pte_offset_map_lock+0xa4/0x264
> [ 75.537488] irq event stamp: 1761
> [ 75.540890] hardirqs last enabled at (1760): [<ffffffc080bc1f78>] percpu_counter_add_batch+0x160/0x1b8
> [ 75.550465] hardirqs last disabled at (1761): [<ffffffc081aca990>] el1_interrupt+0x24/0x50
> [ 75.558893] softirqs last enabled at (1534): [<ffffffc08001071c>] __do_softirq+0x47c/0x500
> [ 75.567382] softirqs last disabled at (1529): [<ffffffc080017134>] ____do_softirq+0x10/0x1c
> [ 75.575889]
> [ 75.577460] read to 0xffffffc082aa1048 of 4 bytes by interrupt on cpu 2:
> [ 75.584273] tick_sched_do_timer+0x48/0x168
> [ 75.588596] tick_sched_timer+0x48/0xc8
> [ 75.592572] __hrtimer_run_queues+0x200/0x5f0
> [ 75.597050] hrtimer_interrupt+0x1b0/0x3f4
> [ 75.601269] arch_timer_handler_phys+0x40/0x54
> [ 75.605835] handle_percpu_devid_irq+0xf8/0x20c
> [ 75.610531] generic_handle_domain_irq+0x48/0x64
> [ 75.615287] gic_handle_irq+0xc4/0x100
> [ 75.619150] call_on_irq_stack+0x24/0x4c
> [ 75.623204] do_interrupt_handler+0xc0/0xc4
> [ 75.627535] el1_interrupt+0x34/0x50
> [ 75.631258] el1h_64_irq_handler+0x18/0x24
> [ 75.635512] el1h_64_irq+0x64/0x68
> [ 75.639027] lock_acquire+0x13c/0x2f0
> [ 75.642820] count_memcg_event_mm.part.0+0x70/0x2a8
> [ 75.647829] handle_mm_fault+0x1d8/0x374
> [ 75.651874] do_page_fault+0x1f8/0x620
> [ 75.655789] do_mem_abort+0x5c/0xe4
> [ 75.659417] el0_da+0x5c/0xac
> [ 75.662524] el0t_64_sync_handler+0xe4/0x158
> [ 75.666951] el0t_64_sync+0x190/0x194
> [ 75.670736]
> [ 75.672307] 2 locks held by systemd/1:
> [ 75.676152] #0: ffffff88042627b0 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x15c/0x5c0
> [ 75.685900] #1: ffffffc082b042f0 (rcu_read_lock){....}-{1:2}, at: count_memcg_event_mm.part.0+0x0/0x2a8
> [ 75.695726] irq event stamp: 576347
> [ 75.699302] hardirqs last enabled at (576346): [<ffffffc08001af70>] local_daif_restore+0x38/0x54
> [ 75.708338] hardirqs last disabled at (576347): [<ffffffc081aca990>] el1_interrupt+0x24/0x50
> [ 75.716940] softirqs last enabled at (576161): [<ffffffc08001071c>] __do_softirq+0x47c/0x500
> [ 75.725602] softirqs last disabled at (576154): [<ffffffc080017134>] ____do_softirq+0x10/0x1c
> [ 75.734283]
> [ 75.735854] value changed: 0xffffffff -> 0x00000000
> [ 75.740836]
> [ 75.742407] Reported by Kernel Concurrency Sanitizer on:
> [ 75.747808] CPU: 2 PID: 1 Comm: systemd Not tainted 6.6.20+ #101
> [ 75.754290] Hardware name: xlnx,zynqmp (DT)
> [ 75.758563] ==================================================================
>
> The justification seems to be in tick_sched_do_timer:
>
> /*
> * Check if the do_timer duty was dropped. We don't care about
> * concurrency: This happens only when the CPU in charge went
> * into a long sleep. If two CPUs happen to assign themselves to
> * this duty, then the jiffies update is still serialized by
> * jiffies_lock.
> *
> * If nohz_full is enabled, this should not happen because the
> * tick_do_timer_cpu never relinquishes.
> */
>
> with the other assignment being in tick_nohz_stop_tick. I'm not familiar
> enough with this code to say whether we should be using READ/WRITE_ONCE
> or maybe just data_race (as would be implied by the comment above).
>
> --Sean
>
> [Embedded World 2024, SECO SpA]<https://www.messe-ticket.de/Nuernberg/embeddedworld2024/Register/ew24517689>
Powered by blists - more mailing lists