lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47563570-7339-43da-af15-4acf7b93075c@samsung.com>
Date: Tue, 23 Sep 2025 08:31:26 +0200
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: John Stultz <jstultz@...gle.com>
Cc: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org, "Peter
 Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org, Linux Samsung SOC
	<linux-samsung-soc@...r.kernel.org>, Krzysztof Kozlowski <krzk@...nel.org>
Subject: Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck

On 23.09.2025 01:46, John Stultz wrote:
> On Mon, Sep 22, 2025 at 2:57 PM Marek Szyprowski
> <m.szyprowski@...sung.com> wrote:
>> This patch landed in today's linux-next as commit 077e1e2e0015
>> ("sched/deadline: Fix dl_server getting stuck"). In my tests I found
>> that it breaks CPU hotplug on some of my systems. On 64bit
>> Exynos5433-based TM2e board I've captured the following lock dep warning
>> (which unfortunately doesn't look like really related to CPU hotplug):
>>
> Huh. Nor does it really look related to the dl_server change. Interesting...
>
>
>> # for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
>> Detected VIPT I-cache on CPU7
>> CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
>> ------------[ cut here ]------------
>> WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329
>> rcutree_report_cpu_starting+0x1e8/0x348
>> Modules linked in: brcmfmac_wcc cpufreq_powersave cpufreq_conservative
>> brcmfmac brcmutil sha256 snd_soc_wm5110 cfg80211 snd_soc_wm_adsp cs_dsp
>> snd_soc_tm2_wm5110 snd_soc_arizona arizona_micsupp phy_exynos5_usbdrd
>> s5p_mfc typec arizona_ldo1 hci_uart btqca s5p_jpeg max77693_haptic btbcm
>> s3fwrn5_i2c exynos_gsc bluetooth s3fwrn5 nci v4l2_mem2mem nfc
>> snd_soc_i2s snd_soc_idma snd_soc_hdmi_codec snd_soc_max98504
>> snd_soc_s3c_dma videobuf2_dma_contig videobuf2_memops ecdh_generic
>> snd_soc_core ir_spi videobuf2_v4l2 ecc snd_compress ntc_thermistor
>> panfrost videodev snd_pcm_dmaengine snd_pcm rfkill drm_shmem_helper
>> panel_samsung_s6e3ha2 videobuf2_common backlight pwrseq_core gpu_sched
>> mc snd_timer snd soundcore ipv6
>> CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16012 PREEMPT
>> Hardware name: Samsung TM2E board (DT)
>> Hardware name: Samsung TM2E board (DT)
>> Detected VIPT I-cache on CPU7
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 6.17.0-rc6+ #16012 Not tainted
>> ------------------------------------------------------
>> swapper/7/0 is trying to acquire lock:
>> ffff000024021cc8 (&irq_desc_lock_class){-.-.}-{2:2}, at:
>> __irq_get_desc_lock+0x5c/0x9c
>>
>> but task is already holding lock:
>> ffff800083e479c0 (&port_lock_key){-.-.}-{3:3}, at:
>> s3c24xx_serial_console_write+0x80/0x268
>>
>> which lock already depends on the new lock.
>>
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #2 (&port_lock_key){-.-.}-{3:3}:
>>          _raw_spin_lock_irqsave+0x60/0x88
>>          s3c24xx_serial_console_write+0x80/0x268
>>          console_flush_all+0x304/0x49c
>>          console_unlock+0x70/0x110
>>          vprintk_emit+0x254/0x39c
>>          vprintk_default+0x38/0x44
>>          vprintk+0x28/0x34
>>          _printk+0x5c/0x84
>>          register_console+0x3ac/0x4f8
>>          serial_core_register_port+0x6c4/0x7a4
>>          serial_ctrl_register_port+0x10/0x1c
>>          uart_add_one_port+0x10/0x1c
>>          s3c24xx_serial_probe+0x34c/0x6d8
>>          platform_probe+0x5c/0xac
>>          really_probe+0xbc/0x298
>>          __driver_probe_device+0x78/0x12c
>>          driver_probe_device+0xdc/0x164
>>          __device_attach_driver+0xb8/0x138
>>          bus_for_each_drv+0x80/0xdc
>>          __device_attach+0xa8/0x1b0
>>          device_initial_probe+0x14/0x20
>>          bus_probe_device+0xb0/0xb4
>>          deferred_probe_work_func+0x8c/0xc8
>>          process_one_work+0x208/0x60c
>>          worker_thread+0x244/0x388
>>          kthread+0x150/0x228
>>          ret_from_fork+0x10/0x20
>>
>> -> #1 (console_owner){..-.}-{0:0}:
>>          console_lock_spinning_enable+0x6c/0x7c
>>          console_flush_all+0x2c8/0x49c
>>          console_unlock+0x70/0x110
>>          vprintk_emit+0x254/0x39c
>>          vprintk_default+0x38/0x44
>>          vprintk+0x28/0x34
>>          _printk+0x5c/0x84
>>          exynos_wkup_irq_set_wake+0x80/0xa4
>>          irq_set_irq_wake+0x164/0x1e0
>>          arizona_irq_set_wake+0x18/0x24
>>          irq_set_irq_wake+0x164/0x1e0
>>          regmap_irq_sync_unlock+0x328/0x530
>>          __irq_put_desc_unlock+0x48/0x4c
>>          irq_set_irq_wake+0x84/0x1e0
>>          arizona_set_irq_wake+0x5c/0x70
>>          wm5110_probe+0x220/0x354 [snd_soc_wm5110]
>>          platform_probe+0x5c/0xac
>>          really_probe+0xbc/0x298
>>          __driver_probe_device+0x78/0x12c
>>          driver_probe_device+0xdc/0x164
>>          __driver_attach+0x9c/0x1ac
>>          bus_for_each_dev+0x74/0xd0
>>          driver_attach+0x24/0x30
>>          bus_add_driver+0xe4/0x208
>>          driver_register+0x60/0x128
>>          __platform_driver_register+0x24/0x30
>>          cs_exit+0xc/0x20 [cpufreq_conservative]
>>          do_one_initcall+0x64/0x308
>>          do_init_module+0x58/0x23c
>>          load_module+0x1b48/0x1dc4
>>          init_module_from_file+0x84/0xc4
>>          idempotent_init_module+0x188/0x280
>>          __arm64_sys_finit_module+0x68/0xac
>>          invoke_syscall+0x48/0x110
>>          el0_svc_.common.c
>>
>> (system is frozen at this point).
> So I've seen issues like this when testing scheduler changes,
> particularly when I've added debug printks or WARN_ONs that trip while
> we're deep in the scheduler core and hold various locks. I reported
> something similar here:
> https://lore.kernel.org/lkml/CANDhNCo8NRm4meR7vHqvP8vVZ-_GXVPuUKSO1wUQkKdfjvy20w@mail.gmail.com/
>
> Now, usually I'll see the lockdep warning, and the hang is much more rare.
>
> But I don't see right off how the dl_server change would affect this,
> other than just changing the timing of execution such that you manage
> to trip over the existing issue.
>
> So far I don't see anything similar testing hotplug on x86 qemu.  Do
> you get any other console messages or warnings prior?

Nope. But the most suspicious message that is there is the 'CPU7: Booted 
secondary processor 0x0000000101' line, which I got while off-lining all 
non-zero CPUs.


> Looking at the backtrace, I wonder if changing the pr_info() in
> exynos_wkup_irq_set_wake() to printk_deferred() might avoid this?


I've removed that pr_info() from exynos_wkup_irq_set_wake() completely 
and now I get the following warning:

# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; done
# Detected VIPT I-cache on CPU7
  CPU7: Booted secondary processor 0x0000000101 [0x410fd031]
  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 0 at kernel/rcu/tree.c:4329 
rcutree_report_cpu_starting+0x1e8/0x348
  Modules linked in: brcmfmac_wcc brcmfmac brcmutil sha256 
cpufreq_powersave cpufreq_conservative cfg80211 snd_soc_tm2_wm5110 
hci_uart btqca btbcm s3fwrn5_i2c snd_soc_wm5110 bluetooth 
arizona_micsupp phy_exynos5_usbdrd s3fwrn5 s5p_mfc nci typec 
snd_soc_wm_adsp s5p_jpeg cs_dsp nfc ecdh_generic max77693_haptic 
snd_soc_arizona arizona_ldo1 ecc rfkill snd_soc_i2s snd_soc_idma 
snd_soc_max98504 snd_soc_hdmi_codec snd_soc_s3c_dma pwrseq_core 
snd_soc_core exynos_gsc ir_spi v4l2_mem2mem videobuf2_dma_contig 
videobuf2_memops snd_compress snd_pcm_dmaengine videobuf2_v4l2 videodev 
ntc_thermistor snd_pcm panfrost videobuf2_common drm_shmem_helper 
gpu_sched snd_timer mc panel_samsung_s6e3ha2 backlight snd soundcore ipv6
  CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16014 
PREEMPT
  Hardware name: Samsung TM2E board (DT)
  Hardware name: Samsung TM2E board (DT)
  Detected VIPT I-cache on CPU7
  CPU7: Booted secondary processor 0x0000000103 [0x410fd031]

  ================================
  WARNING: inconsistent lock state
  6.17.0-rc6+ #16014 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  swapper/7/0 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff800083e479c0 (&port_lock_key){?.-.}-{3:3}, at: 
s3c24xx_serial_console_write+0x80/0x268
  {IN-HARDIRQ-W} state was registered at:
    lock_acquire+0x1c8/0x354
    _raw_spin_lock+0x48/0x60
    s3c64xx_serial_handle_irq+0x6c/0x164
    __handle_irq_event_percpu+0x9c/0x2d8
    handle_irq_event+0x4c/0xac
    handle_fasteoi_irq+0x108/0x198
    handle_irq_desc+0x40/0x58
    generic_handle_domain_irq+0x1c/0x28
    gic_handle_irq+0x40/0xc8
    call_on_irq_stack+0x30/0x48
    do_interrupt_handler+0x80/0x84
    el1_interrupt+0x34/0x64
    el1h_64_irq_handler+0x18/0x24
    el1h_64_irq+0x6c/0x70
    default_idle_call+0xac/0x26c
    do_idle+0x220/0x284
    cpu_startup_entry+0x38/0x3c
    rest_init+0xf4/0x184
    start_kernel+0x70c/0x7d4
    __primary_switched+0x88/0x90
  irq event stamp: 63878
  hardirqs last  enabled at (63877): [<ffff800080121d2c>] 
do_idle+0x220/0x284
  hardirqs last disabled at (63878): [<ffff80008132f3a4>] 
el1_brk64+0x1c/0x54
  softirqs last  enabled at (63812): [<ffff8000800c1164>] 
handle_softirqs+0x4c4/0x4dc
  softirqs last disabled at (63807): [<ffff800080010690>] 
__do_softirq+0x14/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&port_lock_key);
    <Interrupt>
      lock(&port_lock_key);

   *** DEADLOCK ***

  5 locks held by swapper/7/0:
   #0: ffff800082d0aa98 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x150/0x39c
   #1: ffff800082d0aaf0 (console_srcu){....}-{0:0}, at: 
console_flush_all+0x78/0x49c
   #2: ffff800082d0acb0 (console_owner){+.-.}-{0:0}, at: 
console_lock_spinning_enable+0x48/0x7c
   #3: ffff800082d0acd8 
(printk_legacy_map-wait-type-override){+...}-{4:4}, at: 
console_flush_all+0x2b0/0x49c
   #4: ffff800083e479c0 (&port_lock_key){?.-.}-{3:3}, at: 
s3c24xx_serial_console_write+0x80/0x268

  stack backtrace:
  CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Not tainted 6.17.0-rc6+ #16014 
PREEMPT
  Hardware name: Samsung TM2E board (DT)
  Call trace:
   show_stack+0x18/0x24 (C)
   dump_stack_lvl+0x90/0xd0
   dump_stack+0x18/0x24
   print_usage_bug.part.0+0x29c/0x358
   mark_lock+0x7bc/0x960
   mark_held_locks+0x58/0x90
   lockdep_hardirqs_on_prepare+0x104/0x214
   trace_hardirqs_on+0x58/0x1d8
   secondary_start_kernel+0x134/0x160
   __secondary_switched+0xc0/0xc4
  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 0 at kernel/context_tracking.c:127 
ct_kernel_exit.constprop.0+0x120/0x184
  Modules linked in: brcmfmac_wcc brcmfmac brcmutil sha256 
cpufreq_powersave cpufreq_conservative cfg80211 snd_soc_tm2_wm5110 
hci_uart btqca btbcm s3fwrn5_i2c snd_soc_wm5110 bluetooth 
arizona_micsupp phy_exynos5_usbdrd s3fwrn5 s5p_mfc nci typec 
snd_soc_wm_adsp s5p_jpeg cs_dsp nfc ecdh_generic max77693_haptic 
snd_soc_arizona arizona_ldo1 ecc rfkill snd_soc_i2s snd_soc_idma 
snd_soc_max98504 snd_soc_hdmi_c

(no more messages, system frozen)

It looks that offlining CPUs 1-7 was successful (there is a prompt char 
in the second line), but then CPU7 got somehow onlined again, what 
causes this freeze.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ