[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <544c8626-8330-6dd9-0076-c7cb73ec6054@linux.alibaba.com>
Date: Fri, 7 Jan 2022 09:15:53 +0800
From: luanshi <zhangliguang@...ux.alibaba.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Frederic Weisbecker <fweisbec@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: tick/sched: iowait_sleeptime resides in tick_cpu_sched structure
is a negative value
Hi Frederi,
在 2022/1/6 21:40, Frederic Weisbecker 写道:
> On Thu, Jan 06, 2022 at 05:14:02PM +0800, luanshi wrote:
>> Hi All,
>>
>> We encounted a hardlockup problem on ARM server, after some debug found
>> that:
>>
>> PID: 0 TASK: ffff0400064de300 CPU: 126 COMMAND: "swapper/126"
>> #0 [ffff8000250f3c90] __crash_kexec at ffff80001013a064
>> #1 [ffff8000250f3e30] panic at ffff800010afd028
>> #2 [ffff8000250f3f10] nmi_panic at ffff80001004a9e0
>> #3 [ffff8000250f3f20] watchdog_hardlockup_check at ffff80001017b7f0
>> #4 [ffff8000250f3f40] sdei_watchdog_callback at ffff80001003c9a4
>> #5 [ffff8000250f3f50] sdei_event_handler at ffff8000108816b8
>> #6 [ffff8000250f3f80] _sdei_handler at ffff800010b1c2e4
>> #7 [ffff8000250f3fd0] __sdei_handler at ffff800010b139b0
>> #8 [ffff8000250f3ff0] __sdei_asm_handler at ffff800010014c18
>> --- <IRQ stack> ---
>> #9 [ffff800013973f10] __cpu_do_idle at ffff800010b13764
>> #10 [ffff800013973f20] arch_cpu_idle at ffff800010b137ec
>> #11 [ffff800013973f30] default_idle_call at ffff800010b1b5cc
>> #12 [ffff800013973f50] cpuidle_idle_call at ffff80001009648c
>> #13 [ffff800013973f90] do_idle at ffff8000100965b0
>> #14 [ffff800013973fc0] cpu_startup_entry at ffff8000100967d4
>> #15 [ffff800013973fe0] secondary_start_kernel at ffff800010026bb0
>>
>>
>> per_cpu(tick_cpu_sched, 126) = $1 = {
>> sched_timer = {
>> node = {
>> node = {
>> __rb_parent_color = 18446603337117384112,
>> rb_right = 0x0,
>> rb_left = 0x0
>> },
>> expires = 6108564000000
>> },
>> _softexpires = 6108564000000,
>> function = 0xffff800010122ec0 <tick_sched_timer>,
>> base = 0xffff04473bbcc780,
>> state = 1 '\001',
>> is_rel = 0 '\000',
>> is_soft = 0 '\000',
>> is_hard = 1 '\001'
>> },
>> check_clocks = 0,
>> nohz_mode = NOHZ_MODE_INACTIVE,
>> inidle = 1,
>> tick_stopped = 0,
>> idle_active = 1,
>> do_timer_last = 0,
>> got_idle_tick = 1,
>> last_tick = 0,
>> next_tick = 0,
>> idle_jiffies = 0,
>> idle_calls = 0,
>> idle_sleeps = 0,
>> idle_entrytime = 5012087709249,
>> idle_waketime = 0,
>> idle_exittime = 0,
>> idle_sleeptime = 4936136669951,
>> iowait_sleeptime = -1942739704,
>> last_jiffies = 0,
>> timer_expires = 0,
>> timer_expires_base = 0,
>> next_timer = 0,
>> idle_expires = 0,
>> tick_dep_mask = {
>> counter = 0
>> }
>> }
>>
>> iowait_sleeptime = -1942739704,
>> ----------
>> iowait_sleeptime is monotonically increasing, under what circumstances iowait_sleeptime
>> can be a negative value?
>>
>> For detailed information:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215458
>>
>>
>> Can you give me some suggestions for debugging.
> Some racy updates can happen if cpufreq or "cat /proc/stat" do concurrent
> updates. But for that nohz needs to be running and I see your CPU clock has
> NOHZ_MODE_INACTIVE. Perhaps it's only for that CPU though.
>
> What is the value of tick_nohz_active in your dump?
nohz=off was configured in the startup parameters
crash> tick_nohz_enabled
tick_nohz_enabled = $1 = false
crash> tick_nohz_active
tick_nohz_active = $2 = 0
and current no cpufreq driver was used:
crash> cpufreq_driver
cpufreq_driver = $4 = (struct cpufreq_driver *) 0x0
Thanks,
Liguang
>
>
>>
>>
>> Thanks,
>> Liguang
>>
Powered by blists - more mailing lists