[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <6dab6e564e43c952f63f83ef868da6ed829fc1a8.camel@mediatek.com>
Date: Tue, 6 Sep 2022 20:54:58 +0800
From: Kuyo Chang <kuyo.chang@...iatek.com>
To: <mingo@...hat.com>, <peterz@...radead.org>,
<juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>,
<bsegall@...gle.com>, <mgorman@...e.de>, <bristot@...hat.com>
CC: <linux-kernel@...r.kernel.org>, <wsd_upstream@...iatek.com>,
<linux-mediatek@...ts.infradead.org>, <jing-ting.wu@...iatek.com>,
<yt.chang@...iatek.com>, <jonathan.jmchen@...iatek.com>
Subject: BUG: list_add corruption while doing migrate_swap -> balance_push
Hi,
[Syndrome]
A list_add corruption error at kernel-5.15, the log shows.
list_add corruption. prev->next should be next (ffffff81a6f08ba0), but
was 0000000000000000. (prev=ffffff81a6f05930).
The call trace as below:
ipanic_die
notify_die
die
bug_handler
brk_handler
do_debug_exception
el1_dbg
el1h_64_sync_handler
el1h_64_sync
__list_add_valid
cpu_stop_queue_work
stop_one_cpu_nowait
balance_push
__schedule
schedule
do_sched_yield
__arm64_sys_sched_yield
invoke_syscall
el0_svc_common
do_el0_svc
el0_svc
el0t_64_sync_handler
el0t_64_sync
[Analysis]
By memory dump and analyzing the stopper->works list, the error code
flow as following:
migrate_swap
->stop_two_cpus
->cpu_stop_queue_two_works
->__cpu_stop_queue_work (add work->list to stopper-
>works respectively)
->list_add_tail(&work->list, &stopper->works);
->wake_up_q(&wakeq);
->wait_for_completion(&done.completion);
->wait_for_common
->schedule_timeout
->schedule
At this point, the cpu hotplug trigged,
It registers balance_callback by below flow:
cpu_down(cpuid)
->_cpu_down
->cpuhp_set_state()
->set_cpu_dying(cpuid, true)
->sched_cpu_deactivate
->balance_push_set(cpuid, true)
->rq->balance_callback = &balance_push_callback;
Finally,
->__schedule
->__balance_callbacks
->do_balance_callbacks(rq, __splice_balance_callbacks(rq, false));
->balance_push
->stop_one_cpu_nowait
*work_buf = (struct cpu_stop_work){ .fn = fn, .arg = arg,
.caller = _RET_IP_, };
At this point the list_head *next, *prev is initial to NULL!!
->cpu_stop_queue_work
->__list_add_valid
So it will hit this error
if (CHECK_DATA_CORRUPTION(next->prev != prev,
"list_add corruption. next->prev should be prev (%px), but was
%px. (next=%px).\n",
prev, next->prev, next)
Do you have any suggestion for this issue?
Thank you.
Powered by blists - more mailing lists