[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ec925126-2756-4b3a-b311-5f50ffee58c1@arm.com>
Date: Tue, 30 Sep 2025 14:30:13 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: Han Guangjiang <gj.han@...mail.com>
Cc: hanguangjiang@...iang.com, fanggeng@...iang.com,
Mel Gorman <mgorman@...e.de>, Ben Segall <bsegall@...gle.com>,
Steven Rostedt <rostedt@...dmis.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
"open list:SCHEDULER" <linux-kernel@...r.kernel.org>,
Valentin Schneider <vschneid@...hat.com>, yangchen11@...iang.com
Subject: Re: [PATCH] sched/fair: Fix DELAY_DEQUEUE issue related to cgroup
throttling
Hello Han,
On 9/4/25 03:51, Han Guangjiang wrote:
> From: Han Guangjiang <hanguangjiang@...iang.com>
>
> When both CPU cgroup and memory cgroup are enabled with parent cgroup
> resource limits much smaller than child cgroup's, the system frequently
> hangs with NULL pointer dereference:
Is it happening while running a specific workload ?
Would it be possible to provide a reproducer ?
> Unable to handle kernel NULL pointer dereference
> at virtual address 0000000000000051
> Internal error: Oops: 0000000096000006 [#1] PREEMPT_RT SMP
> pc : pick_task_fair+0x68/0x150
> Call trace:
> pick_task_fair+0x68/0x150
> pick_next_task_fair+0x30/0x3b8
> __schedule+0x180/0xb98
> preempt_schedule+0x48/0x60
> rt_mutex_slowunlock+0x298/0x340
> rt_spin_unlock+0x84/0xa0
> page_vma_mapped_walk+0x1c8/0x478
> folio_referenced_one+0xdc/0x490
> rmap_walk_file+0x11c/0x200
> folio_referenced+0x160/0x1e8
> shrink_folio_list+0x5c4/0xc60
> shrink_lruvec+0x5f8/0xb88
> shrink_node+0x308/0x940
> do_try_to_free_pages+0xd4/0x540
> try_to_free_mem_cgroup_pages+0x12c/0x2c0
>
> The issue can be mitigated by increasing parent cgroup's CPU resources,
> or completely resolved by disabling DELAY_DEQUEUE feature.
>
> SCHED_FEAT(DELAY_DEQUEUE, false)
>
> With CONFIG_SCHED_DEBUG enabled, the following warning appears:
>
> WARNING: CPU: 1 PID: 27 at kernel/sched/fair.c:704 update_entity_lag+0xa8/0xd0
> !se->on_rq
> Call trace:
> update_entity_lag+0xa8/0xd0
> dequeue_entity+0x90/0x538
> dequeue_entities+0xd0/0x490
> dequeue_task_fair+0xcc/0x230
> rt_mutex_setprio+0x2ec/0x4d8
> rtlock_slowlock_locked+0x6c8/0xce8
>
> The warning indicates se->on_rq is 0, meaning dequeue_entity() was
> entered at least twice and executed update_entity_lag().
>
> Root cause analysis:
> In rt_mutex_setprio(), there are two dequeue_task() calls:
> 1. First call: dequeue immediately if task is delay-dequeued
> 2. Second call: dequeue running tasks
>
> Through debugging, we observed that for the same task, both dequeue_task()
> calls are actually executed. The task is a sched_delayed task on cfs_rq,
> which confirms our analysis that dequeue_entity() is entered at least
> twice.
>
> Semantically, rt_mutex handles scheduling and priority inheritance, and
> should only dequeue/enqueue running tasks. A sched_delayed task is
> essentially non-running, so the second dequeue_task() should not execute.
>
> Further analysis of dequeue_entities() shows multiple cfs_rq_throttled()
> checks. At the function's end, __block_task() updates sched_delayed
> tasks to non-running state. However, when cgroup throttling occurs, the
> function returns early without executing __block_task(), leaving the
> sched_delayed task in running state. This causes the unexpected second
> dequeue_task() in rt_mutex_setprio(), leading to system crash.
>
> We initially tried modifying the two cfs_rq_throttled() return points in
> dequeue_entities() to jump to the __block_task() condition check, which
> resolved the issue completely.
>
> This patch takes a cleaner approach by moving the __block_task()
> operation from dequeue_entities() to finish_delayed_dequeue_entity(),
> ensuring sched_delayed tasks are properly marked as non-running
> regardless of cgroup throttling status.
>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Han Guangjiang <hanguangjiang@...iang.com>
> ---
> kernel/sched/fair.c | 21 ++++++---------------
> 1 file changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..d6c2a604358f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5373,6 +5373,12 @@ static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> clear_delayed(se);
> if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> se->vlag = 0;
> +
> + if (entity_is_task(se)) {
> + struct task_struct *p = task_of(se);
> +
> + __block_task(task_rq(p), p);
> + }
> }
>
> static bool
> @@ -7048,21 +7054,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> rq->next_balance = jiffies;
>
> - if (p && task_delayed) {
> - WARN_ON_ONCE(!task_sleep);
> - WARN_ON_ONCE(p->on_rq != 1);
> -
> - /* Fix-up what dequeue_task_fair() skipped */
> - hrtick_update(rq);
> -
> - /*
> - * Fix-up what block_task() skipped.
> - *
> - * Must be last, @p might not be valid after this.
> - */
> - __block_task(rq, p);
> - }
> -
> return 1;
> }
>
Powered by blists - more mailing lists