linux-kernel - Re: [PATCH 03/15] sched/fair: Add lag based placement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250207-petite-eminent-husky-7d1704@leitao>
Date: Fri, 7 Feb 2025 02:07:18 -0800
From: Breno Leitao <leitao@...ian.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...nel.org, vincent.guittot@...aro.org,
	linux-kernel@...r.kernel.org, juri.lelli@...hat.com,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, bristot@...hat.com, corbet@....net,
	qyousef@...alina.io, chris.hyser@...cle.com,
	patrick.bellasi@...bug.net, pjt@...gle.com, pavel@....cz,
	qperret@...gle.com, tim.c.chen@...ux.intel.com, joshdon@...gle.com,
	timj@....org, kprateek.nayak@....com, yu.c.chen@...el.com,
	youssefesmat@...omium.org, joel@...lfernandes.org, efault@....de,
	tglx@...utronix.de
Subject: Re: [PATCH 03/15] sched/fair: Add lag based placement

Hello Peter,

On Wed, May 31, 2023 at 01:58:42PM +0200, Peter Zijlstra wrote:
>
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  {
<snip>
> -		vruntime -= thresh;
> +		lag *= load + se->load.weight;
> +		if (WARN_ON_ONCE(!load))

I have 6.13 running on some hosts, and in some cases, where the system
is getting some OOMs, I see the following stack:

          WARNING: CPU: 29 PID: 593474 at kernel/sched/fair.c:5250 place_entity+0x199/0x1b0

           Call Trace:
            <TASK>
            ? __warn+0xd1/0x1b0
            ? place_entity+0x199/0x1b0
            ? report_bug+0x140/0x1c0
            ? handle_bug+0x5e/0x90
            ? exc_invalid_op+0x16/0x40
            ? asm_exc_invalid_op+0x16/0x20
            ? place_entity+0x199/0x1b0
            reweight_entity+0x188/0x200
            enqueue_task_fair.llvm.15448040313737105663+0x28c/0x560
            enqueue_task+0x30/0x120
            ttwu_do_activate+0x99/0x230
            try_to_wake_up+0x25a/0x4a0
            ? hrtimer_dummy_timeout+0x10/0x10
            hrtimer_wakeup+0x25/0x30
            __hrtimer_run_queues+0xf1/0x250
            hrtimer_interrupt+0xfb/0x220
            __sysvec_apic_timer_interrupt+0x47/0x140
            sysvec_apic_timer_interrupt+0x35/0x80
            asm_sysvec_apic_timer_interrupt+0x16/0x20

I am sorry for not decoding the stack, but I am having a hard time
decoding the stack properly. The values I got was misleading, and I am
working to understand what is happening.

Anyway, I don't have a reproducer and this problem doesn't happen
frequent enough. I have 1K hosts with 6.13 and I saw it 5 times in the
last week.

Also, this is happening in 6.13.1.

Thanks
--breno