linux-kernel - Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZhzfNU0rsscFnTm4@chenyu5-mobl2>
Date: Mon, 15 Apr 2024 16:03:01 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Abel Wu <wuyun.abel@...edance.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli
	<juri.lelli@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Tiwei Bie
	<tiwei.btw@...group.com>, Honglei Wang <wanghonglei@...ichuxing.com>, "Aaron
 Lu" <aaron.lu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Breno Leitao
	<leitao@...ian.org>, <linux-kernel@...r.kernel.org>, kernel test robot
	<oliver.sang@...el.com>
Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf()
 if no eligible entity is found

On 2024-04-15 at 09:22:51 +0200, Peter Zijlstra wrote:
> On Tue, Apr 09, 2024 at 11:21:04AM +0200, Peter Zijlstra wrote:
> 
> > Is there any sane way to reproduce this, and how often does it happen?
> 
> This, how do you all make it go bang?

It was reproduced in lkp's environment, and originally reported here:
https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/

It is a trinity test on a vm guest, and seems like be triggered after some futex
test.  And it was reproduced at a rate of 23/999 according to that report.
Previously I could not reproduce it locally, so lkp has helped test my debug
patch in their environment and got the clue that it was broken by s64 overflow.

Breno told me that he has reproduce this issue with KASAN on and using:
'stress-ng -a 20', but I can not reproduce it locally neither.

I'm thinking of creating a debug patch to trace all the changes related to
cfs_rq->avg_vruntime in avg_vruntime_add()/sub().
To see how cfs_rq->avg_vruntime gets far behind the cfs_rq->min_vruntime, which
caused the overflow. My understanding is that, the se's vruntime is got from
cfs_rq->avg_vruntime in place_entity(), if the se's vruntime gets an extrem
smaller value than cfs_rq->min_vruntime, then it might indicate that there could
be something wrong with the update of cfs_rq->avg_vruntime.
Then lkp could help us to further debug.

Or do you have any suggestion/suspect that how to narrow down this, I could
try as you suggest.

thanks,
Chenyu