lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 15 Apr 2024 16:03:01 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Abel Wu <wuyun.abel@...edance.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli
	<juri.lelli@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Tiwei Bie
	<tiwei.btw@...group.com>, Honglei Wang <wanghonglei@...ichuxing.com>, "Aaron
 Lu" <aaron.lu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Breno Leitao
	<leitao@...ian.org>, <linux-kernel@...r.kernel.org>, kernel test robot
	<oliver.sang@...el.com>
Subject: Re: [RFC PATCH] sched/eevdf: Return leftmost entity in pick_eevdf()
 if no eligible entity is found

On 2024-04-15 at 09:22:51 +0200, Peter Zijlstra wrote:
> On Tue, Apr 09, 2024 at 11:21:04AM +0200, Peter Zijlstra wrote:
> 
> > Is there any sane way to reproduce this, and how often does it happen?
> 
> This, how do you all make it go bang?

It was reproduced in lkp's environment, and originally reported here:
https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/

It is a trinity test on a vm guest, and seems like be triggered after some futex
test.  And it was reproduced at a rate of 23/999 according to that report.
Previously I could not reproduce it locally, so lkp has helped test my debug
patch in their environment and got the clue that it was broken by s64 overflow.

Breno told me that he has reproduce this issue with KASAN on and using:
'stress-ng -a 20', but I can not reproduce it locally neither.

I'm thinking of creating a debug patch to trace all the changes related to
cfs_rq->avg_vruntime in avg_vruntime_add()/sub().
To see how cfs_rq->avg_vruntime gets far behind the cfs_rq->min_vruntime, which
caused the overflow. My understanding is that, the se's vruntime is got from
cfs_rq->avg_vruntime in place_entity(), if the se's vruntime gets an extrem
smaller value than cfs_rq->min_vruntime, then it might indicate that there could
be something wrong with the update of cfs_rq->avg_vruntime.
Then lkp could help us to further debug.

Or do you have any suggestion/suspect that how to narrow down this, I could
try as you suggest.

thanks,
Chenyu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ