linux-kernel - Re: [bug-report] possible s64 overflow in max

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Y9LG5vkf/4ufJb35@u40bc5e070a0153.ant.amazon.com>
Date:   Thu, 26 Jan 2023 19:31:02 +0100
From:   Roman Kagan <rkagan@...zon.de>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     Zhang Qiao <zhangqiao22@...wei.com>,
        Waiman Long <longman@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        "Vincent Guittot" <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        "Daniel Bristot de Oliveira" <bristot@...hat.com>,
        lkml <linux-kernel@...r.kernel.org>
Subject: Re: [bug-report] possible s64 overflow in max_vruntime()

On Thu, Jan 26, 2023 at 01:49:43PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 25, 2023 at 08:45:32PM +0100, Roman Kagan wrote:
> 
> > The calculation is indeed safe against the overflow of the vruntimes
> > themselves.  However, when the two vruntimes are more than 2^63 apart,
> > their comparison gets inverted due to that s64 overflow.
> 
> Yes, but that's a whole different issue. vruntime are not expected to be
> *that* far apart.
> 
> That is surely the abnormal case. The normal case is wrap around, and
> that happens 'often' and should continue working.
> 
> > And this is what happens here: one scheduling entity has accumulated a
> > vruntime more than 2^63 ahead of another.  Now the comparison is
> > inverted due to s64 overflow, and the latter can't get to the cpu,
> > because it appears to have vruntime (much) bigger than that of the
> > former.
> 
> If it can be 2^63 ahead, it can also be 2^(64+) ahead and nothing will
> help.
> 
> > This situation is reproducible e.g. when one scheduling entity is a
> > multi-cpu hog, and the other is woken up from a long sleep.  Normally
> 
> A very low weight CPU hog?

Right.  In our case this weight was due to the task group consuming
all 448 cpus on the machine; presumably one can achive this on a smaller
machine by tweaking shares of the cgroup.

> > when a task is placed on a cfs_rq, its vruntime is pulled to
> > min_vruntime, to avoid boosting the woken up task.  However in this case
> > the task is so much behind in vruntime that it appears ahead instead,
> > its vruntime is not adjusted in place_entity(), and then it looses the
> > cpu to the current scheduling entity.
> 
> What I think might be a way out here is passing the the sleep wall-time
> (cfs_rq_clock_pelt() time I suppose) to place entity and simply skip the
> magic if 'big'.
> 
> All that only matters for small sleeps anyway.
> 
> Something like:
> 
>         sleep_time = U64_MAX;
>         if (se->avg.last_update_time)
>           sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;

Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
others were suggesting?  It appears to better match the notion of sleep
wall-time, no?

Thanks,
Roman.

> 
>         if (sleep_time > 60*NSEC_PER_SEC) { // 1 minute is huge
>           se->vruntime = cfs_rq->min_vruntime;
>           return;
>         }
> 
>         // ... rest of place_entity()
> 
> Hmm... ?



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879