linux-kernel - AW: Crash in fair scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ad7d2769ae3b4d8a88e4f67d5fb800cf@SVR-IES-MBX-03.mgc.mentorg.com>
Date:   Thu, 5 Dec 2019 10:56:13 +0000
From:   "Schmid, Carsten" <Carsten_Schmid@...tor.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     "mingo@...hat.com" <mingo@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "walken@...gle.com" <walken@...gle.com>,
        "dave@...olabs.net" <dave@...olabs.net>
Subject: AW: Crash in fair scheduler

> Von: Peter Zijlstra [mailto:peterz@...radead.org]

> 
> Exatly.
> 
> 
> I suppose one approach is to add code to both __enqueue_entity() and
> __dequeue_entity() that compares ->rb_leftmost to the result of
> rb_first(). That'd incur some overhead but it'd double check the logic.

As this is a ONCE without reproducer, i would prefer to use an approach
to exactly check for this case in the code path where it crashed.
Something like this (with pseudo-code):

simple:
....

do {
  se = pick_next_entity(..)
  if (unlikely(!se)) { /* here we check for the issue */
     write warning and some useful data to dmesg
     if (cur_rq->rb_leftmost == NULL) { /* our case */
       set cur_rq->rb_leftmost to itself as mentioned in the discussion
       se = pick_next_entity(..)       /* should now return a valid pointer */
     } else { /* another case happened, unknown */
        write warning to dmesg UNKNOWN
        panic() /* not known what to do here, would crash anyway. */
     }
  set_next_entity(se, ..)
  cfs_rq = group_cfs_rq(...)
} while (cfs_rq);

This will definitely not fix the rb_leftmost being NULL, but we can't tell
where this happened at all, so it's digging in the dark.
Maybe the data written to dmesg will help to diagnose further, if the issue
will happen again.
And, this will not affect performance much, as i have to take care of this too.

Thanks for all your suggestions.
Carsten