lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad7d2769ae3b4d8a88e4f67d5fb800cf@SVR-IES-MBX-03.mgc.mentorg.com>
Date:   Thu, 5 Dec 2019 10:56:13 +0000
From:   "Schmid, Carsten" <Carsten_Schmid@...tor.com>
To:     Peter Zijlstra <peterz@...radead.org>
CC:     "mingo@...hat.com" <mingo@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "walken@...gle.com" <walken@...gle.com>,
        "dave@...olabs.net" <dave@...olabs.net>
Subject: AW: Crash in fair scheduler

> Von: Peter Zijlstra [mailto:peterz@...radead.org]

> 
> Exatly.
> 
> 
> I suppose one approach is to add code to both __enqueue_entity() and
> __dequeue_entity() that compares ->rb_leftmost to the result of
> rb_first(). That'd incur some overhead but it'd double check the logic.

As this is a ONCE without reproducer, i would prefer to use an approach
to exactly check for this case in the code path where it crashed.
Something like this (with pseudo-code):

simple:
....

do {
  se = pick_next_entity(..)
  if (unlikely(!se)) { /* here we check for the issue */
     write warning and some useful data to dmesg
     if (cur_rq->rb_leftmost == NULL) { /* our case */
       set cur_rq->rb_leftmost to itself as mentioned in the discussion
       se = pick_next_entity(..)       /* should now return a valid pointer */
     } else { /* another case happened, unknown */
        write warning to dmesg UNKNOWN
        panic() /* not known what to do here, would crash anyway. */
     }
  set_next_entity(se, ..)
  cfs_rq = group_cfs_rq(...)
} while (cfs_rq);

This will definitely not fix the rb_leftmost being NULL, but we can't tell
where this happened at all, so it's digging in the dark.
Maybe the data written to dmesg will help to diagnose further, if the issue
will happen again.
And, this will not affect performance much, as i have to take care of this too.

Thanks for all your suggestions.
Carsten

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ