linux-kernel - Re: INFO: rcu detected stall in do

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181024120335.GE29272@localhost.localdomain>
Date:   Wed, 24 Oct 2018 14:03:35 +0200
From:   Juri Lelli <juri.lelli@...hat.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     luca abeni <luca.abeni@...tannapisa.it>,
        Thomas Gleixner <tglx@...utronix.de>,
        Juri Lelli <juri.lelli@...il.com>,
        syzbot <syzbot+385468161961cee80c31@...kaller.appspotmail.com>,
        Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        LKML <linux-kernel@...r.kernel.org>, mingo@...hat.com,
        nstange@...e.de, syzkaller-bugs@...glegroups.com, henrik@...tad.us,
        Tommaso Cucinotta <tommaso.cucinotta@...tannapisa.it>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>
Subject: Re: INFO: rcu detected stall in do_idle

On 19/10/18 22:50, luca abeni wrote:
> On Fri, 19 Oct 2018 13:39:42 +0200
> Peter Zijlstra <peterz@...radead.org> wrote:
> 
> > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > > Ok, I see the issue now: the problem is that the "while
> > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > > the deadline should be postponed at enforcement time.
> > > 
> > > I mean: in update_curr_dl() we do:
> > > 	dl_se->runtime -= scaled_delta_exec;
> > > 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > > 		...
> > > 		enqueue replenishment timer at dl_next_period(dl_se)
> > > But dl_next_period() is based on a "wrong" deadline!
> > > 
> > > 
> > > I think that inserting a
> > >         while (dl_se->runtime <= -pi_se->dl_runtime) {
> > >                 dl_se->deadline += pi_se->dl_period;
> > >                 dl_se->runtime += pi_se->dl_runtime;
> > >         }
> > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > > the problem, no?  
> > 
> > That certainly makes sense to me.
> 
> Good; I'll try to work on this idea in the weekend.

So, we (me and Luca) managed to spend some more time on this and found a
few more things worth sharing. I'll try to summarize what we have got so
far (including what already discussed) because impression is that each
point might deserve a fix or at least consideration (just amazing how a
simple random fuzzer thing can highlight all that :). Apologies for the
long email.

Reproducer runs on a CONFIG_HZ=100, CONFIG_IRQ_TIME_ACCOUNTING kernel
and does something like this (only the bits that seems to matter here)

int main(void)
{
  [...]
  [setup stuff at 0x2001d000]
  syscall(__NR_perf_event_open, 0x2001d000, 0, -1, -1, 0);
  *(uint32_t*)0x20000000 = 0;
  *(uint32_t*)0x20000004 = 6;
  *(uint64_t*)0x20000008 = 0;
  *(uint32_t*)0x20000010 = 0;
  *(uint32_t*)0x20000014 = 0;
  *(uint64_t*)0x20000018 = 0x9917; <-- ~40us
  *(uint64_t*)0x20000020 = 0xffff; <-- ~65us (~60% bandwidth)
  *(uint64_t*)0x20000028 = 0;
  syscall(__NR_sched_setattr, 0, 0x20000000, 0);
  [busy loop]
  return 0;
}

And this causes problems because the task is actually never throttled.

Pain points:

 1. Granularity of enforcement (at each tick) is huge compared with
    the task runtime. This makes starting the replenishment timer,
    when runtime is depleted, always to fail (because old deadline
    is way in the past). So, the task is fully replenished and put
    back to run.

    - Luca's proposal should help here, since the deadline is postponed
      at throttling time, and replenishment timer set to that (and it
      should be in the future)

 1.1 Even if we fix 1. in a configuration like this, the task would
     still be able to run for ~10ms (worst case) and potentially starve
     other tasks. It doesn't seem a too big interval maybe, but there
     might be other very short activities that might miss an occasion
     to run "quickly".

     - Might be fixed by imposing (via sysctl) reasonable defaults for
       minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
       (as also a very small bandwidth task can have a big runtime if
       period is big as well)

 (1.2) When runtime becomes very negative (because delta_exec was big)
       we seem to spend lot of time inside the replenishment loop.

       - Not sure it's such a big problem, might need more profiling.
         Feeling is that once the other points will be addressed this
	 won't matter anymore

 2. This is related to perf_event_open syscall reproducer does before
    becoming DEADLINE and entering the busy loop. Enabling of perf
    swevents generates lot of hrtimers load that happens in the
    reproducer task context. Now, DEADLINE uses rq_clock() for setting
    deadlines, but rq_clock_task() for doing runtime enforcement.
    In a situation like this it seems that the amount of irq pressure
    becomes pretty big (I'm seeing this on kvm, real hw should maybe do
    better, pain point remains I guess), so rq_clock() and
    rq_clock_task() might become more a more skewed w.r.t. each other.
    Since rq_clock() is only used when setting absolute deadlines for
    the first time (or when resetting them in certain cases), after a
    bit the replenishment code will start to see postponed deadlines
    always in the past w.r.t. rq_clock(). And this brings us back to the
    fact that the task is never stopped, since it can't keep up with
    rq_clock().

    - Not sure yet how we want to address this [1]. We could use
      rq_clock() everywhere, but tasks might be penalized by irq
      pressure (theoretically this would mandate that irqs are
      explicitly accounted for I guess). I tried to use the skew between
      the two clocks to "fix" deadlines, but that puts us at risks of
      de-synchronizing userspace and kernel views of deadlines.

 3. HRTICK is not started for new entities.
 
    - Already got a patch for it.

This should be it, I hope. Luca (thanks a lot for your help) and please
add or correct me if I was wrong.

Thoughts?

Best,

- Juri

1 - https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L1162