linux-kernel - Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160204095448.GE12132@e106622-lin>
Date:	Thu, 4 Feb 2016 09:54:48 +0000
From:	Juri Lelli <juri.lelli@....com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Clark Williams <williams@...hat.com>,
	John Kacur <jkacur@...hat.com>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Juri Lelli <juri.lelli@...il.com>
Subject: Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

Hi Steve,

first of all thanks a lot for your detailed report, if only all bug
reports were like this.. :)

On 03/02/16 13:55, Steven Rostedt wrote:
> There's an accounting issue with the SCHED_DEADLINE and the creation of
> cpusets. If a SCHED_DEADLINE task already exists and a new root domain
> is created, the calculation of the bandwidth among the root domains
> gets corrupted.
> 
> For the reproducer, I downloaded Juri's tests:
> 
>   https://github.com/jlelli/tests.git 
> 
>     For his burn.c file.
> 
>   https://github.com/jlelli/schedtool-dl.git
> 
>     For his modified schedtool utility.
> 
> 
> I have a kernel with my patches that show the bandwidth:
> 
>  # grep dl /proc/sched_debug        
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> 
> 
> Note: the sched_rt runtime and period are 950000 and 1000000
> respectively, and the bw ratio is (95/100) << 20 == 996147.
> 
> This isn't the way I first discovered the issue, but it appears to be
> the quickest way to reproduce it.
> 
> Make sure there's no other cpusets. As libvirt created some, I had to
> remove them first:
> 
>  # rmdir /sys/fs/cgroup/cpuset/libvirt/{qemu,}
> 
> 
>  # burn&
>  # schedtool -E -t 2000000:20000000 $!
> 
>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 104857
> 
> Note: (2/20) << 20 == 104857
> 
>  # echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
> 
>  # grep dl /proc/sched_debug                                       
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> 
> Notice that after removing load_balancing from the main cpuset, all the
> totals went to zero.
> 

Right. I think this is the same thing that happens after hotplug. IIRC
the code paths are actually the same. The problem is that hotplug or
cpuset reconfiguration operations are destructive w.r.t. root_domains,
so we lose bandwidth information when that happens. The problem is that
we only store cumulative information regarding bandwidth in root_domain,
while information about which task belongs to which cpuset is store in
cpuset data structures.

I tried to fix this a while back, but my tentative was broken, I failed
to get locking right and, even though it seemed to fix the issue for me,
it was prone to race conditions. You might still want to have a look at
that for reference: https://lkml.org/lkml/2015/9/2/162

> Let's see what happens when we kill the task.
> 
>  # killall burn
> 
>  # grep dl /proc/sched_debug
> dl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : -104857
> 
> They all went negative!
> 

Yes, that's because we remove task's bw from the root_domain
unconditionally in task_dead_dl(), as you also found out below.

> Not good, but we can recover...
> 
>  # echo 1 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance 
> 
> # grep dl /proc/sched_debugdl_rq[0]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[1]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[2]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[3]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[4]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[5]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[6]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> dl_rq[7]:
>   .dl_nr_running                 : 0
>   .dl_bw->bw                     : 996147
>   .dl_bw->total_bw               : 0
> 
> Playing with this a bit more, I found that it appears that setting
> load_balance to 1 in the toplevel cpuset always resets the deadline
> bandwidth weather or not it should be. At least that's a way to recover
> from things not working anymore, but I still believe this is a bug.
> 

It's good that we can recover, but that's still a bug yes :/.

I'll try to see if my broken patch make what you are seeing apparently
disappear, so that we can at least confirm that we are seeing the same
problem; you could do the same if you want, I pushed that here

 git://linux-arm.org/linux-jl.git upstream/fixes/dl-hotplug

I'm not sure anyway if my approach can be fixed or if we have to solve
this some other way. I have to get back to look at this.

Best,

- Juri