linux-kernel - Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160204122745.GC29586@e106622-lin>
Date:	Thu, 4 Feb 2016 12:27:45 +0000
From:	Juri Lelli <juri.lelli@....com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Clark Williams <williams@...hat.com>,
	John Kacur <jkacur@...hat.com>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Juri Lelli <juri.lelli@...il.com>
Subject: Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets

On 04/02/16 12:04, Juri Lelli wrote:
> On 04/02/16 09:54, Juri Lelli wrote:
> > Hi Steve,
> > 
> > first of all thanks a lot for your detailed report, if only all bug
> > reports were like this.. :)
> > 
> > On 03/02/16 13:55, Steven Rostedt wrote:
> 
> [...]
> 
> > 
> > Right. I think this is the same thing that happens after hotplug. IIRC
> > the code paths are actually the same. The problem is that hotplug or
> > cpuset reconfiguration operations are destructive w.r.t. root_domains,
> > so we lose bandwidth information when that happens. The problem is that
> > we only store cumulative information regarding bandwidth in root_domain,
> > while information about which task belongs to which cpuset is store in
> > cpuset data structures.
> > 
> > I tried to fix this a while back, but my tentative was broken, I failed
> > to get locking right and, even though it seemed to fix the issue for me,
> > it was prone to race conditions. You might still want to have a look at
> > that for reference: https://lkml.org/lkml/2015/9/2/162
> > 
> 
> [...]
> 
> > 
> > It's good that we can recover, but that's still a bug yes :/.
> > 
> > I'll try to see if my broken patch make what you are seeing apparently
> > disappear, so that we can at least confirm that we are seeing the same
> > problem; you could do the same if you want, I pushed that here
> > 
> 
> No it doesn't solve this :/. I placed restoring code in the hotplug
> workfn, so updates generated by toggling sched_load_balance don't get
> caught, of course. But, this at least tells us that we should solve this
> someplace else.
> 

Well, if I call an unlocked version of my cpuset_hotplug_update_rd()
from kernel/cpuset.c:update_flag() the issue seems to go away. But, we
end up overcommitting the default null domain (try to toggle sched_load_
balance multiple times). I updated the branch, but I still think we
should solve this differently.

Best,

- Juri