linux-kernel - Re: [RFC] [PATCH] Pre-emption control for userspace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <53189D92.8080404@oracle.com>
Date:	Thu, 06 Mar 2014 09:08:50 -0700
From:	Khalid Aziz <khalid.aziz@...cle.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	Andi Kleen <andi@...stfloor.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
	"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...nel.org>,
	akpm@...ux-foundation.org, viro@...iv.linux.org.uk,
	oleg@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [RFC] [PATCH] Pre-emption control for userspace

On 03/06/2014 02:57 AM, Peter Zijlstra wrote:
> On Wed, Mar 05, 2014 at 12:58:29PM -0700, Khalid Aziz wrote:
>> Looking at the current problem I am trying to
>> solve with databases and JVM, I run into the same issue I described in my
>> earlier email. Proxy execution is a post-contention solution. By the time
>> proxy execution can do something for my case, I have already paid the price
>> of contention and a context switch which is what I am trying to avoid. For a
>> critical section that is very short compared to the size of execution
>> thread, which is the case I am looking at, avoiding preemption in the middle
>> of that short critical section helps much more than dealing with lock
>> contention later on.
>
> Like others have already stated; its likely still cheaper than the
> pile-up you get now. It might not be optimally fast, but it sure takes
> out the worst case you have now.
>
>> The goal here is to avoid lock contention and
>> associated cost. I do understand the cost of dealing with lock contention
>> poorly and that can easily be much bigger cost, but I am looking into
>> avoiding even getting there.
>
> The thing is; unless userspace is a RT program or practises the same
> discipline in such an extend as that it make no practical difference,
> there's always going to be the case where you fail to cover the entire
> critical section, at which point you're back to your pile-up fail.
>
> So while the limited preemption guard helps the best cast, it doesn't
> help the worst case at all.

That is true. I am breaking this problem into two parts - (1) avoid pile 
up, (2) if pile up happens, deal with it efficiently. Worst case 
scenario you point out is the second part of the problem. Solutions for 
that can be PTHREAD_PRIO_PROTECT protocol for the threads that use POSIX 
threads or proxy execution. Once pile up has happened, cost of a system 
call to boost thread priority becomes much smaller part of overall cost 
of handling the pile up.

Part (1) of this problem is what my patch attempts to solve. Here the 
cost of system call to boost priority or do anything else is too high. 
The mechanism to avoid pile up has to be very light weight to be of any use.

>
> So supposing we went with this now; you (or someone else) will come back
> in a year's time and tell us that if we only just stretch this window a
> little, their favourite workload will also benefit.
>
> Where's the end of that?
>
> And what about CONFIG_HZ; suppose you compile your kernel with HZ=100
> and your 1 extra tick is sufficient. Then someone compiles their kernel
> with HZ=1000 and it all comes apart.
>
>

My goal here is to help the cases where critical section is short and 
executes quickly as it should be for well designed critical sections in 
threads that want to run using CFS. I see this as an incremental 
improvement over current situation. With CFS, timeslice is adaptive and 
depends upon the workload, so it is not directly tied to CONFIG_HZ. But 
you are right, CONFIG_HZ does have a bearing on this. I see a critical 
section that can easily go over a single timeslice and cause a pile up, 
as a workload designed to create these problems. Such a workload needs 
to use SCHED_FIFO or the deadline scheduler with properly designed yield 
points and priorities, or live with the pile ups caused by using CFS. 
Trying to help such cases with CFS is not beneficial and will cause CFS 
to become more and more complex. What I am trying to do is help the 
cases where a short critical section ends up being pre-empted simply 
because the execution reached critical section only towards the end of 
current timeslice and resulted in an unintended pile up. So give these 
cases a tool to avoid pile ups but use of the tool comes with 
restrictions (yield the processor as soon as you can if you got amnesty, 
and pay a penalty if you don't). At this point, the two workloads I know 
of that fit this group are databases and JVM both of which are in 
significant use.

Makes sense?

Thanks,
Khalid
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/