linux-kernel - Re: CPU POSIX timers livelock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080502152126.GA3130@atjola.homenet>
Date:	Fri, 2 May 2008 17:21:26 +0200
From:	Björn Steinbrink <B.Steinbrink@....de>
To:	Petr Tesarik <ptesarik@...e.cz>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	Oleg Nesterov <oleg@...sign.ru>, linux-kernel@...r.kernel.org,
	Roland McGrath <roland@...hat.com>,
	Frank Mayhar <fmayhar@...gle.com>
Subject: Re: CPU POSIX timers livelock

[Added Roland McGrath and Frank Mayhar to Cc:, as this sounds similar
enough to what has been discussed here http://lkml.org/lkml/2008/2/6/505]

On 2008.05.02 17:05:41 +0200, Petr Tesarik wrote:
> Hello,
> 
> there seems to be a possible livelock in the CPU timers (actually, we've
> experienced it once already). The analysis lead to the conclusion that
> fixing this properly will be a bit harder.
> 
> Problem Statement
> 
>  1. Any process can set ITIMER_PROF as short as 1 timer tick
>  2. If the process is CPU-bound (e.g. a number-crunching application),
>     the timer will expire with every timer tick
>  3. If the process has N threads and each thread runs on its own
>     CPU, the timer will expire N times per timer tick
> 
> Now, this can become quite interesting on a larger system. For instance,
> with 128 CPUs and a (conservative) setting of HZ 300, any process can
> effectively ask to be sent a SIGPROF every 26 usec (1/300/128 s). Pretty
> fast, but it can get worse if you consider the pitfalls of a NUMA
> architecture.
> 
> Whenever the timer expires, a new expiration time must be computed for
> each thread in the thread group. The values are stored in task_struct of
> each thread, and IIRC those are kept local to the CPU where the thread
> is running (quite logically). Put in another way, walking all threads
> means touching remote memory except for the few task_struct which are
> local to the recomputing CPU. In the 128-CPU example, if each node has 2
> CPUs (think Altix), only 2 accesses are to the local node and 126
> accesses are to remote memory. These are already quite expensive, but it
> gets even worse. The timer interrupt is generated locally for each CPU,
> so if the process is spread over all of them (see above), all CPUs
> recompute the timer (hint: write access), thus constantly invalidating
> local caches of the remote memory.
> 
> And those computations cannot run in parallel, of course, because
> everything is serialized on sighand->siglock (see run_posix_cpu_timers).
> So, each time one CPU finishes walking the threads, there's already a
> queue of all the other CPUs waiting to do the same again. For this kind
> of load, those 26 usec (approx. 41600 CPU cycles on a 1.6 GHz CPU) may
> be well insufficient. Even more so, if somebody else occasionally takes
> write lock on tasklist_lock. It may so happen that the system spends all
> CPU cycles re-computing the profiling timers over and over again and
> never make any progress. Note this is partly caused by the fact that the
> time needed to handle the profiling timer is itself also counted to the
> profiling time.
> 
> This simply does not scale for large systems (and I was not even
> considering those really large 1024p ones).
> 
> Ideas
> 
> Any ideas for improvement? Obviously, storing per-process expiration
> times in one place instead of dividing the interval by the number of
> currently running threads would help here. But it would spoil the fast
> path (when there are no timers at all). Avoiding that multiple threads
> re-compute the same values might also help (although I've got no idea
> how to implement this).
> 
> Any more comments (before I start hacking something you'll NAK)?
> 
> BTW did POSIX really mean to allow an indefinitely small interval for
> the SIGPROF?
> 
> Petr Tesarik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/