[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1432738411.4060.389.camel@edumazet-glaptop2.roam.corp.google.com>
Date:	Wed, 27 May 2015 07:53:31 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	John Stultz <john.stultz@...aro.org>,
	Joonwoo Park <joonwoop@...eaurora.org>,
	Wenbo Wang <wenbo.wang@...blaze.com>
Subject: Re: [patch 0/7] timers: Footprint diet and NOHZ overhead mitigation
On Tue, 2015-05-26 at 22:50 +0000, Thomas Gleixner wrote:
> I still have a couple of patches against the timer code in my review
> list, but the more I look at them the more horrible I find them.
> 
> All these patches are related to the NOHZ stuff and try to mitigate
> shortcomings with even more bandaids. And of course these bandaids
> cost cycles and are a burden for timer heavy use cases like
> networking.
> 
> Sadly enough the NOHZ crowd is happy to throw more and more crap at
> the kernel and none of these people is even thinking about whether
> this stuff could be solved in a different way.
> 
> After Eric mentioned in one of the recent discussions that the
> timer_migration sysctl is not a lot of gain, I tried to mitigate at
> least that issue. That caused me to look deeper and I came up with the
> following patch series which:
> 
>   - Clean up the sloppy catchup timer jiffies stuff
> 
>   - Replace the hash bucket lists by hlists, which shrinks the wheel
>     footprint by 50%
> 
>   - Replaces the timer base pointer in struct timer_list by a cpu
>     index, which shrinks struct timer_list by 4 - 8 bytes depending on
>     alignement and architecture.
> 
>   - Caches nohz_active and timer_migration in the timer per cpu bases,
>     so we can avoid going through loops and hoops to figure that out.
> 
> This series applies against
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timer/core
> 
> With a pretty naive timer test module and a sched other cpu hog on an
> isolated core I verified that I did not wreckage anything. The
> following table shows the resulting CPU time of the hog for the
> various scenarios.
> 
> 	     nohz=on	      	nohz=on			nohz=off
> 	     timer_migration=on	timer_migration=off	    
> 
> Unpatched:   46.57%		46.87%			46.64%
> 
> Patched:     47.76%		48.20%			48.73%
> 
> Though, I do not trust my numbers, so I really hope that Eric or any
> other power timer wheel user can throw a bunch of tests at it.
> 
> 
> Now some more rant about the whole NOHZ issues. The timer wheel is
> fundamentally unfriendly for NOHZ because:
> 
>   - it's impossible to keep reliably track of the next expiring timer
> 
>   - finding the next expiring timer is in the worst case O(N) and
>     involves touching a gazillion of cache lines
> 
> The other disaster area is the nohz timer migration mechanism. I think
> it's fundamentally broken. 
> 
>  - We decide at timer enqueue time, on which CPU we queue the timer,
>    based on cpu idle heuristics which even fails to recognize that a
>    cpu is really busy with interrupt and softirq processing (reported
>    by Eric).
> 
>  - When moving a timer to some "non-idle" CPU we speculate about the
>    system situation in the future (at timer expiry time). This kinda
>    works for limited scenarios (NOHZ_FULL and constraint power saving
>    on mobile devices). But it is pretty much nonsensical for
>    everything else. For network heavy workloads it can be even
>    outright wrong and dangerous as Eric explained.
> 
> So we really need to put a full stop at all this NOHZ tinkering and
> think about proper solutions. I'm not going to merge any NOHZ related
> features^Whacks before the existing problems are not solved. In
> hindsight I should have done that earlier, but it's never too late.
> 
> So we need to address two issues:
> 
> 1) Keeping track of the first expiring timer in the wheel.
> 
>    Don't even think about rbtrees or such, it just wont work, but I'm
>    willing to accept prove of the contrary.
> 
>    One of the ideas I had a long time ago is to have a wheel
>    implementation, which does never cascade and therefor provides
>    different levels of granularity per wheel level.
> 
>    LVL0	   1  jiffy granularity
>    LVL1	   8  jiffies granularity
>    LVL1	   64 jiffies granularity
>    ....
> 
>    This requires more levels than the classic timer wheel, so its not
>    feasible from a memory consumption POV.
> 
>    But we can have a compromise and put all 'soon' expiring timers
>    into these fancy new wheels and stick all 'far out' timers into the
>    last level of the wheel and cascade them occasionally.
> 
>    That should work because:
> 
>      - Most timers are short term expiry (< 1h)
>      - Most timers are canceled before expiry
> 
>    So we need a sensible upper bound of levels and get the following
>    properties:
> 
>    	- Natural batching (no more slack magic). This might also help
>           networking to avoid rearming of timers.
> 
> 	- Long out timers are queued in the slowest wheel and
>           ocassionally with the granularity of the slowest wheel
>           cascaded
> 
> 	- Tracking the next expiring timer can be done with a bitmap,
>           so the 'get next expiring timer' becomes O(1) without
>           touching anything else than the bitmap, if we accept that
>           the upper limit of what we can retrieve O(1) is the
>           granularity of the last level, i.e. we treat timers which
>           need recascading simple as normal inhabitants of the last
>           wheel level.
> 	  
>         - The expiry code (run_timers) gets more complicated as it
>           needs to walk through the different levels more often, but
>           with the bitmap that shouldnt be a too big issue.
> 
>         - Seperate storage for non-deferrable and deferrable timers.
> 
>     I spent some time in coding that up. It barely boots and has
>     certainly a few bugs left and right, but I will send it out
>     nevertheless as a reply to this mail to get the discussion going.
Thanks a lot Thomas !
I finally got back two hosts with 72 cpus where I can try your patches.
I have one comment about socket TCP timers : They are re-armed very
often, but they are never canceled (unless the socket is dismantled)
I am too lazy to search for the commit adding this stuff (before git),
but I guess it helped a lot RPC sessions : No need to cancel a timer if
we're likely to rearm it in the following 10 usec ;)
$ git grep -n INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:29:#undef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:199:#ifdef INET_CSK_CLEAR_TIMERS
include/net/inet_connection_sock.h:204:#ifdef INET_CSK_CLEAR_TIMERS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists
 
