[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <YIh/QubidJcE5IIv@cork>
Date:   Tue, 27 Apr 2021 14:16:50 -0700
From:   Jörn Engel <joern@...estorage.com>
To:     linux-kernel@...r.kernel.org
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>
Subject: sched: wakeup setting TIF_NEED_RESCHED too frequently
I came across something that looks wrong and is certainly inefficient.
Here is a trace of what happened.  Kernel was 5.4, so not too ancient.
 + nanoseconds before present
 |       + 0 for voluntary (going to sleep), 1 for involuntary (TASK_RUNNING)
 |       |    + PID of prev task
 |       |    |      + PID of next task
 |       |    |      |    + RT priority
 |       |    |      |    |  + comm
 v       v    v      v    v  v
8729751  0   6893   6894  0 one
8723802  0   6894 106051  0 one
8718671  1 106051   6899  0 other
8711411  0   6899   6900  0 one
8705643  0   6900   6902  0 one
8699444  0   6902   6903  0 one
8693987  0   6903   6906  0 one
8687832  0   6906   6904  0 one
8681188  0   6904   6901  0 one
8674083  0   6901 106051  0 one
8639249  1 106051   6893  0 other
8631961  0   6893   6894  0 one
8626684  0   6894 106051  0 one
8623375  1 106051   6899  0 other
8617434  0   6899   6900  0 one
8612341  0   6900   6901  0 one
8607215  0   6901   6902  0 one
8600824  0   6902   6903  0 one
8595214  0   6903   6906  0 one
8590161  0   6906   6904  0 one
8584825  0   6904 106051  0 one
8465395  1 106051   6893  0 other
8358696  0   6893   6903  0 one
The "one" process seems to frequently wake a bunch of threads that
do very little work before going back to sleep.  Clearly inefficient,
but we are not interested in userspace problems here.  The "other"
process has a thread that would like to continue running, but only gets
to run for very short periods, 3.3µs in one case.  That also seems
inefficient and looks like either a scheduler or debatable policy.
My rule of thumb for task switch overhead is 1-3µs.  Probably 1µs for
the raw switch, plus another 2µs for cold cache effects.  If we want to
amortize that overhead a bit, time slices should be in the region of
1ms, give or take.  I would also be happy if the "one" process would
have to wait 1ms before the woken threads actually get to run.  So in my
book the observed behaviour is wrong, but opinions may differ.
Anyway, trying to find a cause, I noticed the following call chain:
	set_nr_if_polling()
	ttwu_queue_remote()
	ttwu_queue()
	try_to_wake_up()
	default_wake_function()
	curr->func()
	__wake_up_common()
	__wake_up_common_lock()
	__wake_up()
	wake_up()
Call chain above is manually created from source code.  Closest sample I
caught with instrumentation is missing the leaf calls after
try_to_wake_up():
	_raw_spin_unlock_irqrestore+0x1f/0x40
	try_to_wake_up+0x425/0x5e0
	wake_up_q+0x3f/0x80
	futex_wake+0x159/0x180
	do_futex+0xcd/0xba0
Afaics, the result is us setting TIF_NEED_RESCHED on any wakeup, unless
wake_list is already populated.  Is that actually intentional?  And is
that useful for performance of latency?  I think it isn't, but I am
probably missing something here.
Jörn
--
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster
Powered by blists - more mailing lists
 
