[<prev] [next>] [day] [month] [year] [list]
Message-ID: <702546fc-5d5e-2d62-561a-5c323452f5be@oracle.com>
Date: Wed, 13 Jul 2022 12:34:22 -0700
From: Libo Chen <libo.chen@...cle.com>
To: Tim Chen <tim.c.chen@...ux.intel.com>, peterz@...radead.org,
vincent.guittot@...aro.org, mgorman@...e.de, 21cnbao@...il.com,
dietmar.eggemann@....com
Cc: linux-kernel@...r.kernel.org, tglx@...utronix.de
Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
linux-kernel@...r.kernel.org complains about html subpart... I am
resending it in plain-text only
so everybody on the mailing list can see.
On 7/13/22 12:17, Libo Chen wrote:
>
>
> On 7/13/22 09:40, Tim Chen wrote:
>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>> improvement.[1]
>>>
>>> This patch argues that allowing sync for wakeups from interrupt context
>>> is a bug and fixing it can improve performance even when irq/softirq is
>>> evenly spread out.
>>>
>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>> waker can be any random task that happens to be running on that CPU when the
>>> interrupt comes in. This is completely different from other wakups where the
>>> task running on the waking CPU is the actual waker. For example, two tasks
>>> communicate through a pipe or mutiple tasks access the same critical section,
>>> etc. This difference is important because with sync we assume the waker will
>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>> assumption is built into wake_affine() where it discounts the waker's presence
>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>> preepmtion of waker for sync wakeup as it used to before
>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>> Will there be scenarios where you do want the task being woken up be pulled
>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>> on local CPU/NUMA that interrupt happened? For example, interrupt associated with network
>> packets received. Sync still seems desirable, at least if there is no task currently
>> running on the CPU where interrupt happened. So maybe we should have some consideration
>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>> interrupt.
>>
> There are only two places where sync wakeup matters:
> wake_affine_idle() and wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable
> on the waking CPU because
> of the belief that this runnable will voluntarily get off the
> runqueue. In wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes
> to sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups
> from the interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it
> seems to be a reasonable
> assumption. For your idle case, I totally agree but I don't think
> having sync or not will actually
> have any impacts here giving what the code does. Real impact comes
> fromMel's patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle.
> I believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression
> has been spotted recently on our
> system due to this patch.
>
>>> Add a check to make sure that sync can only be set when in_task() is true. If
>>> a wakeup is from interrupt context, sync flag will be 0 because in_task()
>>> returns 0.
>>>
>>> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
>>> performance changes; 2) interrupt contexts, expected to see better performance
>>> under low/medium load and no regression under heavy load.
>>>
>>> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
>>> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
>>> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
>>> daemon is active.
>>>
>>> Hackbench (config-scheduler-unbound)
>>> =========
>>> process-pipes:
>>> Baseline Patched
>>> Amean 1 0.4300 ( 0.00%) 0.4420 ( -2.79%)
>>> Amean 4 0.8757 ( 0.00%) 0.8774 ( -0.20%)
>>> Amean 7 1.3712 ( 0.00%) 1.3789 ( -0.56%)
>>> Amean 12 2.3541 ( 0.00%) 2.3714 ( -0.73%)
>>> Amean 21 4.2229 ( 0.00%) 4.2439 ( -0.50%)
>>> Amean 30 5.9113 ( 0.00%) 5.9451 ( -0.57%)
>>> Amean 48 9.3873 ( 0.00%) 9.4898 ( -1.09%)
>>> Amean 79 15.9281 ( 0.00%) 16.1385 ( -1.32%)
>>> Amean 110 22.0961 ( 0.00%) 22.3433 ( -1.12%)
>>> Amean 141 28.2973 ( 0.00%) 28.6209 ( -1.14%)
>>> Amean 172 34.4709 ( 0.00%) 34.9347 ( -1.35%)
>>> Amean 203 40.7621 ( 0.00%) 41.2497 ( -1.20%)
>>> Amean 234 47.0416 ( 0.00%) 47.6470 ( -1.29%)
>>> Amean 265 53.3048 ( 0.00%) 54.1625 ( -1.61%)
>>> Amean 288 58.0595 ( 0.00%) 58.8096 ( -1.29%)
>>>
>>> process-sockets:
>>> Baseline Patched
>>> Amean 1 0.6776 ( 0.00%) 0.6611 ( 2.43%)
>>> Amean 4 2.6183 ( 0.00%) 2.5769 ( 1.58%)
>>> Amean 7 4.5662 ( 0.00%) 4.4801 ( 1.89%)
>>> Amean 12 7.7638 ( 0.00%) 7.6201 ( 1.85%)
>>> Amean 21 13.5335 ( 0.00%) 13.2915 ( 1.79%)
>>> Amean 30 19.3369 ( 0.00%) 18.9811 ( 1.84%)
>>> Amean 48 31.0724 ( 0.00%) 30.6015 ( 1.52%)
>>> Amean 79 51.1881 ( 0.00%) 50.4251 ( 1.49%)
>>> Amean 110 71.3399 ( 0.00%) 70.4578 ( 1.24%)
>>> Amean 141 91.4675 ( 0.00%) 90.3769 ( 1.19%)
>>> Amean 172 111.7463 ( 0.00%) 110.3947 ( 1.21%)
>>> Amean 203 131.6927 ( 0.00%) 130.3270 ( 1.04%)
>>> Amean 234 151.7459 ( 0.00%) 150.1320 ( 1.06%)
>>> Amean 265 171.9101 ( 0.00%) 169.9751 ( 1.13%)
>>> Amean 288 186.9231 ( 0.00%) 184.7706 ( 1.15%)
>>>
>>> thread-pipes:
>>> Baseline Patched
>>> Amean 1 0.4523 ( 0.00%) 0.4535 ( -0.28%)
>>> Amean 4 0.9041 ( 0.00%) 0.9085 ( -0.48%)
>>> Amean 7 1.4111 ( 0.00%) 1.4146 ( -0.25%)
>>> Amean 12 2.3532 ( 0.00%) 2.3688 ( -0.66%)
>>> Amean 21 4.1550 ( 0.00%) 4.1701 ( -0.36%)
>>> Amean 30 6.1043 ( 0.00%) 6.2391 ( -2.21%)
>>> Amean 48 10.2077 ( 0.00%) 10.3511 ( -1.40%)
>>> Amean 79 16.7922 ( 0.00%) 17.0427 ( -1.49%)
>>> Amean 110 23.3350 ( 0.00%) 23.6522 ( -1.36%)
>>> Amean 141 29.6458 ( 0.00%) 30.2617 ( -2.08%)
>>> Amean 172 35.8649 ( 0.00%) 36.4225 ( -1.55%)
>>> Amean 203 42.4477 ( 0.00%) 42.8332 ( -0.91%)
>>> Amean 234 48.7117 ( 0.00%) 49.4042 ( -1.42%)
>>> Amean 265 54.9171 ( 0.00%) 55.6551 ( -1.34%)
>>> Amean 288 59.5282 ( 0.00%) 60.2903 ( -1.28%)
>>>
>>> thread-sockets:
>>> Baseline Patched
>>> Amean 1 0.6917 ( 0.00%) 0.6892 ( 0.37%)
>>> Amean 4 2.6651 ( 0.00%) 2.6017 ( 2.38%)
>>> Amean 7 4.6734 ( 0.00%) 4.5637 ( 2.35%)
>>> Amean 12 8.0156 ( 0.00%) 7.8079 ( 2.59%)
>>> Amean 21 14.0451 ( 0.00%) 13.6679 ( 2.69%)
>>> Amean 30 20.0963 ( 0.00%) 19.5657 ( 2.64%)
>>> Amean 48 32.4115 ( 0.00%) 31.6001 ( 2.50%)
>>> Amean 79 53.1104 ( 0.00%) 51.8395 ( 2.39%)
>>> Amean 110 74.0929 ( 0.00%) 72.4391 ( 2.23%)
>>> Amean 141 95.1506 ( 0.00%) 93.0992 ( 2.16%)
>>> Amean 172 116.1969 ( 0.00%) 113.8307 ( 2.04%)
>>> Amean 203 137.4413 ( 0.00%) 134.5247 ( 2.12%)
>>> Amean 234 158.5395 ( 0.00%) 155.2793 ( 2.06%)
>>> Amean 265 179.7729 ( 0.00%) 176.1099 ( 2.04%)
>>> Amean 288 195.5573 ( 0.00%) 191.3977 ( 2.13%)
>>>
>>> Pgbench (config-db-pgbench-timed-ro-small)
>>> =======
>>> Baseline Patched
>>> Hmean 1 68.54 ( 0.00%) 69.72 ( 1.71%)
>>> Hmean 6 27725.78 ( 0.00%) 34119.11 ( 23.06%)
>>> Hmean 12 55724.26 ( 0.00%) 63158.22 ( 13.34%)
>>> Hmean 22 72806.26 ( 0.00%) 73389.98 ( 0.80%)
>>> Hmean 30 79000.45 ( 0.00%) 75197.02 ( -4.81%)
>>> Hmean 48 78180.16 ( 0.00%) 75074.09 ( -3.97%)
>>> Hmean 80 75001.93 ( 0.00%) 70590.72 ( -5.88%)
>>> Hmean 110 74812.25 ( 0.00%) 74128.57 ( -0.91%)
>>> Hmean 142 74261.05 ( 0.00%) 72910.48 ( -1.82%)
>>> Hmean 144 75375.35 ( 0.00%) 71295.72 ( -5.41%)
>>>
>>> For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
>>> conclude that there are no performance differences for wakeups from task context.
>>> For pgbench, after many runs, 10~30% gains are very consistent at lower
>>> client counts (< #cores per socket).
>> Can you provide some further insights on why pgebench is helped at low load
>> case? Is it because the woken tasks tend to stay put and not get moved around with interrupts
>> and maintain cache warmth?
> Yes, and for read-only database workloads, the cache (whether it's
> incoming packet or not) on the interrupt
> CPU isn't as performance critical as cache from its previous CPU where
> the db task run to process data.
> To give you an example, consider a db client/server case, a client
> sends a request for a select query
> through the network, the server accepts the query request and does all
> the heavy lifting and sends the result
> back. For the server, the incoming packet is just a line of query
> whereas the CPU and its L3 db process previously
> on has all the warm db caches, pulling it away from them is a crime
> :) This may seem to be a little contradictory
> to what I said earlier about the idle case and Mel's patch, but
> ¯\_(ツ)_/¯ it's hard to make all the workloads out
> there happy. Anyway like I said earlier, this patch doesn't affect the
> idle case
>
> At higher load, sync in wake_affine_idle() doesn't really matter
> because the waking CPU could easily have more than
> 1 runnable tasks. Sync in wake_affine_weight() also doesn't matter
> much as both sides have work to do, and cache
> benefit of not pulling decreases simply because there are a lot more
> db processes under the same L3, they can compete
> for the same cachelines.
>
> Hope my explanation helps!
>
> Libo
>> Tim
>>
>>> For higher client counts, both kernels are
>>> very close, +-5% swings are quite common. Also NET_TX|RX softirq load
>>> does spread out across both NUMA nodes in this test so there is no need to do
>>> any explicit RPS/RFS.
>>>
>>> [1]:https://urldefense.com/v3/__https://lkml.org/lkml/2021/11/5/234__;!!ACWV5N9M2RV99hQ!PTDVWQJ1pb2KldrRnXA3EJ2CPKyHAolke1QRjVLadEl2ctd3xYxTo5uZ5Dp91L3ExImjbrrqbNM27uEA-E9do7LM$
>>>
>>> Signed-off-by: Libo Chen<libo.chen@...cle.com>
>>> ---
>>> kernel/sched/fair.c | 4 +++-
>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 794c2cb945f8..59b210d2cdb5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>> static int
>>> select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>> {
>>> - int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>>> + /* Don't set sync for wakeup from irq/soft ctx */
>>> + int sync = in_task() && (wake_flags & WF_SYNC)
>>> + && !(current->flags & PF_EXITING);
>>> struct sched_domain *tmp, *sd = NULL;
>>> int cpu = smp_processor_id();
>>> int new_cpu = prev_cpu;
>>> --
>>> 2.31.1
>>>
>
Powered by blists - more mailing lists