linux-kernel - Re: [PATCH] sched/fair: no sync wakeup from interrupt context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <702546fc-5d5e-2d62-561a-5c323452f5be@oracle.com>
Date:   Wed, 13 Jul 2022 12:34:22 -0700
From:   Libo Chen <libo.chen@...cle.com>
To:     Tim Chen <tim.c.chen@...ux.intel.com>, peterz@...radead.org,
        vincent.guittot@...aro.org, mgorman@...e.de, 21cnbao@...il.com,
        dietmar.eggemann@....com
Cc:     linux-kernel@...r.kernel.org, tglx@...utronix.de
Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context

linux-kernel@...r.kernel.org complains about html subpart... I am 
resending it in plain-text only
so everybody on the mailing list can see.

On 7/13/22 12:17, Libo Chen wrote:
>
>
> On 7/13/22 09:40, Tim Chen wrote:
>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>> improvement.[1]
>>>
>>> This patch argues that allowing sync for wakeups from interrupt context
>>> is a bug and fixing it can improve performance even when irq/softirq is
>>> evenly spread out.
>>>
>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>> waker can be any random task that happens to be running on that CPU when the
>>> interrupt comes in. This is completely different from other wakups where the
>>> task running on the waking CPU is the actual waker. For example, two tasks
>>> communicate through a pipe or mutiple tasks access the same critical section,
>>> etc. This difference is important because with sync we assume the waker will
>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>> assumption is built into wake_affine() where it discounts the waker's presence
>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>> preepmtion of waker for sync wakeup as it used to before
>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>> Will there be scenarios where you do want the task being woken up be pulled
>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>> on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
>> packets received.  Sync still seems desirable, at least if there is no task currently
>> running on the CPU where interrupt happened.  So maybe we should have some consideration
>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>> interrupt.
>>
> There are only two places where sync wakeup matters: 
> wake_affine_idle() and wake_affine_weight().
> In wake_affine_idle(), it considers pulling if there is one runnable 
> on the waking CPU because
> of the belief that this runnable will voluntarily get off the 
> runqueue. In wake_affine_weight(),
> it basically takes off the waker's load again assuming the waker goes 
> to sleep after the wakeup.
> My argument is that this assumption doesn't really hold for wakeups 
> from the interrupt contexts
> when the waking CPU is non-idle. Wakeups from task context? sure, it 
> seems to be a reasonable
> assumption. For your idle case, I totally agree but I don't think 
> having sync or not will actually
> have any impacts here giving what the code does. Real impact comes 
> fromMel's patch 7332dec055f2457c3
> which makes it less likely to pull tasks when the waking CPU is idle. 
> I believe we should consider
> reverting 7332dec055f2 because a significant RDS latency regression 
> has been spotted recently on our
> system due to this patch.
>
>>> Add a check to make sure that sync can only be set when in_task() is true. If
>>> a wakeup is from interrupt context, sync flag will be 0 because in_task()
>>> returns 0.
>>>
>>> Tested in two scenarios: wakeups from 1) task contexts, expected to see no
>>> performance changes; 2) interrupt contexts, expected to see better performance
>>> under low/medium load and no regression under heavy load.
>>>
>>> Use hackbench for scenario 1 and pgbench for scenarios 2 both from mmtests on
>>> a 2-socket Xeon E5-2699v3 box with 256G memory in total. Running 5.18 kernel
>>> with SELinux disabled. Scheduler/MM tunables are all default. Irqbalance
>>> daemon is active.
>>>
>>> Hackbench (config-scheduler-unbound)
>>> =========
>>> process-pipes:
>>>                  Baseline                Patched
>>> Amean   1       0.4300  ( 0.00%)        0.4420  ( -2.79%)
>>> Amean   4       0.8757  ( 0.00%)        0.8774  ( -0.20%)
>>> Amean   7       1.3712  ( 0.00%)        1.3789  ( -0.56%)
>>> Amean   12      2.3541  ( 0.00%)        2.3714  ( -0.73%)
>>> Amean   21      4.2229  ( 0.00%)        4.2439  ( -0.50%)
>>> Amean   30      5.9113  ( 0.00%)        5.9451  ( -0.57%)
>>> Amean   48      9.3873  ( 0.00%)        9.4898  ( -1.09%)
>>> Amean   79      15.9281 ( 0.00%)        16.1385 ( -1.32%)
>>> Amean   110     22.0961 ( 0.00%)        22.3433 ( -1.12%)
>>> Amean   141     28.2973 ( 0.00%)        28.6209 ( -1.14%)
>>> Amean   172     34.4709 ( 0.00%)        34.9347 ( -1.35%)
>>> Amean   203     40.7621 ( 0.00%)        41.2497 ( -1.20%)
>>> Amean   234     47.0416 ( 0.00%)        47.6470 ( -1.29%)
>>> Amean   265     53.3048 ( 0.00%)        54.1625 ( -1.61%)
>>> Amean   288     58.0595 ( 0.00%)        58.8096 ( -1.29%)
>>>
>>> process-sockets:
>>>                  Baseline                Patched
>>> Amean   1       0.6776   ( 0.00%)       0.6611   ( 2.43%)
>>> Amean   4       2.6183   ( 0.00%)       2.5769   ( 1.58%)
>>> Amean   7       4.5662   ( 0.00%)       4.4801   ( 1.89%)
>>> Amean   12      7.7638   ( 0.00%)       7.6201   ( 1.85%)
>>> Amean   21      13.5335  ( 0.00%)       13.2915  ( 1.79%)
>>> Amean   30      19.3369  ( 0.00%)       18.9811  ( 1.84%)
>>> Amean   48      31.0724  ( 0.00%)       30.6015  ( 1.52%)
>>> Amean   79      51.1881  ( 0.00%)       50.4251  ( 1.49%)
>>> Amean   110     71.3399  ( 0.00%)       70.4578  ( 1.24%)
>>> Amean   141     91.4675  ( 0.00%)       90.3769  ( 1.19%)
>>> Amean   172     111.7463 ( 0.00%)       110.3947 ( 1.21%)
>>> Amean   203     131.6927 ( 0.00%)       130.3270 ( 1.04%)
>>> Amean   234     151.7459 ( 0.00%)       150.1320 ( 1.06%)
>>> Amean   265     171.9101 ( 0.00%)       169.9751 ( 1.13%)
>>> Amean   288     186.9231 ( 0.00%)       184.7706 ( 1.15%)
>>>
>>> thread-pipes:
>>>                  Baseline                Patched
>>> Amean   1       0.4523  ( 0.00%)        0.4535  ( -0.28%)
>>> Amean   4       0.9041  ( 0.00%)        0.9085  ( -0.48%)
>>> Amean   7       1.4111  ( 0.00%)        1.4146  ( -0.25%)
>>> Amean   12      2.3532  ( 0.00%)        2.3688  ( -0.66%)
>>> Amean   21      4.1550  ( 0.00%)        4.1701  ( -0.36%)
>>> Amean   30      6.1043  ( 0.00%)        6.2391  ( -2.21%)
>>> Amean   48      10.2077 ( 0.00%)        10.3511 ( -1.40%)
>>> Amean   79      16.7922 ( 0.00%)        17.0427 ( -1.49%)
>>> Amean   110     23.3350 ( 0.00%)        23.6522 ( -1.36%)
>>> Amean   141     29.6458 ( 0.00%)        30.2617 ( -2.08%)
>>> Amean   172     35.8649 ( 0.00%)        36.4225 ( -1.55%)
>>> Amean   203     42.4477 ( 0.00%)        42.8332 ( -0.91%)
>>> Amean   234     48.7117 ( 0.00%)        49.4042 ( -1.42%)
>>> Amean   265     54.9171 ( 0.00%)        55.6551 ( -1.34%)
>>> Amean   288     59.5282 ( 0.00%)        60.2903 ( -1.28%)
>>>
>>> thread-sockets:
>>>                  Baseline                Patched
>>> Amean   1       0.6917   ( 0.00%)       0.6892   ( 0.37%)
>>> Amean   4       2.6651   ( 0.00%)       2.6017   ( 2.38%)
>>> Amean   7       4.6734   ( 0.00%)       4.5637   ( 2.35%)
>>> Amean   12      8.0156   ( 0.00%)       7.8079   ( 2.59%)
>>> Amean   21      14.0451  ( 0.00%)       13.6679  ( 2.69%)
>>> Amean   30      20.0963  ( 0.00%)       19.5657  ( 2.64%)
>>> Amean   48      32.4115  ( 0.00%)       31.6001  ( 2.50%)
>>> Amean   79      53.1104  ( 0.00%)       51.8395  ( 2.39%)
>>> Amean   110     74.0929  ( 0.00%)       72.4391  ( 2.23%)
>>> Amean   141     95.1506  ( 0.00%)       93.0992  ( 2.16%)
>>> Amean   172     116.1969 ( 0.00%)       113.8307 ( 2.04%)
>>> Amean   203     137.4413 ( 0.00%)       134.5247 ( 2.12%)
>>> Amean   234     158.5395 ( 0.00%)       155.2793 ( 2.06%)
>>> Amean   265     179.7729 ( 0.00%)       176.1099 ( 2.04%)
>>> Amean   288     195.5573 ( 0.00%)       191.3977 ( 2.13%)
>>>
>>> Pgbench (config-db-pgbench-timed-ro-small)
>>> =======
>>>                  Baseline            Patched
>>> Hmean   1       68.54    ( 0.00%)   69.72    ( 1.71%)
>>> Hmean   6       27725.78 ( 0.00%)   34119.11 ( 23.06%)
>>> Hmean   12      55724.26 ( 0.00%)   63158.22 ( 13.34%)
>>> Hmean   22      72806.26 ( 0.00%)   73389.98 ( 0.80%)
>>> Hmean   30      79000.45 ( 0.00%)   75197.02 ( -4.81%)
>>> Hmean   48      78180.16 ( 0.00%)   75074.09 ( -3.97%)
>>> Hmean   80      75001.93 ( 0.00%)   70590.72 ( -5.88%)
>>> Hmean   110     74812.25 ( 0.00%)   74128.57 ( -0.91%)
>>> Hmean   142     74261.05 ( 0.00%)   72910.48 ( -1.82%)
>>> Hmean   144     75375.35 ( 0.00%)   71295.72 ( -5.41%)
>>>
>>> For hackbench, +-3% fluctuation is normal on this two-socket box, it's safe to
>>> conclude that there are no performance differences for wakeups from task context.
>>> For pgbench, after many runs, 10~30% gains are very consistent at lower
>>> client counts (< #cores per socket).
>> Can you provide some further insights on why pgebench is helped at low load
>> case?  Is it because the woken tasks tend to stay put and not get moved around with interrupts
>> and maintain cache warmth?
> Yes, and for read-only database workloads, the cache (whether it's 
> incoming packet or not) on the interrupt
> CPU isn't as performance critical as cache from its previous CPU where 
> the db task run to process data.
> To give you an example, consider a db client/server case, a client 
> sends a request for a select query
> through the network, the server accepts the query request and does all 
> the heavy lifting and sends the result
> back. For the server, the incoming packet is just a line of query 
> whereas the CPU and its L3 db process previously
> on has all the warm db caches, pulling it away from them is a crime 
> :)  This may seem to be a little contradictory
> to what I said earlier about the idle case and Mel's patch, but 
> ¯\_(ツ)_/¯ it's hard to make all the workloads out
> there happy. Anyway like I said earlier, this patch doesn't affect the 
> idle case
>
> At higher load, sync in wake_affine_idle() doesn't really matter 
> because the waking CPU could easily have more than
> 1 runnable tasks. Sync in wake_affine_weight() also doesn't matter 
> much as both sides have work to do, and cache
> benefit of not pulling decreases simply because there are a lot more 
> db processes under the same L3, they can compete
> for the same cachelines.
>
> Hope my explanation helps!
>
> Libo
>> Tim
>>
>>> For higher client counts, both kernels are
>>> very close, +-5% swings are quite common. Also NET_TX|RX softirq load
>>> does spread out across both NUMA nodes in this test so there is no need to do
>>> any explicit RPS/RFS.
>>>
>>> [1]:https://urldefense.com/v3/__https://lkml.org/lkml/2021/11/5/234__;!!ACWV5N9M2RV99hQ!PTDVWQJ1pb2KldrRnXA3EJ2CPKyHAolke1QRjVLadEl2ctd3xYxTo5uZ5Dp91L3ExImjbrrqbNM27uEA-E9do7LM$  
>>>
>>> Signed-off-by: Libo Chen<libo.chen@...cle.com>
>>> ---
>>>   kernel/sched/fair.c | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 794c2cb945f8..59b210d2cdb5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6704,7 +6704,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>>   static int
>>>   select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>>>   {
>>> -	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>>> +	/* Don't set sync for wakeup from irq/soft ctx */
>>> +	int sync = in_task() && (wake_flags & WF_SYNC)
>>> +		   && !(current->flags & PF_EXITING);
>>>   	struct sched_domain *tmp, *sd = NULL;
>>>   	int cpu = smp_processor_id();
>>>   	int new_cpu = prev_cpu;
>>> --
>>> 2.31.1
>>>
>