linux-kernel - Re: [PATCH] sched/fair: no sync wakeup from interrupt context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0c0dab83-807d-317e-af1a-7bd898626b86@oracle.com>
Date:   Thu, 14 Jul 2022 13:21:14 -0700
From:   Libo Chen <libo.chen@...cle.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Tim Chen <tim.c.chen@...ux.intel.com>, peterz@...radead.org,
        vincent.guittot@...aro.org, 21cnbao@...il.com,
        dietmar.eggemann@....com, linux-kernel@...r.kernel.org,
        tglx@...utronix.de, Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context



On 7/14/22 07:18, Mel Gorman wrote:
> On Wed, Jul 13, 2022 at 12:17:33PM -0700, Libo Chen wrote:
>>
>> On 7/13/22 09:40, Tim Chen wrote:
>>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>>> improvement.[1]
>>>>
>>>> This patch argues that allowing sync for wakeups from interrupt context
>>>> is a bug and fixing it can improve performance even when irq/softirq is
>>>> evenly spread out.
>>>>
>>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>>> waker can be any random task that happens to be running on that CPU when the
>>>> interrupt comes in. This is completely different from other wakups where the
>>>> task running on the waking CPU is the actual waker. For example, two tasks
>>>> communicate through a pipe or mutiple tasks access the same critical section,
>>>> etc. This difference is important because with sync we assume the waker will
>>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>>> assumption is built into wake_affine() where it discounts the waker's presence
>>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>>> preepmtion of waker for sync wakeup as it used to before
>>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>>> Will there be scenarios where you do want the task being woken up be pulled
>>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>>> on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
>>> packets received.  Sync still seems desirable, at least if there is no task currently
>>> running on the CPU where interrupt happened.  So maybe we should have some consideration
>>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>>> interrupt.
>>>
>> There are only two places where sync wakeup matters: wake_affine_idle() and
>> wake_affine_weight().
>> In wake_affine_idle(), it considers pulling if there is one runnable on the
>> waking CPU because
>> of the belief that this runnable will voluntarily get off the runqueue. In
>> wake_affine_weight(),
>> it basically takes off the waker's load again assuming the waker goes to
>> sleep after the wakeup.
>> My argument is that this assumption doesn't really hold for wakeups from the
>> interrupt contexts
>> when the waking CPU is non-idle. Wakeups from task context? sure, it seems
>> to be a reasonable
>> assumption. For your idle case, I totally agree but I don't think having
>> sync or not will actually
>> have any impacts here giving what the code does. Real impact comes fromMel's
>> patch 7332dec055f2457c3
>> which makes it less likely to pull tasks when the waking CPU is idle. I
>> believe we should consider
>> reverting 7332dec055f2 because a significant RDS latency regression has been
>> spotted recently on our
>> system due to this patch.
>>
> The intent of 7332dec055f2 was to prevent harmful cross-node accesses.
> It still allowed cache-local migrations on the assumption that the incoming
> data was critical enough to justify losing any other cache-hot data. You
I am not too against cache-local migrations here because they are still 
under the
same LLC. We really focus on cross-node migrations that sometimes hurt us,
sometimes don't because this is where you need to determine who has the 
warmer L3
cache.
> state explicitly that "the interrupt CPU isn't as performance critical as
> cache from its previous CPU" so that assumption was incorrect, at least
> in your case. I don't have a counter example where the interrupt data *is*
> more important than any other cache-hot data so the check can go.
>
> I think a revert would not achieve what you want as a plain revert would
> still allow an interrupt to pull a task from an arbitrary location as sync
This is the tricky part, I didn't explain it well. For rds-stress, it's a
lot (~30%) better to allow pulling across nodes when the waking CPU is 
idle.
I think this may be an example of interrupt data being more important. 
Something
like below will help a lot for this particular benchmark (rds-stress):

if (available_idle_cpu(this_cpu))
         return this_cpu;

But for db workloads, yes, in general, pulling does more damages than 
goods as
said before esp. under light load. But I think pulling to an idle CPU 
across nodes
is okay, because this usually happens at the beginning of a workload and 
once a
task is pulled it stays there or at least in the same LLC sched domain 
for a while.
What is not okay is when the waking CPU already has a task and the 
wakeup still pulls
the wakee task to that CPU across nodes irrespective of its previous 
CPU. And that's
what this patch tries to address.

Mel, I am thinking about a follow-up patch like below then we can 
continue the discussion
there since this is kinda a separate issue:

-	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
-		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
-

+       if (available_idle_cpu(this_cpu))
+               if (cpus_share_cache(this_cpu, prev_cpu))
+                       return available_idle_cpu(prev_cpu) ? prev_cpu : 
this_cpu;
+       else
+               return this_cpu;

> is not checked. A follow-up to your patch or an updated version should not
> check available_idle_cpu at all in wake_affine_idle as it's only idle the
> wake is from interrupt context and vcpu_is_preempted is not necessarily
> justification for pulling a task due to an interrupt.
>
> Something like this but needs testing with your target loads, particularly
> the RDS (Relational Database Service?) latency regression;
Sorry, it's Reliable Datagram Sockets. We run rds-stress to measure rds
bandwidth and latency.

Libo
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7b275672c89..e55a3a67a442 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5975,8 +5975,8 @@ static int wake_wide(struct task_struct *p)
>    * soonest. For the purpose of speed we only consider the waking and previous
>    * CPU.
>    *
> - * wake_affine_idle() - only considers 'now', it check if the waking CPU is
> - *			cache-affine and is (or	will be) idle.
> + * wake_affine_idle() - only considers 'now', it checks if the waker task is a
> + *			sync wakeup from a CPU that should be idle soon.
>    *
>    * wake_affine_weight() - considers the weight to reflect the average
>    *			  scheduling latency of the CPUs. This seems to work
> @@ -5985,21 +5985,6 @@ static int wake_wide(struct task_struct *p)
>   static int
>   wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>   {
> -	/*
> -	 * If this_cpu is idle, it implies the wakeup is from interrupt
> -	 * context. Only allow the move if cache is shared. Otherwise an
> -	 * interrupt intensive workload could force all tasks onto one
> -	 * node depending on the IO topology or IRQ affinity settings.
> -	 *
> -	 * If the prev_cpu is idle and cache affine then avoid a migration.
> -	 * There is no guarantee that the cache hot data from an interrupt
> -	 * is more important than cache hot data on the prev_cpu and from
> -	 * a cpufreq perspective, it's better to have higher utilisation
> -	 * on one CPU.
> -	 */
> -	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> -		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> -
>   	if (sync && cpu_rq(this_cpu)->nr_running == 1)
>   		return this_cpu;
>