linux-kernel - Re: [External] Re: [PATCH] psi: Use ONCPU state tracking machinery to detect reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <40e0cd03-4c97-86f6-0b79-c1b212f9c0dd@bytedance.com>
Date:   Thu, 11 Feb 2021 20:31:19 +0800
From:   Chengming Zhou <zhouchengming@...edance.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, rostedt@...dmis.org,
        linux-kernel@...r.kernel.org, songmuchun@...edance.com
Subject: Re: [External] Re: [PATCH] psi: Use ONCPU state tracking machinery to
 detect reclaim

Hello Johannes,

在 2021/2/11 上午4:37, Johannes Weiner 写道:
> On Wed, Feb 10, 2021 at 12:06:05PM +0800, Chengming Zhou wrote:
>> Move the reclaim detection from the timer tick to the task state
>> tracking machinery using the recently added ONCPU state. And we
>> also add memstall state changes checking in the psi_task_switch()
>> optimization to update the parents properly.
>>
>> Thanks to Johannes Weiner for pointing out the psi_task_switch()
>> optimization things and the clearer changelog.
>>
>> Signed-off-by: Muchun Song <songmuchun@...edance.com>
>> Signed-off-by: Chengming Zhou <zhouchengming@...edance.com>
> Thanks for the update, Chengming.
>
> It would be good to include a rationale in the changelog that explains
> why we're doing this. Performance and cost are a bit questionable, IMO
> because it doesn't look cheaper in aggregate...

Yes, there maybe not obvious win in terms of performance and cost. But I think

the statistical time maybe a little more acurrate and the code become fewer. : )

>> ---
>>  include/linux/psi.h  |  1 -
>>  kernel/sched/core.c  |  1 -
>>  kernel/sched/psi.c   | 52 ++++++++++++++++------------------------------------
>>  kernel/sched/stats.h |  9 ---------
>>  4 files changed, 16 insertions(+), 47 deletions(-)
> ...but the code is simpler and shorter this way: fewer lines, and
> we're removing one of the hooks into the scheduler. So it's a
> maintainability win, I would say.
>
> I did some testing with perf bench. The new code appears to have
> slightly more overhead (which is expected), although the error bars
> overlap to a point where I don't think it would matter on real loads.
>
> I tested an additional version of the code that adds unlikely()
> annotations to move the pressure state branches out of line - since
> they are after all exceptional situations. It seems to help -
> especially the pipe bench actually looks better than on vanilla.

Thank you so much for these tests. I didn't go that far and learned that unlikely()

can be used like this.

> Attached the incremental patch below.
>
> ---
>
> perf stat -r 3 -- perf sched bench messaging -l 10000
>
> vanilla
>         125,833.65 msec task-clock:u              #   22.102 CPUs utilized            ( +-  1.94% )
>                  0      context-switches:u        #    0.000 K/sec                  
>                  0      cpu-migrations:u          #    0.000 K/sec                  
>             69,526      page-faults:u             #    0.553 K/sec                    ( +-  0.79% )
>      8,189,667,649      cycles:u                  #    0.065 GHz                      ( +-  1.49% )  (83.31%)
>      2,184,284,296      stalled-cycles-frontend:u #   26.67% frontend cycles idle     ( +-  4.37% )  (83.32%)
>      1,152,703,719      stalled-cycles-backend:u  #   14.08% backend cycles idle      ( +-  0.56% )  (83.37%)
>      2,483,312,679      instructions:u            #    0.30  insn per cycle         
>                                                   #    0.88  stalled cycles per insn  ( +-  0.15% )  (83.33%)
>        781,332,174      branches:u                #    6.209 M/sec                    ( +-  0.13% )  (83.35%)
>        159,531,476      branch-misses:u           #   20.42% of all branches          ( +-  0.17% )  (83.32%)
>
>             5.6933 +- 0.0911 seconds time elapsed  ( +-  1.60% )
>
> patched
>         129,756.92 msec task-clock:u              #   22.243 CPUs utilized            ( +-  1.92% )
>                  0      context-switches:u        #    0.000 K/sec                  
>                  0      cpu-migrations:u          #    0.000 K/sec                  
>             69,904      page-faults:u             #    0.539 K/sec                    ( +-  0.67% )
>      8,518,161,670      cycles:u                  #    0.066 GHz                      ( +-  2.19% )  (83.30%)
>      2,337,165,666      stalled-cycles-frontend:u #   27.44% frontend cycles idle     ( +-  5.47% )  (83.32%)
>      1,148,789,343      stalled-cycles-backend:u  #   13.49% backend cycles idle      ( +-  0.05% )  (83.35%)
>      2,483,527,911      instructions:u            #    0.29  insn per cycle         
>                                                   #    0.94  stalled cycles per insn  ( +-  0.18% )  (83.38%)
>        782,138,388      branches:u                #    6.028 M/sec                    ( +-  0.09% )  (83.33%)
>        160,131,311      branch-misses:u           #   20.47% of all branches          ( +-  0.16% )  (83.31%)
>
>              5.834 +- 0.106 seconds time elapsed  ( +-  1.81% )
>
> patched-unlikely
>         127,437.78 msec task-clock:u              #   22.184 CPUs utilized            ( +-  0.74% )
>                  0      context-switches:u        #    0.000 K/sec                  
>                  0      cpu-migrations:u          #    0.000 K/sec                  
>             70,063      page-faults:u             #    0.550 K/sec                    ( +-  0.53% )
>      8,453,581,973      cycles:u                  #    0.066 GHz                      ( +-  1.49% )  (83.34%)
>      2,327,192,242      stalled-cycles-frontend:u #   27.53% frontend cycles idle     ( +-  2.43% )  (83.32%)
>      1,146,196,558      stalled-cycles-backend:u  #   13.56% backend cycles idle      ( +-  0.35% )  (83.34%)
>      2,486,920,732      instructions:u            #    0.29  insn per cycle         
>                                                   #    0.94  stalled cycles per insn  ( +-  0.10% )  (83.34%)
>        781,067,666      branches:u                #    6.129 M/sec                    ( +-  0.15% )  (83.34%)
>        160,104,212      branch-misses:u           #   20.50% of all branches          ( +-  0.10% )  (83.33%)
>
>             5.7446 +- 0.0418 seconds time elapsed  ( +-  0.73% )
>
> ---
>
> perf stat -r 3 -- perf bench sched pipe
>
> vanilla
>          14,086.14 msec task-clock:u              #    1.009 CPUs utilized            ( +-  6.52% )
>                  0      context-switches:u        #    0.000 K/sec                  
>                  0      cpu-migrations:u          #    0.000 K/sec                  
>              1,467      page-faults:u             #    0.104 K/sec                    ( +-  0.06% )
>        306,181,835      cycles:u                  #    0.022 GHz                      ( +-  2.13% )  (83.41%)
>         43,975,811      stalled-cycles-frontend:u #   14.36% frontend cycles idle     ( +- 14.45% )  (83.05%)
>         52,429,386      stalled-cycles-backend:u  #   17.12% backend cycles idle      ( +-  0.28% )  (83.58%)
>         93,097,176      instructions:u            #    0.30  insn per cycle         
>                                                   #    0.56  stalled cycles per insn  ( +-  0.36% )  (83.23%)
>         35,351,661      branches:u                #    2.510 M/sec                    ( +-  0.21% )  (83.37%)
>          6,124,932      branch-misses:u           #   17.33% of all branches          ( +-  0.51% )  (83.36%)
>
>             13.955 +- 0.164 seconds time elapsed  ( +-  1.17% )
>
> patched
>          14,574.69 msec task-clock:u              #    1.040 CPUs utilized            ( +-  0.87% )
>                  0      context-switches:u        #    0.000 K/sec                  
>                  0      cpu-migrations:u          #    0.000 K/sec                  
>              1,469      page-faults:u             #    0.101 K/sec                    ( +-  0.13% )
>        302,769,739      cycles:u                  #    0.021 GHz                      ( +-  1.19% )  (83.17%)
>         37,638,522      stalled-cycles-frontend:u #   12.43% frontend cycles idle     ( +-  0.31% )  (83.47%)
>         46,206,055      stalled-cycles-backend:u  #   15.26% backend cycles idle      ( +-  6.56% )  (83.34%)
>         92,566,358      instructions:u            #    0.31  insn per cycle         
>                                                   #    0.50  stalled cycles per insn  ( +-  0.51% )  (83.45%)
>         35,667,707      branches:u                #    2.447 M/sec                    ( +-  0.67% )  (83.23%)
>          6,224,587      branch-misses:u           #   17.45% of all branches          ( +-  2.24% )  (83.35%)
>
>             14.010 +- 0.245 seconds time elapsed  ( +-  1.75% )
>
> patched-unlikely
>          13,470.99 msec task-clock:u              #    1.024 CPUs utilized            ( +-  3.10% )
>                  0      context-switches:u        #    0.000 K/sec                  
>                  0      cpu-migrations:u          #    0.000 K/sec                  
>              1,477      page-faults:u             #    0.110 K/sec                    ( +-  0.09% )
>        310,752,740      cycles:u                  #    0.023 GHz                      ( +-  1.32% )  (83.35%)
>         44,894,078      stalled-cycles-frontend:u #   14.45% frontend cycles idle     ( +- 13.24% )  (83.47%)
>         52,540,903      stalled-cycles-backend:u  #   16.91% backend cycles idle      ( +-  0.36% )  (82.96%)
>         92,296,178      instructions:u            #    0.30  insn per cycle         
>                                                   #    0.57  stalled cycles per insn  ( +-  0.48% )  (83.44%)
>         35,316,802      branches:u                #    2.622 M/sec                    ( +-  0.06% )  (83.32%)
>          6,173,049      branch-misses:u           #   17.48% of all branches          ( +-  0.66% )  (83.47%)
>
>             13.161 +- 0.293 seconds time elapsed  ( +-  2.22% )
>
>> @@ -833,7 +827,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>>  		 */
>>  		iter = NULL;
>>  		while ((group = iterate_groups(next, &iter))) {
>> -			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
>> +			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
>> +			    next->in_memstall == prev->in_memstall) {
>>  				common = group;
>>  				break;
> It'd be better to compare psi_flags instead of just in_memstall: it's
> clearer and also more robust against future changes (even though it's
> somewhat unlikely we grow more states). It's also an invariant
> throughout the loop, so we should move it out.
Ok, make sense a lot.
>
> The comment above the loop is now stale too.
>
> Can you fold the following into your patch?

Of course, I will take the following and send a patch-v2 for more review.

Thank you.

>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8735d5f291dc..6d4a246ef131 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -809,18 +809,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>  	void *iter;
>  
>  	if (next->pid) {
> +		bool identical_state;
> +
>  		psi_flags_change(next, 0, TSK_ONCPU);
>  		/*
> -		 * When moving state between tasks, the group that
> -		 * contains them both does not change: we can stop
> -		 * updating the tree once we reach the first common
> -		 * ancestor. Iterate @next's ancestors until we
> -		 * encounter @prev's state.
> +		 * When switching between tasks that have an identical
> +		 * runtime state, the cgroup that contains both tasks
> +		 * does not change: we can stop updating the tree once
> +		 * we reach the first common ancestor. Iterate @next's
> +		 * ancestors only until we encounter @prev's ONCPU.
>  		 */
> +		identical_state = prev->psi_flags == next->psi_flags;
>  		iter = NULL;
>  		while ((group = iterate_groups(next, &iter))) {
> -			if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
> -			    next->in_memstall == prev->in_memstall) {
> +			if (identical_state &&
> +			    per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
>  				common = group;
>  				break;
>  			}
>
> Otherwise, this looks good to me. Peter, what do you think?
>
> ---
>
> From cf36f1dde714a2c543f5947e47138de32468f33a Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@...xchg.org>
> Date: Wed, 10 Feb 2021 13:38:34 -0500
> Subject: [PATCH] psi: pressure states are unlikely
>
> Move the unlikely branches out of line. This eliminates undesirable
> jumps during wakeup and sleeps for workloads that aren't under any
> sort of resource pressure.
>
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>
> ---
>  kernel/sched/psi.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8735d5f291dc..7fbacd6347a6 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -216,15 +216,15 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
>  {
>  	switch (state) {
>  	case PSI_IO_SOME:
> -		return tasks[NR_IOWAIT];
> +		return unlikely(tasks[NR_IOWAIT]);
>  	case PSI_IO_FULL:
> -		return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
> +		return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
>  	case PSI_MEM_SOME:
> -		return tasks[NR_MEMSTALL];
> +		return unlikely(tasks[NR_MEMSTALL]);
>  	case PSI_MEM_FULL:
> -		return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
> +		return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
>  	case PSI_CPU_SOME:
> -		return tasks[NR_RUNNING] > tasks[NR_ONCPU];
> +		return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
>  	case PSI_NONIDLE:
>  		return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
>  			tasks[NR_RUNNING];
> @@ -721,7 +721,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
>  	 * task in a cgroup is in_memstall, the corresponding groupc
>  	 * on that cpu is in PSI_MEM_FULL state.
>  	 */
> -	if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
> +	if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
>  		state_mask |= (1 << PSI_MEM_FULL);
>  
>  	groupc->state_mask = state_mask;