linux-kernel - Re: [PATCH] tracing: save cmdline only when task does not exist in savecmd for optimization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20211013230201.0f777564@oasis.local.home>
Date:   Wed, 13 Oct 2021 23:02:01 -0400
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Yang Jihong <yangjihong1@...wei.com>
Cc:     <mingo@...hat.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] tracing: save cmdline only when task does not exist in
 savecmd for optimization

On Mon, 11 Oct 2021 19:50:18 +0800
Yang Jihong <yangjihong1@...wei.com> wrote:

> commit 85f726a35e504418 use strncpy instead of memcpy when copying comm,
> on ARM64 machine, this commit causes performance degradation.
> 
> For the task that already exists in savecmd, it is unnecessary to call
> set_cmdline to execute strncpy once, run set_cmdline only if the task does
> not exist in savecmd.
> 
> I have written an example (which is an extreme case) in which trace sched switch
> is invoked for 1000 times, as shown in the following:
> 
>   for (int i = 0; i < 1000; i++) {
>           trace_sched_switch(true, current, current);
>  }

Well that's a pretty non realistic benchmark.

> 
> On ARM64 machine, compare the data before and after the optimization:
> +---------------------+------------------------------+------------------------+
> |                     | Total number of instructions | Total number of cycles |
> +---------------------+------------------------------+------------------------+
> | Before optimization |           1107367            |          658491        |
> +---------------------+------------------------------+------------------------+
> | After optimization  |            869367            |          520171        |
> +---------------------+------------------------------+------------------------+
> As shown above, there is nearly 26% performance

I'd prefer to see a more realistic benchmark.

> 
> Signed-off-by: Yang Jihong <yangjihong1@...wei.com>
> ---
>  kernel/trace/trace.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 7896d30d90f7..a795610a3b37 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -2427,8 +2427,11 @@ static int trace_save_cmdline(struct task_struct *tsk)
>  		savedcmd->cmdline_idx = idx;
>  	}
>  
> -	savedcmd->map_cmdline_to_pid[idx] = tsk->pid;
> -	set_cmdline(idx, tsk->comm);
> +	/* save cmdline only when task does not exist in savecmd */
> +	if (savedcmd->map_cmdline_to_pid[idx] != tsk->pid) {
> +		savedcmd->map_cmdline_to_pid[idx] = tsk->pid;
> +		set_cmdline(idx, tsk->comm);
> +	}

I'm not against adding this. Just for kicks I ran the following before
and after this patch:

  # trace-cmd start -e sched
  # perf stat -r 100 hackbench 50

Before:

 Performance counter stats for '/work/c/hackbench 50' (100 runs):

          6,261.26 msec task-clock                #    6.126 CPUs utilized            ( +-  0.12% )
            93,519      context-switches          #   14.936 K/sec                    ( +-  1.12% )
            13,725      cpu-migrations            #    2.192 K/sec                    ( +-  1.16% )
            47,266      page-faults               #    7.549 K/sec                    ( +-  0.54% )
    22,911,885,026      cycles                    #    3.659 GHz                      ( +-  0.11% )
    15,171,250,777      stalled-cycles-frontend   #   66.22% frontend cycles idle     ( +-  0.13% )
    18,330,841,604      instructions              #    0.80  insn per cycle
                                                  #    0.83  stalled cycles per insn  ( +-  0.11% )
     4,027,904,559      branches                  #  643.306 M/sec                    ( +-  0.11% )
        31,327,782      branch-misses             #    0.78% of all branches          ( +-  0.20% )

           1.02201 +- 0.00158 seconds time elapsed  ( +-  0.15% )
After:

 Performance counter stats for '/work/c/hackbench 50' (100 runs):

          6,216.47 msec task-clock                #    6.124 CPUs utilized            ( +-  0.10% )
            93,311      context-switches          #   15.010 K/sec                    ( +-  0.91% )
            13,719      cpu-migrations            #    2.207 K/sec                    ( +-  1.09% )
            47,085      page-faults               #    7.574 K/sec                    ( +-  0.49% )
    22,746,703,318      cycles                    #    3.659 GHz                      ( +-  0.09% )
    15,012,911,121      stalled-cycles-frontend   #   66.00% frontend cycles idle     ( +-  0.11% )
    18,275,147,949      instructions              #    0.80  insn per cycle
                                                  #    0.82  stalled cycles per insn  ( +-  0.08% )
     4,017,673,788      branches                  #  646.295 M/sec                    ( +-  0.08% )
        31,313,459      branch-misses             #    0.78% of all branches          ( +-  0.17% )

           1.01506 +- 0.00150 seconds time elapsed  ( +-  0.15% )

Really it's all in the noise, so adding this doesn't seem to hurt.

-- Steve



>  
>  	arch_spin_unlock(&trace_cmdline_lock);
>