linux-kernel - Re: [PATCH 1/3] audit: cache ctx->major in audit_filter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8171459.NyiUUSuA9g@x2>
Date:   Fri, 30 Sep 2022 13:45:26 -0400
From:   Steve Grubb <sgrubb@...hat.com>
To:     linux-audit@...hat.com
Cc:     boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        eparis@...hat.com, linux-kernel@...r.kernel.org,
        Ankur Arora <ankur.a.arora@...cle.com>
Subject: Re: [PATCH 1/3] audit: cache ctx->major in audit_filter_syscall()

Hello,

Thanks for the detailed notes on this investigation. It really is  a lot of 
good information backing this up. However, there will come a day when someone 
sees this "major = ctx->major" and they will send a patch to "fix" this 
unnecessary assignment. If you are sending a V2 of this set, I would suggest 
adding some comment in the code that this is for a performance improvement 
and to see the commit message for additional info.

Thanks,
-Steve

On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
> 
> In and of itself the load isn't very expensive (ops dependent on the
> ctx->major load are only used to determine the direction of control flow
> and have short dependence chains and, in any case the related branches
> get predicted perfectly in the fastpath) but still cache ctx->major
> in a local for two reasons:
> 
> * ctx->major is in the first cacheline of struct audit_context and has
>   similar alignment as audit_entry::list audit_entry. For cases
>   with a lot of audit rules, doing this reduces one source of contention
>   from a potentially busy cache-set.
> 
> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
>   does cast manipulation and error checking on ctx->major:
> 
>      audit_in_mask(const struct audit_krule *rule, unsigned long val):
>              if (val > 0xffffffff)
>                      return false;
> 
>              word = AUDIT_WORD(val);
>              if (word >= AUDIT_BITMASK_SIZE)
>                      return false;
> 
>              bit = AUDIT_BIT(val);
> 
>              return rule->mask[word] & bit;
> 
>   The clauses related to the rule need to be evaluated in the loop, but
>   the rest is unnecessarily re-evaluated for every loop iteration.
>   (Note, however, that most of these are cheap ALU ops and the branches
>    are perfectly predicted. However, see discussion on cycles
>    improvement below for more on why it is still worth hoisting.)
> 
> On a Skylakex system change in getpid() latency (aggregated over
> 12 boot cycles):
> 
>              Min     Mean  Median     Max       pstdev
>             (ns)     (ns)    (ns)    (ns)
> 
>  -        201.30   216.14  216.22  228.46      (+- 1.45%)
>  +        196.63   207.86  206.60  230.98      (+- 3.92%)
> 
> Performance counter stats for 'bin/getpid' (3 runs) go from:
>     cycles               836.89  (  +-   .80% )
>     instructions        2000.19  (  +-   .03% )
>     IPC                    2.39  (  +-   .83% )
>     branches             430.14  (  +-   .03% )
>     branch-misses          1.48  (  +-  3.37% )
>     L1-dcache-loads      471.11  (  +-   .05% )
>     L1-dcache-load-misses  7.62  (  +- 46.98% )
> 
>  to:
>     cycles               805.58  (  +-  4.11% )
>     instructions        1654.11  (  +-   .05% )
>     IPC                    2.06  (  +-  3.39% )
>     branches             430.02  (  +-   .05% )
>     branch-misses          1.55  (  +-  7.09% )
>     L1-dcache-loads      440.01  (  +-   .09% )
>     L1-dcache-load-misses  9.05  (  +- 74.03% )
> 
> (Both aggregated over 12 boot cycles.)
> 
> instructions: we reduce around 8 instructions/iteration because some of
> the computation is now hoisted out of the loop (branch count does not
> change because GCC, for reasons unclear, only hoists the computations
> while keeping the basic-blocks.)
> 
> cycles: improve by about 5% (in aggregate and looking at individual run
> numbers.) This is likely because we now waste fewer pipeline resources
> on unnecessary instructions which allows the control flow to
> speculatively execute further ahead shortening the execution of the loop
> a little. The final gating factor on the performance of this loop
> remains the long dependence chain due to the linked-list load.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
> ---
>  kernel/auditsc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
> *tsk, {
>  	struct audit_entry *e;
>  	enum audit_state state;
> +	unsigned long major = ctx->major;
> 
>  	if (auditd_test_task(tsk))
>  		return;
> 
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT], 
list) {
> -		if (audit_in_mask(&e->rule, ctx->major) &&
> +		if (audit_in_mask(&e->rule, major) &&
>  		    audit_filter_rules(tsk, &e->rule, ctx, NULL,
>  				       &state, false)) {
>  			rcu_read_unlock();