linux-kernel - [PATCH 0/3] improve audit syscall-exit latency

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220927225944.2254360-1-ankur.a.arora@oracle.com>
Date:   Tue, 27 Sep 2022 15:59:41 -0700
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     linux-audit@...hat.com
Cc:     paul@...l-moore.com, eparis@...hat.com,
        linux-kernel@...r.kernel.org, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com
Subject: [PATCH 0/3] improve audit syscall-exit latency

This series attempts to reduce syscall-exit latency in the audit path,
especially for cases where there are a lot of audit exit rules.

>From profiling, audit_filter_syscall() takes the largest chunk of time,
specifically in this list-walk while processing the AUDIT_FILTER_EXIT
list:

      list = &audit_filter_list[AUDIT_FILTER_EXIT];

      list_for_each_entry_rcu(e, list, list) {
          if (audit_in_mask(&e->rule, ctx->major) &&
              audit_filter_rules(tsk, &e->rule, ctx, NULL,
                                 &state, false, x)) {
                  rcu_read_unlock();
                  ctx->current_state = state;
                  return state;
          }
      }

(Note that almost identical logic exists in audit_filter_uring(),
audit_filter_inode_name().)

Comparing baseline performance with audit=0/audit=1 with a user-space
getpid() loop (executes 10^7 times.) For audit=1, this uses an
audit-rule set where the audit_filter_syscall() loop iterates over
42 AUDIT_FILTER_EXIT rules which, for getpid(), calls audit_filter_rules()
for 5 of them (we use this set of rules in production.)

Test system:
 Server: ORACLE SERVER X8-2L
 CPU: Intel Skylakex (6:85:6)
 Microcode: 0x400320a

 # v6.0.0-rc5.baseline, audit=0
 Performance counter stats for 'bin/getpid' (3 runs):

            734.10 msec task-clock                       #    0.999 CPUs utilized            ( +-  0.03% )
                 1      context-switches                 #    1.361 /sec                     ( +- 66.67% )
                 0      cpu-migrations                   #    0.000 /sec
                53      page-faults                      #   72.152 /sec                     ( +-  0.63% )
     2,838,869,156      cycles                           #    3.865 GHz                      ( +-  0.01% )  (62.40%)
     4,174,224,305      instructions                     #    1.47  insn per cycle           ( +-  0.01% )  (74.93%)
       890,798,133      branches                         #    1.213 G/sec                    ( +-  0.01% )  (74.93%)
         5,015,118      branch-misses                    #    0.56% of all branches          ( +-  0.05% )  (74.93%)
     1,231,150,558      L1-dcache-loads                  #    1.676 G/sec                    ( +-  0.01% )  (74.94%)
           418,297      L1-dcache-load-misses            #    0.03% of all L1-dcache accesses  ( +-  0.68% )  (75.07%)
             3,937      LLC-loads                        #    5.360 K/sec                    ( +-  3.76% )  (50.13%)
               510      LLC-load-misses                  #   13.39% of all LL-cache accesses  ( +- 79.89% )  (50.00%)

          0.735018 +- 0.000275 seconds time elapsed  ( +-  0.04% )

 # v6.0.0-rc5.baseline, audit=1
 Performance counter stats for 'bin/getpid' (3 runs):

          2,158.81 msec task-clock                       #    0.998 CPUs utilized            ( +-  0.13% )
                 2      context-switches                 #    0.925 /sec                     ( +- 28.87% )
                 0      cpu-migrations                   #    0.000 /sec
                52      page-faults                      #   24.056 /sec                     ( +-  0.64% )
     8,364,119,898      cycles                           #    3.869 GHz                      ( +-  0.11% )  (62.48%)
    19,996,521,678      instructions                     #    2.39  insn per cycle           ( +-  0.01% )  (74.98%)
     4,302,068,252      branches                         #    1.990 G/sec                    ( +-  0.00% )  (74.98%)
        15,135,694      branch-misses                    #    0.35% of all branches          ( +-  0.16% )  (74.99%)
     4,714,694,841      L1-dcache-loads                  #    2.181 G/sec                    ( +-  0.01% )  (74.99%)
        66,407,857      L1-dcache-load-misses            #    1.41% of all L1-dcache accesses  ( +-  1.50% )  (75.01%)
             6,785      LLC-loads                        #    3.139 K/sec                    ( +- 12.49% )  (50.03%)
             3,235      LLC-load-misses                  #   41.29% of all LL-cache accesses  ( +-  6.08% )  (50.01%)

           2.16213 +- 0.00288 seconds time elapsed  ( +-  0.13% )

perf stat numbers for each getpid() iteration:

		  baseline      baseline 
                   audit=0       audit=1 

  cycles               283           836 
  instructions         417          1999 
  IPC                 1.47          2.39 
  branches              89           430 
  branch-misses       0.50          1.51 
  L1-loads             123           471 
  L1-load-misses        ~0          ~6.6*

  * the L1-load-misses are largely stable for runs across a single
    boot, but vary across boots due to vagaries of the SLAB allocator.

baseline audit=1 spends a significant amount of time executing in audit
code and incurs a new branch-miss and a few new L1-load-misses. Also
note that computed audit-only IPC is 2.86 so the baseline is not
ill-performant code.

Patches
==

Patch 1 "audit: cache ctx->major in audit_filter_syscall()", caches
ctx->major in a local variable. This gets rid of a persistent entry
from L1-dcache  (audit_context::major) that had similar alignment
constraints as a potentially busy cache-set (audit_entry::list) and
allows some of the error checking in audit_in_mask() to be hoisted out
of the loop.

Patch 2: "audit: annotate branch direction for audit_in_mask()", so
the compiler can pessimize the call to audit_filter_rules().

Patch 3, "audit: unify audit_filter_{uring(),inode_name(),syscall()}"
centralizes this logic in a common function.

with these changes:

 Performance counter stats for 'bin/getpid' (3 runs):

          1,750.21 msec task-clock                       #    0.994 CPUs utilized            ( +-  0.45% )
                 3      context-switches                 #    1.705 /sec                     ( +- 11.11% )
                 0      cpu-migrations                   #    0.000 /sec
                52      page-faults                      #   29.548 /sec                     ( +-  0.64% )
     6,770,976,590      cycles                           #    3.848 GHz                      ( +-  0.40% )  (27.74%)
    16,588,372,024      instructions                     #    2.44  insn per cycle           ( +-  0.03% )  (33.34%)
     4,322,555,829      branches                         #    2.456 G/sec                    ( +-  0.02% )  (33.40%)
         2,803,286      branch-misses                    #    0.06% of all branches          ( +- 26.45% )  (33.46%)
     4,449,426,336      L1-dcache-loads                  #    2.528 G/sec                    ( +-  0.01% )  (27.71%)
        63,612,108      L1-dcache-load-misses            #    1.43% of all L1-dcache accesses  ( +-  0.50% )  (27.71%)
             6,123      LLC-loads                        #    3.479 K/sec                    ( +-  8.68% )  (27.71%)
             1,890      LLC-load-misses                  #   26.69% of all LL-cache accesses  ( +- 46.99% )  (27.71%)

           1.76033 +- 0.00791 seconds time elapsed  ( +-  0.45% )

And, overall getpid() latency numbers (aggregated over 12 boots for each):
                Min     Mean    Median    Max         pstdev
                (ns)    (ns)      (ns)    (ns)

 baseline     201.30   216.14   216.22  228.46      (+- 1.45%)
 patch1       196.63   207.86   206.60  230.98      (+- 3.92%)
 patch1-2     173.11   182.51   179.65  202.09      (+- 4.34%)
 patch1-3     162.11   175.26   173.71  190.56      (+- 4.33%)

This gives a reasonable speedup. My testing was on Intel Skylake, but I
suspect this should translate to other archs as well (especially on less
wide architectures.)

Please review.

Thanks
Ankur

Ankur Arora (3):
  audit: cache ctx->major in audit_filter_syscall()
  audit: annotate branch direction for audit_in_mask()
  audit: unify audit_filter_{uring(),inode_name(),syscall()}

 kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 41 deletions(-)

-- 
2.31.1