lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230821121049.GA19670@willie-the-truck>
Date:   Mon, 21 Aug 2023 13:10:50 +0100
From:   Will Deacon <will@...nel.org>
To:     Mark Brown <broonie@...nel.org>
Cc:     Catalin Marinas <catalin.marinas@....com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] arm64/fpsimd: Suppress SVE access traps when loading
 FPSIMD state

On Mon, Aug 07, 2023 at 11:20:38PM +0100, Mark Brown wrote:
> When we are in a syscall we take the opportunity to discard the SVE state,
> saving only the FPSIMD subset of the register state. When we reload the
> state from memory we reenable SVE access traps, stopping tracking SVE until
> the task takes another SVE access trap. This means that for a task which is
> actively using SVE many blocking system calls will have the additional
> overhead of a SVE access trap.
> 
> As SVE deployment is progressing we are seeing much wider use of the SVE
> instruction set, including performance optimised implementations of
> operations like memset() and memcpy(), which mean that even tasks which are
> not obviously floating point based can end up with substantial SVE usage.
> 
> It does not, however, make sense to just unconditionally use the full SVE
> register state all the time since it is larger than the FPSIMD register
> state so there is overhead saving and restoring it on context switch and
> our requirement to flush the register state not shared with FPSIMD on
> syscall also creates a noticeable overhead on system call.
> 
> I did some instrumentation which counted the number of SVE access traps
> and the number of times we loaded FPSIMD only register state for each task.
> Testing with Debian Bookworm this showed that during boot the overwhelming
> majority of tasks triggered another SVE access trap more than 50% of the
> time after loading FPSIMD only state with a substantial number near 100%,
> though some programs had a very small number of SVE accesses most likely
> from startup. There were few tasks in the range 5-45%, most tasks either
> used SVE frequently or used it only a tiny proportion of times. As expected
> older distributions which do not have the SVE performance work available
> showed no SVE usage in general applications.
> 
> This indicates that there should be some useful benefit from reducing the
> number of SVE access traps for blocking system calls like we did for non
> blocking system calls in commit 8c845e273104 ("arm64/sve: Leave SVE enabled
> on syscall if we don't context switch"). Let's do this by counting the
> number of times we have loaded FPSIMD only register state for SVE tasks
> and only disabling traps after some number of times, otherwise leaving
> traps disabled and flushing the non-shared register state like we would on
> trap.
> 
> I pulled 64 out of thin air for the number of flushes to do, there is
> doubtless room for tuning here. Ideally we would be able to tell if the
> task is actually using SVE but without using performance counters (which
> would be substantial work) we can't currently tell. I picked the number
> because so many of the tasks using SVE used it so frequently.
> 
> This means that for a task which is actively using SVE the number of SVE
> access traps will be substantially reduced but applications which use SVE
> only very infrequently will avoid the overheads associated with tracking
> SVE state once the counter expires.
> 
> There should be no functional change resulting from this, it is purely a
> performance optimisation.

Do you have any performance numbers to motivate this change? It would be
interesting, for example, to see how changing the timeout value affects
the results for some real workloads.

Will

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ