linux-kernel - Re: selftests: cgroup: Failures – Timeouts & OOM Issues Analysis

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <exfgzrx7u3s77gsoxqzm4zhb6mr7aysc2vzus5ob3zeadkm7ut@3dzkywk3jfqr>
Date: Tue, 4 Mar 2025 15:20:58 +0100
From: Michal Koutný <mkoutny@...e.com>
To: Naresh Kamboju <naresh.kamboju@...aro.org>
Cc: Cgroups <cgroups@...r.kernel.org>, linux-mm <linux-mm@...ck.org>, 
	"open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>, 
	Shuah Khan <shuah@...nel.org>, Dan Carpenter <dan.carpenter@...aro.org>, 
	Arnd Bergmann <arnd@...db.de>, Anders Roxell <anders.roxell@...aro.org>, 
	Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>
Subject: Re: selftests: cgroup: Failures – Timeouts & OOM Issues Analysis

Hello Naresh.

On Tue, Mar 04, 2025 at 05:26:45PM +0530, Naresh Kamboju <naresh.kamboju@...aro.org> wrote:
> As part of LKFT’s re-validation of known issues, we have observed that
> the selftests: cgroup suite is consistently failing across almost all
> LKFT-supported devices due to:
>  - Test timeouts (45 seconds limit reached)
>  - OOM-killer invocation

Thanks for reporting the issues with the tests.

> ## Key Questions for Discussion:
>  - Would it be beneficial to increase the test timeout to ~180 seconds
>    to allow sufficient execution time?

That depends.

test_cpu has some lenghtier checks and they can in sum surpass 45s,
it'd might be better to shorten them (withing precision margin) instead
of prolonging the limit.

test_kmem -- it shouldn't take so long, if anything I'd suspect
/proc/kpagecgroup -- are your systems larger than 100GiB of memory
(that's my rough estimate for this reads to take above the limit)?

(Are there any other timeouts?)

OOM -- some tests are supposed to trigger memcg OOM.

>  - Should we enhance logging to explicitly print failure reasons when a
>    test fails?

These tests are useful when run by developers them_selves_. In such a
case it's handy to obtain more info running them understrace (since
they're so simple).

>  - Are there any missing dependencies that could be causing these failures?
>      Note: The required selftests/cgroup/config options were included in
>      LKFT's build and test plans.

The deps are rather minimal, only some coreutils (cgroup selftests
should be covered by e.g. this list [1]).

> 
> ## Devices Affected:
> The following DUTs consistently experience these failures:
>   -  dragonboard-410c (arm64)
>   -  dragonboard-845c (arm64)
>   -  e850-96 (arm64)
>   -  juno-r2 (arm64)
>   -  qemu-arm64 (arm64)
>   -  qemu-armv7
>   -  qemu-x86_64
>   -  rk3399-rock-pi-4b (arm64)
>   -  x15 (arm)
>   -  x86_64
> 
> Regression Analysis:
>  - New regression? No (these failures have been observed for months/years).

Actually, I noticed test_memcontrol failure yesterday (with ~mainline
kernel) but I remember they used to work also rather recently. I haven't
got time to look into that but at least that one may be a regression (in
code or test).

>  - Reproducibility? Yes, the failures occur consistently.

+/- as that may depend no nr_cpus or totalram.

>  - Test suite affected? selftests: cgroup (timeouts and OOM-related failures).

Michal

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)