lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bbc48f40-274c-44ec-9a98-7c18b64628c0@redhat.com>
Date: Sat, 12 Apr 2025 23:15:00 -0400
From: Waiman Long <llong@...hat.com>
To: Roman Gushchin <roman.gushchin@...ux.dev>
Cc: Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
 Shakeel Butt <shakeel.butt@...ux.dev>, Muchun Song <muchun.song@...ux.dev>,
 Andrew Morton <akpm@...ux-foundation.org>, Tejun Heo <tj@...nel.org>,
 Michal Koutný <mkoutny@...e.com>,
 Shuah Khan <shuah@...nel.org>, linux-kernel@...r.kernel.org,
 cgroups@...r.kernel.org, linux-mm@...ck.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 2/2] selftests: memcg: Increase error tolerance of
 child memory.current check in test_memcg_protection()


On 4/8/25 6:22 PM, Roman Gushchin wrote:
> On Sat, Apr 05, 2025 at 10:40:10PM -0400, Waiman Long wrote:
>> The test_memcg_protection() function is used for the test_memcg_min and
>> test_memcg_low sub-tests. This function generates a set of parent/child
>> cgroups like:
>>
>>    parent:  memory.min/low = 50M
>>    child 0: memory.min/low = 75M,  memory.current = 50M
>>    child 1: memory.min/low = 25M,  memory.current = 50M
>>    child 2: memory.min/low = 0,    memory.current = 50M
>>
>> After applying memory pressure, the function expects the following
>> actual memory usages.
>>
>>    parent:  memory.current ~= 50M
>>    child 0: memory.current ~= 29M
>>    child 1: memory.current ~= 21M
>>    child 2: memory.current ~= 0
>>
>> In reality, the actual memory usages can differ quite a bit from the
>> expected values. It uses an error tolerance of 10% with the values_close()
>> helper.
>>
>> Both the test_memcg_min and test_memcg_low sub-tests can fail
>> sporadically because the actual memory usage exceeds the 10% error
>> tolerance. Below are a sample of the usage data of the tests runs
>> that fail.
>>
>>    Child   Actual usage    Expected usage    %err
>>    -----   ------------    --------------    ----
>>      1       16990208         22020096      -12.9%
>>      1       17252352         22020096      -12.1%
>>      0       37699584         30408704      +10.7%
>>      1       14368768         22020096      -21.0%
>>      1       16871424         22020096      -13.2%
>>
>> The current 10% error tolerenace might be right at the time
>> test_memcontrol.c was first introduced in v4.18 kernel, but memory
>> reclaim have certainly evolved quite a bit since then which may result
>> in a bit more run-to-run variation than previously expected.
>>
>> Increase the error tolerance to 15% for child 0 and 20% for child 1 to
>> minimize the chance of this type of failure. The tolerance is bigger
>> for child 1 because an upswing in child 0 corresponds to a smaller
>> %err than a similar downswing in child 1 due to the way %err is used
>> in values_close().
>>
>> Before this patch, a 100 test runs of test_memcontrol produced the
>> following results:
>>
>>       17 not ok 1 test_memcg_min
>>       22 not ok 2 test_memcg_low
>>
>> After applying this patch, there were no test failure for test_memcg_min
>> and test_memcg_low in 100 test runs.
> Ideally we want to calculate these values dynamically based on the machine
> size (number of cpus and total memory size).
>
> We can calculate the memcg error margin and scale memcg sizes if necessarily.
> It's the only way to make it pass both on a 2-CPU's vm and 512-CPU's physical
> server.
>
> Not a blocker for this patch, just an idea for the future.

Thanks for the suggestion.

As I said in a previous mail, the way the test works is by waiting until 
the the memory.current of the parent is close to 50M, then it checks the 
memory.current's of its children to see how much usage each of them 
have. I am not sure if nr of CPUs or total memory size is really a 
factor here. We will probably need to run some experiments to find out. 
Anyway, it will be a future patch if they are really a factor here.

Cheers,
Longman

>
> Thanks!
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ