linux-kernel - Re: [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpodddsnOF7nKX-ijsujAcnjvLHB2UTyyZknc6uTgb3UWXY2g@mail.gmail.com>
Date: Tue, 11 Jun 2024 23:50:08 +0900
From: Takero Funaki <flintglass@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Nhat Pham <nphamcs@...il.com>, 
	Chengming Zhou <chengming.zhou@...ux.dev>, Jonathan Corbet <corbet@....net>, 
	Andrew Morton <akpm@...ux-foundation.org>, 
	Domenico Cerasuolo <cerasuolodomenico@...il.com>, linux-mm@...ck.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration

On 2024/06/11 4:16, Yosry Ahmed wrote:
>
> I am really finding it difficult to understand what the diff is trying
> to do. We are holding a lock that protects zswap_next_shrink. We
> always access it with the lock held. Why do we need all of this?
>
> Adding READ_ONCE() and WRITE_ONCE() where they are not needed is just
> confusing imo.

I initially thought that reading new values from external variables
inside a loop required protection from compiler optimization. I will
remove the access macros in v2.

> 'memcg' will always be NULL on the first iteration, so we will always
> start by shrinking 'zswap_next_shrink' for a second time before moving
> the iterator.
>
>> +               } else {
>> +                       /* advance cursor */
>> +                       memcg = mem_cgroup_iter(NULL, memcg, NULL);
>> +                       WRITE_ONCE(zswap_next_shrink, memcg);
> Again, I don't see what this is achieving. The first iteration will
> always set 'memcg' to 'zswap_next_shrink', and then we will always
> move the iterator forward. The only difference I see is that we shrink
> 'zswap_next_shrink' twice in a row now (last 'memcg' in prev call, and
> first 'memcg' in this call).

The reason for checking if `memcg != next_memcg` was to ensure that we
do not skip memcg that might be modified by the cleaner.  For example,
say we get memcg A and save it. When the cleaner advances the cursor
from A to B, we then advance from B to C, shrink C. We have to check
that A in the zswap_next_shrink is untouched before advancing the
cursor.

If this approach is overly complicated and ignoring B is acceptable,
the beginning of the loop can be simplified to:

        do {
+iternext:
                spin_lock(&zswap_shrink_lock);

                zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
                memcg = zswap_next_shrink;



>> @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
>>                 }
>>
>>                 if (!mem_cgroup_tryget_online(memcg)) {
>> -                       /* drop the reference from mem_cgroup_iter() */
>> -                       mem_cgroup_iter_break(NULL, memcg);
>> -                       zswap_next_shrink = NULL;
>> +                       /*
>> +                        * It is an offline memcg which we cannot shrink
>> +                        * until its pages are reparented.
>> +                        *
>> +                        * Since we cannot determine if the offline cleaner has
>> +                        * been already called or not, the offline memcg must be
>> +                        * put back unconditonally. We cannot abort the loop while
>> +                        * zswap_next_shrink has a reference of this offline memcg.
>> +                        */
> You actually deleted the code that actually puts the ref to the
> offline memcg above.
>
> Why don't you just replace mem_cgroup_iter_break(NULL, memcg) with
> mem_cgroup_iter(NULL, memcg, NULL) here? I don't understand what the
> patch is trying to do to be honest. This patch is a lot more confusing
> than it should be.


>>                         spin_unlock(&zswap_shrink_lock);
>> -
>> -                       if (++failures == MAX_RECLAIM_RETRIES)
>> -                               break;
>> -
>> -                       goto resched;
>> +                       goto iternext;
>>                 }

Removing the `break` on max failures from the if-offline branch is
required to not leave the reference of the next memcg.

If we just replace the mem_cgroup_iter_break with `memcg =
zswap_next_shrink = mem_cgroup_iter(NULL, memcg, NULL);` and break the
loop on failure, the next memcg will be left in zswap_next_shrink. If
zswap_next_shrink is also offline, the reference will be held
indefinitely.

When we get offline memcg, we cannot determine if the cleaner has
already been called or will be called later. We have to put back the
offline memcg reference before returning from the worker function.
This potential memcg leak is the reason why I think we cannot break
the loop here.
In this patch, the `goto iternext` ensures the offline memcg is
released in the next iteration (or by cleaner waiting for our unlock).

>
> Also, I would like Nhat to weigh in here. Perhaps the decision to
> reset the iterator instead of advancing it in this case was made for a
> reason that we should honor. Maybe cgroups are usually offlined
> together so we will keep running into offline cgroups here if we
> continue? I am not sure.

>From comment I removed,

>> -                * We need to retry if we have gone through a full round trip, or if we
>> -                * got an offline memcg (or else we risk undoing the effect of the
>> -                * zswap memcg offlining cleanup callback). This is not catastrophic
>> -                * per se, but it will keep the now offlined memcg hostage for a while.

I think this mentioned the potential memcg leak, which is now resolved
by this patch modifying the offline memcg case.