linux-kernel - Re: [RFC PATCH 4/7] memcg: sleep during flushing stats in safe contexts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALvZod4z6F2Rr3prKdLqBuWUjippOBoLFw3QFFY7Bk=czm5iHg@mail.gmail.com>
Date:   Thu, 23 Mar 2023 12:35:20 -0700
From:   Shakeel Butt <shakeelb@...gle.com>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Johannes Weiner <hannes@...xchg.org>, Tejun Heo <tj@...nel.org>,
        Josef Bacik <josef@...icpanda.com>,
        Jens Axboe <axboe@...nel.dk>,
        Zefan Li <lizefan.x@...edance.com>,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Muchun Song <muchun.song@...ux.dev>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vasily Averin <vasily.averin@...ux.dev>,
        cgroups@...r.kernel.org, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        bpf@...r.kernel.org
Subject: Re: [RFC PATCH 4/7] memcg: sleep during flushing stats in safe contexts

On Thu, Mar 23, 2023 at 11:08 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <hannes@...xchg.org> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <hannes@...xchg.org> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > > >               return;
> > > > >
> > > > >       flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > > > -     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > > > +     cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> > > >
> > > > How is it safe to call this with may_sleep=true when it's holding the
> > > > stats_flush_lock?
> > >
> > > stats_flush_lock is always called with trylock, it is only used today
> > > so that we can skip flushing if another cpu is already doing a flush
> > > (which is not 100% correct as they may have not finished flushing yet,
> > > but that's orthogonal here). So I think it should be safe to sleep as
> > > no one can be blocked waiting for this spinlock.
> >
> > I see. It still cannot sleep while the lock is held, though, because
> > preemption is disabled. Make sure you have all lock debugging on while
> > testing this.
>
> Thanks for pointing this out, will do.
>
> >
> > > Perhaps it would be better semantically to replace the spinlock with
> > > an atomic test and set, instead of having a lock that can only be used
> > > with trylock?
> >
> > It could be helpful to clarify what stats_flush_lock is protecting
> > first. Keep in mind that locks should protect data, not code paths.
> >
> > Right now it's doing multiple things:
> >
> > 1. It protects updates to stats_flush_threshold
> > 2. It protects updates to flush_next_time
> > 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
> >
> > However,
> >
> > 1. stats_flush_threshold is already an atomic
> >
> > 2. flush_next_time is not atomic. The writer is locked, but the reader
> >    is lockless. If the reader races with a flush, you could see this:
> >
> >                                         if (time_after(jiffies, flush_next_time))
> >         spin_trylock()
> >         flush_next_time = now + delay
> >         flush()
> >         spin_unlock()
> >                                         spin_trylock()
> >                                         flush_next_time = now + delay
> >                                         flush()
> >                                         spin_unlock()
> >
> >    which means we already can get flushes at a higher frequency than
> >    FLUSH_TIME during races. But it isn't really a problem.
> >
> >    The reader could also see garbled partial updates, so it needs at
> >    least READ_ONCE and WRITE_ONCE protection.
> >
> > 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> >    factors is currently broken because of the race in 2. But the race
> >    is actually harmless, all we might get is the occasional earlier
> >    flush. If there is no delta, the flush won't do much. And if there
> >    is, the flush is justified.
> >
> > In summary, it seems to me the lock can be ditched altogether. All the
> > code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
>
> Thanks a lot for this analysis. I agree that the lock can be removed
> with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose
> of the lock that we are missing here.
>
> I think one other purpose of the lock is avoiding a thundering herd
> problem on cgroup_rstat_lock, particularly from reclaim context, as
> mentioned by the log of  commit aa48e47e3906 ("memcg: infrastructure
> to flush memcg stats").
>
> While testing, I did notice that removing this lock indeed causes a
> thundering herd problem if we have a lot of concurrent reclaimers. The
> trylock makes sure we abort immediately if someone else is flushing --
> which is not ideal because that flusher might have just started, and
> we may end up reading stale data anyway.
>
> This is why I suggested replacing the lock by an atomic, and do
> something like this if we want to maintain the current behavior:
>
> static void __mem_cgroup_flush_stats(void)
> {
>     ...
>     if (atomic_xchg(&ongoing_flush, 1))
>         return;
>     ...
>     atomic_set(&ongoing_flush, 0)
> }
>
> Alternatively, if we want to change the behavior and wait for the
> concurrent flusher to finish flushing, we can maybe spin until
> ongoing_flush goes back to 0 and then return:
>
> static void __mem_cgroup_flush_stats(void)
> {
>     ...
>     if (atomic_xchg(&ongoing_flush, 1)) {
>         /* wait until the ongoing flusher finishes to get updated stats */
>         while (atomic_read(&ongoing_flush) {};
>         return;
>     }
>     /* flush the stats ourselves */
>     ...
>     atomic_set(&ongoing_flush, 0)
> }
>
> WDYT?

I would go with your first approach i.e. no spinning.