lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 25 Jun 2024 02:28:49 -0700
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, tj@...nel.org, cgroups@...r.kernel.org, 
	hannes@...xchg.org, lizefan.x@...edance.com, longman@...hat.com, 
	kernel-team@...udflare.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd
 across NUMA nodes

On Mon, Jun 24, 2024 at 5:24 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> On Mon, Jun 24, 2024 at 03:21:22PM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 3:17 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 02:43:02PM GMT, Yosry Ahmed wrote:
> > > [...]
> > > > >
> > > > > > There is also
> > > > > > a heuristic in zswap that may writeback more (or less) pages that it
> > > > > > should to the swap device if the stats are significantly stale.
> > > > > >
> > > > >
> > > > > Is this the ratio of MEMCG_ZSWAP_B and MEMCG_ZSWAPPED in
> > > > > zswap_shrinker_count()? There is already a target memcg flush in that
> > > > > function and I don't expect root memcg flush from there.
> > > >
> > > > I was thinking of the generic approach I suggested, where we can avoid
> > > > contending on the lock if the cgroup is a descendant of the cgroup
> > > > being flushed, regardless of whether or not it's the root memcg. I
> > > > think this would be more beneficial than just focusing on root
> > > > flushes.
> > >
> > > Yes I agree with this but what about skipping the flush in this case?
> > > Are you ok with that?
> >
> > Sorry if I am confused, but IIUC this patch affects all root flushes,
> > even for userspace reads, right? In this case I think it's not okay to
> > skip the flush without waiting for the ongoing flush.
>
> So, we differentiate between userspace and in-kernel users. For
> userspace, we should not skip flush and for in-kernel users, we can skip
> if flushing memcg is the ancestor of the given memcg. Is that what you
> are saying?

Basically, I prefer that we don't skip flushing at all and keep
userspace and in-kernel users the same. We can use completions to make
other overlapping flushers sleep instead of spin on the lock.

A proof of concept is basically something like:

void cgroup_rstat_flush(cgroup)
{
    if (cgroup_is_descendant(cgroup, READ_ONCE(cgroup_under_flush))) {
        wait_for_completion_interruptible(&cgroup_under_flush->completion);
        return;
    }

    __cgroup_rstat_lock(cgrp, -1);
    reinit_completion(&cgroup->completion);
    /* Any overlapping flush requests after this write will not spin
on the lock */
    WRITE_ONCE(cgroup_under_flush, cgroup);

    cgroup_rstat_flush_locked(cgrp);
    complete_all(&cgroup->completion);
    __cgroup_rstat_unlock(cgrp, -1);
}

There may be missing barriers or chances to reduce the window between
__cgroup_rstat_lock and WRITE_ONCE(), but that's what I have in mind.
I think it's not too complicated, but we need to check if it fixes the
problem.

If this is not preferable, then yeah, let's at least keep the
userspace behavior intact. This makes sure we don't affect userspace
negatively, and we can change it later as we please.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ