lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 22 Aug 2022 14:52:48 -0700
From:   Mina Almasry <almasrymina@...gle.com>
To:     Johannes Weiner <hannes@...xchg.org>,
        Yosry Ahmed <yosryahmed@...gle.com>
Cc:     Tejun Heo <tj@...nel.org>, Yafang Shao <laoar.shao@...il.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>, Martin Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        john fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Hao Luo <haoluo@...gle.com>, jolsa@...nel.org,
        Michal Hocko <mhocko@...nel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeelb@...gle.com>,
        Muchun Song <songmuchun@...edance.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Zefan Li <lizefan.x@...edance.com>,
        Cgroups <cgroups@...r.kernel.org>,
        netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        Dan Schatzberg <schatzberg.dan@...il.com>,
        Lennart Poettering <lennart@...ttering.net>
Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking

On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@...nel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > >    tmpfs and bpf maps. The key problems with this approach are
> > >
> > >    1. How to grant/deny what can be charged where. We must ensure that a
> > >       descendant can't move charges up or across the tree without the
> > >       ancestors allowing it.
> > >
> > >    2. How to specify the cgroup to charge. While specifying the target
> > >       cgroup directly might seem like an obvious solution, it has a couple
> > >       rather serious problems. First, if the descendant is inside a cgroup
> > >       namespace, it might be able to see the target cgroup at all. Second,
> > >       it's an interface which is likely to cause misunderstandings on how it
> > >       can be used. It's too broad an interface.
> > >
> >
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> >
> >
> > >    One solution that I can think of is leveraging the resource domain
> > >    concept which is currently only used for threaded cgroups. All memory
> > >    usages of threaded cgroups are charged to their resource domain cgroup
> > >    which hosts the processes for those threads. The persistent usages have a
> > >    similar pattern, so maybe the service level cgroup can declare that it's
> > >    the encompassing resource domain and the instance cgroup can say whether
> > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > >    resource domain.
> > >
> >
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> >
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
>
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
>
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
>
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.
>
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
>
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?

I think doing this on a file granularity makes it logistically hard to
use, no? The service needs to create a file in the shared domain and
all its instances need to re-use this exact same file.

Our kubernetes use case from [1] shares a mount between subtasks
rather than specific files. This allows subtasks to create files at
will in the mount with the memory charged to the shared domain. I
imagine this is more convenient than a shared file.

Our other use case, which I hope to address here as well, is a
service-client relationship from [1] where the service would like to
charge per-client memory back to the client itself. In this case the
service or client can create a mount from the shared domain and pass
it to the service at which point the service is free to create/remove
files in this mount as it sees fit.

Would you be open to a per-mount interface rather than a per-file
fadvise interface?

Yosry, would a proposal like so be extensible to address the bpf
charging issues?

[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ