lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 17 Nov 2020 17:07:03 -0800
From:   Roman Gushchin <guro@...com>
To:     Daniel Borkmann <daniel@...earbox.net>
CC:     <bpf@...r.kernel.org>, <ast@...nel.org>, <netdev@...r.kernel.org>,
        <andrii@...nel.org>, <akpm@...ux-foundation.org>,
        <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
        <kernel-team@...com>
Subject: Re: [PATCH bpf-next v6 06/34] bpf: prepare for memcg-based memory
 accounting for bpf maps

On Tue, Nov 17, 2020 at 04:46:34PM -0800, Roman Gushchin wrote:
> On Wed, Nov 18, 2020 at 01:06:17AM +0100, Daniel Borkmann wrote:
> > On 11/17/20 4:40 AM, Roman Gushchin wrote:
> > > In the absolute majority of cases if a process is making a kernel
> > > allocation, it's memory cgroup is getting charged.
> > > 
> > > Bpf maps can be updated from an interrupt context and in such
> > > case there is no process which can be charged. It makes the memory
> > > accounting of bpf maps non-trivial.
> > > 
> > > Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
> > > memcg accounting from interrupt contexts") and b87d8cefe43c
> > > ("mm, memcg: rework remote charging API to support nesting")
> > > it's finally possible.
> > > 
> > > To do it, a pointer to the memory cgroup of the process which created
> > > the map is saved, and this cgroup is getting charged for all
> > > allocations made from an interrupt context.
> > > 
> > > Allocations made from a process context will be accounted in a usual way.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@...com>
> > > Acked-by: Song Liu <songliubraving@...com>
> > [...]
> > > +#ifdef CONFIG_MEMCG_KMEM
> > > +static __always_inline int __bpf_map_update_elem(struct bpf_map *map, void *key,
> > > +						 void *value, u64 flags)
> > > +{
> > > +	struct mem_cgroup *old_memcg;
> > > +	bool in_interrupt;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * If update from an interrupt context results in a memory allocation,
> > > +	 * the memory cgroup to charge can't be determined from the context
> > > +	 * of the current task. Instead, we charge the memory cgroup, which
> > > +	 * contained a process created the map.
> > > +	 */
> > > +	in_interrupt = in_interrupt();
> > > +	if (in_interrupt)
> > > +		old_memcg = set_active_memcg(map->memcg);
> > > +
> > > +	ret = map->ops->map_update_elem(map, key, value, flags);
> > > +
> > > +	if (in_interrupt)
> > > +		set_active_memcg(old_memcg);
> > > +
> > > +	return ret;
> > 
> > Hmm, this approach here won't work, see also commit 09772d92cd5a ("bpf: avoid
> > retpoline for lookup/update/delete calls on maps") which removes the indirect
> > call, so the __bpf_map_update_elem() and therefore the set_active_memcg() is
> > not invoked for the vast majority of cases.
> 
> I see. Well, the first option is to move these calls into map-specific update
> functions, but the list is relatively long:
>   nsim_map_update_elem()
>   cgroup_storage_update_elem()
>   htab_map_update_elem()
>   htab_percpu_map_update_elem()
>   dev_map_update_elem()
>   dev_map_hash_update_elem()
>   trie_update_elem()
>   cpu_map_update_elem()
>   bpf_pid_task_storage_update_elem()
>   bpf_fd_inode_storage_update_elem()
>   bpf_fd_sk_storage_update_elem()
>   sock_map_update_elem()
>   xsk_map_update_elem()
> 
> Alternatively, we can set the active memcg for the whole duration of bpf
> execution. It's simpler, but will add some overhead. Maybe we can somehow
> mark programs calling into update helpers and skip all others?

Actually, this is problematic if a program updates several maps, because
in theory they can belong to different cgroups.
So it seems that the first option is the way to go. Do you agree?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ