linux-kernel - Re: [PATCH 3/3] bpf: Implement prealloc for task_local

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211022224733.woyxljoudm3th7vq@kafai-mbp.dhcp.thefacebook.com>
Date:   Fri, 22 Oct 2021 15:47:33 -0700
From:   Martin KaFai Lau <kafai@...com>
To:     Tejun Heo <tj@...nel.org>
CC:     Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>, <bpf@...r.kernel.org>,
        <kernel-team@...com>, <linux-kernel@...r.kernel.org>,
        KP Singh <kpsingh@...nel.org>
Subject: Re: [PATCH 3/3] bpf: Implement prealloc for task_local_storage

On Wed, Oct 20, 2021 at 10:18:13AM -1000, Tejun Heo wrote:
> From 5e3ad0d4a0b0732e7ebe035582d282ab752397ed Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@...nel.org>
> Date: Wed, 20 Oct 2021 08:56:53 -1000
> 
> task_local_storage currently does not support pre-allocation and the memory
> is allocated on demand using the GFP_ATOMIC mask. While atomic allocations
> succeed most of the time and the occasional failures aren't a problem for
> many use-cases, there are some which can benefit from reliable allocations -
> e.g. tracking acquisitions and releases of specific resources to root cause
> long-term reference leaks.
>
> Prealloc semantics for task_local_storage:
> 
> * When a prealloc map is created, the map's elements for all existing tasks
>   are allocated.
> 
> * Afterwards, whenever a new task is forked, it automatically allocates the
>   elements for the existing preallocated maps.
> 
> To synchronize against concurrent forks, CONFIG_BPF_SYSCALL now enables
> CONFIG_THREADGROUP_RWSEM and prealloc task_local_storage creation path
> write-locks threadgroup_rwsem, and the rest of the implementation is
> straight-forward.

[ ... ]

> +static int task_storage_map_populate(struct bpf_local_storage_map *smap)
> +{
> +	struct bpf_local_storage *storage = NULL;
> +	struct bpf_local_storage_elem *selem = NULL;
> +	struct task_struct *p, *g;
> +	int err = 0;
> +
> +	lockdep_assert_held(&threadgroup_rwsem);
> +retry:
> +	if (!storage)
> +		storage = bpf_map_kzalloc(&smap->map, sizeof(*storage),
> +					  GFP_USER);
> +	if (!selem)
> +		selem = bpf_map_kzalloc(&smap->map, smap->elem_size, GFP_USER);
> +	if (!storage || !selem) {
> +		err = -ENOMEM;
> +		goto out_free;
> +	}
> +
> +	rcu_read_lock();
> +	bpf_task_storage_lock();
> +
> +	for_each_process_thread(g, p) {
I am thinking if this loop can be done in bpf iter.

If the bpf_local_storage_map is sleepable safe (not yet done but there is
an earlier attempt [0]),  bpf_local_storage_update() should be able to
alloc without GFP_ATOMIC by sleepable bpf prog and this potentially
will be useful in general for other sleepable use cases.

For example, if a sleepable bpf iter prog can run in this loop (or the existing
bpf task iter loop is as good?), the iter bpf prog can call
bpf_task_storage_get(BPF_SK_STORAGE_GET_F_CREATE) on a sleepable
bpf_local_storage_map.

This pre-alloc then can be done similarly on the tcp/udp socket side
by running a bpf prog at the existing bpf tcp/udp iter.

[0]: https://lore.kernel.org/bpf/20210826235127.303505-1-kpsingh@kernel.org/

> +		struct bpf_local_storage_data *sdata;
> +
> +		/* Try inserting with atomic allocations. On failure, retry with
> +		 * the preallocated ones.
> +		 */
> +		sdata = bpf_local_storage_update(p, smap, NULL, BPF_NOEXIST);
> +
> +		if (PTR_ERR(sdata) == -ENOMEM && storage && selem) {
> +			sdata = __bpf_local_storage_update(p, smap, NULL,
> +							   BPF_NOEXIST,
> +							   &storage, &selem);
> +		}
> +
> +		/* Check -EEXIST before need_resched() to guarantee forward
> +		 * progress.
> +		 */
> +		if (PTR_ERR(sdata) == -EEXIST)
> +			continue;
> +
> +		/* If requested or alloc failed, take a breather and loop back
> +		 * to preallocate.
> +		 */
> +		if (need_resched() ||
> +		    PTR_ERR(sdata) == -EAGAIN || PTR_ERR(sdata) == -ENOMEM) {
> +			bpf_task_storage_unlock();
> +			rcu_read_unlock();
> +			cond_resched();
> +			goto retry;
> +		}
> +
> +		if (IS_ERR(sdata)) {
> +			err = PTR_ERR(sdata);
> +			goto out_unlock;
> +		}
> +	}
> +out_unlock:
> +	bpf_task_storage_unlock();
> +	rcu_read_unlock();
> +out_free:
> +	if (storage)
> +		kfree(storage);
> +	if (selem)
> +		kfree(selem);
> +	return err;
> +}

[ ... ]

> +int bpf_task_storage_fork(struct task_struct *task)
> +{
> +	struct bpf_local_storage_map *smap;
> +
> +	percpu_rwsem_assert_held(&threadgroup_rwsem);
> +
> +	list_for_each_entry(smap, &prealloc_smaps, prealloc_node) {
Mostly a comment here from the networking side,  I suspect the common use case
is going to be more selective based on different protocol (tcp or udp)
and even port.  There is some existing bpf hooks during inet_sock creation
time, bind time ...etc.  The bpf prog can be selective on what bpf_sk_storage
it needs by inspecting different fields of a sk.

e.g. in inet_create(), there is BPF_CGROUP_RUN_PROG_INET_SOCK(sk).
Would a similar hook be useful on the fork side?

> +		struct bpf_local_storage *storage;
> +		struct bpf_local_storage_elem *selem;
> +		struct bpf_local_storage_data *sdata;
> +
> +		storage = bpf_map_kzalloc(&smap->map, sizeof(*storage),
> +					  GFP_USER);
> +		selem = bpf_map_kzalloc(&smap->map, smap->elem_size, GFP_USER);
> +
> +		rcu_read_lock();
> +		bpf_task_storage_lock();
> +		sdata = __bpf_local_storage_update(task, smap, NULL, BPF_NOEXIST,
> +						   &storage, &selem);
> +		bpf_task_storage_unlock();
> +		rcu_read_unlock();
> +
> +		if (storage)
> +			kfree(storage);
> +		if (selem)
> +			kfree(selem);
> +
> +		if (IS_ERR(sdata)) {
> +			bpf_task_storage_free(task);
> +			return PTR_ERR(sdata);
> +		}
> +	}
> +
> +	return 0;
> +}