netdev - Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCuOsXKCkwa8zkwR@slm.duckdns.org>
Date: Mon, 19 May 2025 10:04:01 -1000
From: Tejun Heo <tj@...nel.org>
To: Amery Hung <ameryhung@...il.com>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org,
	alexei.starovoitov@...il.com, andrii@...nel.org,
	daniel@...earbox.net, memxor@...il.com, martin.lau@...nel.org,
	kernel-team@...a.com
Subject: Re: [PATCH bpf-next v4 1/3] selftests/bpf: Introduce task local data

Hello,

On Thu, May 15, 2025 at 02:16:00PM -0700, Amery Hung wrote:
...
> +#define PAGE_SIZE 4096

This might conflict with other definitions. Looks like non-4k page sizes are
a lot more popular on arm. Would this be a problem?

> +static int __tld_init_metadata(int map_fd)
> +{
> +	struct u_tld_metadata *new_metadata;
> +	struct tld_map_value map_val;
> +	int task_fd = 0, err;
> +
> +	task_fd = syscall(SYS_pidfd_open, getpid(), 0);
> +	if (task_fd < 0) {
> +		err = -errno;
> +		goto out;
> +	}
> +
> +	new_metadata = aligned_alloc(PAGE_SIZE, PAGE_SIZE);

Is 4k size limit from UPTR? Is it still 4k on machines with >4k pages? If
this isn't a hard limit from UPTR, would it make sense to encode the size in
the header part of the metadata?

> +static int __tld_init_data(int map_fd)
> +{
> +	struct u_tld_data *new_data = NULL;
> +	struct tld_map_value map_val;
> +	int err, task_fd = 0;
> +
> +	task_fd = syscall(SYS_pidfd_open, gettid(), PIDFD_THREAD);
> +	if (task_fd < 0) {
> +		err = -errno;
> +		goto out;
> +	}
> +
> +	new_data = aligned_alloc(PAGE_SIZE, TLD_DATA_SIZE);

Ditto.

Noob question. Does this means that each thread will map a 4k page no matter
how much data it actually uses?

> +__attribute__((unused))
> +static tld_key_t tld_create_key(int map_fd, const char *name, size_t size)
> +{
> +	int err, i, cnt, sz, off = 0;
> +
> +	if (!READ_ONCE(tld_metadata_p)) {
> +		err = __tld_init_metadata(map_fd);
> +		if (err)
> +			return (tld_key_t) {.off = err};
> +	}
> +
> +	if (!tld_data_p) {
> +		err = __tld_init_data(map_fd);
> +		if (err)
> +			return (tld_key_t) {.off = err};
> +	}
> +
> +	size = round_up(size, 8);
> +
> +	for (i = 0; i < TLD_DATA_CNT; i++) {
> +retry:
> +		cnt = __atomic_load_n(&tld_metadata_p->cnt, __ATOMIC_RELAXED);
> +		if (i < cnt) {
> +			/*
> +			 * Pending tld_create_key() uses size to signal if the metadata has
> +			 * been fully updated.
> +			 */
> +			while (!(sz = __atomic_load_n(&tld_metadata_p->metadata[i].size,
> +						      __ATOMIC_ACQUIRE)))
> +				sched_yield();
> +
> +			if (!strncmp(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN))
> +				return (tld_key_t) {.off = -EEXIST};
> +
> +			off += sz;
> +			continue;
> +		}
> +
> +		if (off + size > TLD_DATA_SIZE)
> +			return (tld_key_t) {.off = -E2BIG};
> +
> +		/*
> +		 * Only one tld_create_key() can increase the current cnt by one and
> +		 * takes the latest available slot. Other threads will check again if a new
> +		 * TLD can still be added, and then compete for the new slot after the
> +		 * succeeding thread update the size.
> +		 */
> +		if (!__atomic_compare_exchange_n(&tld_metadata_p->cnt, &cnt, cnt + 1, true,
> +						 __ATOMIC_RELAXED, __ATOMIC_RELAXED))
> +			goto retry;
> +
> +		strncpy(tld_metadata_p->metadata[i].name, name, TLD_NAME_LEN);
> +		__atomic_store_n(&tld_metadata_p->metadata[i].size, size, __ATOMIC_RELEASE);
> +		return (tld_key_t) {.off = off};
> +	}
> +
> +	return (tld_key_t) {.off = -ENOSPC};
> +}

This looks fine to me but I wonder whether run-length encoding the key
strings would be more efficient and less restrictive in terms of key length.
e.g.:

struct key {
        u32 data_len;
        u16 key_off;
        u16 key_len;
};

struct metadata {
        struct key      keys[MAX_KEYS];
        char            key_strs[SOME_SIZE];
};

The logic can be mostly the same. The only difference would be that key
string is not inline. Determine winner in the creation path by compxchg'ing
on data_len, but set key_off and key_len only after key string is updated.
Losing on cmpxhcg or seeing an entry where key_len is zero means that that
one lost and should relax and retry. It can still use the same 4k metadata
page but will likely be able to allow more keys while also relaxing
restrictions on key length.

Hmm... maybe making the key string variably sized makes things difficult for
the BPF code. If so (or for any other reasons), please feel free to ignore
the above.

> +#endif /* __TASK_LOCAL_DATA_H */
> diff --git a/tools/testing/selftests/bpf/progs/task_local_data.bpf.h b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
> new file mode 100644
> index 000000000000..5f48e408a5e5
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/task_local_data.bpf.h
...
> +/**
> + * tld_get_data() - Retrieves a pointer to the TLD associated with the key.
> + *
> + * @tld_obj: A pointer to a valid tld_object initialized by tld_object_init()
> + * @key: The key of a TLD saved in tld_maps
> + * @size: The size of the TLD. Must be a known constant value
> + *
> + * Returns a pointer to the TLD data associated with the key; NULL if the key
> + * is not valid or the size is too big
> + */
> +#define tld_get_data(tld_obj, key, size) \
> +	__tld_get_data(tld_obj, (tld_obj)->key_map->key.off - 1, size)
> +
> +__attribute__((unused))
> +__always_inline void *__tld_get_data(struct tld_object *tld_obj, u32 off, u32 size)
> +{
> +	return (tld_obj->data_map->data && off >= 0 && off < TLD_DATA_SIZE - size) ?
> +		(void *)tld_obj->data_map->data + off : NULL;
> +}

Neat.

Generally looks great to me. The only thing I wonder is whether the data
area sizing can be determined at init time rather than fixed to 4k.

Thanks.

-- 
tejun