linux-kernel - Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30625cdd-4d61-594b-8db9-6816b017dde3@amd.com>
Date:   Tue, 30 May 2023 13:36:13 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Tejun Heo <tj@...nel.org>, jiangshanlai@...il.com
Cc:     torvalds@...ux-foundation.org, peterz@...radead.org,
        linux-kernel@...r.kernel.org, kernel-team@...a.com,
        joshdon@...gle.com, brho@...gle.com, briannorris@...omium.org,
        nhuck@...gle.com, agk@...hat.com, snitzer@...nel.org,
        void@...ifault.com
Subject: Re: [PATCH 14/24] workqueue: Generalize unbound CPU pods

Hello Tejun,

I ran into a NULL pointer dereferencing issue when trying to test a build
of the "affinity-scopes-v1" branch from your workqueue tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/?h=affinity-scopes-v1)
Inlining the splat, some debug details, and workaround I used below.

On 5/19/2023 5:46 AM, Tejun Heo wrote:
> While renamed to pod, the code still assumes that the pods are defined by
> NUMA boundaries. Let's generalize it:
> 
> * workqueue_attrs->affn_scope is added. Each enum represents the type of
>   boundaries that define the pods. There are currently two scopes -
>   WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
>   - one pod per NUMA node. The latter defines one global pod across the
>   whole system.
> 
> * struct wq_pod_type is added which describes how pods are configured for
>   each affnity scope. For each pod, it lists the member CPUs and the
>   preferred NUMA node for memory allocations. The reverse mapping from CPU
>   to pod is also available.
> 
> * wq_pod_enabled is dropped. Pod is now always enabled. The previously
>   disabled behavior is now implemented through WQ_AFFN_SYSTEM.
> 
> * get_unbound_pool() wants to determine the NUMA node to allocate memory
>   from for the new pool. The variables are renamed from node to pod but the
>   logic still assumes they're one and the same. Clearly distinguish them -
>   walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
>   NUMA node.
> 
> * wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
>   node. Take @cpu instead and determine the cpumask to use from the pod_type
>   matching @attrs.
> 
> * apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
>   NULL so that it can indicate -EINVAL on invalid affinity scopes.
> 
> This patch allows CPUs to be grouped into pods however desired per type.
> While this patch causes some internal behavior changes, nothing material
> should change for workqueue users.
> 
> Signed-off-by: Tejun Heo <tj@...nel.org>
> ---
>  include/linux/workqueue.h |  31 +++++++-
>  kernel/workqueue.c        | 154 ++++++++++++++++++++++++--------------
>  2 files changed, 125 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index b8961c8ea5b3..a2f826b6ec9a 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -124,6 +124,15 @@ struct rcu_work {
>  	struct workqueue_struct *wq;
>  };
>  
> +enum wq_affn_scope {
> +	WQ_AFFN_NUMA,			/* one pod per NUMA node */
> +	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
> +
> +	WQ_AFFN_NR_TYPES,
> +
> +	WQ_AFFN_DFL = WQ_AFFN_NUMA,
> +};
> +
>  /**
>   * struct workqueue_attrs - A struct for workqueue attributes.
>   *
> @@ -140,12 +149,26 @@ struct workqueue_attrs {
>  	 */
>  	cpumask_var_t cpumask;
>  
> +	/*
> +	 * Below fields aren't properties of a worker_pool. They only modify how
> +	 * :c:func:`apply_workqueue_attrs` select pools and thus don't
> +	 * participate in pool hash calculations or equality comparisons.
> +	 */
> +
>  	/**
> -	 * @ordered: work items must be executed one by one in queueing order
> +	 * @affn_scope: unbound CPU affinity scope
>  	 *
> -	 * Unlike other fields, ``ordered`` isn't a property of a worker_pool. It
> -	 * only modifies how :c:func:`apply_workqueue_attrs` select pools and thus
> -	 * doesn't participate in pool hash calculations or equality comparisons.
> +	 * CPU pods are used to improve execution locality of unbound work
> +	 * items. There are multiple pod types, one for each wq_affn_scope, and
> +	 * every CPU in the system belongs to one pod in every pod type. CPUs
> +	 * that belong to the same pod share the worker pool. For example,
> +	 * selecting %WQ_AFFN_NUMA makes the workqueue use a separate worker
> +	 * pool for each NUMA node.
> +	 */
> +	enum wq_affn_scope affn_scope;
> +
> +	/**
> +	 * @ordered: work items must be executed one by one in queueing order
>  	 */
>  	bool ordered;
>  };
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index add6f5fc799b..dae1787833cb 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -325,7 +325,18 @@ struct workqueue_struct {
>  
>  static struct kmem_cache *pwq_cache;
>  
> -static cpumask_var_t *wq_pod_cpus;	/* possible CPUs of each node */
> +/*
> + * Each pod type describes how CPUs should be grouped for unbound workqueues.
> + * See the comment above workqueue_attrs->affn_scope.
> + */
> +struct wq_pod_type {
> +	int			nr_pods;	/* number of pods */
> +	cpumask_var_t		*pod_cpus;	/* pod -> cpus */
> +	int			*pod_node;	/* pod -> node */
> +	int			*cpu_pod;	/* cpu -> pod */
> +};
> +
> +static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
>  
>  /*
>   * Per-cpu work items which run for longer than the following threshold are
> @@ -341,8 +352,6 @@ module_param_named(power_efficient, wq_power_efficient, bool, 0444);
>  
>  static bool wq_online;			/* can kworkers be created yet? */
>  
> -static bool wq_pod_enabled;		/* unbound CPU pod affinity enabled */
> -
>  /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
>  static struct workqueue_attrs *wq_update_pod_attrs_buf;
>  
> @@ -1753,10 +1762,6 @@ static int select_numa_node_cpu(int node)
>  {
>  	int cpu;
>  
> -	/* No point in doing this if NUMA isn't enabled for workqueues */
> -	if (!wq_pod_enabled)
> -		return WORK_CPU_UNBOUND;
> -
>  	/* Delay binding to CPU if node is not valid or online */
>  	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
>  		return WORK_CPU_UNBOUND;
> @@ -3639,6 +3644,7 @@ struct workqueue_attrs *alloc_workqueue_attrs(void)
>  		goto fail;
>  
>  	cpumask_copy(attrs->cpumask, cpu_possible_mask);
> +	attrs->affn_scope = WQ_AFFN_DFL;
>  	return attrs;
>  fail:
>  	free_workqueue_attrs(attrs);
> @@ -3650,11 +3656,13 @@ static void copy_workqueue_attrs(struct workqueue_attrs *to,
>  {
>  	to->nice = from->nice;
>  	cpumask_copy(to->cpumask, from->cpumask);
> +
>  	/*
> -	 * Unlike hash and equality test, this function doesn't ignore
> -	 * ->ordered as it is used for both pool and wq attrs.  Instead,
> -	 * get_unbound_pool() explicitly clears ->ordered after copying.
> +	 * Unlike hash and equality test, copying shouldn't ignore wq-only
> +	 * fields as copying is used for both pool and wq attrs. Instead,
> +	 * get_unbound_pool() explicitly clears the fields.
>  	 */
> +	to->affn_scope = from->affn_scope;
>  	to->ordered = from->ordered;
>  }
>  
> @@ -3680,6 +3688,24 @@ static bool wqattrs_equal(const struct workqueue_attrs *a,
>  	return true;
>  }
>  
> +/* find wq_pod_type to use for @attrs */
> +static const struct wq_pod_type *
> +wqattrs_pod_type(const struct workqueue_attrs *attrs)
> +{
> +	struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];
> +
> +	if (likely(pt->nr_pods))
> +		return pt;
> +
> +	/*
> +	 * Before workqueue_init_topology(), only SYSTEM is available which is
> +	 * initialized in workqueue_init_early().
> +	 */
> +	pt = &wq_pod_types[WQ_AFFN_SYSTEM];
> +	BUG_ON(!pt->nr_pods);
> +	return pt;
> +}
> +
>  /**
>   * init_worker_pool - initialize a newly zalloc'd worker_pool
>   * @pool: worker_pool to initialize
> @@ -3880,10 +3906,10 @@ static void put_unbound_pool(struct worker_pool *pool)
>   */
>  static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  {
> +	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
>  	u32 hash = wqattrs_hash(attrs);
>  	struct worker_pool *pool;
> -	int pod;
> -	int target_pod = NUMA_NO_NODE;
> +	int pod, node = NUMA_NO_NODE;
>  
>  	lockdep_assert_held(&wq_pool_mutex);
>  
> @@ -3895,28 +3921,24 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
>  		}
>  	}
>  
> -	/* if cpumask is contained inside a pod, we belong to that pod */
> -	if (wq_pod_enabled) {
> -		for_each_node(pod) {
> -			if (cpumask_subset(attrs->cpumask, wq_pod_cpus[pod])) {
> -				target_pod = pod;
> -				break;
> -			}
> +	/* If cpumask is contained inside a NUMA pod, that's our NUMA node */
> +	for (pod = 0; pod < pt->nr_pods; pod++) {
> +		if (cpumask_subset(attrs->cpumask, pt->pod_cpus[pod])) {
> +			node = pt->pod_node[pod];
> +			break;
>  		}
>  	}
>  
>  	/* nope, create a new one */
> -	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_pod);
> +	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, node);
>  	if (!pool || init_worker_pool(pool) < 0)
>  		goto fail;
>  
>  	copy_workqueue_attrs(pool->attrs, attrs);
> -	pool->node = target_pod;
> +	pool->node = node;
>  
> -	/*
> -	 * ordered isn't a worker_pool attribute, always clear it.  See
> -	 * 'struct workqueue_attrs' comments for detail.
> -	 */
> +	/* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
> +	pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;

When booting into the kernel, I got the following NULL pointer
dereferencing error:

    [    4.280321] BUG: kernel NULL pointer dereference, address: 0000000000000004
    [    4.284172] #PF: supervisor read access in kernel mode
    [    4.284172] #PF: error_code(0x0000) - not-present page
    [    4.284172] PGD 0 P4D 0
    [    4.284172] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [    4.284172] CPU: 1 PID: 21 Comm: cpuhp/1 Not tainted 6.4.0-rc1-tj-wq+ #448
    [    4.284172] Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    [    4.284172] RIP: 0010:wq_calc_pod_cpumask+0x5a/0x180
    [    4.284172] Code: 24 e0 08 94 96 4d 8d bc 24 e0 08 94 96 85 d2 0f 84 ec 00 00 00 49 8b 47 18 49 63 f5 48 8b 53 08 48 8b 7b 10 8b 0d 56 4b f0 01 <4c> 63 2c b0 49 8b 47 08 4a 8b 34 e8 e8 25 4f 63 00 48 8b 7b 10 8b
    [    4.284172] RSP: 0018:ffff9b548d20fd88 EFLAGS: 00010286
    [    4.284172] RAX: 0000000000000000 RBX: ffff8baec0048380 RCX: 0000000000000100
    [    4.284172] RDX: ffff8baec0159440 RSI: 0000000000000001 RDI: ffff8baec0159dc0
    [    4.284172] RBP: ffff9b548d20fdb0 R08: ffffffffffffffff R09: ffffffffffffffff
    [    4.284172] R10: ffffffffffffffff R11: ffffffffffffffff R12: 00000000000000a0
    [    4.284172] R13: 0000000000000001 R14: ffffffffffffffff R15: ffffffff96940980
    [    4.284172] FS:  0000000000000000(0000) GS:ffff8bed3d240000(0000) knlGS:0000000000000000
    [    4.284172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [    4.284172] CR2: 0000000000000004 CR3: 000000530ae3c001 CR4: 0000000000770ee0
    [    4.284172] PKRU: 55555554
    [    4.284172] Call Trace:
    [    4.284172]  <TASK>
    [    4.284172]  wq_update_pod+0x89/0x1e0
    [    4.284172]  workqueue_online_cpu+0x1fc/0x250
    [    4.284172]  ? watchdog_nmi_enable+0x12/0x20
    [    4.284172]  ? __pfx_workqueue_online_cpu+0x10/0x10
    [    4.284172]  cpuhp_invoke_callback+0x165/0x4b0
    [    4.284172]  ? try_to_wake_up+0x279/0x690
    [    4.284172]  cpuhp_thread_fun+0xc4/0x1b0
    [    4.284172]  ? __pfx_smpboot_thread_fn+0x10/0x10
    [    4.284172]  smpboot_thread_fn+0xe7/0x1e0
    [    4.284172]  kthread+0xfb/0x130
    [    4.284172]  ? __pfx_kthread+0x10/0x10
    [    4.284172]  ret_from_fork+0x2c/0x50
    [    4.284172]  </TASK>
    [    4.284172] Modules linked in:
    [    4.284172] CR2: 0000000000000004
    [    4.284172] ---[ end trace 0000000000000000 ]---

I was basically hitting the following, seemingly impossible scenario in
wqattrs_pod_type():

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3825,8 +3825,10 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs)
 {
        struct wq_pod_type *pt = &wq_pod_types[attrs->affn_scope];

-       if (likely(pt->nr_pods))
+       if (likely(pt->nr_pods)) {
+               BUG_ON(!pt->cpu_pod); /* No pods despite thinking we have? */
                return pt;
+       }

        /*
         * Before workqueue_init_topology(), only SYSTEM is available which is
--

Logging the value of "attrs->affn_scope" when hitting the scenario gave
me "5" which corresponds to "WQ_AFFN_NR_TYPES". The kernel was reading a
value beyond the wq_pod_types[] bounds.

This value for "affn_scope" is only set in the above hunk and I got the
kernel to boot by making the following change:

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4069,7 +4071,7 @@ static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
        pool->node = node;

        /* clear wq-only attr fields. See 'struct workqueue_attrs' comments */
-       pool->attrs->affn_scope = WQ_AFFN_NR_TYPES;
+       pool->attrs->affn_scope = wq_affn_dfl;
        pool->attrs->localize = false;
        pool->attrs->ordered = false;
--

Let me know if the above approach is correct since I'm not sure whether
the case for "affn_scope" being set to "WQ_AFFN_NR_TYPES" has a special
significance that is being handled elsewhere. Thank you :)

>  	pool->attrs->ordered = false;
>  
>  	if (worker_pool_assign_id(pool) < 0)
> [..snip..]
--
Thanks and Regards,
Prateek