linux-kernel - Re: [PATCH v7 2/4] kernfs: Change kernfs_notify

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <270b640d-d5e8-b775-9a16-5d5d07f959ff@samsung.com>
Date:   Fri, 1 Jul 2022 13:22:08 +0200
From:   Marek Szyprowski <m.szyprowski@...sung.com>
To:     Imran Khan <imran.f.khan@...cle.com>, tj@...nel.org,
        gregkh@...uxfoundation.org, viro@...iv.linux.org.uk
Cc:     linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7 2/4] kernfs: Change kernfs_notify_list to llist.

Hi All,

On 15.06.2022 04:10, Imran Khan wrote:
> At present kernfs_notify_list is implemented as a singly linked
> list of kernfs_node(s), where last element points to itself and
> value of ->attr.next tells if node is present on the list or not.
> Both addition and deletion to list happen under kernfs_notify_lock.
>
> Change kernfs_notify_list to llist so that addition to list can heppen
> locklessly.
>
> Suggested by: Al Viro <viro@...iv.linux.org.uk>
> Signed-off-by: Imran Khan <imran.f.khan@...cle.com>
> Acked-by: Tejun Heo <tj@...nel.org>

This patch landed in linux next-20220630 as commit b8f35fa1188b 
("kernfs: Change kernfs_notify_list to llist."). Unfortunately, it 
causes serious regression on my test systems. It can be easily noticed 
in the logs by the following warning:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 34 at fs/kernfs/dir.c:531 kernfs_put.part.0+0x1a4/0x1d8
kernfs_put: console/active: released with incorrect active_ref 0
Modules linked in:
CPU: 1 PID: 34 Comm: kworker/1:4 Not tainted 
5.19.0-rc4-05465-g5732b42edfd1 #12317
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events kernfs_notify_workfn
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x40/0x4c
  dump_stack_lvl from __warn+0xc8/0x13c
  __warn from warn_slowpath_fmt+0x90/0xb4
  warn_slowpath_fmt from kernfs_put.part.0+0x1a4/0x1d8
  kernfs_put.part.0 from kernfs_notify_workfn+0x1a0/0x1d0
  kernfs_notify_workfn from process_one_work+0x1ec/0x4cc
  process_one_work from worker_thread+0x58/0x54c
  worker_thread from kthread+0xd0/0xec
  kthread from ret_from_fork+0x14/0x2c
Exception stack(0xf099dfb0 to 0xf099dff8)
...
---[ end trace 0000000000000000 ]---

then, after some standard kernel messages, booting stops with the 
following log:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu:     1-...!: (2099 ticks this GP) idle=403c/1/0x40000002 
softirq=33/33 fqs=1
  (t=2100 jiffies g=-1135 q=4 ncpus=2)
rcu: rcu_sched kthread starved for 2098 jiffies! g-1135 f0x0 
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu:     Unless rcu_sched kthread gets sufficient CPU time, OOM is now 
expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_sched       state:R  running task     stack:    0 pid: 10 
ppid:     2 flags:0x00000000
  __schedule from schedule+0x48/0xb0
  schedule from schedule_timeout+0x84/0x138
  schedule_timeout from rcu_gp_fqs_loop+0x124/0x478
  rcu_gp_fqs_loop from rcu_gp_kthread+0x150/0x1c8
  rcu_gp_kthread from kthread+0xd0/0xec
  kthread from ret_from_fork+0x14/0x2c
Exception stack(0xf088dfb0 to 0xf088dff8)
...
rcu: Stack dump where RCU GP kthread last ran:
NMI backtrace for cpu 1
CPU: 1 PID: 34 Comm: kworker/1:4 Tainted: G        W 
5.19.0-rc4-05465-g5732b42edfd1 #12317
Hardware name: Samsung Exynos (Flattened Device Tree)
Workqueue: events kernfs_notify_workfn
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x40/0x4c
  dump_stack_lvl from nmi_cpu_backtrace+0xd0/0x104
  nmi_cpu_backtrace from nmi_trigger_cpumask_backtrace+0xd8/0x120
  nmi_trigger_cpumask_backtrace from 
rcu_check_gp_kthread_starvation+0x138/0x154
  rcu_check_gp_kthread_starvation from rcu_sched_clock_irq+0x664/0x9f4
  rcu_sched_clock_irq from update_process_times+0x6c/0x94
  update_process_times from tick_sched_timer+0x60/0xe8
  tick_sched_timer from __hrtimer_run_queues+0x1b4/0x328
  __hrtimer_run_queues from hrtimer_interrupt+0x124/0x2b0
  hrtimer_interrupt from exynos4_mct_tick_isr+0x44/0x7c
  exynos4_mct_tick_isr from __handle_irq_event_percpu+0x7c/0x1cc
  __handle_irq_event_percpu from handle_irq_event+0x44/0x8c
  handle_irq_event from handle_fasteoi_irq+0x98/0x18c
  handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34
  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
  generic_handle_arch_irq from call_with_stack+0x18/0x20
  call_with_stack from __irq_svc+0x98/0xb0
Exception stack(0xf099de78 to 0xf099dec0)
...
  __irq_svc from up_write+0x4/0x3c
  up_write from 0xef7cbb00


It was not easy to find this, because bisecting points to commit 
5732b42edfd1 ("Merge branch 'driver-core-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git"). 
This means that there are some non-trivial dependencies between both 
merged branches. I've rebased 5732b42edfd1^2 onto 5732b42edfd1^1 and 
then I've did the bisecting of the rebased commits. Finally I've found 
this one. Then I've confirmed that reverting it on top of 5732b42edfd1 
and then next-20220630 really fixes the issue.

Let me know how I can help fixing this issue. I've observed this issue 
on ARM 32bit kernel compiled from multi_v7_defconfig and following 
boards: Raspberry Pi 3b and 4b as well as Samsung Exynos based Odroid U3 
and XU4. This issue doesn't happen on QEMU's ARM32bit 'virt' machine.


> ---
>   fs/kernfs/file.c       | 47 ++++++++++++++++++------------------------
>   include/linux/kernfs.h |  2 +-
>   2 files changed, 21 insertions(+), 28 deletions(-)
>
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 706aebcfb8f69..dce654ad2cea9 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -38,18 +38,16 @@ struct kernfs_open_node {
>   	struct list_head	files; /* goes through kernfs_open_file.list */
>   };
>   
> -/*
> - * kernfs_notify() may be called from any context and bounces notifications
> - * through a work item.  To minimize space overhead in kernfs_node, the
> - * pending queue is implemented as a singly linked list of kernfs_nodes.
> - * The list is terminated with the self pointer so that whether a
> - * kernfs_node is on the list or not can be determined by testing the next
> - * pointer for NULL.
> +/**
> + * attribute_to_node - get kernfs_node object corresponding to a kernfs attribute
> + * @ptr:	&struct kernfs_elem_attr
> + * @type:	struct kernfs_node
> + * @member:	name of member (i.e attr)
>    */
> -#define KERNFS_NOTIFY_EOL			((void *)&kernfs_notify_list)
> +#define attribute_to_node(ptr, type, member)	\
> +	container_of(ptr, type, member)
>   
> -static DEFINE_SPINLOCK(kernfs_notify_lock);
> -static struct kernfs_node *kernfs_notify_list = KERNFS_NOTIFY_EOL;
> +static LLIST_HEAD(kernfs_notify_list);
>   
>   /**
>    * kernfs_deref_open_node - Get kernfs_open_node corresponding to @kn.
> @@ -903,18 +901,16 @@ static void kernfs_notify_workfn(struct work_struct *work)
>   	struct kernfs_node *kn;
>   	struct kernfs_super_info *info;
>   	struct kernfs_root *root;
> +	struct llist_node *free;
> +	struct kernfs_elem_attr *attr;
>   repeat:
>   	/* pop one off the notify_list */
> -	spin_lock_irq(&kernfs_notify_lock);
> -	kn = kernfs_notify_list;
> -	if (kn == KERNFS_NOTIFY_EOL) {
> -		spin_unlock_irq(&kernfs_notify_lock);
> +	free = llist_del_first(&kernfs_notify_list);
> +	if (free == NULL)
>   		return;
> -	}
> -	kernfs_notify_list = kn->attr.notify_next;
> -	kn->attr.notify_next = NULL;
> -	spin_unlock_irq(&kernfs_notify_lock);
>   
> +	attr = llist_entry(free, struct kernfs_elem_attr, notify_next);
> +	kn = attribute_to_node(attr, struct kernfs_node, attr);
>   	root = kernfs_root(kn);
>   	/* kick fsnotify */
>   	down_write(&root->kernfs_rwsem);
> @@ -970,12 +966,14 @@ static void kernfs_notify_workfn(struct work_struct *work)
>   void kernfs_notify(struct kernfs_node *kn)
>   {
>   	static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
> -	unsigned long flags;
>   	struct kernfs_open_node *on;
>   
>   	if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
>   		return;
>   
> +	/* Because we are using llist for kernfs_notify_list */
> +	WARN_ON_ONCE(in_nmi());
> +
>   	/* kick poll immediately */
>   	rcu_read_lock();
>   	on = rcu_dereference(kn->attr.open);
> @@ -986,14 +984,9 @@ void kernfs_notify(struct kernfs_node *kn)
>   	rcu_read_unlock();
>   
>   	/* schedule work to kick fsnotify */
> -	spin_lock_irqsave(&kernfs_notify_lock, flags);
> -	if (!kn->attr.notify_next) {
> -		kernfs_get(kn);
> -		kn->attr.notify_next = kernfs_notify_list;
> -		kernfs_notify_list = kn;
> -		schedule_work(&kernfs_notify_work);
> -	}
> -	spin_unlock_irqrestore(&kernfs_notify_lock, flags);
> +	kernfs_get(kn);
> +	llist_add(&kn->attr.notify_next, &kernfs_notify_list);
> +	schedule_work(&kernfs_notify_work);
>   }
>   EXPORT_SYMBOL_GPL(kernfs_notify);
>   
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 13f54f078a52a..2dd9c8df0f4f6 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -116,7 +116,7 @@ struct kernfs_elem_attr {
>   	const struct kernfs_ops	*ops;
>   	struct kernfs_open_node __rcu	*open;
>   	loff_t			size;
> -	struct kernfs_node	*notify_next;	/* for kernfs_notify() */
> +	struct llist_node	notify_next;	/* for kernfs_notify() */
>   };
>   
>   /*

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland