linux-kernel - RE: rcu_preempt caused oom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CD6925E8781EFD4D8E11882D20FC406D52A198C7@SHSMSX104.ccr.corp.intel.com>
Date:   Fri, 14 Dec 2018 01:30:04 +0000
From:   "He, Bo" <bo.he@...el.com>
To:     "paulmck@...ux.ibm.com" <paulmck@...ux.ibm.com>
CC:     "Zhang, Jun" <jun.zhang@...el.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "josh@...htriplett.org" <josh@...htriplett.org>,
        "mathieu.desnoyers@...icios.com" <mathieu.desnoyers@...icios.com>,
        "jiangshanlai@...il.com" <jiangshanlai@...il.com>,
        "Xiao, Jin" <jin.xiao@...el.com>,
        "Zhang, Yanmin" <yanmin.zhang@...el.com>,
        "Bai, Jie A" <jie.a.bai@...el.com>,
        "Sun, Yi J" <yi.j.sun@...el.com>
Subject: RE: rcu_preempt caused oom

as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I double checked there is no FAST_NO_HZ in .config:

Here is the grep from .config:
egrep "HZ|RCU" .config
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_MACHZ_WDT is not set
# RCU Debugging
CONFIG_PROVE_RCU=y
CONFIG_RCU_PERF_TEST=m
CONFIG_RCU_TORTURE_TEST=m
CONFIG_RCU_CPU_STALL_TIMEOUT=7
CONFIG_RCU_TRACE=y
CONFIG_RCU_EQS_DEBUG=y

-----Original Message-----
From: Paul E. McKenney <paulmck@...ux.ibm.com> 
Sent: Friday, December 14, 2018 2:12 AM
To: He, Bo <bo.he@...el.com>
Cc: Zhang, Jun <jun.zhang@...el.com>; Steven Rostedt <rostedt@...dmis.org>; linux-kernel@...r.kernel.org; josh@...htriplett.org; mathieu.desnoyers@...icios.com; jiangshanlai@...il.com; Xiao, Jin <jin.xiao@...el.com>; Zhang, Yanmin <yanmin.zhang@...el.com>; Bai, Jie A <jie.a.bai@...el.com>; Sun, Yi J <yi.j.sun@...el.com>
Subject: Re: rcu_preempt caused oom

On Thu, Dec 13, 2018 at 03:26:08PM +0000, He, Bo wrote:
> one of the board reproduce the issue with the show_rcu_gp_kthreads(), I also enclosed the logs as attachment.
> 
> [17818.936032] rcu: rcu_preempt: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258 ->gp_wake_seq       21808189 ->gp_seq 21808192 ->gp_seq_needed 21808196 ->gp_flags 0x1

This is quite helpful, thank you!

The "RCU lockdep checking is enabled" says that CONFIG_PROVE_RCU=y, which is good.  The "RCU_GP_WAIT_GPS(1)" means that the rcu_preempt task is waiting for a new grace-period request.  The "->state: 0x402" means that it is sleeping, neither running nor in the process of waking up.
The "delta ->gp_activity 308257 ->gp_req_activity 308256 ->gp_wake_time 308258" means that it has been more than 300,000 jiffies since the rcu_preempt task did anything or was requested to do anything.

The "->gp_wake_seq 21808189 ->gp_seq 21808192" says that the last attempt to awaken the rcu_preempt task happened during the last grace period.
The "->gp_seq_needed 21808196 ->gp_flags 0x1" nevertheless says that someone requested a new grace period.  So if the rcu_preempt task were to wake up, it would process the new grace period.  Note again also the ->gp_req_activity 308256, which indicates that ->gp_flags was set more than 300,000 jiffies ago, just after the last recorded activity of the rcu_preempt task.

But this is exactly the situation that rcu_check_gp_start_stall() is designed to warn about (and does warn about for me when I comment out the wakeup code).  So why is rcu_check_gp_start_stall() not being called?  Here are a couple of possibilities:

1.	Because rcu_check_gp_start_stall() is only ever invoked from
	RCU_SOFTIRQ, it is possible that softirqs are stalled for
	whatever reason.

2.	Because RCU_SOFTIRQ is invoked primarily from the scheduler-clock
	interrupt handler, it is possible that the scheduler tick has
	somehow been disabled.  Traces from earlier runs showed a great
	deal of RCU callbacks queued, which would have caused RCU to
	refuse to allow the scheduler tick to be disabled, even if the
	corresponding CPU was idle.

3.	You have CONFIG_FAST_NO_HZ=y (which you probably do, given
	that you are building for a battery-powered device) and all of the
	CPU's callbacks are lazy.  Except that your earlier traces showed
	lots of non-lazy callbacks.  Besides, even if all callbacks were
	lazy, there would still be a scheduling-clock interrupt every
	six seconds, and there are quite a few six-second intervals
	in a two-minute watchdog timeout.

	But if we cannot find the problem quickly, I will likely ask
	you to try reproducing with CONFIG_FAST_NO_HZ=n.  This could
	be thought of as bisecting the RCU code looking for the bug.

The first two of these seem unlikely given that the watchdog timer was still firing.  Still, I don't see how 300,000 jiffies elapsed with a grace period requested and not started otherwise.  Could you please check?
One way to do so would be to enable ftrace on rcu_check_callbacks(), __rcu_process_callbacks(), and rcu_check_gp_start_stall().  It might be necessary to no-inline rcu_check_gp_start_stall().  You might have better ways to collect this information.

Without this information, the only workaround patch I can give you will degrade battery lifetime, which might not be what you want.

You do have a lockdep complaint early at boot.  Although I don't immediately see how this self-deadlock would affect RCU, please do get it fixed.  Sometimes the consequences of this sort of deadlock can propagate to unexepected places.

Regardless of why rcu_check_gp_start_stall() failed to complain, it looks like this was set after the rcu_preempt task slept for the last time, and so there should have been a wakeup the last time that ->gp_flags was set.  Perhaps there is some code path that drops the wakeup.
I did check this in current -rcu, but you are instead running v4.19, so I should also check there.

The ->gp_flags has its RCU_GP_FLAG_INIT bit set in rcu_start_this_gp() and in rcu_gp_cleanup().  We can eliminate rcu_gp_cleanup() from consideration because only the rcu_preempt task will execute that code, and we know that this task was asleep at the last time this bit was set.
Now rcu_start_this_gp() returns a flag indicating whether or not a wakeup is needed, and the caller must do the wakeup once it is safe to do so, that is, after the various rcu_node locks have been released (doing a wakeup while holding any of those locks results in deadlock).

The following functions invoke rcu_start_this_gp: rcu_accelerate_cbs() and rcu_nocb_wait_gp().  We can eliminate rcu_nocb_wait_gp() because you are building with CONFIG_RCU_NOCB_CPU=n.  Then rcu_accelerate_cbs() is invoked from:

o	rcu_accelerate_cbs_unlocked(), which does the following, thus
	properly awakening the rcu_preempt task when needed:

	needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
	raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
	if (needwake)
		rcu_gp_kthread_wake(rsp);

o	rcu_advance_cbs(), which returns the value returned by
	rcu_accelerate_cbs(), thus pushing the problem off to its
	callers, which are called out below.

o	__note_gp_changes(), which also returns the value returned by
	rcu_accelerate_cbs(), thus pushing the problem off to its callers,
	which are called out below.

o	rcu_gp_cleanup(), which is only ever invoked by RCU grace-period
	kthreads such as the rcu_preempt task.	Therefore, this function
	never needs to awaken the rcu_preempt task, because the fact
	that this function is executing means that this task is already
	awake.	(Also, as noted above, we can eliminate this code from
	consideration because this task is known to have been sleeping
	at the last time that the RCU_GP_FLAG_INIT bit was set.)

o	rcu_report_qs_rdp(), which does the following, thus properly
	awakening the rcu_preempt task when needed:

		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);

		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
		/* ^^^ Released rnp->lock */
		if (needwake)
			rcu_gp_kthread_wake(rsp);

o	rcu_prepare_for_idle(), which does the following, thus properly
	awakening the rcu_preempt task when needed:

		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
		raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
		if (needwake)
			rcu_gp_kthread_wake(rsp);

Now for rcu_advance_cbs():

o	__note_gp_changes(), which which also returns the value returned
	by rcu_advance_cbs(), thus pushing the problem off to its callers,
	which are called out below.

o	rcu_migrate_callbacks(), which does the following, thus properly
	awakening the rcu_preempt task when needed:

	needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
		   rcu_advance_cbs(rsp, rnp_root, my_rdp);
	rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
	WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
		     !rcu_segcblist_n_cbs(&my_rdp->cblist));
	raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
	if (needwake)
		rcu_gp_kthread_wake(rsp);

Now for __note_gp_changes():

o	note_gp_changes(), which does the following, thus properly
	awakening the rcu_preempt task when needed:

	needwake = __note_gp_changes(rsp, rnp, rdp);
	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
	if (needwake)
		rcu_gp_kthread_wake(rsp);

o	rcu_gp_init() which is only ever invoked by RCU grace-period
	kthreads such as the rcu_preempt task, which makes wakeups
	unnecessary, just as for rcu_gp_cleanup() above.

o	rcu_gp_cleanup(), ditto.

So I am not seeing how I am losing a wakeup, but please do feel free to double-check my analysis.  One way to do that is using event tracing.

							Thanx, Paul

------------------------------------------------------------------------
lockdep complaint:
------------------------------------------------------------------------

[    2.895507] ======================================================
[    2.895511] WARNING: possible circular locking dependency detected
[    2.895517] 4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1 Tainted: G     U           
[    2.895521] ------------------------------------------------------
[    2.895525] earlyEvs/1839 is trying to acquire lock:
[    2.895530] 00000000ff344115 (&asd->mutex){+.+.}, at: ipu_isys_subdev_get_ffmt+0x32/0x90
[    2.895546] 
[    2.895546] but task is already holding lock:
[    2.895550] 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
[    2.895561] 
[    2.895561] which lock already depends on the new lock.
[    2.895561] 
[    2.895566] 
[    2.895566] the existing dependency chain (in reverse order) is:
[    2.895570] 
[    2.895570] -> #1 (&mdev->graph_mutex){+.+.}:
[    2.895583]        __mutex_lock+0x80/0x9a0
[    2.895588]        mutex_lock_nested+0x1b/0x20
[    2.895593]        media_device_register_entity+0x92/0x1e0
[    2.895598]        v4l2_device_register_subdev+0xc2/0x1b0
[    2.895604]        ipu_isys_csi2_init+0x22c/0x520
[    2.895608]        isys_probe+0x6cb/0xed0
[    2.895613]        ipu_bus_probe+0xfd/0x2e0
[    2.895620]        really_probe+0x268/0x3d0
[    2.895625]        driver_probe_device+0x11a/0x130
[    2.895630]        __device_attach_driver+0x86/0x100
[    2.895635]        bus_for_each_drv+0x6e/0xb0
[    2.895640]        __device_attach+0xdf/0x160
[    2.895645]        device_initial_probe+0x13/0x20
[    2.895650]        bus_probe_device+0xa6/0xc0
[    2.895655]        deferred_probe_work_func+0x88/0xe0
[    2.895661]        process_one_work+0x220/0x5c0
[    2.895665]        worker_thread+0x1da/0x3b0
[    2.895670]        kthread+0x12c/0x150
[    2.895675]        ret_from_fork+0x3a/0x50
[    2.895678] 
[    2.895678] -> #0 (&asd->mutex){+.+.}:
[    2.895688]        lock_acquire+0x95/0x1a0
[    2.895693]        __mutex_lock+0x80/0x9a0
[    2.895698]        mutex_lock_nested+0x1b/0x20
[    2.895703]        ipu_isys_subdev_get_ffmt+0x32/0x90
[    2.895708]        ipu_isys_csi2_get_fmt+0x14/0x30
[    2.895713]        v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[    2.895718]        v4l2_subdev_link_validate_one+0x67/0x120
[    2.895723]        v4l2_subdev_link_validate+0x246/0x490
[    2.895728]        csi2_link_validate+0xc6/0x220
[    2.895733]        __media_pipeline_start+0x15b/0x2f0
[    2.895738]        media_pipeline_start+0x33/0x50
[    2.895743]        ipu_isys_video_prepare_streaming+0x1e0/0x610
[    2.895748]        start_streaming+0x186/0x3a0
[    2.895753]        vb2_start_streaming+0x6d/0x130
[    2.895758]        vb2_core_streamon+0x108/0x140
[    2.895762]        vb2_streamon+0x29/0x50
[    2.895767]        vb2_ioctl_streamon+0x42/0x50
[    2.895772]        v4l_streamon+0x20/0x30
[    2.895776]        __video_do_ioctl+0x1af/0x3c0
[    2.895781]        video_usercopy+0x27e/0x7e0
[    2.895785]        video_ioctl2+0x15/0x20
[    2.895789]        v4l2_ioctl+0x49/0x50
[    2.895794]        do_video_ioctl+0x93c/0x2360
[    2.895799]        v4l2_compat_ioctl32+0x93/0xe0
[    2.895806]        __ia32_compat_sys_ioctl+0x73a/0x1c90
[    2.895813]        do_fast_syscall_32+0x9a/0x2d6
[    2.895818]        entry_SYSENTER_compat+0x6d/0x7c
[    2.895821] 
[    2.895821] other info that might help us debug this:
[    2.895821] 
[    2.895826]  Possible unsafe locking scenario:
[    2.895826] 
[    2.895830]        CPU0                    CPU1
[    2.895833]        ----                    ----
[    2.895836]   lock(&mdev->graph_mutex);
[    2.895842]                                lock(&asd->mutex);
[    2.895847]                                lock(&mdev->graph_mutex);
[    2.895852]   lock(&asd->mutex);
[    2.895857] 
[    2.895857]  *** DEADLOCK ***
[    2.895857] 
[    2.895863] 3 locks held by earlyEvs/1839:
[    2.895866]  #0: 00000000ed860090 (&av->mutex){+.+.}, at: __video_do_ioctl+0xbf/0x3c0
[    2.895876]  #1: 000000000cb253e7 (&isys->stream_mutex){+.+.}, at: start_streaming+0x5c/0x3a0
[    2.895886]  #2: 0000000069562e72 (&mdev->graph_mutex){+.+.}, at: media_pipeline_start+0x28/0x50
[    2.895896] 
[    2.895896] stack backtrace:
[    2.895903] CPU: 0 PID: 1839 Comm: earlyEvs Tainted: G     U            4.19.5-quilt-2e5dc0ac-g4d59bbd0fd1a #1
[    2.895907] Call Trace:
[    2.895915]  dump_stack+0x70/0xa5
[    2.895921]  print_circular_bug.isra.35+0x1d8/0x1e6
[    2.895927]  __lock_acquire+0x1284/0x1340
[    2.895931]  ? __lock_acquire+0x2b5/0x1340
[    2.895940]  lock_acquire+0x95/0x1a0
[    2.895945]  ? lock_acquire+0x95/0x1a0
[    2.895950]  ? ipu_isys_subdev_get_ffmt+0x32/0x90
[    2.895956]  ? ipu_isys_subdev_get_ffmt+0x32/0x90
[    2.895961]  __mutex_lock+0x80/0x9a0
[    2.895966]  ? ipu_isys_subdev_get_ffmt+0x32/0x90
[    2.895971]  ? crlmodule_get_format+0x43/0x50
[    2.895979]  mutex_lock_nested+0x1b/0x20
[    2.895984]  ? mutex_lock_nested+0x1b/0x20
[    2.895989]  ipu_isys_subdev_get_ffmt+0x32/0x90
[    2.895995]  ipu_isys_csi2_get_fmt+0x14/0x30
[    2.896001]  v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[    2.896006]  v4l2_subdev_link_validate_one+0x67/0x120
[    2.896011]  ? crlmodule_get_format+0x2a/0x50
[    2.896018]  ? find_held_lock+0x35/0xa0
[    2.896023]  ? crlmodule_get_format+0x43/0x50
[    2.896030]  v4l2_subdev_link_validate+0x246/0x490
[    2.896035]  ? __mutex_unlock_slowpath+0x58/0x2f0
[    2.896042]  ? mutex_unlock+0x12/0x20
[    2.896046]  ? crlmodule_get_format+0x43/0x50
[    2.896052]  ? v4l2_subdev_link_validate_get_format.isra.6+0x52/0x80
[    2.896057]  ? v4l2_subdev_link_validate_one+0x67/0x120
[    2.896065]  ? __is_insn_slot_addr+0xad/0x120
[    2.896070]  ? kernel_text_address+0xc4/0x100
[    2.896078]  ? v4l2_subdev_link_validate+0x246/0x490
[    2.896085]  ? kernel_text_address+0xc4/0x100
[    2.896092]  ? __lock_acquire+0x1106/0x1340
[    2.896096]  ? __lock_acquire+0x1169/0x1340
[    2.896103]  csi2_link_validate+0xc6/0x220
[    2.896110]  ? __lock_is_held+0x5a/0xa0
[    2.896115]  ? mark_held_locks+0x58/0x80
[    2.896122]  ? __kmalloc+0x207/0x2e0
[    2.896127]  ? __lock_is_held+0x5a/0xa0
[    2.896134]  ? rcu_read_lock_sched_held+0x81/0x90
[    2.896139]  ? __kmalloc+0x2a3/0x2e0
[    2.896144]  ? media_pipeline_start+0x28/0x50
[    2.896150]  ? __media_entity_enum_init+0x33/0x70
[    2.896155]  ? csi2_has_route+0x18/0x20
[    2.896160]  ? media_graph_walk_next.part.9+0xac/0x290
[    2.896166]  __media_pipeline_start+0x15b/0x2f0
[    2.896173]  ? rcu_read_lock_sched_held+0x81/0x90
[    2.896179]  media_pipeline_start+0x33/0x50
[    2.896186]  ipu_isys_video_prepare_streaming+0x1e0/0x610
[    2.896191]  ? __lock_acquire+0x132e/0x1340
[    2.896198]  ? __lock_acquire+0x2b5/0x1340
[    2.896204]  ? lock_acquire+0x95/0x1a0
[    2.896209]  ? start_streaming+0x5c/0x3a0
[    2.896215]  ? start_streaming+0x5c/0x3a0
[    2.896221]  ? __mutex_lock+0x391/0x9a0
[    2.896226]  ? v4l_enable_media_source+0x2d/0x70
[    2.896233]  ? find_held_lock+0x35/0xa0
[    2.896238]  ? v4l_enable_media_source+0x57/0x70
[    2.896245]  start_streaming+0x186/0x3a0
[    2.896250]  ? __mutex_unlock_slowpath+0x58/0x2f0
[    2.896257]  vb2_start_streaming+0x6d/0x130
[    2.896262]  ? vb2_start_streaming+0x6d/0x130
[    2.896267]  vb2_core_streamon+0x108/0x140
[    2.896273]  vb2_streamon+0x29/0x50
[    2.896278]  vb2_ioctl_streamon+0x42/0x50
[    2.896284]  v4l_streamon+0x20/0x30
[    2.896288]  __video_do_ioctl+0x1af/0x3c0
[    2.896296]  ? __might_fault+0x85/0x90
[    2.896302]  video_usercopy+0x27e/0x7e0
[    2.896307]  ? copy_overflow+0x20/0x20
[    2.896313]  ? find_held_lock+0x35/0xa0
[    2.896319]  ? __might_fault+0x3e/0x90
[    2.896325]  video_ioctl2+0x15/0x20
[    2.896330]  v4l2_ioctl+0x49/0x50
[    2.896335]  do_video_ioctl+0x93c/0x2360
[    2.896343]  v4l2_compat_ioctl32+0x93/0xe0
[    2.896349]  __ia32_compat_sys_ioctl+0x73a/0x1c90
[    2.896354]  ? lockdep_hardirqs_on+0xef/0x180
[    2.896359]  ? do_fast_syscall_32+0x3b/0x2d6
[    2.896364]  do_fast_syscall_32+0x9a/0x2d6
[    2.896370]  entry_SYSENTER_compat+0x6d/0x7c
[    2.896377] RIP: 0023:0xf7e79b79
[    2.896382] Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 0c 24 c3 8b 1c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
[    2.896387] RSP: 002b:00000000f76816bc EFLAGS: 00000292 ORIG_RAX: 0000000000000036
[    2.896393] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 0000000040045612
[    2.896396] RDX: 00000000f768172c RSI: 00000000f7d42d9c RDI: 00000000f768172c
[    2.896400] RBP: 00000000f7681708 R08: 0000000000000000 R09: 0000000000000000
[    2.896404] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    2.896408] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

------------------------------------------------------------------------

> [17818.936039] rcu:     rcu_node 0:3 ->gp_seq 21808192 ->gp_seq_needed 21808196
> [17818.936048] rcu: rcu_sched: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 101730 ->gp_req_activity 101732 ->gp_wake_time 101730 ->gp_wake_seq 1357 -  >gp_seq 1360 ->gp_seq_needed 1360 ->gp_flags 0x0                                                                                                                             
> [17818.936056] rcu: rcu_bh: wait state: RCU_GP_WAIT_GPS(1) ->state: 0x402 delta ->gp_activity 4312486108 ->gp_req_activity 4312486108 ->gp_wake_time 4312486108 -            >gp_wake_seq 0 ->gp_seq -1200 ->gp_seq_needed -1200 ->gp_flags 0x0
> 
> -----Original Message-----
> From: Paul E. McKenney <paulmck@...ux.ibm.com>
> Sent: Thursday, December 13, 2018 12:40 PM
> To: Zhang, Jun <jun.zhang@...el.com>
> Cc: He, Bo <bo.he@...el.com>; Steven Rostedt <rostedt@...dmis.org>; 
> linux-kernel@...r.kernel.org; josh@...htriplett.org; 
> mathieu.desnoyers@...icios.com; jiangshanlai@...il.com; Xiao, Jin 
> <jin.xiao@...el.com>; Zhang, Yanmin <yanmin.zhang@...el.com>; Bai, Jie 
> A <jie.a.bai@...el.com>; Sun, Yi J <yi.j.sun@...el.com>
> Subject: Re: rcu_preempt caused oom
> 
> On Thu, Dec 13, 2018 at 03:28:46AM +0000, Zhang, Jun wrote:
> > Ok, we will test it, thanks!
> 
> But please also try the sysrq-y with the earlier patch after a hang!
> 
> 							Thanx, Paul
> 
> > -----Original Message-----
> > From: Paul E. McKenney [mailto:paulmck@...ux.ibm.com]
> > Sent: Thursday, December 13, 2018 10:43
> > To: Zhang, Jun <jun.zhang@...el.com>
> > Cc: He, Bo <bo.he@...el.com>; Steven Rostedt <rostedt@...dmis.org>; 
> > linux-kernel@...r.kernel.org; josh@...htriplett.org; 
> > mathieu.desnoyers@...icios.com; jiangshanlai@...il.com; Xiao, Jin 
> > <jin.xiao@...el.com>; Zhang, Yanmin <yanmin.zhang@...el.com>; Bai, 
> > Jie A <jie.a.bai@...el.com>; Sun, Yi J <yi.j.sun@...el.com>
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Thu, Dec 13, 2018 at 02:11:35AM +0000, Zhang, Jun wrote:
> > > Hello, Paul
> > > 
> > > I think the next patch is better.
> > > Because ULONG_CMP_GE could cause double write, which has risk that write back old value.
> > > Please help review.
> > > I don't test it. If you agree, we will test it.
> > 
> > Just to make sure that I understand, you are worried about something like the following, correct?
> > 
> > o	__note_gp_changes() compares rnp->gp_seq_needed and rdp->gp_seq_needed
> > 	and finds them equal.
> > 
> > o	At just this time something like rcu_start_this_gp() assigns a new
> > 	(larger) value to rdp->gp_seq_needed.
> > 
> > o	Then __note_gp_changes() overwrites rdp->gp_seq_needed with the
> > 	old value.
> > 
> > This cannot happen because __note_gp_changes() runs with interrupts disabled on the CPU corresponding to the rcu_data structure referenced by the rdp pointer.  So there is no way for rcu_start_this_gp() to be invoked on the same CPU during this "if" statement.
> > 
> > Of course, there could be bugs.  For example:
> > 
> > o	__note_gp_changes() might be called on a different CPU than that
> > 	corresponding to rdp.  You can check this with something like:
> > 
> > 	WARN_ON_ONCE(rdp->cpu != smp_processor_id());
> > 
> > o	The same things could happen with rcu_start_this_gp(), and the
> > 	above WARN_ON_ONCE() would work there as well.
> > 
> > o	rcutree_prepare_cpu() is a special case, but is irrelevant unless
> > 	you are doing CPU-hotplug operations.  (It can run on a CPU other
> > 	than rdp->cpu, but only at times when rdp->cpu is offline.)
> > 
> > o	Interrupts might not really be disabled.
> > 
> > That said, your patch could reduce overhead slightly, given that the two values will be equal much of the time.  So it might be worth testing just for that reason.
> > 
> > So why not just test it anyway?  If it makes the bug go away, I will 
> > be surprised, but it would not be the first surprise for me.  ;-)
> > 
> > 							Thanx, Paul
> > 
> > > Thanks!
> > > 
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 
> > > 0b760c1..c00f34e 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1849,7 +1849,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp,
> > >                 zero_cpu_stall_ticks(rdp);
> > >         }
> > >         rdp->gp_seq = rnp->gp_seq;  /* Remember new grace-period state. */
> > > -       if (ULONG_CMP_GE(rnp->gp_seq_needed, rdp->gp_seq_needed) || rdp->gpwrap)
> > > +       if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) 
> > > + ||
> > > + rdp->gpwrap)
> > >                 rdp->gp_seq_needed = rnp->gp_seq_needed;
> > >         WRITE_ONCE(rdp->gpwrap, false);
> > >         rcu_gpnum_ovf(rnp, rdp);
> > > 
> > > 
> > > -----Original Message-----
> > > From: Paul E. McKenney [mailto:paulmck@...ux.ibm.com]
> > > Sent: Thursday, December 13, 2018 08:12
> > > To: He, Bo <bo.he@...el.com>
> > > Cc: Steven Rostedt <rostedt@...dmis.org>; 
> > > linux-kernel@...r.kernel.org; josh@...htriplett.org; 
> > > mathieu.desnoyers@...icios.com; jiangshanlai@...il.com; Zhang, Jun 
> > > <jun.zhang@...el.com>; Xiao, Jin <jin.xiao@...el.com>; Zhang, 
> > > Yanmin <yanmin.zhang@...el.com>; Bai, Jie A <jie.a.bai@...el.com>; 
> > > Sun, Yi J <yi.j.sun@...el.com>
> > > Subject: Re: rcu_preempt caused oom
> > > 
> > > On Wed, Dec 12, 2018 at 11:13:22PM +0000, He, Bo wrote:
> > > > I don't see the rcutree.sysrq_rcu parameter in v4.19 kernel, I also checked the latest kernel and the latest tag v4.20-rc6, not see the sysrq_rcu.
> > > > Please correct me if I have something wrong.
> > > 
> > > That would be because I sent you the wrong patch, apologies!  :-/
> > > 
> > > Please instead see the one below, which does add sysrq_rcu.
> > > 
> > > 							Thanx, Paul
> > > 
> > > > -----Original Message-----
> > > > From: Paul E. McKenney <paulmck@...ux.ibm.com>
> > > > Sent: Thursday, December 13, 2018 5:03 AM
> > > > To: He, Bo <bo.he@...el.com>
> > > > Cc: Steven Rostedt <rostedt@...dmis.org>; 
> > > > linux-kernel@...r.kernel.org; josh@...htriplett.org; 
> > > > mathieu.desnoyers@...icios.com; jiangshanlai@...il.com; Zhang, 
> > > > Jun <jun.zhang@...el.com>; Xiao, Jin <jin.xiao@...el.com>; 
> > > > Zhang, Yanmin <yanmin.zhang@...el.com>; Bai, Jie A 
> > > > <jie.a.bai@...el.com>
> > > > Subject: Re: rcu_preempt caused oom
> > > > 
> > > > On Wed, Dec 12, 2018 at 07:42:24AM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Dec 12, 2018 at 01:21:33PM +0000, He, Bo wrote:
> > > > > > we reproduce on two boards, but I still not see the show_rcu_gp_kthreads() dump logs, it seems the patch can't catch the scenario.
> > > > > > I double confirmed the CONFIG_PROVE_RCU=y is enabled in the config as it's extracted from the /proc/config.gz.
> > > > > 
> > > > > Strange.
> > > > > 
> > > > > Are the systems responsive to sysrq keys once failure occurs?  
> > > > > If so, I will provide you a sysrq-R or some such to dump out the RCU state.
> > > > 
> > > > Or, as it turns out, sysrq-y if booting with rcutree.sysrq_rcu=1 using the patch below.  Only lightly tested.
> > > 
> > > ------------------------------------------------------------------
> > > --
> > > --
> > > --
> > > 
> > > commit 04b6245c8458e8725f4169e62912c1fadfdf8141
> > > Author: Paul E. McKenney <paulmck@...ux.ibm.com>
> > > Date:   Wed Dec 12 16:10:09 2018 -0800
> > > 
> > >     rcu: Add sysrq rcu_node-dump capability
> > >     
> > >     Backported from v4.21/v5.0
> > >     
> > >     Life is hard if RCU manages to get stuck without triggering RCU CPU
> > >     stall warnings or triggering the rcu_check_gp_start_stall() checks
> > >     for failing to start a grace period.  This commit therefore adds a
> > >     boot-time-selectable sysrq key (commandeering "y") that allows manually
> > >     dumping Tree RCU state.  The new rcutree.sysrq_rcu kernel boot parameter
> > >     must be set for this sysrq to be available.
> > >     
> > >     Signed-off-by: Paul E. McKenney <paulmck@...ux.ibm.com>
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > 0b760c1369f7..e9392a9d6291 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -61,6 +61,7 @@
> > >  #include <linux/trace_events.h>
> > >  #include <linux/suspend.h>
> > >  #include <linux/ftrace.h>
> > > +#include <linux/sysrq.h>
> > >  
> > >  #include "tree.h"
> > >  #include "rcu.h"
> > > @@ -128,6 +129,9 @@ int num_rcu_lvl[] = NUM_RCU_LVL_INIT;  int 
> > > rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes 
> > > in use. */
> > >  /* panic() on RCU Stall sysctl. */  int sysctl_panic_on_rcu_stall 
> > > __read_mostly;
> > > +/* Commandeer a sysrq key to dump RCU's tree. */ static bool 
> > > +sysrq_rcu; module_param(sysrq_rcu, bool, 0444);
> > >  
> > >  /*
> > >   * The rcu_scheduler_active variable is initialized to the value 
> > > @@
> > > -662,6 +666,27 @@ void show_rcu_gp_kthreads(void)  } 
> > > EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
> > >  
> > > +/* Dump grace-period-request information due to commandeered sysrq. 
> > > +*/ static void sysrq_show_rcu(int key) {
> > > +	show_rcu_gp_kthreads();
> > > +}
> > > +
> > > +static struct sysrq_key_op sysrq_rcudump_op = {
> > > +	.handler = sysrq_show_rcu,
> > > +	.help_msg = "show-rcu(y)",
> > > +	.action_msg = "Show RCU tree",
> > > +	.enable_mask = SYSRQ_ENABLE_DUMP, };
> > > +
> > > +static int __init rcu_sysrq_init(void) {
> > > +	if (sysrq_rcu)
> > > +		return register_sysrq_key('y', &sysrq_rcudump_op);
> > > +	return 0;
> > > +}
> > > +early_initcall(rcu_sysrq_init);
> > > +
> > >  /*
> > >   * Send along grace-period-related data for rcutorture diagnostics.
> > >   */
> > > 
> > 
>