linux-kernel - Re: [RCU] kernel hangs in wait_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKZGPAPNpci1UriB=vtT3OHcVQnMNtwi7qbT+d_Uqvp+A_xENw@mail.gmail.com>
Date:	Thu, 18 Dec 2014 21:52:28 +0530
From:	Arun KS <arunks.linux@...il.com>
To:	Paul McKenney <paulmck@...ux.vnet.ibm.com>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	josh@...htriplett.org, rostedt@...dmis.org,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	laijs@...fujitsu.com, Arun KS <getarunks@...il.com>
Subject: Re: [RCU] kernel hangs in wait_rcu_gp during suspend path

Hi Paul,

On Thu, Dec 18, 2014 at 12:54 AM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
> On Tue, Dec 16, 2014 at 11:00:20PM +0530, Arun KS wrote:
>> Hello,
>>
>> Adding some more info.
>>
>> Below is the rcu_data data structure corresponding to cpu4.
>
> This shows that RCU is idle.  What was the state of the system at the
> time you collected this data?

System initiated a suspend sequence and currently at disable_nonboot_cpus().
It has hotplugged 0, 1 and 2 successfully.  And even successful in hot
plugging cpu3.
But while calling the CPU_POST_DEAD notifier for cpu3, another driver tried to
unregister an atomic notifier. Which eventually calls syncronize_rcu()
which hangs the suspend task.

bt as follows,
PID: 202    TASK: edcd2a00  CPU: 4   COMMAND: "kworker/u16:4"
 #0 [<c0a1f8c0>] (__schedule) from [<c0a1d054>]
 #1 [<c0a1d054>] (schedule_timeout) from [<c0a1f018>]
 #2 [<c0a1f018>] (wait_for_common) from [<c013c570>]
 #3 [<c013c570>] (wait_rcu_gp) from [<c014407c>]
 #4 [<c014407c>] (atomic_notifier_chain_unregister) from [<c06be62c>]
 #5 [<c06be62c>] (cpufreq_interactive_disable_sched_input) from [<c06bee1c>]
 #6 [<c06bee1c>] (cpufreq_governor_interactive) from [<c06b7724>]
 #7 [<c06b7724>] (__cpufreq_governor) from [<c06b9f74>]
 #8 [<c06b9f74>] (__cpufreq_remove_dev_finish) from [<c06ba3a4>]
 #9 [<c06ba3a4>] (cpufreq_cpu_callback) from [<c0a22674>]
#10 [<c0a22674>] (notifier_call_chain) from [<c012284c>]
#11 [<c012284c>] (__cpu_notify) from [<c01229dc>]
#12 [<c01229dc>] (cpu_notify_nofail) from [<c0a0dd1c>]
#13 [<c0a0dd1c>] (_cpu_down) from [<c0122b48>]
#14 [<c0122b48>] (disable_nonboot_cpus) from [<c0168cd8>]
#15 [<c0168cd8>] (suspend_devices_and_enter) from [<c0169018>]
#16 [<c0169018>] (pm_suspend) from [<c01691e0>]
#17 [<c01691e0>] (try_to_suspend) from [<c01396f0>]
#18 [<c01396f0>] (process_one_work) from [<c0139db0>]
#19 [<c0139db0>] (worker_thread) from [<c013efa4>]
#20 [<c013efa4>] (kthread) from [<c01061f8>]

But the other cores 4-7 are active. I can see them going to idle tasks
coming out from idle because of interrupts, scheduling kworkers etc.
So when I took the data, all the online cores(4-7) were in idle as
shown below from runq data structures.

------start--------------
crash> runq
CPU 0 [OFFLINE]

CPU 1 [OFFLINE]

CPU 2 [OFFLINE]

CPU 3 [OFFLINE]

CPU 4 RUNQUEUE: c5439040
  CURRENT: PID: 0      TASK: f0c9d400  COMMAND: "swapper/4"
  RT PRIO_ARRAY: c5439130
     [no tasks queued]
  CFS RB_ROOT: c54390c0
     [no tasks queued]

CPU 5 RUNQUEUE: c5447040
  CURRENT: PID: 0      TASK: f0c9aa00  COMMAND: "swapper/5"
  RT PRIO_ARRAY: c5447130
     [no tasks queued]
  CFS RB_ROOT: c54470c0
     [no tasks queued]

CPU 6 RUNQUEUE: c5455040
  CURRENT: PID: 0      TASK: f0c9ce00  COMMAND: "swapper/6"
  RT PRIO_ARRAY: c5455130
     [no tasks queued]
  CFS RB_ROOT: c54550c0
     [no tasks queued]

CPU 7 RUNQUEUE: c5463040
  CURRENT: PID: 0      TASK: f0c9b000  COMMAND: "swapper/7"
  RT PRIO_ARRAY: c5463130
     [no tasks queued]
  CFS RB_ROOT: c54630c0
     [no tasks queued]
------end--------------

but one strange thing i can see is that rcu_read_lock_nesting for idle
tasks running on cpu 5 and cpu 6 are set to 1.

PID: 0      TASK: f0c9d400  CPU: 4   COMMAND: "swapper/4"
  rcu_read_lock_nesting = 0,

PID: 0      TASK: f0c9aa00  CPU: 5   COMMAND: "swapper/5"
  rcu_read_lock_nesting = 1,

PID: 0      TASK: f0c9ce00  CPU: 6   COMMAND: "swapper/6"
  rcu_read_lock_nesting = 1,

PID: 0      TASK: f0c9b000  CPU: 7   COMMAND: "swapper/7"
  rcu_read_lock_nesting = 0,

Does this means that the current grace period(suspend thread is
waiting on) is getting extended infinitely?
Also attaching the per_cpu rcu_data for online and offline cores.

Thanks,
Arun

>
>                                                         Thanx, Paul
>
>> struct rcu_data {
>>   completed = 5877,
>>   gpnum = 5877,
>>   passed_quiesce = true,
>>   qs_pending = false,
>>   beenonline = true,
>>   preemptible = true,
>>   mynode = 0xc117f340 <rcu_preempt_state>,
>>   grpmask = 16,
>>   nxtlist = 0xedaaec00,
>>   nxttail = {0xc54366c4, 0xe84d350c, 0xe84d350c, 0xe84d350c},
>>   nxtcompleted = {4294967035, 5878, 5878, 5878},
>>   qlen_lazy = 105,
>>   qlen = 415,
>>   qlen_last_fqs_check = 0,
>>   n_cbs_invoked = 86323,
>>   n_nocbs_invoked = 0,
>>   n_cbs_orphaned = 0,
>>   n_cbs_adopted = 139,
>>   n_force_qs_snap = 0,
>>   blimit = 10,
>>   dynticks = 0xc5436758,
>>   dynticks_snap = 7582140,
>>   dynticks_fqs = 41,
>>   offline_fqs = 0,
>>   n_rcu_pending = 59404,
>>   n_rp_qs_pending = 5,
>>   n_rp_report_qs = 4633,
>>   n_rp_cb_ready = 32,
>>   n_rp_cpu_needs_gp = 41088,
>>   n_rp_gp_completed = 2844,
>>   n_rp_gp_started = 1150,
>>   n_rp_need_nothing = 9657,
>>   barrier_head = {
>>     next = 0x0,
>>     func = 0x0
>>   },
>>   oom_head = {
>>     next = 0x0,
>>     func = 0x0
>>   },
>>   cpu = 4,
>>   rsp = 0xc117f340 <rcu_preempt_state>
>> }
>>
>>
>>
>> Also pasting complete rcu_preempt_state.
>>
>>
>>
>> rcu_preempt_state = $9 = {
>>   node = {{
>>       lock = {
>>         raw_lock = {
>>           {
>>             slock = 3129850509,
>>             tickets = {
>>               owner = 47757,
>>               next = 47757
>>             }
>>           }
>>         },
>>         magic = 3735899821,
>>         owner_cpu = 4294967295,
>>         owner = 0xffffffff
>>       },
>>       gpnum = 5877,
>>       completed = 5877,
>>       qsmask = 0,
>>       expmask = 0,
>>       qsmaskinit = 240,
>>       grpmask = 0,
>>       grplo = 0,
>>       grphi = 7,
>>       grpnum = 0 '\000',
>>       level = 0 '\000',
>>       parent = 0x0,
>>       blkd_tasks = {
>>         next = 0xc117f378 <rcu_preempt_state+56>,
>>         prev = 0xc117f378 <rcu_preempt_state+56>
>>       },
>>       gp_tasks = 0x0,
>>       exp_tasks = 0x0,
>>       need_future_gp = {1, 0},
>>       fqslock = {
>>         raw_lock = {
>>           {
>>             slock = 0,
>>             tickets = {
>>               owner = 0,
>>               next = 0
>>             }
>>           }
>>         },
>>         magic = 3735899821,
>>         owner_cpu = 4294967295,
>>         owner = 0xffffffff
>>       }
>>     }},
>>   level = {0xc117f340 <rcu_preempt_state>},
>>   levelcnt = {1, 0, 0, 0, 0},
>>   levelspread = "\b",
>>   rda = 0xc115e6b0 <rcu_preempt_data>,
>>   call = 0xc01975ac <call_rcu>,
>>   fqs_state = 0 '\000',
>>   boost = 0 '\000',
>>   gpnum = 5877,
>>   completed = 5877,
>>   gp_kthread = 0xf0c9e600,
>>   gp_wq = {
>>     lock = {
>>       {
>>         rlock = {
>>           raw_lock = {
>>             {
>>               slock = 2160230594,
>>               tickets = {
>>                 owner = 32962,
>>                 next = 32962
>>               }
>>             }
>>           },
>>           magic = 3735899821,
>>           owner_cpu = 4294967295,
>>           owner = 0xffffffff
>>         }
>>       }
>>     },
>>     task_list = {
>>       next = 0xf0cd1f20,
>>       prev = 0xf0cd1f20
>>     }
>>   },
>>   gp_flags = 1,
>>   orphan_lock = {
>>     raw_lock = {
>>       {
>>         slock = 327685,
>>         tickets = {
>>           owner = 5,
>>           next = 5
>>         }
>>       }
>>     },
>>     magic = 3735899821,
>>     owner_cpu = 4294967295,
>>     owner = 0xffffffff
>>   },
>>   orphan_nxtlist = 0x0,
>>   orphan_nxttail = 0xc117f490 <rcu_preempt_state+336>,
>>   orphan_donelist = 0x0,
>>   orphan_donetail = 0xc117f498 <rcu_preempt_state+344>,
>>   qlen_lazy = 0,
>>   qlen = 0,
>>   onoff_mutex = {
>>     count = {
>>       counter = 1
>>     },
>>     wait_lock = {
>>       {
>>         rlock = {
>>           raw_lock = {
>>             {
>>               slock = 811479134,
>>               tickets = {
>>                 owner = 12382,
>>                 next = 12382
>>               }
>>             }
>>           },
>>           magic = 3735899821,
>>           owner_cpu = 4294967295,
>>           owner = 0xffffffff
>>         }
>>       }
>>     },
>>     wait_list = {
>>       next = 0xc117f4bc <rcu_preempt_state+380>,
>>       prev = 0xc117f4bc <rcu_preempt_state+380>
>>     },
>>     owner = 0x0,
>>     name = 0x0,
>>     magic = 0xc117f4a8 <rcu_preempt_state+360>
>>   },
>>   barrier_mutex = {
>>     count = {
>>       counter = 1
>>     },
>>     wait_lock = {
>>       {
>>         rlock = {
>>           raw_lock = {
>>             {
>>               slock = 0,
>>               tickets = {
>>                 owner = 0,
>>                 next = 0
>>               }
>>             }
>>           },
>>           magic = 3735899821,
>>           owner_cpu = 4294967295,
>>           owner = 0xffffffff
>>         }
>>       }
>>     },
>>     wait_list = {
>>       next = 0xc117f4e4 <rcu_preempt_state+420>,
>>       prev = 0xc117f4e4 <rcu_preempt_state+420>
>>     },
>>     owner = 0x0,
>>     name = 0x0,
>>     magic = 0xc117f4d0 <rcu_preempt_state+400>
>>   },
>>   barrier_cpu_count = {
>>     counter = 0
>>   },
>>   barrier_completion = {
>>     done = 0,
>>     wait = {
>>       lock = {
>>         {
>>           rlock = {
>>             raw_lock = {
>>               {
>>                 slock = 0,
>>                 tickets = {
>>                   owner = 0,
>>                   next = 0
>>                 }
>>               }
>>             },
>>             magic = 0,
>>             owner_cpu = 0,
>>             owner = 0x0
>>           }
>>         }
>>       },
>>       task_list = {
>>         next = 0x0,
>>         prev = 0x0
>>       }
>>     }
>>   },
>>   n_barrier_done = 0,
>>   expedited_start = {
>>     counter = 0
>>   },
>>   expedited_done = {
>>     counter = 0
>>   },
>>   expedited_wrap = {
>>     counter = 0
>>   },
>>   expedited_tryfail = {
>>     counter = 0
>>   },
>>   expedited_workdone1 = {
>>     counter = 0
>>   },
>>   expedited_workdone2 = {
>>     counter = 0
>>   },
>>   expedited_normal = {
>>     counter = 0
>>   },
>>   expedited_stoppedcpus = {
>>     counter = 0
>>   },
>>   expedited_done_tries = {
>>     counter = 0
>>   },
>>   expedited_done_lost = {
>>     counter = 0
>>   },
>>   expedited_done_exit = {
>>     counter = 0
>>   },
>>   jiffies_force_qs = 4294963917,
>>   n_force_qs = 4028,
>>   n_force_qs_lh = 0,
>>   n_force_qs_ngp = 0,
>>   gp_start = 4294963911,
>>   jiffies_stall = 4294966011,
>>   gp_max = 17,
>>   name = 0xc0d833ab "rcu_preempt",
>>   abbr = 112 'p',
>>   flavors = {
>>     next = 0xc117f2ec <rcu_bh_state+556>,
>>     prev = 0xc117f300 <rcu_struct_flavors>
>>   },
>>   wakeup_work = {
>>     flags = 3,
>>     llnode = {
>>       next = 0x0
>>     },
>>     func = 0xc0195aa8 <rsp_wakeup>
>>   }
>> }
>>
>> Hope this helps.
>>
>> Thanks,
>> Arun
>>
>>
>> On Tue, Dec 16, 2014 at 11:59 AM, Arun KS <arunks.linux@...il.com> wrote:
>> > Hello,
>> >
>> > I dig little deeper to understand the situation.
>> > All other cpus are in idle thread already.
>> > As per my understanding, for the grace period to end, at-least one of
>> > the following should happen on all online cpus,
>> >
>> > 1. a context switch.
>> > 2. user space switch.
>> > 3. switch to idle thread.
>> >
>> > In this situation, since all the other cores are already in idle,  non
>> > of the above are meet on all online cores.
>> > So grace period is getting extended and never finishes. Below is the
>> > state of runqueue when the hang happens.
>> > --------------start------------------------------------
>> > crash> runq
>> > CPU 0 [OFFLINE]
>> >
>> > CPU 1 [OFFLINE]
>> >
>> > CPU 2 [OFFLINE]
>> >
>> > CPU 3 [OFFLINE]
>> >
>> > CPU 4 RUNQUEUE: c3192e40
>> >   CURRENT: PID: 0      TASK: f0874440  COMMAND: "swapper/4"
>> >   RT PRIO_ARRAY: c3192f20
>> >      [no tasks queued]
>> >   CFS RB_ROOT: c3192eb0
>> >      [no tasks queued]
>> >
>> > CPU 5 RUNQUEUE: c31a0e40
>> >   CURRENT: PID: 0      TASK: f0874980  COMMAND: "swapper/5"
>> >   RT PRIO_ARRAY: c31a0f20
>> >      [no tasks queued]
>> >   CFS RB_ROOT: c31a0eb0
>> >      [no tasks queued]
>> >
>> > CPU 6 RUNQUEUE: c31aee40
>> >   CURRENT: PID: 0      TASK: f0874ec0  COMMAND: "swapper/6"
>> >   RT PRIO_ARRAY: c31aef20
>> >      [no tasks queued]
>> >   CFS RB_ROOT: c31aeeb0
>> >      [no tasks queued]
>> >
>> > CPU 7 RUNQUEUE: c31bce40
>> >   CURRENT: PID: 0      TASK: f0875400  COMMAND: "swapper/7"
>> >   RT PRIO_ARRAY: c31bcf20
>> >      [no tasks queued]
>> >   CFS RB_ROOT: c31bceb0
>> >      [no tasks queued]
>> > --------------end------------------------------------
>> >
>> > If my understanding is correct the below patch should help, because it
>> > will expedite grace periods during suspend,
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1d74d14e98a6be740a6f12456c7d9ad47be9c9c
>> >
>> > But I wonder why it was not taken to stable trees. Can we take it?
>> > Appreciate your help.
>> >
>> > Thanks,
>> > Arun
>> >
>> > On Mon, Dec 15, 2014 at 10:34 PM, Arun KS <arunks.linux@...il.com> wrote:
>> >> Hi,
>> >>
>> >> Here is the backtrace of the process hanging in wait_rcu_gp,
>> >>
>> >> PID: 247    TASK: e16e7380  CPU: 4   COMMAND: "kworker/u16:5"
>> >>  #0 [<c09fead0>] (__schedule) from [<c09fcab0>]
>> >>  #1 [<c09fcab0>] (schedule_timeout) from [<c09fe050>]
>> >>  #2 [<c09fe050>] (wait_for_common) from [<c013b2b4>]
>> >>  #3 [<c013b2b4>] (wait_rcu_gp) from [<c0142f50>]
>> >>  #4 [<c0142f50>] (atomic_notifier_chain_unregister) from [<c06b2ab8>]
>> >>  #5 [<c06b2ab8>] (cpufreq_interactive_disable_sched_input) from [<c06b32a8>]
>> >>  #6 [<c06b32a8>] (cpufreq_governor_interactive) from [<c06abbf8>]
>> >>  #7 [<c06abbf8>] (__cpufreq_governor) from [<c06ae474>]
>> >>  #8 [<c06ae474>] (__cpufreq_remove_dev_finish) from [<c06ae8c0>]
>> >>  #9 [<c06ae8c0>] (cpufreq_cpu_callback) from [<c0a0185c>]
>> >> #10 [<c0a0185c>] (notifier_call_chain) from [<c0121888>]
>> >> #11 [<c0121888>] (__cpu_notify) from [<c0121a04>]
>> >> #12 [<c0121a04>] (cpu_notify_nofail) from [<c09ee7f0>]
>> >> #13 [<c09ee7f0>] (_cpu_down) from [<c0121b70>]
>> >> #14 [<c0121b70>] (disable_nonboot_cpus) from [<c016788c>]
>> >> #15 [<c016788c>] (suspend_devices_and_enter) from [<c0167bcc>]
>> >> #16 [<c0167bcc>] (pm_suspend) from [<c0167d94>]
>> >> #17 [<c0167d94>] (try_to_suspend) from [<c0138460>]
>> >> #18 [<c0138460>] (process_one_work) from [<c0138b18>]
>> >> #19 [<c0138b18>] (worker_thread) from [<c013dc58>]
>> >> #20 [<c013dc58>] (kthread) from [<c01061b8>]
>> >>
>> >> Will this patch helps here,
>> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d1d74d14e98a6be740a6f12456c7d9ad47be9c9c
>> >>
>> >> I couldn't really understand why it got struck in  synchronize_rcu().
>> >> Please give some pointers to debug this further.
>> >>
>> >> Below are the configs enable related to RCU.
>> >>
>> >> CONFIG_TREE_PREEMPT_RCU=y
>> >> CONFIG_PREEMPT_RCU=y
>> >> CONFIG_RCU_STALL_COMMON=y
>> >> CONFIG_RCU_FANOUT=32
>> >> CONFIG_RCU_FANOUT_LEAF=16
>> >> CONFIG_RCU_FAST_NO_HZ=y
>> >> CONFIG_RCU_CPU_STALL_TIMEOUT=21
>> >> CONFIG_RCU_CPU_STALL_VERBOSE=y
>> >>
>> >> Kernel version is 3.10.28
>> >> Architecture is ARM
>> >>
>> >> Thanks,
>> >> Arun
>>
>

View attachment "rcu_data_offline_cpus_0_3.txt" of type "text/plain" (4166 bytes)

View attachment "rcu_data_online_cpus_4_7.txt" of type "text/plain" (4215 bytes)

View attachment "rcu_preept_state.txt" of type "text/plain" (5182 bytes)