lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 13 May 2015 13:15:08 -0000
From:	vigneshr@...eaurora.org
To:	linux-kernel@...r.kernel.org
Cc:	catalin.marinas@....com, bernd.schubert@...m.fraunhofer.de
Subject: Crash in crc32_le during kmemleak_scan()

We are seeing a panic in crc32_le after kmemleak_scan(). I have pasted the
snippet of the crash log below. This is seen on 3.10 Kernel.

This issue was earlier discussed over this thread as well -
https://lkml.org/lkml/2014/2/6/287.


<4>[   70.732606] kmemleak: Cannot allocate a kmemleak_object structure
<6>[   70.739803] kmemleak: Kernel memory leak detector disabled
<6>[   70.740831] lowmemorykiller: Killing 'id.gms.wearable' (7004), adj 647,
<6>[   70.740831]    to free 29144kB on behalf of 'kswapd0' (95) because
<6>[   70.740831]    cache 216560kB is below limit 225000kB for
oom_score_adj 529
<6>[   70.740831]    Free memory is 5788kB above reserved.
<6>[   70.740831]    Free CMA is 107248kB
<6>[   70.740831]    Total reserve is 17856kB
<6>[   70.740831]    Total free pages is 113036kB
<6>[   70.740831]    Total file cache is 217632kB
<6>[   70.740831]    Slab Reclaimable is 17104kB
<6>[   70.740831]    Slab UnReclaimable is 117508kB
<6>[   70.740831]    Total Slab is 134612kB
<6>[   70.740831]    GFP mask is 0xd0
<6>[   70.768877] lowmemorykiller: Killing 'le.android.talk' (7955), adj 529,
<6>[   70.768877]    to free 39124kB on behalf of 'kswapd0' (95) because
<6>[   70.768877]    cache 211088kB is below limit 225000kB for
oom_score_adj 529
<6>[   70.768877]    Free memory is 4716kB above reserved.
<6>[   70.768877]    Free CMA is 108048kB
<6>[   70.768877]    Total reserve is 17856kB
<6>[   70.768877]    Total free pages is 112764kB
<6>[   70.768877]    Total file cache is 212304kB
<6>[   70.768877]    Slab Reclaimable is 17104kB
<6>[   70.768877]    Slab UnReclaimable is 117408kB
<6>[   70.768877]    Total Slab is 134512kB
<6>[   70.768877]    GFP mask is 0xd0
<1>[   70.935886] Unable to handle kernel paging request at virtual
address ffffff8001119000
<1>[   70.942768] pgd = ffffffc00007d000
<1>[   70.946150] [ffffff8001119000] *pgd=000000008c307003,
*pmd=00000000b2b96003, *pte=0000000000000000
<0>[   70.955094] Internal error: Oops: 96000007 [#1] PREEMPT SMP
<6>[   70.960648] Modules linked in: wlan(O)
<6>[   70.964383] CPU: 2 PID: 198 Comm: kmemleak Tainted: G           O
3.10.49-gcb3457b-00006-g72e5422 #1
<6>[   70.973497] task: ffffffc030b24b40 ti: ffffffc03019c000 task.ti:
ffffffc03019c000
<6>[   70.980966] PC is at crc32_le+0x78/0x140
<6>[   70.984873] LR is at kmemleak_scan+0x2d0/0x3c8
<6>[   70.989296] pc : [<ffffffc00044ea30>] lr : [<ffffffc0002f088c>]
pstate: 800001c5
<6>[   70.996671] sp : ffffffc03019fd60
<6>[   70.999972] x29: ffffffc03019fd70 x28: 0000000000000000
<6>[   71.005267] x27: 0000000000000000 x26: 0000000000000000
<6>[   71.010557] x25: ffffffc00168d6e8 x24: 0000000000000000
<6>[   71.028876] x23: 0000000000000140 x22: ffffffc0014e23d8
<6>[   71.034168] x21: ffffffc0014e23a0 x20: ffffffc0014e2338
<6>[   71.039464] x19: ffffffc002a02640 x18: 000000000b220dc1
<6>[   71.044757] x17: 0000000000000400 x16: 0000000000000800
<6>[   71.050053] x15: 0000000000000c00 x14: 0000000000001400
<6>[   71.064462] x13: 0000000000001800 x12: 0000000000001c00
<6>[   71.069756] x11: 0000000000001000 x10: ffffffc0014c2000
<6>[   71.075052] x9 : 0000000000000000 x8 : 0000000000000000
<6>[   71.080347] x7 : ffffff8001118ffc x6 : 00000000000012aa
<6>[   71.085642] x5 : 0000000000000000 x4 : ffffffc0014e23d8
<6>[   71.100052] x3 : 0000000000000000 x2 : 00000000000012aa
<6>[   71.105348] x1 : ffffff8001118ffc x0 : 0000000000000000
<6>[   71.115590] e930  97fffff053107e60 a94153f333103e80 d65f03c0a8c27bfd
d280000453003c00 eb02009f90003f65 3864682354000120 91000484913ea0a6
12001c634a000063
<6>[   71.138421] e970  4a4020607863d8c3 d65f03c017fffff7 d280000453003c00
eb02009fb0003f65 3864682354000120 910004849106a0a6 12001c634a000063
4a4020607863d8c3
<snip> <snip>
<6>[   71.468288] fde0  ffffffc03019fe10 ffffffc00023bb1c ffffffc03302fd30
ffffffc001691738 ffffffc000f61dd0 ffffffc00023bb00 0000000000000000
ffffffc000203ba0
<6>[   71.482001] fe20  ffffffc00023ba6c ffffffc03302fd30 0000000000000000
0000000000000000 0000000000000000 ffffffc000247048 ffffffc03019fea0
0000000000000000
<snip> <snip>
<6>[   71.495716]
<0>[   71.497201] Process kmemleak (pid: 198, stack limit =
0xffffffc03019c058)
<6>[   71.503968] Call trace:
<6>[   71.506402] [<ffffffc00044ea30>] crc32_le+0x78/0x140
<6>[   71.511356] [<ffffffc0002f0cc0>] kmemleak_scan_thread+0x74/0xc0 <6>[
  71.517249] [<ffffffc00023bb18>] kthread+0xac/0xb8
<0>[   71.522027] Code: d281800f d2810010 d2808011 b4000466 (b94004e3)
<4>[   71.528180] ---[ end trace 0387f1db53e12801 ]---
<0>[   73.779697] Rebooting in 5 seconds..
<5>[   78.782326] Going down for restart now


Analysis so far :

1. We see that this is during low memory conditions, when kmemleak_object
structure allocation fails (happens at around 70.732606 seconds after
boot).

2. Immediately as per code, we go ahead and call kmemleak_disable() which
is an irreversible operation and we see the message at 70.739803 seconds.

Now we see that kmemleak_scan() has caused the panic here at around
70.935886 seconds, even though disable was called earlier (0.2 seconds
before).

So the question was why it was getting triggered/running even after we
disable it.
So we started seeing if kmemleak_disable() bailed out and if yes, at what
stage via post mortem analysis.

  1766 static void kmemleak_disable(void)
  1767 {
  1768     /* atomically check whether it was already invoked */
  1769     if (cmpxchg(&kmemleak_error, 0, 1))
  1770         return;
  1771
  1772     /* stop any memory operation tracing */
  1773     kmemleak_enabled = 0;
  1774
  1775     /* check whether it is too early for a kernel thread */ 1776   
 if (kmemleak_initialized)
  1777         schedule_work(&cleanup_work);
  1778
  1779     pr_info("Kernel memory leak detector disabled\n");
  1780 }


1. From Line 1769, it would have bailed out if the kmemleak_error were
already set to 1 in which case , we would have already disabled it,
therefore this would have been 0 and then changed to 1.

2. From the post mortem raw dumps, we see that kmemleak_enabled was indeed
set to 0. Therefore we had successfully come here.

3. After the above, we see that we have scheduled a work - cleanup_work,
which does the following :

  1747 static void kmemleak_do_cleanup(struct work_struct *work)
  1748 {
  1749     mutex_lock(&scan_mutex);
  1750     stop_scan_thread();
  1751
  1752     if (!kmemleak_found_leaks)
  1753         __kmemleak_do_cleanup();
  1754     else
  1755         pr_info("Kmemleak disabled without freeing internal data. "
1756             "Reclaim the memory with \"echo clear >
/sys/kernel/debug/kmemleak\"\n");
  1757     mutex_unlock(&scan_mutex);
  1758 }

If it had gone inside stop_scan_thread() from line 1750 routine, we should
have stopped the kmemleak kthread and the variable scan_thread should have
been null :

a. Postmortem analysis says otherwise.
b. We see that kmemleak thread did not stop since it executed after 0.2
seconds after we disabled.
c. Additionally we saw that scan_thread variable is perfectly intact and
not NULL. (snippet pasted below)

    scan_thread = 0xFFFFFFC030B24B40 -> (
    state = 0,
    stack = 0xFFFFFFC03019C000,
    usage = (counter = 2),
    flags = 2129984,
    ptrace = 0,
    wake_entry = (next = 0x0),
    on_cpu = 1,
    on_rq = 1,
    prio = 130,
    static_prio = 130,
    normal_prio = 130,
    rt_priority = 0,
    sched_class = 0xFFFFFFC000C03280,
    se = (load = (weight = 110, inv_weight = 39045157), run_node =
(__rb_parent_
    rt = (run_list = (next = 0xFFFFFFC030B24D08, prev =
0xFFFFFFC030B24D08), tim
    ravg = (mark_start = 71528155165, sum = 8467458, demand = 10000000,
sum_hist
    sched_task_group = 0xFFFFFFC00D53F200,
    fpu_counter = 0,
    policy = 0,
    nr_cpus_allowed = 8,


Therefore, we checked the weird possibilty of the work function getting
stuck in acquiring scan_mutex lock which would explain the fact that
stop_scan_thread() did not execute and it
indeed turned out be that way.

  Serving work : (struct work_struct *)0xFFFFFFC0014E23B0
  work_func : \\vmlinux\kmemleak\kmemleak_do_cleanup
  serving worker : kworker/3:1 PID : 32 on CPU : 3

Stack of kworker/3:1 :

-000|__switch_to(prev = 0xFFFFFFC032A60AC0, next = 0xFFFFFFC0014E2360)
-001|context_switch(inline)
-001|__schedule()
-002|schedule()
-003|schedule_preempt_disabled()
-004|__mutex_lock_common(inline)
-004|__mutex_lock_slowpath(lock_count = 0xFFFFFFC0014E2358)
-005|current_thread_info(inline)
-005|mutex_set_owner(inline)
-005|mutex_lock(lock = 0xFFFFFFC0014E2358)
-006|kmemleak_do_cleanup(?)
-007|static_key_false(inline)
-007|trace_workqueue_execute_end(inline)
-007|process_one_work(worker = 0xFFFFFFC032A0D000, work =
0xFFFFFFC0014E23B0)
-008|worker_thread(__worker = 0xFFFFFFC032A0D000)
-009|kthread(_create = 0xFFFFFFC032E03C80)
-010|ret_from_fork(asm)
-->|exception
-011|create_kthread(inline)
-011|kthreadd(?)
---|end of frame


We can see that it went into sleep trying to acquire the mutex lock and it
never came back and last arrival backs this claim.

    sched_info = (
      pcount = 1179,
      run_delay = 209410806,
      last_arrival = 70739794902, ----------------> Last it got a chance
to execute at 70.73 seconds when we called disable and its stuck
there.


Now when we check who is the owner of this mutex lock, we see that
kmemleak scan thread holds it

    |__mutex_lock_slowpath(
    |    lock_count = 0xFFFFFFC0014E2358)
    |  lock = 0xFFFFFFC0014E2358 -> (
    |    count = (counter = -1),
    |    wait_lock = (rlock = (raw_lock = (owner = 4, next = 4), magic =
3735899
    |    wait_list = (next = 0xFFFFFFC032A6BCD0, prev =
0xFFFFFFC032A6BCD0),
    |    owner = 0xFFFFFFC030B24B40, --> Corresponds to task_struct
of kmemleak
    |    name = 0x0,
    |    magic = 0xFFFFFFC0014E2358)


We tried the following as well to stop the scan thread first and then
schedule work but it did not help :


diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 0bd522f..6105dfe 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1460,7 +1460,6 @@ static void start_scan_thread(void)
static void stop_scan_thread(void)
{
        if (scan_thread) {
-               kthread_stop(scan_thread);
                scan_thread = NULL;
        }
}
@@ -1747,6 +1746,9 @@ static void kmemleak_disable(void)
        /* stop any memory operation tracing */
        atomic_set(&kmemleak_enabled, 0);

+       if(scan_thread)
+               kthread_stop(scan_thread);
+
        /* check whether it is too early for a kernel thread */



Therefore in summary, the cleanup_work scheduled after kmemleak_disable()
is stuck in acquiring scan_mutex and this mutex lock is held by kmemleak
scan thread.

In the above stated case we have seen the time difference between the
disable call and crash is close to 0.2 seconds, but we have other
instances where it took more time like close to a second or so :

===
    81.797655:   <6> kmemleak: Cannot allocate a kmemleak_object structure
81.830707:   <6> kmemleak: Kernel memory leak detector disabled
                                                <snip> <snip>
    83.015406:   <6> Unable to handle kernel paging request at virtual
address ffffff80013bd000
    83.022284:   <6> pgd = ffffffc00007d000
    83.025666:   <2> [ffffff80013bd000] *pgd=000000008c307003,
*pmd=00000000b24af003, *pte=0000000000000000
    83.034610:   <6> Internal error: Oops: 96000007 [#1] PREEMPT SMP
83.040163:   <2> Modules linked in: wlan(O)
    83.043899:   <6> CPU: 2 PID: 204 Comm: kmemleak Tainted: G           O
3.10.49-g7ed4272-00237-g8f9114d #1
    83.053011:   <6> task: ffffffc030b54080 ti: ffffffc030214000 task.ti:
ffffffc030214000
    83.060481:   <2> PC is at crc32_le+0x78/0x140
    83.064384:   <2> LR is at kmemleak_scan+0x2d0/0x3d4


====


    97.092890:   <6> kmemleak: Cannot allocate a kmemleak_object structure
97.098695:   <6> kmemleak: Kernel memory leak detector disabled
                                                <snip> <snip>
    98.149826:   <6> Unable to handle kernel paging request at virtual
address ffffff800038a000
    98.156704:   <6> pgd = ffffffc00007d000
    84.347644:   <2> msm_pm_qos_update_request: update request -1
    98.160088:   <4> [ffffff800038a000] *pgd=000000008c307003
    98.164862:   <2> , *pmd=00000000b2a36003, *pte=0000000000000000
98.164874:   <6> Internal error: Oops: 96000007 [#1] PREEMPT SMP
98.170418:   <2> Modules linked in:
    98.173462:   <6> CPU: 2 PID: 203 Comm: kmemleak Not tainted
3.10.49-g18d73a2 #1
    98.180317:   <6> task: ffffffc0304bd600 ti: ffffffc030204000 task.ti:
ffffffc030204000
    98.187784:   <2> PC is at crc32_le+0x78/0x140
    98.191688:   <2> LR is at kmemleak_scan+0x2d0/0x3d4

====

Any debug pointers/suggestions for the same would be really helpful.

Thanks and regards,
Vignesh Radhakrishnan

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ