linux-kernel - Re: hugepage related lockdep trace.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130724025711.GB14795@bbox>
Date:	Wed, 24 Jul 2013 11:57:11 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Hush Bensen <hush.bensen@...il.com>
Cc:	"Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	Dave Jones <davej@...hat.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	linux-mm@...ck.org, Rik van Riel <riel@...hat.com>,
	Michal Hocko <mhocko@...e.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Hillf Danton <dhillf@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: hugepage related lockdep trace.

On Tue, Jul 23, 2013 at 01:24:17AM -0600, Hush Bensen wrote:
> On 07/18/2013 06:13 PM, Minchan Kim wrote:
> >On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> >>Minchan Kim <minchan@...nel.org> writes:
> >>
> >>>Ccing people get_maintainer says.
> >>>
> >>>On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >>>>[128095.470960] =================================
> >>>>[128095.471315] [ INFO: inconsistent lock state ]
> >>>>[128095.471660] 3.11.0-rc1+ #9 Not tainted
> >>>>[128095.472156] ---------------------------------
> >>>>[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >>>>[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >>>>[128095.474373]  (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >>>>[128095.475866]   [<c10a6232>] mark_held_locks+0x81/0xe7
> >>>>[128095.476597]   [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >>>>[128095.477322]   [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >>>>[128095.478049]   [<c1123ab6>] __get_free_pages+0x20/0x31
> >>>>[128095.478769]   [<c1123ad9>] get_zeroed_page+0x12/0x14
> >>>>[128095.479477]   [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >>>>[128095.480138]   [<c1155ea7>] huge_pmd_share+0x265/0x283
> >>>>[128095.480138]   [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >>>>[128095.480138]   [<c115612e>] hugetlb_fault+0x7c/0x64a
> >>>>[128095.480138]   [<c114087c>] handle_mm_fault+0x255/0x299
> >>>>[128095.480138]   [<c15bbab0>] __do_page_fault+0x142/0x55c
> >>>>[128095.480138]   [<c15bbed7>] do_page_fault+0xd/0x16
> >>>>[128095.480138]   [<c15b927c>] error_code+0x6c/0x74
> >>>>[128095.480138] irq event stamp: 3136917
> >>>>[128095.480138] hardirqs last  enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >>>>[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >>>>[128095.480138] softirqs last  enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >>>>[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >>>>[128095.480138]
> >>>>other info that might help us debug this:
> >>>>[128095.480138]  Possible unsafe locking scenario:
> >>>>
> >>>>[128095.480138]        CPU0
> >>>>[128095.480138]        ----
> >>>>[128095.480138]   lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]   <Interrupt>
> >>>>[128095.480138]     lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]
> >>>>  *** DEADLOCK ***
> >>>>
> >>>>[128095.480138] no locks held by kswapd0/49.
> >>>>[128095.480138]
> >>>>stack backtrace:
> >>>>[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >>>>[128095.480138] Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
> >>>>[128095.480138]  c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >>>>[128095.480138]  c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >>>>[128095.480138]  c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >>>>[128095.480138] Call Trace:
> >>>>[128095.480138]  [<c15b001e>] dump_stack+0x4b/0x79
> >>>>[128095.480138]  [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >>>>[128095.480138]  [<c10a6130>] mark_lock+0x1e0/0x261
> >>>>[128095.480138]  [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >>>>[128095.480138]  [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >>>>[128095.480138]  [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >>>>[128095.480138]  [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >>>>[128095.480138]  [<c10a84cf>] lock_acquire+0x7d/0x195
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >>>>[128095.480138]  [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.480138]  [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >>>>[128095.480138]  [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >>>>[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138]  [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >>>>[128095.480138]  [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >>>>[128095.480138]  [<c112d028>] shrink_zone+0x53/0x14e
> >>>>[128095.480138]  [<c112e531>] kswapd+0x517/0xa75
> >>>>[128095.480138]  [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >>>>[128095.480138]  [<c10661ff>] kthread+0xa8/0xaa
> >>>>[128095.480138]  [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138]  [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >>>>[128095.480138]  [<c1066157>] ? insert_kthread_work+0x63/0x63
> >>>IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> >>>while one in the middle of fault path could be never on kswapd context.
> >>>
> >>>It seems lockdep for reclaim-over-fs isn't enough smart to identify
> >>>between background and direct reclaim.
> >>>
> >>>Wait for other's opinion.
> >>Is that reasoning correct ?. We may not deadlock because hugetlb pages
> >>cannot be reclaimed. So the fault path in hugetlb won't end up
> >>reclaiming pages from same inode. But the report is correct right ?
> >>
> >>
> >>Looking at the hugetlb code we have in huge_pmd_share
> >>
> >>out:
> >>	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>	mutex_unlock(&mapping->i_mmap_mutex);
> >>	return pte;
> >>
> >>I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> >>that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> >True. Sorry for that I didn't review the code carefully and I was very paranoid
> >in reclaim-over-fs due to internal works. :(
> 
> Could you explain more about reclaim-over-fs stuff?

It's lockdep stuff to catch reclaim deadlock, which could happen following as
fox example,

fs_write
        lock(fs_lock);
        alloc memory
                reclaim path
                        pageout
                                fs_write
                                        lock(fs_lock)

                

> 
> >
> >>Something like  ?
> >>
> >>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>index 83aff0a..2cb1be3 100644
> >>--- a/mm/hugetlb.c
> >>+++ b/mm/hugetlb.c
> >>@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >>  		put_page(virt_to_page(spte));
> >>  	spin_unlock(&mm->page_table_lock);
> >>  out:
> >>-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>  	mutex_unlock(&mapping->i_mmap_mutex);
> >>+	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >>  	return pte;
> >I am blind on hugetlb but not sure it doesn't break eb48c071.
> >Michal?
> >
> >
> >>  }
> >>-aneesh
> >>
> >>
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to majordomo@...r.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at  http://www.tux.org/lkml/
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/