linux-kernel - Re: [BUG] 2.6.25-rc4 hang/softlockups after freeing hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080306175311.GA14567@us.ibm.com>
Date:	Thu, 6 Mar 2008 09:53:11 -0800
From:	Nishanth Aravamudan <nacc@...ibm.com>
To:	Lee Schermerhorn <Lee.Schermerhorn@...com>
Cc:	linux-kernel <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>, Adam Litke <agl@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mel@....ul.ie>, Eric Whitney <eric.whitney@...com>
Subject: Re: [BUG] 2.6.25-rc4 hang/softlockups after freeing hugepages

On 06.03.2008 [12:23:03 -0500], Lee Schermerhorn wrote:
> Test platform:  HP Proliant DL585 server - 4 socket, dual core AMD with
> 32GB memory.
> 
> I first saw this on 25-rc2-mm1 with Mel's zonelist patches, while
> investigating the interaction of hugepages and cpusets.  Thinking that
> it might be caused by the zonelist patches, I went back to 25-rc2-mm1
> w/o the patches and saw the same thing.  It sometimes takes a while for
> the softlockups to start appearing, and I wanted to find a fairly
> minimal duplicator.  Meanwhile 25-rc3 and rc4 have come out, so I tried
> the latest upstream kernel and see the same thing.

So, does 2.6.25-rc2 show the problem? Or was it something introduced in
that -mm which has since gone upstream?

> To duplicate the problem, I need only:
> 
> + log into the platform as root in one window and:
> 
> 	echo N >/proc/sys/vm/nr_hugepages
> 	echo 0 >proc/sys/vm/nr_hugepages
> 
> In my case, N=64.  If I look, before echoing 0, I see 16 hugepages
> allocated on each of the 4 nodes, as expected.
> 
> + then in another window, log in again.  
> 
> Sometimes it will hang during the 2nd login and I'll never see a shell
> prompt.  Other times, I make it all the way to editing a file or
> starting a kernel build.  The task in the 2nd login hangs and on the
> console I see--e.g.,
> 
> BUG: soft lockup - CPU#1 stuck for 61s! [runkbuild:3320]
> CPU 1:
> Modules linked in: sunrpc ipv6 dm_mirror dm_mod parport_pc lp parport ide_cd_mod cdrom button tg3 hpwdt serio_raw amd_rng pata_acpi libata i2c_amd756 i2c_core pcspkr mptspi mptscsih sym53c8xx scsi_transport_spi sd_mod scsi_mod mptbase ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 3320, comm: runkbuild Not tainted 2.6.25-rc4 #1
> RIP: 0010:[<ffffffff803341f5>]  [<ffffffff803341f5>] copy_page_c+0x5/0x10
> RSP: 0000:ffff8103fe56fe00  EFLAGS: 00010286
> RAX: ffff810000000000 RBX: ffff8103fe56fe68 RCX: 0000000000000200
> RDX: ffffffff805d6c00 RSI: ffff8103fdada000 RDI: ffff8103fe200000
> RBP: ffff8103fe56fe68 R08: ffffe20017fc3a68 R09: 00003ffffffff000
> R10: 0000000000000002 R11: 0000000000000246 R12: ffffe2000ff6b680
> R13: ffffe2000ff88000 R14: ffff8103fe08c160 R15: ffff8103fe08fb10
> FS:  00007f20b83996f0(0000) GS:ffff8103ff028000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff8103fe200000 CR3: 00000007fe0c7000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 
> Call Trace:
>  [<ffffffff8027b693>] ? do_wp_page+0x103/0x570
>  [<ffffffff8027e4cf>] handle_mm_fault+0x5cf/0x7f0
>  [<ffffffff804a1cdf>] do_page_fault+0x26f/0x8d0
>  [<ffffffff8049fbd9>] error_exit+0x0/0x51
> 
> ---------------------------------------------------------------------------
> 
> This one is from starting a shell script 'runkbuild' to run parallel
> kernel builds in a loop.  Never got to start any make.  Dont' know
> whether I can trust the RIP.  
> 
> I have also seen hangs in get_page_from_freelist() which make more sense
> to me.  Perhaps failure to unlock a zone lru_lock?

Hrm, interesting. Barring an obvious thinko, can you bisect it at all?
If it's in mainline for 2.6.25-rc2 to -rc3, that shouldn't take too
long.

> I've been looking through the hugepage allocation/freeing functions and
> haven't seen anything that jumps out at me.

I don't see anything obvious either. You don't get any softlockups
without first growing and shrinking the pool? How about only growing it?

> I took a look at the recent hugetlb patches from Adam and Nish, but none
> seemed to address this symptom.  I don't think I'm dealing with surplus
> pages here.

If /proc/sys/vm/nr_overcommit_hugepages = 0, then no, you're not.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@...ibm.com>
IBM Linux Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/