linux-kernel - Re: [bug] Re: [PATCH] - Fix stack overflow for large values of MAX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20080624220335.GA8039@sgi.com>
Date:	Tue, 24 Jun 2008 17:03:35 -0500
From:	Jack Steiner <steiner@....com>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	tglx@...utronix.de, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, Mike Travis <travis@....com>
Subject: Re: [bug] Re: [PATCH] - Fix stack overflow for large values of MAX_APICS


On Tue, Jun 24, 2008 at 12:24:01PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@...e.hu> wrote:
> 
> > * Jack Steiner <steiner@....com> wrote:
> > 
> > > physid_mask_of_physid() causes a huge stack (12k) to be created if 
> > > the number of APICS is large. Replace physid_mask_of_physid() with a 
> > > new function that does not create large stacks. This is a problem 
> > > only on large x86_64 systems.
> > 
> > this indeed fixes the crash i reported here:
> > 
> >    http://lkml.org/lkml/2008/6/19/98
> > 
> > so i've added both this and the MAXAPICS patch to tip/x86/uv, and will 
> > test it some more. Lets hope it goes all well this time :-)
> 
> -tip auto-testing found a new boot failure on x86 which happens if 
> NR_CPUS is changed from 8 to 4096. The hang goes like this:
> 

Still looking but here is what I have found so far.

The most obvious change was to revert the patch that changed MAX_APICS to
32k. With this patch reverted, the system still hangs at the same spot.

I noticed that the hang is random. It usually occurs  at acpi_event_init()
but sometimes it hangs at a different place.

I also observed that the hang does not always occur. The system will
boot to the point of mounting /root, then panics because the mount
fails. I expect that this is a different failure due to missing drivers.
I'll chase that down later.


I added trace code & isolated the hang to a call to synchronize_rcu().
Usually from netlink_change_ngroups().

If I boot with "maxcpus=1, it never hangs (obviously) but always fails
to mount /root.

Next I changed NR_CPUS to 128. I still see random hangs at the call
to acpi_event_init().


I'll chase this more tomorrow. Has anyone else seen any failures that might be
related???




>  Linux version 2.6.26-rc7-tip (mingo@...ne) (gcc version 4.2.3) #10233 SMP
>  Tue Jun 24 12:13:46 CEST 2008
>  [...]
>  initcall init_mnt_writers+0x0/0x8c returned 0 after 0 msecs
>  calling  eventpoll_init+0x0/0x9a
>  initcall eventpoll_init+0x0/0x9a returned 0 after 0 msecs
>  calling  anon_inode_init+0x0/0x11a
>  initcall anon_inode_init+0x0/0x11a returned 0 after 0 msecs
>  calling  pcie_aspm_init+0x0/0x27
>  initcall pcie_aspm_init+0x0/0x27 returned 0 after 0 msecs
>  calling  acpi_event_init+0x0/0x57
>  [... hard hang ...]
> 
> on a good bootup, it would continue like this:
> 
>  initcall acpi_event_init+0x0/0x57 returned 0 after 38 msecs
>  calling  pnp_system_init+0x0/0x17
>  [...]
> 
> the config, full bootlog and reproducer bzImage is at:
> 
>   http://redhat.com/~mingo/misc/config-Tue_Jun_24_07_44_17_CEST_2008.bad
>   http://redhat.com/~mingo/misc/log-Tue_Jun_24_07_44_17_CEST_2008.bad
>   http://redhat.com/~mingo/misc/bzImage-Tue_Jun_24_07_44_17_CEST_2008.bad
> 
> changing CONFIG_NR_CPUS from 4096 to 8 causes the system to boot up 
> fine.
> 
> 	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/