linux-kernel - Re: system gets stuck in a lock during boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ACB5E6B.1010601@gmail.com>
Date:	Tue, 06 Oct 2009 08:12:43 -0700
From:	"Justin P. Mattock" <justinmattock@...il.com>
To:	Jason Baron <jbaron@...hat.com>
CC:	Ingo Molnar <mingo@...e.hu>, Peter Zijlstra <peterz@...radead.org>,
	Li Zefan <lizf@...fujitsu.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: system gets stuck in a lock during boot

Jason Baron wrote:
> On Mon, Oct 05, 2009 at 06:00:41PM -0700, Justin P. Mattock wrote:
>    
>> Justin Mattock wrote:
>>      
>>> On Sun, Oct 4, 2009 at 10:41 AM, Ingo Molnar<mingo@...e.hu>   wrote:
>>>
>>>        
>>>> * Jason Baron<jbaron@...hat.com>   wrote:
>>>>
>>>>
>>>>          
>>>>> On Mon, Sep 07, 2009 at 02:49:44PM -0700, Justin Mattock wrote:
>>>>>
>>>>>            
>>>>>>>> * Justin P. Mattock<justinmattock@...il.com>     wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> Ingo Molnar wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> * Justin Mattock<justinmattock@...il.com>      wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>>>> O.K. I feel better, deleted
>>>>>>>>>>> my system, and threw in a minimal built system
>>>>>>>>>>> with only the bare essentials to boot.
>>>>>>>>>>> (just to make sure things are correct).
>>>>>>>>>>>
>>>>>>>>>>> unfortunately after building rc6 I'm still hitting
>>>>>>>>>>> this. really am not sure why this is happening.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                        
>>>>>>>>>> Could you please double-check the bisection result by doing this:
>>>>>>>>>>
>>>>>>>>>>     git revert af6af30c0f
>>>>>>>>>>
>>>>>>>>>> on the latest kernel and seeing whether that fixes the lockup?
>>>>>>>>>>
>>>>>>>>>> Bisections are very efficient and hence very sensitive as well to
>>>>>>>>>> minimal errors. Just one small mistake near the end of a bisection
>>>>>>>>>> can blame the wrong commit.
>>>>>>>>>>
>>>>>>>>>> So the best way to double-check such 100%-triggerable crashes is to
>>>>>>>>>> do the revert. I tried the revert and it can be done fine here.
>>>>>>>>>>
>>>>>>>>>> [ _If_ that does not fix the bug then to save time you can
>>>>>>>>>>       'backtrack' the bisection, instead of re-doing it completely.
>>>>>>>>>>       I.e. you have your bisection log, re-check the final steps going
>>>>>>>>>>       backwards. Once you find a discrepancy (i.e. a 'bad' point that
>>>>>>>>>>       is 'good' or the other way around), redo the bisection log
>>>>>>>>>>       commands up to that point and continue it up to the end. ]
>>>>>>>>>>
>>>>>>>>>>          Ingo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>> shoot, I did not see your post here. when looking at my bisect
>>>>>>>>> log, I guess after a git bisect reset it clears?
>>>>>>>>>
>>>>>>>>> Anyways after git bisect had finished I looked manually at the
>>>>>>>>> commits that it had generated the one which I had sent in a post
>>>>>>>>> previously, and this one:
>>>>>>>>>
>>>>>>>>>    9424edc2da097c8589fcc24a72552d33e54be161
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>> (this commit has no effect on your kernel image, at all.)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> yep. but it was worth a try.
>>>>>>>
>>>>>>>                
>>>>>>>>> at the time looking at the commit, I see this to be more of the
>>>>>>>>> cause because of it being related to elf as so forth, but as soon
>>>>>>>>> as I reverted this on rc6 made no difference.(the previous commit
>>>>>>>>> fixes this for me, on a regular tar.ball as well as in git.
>>>>>>>>>
>>>>>>>>> I think at this point since this system is a fresh from scratch
>>>>>>>>> build, I think something might be wrong that I'm doing (all the
>>>>>>>>> CFLAGS, and such are in a previous post).
>>>>>>>>>
>>>>>>>>> At the moment I don't have a problem applying a patch to the
>>>>>>>>> kernel for this. especially since I'm the only one that seems to
>>>>>>>>> be hitting this, then if more and more reports of this happen then
>>>>>>>>> we can go from there.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>> What would be nice is to verify your bisection end result, i.e. do
>>>>>>>> what i suggested:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> yeah I've done this on both kernels three to be exact, and all boot after
>>>>>>> reverting
>>>>>>> Fix perf-tracepoint OOPS.
>>>>>>>
>>>>>>> As for my system, I'm still convinced that I might be doing something wrong
>>>>>>> over here.
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>>>> Could you please double-check the bisection result by doing this:
>>>>>>>>>>
>>>>>>>>>>     git revert af6af30c0f
>>>>>>>>>>
>>>>>>>>>> on the latest kernel and seeing whether that fixes the lockup?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>> if this doesnt fix it on latest -git then this commit is not the
>>>>>>>> cause of the lockup.
>>>>>>>>
>>>>>>>>          Ingo
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> This commit(Fix perf-tracepoint OOPS.)does fix my stuckage, but I'm left, as
>>>>>>> well as others asking
>>>>>>> the question of why.
>>>>>>> In any case I still think I'm setting something wrong with either gcc, or
>>>>>>> something
>>>>>>> that might be causing this from userland.
>>>>>>>
>>>>>>> Justin P. Mattock
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> O.k. here something awkward about this issue I was
>>>>>> experiencing. at the moment I have two imac's
>>>>>> here the descriptions:
>>>>>>
>>>>>> imac A) the one with the problem
>>>>>>
>>>>>> OS: built from the clfs book
>>>>>> x86_64 multilib with only lib64
>>>>>>
>>>>>> built everything with these flags:
>>>>>> CFLAGS="-m64 -mtune=core2 -march=core2
>>>>>> -mfpmath=both -O2 -pipe -fomit-frame-pointer
>>>>>> -fstack-protection"
>>>>>> CXXFLAGS="${CFLAGS}" MAKEOPTS="{-j3}"
>>>>>> while compiling everything with
>>>>>> gcc version: 4.5.0 20090730
>>>>>>
>>>>>>
>>>>>> imac B) the one that works
>>>>>>
>>>>>> OS: clfs(just built a few days ago)
>>>>>> x86_64 pure64 bit build
>>>>>> (lib with a symlink to lib64)
>>>>>> CFLAGS="-m64 -mtune=core2 -march=core2
>>>>>>    -O2 -pipe -fomit-frame-pointer"
>>>>>> CXXFLAGS="${CFLAGS}" MAKEOPTS="{-j3}"
>>>>>> gcc version: 4.4.1 (GCC for Cross-LFS 4.4.1.20090722)
>>>>>>
>>>>>> The only things I can think of is either I hit something
>>>>>> because of gcc, something goes wrong with the libraries,
>>>>>> or there something happening with either the option
>>>>>> of mfpmath=both or stackprotection.
>>>>>>
>>>>>> At this point since the kernel seems to be running fine,
>>>>>> is to just trash the system that has this issue and just leave
>>>>>> it at, I was hitting some weird anomaly.
>>>>>>
>>>>>>
>>>>>>              
>>>>> hi Justin,
>>>>>
>>>>> I've been playing around with gcc '4.5' as well and hit a panic that
>>>>> looks very similar to what you've seen with stock 2.6.31 - I haven't
>>>>> seen it anywhere else. Anyways, it seems to be some sort of alignment
>>>>> issue with the 'struct ftrace_event_call'. I'm not sure yet if this is a
>>>>> compiler or kernel issue. But the following kernel patch fixes the issue
>>>>> for me. It would be interesting to verify if the patch also resolves the
>>>>> issue for you.
>>>>>
>>>>>            
>>>> Would be nice to know precisely what kind of problem is being hit here -
>>>> we'd like to fix either the kernel or GCC - depending on where the bug
>>>> lies.
>>>>
>>>>          Ingo
>>>>
>>>>
>>>>          
>>> So I wasn't going crazy....
>>> Anyways that system(clfs)
>>> I still have, I can go ahead and
>>> put it back on the machine and see if I hit this
>>> again(keep in mind, just got back from a 7hr drive,
>>> so it might be tomorrow).
>>>
>>>
>>>        
>> o.k. I put back on that system, and
>> hit the error. I add your patch to 2.6.31-rc6,
>>      
>
> ok. is that error, the same as the error below? The error below looks
> completely different from the posted previously. So, it almost looks
> like you the patch fixed one problem, only to reveal another one. Is
> that correct?
>
>    
Could be a different error, the problem I have is capturing this error i.g.
tried ieee1394_dma=early to capture this, but that mechanism
seems to error out.(ssh no go either because this happens so early).
I think this is the top part of the error, because before adding your patch
the system would boot a little farther(to fast to read anything)down the 
line,
and I did see something in there about a kernel panic.

If you have any ideas on how I can capture this early, would be 
appreciated.
(getting anything to log this early is a bit tricky).
>> and the latest git(a few days old).
>> I still am hitting this, but with your patch
>> I'm able to see the beginning of this panic:
>> (Ill write it manually)
>>
>> [   2.523966] kernel panic - not syncing: No init found. try passing
>> init= option
>> to the kernel
>> [   2.524394] Pid: 1, comm: swapper Not tainted 2.6.31-rc6 #6
>> [   2.524633] Call Trace:
>> [   2.524875] [<ffffffff813a5b72>] panic+0x75/0x120
>> [   2.525119] [<ffffffff8100910f>] init_post+0xef/0xf5
>> [   2.525357] [<ffffffff815f6cf0>] kernel_init+0x198/0x1a3
>> [   2.525600] [<ffffffff8102410a>] child_rip+0xa/0x20
>> [   2.525842] [<ffffffff815f6b58>] ? kernel_init+0x0/0x1a3
>> [   2.526084] [>ffffffff810224100>] ? child_rip+0x0/0x20
>>
>> Seems I only hit this with using gcc 4.5.0 and compiling
>> sysvinit with SELinux support to load the policy at boot.
>> (here's the patch I used
>> http://readlist.com/lists/tycho.nsa.gov/selinux/3/15451.html).
>>
>> Sound's like gcc is doing something(correct me if I'm
>> wrong) because the other systems I have are using the same
>> packages except for and older version of gcc.
>> maybe  I should update sysvinit with a better patch to load the policy.
>>
>> Justin P. Mattock
>>      
>
>    
As a test Ill throw in a kernel that was compiled with gcc 4.4.0 just to
see if this is a compiler/kernel issue.

Justin P. Mattock
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/