linux-kernel - Re: exit_mmap BUG_ON in 2.6.23 (and Add qdisc __NET_XMIT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1338489169.41890.YahooMailNeo@web121305.mail.ne1.yahoo.com>
Date:	Thu, 31 May 2012 11:32:49 -0700 (PDT)
From:	Sam Portolla <samportolla@...oo.com>
To:	Hugh Dickins <hughd@...gle.com>
Cc:	Eric Dumazet <eric.dumazet@...il.com>,
	"kaber@...sh.net" <kaber@...sh.net>,
	"davem@...emloft.net" <davem@...emloft.net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"samPortolla@...oo.com" <samPortolla@...oo.com>
Subject: Re: exit_mmap BUG_ON in 2.6.23 (and Add qdisc __NET_XMIT_STOLEN)

[please cc samPortolla@...oo.com on the reply as not a member of this mailer]

----- Original Message -----
From: Hugh Dickins <hughd@...gle.com>
To: Sam Portolla <samportolla@...oo.com>
Cc: Eric Dumazet <eric.dumazet@...il.com>; "kaber@...sh.net" <kaber@...sh.net>; "jarkao2@...il.com" <jarkao2@...il.com>; "davem@...emloft.net" <davem@...emloft.net>; "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Sent: Saturday, May 26, 2012 11:06 AM
Subject: Re: exit_mmap BUG_ON in 2.6.23 (and Add qdisc __NET_XMIT_STOLEN)

On Fri, 25 May 2012, Sam Portolla wrote:
> 
> commit 378a2f090f7a478704a372a4869b8a9ac206234e
> Date:   Mon Aug 4 22:31:03 2008 -0700
> net_sched: Add qdisc __NET_XMIT_STOLEN flag
...
> 
>  I wonder if the lack of above patch in our code base could explain the
>  exit_mmap() BUG_ON as well due to memory corruption causing MMU to not
>  be able to locate the page(s) it had to free. NR_PTES keeps track of
>  that? Could you explain that more?

I concur with Eric in thinking it unlikely - though (unlike Eric)
I know far too little about networking to comment with authority.

I'd guess that there have been literally hundreds of fixes gone into
the kernel since 2.6.23, each more likely to be the fix to such memory
corruption than this one.  And I could also be wrong in attributing
your BUG to memory corruption: perhaps I'm forgetting an mm fix.

You ask me to explain more: mm->nr_ptes keeps track of the number of
page tables that have been allocated; when we free the mm, we should
be freeing exactly the number of page tables we allocated earlier,
but a bug in the code maintaining the vmas or the page tables might
break that, hence the BUG_ON to test.  But equally, if there has been
memory corruption of vmas or of higher-level page tables, we may now
be unable to locate all the page tables we allocated earlier, and so
hit the BUG_ON for that reason.

Would I be unfair to characterize this as a problem seen once at a
customer site in the 4.5 years since 2.6.23 was released?

As I said before, please just change that BUG_ON to WARN_ON, and
wait to see if more such issues come up: if they do, then you can
start to look for a pattern.

Hi Hugh, The concern I have with changing BUG_ON to WARN_ON, is one you had mentioned earlier in the thread.
If there is a memory corruption, BUG_ON causes system reboot and a clean start. WARN_ON won't and we might again end up crashing somewhere totally unrelated, possibly much later, i.e. unknown impact of this change. I know you know this area 100 times or more than me, but this is my concern and I am ready to be corrected, by all means.

Hugh and Eric, 

 Also, I can not get it out of my head that there was 3 instances of kernel crashes on the same system within 1 hour, all of them in different areas of the kernel and all of them just after the ethernet driver printed a transmit timeout message, which would come out if it the network layer saw the transmit Q's stopped. It seems really unlikely that 3 separate root causes exist in such a scenario. Therefore I keep thinking whatever caused the transmit timeout, also caused kernel memory corruption, which then manifested in different ways, namely: crash in the corresponding ethernet driver due to NULL ptr access, this BUG_ON and another NULL ptr access in buffer.c for the 3rd crash. The way I can think of unifying all these is that possibly the QDISC bug caused memory corruption AND also triggered the transmit timeout as it messed up the Tx Q to the device. We know when that QDISC issue happened originally couple of years ago, the same BNX2 driver had a
 NULL ptr access in its SKB area, because of what I explained in my reply to Eric yesterday. So, we may have a pure coincidence here, but in the absence of reproducibility, core file to analyze, and based on reasons above, I really am interested in this QDISC problem. Eric, could you kindly respond to the email I sent yesterday, with the above background in mind. Regards to both of you. 

Hugh 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/