linux-kernel - Re: [PATCH 2/2] x86/mtrr: Refactor PAT initialization code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <56E97F66.9080701@stratus.com>
Date:	Wed, 16 Mar 2016 11:44:38 -0400
From:	Joe Lawrence <joe.lawrence@...atus.com>
To:	Toshi Kani <toshi.kani@....com>,
	"Luis R. Rodriguez" <mcgrof@...e.com>
CC:	Ingo Molnar <mingo@...nel.org>, "bp@...e.de" <bp@...e.de>,
	"hpa@...or.com" <hpa@...or.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"jgross@...e.com" <jgross@...e.com>,
	"paul.gortmaker@...driver.com" <paul.gortmaker@...driver.com>,
	"x86@...nel.org" <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Andy Lutomirski <luto@...capital.net>
Subject: Re: [PATCH 2/2] x86/mtrr: Refactor PAT initialization code

On 03/14/2016 08:37 PM, Toshi Kani wrote:
[... snip ...]
>> Joe at Stratus also hit this issue but on a system where MTRR is enabled.
>> He sent his report only to me as he thought it was caused by the
>> ioremap_wc() changes and his driver was one that got it. In his case
>> though he modified the driver significantly, and upon inspection of that
>> code saw how it used a secondary backup PCI device for failover for a
>> framebuffer device... The changes to the driver in place are rather
>> complex though and as such it made no sense to further review unless he
>> moved his changes upstream.  It is still worth noting this issue has been
>> seeing elsehwere, but the root cause is still not known.  The error Joe
>> got is:
>>
>> x86/PAT: Xorg:37506 map pfn expected mapping type uncached-minus for [mem
>> 0x9f000000-0x9f7fffff], got write-combining
>>
>> Even though the driver is custom (and actually I even saw another
>> unrelated proprietary driver loaded) I figured its worth noting others
>> have seen this error without MTRR being disabled.
> 
> The error message looks the same.  So, this could be the same issue if WC
> is redirected to UC without disabling PAT properly on his env.  I need a
> whole dmesg output to confirm if this is the case.  Another way to hit this
> error is that the driver called remap_pfn_range() with UC to a range where
> WC map was set by ioremap_wc() already.

As mentioned in the other thread, our driver is very custom and performs
some page table parlor tricks to failover the frame buffer from one PCI
adapter to another.  I need some time to fully digest these
customizations before I can really be of much help here.  My current
theory is that we have bug when do:

  ioremap_wc( backup adapter fb )
  iounmap   ( backup adapter fb )      << I'm unclear why this occurs
  ioremap_wc( primary adapter fb )

to failover, we stop_machine and remap the framebuffer (heavily
summarized: iterate through its pages, lookup_address to get a ptep, get
its __pgprot and then rewrite the pte to the new physical address and
the old pgprot_val.  A tlb flush on the way out of stop_machine).

Since this is all out-of-tree custom work, I'm not asking for any
support.  If we're too far off the reservation to be useful to the
conversation here, no worries.  I'm just trying to help if there is an
upstream bug.

>> The second thread you referred to seems to say that if you built-in the
>> code the error does not come up. What the hell. Joe, can you try building
>> your driver built-in to see if you also see this go away? Even though I
>> don't want to support your custom hacked up driver I do want to know if
>> your issue goes away with built-in as well.
> 
> I do not have sufficient info to support this case, and do not have
> technical explanation for it, either.

Luis -- I tried last night to reproduce (built as a module) without any
luck.  Going back through my test logs, it looks like the warning is
relatively sporadic and may take a few days to hit.  Without having a
better repro case, I think my warning is caused by that dodgy page table
work mentioned above (or at least timing related).  Let me see if I can
try to provoke the warning into happening more frequently first.

Regards,

-- Joe