linux-kernel - XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160817203238.GA9408@char.us.oracle.com>
Date:	Wed, 17 Aug 2016 16:32:38 -0400
From:	Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>
To:	xen-devel@...ts.xensource.com, chuck.anderson@...cle.com,
	Boris Ostrovsky <boris.ostrovsky@...cle.com>,
	david.vrabel@...rix.com, jgross@...e.com, jbeulich@...e.com,
	stefan.bader@...onical.com
Cc:	linux-kernel@...r.kernel.org
Subject: XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC

Hey Jan, et. al.,

One of the interesting things about XSA 154 fix ("x86: enforce consistent
cachability of MMIO mappings") is that when certain applications (mcelog)
are trying to map /dev/mmap and lurk in ISA regions - we get:

[   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
[   49.399055] Modules linked in: bnx2fc fcoe libfcoe libfc 8021q mrp garp stp llc bonding dm_multipath vfat fat iTCO_wdt iTCO_vendor_support pcspkr ipmi_devintf ipmi_si ipmi_msghandler sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma sg ext4 jbd2 mbcache sr_mod cdrom sd_mod usb_storage ahci libahci megaraid_sas qla2xxx scsi_transport_fc crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdio dm_mirror dm_region_hash dm_log dm_mod
[   49.399131] CPU: 0 PID: 2471 Comm: mcelog Not tainted 4.1
[   49.399134] Hardware name: Oracle Corporation SUN SERVER X4-2       /ASSY,MB,X4-2, 1U      , BIOS 25030100 04/15/2015
[   49.399138]  0000000000000000 ffff880074673c28 ffffffff816c66f0 0000000000000000
[   49.399143]  0000000000000391 ffff880074673c68 ffffffff81084745 ffff880074673c78
[   49.399148]  ffff88014b625db0 0000000000000000 ffff880074673d58 00007f39290ab000
[   49.399152] Call Trace:
[   49.399166]  [<ffffffff816c66f0>] dump_stack+0x63/0x83
[   49.399175]  [<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
[   49.399180]  [<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
[   49.399183]  [<ffffffff810725f3>] untrack_pfn+0x93/0xc0
[   49.399190]  [<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
[   49.399194]  [<ffffffff811b9644>] unmap_vmas+0x54/0xa0
[   49.399199]  [<ffffffff811bf0da>] exit_mmap+0x9a/0x150
[   49.399204]  [<ffffffff810825d3>] mmput+0x73/0x110
[   49.399208]  [<ffffffff81082775>] dup_mm+0x105/0x110
[   49.399213]  [<ffffffff81083b1d>] copy_process+0x11ed/0x1240
[   49.399218]  [<ffffffff81084009>] do_fork+0x79/0x280
[   49.399226]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[   49.399231]  [<ffffffff81084226>] SyS_clone+0x16/0x20
[   49.399235]  [<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71
[   49.399239] ---[ end trace a61cd3d271a53a54 ]---

The reason is that Linux kernel assumes that the range from 640KB -> 1MB can
be mapped as write-back (see is_new_memtype_allowed and x86_platform.is_untracked_pat_range).
But we enforce the uncached mode and Linux complains.

With the mmio-relax=1, Linux gets its way and is happy.

With the patch below:

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 70a38c1..e5ff5a5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -288,6 +287,12 @@ static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+static bool xen_ignore(u64 s, u64 e)
+{
+	return false;
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -1563,7 +1570,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
 		x86_init.resources.memory_setup = xen_memory_setup;
 	x86_init.oem.arch_setup = xen_arch_setup;
 	x86_init.oem.banner = xen_banner;
-
+	x86_platform.is_untracked_pat_range = xen_ignore;
 	xen_init_time_ops();
 
 	/*

Things work much better - as we don't treat the 640KB->1MB region specially.

Anyhow what I am wondering:

 a) Should we add a edge case in the hypervisor to allow multiple mappings
   for this region? I am thinking no.. but it sounds like mapping ISA region
   as WB is safe even in baremetal?

 b) Or would it be better to let Linux do its thing and treat 640KB->1MB
   as uncached instead of writeback?

   Looking at the kernel it assumes that WB is ok for 640KB->1MB.
   The comment says:
   " /* Low ISA region is always mapped WB in page table. No need to track *"

   which is probably true on baremetal. But with Xen PV:

 856         /*                                                                      
 857          * In domU, the ISA region is normal, usable memory, but we             
 858          * reserve ISA memory anyway because too many things poke               
 859          * about in there.                                                      
 860          */                                                                     
 861         e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, 
 862                         E820_RESERVED);                                 

   which would imply we don't have any page table mappings.

   And then the quick fix I provided above looks like the right solution?


CC-ing Boris, Daniel, Juergen, Steve, and Chuck.