[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <58C08535.3070000@iogearbox.net>
Date: Wed, 08 Mar 2017 23:27:01 +0100
From: Daniel Borkmann <daniel@...earbox.net>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>, Peter Anvin <hpa@...or.com>
CC: Fengguang Wu <fengguang.wu@...el.com>,
Network Development <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>, LKP <lkp@...org>,
ast@...com, the arch/x86 maintainers <x86@...nel.org>,
keescook@...omium.org, labbott@...hat.com, davem@...emloft.net
Subject: Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request
at 0000a7cf
[ + Kees, Laura, and Dave ]
On 03/08/2017 08:25 PM, Linus Torvalds wrote:
> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
>
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.
>
> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)
>
> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
>
> /* No alias checking for _NX bit modifications */
> checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != _PAGE_NX;
>
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?
>
> But that doesn't explain why the bpf code would have issues with this
> all only on x86-32.
>
> Maybe somebody else can see why ARCH_HAS_SET_MEMORY would depend on
> 64-bit only..
Last days I tried to narrow this down a bit, I have a dirty hack I
used just for the sake of experimenting when booting into Fengguang's
x86-32 test kernel via qemu:
lib/Makefile | 3 ++-
lib/test_setmem.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 1 deletion(-)
create mode 100644 lib/test_setmem.c
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a..702af98 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,8 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
- once.o refcount.o
+ once.o refcount.o test_setmem.o
+
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
obj-y += hexdump.o
diff --git a/lib/test_setmem.c b/lib/test_setmem.c
new file mode 100644
index 0000000..82b08db
--- /dev/null
+++ b/lib/test_setmem.c
@@ -0,0 +1,50 @@
+#include <linux/init.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+
+#include <asm/cacheflush.h>
+
+static int __init test_setmem(void)
+{
+ const gfp_t gfp_flags = GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO;
+ int ret, i, good = 0, max_runs = 1000, one = 1;
+ u32 pages, size;
+ u8 *buff;
+
+ for (i = 0; i < max_runs; i++) {
+ size = round_up(4096*8, PAGE_SIZE);
+ pages = size / PAGE_SIZE;
+
+ buff = __vmalloc(size, gfp_flags, PAGE_KERNEL);
+ if (buff == NULL) {
+ printk("%s: allocation failed!\n", __func__);
+ return 0;
+ }
+#if 1
+ buff[0] = 41;
+ buff[10] = 42;
+ buff[100] = 43;
+ buff[1000] = 44;
+ // can also be just read access, f.e. if (buff[0] + ...)
+#endif
+ WARN_ON_ONCE(set_memory_ro((unsigned long)buff, pages));
+// msleep(0);
+
+ ret = probe_kernel_write(&buff[1], &one, sizeof(one));
+ if (!ret)
+ printk("%s: write succeeded despite read-only!\n", __func__);
+ else
+ good++;
+
+ WARN_ON_ONCE(set_memory_rw((unsigned long)buff, pages));
+ vfree(buff);
+ }
+
+ printk("%s: write failed due to read-only: %d/%d\n", __func__, good, max_runs);
+ return 0;
+}
+
+late_initcall(test_setmem);
--
2.9.3
The issue seems to be accessing buff first (can be read or write access)
and then doing set_memory_ro() doesn't make it read-only immediately,
meaning the subsequent call into probe_kernel_write() will succeed without
error.
Then, if I don't touch buff first and only do the set_memory_ro() seems
to work and probe_kernel_write() will then fail as expected due to pages
being read-only now.
Now, if I access buff, do the set_memory_ro() and then a msleep(0), for
example, it "kind of" works most of the time (see last log extract below),
and probe_kernel_write() will fail.
None of this seems an issue with x86_64 and the test_setmem runs fine all
the time, same for the actual BPF stuff.
I've also added rodata_test from the log (perhaps related).
x86_64 (access data before set_memory_ro()):
[...]
[ 1.357834] Key type big_key registered
[ 1.418542] test_setmem: write failed due to read-only: 1000/1000
[ 1.419545] Magic number: 9:497:653
[ 1.419584] acpi LNXCPU:06: hash matches
[ 1.419750] rtc_cmos 00:02: setting system clock to 2017-03-08 22:37:44 UTC (1489012664)
[ 1.419777] PM: Hibernation image not present or could not be loaded.
[ 1.420643] Freeing unused kernel memory: 1720K
[ 1.420643] Write protecting the kernel read-only data: 14336k
[ 1.420997] Freeing unused kernel memory: 1644K
[ 1.421844] Freeing unused kernel memory: 652K
[ 1.423054] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[ 1.423057] rodata_test: all tests were successful
[...]
x86-32 (access data before set_memory_ro()):
[...]
[ 8.551454] Key type trusted registered
[ 8.551792] Key type encrypted registered
[ 8.552274] test_setmem: write succeeded despite read-only!
[ 8.552740] test_setmem: write succeeded despite read-only!
[ 8.553205] test_setmem: write succeeded despite read-only!
[ 8.553686] test_setmem: write succeeded despite read-only!
[ 8.554179] test_setmem: write succeeded despite read-only!
[ 8.554666] test_setmem: write succeeded despite read-only!
[ 8.555157] test_setmem: write succeeded despite read-only!
[...]
[ 9.042271] test_setmem: write succeeded despite read-only!
[ 9.042741] test_setmem: write failed due to read-only: 0/1000
[ 9.043343] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 9.044034] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 9.044597] EDD information not available.
[ 9.045233] Freeing unused kernel memory: 612K
[ 9.045732] Write protecting the kernel text: 12604k
[ 9.046234] Write protecting the kernel read-only data: 5124k
[ 9.046781] rodata_test: test data was not read only
[...]
x86-32 (no access data before set_memory_ro()):
[...]
[ 8.537275] Key type encrypted registered
[ 8.541771] test_setmem: write failed due to read-only: 1000/1000
[ 8.542481] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 8.543265] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 8.543824] EDD information not available.
[ 8.544467] Freeing unused kernel memory: 612K
[ 8.544909] Write protecting the kernel text: 12604k
[ 8.545399] Write protecting the kernel read-only data: 5128k
[ 8.545903] rodata_test: test data was not read only
[...]
x86-32 (access data + msleep(0) before set_memory_ro()):
* run 1:
[...]
[ 8.478233] Key type encrypted registered
[ 20.334254] test_setmem: write succeeded despite read-only!
[ 28.474980] test_setmem: write failed due to read-only: 999/1000
[ 28.478187] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 28.480476] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 28.481342] EDD information not available.
[ 28.482192] Freeing unused kernel memory: 612K
[ 28.482841] Write protecting the kernel text: 12604k
[ 28.483568] Write protecting the kernel read-only data: 5128k
[ 28.484364] rodata_test: test data was not read only
[...]
* run 2:
[...]
[ 8.472088] Key type encrypted registered
[ 28.466359] test_setmem: write failed due to read-only: 1000/1000
[ 28.469162] hd: no drives specified - use hd=cyl,head,sectors on kernel command line
[ 28.470838] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 28.471590] EDD information not available.
[ 28.472346] Freeing unused kernel memory: 612K
[ 28.472910] Write protecting the kernel text: 12604k
[ 28.473542] Write protecting the kernel read-only data: 5128k
[ 28.474232] rodata_test: test data was not read only
[...]
I've attached the .config and Fengguang's test script. Qemu version is
2.7.1 from Fedora. Can be run as:
./reproduce-quantal-ivb41-127-20170301175555-i386-randconfig-b0-02262304-4.10.0-04456-g3051bf3-1 arch/x86/boot/bzImage
Thanks,
Daniel
> Linus
>
> On Thu, Mar 2, 2017 at 12:40 PM, Daniel Borkmann <daniel@...earbox.net> wrote:
>> On 03/02/2017 09:23 PM, Fengguang Wu wrote:
>> [...]
>>>
>>> I confirm that the below patch provided by Daniel fixes the above
>>> issues on mainline kernel, too. Where should this patch be sent to?
>>
>>
>> If nobody objects, I could send it to -net tree via Dave due to being
>> BPF related, but I don't mind sending it elsewhere too (f.e. Linus
>> directly?) in order to stop your bot from continuing to send such mails.
>>
>> The issue seems only related to i386 and doesn't trigger each time with
>> Fengguang's kernel config and qemu image when I try to reproduce it.
>> set_memory_ro()/set_memory_rw() on i386 seems to work in general, but
>> when it's used/reproduced, from time to time (perhaps some corner-case?)
>> it looks like that memory area can have issues much later on after being
>> fed back to the allocator which then causes a GPF from random locations.
>> Gut feeling, it might be an issue in set_memory_*() that my commit
>> uncovered. Still looking into it, but mean-time I could just send the
>> below, sure.
>>
>> Thanks,
>> Daniel
>>
>>
>>> It'd be very noisy if all these Oops hit the upcoming RC1 kernel.
>>>
>>> Daniel thinks there may be deeper problem in i386 set_memory_rw().
>>> However that could take much longer time to debug.
>>>
>>> Thanks,
>>> Fengguang
>>> ---
>>>
>>> Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at
>>> 653a8346
>>>
>>>> On Tue, Feb 28, 2017 at 04:39:36PM +0100, Daniel Borkmann wrote:
>>>
>>>
>>> I have a rough feeling what it is, but I didn't have cycles to work on
>>> it yet (due to travel, sorry about that). The issue is likely shut down
>>> by just doing:
>>>
>>> ---
>>> arch/x86/Kconfig | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> --- linux.orig/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>>> +++ linux/arch/x86/Kconfig 2017-03-03 03:44:35.962022996 +0800
>>> @@ -54,7 +54,7 @@ config X86
>>> select ARCH_HAS_KCOV if X86_64
>>> select ARCH_HAS_MMIO_FLUSH
>>> select ARCH_HAS_PMEM_API if X86_64
>>> - select ARCH_HAS_SET_MEMORY
>>> + select ARCH_HAS_SET_MEMORY if X86_64
>>> select ARCH_HAS_SG_CHAIN
>>> select ARCH_HAS_STRICT_KERNEL_RWX
>>> select ARCH_HAS_STRICT_MODULE_RWX
>>
>>
View attachment "config.bak" of type "text/plain" (97576 bytes)
View attachment "reproduce-quantal-ivb41-127-20170301175555-i386-randconfig-b0-02262304-4.10.0-04456-g3051bf3-1" of type "text/plain" (888 bytes)
Powered by blists - more mailing lists