netdev - Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB freelists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151005212639.35932b6c@redhat.com>
Date:	Mon, 5 Oct 2015 21:26:39 +0200
From:	Jesper Dangaard Brouer <brouer@...hat.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	linux-mm@...ck.org, Christoph Lameter <cl@...ux.com>,
	Alexander Duyck <alexander.duyck@...il.com>,
	Pekka Enberg <penberg@...nel.org>, netdev@...r.kernel.org,
	Joonsoo Kim <iamjoonsoo.kim@....com>,
	David Rientjes <rientjes@...gle.com>,
	Hannes Frederic Sowa <hannes@...hat.com>, brouer@...hat.com,
	Andi Kleen <ak@...ux.intel.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>
Subject: Re: [MM PATCH V4.1 5/6] slub: support for bulk free with SLUB
 freelists

On Fri, 2 Oct 2015 14:50:44 -0700
Andrew Morton <akpm@...ux-foundation.org> wrote:

> On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@...hat.com> wrote:
> 
> > > Thus, I need introducing new code like this patch and at the same time
> > > have to reduce the number of instruction-cache misses/usage.  In this
> > > case we solve the problem by kmem_cache_free_bulk() not getting called
> > > too often. Thus, +17 bytes will hopefully not matter too much... but on
> > > the other hand we sort-of know that calling kmem_cache_free_bulk() will
> > > cause icache misses.
> > 
> > I just tested this change on top of my net-use-case patchset... and for
> > some strange reason the code with this WARN_ON is faster and have much
> > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses).
> > 
> > Thus, I think we should keep your fix.
> > 
> > I cannot explain why using WARN_ON() is better and cause less icache
> > misses.  And I hate when I don't understand every detail.
> > 
> >  My theory is, after reading the assembler code, that the UD2
> > instruction (from BUG_ON) cause some kind of icache decoder stall
> > (Intel experts???).  Now that should not be a problem, as UD2 is
> > obviously placed as an unlikely branch and left at the end of the asm
> > function call.  But the call to __slab_free() is also placed at the end
> > of the asm function (gets inlined from slab_free() as unlikely).  And
> > it is actually fairly likely that bulking is calling __slab_free (slub
> > slowpath call).
> 
> Yes, I was looking at the asm code and the difference is pretty small:
> a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly.

Actually managed to find documentation that the UD2 instruction do
stops the instruction prefetcher.

>From "Intel® 64 and IA-32 Architectures Optimization Reference Manual"
 Section 3.4.1.6 "Branch Type Selection":
 Cite:
  "[...] follow the indirect branch with a UD2 instruction, which will
   stop the processor from decoding down the fall-through path."

And from LWN article: https://lwn.net/Articles/255364/
 Cite: "[...] like using the ud2 instruction [...]. This instruction,
 which cannot be executed itself, is used after an indirect jump
 instruction; it is used as a signal to the instruction fetcher that
 the processor should not waste efforts decoding the following memory
 since the execution will continue at a different location."

Thus, we were hitting a very unfortunately combination of code, there
the UD2 instruction was located before the call to __slab_free() in the
ASM code, causing instruction prefetching to fail/stop.

Now I understand all the details again :-)

My only problem left, is I want a perf measurement that pinpoint these
kind of spots.  The difference in L1-icache-load-misses were significant
(1,278,276 vs 2,719,158).  I tried to somehow perf record this with
different perf events without being able to pinpoint the location (even
though I know the spot now).  Even tried Andi's ocperf.py... maybe he
will know what event I should try?


> But I wouldn't assume that the microbenchmarking is meaningful.  I've
> seen shockingly large (and quite repeatable) microbenchmarking
> differences from small changes in code which isn't even executed (and
> this is one such case, actually).  You add or remove just one byte of
> text and half the kernel (or half the .o file?) gets a different
> alignment and this seems to change everything.

I do know alignment of code is important, but this performance impact
was just too large.  And I think I found documentation to back my theory.

 
> Deleting the BUG altogether sounds the best solution.  As long as the
> kernel crashes in some manner, we'll be able to work out what happened.
> And it's cant-happen anyway, isn't it?

To me WARN_ON() seems like a good "documentation" if it does not hurt
performance.  I don't think removing the WARN_ON() will improve
performance, but I'm willing to actually test if it matters.

Below is how it will crash, it is fairly obvious (if you know x86_64
call convention). 

API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Below it clearly shows second arg "size" in %RSI == 0.

[ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4d58
[ 2986.977760] IP: [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161
[ 2986.977879] Oops: 0003 [#1] SMP 
[ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netconsole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcspkr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e1000e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03]
[ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G           O    4.3.0-rc2-net-next-mm+ #442
[ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff8803e3c50000
[ 2986.978414] RIP: 0010:[<ffffffff8116d044>]  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.978490] RSP: 0018:ffff8803e3c53aa8  EFLAGS: 00010282
[ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 0000000100200016
[ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff80000000
[ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3db3b01
[ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: ffffffffffffffff
[ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d774c00
[ 2986.978740] FS:  00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000
[ 2986.978814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001406e0
[ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2986.978979] Stack:
[ 2986.979014]  ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffffa002e000
[ 2986.979099]  ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 0000000000000000
[ 2986.979184]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2986.979259] Call Trace:
[ 2986.979296]  [<ffffffffa002e000>] ? 0xffffffffa002e000
[ 2986.979335]  [<ffffffffa03c4d58>] run_try_crash_tests+0x378/0x3a6 [slab_bulk_test03]
[ 2986.979409]  [<ffffffffa002e12d>] slab_bulk_test03_module_init+0x12d/0x1000 [slab_bulk_test03]
[ 2986.979485]  [<ffffffff810002ed>] do_one_initcall+0xad/0x1d0
[ 2986.979526]  [<ffffffff8111c117>] do_init_module+0x60/0x1e8
[ 2986.979567]  [<ffffffff810c2778>] load_module+0x1bb8/0x2420
[ 2986.979608]  [<ffffffff810bf1b0>] ? __symbol_put+0x40/0x40
[ 2986.979647]  [<ffffffff810c31b0>] SyS_finit_module+0x80/0xb0
[ 2986.979690]  [<ffffffff8166b697>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4 
[ 2986.979904] RIP  [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0
[ 2986.979945]  RSP <ffff8803e3c53aa8>
[ 2986.979982] CR2: ffffffffa03c4d58
[ 2986.980130] ---[ end trace 64f42b02f4347220 ]---


(gdb) list *(kmem_cache_free_bulk)+0x54
0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269).
264		return p;
265	}
266	
267	static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
268	{
269		*(void **)(object + s->offset) = fp;
270	}
271	
272	/* Loop over all objects in a slab */
273	#define for_each_object(__p, __s, __addr, __objects) \


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html