linux-kernel - Re: [RFC v2 3/4] locks: Split insert/delete block functions into flock/posix parts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150304160136.57a75e8e@tlielax.poochiereds.net>
Date:	Wed, 4 Mar 2015 16:01:36 -0500
From:	Jeff Layton <jlayton@...chiereds.net>
To:	Daniel Wagner <daniel.wagner@...-carit.de>
Cc:	Jeff Layton <jlayton@...marydata.com>,
	<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	"J. Bruce Fields" <bfields@...ldses.org>,
	Alexander Viro <viro@...iv.linux.org.uk>
Subject: Re: [RFC v2 3/4] locks: Split insert/delete block functions into
 flock/posix parts

On Wed, 4 Mar 2015 15:20:33 +0100
Daniel Wagner <daniel.wagner@...-carit.de> wrote:

> On 03/03/2015 01:55 AM, Jeff Layton wrote:
> > On Mon,  2 Mar 2015 15:25:12 +0100
> > Daniel Wagner <daniel.wagner@...-carit.de> wrote:
> > 
> >> The locks_insert/delete_block() functions are used for flock, posix
> >> and leases types. blocked_lock_lock is used to serialize all access to
> >> fl_link, fl_block, fl_next and blocked_hash. Here, we prepare the
> >> stage for using blocked_lock_lock to protect blocked_hash.
> >>
> >> Signed-off-by: Daniel Wagner <daniel.wagner@...-carit.de>
> >> Cc: Jeff Layton <jlayton@...chiereds.net>
> >> Cc: "J. Bruce Fields" <bfields@...ldses.org>
> >> Cc: Alexander Viro <viro@...iv.linux.org.uk>
> >> ---
> >>  fs/locks.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
> >>  1 file changed, 40 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/fs/locks.c b/fs/locks.c
> >> index 4498da0..02821dd 100644
> >> --- a/fs/locks.c
> >> +++ b/fs/locks.c
> >> @@ -611,11 +611,20 @@ static void locks_delete_global_blocked(struct file_lock *waiter)
> >>   */
> >>  static void __locks_delete_block(struct file_lock *waiter)
> >>  {
> >> -	locks_delete_global_blocked(waiter);
> >>  	list_del_init(&waiter->fl_block);
> >>  	waiter->fl_next = NULL;
> >>  }
> >>  
> >> +/* Posix block variant of __locks_delete_block.
> >> + *
> >> + * Must be called with blocked_lock_lock held.
> >> + */
> >> +static void __locks_delete_posix_block(struct file_lock *waiter)
> >> +{
> >> +	locks_delete_global_blocked(waiter);
> >> +	__locks_delete_block(waiter);
> >> +}
> >> +
> >>  static void locks_delete_block(struct file_lock *waiter)
> >>  {
> >>  	spin_lock(&blocked_lock_lock);
> >> @@ -623,6 +632,13 @@ static void locks_delete_block(struct file_lock *waiter)
> >>  	spin_unlock(&blocked_lock_lock);
> >>  }
> >>  
> >> +static void locks_delete_posix_block(struct file_lock *waiter)
> >> +{
> >> +	spin_lock(&blocked_lock_lock);
> >> +	__locks_delete_posix_block(waiter);
> >> +	spin_unlock(&blocked_lock_lock);
> >> +}
> >> +
> >>  /* Insert waiter into blocker's block list.
> >>   * We use a circular list so that processes can be easily woken up in
> >>   * the order they blocked. The documentation doesn't require this but
> >> @@ -639,7 +655,17 @@ static void __locks_insert_block(struct file_lock *blocker,
> >>  	BUG_ON(!list_empty(&waiter->fl_block));
> >>  	waiter->fl_next = blocker;
> >>  	list_add_tail(&waiter->fl_block, &blocker->fl_block);
> >> -	if (IS_POSIX(blocker) && !IS_OFDLCK(blocker))
> >> +}
> >> +
> >> +/* Posix block variant of __locks_insert_block.
> >> + *
> >> + * Must be called with flc_lock and blocked_lock_lock held.
> >> + */
> >> +static void __locks_insert_posix_block(struct file_lock *blocker,
> >> +					struct file_lock *waiter)
> >> +{
> >> +	__locks_insert_block(blocker, waiter);
> >> +	if (!IS_OFDLCK(blocker))
> >>  		locks_insert_global_blocked(waiter);
> >>  }
> >>
> > 
> > In many ways OFD locks act more like flock locks than POSIX ones. In
> > particular, there is no deadlock detection there, so once your
> > conversion is done to more widely use the percpu locks, then you should
> > be able to avoid taking the blocked_lock_lock for OFD locks as well.
> > The 4th patch in this series doesn't currently do that.
> > 
> > You may want to revisit this patch such that the IS_OFDLCK checks are
> > done earlier, so that you can avoid taking the blocked_lock_lock if
> > IS_POSIX and !IS_OFDLCK.
> 
> Thanks for the explanation. I was not entirely sure what to do here
> and forgot to ask.
> 
> I have fixed that stuff and now I am testing it. Though it seems
> that there is a memory leak which can be triggered with 
> 
> 	while true; rm -rf /tmp/a; ./lease02 /tmp/a; done
> 
> and this happens also without any of my patches. Still trying to
> figure out what's happening. Hopefully I just see a ghost.
> 
> slabtop tells me that ftrace_event_field is constantly growing:
> 
>  Active / Total Objects (% used)    : 968819303 / 968828665 (100.0%)
>  Active / Total Slabs (% used)      : 11404623 / 11404623 (100.0%)
>  Active / Total Caches (% used)     : 72 / 99 (72.7%)
>  Active / Total Size (% used)       : 45616199.68K / 45619608.73K (100.0%)
>  Minimum / Average / Maximum Object : 0.01K / 0.05K / 16.00K
> 
>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
> 967510630 967510630   2%    0.05K 11382478       85  45529912K ftrace_event_field
> 154368 154368 100%    0.03K   1206      128      4824K kmalloc-32
> 121856 121856 100%    0.01K    238      512       952K kmalloc-8
> 121227 121095  99%    0.08K   2377       51      9508K Acpi-State
> 
> This is on proper hardware. On a kvm guest, fasync_cache grows fast and finally the
> guest runs out of memory. systemd tries hard to restart everything and fails constantly:
> 
> [  187.021758] systemd invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
> [  187.022337] systemd cpuset=/ mems_allowed=0
> [  187.022662] CPU: 3 PID: 1 Comm: systemd Not tainted 4.0.0-rc1+ #380
> [  187.023117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
> [  187.023801]  ffff88007c918000 ffff88007c9179c8 ffffffff81b4f9be ffffffff8116a9cc
> [  187.024373]  0000000000000000 ffff88007c917a88 ffffffff8116a9d1 000000007c917a58
> [  187.024940]  ffffffff8224bc98 ffff88007c917a28 0000000000000092 ffffffff81c1b780
> [  187.025515] Call Trace:
> [  187.025698]  [<ffffffff81b4f9be>] dump_stack+0x4c/0x65
> [  187.026083]  [<ffffffff8116a9cc>] ? dump_header.isra.13+0x7c/0x450
> [  187.026525]  [<ffffffff8116a9d1>] dump_header.isra.13+0x81/0x450
> [  187.026958]  [<ffffffff810a45c6>] ? trace_hardirqs_on_caller+0x16/0x240
> [  187.027437]  [<ffffffff810a47fd>] ? trace_hardirqs_on+0xd/0x10
> [  187.027859]  [<ffffffff814fe5c4>] ? ___ratelimit+0x84/0x110
> [  187.028264]  [<ffffffff8116b378>] oom_kill_process+0x1e8/0x4c0
> [  187.028683]  [<ffffffff8105fda5>] ? has_ns_capability_noaudit+0x5/0x170
> [  187.029167]  [<ffffffff8116baf4>] __out_of_memory+0x4a4/0x510
> [  187.029579]  [<ffffffff8116bd2b>] out_of_memory+0x5b/0x80
> [  187.029970]  [<ffffffff81170f2e>] __alloc_pages_nodemask+0xa0e/0xb60
> [  187.030434]  [<ffffffff811ad863>] read_swap_cache_async+0xe3/0x180
> [  187.030881]  [<ffffffff811ad9ed>] swapin_readahead+0xed/0x190
> [  187.031300]  [<ffffffff8119bcae>] handle_mm_fault+0xbbe/0x1180
> [  187.031719]  [<ffffffff81046bed>] __do_page_fault+0x1ed/0x4c0
> [  187.032138]  [<ffffffff81046ecc>] do_page_fault+0xc/0x10
> [  187.032520]  [<ffffffff81b5ddc2>] page_fault+0x22/0x30
> [  187.032889] Mem-Info:
> [  187.033066] DMA per-cpu:
> [  187.033254] CPU    0: hi:    0, btch:   1 usd:   0
> [  187.033596] CPU    1: hi:    0, btch:   1 usd:   0
> [  187.033941] CPU    2: hi:    0, btch:   1 usd:   0
> [  187.034292] CPU    3: hi:    0, btch:   1 usd:   0
> [  187.034637] DMA32 per-cpu:
> [  187.034837] CPU    0: hi:  186, btch:  31 usd:  51
> [  187.035185] CPU    1: hi:  186, btch:  31 usd:   0
> [  187.035529] CPU    2: hi:  186, btch:  31 usd:   0
> [  187.035873] CPU    3: hi:  186, btch:  31 usd:  32
> [  187.036221] active_anon:5 inactive_anon:0 isolated_anon:0
> [  187.036221]  active_file:238 inactive_file:194 isolated_file:0
> [  187.036221]  unevictable:0 dirty:0 writeback:8 unstable:0
> [  187.036221]  free:3361 slab_reclaimable:4651 slab_unreclaimable:493909
> [  187.036221]  mapped:347 shmem:0 pagetables:400 bounce:0
> [  187.036221]  free_cma:0
> [  187.038385] DMA free:7848kB min:44kB low:52kB high:64kB active_anon:4kB inactive_anon:12kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:8kB mapped:4kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:7880kB kernel_stack:32kB pagetables:36kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:132 all_unreclaimable? yes
> [  187.041138] lowmem_reserve[]: 0 1952 1952 1952
> [  187.041510] DMA32 free:5596kB min:5628kB low:7032kB high:8440kB active_anon:16kB inactive_anon:0kB active_file:952kB inactive_file:772kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2004912kB mlocked:0kB dirty:0kB writeback:24kB mapped:1384kB shmem:0kB slab_reclaimable:18592kB slab_unreclaimable:1967756kB kernel_stack:1968kB pagetables:1564kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12716 all_unreclaimable? yes
> [  187.044442] lowmem_reserve[]: 0 0 0 0
> [  187.044756] DMA: 4*4kB (UEM) 2*8kB (UM) 5*16kB (UEM) 2*32kB (UE) 2*64kB (EM) 3*128kB (UEM) 2*256kB (EM) 3*512kB (UEM) 3*1024kB (UEM) 1*2048kB (R) 0*4096kB = 7856kB
> [  187.046022] DMA32: 190*4kB (UER) 6*8kB (R) 1*16kB (R) 1*32kB (R) 0*64kB 0*128kB 1*256kB (R) 1*512kB (R) 0*1024kB 0*2048kB 1*4096kB (R) = 5720kB
> [  187.047128] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  187.047724] 554 total pagecache pages
> [  187.047991] 60 pages in swap cache
> [  187.048259] Swap cache stats: add 102769, delete 102709, find 75688/136761
> [  187.048748] Free swap  = 1041456kB
> [  187.048995] Total swap = 1048572kB
> [  187.049250] 524158 pages RAM
> [  187.049463] 0 pages HighMem/MovableOnly
> [  187.049739] 18953 pages reserved
> [  187.049974] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [  187.050587] [ 1293]     0  1293    10283        1      23       2      131         -1000 systemd-udevd
> [  187.051253] [ 1660]     0  1660    12793       57      24       2      134         -1000 auditd
> [  187.051872] [ 1681]    81  1681     6637        1      18       2      124          -900 dbus-daemon
> [  187.052529] [ 1725]     0  1725    20707        0      42       3      216         -1000 sshd
> [  187.053146] [ 2344]     0  2344     3257        0      11       2       49             0 systemd-cgroups
> [  187.053820] [ 2345]     0  2345     3257        0      11       2       55             0 systemd-cgroups
> [  187.054497] [ 2350]     0  2350     3257        0      11       2       35             0 systemd-cgroups
> [  187.055175] [ 2352]     0  2352     3257        0      12       2       37             0 systemd-cgroups
> [  187.055846] [ 2354]     0  2354     3257        0      11       2       43             0 systemd-cgroups
> [  187.056530] [ 2355]     0  2355     3257        0      11       2       40             0 systemd-cgroups
> [  187.057212] [ 2356]     0  2356     3257        0      11       2       44             0 systemd-cgroups
> [  187.057886] [ 2362]     0  2362     3257        0      11       3       33             0 systemd-cgroups
> [  187.058564] [ 2371]     0  2371     3257        0      11       2       33             0 systemd-cgroups
> [  187.059244] [ 2372]     0  2372     3257        0      10       2       44             0 systemd-cgroups
> [  187.059917] [ 2373]     0  2373     3257        0      11       2       39             0 systemd-cgroups
> [  187.060600] [ 2376]     0  2376     3257        0      11       2       34             0 systemd-cgroups
> [  187.061280] [ 2377]     0  2377     3257        0      10       2       43             0 systemd-cgroups
> [  187.061942] [ 2378]     0  2378     3257        0      12       3       34             0 systemd-cgroups
> [  187.062598] [ 2379]     0  2379    27502        0      10       3       33             0 agetty
> [  187.063200] [ 2385]     0  2385     3257        0      12       2       44             0 systemd-cgroups
> [  187.063859] [ 2390]     0  2390     3257        0      11       2       43             0 systemd-cgroups
> [  187.064520] [ 2394]     0  2394     3257        0      11       2       41             0 systemd-cgroups
> [  187.065182] [ 2397]     0  2397     3257        0      11       2       43             0 systemd-cgroups
> [  187.065833] [ 2402]     0  2402     3257        0      11       2       42             0 systemd-cgroups
> [  187.066490] [ 2403]     0  2403     3257        0      11       2       44             0 systemd-cgroups
> [  187.067148] [ 2404]     0  2404    27502        0      13       3       30             0 agetty
> [  187.067743] [ 2410]     0  2410     3257        0      11       2       43             0 systemd-cgroups
> [  187.068407] [ 2413]     0  2413     3257        0      11       2       36             0 systemd-cgroups
> [  187.069072] [ 2416]     0  2416     3257        0      11       2       49             0 systemd-cgroups
> [  187.069720] [ 2417]     0  2417    11861      173      26       2      334             0 (journald)
> [  187.070368] Out of memory: Kill process 2417 ((journald)) score 0 or sacrifice child
> [  187.070943] Killed process 2417 ((journald)) total-vm:47444kB, anon-rss:0kB, file-rss:692kB
> [  187.513857] systemd[1]: Unit systemd-logind.service entered failed state.
> [  188.262477] systemd[1]: Unit systemd-journald.service entered failed state.
> [  188.315222] systemd[1]: systemd-logind.service holdoff time over, scheduling restart.
> [  188.334194] systemd[1]: Stopping Login Service...
> [  188.341556] systemd[1]: Starting Login Service...
> [  188.408787] systemd[1]: systemd-journald.service holdoff time over, scheduling restart.
> [  189.284506] systemd[1]: Stopping Journal Service...
> [  189.330806] systemd[1]: Starting Journal Service...
> [  189.384800] systemd[1]: Started Journal Service.
> 
> 
> cheers,
> daniel
> 

I pulled down the most recent Fedora rawhide kernel today:

    4.0.0-0.rc2.git0.1.fc23.x86_64

...and with that, I can't reproduce this. The ftrace_event_field slab
(which is shared by the fasync_struct cache) seems to stay under
control. I see it hover around 3-4M in size while the test is running
but the box isn't falling over or anything.

Perhaps this was an MM or RCU bug that is now fixed? Can you confirm
whether you're still able to reproduce it with the most recent mainline
kernels?

-- 
Jeff Layton <jlayton@...chiereds.net>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/