lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45730E36.10309@redhat.com>
Date:	Sun, 03 Dec 2006 12:49:42 -0500
From:	Wendy Cheng <wcheng@...hat.com>
To:	Andrew Morton <akpm@...l.org>
CC:	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH] prune_icache_sb

Andrew Morton wrote:

>On Thu, 30 Nov 2006 11:05:32 -0500
>Wendy Cheng <wcheng@...hat.com> wrote:
>
>  
>
>>
>>The idea is, instead of unconditionally dropping every buffer associated 
>>with the particular mount point (that defeats the purpose of page 
>>caching), base kernel exports the "drop_pagecache_sb()" call that allows 
>>page cache to be trimmed. More importantly, it is changed to offer the 
>>choice of not randomly purging any buffer but the ones that seem to be 
>>unused (i_state is NULL and i_count is zero). This will encourage 
>>filesystem(s) to pro actively response to vm memory shortage if they 
>>choose so.
>>    
>>
>
>argh.
>  
>
I read this as "It is ok to give system admin(s) commands (that this 
"drop_pagecache_sb() call" is all about) to drop page cache. It is, 
however, not ok to give filesystem developer(s) this very same function 
to trim their own page cache if the filesystems choose to do so" ?

>In Linux a filesystem is a dumb layer which sits between the VFS and the
>I/O layer and provides dumb services such as reading/writing inodes,
>reading/writing directory entries, mapping pagecache offsets to disk
>blocks, etc.  (This model is to varying degrees incorrect for every
>post-ext2 filesystem, but that's the way it is).
>  
>
Linux kernel, particularly the VFS layer, is starting to show signs of 
inadequacy as the software components built upon it keep growing. I have 
doubts that it can keep up and handle this complexity with a development 
policy like you just described (filesystem is a dumb layer ?). Aren't 
these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect 
example why trying to do too many things inside vfs layer for so many 
filesystems is a bad idea ? By the way, since we're on this subject, 
could we discuss a little bit about vfs rename call (or I can start 
another new discussion thread) ?

Note that linux do_rename() starts with the usual lookup logic, followed 
by "lock_rename", then a final round of dentry lookup, and finally comes 
to filesystem's i_op->rename call. Since lock_rename() only calls for 
vfs layer locks that are local to this particular machine, for a cluster 
filesystem, there exists a huge window between the final lookup and 
filesystem's i_op->rename calls such that the file could get deleted 
from another node before fs can do anything about it. Is it possible 
that we could get a new function pointer (lock_rename) in 
inode_operations structure so a cluster filesystem can do proper locking ?

>>>From our end (cluster locks are expensive - that's why we cache them), 
>>one of our kernel daemons will invoke this newly exported call based on 
>>a set of pre-defined tunables. It is then followed by a lock reclaim 
>>logic to trim the locks by checking the page cache associated with the 
>>inode (that this cluster lock is created for). If nothing is attached to 
>>the inode (based on i_mapping->nrpages count), we know it is a good 
>>candidate for trimming and will subsequently drop this lock (instead of 
>>waiting until the end of vfs inode life cycle).
>>    
>>
>
>Again, I don't understand why you're tying the lifetime of these locks to
>the VFS inode reclaim mechanisms.  Seems odd.
>  
>
Cluster locks are expensive because:

1. Every node in the cluster has to agree about it upon granting the 
request (communication overhead).
2. It involves disk flushing if bouncing between nodes. Say one node 
requests a read lock after another node's write... before the read lock 
can be granted, the write node needs to flush the data to the disk (disk 
io overhead).

For optimization purpose, we want to refrain the disk flush after writes 
and hope (and encourage) the next person who requests the lock to be on 
the very same node (to take the advantage of OS write-back logic). 
That's why the locks are cached on the very same node. It will not get 
removed unless necessary.
What would be better to build the lock caching on top of the existing 
inode cache logic - since these are the objects that the cluster locks 
are created for in the first place.

>If you want to put an upper bound on the number of in-core locks, why not
>string them on a list and throw away the old ones when the upper bound is
>reached?
>  
>
Don't take me wrong. DLM *has* a tunable to set the max lock counts. We 
do drop the locks but to drop the right locks, we need a little bit help 
from VFS layer. Latency requirement is difficult to manage.

>Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
>no problem maintaining millions of cached VFS objects - is there any reason
>why your lock lookup cannot be similarly efficient?
>  
>
Don't be so confident. I did see some complaints from ext3 based mail 
servers in the past - when the storage size was large enough, people had 
to explicitly umount the filesystem from time to time to rescue their 
performance. I don't recall the details at this moment though.

For us with this particular customer, it is a 15TB storage.

-- Wendy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ