linux-kernel - Re: If not readdir() then what?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17949.36737.701327.104172@notabene.brown>
Date:	Thu, 12 Apr 2007 11:46:41 +1000
From:	Neil Brown <neilb@...e.de>
To:	Theodore Tso <tytso@....edu>
Cc:	Jörn Engel <joern@...ybastard.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Christoph Hellwig <hch@...radead.org>,
	Ulrich Drepper <drepper@...il.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: If not readdir() then what?

On Wednesday April 11, tytso@....edu wrote:
> 
> Actually, no, we can't keep the collision chain count stable across a
> create/delete even while the tree is cached.  At least, not without
> storing a huge amount of state associated with each page.  (It would
> be a lot more work than simply having nfsd keep a fd cache for
> directory streams ;-).

Well, there's the rub, isn't it :-)
You think it is easier to fix the problem in nfsd, and I think it is
easier to fix the problem in ext3.  Both positions are quite
understandable and quite genuine.
And I am quite sure that all the issues that have been raised can be
solved with a bit of effort providing the motivation is there.

I could argue that nfs came before ext3+dirindex, so ext3 should have
been designed to work properly with NFS.  You could argue that fixing
it in nfsd fixes it for all filesystems.  But I'm not sure either of
those arguments are likely to be at all convincing...

Maybe the best compromise is to both fix the 'problem' :-?

Let me explores some designs a bit more..

NFS:
   All we have to do is cache the open files.  This should be only a
   performance issue, not a correctness issue (once we get 64bit
   cookies from ext3).  So the important thing is to cache then for
   a reasonable period of time.

   We currently cache the read-ahead info from regular files (though
   I was hoping that would go away when the page-cache-based readahead
   became a reality).  We can reasonably replace this with caching
   the open files if we are going to do it for directories anyway.

   So the primary key is "struct inode * + loff_t".  This is suitable
   both for file-readahead and for ext3-directory-caching.  Might also
   be useful for filesystem that stores pre-allocation information in
   the struct-file.
   We keep these in an LRU list and a hash table.  We register a
   callback with register_shrinker (or whatever it is called today) so
   that VM pressure can shrink the cache, and also arrange a timer to
   remove entries older than -- say -- 5 seconds.

   I think that placing a fixed size on the cache based on number of
   active clients would be a mistake, as it is virtually impossible to
   know how many active clients there are, and the number can change
   very dynamically.

   When a filesystem is un-exported, (rare event) we walk the whole
   list and discard entries for that filesystem.

   Look into the possibility of a callback on unlink to drop the
   cached open when the link count hits zero.... I wonder if inotify
   or leases can help with that.

   To help with large NUMA machine, we probably don't want a single
   hash LRU chain, but rather a number of LRU chains.  That way the
   action of moving an entry to the end of the chain is less likely to
   conflict with another processor trying to do the same thing to a
   different entry.  This is the sort of consideration that is already
   handled in the page cache, and having to handle it in every other
   cache is troublesome.... because the next time a need like that
   arises, the page cache will get fixed but other little caches won't
   until someone like Greg Banks come along with a big hammer...

EXT3:
   Have a rbtree for storing directory entries. This is attached to a
   pages via the ->private page field.
   Normally each page of a directory has it's own rbtree, but when two
   pages contain entries with the same hash, the one rbtree is shared
   between the pages.
   Thus when you load a block you must also load other blocks under
   the same hash, but I think you do that already.

   When you split a block (because it has become too big) the rbtree
   attached to that block is dismantled and each entry is inserted
   into the appropriate new rbtree, one for each of the two blocks.
   The entries are unchanged - they just get placed in a different
   tree - so cursors in the struct file will still be valid.

   Each entry has a count of the number of cursors pointing to it, and
   when this is non-zero, a refcount on the page is held, thus making
   sure the page doesn't get freed and the btree lost.  The entry
   should possibly also contain a pointer to the page.. not sure if
   that is needed.

   Each entry in the rbtree contains (in minor_hash) a sequence number
   that is used when multiple entries hash to the same value.  We
   store a 'current-seq-number' in the root of the rbtree and when an
   attempt to insert an entry finds a collision, we increase
   current-seq-number, set the minor_hash to that, and retry the
   insert.
   This minor_hash is combined with some bits of the major hash to
   form the fpos/cookie.

   The releasepage address_space_operation will check that all pages
   which share the same major hash are treated as a unit, all released
   at the same time.  So it will fail if any of the pages in the
   group are in use.  If they can all be freed, it will free the
   rbtree for that group.


   This not only benefits nfsd, which opens and closes directories
   often, but also benefit any local application that happens to
   open/scan/close a directory very often.  It might still be a
   measurable benefit even if the nfsd fix is implemented.



Hmmm. I wonder.  Which is more likely?
  - That two 64bit hashes from some set are the same
  - or that 65536 48bit hashes from a set of equal size are the same.

Taken to the extreme, if we used all the available bits for a sequence
number, we would never get a collision with 2^64 entries, while if all
the bits were for hash, we have a good chance of a collision at 2^32
entries.  So using bits of the cookie for a sequence number reduces
the chance that two entries will need the same cookie, at the cost of
increasing the chance that multiple entries will hash to the same
value.

Probably 8 bits is plenty to spend on the sequence number leaving 56
bits for the hash, meaning 200 million entries before the birthday
paradox starts making hash chains at all likely, and still billions
before there is significant chance that the hash chains will even
overflow one block.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/