linux-kernel - Re: [PATCH V2 10/16] Squashfs: cache operations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 31 Oct 2008 04:43:46 +0000
From:	Phillip Lougher <phillip@...gher.demon.co.uk>
To:	Jörn Engel <joern@...fs.org>
CC:	akpm@...ux-foundation.org, linux-embedded@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	tim.bird@...sony.com
Subject: Re: [PATCH V2 10/16] Squashfs: cache operations

Jörn Engel wrote:
> On Wed, 29 October 2008 01:49:56 +0000, Phillip Lougher wrote:
>> +/*
>> + * Blocks in Squashfs are compressed.  To avoid repeatedly decompressing
>> + * recently accessed data Squashfs uses two small metadata and fragment caches.
>> + *
>> + * This file implements a generic cache implementation used for both caches,
>> + * plus functions layered ontop of the generic cache implementation to
>> + * access the metadata and fragment caches.
>> + */
> 
> I tend to agree with Andrew that a lot of this should be done by the
> page cache instead.  One of the problems seems to be that your blocksize
> can exceed page size and there really isn't any infrastructure to deal
> with such cases yet.  Bufferheads deal with blocks smaller than a page,
> not the other way around.
> 

I thought about using the page cache, but, the fact that blocksizes 
exceed the page cache size is only one of a number of reasons why I 
prefer the current solution, there is also simplicity and speed to consider.

There are three types of compressed block in Squashfs: datablocks, 
fragments, and metadata blocks.  Of these datablocks (by far the largest 
number of blocks) are decompressed and pushed into the page cache, and 
are not otherwise cached by Squashfs.  This, obviously (?), is because 
they contain data for only one file, and so at time of access there is a 
readily available address space to push the data into.

Metadata and fragment blocks are different in that when accessed and 
decompressed (say for an inode or for a particular tail-end block) they 
will contain metadata or tail-ends for other files.  This data could be 
thrown away but due to locality of reference it makes sense to 
temporarily cache it in-case a near future file access references the 
data.  But it doesn't make much sense to cache it more than temporarily, 
  much of the data will probably not be reused, and it exists compressed 
in the buffer cache.

The squashfs cache is therefore designed to cache only the last couple 
of metadata and fragment blocks.  It is also designed to be simple and 
extremely fast.  The maximum size of the metadata cache is only 64 KiB.

Simplicity and speed is extremely important.  The 
squashfs_metadata_read() wrapper around the cache is designed to step 
through the metadata a structure at a time (20 or so bytes), updating 
the read position in the metadata each call, with more metadata cache 
blocks being read and decompressed as necessary.  The common case where 
the metadata is already in the cache (because we're stepping through it 
20 or so bytes at a time), is designed to be extremely fast - a spinlock 
and array search only.  I recently optimised the cache to use spinlocks 
rather than mutexes and reduced the lock holding time (necessary to move 
to spinlocks), and this resulted in a 20%+ speed improvement in reading 
squashfs filesystems.

Given the above using an address space in the page cache will result in 
greater complexity, more memory overhead, and much slower operation. 
There's a number of reasons for this.

1. The amount and life-span of the data stored in the page cache is 
outside of Squashfs' control.  As explained above it only makes sense to 
temporarily cache the last couple of metadata and fragment blocks. 
Using the page cache (if a 'large' address space is used) for these 
keeps more of them around for longer than necessary, and will 
potentially cause more worthy datablocks to be flushed on memory pressure.

2. The address space will be caching uncompressed data, the squashfs 
references to this data are the compressed locations within the 
filesystem.  There doesn't exist a one-to-one linear mapping from 
compressed location to decompressed location in the address space.  This 
means a lookup table still needs to be used to store the mapping from 
compressed location to decompressed location in the address space.  Now 
this lookup table (however implemented) is itself at least as complex as 
my current cache implementation.

3. Once the locations of the decompressed pages in the address space 
have been found, they'll need to be looked up in the page cache, and 
this has to be done for every 4K page.  With the default fragment size 
of 128 KiB this means 32 separate lookups.  Somewhat slower than one 
spinlock and array search per 128 KiB block in the squashfs cache 
implementation.

Comments, especially those of the form "you've got this completely 
wrong, and you can use the page cache like this, which will be simpler 
and faster than your current implementation" welcome :)  I'm not adverse 
  to using the page cache, but I can't see how it will be simpler or 
faster than the current implementation.

Phillip
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/