linux-kernel - Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4C24FC71.6020001@redhat.com>
Date:	Fri, 25 Jun 2010 14:58:57 -0400
From:	Ric Wheeler <rwheeler@...hat.com>
To:	Daniel Taylor <Daniel.Taylor@....com>
CC:	Mike Fedyk <mfedyk@...efedyk.com>,
	Daniel J Blueman <daniel.blueman@...il.com>,
	Mat <jackdachef@...il.com>, LKML <linux-kernel@...r.kernel.org>,
	linux-fsdevel@...r.kernel.org,
	Chris Mason <chris.mason@...cle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	The development of BTRFS <linux-btrfs@...r.kernel.org>
Subject: Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation
 in Btrfs)

On 06/24/2010 06:06 PM, Daniel Taylor wrote:
>
>
>    
>> -----Original Message-----
>> From: mikefedyk@...il.com [mailto:mikefedyk@...il.com] On
>> Behalf Of Mike Fedyk
>> Sent: Wednesday, June 23, 2010 9:51 PM
>> To: Daniel Taylor
>> Cc: Daniel J Blueman; Mat; LKML;
>> linux-fsdevel@...r.kernel.org; Chris Mason; Ric Wheeler;
>> Andrew Morton; Linus Torvalds; The development of BTRFS
>> Subject: Re: Btrfs: broken file system design (was Unbound(?)
>> internal fragmentation in Btrfs)
>>
>> On Wed, Jun 23, 2010 at 8:43 PM, Daniel Taylor
>> <Daniel.Taylor@....com>  wrote:
>>      
>>> Just an FYI reminder.  The original test (2K files) is utterly
>>> pathological for disk drives with 4K physical sectors, such as
>>> those now shipping from WD, Seagate, and others.  Some of the
>>> SSDs have larger (16K0 or smaller blocks (2K).  There is also
>>> the issue of btrfs over RAID (which I know is not entirely
>>> sensible, but which will happen).
>>>
>>> The absolute minimum allocation size for data should be the same
>>> as, and aligned with, the underlying disk block size.  If that
>>> results in underutilization, I think that's a good thing for
>>> performance, compared to read-modify-write cycles to update
>>> partial disk blocks.
>>>        
>> Block size = 4k
>>
>> Btrfs packs smaller objects into the blocks in certain cases.
>>
>>      
> As long as no object smaller than the disk block size is ever
> flushed to media, and all flushed objects are aligned to the disk
> blocks, there should be no real performance hit from that.
>
> Otherwise we end up with the damage for the ext[234] family, where
> the file blocks can be aligned, but the 1K inode updates cause
> the read-modify-write (RMW) cycles and and cost>10% performance
> hit for creation/update of large numbers of files.
>
> An RMW cycle costs at least a full rotation (11 msec on a 5400 RPM
> drive), which is painful.
>    

Also interesting is to note that you can get a significant overheard 
even with 0 byte length files. Path names, metadata overhead, etc can 
consume (depending on the pathname length) quite a bit of space per file.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/