linux-ext4 - Re: [LSF/FS TOPIC] Ext4 snapshots status update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTinV-WSj2c+dtNyaS-r8xf_c0R5UB-_f73XLm0Z8@mail.gmail.com>
Date:	Wed, 30 Mar 2011 08:05:38 +0200
From:	Amir Goldstein <amir73il@...il.com>
To:	Tao Ma <tm@....ma>
Cc:	Joel Becker <jlbec@...lplan.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>,
	Ext4 Developers List <linux-ext4@...r.kernel.org>,
	Theodore Tso <tytso@....edu>,
	Chris Mason <chris.mason@...cle.com>,
	Josef Bacik <josef@...hat.com>
Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 30, 2011 at 7:52 AM, Tao Ma <tm@....ma> wrote:
> Hi Amir,
> On 03/30/2011 12:16 PM, Amir Goldstein wrote:
>> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <jlbec@...lplan.org> wrote:
>>> On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
>>>> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <jlbec@...lplan.org> wrote:
>>>>> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>>>>>        I've already got a design for a front-end snapshot program that
>>>>> implements a policy on top this generic behavior.  This design would
>>>>> cover both first-class and hidden style snapshots, because it assume
>>>>> snapshots are in a distinct namespace.  I haven't gotten around to
>>>>> implementing it yet, but btrfs and other snapshottable filesystems were
>>>>> part of the design goal.
>>>>
>>>> Any chance of getting a copy of that design of yours, to get a head start
>>>> for LSF?
>>>
>>>        Yeah, I owe it to you.  It wasn't a written-down thing, it was a
>>> hammered-out-in-our-heads thing among some ocfs2 developers.  I'm going
>>> to braindump here to get us going.  First, I'll speak to your points.
>>>
>>>> Here are some other generic snapshot related topics we may want to discuss:
>>>>
>>>> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
>>>
>>>        I'm unsure where these fit, perhaps because I missed the
>>> discussion between Chris and you.  ocfs2 has the inode flag
>>> OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
>>> This is ocfs2's structure for maintaining extent reference counts.  Is
>>> your COW_FL the same?  Or is it a permission flag?  NOCOW_FL sounds
>>> like: "Set this flag on the inode and it will prevent CoW."
>>
>> I don't have a use for COW_FL, since my snapshots are volume level snapshots.
>> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
>> blocks in the volume.
>> Maybe Chris or Josef can elaborate of the flags intended use in btrfs.
>>
>>>
>>>> 2. How to deal with mmap write to COW file, when you get ENOSPC.
>>>
>>>        We just fail the write with VM_FAULT_SIGBUS like mmap write to a
>>> hole.

OK. "private" thread is opened.
Just wanted to clarify there are 2 differences I notice between mmap
write to a hole
and mmap write to COWed file with ENOSPC:

1. A "good" application can avoid mmap write to a hole.

2. when initiating a hole, the mkwrite callback is in used (in ext4) to
reserve disk space for delayed allocation when a page becomes writable.
with COW a page may already be writable when the flush encounters COW
with ENOSPC. that flush can even happen after the application has exited,
so the data will be dropped on the floor silently (like in ext3).


>>> It's what happens for most other CoW filesystems today.  If
>>> you're using CoW, you should be aware of what to expect.
>>>
>>
>> "you", meaning a CoW fs developer? a CoW fs administrator? or an application
>> developer, who has no idea what fs the application will be on?
>> I know it is easy for us to say "there is no solution", but I have
>> actually implemented
>> a block reservation technique that may be useful in this case...
>> it's hammered-out-in-my-head, so let's save me the brain dump and I'll tell
>> you about it in person...
>>
>>
>>>> 3. Adding buffer_remap() flag for buffered I/O code, meaning, there is
>>>> an existing mapping to initialize a page on partial write, but still need
>>>> to call get_block() to get a (possibly) new mapping.
>>>
>>>        Since ocfs2 doesn't allocate in get_block(), this doesn't affect
>>> us.  We notice the refcounted extent in write_begin() and CoW it right
>>> there.  Same place we clean up unwritten extents.
>>>
>>
>> Yes, I was going to write a specialized block_write_begin() for CoW,
>> but I like to use existing generic code when possible and block_write_begin()
>> is only a few lines of code short of what I need, so maybe we can all use it?
>>
>>
>>> --snip--
>>>
>>>        Now, about my snapshot thoughts as promised.  My understanding
>>> of the snapshots you have implemented in ext4 is that they are like some
>>> SAN snapshots; they are hidden objects not visible unless you use
>>> special access.  They are particular to a given inode and are children
>>> of that inode.  What happens when you remove the visible inode?  Do the
>>> snapshots disappear?  Do you have limitations on how many shapshots a
>>> particular inode can have?  These questions plagued us when we original
>>> set out to design inode snapshots for ocfs2.
>>
>> ext4 snapshots are volume level (readonly) snapshots.
>> the snapshot inodes are both the "place-holder" of private snapshot blocks
>> and the (loopdev) mount point to access the volume snapshot.
>> This is why I wondered if inode level snapshots and volume/subvolume
>> level snapshots can share the same API.
>> BTW, does btrfs have inode level snapshots as well?
>>
>>>        Once we settled on a mechanism for CoW among ocfs2 inodes, we
>>> quickly decided that a snapshot should be visible in the namespace.
>>> This gave rise to the reflink(2) call, though that name is deprecated in
>>> favor of fastcopy(2).  Currently our API is OCFS2_IOC_REFLINK (see,
>>> legacy!), but we eventually want to get the system call upstream.  In
>>> ocfs2-land, we decided to keep policy out of the kernel.
>>> OCFS2_IOC_REFLINK creates a new inode that shares all the extents of the
>>> source in CoW fashion, but once it returns, that new inode is a peer of
>>> the source.  There is no parent->child relationship.
>>>        Thus, for ocfs2 (and forgive the legacy names, the binary hasn't
>>> changed yet), a "snapshot" is just:
>>>
>>>    snapshot: reflink source target.snap && chmod 0444 target.snap
>>>
>>> You can add "chattr +i target.snap" in there if you like.
>>>        Since there is no "snapshot namespace" stuff for ocfs2 in the
>>> kernel, it was our intention to propose a snapshot(8) binary that works
>>> like mkfs/fsck; snapshot(8) just calls snapshot.<fstype>(8).  Our
>>> plan was to place snapshot policy in snapshot.ocfs2(8).  This
>>> implementation would handle managing the <mountpoint>/.snapshot/...
>>> namespace behind the user:
>>>
>>>    ? cd /mnt/ocfs2
>>>    ? snapshot file1  # Creates /mnt/ocfs2/.snapshot/file1.<timestamp>
>>>    <timestamp>
>>>    ? snapshot file1 test  # Creates /mnt/ocfs2/.snapshot/file1.test
>>>    test
>>>    ? snapshot list file1
>>>    Snapshots for file1:
>>>        <timestamp>
>>>        test
>>>
>>> Something like that.
>>>        A different snapshot model like ext4 could have snapshot.ext4(8)
>>> call the kernel or whatever mechanism was appropriate.  A filesystem
>>> from a NAS filer could use filer-specific calls.
>>>        Beyond that, I wanted snapshot(8) to handle scheduling of
>>> snapshots.  The usual daily/weekly stuff should be easy to schedule
>>> generically.
>>>        That's my brain dump.  I could enumerate proposed command
>>> syntaxes, but I don't think that's necessary.
>>>
>>
>> No need for that. snapshot(8) API sounds good.
>> Let's sit together in LSF with btrfs representatives and finalize this API.
>> For ext4, I just need for the 'file' arg to be optional.
>> I would like to include some API to attach a snapshot to a namespace
>> (mount it in my case) and to see how the inode level snapshots namespace
>> and volume level snapshots namespace will appear the same to the end-user.
>>
>> I suppose further discussion on the subject should exclude lsf ml,
>> which appear to be very hectic these days, so anyone who likes to join this
>> thread, please say so now.
> I implemented the reflink support in ocfs2, so please cc me when you
> open a private thread about this topic. Thanks.
>
> Regards,
> Tao
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html