lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANQeFDCZJunbv-wmLxax9V4PvhN=u7nGc966Xzud2-P8EqBPCw@mail.gmail.com>
Date:   Wed, 23 Jan 2019 17:34:48 -0800
From:   Liu Bo <obuil.liubo@...il.com>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Xiaoguang Wang <xiaoguang.wang@...ux.alibaba.com>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Liu Bo <bo.liu@...ux.alibaba.com>
Subject: Re: [PATCH v2 2/2] ext4: fix slow writeback under dioread_nolock and nodelalloc

On Tue, Jan 22, 2019 at 10:30 PM Theodore Y. Ts'o <tytso@....edu> wrote:
>
> On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
> > With "nodelalloc", blocks are allocated at the time of writing, and with
> > "dioread_nolock", these allocated blocks are marked as unwritten as well,
> > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
>
> I've been looking at your patches, and it seems that a simpler way,
> perhaps more maintainable approach in the long term is to change how
> we write to newly allocated blocks.  Today, we have two ways of doing
> this:
>
> 1) In the dioread_nolock case, we allocate blocks, insert an entry in
> the extent tree with the blocks marked uninitialized, write the
> blocks, and then mark the blocks initialized.
>
> 2) In the !dioread_nolock case, we allocate blocks, insert an entry to
> the extent tree, write the blocks --- and if we start a commit, we
> write out all dirty pages associated with that inode (in the default
> data=writeback case) to avoid stale writes.
>
> So what if we change the dioread_nolock case to do write the blocks
> first, and *then* insert the entry into the extent tree?  This avoids
> stale data getting exposed, either by a direct I/O read, or after a
> crash (which means we avoid needing to do the force write-out).
>
> So what we would need to do is to pass a flag to ext4_map_blocks()
> which causes it to *not* make any on-disk changes.  Instead, it would
> track the fact that blocks have be reserved in the buddy bitmap (this
> is how we prevent blocks from being preallocated after they are
> deleted, but before the transaction has been committed), and the
> location of the assigned blocks in the extent_status tree.  Since no
> on-disk changes are being made, we wouldn't need to hold the
> transaction open.
>
> Then in the callback after the blocks are written, using the starting
> logical block number stored in the io_end structure, we either convert
> the unwritten extents or actually insert the newly allocated blocks in
> the extent tree and update the on-disk bitmap allocation bitmaps.
>
> Once we get this working, it should be easy to make dioread_nolock for
> 1k block sizes; it keeps the time that the handle open very short; and
> it completely obviates the need for data=writeback.
>
> What do folks think?
>

So that (reserve, write, insert extent records) is basically what
btrfs is doing and I feel like it will work better than the current
way.

My only concern is performance since metadata reservation for delalloc
now becomes more and needs to be carried until endio, a perf. spike
would appear if the foreground writer needs to wait for flushing dirty
pages to reclaim metadata credits.

thanks,
liubo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ