linux-kernel - Re: Latency writing to an mlocked ext4 mapping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20111025122618.GA8072@quack.suse.cz>
Date:	Tue, 25 Oct 2011 14:26:18 +0200
From:	Jan Kara <jack@...e.cz>
To:	Andy Lutomirski <luto@...capital.net>
Cc:	Andreas Dilger <adilger@...ger.ca>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Re: Latency writing to an mlocked ext4 mapping

On Wed 19-10-11 22:59:55, Andy Lutomirski wrote:
> On Wed, Oct 19, 2011 at 7:17 PM, Andy Lutomirski <luto@...capital.net> wrote:
> > On Wed, Oct 19, 2011 at 6:15 PM, Andy Lutomirski <luto@...capital.net> wrote:
> >> On Wed, Oct 19, 2011 at 6:02 PM, Andreas Dilger <adilger@...ger.ca> wrote:
> >>> What kernel are you using?  A change to keep pages consistent during writeout was landed not too long ago (maybe Linux 3.0) in order to allow checksumming of the data.
> >>
> >> 3.0.6, with no relevant patches.  (I have a one-liner added to the tcp
> >> code that I'll submit sometime soon.)  Would this explain the latency
> >> in file_update_time or is that a separate issue?  file_update_time
> >> seems like a good thing to make fully asynchronous (especially if the
> >> file in question is a fifo, but I've already moved my fifos to tmpfs).
> >
> > On 2.6.39.4, I got one instance of:
> >
> > call_rwsem_down_read_failed ext4_map_blocks ext4_da_get_block_prep
> > __block_write_begin ext4_da_write_begin ext4_page_mkwrite do_wp_page
> > handle_pte_fault handle_mm_fault do_page_fault page_fault
> >
> > but I'm not seeing the large numbers of the ext4_page_mkwrite trace
> > that I get on 3.0.6.  file_update_time is now by far the dominant
> > cause of latency.
> 
> The culprit seems to be do_wp_page -> file_update_time ->
> mark_inode_dirty_sync.  This surprises me for two reasons:
> 
>  - Why the _sync?  Are we worried that data will be written out before
> the metadata?  If so, surely there's a better way than adding latency
> here.
  _sync just means that inode will become dirty for fsync(2) purposes but
not for fdatasync(2) purposes - i.e. it's just a timestamp update (or
it could be something similar).

>  - Why are we calling file_update_time at all?  Presumably we also
> update the time when the page is written back (if not, that sounds
> like a bug, since the contents may be changed after something saw the
> mtime update), and, if so, why bother updating it on the first write?
> Anything that relies on this behavior is, I think, unreliable, because
> the page could be made writable arbitrarily early by another program
> that changes nothing.
  We don't update timestamp when the page is written back. I believe this
is mostly because we don't know whether the data has been changed by a
write syscall, which already updated the timestamp, or by mmap. That is
also the reason why we update the timestamp at page fault time.

  The reason why file_update_time() blocks for you is probably that it
needs to get access to buffer where inode is stored on disk and because a
transaction including this buffer is committing at the moment, your thread
has to wait until the transaction commit finishes. This is mostly a problem
specific to how ext4 works so e.g. xfs shouldn't have it.

  Generally I believe the attempts to achieve any RT-like latencies when
writing to a filesystem are rather hopeless. How much hopeless depends on
the load of the filesystem (e.g., in your case of mostly idle filesystem I
can imagine some tweaks could reduce your latencies to an acceptable level
but once the disk gets loaded you'll be screwed). So I'd suggest that
having RT thread just store log in memory (or write to a pipe) and have
another non-RT thread write the data to disk would be a much more robust
design.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/