linux-kernel - Re: Excessive stall times on ext4 in 3.9-rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 12 Apr 2013 11:18:13 +0100
From:	Tvrtko Ursulin <tvrtko.ursulin@...lan.co.uk>
To:	Theodore Ts'o <tytso@....edu>
Cc:	Jan Kara <jack@...e.cz>, Mel Gorman <mgorman@...e.de>,
	linux-ext4@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
	Linux-MM <linux-mm@...ck.org>, Jiri Slaby <jslaby@...e.cz>
Subject: Re: Excessive stall times on ext4 in 3.9-rc2

Hi all,

On Thursday 11 April 2013 22:57:08 Theodore Ts'o wrote:
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.
> 
> The only problem I can see with this hypothesis is that if this is the
> explanation for what Mel and Jiri are seeing, it's something that
> would have been around for a long time, and would affect ext3 as well
> as ext4.  That isn't quite consistent, however, with Mel's observation
> that this is a probablem which has gotten worse in relatively
> recently.

Dropping in as a casual observer and having missed the start of the thread, 
risking that I will just muddle the water for you.

I had a similar problem for quite a while with ext4, at least that was my 
conclusion since the fix was to migrate one filesystem to xfs which fixed it 
for me. Time period when I observed this was between 3.5 and 3.7 kernels.

Situation was I had an ext4 filesystem (on top of LVM, which was on top of MD 
RAID 1, which was on top of two mechanical hard drives) which was dedicated to 
holding a large SVN check-out. Other filesystems were also ext4 on different 
logical volumes (but same spindles).

Symptoms were long stalls of everything (including window management!) on a 
relatively heavily loaded desktop (which was KDE). Stalls would last anything 
from five to maybe even 30 seconds. Not sure exactly but long enough that you 
think the system has actually crashed. I couldn't even switch away to a 
different virtual terminal during the stall, nothing.

Eventually I traced it down to kdesvn (subversion client) periodically 
refreshing (or something) it's metadata and hence generating some IO on that 
dedicated filesystem. That combined with some other desktop activity had an 
effect of stalling everything else. I thought it was very weird, but I suppose 
KDE and all the rest nowadays do to much IO in everything they do.

Following a hunch I reformatted that filesystem as XFS which fixed the 
problem.

I can't reproduce this now to run any tests so I know this is not very helpful 
now. But perhaps some of the info will be useful to someone.

Tvrtko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/