linux-kernel - Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.0.98.0704280849150.9964@woody.linux-foundation.org>
Date:	Sat, 28 Apr 2007 09:05:44 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Mikulas Patocka <mikulas@...ax.karlin.mff.cuni.cz>
cc:	Mike Galbraith <efault@....de>,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Jens Axboe <jens.axboe@...cle.com>
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
 is under heavy write load (massive starvation)

On Sat, 28 Apr 2007, Mikulas Patocka wrote:
> > 
> > Especially with lots of memory, allowing 40% of that memory to be dirty is
> > just insane (even if we limit it to "just" 40% of the normal memory zone.
> > That can be gigabytes. And no amount of IO scheduling will make it
> > pleasant to try to handle the situation where that much memory is dirty.
> 
> What about using different dirtypage limits for different processes?

Not good. We inadvertedly actually had a very strange case of that, in the 
sense that we had different dirtypage limits depending on the type of the 
allocation: if somebody used GFP_HIGHUSER, he'd be looking at the 
percentage as a percentage of _all_ memory, but if somebody used 
GFP_KERNEL he'd look at it as a percentage of just the normal low memory. 
So effectively they had different limits (the percentage may have been the 
same, but the _meaning_ of the percentage changed ;)

And it's really problematic, because it means that the process that has a 
high tolerance for dirty memory will happily dirty a lot of RAM, and then 
when the process that has a _low_ tolerance comes along, it might write 
just a single byte, and go "oh, damn, I'm way over my dirty limits, I will 
now have to start doing writeouts like mad".

Your form is much better:

> --- i.e. every process has dirtypage activity counter, that is increased when
> it dirties a page and decreased over time.

..but is really hard to do, and in particular, it's really hard to make 
any kinds of guarantees that when you have a hundred processes, they won't 
go over the total dirty limit together!

And one of the reasons for the dirty limit is that the VM really wants to 
know that it always has enough clean memory that it can throw away that 
even if it needs to do allocations while under IO, it's not totally 
screwed.  An example of this is using dirty mmap with a networked 
filesystem: with 2.6.20 and later, this should actually _work_ fairly 
reliably, exactly because we now also count the dirty mapped pages in the 
dirty limits, so we never get into the situation that we used to be able 
to get into, where some process had mapped all of RAM, and dirtied it 
without the kernel even realizing, and then when the kernel needed more 
memory (in order to write some of it back), it was totally screwed.

So we do need the "global limit", as just a VM safety issue. We could do 
some per-process counters in addition to that, but generally, the global 
limit actually ends up doing the right thing: heavy writers are more 
likely to _hit_ the limit, so statistically the people who write most are 
also the people who end up havign to clean up - so it's all fair.

> The main problem is that if the user extracts tar archive, tar eventually
> blocks on writeback I/O --- O.K. But if bash attempts to write one page to
> .bash_history file at the same time, it blocks too --- bad, the user is
> annoyed.

Right, but it's actually very unlikely. Think about it: the person who 
extracts the tar-archive is perhaps dirtying a thousand pages, while the 
.bash_history writeback is doing a single one. Which process do you think 
is going to hit the "oops, we went over the limit" case 99.9% of the time?

The _really_ annoying problem is when you just have absolutely tons of 
memory dirty, and you start doing the writeback: if you saturate the IO 
queues totally, it simply doesn't matter _who_ starts the writeback, 
because anybody who needs to do any IO at all (not necessarily writing) is 
going to be blocked.

This is why having gigabytes of dirty data (or even "just" hundreds of 
megs) can be so annoying.

Even with a good software IO scheduler, when you have disks that do tagged 
queueing, if you fill up the disk queue with a few dozen (depends on the 
disk what the queue limit is) huge write requests, it doesn't really 
matter if the _software_ queuing then gives a big advantage to reads 
coming in. They'll _still_ be waiting for a long time, especially since 
you don't know what the disk firmware is going to do.

It's possible that we could do things like refusing to use all tag entries 
on the disk for writing. That would probably help latency a _lot_. Right 
now, if we do writeback, and fill up all the slots on the disk, we cannot 
even feed the disk the read request immediately - we'll have to wait for 
some of the writes to finish before we can even queue the read to the 
disk.

(Of course, if disks don't support tagged queueing, you'll never have this 
problem at all, but most disks do these days, and I strongly suspect it 
really can aggravate latency numbers a lot).

Jens? Comments? Or do you do that already?

			Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/