linux-kernel - Re: [PATCH] x86_64: fix delayed signals

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48791393.1020107@gmail.com>
Date:	Sat, 12 Jul 2008 23:26:59 +0300
From:	Török Edwin <edwintorok@...il.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
CC:	Ingo Molnar <mingo@...e.hu>, Roland McGrath <roland@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Elias Oltmanns <eo@...ensachen.de>,
	Arjan van de Ven <arjan@...radead.org>,
	Oleg Nesterov <oleg@...sign.ru>
Subject: Re: [PATCH] x86_64: fix delayed signals

On 2008-07-12 20:29, Linus Torvalds wrote:
> On Sat, 12 Jul 2008, Török Edwin wrote:
>   
>> On my 32-bit box (slow disks, SMP, XFS filesystem) 2.6.26-rc9 behaves
>> the same as 2.6.26-rc8, I can reliably reproduce a 2-3 second latency
>> [1] between pressing ^C the first time, and the shell returning (on the
>> text console too).
>> Using ftrace available from tip/master, I see up to 3 seconds of delay
>> between kill_pgrp and detach_pid (and during that time I can press ^C
>> again, leading to 2-3 kill_pgrp calls)
>>     
>
> The thing is, it's important to see what happens in between. 
>
> In particular, 2-3 second latencies can be entirely _normal_ (although 
> obviously very annoying) with most log-based filesystems when they decide 
> they have to flush the log.

A bit off-topic, but something I noticed during the tests:
In my original test I have rm-ed the files right after launching dd in
the background, yet it still continued to write to the disk.
I can understand that if the file is opened O_RDWR, you might seek back
and read what you wrote, so Linux needs to actually do the write,
but why does it insist on writing to the disk, on a file opened with
O_WRONLY, after the file itself got unlinked?

>  A lot of filesystems are not designed for 
> latency - every single filesystem test I have ever seen has always been 
> either a throughput test, or a "average random-seek latency" kind of test.
>
> The exact behavior will depend on the filesystem, for example. It will 
> also easily depend on things like whether you update 'atime' or not. Many 
> ostensibly read-only loads end up writing some data, especially inode 
> atimes, and that's when they can get caught up in having to wait for a log 
> to flush (to make up for that atime thing).
>   

I have my filesystems mounted as noatime already.
But yes, I am using different filesystems, the x86-64 box has reiserfs,
and the x86-32 box has xfs.

> You can try to limit the amount of dirty data in flight by tweaking 
> /proc/sys/vm/dirty*ratio

I have these in my /etc/rc.local:
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 >/proc/sys/vm/dirty_ratio


> , but from a latency standpoint the thing that 
> actually matters more is often not the amount of dirty data, but the size 
> of the requests queues - because you often care about read latency, but if 
> you have big requests and especially if you have a load that does lots of 
> big _contiguous_ writes (like your 'dd's would do), then what can easily 
> happen is that the read ends up being behind a really big write in the 
> request queues. 
>   

I will try tweaking these (and /sys/block/sda/queue/nr_requests that
Arjan suggested).

> And 2-3 second latencies by no means means that each individual IO is 2-3 
> seconds long. No - it just means that you ended up having to do multiple 
> reads synchronously, and since the reads depended on each other (think a 
> pathname lookup - reading each directory entry -> inode -> data), you can 
> easily have a single system call causing 5-10 reads (bad cases are _much_ 
> more, but 5-10 are perfectly normal for even well-behaved things), and now 
> if each of those reads end up being behind a fairly big write...
>   

I'll try blktrace tomorrow, it should tell me when I/Os are queued /
completed.

>   
>> On my 64-bit box (2 disks in raid-0, UP, reiserfs filesystem) 2.6.25 and
>> 2.6.26-rc9 behave the same, and most of the time (10-20 times in a row)
>> find responds to ^C instantly.
>>
>> However in _some_ cases find doesn't respond to ^C for a very long time 
>> (~30 seconds), and when this happens I can't do anything else but switch 
>> consoles, starting another process (latencytop -d) hangs, and so does 
>> any other external command.
>>     
>
> Ok, that is definitel not related to signals at all. You're simply stuck 
> waiting for IO - or perhaps some fundamental filesystem semaphore which is 
> held while some IO needs to be flushed.

AFAICT reiserfs still uses the BKL, could that explain why one I/O
delays another?

>  That's why _unrelated_ processes 
> hang: they're all waiting for a global resource.
>
> And it may be worse on your other box for any number of reasons: raid 
> means, for example, that you have two different levels of queueing, and 
> thus effectively your queues are longer. And while raid-0 is better for 
> throughput, it's not necessarily at all better for latency. The filesystem 
> also makes a difference, as does the amount of dirty data under write-back 
> (do you also have more memory in your x86-64 box, for example? That makes 
> the kernel do bigger writeback buffers by default)
>
>   

Yes, I have more memory on x86-64 (2G), and x86-32 has 1G.
>> I haven't yet tried ftrace on this box, and neither did I try Roland's
>> patch yet. I will try that now, and hopefuly come back with some numbers
>> shortly.
>>     
>
> Trust me, roland's patch will make no difference what-so-ever. It's purely 
> a per-thread thing, and your behaviour is clearly not per-thread.
>   

Indeed.
[Roland's patch is included in tip/master so I actually tried it even if
I didn't know I did]

> Signals are _always_ delayed until non-blocking system calls are done, and 
> that means until the end of IO. 
>
> This is also why your trace on just 'kill_pgrp' and 'detach_pid' is not 
> interesting. It's _normal_ to have a delay between them. It can happen 
> because the process blocks (or catches) signals, but it will also happen 
> if some system call waits for disk.
>   

Is there a way to trace what happens between those 2 functions?
Maybe if I don't use the trace filter (and thus trace all functions),
and modify the kernel sources to start tracing on kill_pgrp, and stop
tracing on detach_pid.
Would that provide useful info?


Best regards,
--Edwin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/