linux-kernel - Re: Why processes on linux loses signals?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2c0942db0911221739m2e5a1bb3vea69bccbfb3306cf@mail.gmail.com>
Date:	Sun, 22 Nov 2009 17:39:06 -0800
From:	Ray Lee <ray-lk@...rabbit.org>
To:	Michael Tokarev <mjt@....msk.ru>, Oleg Nesterov <oleg@...hat.com>,
	roland@...hat.com
Cc:	Linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: Why processes on linux loses signals?

[ adding potential interested parties to the CC:. Michael, please respond
with the latest kernel version you've tried that exhibits the problem, as well
as whether or not you've been able to create a test-case that shows the
signal loss. ]

On Sun, Nov 22, 2009 at 1:14 PM, Michael Tokarev <mjt@....msk.ru> wrote:
> It's a very old issue, but I still don't know an answer.
>
> In short, processes on linux loses signals.  It happens
> rarely, but it happens, and the frequency of this happening
> is enough to be annoying.
>
> For example, I've a program that used alarm(2) to periodically
> check for something.  Nothing fancy, nothing interesting is done
> in the signal handler, no long operations or something, plain
> signal(2) with sighandler just setting a global variable.  When
> under heavy usage (it's a DNS nameserver), in about a week
> (sometimes a few hours, sometimes after a month) it stops checking
> for updates, because apparently some sigalrm got lost.
>
> For this program I had to replace alarm() with setitimer(), but
> only on linux.  On all other operating systems (Solaris, FreeBSD,
> HP/UX, AIX) where it is used, everything works as expected.
>
> Another common issue is SIGIO-based event loop.  For a classical
> form of it, on a non-heavily-loaded process.  Quite often server
> loses SIGIO so even if an I/O is possible, the process does not
> know.  The pending (or stuck) I/O gets processed on receipt of
> next SIGIO that indicates readiness of another filedescriptor --
> since after SIGIO a process does poll() it notices both.
>
> A "classical" (for me) example of this is an Oracle database
> version 8 (we've many of these in production still; in later
> versions they rewrote the event loop to use different techniques).
> There, there's a dispatcher process that does nothing but listens
> on the network, receives requests and sends them to a set of
> worker processes.  Everything is non-blocking and the process
> mostly does nothing.  It is very annoying when trivial actions
> in a user application causes loooong delays - when an app sent
> some request to oracle db and that request stuck in the event
> queue because the corresponding SIGIO was never delivered.  It
> helps immediately to make another connection to the same DB to
> "unstuck" that request.  It is done transparently when there are
> many users are working with the database at the same time, each
> making requests --- this way any stuck/lost I/O unstucks immediately
> because new requests are coming from other users; but at evenings
> or over periods of small activity it becomes real problem.
>
> I looked at the server behavour numerous times -- the server (oracle)
> works quite reasonable, strace is sane enough.  That to say, one
> can't blame "stupid closed-source programmers" for this.
>
> There are other examples like this, all involving lost signals.
> The two above are just the most "famous" for me.
>
> The problem becomes much much worse when a system has multiple
> cores.  On single-CPU system such situation is rare enough to
> become almost unnoticeable.  But with even second core the issue
> emerges almost immediately - enough for many users to start calling
> techsupport because their apps are very slow.
>
> Last time I asked similar question here, I was told that signals
> are unreliable and should not be used.  But what is the reason for
> the unreliability, and why signals should be unreliable on linux
> only?
>
> Thanks!
>
> /mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/