lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:	Wed, 19 Jan 2011 14:40:38 +0100
From:	Peter Kruse <pk@...eap.com>
To:	linux-kernel@...r.kernel.org
Subject: page allocation failure leads to server unusability in igb_alloc_rx_buffers_adv

Hello,

one of our servers (Supermicro X8DTN) experiences random "crashes"
every two weeks or so.  The only real hint for a reason is
the page allocation error, which I have attached.  There
are no messages in kern.log indicating a broken hardware.
I put the "crashes" in quotes because the server keeps running
and for example writing messages to syslog, running cronjobs,
and so on, but it seems that all network related actions
fail.  For example:

----------------------------8<-----------------------------------------------------

nagios3: HOST ALERT: beo-06;DOWN;SOFT;8;CRITICAL - popen timeout received, but 
no child process
...
postfix/sendmail[6969]: fatal: no login name found for user ID 2403
# (although the ID is known)
...
CRON[11286]: Authentication service cannot retrieve authentication info.
...
sshd[14990]: fatal: login_init_entry: Cannot find user "..."
# (although that user exists!)
...
postfix/cleanup[18772]: warning: problem talking to service rewrite: 
Connection timed out

----------------------------8<-----------------------------------------------------

So the server insofar is unusable as login is no longer possible
and network related services like NFS no longer respond.

The message I have attached occured four days before the server
showed the other errors so it is hard to believe that there
is a relation but since this is the only message that we have
we think there must be some relation.  We would appreciate
if you could help interpret the messages.  The server has
48GB of RAM and no swap is defined.  For now we
increase the value in /proc/sys/vm/min_free_kbytes and hope
that the allocation error will happen less frequent.

Thanks,

   Peter

ps: please CC to me as I'm not subscribed

View attachment "dmesg" of type "text/plain" (7249 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ