lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 17 Aug 2012 21:12:38 +0400
From:	Michael Tokarev <mjt@....msk.ru>
To:	"J. Bruce Fields" <bfields@...ldses.org>
CC:	"Myklebust, Trond" <Trond.Myklebust@...app.com>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
	Linux-kernel <linux-kernel@...r.kernel.org>,
	Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: 3.0+ NFS issues (bisected)

On 17.08.2012 20:00, J. Bruce Fields wrote:
[]> Uh, if I grepped my way through this right: it looks like it's the
> "memory" column of the "TCP" row of /proc/net/protocols; might be
> interesting to see how that's changing over time.

This file does not look interesting.  Memory usage does not jump,
there's no high increase either.

But there's something else which is interesting here.

I noticed that in perf top, the top consumer of CPU is svc_recv()
(I mentioned this in the start of this thread).  So I looked how
this routine is called from nfsd.  And here we go.

fs/nfsd/nfssvc.c:

/*
 * This is the NFS server kernel thread
 */
static int
nfsd(void *vrqstp)
{
...
        /*
         * The main request loop
         */
        for (;;) {
                /*
                 * Find a socket with data available and call its
                 * recvfrom routine.
                 */
int i = 0;
                while ((err = svc_recv(rqstp, 60*60*HZ)) == -EAGAIN)
                        ++i;
printk(KERN_ERR "calling svc_recv: %d times (err=%d)\n", i, err);
                if (err == -EINTR)
                        break;
...

(I added the "i" counter and the printk).  And here's the output:

[19626.401136] calling svc_recv: 0 times (err=212)
[19626.405059] calling svc_recv: 1478 times (err=212)
[19626.409512] calling svc_recv: 1106 times (err=212)
[19626.543020] calling svc_recv: 0 times (err=212)
[19626.543059] calling svc_recv: 0 times (err=212)
[19626.548074] calling svc_recv: 0 times (err=212)
[19626.549515] calling svc_recv: 0 times (err=212)
[19626.552320] calling svc_recv: 0 times (err=212)
[19626.553503] calling svc_recv: 0 times (err=212)
[19626.556007] calling svc_recv: 0 times (err=212)
[19626.557152] calling svc_recv: 0 times (err=212)
[19626.560109] calling svc_recv: 0 times (err=212)
[19626.560943] calling svc_recv: 0 times (err=212)
[19626.565315] calling svc_recv: 1067 times (err=212)
[19626.569735] calling svc_recv: 2571 times (err=212)
[19626.574150] calling svc_recv: 3842 times (err=212)
[19626.581914] calling svc_recv: 2891 times (err=212)
[19626.583072] calling svc_recv: 1247 times (err=212)
[19626.616885] calling svc_recv: 0 times (err=212)
[19626.616952] calling svc_recv: 0 times (err=212)
[19626.622889] calling svc_recv: 0 times (err=212)
[19626.624518] calling svc_recv: 0 times (err=212)
[19626.627118] calling svc_recv: 0 times (err=212)
[19626.629735] calling svc_recv: 0 times (err=212)
[19626.631777] calling svc_recv: 0 times (err=212)
[19626.633986] calling svc_recv: 0 times (err=212)
[19626.636746] calling svc_recv: 0 times (err=212)
[19626.637692] calling svc_recv: 0 times (err=212)
[19626.640769] calling svc_recv: 0 times (err=212)
[19626.657852] calling svc_recv: 0 times (err=212)
[19626.661602] calling svc_recv: 0 times (err=212)
[19626.670160] calling svc_recv: 0 times (err=212)
[19626.671917] calling svc_recv: 0 times (err=212)
[19626.684643] calling svc_recv: 0 times (err=212)
[19626.684680] calling svc_recv: 0 times (err=212)
[19626.812820] calling svc_recv: 0 times (err=212)
[19626.814697] calling svc_recv: 0 times (err=212)
[19626.817195] calling svc_recv: 0 times (err=212)
[19626.820324] calling svc_recv: 0 times (err=212)
[19626.822855] calling svc_recv: 0 times (err=212)
[19626.824823] calling svc_recv: 0 times (err=212)
[19626.828016] calling svc_recv: 0 times (err=212)
[19626.829021] calling svc_recv: 0 times (err=212)
[19626.831970] calling svc_recv: 0 times (err=212)

> the stall begin:
[19686.823135] calling svc_recv: 3670352 times (err=212)
[19686.823524] calling svc_recv: 3659205 times (err=212)

> transfer continues
[19686.854734] calling svc_recv: 0 times (err=212)
[19686.860023] calling svc_recv: 0 times (err=212)
[19686.887124] calling svc_recv: 0 times (err=212)
[19686.895532] calling svc_recv: 0 times (err=212)
[19686.903667] calling svc_recv: 0 times (err=212)
[19686.922780] calling svc_recv: 0 times (err=212)

So we're calling svc_recv in a tight loop, eating
all available CPU.  (The above is with just 2 nfsd
threads).

Something is definitely wrong here.  And it happens mure more
often after the mentioned commit (f03d78db65085).

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ