linux-kernel - [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090117022936.20425.43248.stgit@crlf.corp.google.com>
Date:	Fri, 16 Jan 2009 18:29:36 -0800
From:	Mike Waychison <mikew@...gle.com>
To:	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention

We've noticed that at times it can become very easy to have a system begin to
livelock on dcache_lock/inode_lock (specifically in atomic_dec_and_lock()) when
a lot of dentries are getting finalized at the same time (massive delete and
large fdtable destructions are two paths I've seen cause problems).

This patchset is an attempt to try and reduce the locking overheads associated
with final dput() and final iput().  This is done by batching dentries and
inodes into per-process queues and processing them in 'parallel' to consolidate
some of the locking.

Besides various workload testing, I threw together a load (at the end of this
email) that causes massive fdtables (50K sockets by default) to get destroyed
on each cpu in the system.  It also populates the dcache for procfs on those
tasks for good measure.  Comparing lock_stat results (hardware is a Sun x4600
M2 populated with 8 4-core 2.3GHz packages (32 CPUs) + 128GiB RAM):

BEFORE PATCHSET:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                             dcache_lock:       5666485       10086172          11.16     2274036.47  3821690090.32       11042789       14926411           2.78       42138.28    65887138.46
                             -----------
                             dcache_lock        3016578          [<ffffffff810bd7e7>] d_alloc+0x190/0x1e1
                             dcache_lock        1844569          [<ffffffff810bcdfc>] d_instantiate+0x2c/0x51
                             dcache_lock        4223784          [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
                             dcache_lock         207544          [<ffffffff810bc36a>] d_rehash+0x1b/0x43
                             -----------
                             dcache_lock        2305449          [<ffffffff810bcdfc>] d_instantiate+0x2c/0x51
                             dcache_lock        1954160          [<ffffffff810bd7e7>] d_alloc+0x190/0x1e1
                             dcache_lock        4169163          [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c
                             dcache_lock         866247          [<ffffffff810bc36a>] d_rehash+0x1b/0x43

...............................................................................................................................................................................................

                              inode_lock:        202813         203064          12.91      376655.85    37696597.26        5136095        8001156           3.37       15142.77     5033144.13
                              ----------
                              inode_lock          20192          [<ffffffff810bfdbc>] new_inode+0x27/0x63
                              inode_lock              1          [<ffffffff810c6f58>] sync_inode+0x19/0x39
                              inode_lock              5          [<ffffffff810c76b5>] __mark_inode_dirty+0xe2/0x165
                              inode_lock              2          [<ffffffff810c6e92>] __writeback_single_inode+0x1bf/0x26c
                              ----------
                              inode_lock          20198          [<ffffffff810bfdbc>] new_inode+0x27/0x63
                              inode_lock              1          [<ffffffff810c76b5>] __mark_inode_dirty+0xe2/0x165
                              inode_lock              5          [<ffffffff810c70fc>] generic_sync_sb_inodes+0x33/0x31a
                              inode_lock         165016          [<ffffffff8119bd88>] _atomic_dec_and_lock+0x3c/0x5c

...............................................................................................................................................................................................

                       &sbsec->isec_lock:         34428          34431          16.77       67593.54     3780131.15        1415432        3200261           4.06       13357.96     1180726.21






AFTER PATCHSET [Note that inode_lock doesn't even show up anymore]:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                             dcache_lock:       2732103        5254824          11.04     1314926.88  2187466928.10        5551173        9751332           2.43       78012.24    42830806.29
                             -----------
                             dcache_lock        3015590          [<ffffffff810bdb04>] d_alloc+0x190/0x1e1
                             dcache_lock        1966978          [<ffffffff810bd118>] d_instantiate+0x2c/0x51
                             dcache_lock         258657          [<ffffffff810bc3cd>] d_rehash+0x1b/0x43
                             dcache_lock             30          [<ffffffff810b7ad4>] __link_path_walk+0x42d/0x665
                             -----------
                             dcache_lock        2493589          [<ffffffff810bd118>] d_instantiate+0x2c/0x51
                             dcache_lock        1862921          [<ffffffff810bdb04>] d_alloc+0x190/0x1e1
                             dcache_lock         882054          [<ffffffff810bc3cd>] d_rehash+0x1b/0x43
                             dcache_lock          15504          [<ffffffff810bc5e1>] process_postponed_dentries+0x3e/0x29f

...............................................................................................................................................................................................


ChangeLog:
  [v1] - Jan 16, 2009
    - Initial version

Patches in series:

  1)  Deferred batching of dput()
  2)  Parallel dput()
  3)  Deferred batching of iput()
  4)  Fixing iput called from put_super path
  5)  Parallelize iput()
  6)  hugetlbfs drop_inode update
  7)  Make drop_caches flush pending dput()s and iput()s
  8)  Make the sync path drain dentries and inodes

Caveats:

  * It's not clear to me how to make 'sync(2)' do what it should.  Currently we
flush pending dputs and iputs before writing anything out, but presumably,
there may be hidden iputs in the do_sync path that really ought to be
synchronous.
  * I don't like the changes in patch 4 and it should probably be changed to
have the ->put_super callback paths call a synchronous version of iput()
(perhaps sync_iput()?) instead of having to tag the inodes with S_SYNCIPUT.
  * cramfs, gfs2, ocfs2 need to be updated for the new drop_inode semantics.
---

 fs/block_dev.c         |    2 
 fs/dcache.c            |  374 +++++++++++++++++++++++++++++++++++++----
 fs/drop_caches.c       |    1 
 fs/ext2/super.c        |    2 
 fs/gfs2/glock.c        |    2 
 fs/gfs2/ops_fstype.c   |    2 
 fs/hugetlbfs/inode.c   |   72 +++++---
 fs/inode.c             |  441 +++++++++++++++++++++++++++++++++++++++++-------
 fs/ntfs/super.c        |    2 
 fs/smbfs/inode.c       |    2 
 fs/super.c             |    6 -
 fs/sync.c              |    5 +
 include/linux/dcache.h |    1 
 include/linux/fs.h     |   18 ++
 14 files changed, 787 insertions(+), 143 deletions(-)

-- 
// Copyright 2008 Google Inc. All Rights Reserved.
// Author: mikew@...gle.com (Mike Waychison)
//
//    This program is free software: you can redistribute it and/or modify
//    it under the terms of the GNU General Public License as published by
//    the Free Software Foundation, either version 3 of the License, or
//    (at your option) any later version.
//
//    This program is distributed in the hope that it will be useful,
//    but WITHOUT ANY WARRANTY; without even the implied warranty of
//    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
//    GNU General Public License for more details.
//
//    You should have received a copy of the GNU General Public License
//    along with this program.  If not, see <http://www.gnu.org/licenses/>.
//
// Description: Meant to make a kernel go nuts by stressing dcache with
// implosion.

#include <sys/mman.h>
#include <sys/resource.h>
#include <sys/select.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <dirent.h>
#include <fcntl.h>
#include <inttypes.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <string>

using std::cout;
using std::string;

static char *who_am_i;

static bool wait_on_kids = true;
static int files_per_cpu = 50000;
static int wait_before_slaughter = 0;

static const int kMagicSignal = SIGUSR2;

static const int kMaxCPUs = __NCPUBITS;

typedef struct { volatile int counter; } atomic_t;
#define LOCK_PREFIX "lock "
static inline void atomic_inc(atomic_t *v)
{
  __asm__ __volatile__(
    LOCK_PREFIX "incl %0"
    :"=m" (v->counter)
    :"m" (v->counter));
}
#define atomic_read(v)		((v)->counter)
#define atomic_set(v,i)		(((v)->counter) = (i))

static void *shared_area;

static atomic_t *children_done;

#define Abort() DoAbort(__LINE__)
static void DoAbort(int line) {
  cout << "Aborted at line " << line << " errno=" << errno << "\n";
  exit(1);
}

static void InitSharedArea() {
  shared_area = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, 0, NULL);
  if (shared_area == MAP_FAILED)
    Abort();
  children_done = static_cast<atomic_t *>(shared_area);
  atomic_set(children_done, 0);
}

static void get_cpu_bitmap(cpu_set_t *bitmap) {
  CPU_ZERO(bitmap);
  if (sched_getaffinity(0, sizeof(*bitmap), bitmap) < 0)
    Abort();
}

static void run_child(int cpu) {
  // Start by setting our own SIGTERM handler to avoid google3 fallout
  signal(SIGTERM, SIG_DFL);

  // Pin ourselves to the cpu
  cpu_set_t cpumask;
  CPU_ZERO(&cpumask);
  CPU_SET(cpu, &cpumask);
  if (sched_setaffinity(0, sizeof(cpumask), &cpumask) < 0)
    Abort();

  // Start off by creating tons of open sockets.  Well just make dummy sockets
  // for now.
  printf("Making sockets\n");
  for (int i = 0; i < files_per_cpu; i++) {
    if (socket(PF_INET, SOCK_STREAM, 0) < 0) {
      printf("Ran out of file descriptors? errno == %d\n", errno);
      kill(getppid(), SIGABRT);
      exit(0);
    }
  }
  printf("Done making sockets\n");

  // Now populate the proc_inode_cache
  pid_t my_pid = getpid();
  char fd_path[100];
  sprintf(fd_path, "/proc/%d/fd", my_pid);
  struct stat stat_dat2;
  if (lstat(fd_path, &stat_dat2) < 0)
    Abort();
  cout << "Mode : " << stat_dat2.st_mode << "\n";
  DIR *dir = opendir(fd_path);
  if (dir == static_cast<DIR *>(NULL))
    Abort();

  printf("Done opening fd directory\n");
  dirent *dirent;
  while ((dirent = readdir(dir)) != NULL) {
    // Now stat it to make sure it gets populated in dcache
    char file_path[100];
    sprintf(file_path, "/proc/%d/fd/%s", my_pid, dirent->d_name);
    struct stat stat_dat;
    if (lstat(file_path, &stat_dat) < 0)
      Abort();
  }
  closedir(dir);
  printf("Done looking at directory\n");

  // Now tell the parent that we are done
  atomic_inc(children_done);

  // And wait around to die
  for (;;) {
    timeval tv = {1, 0};
    // Poke the parent
    if (atomic_read(children_done) != 0)
      kill(getppid(), kMagicSignal);
    select(0, NULL, NULL, NULL, &tv);
  }

  Abort();
}

static void handle_sigusr1(int signum) {
}

static void make_child(int cpu, pid_t *child_pids) {
  // Set up signal so that we know when the child is done.
  if (signal(kMagicSignal, handle_sigusr1) == SIG_ERR)
    Abort();

  pid_t pid;
  if ((pid = fork()) < 0)
    Abort();
  if (pid == 0) {
    run_child(cpu);
    return;
  }

  // Parent
  child_pids[cpu] = pid;

  // Nothing left to do for this child
}

static void crank_up_filemax() {
  int fd;
  if ((fd = open("/proc/sys/fs/file-max", O_RDWR)) < 0)
    Abort();
  char msg[100];
  sprintf(msg, "%d\n", files_per_cpu * kMaxCPUs);
  write(fd, msg, strlen(msg));
  close(fd);
}

static void UsageAndDie() {
  cout << "Usage: Hurt a kernel with fd implosion\n";
  cout << who_am_i << " [options]\n";
  cout << "  --dont_wait_on_kids : Exit before kids are dead\n";
  cout << "  --files_per_cpu <n> : How many files should be created per cpu"
          " (default 50000)\n";
  cout << "  --wait_before_slaugther <secs> : Do we sleep before killing "
          "children? (default 0)";
  exit(1);
}

static int ParseInt(char *arg)
{
  char *endp;
  int ret;
  ret = (int)strtol(arg, &endp, 0);
  if (*endp) {
    cout << "Argument '" << arg << "' not an integer\n";
    UsageAndDie();
  }
  return ret;
}

static void ParseArgs(int argc, char **argv)
{
  who_am_i = argv[0];
  argc--;
  argv++;

  while (argc) {
    if (!strcmp(argv[0], "--wait_on_kids")) {
      if (argc < 2)
        UsageAndDie();
      wait_on_kids = ParseInt(argv[1]);
      argc -= 2;
      argv += 2;
      continue;
    }
    if (!strcmp(argv[0], "--files_per_cpu")) {
      if (argc < 2)
        UsageAndDie();
      files_per_cpu = ParseInt(argv[1]);
      argc -= 2;
      argv += 2;
      continue;
    }
    if (!strcmp(argv[0], "--dont_wait_on_kids")) {
      wait_on_kids = false;
      argc -= 1;
      argv += 1;
      continue;
    }
    cout << "Don't understand option " << argv[0] << "\n";
    UsageAndDie();
  }
}
int main(int argc, char **argv) {
  ParseArgs(argc, argv);
  if (getuid() != 0) {
    cout << "This program must be run as root\n";
    Abort();
  }

  InitSharedArea();

  crank_up_filemax();

  // Up our count of allowable open files
  struct rlimit new_limits = {files_per_cpu + 20,
                              files_per_cpu + 20};
  if (setrlimit(RLIMIT_NOFILE, &new_limits) < 0)
    Abort();

  cpu_set_t cpu_bitmap;
  get_cpu_bitmap(&cpu_bitmap);

  pid_t child_pids[kMaxCPUs];

  // Build all the children
  int childCount = 0;
  for (int i = 0; i < kMaxCPUs; i++) {
    if (CPU_ISSET(i, &cpu_bitmap)) {
      cout << "About to run child on cpu" << i << "\n";
      make_child(i, child_pids);
      childCount++;
    }
  }

  // Wait for all the children to finish
  for (;;) {
    if (atomic_read(children_done) == childCount) {
      atomic_set(children_done, 0);
      break;
    }
    timeval tv = {1, 0};
    select(0, NULL, NULL, NULL, &tv);
  }

  sleep(wait_before_slaughter);

  cout << "About to start killing children\n";

  timeval before_tv;
  gettimeofday(&before_tv, NULL);

  // Now kill all the children
  for (int i = 0; i < kMaxCPUs; i++) {
    if (CPU_ISSET(i, &cpu_bitmap)) {
      kill(child_pids[i], SIGTERM);
    }
  }

  // Wait on em
  if (wait_on_kids) {
    for (int i = 0; i < kMaxCPUs; i++) {
      if (CPU_ISSET(i, &cpu_bitmap)) {
        int status;
        waitpid(child_pids[i], &status, 0);
      }
    }
  }

  timeval after_tv;
  gettimeofday(&after_tv, NULL);

  int seconds = after_tv.tv_sec - before_tv.tv_sec;
  uint64_t millisecs = seconds * 1000;
  millisecs += (after_tv.tv_usec - before_tv.tv_usec) / 1000;
  seconds = millisecs / 1000;
  millisecs %= 1000;

  char buf[100];
  sprintf(buf, "%d.%03" PRIu64, seconds, millisecs);
  cout << "do_exit path for run took " << buf << " seconds\n";

  return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/