Recently my friend was complaining about his new OpenVZ main node server taking a lot of i/o resources even though no VPS were running on it. He was blaming it on the datacenter guys for giving him faulty hard disk and wanted me to check and confirm. With iostat and lsof commands, it didn’t took much time to find out that the issue was something else!

[root@server ~]# iostat -x
Linux 2.6.18-164.15.1.el5.028stab068.9 (server.example.net) 04/06/2010

avg-cpu: %user %nice %system %iowait %steal %idle
          0.00  0.00   0.50   49.50   0.00  50.00

Device: rrqm/s  wrqm/s  r/s   w/s  rsec/s  wsec/s  avgrq-sz  avgqu-sz  await  svctm  %util
sda       0.00  136.00 0.00  71.00 0.00   1552.00    21.86      8.51   61.86  13.25  94.10
sda1      0.00  127.00 0.00  50.00 0.00   1352.00    27.04      1.12   18.22  18.80  94.00
sda2      0.00    0.00 0.00   0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00
sda3      0.00    9.00 0.00  21.00 0.00    200.00     9.52      7.39  165.76  15.24  32.00
sdb       0.00  136.00 0.00  69.00 0.00   1528.00    22.14      8.72   58.64  11.99  82.70
sdb1      0.00  127.00 0.00  50.00 0.00   1352.00    27.04      1.01   16.00  16.38  81.90
sdb2      0.00    0.00 0.00   0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00
sdb3      0.00    9.00 0.00  19.00 0.00    176.00     9.26      7.71  170.84  16.63  31.60
md2       0.00    0.00 0.00 176.00 0.00   1408.00     8.00      0.00    0.00   0.00   0.00
md0       0.00    0.00 0.00  43.00 0.00    344.00     8.00      0.00    0.00   0.00   0.00
md1       0.00    0.00 0.00   0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00

Seems to suggest that /dev/md2 is causing the iowait. This is infact mounted on / :

[root@server ~]# df -h | grep md2
/dev/md2 20G 6.4G 13G 35% /

This server did not have separate partitions for /var, /usr etc. Seeing large I/O activity in /, I was certain that it should be some process writing in the logs (in /var/log/). lsof /var/log/* showed that only /var/log/maillog was accessed at that time by syslogd and lfd processes.

[root@server ~]# lsof /var/log/maillog
COMMAND   PID  USER FD TYPE DEVICE    SIZE    NODE      NAME
syslogd  6243  root 3w REG   9,2   441431340 586331 /var/log/maillog
lfd     25197  root 5r REG   9,2   441431340 586331 /var/log/maillog
[root@server ~]# lsof /var/log/maillog
COMMAND   PID  USER FD TYPE DEVICE    SIZE    NODE      NAME
syslogd  6243  root 3w REG   9,2   441438732 586331 /var/log/maillog
lfd     25197  root 5r REG   9,2   441438732 586331 /var/log/maillog
[root@server ~]# lsof /var/log/maillog
COMMAND   PID  USER FD TYPE DEVICE    SIZE    NODE      NAME
syslogd  6243  root 3w REG   9,2   441451948 586331 /var/log/maillog
lfd     25197  root 5r REG   9,2   441451948 586331 /var/log/maillog

If you note the FD column in the above output, 3w means that syslogd has opened /var/log/maillog for writing into it, while 5r tells us that lfd opened the same file for reading data from it. syslogd writing in /var/log/maillog seems to suggest sendmail was not configured correctly. Also notice the increase in file size. I stopped sendmail service to check for any difference in i/o wait:

[root@server ~]# service sendmail stop
Shutting down sm-client: [ OK ]
Shutting down sendmail: [ OK ]

[root@server ~]# iostat -x
Linux 2.6.18-164.15.1.el5.028stab068.9 (server.example.net) 04/06/2010

avg-cpu: %user %nice %system %iowait %steal %idle
          0.00  0.00   0.00    1.01   0.00  98.99

Device: rrqm/s  wrqm/s  r/s    w/s  rsec/s  wsec/s  avgrq-sz  avgqu-sz  await  svctm  %util
sda       0.00   10.89 0.00    3.96 0.00    118.81    30.00      0.09   17.75  23.75   9.41
sda1      0.00   10.89 0.00    3.96 0.00    118.81    30.00      0.07   17.75  17.75   7.03
sda2      0.00    0.00 0.00    0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00
sda3      0.00    0.00 0.00    0.00 0.00      0.00     0.00      0.02    0.00   0.00   2.38
sdb       0.00   10.89 0.00    3.96 0.00    118.81    30.00      0.10   20.00  26.00  10.30
sdb1      0.00   10.89 0.00    3.96 0.00    118.81    30.00      0.08   20.00  20.00   7.92
sdb2      0.00    0.00 0.00    0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00
sdb3      0.00    0.00 0.00    0.00 0.00      0.00     0.00      0.02    0.00   0.00   2.38
md2       0.00    0.00 0.00   12.87 0.00    102.97     8.00      0.00    0.00   0.00   0.00
md0       0.00    0.00 0.00    0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00
md1       0.00    0.00 0.00    0.00 0.00      0.00     0.00      0.00    0.00   0.00   0.00

Oh! Now that’s indeed a difference. The culprit here as we found out, was a misconfigured sendmail service due to which its logs were getting populated. Moral of the story is, don’t blame it on the hard drive for all I/O issues before checking which all files are open, and being written to at that point of time 🙂