Saturday, May 28, 2011

Which ESX host is locking my files?

I’ve found myself asking this very annoying question just last week again. Which one of the servers is holding a lock on a virtual machine log file that was last modified 3 months ago?

Last week I came across a problem where VCB failed a job while trying to perform a full backup of one of the VMs. This was because one of the log files for the Virtual Machine was locked on the SAN. VCB was therefore unable to copy the log file to the backup server and therefore failed the entire job.

Normally, a simple VMotion of the Virtual Machine to another host will solve this issue, but I wasn’t as lucky this time. So I thought powering off the VM will do it... Didn’t work! No matter what I did, I just couldn’t get the lock released on that file. One of the ESX hosts in the cluster was holding on to the log file, but how do I go about finding out which one of the 20 ESX hosts is was? To me, this sounded like a job for vmkfstools, and indeed it was. Well, sort off. Using vmksftools, I was able to retrieve the MAC address of the ESX host in the cluster that was holding on to the 3 month old log file.

The command is:

vmkfstools –D /filename

In my case this was;

vmkfstools –D /vmfs/volumes/iscsi-002-vmfs/WKSTN01/vmware.log

The output is then written to /var/log/vmkernel.

To get the output, simply do:

tail /var/log/vmkernel

This returned:

Jun 20 15:35:33 esx1 vmkernel: 23:02:22:35.020 cpu0:4174)FS3: 142:
Jun 20 15:35:33 esx1 vmkernel: 23:02:22:35.020 cpu0:4174)Lock [type 10c00001 offset 29190144 v 7, hb offset 4083712
Jun 20 15:35:33 esx1 vmkernel: gen 1881, mode 1, owner 4a2128d2-86a81c3a-ce30-000e0cc41e98 mtime 893]
Jun 20 15:35:33 esx1 vmkernel: 23:02:22:35.020 cpu0:4174)Addr , gen 6, links 1, type reg, flags 0x0, uid 0, gid 0, mode 644
Jun 20 15:35:33 esx1 vmkernel: 23:02:22:35.021 cpu0:4174)len 312433, nb 1 tbz 0, cow 0, zla 1, bs 1048576
Jun 20 15:35:33 esx1 vmkernel: 23:02:22:35.021 cpu0:4174)FS3: 144:

The MAC address of the host locking the file is reported in line 3:

000e0cc41e98

Now, this is the bit where I can’t make it any easier for you. Unless you write a script, (and I don’t have that much time at the moment) the only way to find the host with that MAC is to log onto each host via SSH and run:

esxcfg-info |grep –i ‘system uuid’

This will then return the UUID for the host you are on. If it matches the MAC retrieved using vmkfstools, then you know the process that’s keeping the lock is on that server.

So what process is locking the file? That I can’t tell you. I can only give you some tips as to how to find it.
1. Power off the VM in vCenter;
2. Log onto the service console of the host that’s locking the file;
3. Try to move or delete the lock file from the service console of the locking host. This worked me. If it works for you, then good. If not, go to step 4;
4. Try and see if there’s a process running with the filename that is locked;

ps –auxwww |grep

If it returns a line(other than the grep line) kill the process with “kill -9 "

5. If it doesn’t return any processes under that filename, then try and search for a PID with the VM name that has a locked file:

ps –auxwww|grep

If it returns a PID, kill the PID, as your VM was already powered off in step one and should therefore not have a PID on any host;

6. If it still doesn’t work, leave a comment and we'll have a look at it ;-)

No comments:

Post a Comment