Monday, May 25, 2009

Doing a small bit of diagnosis with mdb on process hangs

If you've ever had a process hang in Opensolaris, and wanted to find out a bit more, here's a few quick steps you can do.  These are by no means exhaustive, but if you just want to learn a little bit more:
  1. Run pstack pid on the process to see what it is doing.  Possibly try it a few times to see if it's always calling the same function.  This will print out the user call stack (most recently called function at the top).  If it looks like a user function, it probably suggests some sort of coding bug.
  2. If it looks like it's stuck inside a syscall, you probably want to get the call stack within the kernel.
  3. As root, run mdb -k in another window.
  4. In mdb, type '::ps -t'.  This will list all the processes (lines starting with an 'R').  Under each process will be at least one line that looks like this:
    T  0xffffff014ac00e00 
  5. For each of those 'T' lines under the process in question, take the 2nd value and run 'val::findstack'.   I.e. with the above example, you'd type '0xffffff014ac00e00::findstack'
  6. Type ctrl-d to exit from mdb.
From there, what you do really depends on what the output is.  From what I've seen, unkillable processes tend to be stuck waiting on a condition variable (cv_wait) -- or at least it's decent odds.  Solaris 10 for a while had some issues with locking inside of /proc causing that, though it's been a couple of years since I've seen it, it looks like the issues have been resolved.

Most recently, I saw one that looks like it might be some sort of loop inside the ufs code on Nevada, though I will have to wait to see what those more experienced in this stuff are able to determine.

No comments: