Here’s the scenario: /usr/bin/firefox launches and does the usual resource allocation with file descriptors and mmap() calls and all the other fun stuff UNIX processes do. The CWD for the process is my home directory, /usr/people/imbrius, which is an automount map over NFS. If I don’t touch my browser for more than 15 minutes, the NFS client silently closes the file descriptors and flushes the write cache without telling firefox. The next time I go to use my browser, it hangs. I send it a KILL. Whoops. I just made a zombie out of firefox.
How did I turn firefox into a zombie process? When NFS time out and the kernel lost track of the open file descriptors, it sent a signal to the browser’s parent process (/bin/sh /usr/lib64/firefox-3.0.18/run-mozilla.sh /usr/lib64/firefox-3.0.18/firefox), who was the registered owner of the file descriptors. Since it’s a shell script, it doesn’t have a handler for whatever signal the kernel sent it and it performs the default action of terminating with a return of -1. But wait – what about its child, the browser process proper (/usr/lib64/firefox-3.0.18/firefox)? Under normal circumstances, it would exit as its shell is now dead. However, before it got a chance to exit correctly, it tried to access a dead file descriptor and hung. At this point, it is still a runnable process with program code, memory, data, and (dead) file descriptors. Since its parent has died, it gets adopted by init. But it’s hung – essentially busy-waiting for the kernel to find an NFS resource that doesn’t exist anymore. The only thing we can do is put it out of its misery and send it a KILL.
The KILL does its job and the program text and data areas are freed (including file descriptors), and it returns -1 to init. At this point, the process is a true zombie – there’s no program to run, no data to process, nothing there but an entry in the kernel’s process table. It’s a dried-out husk of a process. But wait! The kernel is still holding on to a bit of information that the program requested – the NFS FD call! The kernel doesn’t want to reap the process until it fulfills the request. Only the request won’t ever be fulfilled. Init never calls wait(1) on the now zombie browser process and the file descriptors never get closed, preventing us from restarting the browser until we reboot and clear the kernel’s process table.