Dealing With Runaway Processes

Jump to: navigation, search

What Are They?

It is quite common for one process (the parent) to create a second process (the child). Usually, the child does it's job, and then exits gracefully, passing the result back to the parent. Sometimes, the child process does NOT exit however (due to a bug in the program). When this happens, the parent may die or be killed, but leave the child process running. At this point, the child process is a runaway. They are stuck in an infinite loop eating up system resources with no parent process to tell them to stop.

Runaway processes have become a large problem at CISE. Several classes have programming assignments which involve creating children processes. Any programming involving threads, client-server type apps, operating systems, distributed computing, etc., are susceptible to this. Since we have several such classes at CISE, we run into this problem on a daily basis. The current most frequent offenders are minix processes and JAVA thread programs.

Runaway processes eat up computer resources needed by others, and they do it until the process is killed. If enough runaways (or even just one especially greedy one) is running on a machine, the machine quickly becomes unusable. Unfortunately, the machines used to run these processes (which tend to be homework assignments) are the same public machines relied on by hundreds of other CISE students to complete their homework and research assignments. As a result, the processes must be cleaned up either by the student or by the system staff.

Who Is Responsible for Removing Them?

In the past, the system staff has done much of the cleaning up of runaways by hand. It is done largely by hand because it often takes a good deal of personal attention in order to decide if a process is runaway (though a lot of them can be done automatically). Some of it could be done with a cron job, but we feel that this does the students a disservice for two reasons:

  • It is very difficult to automatically distinguish between processes which are runaways and those which are legitimate. Although a lot of them can be determined automatically, many can't. Writing a program to do it for us is even harder. If we are too aggressive, we end up killing legitimate processes. If we're not aggressive enough, runaways slip through and eat up resources.
  • We feel it is a disservice to the students writing the program. Part of learning how to write these types of software involves learning about common bugs and pitfalls and how to avoid them. To the student, a runaway MAY have appeared to behave normally, when in fact it had a serious bug in it. By forcing students to clean up their runaways, there is the opportunity to learn more about the program.

In late 1998, we started insisting that students in COP 4600 (one of the operating systems classes especially susceptible to runaways) clean up after themselves. It has been a tremendous success (at least in terms of freeing up resources). Since then, other classes have been added, and at this point, the policy applies to all people using CISE computers resulting in far fewer unusable machines.

Cleanup Policy

The policy is fairly simple. In classes where this type of programming is done, students will be instructed to clean up their runaways, along with simple instructions for doing so.

The system staff will also continue to clean up processes. If runaways are found, they will be killed and a warning sent to that student. If runaways are found a second time, they will be killed and a second warning sent to the user.

After two warnings, if we find runaway processes belonging to a user, their access to that particular machine will be cut off for the remainder of the semester.

Note: being restricted from one or two machines is inconvenient, but in no way stops you from doing your homework. A list of all public machines is available here. Please use one of the other machines to complete your homework. At the end of the current semester, all access restrictions will be relaxed automatically.

How to Clean Up Runaways

Finding runaway processes is fairly easy for the user. If I wanted to find out if I had runaway processes on a machine, I'd run the command:

   % /bin/ps -u USERNAME

Be sure to substitute your username in for "USERNAME". The output might look something like:

     PID TTY      TIME CMD
   29155 ?        0:01 xman
   29157 ?        0:01 twm
   29085 ?        0:00 system.x
   29153 ?        0:00 xclock
   29154 ?        0:00 xcal
   29156 ?        0:00 xautoloc
   29135 ?        0:00 zephyr.w
   29982 ?        0:03 minix
   29194 ?        0:00 minix
   29194 pts/1    0:00 minix
   2705 pts/1     5:32 emacs
   6223 pts/1     3:26 exmh
   29194 ?        0:00 minix
   29138 ?        0:00 zwgc

Most of the processes you get are completely normal (started up by the system when you log in). Pay special attention to processes with a "?" in the TTY colunn. Minix and java processes here are almost always runaway. Processes with a TTY are not runaway, but in some cases, they may need to be cleaned up.

Note: if you use the "ps" command without the -u option, most of the time, runaway processes will not be listed, so you must use the -u option.

To kill the two minix processes, use the command:

   kill 29982 29194

In some cases, the processes may not die even after you issued the kill command. In that case, use the command "kill -9". So, you could type:

   kill -9 29982 29194

Using the "-9" option should always kill a runaway process.

NOTE: Some people use the skill command instead of the kill command. The skill command is not reliable, and should not be used. If you do use it, double check the results using the ps command again to make sure all runaway processes have been removed.

You should check for runaways every time you have run minix, java with threads, or any other type of process you have been warned about in class or from one of the system staff. Also, you MUST run it on every single machine where a process could have started. This is especially important with any type of distributed programming project. Any time the children are created on other computers, it is essential to log into each of them and check for runaways.

Common Problems

One of the biggest problem that can occur is logging out from a machine before you check for runaways. You MUST be logged on to the same machine where you ran the programs when you check for runaways. The "ps" command above only checks for processes on whatever machine you are logged on to. If you log out, you must log back in to the same machine to check for runaways.

A problem that occurs with Java thread programs or any other type of distributed program is that the threads are started on different machines. You MUST log in to each of the machines and kill runaways there. It is not sufficient to kill them only on the machine where you are actually logged in on.

Another problem is that the machine hangs (perhaps due to your processes) and you cannot kill the process. In this case, you MUST mail and notify us of the problem. If you don't clean up AND don't inform us of the problem, any runaways left behind will be counted as a warning or a reason to close access.