Issues Where Farms Fail
Sending email from x,000 nodes at the same time with the results of output, e.g., LSF's default behavior sends stdout and error via email. Often results in a crashed mailserver. | |
| NFS | Remote file read access over the network from multiple nodes. |
| NFSKiller | As for nfs, but with multiple write access all at the same time. |
| PreExec | Writing a pre-execution wrapper to test a failure condition prior to the job running, that then fails to run, or exits with a non-zero exit status. e.g., coding error can cause this. |
| RubberJob | Jobs bouncing in the queues due to some failure state, e.g., missing nfs intercept point, coding error. Often induced by a PreExec failure—as above. |
| Typo | Often the biggest killer of farm servers, e.g., /daata/blast not /data/blast, jobs can become Rubber, and proceed to bounce in and out of the queuing system on to modes and then fail. |
| RawOut | Writing raw blast/exonerate/other output without any data reduction, e.g., MSPCrunch/grep, etc. |
| BigLog | Excessive logging that may generate more than 1 GB of stderr output data, this is often written back to the NFS server. |
| JobSize | Jobs that run for less than 1 sec. Also jobs that run for 6 mo. |
| SwapKiller | Jobs that end up allocating too much memory, or jobs that grow and difficulty predicting usage patterns, e.g., exonerate FSM generation. |
| MasterKiller | Job submission, dispatch rate, and queue size are sufficiently high that the dispatch code becomes CPU bound and fails to run new jobs. |
| NameService | NIS or DNS servers become overloaded due to many gethostsbyname, getuid, getgroup requests. |
| NetFill | Backbone network becomes saturated with I/O requests, e.g., heavy NFS or DBI loading. |
| DeadDisk | Storage failure on remote node. When large numbers of spindles are considered, this becomes a significant factor. Jobs can arrive at the remote node and find that the storage has failed. |
| RDBKiller | Head database nodes become saturated with long running requests, or too many concurrent connections. RDB is no longer able to actively serve results for new SQL statements. |
[i] There are a number of bottlenecks that can appear in farm environments; the table above is a list of some of the significant ones. If you are particularly unlucky, you will see all of them at the same time.