Table 1.

Issues Where Farms Fail


Email

Sending email from x,000 nodes at the same time with the results of output, e.g., LSF's default behavior sends stdout and error via email. Often results in a crashed mailserver.
NFS Remote file read access over the network from multiple nodes.
NFSKiller As for nfs, but with multiple write access all at the same time.
PreExec Writing a pre-execution wrapper to test a failure condition prior to the job running, that then fails to run, or exits with a non-zero exit status. e.g., coding error can cause this.
RubberJob Jobs bouncing in the queues due to some failure state, e.g., missing nfs intercept point, coding error. Often induced by a PreExec failure—as above.
Typo Often the biggest killer of farm servers, e.g., /daata/blast not /data/blast, jobs can become Rubber, and proceed to bounce in and out of the queuing system on to modes and then fail.
RawOut Writing raw blast/exonerate/other output without any data reduction, e.g., MSPCrunch/grep, etc.
BigLog Excessive logging that may generate more than 1 GB of stderr output data, this is often written back to the NFS server.
JobSize Jobs that run for less than 1 sec. Also jobs that run for 6 mo.
SwapKiller Jobs that end up allocating too much memory, or jobs that grow and difficulty predicting usage patterns, e.g., exonerate FSM generation.
MasterKiller Job submission, dispatch rate, and queue size are sufficiently high that the dispatch code becomes CPU bound and fails to run new jobs.
NameService NIS or DNS servers become overloaded due to many gethostsbyname, getuid, getgroup requests.
NetFill Backbone network becomes saturated with I/O requests, e.g., heavy NFS or DBI loading.
DeadDisk Storage failure on remote node. When large numbers of spindles are considered, this becomes a significant factor. Jobs can arrive at the remote node and find that the storage has failed.
RDBKiller
Head database nodes become saturated with long running requests, or too many concurrent connections. RDB is no longer able to actively serve results for new SQL statements.

[i] There are a number of bottlenecks that can appear in farm environments; the table above is a list of some of the significant ones. If you are particularly unlucky, you will see all of them at the same time.