Markup | Genome Research

Table 1.

Issues Where Farms Fail

Email	Sending email from x,000 nodes at the same time with the results of output, e.g., LSF's default behavior sends stdout and error via email. Often results in a crashed mailserver.
NFS	Remote file read access over the network from multiple nodes.
NFSKiller	As for nfs, but with multiple write access all at the same time.
PreExec	Writing a pre-execution wrapper to test a failure condition prior to the job running, that then fails to run, or exits with a non-zero exit status. e.g., coding error can cause this.
RubberJob	Jobs bouncing in the queues due to some failure state, e.g., missing nfs intercept point, coding error. Often induced by a PreExec failure—as above.
Typo	Often the biggest killer of farm servers, e.g., /daata/blast not /data/blast, jobs can become Rubber, and proceed to bounce in and out of the queuing system on to modes and then fail.
RawOut	Writing raw blast/exonerate/other output without any data reduction, e.g., MSPCrunch/grep, etc.
BigLog	Excessive logging that may generate more than 1 GB of stderr output data, this is often written back to the NFS server.
JobSize	Jobs that run for less than 1 sec. Also jobs that run for 6 mo.
SwapKiller	Jobs that end up allocating too much memory, or jobs that grow and difficulty predicting usage patterns, e.g., exonerate FSM generation.
MasterKiller	Job submission, dispatch rate, and queue size are sufficiently high that the dispatch code becomes CPU bound and fails to run new jobs.
NameService	NIS or DNS servers become overloaded due to many gethostsbyname, getuid, getgroup requests.
NetFill	Backbone network becomes saturated with I/O requests, e.g., heavy NFS or DBI loading.
DeadDisk	Storage failure on remote node. When large numbers of spindles are considered, this becomes a significant factor. Jobs can arrive at the remote node and find that the storage has failed.
RDBKiller	Head database nodes become saturated with long running requests, or too many concurrent connections. RDB is no longer able to actively serve results for new SQL statements.

[i] There are a number of bottlenecks that can appear in farm environments; the table above is a list of some of the significant ones. If you are particularly unlucky, you will see all of them at the same time.