Please record failed jobs

One issue we are already facing is identifying nodes with problems. To help us with this, if your job fails with what seems to have been a "system" problem (and not, for example, a bug in your code or a bad input file), please record this with the badjob command, for example:

badjob 1234

This just adds an entry to a log file but we hope to be able to use the records to identify nodes with problems.

Why is my job still queued?

The command:

tracejob 1234

gives a history of what the queueing system has tried to do with your job.

Is my job actually running?

We have written a simple script to show the load average of the nodes running your job. Just type:

nodestat 1234

or:

nodestat r1i2n3

if you wish to examine a particular node. The load averages shown are the average number of jobs in the run queue over the last 1, 5 and 15 minutes.

You may also log in with ssh but remember that you will be logged out if somebody else's job starts on that node.

-- JohnRowe - 28 Jan 2008

This topic: Zen > WebHome > RunningJobs > ProblemSolving
History: r2 - 28 Jan 2008 - 18:12:17 - JohnRowe
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Astrophysics Wiki? Send feedback