View topic | Edit | WYSIWYGAttachPrintable
r2 - 28 Jan 2008 - 18:12:17 - JohnRoweYou are here: Astrophysics Wiki >  Zen Web  >  RunningJobs > ProblemSolving

Please record failed jobs

One issue we are already facing is identifying nodes with problems. To help us with this, if your job fails with what seems to have been a "system" problem (and not, for example, a bug in your code or a bad input file), please record this with the badjob command, for example:

badjob 1234

This just adds an entry to a log file but we hope to be able to use the records to identify nodes with problems.

Why is my job still queued?

The command:

tracejob 1234

gives a history of what the queueing system has tried to do with your job.

Is my job actually running?

We have written a simple script to show the load average of the nodes running your job. Just type:

nodestat 1234

or:

nodestat r1i2n3

if you wish to examine a particular node. The load averages shown are the average number of jobs in the run queue over the last 1, 5 and 15 minutes.

You may also log in with ssh but remember that you will be logged out if somebody else's job starts on that node.

-- JohnRowe - 28 Jan 2008

View topic | Edit |  | WYSIWYG | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r4 < r3 < r2 < r1 | More topic actions...

key Log In or Register
Information

Main Web Users Groups Index Search Changes Notifications Statistics Preferences


Webs Main Sandbox TWiki Zen Information

Main Web Users Groups Index Search Changes Notifications Statistics Preferences


Webs Main Sandbox TWiki Zen


 
Astrophysics Wiki


Edit Wysiwyg Attach Printable
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Astrophysics Wiki? Send feedback