Forum Discussion

davidcastells's avatar
davidcastells
Icon for New Contributor rankNew Contributor
5 years ago
Solved

Frozen Job in Devcloud

I have a frozen job in DevCloud. Time quota was 6 hours, but it's been running for more than 62 hours.

I try to kill it with

qdel <job id>

but I get

qdel: Server could not connect to MOM <job id>

Any idea on what to do ?

  • davidcastells's avatar
    davidcastells
    4 years ago

    Let me add (for others having the same problem) that the DevCloud team finally cancelled my pending job.

    A general good advice is to always include a deadline in your batch jobs to avoid any issue with the queueing system in case something strange happen.

6 Replies

  • Hi,

    I have forwarded your issue to the owner of this Dev Cloud platform and awaiting to hear back. I would request for them to answer to your post directly. Please give us a couple of days on this.


    -Hazlina


  • Lawrence_L_Intel's avatar
    Lawrence_L_Intel
    Icon for Occasional Contributor rankOccasional Contributor

    Do you know which server you launched the job from? If so, you can log back into the same server, you can try ps -auxw and kill -9 the job ID. Sometimes that kills the job. Make sure you use the walltime construct in batch mode so you don't time out i the future.

    Thanks,

    Larry

    • Lawrence_L_Intel's avatar
      Lawrence_L_Intel
      Icon for Occasional Contributor rankOccasional Contributor

      Let me add if you post here and dont see a response, try fpgauniversity@intel.com . We have a fairly small team moderating technical inquiries on the FPGA devcloud, and dont check the forum frequently.

      Thanks

      Larry

      • davidcastells's avatar
        davidcastells
        Icon for New Contributor rankNew Contributor

        Thanks Lawrence,
        I already sent them 2 maills (last saturday, and yesterday) but I have no response.

    • davidcastells's avatar
      davidcastells
      Icon for New Contributor rankNew Contributor

      The problem is that the node s005-n005 that was running the job went down (I don't know why) and the queue system has lost the control of the job.

      I cannot login to s005-n005 because it is not running.

      Apparently (with admin privileges) the problem would be simply solved by running

      qdel -p 18216.v-qsvr-fpga.aidevcloud