Welcome to MilkyWay@home

196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

Message boards : Number crunching : 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63461 - Posted: 24 Apr 2015, 19:35:31 UTC
Last modified: 24 Apr 2015, 19:41:05 UTC

I'm getting a few "Maximum disk usage exceeded" errors. I don't think I've ever seen this here.

For my tasks, the client_state file shows:

<rsc_disk_bound>15000000.000000</rsc_disk_bound>

Example task result shows Peak disk usage 5,741.01 MB.
ID: 63461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63469 - Posted: 27 Apr 2015, 13:26:41 UTC

These things keep coming. Updating to the latest driver doesn't seem to have helped and others running the same tasks don't seem to have any problems, either. I'm thinking I have a hardware problem... :-(
ID: 63469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63470 - Posted: 27 Apr 2015, 18:41:11 UTC

Additional info, in case it matters...

I have local preferences set that are pretty wide-open, I think. The host has an 1TB HDD with only about 250GB used. Local disk limits are set to:

Use at most -- 150GB (most restrictive)
Leave at least -- 0.1 GB (least restrictive)
Use at most -- 50% of total (less restrictive)

The BOINC manager says that 26 GB is used for BOINC with 124GB available and that MS@H is using less than 240MB.
ID: 63470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,938,432
RAC: 22,812
Message 63472 - Posted: 28 Apr 2015, 10:39:53 UTC - in response to Message 63470.  

Additional info, in case it matters...

I have local preferences set that are pretty wide-open, I think. The host has an 1TB HDD with only about 250GB used. Local disk limits are set to:

Use at most -- 150GB (most restrictive)
Leave at least -- 0.1 GB (least restrictive)
Use at most -- 50% of total (less restrictive)

The BOINC manager says that 26 GB is used for BOINC with 124GB available and that MS@H is using less than 240MB.


It probably won't help but change the top line "use at most" to 500gb, if it works back it off to just above where you get the errors again. If it doesn't work, as I suspect it won't but who knows, then you can switch it back to where it is now.
ID: 63472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ananas

Send message
Joined: 19 Aug 08
Posts: 12
Credit: 2,500,263
RAC: 0
Message 63475 - Posted: 28 Apr 2015, 19:03:28 UTC
Last modified: 28 Apr 2015, 19:21:14 UTC

This is not a client configuration problem, it is a workunit problem.

The project can define a max. disk usage per workunit, each workunit that fails to stay below this limit will be aborted by the BOINC core client.

The only thing you could do against it would be to patch your core client so it ignores this limit - but then a workunit running wild would not be recognized.

p.s.: usually all workunits of the same batch have the same disk limitation. Even though it is possible to calculate an individual limit for each single workunit, I doubt that too many projects do that.


OOPS, sorry ... I thought your problem was related to the rsc_disk_bound value (as mentioned in the starting post) but this doesn't seem to be the case.

Several years ago I made this thing, maybe it helps. Note that BOINC will probably still not get along well with NTFS compression, i.e. BOINC uses the physical disk space for the calculation, ignoring the compression potential.
ID: 63475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 63476 - Posted: 28 Apr 2015, 22:19:12 UTC - in response to Message 63475.  

I'm with Ananas' original thought - I think it's to do with <rsc_disk_bound>. There's a parallel thread at Einstein - Maximum disk usage exceeded - where ritterm cross-posted, and some related discussion in Results showing "Aborted by user".
ID: 63476 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63478 - Posted: 29 Apr 2015, 14:43:34 UTC - in response to Message 63476.  

...I think it's to do with <rsc_disk_bound>...

That's what I was thinking, too, of course. However, I now think I might have a GPU hardware problem -- all the tasks I've checked that errored out for me have been completed by another host without a problem. If the tasks I ran had bad parameters, would the same task work for another host?

When I upgraded the video driver, I went to the AMD website, downloaded and ran their auto-detect tool, and let it pick and install a new driver. Is there anything else I need to install?
ID: 63478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 63481 - Posted: 29 Apr 2015, 18:25:53 UTC - in response to Message 63478.  

...I think it's to do with <rsc_disk_bound>...

That's what I was thinking, too, of course. However, I now think I might have a GPU hardware problem -- all the tasks I've checked that errored out for me have been completed by another host without a problem. If the tasks I ran had bad parameters, would the same task work for another host?

When I upgraded the video driver, I went to the AMD website, downloaded and ran their auto-detect tool, and let it pick and install a new driver. Is there anything else I need to install?

A bad driver by itself wouldn't cause a disk limit error - unless it's spewing out yards and yards of error messages. Look in the slot directory, as I said at Einstein.
ID: 63481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63482 - Posted: 30 Apr 2015, 13:12:41 UTC - in response to Message 63481.  
Last modified: 30 Apr 2015, 13:13:12 UTC

A bad driver by itself wouldn't cause a disk limit error - unless it's spewing out yards and yards of error messages. Look in the slot directory...

I'm afraid I'm not sure I know what to look for. These tasks are failing right away, after only 1-2 seconds of run time. I don't see anything changing in the slot directory when this happens (\ProgramData\BOINC\slots, right?).
ID: 63482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 63484 - Posted: 30 Apr 2015, 17:43:41 UTC - in response to Message 63482.  

A bad driver by itself wouldn't cause a disk limit error - unless it's spewing out yards and yards of error messages. Look in the slot directory...

I'm afraid I'm not sure I know what to look for. These tasks are failing right away, after only 1-2 seconds of run time. I don't see anything changing in the slot directory when this happens (\ProgramData\BOINC\slots, right?).

Well, each task gets allocated to one particular numbered folder in there as it starts - which one is visible via the 'properties' button while it's active, but a couple of seconds doesn't give you much time to investigate.

Each slot should be empty, unless there's a running task using it. Might be worth (re-)starting BOINC with GPU activity disabled, and emptying any slots which should be empty but aren't. Then, the next task you allow to run should occupy the lowest-numbered empty slot - watch that, and see if anything (big) appears in it.
ID: 63484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63487 - Posted: 1 May 2015, 3:08:12 UTC

I suspended all tasks then resumed them one at a time and waited for one to crash. It didn't take too long, but all that's left in the directory is the stderr output file and it's only 4KB. If something "big" was generated before being deleted, I wasn't able to see anything.

The only message in the BOINC manager log related to the task is something similar to this:

Aborting task de_80_DR8_Rev_8_5_00004_1429700402_4384432_0: exceeded disk limit: 5115.01MB > 14.31MB

I just don't understand what's going on. Only about 5% of the tasks are failing for me and others don't seem to be having any problem with them.
ID: 63487 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,938,432
RAC: 22,812
Message 63495 - Posted: 2 May 2015, 10:51:16 UTC - in response to Message 63487.  

I suspended all tasks then resumed them one at a time and waited for one to crash. It didn't take too long, but all that's left in the directory is the stderr output file and it's only 4KB.


THAT'S the file you want to look at, post it here if you can please.
ID: 63495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63498 - Posted: 2 May 2015, 12:22:35 UTC - in response to Message 63495.  

THAT'S the file you want to look at, post it here if you can please.

Well, other than the initial remarks about the "Maximum disk usage exceeded" error and what appears to be the lack of results data at the end, the rest of the file looks virtually identical to the stderr output of a valid task.

However, I think my problem is solved. Following the suggestion of a forum post about a similar problem at another project, I checked my host's slots directories and found two "stray" VM image files left by one of the VM projects (probably CERN's CMS-dev), each of which was over 5GB. I deleted those files and slots and have been running trouble free for almost 12 hours. I'm not sure, though, that I understand why that was the problem.
ID: 63498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,938,432
RAC: 22,812
Message 63500 - Posted: 3 May 2015, 11:26:40 UTC - in response to Message 63498.  

THAT'S the file you want to look at, post it here if you can please.

Well, other than the initial remarks about the "Maximum disk usage exceeded" error and what appears to be the lack of results data at the end, the rest of the file looks virtually identical to the stderr output of a valid task.

However, I think my problem is solved. Following the suggestion of a forum post about a similar problem at another project, I checked my host's slots directories and found two "stray" VM image files left by one of the VM projects (probably CERN's CMS-dev), each of which was over 5GB. I deleted those files and slots and have been running trouble free for almost 12 hours. I'm not sure, though, that I understand why that was the problem.


I'm glad you got it fixed, that is strange. Could they be trying to use the same file name?
ID: 63500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63502 - Posted: 3 May 2015, 11:55:31 UTC - in response to Message 63500.  

I'm glad you got it fixed, that is strange. Could they be trying to use the same file name?

Me, too! :-) I'm really not sure what's going on with the other project, but you can read more in this thread over at CERN/CMS-dev.
ID: 63502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 63511 - Posted: 4 May 2015, 8:50:43 UTC

I reported this problem to the BOINC developers, and got this reply from David Anderson:

I looked at this and couldn't immediately see the problem.
The BOINC client deletes everything in a slot directory before using it for a new job.
If a deletion fails (e.g. because a file is in use by another app) it doesn't use
that slot directory.
I verified this by opening some Word docs in slot directories.

Notes:

* There's a "slot_debug" log flag for messages related to slot directories.
Unfortunately it doesn't print messages about failed file deletions; I'll add this.
* The "disk limit exceeded" errors refer to the per-job disk limit, not the user's
disk usage preferences; I'll change the message to clarify this.
* Apps aren't responsible for cleaning out their slot dirs; BOINC does this. It
may be that BOINC is failing to delete VM images because they're still in use by
the VirtualBox executive.

Bottom line: I'll need some more info to debug this.
If anyone is seeing this reproducibly, let me know.
Otherwise we'll release a client with more debugging output to help us investigate.

-- David

So, help needed.

Under what circumstances does the CMS .vdi image get left behind? Is there a difference between successful task completions and abnormal (error) exits?

Can the .vdi be deleted manually? Immediately? Later? After BOINC restart? After reboot?

Does BOINC ever clean it up by itself, say after a client restart?

And anything else you can think of.

Could somebody pass David's message over to CERN/CMS-dev, please? I don't even have an invitation code to create a posting account.
ID: 63511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 63514 - Posted: 4 May 2015, 11:00:19 UTC - in response to Message 63511.  

Could somebody pass David's message over to CERN/CMS-dev, please? I don't even have an invitation code to create a posting account.

Done.
ID: 63514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED

©2024 Astroinformatics Group