Welcome to MilkyWay@home

Computation errors


Advanced search

Message boards : Number crunching : Computation errors
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Phil

Send message
Joined: 20 Nov 12
Posts: 3
Credit: 4,248,153
RAC: 0
3 million credit badge9 year member badge
Message 56752 - Posted: 5 Jan 2013, 19:17:51 UTC

I have been running the MW@home for about six weeks with no errors or problems until now. Every task that the cpu completes now states "computation error" but the gpu tasks are ok. It also no longer downloads new work unless I hit the update button. My computer hardware hasn't changed and I have not added or deleted any programs lately. Any help would be appreciated as it's frustrating to have the computer do all the work and get wasted results and no credit.

Phil
ID: 56752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TeeVeeEss

Send message
Joined: 5 Nov 08
Posts: 2
Credit: 13,898,752
RAC: 0
10 million credit badge13 year member badge
Message 56753 - Posted: 5 Jan 2013, 22:54:55 UTC

First set the project to No New Work (NNW).
When all Milkyway-work is finished: reset the Milkyway-project.
Stop BOINC.
Check if the project-directory (..\projects\milkyway.cs.rpi.edu_milkyway) is empty. If not: manually delete all files in that directory.
Close down your computer: shut down and power off.
Start your computer.
Start BOINC.
Allow new work for Milkyway.
If your N-Body CPU-tasks still end in error after 0 seconds: ask someone else for help in this thread :)
ID: 56753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 20 Nov 12
Posts: 3
Credit: 4,248,153
RAC: 0
3 million credit badge9 year member badge
Message 56754 - Posted: 5 Jan 2013, 23:02:54 UTC - in response to Message 56753.  

Thanks for the info. I tried it but I'm still having the same problem. Thank you for your help though.

Phil
ID: 56754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2559
Credit: 462,762,741
RAC: 357
300 million credit badge12 year member badgeextraordinary contributions badge
Message 56763 - Posted: 6 Jan 2013, 13:18:00 UTC - in response to Message 56754.  

Thanks for the info. I tried it but I'm still having the same problem. Thank you for your help though.

Phil


Read some of the other threads, lots of people are having problems with the new 1.04 cpu units. This thread in particular:
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3102&nowrap=true#56727
ID: 56763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ronald R Codney

Send message
Joined: 29 Nov 11
Posts: 18
Credit: 815,433
RAC: 0
500 thousand credit badge10 year member badge
Message 56769 - Posted: 6 Jan 2013, 17:37:52 UTC

On a Nbody error, this is the info sent back to me:

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code -1073740940 (0xc0000374)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.04 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 4 processors
Warning: not applying timestep correction for workunit with min version 0.80
Using OpenMP 1 max threads on a system with 4 processors
Using OpenMP 1 max threads on a system with 4 processors
<search_likelihood>-62200.827114903703000</search_likelihood>

</stderr_txt>
]]>
ID: 56769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2559
Credit: 462,762,741
RAC: 357
300 million credit badge12 year member badgeextraordinary contributions badge
Message 56779 - Posted: 7 Jan 2013, 12:18:36 UTC - in response to Message 56769.  

On a Nbody error, this is the info sent back to me:

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code -1073740940 (0xc0000374)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.04 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 4 processors
Warning: not applying timestep correction for workunit with min version 0.80
Using OpenMP 1 max threads on a system with 4 processors
Using OpenMP 1 max threads on a system with 4 processors
<search_likelihood>-62200.827114903703000</search_likelihood>

</stderr_txt>
]]>


Look in the Nbody thread under News, some Admins are responding in that thread.
ID: 56779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Kanyo Anastasov

Send message
Joined: 26 Jun 09
Posts: 1
Credit: 797,255
RAC: 1,003
500 thousand credit badge12 year member badge
Message 56808 - Posted: 10 Jan 2013, 8:24:02 UTC

I have computation errors on 6 WU from the N-Body Simulation v1.04. Is this an issue with the application or with my computer.

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
- exit code -1073740940 (0xc0000374)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.04 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 1 max threads on a system with 4 processors
Warning: not applying timestep correction for workunit with min version 0.80
Using OpenMP 1 max threads on a system with 4 processors
Number of particles in bins is very small compared to total. (334 << 100000). Skipping distance calculation
<search_likelihood>-194.117647058823540</search_likelihood>

</stderr_txt>
]]>
ID: 56808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 449,588
RAC: 0
100 thousand credit badge9 year member badge
Message 56809 - Posted: 10 Jan 2013, 9:07:35 UTC - in response to Message 56808.  

I have a suspicion that the N-Body application fails with exit code -1073740940 if it was stopped and restarted (from the checkpoint file) during the course of a run.

Somebody help me test this, please?
ID: 56809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge13 year member badge
Message 56819 - Posted: 11 Jan 2013, 11:07:05 UTC - in response to Message 56809.  
Last modified: 11 Jan 2013, 11:25:33 UTC

I have a suspicion that the N-Body application fails with exit code -1073740940 if it was stopped and restarted (from the checkpoint file) during the course of a run.

Somebody help me test this, please?


Hmmm...

OK, I have an nBody running on this host I'm posting from, so to test that I just forced the BOINC service to do a restart (to ensure that all tasks would have to be unloaded from memory) and the nBody task restarted from checkpoint without a problem.

So I don't think that restarting from checkpoint is the problem, at least not in and of itself that is.

Al
ID: 56819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 449,588
RAC: 0
100 thousand credit badge9 year member badge
Message 56821 - Posted: 11 Jan 2013, 11:32:14 UTC - in response to Message 56819.  

I have a suspicion that the N-Body application fails with exit code -1073740940 if it was stopped and restarted (from the checkpoint file) during the course of a run.

Somebody help me test this, please?


Hmmm...

OK, I have an nBody running on this host I'm posting from, so to test that I just forced the BOINC service to do a restart (to ensure that all tasks would have to be unloaded from memory) and the nBody task restarted from checkpoint without a problem.

So I don't think that restarting from checkpoint is the problem, at least not in and of itself that is.

Al

Any error will reveal itself as heap corruption discovered during the memory de-allocation and cleanup phase at the end of the run.
ID: 56821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge13 year member badge
Message 56823 - Posted: 11 Jan 2013, 12:05:20 UTC - in response to Message 56821.  
Last modified: 11 Jan 2013, 12:15:11 UTC

So IOW, I'm going to have to wait until the run completes, correct?

If so, here's a link to the task in case I'm not around to catch the results.

http://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=293818334

Estimated ToC is 25.5 hours at this point (but increasing).

Al
ID: 56823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 449,588
RAC: 0
100 thousand credit badge9 year member badge
Message 56824 - Posted: 11 Jan 2013, 13:16:39 UTC

OK, I sacrificed a short one to the cause: task 379042021.

I stopped BOINC at about 75% - 80%, so it was flushed from memory, and restarted. It ran for another 20 minutes after the restart, but failed with the expected 0xc0000374 error some 20 minutes later, which is when I was expecting it to reach 100%.
ID: 56824 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge13 year member badge
Message 56827 - Posted: 11 Jan 2013, 13:39:57 UTC - in response to Message 56824.  

OK, I sacrificed a short one to the cause: task 379042021.

I stopped BOINC at about 75% - 80%, so it was flushed from memory, and restarted. It ran for another 20 minutes after the restart, but failed with the expected 0xc0000374 error some 20 minutes later, which is when I was expecting it to reach 100%.


LOL...

Oh great! :-)

The one I was messing with looks to be a long one.

However, this was the XPP-64 host and I'm pretty sure I have restarted BOINC, rebooted the machine, etc. while an nBody has been running during this latest test run. So hopefully XPP is immune to this restart from checkpoint heap corruption problem.
ID: 56827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
50 million credit badge13 year member badge
Message 56839 - Posted: 11 Jan 2013, 21:55:17 UTC
Last modified: 11 Jan 2013, 21:56:30 UTC

OK, the task I was testing on and linked to in the earlier post completed and validated with no issues.
ID: 56839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileOvertonesinger
Avatar

Send message
Joined: 15 Feb 10
Posts: 63
Credit: 1,836,010
RAC: 0
1 million credit badge11 year member badge
Message 57439 - Posted: 7 Mar 2013, 8:42:26 UTC

Suddenly, all separation tasks I get are erroring out immediatelly on this computer! 6 tesks, than another 6 tasks, etc.... :O

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=344187

Even after resetting the project! ...

Is there a "bad" serie of WUs, please?
Whats happening ? :(
ID: 57439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Len LE/GE

Send message
Joined: 8 Feb 08
Posts: 261
Credit: 104,050,322
RAC: 0
100 million credit badge13 year member badge
Message 57440 - Posted: 7 Mar 2013, 10:15:04 UTC - in response to Message 57439.  

Suddenly, all separation tasks I get are erroring out immediatelly on this computer! 6 tesks, than another 6 tasks, etc.... :O

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=344187

Even after resetting the project! ...

Is there a "bad" serie of WUs, please?
Whats happening ? :(


Client state Compute error
Exit status -185 (0xffffffffffffff47) ERR_RESULT_START
<core_client_version>7.0.44</core_client_version>
<![CDATA[
<message>
couldn't start app: CreateProcess() failed - Klient nen� dr�itelem po�adovan�ho opr�vn�n�. (0x522)
</message>
]]>

Your computer has a problem to start the mw app.
Do you have problems with other projects on this computer too?
ID: 57440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Carsten Milkau

Send message
Joined: 10 Feb 13
Posts: 6
Credit: 1,994,863
RAC: 0
1 million credit badge8 year member badgeextraordinary contributions badge
Message 57465 - Posted: 9 Mar 2013, 20:59:12 UTC
Last modified: 9 Mar 2013, 21:06:38 UTC

Got a different kind of error, which fails all WU immediately:

<core_client_version>7.0.29</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
<search_application> milkyway_separation 1.02 Linux x86_64 double OpenCL </search_application>
Unrecognized XML in project preferences: max_gfx_cpu_pct
Skipping: 20
Skipping: /max_gfx_cpu_pct
Unrecognized XML in project preferences: allow_non_preferred_apps
Skipping: 1
Skipping: /allow_non_preferred_apps
Unrecognized XML in project preferences: nbody_graphics_poll_period
Skipping: 30
Skipping: /nbody_graphics_poll_period
Unrecognized XML in project preferences: nbody_graphics_float_speed
Skipping: 5
Skipping: /nbody_graphics_float_speed
Unrecognized XML in project preferences: nbody_graphics_textured_point_size
Skipping: 250
Skipping: /nbody_graphics_textured_point_size
Unrecognized XML in project preferences: nbody_graphics_point_point_size
Skipping: 40
Skipping: /nbody_graphics_point_point_size
BOINC GPU type suggests using OpenCL vendor 'NVIDIA Corporation'
Setting process priority to 0 (13): Permission denied
Opening Lua script 'astronomy_parameters.txt' (2): No such file or directory
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file
Opening astronomy parameters file 'astronomy_parameters.txt' (2): No such file or directory
Failed to read parameters file

20:40:09 (12990): called boinc_finish

</stderr_txt>
]]>

I completely removed and re-initialized the project several times without any luck. Apparantly GPU tasks need this file but don't download it.

http://milkyway.cs.rpi.edu/milkyway/result.php?resultid=415923043
http://milkyway.cs.rpi.edu/milkyway/results.php?userid=765878&state=5
ID: 57465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileOvertonesinger
Avatar

Send message
Joined: 15 Feb 10
Posts: 63
Credit: 1,836,010
RAC: 0
1 million credit badge11 year member badge
Message 57466 - Posted: 9 Mar 2013, 22:30:40 UTC - in response to Message 57440.  

Yep ! With all projects the same error!

- reinstalled back the STABLE version: Issue resolved! :)


(so, the special users for running boinc apps and for admin.boinc ... were re-created with the proper rights, probably. It is still unknown how or why they were damaged, either some windows update .... or *maybe* mass-user-rights change invoked by company-domain administrator )
ID: 57466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 12 Aug 09
Posts: 262
Credit: 92,631,041
RAC: 0
50 million credit badge12 year member badge
Message 57467 - Posted: 9 Mar 2013, 23:24:53 UTC
Last modified: 9 Mar 2013, 23:26:54 UTC

All taks error out immediately with BOINC 7.0.28. See these :416025759 (I can not get it clickable).
Is it something on my end?
Einstein@home is now running on the GPU's and Rosetta@home on 5 CPU's, thus one CPU left.
This is the eroor message (for all last 33 tasks):
Stderr output

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
BOINC: parse gpu_opencl_dev_index 0
<search_application> milkyway_separation 1.02 Windows x86_64 double OpenCL </search_application>
Unrecognized XML in project preferences: nvidia_block_amount
Skipping: 128
Skipping: /nvidia_block_amount
BOINC GPU type suggests using OpenCL vendor 'Advanced Micro Devices, Inc.'
Error loading Lua script 'astronomy_parameters.txt': [string "number_parameters: 4..."]:1: '<name>' expected near '4'
Error reading astronomy parameters from file 'astronomy_parameters.txt'
Trying old parameters file
Using SSE4.1 path
Found 1 platform
Platform 0 information:
Name: AMD Accelerated Parallel Processing
Version: OpenCL 1.2 AMD-APP (1124.2)
Vendor: Advanced Micro Devices, Inc.
Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing
Profile: FULL_PROFILE
Using device 0 on platform 0
Found 2 CL devices
Device 'Cypress' (Advanced Micro Devices, Inc.:0x1002) (CL_DEVICE_TYPE_GPU)
Driver version: 1124.2 (VM)
Version: OpenCL 1.2 AMD-APP (1124.2)
Compute capability: 0.0
Max compute units: 20
Clock frequency: 850 Mhz
Global mem size: 1073741824
Local mem size: 32768
Max const buf size: 65536
Double extension: cl_khr_fp64
Build log:
--------------------------------------------------------------------------------
"C:\Users\TJ\AppData\Local\Temp\OCL4D8.tmp.cl", line 30: warning: OpenCL
extension is now part of core
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
^


--------------------------------------------------------------------------------
clBuildProgram: Build failure (-11): CL_BUILD_PROGRAM_FAILURE
Error building program from source (-11): CL_BUILD_PROGRAM_FAILURE
Error creating integral program from source
Failed to calculate likelihood
<background_integral> 1.#QNAN0000000000 </background_integral>
<stream_integral> 1.#QNAN0000000000 1.#QNAN0000000000 1.#QNAN0000000000 </stream_integral>
<background_likelihood> 1.#QNAN0000000000 </background_likelihood>
<stream_only_likelihood> 1.#QNAN0000000000 1.#QNAN0000000000 1.#QNAN0000000000 </stream_only_likelihood>
<search_likelihood> 1.#QNAN0000000000 </search_likelihood>
00:10:58 (3332): called boinc_finish

</stderr_txt>
]]>
Greetings from,
TJ
ID: 57467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profilemikey
Avatar

Send message
Joined: 8 May 09
Posts: 2559
Credit: 462,762,741
RAC: 357
300 million credit badge12 year member badgeextraordinary contributions badge
Message 57471 - Posted: 10 Mar 2013, 11:22:56 UTC - in response to Message 57467.  
Last modified: 10 Mar 2013, 11:23:30 UTC

All taks error out immediately with BOINC 7.0.28. See these :416025759 (I can not get it clickable).
Is it something on my end?
Einstein@home is now running on the GPU's and Rosetta@home on 5 CPU's, thus one CPU left.
This is the eroor message (for all last 33 tasks):
Stderr output

<core_client_version>7.0.28</core_client_version>
]]>


Are you averse to upgrading to the Boinc version 7.0.52? It fixed my problems at another project and for me seems to be stable. You can get it here:
http://boinc.berkeley.edu/dl/?C=M;O=D

I see 7.0.54 is now available, here are the changes:
http://boinc.berkeley.edu/dev/forum_thread.php?id=6698
ID: 57471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Computation errors

©2021 Astroinformatics Group