Welcome to MilkyWay@home

Never Ending WU's + Invalid "Separation"

Message boards : Number crunching : Never Ending WU's + Invalid "Separation"
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile [AF>Amis des Lapins] Phil1966

Send message
Joined: 25 Dec 11
Posts: 20
Credit: 119,498,787
RAC: 0
Message 63680 - Posted: 6 Jun 2015, 20:30:27 UTC

Hello,

Is there a way to avoid running these never ending WU's ?
ie Why don't they automaticaly stop after X minutes ?


1143729764 842451651 551092 6 Jun 2015, 13:20:28 UTC 6 Jun 2015, 17:30:54 UTC Annulé par l'utilisateur 13,937.59 2.15 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)

1143108316 841998131 551092 6 Jun 2015, 0:00:59 UTC 6 Jun 2015, 7:32:21 UTC Annulé par l'utilisateur 24,707.89 0.80 --- MilkyWay@Home v1.02 (opencl_amd_ati)

1142937681 841873815 551092 5 Jun 2015, 20:18:12 UTC 6 Jun 2015, 0:24:16 UTC Erreur en cours de calculs 13,704.75 13,607.81 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)


+

Have a lot of INVALID "Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)" ... Is this "normal" ?

Thank You

Best Regards,

Phil1966
ID: 63680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>Amis des Lapins] Phil1966

Send message
Joined: 25 Dec 11
Posts: 20
Credit: 119,498,787
RAC: 0
Message 63681 - Posted: 7 Jun 2015, 16:58:36 UTC
Last modified: 7 Jun 2015, 17:12:05 UTC

Hello,

Another "never ending" WU today :

1144844168 843256198 7 Jun 2015, 13:52:23 UTC 7 Jun 2015, 16:45:48 UTC Annulé par l'utilisateur 9,113.11 0.87 --- MilkyWay@Home v1.02 (opencl_amd_ati)

:/


NB I stopped momentarily running my 4 * GTX970 and bought 3 * HD7950 only to be able to crunch for MW.

Running also
Milkyway (Campaign #3) credit 4,720,110,000
on BU.

Really hope to get an answer to my 2 questions this time.

The opposite would be really disappointing / frustrating.

Thank you.


cf

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3772&postid=63648#63648
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3772&postid=63674#63674
ID: 63681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>Amis des Lapins] Phil1966

Send message
Joined: 25 Dec 11
Posts: 20
Credit: 119,498,787
RAC: 0
Message 63684 - Posted: 8 Jun 2015, 4:44:08 UTC

Had to cancel another one this morning. runtime > 7 hours :/


1145173555 843500743 7 Jun 2015, 20:42:16 UTC 8 Jun 2015, 4:05:44 UTC Annulé par l'utilisateur 26,320.96 1.03 --- MilkyWay@Home v1.02 (opencl_amd_ati)
ID: 63684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 63688 - Posted: 9 Jun 2015, 13:13:38 UTC

Hi Phil1966,

This is the first I'm hearing about never ending work units on the separation application. Would you mind checking your BOINC version number and posting that here for me?

Jake W.
ID: 63688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>Amis des Lapins] Phil1966

Send message
Joined: 25 Dec 11
Posts: 20
Credit: 119,498,787
RAC: 0
Message 63691 - Posted: 10 Jun 2015, 4:23:41 UTC - in response to Message 63688.  
Last modified: 10 Jun 2015, 4:27:00 UTC

Dear Jake,

Thank you for your message.

I am running 7.4.42 / W7 Ultimate / Machine ID 551092.


GenuineIntel
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz [Family 6 Model 60 Stepping 3]
(8 processeurs) [2] AMD AMD Radeon HD 7870/7950/7970/R9 280X series (Tahiti) (3072MB) driver: 1.4.1848 OpenCL: 1.2

Microsoft Windows 7
Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)



NB Yesterday it just ran fine. Herebelow the ref of the WU's I am referring to :

1145173555 843500743 551092 7 Jun 2015, 20:42:16 UTC 8 Jun 2015, 4:05:44 UTC Annulé par l'utilisateur 26,320.96 1.03 --- MilkyWay@Home v1.02 (opencl_amd_ati)

1143729764 842451651 551092 6 Jun 2015, 13:20:28 UTC 6 Jun 2015, 17:30:54 UTC Annulé par l'utilisateur 13,937.59 2.15 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)

1143108316 841998131 551092 6 Jun 2015, 0:00:59 UTC 6 Jun 2015, 7:32:21 UTC Annulé par l'utilisateur 24,707.89 0.80 --- MilkyWay@Home v1.02 (opencl_amd_ati)

1142937681 841873815 551092 5 Jun 2015, 20:18:12 UTC 6 Jun 2015, 0:24:16 UTC Erreur en cours de calculs 13,704.75 13,607.81 --- Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)

Thank You

Kind Regards

Philippe
ID: 63691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 63694 - Posted: 10 Jun 2015, 12:36:58 UTC - in response to Message 63691.  

I've noticed a few of your work units, the invalid ones, are returning blank stderr.txt files. This is very strange and the fact it only happens on a small percentage is interesting. Equally interesting is that you are the only one who seems to be getting stuck on the work units in question. I will do my best to try to reproduce the issue here. If you can give me any more info about the state of your system when this happens please let me know.

This is only happening on your dual GPU system correct?

Jake W.
ID: 63694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 31 Mar 08
Posts: 22
Credit: 84,159,673
RAC: 0
Message 63711 - Posted: 14 Jun 2015, 14:47:19 UTC - in response to Message 63694.  

i do also get never ending wu.

suspending/resuming them works, but I dot get a bunch of invalid/errors:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid=

All (3581) · In progress (120) · Validation pending (0) · Validation inconclusive (182) · Valid (2632) · Invalid (527) · Error (120)


3x HD7950 GPU, running 2 WU/GPU


No invalid on my Nvidia GTX980 (also running 2 WU/GPU)
State: All (329) · In progress (39) · Validation pending (0) · Validation inconclusive (30) · Valid (155) · Invalid (0) · Error (105)
ID: 63711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 63712 - Posted: 14 Jun 2015, 23:17:36 UTC
Last modified: 14 Jun 2015, 23:31:14 UTC

re Invalids due to stderr blankness.

Alot of us get 5% to 7% invalids using HD7950's on my ATI FX-8350 the whole
stderr is always blank, on my Haswell-E with HD7950 it always cuts off the same portion of the STDERR, just at a different spot than [AF>EDLS]zOU
FWIW


my amd fx-8350 always truncates the whole stderr.

But

My Haswell (also running an HD7950) truncates the stderr after the

Initial wait always at the same place. Although the FX-8350 had many more
errors per day than the Haswell.

Iteration area: 560000
Chunk estimate: 1
Num chunks: 2
Chunk size: 559104
Added area: 558208
Effective area: 1118208
Initial wait: 16 ms

</stderr_txt>
]]>

This is where my Haswell-E always truncates the STDERR

Never had a hang though
ID: 63712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,947,628
RAC: 22,118
Message 63714 - Posted: 15 Jun 2015, 10:44:54 UTC - in response to Message 63711.  

i do also get never ending wu.

suspending/resuming them works, but I dot get a bunch of invalid/errors:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid=

All (3581) · In progress (120) · Validation pending (0) · Validation inconclusive (182) · Valid (2632) · Invalid (527) · Error (120)


3x HD7950 GPU, running 2 WU/GPU


Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use?
ID: 63714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 31 Mar 08
Posts: 22
Credit: 84,159,673
RAC: 0
Message 63715 - Posted: 15 Jun 2015, 11:12:43 UTC - in response to Message 63714.  
Last modified: 15 Jun 2015, 11:39:30 UTC

i do also get never ending wu.

suspending/resuming them works, but I dot get a bunch of invalid/errors:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid=

All (3581) · In progress (120) · Validation pending (0) · Validation inconclusive (182) · Valid (2632) · Invalid (527) · Error (120)


3x HD7950 GPU, running 2 WU/GPU


Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use?


hello Mickey (are you the same having help me with my BU issue ? )

This computer is dedicated to BOINC.

There's 2 crossfire cables.

There's a free CPU core.

I've noticed that in the past 2h, my GPU temp has dropped dramatically (used to reach 85C, now stable at 40C.

Even though the ambiant temperature hasn't dropped and each GPU is still doing 2WU at a time.


==================================================
I've stopped the computer and removed the XFire conenctors and restarted it.

We'll see.

Thank you for the suggestion.

(FYI I initially used the XFire conenctors as I couldn't connect each GPU to a monitor to get them to process WU)

Note: When I remove the crossfire connectors, My GPU show as connected to a PCI-E v1.1 slot.

With the Xfire connector, they show a 2.0 PCI-E bus... well except now... they still show PCI-E 1.1 and WU are taking a lot longer, and it explains the temps.

Anyway, I'm going to troubleshoot all that and post back...
ID: 63715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 31 Mar 08
Posts: 22
Credit: 84,159,673
RAC: 0
Message 63716 - Posted: 15 Jun 2015, 13:10:31 UTC - in response to Message 63715.  

I finally managed to get them to work at the proper PCI-E speed and with Xfire connectors removed (don't ask me how...)

I will monitor for "never ending WU" now.

(they're still crunching 2/GPU)
ID: 63716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 31 Mar 08
Posts: 22
Credit: 84,159,673
RAC: 0
Message 63717 - Posted: 15 Jun 2015, 14:53:58 UTC

all my invalids are Milkyway@Home Separation (Modified Fit) v1.36 (opencl_ati_101)


I've removed that application from my preferences for now.
ID: 63717 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 63718 - Posted: 15 Jun 2015, 16:24:12 UTC

Hey everyone,

Thanks for the feedback.

I am still looking into what could be causing this. Can you all confirm that you do not have these issues with the regular separation application and only the "Modified Fit" application?

Jake W.
ID: 63718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 31 Mar 08
Posts: 22
Credit: 84,159,673
RAC: 0
Message 63719 - Posted: 15 Jun 2015, 16:28:14 UTC - in response to Message 63718.  

Yes, check here for details :)

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091

This is my main Milkyway machine.

But you can check the other one too :)
ID: 63719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,045,272
RAC: 86,667
Message 63720 - Posted: 15 Jun 2015, 18:18:52 UTC - in response to Message 63718.  

Jake, please look at this thread:
http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662

The problem is with the underlying BOINC code. I have just come to accept the 3% error rate here at MW. The MW applications are mainly to blame for exposing the BOINC coding problem. Most of my errors are with the 1.36 Modifed Fit application but I have seen the truncated stderr.txt problem with the 1.02 app also. The problem is worst with the FX-8350 though it occurs with other chip types as mentioned here in the thread. Doesn't matter whether you are running SLI, Crossfire, single or multiple tasks per card, the problem occurs because of an issue with the BOINC code. I have attracted the BOINC developers attention to the problem and they have acknowledged it, just no timeline for when the issue will get resolved.

Cheers, Keith
ID: 63720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,045,272
RAC: 86,667
Message 63723 - Posted: 15 Jun 2015, 20:51:02 UTC

Haven't seem any response from Jake yet. Just wanted to comment that the MW app developers could help with the problem by increasing the work unit completion times by doing more work with larger computations as has been discussed. The problem simply can be described as that the Modified Fit task completes too quickly for the underlying BOINC code to clean up the task completion in the slot. If anyone wants to see what is happening, just turn on slot_debug in the BOINC Manager and look at the logfiles. Match up the invalid task entries and see that the slot cleanup doesn't work correctly for the very short completion times of the 1.36 Modified Fit tasks. That is why it looks like only the 1.36 app has problems but in fact in can occur with the 1.02 app if the system is busy crunching work for other projects and system resources can't be applied in a timely manner to get ahead of the underlying problems with the BOINC code.

Cheers, Keith


P.S. So get busy on releasing a new Modified Fit app with larger work tasks.
ID: 63723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile mikey
Avatar

Send message
Joined: 8 May 09
Posts: 3315
Credit: 519,947,628
RAC: 22,118
Message 63724 - Posted: 16 Jun 2015, 11:47:58 UTC - in response to Message 63715.  

i do also get never ending wu.

suspending/resuming them works, but I dot get a bunch of invalid/errors:

http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=620091&offset=0&show_names=0&state=0&appid=

All (3581) · In progress (120) · Validation pending (0) · Validation inconclusive (182) · Valid (2632) · Invalid (527) · Error (120)


3x HD7950 GPU, running 2 WU/GPU


Have you tried running just one wu/gpu at a time and see if the errors still occur? Do you have an SLI cable connecting the gpu's? If so take it off and see if the errors stop. I realize that the SLI IS better for gaming, but it is NOT helpful to crunching, in fact it can cause errors. What else do you do while the pc is crunching using the gpu's? Do you leave any cpu cores free just for the gpu's to use?


hello Mickey (are you the same having help me with my BU issue ? )

This computer is dedicated to BOINC.

There's 2 crossfire cables.

There's a free CPU core.

I've noticed that in the past 2h, my GPU temp has dropped dramatically (used to reach 85C, now stable at 40C.

Even though the ambiant temperature hasn't dropped and each GPU is still doing 2WU at a time.


==================================================
I've stopped the computer and removed the XFire conenctors and restarted it.

We'll see.

Thank you for the suggestion.

(FYI I initially used the XFire conenctors as I couldn't connect each GPU to a monitor to get them to process WU)

Note: When I remove the crossfire connectors, My GPU show as connected to a PCI-E v1.1 slot.

With the Xfire connector, they show a 2.0 PCI-E bus... well except now... they still show PCI-E 1.1 and WU are taking a lot longer, and it explains the temps.

Anyway, I'm going to troubleshoot all that and post back...


Yes it's me...the drop in temps, and extra long workunit run times, could be because your gpu's crashed and it was running at the default low level. Rebooting the pc should bring them back up to normal speed again.

As for the pci-e bus speed if you have cards that HAVE the Xfire connector on it they will all think they are on the faster bus, once you remove it they will show what they are really on. Do you use a cc_config.xml file like this:

<cc_config>
<options>
<use_all_gpus>1</use_all_gpus>
</options>
</cc_config>

to tell Boinc to use all your gpu's? You MUST use Notepad to copy and paste the info and then save the file as a txt type file, it will complain but say yes, and then save it in the c:\program data\boinc directory. That way all your Boinc projects can use it. You should exit and restart Boinc after putting the file in there. Each gpu will then crunch it's own unit and finish in its own time.
ID: 63724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>EDLS]zOU

Send message
Joined: 31 Mar 08
Posts: 22
Credit: 84,159,673
RAC: 0
Message 63725 - Posted: 16 Jun 2015, 11:58:45 UTC - in response to Message 63724.  
Last modified: 16 Jun 2015, 12:05:51 UTC

Thank you

My cc_config was correct.

I also had to connect each GPU to a monitor and extend the windows desktop.

I'm glad I have monitors with multiple inputs as I didn't want to put resistors on the dvi to VGA adapter.

Now I may have to do that if I want to use the APU engine... but i'm sure it's worth it

Also I had no errors or invalid WU for 24 on this computer, although I had to suspend/resume some WUs.

A few inconclusive but that's usual.
ID: 63725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 63726 - Posted: 16 Jun 2015, 12:47:52 UTC

Hey everyone,

Thank you for letting me know this is a BOINC issue! It seems strange that it would only be on the Modified Fit application though if it were a BOINC issue so I will still be looking to see if we are causing the issue somehow.

Jake W.
ID: 63726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Keith Myers
Avatar

Send message
Joined: 24 Jan 11
Posts: 696
Credit: 540,045,272
RAC: 86,667
Message 63738 - Posted: 17 Jun 2015, 20:43:05 UTC - in response to Message 63726.  

Hi Jake, as I stated in the previous post, the main issue is the very short runtimes of the Modified Fit app. It just completes too fast in fast GPU crunchers for the BOINC slot cleanup code to finish in time before another task occupies the same slot as the previous task and ends up overwriting the stderr.txt file. This is why you get the completely truncated output of the stderr.txt file and thus reports as an invalid. The solution is either wait for the BOINC developers to fix this and the >4GB file deletion problem in the slot cleanup code or to make the modified fit tasks run a lot longer. If they ran in the 2-2:30 minute completion time of the 1.02 app, then you wouldn't make invalids in most cases. If you want to see the problem happen in real-time, just enable slot_debug in the BOINC Manager and look at the entries in the logfile for the invalidated tasks. You will see that a new task occupies the slot of a previous task ahead of the final slot cleanup for the previous completed task.

Cheers, Keith
ID: 63738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Never Ending WU's + Invalid "Separation"

©2024 Astroinformatics Group