Advanced search

Forums : Technical Support : CAMB 2.08/2.09 not suspending
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Yeti
Avatar

Send message
Joined: 26 Jul 07
Posts: 21
Credit: 3,448,022
RAC: 0
Message 5037 - Posted: 6 Mar 2008, 11:18:12 UTC

I was just testing the latest Alpha-Client 5.10.45 and watched, that camb 2.07 didn\'t suspend as it should.

At the moment it is not clear wheather this is related to BOINC 5.10.45 or CAMB 2.07, but we should keep an eye on it.

Yeti


Supporting BOINC, a great concept !
ID: 5037 · Report as offensive
Yeti
Avatar

Send message
Joined: 26 Jul 07
Posts: 21
Credit: 3,448,022
RAC: 0
Message 5040 - Posted: 6 Mar 2008, 11:52:03 UTC

It seems to be really the reason of CAMB 2.07. See this log:

06/03/2008 12:39:10||Suspending computation - user request
06/03/2008 12:39:10||[app_msg_send] sent <suspend/> to wu_030208_141149_1_1
06/03/2008 12:39:10|Cosmology@Home|[task_debug] task_state=SUSPENDED for wu_030208_141149_1_1 from suspend

06/03/2008 12:39:10||[app_msg_send] sent <suspend/> to 11oc06aa.18571.544210.14.10.52_2
06/03/2008 12:39:10|SETI@home Beta Test|[task_debug] task_state=SUSPENDED for 11oc06aa.18571.544210.14.10.52_2 from suspend
06/03/2008 12:39:10||[app_msg_receive] got msg from slot 1: <current_cpu_time>6.108843158700001e+003</current_cpu_time><checkpoint_cpu_time>5.593930658000000e+003</checkpoint_cpu_time><fraction_done>0.38206798</fraction_done>
06/03/2008 12:39:10||[app_msg_receive] got msg from slot 3: <current_cpu_time>6.664206e+002</current_cpu_time><checkpoint_cpu_time>5.978738e+002</checkpoint_cpu_time><fraction_done>2.349387e-001</fraction_done><fpops_cumulative>3.467713e+012</fpops_cumulative>



Supporting BOINC, a great concept !
ID: 5040 · Report as offensive
Yeti
Avatar

Send message
Joined: 26 Jul 07
Posts: 21
Credit: 3,448,022
RAC: 0
Message 5041 - Posted: 6 Mar 2008, 11:59:24 UTC

Don\'t know if it helps to know but CAMB 2.07 is stopping normal if I stop the boinc-client, even after ignoring the suspend command.

Yeti


Supporting BOINC, a great concept !
ID: 5041 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5042 - Posted: 6 Mar 2008, 13:17:01 UTC
Last modified: 6 Mar 2008, 13:31:01 UTC

I just went to the forum to tell the same thing (camb 2.08).

It is a camb bug, not BOINC related.


p.s.: 2.08 seems to be slower than 2.06 too - but that might be caused by needing more CPU L2 cache. If that is the case, there is probably no slowdown on modern CPUs


p.p.s.: 2.08 resets the CPU time after beeing (not really) suspended and then resumed! It does not reset fraction_done though. That means it must have restarted the application after suspend/unsuspend - my setting is to leave apps in memory while suspended.


All this together looks very much like a shared memory issue.
ID: 5042 · Report as offensive
Profile Scott
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 1 Apr 07
Posts: 662
Credit: 13,742
RAC: 0
Message 5045 - Posted: 6 Mar 2008, 17:46:49 UTC

The new CAMB versions may very well be slower, but that\'s because we\'re trying to do some different calculations with it which may tend to slow it down.
Scott Kruger
Project Administrator, Cosmology@Home
ID: 5045 · Report as offensive
Profile Jord
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 Jun 07
Posts: 345
Credit: 50,500
RAC: 0
Message 5046 - Posted: 6 Mar 2008, 18:00:16 UTC - in response to Message 5045.  

The new CAMB versions may very well be slower

Slower? It\'s been a while since one of my tasks took less than 3 hours. It looks like it\'ll run with 2.08 in 2h 45m
ID: 5046 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5047 - Posted: 6 Mar 2008, 19:41:49 UTC - in response to Message 5046.  

The new CAMB versions may very well be slower

Slower? It\'s been a while since one of my tasks took less than 3 hours. It looks like it\'ll run with 2.08 in 2h 45m



Just like 2.05 there are variations in run-times...1 result is not enough to tell whether overall they are faster or slower. I haven\'t seen much difference yet with the small sampling I have run....give it some time :)
ID: 5047 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5048 - Posted: 6 Mar 2008, 21:53:22 UTC

You\'re right, the second 2.08 was in the range of the older ones.

So ... what about the \"suspend\" thing, is there anything we should test in order to narrow down why it does that?
ID: 5048 · Report as offensive
Profile Scott
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 1 Apr 07
Posts: 662
Credit: 13,742
RAC: 0
Message 5052 - Posted: 7 Mar 2008, 3:30:21 UTC - in response to Message 5048.  

You\'re right, the second 2.08 was in the range of the older ones.

So ... what about the \"suspend\" thing, is there anything we should test in order to narrow down why it does that?

If you keep applications in memory during a suspend, the app will never stop. My intuition is that it has something to do with the checkpointing system, but I\'m not sure yet.
Scott Kruger
Project Administrator, Cosmology@Home
ID: 5052 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5058 - Posted: 7 Mar 2008, 23:22:24 UTC - in response to Message 5052.  
Last modified: 8 Mar 2008, 6:03:39 UTC

You\'re right, the second 2.08 was in the range of the older ones.

So ... what about the \"suspend\" thing, is there anything we should test in order to narrow down why it does that?

If you keep applications in memory during a suspend, the app will never stop. My intuition is that it has something to do with the checkpointing system, but I\'m not sure yet.


My observations show leaving in memory option either on/off has no effect.If Boinc suspends the task automatically it stops and another project runs.If you manually suspend and when another project starts then unsuspend it says waiting to run but the cpu time goes up and runs over the other project. It is aggravating because the cosmo unit also disregards the switch applications every x minutes. The only workaround for me at the moment is to reboot(also aggravating) when I know cosmo has huge - debt and the other projects start up and cosmo stays asleep.So it takes a lot of manual intervention at this point.

Of course I don\'t have to intervene because when the task finishes the debts takeover. I do intervene when I can because it isn\'t running as I wish.

It also seems now on reboots the cpu time resets to 0 but the progress % resumes where it left off...just the opposite of before.


I hope my observations help but at this point would rather have the old \"functionality\" back with a rollback rather than the superficial progress indicator running better.
ID: 5058 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5059 - Posted: 8 Mar 2008, 7:18:22 UTC
Last modified: 8 Mar 2008, 7:46:57 UTC

There\'s one more thing related to the checkpoint :

camb checkpoints permanently, the setting \"Write to disk at most every 180 seconds\" is ignored.

This produces a basic \"system\" load of between 30% and 40% (windows task manager : the red part of the CPU load graph), leaving only 60% - 70% for the camb application.

Those permanent checkpoints can well be a cause for not suspending properly, because checkpoints are a critical operation which will not be interrupted by BOINC.

I think that this is a critical bug and should have a high priority. (HD killer application!)
ID: 5059 · Report as offensive
Klimax

Send message
Joined: 24 Oct 07
Posts: 22
Credit: 648,291
RAC: 0
Message 5061 - Posted: 8 Mar 2008, 8:05:31 UTC - in response to Message 5059.  


...
(HD killer application!)

If such frequency would kill hardrives,then I should have now around 10 dead HD at least... :-)
ID: 5061 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5062 - Posted: 8 Mar 2008, 9:43:55 UTC
Last modified: 8 Mar 2008, 9:52:00 UTC

Well, it is not linear read/write as soon as you have 2 camb running. As long as 1 camb and 1 simap have been working together, I had a high write rate as well but not such an extreme head activity.

The ear is an ideal instrument to detect HD torture.

But anyway, there is a BOINC setting for the checkpoint frequency and the application should respect it.

p.s.: I wouldn\'t need checkpoints here at all, those short results are hardly ever affected by system crashes. A checkpoint when the application leaves memory would be enough.
ID: 5062 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5063 - Posted: 8 Mar 2008, 10:29:37 UTC
Last modified: 8 Mar 2008, 10:32:14 UTC

I stopped one of the two running tasks now (had to exit BOINC for that) and run the remaining one together with Spinhenge - the difference is extreme.

Now both camb and spinhenge receive between 49% and 50% CPU time each, system load down to ~1%.

camb_scalarcls.chk of the stopped application went from about 50MB to 5kB btw., I wonder why it did that - if a 30MB-50MB checkpoint file is needed for the restart, how can it shrink to 5kB when BOINC exits?
ID: 5063 · Report as offensive
Nothing But Idle Time

Send message
Joined: 27 Aug 07
Posts: 84
Credit: 148,380
RAC: 0
Message 5064 - Posted: 8 Mar 2008, 14:26:41 UTC

Please fix the suspend problem. To suspend is an order to be obeyed and is not at the discretion of the application. It is causing havoc on my P4/HT machine because 3 or 4 tasks can get started and all are running simultaneously with 33% or 25% resources instead of 50/50. It also prevents me from having the freedom to control my own environment. This is the only project where I\'ve encountered this problem, and this is the second time.
ID: 5064 · Report as offensive
Yeti
Avatar

Send message
Joined: 26 Jul 07
Posts: 21
Credit: 3,448,022
RAC: 0
Message 5065 - Posted: 8 Mar 2008, 15:15:06 UTC

Just saw that 2.09 is out, but, unfortenatly it doesn\'t solve the suspend-problem


Supporting BOINC, a great concept !
ID: 5065 · Report as offensive
Profile Scott
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 1 Apr 07
Posts: 662
Credit: 13,742
RAC: 0
Message 5071 - Posted: 8 Mar 2008, 18:33:43 UTC - in response to Message 5065.  

Just saw that 2.09 is out, but, unfortenatly it doesn\'t solve the suspend-problem

It works for me on my i686 Linux machine, so it looks like it\'s a Windows problem at this point.

I\'ll continue to look at it.
Scott Kruger
Project Administrator, Cosmology@Home
ID: 5071 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5072 - Posted: 8 Mar 2008, 18:48:55 UTC - in response to Message 5071.  

Just saw that 2.09 is out, but, unfortenatly it doesn\'t solve the suspend-problem

It works for me on my i686 Linux machine, so it looks like it\'s a Windows problem at this point.

I\'ll continue to look at it.


Not working on my 32 bit Linux Scott ver 2.09
ID: 5072 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5074 - Posted: 8 Mar 2008, 20:06:35 UTC
Last modified: 8 Mar 2008, 20:10:09 UTC

I aborted all 2.09, too much trouble :-/

The result I mentioned above (checkpoint file shrunk to 5k) restarted at 0% and after 6 more hours (8 more hours wallclock time!) it has just been at 59%.

That\'s 3 times as long as usual (2.05 and such) by wallclock, 2 times as much by CPU time.
ID: 5074 · Report as offensive
Yeti
Avatar

Send message
Joined: 26 Jul 07
Posts: 21
Credit: 3,448,022
RAC: 0
Message 5081 - Posted: 8 Mar 2008, 21:49:10 UTC

Similar here; I received a lot of errors of my server-remote-watches.

I have to set 2.09 to \"No new work\" until hopefully 2.10 will fix the problem


Supporting BOINC, a great concept !
ID: 5081 · Report as offensive
1 · 2 · 3 · Next

Forums : Technical Support : CAMB 2.08/2.09 not suspending