Data analysis - What can you really measure?


Advanced search

Message boards : Science : Data analysis - What can you really measure?

Author Message
phys
Avatar
Send message
Joined: 28 Apr 12
Posts: 24
Credit: 0
RAC: 0
Message 1077 - Posted: 4 May 2012 | 17:26:31 UTC

Data analysis is all about statistics. There is your measurement and you have to decide if you get a valid value from your apparatus or is it all noise and nonsense. I don't want to give you a full lecture in it (google for your favourite university + geiger counter experiment + poisson process) but just give you a feel of the limits.

So I have taken 5000 datapoints from Dagorath (cleaned a few false points up), that date back to March 31th. So the samples (they call it ticks here) are from 4 days. First I plotted a histogramm
You can see, that there is a huge span in the detected clicks, there are a few samples with just 2 or over 24 counts, but the majority are about 11 counts. You can fit it with an poisson distribution her, nothing special.

Now a counting experiment relays on (as it is named) counts. The standard devitation, the error of your expected value, is the squareroot of your counts. If you have a poisson process you can't measure it better, you need more counts to get a lesser percentage error.

Look at
.
I have summed every 5 datapoints up, so the counting time is over 3 minutes. The two bands of sigma denotes the measuring error of the mean value (actually every datapoint have got a error bar of sigma). Do you see anything in this noise?

Ok, then take 10 samples

Thats a measuring time of over 6 minutes. Do you see anything?

Next 20 samples

That's a measuring time of over 12 minutes. Well there are not so many spikes there. If you look at the sigma band it is getting thinner too. But because of the square root you need 4 times more counts to get half an error.

Ok, next 40 samples together

That's a measuring time of over 25 minutes. It looks like a steady value with a little noise. Double it again?


Now I summed up over 51 minutes. It looks like the measuring is declining and on the left there is a little hill. But just sum the counts up is a little boring. But we can use a moving average. So we have got a window of lets say 11 values and we shift it over the summed values of eg 40 datapoints. So we preserve a little bit of dynamics and see if the values change in time.

Ok, now have a look at it


The virtual measuring time per point is 282 minutes. That is interesting! Notice that I have rescaled the axis. So the hill is out of the error band, and the measurements are declining to the right. On the beginning of April you got a reading little above background. But you don't know it is radioactivity. Maybe the sun was shining on your detector? Have you taken it to another place, from the cellar to the attic?... Or were there rainy days, it started to get hot and now there is a rainfall again (the reading is climbing again at the end of the graph).

So looking at every click of the detector is like staring on a pixel of your tv set. It is flashing all the time but you do not get the big picture and miss the movie.

Profile krzyszp
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 16 Apr 11
Posts: 368
Credit: 453,559
RAC: 140

Message 1079 - Posted: 4 May 2012 | 18:31:30 UTC - in response to Message 1077.

You are right :) And you are not :)

In other topic, I show how to create chart with averages for last 24h. It's very smooth and accurate on your point of view.



But, as TJM mentioned, we will to provide as much data as possible (and as far as it is sensible) and data resolution(?) play a role on it.

Also you can imagine situation of nuclear explosion somewhere near detector - our hardware shows it immediately (well - if it not be destroyed/network unavailable/etc). In this case gamma radiation jump from e.g. 0.16uSv/h to 1000uSv/h (or even more) and differences between 0.09-0.30 not make any difference - you get right signal to run ASAP and from map you get enough data (by compare other detectors) where problem happened.

What is interesting, by analyse data form our project and compare it with other factors (like e.g. dependency type day/night or good/bad weather) you can find more interesting info! This shows that more data equal more possibility to get "something" interesting from it.

Why not use some data as "white noise"? This is another possibility as some people use geiger detectors as random number generator (well, I can imagine other, better methods but here are data available right now).

I know, that our detectors are not laboratory stuff but still collect correct and (I hope) useful data. As I see - interesting for you as well :)

Anyway, thanks for you analyse, I found it very interesting.

____________
Regards,
Krzysztof 'krzyszp' Piszczek
Android Radioactive@Home Map
Android Radioactive@Home Map - donated
My Workplace

exsafs
Avatar
Send message
Joined: 25 Jun 11
Posts: 14
Credit: 5,359
RAC: 0
Message 1081 - Posted: 4 May 2012 | 18:41:19 UTC - in response to Message 1077.

:)

background statistics is a complicated topic, as you are never sure where the counts are coming from. To do a correct background measurement is far more complicated than doing measurements of a radioactive sample.
just an example - leaning over the detector automatically should increase the background counts as the K-40 decay inside the body of each of us gives automatically more counts. The crucial question is - how much does each effect (radon emanation after rain, solar modulation etc.) change the mean background value?

i guess that measuring real radioactive samples with our detector will decrease the uncertainty significantly. if desired, i may provide you with the data.

Dagorath
Avatar
Send message
Joined: 4 Jul 11
Posts: 151
Credit: 42,738
RAC: 0

Message 1087 - Posted: 5 May 2012 | 0:46:19 UTC - in response to Message 1079.

You are right :) And you are not :)

In other topic, I show how to create chart with averages for last 24h. It's very smooth and accurate on your point of view.



But, as TJM mentioned, we will to provide as much data as possible (and as far as it is sensible) and data resolution(?) play a role on it.


Data resolution in this case is an illusion that distorts your thinking on the subject.

Also you can imagine situation of nuclear explosion somewhere near detector - our hardware shows it immediately (well - if it not be destroyed/network unavailable/etc). In this case gamma radiation jump from e.g. 0.16uSv/h to 1000uSv/h (or even more) and differences between 0.09-0.30 not make any difference - you get right signal to run ASAP and from map you get


The example you provide assumes a nuclear blast and then mentions the effect you will see on the detector. A nearby nuclear blast yields quick high detector reading, that is true. The mistake in logic you are making is that you reverse the statement to say "a quick high detector reading indicates a nuclear blast" and assuming that is true. It is not true and phys's explanation of a Poisson process and the statistical difficulty in measuring Poisson processes proves the reverse statement is absolutely false. The high reading could be the result of other events, not a nuclear blast, so running at the first high reading when the pulse count and count period is statistically irrelevant is naive.

You're trying to refute information from people who know far more about this than you do. You really need to discuss this with people who understand statistics, nuclear radiation and the problems pertaining to counting and measuring nuclear events. The R@H team obviously has considerable skill in electronics engineering and computer programming but it's obvious none of you understands statistics and measurement of nuclear decay sufficiently. As phys's analogy implies, you are focusing on a few pixels on the screen and missing the bigger picture, probably the entire movie.

What is interesting, by analyse data form our project and compare it with other factors (like e.g. dependency type day/night or good/bad weather) you can find more interesting info! This shows that more data equal more possibility to get "something" interesting from it.


You missed the point. Phys's graphs clearly indicate the "more interesting info" you refer to is invisible until the R@H data is reduced to fewer samples. Phys's explanation and example graphs make the need to reduce the data perfectly clear. I don't understand how you can miss the point.

Why not use some data as "white noise"? This is another possibility as some people use geiger detectors as random number generator (well, I can imagine other, better methods but here are data available right now).


Well, I didn't buy a detector and join the project just to provide white noise or random number generator for some statistics students. The reason I joined is to provide the world with accurate readings of background radiation. That's not happening at R@H and there is every indication that it will never happen and that's why I won't donate money. It makes far more sense for me to use my money and time to provide meaningful data.

I know, that our detectors are not laboratory stuff but still collect correct and (I hope) useful data.


The hardware is very good (except for a problem with USB that might only be the result of a "bad" USB port on the host computer) and I think it works well. Unfortunately the way you guys collect and present the data sucks and if you ask people who understand the math and physics behind what you are trying to do then you will realise it sucks too. Don't trust phys and I and that other poster who gave the same advice over a month ago, ask someone from a local university, someone you choose. I think machismo will prevent you from investigating that possibility and taking the 5 minutes required to correct the problem. Don't worry, I'll correct the problem for you :-)

____________

phys
Avatar
Send message
Joined: 28 Apr 12
Posts: 24
Credit: 0
RAC: 0
Message 1088 - Posted: 5 May 2012 | 1:02:59 UTC - in response to Message 1079.

@Krzysztof

Can you point me to the thread of the script? The script is not on the radioactiveathome server and just accepts your hostid. It's got to know at least you are working on a script, that is in my opinion more useful.

Please look at the histogramm again. There is a small, but finite probability that you detect eg 26 clicks per trickle. So how can you decide, if there is something going on or it is just the poisson tail of your background? Just by measuring more or a longer time! So it is everytime a tradeoff between accuracy and response to varying conditions. You can never get them both. Two moving averages with different window sizes could do the trick, so you switch in an dynamic mode. But you need to know what do you want - background monitoring then you need data analysis, dosis measurement another device, gamma ray bursts a few hundred millions for a satelite...the famous random number generator-well in the new Intel Ivy Bridge you get one buildin that passes all tests and is no longer pseudorandom. Please collect every sample you want, but please give the users the tools to make reasonable things.

More important would be a local user database, so every client keeps a copy of his records (dating months back and are not erased by monthly server cleaning), that third party tools are able to do nice plots on the users pc and display things in real time.

Profile krzyszp
Project administrator
Project developer
Project tester
Project scientist
Avatar
Send message
Joined: 16 Apr 11
Posts: 368
Credit: 453,559
RAC: 140

Message 1089 - Posted: 5 May 2012 | 1:15:38 UTC - in response to Message 1088.
Last modified: 5 May 2012 | 1:16:40 UTC

http://radioactiveathome.org/boinc/forum_thread.php?id=85

There is manual and code for it.
____________
Regards,
Krzysztof 'krzyszp' Piszczek
Android Radioactive@Home Map
Android Radioactive@Home Map - donated
My Workplace

phys
Avatar
Send message
Joined: 28 Apr 12
Posts: 24
Credit: 0
RAC: 0
Message 1090 - Posted: 5 May 2012 | 1:22:42 UTC

@Dagorath

I think you are a bit to harsh. I don't wanted to blame anyone. I just wanted to push the team in a more scientific direction. But it is a hobbyist project. I appreciate it. I just stumbled over it because I wanted to build a counter on my own, and so I digged a little bit deeper than the average user. If there are allready over 1000 detectors out I think the team did a good job. The community just have to make it better to suit theire needs.

phys
Avatar
Send message
Joined: 28 Apr 12
Posts: 24
Credit: 0
RAC: 0
Message 1093 - Posted: 5 May 2012 | 2:59:32 UTC - in response to Message 1081.

@ exsafs

background statistics is indeed a complicated topic. I do not say that the slight variations in the analyzed samples are from radiation at all, probably they are temperatur or air preassure effects. Geiger counters are not so senitive and as you know not really used for background in practice. However, you are the only guy with real measuring equipment here. So just the problems in short form and maybe you can help:

The tube STS-5 is a predecessor of the SBM-20 and no real comparisons (besides TJM 5%) are known. The SBM-20 datasheet gives a value of 1 cps for inherit background from the tube alone, that is 3 times the rate from the "background" measurement of my figures. So can you put the device without any radioactive source for a longer time in a
lead castle to get a measurement of the own background? For later data analysis the boinc client needs to run so I don't know how to handle this, maybe put the labtop in the lead chamber too?

Second just 3 more calibration points with Cs137 (under 1mSv/h ... I hope nobody will ever measure such activity) would be nice, so you can draw a least-square line. For the dosis all the self made geiger counter rely on obscure conversion factors. From the datasheet you get a number but nobody knows if it corresbonds to reality. The current recipe is a factor of 171 but there are others like counts in 56 seconds for a reading in mikroRoentgen...

If the overall accuracy of the detecter could be increased by a factor of 2 I think everyone is happy. Just to give an impression what professional equipment is capable of (and doing with your taxes):


You see it started to rain 2 times. The national sensor networks are well developed in central european, so you have a grid to calibrate against it. (And if you find even simular readings you should be really happy).

phys
Avatar
Send message
Joined: 28 Apr 12
Posts: 24
Credit: 0
RAC: 0
Message 1097 - Posted: 6 May 2012 | 17:51:37 UTC

Links for the national radiation monitoring networks for background comparison

EURDEP covering whole europe but a 6 hour delay


AT
BE
CH
DE
FR
NL
US

Profile jhelebrant
Avatar
Send message
Joined: 30 Jul 12
Posts: 27
Credit: 1,521
RAC: 0
Message 1346 - Posted: 31 Aug 2012 | 7:20:32 UTC

Hi,
have to also add some info:

@phys: using Scidavis or Qtiplot aren't you? :-)

We also discussed the measuring time. Is is clear, that longer sampling time would be better for the tube to collect more counts and to get less variable (=more "stable") background measurements. Would be also smaller load for the server etc.

I did some analyses and currently continue with it and we will see.

From the other point of view - this is not scientific measuring equipment for high precision data acquisition but rather an indicator. Ok, maybe using 5-10 samples moving average would improve the values, although wiping out some background variations.

So for example - variations from 100 to 200 nSv/h are relatively big, but it is still background. If there were any radiation event, the GM tube would have enough counts per selected sampling time and would be quite enough for measuring.

Finally, the decision what to do in such situation cannot be based few measurements (could be false points) and on these sensors only - it is just for reference. But, as we know from Japan, such network can help the first respond teams to refine the overview of the situation for next steps.

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1355 - Posted: 9 Sep 2012 | 8:58:21 UTC
Last modified: 9 Sep 2012 | 8:59:04 UTC

I made a graph which shows (for now) averaged samples at 30 minute steps, I'm going to add an option to compare results from two sensors and switch for optional even longer sampling periods, so it will eventually make it easier to compare results with official stations.
The graph is not finished yet, I think the X-scale (date&time) is off by 1 sample. The empty places are periods with not enough samples (the minimum is set at 20 minutes of samples for every 30 minutes) or without samples at all, however they're not linear, the blank space has always the same length, no matter how long the period without sample was.

http://radioactiveathome.org/scripts/graph/draw24h.php?hostid=506

Dagorath
Avatar
Send message
Joined: 4 Jul 11
Posts: 151
Credit: 42,738
RAC: 0

Message 1356 - Posted: 10 Sep 2012 | 9:47:13 UTC - in response to Message 1355.

Very nice! Good work!

____________

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1360 - Posted: 14 Sep 2012 | 0:24:26 UTC - in response to Message 1356.
Last modified: 17 Sep 2012 | 10:19:10 UTC

I did a bit of analysis on data returned by my sensor, there's an official station around 15km from the place where I live so I have a source of data for comparison.

This is the official station's chart for the last couple of days:


And this is the data from my outdoor sensor:


The values obviously do not match for various reasons, but the sensor registered a slight increase in approximately the same time the official station did.

And some more work with graphs, this one shows last 7 days, 1-hour samples (triangles) and 5-samples running average (blanked in places where not enough data is available) - red line.
http://radioactiveathome.org/scripts/graph/drawweekdotted.php?hostid=506


Simple trick to scroll the data on graph (max 30 days back since the data is taken from 'trickles'):
http://radioactiveathome.org/scripts/graph/scrollgraph.php?hostid=978&days=0

jacek
Send message
Joined: 6 Nov 12
Posts: 10
Credit: 0
RAC: 0
Message 1459 - Posted: 7 Nov 2012 | 15:15:45 UTC

From what I see so far on this blog, the detector overestimates the low, background radiation some (under 50%?) while it underestimates high levels by a factor of 2.

This is not necessarily bad as I saw evidence that the esteemed Terra's MKS-05 can underestimate doses 5 fold!

However, since the project wants to monitor the low levels, is it possible to improve results by subtracting the GM tube's own noise of 1 cps or by some other constant adjustments?

Note sure how the adjustments would be deployed across the different hardware versions though.

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1460 - Posted: 7 Nov 2012 | 15:44:27 UTC - in response to Message 1459.

I think the sensor does not overestimate the background levels, but the tubes are somewhat sensitive to weaker (beta) radiation so eventually the averages are slightly higher.
That's clearly visible on sensors which run without any covers, thick (3mm) plastic case drops the overall background level a bit. Combined with shielded version of the tube it brings the readings to the values shown by older russian detector which I have access to (I think it's БЕЛЛА).

Also remember that some of the sensor are running indoors, where in some cases the background radiation is higher.




jacek
Send message
Joined: 6 Nov 12
Posts: 10
Credit: 0
RAC: 0
Message 1461 - Posted: 7 Nov 2012 | 17:18:59 UTC - in response to Message 1460.

Really?
You just showed a compelling difference in your above post on Sept 17th.
You can also compare the readings on your map with government network readings also available online.

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1462 - Posted: 7 Nov 2012 | 18:51:04 UTC - in response to Message 1461.
Last modified: 7 Nov 2012 | 19:12:01 UTC

The data above is from host 506 which uses standard SBM-20, no shield of any sort.

Take a look here http://radioactiveathome.org/boinc/test123a.php at host #6, it's placed less than 1m from #506 yet the average readings are lower and the only difference between them is the tube - #6 uses shielded SBM-20-1, I think it's the same version that can be found inside "БЕЛЛА".
It probably would go even lower with more shielding, because it still reacts to beta sources placed nearby.
The official source states that their sensor counts gamma only, so obviously our sensor will show higher values.

It would be nice to have each sensor calibrated using some sort of more precise equipment, but then each tube should be tested and calibrated separately under the same conditions, because there are slight differences between them.

EDIT: Btw, do not compare the numbers on map with anything, as these values are averages from very short periods of time.

jacek
Send message
Joined: 6 Nov 12
Posts: 10
Credit: 0
RAC: 0
Message 1463 - Posted: 7 Nov 2012 | 20:56:38 UTC - in response to Message 1462.
Last modified: 7 Nov 2012 | 20:57:41 UTC

I'm not talking about #506.
I'm talking about this:


vs that:


As for the BOINC map, I'm looking at 24hrs averages and not instant. I would not call 24 hrs "a very short period of time."

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1464 - Posted: 7 Nov 2012 | 21:03:50 UTC - in response to Message 1463.
Last modified: 7 Nov 2012 | 21:09:09 UTC

I'm not talking about #506.
I'm talking about this:



This *is* data taken from host #506.


The map shows incorrect values for some reason, or they're not 24 hour averages, I'll take a look at it soon. The map code and the mysql queries it uses are rather complicated so perhaps there is a bug.

Here are more accurate values:
http://radioactiveathome.org/boinc/test123a.php

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1707 - Posted: 24 Apr 2013 | 21:57:29 UTC - in response to Message 1464.

The next version of the app (coming soon) will change the default sample_time from 40 to 200 or 240s.
What do you guys think about this ?
I'm a bit tired of backing up the HUGE database (which could be easily 5-6 times smaller), not to mention that debugging is very problematic when I have to deal with tons of short samples.

Profile jhelebrant
Avatar
Send message
Joined: 30 Jul 12
Posts: 27
Credit: 1,521
RAC: 0
Message 1708 - Posted: 25 Apr 2013 | 7:26:04 UTC - in response to Message 1707.

Hi TJM,
we also wanted to recommend to increase the sampling time. Not only because of the database, but also for improving the measurement quality. In our sensor networks we mostly use 10 minute sampling interval which seems to be the best compromise for stationary measuring stations.

I played a little bit with the radioactive@home data from some selected stations and longer sampling interval decreases the value fluctuations very much.

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1709 - Posted: 25 Apr 2013 | 19:55:15 UTC - in response to Message 1708.
Last modified: 25 Apr 2013 | 19:55:27 UTC

The current radioactive@home database/daemons design would have to be changed to work with sample times exceeding 5 minutes. I think for now 3-5 minutes is fine, at least for now. The output looks clear and I think it will be far easier to look for bugs (and there are some, which I can't find).


Stderr output
<core_client_version>7.0.31</core_client_version>
<![CDATA[
<stderr_txt>
Radac $Rev: 558 $ starting...
sensors.xml: 6 nodes found
Found sensor v2.01
8730,4,2013-4-25 19:0:52,f
168390,56,2013-4-25 19:3:33,n
328140,96,2013-4-25 19:6:13,n
487870,152,2013-4-25 19:8:53,n
648140,201,2013-4-25 19:11:33,n
807990,256,2013-4-25 19:14:14,n
968350,328,2013-4-25 19:16:54,n
1128140,383,2013-4-25 19:19:34,n
1287120,433,2013-4-25 19:22:14,n
Trickle sent
1447350,483,2013-4-25 19:24:54,n
1606560,540,2013-4-25 19:27:34,n
Trickle sent
Done - calling boinc_finish()
21:30:14 (3476): called boinc_finish

</stderr_txt>
]]>

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1716 - Posted: 29 Apr 2013 | 21:23:40 UTC - in response to Message 1709.
Last modified: 29 Apr 2013 | 21:24:50 UTC

The output from the "new" app looks like this:


<core_client_version>7.0.31</core_client_version>
<![CDATA[
<stderr_txt>
Radac $Rev: 558 $ starting...
sensors.xml: 6 nodes found
Found sensor v2.52
16959,5,2013-4-29 20:51:12,f 0.3 minutes,17.7 cpm,0.10 &#181;Sv/h
257443,74,2013-4-29 20:55:12,n 4.0 minutes,17.2 cpm,0.10 &#181;Sv/h
496950,134,2013-4-29 20:59:12,n 4.0 minutes,15.0 cpm,0.09 &#181;Sv/h
736376,209,2013-4-29 21:3:12,n 4.0 minutes,18.8 cpm,0.11 &#181;Sv/h
975629,289,2013-4-29 21:7:12,n 4.0 minutes,20.1 cpm,0.12 &#181;Sv/h
1216192,350,2013-4-29 21:11:13,n 4.0 minutes,15.2 cpm,0.09 &#181;Sv/h
Trickle sent
1456772,428,2013-4-29 21:15:13,n 4.0 minutes,19.5 cpm,0.11 &#181;Sv/h
Trickle sent
Done - calling boinc_finish()
23:19:13 (7444): called boinc_finish

</stderr_txt>


There is a minor glitch because the stderr does not accept "µ".

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1720 - Posted: 15 May 2013 | 19:28:45 UTC - in response to Message 1716.

I'm updating the Windows app to the revision 584 right now. I hope there will be no new bugs/issues. The app went through 2 weeks testing phase and everything seems to work.
Since the app produces far less output, I'll try to hunt some bugs that were very hard to catch - mainly the occasional glitch that caused "negative" readings on graphs.

Profile ChertseyAl
Avatar
Send message
Joined: 16 Jun 11
Posts: 152
Credit: 385,292
RAC: 162

Message 1723 - Posted: 17 May 2013 | 16:56:01 UTC - in response to Message 1720.
Last modified: 17 May 2013 | 16:56:37 UTC

I'm now getting a much 'smoother' graph. Note the last 500 samples as of time of posting this:



This is v1.76, which I guess is the revision that you mention.

Edit: I see other peoples graphs are diffferent too!

Cheers,

Al.
____________

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1724 - Posted: 17 May 2013 | 17:01:10 UTC - in response to Message 1723.

Yep, the sample time is set to 4 minutes, it also looks quite nice in stderr.

Jason Taylor
Avatar
Send message
Joined: 11 Jul 12
Posts: 8
Credit: 2,874
RAC: 0
Message 1856 - Posted: 7 Jul 2013 | 10:32:21 UTC - in response to Message 1346.
Last modified: 7 Jul 2013 | 10:40:50 UTC

Is is clear, that longer sampling time would be better for the tube to collect more counts and to get less variable (=more "stable") background measurements. Would be also smaller load for the server etc.


jhelebrant, I think you are totally wrong. In this post I supply a simple improvement/solution that is the best of all worlds. TJM there is no need to compromise. Please read this.

Another way to lower error is to use a bigger effective cross sectional area in the detector, something that the project *should* be doing but does not as far as I can tell. We have hundreds of Geiger counters. Using more or all is the answer to lowering error, not decreasing the resolution of the x (time) axis. You lose some geographical resolution but that is ok for looking at, e.g., gamma ray bursts. My proposal: eliminate all averaging. Send a packet each click!

It will not lower total operational costs once you include the users (me and you). The reason is because that is >90% electricity costs of having these computers on in the night/unused periods, which is about $7/month. The incremental cost of a single packet for me is zero until the total rate approaches 50% of 20 Mbps. At one packet per click/count and 10 counts per minute the total bandwidth is still <1%, so it is completely free probably by more than a factor of 1000. Since there are fewer than 1000 users the same logic even holds true for the server/host.

In other words, the software should be sending a single packet each click such that no raw data is averaged at all! Plotting software can always average in what ever scale any user wishes, and using group data of hundreds of detectors allows for extremely precise time resolution. But it is a one-way process. If the data is destroyed more (as you suggest!) it cannot be un-averaged.

Only by going one-packet/click model can one do good and cool astrophysics with this project.

Jason
____________
Jason Taylor A new rant each day @ http://twitter.com/jasontaylor7.

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1858 - Posted: 7 Jul 2013 | 14:10:01 UTC - in response to Message 1856.

That would require to completely redesign the client, server and hardware (sensor) which were not designed for any kind of real-time operation.

Also, there is just no way our server could handle data at single pulse level. Network is not a problem, but the database was already extremely large with 40s samples. To register each single pulse we would probably need a datacenter, not a single server.

____________

Jason Taylor
Avatar
Send message
Joined: 11 Jul 12
Posts: 8
Credit: 2,874
RAC: 0
Message 1859 - Posted: 7 Jul 2013 | 15:56:11 UTC - in response to Message 1858.
Last modified: 7 Jul 2013 | 16:55:37 UTC

That would require to completely redesign the client, server and hardware (sensor)


Yes, you can drastically simplify everything since there is no averaging, no work unit nonsense, no boinc screwing with our machines, etc. Each packet contains the utc and user number instead of these huge integers it presently is sending.

which were not designed for any kind of real-time operation.


What is the present cpu usage % of your server?



Also, there is just no way our server could handle data at single pulse level. Network is not a problem, but the database was already extremely large with 40s samples. To register each single pulse we would probably need a datacenter, not a single server.

I am glad you agree the network is not a reasonable excuse. Already we are on the same page because we can have any required data smoothing being done at the local server instead of by the client which may or may not easily upgrade should this debate alter the allegedly "ideal" 4-minute interval.

However I STRONGLY disagree about this data center term. Firstly, there is no reason to make any speculations. The math is simple. What, exactly, is your present database size? Over what time span? All years? What are the fields? What is the primary key? There are several methods of designing the database file so it is smaller that a data center handles. It is not uncommon for someone using database software to get large files containing redundant fields because they have no clue how to design for a low footprint. If you want help on shrinking it, well for starters you should not be storing these huge integers representing your integrated counts. Please post details of your database structure so we can help. The use of data center alone in this context indicates your prejudice and/or ignorance. A data center stores exabytes. An exabyte is 1 million 1TB hard drives! In the worst case you continued to use your present non-optimized system you have 20*4=80x more disk space than before. Therefore, if you will need an exabyte, your present server has 1E6/80=12,500 1TB drives. Can you verify you presently have that many drives? My guess would be you cannot.

Secondly, the utc time data from a few hundred detectors at 20 clicks/minute should be only around 20*200=4,000 entries per minute, which is 5 million entries per day! Seems a lot. But what if we round to the nearest 1 second? And use time as the primary key? Then the "integrated" database only stores the total counts per second. It now has only 86000 entries per day. Each entry has 1 integer, which can be only 4 bytes. Now we have 4 bytes * 86,000 = 344kb/day. A 1TB drive will fill up in 10^12/344000/360=8000 years. But if you want to do good geographical analysis or astrophysics using the earth as an occultant you want to divide the earth into 24*5 sections (really you don't need that many but I'm trying to give your argument the benefit of the doubt) yielding 80 years per TB. With a 4 tb drive and compression (most of the 100 zones are empty due to concentration in europe of detectors) you are back to the >100 years per hard drive. Sorry, no need for a data center. Laziness is the best reason for you not switching. A second best might be because you want to withhold this data hostage from us. (I'm a cynic.) The total data required to be stored is actually less than what you are probably using now, but I cannot verify this because I don't know what fields you are storing. If it were me I'd add in an effective collective detector area which can be a second integer that changes as people unplug and plug in their detectors. There is plenty of hdd space.

In summary, my proposal would seem to allow superior astrophysics, early detection of "north korea" events, ability to localize, etc., and definitely does NOT require any data centers. I feel so strongly about this I hereby agree to let you use my machine and hdd to store the data.

Jason
http://twitter.com/jasontaylor7
____________
Jason Taylor A new rant each day @ http://twitter.com/jasontaylor7.

Profile TJM
Project administrator
Project developer
Project tester
Send message
Joined: 16 Apr 11
Posts: 291
Credit: 1,081,449
RAC: 472

Message 1860 - Posted: 7 Jul 2013 | 16:33:49 UTC - in response to Message 1859.

The 'realtime' database size is usually a few gigs for the last 30 days. It has to keep lots of data - each sample has it's geolocation (if present), datestamp, sensor hardware + software id, and user and host info stored.

In the past we used 40s sample time and it was increased mostly due to severe performance issues. At 40s sample_time a simple script that draws dots on the map was running very slow, some other things started to fail due to timeouts and stressing the hardware too much.

Anyway, our sensors do not support reporting single pulses at all, what they return is raw counter value + hardware timestamp. Also the entire 'network' we use is not realtime, there is no way to synchronise the clients so I doubt that even 1s precision would be of any use.

Jason Taylor
Avatar
Send message
Joined: 11 Jul 12
Posts: 8
Credit: 2,874
RAC: 0
Message 1861 - Posted: 7 Jul 2013 | 17:11:49 UTC - in response to Message 1860.
Last modified: 7 Jul 2013 | 17:39:26 UTC

The 'realtime' database size is usually a few gigs for the last 30 days. It has to keep lots of data - each sample has it's geolocation (if present), datestamp, sensor hardware + software id, and user and host info stored.

Thanks very much for supplying this info TJM. Now that I have something to work with I can explain how to fix your system a little. I will take each field in turn:

1. Geolocation should be stored in a separate table as it is a function of the user number.

2. datestamp should be in the header AND be the primary key AND be the header plus the sampling interval * the line number only. The existence of a carrige return thus determines the time/datestamp.

3. sensor hardware + software id should be stored in a separate table as it is a function of the user number.

4. user This needs to be stored. It is a 4 byte integer.

5. host info should be stored in a separate table as it is a function of
the user number.

Please correct any errors in my logic. Otherwise, as you can see, 4/5ths of your data is redundantly stored for no rational reason I can see.

In the past
With all due respect, what mistakes you did in the past are not pertinent. That's the point of improving. If you need to live in the past please ignore this thread.

we used 40s sample time and it was increased mostly due to severe performance issues. At 40s sample_time a simple script that draws dots on the map was running very slow, some other things started to fail due to timeouts and stressing the hardware too much.


That's partly due to the inefficiency of your method of storing the data. If the largest data file is 1 TB it will take even longer to process. This is just an I/O issue.

But I'd guess your real problem is that you are letting map requests instigate computation. Run a cron job and cache the results to that user-instigated map requests read in the output from the cron job instead of doing any calculation. That way, if 10 people ask for a map at the same time, your cpu does not get bogged down.

Anyway, our sensors do not support reporting single pulses at all, what they return is raw counter value + hardware timestamp.
What is the fastest time resolution you can get? Can't you flash the firmware?

Also the entire 'network' we use is not realtime, there is no way to synchronise the clients

Yes, exactly, that's a point of my post. The way you are doing it is not as good as the way it can be done by ditching boinc, which does not even work for me anyway. But even if you don't ditch boinc my entire comments make sense towards moving the time resolution to a finer region like 10 seconds or so. I guess what you need me to explain is that each packet containing a tick, if you altered the firmware somewhat to my simple realtime way, also contains the click time, so you have to alter past counter buckets. Just use tcp and you should be good.

so I doubt that even 1s precision would be of any use.
There is a lot of things you can do with data. Next nearby supernova the second resolution would be extremely useful. Also, some potential nuclear events happen very quickly. Trust me.

I'm sure there are some errors in my logic, but aside from a possible boycott request to other users I think I'm done. Either you take some of my advice or you are just making bogus excuses, and it is a waste of our time, because it was pre-decided that you will not improve anything.

Profile jhelebrant
Avatar
Send message
Joined: 30 Jul 12
Posts: 27
Credit: 1,521
RAC: 0
Message 1868 - Posted: 9 Jul 2013 | 8:42:44 UTC - in response to Message 1861.

Hi Jason,
not sure if I fully understand what you mean. However, there is it not averaging - you always have pulse count for certain time interval:

sample_ID,host_ID,pulse_count,date_time,Y_coord,x_coord,sample_time,type,experimental,version
45032963,1934,9,2012-06-29 14:16:47,50.037849,14.482327,0.721,n,0,594
45037617,1934,15,2012-06-29 14:17:29,50.037849,14.482327,0.702,n,0,594
45037618,1934,18,2012-06-29 14:18:11,50.037849,14.482327,0.697,n,0,594
45037619,1934,18,2012-06-29 14:18:52,50.037849,14.482327,0.694,n,0,594

Apart from the mentioned data traffic reason there is also another one. As I know, this network was built to provide independent and volunteer radiation monitoring and warning system (although in case of some alarm from such network you should contact institutions responsible for radiation protection or police/fire fighters first).

If you use short time interval to calculate the dose rate (usually microGrays or microSieverts per hour) with low sensitivity small geiger tubes (the SBM-20 have lower sensitivity to background values), then your dose rate values varies much - you can see it on older data. If you collect counts for too short period you have too less data to calculate the dose rate "reasonably accurately". And be sure, that most radiation warning sensor networks use 10 minutes and longer intervals not because of the data traffic.

Finally, I do not think there is a reason to have recorded each count + time value. Using 4 minutes interval to collect counts and also to calculate the dose rate can reduce false trends in data.

Profile PY2RPD
Send message
Joined: 19 May 13
Posts: 4
Credit: 18,255
RAC: 0

Message 1982 - Posted: 28 Aug 2013 | 6:57:21 UTC - in response to Message 1077.

Hi
How can i get these software for plot data?
73 de PY2RPD
Wagner
Brazil

Post to thread

Message boards : Science : Data analysis - What can you really measure?


Main page · Your account · Message boards


Copyright © 2019 BOINC@Poland | Open Science for the future