Message boards : Science : Data analysis - What can you really measure?
Author | Message |
---|---|
Data analysis is all about statistics. There is your measurement and you have to decide if you get a valid value from your apparatus or is it all noise and nonsense. I don't want to give you a full lecture in it (google for your favourite university + geiger counter experiment + poisson process) but just give you a feel of the limits. | |
ID: 1077 | Rating: 0 | rate: / Reply Quote | |
You are right :) And you are not :) | |
ID: 1079 | Rating: 0 | rate: / Reply Quote | |
:) | |
ID: 1081 | Rating: 0 | rate: / Reply Quote | |
You are right :) And you are not :) Data resolution in this case is an illusion that distorts your thinking on the subject. Also you can imagine situation of nuclear explosion somewhere near detector - our hardware shows it immediately (well - if it not be destroyed/network unavailable/etc). In this case gamma radiation jump from e.g. 0.16uSv/h to 1000uSv/h (or even more) and differences between 0.09-0.30 not make any difference - you get right signal to run ASAP and from map you get The example you provide assumes a nuclear blast and then mentions the effect you will see on the detector. A nearby nuclear blast yields quick high detector reading, that is true. The mistake in logic you are making is that you reverse the statement to say "a quick high detector reading indicates a nuclear blast" and assuming that is true. It is not true and phys's explanation of a Poisson process and the statistical difficulty in measuring Poisson processes proves the reverse statement is absolutely false. The high reading could be the result of other events, not a nuclear blast, so running at the first high reading when the pulse count and count period is statistically irrelevant is naive. You're trying to refute information from people who know far more about this than you do. You really need to discuss this with people who understand statistics, nuclear radiation and the problems pertaining to counting and measuring nuclear events. The R@H team obviously has considerable skill in electronics engineering and computer programming but it's obvious none of you understands statistics and measurement of nuclear decay sufficiently. As phys's analogy implies, you are focusing on a few pixels on the screen and missing the bigger picture, probably the entire movie. What is interesting, by analyse data form our project and compare it with other factors (like e.g. dependency type day/night or good/bad weather) you can find more interesting info! This shows that more data equal more possibility to get "something" interesting from it. You missed the point. Phys's graphs clearly indicate the "more interesting info" you refer to is invisible until the R@H data is reduced to fewer samples. Phys's explanation and example graphs make the need to reduce the data perfectly clear. I don't understand how you can miss the point. Why not use some data as "white noise"? This is another possibility as some people use geiger detectors as random number generator (well, I can imagine other, better methods but here are data available right now). Well, I didn't buy a detector and join the project just to provide white noise or random number generator for some statistics students. The reason I joined is to provide the world with accurate readings of background radiation. That's not happening at R@H and there is every indication that it will never happen and that's why I won't donate money. It makes far more sense for me to use my money and time to provide meaningful data. I know, that our detectors are not laboratory stuff but still collect correct and (I hope) useful data. The hardware is very good (except for a problem with USB that might only be the result of a "bad" USB port on the host computer) and I think it works well. Unfortunately the way you guys collect and present the data sucks and if you ask people who understand the math and physics behind what you are trying to do then you will realise it sucks too. Don't trust phys and I and that other poster who gave the same advice over a month ago, ask someone from a local university, someone you choose. I think machismo will prevent you from investigating that possibility and taking the 5 minutes required to correct the problem. Don't worry, I'll correct the problem for you :-) ____________ | |
ID: 1087 | Rating: 0 | rate: / Reply Quote | |
@Krzysztof | |
ID: 1088 | Rating: 0 | rate: / Reply Quote | |
http://radioactiveathome.org/boinc/forum_thread.php?id=85 There is manual and code for it. ____________ Regards, Krzysztof 'krzyszp' Piszczek Android Radioactive@Home Map Android Radioactive@Home Map - donated My Workplace | |
ID: 1089 | Rating: 0 | rate: / Reply Quote | |
@Dagorath | |
ID: 1090 | Rating: 0 | rate: / Reply Quote | |
@ exsafs | |
ID: 1093 | Rating: 0 | rate: / Reply Quote | |
Links for the national radiation monitoring networks for background comparison | |
ID: 1097 | Rating: 0 | rate: / Reply Quote | |
Hi, | |
ID: 1346 | Rating: 0 | rate: / Reply Quote | |
I made a graph which shows (for now) averaged samples at 30 minute steps, I'm going to add an option to compare results from two sensors and switch for optional even longer sampling periods, so it will eventually make it easier to compare results with official stations. | |
ID: 1355 | Rating: 0 | rate: / Reply Quote | |
Very nice! Good work! | |
ID: 1356 | Rating: 0 | rate: / Reply Quote | |
I did a bit of analysis on data returned by my sensor, there's an official station around 15km from the place where I live so I have a source of data for comparison. | |
ID: 1360 | Rating: 0 | rate: / Reply Quote | |
From what I see so far on this blog, the detector overestimates the low, background radiation some (under 50%?) while it underestimates high levels by a factor of 2. | |
ID: 1459 | Rating: 0 | rate: / Reply Quote | |
I think the sensor does not overestimate the background levels, but the tubes are somewhat sensitive to weaker (beta) radiation so eventually the averages are slightly higher. | |
ID: 1460 | Rating: 0 | rate: / Reply Quote | |
Really? | |
ID: 1461 | Rating: 0 | rate: / Reply Quote | |
The data above is from host 506 which uses standard SBM-20, no shield of any sort. | |
ID: 1462 | Rating: 0 | rate: / Reply Quote | |
I'm not talking about #506. | |
ID: 1463 | Rating: 0 | rate: / Reply Quote | |
I'm not talking about #506. This *is* data taken from host #506. The map shows incorrect values for some reason, or they're not 24 hour averages, I'll take a look at it soon. The map code and the mysql queries it uses are rather complicated so perhaps there is a bug. Here are more accurate values: http://radioactiveathome.org/boinc/test123a.php | |
ID: 1464 | Rating: 0 | rate: / Reply Quote | |
The next version of the app (coming soon) will change the default sample_time from 40 to 200 or 240s. | |
ID: 1707 | Rating: 0 | rate: / Reply Quote | |
Hi TJM, | |
ID: 1708 | Rating: 0 | rate: / Reply Quote | |
The current radioactive@home database/daemons design would have to be changed to work with sample times exceeding 5 minutes. I think for now 3-5 minutes is fine, at least for now. The output looks clear and I think it will be far easier to look for bugs (and there are some, which I can't find).
| |
ID: 1709 | Rating: 0 | rate: / Reply Quote | |
The output from the "new" app looks like this:
There is a minor glitch because the stderr does not accept "µ". | |
ID: 1716 | Rating: 0 | rate: / Reply Quote | |
I'm updating the Windows app to the revision 584 right now. I hope there will be no new bugs/issues. The app went through 2 weeks testing phase and everything seems to work. | |
ID: 1720 | Rating: 0 | rate: / Reply Quote | |
I'm now getting a much 'smoother' graph. Note the last 500 samples as of time of posting this: | |
ID: 1723 | Rating: 0 | rate: / Reply Quote | |
Yep, the sample time is set to 4 minutes, it also looks quite nice in stderr. | |
ID: 1724 | Rating: 0 | rate: / Reply Quote | |
Is is clear, that longer sampling time would be better for the tube to collect more counts and to get less variable (=more "stable") background measurements. Would be also smaller load for the server etc. jhelebrant, I think you are totally wrong. In this post I supply a simple improvement/solution that is the best of all worlds. TJM there is no need to compromise. Please read this. Another way to lower error is to use a bigger effective cross sectional area in the detector, something that the project *should* be doing but does not as far as I can tell. We have hundreds of Geiger counters. Using more or all is the answer to lowering error, not decreasing the resolution of the x (time) axis. You lose some geographical resolution but that is ok for looking at, e.g., gamma ray bursts. My proposal: eliminate all averaging. Send a packet each click! It will not lower total operational costs once you include the users (me and you). The reason is because that is >90% electricity costs of having these computers on in the night/unused periods, which is about $7/month. The incremental cost of a single packet for me is zero until the total rate approaches 50% of 20 Mbps. At one packet per click/count and 10 counts per minute the total bandwidth is still <1%, so it is completely free probably by more than a factor of 1000. Since there are fewer than 1000 users the same logic even holds true for the server/host. In other words, the software should be sending a single packet each click such that no raw data is averaged at all! Plotting software can always average in what ever scale any user wishes, and using group data of hundreds of detectors allows for extremely precise time resolution. But it is a one-way process. If the data is destroyed more (as you suggest!) it cannot be un-averaged. Only by going one-packet/click model can one do good and cool astrophysics with this project. Jason ____________ Jason Taylor A new rant each day @ http://twitter.com/jasontaylor7. | |
ID: 1856 | Rating: 0 | rate: / Reply Quote | |
That would require to completely redesign the client, server and hardware (sensor) which were not designed for any kind of real-time operation. | |
ID: 1858 | Rating: 0 | rate: / Reply Quote | |
That would require to completely redesign the client, server and hardware (sensor) Yes, you can drastically simplify everything since there is no averaging, no work unit nonsense, no boinc screwing with our machines, etc. Each packet contains the utc and user number instead of these huge integers it presently is sending. which were not designed for any kind of real-time operation. What is the present cpu usage % of your server? Also, there is just no way our server could handle data at single pulse level. Network is not a problem, but the database was already extremely large with 40s samples. To register each single pulse we would probably need a datacenter, not a single server. I am glad you agree the network is not a reasonable excuse. Already we are on the same page because we can have any required data smoothing being done at the local server instead of by the client which may or may not easily upgrade should this debate alter the allegedly "ideal" 4-minute interval. However I STRONGLY disagree about this data center term. Firstly, there is no reason to make any speculations. The math is simple. What, exactly, is your present database size? Over what time span? All years? What are the fields? What is the primary key? There are several methods of designing the database file so it is smaller that a data center handles. It is not uncommon for someone using database software to get large files containing redundant fields because they have no clue how to design for a low footprint. If you want help on shrinking it, well for starters you should not be storing these huge integers representing your integrated counts. Please post details of your database structure so we can help. The use of data center alone in this context indicates your prejudice and/or ignorance. A data center stores exabytes. An exabyte is 1 million 1TB hard drives! In the worst case you continued to use your present non-optimized system you have 20*4=80x more disk space than before. Therefore, if you will need an exabyte, your present server has 1E6/80=12,500 1TB drives. Can you verify you presently have that many drives? My guess would be you cannot. Secondly, the utc time data from a few hundred detectors at 20 clicks/minute should be only around 20*200=4,000 entries per minute, which is 5 million entries per day! Seems a lot. But what if we round to the nearest 1 second? And use time as the primary key? Then the "integrated" database only stores the total counts per second. It now has only 86000 entries per day. Each entry has 1 integer, which can be only 4 bytes. Now we have 4 bytes * 86,000 = 344kb/day. A 1TB drive will fill up in 10^12/344000/360=8000 years. But if you want to do good geographical analysis or astrophysics using the earth as an occultant you want to divide the earth into 24*5 sections (really you don't need that many but I'm trying to give your argument the benefit of the doubt) yielding 80 years per TB. With a 4 tb drive and compression (most of the 100 zones are empty due to concentration in europe of detectors) you are back to the >100 years per hard drive. Sorry, no need for a data center. Laziness is the best reason for you not switching. A second best might be because you want to withhold this data hostage from us. (I'm a cynic.) The total data required to be stored is actually less than what you are probably using now, but I cannot verify this because I don't know what fields you are storing. If it were me I'd add in an effective collective detector area which can be a second integer that changes as people unplug and plug in their detectors. There is plenty of hdd space. In summary, my proposal would seem to allow superior astrophysics, early detection of "north korea" events, ability to localize, etc., and definitely does NOT require any data centers. I feel so strongly about this I hereby agree to let you use my machine and hdd to store the data. Jason http://twitter.com/jasontaylor7 ____________ Jason Taylor A new rant each day @ http://twitter.com/jasontaylor7. | |
ID: 1859 | Rating: 0 | rate: / Reply Quote | |
The 'realtime' database size is usually a few gigs for the last 30 days. It has to keep lots of data - each sample has it's geolocation (if present), datestamp, sensor hardware + software id, and user and host info stored. | |
ID: 1860 | Rating: 0 | rate: / Reply Quote | |
The 'realtime' database size is usually a few gigs for the last 30 days. It has to keep lots of data - each sample has it's geolocation (if present), datestamp, sensor hardware + software id, and user and host info stored. Thanks very much for supplying this info TJM. Now that I have something to work with I can explain how to fix your system a little. I will take each field in turn: 1. Geolocation should be stored in a separate table as it is a function of the user number. 2. datestamp should be in the header AND be the primary key AND be the header plus the sampling interval * the line number only. The existence of a carrige return thus determines the time/datestamp. 3. sensor hardware + software id should be stored in a separate table as it is a function of the user number. 4. user This needs to be stored. It is a 4 byte integer. 5. host info should be stored in a separate table as it is a function of the user number. Please correct any errors in my logic. Otherwise, as you can see, 4/5ths of your data is redundantly stored for no rational reason I can see. In the pastWith all due respect, what mistakes you did in the past are not pertinent. That's the point of improving. If you need to live in the past please ignore this thread. we used 40s sample time and it was increased mostly due to severe performance issues. At 40s sample_time a simple script that draws dots on the map was running very slow, some other things started to fail due to timeouts and stressing the hardware too much. That's partly due to the inefficiency of your method of storing the data. If the largest data file is 1 TB it will take even longer to process. This is just an I/O issue. But I'd guess your real problem is that you are letting map requests instigate computation. Run a cron job and cache the results to that user-instigated map requests read in the output from the cron job instead of doing any calculation. That way, if 10 people ask for a map at the same time, your cpu does not get bogged down. Anyway, our sensors do not support reporting single pulses at all, what they return is raw counter value + hardware timestamp.What is the fastest time resolution you can get? Can't you flash the firmware? Also the entire 'network' we use is not realtime, there is no way to synchronise the clients Yes, exactly, that's a point of my post. The way you are doing it is not as good as the way it can be done by ditching boinc, which does not even work for me anyway. But even if you don't ditch boinc my entire comments make sense towards moving the time resolution to a finer region like 10 seconds or so. I guess what you need me to explain is that each packet containing a tick, if you altered the firmware somewhat to my simple realtime way, also contains the click time, so you have to alter past counter buckets. Just use tcp and you should be good. so I doubt that even 1s precision would be of any use.There is a lot of things you can do with data. Next nearby supernova the second resolution would be extremely useful. Also, some potential nuclear events happen very quickly. Trust me. I'm sure there are some errors in my logic, but aside from a possible boycott request to other users I think I'm done. Either you take some of my advice or you are just making bogus excuses, and it is a waste of our time, because it was pre-decided that you will not improve anything. | |
ID: 1861 | Rating: 0 | rate: / Reply Quote | |
Hi Jason, | |
ID: 1868 | Rating: 0 | rate: / Reply Quote | |
Hi | |
ID: 1982 | Rating: 0 | rate: / Reply Quote | |
Message boards :
Science :
Data analysis - What can you really measure?