Posts by Jason Taylor

1) Message boards : Science : Data analysis - What can you really measure? (Message 1861)
Posted 3945 days ago by Jason Taylor
The 'realtime' database size is usually a few gigs for the last 30 days. It has to keep lots of data - each sample has it's geolocation (if present), datestamp, sensor hardware + software id, and user and host info stored.

Thanks very much for supplying this info TJM. Now that I have something to work with I can explain how to fix your system a little. I will take each field in turn:

1. Geolocation should be stored in a separate table as it is a function of the user number.

2. datestamp should be in the header AND be the primary key AND be the header plus the sampling interval * the line number only. The existence of a carrige return thus determines the time/datestamp.

3. sensor hardware + software id should be stored in a separate table as it is a function of the user number.

4. user This needs to be stored. It is a 4 byte integer.

5. host info should be stored in a separate table as it is a function of
the user number.

Please correct any errors in my logic. Otherwise, as you can see, 4/5ths of your data is redundantly stored for no rational reason I can see.

In the past
With all due respect, what mistakes you did in the past are not pertinent. That's the point of improving. If you need to live in the past please ignore this thread.

we used 40s sample time and it was increased mostly due to severe performance issues. At 40s sample_time a simple script that draws dots on the map was running very slow, some other things started to fail due to timeouts and stressing the hardware too much.


That's partly due to the inefficiency of your method of storing the data. If the largest data file is 1 TB it will take even longer to process. This is just an I/O issue.

But I'd guess your real problem is that you are letting map requests instigate computation. Run a cron job and cache the results to that user-instigated map requests read in the output from the cron job instead of doing any calculation. That way, if 10 people ask for a map at the same time, your cpu does not get bogged down.

Anyway, our sensors do not support reporting single pulses at all, what they return is raw counter value + hardware timestamp.
What is the fastest time resolution you can get? Can't you flash the firmware?

Also the entire 'network' we use is not realtime, there is no way to synchronise the clients

Yes, exactly, that's a point of my post. The way you are doing it is not as good as the way it can be done by ditching boinc, which does not even work for me anyway. But even if you don't ditch boinc my entire comments make sense towards moving the time resolution to a finer region like 10 seconds or so. I guess what you need me to explain is that each packet containing a tick, if you altered the firmware somewhat to my simple realtime way, also contains the click time, so you have to alter past counter buckets. Just use tcp and you should be good.

so I doubt that even 1s precision would be of any use.
There is a lot of things you can do with data. Next nearby supernova the second resolution would be extremely useful. Also, some potential nuclear events happen very quickly. Trust me.

I'm sure there are some errors in my logic, but aside from a possible boycott request to other users I think I'm done. Either you take some of my advice or you are just making bogus excuses, and it is a waste of our time, because it was pre-decided that you will not improve anything.
2) Message boards : Science : Data analysis - What can you really measure? (Message 1859)
Posted 3945 days ago by Jason Taylor
That would require to completely redesign the client, server and hardware (sensor)


Yes, you can drastically simplify everything since there is no averaging, no work unit nonsense, no boinc screwing with our machines, etc. Each packet contains the utc and user number instead of these huge integers it presently is sending.

which were not designed for any kind of real-time operation.


What is the present cpu usage % of your server?



Also, there is just no way our server could handle data at single pulse level. Network is not a problem, but the database was already extremely large with 40s samples. To register each single pulse we would probably need a datacenter, not a single server.

I am glad you agree the network is not a reasonable excuse. Already we are on the same page because we can have any required data smoothing being done at the local server instead of by the client which may or may not easily upgrade should this debate alter the allegedly "ideal" 4-minute interval.

However I STRONGLY disagree about this data center term. Firstly, there is no reason to make any speculations. The math is simple. What, exactly, is your present database size? Over what time span? All years? What are the fields? What is the primary key? There are several methods of designing the database file so it is smaller that a data center handles. It is not uncommon for someone using database software to get large files containing redundant fields because they have no clue how to design for a low footprint. If you want help on shrinking it, well for starters you should not be storing these huge integers representing your integrated counts. Please post details of your database structure so we can help. The use of data center alone in this context indicates your prejudice and/or ignorance. A data center stores exabytes. An exabyte is 1 million 1TB hard drives! In the worst case you continued to use your present non-optimized system you have 20*4=80x more disk space than before. Therefore, if you will need an exabyte, your present server has 1E6/80=12,500 1TB drives. Can you verify you presently have that many drives? My guess would be you cannot.

Secondly, the utc time data from a few hundred detectors at 20 clicks/minute should be only around 20*200=4,000 entries per minute, which is 5 million entries per day! Seems a lot. But what if we round to the nearest 1 second? And use time as the primary key? Then the "integrated" database only stores the total counts per second. It now has only 86000 entries per day. Each entry has 1 integer, which can be only 4 bytes. Now we have 4 bytes * 86,000 = 344kb/day. A 1TB drive will fill up in 10^12/344000/360=8000 years. But if you want to do good geographical analysis or astrophysics using the earth as an occultant you want to divide the earth into 24*5 sections (really you don't need that many but I'm trying to give your argument the benefit of the doubt) yielding 80 years per TB. With a 4 tb drive and compression (most of the 100 zones are empty due to concentration in europe of detectors) you are back to the >100 years per hard drive. Sorry, no need for a data center. Laziness is the best reason for you not switching. A second best might be because you want to withhold this data hostage from us. (I'm a cynic.) The total data required to be stored is actually less than what you are probably using now, but I cannot verify this because I don't know what fields you are storing. If it were me I'd add in an effective collective detector area which can be a second integer that changes as people unplug and plug in their detectors. There is plenty of hdd space.

In summary, my proposal would seem to allow superior astrophysics, early detection of "north korea" events, ability to localize, etc., and definitely does NOT require any data centers. I feel so strongly about this I hereby agree to let you use my machine and hdd to store the data.

Jason
http://twitter.com/jasontaylor7
3) Message boards : Science : Data analysis - What can you really measure? (Message 1856)
Posted 3945 days ago by Jason Taylor
Is is clear, that longer sampling time would be better for the tube to collect more counts and to get less variable (=more "stable") background measurements. Would be also smaller load for the server etc.


jhelebrant, I think you are totally wrong. In this post I supply a simple improvement/solution that is the best of all worlds. TJM there is no need to compromise. Please read this.

Another way to lower error is to use a bigger effective cross sectional area in the detector, something that the project *should* be doing but does not as far as I can tell. We have hundreds of Geiger counters. Using more or all is the answer to lowering error, not decreasing the resolution of the x (time) axis. You lose some geographical resolution but that is ok for looking at, e.g., gamma ray bursts. My proposal: eliminate all averaging. Send a packet each click!

It will not lower total operational costs once you include the users (me and you). The reason is because that is >90% electricity costs of having these computers on in the night/unused periods, which is about $7/month. The incremental cost of a single packet for me is zero until the total rate approaches 50% of 20 Mbps. At one packet per click/count and 10 counts per minute the total bandwidth is still <1%, so it is completely free probably by more than a factor of 1000. Since there are fewer than 1000 users the same logic even holds true for the server/host.

In other words, the software should be sending a single packet each click such that no raw data is averaged at all! Plotting software can always average in what ever scale any user wishes, and using group data of hundreds of detectors allows for extremely precise time resolution. But it is a one-way process. If the data is destroyed more (as you suggest!) it cannot be un-averaged.

Only by going one-packet/click model can one do good and cool astrophysics with this project.

Jason
4) Message boards : Number crunching : What is the oldest version of boinc I can run? (Message 1854)
Posted 3946 days ago by Jason Taylor
Thanks. The middle column is time. What are the other columns? More specifically, which column is the radioactivity? Do you have to do a subtraction to get the derivative?

I got the email but no, I disconnected it because I got a complaint about the noise and I was unsure that usb is actually working. I guess I'm a skeptic. I now am thinking if I just subtracted the #s it might make more sense.
5) Message boards : Science : Some possible ways to improve this project (Message 1850)
Posted 3946 days ago by Jason Taylor
I'm very happy to own a unit that arrived a month ago. It isn't working perfectly. But aside from that here are a few suggestions from me on how we can improve this project:

1. Forum needs a "Help" (setting up and getting it to work section). The existing 4 Boards are not for that.
2. The data needs to be avail to all. What is the point in giving the data to ?? (who owns this site?) if I cannot get the data from all as well. I want something back for donating. The map only shows a binary edition in intensity that is essentially worthless for almost all purposes but to see if there is a nuclear war. What if i want to see a geographical location where plants are growing at high natural radiation. Graph map is worthless for that.
3. Layout of website would be easier if the FAQ was renamed to "How to get going" because there are faqs in the forums, such as #2.
4. I have no idea how anyone is plotting their past data. I can't get it. But, it should be in utc, not last 1000 ??? time units.

Feedback welcomed. Change is always resisted, so I don't anticipate these suggestions going anywhere, but I think it would be unethical to not put them up for the powers at be to have exposure to them.

Jason
6) Message boards : Number crunching : What is the oldest version of boinc I can run? (Message 1849)
Posted 3946 days ago by Jason Taylor
Ok, I', running 6.10.60. The data.bin file is as follows:

84040,20,2013-7-6 1:25:40,769,f,81327870
323510,95,2013-7-6 1:29:40,769,n,81327870
562980,164,2013-7-6 1:33:40,769,n,81327870
802450,223,2013-7-6 1:37:40,769,n,81327870
1041920,296,2013-7-6 1:41:40,769,n,81327870
1281390,357,2013-7-6 1:45:40,769,n,81327870
1520860,427,2013-7-6 1:49:40,769,n,81327870
1760330,494,2013-7-6 1:53:40,769,n,81327870
1999800,559,2013-7-6 1:57:40,769,n,81327870
2239270,638,2013-7-6 2:1:40,769,n,81327870
2478740,710,2013-7-6 2:5:40,769,n,81327870
2718210,769,2013-7-6 2:9:40,769,n,81327870
2957680,834,2013-7-6 2:13:40,769,n,81327870
3197150,904,2013-7-6 2:17:40,769,n,81327870
3436620,970,2013-7-6 2:21:40,769,n,81327870
3676090,1049,2013-7-6 2:25:40,769,n,81327870
3915560,1113,2013-7-6 2:29:40,769,n,81327870
4155030,1184,2013-7-6 2:33:40,769,n,81327870

Is that good? Because I do not understand how to get a history plot out of boinc. What is each column? Seems like bad data to me.

Also, what does the black button do.
7) Message boards : Science : Detector locations to avoid, to prevent a biased reading? (Message 1848)
Posted 3946 days ago by Jason Taylor
Florescent lights and smoke detectors (from the roof on the floor below if you place it on your floor) may contain radioactive matter and can increase count rates. Basement will get a lower count than top floor.
8) Message boards : Number crunching : What is the oldest version of boinc I can run? (Message 1846)
Posted 3947 days ago by Jason Taylor
Hi all.

Is this correct place to ask a question around here? I tried the skype page but nobody was online.

Anyway, if it is the correct place here is my question. I want to know the oldest version of boinc I can run under Windows XP that is able to do this strange usb thing. I am using v6.10.60 but no stats are showing up and I don't see anywhere where I can manually determine the usb port.

Jason Taylor


Main page · Your account · Message boards


Copyright © 2024 BOINC@Poland | Open Science for the future