Author |
Message |
|
On this WU:
http://radioactiveathome.org/boinc/result.php?resultid=300672
see this error:
Error reading data: Device communication error
Is there a way to easy detect if the sensor correctly communicate to boinc client ? the display continue to show the data without error messages. I've found this problem looking at the progress bar of the next WU (that after 22 mins elapsed was at 0%). So I must stop BOINC and dis/connected the sensor and restart BOINC, to resume the work on the progress bar.
Exists a watch dog procedures (both side boinc client and firmware) that try to reset the sensor on lost connection ?
BTW of 4 sensors running on my PCs this the first problem and is the only connected with 2 USB cables 2m + 1,2m.
What I can do to avoid/debug this problem ? |
|
|
TJMProject administrator Project developer Project tester Send message
Joined: 16 Apr 11 Posts: 291 Credit: 1,378,693 RAC: 74
|
Do not place the sensor near other devices, as the attiny is very prone to crashes on slighest EM pulses.
Also, check the USB voltage at the end of the cable.
There is a built-in watchdog, but it does not reset the port and the operating system doesn't notice the connection is stalled.
|
|
|
|
I may have found the problem, 2 days ago I moved the sensor to a port not used the MB (and old PIII), and now I remember why that port was not used: sometimes even the optical mouse stucks when connected in this port, so may be a unreliable port.
Now put it back on the PCI card with 4 USB ports, waiting next days to check if this solve.
|
|
|
|
You ideally should be able to see within a minute or two max if the thing is communicating or not. If there is no percent progress showing, it's not communicating. The detector may still work as long as it has power but not communicating with the computer, hence no percent increase in Boinc.
Another thing to keep in mind, especially with older computers. USB is only like 5 VDC. Ok maybe 12, my brain hurts today and I don't remember the exact specs of USB.... anyways.... as the computer gets older, the USB plugs get worn out, people pulling plugs in and out, wiggling them up and down back and forth, normal movement doing that, it over time can 'spread' the contacts out on the USB plug so they are barely making contact. Now let's add a bit of corrosion or tarnish and you can have a poor connection at best where the slightest movement can break the connection. With such low voltages, the smallest amount of crap on the connections or looseness can cause problems. Even just a fraction of a second of the connecting being broken, even if it is reconnected right after, may lock your detector out and not communicate.
I have this issue with one of my laptops and it's at the point that out of 4 USB ports, only one really works worth a darned anymore.
one other point of interest, if you are going to plug in a USB hub to get more ports, while a passive one might work, you will be much better off with an externally powered one .vs. a port powered one.
Aaron
____________
|
|
|
|
After many days the problem on the USB port of PIII don't rise again. So the problem seems the port and not the sensor.
However today I see another stop on my, almost new, quad core PC (different sensor: 1782) with no problems before today ...
You ideally should be able to see within a minute or two max if the thing is communicating or not. If there is no percent progress showing, it's not communicating. The detector may still work as long as it has power but not communicating with the computer, hence no percent increase in Boinc.
Ok, question is, if the percent don't increase during a comm fault the BOINC client should know the problem. Is it possible add a "cold restart" command to try to restore the communication with sensor ? for example (if it's possible) with a power off/on of the 5V USB line.
This should be very userful for PCs that are unattended for many days.
Another idea could be a firmware patch that perform "something" if the communication with PC starts and after a while go down. A sort of "watch dog". "Something" should be, force backlight on and add a "*" in the last char of the display. This should userful to manually check/discover this communication problem without logon the PC. For example this morning my sensor was already stopped but I can't noticing because have not open the PC desktop, only a brief look to sensor display.
Let me know if you think that these are reasonable patches.
I have this issue with one of my laptops and it's at the point that out of 4 USB ports, only one really works worth a darned anymore.
one other point of interest, if you are going to plug in a USB hub to get more ports, while a passive one might work, you will be much better off with an externally powered one .vs. a port powered one.
Aaron
I running 3 sensor in the last weeks and seens the "comm problem" on 2 PCs 3 times, the only one that works without this problem so far is a laptop with sensor connected via a 4 port passive usb hub 8-) |
|
|
|
It would be good if there was some easy way to recovered from problems.
I only normaly notice is my credit stops going up, my sensor isn't on my main computer so I don't check the %'s that much.
____________
|
|
|
|
False alarm about sensor on my quad PC. It restart during a storm because I forget to connect it to the UPS power point last time I've move my wattmeter.
EDIT
However I have received feedback from BOINC.Italy users that have seen them too the problem of "silent disconnection". |
|
|
|
How do you reset the detector... reboot BOINC? reboot the computer? unplug and replug the detector?
I can offer you a decent workaround for now and possibly a better workaround later. To use the workaround you would need to install the Python interpreter to allow Python scripts to run on your computer. I can provide you with a script that will monitor the progress of the R@H task and if the % complete does not increase you will hear a warning sound that will tell you you need to reset the detector. Optionally, it could raise a pop-up window to warn you but for that functionality you would also need to install pyWidgets which is not a big deal. In the future I may find a way to have the script automatically reset the detector for you instead of bothering you with a warning.
Python and pyWidgets (wxWidets) are both free, open source and fairly cross-platform compatible.
____________
|
|
|
|
How do you reset the detector... reboot BOINC? reboot the computer? unplug and replug the detector?
I needed to unplug the detector (or reboot the PC). I think anything that can power off-on the sensor could works. After I also stop/restart the radiactive client, but only to force load the prefs from project and quiet the buzzer, don't think it's needful.
Now my sensors are on reliable USB ports, but if can send me the script I can forward on BOINC.Italy forum for users that could need it.
BTW think that also a .bat boinccmd.exe based (for windows normal shell or powershell) could works (I written a couple for milkway@home old behavior and a QCN client "priority raiser" for slow PC).
I don't know the USB specs so don't know it's possible, but the best way to automatize a watch dog script should to use an action that power off-on the 5V on the port to perform a sensor cold restart. |
|
|
|
Windows isn't allowed on my property so you can be sure I won't be doing a .bat file. I found a USB module for Python and it has a reset method so I'll explore that. So far I haven't found a way to power off/on but maybe a reset will be sufficient.
____________
|
|
|
|
In windows you can use Devcon to add/remove the device.
http://social.technet.microsoft.com/wiki/contents/articles/182.aspx
You could write a script to read the stderr file in the Radio slot, when the sensor loses communitcation it's logged in there. e.g.:
Radac $Rev: 407 $ starting...
error finding 'radioactiveathome.org GRS': Device communication error
____________
|
|
|
|
Thanks to a weird power outage/brownout/surge/UFO/whatever during the night, I found I had problems with my sensor.
The machine it's attached to rebooted, but apparently couldn't find the sensor.
So I shut down the machine properly, powered up, same problem.
In the end I had to unplug the sensor and plug it back in again, then it started working.
FWIW my QCN sensor on the same machine carried on working as expected.
Seems that the WU needs to be running and see the sensor 'appear' to work. If the sensor is already there, it never finds it.
Not a problem for me as that particular machine is accesible. Would be a nuisance if it was on a more remote machine though :(
Cheers,
Al.
____________
|
|
|
TJMProject administrator Project developer Project tester Send message
Joined: 16 Apr 11 Posts: 291 Credit: 1,378,693 RAC: 74
|
Actually, it's a hardware problem, if the USB voltage doesn't rise fast enough on power on the attiny crashes. The issue was partially fixed in 2.51, however it still happens on some hosts, and the fact that the sensor is quite power hungry during cold start (charging the high voltage capacitor) doesn't help. |
|
|
|
It did seem to have powered up though. Display was on, as was the backlight, and it was beeping randomly as if it was detecting. But the display was stuck on 'please wait' or something similar. Didn't look crashed to me ;)
Anyway, not a problem for me, and I'm sure a hardware watchdog in later versions would solve the problem :)
Cheers,
Al.
____________
|
|
|
|
I'll try to get a powershell script with devcon. I've written the code to find the logfile, I just need to find out what I need to tell devcon to do once it errors out.
I need to wait for my sensor to disconect it's self, it's normally about 4-5 day's and I just manually moved it to a different port yesterday.
I can just create a scheduled task then to check the log file every so often.
____________
|
|
|
TJMProject administrator Project developer Project tester Send message
Joined: 16 Apr 11 Posts: 291 Credit: 1,378,693 RAC: 74
|
It did seem to have powered up though. Display was on, as was the backlight, and it was beeping randomly as if it was detecting.
But still, it's in half-crashed mode unable to do any USB transfer, and if you check the device manager it won't even be listed there, or will be marked as not working properly/unrecognised device.
|
|
|
|
Thanks for the info, TJM. I can think of a few different "solutions" and workarounds but before implementing anything it would be best to get your opinion.
1. I have a Python module that has a method() for "sending a reset down the wire to the designated device". I haven't actually tried it yet and if you think a reset won't actually reset the device the way we need it to then I won't bother pursuing the idea. Sorry, I don't know how the module's author or USB protocol defines "reset" so what is your opinion? Will a reset work? (I can't reproduce the problem here, if I could I would just try a reset)
2. Is a power off/on the only way to restore proper operation?
3. If the detector were powered by a battery, one of those huge 6V batteries used in lanterns would probably last for many months, would that solve the power-up issue? Would it create new issues? For example, would it prevent a software reset?
____________
|
|
|
TJMProject administrator Project developer Project tester Send message
Joined: 16 Apr 11 Posts: 291 Credit: 1,378,693 RAC: 74
|
Port reset fixes most of the problems. Power cycle is rarely needed.
The problem is not the occasional crash (because the attiny restarts itself most of the cases), but the fact that both sides think (at hardware level) they still got a working connection, while in fact the connection was reset or is stalled from the beginning. |
|
|
|
You have to remember that even if you turn your computer off, many times there is still power applied to some places, this includes your USB. The only reliable way to really 'reset' the thing is to unplug it from the computer so there is NO power on it. Even a cold reboot of the computer, if the device is in a comm lockup, is not going to necessarily reset it.
power surges can do this, wiggling the cable can do this, Winblows will do this from time to time just for the hell of it, I am seeing on two of my machines.
For future designs of the detector, perhaps have some sort of reset button on the front of it, I press it in, it breaks the 5 volt from the USB, (same as if I unplugged it). This thing takes little power, even an SCR on the power bus, you could easily program an 'off bias' / 'on bias' to hit it at say (user preference set in settings plus 5 minutes) with that timer resetting after every new task assigned.
Actually having the control turn the SCR off, once it dies, that logic command goes away and it automatically resets to on state or something might work well.
My thinking on this is, people set in their preferences how long they want their unit to run per task. You set a 'software timer' or hardware or something to that amount of time (as read from preferences file) plus say 10 minutes. This way, if something goes wrong, the thing hits 100 percent and does not upload or stops communicating for some reason, after the ten minute timeout, it performs a 'hard reset' on the device to hopefully jumpstart it. With the power off, the data section should quit sending and your computer should see it the same way as a cable unplug and reset the port as well. If the device is working properly, then once it uploads an old task, gets a new task, that timer is automatically reset. (no need for reset if the thing is working).
You don't have to use an SCR, I just threw that out there as an example, you can probably accomplish the same thing through other means.
If this sounds confusing I can try to reword it. Basically this is essentially hoping for a 'self check' on the device, if it's talking and getting new tasks, fine, if not, it 'resets' itself preferrable w/o my babysitting it.
Maybe if you don't want to play with new stuff on the device, program in a 'reset' function and the server can do this check and send a POR command to it via it's own coding?
I don't know how it is programmed or how hard this might be, just throwing odd ideas out there in hopes someone with a clue might be able to use one of them.
Aaron
____________
|
|
|
TJMProject administrator Project developer Project tester Send message
Joined: 16 Apr 11 Posts: 291 Credit: 1,378,693 RAC: 74
|
It is not necessary to cut power to reset the device.
Disconnecting pull-up resistor does the same, and the 2.51 sensor partially does it (it pulls down the line on init, for example when the attiny resets itself). However it lacks the code to detect the fact that it is plugged but the connection has stalled for some reason, and AFAIK it can't be implemented due to lack of flash space.
|
|
|
|
Feedback here suggests reset() should do the trick so that's what I'm going to try. I'm going to use Python because that way everybody gets to see the source and verify what it does without having to compile it for themselves. I am confident it will run on Windows and OS X as well as Linux.
The Python USB module simply wraps libusb (for Linux) and libusb-win (for Win). I have the following in my notes if anybody else wants to take a crack at this using libusb-win from Win powershell, C, or whatever:
- download PyUSB from http://sourceforge.net/projects/pyusb/ and see the README file in the package for install instructions
- libusb is available from repositories for many Linux distros, for Windows download the installer from http://sourceforge.net/projects/libusb- win32/?source=recommended
- get libusb 1.0 or better, Python 2.7 or better and PyUSB 1.0 or better (I think 1.0 is the latest as of April 2012)
____________
|
|
|
|
a reset script to try
____________
|
|
|