Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver gets stuck when it receives bad frames #18

Open
brunoseivam opened this issue Apr 18, 2016 · 8 comments
Open

Driver gets stuck when it receives bad frames #18

brunoseivam opened this issue Apr 18, 2016 · 8 comments

Comments

@brunoseivam
Copy link
Member

When in Single or Multiple mode, the driver sets framesRemaining to 1 or numImages, respectively. However, when it receives a bad frame, it won't decrement framesRemaining nor reissue new triggers, which will leave the driver stuck in the Acquire state.

Although ideally the driver shouldn't be receiving bad frames, I think it shouldn't get stuck when it does. However, I don't know how to properly address that. Should it fail and return an error? Should it reissue the acquisition for the frames that came in bad? What about hardware triggers?

@MarkRivers
Copy link
Member

Is the frameCallback function actually being called for the bad frames? If so we could change the behavior to at least stop acquisition when the correct number of frames, good or bad, have been received.

@brunoseivam
Copy link
Member Author

Yes, it is for most of the time, although I found some instances where it is not even being called. The rate of bad frames correlate with the CPU usage by another IOC, so I guess the prosilica thread might be getting starved of CPU time and can't keep up with the data rate?

The machine has 12 cores, one IOC is consuming ~350% and the prosilica IOC is consuming ~100%, so I wouldn't expect it to be an issue.

I will try pinning the IOC to a set of CPUs and see if that helps.

I tried setting GvspResendPercent to 100%, but it didn't seem to help much.

@mp49
Copy link

mp49 commented Apr 21, 2016

If one thread in the prosillica IOC is using all or most of that 100%, that might be the problem. Having idle cores won't help in that case, if one thread is maxing out a core.

@brunoseivam
Copy link
Member Author

cam07 is the one driving the CPU usage high. cam03 is the one giving me grief, even though none of its threads is getting to 100%. CPU pinning didn't help. Does the PvAPI library use only one thread to handle all requests from different IOCs?

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                           
 4096 cam07     20   0 2553m 410m 5932 R  96.6  1.3   2926:56 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4058 cam07     20   0 2553m 410m 5932 R  94.0  1.3   2970:24 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4030 cam07     20   0 2553m 410m 5932 R  70.6  1.3   7769:35 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
26614 cam03     20   0  529m  83m 5172 R  55.1  0.3  10:51.27 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam03/st.cmd  
 4463 cam07     20   0 2553m 410m 5932 R  28.9  1.3   2390:17 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
26573 cam03     20   0  529m  83m 5172 S  20.5  0.3   3:37.64 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam03/st.cmd  
26606 cam03     20   0  529m  83m 5172 R  18.8  0.3   3:40.06 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam03/st.cmd  
22796 cam07     20   0 2553m 410m 5932 S  15.3  1.3  10:19.45 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
26594 cam03     20   0  529m  83m 5172 S  12.2  0.3   2:35.21 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam03/st.cmd  
 4136 cam07     20   0 2553m 410m 5932 S  11.0  1.3   1595:23 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4032 cam07     20   0 2553m 410m 5932 R  10.0  1.3   1279:10 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4028 cam07     20   0 2553m 410m 5932 S   7.2  1.3 881:15.71 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
26628 cam03     20   0  529m  83m 5172 S   7.2  0.3   1:32.90 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam03/st.cmd  
19505 cam07     20   0 2553m 410m 5932 S   4.8  1.3  18:21.02 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4151 cam07     20   0 2553m 410m 5932 S   4.3  1.3 588:21.18 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4155 cam07     20   0 2553m 410m 5932 S   4.3  1.3 649:44.59 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
26659 cam03     20   0  529m  83m 5172 S   3.8  0.3   3:23.81 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam03/st.cmd  
 4034 cam07     20   0 2553m 410m 5932 S   1.7  1.3 212:36.21 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  
 4159 cam07     20   0 2553m 410m 5932 S   0.7  1.3  40:17.57 ../prosilica/bin/linux-x86_64/prosilica /epics/iocs/cam07/st.cmd  

@MarkRivers
Copy link
Member

Have you applied the system changes discussed in this tech-talk thread?

http://www.aps.anl.gov/epics/tech-talk/2013/msg00787.php

It involves increasing net.core.rmem_default and net.core.rmem_max.

@brunoseivam
Copy link
Member Author

That did the trick! No bad frames anymore. Thanks!

Although if the driver does perchance receive one when in Single or Multiple mode it will still get stuck :)

@MarkRivers
Copy link
Member

Note that the link to the Point Grey Knowledge Base article in my old tech-talk thread no longer works. However, this link does work:

http://www.ptgrey.com/KB/10016

@MarkRivers
Copy link
Member

That did the trick! No bad frames anymore. Thanks!
Although if the driver does perchance receive one when in Single or Multiple mode it will still get stuck :)

We just re-discovered this issue with cameras at NSLS-II. It is not clear how to fix it, as Bruno said above. Questions:

  • How long to wait for timeout? In Single mode this can be the AcquirePeriod*margin + minimum. Using AcquirePeriod rather than AcquireTime allows the user to avoid timeouts in the case of external triggers by setting the AcquirePeriod to be larger than the time between Acquire=1 and actual trigger.

  • What to do on timeout? Try again? Return error? Return dummy frame and error?

  • How to handle Multiple mode where more than 1 frame could be dropped?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants