Bug #327
Freeswitch hangs upon startup
| Status: | Closed | Start date: | 03/16/2010 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | % Done: | 100% |
||
| Category: | Aports | |||
| Target version: | Alpine 2.0.0 |
Description
Freeswitch is started normally. For example, as follows:
/usr/bin/freeswitch -c
But the process appears to hang as follows:
2010-03-16 17:20:11.12759 [INFO] switch_event.c:565 Activate Eventing Engine. 2010-03-16 17:20:11.17938 [DEBUG] switch_event.c:553 Create event dispatch thread 0 2010-03-16 17:20:12.319284 [INFO] switch_nat.c:392 Scanning for NAT 2010-03-16 17:20:12.320055 [DEBUG] switch_nat.c:152 Checking for PMP 1/5 2010-03-16 17:20:12.321667 [ERR] switch_nat.c:183 Error checking for PMP [general error] 2010-03-16 17:20:12.321865 [DEBUG] switch_nat.c:397 Checking for UPnP 2010-03-16 17:20:25.576162 [DEBUG] switch_nat.c:106 No InternetGatewayDevice, using first entry as default. 2010-03-16 17:20:28.987593 [INFO] switch_nat.c:411 No PMP or UPnP NAT detected! 2010-03-16 17:20:28.990330 [INFO] switch_core_sqldb.c:538 Opening DB 2010-03-16 17:20:29.10009 [NOTICE] switch_scheduler.c:166 Starting task thread 2010-03-16 17:20:29.155268 [DEBUG] switch_scheduler.c:214 Added task 1 heartbeat (core) to run at 1268760029
10 freeswitch processes appear to have been started in the process list but there is no indication of real functionality (registrations, fs_cli connection, calls, etc)
It has been observed that sometimes the process starts completely and functions normally. If there is any need to restart the process then it risks to hang as above.
I have not been able to observe a pattern to the hang. It appears that sometimes the process can be started successfully after some unknown period of time or by going to a different console.
No message have been found in logs re: this problem.
Related issues
Associated revisions
main/freeswitch: use /dev/urandom instead of /dev/random
fixes #327
History
Updated by Natanael Copa almost 2 years ago
- Target version changed from Alpine 1.10.0 to Alpine 2.0.0
I have a feeling this is due to problems in linuxthreads. I think the real solution here is uclibc with NPTL.
Updated by Travis Colbert almost 2 years ago
Just wanted to add that since the latest re-packaging of FreeSwitch (1.0.4r2) the hanging still occurs.
The FreeSwitch IRC guys have indicated in the past something low-level with threading in uClibc platforms but, sadly, I did not understand the issue.
Updated by Natanael Copa over 1 year ago
Does this still happen with freeswitch 1.0.6?
Updated by Travis Colbert over 1 year ago
Yes. I experienced a similar hang using freeswitch -c (console):
Error: stacksize 4194303 is too large: run ulimit -s 240 or run freeswitch -waste. auto-adjusting stack size for optimal performance... 2010-05-19 18:48:24.097638 [INFO] switch_event.c:582 Activate Eventing Engine. 2010-05-19 18:48:24.102874 [DEBUG] switch_event.c:570 Create event dispatch thread 0 2010-05-19 18:48:25.264679 [INFO] switch_nat.c:409 Scanning for NAT 2010-05-19 18:48:25.265189 [DEBUG] switch_nat.c:166 Checking for PMP 1/5 2010-05-19 18:48:25.269688 [ERR] switch_nat.c:197 Error checking for PMP [general error] 2010-05-19 18:48:25.269813 [DEBUG] switch_nat.c:414 Checking for UPnP 2010-05-19 18:48:37.604754 [DEBUG] switch_nat.c:114 No InternetGatewayDevice, using first entry as default (http://10.1.16.66:1900/cameradesc.xml). 2010-05-19 18:48:41.030262 [INFO] switch_nat.c:429 No PMP or UPnP NAT devices detected! 2010-05-19 18:48:41.034921 [ERR] switch_odbc.c:313 STATE: IM002 CODE 0 ERROR: [unixODBC][Driver Manager]Data source name not found, and no default driver specified 2010-05-19 18:48:41.035128 [CRIT] switch_core_sqldb.c:333 Failure! 2010-05-19 18:48:41.035199 [ERR] switch_core_sqldb.c:1259 Error Opening DB! 2010-05-19 18:48:41.035258 [WARNING] switch_core_sqldb.c:1265 Falling back to core_db. 2010-05-19 18:48:41.036072 [INFO] switch_core_sqldb.c:1275 Opening DB 2010-05-19 18:48:41.097713 [NOTICE] switch_scheduler.c:166 Starting task thread 2010-05-19 18:48:41.239323 [DEBUG] switch_scheduler.c:214 Added task 1 heartbeat (core) to run at 1274294921
This is being run on Alpine 1.10.3 and Freeswitch 1.0.6 from Backports.
I had previously started FS successfully two times before the above event. Each time I brought FS down with its (...) command. I observed the DB errors above on the previous startups. I don't believe them to be related but I am addressing those issues.
Updated by Travis Colbert over 1 year ago
addendum: I ran:
ulimit -s 240
And then successfully started FS immediately afterward.
Updated by Nathan Angelacos over 1 year ago
Travis,
I've seen this as well, but it wasn't actually a "hang" - it just took a very long time to start up.
Next time it "hangs", could you leave it there for 2 minutes and see if it eventually starts? Then report back.. thanks!
Updated by Natanael Copa over 1 year ago
Travis Colbert wrote:
Just wanted to add that since the latest re-packaging of FreeSwitch (1.0.4r2) the hanging still occurs.
The FreeSwitch IRC guys have indicated in the past something low-level with threading in uClibc platforms but, sadly, I did not understand the issue.
I'm very interested if this happens with edge (not 1.10.x) since edge have a new threading implementation and as far I know, current edge is fixing every threading related problem we ever had.
Updated by Natanael Copa over 1 year ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
With NPTL uclibc this seems to be fixed. Closing this.
Updated by Travis Colbert over 1 year ago
Thanks for the NPTL work.
Here is the latest that we have observed:
On VIA hardware we have been able to reproduce the above problem. All the above descriptions appear to happen usually on the 3rd or 4th stop/start cycle.
The problem can be reproduced either using the init.d script or by invoking the freeswitch manually.
On Intel hardware we have observed a different problem manifestation. This problem seems to appear on shutdown of FS.
Here is an excerpt of the shutdown messages in the FS console (fs_cli):
2010-06-04 15:39:22.778392 [NOTICE] switch_loadable_module.c:490 Deleting Codec 'PCMA' (G.711 alaw) 8000hz 120ms 2010-06-04 15:39:22.778848 [CONSOLE] switch_loadable_module.c:1277 Stopping: CORE_PCM_MODULE 2010-06-04 15:39:22.778955 [NOTICE] switch_loadable_module.c:545 Deleting Application 'socket' 2010-06-04 15:39:22.779086 [DEBUG] switch_loadable_module.c:547 Write lock interface 'socket' to wait for existing references. 2010-06-04 15:39:22.780489 [NOTICE] switch_loadable_module.c:572 Deleting API Function 'event_sink' 2010-06-04 15:39:22.780605 [DEBUG] switch_loadable_module.c:574 Write lock interface 'event_sink' to wait for existing references.
How we have reproduced the problem:
1. Logged in.
2. Used 'ps' and verified freeswitch was NOT running.
3. Ran '/etc/init.d/freeswitch start'.
4. Used 'ps' and saw the freeswitch process running. Waited a few seconds.
5. Ran 'fs_cli'. Console successfully connected.
6. Exited the console (fs_cli).
7. Ran '/etc/init.d/freeswitch stop'
8. Waited less than a minute and used 'ps' to verify freeswitch was no longer running.
9. Repeat at step 3 (i.e. I used '/etc/init.d/freeswitch start' to start the program).
If you repeat this procedure perhaps 3 times this problem will either appear upon startup or upon shutting down the process
Here is an IRC conversation showing the rough times and steps of the process:
[13:45] <thcolber> ok so i have gotten it to lock up again... now I will try the "freeswitch -stop" twice [13:45] <thcolber> once: FS is still running [13:46] <thcolber> twice: still running [13:46] <thcolber> 3, 4, 5: still running according to ps -e [13:46] <thcolber> now... this problem looks like a startup halt [13:47] <thcolber> because netstat -npl never reported his ports opening up (so fs_cli doesn't work) [13:48] <thcolber> rebooting... [13:48] <thcolber> oh BTW this was fresh after a reboot a few minutes ago [13:49] <thcolber> FWIW this is Intel/Alpine/boot off CD/USB apkovl [13:49] <thcolber> no harddrive [13:50] <thcolber> machien restarted [13:52] <thcolber> FS successfully started... connecting w/ fs_cli... good ... exit cli... /etc/init.d/freeswitch stop [13:52] <thcolber> clean stop [13:52] <thcolber> starting again... /etc/init.d/freeswitch start [13:53] <thcolber> good... [13:54] <thcolber> stopped clean [13:55] <thcolber> no go... startup halt (because ports did not get bound) [13:55] <thcolber> fs_cli reports "Error connecting socket" [13:55] <thcolber> ps -e shows running FS process [13:55] <thcolber> tried /etc/init.d/freeswitch stop [13:56] <thcolber> nothing FS still running [13:56] <thcolber> freeswitch -stop doesn't work either
Perhaps some added information
I have been able to stop and start FS on this same machine at least 10 times in a row with success so I have no impression of a completely consistent pattern.
Updated by Travis Colbert over 1 year ago
We have confirmed that the above DEBUG messages on shutdown are normal.
They appear on our properly-functioning FS/Ubuntu/Intel server.
Updated by Timo Teräs over 1 year ago
I did a analysis about FS on both Alpine 1.10.x and Alpine Edge.
Alpine 1.10.x uses LinuxThreads uclibc for threading which has fundamental problems. It seems that FS often hangs in uclibc thread creation parts. Exact reason is unknown, but likely related to threading and signals not working properly with this implementation. This is fixed in Alpine Edge using NPTL threading of uclibc.
Alpine Edge version was not actually hanging. It was just extremely slow to start because FS tried to read random bytes via /dev/random which blocks until sufficient entropy is gathered. Since the test box has no keyboard, hard disk or other means of gaining entropy it might take up to 3-5 minutes for FS to start. This problem roots APR library which used by FS (it has it's own version of 1.2.x). APR 1.3.x includes a fix by using /dev/urandom (it never blocks) instead of /dev/random for exactly issues like this. Trivial fix is to recompile FS version of APR and make it use /dev/urandom.
Updated by Natanael Copa over 1 year ago
Applied in changeset b94537f180f8193e485a88f178857f3c8af6955c.
Updated by Nathan Angelacos over 1 year ago
- Status changed from Resolved to Closed