PDA

View Full Version : bug in mod_caucho


remkodev
04-13-2011, 01:22 PM
I'm currently investigating a long standing issue we have with our resin pro license. We have a setup where we run several resin instances behind several apache servers inside a dmz. These apps can be quite busy, 70 - 600 requests per sec. for each apache server. We run the threaded module (ap2_worker) and mod_caucho plus a lot of rewrite and proxy rules, no php ,cgi etc.

Now, every once in a while we get a stuck process that consumes 1 cpu core or more. We have tried all sort of updates on apache, resin, the plugin, compiler settings etc.

this is normal (0.6 % load)


Server uptime: 36 minutes 38 seconds
Total accesses: 279214 - Total Traffic: 2.5 GB
CPU Usage: u12.04 s2.52 cu0 cs0 - .662% CPU load
127 requests/sec - 1.2 MB/second - 9.5 kB/request
31 requests currently being processed, 69 idle workers

.............................R............R....... ..............
R_RRRC_R____CR_RR__R__CCC_CC__C_______W_____W_CRCW ..............
.R................R....RR..RR................R..R. ..............
........RR.......R..........R.R..R................ ..............
............R........R..R......R.R......R......... ..............
......................R..R....R...W............... ..............
__________________________C_C____W__R_____W_CCC___ ..............
............R..................................... ..............
.................................................. ..............
.......................R.......................... ..............
.................................................. ..............
.................................................. ..............
.................................................. ..............
.................................................. ..............
.................................................. ..............
.................................................. ..............



this is not: (load is not high rn, but can easily reach 100-800%) Als note the WWWW status that will never go away

Server uptime: 1 hour 12 minutes 21 seconds
Total accesses: 540899 - Total Traffic: 5.2 GB
CPU Usage: u4.66 s3.13 cu0 cs0 - .179% CPU load
125 requests/sec - 1.2 MB/second - 10.0 kB/request
61 requests currently being processed, 39 idle workers

...................R....................R......... ..............
..........................................R....... ..............
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW ..............
......R...........R............................... ..............
R.WW......C.......C..W.......R.......C.R.....R..R. ..............
_________________________________WCCCCRCC_C_C_R___ ..............



This is where we are right now:


hanging processes:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
18136 8023 54M 10M cpu6 2 0 3:16:35 31% httpd/52
22783 8023 54M 10M cpu4 10 0 2:14:58 29% httpd/51
24225 8023 53M 9668K cpu2 2 0 1:57:43 29% httpd/52

a stack dump of the thread
fffffd7ffef3d668 apr_pool_cleanup_kill () + 18
fffffd7ffdbe454b cse_free_lock () + 19
fffffd7ffdbec6c1 cse_free_pool () + 8e
fffffd7ffdbea717 read_config () + 884
fffffd7ffdbeb1c5 read_all_config_impl () + 1ac
fffffd7ffdbeb838 cse_update_host () + 9b
fffffd7ffdbeb8ce cse_match_host_impl () + 3c
fffffd7ffdbeb93a cse_match_host () + 3f
fffffd7ffdbebe79 cse_is_match () + 100
fffffd7ffdbec35b cse_match_request () + 29a
fffffd7ffdbe6d6a cse_dispatch () + ed
00000000004444c9 ap_run_handler () + 5f
0000000000444d66 ap_invoke_handler () + 158
0000000000455038 ap_process_request () + 64
0000000000452028 ap_process_http_connection () + 74
000000000044d6fb ap_run_process_connection () + 5f
000000000044db38 ap_process_connection () + 5b
000000000045b450 process_socket () + 91
000000000045bcd9 worker_thread () + 221
fffffd7ffed3c39b _thr_setup () + 5b
fffffd7ffed3c5c0 _lwp_start ()

The thread's last resin function is cse_free_lock

cse_free_lock(config_t *config, void *vlock)
{
apr_thread_mutex_destroy(vlock);
}

After some research, it appears that apr_thread_mutex_destroy(vlock); is not thread safe.

(https://www.securecoding.cert.org/confluence/display/seccode/CON31-C.+Do+not+unlock+or+destroy+another+thread%27s+mut ex)

We figured that because our webservers uses multiple resin servers (around 6) in each vhost/apache instance these will cause the race condition over the thread_mutex_destroy method.

Today I altered the configuration of the most busy app ( ~90% off the hits) to be proxied to a seprate webserver with only 1 resin host configured. This setup did not cause a single hanging process for over 4 hours.

Other params that effect behaviour is: MaxRequestsPerChild 0
We have way more hangs if the value is non zero, I assume the cleanup code is called much more ofter in this case.

Is it possible to provide us with some thoughts and possibly a solution. I checked newer versions of the plugin but I don't see any relevant changes in the code.

versions we currently use

OS solaris 10u9
Apache 2.2.15, latest apr
resin 3.1.10, both for the plugin and appserver
License: Resin Pro

Best Regards,

remko