LAMP: Troubleshooting slow website with strace and lsof
Problem:#
I was recently debugging a slow website. When tested in the browser, it was not showing any content and just continuously loading. It was hosted on a big web farm behind a load balancer, with no external access to individual web nodes.
Solution:#
To strace a single Apache process, I had to access the site from the web server itself. I changed the User-Agent header to easily grep through the logs:
# curl -v -s -o /dev/null -A "DEBUG" -H "Host: example.com" http://10.10.0.10:80/
* About to connect() to 10.10.0.10 port 80 (#0)
* Trying 10.10.0.10... connected
* Connected to 10.10.0.10 (10.10.0.10) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: DEBUG
> Accept: */*
> Host: example.com
>
This was taking too long, and after a few minutes we finally got a reply:
< HTTP/1.1 200 OK
< Date: Fri, 17 Feb 2017 11:10:44 GMT
< Server: Apache
< X-Frame-Options: DENY
< Cache-Control: no-cache, no-store
< Expires: -1
< Pragma: no-cache
< Set-Cookie: cp_session=Xp5xCpbtRVRrZev22rcXkxMk4rIjtc8S6o8ndE8yXlIIVlMzD-nE1HxPmY9KVDBawpzgo%2CH-9o8wVy3%2CLPT5A1; path=/; httponly
< Content-Length: 68053
< Vary: User-Agent,Accept-Encoding
< Content-Type: text/html; charset=UTF-8
<
{ [data not shown]
* Connection #0 to host 10.70.0.64 left intact
* Closing connection #0
While curl is still running, we’ll check its source port, which will help us identify the PID for Apache:
# lsof -i 4 -a -c curl
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
curl 18421 root 3u IPv4 1556243712 0t0 TCP web01.local:59054->web01.local:http (ESTABLISHED)
Now we can find the Apache process communicating with curl on the port identified above:
# lsof -a -c httpd -i TCP:59054
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
httpd 4955 apache 8u IPv4 1556230695 0t0 TCP web01.local:http->web01.local:59054 (ESTABLISHED)
Time to see what is taking so long:
# strace -ttT -s 512 -f -p 4955
Process 4955 attached
04:58:29.245680 epoll_wait(10,
OK, so it’s waiting on some event using file descriptor 10. Let’s check that:
# lsof -a -d 10 -p 4955
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
httpd 4955 apache 10u 0000 0,9 0 5380 anon_inode
That’s not helpful. Let’s dig deeper and check the child processes:
# pstree -Ap 4955
httpd(4955)---php(18427)
Note: another way to list child processes is ps --ppid 4955. To also see threads, use ps -L --ppid 4955 which shows the LWP (Light Weight Process) column.
Let’s check what this child process is doing and exclude some system calls that add too much noise:
# strace -ttT -s 512 -f -e 'trace=!clock_gettime' -p 18427
Process 18427 attached
[ Process PID=18427 runs in 32 bit mode. ]
05:00:06.797195 restart_syscall(<... resuming interrupted call ...>) = 0 <0.825651>
05:00:07.622969 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) <0.000019>
05:00:07.623196 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout) <1.000856>
05:00:08.624109 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) <0.000010>
05:00:08.624287 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 1000) = 0 (Timeout) <1.001037>
05:00:09.625374 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) <0.000011>
05:00:09.625544 poll([{fd=13, events=POLLIN|POLLPRI}], 1, 1000^CProcess 18427 detached
<detached ...>
The process is trying to read from FD 13, repeatedly calling poll(2)
waiting for data from the remote end. Let’s check what that file descriptor points to:
# lsof -a -d 13 -p 18427
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
php 18427 nonpriv 13u IPv4 1556386925 0t0 TCP web01.local:54981->xxx.xxx.xxx.xxx:https (ESTABLISHED)
The TCP connection to the external host is established, but the remote end is slow to respond, causing the request to block until the external endpoint eventually replies. In my case that was a PHP framework pulling content from a remote source with no read timeout configured.