heredago's blog

January 7, 2013

Linux server outage checklist (Reddit)

Filed under: Uncategorized — Tags: , , , , , , — heredago @ 18:02

 

Hi all, I just wanted to put together a list of checks you can perform on a linux server to try and figure out why it keeps going down. Please help and i’ll edit my list with your submissions. Thanks!

Disk Space:

df -h

(Make sure you have enough disk space)

Memory:

free -m

(Check you’re not out of memory)

Processes / Load average

top
htop

(Check for processes that are taking up a lot of memory/CPU)

Apache errors

cat /var/log/apache2/error.log

(Look for 500 errors caused by erroneous code on the server)

High hit rate

grep MaxClients /var/log/apache2/error.log

(Check for MaxClients warningdamn in your apache error logs)

tail -f /var/log/apache2/access.log

(Check for bots/spiders) [You might need to lower your MaxClients settings]

Check recent logs

ls -lrt /var/log/

(the -lrt flag will show you the most recently modified files at the end)

Check your cronjobs

ls -la /var/spool/cron/*
ls -la /etc/cron*

(You might find your server is going down at a certain time, this could be result of a cronjob eating up too many resources)

Check Kernel Messages

dmesg

Check inodes

df -i

(Check inodes remaining when you have a disk that looks full but is reporting free space)

Install Systat for collective stats (cpu, i/o, memory, networking)

http://www.thegeekstuff.com/2011/03/sar-examples/

Determine how many apache threads are running (if you’re not using mod_status)

ps -e | grep apache2 | wc -l

For DOS attacks: Start

Number of active, and recently torn down TCP sessions

netstat -ant | egrep -i '(ESTABLISHED|WAIT|CLOSING)' | wc -l

Number of sessions waiting for ACK (SYN Flood)

netstat -ant | egrep -i '(SYN)' | wc -l

List listening TCP sockets

netstat -ant | egrep -i '(LISTEN)'

List arguments passed to program

cat /proc/<PID>/cmdline

For DOS attacks: END

 

 

[–]aramsumair 38 points 10 hours ago (34|3)

I’ve always found sar to be very helpful.

http://www.thegeekstuff.com/2011/03/sar-examples/

 

[–]gsxr Sr dev support and release 16 points 8 hours ago (15|0)

sar doesn’t get enough love now days. I’ve seen many systems with it disabled.

 

[–]triflerifle 3 points 3 hours ago (3|0)

I get not using it as there are so many quicker ways to view your historic system stats. But disabling it altogether? No way. I always leave it enabled because when all other monitoring has broken, the sar data will still be there.

 
 
 
 
 

 

[–]ibfreeekout 4 points 5 hours ago (4|0)

Yup, this is pretty much one of the most used methods of determining a basic understanding of how a server has been behaving for the past few days. All of our servers in our DC are deployed by default with this. which makes it very easy to find if there are issues that occur at a specific time in the day. Highly recommend this software.

 
 
 
 
 

 

[–]fonzie588 Network Admin 30 points 10 hours ago (32|4)

htop > top

 

 

 

Source: http://www.reddit.com/r/sysadmin/comments/1646l8/linux_server_outage_checklist/

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: