Linux CPU usage and Process Execution History – A server stack is the collection of software that forms the operational infrastructure on a given machine. In a computing context, a stack is an ordered pile. A server stack is one type of solution stack — an ordered selection of software that makes it possible to complete a particular task. Like in this post about Linux CPU usage and Process Execution History was one problem in server stack that need for a solution. Below are some tips in manage your linux server when you find problem about linux, cpu-usage, , , .
Is there any way to see what process(es) caused the most CPU usage?
I have AMAZON EC2 Linux which CPU utilization reaches 100 percent and make me to reboot the system. I cannot even login through SSH (Using putty).
Is there any way to see what causes such a high CPU usage and which process caused that ?
I know about
top command but I could not find process execution history anywhere. Here is the image from Amazon EC2 monitoring tool, but I would like to know which process caused that :
I have also tried
ps -eo pcpu,args | sort -k 1 -r | head -100 but no luck finding such a high CPU usage.
There are a couple of possible ways you can do this.
Note that its entirely possible its many processes in a runaway scenario causing this, not just one.
The first way is to setup pidstat to run in the background and produce data.
pidstat -u 600 >/var/log/pidstats.log & disown $!
This will give you a quite detailed outlook of the running of the system at ten minute intervals. I would suggest this be your first port of call since it produces the most valuable/reliable data to work with.
There is a problem with this, primarily if the box goes into a runaway cpu loop and produces huge load — your not guaranteed that your actual process will execute in a timely manner during load (if at all) so you could actually miss the output!
The second way to look for this is to enable process accounting. Possibly more of a long term option.
This will enable process accounting (if not already added). If it was not running before this will need time to run.
Having been ran, for say 24 hours – you can then run such a command (which will produce output like this)
# sa --percentages --separate-times 108 100.00% 7.84re 100.00% 0.00u 100.00% 0.00s 100.00% 0avio 19803k 2 1.85% 0.00re 0.05% 0.00u 75.00% 0.00s 0.00% 0avio 29328k troff 2 1.85% 0.37re 4.73% 0.00u 25.00% 0.00s 44.44% 0avio 29632k man 7 6.48% 0.00re 0.01% 0.00u 0.00% 0.00s 44.44% 0avio 28400k ps 4 3.70% 0.00re 0.02% 0.00u 0.00% 0.00s 11.11% 0avio 9753k ***other* 26 24.07% 0.08re 1.01% 0.00u 0.00% 0.00s 0.00% 0avio 1130k sa 14 12.96% 0.00re 0.01% 0.00u 0.00% 0.00s 0.00% 0avio 28544k ksmtuned* 14 12.96% 0.00re 0.01% 0.00u 0.00% 0.00s 0.00% 0avio 28096k awk 14 12.96% 0.00re 0.01% 0.00u 0.00% 0.00s 0.00% 0avio 29623k man* 7 6.48% 7.00re 89.26% 0.00u 0.00% 0.00s
The columns are ordered as such:
- Number of calls
- Percentage of calls
- Amount of real time spent on all the processes of this type.
- User CPU time
- System CPU time.
- Average IO calls.
- Command name
What you’ll be looking for is the process types that generate the most User/System CPU time.
This breaks down the data as the total amount of CPU time (the top row) and then how that CPU time has been split up. Process accounting only accounts properly when its on when processes spawn, so its probably best to restart the system after enabling it to ensure all services are being accounted for.
This, by no means actually gives you a definite idea what process it might be that is the cause of this problem, but might give you good feel. As it could be a 24 hour snapshot theres a possibility of skewed results so bear that in mind. It also should always log since its a kernel feature and unlike pidstat will always produce output even during heavy load.
The last option available also uses process accounting so you can turn it on as above, but then use the program “lastcomm” to produce some statistics of processes executed around the time of the problem along with cpu statistics for each process.
lastcomm | grep "May 8 22:" kworker/1:0 F root __ 0.00 secs Tue May 8 22:20 sleep root __ 0.00 secs Tue May 8 22:49 sa root pts/0 0.00 secs Tue May 8 22:49 sa root pts/0 0.00 secs Tue May 8 22:49 sa X root pts/0 0.00 secs Tue May 8 22:49 ksmtuned F root __ 0.00 secs Tue May 8 22:49 awk root __ 0.00 secs Tue May 8 22:49
This might give you some hints too as to what might be causing the problem.
Atop is a particularly handy daemon for looking at drill-downs to the process level and by default archives this data for 28 days. Besides presenting an awesome real-time monitoring interface, you can specify those log files to open and step through them.
The article gives some idea of the capabilities, and you can find more in the manpage.
It’s truly a wonderful piece of software.
Programs such as psmon and monit maybe helpful for you. Those can monitor the processes running on your system and if any threshold (CPU usage, memory usage…) gets exceeded, you can set them send you an e-mail report about what’s going on.
It’s also possible to automatically restart the misbehaving processes.
One solution is writing a script that is run via one minute cron or in a sleep loop, and sends you an email/scp job/dump to an ebs volume… with relevant output (dmesg, pstree -pa and ps aux, probably vmstat) the instant it finds the load average over a certain limit…