Sie sind auf Seite 1von 13

Contents

Introduction: ................................................................................................................................................. 2
Access/User Management ............................................................................................................................ 3
Controlling Processes .................................................................................................................................... 4
Monitor Process ........................................................................................................................................ 5
File System .................................................................................................................................................... 5
Disk Management ......................................................................................................................................... 6
Scripting and Shell......................................................................................................................................... 7
Bash Scripting............................................................................................................................................ 7
Python Scripting ............................................................................................................................................ 7
Tools - Cron ................................................................................................................................................... 7
Troubleshooting ............................................................................................................................................ 8

 Essential Duties of Admin:

Account provisioning

Adding and removing hardware

Performing backups

Installing and upgrading software (Patching)

Monitoring system security

Troubleshooting

Learn: Python, Bash, expect

Access Management

User Management

Controlling Processes

File System

Disk Management
Password Management

Log Management

Job Scheduling

Performance Issues

Certificate Management

Package Management

Scripting and Shell

Introduction:
KeyNotes:

Linux distributions – RedHat, CentOS,


Access/User Management

Keywords:

Provisioning local users

Password vaults for managing credentials

Default group number (GID)

LDAP and Active Directory

Single sign-on systems

Identity Management

Super user account - The ROOT

Best Practices:

Restrict ROOT access (credentials, direct login)

su doesn’t record the commands executed as root, but it does create a log entry that states who
became root and when

Enable SUDO policy

/etc/sudoers.d/ - sudo policies

/etc/group - group identification numbers (GIDs) are mapped to group names

/etc/passwd - user identification numbers (UIDs for short) are mapped to usernames

/etc/shadow – passwords are stored

/etc/security

 Filesystem access control(ACL) - Owner, Group, Others

Process ownership

 SUDO

To modify /etc/sudoers, you use the visudo command (verify syntax)


Su: substitute user identity

 RBAC Model –

Maintain roles/groups to manage large no.of users

SELinux : security enhanced linux

 PAM: Pluggable Authentication Modules

Enforcement of strong password (complexity, length, ..)

Password encryption

Like, Allow specific users to SU to other users (/etc/pam/su)

 Password Vaults

stores passwords for your organization in a more secure fashion

SU (Examples)

sudo -u operator /usr/sbin/dump 0u /dev/sda1

Commands:

Useradd, userdel, usermod

Passwd

usermod –L user # to lock user, Disabling login

/etc/passwd – userid, home dir,

/etc/shadow

/etc/group

Controlling Processes
Keywords:
Process ID

Process Priority (Niceness)

Life cycle of process – Runnable, Sleeping (IO or CPU Cycles), Zombie, Stopped

Signals – kill, hup, init,

Multithreading

User and System Process

 Monitor Process

ps aux

ps –eaf

top

strace

Kill -9 pid

File System
Keywords:

Absolute and relative paths

Filesystem mounting and unmounting

Links – Symbolic / Hard links

File Permissions and Ownership

 Commands:

touch

Mount , unmount
Chmod - change permissions

Chown, chgrp - change permissions

Disk Management

Expand disk / partition

Partitioning
LVM (PV, VG, LV) – Logical volume management

$ sudo pvcreate /dev/sdc1 # Prepare for use


$ sudo vgcreate vgname /dev/sdc1 w/LVM
$ sudo lvcreate -l 100%FREE -n volname # Create volume group
vgname # Create logical volume
$ sudo mkfs -t ext4 # Create filesystem
/dev/vgname/volname # Create mount point
$ sudo mkdir mountpoint # Set mount opts,
$ sudo vi /etc/fstab mntpoint

/etc/fstab - Setup for automatic mounting

fdisk –l

Scripting and Shell

Common filter commands – cut, sort, uniq, cut, grep

Schedule Jobs – cron

 Environmental variables

Use export varname to promote a shell variable to an environment variable.

set up at login time should be included in your ~/.profile or ~/.bash_profile file

Bash Scripting
#!/bin/bash
echo "Hello, world!"

$ chmod +x helloworld
$ ./helloworld

Python Scripting

Read – Dive into Python

Tools - Cron
standard tool for scheduling tasks
crontab –l

crontab –e

minute hour dom month weekday command

 Common usecases:

Simple reminders

File system cleanup

Log file rotation

Backup

Troubleshooting
Level of troubleshooting

Divide the problem space – Dive and conquer approach ,

Possible causes , random tests

Isolate the problem

Collaborate – Conference calls , Email, Direct conversation, Chat rooms (Jabber, Spark)

Best Practices

Favor Past solutions - most problems happen more than once.

Document your problems and solutions

Know What Changed - One of the largest sources of problems in a system is change. When everything
has been running smoothly for a long time and then a problem appears, one of the first things you
should ask is “What changed?”

Understand How Systems Works

Resist rebooting server/service


System Load

System load average is probably the fundamental metric you start from when troubleshooting a sluggish
system

The three numbers after load average—2.03, 20.17, and 15.09—represent the 1-, 5-, and 15-minute
load averages on the machine, respectively

$ uptime
13:35:03 up 103 days, 8 min, 5 users, load average: 2.03, 20.17, 15.09

Explanation:

A single-CPU system with a load average of 1 means the single CPU is under constant load. If that single-
CPU system has a load average of 4,
there is four times the load on the system than it can handle, so three out of four processes are waiting
for resources. The load average reported on
a system is not tweaked based on the number of CPUs you have, so if you have a two-CPU system with a
load average of 1, one of your two CPUs is loaded at all times—that is, you are 50% loaded. So a load of
1 on a singleCPU system is the same as a load of 4 on a four-CPU system in terms of the amount of
available resources used

“It depends on what is causing it.”

What is important to determine


is whether the load is CPU-bound (processes waiting on CPU resources), RAM-bound (specifically, high
RAM usage that has moved into swap), or I/O-bound (processes fighting for disk or network I/O).

A system that runs out of RAM resources often appears to have I/O-bound load, since once the system
starts using swap storage on the disk, it can consume disk resources and cause a downward spiral as
processes slow to a halt

Scenarios

 Sluggish / Unresponsive system


- Resource consumption is more - CPU, RAM, disk I/O, and network
- System load is high – CPU bound, IO Bound, ..
- slow queries to MySQL
- Network problems
- Diagnose Load Problems with top

$ top -b -n 1 | tee top_output

- Kill - So what if you do notice a process consuming all of your CPU and you want to kill it?

 Diagnose Out-Of-Memory Issues

Before diagnosing specific system problems, it’s important to be able to rule out memory issues.
Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers
Swap: 1004052k total, 4360k used, 999692k free, 286040k cached

Note: The Linux kernel also has an out-of-memory (OOM) killer that can kick in if the system runs
dangerously low on RAM. When a system is almost out of RAM, the OOM killer will start killing
processes

Loaded network vs loaded machine

 Diagnose High I/O Wait

When you see high I/O wait, one of the first things you should check is whether the machine is using a
lot of swap.

$ sudo iostat

$ sudo iotop

 Check CPU/RAM/Disk Statistics

Use the sar tool to view these statistics

$ sar

$ sar –r // View RAM Stats

$ sar –b // View Disk stats

 Booting Issues

Linux boot loaders

Single user mode

BIOS

GRUB

Root partition is corrupted or Failed

< Chapter 3 is pending >

DevOps Troubleshooting Book

 Storage or Disk Issues


When the Disk Is Full:
$ cp /var/log/syslog syslogbackup
cp: writing `syslogbackup’: No space left on device

Track down largest directories :


$ cd /
$ sudo du -ckx | sort -n > /tmp/duck-root
Verify log rotation:
If you routinely find you have disk space problems due to uncompressed logs, you can tweak your
logrotate settings in /etc/logrotate.conf and /etc/logrotate.d/ and make sure it automatically
compresses rotated logs.

File system is Read Only:


$ sudo mount -o remount,rw /home
see if you can, in fact, remount the file system read-write; so for instance, if the /home partition
were read-only, you would type

Repair Corrupted file systems:


 There are a number of scenarios in which a file system might get corrupted through either a
hard reboot or some other error.
 fsck is enough to repair the file system
 Unmount faulty file system, fsck and Mount once it is repaired.

 Server down (or) Reachability issues


< Chapter 5>

Ping
Is Interface Up ?
Network routes
Firewall rules - /sbin/iptables
DNS issues - /etc/resolve.conf
Packet capture tools – tcpdump, wireshark

 Tracing Email Problems


< Chapter 7>

 Is Website Down? (or) Webserver problems


< Chapter 8>
Is server running ?
Is port opened
Firewall rules
Test webserver with Curl/Telnet
HTTP error codes
Webserver logs - /var/log/nginx

Sluggish or Unavailable Web Server

 Why is Database is slow ? (or) Tracking down database problems


Identify slow queries

< Chapter 9>

 Hardware Fault – Diagnosing common hardware problems


Hard drive is dying
Test RAM for Errors
Network Card Failures
The Server Is Too Hot
Power Supply Failures

 Booting and Shutting Down

Bootstrapping

Init process

booting to single-user mode, recovery mode, or maintenance mode

single-user mode does not allow network operation; you need physical access to the system console to
use it.

You can type <Control-D> instead of a password to bypass single-user mode and continue with a normal
boot.

The fsck command is run during a normal boot to check and repair filesystems

shutdown: the genteel way to halt the system


Protocols

Das könnte Ihnen auch gefallen