Engineering and Operation

Code is shipped to production

  • coding
    • write the code
    • test
  • debugging
    • ship the code
    • issues emerge


Think about production environment while developing

  • How to avoid/defend against/recover from issues?
  • How to help to troubleshoot?

Race Conditions and Edge Cases

Very likely happen when the following are involved

  • Threads
  • Multi-Process (resource contention)
  • Across-the-network

Use “atomic” operations to avoid race conditions

May happen in the situations never thought about, the Edge Cases

  • all possible input
  • never assume something never happens
    • at least log the extremely rare cases

Research on the implicit and explicit locks or semaphores available (for file systems or databases etc.)

  • usually DBMS automatically locks
  • but they may or may not be correct or enough or efficient

UNIX file locking

  • lockf()
  • flock()
  • fcntl()


Think carefully when opening files. (Majority developers never think beyond closing after opening)

They may change after opening, may disappear or even may be maliciously edited (or read).

One security measurement: create a randomly named directory for the files under /tmp/, and change the permission of that directory

Race Condition Pitfall

Bad Locks, use randomness to help

  • deadlock
  • livelock

Failed to account for network latency

  • possible solution is to averaging the data of multi-clients (e.g. local and remote)
    • e.g. display player in a location between where the player thinks and others thinks
  • do not wish to slow down other people via atomic operations etc.

Efficiency and Scalability

There is usually a trade-off between “speed to ship the product” and “scalable or efficient code”

First know

  • over-efficient is bad
    • may prevent scaling
    • may harm later development
  • unnecessary scalability is unnecessary
    • never able to predict the future exactly
    • always trading-off scalability with others, usually efficiency
      • e.g. more servers -> slower to start the whole system

Think about your own product

  • what is the limit (disk, memory, cpu, networking)?
  • what happens when reaching the limit?
  • how to scale?


  • separate by layers
  • load balancing
  • DNS evenly lookup
  • active/passive mode

See the Scale page for more details

Don’t forget availability

  • CAP theorem



  • Logs
  • Console output


  • command line call
  • process signal (kill)
  • Config file
  • response of another service (time-out or different response code)
  • presence of a file on the system
  • webhook

Think carefully (again) what to use. Some bugs may be caused by logging itself. Some switches like reloading a config file may clear the bug.


One of the most important things.

Logging builds the bridge between developers and debuggers.

No logging means, operation people have to call development people at 2 AM.

Logging and Monitoring and Investment

Nothing can hide the important of monitoring the production software.

Logging, hooks, and so on are common choices.

  • log the transactions within the system
  • log the performance of each function call (like queries)
  • log the data, whose values are greater than the cost to log

Big logging implies big investment, either specialized framework or specialized servers.

  • log all critical data
  • passive logging
  • sample huge traffic
    • cautious about how to pick the “real” and “sweet” randomness
Logging Level

Usually different levels for logs

  • debug
  • info
  • warning
  • fatal

There can be many many other levels; use based on your own need

Since plenty of bugs only appear at specific time period, chances are people do not want to restart or re-compile the code. Here is where switches come to be handy (mentioned above)

Logging Pitfalls
  • over-logging wastes resources
    • and possibly hides the real issues
  • jumbled or interleaved logging makes logs useless
    • make sure log entries are uniquely identified
  • logging changes behavior of the program
    • one more function call changes the stack frame?
  • bad strategy to sample log
    • not capturing enough data


System Health


  • load average
  • processor utilization


  • resident set sizes
  • swap

About swap

  • If forking a large process, lack of memory will fail this fork unless there is enough swap.
  • But we don’t want to use swap. Check problems that cause processes to use swap.
  • Can use ps to check swap

Also Linux has OOM Killer (Out of memory killer). Overloaded memory usage might trigger it to kill.

Disk I/O

  • IOPS
    • I/O operations Per Second
  • file system
    • local
    • fiber channel
    • SAN (Storage Area Network)
    • NFS
    • NAS (Network Attached Storage)


  • utilization
    • bps (bit per seconds)
    • pps (packet per seconds)
  • packet loss
    • local
    • remote
  • protocol
  • end users
    • slow
    • lossy
    • not exist
    • malicious

There are 65535 TCP/IP ports (some are reserved) in total. Running out of ports is another common issue.

There are “2” ips: IPv6 and IPv4. One process might make 2 connection attempts or 1 or 0.

Process Health

  • pegged CPU
  • weird memory usage
  • process state

See Unix and Linux Commands for commands to use. Search for the keyword: troubleshoot or click this link for built-in highlight. (It’s a long page)


  • where
  • useful
  • readable

Check how your program logs

Check system logs (system health)


Does it depend on other recourses? Are those working?

  • blocking synchronous calls
  • slow asynchronous calls
  • dependency services go down
  • components in series go down
  • proxy servers
  • etc.

Real problem might be surprising, like a DNS record issue or an expired certificate