Engineering and Operation¶
Code is shipped to production
- write the code
- ship the code
- issues emerge
Think about production environment while developing
- How to avoid/defend against/recover from issues?
- How to help to troubleshoot?
Race Conditions and Edge Cases¶
Very likely happen when the following are involved
- Multi-Process (resource contention)
Use “atomic” operations to avoid race conditions
May happen in the situations never thought about, the Edge Cases
- all possible input
- never assume something never happens
- at least log the extremely rare cases
Research on the implicit and explicit locks or semaphores available (for file systems or databases etc.)
- usually DBMS automatically locks
- but they may or may not be correct or enough or efficient
UNIX file locking
Think carefully when opening files. (Majority developers never think beyond closing after opening)
They may change after opening, may disappear or even may be maliciously edited (or read).
One security measurement: create a randomly named directory for the files under /tmp/, and change the permission of that directory
Race Condition Pitfall¶
Bad Locks, use randomness to help
Failed to account for network latency
- possible solution is to averaging the data of multi-clients (e.g. local and remote)
- e.g. display player in a location between where the player thinks and others thinks
- do not wish to slow down other people via atomic operations etc.
Efficiency and Scalability¶
There is usually a trade-off between “speed to ship the product” and “scalable or efficient code”
- over-efficient is bad
- may prevent scaling
- may harm later development
- unnecessary scalability is unnecessary
- never able to predict the future exactly
- always trading-off scalability with others, usually efficiency
- e.g. more servers -> slower to start the whole system
Think about your own product
- what is the limit (disk, memory, cpu, networking)?
- what happens when reaching the limit?
- how to scale?
- separate by layers
- load balancing
- DNS evenly lookup
- active/passive mode
See the Scale page for more details
Don’t forget availability
- CAP theorem
- Console output
- command line call
- process signal (
- Config file
- response of another service (time-out or different response code)
- presence of a file on the system
Think carefully (again) what to use. Some bugs may be caused by logging itself. Some switches like reloading a config file may clear the bug.
One of the most important things.
Logging builds the bridge between developers and debuggers.
No logging means, operation people have to call development people at 2 AM.
Logging and Monitoring and Investment¶
Nothing can hide the important of monitoring the production software.
Logging, hooks, and so on are common choices.
- log the transactions within the system
- log the performance of each function call (like queries)
- log the data, whose values are greater than the cost to log
Big logging implies big investment, either specialized framework or specialized servers.
- log all critical data
- passive logging
- sample huge traffic
- cautious about how to pick the “real” and “sweet” randomness
Usually different levels for logs
There can be many many other levels; use based on your own need
Since plenty of bugs only appear at specific time period, chances are people do not want to restart or re-compile the code. Here is where switches come to be handy (mentioned above)
- over-logging wastes resources
- and possibly hides the real issues
- jumbled or interleaved logging makes logs useless
- make sure log entries are uniquely identified
- logging changes behavior of the program
- one more function call changes the stack frame?
- bad strategy to sample log
- not capturing enough data
- load average
- processor utilization
- resident set sizes
- If forking a large process, lack of memory will fail this fork unless there is enough swap.
- But we don’t want to use swap. Check problems that cause processes to use swap.
- Can use
psto check swap
Also Linux has OOM Killer (Out of memory killer). Overloaded memory usage might trigger it to kill.
- I/O operations Per Second
- file system
- fiber channel
- SAN (Storage Area Network)
- NAS (Network Attached Storage)
- bps (bit per seconds)
- pps (packet per seconds)
- packet loss
- end users
- not exist
There are 65535 TCP/IP ports (some are reserved) in total. Running out of ports is another common issue.
There are “2” ips: IPv6 and IPv4. One process might make 2 connection attempts or 1 or 0.
- pegged CPU
- weird memory usage
- process state
Check how your program logs
Check system logs (system health)
Does it depend on other recourses? Are those working?
- blocking synchronous calls
- slow asynchronous calls
- dependency services go down
- components in series go down
- proxy servers
Real problem might be surprising, like a DNS record issue or an expired certificate