How to Use Logs Effectively Instead of Guessing in the Dark
User reported a bug at 2am. I pulled up the logs. Here's what I found:
Starting...
Processing...
Done.
That's it. No user ID. No request ID. No timestamps. No indication of what "processing" meant. I spent the next 4 hours adding logs, redeploying, and trying to reproduce the issue—just to get enough information to understand what happened.
In production, logs are your only witness. If they're useless, you're a detective with no evidence.
Why Print Statements Aren't Enough
print("here") works for quick debugging. It fails in production:
- No context: "Error: 500" tells you nothing. Which user? Which endpoint?
- No levels: A crash and a minor warning look identical
- No filtering: You can't search for just errors
- No persistence: Stdout vanishes unless captured
Use Log Levels
Proper logging frameworks have levels:
- DEBUG: Granular details for developers. Usually off in production.
- INFO: Normal operations. "Server started", "Job completed"
- WARNING: Something unexpected, but handled. "Disk 80% full", "Retrying connection"
- ERROR: Operation failed. "Database refused connection", "Payment declined"
- CRITICAL: App can't continue. "Config file missing", "Out of memory"
When investigating a crash, you filter to ERROR and above. You don't wade through a million "User logged in" messages.
Structured Logging
The biggest upgrade: switch from text to JSON.
Text log:
[2025-04-15 10:00:00] ERROR: Payment failed for user 123. Reason: Timeout.
Readable by humans. Terrible for machines. Want to count failures by reason? Write regex.
JSON log:
{
"timestamp": "2025-04-15T10:00:00Z",
"level": "ERROR",
"event": "payment_failed",
"user_id": 123,
"reason": "timeout"
}
Feed this into Datadog, Splunk, or ELK. Query instantly: level="ERROR" AND reason="timeout". Build dashboards. Set alerts.
Correlation IDs
In microservices, one user action triggers logs in five services. How do you connect them?
Generate a unique ID when a request enters your system. Pass it to every service. Include it in every log:
{"request_id": "abc-123", "service": "auth", "event": "login_success"}
{"request_id": "abc-123", "service": "orders", "event": "fetching_history"}
{"request_id": "abc-123", "service": "orders", "event": "db_timeout", "level": "ERROR"}
Search for abc-123, see the entire journey across your system.
What to Log
Do log:
- Entry and exit points (API requests/responses)
- External calls (database, third-party APIs)
- Business decisions ("User is premium, applying discount")
- Errors with full context
Don't log:
- Passwords (even hashed)
- API keys
- Credit card numbers
- PII unless necessary and compliant
Log the Stack Trace
When catching exceptions, don't just log the message:
# ❌ Loses the location
except Exception as e:
logger.error(f"Failed: {e}")
# ✅ Includes full stack trace
except Exception as e:
logger.exception("Operation failed") # Automatically includes trace
The message tells you what happened. The stack trace tells you exactly where.
The Payoff
Good logs turn debugging from hours to minutes. When the 2am alert fires, you search for the request ID, see exactly what happened, and fix it. No guessing. No reproducing. Just evidence.
Treat logs as a feature, not an afterthought.