PostgreSQL DBA Interview Questions

 PostgreSQL DBA Interview Questions


1) PostgreSQL Architecture & Internals (25)

  1. Explain PostgreSQL’s process model—postmaster, backend processes, background workers, and key daemons (walwriter, checkpointer, autovacuum launcher, stats collector).
  2. Walk through how a client connection is established from libpq to backend.
  3. Describe the memory architecture—shared buffers, work_mem, maintenance_work_mem, temp buffers.
  4. What are hint bits? How and when are they set?
  5. Explain MVCC: snapshots, visibility rules, xmin/xmax, and tuple states.
  6. What is CTID? Why must client applications avoid relying on CTID?
  7. How does PostgreSQL implement transaction isolation at the storage level?
  8. What’s in pg_control and how is it used at startup/recovery?
  9. Explain the role and format of CLOG/pg_xact and pg_subtrans.
  10. How does the buffer manager choose pages for eviction?
  11. What triggers background writer vs checkpointer?
  12. Explain full_page_writes and its impact on performance and durability.
  13. Detail the path of an UPDATE statement from parse → plan → execute → WAL → visibility.
  14. What are the catalog tables most frequently read by the planner?
  15. How and when does PostgreSQL rewrite a whole table (heap rewrite)?
  16. Explain DSM (dynamic shared memory) and parallel query infrastructure.
  17. What are visibility map (VM) and free space map (FSM)? How are they maintained?
  18. How does HOT (Heap-Only Tuple) update work and when does it not apply?
  19. How are stats collected and used (pg_statistic, MCVs, histograms, ndistinct)?
  20. How are extensions loaded and what are the risks/controls?
  21. What does track_io_timing do and when should you enable it?
  22. Explain TOAST: why, when, and how does it affect storage/performance?
  23. How does PostgreSQL enforce WAL-before-data on persistent media?
  24. Describe how write amplification occurs inside PostgreSQL.
  25. When do temp files get created and where are they stored?

🔁 2) WAL, Checkpoints & Crash Recovery (20)

  1. How are WAL records structured? What’s an LSN?
  2. What triggers a checkpoint? How can you observe them?
  3. What is checkpoint_completion_target and how does it change flushing behavior?
  4. What is the difference between redo and full page images?
  5. Under which conditions does WAL generation spike unexpectedly?
  6. How do you analyze WAL rates and sources for spikes?
  7. Explain crash recovery phases—from reading pg_control to consistent state.
  8. What controls WAL segment size and recycling?
  9. Why can frequent checkpoints degrade performance?
  10. How do you tune max_wal_size / min_wal_size on large systems?
  11. How to debug “requested WAL segment has been removed” on a standby?
  12. What is wal_compression and when does it help/hurt?
  13. How do you ensure WAL archiving is correct and complete?
  14. How does synchronous_commit affect latency and data durability?
  15. Explain wal_buffers sizing basics.
  16. How to diagnose “WAL file could not be archived” loops?
  17. What are the key walwriter metrics to watch?
  18. How can you estimate RTO based on WAL volume and I/O?
  19. What is wal_keep_size vs replication slots behavior?
  20. Describe the interaction between WAL archiving and replication slots.

🧼 3) Vacuum, Autovacuum, Bloat & Freeze (25)

  1. Differentiate VACUUM, VACUUM FULL, ANALYZE, and FREEZE.
  2. How does autovacuum decide when to vacuum a table?
  3. What are the key autovacuum parameters for heavy OLTP?
  4. How to detect table vs index bloat—tools and system views?
  5. What is xid wraparound, how to detect risk, and how to avoid outages?
  6. Why can autovacuum lag for very large tables and how to fix it?
  7. Explain vacuum_freeze_* parameters.
  8. How to safely vacuum extremely large tables with minimal disruption?
  9. Why does VACUUM FULL block and how to avoid it?
  10. How does HOT update reduce index churn and why may it fail?
  11. How to tune autovacuum_vacuum_cost_limit and delay?
  12. What leads to vacuum conflicts on standbys?
  13. Why is autovacuum not running on a table (diagnostic approach)?
  14. How to manage freeze on partitioned tables?
  15. How to detect tuples “recently dead” vs “dead” vs “live” efficiently?
  16. Impact of aggressive autovacuum settings on throughput.
  17. How to measure vacuum effectiveness with pg_stat_* views?
  18. What causes vacuum to run forever (never-ending cycles)?
  19. When does ANALYZE become essential post-bulk load?
  20. How do visibility map inaccuracies affect performance?
  21. How do relfrozenxid and relminmxid work?
  22. Strategies for anti-wraparound maintenance windows.
  23. Managing vacuum on hot standby conflicts.
  24. Preventing vacuum storms after large batch operations.
  25. Rebuilding bloated indexes online—safest approaches.

🔎 4) Query Planning & Optimization (30)

  1. Steps of query planning—parser, rewriter, planner, executor.
  2. How does PostgreSQL estimate row counts? What causes misestimates?
  3. Explain the choice among nested loop, hash join, merge join.
  4. When should you force join order or hints (via extensions) and why?
  5. Why would PostgreSQL choose a seq scan with an available index?
  6. How do statistics targets and histograms influence plans?
  7. Role of extended statistics (MCV, ndistinct, functional deps).
  8. How to diagnose “bad” plans caused by parameter sniffing/variables?
  9. How to detect sort/hash aggregate spills to disk?
  10. How does parallel query work and when is it not used?
  11. What does effective_cache_size influence?
  12. How to debug plans using EXPLAIN (ANALYZE, BUFFERS, TIMING)?
  13. When to refactor CTEs post PG12 due to inlining behavior?
  14. When is bitmap heap scan chosen and how to tune for it?
  15. How does correlation (pg_stats.correlation) affect index scans?
  16. How to detect implicit casts causing index misses?
  17. How to reduce temp file creation for heavy ORDER BY / DISTINCT?
  18. What is JIT execution, when to enable/disable?
  19. How to identify top queries via pg_stat_statements—key columns to watch.
  20. Why does performance sometimes drop after adding an index?
  21. How to stabilize plans for predictable performance?
  22. What is plan cache invalidation and what triggers it?
  23. How to analyze function inlining and volatility attributes impact?
  24. Strategies to reduce full scans on time-series workloads.
  25. Fixing poor cardinality for skewed data distributions.
  26. How to benchmark plan alternatives safely in prod-like staging?
  27. Why does LIMIT/OFFSET hurt performance and alternatives?
  28. What is enable_seqscan/enable_indexscan/enable_hashjoin used for?
  29. What is track_io_timing and how to interpret results?
  30. Techniques to detect and fix memory-bound vs I/O-bound queries.

🗂️ 5) Indexing (B‑tree, GIN, GiST, BRIN, Hash) (25)

  1. When to use B‑tree vs GIN vs GiST vs BRIN vs Hash?
  2. Use-cases for partial indexes—pitfalls and benefits.
  3. Expression indexes—when and how to use them.
  4. Multicolumn index ordering—common mistakes and best practices.
  5. Why index-only scans may still hit heap—visibility map relation.
  6. Detecting redundant/overlapping indexes.
  7. Detecting unused indexes in write-heavy systems.
  8. Fillfactor tuning—tradeoffs for read vs write.
  9. BRIN indexes—requirements, query shapes, maintenance.
  10. GIN vs btree for text search—when to choose which?
  11. “Prefix” queries and index support with collations.
  12. Why some LIKE queries can’t use indexes and how to fix.
  13. Functional dependency stats and index design.
  14. Descending indexes and sort optimization.
  15. Index bloat—root causes, measurements, remedies.
  16. REINDEX CONCURRENTLY—when and how to apply.
  17. PITR/restore effects on index visibility map.
  18. JSONB indexing strategies—path ops, expression index.
  19. SP‑GiST and specialized spatial/indexing use-cases.
  20. Hash indexes—current capabilities and caveats.
  21. Covering index benefits and limits in PostgreSQL context.
  22. Handling very wide indexes and heap tuple width.
  23. Strategies for anti-hotspot PK design.
  24. Detecting index correlation and clustering for range queries.
  25. Index maintenance windows on VLDBs.

🔂 6) Replication: Streaming, Logical, Cascading, Slots (25)

  1. How streaming replication works—sender/receiver, WAL apply.
  2. Explain timeline IDs and forked timelines after promotion.
  3. Synchronous vs asynchronous replication—latency and RPO tradeoffs.
  4. Cascading replication—topologies and caveats.
  5. Replication slots—how they prevent WAL removal; dangers of disk fill.
  6. Standby feedback—how it prevents vacuum conflicts; side effects.
  7. Hot standby conflicts—detection and handling.
  8. Delayed standby—use-cases and configuration.
  9. Repointing apps after promotion—client-side strategies.
  10. Rejoining an old primary as a replica—safe approach.
  11. PITR vs physical replica creation tradeoffs.
  12. Logical replication—architecture and when to prefer it.
  13. Schema drift handling in logical replication.
  14. Filtering tables/publications in logical replication—consistency pitfalls.
  15. How to backfill large tables with logical replication online.
  16. Monitoring replication lag accurately (WAL vs apply).
  17. Network vs I/O vs CPU bottlenecks in replication.
  18. “Requested WAL segment removed”—root causes and fixes.
  19. Managing long transactions that block WAL recycling.
  20. Handling very large transactions across replicas.
  21. Sync standby priorities (synchronous_standby_names) proper use.
  22. Failover and switchover runbook key steps.
  23. Handling replication with different collations/encodings.
  24. Using archive_command + restore_command with replicas.
  25. Role of wal_keep_size vs slots on retention.

🛡️ 7) HA & Failover (Patroni, repmgr, etc.) (15)

  1. Patroni architecture—DCS (etcd/Consul/zk) role in leadership.
  2. Fencing and split-brain prevention strategies.
  3. What triggers Patroni failover and how to tune it?
  4. Rebuilding a failed node without full resync.
  5. Avoiding failover loops—common misconfigurations.
  6. repmgr switchover vs failover—differences and caveats.
  7. How to achieve zero/near-zero RPO & <10s RTO with PostgreSQL?
  8. Designing dual-region HA vs DR—tradeoffs.
  9. Load balancing read-only traffic safely.
  10. HA testing/chaos engineering runbooks—what to simulate?
  11. Client failover—pgBouncer/HAProxy/Consul integration.
  12. Handling long transactions during failover.
  13. Version drift risks across nodes in clusters.
  14. Preventing catalog corruption during forced failover.
  15. Observability: health checks that truly reflect write readiness.

💾 8) Backup, Restore, PITR & DR (20)

  1. pg_dump vs pg_basebackup—use-cases and limitations.
  2. pgBackRest vs Barman—architecture differences and selection.
  3. PITR end-to-end steps—recovery.conf → recovery.signal (PG >=12).
  4. WAL archiving verification—techniques and common errors.
  5. Backup encryption strategies—key rotation and restore tests.
  6. Restoring a single table from physical backups—patterns.
  7. Backup consistency under high write—what to validate?
  8. Retention planning—RPO/RTO alignment with wal retention.
  9. Restoring to a new hardware class—what to retune post-restore?
  10. Handling backup jobs blocking due to locks.
  11. Snapshot-based backups (LVM/ZFS/Cloud)—quiesce strategies.
  12. Validating backups regularly—restore drills.
  13. Cross-region DR runbooks—DNS, cutover, and client impact.
  14. Recovering from missing WAL archives—workarounds.
  15. Backup failures due to slots or archive_command—triage steps.
  16. Backup-window design for VLDB—throttling and scheduling.
  17. PITR with time vs LSN vs transaction ID—when to use each.
  18. Backups with logical replication—consistency practices.
  19. Monitoring backup metrics and early-warning signals.
  20. DR tabletop exercises—what to measure and document.

🧱 9) Partitioning, Sharding & VLDB Design (20)

  1. Choosing range vs list vs hash partitions—workload traits.
  2. Constraint exclusion vs partition pruning—differences and tuning.
  3. Designing partition keys for time-series and high-ingest.
  4. Migrating existing big tables to partitions online—approaches.
  5. Local vs global index implications with partitions.
  6. Hot partition problems—identifying and mitigating hotspots.
  7. Partition maintenance—attach/detach/rollover schedules.
  8. Tablespaces with partitions—placement strategies.
  9. Handling extremely wide tables—row vs columnar strategies (FDWs).
  10. Sharding patterns: logical vs physical; middleware options.
  11. Ensuring global uniqueness across shards.
  12. Rebalancing shards with minimal downtime.
  13. Cross-shard queries and aggregation strategies.
  14. Statistics on partitions—global vs per-partition.
  15. Table inheritance vs declarative partitioning—legacy migrations.
  16. Impact of partitions on vacuum/analyze/autovacuum.
  17. Archival pipelines—cold storage and queryable archive.
  18. Hybrid OLTP/OLAP in PostgreSQL—do’s and don’ts.
  19. De-duplication at scale—index strategies and constraints.

☁️ 10) PostgreSQL on Cloud (RDS, Aurora, Cloud SQL, Azure PG) (20)

  1. RDS vs Aurora vs self-managed—tradeoffs for OLTP.
  2. RDS failover mechanics and typical downtime windows.
  3. Aurora storage layer differences vs community PostgreSQL.
  4. Parameter groups and dynamic vs static param changes.
  5. Cross-region read replicas—latency and consistency expectations.
  6. RDS storage full recovery steps and prevention.
  7. Cloud IOPS throttling—detecting and mitigating.
  8. Cloud snapshot restore gotchas—collations, extensions, versions.
  9. Cloud monitoring and alarms that matter for PostgreSQL.
  10. Dealing with noisy neighbors on shared cloud instances.
  11. Aurora Global Database—lag, failover, and cutover patterns.
  12. Cloud connection pooling—RDS Proxy / pgbouncer on EC2/K8s.
  13. Encryption at rest & in-transit—key management and rotations.
  14. Auto-scaling myths—limits for PostgreSQL on cloud.
  15. Running controlled chaos tests in managed services.
  16. Cost/perf tradeoffs—instance class vs storage tier vs caching.
  17. Cloud SQL/Azure PG limitations vs community features.
  18. Observability in managed clouds—what you lose and regain.
  19. Impact of cloud maintenance windows on throughput.
  20. Strategies to keep WAL costs under control in cloud.

🔒 11) Security, Roles, RLS & Auditing (15)

  1. SCRAM-SHA-256 vs md5—migration strategy and pitfalls.
  2. pg_hba.conf best practices—CIDR, hostssl, hostnossl.
  3. Role design—LOGIN/NOLOGIN, INHERIT, SUPERUSER risks.
  4. RLS policies design—performance and correctness.
  5. Secure search_path and function volatility to avoid privilege escalation.
  6. Auditing with pgaudit—what to log and how to store.
  7. Client cert auth—implement and maintain at scale.
  8. Least privilege for application roles—pattern examples.
  9. Detecting and removing unused roles safely.
  10. Preventing SQL injection at database layer—practical controls.
  11. Managing function SECURITY DEFINER safely.
  12. Auditing DDL and role changes for compliance.
  13. Managing secrets/creds in CI/CD pipelines.
  14. Rotating credentials and certs with no downtime.
  15. Tracing who dropped/altered critical objects.

🧪 12) Extensions, FDW, Timescale/PGVector/Etc. (10)

  1. Managing extensions lifecycle in prod—version pinning and drift.
  2. postgres_fdw basics—performance and consistency limits.
  3. Using TimescaleDB for time-series—benefits/tradeoffs.
  4. pg_partman—benefits over native partition management.
  5. pg_stat_statements—overhead and best practices.
  6. pg_repack vs REINDEX CONCURRENTLY—when to choose what.
  7. pg_cron vs OS schedulers—operational differences.
  8. pgvector basics—index choices and storage cost.
  9. HypoPG for hypothetical indexes—safe evaluation workflows.
  10. pgbouncer modes—session vs transaction pooling impact.
  11. Assessing extension risk during major upgrades.

🧰 13) DevOps, CI/CD, Migrations & Observability (10)

  1. Safe schema deployments—minimizing metadata locks.
  2. Migration tooling patterns—sqitch/Flyway/Liquibase with PG.
  3. Blue/green DB deployments and cutover plans.
  4. Detecting schema drift across environments automatically.
  5. Observability stack—pg_stat_statements, auto_explain, pgBadger, PGMonitor.
  6. Performance SLOs and SLIs for PostgreSQL—how to define and enforce.
  7. Rehearsing DR and failover in CI—data subsets and automation.
  8. Canary queries to detect plan regressions early.
  9. Capturing perf baselines before/after changes.
  10. Runbooks for “high CPU”, “high I/O wait”, “too many connections”.

🎯 14) Senior-Level Scenario Questions (Bonus, Mixed Topics) (10)

  1. Standby is 30 minutes behind; app SLA is 5 seconds—what’s your approach?
  2. CPU spikes at peak hours; no query shape changed—how to isolate?
  3. pg_wal grows unbounded; archiving seems fine—where do you look?
  4. After upgrade, some queries are 10x slower—triage steps?
  5. pg_stat_statements shows low mean time but high 99p—what next?
  6. VACUUM conflicts repeatedly on standby—solutions?
  7. Hot partition causing lock contention—practical fixes?
  8. Unused indexes increase by 30%—what do you do?
  9. Logical replication misses updates for a table—where to look?
  10. Failed failover left cluster unstable—stabilize and prevent repeat.

Comments