Skip to main content
PostgreSQL DBA Interview Questions
PostgreSQL DBA Interview Questions
1) PostgreSQL Architecture & Internals (25)
- Explain PostgreSQL’s process model—postmaster, backend processes, background workers, and key daemons (walwriter, checkpointer, autovacuum launcher, stats collector).
- Walk through how a client connection is established from libpq to backend.
- Describe the memory architecture—shared buffers, work_mem, maintenance_work_mem, temp buffers.
- What are hint bits? How and when are they set?
- Explain MVCC: snapshots, visibility rules, xmin/xmax, and tuple states.
- What is CTID? Why must client applications avoid relying on CTID?
- How does PostgreSQL implement transaction isolation at the storage level?
- What’s in pg_control and how is it used at startup/recovery?
- Explain the role and format of CLOG/pg_xact and pg_subtrans.
- How does the buffer manager choose pages for eviction?
- What triggers background writer vs checkpointer?
- Explain full_page_writes and its impact on performance and durability.
- Detail the path of an UPDATE statement from parse → plan → execute → WAL → visibility.
- What are the catalog tables most frequently read by the planner?
- How and when does PostgreSQL rewrite a whole table (heap rewrite)?
- Explain DSM (dynamic shared memory) and parallel query infrastructure.
- What are visibility map (VM) and free space map (FSM)? How are they maintained?
- How does HOT (Heap-Only Tuple) update work and when does it not apply?
- How are stats collected and used (pg_statistic, MCVs, histograms, ndistinct)?
- How are extensions loaded and what are the risks/controls?
- What does track_io_timing do and when should you enable it?
- Explain TOAST: why, when, and how does it affect storage/performance?
- How does PostgreSQL enforce WAL-before-data on persistent media?
- Describe how write amplification occurs inside PostgreSQL.
- When do temp files get created and where are they stored?
🔁 2) WAL, Checkpoints & Crash Recovery (20)
- How are WAL records structured? What’s an LSN?
- What triggers a checkpoint? How can you observe them?
- What is checkpoint_completion_target and how does it change flushing behavior?
- What is the difference between redo and full page images?
- Under which conditions does WAL generation spike unexpectedly?
- How do you analyze WAL rates and sources for spikes?
- Explain crash recovery phases—from reading pg_control to consistent state.
- What controls WAL segment size and recycling?
- Why can frequent checkpoints degrade performance?
- How do you tune max_wal_size / min_wal_size on large systems?
- How to debug “requested WAL segment has been removed” on a standby?
- What is wal_compression and when does it help/hurt?
- How do you ensure WAL archiving is correct and complete?
- How does synchronous_commit affect latency and data durability?
- Explain wal_buffers sizing basics.
- How to diagnose “WAL file could not be archived” loops?
- What are the key walwriter metrics to watch?
- How can you estimate RTO based on WAL volume and I/O?
- What is wal_keep_size vs replication slots behavior?
- Describe the interaction between WAL archiving and replication slots.
🧼 3) Vacuum, Autovacuum, Bloat & Freeze (25)
- Differentiate VACUUM, VACUUM FULL, ANALYZE, and FREEZE.
- How does autovacuum decide when to vacuum a table?
- What are the key autovacuum parameters for heavy OLTP?
- How to detect table vs index bloat—tools and system views?
- What is xid wraparound, how to detect risk, and how to avoid outages?
- Why can autovacuum lag for very large tables and how to fix it?
- Explain vacuum_freeze_* parameters.
- How to safely vacuum extremely large tables with minimal disruption?
- Why does VACUUM FULL block and how to avoid it?
- How does HOT update reduce index churn and why may it fail?
- How to tune autovacuum_vacuum_cost_limit and delay?
- What leads to vacuum conflicts on standbys?
- Why is autovacuum not running on a table (diagnostic approach)?
- How to manage freeze on partitioned tables?
- How to detect tuples “recently dead” vs “dead” vs “live” efficiently?
- Impact of aggressive autovacuum settings on throughput.
- How to measure vacuum effectiveness with pg_stat_* views?
- What causes vacuum to run forever (never-ending cycles)?
- When does ANALYZE become essential post-bulk load?
- How do visibility map inaccuracies affect performance?
- How do relfrozenxid and relminmxid work?
- Strategies for anti-wraparound maintenance windows.
- Managing vacuum on hot standby conflicts.
- Preventing vacuum storms after large batch operations.
- Rebuilding bloated indexes online—safest approaches.
🔎 4) Query Planning & Optimization (30)
- Steps of query planning—parser, rewriter, planner, executor.
- How does PostgreSQL estimate row counts? What causes misestimates?
- Explain the choice among nested loop, hash join, merge join.
- When should you force join order or hints (via extensions) and why?
- Why would PostgreSQL choose a seq scan with an available index?
- How do statistics targets and histograms influence plans?
- Role of extended statistics (MCV, ndistinct, functional deps).
- How to diagnose “bad” plans caused by parameter sniffing/variables?
- How to detect sort/hash aggregate spills to disk?
- How does parallel query work and when is it not used?
- What does effective_cache_size influence?
- How to debug plans using EXPLAIN (ANALYZE, BUFFERS, TIMING)?
- When to refactor CTEs post PG12 due to inlining behavior?
- When is bitmap heap scan chosen and how to tune for it?
- How does correlation (pg_stats.correlation) affect index scans?
- How to detect implicit casts causing index misses?
- How to reduce temp file creation for heavy ORDER BY / DISTINCT?
- What is JIT execution, when to enable/disable?
- How to identify top queries via pg_stat_statements—key columns to watch.
- Why does performance sometimes drop after adding an index?
- How to stabilize plans for predictable performance?
- What is plan cache invalidation and what triggers it?
- How to analyze function inlining and volatility attributes impact?
- Strategies to reduce full scans on time-series workloads.
- Fixing poor cardinality for skewed data distributions.
- How to benchmark plan alternatives safely in prod-like staging?
- Why does LIMIT/OFFSET hurt performance and alternatives?
- What is enable_seqscan/enable_indexscan/enable_hashjoin used for?
- What is track_io_timing and how to interpret results?
- Techniques to detect and fix memory-bound vs I/O-bound queries.
🗂️ 5) Indexing (B‑tree, GIN, GiST, BRIN, Hash) (25)
- When to use B‑tree vs GIN vs GiST vs BRIN vs Hash?
- Use-cases for partial indexes—pitfalls and benefits.
- Expression indexes—when and how to use them.
- Multicolumn index ordering—common mistakes and best practices.
- Why index-only scans may still hit heap—visibility map relation.
- Detecting redundant/overlapping indexes.
- Detecting unused indexes in write-heavy systems.
- Fillfactor tuning—tradeoffs for read vs write.
- BRIN indexes—requirements, query shapes, maintenance.
- GIN vs btree for text search—when to choose which?
- “Prefix” queries and index support with collations.
- Why some LIKE queries can’t use indexes and how to fix.
- Functional dependency stats and index design.
- Descending indexes and sort optimization.
- Index bloat—root causes, measurements, remedies.
- REINDEX CONCURRENTLY—when and how to apply.
- PITR/restore effects on index visibility map.
- JSONB indexing strategies—path ops, expression index.
- SP‑GiST and specialized spatial/indexing use-cases.
- Hash indexes—current capabilities and caveats.
- Covering index benefits and limits in PostgreSQL context.
- Handling very wide indexes and heap tuple width.
- Strategies for anti-hotspot PK design.
- Detecting index correlation and clustering for range queries.
- Index maintenance windows on VLDBs.
🔂 6) Replication: Streaming, Logical, Cascading, Slots (25)
- How streaming replication works—sender/receiver, WAL apply.
- Explain timeline IDs and forked timelines after promotion.
- Synchronous vs asynchronous replication—latency and RPO tradeoffs.
- Cascading replication—topologies and caveats.
- Replication slots—how they prevent WAL removal; dangers of disk fill.
- Standby feedback—how it prevents vacuum conflicts; side effects.
- Hot standby conflicts—detection and handling.
- Delayed standby—use-cases and configuration.
- Repointing apps after promotion—client-side strategies.
- Rejoining an old primary as a replica—safe approach.
- PITR vs physical replica creation tradeoffs.
- Logical replication—architecture and when to prefer it.
- Schema drift handling in logical replication.
- Filtering tables/publications in logical replication—consistency pitfalls.
- How to backfill large tables with logical replication online.
- Monitoring replication lag accurately (WAL vs apply).
- Network vs I/O vs CPU bottlenecks in replication.
- “Requested WAL segment removed”—root causes and fixes.
- Managing long transactions that block WAL recycling.
- Handling very large transactions across replicas.
- Sync standby priorities (synchronous_standby_names) proper use.
- Failover and switchover runbook key steps.
- Handling replication with different collations/encodings.
- Using archive_command + restore_command with replicas.
- Role of wal_keep_size vs slots on retention.
🛡️ 7) HA & Failover (Patroni, repmgr, etc.) (15)
- Patroni architecture—DCS (etcd/Consul/zk) role in leadership.
- Fencing and split-brain prevention strategies.
- What triggers Patroni failover and how to tune it?
- Rebuilding a failed node without full resync.
- Avoiding failover loops—common misconfigurations.
- repmgr switchover vs failover—differences and caveats.
- How to achieve zero/near-zero RPO & <10s RTO with PostgreSQL?
- Designing dual-region HA vs DR—tradeoffs.
- Load balancing read-only traffic safely.
- HA testing/chaos engineering runbooks—what to simulate?
- Client failover—pgBouncer/HAProxy/Consul integration.
- Handling long transactions during failover.
- Version drift risks across nodes in clusters.
- Preventing catalog corruption during forced failover.
- Observability: health checks that truly reflect write readiness.
💾 8) Backup, Restore, PITR & DR (20)
- pg_dump vs pg_basebackup—use-cases and limitations.
- pgBackRest vs Barman—architecture differences and selection.
- PITR end-to-end steps—recovery.conf → recovery.signal (PG >=12).
- WAL archiving verification—techniques and common errors.
- Backup encryption strategies—key rotation and restore tests.
- Restoring a single table from physical backups—patterns.
- Backup consistency under high write—what to validate?
- Retention planning—RPO/RTO alignment with wal retention.
- Restoring to a new hardware class—what to retune post-restore?
- Handling backup jobs blocking due to locks.
- Snapshot-based backups (LVM/ZFS/Cloud)—quiesce strategies.
- Validating backups regularly—restore drills.
- Cross-region DR runbooks—DNS, cutover, and client impact.
- Recovering from missing WAL archives—workarounds.
- Backup failures due to slots or archive_command—triage steps.
- Backup-window design for VLDB—throttling and scheduling.
- PITR with time vs LSN vs transaction ID—when to use each.
- Backups with logical replication—consistency practices.
- Monitoring backup metrics and early-warning signals.
- DR tabletop exercises—what to measure and document.
🧱 9) Partitioning, Sharding & VLDB Design (20)
- Choosing range vs list vs hash partitions—workload traits.
- Constraint exclusion vs partition pruning—differences and tuning.
- Designing partition keys for time-series and high-ingest.
- Migrating existing big tables to partitions online—approaches.
- Local vs global index implications with partitions.
- Hot partition problems—identifying and mitigating hotspots.
- Partition maintenance—attach/detach/rollover schedules.
- Tablespaces with partitions—placement strategies.
- Handling extremely wide tables—row vs columnar strategies (FDWs).
- Sharding patterns: logical vs physical; middleware options.
- Ensuring global uniqueness across shards.
- Rebalancing shards with minimal downtime.
- Cross-shard queries and aggregation strategies.
- Statistics on partitions—global vs per-partition.
- Table inheritance vs declarative partitioning—legacy migrations.
- Impact of partitions on vacuum/analyze/autovacuum.
- Archival pipelines—cold storage and queryable archive.
- Hybrid OLTP/OLAP in PostgreSQL—do’s and don’ts.
- De-duplication at scale—index strategies and constraints.
☁️ 10) PostgreSQL on Cloud (RDS, Aurora, Cloud SQL, Azure PG) (20)
- RDS vs Aurora vs self-managed—tradeoffs for OLTP.
- RDS failover mechanics and typical downtime windows.
- Aurora storage layer differences vs community PostgreSQL.
- Parameter groups and dynamic vs static param changes.
- Cross-region read replicas—latency and consistency expectations.
- RDS storage full recovery steps and prevention.
- Cloud IOPS throttling—detecting and mitigating.
- Cloud snapshot restore gotchas—collations, extensions, versions.
- Cloud monitoring and alarms that matter for PostgreSQL.
- Dealing with noisy neighbors on shared cloud instances.
- Aurora Global Database—lag, failover, and cutover patterns.
- Cloud connection pooling—RDS Proxy / pgbouncer on EC2/K8s.
- Encryption at rest & in-transit—key management and rotations.
- Auto-scaling myths—limits for PostgreSQL on cloud.
- Running controlled chaos tests in managed services.
- Cost/perf tradeoffs—instance class vs storage tier vs caching.
- Cloud SQL/Azure PG limitations vs community features.
- Observability in managed clouds—what you lose and regain.
- Impact of cloud maintenance windows on throughput.
- Strategies to keep WAL costs under control in cloud.
🔒 11) Security, Roles, RLS & Auditing (15)
- SCRAM-SHA-256 vs md5—migration strategy and pitfalls.
- pg_hba.conf best practices—CIDR, hostssl, hostnossl.
- Role design—LOGIN/NOLOGIN, INHERIT, SUPERUSER risks.
- RLS policies design—performance and correctness.
- Secure search_path and function volatility to avoid privilege escalation.
- Auditing with pgaudit—what to log and how to store.
- Client cert auth—implement and maintain at scale.
- Least privilege for application roles—pattern examples.
- Detecting and removing unused roles safely.
- Preventing SQL injection at database layer—practical controls.
- Managing function SECURITY DEFINER safely.
- Auditing DDL and role changes for compliance.
- Managing secrets/creds in CI/CD pipelines.
- Rotating credentials and certs with no downtime.
- Tracing who dropped/altered critical objects.
🧪 12) Extensions, FDW, Timescale/PGVector/Etc. (10)
- Managing extensions lifecycle in prod—version pinning and drift.
- postgres_fdw basics—performance and consistency limits.
- Using TimescaleDB for time-series—benefits/tradeoffs.
- pg_partman—benefits over native partition management.
- pg_stat_statements—overhead and best practices.
- pg_repack vs REINDEX CONCURRENTLY—when to choose what.
- pg_cron vs OS schedulers—operational differences.
- pgvector basics—index choices and storage cost.
- HypoPG for hypothetical indexes—safe evaluation workflows.
- pgbouncer modes—session vs transaction pooling impact.
- Assessing extension risk during major upgrades.
🧰 13) DevOps, CI/CD, Migrations & Observability (10)
- Safe schema deployments—minimizing metadata locks.
- Migration tooling patterns—sqitch/Flyway/Liquibase with PG.
- Blue/green DB deployments and cutover plans.
- Detecting schema drift across environments automatically.
- Observability stack—pg_stat_statements, auto_explain, pgBadger, PGMonitor.
- Performance SLOs and SLIs for PostgreSQL—how to define and enforce.
- Rehearsing DR and failover in CI—data subsets and automation.
- Canary queries to detect plan regressions early.
- Capturing perf baselines before/after changes.
- Runbooks for “high CPU”, “high I/O wait”, “too many connections”.
🎯 14) Senior-Level Scenario Questions (Bonus, Mixed Topics) (10)
- Standby is 30 minutes behind; app SLA is 5 seconds—what’s your approach?
- CPU spikes at peak hours; no query shape changed—how to isolate?
- pg_wal grows unbounded; archiving seems fine—where do you look?
- After upgrade, some queries are 10x slower—triage steps?
- pg_stat_statements shows low mean time but high 99p—what next?
- VACUUM conflicts repeatedly on standby—solutions?
- Hot partition causing lock contention—practical fixes?
- Unused indexes increase by 30%—what do you do?
- Logical replication misses updates for a table—where to look?
- Failed failover left cluster unstable—stabilize and prevent repeat.
Comments
Post a Comment