Open Data Infrastructure
Orphan Files and Storage Hygiene in the Lakehouse
Orphan files are the residue of failed writes, abandoned jobs, and imperfect operations. Clean them up carefully or they become either waste or data loss.
The most dangerous cleanup job in the lakehouse is the one that looks obvious.
Orphans are normal. Unsafe deletion is not.
Distributed systems leave residue. A task fails after writing a file. A job retries. A staging path gets abandoned. A migration writes data that never becomes part of a table snapshot. Eventually storage contains files that table metadata does not reference.
Those files cost money and create confusion. But deleting them without care can remove data that a write job still needs. Storage hygiene is not housekeeping. It is correctness work.
Core idea: orphan cleanup should be driven by table metadata, safety windows, and dry-run evidence, not by vibes in an object-store browser.
How orphan files appear
Orphan files usually come from predictable places:
- failed or aborted write jobs
- streaming jobs that restart during commits
- manual file copies during migrations
- changed staging paths or abandoned temporary directories
- maintenance jobs that were interrupted mid-run
The existence of orphan files does not mean the platform is broken. It means the platform needs a disciplined maintenance path.
The danger is false confidence
Object storage makes files look simple. A file has a path and a timestamp. That does not mean the file is safe to delete. Table formats use metadata to decide which files belong to which snapshot. Running a storage-level deletion job that ignores table metadata is how teams create silent corruption.
Iceberg maintenance docs explicitly warn against overly short retention intervals for orphan removal because in-progress files can be mistaken for orphaned files. That warning should shape the runbook.
A safe cleanup procedure
A practical procedure looks like this:
- use the table format or engine procedure designed for orphan detection
- start with dry runs and review candidate files
- use retention windows longer than any expected write duration
- exclude active staging and migration paths unless explicitly supported
- emit metrics for candidate count, deleted bytes, and failures
The cleanup job should produce evidence. If it cannot explain what it deleted and why, it is too dangerous to automate.
Storage hygiene is a platform habit
Orphan cleanup belongs with compaction, snapshot expiration, metadata cleanup, and observability. Treat it as recurring table maintenance with owners and runbooks.
Open lakehouses are not fragile because they expose files. They become fragile when teams treat open files as permission to bypass the table contract. The table metadata is the source of truth. Respect it.
Sources to start with
Use Iceberg maintenance and procedure docs as the safety baseline, then validate the cleanup path in your operating engine.