2026-05-21

Turning a loader that OOMed 6 times into a 52-second job

A Parquet-to-Postgres rewrite: from batch to streaming execution, from Python to Rust.

The problem

I recently worked on a data engineering problem: cleaned Parquet files in object storage needed to be loaded into Postgres on a schedule, for downstream MLE workflows.

The pain was loading large files. One table had about 5.17 million rows. The Python version ran in a 4GB serverless function and OOMed 6 times out of 6. Deployment was slow too: the Python data stack carried pyarrow, pandas, database drivers, and other dependencies. A local upload could take tens of minutes.

There were a few options:

Option	Fixes OOM?	Fixes slow deploys?	Trade-off
Raise memory from 4GB to 8GB	Maybe, for now	No	Fastest, but only moves the failure point
Rewrite Python as streaming	Yes	No	Smaller change, still ships heavy dependencies
Rewrite in Rust	Yes	Yes	More work, but fixes runtime and deployment together

I chose Rust not because Python cannot do this, but because this function was blocked by both memory and deployment. Fixing only one would leave the other slowing iteration down.

The rewrite changed the numbers like this:

Metric	Old Python version	New Rust version
Large table size	5,174,400 rows	5,174,400 rows
Run result	6 attempts, all 4GB OOM	One run, about 52 seconds
Deploy zip	About 77.8MB	About 2.5MB
Local upload	About 40 minutes	About 3.5 seconds
Runtime dependencies	Python data stack + packages	Single static binary

These numbers can look like “Rust is faster than Python.” I prefer to split the change into two parts:

Execution model: from “materialize the whole batch in memory” to “read and write as a stream.”
Deployment unit: from a heavy Python data stack to a small static binary.

The first fixed OOM. The second fixed slow deploys and heavy dependencies. They just happened in the same rewrite.

Why the old pipeline blew up

The old version looked roughly like this:

tables = [pq.read_table(file) for file in files]
df = pa.concat_tables(tables).to_pandas()
cursor.copy_from(StringIO(df.to_csv()), table_name)

The problem is that it does not read a little and write a little. It reads all Parquet files into Arrow tables, combines them into one large table, converts that to a pandas DataFrame, turns the whole thing into one big CSV string, and only then sends it to Postgres.

One way to picture it:

You need to move a truckload of sand from A to B. The old method is not a conveyor belt. It dumps the sand into the living room, moves it into another bucket, spreads it over a plastic sheet, then drags the whole sheet to B.

Each step creates another large copy:

Arrow data decoded from Parquet.
One combined Arrow table.
A pandas DataFrame.
A CSV string.

5.17 million rows is not necessarily huge. The issue is that several versions of the same data exist at the same time. 4GB is not a warehouse. It is a small box. Put several large tables in it and it will OOM.

Batch vs streaming execution

When people see “streaming”, they often assume the batch job became a stream-processing system.

It did not.

This is still a batch job. The input is a fixed set of Parquet files. The output is one Postgres table. It is not an endless Kafka-style stream, and it does not update downstream state for every event.

Keep the batch boundary. Use streaming execution to lower peak memory.

Batch processing and streaming execution are not opposites. They answer different questions.

Batch processing answers: what is the business commit unit?

For example: “the target table is updated only if all files in this batch succeed.” That is the batch boundary.

Streaming execution answers: how does the program move data internally?

For example: “hold a small chunk, convert it, write it, then move on.” That is the execution model.

The new pipeline is roughly:

Parquet files in object storage
  -> decode into Arrow RecordBatches
  -> turn each batch into CSV bytes
  -> pass bytes through a bounded queue
  -> stream into Postgres COPY FROM STDIN

A few terms:

Parquet is a columnar storage format. Values from the same column are stored together, which makes compression efficient and partial scans fast.

Arrow is a standard in-memory table format. Many data tools can read and write it, so it reduces type-conversion churn.

RecordBatch is a small chunk of Arrow data. The important word is “small”: the program holds one piece, not the whole table.

Streaming means data flows through the program. The program holds the current piece, processes it, then lets it go.

Backpressure means the upstream side slows down when the downstream side is slower. The queue in the middle has a capacity limit, so producers cannot pile up data forever.

Postgres COPY is Postgres’ bulk import path. It is much faster than row-by-row INSERT because it avoids a lot of SQL parsing, round trips, and per-row protocol overhead. For bulk table loads, COPY is usually the default path to consider.

The change is not one clever line of code. It is a different shape:

Old: turn everything into one giant object, then write it. New: keep data moving; feed downstream at the speed it can consume.

Does streaming make retries harder?

Yes. It forces you to design the boundary more carefully.

Batch has a clear boundary: prepare the full batch, then write it. If it fails, rerun the batch. The downside is obvious: if “prepare” means holding the full batch in memory, peak memory gets high.

Streaming keeps memory stable, but failures can happen halfway through: some data has been read, some has been written, then the network drops, the function times out, or the database connection dies. Without a consistency boundary, you can leave half a table behind, write duplicates, or let downstream readers see incomplete data.

So streaming is not just “read while writing.” It needs a few companion decisions:

Transaction boundary. This job uses BEGIN -> TRUNCATE -> COPY -> COMMIT. If COPY fails midway, the transaction rolls back and the old table is not replaced by a partial result.
Idempotent entry point. Running the same table over the same input should produce the same result. That makes retries safe.
Explicit failure points. Oversized files, schema mismatches, network failures, and COPY failures should fail loudly, not be skipped silently.
Staging table when needed. For more complex cases, COPY into a staging table first, validate row counts and quality, then atomically swap into the target table.
Do not turn a finite batch into an infinite stream system. This is bounded streaming: finite input, finite output, batch commit boundary.

That is why I prefer “streaming execution” over simply saying “stream processing.” It avoids confusing a bounded batch job with a real-time stream system.

Python vs Rust

Another common question: can’t Python do streaming?

Yes.

If the only problem is OOM, Python has a reasonable fix: use a pyarrow batch reader to read Parquet in chunks, then use psycopg2 or asyncpg COPY to write chunks into Postgres. That avoids to_pandas() and to_csv() holding the whole table in memory.

So the conclusion is not “Python is bad for data engineering.” Python is good at this work.

The point here is that this function had more than OOM. The deploy package was large, uploads were slow, and the serverless runtime carried heavy dependencies. Python streaming can lower peak memory, but it does not automatically remove the deployment cost of pyarrow + pandas.

Rust helped because it solved two things at once:

The execution model became streaming.
The deployment unit became a small binary.

In other words:

If the pain is only peak memory, Python streaming is probably the cheaper fix.
If the pain also includes dependency size, cold start, deploy speed, and runtime boundaries, a compiled language starts to make more sense.

Do not mix up “batch vs stream” with “Python vs Rust.” The first is a data execution model. The second is a runtime and delivery model.

What Rust contributed

Rust’s value in this case was specific.

First, it made the deployment unit small. The Python version had to carry heavy data dependencies. The Rust version compiles to a Linux binary. In this validation, the deployment zip went from about 77.8MB to about 2.5MB, local upload went from about 40 minutes to a few seconds, and the final bootstrap binary was about 6.5MB.

Second, it made memory boundaries clearer. Rust’s ownership model pushes you to ask who owns the data, when it is freed, and whether it is copied. That can feel like friction in some business code. In data movement code, it helps expose hidden large copies.

Third, its async ecosystem fits this pipeline shape. Reading object storage, converting data, and writing Postgres is a producer-consumer model with both IO and CPU work. tokio, channels, and streaming sinks express that shape directly.

The speed did not come from the word “Rust.” It came from streaming execution.

Write Rust as “read everything, concatenate everything, write everything” and it can OOM too. Write Python as streaming and it can solve a large part of this problem. Rust was the right choice here because it also fixed the deployment unit and runtime dependency problem.

A few traps

I will not go deep into the internals, but a few traps are worth keeping.

One: do not assume a cloud function invoke looks like a normal HTTP client.

The custom runtime request did not set Content-Type: application/json. Using the framework’s JSON extractor returned 415. Reading raw bytes and parsing JSON manually fixed it.

Serverless platforms often have their own runtime protocol. Framework defaults may not match it.

Two: self-signed cloud API requests need ground-truth tests.

I did not want to bet on an unofficial OSS Rust SDK, so I implemented OSS V4 signing directly. Code like this can look right and still return 403 in production. The safer path was to generate signing samples from a mature Python SDK, then make the Rust unit tests match them byte for byte.

If you implement a protocol yourself, anchor it to a trusted implementation.

Three: deployment experience is part of architecture.

Teams often discuss only runtime performance. But if every deploy takes tens of minutes, engineers deploy less, validate less, and make fewer small changes. The system gets more brittle.

Shrinking the deployment package does not just save time. It makes frequent validation feel cheap again.

Takeaways

The reusable lesson is not “rewrite Python in Rust.” It is a set of checks.

First, when you hit OOM, draw the shape of data in memory. Do not only look at input size. Look at how many intermediate copies exist.

Second, for data movement jobs, prefer streaming execution when the business semantics allow it. Use batch size and backpressure to control memory.

Third, keep the batch commit boundary. Streaming execution does not mean giving up transactions, consistency, or idempotency. For finite input, the comfortable shape is often a bounded streaming batch.

Fourth, for Postgres bulk loads, think COPY before row-by-row INSERT. INSERT is for transactional writes. COPY is for loading tables.

Fifth, choose the language together with the deployment environment. If peak memory is the only issue, Python streaming may be enough. If dependency size, cold start, deploy speed, and runtime boundaries also hurt, a compiled language becomes more attractive.

Sixth, the goal of optimization is not just “faster.” It is “more explainable.” A 52-second job is nice. More important: you know why it will not blow up, where it can slow down, where to apply backpressure, and where retries are safe.

That is what I want to keep from this fix.

Got thoughts on this? Argue with my agent, or send me a note.