# Services — Dewei Zhai

Project-based engagements. The list below describes what I do; rate, weekly
capacity, and current availability live on the site (operated live, not
written down here).

## Cloud data platform design

**Problems solved**: Greenfield data platform without a clear architectural backbone, or an early platform that already shows scaling and governance debt.

**Typical client situation**: Startup or scale-up with growing data needs, no dedicated platform team yet, leadership wants a design that survives the next 18 months without a rewrite.

**Technologies**: AWS, Azure, Terraform, CDKTF, Databricks, Delta Lake, Python

**Expected outcome**: Documented architecture with IaC scaffold, dev/prod parity, IAM and cost-allocation model. Ready for a small team to build on without re-litigating fundamentals.

## Lakehouse architecture

**Problems solved**: Mixed warehouse + lake setups with unclear ownership of source-of-truth tables; teams duplicating ingestion because the lake is unreliable.

**Typical client situation**: Mid-size data team running Snowflake or Redshift alongside ad-hoc S3 / ADLS storage, wants a single Medallion-style platform with governed semantics.

**Technologies**: Databricks, Unity Catalog, Delta Lake, Iceberg, Spark, dbt, Terraform

**Expected outcome**: Bronze / Silver / Gold layers with explicit contracts. Single catalog, governed access, reproducible builds. Downstream teams stop maintaining shadow copies.

## Spark / Databricks migration

**Problems solved**: Open-source Spark on EMR / HDInsight / self-managed clusters costing more in ops than the workloads justify, or stuck on legacy runtime versions.

**Typical client situation**: Existing Spark codebase, often EMR-based, with brittle cluster lifecycle and a team that does not want to keep babysitting it.

**Technologies**: Spark, Databricks, PySpark, Delta Lake, AWS, Azure, Airflow

**Expected outcome**: Workloads migrated to Databricks (Unity-governed). Operational overhead drops, runtime upgrades become a one-click action, cost per job is measured instead of guessed.

## Legacy ETL modernisation

**Problems solved**: On-prem Informatica / SSIS / SAS-style ETL that nobody wants to touch; replays break things, late data corrupts marts, timezone handling is folklore.

**Typical client situation**: Enterprise data team with working but fragile pipelines, business stakeholders losing trust, modernisation already attempted once and stalled.

**Technologies**: Spark, dbt, Airflow, Python, Databricks, AWS Glue, DuckDB

**Expected outcome**: Idempotent, config-driven pipelines. Replays are safe, late data is handled explicitly, schema changes are versioned. The on-call rotation gets quieter.

## AWS / Azure platform implementation

**Problems solved**: Cloud account set up by click-ops, no clear environment separation, IAM grew organically, drift between dev and prod.

**Typical client situation**: Team needs production-grade AWS or Azure foundations under a data or AI workload, but has no platform engineer.

**Technologies**: AWS, Azure, Terraform, CDK, CDKTF, GitHub Actions, OIDC, Key Vault, IAM

**Expected outcome**: 100% IaC environments with OIDC-based CI deploys, no long-lived credentials, dev/prod parity, account / subscription topology suitable for audit.

## AI-assisted engineering workflow setup

**Problems solved**: Team uses ChatGPT or Copilot ad-hoc but has no agent harness, no codebase-aware tooling, no measurable productivity lift.

**Typical client situation**: Engineering org curious about the "3× throughput" claims, wants concrete practice instead of slideware.

**Technologies**: Claude Code, MCP, Custom skills and hooks, Multi-agent workflows, Python, TypeScript

**Expected outcome**: Codebase-aware agents that handle boilerplate (handler scaffolding, Spark job templates, DDL, integration tests). Measured throughput uplift. Patterns the team can extend on their own.

## Recovery of stalled data / platform projects

**Problems solved**: Migration or platform build that started ambitious and has been "almost done" for months.

**Typical client situation**: Original architect left, vendor disengaged, or internal team got pulled into other priorities. Stakeholders want a finish line.

**Technologies**: Whatever the project is on

**Expected outcome**: Honest assessment of what is salvageable. Re-scoped plan with a real finish date. Either I take it across the line or hand back a project the existing team can finish.

## Fractional technical leadership

**Problems solved**: No senior data or platform voice in the room when key decisions get made; design choices keep being deferred or made by committee.

**Typical client situation**: Startup or small data org that does not need a full-time staff engineer but needs someone accountable for the technical direction a few days a week.

**Technologies**: Architecture review, Hands-on prototyping, Hiring support

**Expected outcome**: Decisions get made and documented. Hires get vetted. The team gets a senior pair to push back on bad asks. Bounded engagement, no scope creep.

## Project types I take on

- One-time advisor on data or AI projects — short engagements ending in a clear recommendation plus enough work to de-risk it
- Delivering a finished, processed dataset — packaged data products the client owns outright
- Stalled platforms or hard migrations needing a sole hands-on lead
- DApp / smart-contract work with a real product on the other end
- Hackathon collaboration — short, high-intensity team work on a concrete deliverable
- Fractional technical leadership on data / platform initiatives

## Project types I avoid

- Electricity or energy trading work (existing client conflict — firm, not negotiable)
- More than two onsite days a week as a regular cadence (kickoff trips and workshops are fine)
- Retainers / always-on availability arrangements

## Related

- About: /about.md
- Cases: /cases.md
- Stack: /stack.md
- Contact: /contact.md
- JSON: /api/services