Data Pipeline Patterns

I've built enough data workflows at this point that I'm starting to see the same patterns show up over and over. Figured I'd write them down before I forget why I made certain choices. **ETL vs. ELT.** For most of my FP&A work, I've landed on ELT: extract the raw data, load it somewhere stable, then transform it. The temptation is always to clean data on the way in, but I've been burned too many times by transformations that made sense in January but broke in March when the source format changed. Keeping raw data around means I can re-transform without re-extracting. It's slower upfront but saves me on weekends. **Python vs. Power Query.** I use both, but for different things. Power Query is great when the consumer is an Excel or Power BI user and needs to refresh on their own. Python wins when I need scheduling, version control, or anything more complex than a few joins and filters. The mistake I used to make was trying to force everything into one tool. Now I pick based on who needs to maintain it after me. **Idempotency.** This is the one that took me the longest to internalize. Every pipeline should produce the same result if you run it twice. No duplicates, no missing records, no side effects. I learned this the hard way after a scheduled job ran twice on a holiday and doubled an entire month of journal entries. Now every script checks for existing records before inserting. Every. Single. One. **Logging and checkpoints.** Even a simple pipeline should tell you what it did. A timestamp, a row count, a success/fail flag. Future me always thanks past me for this. I've been pulling a lot of these ideas together in my [[lab/Local Data Lake|Local data lake]] project, and the patterns hold up well for simpler, lighter-weight cases too. Still refining, but the patterns are holding up. ← [[Notebook|Back to /notebook]]