Local Data Lake - Ritesh Naik

I got tired of logging into three different systems every time I needed to answer a cross-functional question. "What's our revenue per rep by region, adjusted for the latest forecast?" That question touches Salesforce for pipeline and rep data, Intacct for actuals, and Anaplan for the forecast. Three logins, three exports, and a lot of VLOOKUP gymnastics. So I built a local data lake to bring it all together. ## Why local and CSV-based I know "data lake" conjures images of Snowflake or Databricks. Mine is a folder on my machine with dated CSV files. That's intentional. I don't need real-time streaming or petabyte scale. I need a reliable, queryable snapshot of key datasets that refreshes on a schedule I control. CSV keeps it simple, portable, and readable by every tool I use: Python, Excel, Claude, whatever. ## The structure The folder layout looks like this: ``` data-lake/ ├── intacct/ │ ├── gl_detail_2026_01.csv │ ├── gl_detail_2026_02.csv │ └── trial_balance_2026_02.csv ├── salesforce/ │ ├── opportunities_2026_02.csv │ ├── rep_quota_2026_q1.csv │ └── accounts_active.csv ├── anaplan/ │ ├── forecast_revenue_2026_02.csv │ └── forecast_opex_2026_02.csv └── combined/ ├── revenue_by_rep_region.csv └── actuals_vs_forecast.csv ``` Each source folder holds raw exports. The `combined/` folder holds merged datasets I build with Python scripts. ## The pull process I wrote a simple Python script for each source. Intacct has a Web Services API: I use the `requests` library to pull GL detail and trial balance data. Salesforce uses `simple-salesforce` to run SOQL queries and dump results to CSV. Anaplan's API is the most painful, but their bulk export endpoint works once you get the authentication right. Each script runs on a schedule via Windows Task Scheduler. They pull the latest data, timestamp the file, and drop it into the right folder. The whole refresh takes about four minutes. ## What it enables Once the data is local, everything gets faster. I can run a Python script that joins Salesforce opportunity data with Intacct actuals and Anaplan forecasts to produce a unified revenue view in seconds. I can feed any of these CSVs directly to Claude for analysis. I can build quick Plotly dashboards without waiting on anyone's BI team backlog. The most valuable output so far is `actuals_vs_forecast.csv`: a single file that maps every Intacct GL line to its corresponding Anaplan forecast line. That file powers my monthly variance analysis and saves me hours of manual reconciliation. ## What I'd do differently I'd add data validation earlier. I had a silent issue for two weeks where the Salesforce pull was returning stale opportunity stages because of a filter bug. Adding row count checks and freshness timestamps at the pull step would have caught that immediately. --- Related: [[notebook/Data Pipeline Patterns|Data pipeline patterns]] ← [[Lab|Back to /lab]]