Deep Python curriculum

12-Month Research Engineer Plan

Each month tracks Work deliverables, Python/ETL skills, CPython internals, and Evidence.

Weekly cadence (repeatable, flexible)

☐ Mon/Tue: Deliverable work (APIs/scrapes/briefs)

☐ Wed: CPython hour + one micro-benchmark

☐ Thu: Tests, docs, data quality; push to staging

☐ Fri: Demo + 5-bullet learning log

“Swap-In” modules for urgent weeks

☐ Generator pipelines: streaming ETL using .send() context; .throw() for data-quality aborts; .close() to release resources

☐ Retry/backoff policy: jitter, max in-flight, 429 handling

☐ PDF table extraction clinic: heuristics, column detection, post-clean

☐ Schema governance: versioned JSON Schemas + validation gates

Definition of done (per feature)

☐ Idempotent run, logged, with tests

☐ Re-runnable from clean state

☐ Documented: purpose, inputs, outputs, failure modes

☐ Measured: at least one number improved (speed, reliability, clarity)

Months 1–3: Ship value fast, nail the core

Month 1 — API Intake & Clean Data

Work deliverables

☐ One municipal brief (Yabucoa or Humacao): CSV + 1-pager + simple chart

☐ CLI: researcher55 ingest --disasters 4339,4473,4671

Python/ETL skills

☐ httpx with retries/backoff; pydantic models

☐ Tables with Polars (or Pandas) → Parquet + DuckDB

☐ Unit tests (pytest), lint (ruff), typing (pyright or mypy)

CPython internals

☐ Python data model: objects, refs, slots, iter, next

☐ dict/list/set internals: hashing, resizing, set op performance

☐ Tools: dis, sys.getsizeof, tracemalloc

Evidence

☐ Repo with green CI

☐ /data/schema.md

☐ RUNBOOK.md (how to re-run)

Month 2 — Web Screening & PDF/Document Intake

Work deliverables

☐ Add web screening for project updates (Playwright)

☐ PDF parsing pipeline + dedupe and versioning (content hash)

Python/ETL skills

☐ Playwright: auth, pagination, change detection

☐ pdfplumber for tables; OCR fallback (pytesseract)

☐ Idempotent runs; failure-safe temp dirs; structured logging

CPython internals

☐ Iterators/generators under the hood: frame objects, suspension points

☐ Advanced generators: .send(), .throw(), .close(); VM mapping; use as coroutines/resource scopes

Evidence

☐ docs/change_detection.md

☐ tests/test_dedupe.py

☐ Sample PDFs → CSV examples

Month 3 — Data Quality & Basic Orchestration

Work deliverables

☐ Scheduled daily/weekly refresh

☐ Email/Slack alerts on failures

☐ Data quality checks (row counts, required fields, enums)

Python/ETL skills

☐ Prefect (or Airflow lite): flows, retries, caching, parameters

☐ Great Expectations–style checks (or simple custom validators)

CPython internals

☐ Exceptions: creation, unwinding, tracebacks; cost of try/except

☐ Function call overhead: kwargs, defaults, closures; micro-bench with perf_counter/cProfile

Evidence

☐ flows/refresh.py

☐ quality/checks.yml

☐ Alert screenshot + uptime log

Months 4–6: Reliability, modeling, and async

Month 4 — Data Modeling & Docs

Work deliverables

☐ Dimensional model for projects (staging → curated)

☐ Simple Metabase/Superset dashboard (by municipality, category, status)

Python/ETL skills

☐ SQL modeling (CTEs, windows) in DuckDB

☐ Tidy Polars transforms; repro notebooks → parameterized scripts

CPython internals

☐ Attribute lookup & descriptors; method binding; @property trade-offs

☐ Strings: interning, concat strategies, f-strings vs join

Evidence

☐ /models/ SQL & Polars pipelines

☐ Dashboard link in README

Month 5 — Asynchrony for I/O-Bound Work

Work deliverables

☐ Faster web/API intake using asyncio (concurrent fetches)

☐ Backpressure + polite rate limiting (robots.txt, delays)

Python/ETL skills

☐ asyncio tasks, cancellations, semaphores; httpx.AsyncClient

☐ Retry budgets; circuit breakers

CPython internals

☐ Event loop mechanics, task state, awaitable protocol

☐ GIL basics and why asyncio helps I/O, not CPU

Evidence

☐ Bench notebook: sync vs async; throughput ↑, failures ↓

Month 6 — Packaging & Deployment

Work deliverables

☐ Dockerized CLI & flows

☐ One-command deploy to GCP (or your infra)

☐ Secrets management; environment promotion (dev → prod)

Python/ETL skills

☐ pyproject.toml, wheels, versioning

☐ Container healthchecks; structured logs to cloud logging

CPython internals

☐ Import system: module caching, sys.meta_path, package layout

☐ Startup cost minimization for fast CLIs

Evidence

☐ Release tag + Docker image

☐ DEPLOY.md

☐ Sample prod run log

Months 7–9: Performance, geospatial, and C underpinnings

Month 7 — Performance & Memory

Work deliverables

☐ Optimize one heavy transform (e.g., big join) with measured speedup

☐ Memory profile of end-to-end run; reduce peak usage

Python/ETL skills

☐ Polars lazy queries; columnar thinking; pyarrow

☐ Profilers: cProfile, line_profiler, tracemalloc

CPython internals

☐ Reference counting & GC (generations, cycles)

☐ Small object allocator (pymalloc); list growth factor; copy semantics

Evidence

☐ Before/after benchmarks

☐ Memory graph

☐ PR describing changes

Month 8 — Tiny C Extension & CPython Tour

Work deliverables

☐ Minimal C extension (e.g., sumext) or Cython accel used in reporting

☐ Write-up: when native code is worth it vs Polars/SQL

Python/ETL skills

☐ Build system for extensions; wheels for your platform

☐ ABI awareness; error handling across C ↔ Python

CPython internals

☐ C-API essentials: PyObject*, ref-counts, arg parsing

☐ Bytecode vs native call overhead; why vectorized libs win

Evidence

☐ sumext module + unit tests

☐ Microbench notebook

☐ Narrative post

Month 9 — Geospatial Layer (optional but powerful)

Work deliverables

☐ Join projects to census tracts/municipal boundaries; publish map layer

☐ Simple equity/regional cut (e.g., spend per capita)

Python/ETL skills

☐ GeoPandas/Shapely; spatial joins/buffers; CRS correctness

☐ Export tiles/GeoJSON; light map in dashboard

CPython internals

☐ Buffer protocol & memoryview; zero-copy ideas

☐ Numpy/pyarrow bridges (high level)

Evidence

☐ geo/ pipeline

☐ Validation screenshots

☐ Map in dashboard

Months 10–12: Capstone, polish, and interviews

Month 10 — Capstone Assembly

Work deliverables

☐ “PR Recovery Data Hub” repo: ingestion → quality → model → publish

☐ Public sample dataset + documentation site (mkdocs)

Python/ETL skills

☐ CLI polish (Typer help, examples)

☐ Error taxonomy; retries as policy

CPython internals

☐ Logging internals; cost of exceptions; fast paths for hot loops

Evidence

☐ Project website

☐ “Why these design choices” post

Month 11 — Resilience & Observability

Work deliverables

☐ SLOs for freshness; dashboards for job runtime/failures

☐ Synthetic canary job; chaos day (break and recover)

Python/ETL skills

☐ Metrics export (Prometheus or CSV→chart); alert tuning

☐ Backfills; schema evolution without downtime

CPython internals

☐ Threading vs multiprocessing; when native libs release the GIL

☐ Serialization costs: pickle vs orjson/Arrow IPC

Evidence

☐ Runbook: incident response section

☐ Post-mortem example

Month 12 — Hiring Pack & Knowledge Share

Work deliverables

☐ Hiring packet: dashboard link, capstone repo, 2 short blog posts

☐ Brown-bag talk for your team; recorded demo

Python/ETL skills

☐ Final refactors; docstrings; type coverage; test coverage report

CPython internals

☐ Review & reinforce: GIL, bytecode, GC, and design rationale

Evidence

☐ Slides

☐ Recorded demo

☐ README with TL;DR for hiring managers