Deep Python curriculum
12-Month Research Engineer Plan
Each month tracks Work deliverables, Python/ETL skills, CPython internals, and Evidence.
Weekly cadence (repeatable, flexible)
☐ Mon/Tue: Deliverable work (APIs/scrapes/briefs)
☐ Wed: CPython hour + one micro-benchmark
☐ Thu: Tests, docs, data quality; push to staging
☐ Fri: Demo + 5-bullet learning log
“Swap-In” modules for urgent weeks
☐ Generator pipelines: streaming ETL using .send() context; .throw() for data-quality aborts; .close() to release resources
☐ Retry/backoff policy: jitter, max in-flight, 429 handling
☐ PDF table extraction clinic: heuristics, column detection, post-clean
☐ Schema governance: versioned JSON Schemas + validation gates
Definition of done (per feature)
☐ Idempotent run, logged, with tests
☐ Re-runnable from clean state
☐ Documented: purpose, inputs, outputs, failure modes
☐ Measured: at least one number improved (speed, reliability, clarity)
Months 1–3: Ship value fast, nail the core
Month 1 — API Intake & Clean Data
Work deliverables
☐ One municipal brief (Yabucoa or Humacao): CSV + 1-pager + simple chart
☐ CLI: researcher55 ingest
Python/ETL skills
☐ httpx with retries/backoff; pydantic models
☐ Tables with Polars (or Pandas) → Parquet + DuckDB
☐ Unit tests (pytest), lint (ruff), typing (pyright or mypy)
CPython internals
☐ Python data model: objects, refs, slots, iter, next
☐ dict/list/set internals: hashing, resizing, set op performance
☐ Tools: dis, sys.getsizeof, tracemalloc
Evidence
☐ Repo with green CI
☐ /data/schema.md
☐ RUNBOOK.md (how to re-run)
Month 2 — Web Screening & PDF/Document Intake
Work deliverables
☐ Add web screening for project updates (Playwright)
☐ PDF parsing pipeline + dedupe and versioning (content hash)
Python/ETL skills
☐ Playwright: auth, pagination, change detection
☐ pdfplumber for tables; OCR fallback (pytesseract)
☐ Idempotent runs; failure-safe temp dirs; structured logging
CPython internals
☐ Iterators/generators under the hood: frame objects, suspension points
☐ Advanced generators: .send(), .throw(), .close(); VM mapping; use as coroutines/resource scopes
Evidence
☐ docs/change_detection.md
☐ tests/test_dedupe.py
☐ Sample PDFs → CSV examples
Month 3 — Data Quality & Basic Orchestration
Work deliverables
☐ Scheduled daily/weekly refresh
☐ Email/Slack alerts on failures
☐ Data quality checks (row counts, required fields, enums)
Python/ETL skills
☐ Prefect (or Airflow lite): flows, retries, caching, parameters
☐ Great Expectations–style checks (or simple custom validators)
CPython internals
☐ Exceptions: creation, unwinding, tracebacks; cost of try/except
☐ Function call overhead: kwargs, defaults, closures; micro-bench with perf_counter/cProfile
Evidence
☐ flows/refresh.py
☐ quality/checks.yml
☐ Alert screenshot + uptime log
Months 4–6: Reliability, modeling, and async
Month 4 — Data Modeling & Docs
Work deliverables
☐ Dimensional model for projects (staging → curated)
☐ Simple Metabase/Superset dashboard (by municipality, category, status)
Python/ETL skills
☐ SQL modeling (CTEs, windows) in DuckDB
☐ Tidy Polars transforms; repro notebooks → parameterized scripts
CPython internals
☐ Attribute lookup & descriptors; method binding; @property trade-offs
☐ Strings: interning, concat strategies, f-strings vs join
Evidence
☐ /models/ SQL & Polars pipelines
☐ Dashboard link in README
Month 5 — Asynchrony for I/O-Bound Work
Work deliverables
☐ Faster web/API intake using asyncio (concurrent fetches)
☐ Backpressure + polite rate limiting (robots.txt, delays)
Python/ETL skills
☐ asyncio tasks, cancellations, semaphores; httpx.AsyncClient
☐ Retry budgets; circuit breakers
CPython internals
☐ Event loop mechanics, task state, awaitable protocol
☐ GIL basics and why asyncio helps I/O, not CPU
Evidence
☐ Bench notebook: sync vs async; throughput ↑, failures ↓
Month 6 — Packaging & Deployment
Work deliverables
☐ Dockerized CLI & flows
☐ One-command deploy to GCP (or your infra)
☐ Secrets management; environment promotion (dev → prod)
Python/ETL skills
☐ pyproject.toml, wheels, versioning
☐ Container healthchecks; structured logs to cloud logging
CPython internals
☐ Import system: module caching, sys.meta_path, package layout
☐ Startup cost minimization for fast CLIs
Evidence
☐ Release tag + Docker image
☐ DEPLOY.md
☐ Sample prod run log
Months 7–9: Performance, geospatial, and C underpinnings
Month 7 — Performance & Memory
Work deliverables
☐ Optimize one heavy transform (e.g., big join) with measured speedup
☐ Memory profile of end-to-end run; reduce peak usage
Python/ETL skills
☐ Polars lazy queries; columnar thinking; pyarrow
☐ Profilers: cProfile, line_profiler, tracemalloc
CPython internals
☐ Reference counting & GC (generations, cycles)
☐ Small object allocator (pymalloc); list growth factor; copy semantics
Evidence
☐ Before/after benchmarks
☐ Memory graph
☐ PR describing changes
Month 8 — Tiny C Extension & CPython Tour
Work deliverables
☐ Minimal C extension (e.g., sumext) or Cython accel used in reporting
☐ Write-up: when native code is worth it vs Polars/SQL
Python/ETL skills
☐ Build system for extensions; wheels for your platform
☐ ABI awareness; error handling across C ↔ Python
CPython internals
☐ C-API essentials: PyObject*, ref-counts, arg parsing
☐ Bytecode vs native call overhead; why vectorized libs win
Evidence
☐ sumext module + unit tests
☐ Microbench notebook
☐ Narrative post
Month 9 — Geospatial Layer (optional but powerful)
Work deliverables
☐ Join projects to census tracts/municipal boundaries; publish map layer
☐ Simple equity/regional cut (e.g., spend per capita)
Python/ETL skills
☐ GeoPandas/Shapely; spatial joins/buffers; CRS correctness
☐ Export tiles/GeoJSON; light map in dashboard
CPython internals
☐ Buffer protocol & memoryview; zero-copy ideas
☐ Numpy/pyarrow bridges (high level)
Evidence
☐ geo/ pipeline
☐ Validation screenshots
☐ Map in dashboard
Months 10–12: Capstone, polish, and interviews
Month 10 — Capstone Assembly
Work deliverables
☐ “PR Recovery Data Hub” repo: ingestion → quality → model → publish
☐ Public sample dataset + documentation site (mkdocs)
Python/ETL skills
☐ CLI polish (Typer help, examples)
☐ Error taxonomy; retries as policy
CPython internals
☐ Logging internals; cost of exceptions; fast paths for hot loops
Evidence
☐ Project website
☐ “Why these design choices” post
Month 11 — Resilience & Observability
Work deliverables
☐ SLOs for freshness; dashboards for job runtime/failures
☐ Synthetic canary job; chaos day (break and recover)
Python/ETL skills
☐ Metrics export (Prometheus or CSV→chart); alert tuning
☐ Backfills; schema evolution without downtime
CPython internals
☐ Threading vs multiprocessing; when native libs release the GIL
☐ Serialization costs: pickle vs orjson/Arrow IPC
Evidence
☐ Runbook: incident response section
☐ Post-mortem example
Month 12 — Hiring Pack & Knowledge Share
Work deliverables
☐ Hiring packet: dashboard link, capstone repo, 2 short blog posts
☐ Brown-bag talk for your team; recorded demo
Python/ETL skills
☐ Final refactors; docstrings; type coverage; test coverage report
CPython internals
☐ Review & reinforce: GIL, bytecode, GC, and design rationale
Evidence
☐ Slides
☐ Recorded demo
☐ README with TL;DR for hiring managers