TL;DR: I unlocked the power of pandas by returning to NumPy and learning to see DataFrames as collections of vectorized Series, rather than the row-based mental model I imported from working with Excel.

Vector /ˈvektər/:
an ordered collection of values that an operation applies to all at once rather than one value at a time

Key takeaway: pandas is designed to transform, annotate and select data by working on vectorized columns (Series built from aligned NumPy second-dimension items), not looping through data row-by-row.

How I apply this

Understanding vectorization has helped me see the geometry underneath pandas. I now have clarity on how to:

  • build and use masks
  • operate on columns and select rows using assign and chained vectorized statements instead of mutaiting intermediate DataFrames
  • get group-level information by pairing agg and transform with groupby

Lab:

I wanted to compare first-hand the execution of different mental models for working with pandas: vectorized Series operations, row-by-row logic, and explicit loops. I built a dataframe with two int columns (score1 and score2) and 300,000 rows. For each mental model, I built a function that accepted the dataframe as input, compared the two columns, and returned a series with the higher number from each row. I then used timeit to time execution of each function. (See the code in my Github).

The results:

  • Vectorized: 0.021s
  • Apply: 1.487s (70x slower than vectorized)
  • Loop: 0.065 (3x slower than vectorized)

The takeaway: think vectorization first, loops if needed, and apply(axis=1) only as a last resort.

tags: #Python, #data-science, #performance-testing

From Rows to Vectors: Seeing Pandas as Geometry