Part 2 — Python for Data Science: Essential Tools and Idioms

This post gets you productive with the core Python stack for data work: pandas, NumPy, and Jupyter. You will create a clean workspace, learn idiomatic patterns, and run a small example on provided sample data. If Part 1 was about “what” and “why,” this part is about “how do I start doing it with the right habits.”


Why Python for Data Science?

  • Huge ecosystem: pandas for tabular data, NumPy for arrays, scikit-learn for ML, Matplotlib/Seaborn/Plotly for viz.
  • Fast path from prototype to production: the same language can power notebooks, APIs (FastAPI), and pipelines (Airflow/Prefect).
  • Community and packages: almost every data source (databases, cloud storage, APIs) has a Python client.
  • Ergonomics: readable syntax, rich notebooks, and plenty of learning resources.

Core Tools at a Glance

Tool Purpose Notes
pandas Tabular data (DataFrames), joins, groupby, IO Think “Excel + SQL + Python”
NumPy Fast n-dimensional arrays and math Underpins pandas; great for vectorized ops
JupyterLab Interactive notebooks for exploration Mix code, text, and charts
Matplotlib/Seaborn Plotting basics and statistical visuals Seaborn builds on Matplotlib for nicer defaults
VS Code (optional) IDE for refactoring and debugging Use with Python + Jupyter extensions

Learning Goals

  • Set up a reproducible Python environment and Jupyter workspace (so your notebook runs the same next week).
  • Load and inspect tabular data with pandas; use NumPy for fast array math.
  • Apply clean code patterns: tidy data, assign, pipe, query, and explicit dtypes.
  • Keep a project folder organized so future you (and teammates) can debug quickly.
  • Know what “good enough” exploration looks like before you jump to modeling.

Setup: Environment and Dependencies

  1. Install Python 3.11+ (or your team’s standard). Consistency avoids “works on my machine.”
  2. Create a virtual environment and install essentials (keeps dependencies isolated per project):
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install pandas numpy jupyterlab matplotlib seaborn

Shortcut: run make install (creates .venv and installs from requirements.txt), then make notebook to launch JupyterLab.

  1. Launch JupyterLab in this repo for notebooks:
jupyter lab
  1. Optional VS Code setup: install the Python and Jupyter extensions; set the interpreter to .venv/bin/python.
  2. Alternative: if you prefer Conda, create an environment with conda create -n ds python=3.11 pandas numpy jupyterlab seaborn.

Quick Data Check: pandas + NumPy in Action

Sample files: part2/orders.csv (orders with product, quantity, price) and part2/customers.csv (customer segment and country). This tiny example mirrors a common starter task: join transactions to customer attributes, compute revenue, and summarize by segment.

import pandas as pd
import numpy as np

orders = pd.read_csv("part2/orders.csv", parse_dates=["order_date"])
customers = pd.read_csv("part2/customers.csv")

# Tidy types for memory and consistency
orders["product"] = orders["product"].astype("category")
customers["segment"] = customers["segment"].astype("category")

# Basic quality checks (fast assertions catch bad inputs early)
assert orders["quantity"].ge(0).all()
assert orders["price"].ge(0).all()

# Revenue per order
orders = orders.assign(order_revenue=lambda d: d["quantity"] * d["price"])

# Join customer info (left join preserves all orders)
orders = orders.merge(customers, on="customer_id", how="left")

# Segment-level summary
summary = (
    orders.groupby("segment")
          .agg(
              orders=("order_id", "count"),
              revenue=("order_revenue", "sum"),
              avg_order_value=("order_revenue", "mean"),
          )
          .sort_values("revenue", ascending=False)
)
print(summary)

# NumPy example: standardize revenue for quick z-scores
orders["revenue_z"] = (orders["order_revenue"] - orders["order_revenue"].mean()) / orders["order_revenue"].std(ddof=0)

What you just did:

  • Loaded CSVs with explicit date parsing and dtypes (prevents surprises later).
  • Added a computed column via assign to keep the transformation readable.
  • Joined customer attributes to transactions with a left join (common pattern).
  • Summarized by segment to answer “which customer type drives revenue?”
  • Used NumPy to add a quick z-score—handy for outlier checks or bucketing.

Expected summary output (with the sample data):

            orders  revenue  avg_order_value
segment
Enterprise        4   651.00          162.75
SMB               5   385.50           77.10
Consumer          1    66.00           66.00

Use this as a quick sense check: values are positive, orders count matches the CSV, and Enterprise drives the most revenue.


Companion Notebook

  • Path: part2/notebooks/part2-example.ipynb.
  • Run with the Makefile: make install (first time) then make notebook and open the notebook.
  • Without Makefile: jupyter lab part2/notebooks/part2-example.ipynb (use your activated environment).
  • Check kaggle here

Idiomatic pandas Patterns

  • assign: Add columns without breaking method chains.
  • pipe: Encapsulate reusable transformations and keep chains readable.
  • query: Express simple filters with readable expressions.
  • Explicit dtypes: use astype and to_datetime to avoid silent conversions.
  • Small helpers: prefer value_counts(normalize=True) for quick proportions.

Example using pipe and query:

def add_order_revenue(df):
    return df.assign(order_revenue=lambda d: d["quantity"] * d["price"])

(orders
 .pipe(add_order_revenue)
 .query("order_revenue > 200")
 .groupby("product")["order_revenue"]
 .mean()
)

Notebook Hygiene and Habits

  • Start every notebook with imports, configuration, and a short Markdown cell stating the question you are answering.
  • Pin a random seed for reproducibility when sampling or modeling.
  • Keep side effects contained: write outputs to a data/ or reports/ folder, not your repo root.
  • Restart-and-run-all before sharing; if it fails, fix it before committing.
  • When a notebook grows too large, move reusable code into src/ functions and re-import—treat notebooks as experiments, not long-term code storage.

Common Pitfalls to Avoid Early

  • Silent type coercion: always inspect df.dtypes after loading; parse dates explicitly.
  • Chained indexing (df[df["x"] > 0]["y"] = ...) can create copies—use .loc and assign instead.
  • Skipping data checks: use quick assertions for non-negativity, allowed categories, and unique keys.
  • Mixing raw and cleaned data: keep a clear path (raw → interim/clean → features) with filenames that show the stage.

Workspace Structure (simple starter)

project-root/
├── data/          # large raw data (keep out of git; use .gitignore)
├── notebooks/     # exploratory notebooks
├── src/           # reusable functions and pipelines
├── reports/       # exported charts/tables
└── env/           # environment files (requirements.txt, conda.yml)
  • Keep sample data small and versioned (like the CSVs here); keep production-scale data in object storage or warehouses.
  • Add a requirements.txt or poetry.lock to freeze dependencies; pin exact versions when collaborating.
  • Name notebooks with prefixes like 01-eda.ipynb, 02-model.ipynb to show flow; add a short one-line purpose at the top.
  • Drop a .gitignore entry for data/ (unless you are keeping only tiny samples) and for notebook checkpoints.
  • Consider a Makefile or simple shell scripts for repeatable tasks (make lint, make test, make notebook).

Practical Checklist

  • ✅ Version control your environment (requirements.txt or poetry.lock).
  • ✅ Enforce dtypes and date parsing on read; log shape and null counts immediately.
  • ✅ Start with asserts and simple profiling (nulls, ranges); fail fast beats silent corruption.
  • ✅ Prefer chains over scattered temporary variables for clarity; factor reusable steps into functions.
  • ✅ Cache interim results to disk (parquet) when they are reused; keep filenames stage-aware (e.g., orders_clean.parquet).
  • ✅ Document assumptions in Markdown cells next to the code; future you will thank present you.
  • ✅ Before modeling, have a crisp question and a success metric; code follows the question, not the other way around.

What’s Next (Preview of Part 3)

  • Handling missing values and outliers systematically.
  • Encoding categoricals and first feature engineering patterns.
  • Practical pandas pipelines for cleaning messy, real-world data.