Part 2 — Python for Data Science: Essential Tools and Idioms

This post gets you productive with the core Python stack for data work: pandas, NumPy, and Jupyter. You will create a clean workspace, learn idiomatic patterns, and run a small example on provided sample data. If Part 1 was about “what” and “why,” this part is about “how do I start doing it with the right habits.”

Why Python for Data Science?

Huge ecosystem: pandas for tabular data, NumPy for arrays, scikit-learn for ML, Matplotlib/Seaborn/Plotly for viz.
Fast path from prototype to production: the same language can power notebooks, APIs (FastAPI), and pipelines (Airflow/Prefect).
Community and packages: almost every data source (databases, cloud storage, APIs) has a Python client.
Ergonomics: readable syntax, rich notebooks, and plenty of learning resources.

Core Tools at a Glance

Tool	Purpose	Notes
pandas	Tabular data (DataFrames), joins, groupby, IO	Think “Excel + SQL + Python”
NumPy	Fast n-dimensional arrays and math	Underpins pandas; great for vectorized ops
JupyterLab	Interactive notebooks for exploration	Mix code, text, and charts
Matplotlib/Seaborn	Plotting basics and statistical visuals	Seaborn builds on Matplotlib for nicer defaults
VS Code (optional)	IDE for refactoring and debugging	Use with Python + Jupyter extensions

Learning Goals

Set up a reproducible Python environment and Jupyter workspace (so your notebook runs the same next week).
Load and inspect tabular data with pandas; use NumPy for fast array math.
Apply clean code patterns: tidy data, assign, pipe, query, and explicit dtypes.
Keep a project folder organized so future you (and teammates) can debug quickly.
Know what “good enough” exploration looks like before you jump to modeling.

Setup: Environment and Dependencies

Install Python 3.11+ (or your team’s standard). Consistency avoids “works on my machine.”
Create a virtual environment and install essentials (keeps dependencies isolated per project):

python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install pandas numpy jupyterlab matplotlib seaborn

Shortcut: run make install (creates .venv and installs from requirements.txt), then make notebook to launch JupyterLab.

Launch JupyterLab in this repo for notebooks:

jupyter lab

Optional VS Code setup: install the Python and Jupyter extensions; set the interpreter to .venv/bin/python.
Alternative: if you prefer Conda, create an environment with conda create -n ds python=3.11 pandas numpy jupyterlab seaborn.

Quick Data Check: pandas + NumPy in Action

Sample files: part2/orders.csv (orders with product, quantity, price) and part2/customers.csv (customer segment and country). This tiny example mirrors a common starter task: join transactions to customer attributes, compute revenue, and summarize by segment.

import pandas as pd
import numpy as np

orders = pd.read_csv("part2/orders.csv", parse_dates=["order_date"])
customers = pd.read_csv("part2/customers.csv")

# Tidy types for memory and consistency
orders["product"] = orders["product"].astype("category")
customers["segment"] = customers["segment"].astype("category")

# Basic quality checks (fast assertions catch bad inputs early)
assert orders["quantity"].ge(0).all()
assert orders["price"].ge(0).all()

# Revenue per order
orders = orders.assign(order_revenue=lambda d: d["quantity"] * d["price"])

# Join customer info (left join preserves all orders)
orders = orders.merge(customers, on="customer_id", how="left")

# Segment-level summary
summary = (
    orders.groupby("segment")
          .agg(
              orders=("order_id", "count"),
              revenue=("order_revenue", "sum"),
              avg_order_value=("order_revenue", "mean"),
          )
          .sort_values("revenue", ascending=False)
)
print(summary)

# NumPy example: standardize revenue for quick z-scores
orders["revenue_z"] = (orders["order_revenue"] - orders["order_revenue"].mean()) / orders["order_revenue"].std(ddof=0)

What you just did:

Loaded CSVs with explicit date parsing and dtypes (prevents surprises later).
Added a computed column via assign to keep the transformation readable.
Joined customer attributes to transactions with a left join (common pattern).
Summarized by segment to answer “which customer type drives revenue?”
Used NumPy to add a quick z-score—handy for outlier checks or bucketing.

Expected summary output (with the sample data):

            orders  revenue  avg_order_value
segment
Enterprise        4   651.00          162.75
SMB               5   385.50           77.10
Consumer          1    66.00           66.00

Use this as a quick sense check: values are positive, orders count matches the CSV, and Enterprise drives the most revenue.

Companion Notebook

Path: part2/notebooks/part2-example.ipynb.
Run with the Makefile: make install (first time) then make notebook and open the notebook.
Without Makefile: jupyter lab part2/notebooks/part2-example.ipynb (use your activated environment).
Check kaggle here

Idiomatic pandas Patterns

assign: Add columns without breaking method chains.
pipe: Encapsulate reusable transformations and keep chains readable.
query: Express simple filters with readable expressions.
Explicit dtypes: use astype and to_datetime to avoid silent conversions.
Small helpers: prefer value_counts(normalize=True) for quick proportions.

Example using pipe and query:

def add_order_revenue(df):
    return df.assign(order_revenue=lambda d: d["quantity"] * d["price"])

(orders
 .pipe(add_order_revenue)
 .query("order_revenue > 200")
 .groupby("product")["order_revenue"]
 .mean()
)

Notebook Hygiene and Habits

Start every notebook with imports, configuration, and a short Markdown cell stating the question you are answering.
Pin a random seed for reproducibility when sampling or modeling.
Keep side effects contained: write outputs to a data/ or reports/ folder, not your repo root.
Restart-and-run-all before sharing; if it fails, fix it before committing.
When a notebook grows too large, move reusable code into src/ functions and re-import—treat notebooks as experiments, not long-term code storage.

Common Pitfalls to Avoid Early

Silent type coercion: always inspect df.dtypes after loading; parse dates explicitly.
Chained indexing (df[df["x"] > 0]["y"] = ...) can create copies—use .loc and assign instead.
Skipping data checks: use quick assertions for non-negativity, allowed categories, and unique keys.
Mixing raw and cleaned data: keep a clear path (raw → interim/clean → features) with filenames that show the stage.

Workspace Structure (simple starter)

project-root/
├── data/          # large raw data (keep out of git; use .gitignore)
├── notebooks/     # exploratory notebooks
├── src/           # reusable functions and pipelines
├── reports/       # exported charts/tables
└── env/           # environment files (requirements.txt, conda.yml)

Keep sample data small and versioned (like the CSVs here); keep production-scale data in object storage or warehouses.
Add a requirements.txt or poetry.lock to freeze dependencies; pin exact versions when collaborating.
Name notebooks with prefixes like 01-eda.ipynb, 02-model.ipynb to show flow; add a short one-line purpose at the top.
Drop a .gitignore entry for data/ (unless you are keeping only tiny samples) and for notebook checkpoints.
Consider a Makefile or simple shell scripts for repeatable tasks (make lint, make test, make notebook).

Practical Checklist

✅ Version control your environment (requirements.txt or poetry.lock).
✅ Enforce dtypes and date parsing on read; log shape and null counts immediately.
✅ Start with asserts and simple profiling (nulls, ranges); fail fast beats silent corruption.
✅ Prefer chains over scattered temporary variables for clarity; factor reusable steps into functions.
✅ Cache interim results to disk (parquet) when they are reused; keep filenames stage-aware (e.g., orders_clean.parquet).
✅ Document assumptions in Markdown cells next to the code; future you will thank present you.
✅ Before modeling, have a crisp question and a success metric; code follows the question, not the other way around.

What’s Next (Preview of Part 3)

Handling missing values and outliers systematically.
Encoding categoricals and first feature engineering patterns.
Practical pandas pipelines for cleaning messy, real-world data.