Intro to Data Science: Numpy & Pandas

The engine behind the AI revolution. Understanding how Python handles Big Data.

1. The Big Idea: Why Python?

The Problem: Python is Slow

Let's be honest. Python is slow. In the world of C++ or Rust, Python is a turtle. It processes data line-by-line, dynamically checking types ("Is this an integer? Is this a string?") at every single step. For a list of 1,000 items, this is fine. For a dataset of 100 million rows, this is catastrophic.

So why is Python the #1 language for Data Science and AI?

The Solution: The "Trojan Horse" Strategy

Python uses a clever trick. We write code in friendly, high-level Python. But underneath, libraries like Numpy and Pandas hand off the heavy lifting to incredibly optimized C and Fortran code.

Numpy is not just "a list of numbers". It is a contiguous block of memory, exactly like a C array. It bypasses Python's slowness entirely, allowing you to crunch gigabytes of data in milliseconds.

2. Numpy: The Foundation of Everything

Numpy (Numerical Python) is the bedrock. Pandas is built on Numpy. PyTorch is built on Numpy. Scikit-Learn is built on Numpy. If you understand Numpy, you understand the entire ecosystem.

2.1 The N-Dimensional Array (`ndarray`)

The `ndarray` is the core object. Unlike a Python List, which is a collection of pointers to objects scattered in memory (Linked List style), an `ndarray` is a single solid block of memory.

PYTHON

import numpy as np
import sys

# A standard Python List
py_list = [1, 2, 3, 4, 5]
# Each integer object in Python is ~28 bytes. The list itself is extra overhead.

# A Numpy Array
np_arr = np.array([1, 2, 3, 4, 5], dtype='int32')
# Each integer is EXACTLY 4 bytes (32 bits).
# 5 * 4 = 20 bytes for the whole dataset.

print(f"Item size: {np_arr.itemsize} bytes") # Output: 4

2.2 The Power of Vectorization

This is the single most important concept in Data Science programming. Vectorization means applying an operation to an entire array at once, rather than looping through elements one by one.

When you loop in Python, the interpreter has to decode the instruction 1,000,000 times. When you vectorize, Numpy pushes the loop down into C, where the CPU can use SIMD (Single Instruction, Multiple Data) instructions to process 4, 8, or 16 numbers per clock cycle.

PYTHON

import time

# Create a massive dataset: 10 Million numbers
size = 10_000_000
data_list = list(range(size))
data_arr = np.arange(size)

# --- Python For Loop ---
start = time.time()
result_list = [x + 5 for x in data_list]
end = time.time()
print(f"Python Loop: {end - start:.4f} seconds")
# Result: ~1.2 seconds

# --- Numpy Vectorization ---
start = time.time()
result_arr = data_arr + 5  # <--- THE MAGIC
end = time.time()
print(f"Numpy Vector: {end - start:.4f} seconds")
# Result: ~0.015 seconds
# SPEEDUP: ~80x FASTER

3. Advanced Numpy Mechanics

3.1 Broadcasting

Broadcasting describes how Numpy treats arrays with different shapes during arithmetic operations. It "stretches" the smaller array to match the larger one without actually copying data.

Rule: Two dimensions are compatible when:

They are equal, or
One of them is 1.

PYTHON

# Matrix (3x3)
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

# Vector (1x3)
B = np.array([10, 20, 30])

# Numpy "stacks" B three times vertically to match A
C = A + B

# Row 1: [1, 2, 3] + [10, 20, 30] = [11, 22, 33]
# Row 2: [4, 5, 6] + [10, 20, 30] = [14, 25, 36]
# Row 3: [7, 8, 9] + [10, 20, 30] = [17, 28, 39]

3.2 Boolean Masking

Instead of writing `if` statements inside loops, passing a boolean condition to an array returns a new array of only the elements where the condition is True.

PYTHON

scores = np.array([55, 89, 76, 32, 99, 41])

# Who passed? (Score &gt; 50)
print(scores &gt; 50)
# Output: [True, True, True, False, True, False]

# Get the actual scores
passing_scores = scores[scores &gt; 50]
print(passing_scores)
# Output: [55, 89, 76, 99]

4. Pandas: Excel for Programmers

While Numpy handles numbers, Pandas handles Data. Real-world data is messy. It has labels, missing values, timestamps, and mixed types (strings and numbers side-by-side).

The core structure is the DataFrame. Think of it as a programmable Excel spreadsheet.

Columns: Each column is a Pandas Series (built on Numpy).
Index: The labels for the rows (can be integers or even timestamps).

4.1 Loading and Inspecting

PYTHON

import pandas as pd

# Pandas can read almost ANY format:
# .read_csv(), .read_excel(), .read_json(), .read_sql(), .read_parquet()
df = pd.read_csv("sales_data.csv")

# Quick Inspection
print(df.head())      # First 5 rows
print(df.shape)       # (Rows, Columns) e.g., (1000, 5)
print(df.info())      # Data types and missing values
print(df.describe())  # Statistical summary (Mean, Min, Max, std deviation)

5. Essential Pandas Operations

5.1 Selecting Data: `loc` vs `iloc`

This is the #1 confusion for beginners.

`loc[row_label, col_label]`: Selection by Name. "Give me the row labeled '2023-01-01'".
`iloc[row_pos, col_pos]`: Selection by Integer Position. "Give me the 5th row".

PYTHON

# Get the 'Price' of the row with Index 5
price = df.loc[5, "Price"]

# Get the first 10 rows and the first 3 columns
subset = df.iloc[0:10, 0:3]

5.2 GroupBy: Split-Apply-Combine

This pattern allows you to segment your data.

Split: Break the data into groups based on a key (e.g., "Category").
Apply: Compute a function for each group (e.g., Sum, Mean, Max).
Combine: Glue the results back into a new DataFrame.

PYTHON

# DATASET:
# Category | Sales
# Fruit    | 100
# Fruit    | 200
# Veg      | 50

# 1. Group by Category
grouped = df.groupby("Category")

# 2. Apply Sum
totals = grouped["Sales"].sum()

# RESULT:
# Fruit: 300
# Veg:   50

6. Cleaning and Merging

6.1 Handling Missing Data

Real data has holes. Pandas represents missing data as `NaN` (Not a Number).

PYTHON

# Check for missing values
print(df.isna().sum())

# Option 1: Drop rows with missing data
clean_df = df.dropna()

# Option 2: Fill gaps with a default value (Imputation)
# Fill missing ages with the average age
mean_age = df["Age"].mean()
df["Age"] = df["Age"].fillna(mean_age)

6.2 Merging (SQL Joins)

You can join two DataFrames just like SQL tables.

PYTHON

# Start with two tables: Users and Orders
users = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
orders = pd.DataFrame({'order_id': [101, 102], 'user_id': [1, 1]})

# INNER JOIN: Only return matches
merged = pd.merge(users, orders, left_on='id', right_on='user_id', how='inner')
# Result will show Alice's orders. Bob (who has no orders) is excluded.

Final Words

This course has taken you from "Hello World" to the bleeding edge of Data Science. You are now equipped with the tools used by Engineers at Google, NASA, and Netflix.

The Path Forward:

Machine Learning: Scikit-Learn (Classification, Regression).
Deep Learning: PyTorch or TensorFlow (Neural Networks).
Big Data: PySpark (Distributed Computing).

Keep coding, keep exploring. The possibilities are infinite.