Comprehensive Learning Guide

📚 Complete Learning Reference

Interactive Learning Tools

🧮

Math Explorer — Functions, Derivatives & Probability

Interactive learning with hands-on exercises, visualizations, and quizzes

🔧 OpenRefine Data Cleaning Complete Guide

Master data cleaning, transformation, and quality improvement with OpenRefine

Installation & Setup

Step 1: Download OpenRefine

Download OpenRefine 3.6.2 from: https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2

Choose your OS version (Windows, Mac, Linux). Version 3.6.2 includes embedded Java.

Step 2: Launch OpenRefine

1. Extract the downloaded file

2. Run the executable (openrefine.exe on Windows)

3. Open browser and go to: http://127.0.0.1:3333

Project Creation & Data Import

Creating a New Project

1. Click "Create Project"

2. Choose "This Computer" → Browse for your CSV file

3. Click "Next" to preview data

4. Verify column headers and data types

5. Click "Create Project" (top right)

💡 Data Import Best Practices

• Always preview your data before creating the project

• Check if headers are properly detected

• Verify column separation (comma, tab, semicolon)

• Note any encoding issues with special characters

Core Data Cleaning Operations

1. Removing Blank/Null Values

Using Text Facets to Remove Blanks

1. Click on column dropdown → Facet → Text facet

2. In the facet panel (left side), you'll see all unique values

3. Click on (blank) to select only blank rows

4. Click All → Edit rows → Remove all matching rows

5. Close the facet when done

GREL (General Refine Expression Language) Reference

Function	Purpose	Example
`contains(value, "text")`	Check if value contains text	`contains(value, "Airport")`
`value.match(/pattern/)`	Extract text matching regex	`value.match(/([A-Za-z\s]+Airport)/)`
`split(value, delimiter)`	Split text into array	`split(value, ",")`
`trim(value)`	Remove leading/trailing spaces	`trim(value)`
`if(condition, true, false)`	Conditional logic	`if(contains(value, "Airport"), "Yes", "No")`

Extract Airport Names

if(contains(value, "Airport"),
  value.match(/([A-Za-z\s]+Airport)/)[0].trim(),
  ""
)

⚠️ Common Mistakes to Avoid

• Always backup original data before major operations

• Test GREL expressions on small datasets first

• Check for case sensitivity in filters and matching

• Verify row counts after filtering operations

🌐 Flask Web Development Complete Tutorial

Build data-driven web applications with Python Flask

Flask Installation & Setup

Terminal/Command Prompt

# Install Flask
pip install Flask

# Optional: Create virtual environment first
python -m venv flask_env
# Windows: flask_env\Scripts\activate
# Mac/Linux: source flask_env/bin/activate
pip install Flask

Basic Flask Application Structure

project_folder/
├── run.py                 # Main application runner
├── flaskapp/
│   ├── __init__.py       # Flask app initialization
│   ├── routes.py         # URL routes and view functions
│   └── templates/
│       └── index.html    # HTML templates
└── data/
    └── dataset.csv       # Data files

run.py

""" run.py - Run the Flask app """
from flaskapp import app

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=3001, debug=True)

flaskapp/__init__.py

from flask import Flask

app = Flask(__name__)

from flaskapp import routes

💡 Essential Tips

• Use debug=True for development (auto-reload on changes)

• Organize code into modules (routes, models, utilities)

• Use templates for all HTML (avoid HTML in Python code)

• Handle errors gracefully with try-catch blocks

🔍 Regular Expressions Complete Reference

Master pattern matching for text processing and data extraction

Basic Syntax Elements

Pattern	Description	Example	Matches
`.`	Any single character	`a.c`	abc, axc, a1c
`*`	Zero or more of preceding	`ab*c`	ac, abc, abbc
`+`	One or more of preceding	`ab+c`	abc, abbc (not ac)
`?`	Zero or one of preceding	`ab?c`	ac, abc
`^`	Start of string	`^Hello`	Hello world
`$`	End of string	`world$`	Hello world

Email Address Extraction

import re

pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

text = "Contact us at support@example.com or admin@site.org"
emails = re.findall(pattern, text)
print(emails)  # ['support@example.com', 'admin@site.org']

🐍 Python Data Processing Patterns

Essential Python techniques for data manipulation and analysis

Data Loading and Basic Operations

import pandas as pd
import numpy as np

# Load CSV data
df = pd.read_csv('data.csv')

# Basic data exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"Missing values:\n{df.isnull().sum()}")

# Display first few rows
print(df.head())

Data Cleaning Patterns

# Remove duplicate rows
df_clean = df.drop_duplicates()

# Remove rows with missing values
df_clean = df.dropna(subset=['important_column'])

# Fill missing values
df['column'] = df['column'].fillna('default_value')
df['numeric'] = df['numeric'].fillna(df['numeric'].mean())

🌲 Ensemble Methods (Boosting vs Bagging)

Understanding parallel vs sequential ensemble learning approaches

Core Concepts

Bagging (Bootstrap Aggregating)

Parallel training: Multiple models trained independently
Bootstrap sampling: Each model sees different subset of data
Averaging: Final prediction is average/majority vote
Example: Random Forest

Boosting

Sequential training: Models trained one after another
Error focusing: Later models focus on previous mistakes
Weighted combination: Better models get higher weight
Example: AdaBoost, Gradient Boosting

Bagging: f(x) = (1/M) Σ f_m(x)
Boosting: f(x) = Σ α_m h_m(x)

🌳 CART Decision Trees

Classification and Regression Trees for interpretable machine learning

Gini Impurity: G = 1 - Σ p_i²
Entropy: H = -Σ p_i log(p_i)
Information Gain: IG = H(parent) - Σ (n_i/n) H(child_i)

Hyperparameter Guidelines

• max_depth: Start with 5-10, tune based on validation

• min_samples_split: 2-20, higher for noisy data

• min_samples_leaf: 1-10, higher prevents overfitting

📡 Compressed Sensing & Medical Imaging

Sparse signal recovery from underdetermined systems

Measurement Model: y = Ax + ε
Lasso: min ||y - Ax||₂² + λ||x||₁
Ridge: min ||y - Ax||₂² + λ||x||₂²

Lasso vs Ridge for Sparse Recovery

Lasso (L1): Sets coefficients exactly to zero, recovers sparse signals
Ridge (L2): Shrinks coefficients, doesn't achieve sparsity
Medical imaging: MRI images are naturally sparse in some domain

⚡ AdaBoost Algorithm

Adaptive boosting for sequential weak learner combination

AdaBoost Algorithm Steps

Initialize weights: D₁(i) = 1/n for all i
For t = 1 to T: Train weak learner, calculate error, calculate alpha, update weights
Final classifier: H(x) = sign(Σ αₜhₜ(x))

Weight Update: Dₜ₊₁(i) = (Dₜ(i) × exp(-αₜyᵢhₜ(xᵢ))) / Zₜ
Alpha: αₜ = ½ ln((1-εₜ)/εₜ)

🌲 Random Forest

Ensemble of decision trees with bootstrap aggregating

Random Forest Components

Bootstrap sampling: Each tree trained on different subset
Random feature selection: At each split, consider random subset
Majority voting: Final prediction by majority vote or averaging
Out-of-bag error: Unbiased error estimate using unused samples

Optimization Tips

• max_features: √p for classification, p/3 for regression

• n_estimators: Start with 100, increase until OOB error stabilizes

• Feature importance: Use for feature selection and interpretation

🎯 One-Class SVM

Novelty detection and anomaly identification

Novelty Detection Approach

Training: Learn boundary around "normal" data only
Prediction: +1 for normal, -1 for outliers/anomalies
Use cases: Fraud detection, spam detection, fault detection

Decision Function: f(x) = sign(Σ αᵢ K(xᵢ, x) - ρ)
RBF Kernel: K(x, y) = exp(-γ||x - y||²)

📈 Locally Weighted Linear Regression

Non-parametric regression with local model fitting

Objective: min Σ (yᵢ - β₀ - (x - xᵢ)ᵀβ₁)² Kₕ(x - xᵢ)
Solution: β̂ = (XᵀWX)⁻¹XᵀWY
Gaussian Kernel: Kₕ(z) = exp(-||z||²/(2h²))

LWLR Implementation Tips

• Bandwidth selection: Use cross-validation for optimal h

• Numerical stability: Add small regularization to XᵀWX

• Computational efficiency: Pre-compute distances when possible

⚖️ Bias-Variance Tradeoff

Understanding the fundamental tradeoff in machine learning

Total Error = Bias² + Variance + Irreducible Error

Model Complexity Effects

High Bias (Underfitting): Model too simple, poor on both train and test
High Variance (Overfitting): Model too complex, good train, poor test

Managing Bias-Variance Tradeoff

• Cross-validation: Use for model selection and hyperparameter tuning

• Regularization: Add penalty terms to control model complexity

• Ensemble methods: Combine multiple models to reduce variance

• Data size: More data generally reduces variance

✅ Cross-Validation Techniques

Robust model evaluation and selection methods

Types of Cross-Validation

K-Fold CV: Split data into k folds, train on k-1, test on 1
Stratified K-Fold: Maintains class distribution in each fold
Leave-One-Out (LOO): k = n, leave one sample out each time
Time Series CV: Respect temporal order in splits

Cross-Validation Implementation

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

🔧 Common Issues & Solutions

Solutions to frequent problems in data processing and machine learning

⚠️ Model Overfitting

Problem: High training accuracy, poor test performance

Solutions: Reduce model complexity, add regularization, use cross-validation, get more data

⚠️ SettingWithCopyWarning in Pandas

Problem: Pandas warning about chained assignments

Solution: Use .loc for assignment: df.loc[df['A'] > 5, 'B'] = 'value'

⚠️ Template Not Found (Flask)

Problem: Flask can't locate HTML templates

Solutions: Ensure templates are in templates/ folder, check file name spelling

🚀 Future Learning Areas

Structured areas for expanding knowledge and skills

🧠 Deep Learning

Neural networks, TensorFlow/PyTorch, CNNs, RNNs, transformers

📈 Data Visualization

Matplotlib, seaborn, plotly, dashboard creation, interactive charts

☁️ Cloud Platforms

AWS, Azure, Google Cloud, serverless computing