Comprehensive Learning Reference

Data Science & ML Mastery Guide

Complete tutorials and reference notes for data processing, machine learning algorithms, interactive math tools, OpenRefine, Flask development, and advanced ML concepts.

šŸ”§ OpenRefine Data Cleaning Complete Guide

Master data cleaning, transformation, and quality improvement with OpenRefine

Installation & Setup

Step 1: Download OpenRefine

Download OpenRefine 3.6.2 from: https://github.com/OpenRefine/OpenRefine/releases/tag/3.6.2

Choose your OS version (Windows, Mac, Linux). Version 3.6.2 includes embedded Java.

Step 2: Launch OpenRefine

1. Extract the downloaded file

2. Run the executable (openrefine.exe on Windows)

3. Open browser and go to: http://127.0.0.1:3333

Project Creation & Data Import

Creating a New Project

1. Click "Create Project"

2. Choose "This Computer" → Browse for your CSV file

3. Click "Next" to preview data

4. Verify column headers and data types

5. Click "Create Project" (top right)

šŸ’” Data Import Best Practices

• Always preview your data before creating the project

• Check if headers are properly detected

• Verify column separation (comma, tab, semicolon)

• Note any encoding issues with special characters

Core Data Cleaning Operations

1. Removing Blank/Null Values

Using Text Facets to Remove Blanks

1. Click on column dropdown → Facet → Text facet

2. In the facet panel (left side), you'll see all unique values

3. Click on (blank) to select only blank rows

4. Click All → Edit rows → Remove all matching rows

5. Close the facet when done

GREL (General Refine Expression Language) Reference

Function Purpose Example
contains(value, "text") Check if value contains text contains(value, "Airport")
value.match(/pattern/) Extract text matching regex value.match(/([A-Za-z\s]+Airport)/)
split(value, delimiter) Split text into array split(value, ",")
trim(value) Remove leading/trailing spaces trim(value)
if(condition, true, false) Conditional logic if(contains(value, "Airport"), "Yes", "No")
Extract Airport Names
if(contains(value, "Airport"),
  value.match(/([A-Za-z\s]+Airport)/)[0].trim(),
  ""
)
āš ļø Common Mistakes to Avoid

• Always backup original data before major operations

• Test GREL expressions on small datasets first

• Check for case sensitivity in filters and matching

• Verify row counts after filtering operations

🌐 Flask Web Development Complete Tutorial

Build data-driven web applications with Python Flask

Flask Installation & Setup

Terminal/Command Prompt
# Install Flask
pip install Flask

# Optional: Create virtual environment first
python -m venv flask_env
# Windows: flask_env\Scripts\activate
# Mac/Linux: source flask_env/bin/activate
pip install Flask

Basic Flask Application Structure

project_folder/
ā”œā”€ā”€ run.py                 # Main application runner
ā”œā”€ā”€ flaskapp/
│   ā”œā”€ā”€ __init__.py       # Flask app initialization
│   ā”œā”€ā”€ routes.py         # URL routes and view functions
│   └── templates/
│       └── index.html    # HTML templates
└── data/
    └── dataset.csv       # Data files
run.py
""" run.py - Run the Flask app """
from flaskapp import app

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=3001, debug=True)
flaskapp/__init__.py
from flask import Flask

app = Flask(__name__)

from flaskapp import routes
šŸ’” Essential Tips

• Use debug=True for development (auto-reload on changes)

• Organize code into modules (routes, models, utilities)

• Use templates for all HTML (avoid HTML in Python code)

• Handle errors gracefully with try-catch blocks

šŸ” Regular Expressions Complete Reference

Master pattern matching for text processing and data extraction

Basic Syntax Elements

Pattern Description Example Matches
. Any single character a.c abc, axc, a1c
* Zero or more of preceding ab*c ac, abc, abbc
+ One or more of preceding ab+c abc, abbc (not ac)
? Zero or one of preceding ab?c ac, abc
^ Start of string ^Hello Hello world
$ End of string world$ Hello world
Email Address Extraction
import re

pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'

text = "Contact us at support@example.com or admin@site.org"
emails = re.findall(pattern, text)
print(emails)  # ['support@example.com', 'admin@site.org']

šŸ Python Data Processing Patterns

Essential Python techniques for data manipulation and analysis

Data Loading and Basic Operations
import pandas as pd
import numpy as np

# Load CSV data
df = pd.read_csv('data.csv')

# Basic data exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Data types:\n{df.dtypes}")
print(f"Missing values:\n{df.isnull().sum()}")

# Display first few rows
print(df.head())
Data Cleaning Patterns
# Remove duplicate rows
df_clean = df.drop_duplicates()

# Remove rows with missing values
df_clean = df.dropna(subset=['important_column'])

# Fill missing values
df['column'] = df['column'].fillna('default_value')
df['numeric'] = df['numeric'].fillna(df['numeric'].mean())

🌲 Ensemble Methods (Boosting vs Bagging)

Understanding parallel vs sequential ensemble learning approaches

Core Concepts

Bagging (Bootstrap Aggregating)

Boosting

Bagging: f(x) = (1/M) Σ f_m(x)
Boosting: f(x) = Σ α_m h_m(x)

🌳 CART Decision Trees

Classification and Regression Trees for interpretable machine learning

Gini Impurity: G = 1 - Σ p_i²
Entropy: H = -Ī£ p_i log(p_i)
Information Gain: IG = H(parent) - Σ (n_i/n) H(child_i)
Hyperparameter Guidelines

• max_depth: Start with 5-10, tune based on validation

• min_samples_split: 2-20, higher for noisy data

• min_samples_leaf: 1-10, higher prevents overfitting

šŸ“” Compressed Sensing & Medical Imaging

Sparse signal recovery from underdetermined systems

Measurement Model: y = Ax + ε
Lasso: min ||y - Ax||₂² + Ī»||x||₁
Ridge: min ||y - Ax||₂² + Ī»||x||₂²

Lasso vs Ridge for Sparse Recovery

⚔ AdaBoost Algorithm

Adaptive boosting for sequential weak learner combination

AdaBoost Algorithm Steps

  1. Initialize weights: D₁(i) = 1/n for all i
  2. For t = 1 to T: Train weak learner, calculate error, calculate alpha, update weights
  3. Final classifier: H(x) = sign(Ī£ Ī±ā‚œhā‚œ(x))
Weight Update: Dā‚œā‚Šā‚(i) = (Dā‚œ(i) Ɨ exp(-Ī±ā‚œyįµ¢hā‚œ(xįµ¢))) / Zā‚œ
Alpha: Ī±ā‚œ = ½ ln((1-Īµā‚œ)/Īµā‚œ)

🌲 Random Forest

Ensemble of decision trees with bootstrap aggregating

Random Forest Components

Optimization Tips

• max_features: √p for classification, p/3 for regression

• n_estimators: Start with 100, increase until OOB error stabilizes

• Feature importance: Use for feature selection and interpretation

šŸŽÆ One-Class SVM

Novelty detection and anomaly identification

Novelty Detection Approach

Decision Function: f(x) = sign(Σ αᵢ K(xᵢ, x) - ρ)
RBF Kernel: K(x, y) = exp(-γ||x - y||²)

šŸ“ˆ Locally Weighted Linear Regression

Non-parametric regression with local model fitting

Objective: min Ī£ (yįµ¢ - β₀ - (x - xįµ¢)ᵀβ₁)² Kā‚•(x - xįµ¢)
Solution: β̂ = (Xįµ€WX)⁻¹Xįµ€WY
Gaussian Kernel: Kā‚•(z) = exp(-||z||²/(2h²))
LWLR Implementation Tips

• Bandwidth selection: Use cross-validation for optimal h

• Numerical stability: Add small regularization to Xįµ€WX

• Computational efficiency: Pre-compute distances when possible

āš–ļø Bias-Variance Tradeoff

Understanding the fundamental tradeoff in machine learning

Total Error = Bias² + Variance + Irreducible Error

Model Complexity Effects

Managing Bias-Variance Tradeoff

• Cross-validation: Use for model selection and hyperparameter tuning

• Regularization: Add penalty terms to control model complexity

• Ensemble methods: Combine multiple models to reduce variance

• Data size: More data generally reduces variance

āœ… Cross-Validation Techniques

Robust model evaluation and selection methods

Types of Cross-Validation

Cross-Validation Implementation
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

šŸ”§ Common Issues & Solutions

Solutions to frequent problems in data processing and machine learning

āš ļø Model Overfitting

Problem: High training accuracy, poor test performance

Solutions: Reduce model complexity, add regularization, use cross-validation, get more data

āš ļø SettingWithCopyWarning in Pandas

Problem: Pandas warning about chained assignments

Solution: Use .loc for assignment: df.loc[df['A'] > 5, 'B'] = 'value'

āš ļø Template Not Found (Flask)

Problem: Flask can't locate HTML templates

Solutions: Ensure templates are in templates/ folder, check file name spelling

šŸš€ Future Learning Areas

Structured areas for expanding knowledge and skills

🧠 Deep Learning

Neural networks, TensorFlow/PyTorch, CNNs, RNNs, transformers

šŸ“ˆ Data Visualization

Matplotlib, seaborn, plotly, dashboard creation, interactive charts

ā˜ļø Cloud Platforms

AWS, Azure, Google Cloud, serverless computing