Week 1 / Day 3: Logistic Regression Implementation from Scratch

Learning Objectives

Today's focus was on implementing logistic regression from scratch, understanding the mathematical foundations of classification algorithms, and applying advanced machine learning concepts including regularization and multi-class classification.

Technical Implementation

Data Preparation and Loading

The session began with loading the preprocessed wine dataset from Day 2. When the preprocessed data wasn't available, we implemented a fallback mechanism to load and standardize the original dataset:

try:
    # Load scaled features and target
    X_scaled = np.load('week1_ml/data/X_scaled.npy')
    y = np.load('week1_ml/data/y.npy')
    print("Successfully loaded preprocessed data!")
except FileNotFoundError:
    # Fallback to original dataset
    from sklearn.datasets import load_wine
    from sklearn.preprocessing import StandardScaler
    
    wine = load_wine()
    X = wine.data
    y = wine.target
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

The dataset was split into training (142 samples) and testing (36 samples) sets with stratification to maintain class balance across the three wine types.

Core Algorithm Implementation

We implemented a comprehensive LogisticRegression class from scratch with the following key components:

1. Multi-class Classification Support

  • Softmax Function: Implemented for handling multiple classes
  • One-hot Encoding: Converted target labels for loss computation
  • Cross-entropy Loss: Calculated with regularization support

2. Advanced Features

  • L1/L2 Regularization: Configurable regularization types and strengths
  • Gradient Descent Optimization: Customizable learning rate and iterations
  • Loss Tracking: Monitored training progress throughout iterations

3. Mathematical Implementation

def _softmax(self, z):
    """Compute softmax function for multi-class classification"""
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)

def _cross_entropy_loss(self, y_true, y_pred):
    """Compute cross-entropy loss with regularization"""
    y_one_hot = np.eye(self.n_classes)[y_true]
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    loss = -np.mean(np.sum(y_one_hot * np.log(y_pred), axis=1))
    
    # Add regularization term
    if self.reg_type == 'l1':
        reg_term = self.reg_strength * np.sum(np.abs(self.weights))
    elif self.reg_type == 'l2':
        reg_term = self.reg_strength * np.sum(self.weights ** 2)
    else:
        reg_term = 0
        
    return loss + reg_term

Training Process and Results

The model was trained with the following parameters:

  • Learning Rate: 0.01
  • Maximum Iterations: 1000
  • Regularization: L2 with strength 0.01
  • Random State: 42 for reproducibility

Training Progress

The loss decreased consistently from 0.4128 at iteration 100 to 0.1474 at iteration 1000, demonstrating stable convergence.

Model Performance

  • Training Accuracy: 99.30%
  • Test Accuracy: 100.00%
  • Final Loss: 0.1474
  • Model Parameters: 13 features × 3 classes weights + 3 bias terms

Comprehensive Model Evaluation

Classification Metrics

The model achieved perfect performance on the test set:

  • Precision: 1.00 for all classes
  • Recall: 1.00 for all classes
  • F1-Score: 1.00 for all classes
  • Overall Accuracy: 100%

Confusion Matrix Analysis

The confusion matrix showed no misclassifications, indicating the model perfectly distinguished between all three wine types.

Feature Importance Analysis

We analyzed feature importance based on the learned weights:

Top 5 Most Important Features

  1. color_intensity: 0.441 (highest importance)
  2. alcohol: 0.421
  3. proline: 0.406
  4. alcalinity_of_ash: 0.291
  5. hue: 0.275

This analysis revealed that wine color intensity and alcohol content are the most discriminative features for classification.

Learning Outcomes

Technical Skills Developed

  • Algorithm Implementation: Built logistic regression from mathematical principles
  • Multi-class Classification: Handled three wine types with softmax activation
  • Regularization Techniques: Implemented L1/L2 regularization for model robustness
  • Gradient Descent: Optimized model parameters using iterative optimization
  • Model Evaluation: Applied comprehensive metrics and visualization techniques

Mathematical Understanding

  • Softmax Function: Multi-class probability distribution
  • Cross-entropy Loss: Classification loss function with regularization
  • Gradient Computation: Backpropagation for parameter updates
  • Feature Importance: Weight-based feature significance analysis

Best Practices Learned

  • Fallback Mechanisms: Robust data loading with error handling
  • Reproducibility: Consistent random seeds and parameter tracking
  • Regularization: Preventing overfitting through L1/L2 penalties
  • Comprehensive Evaluation: Multiple metrics for thorough model assessment

Code Repository

All implementation code is saved in the ai-sprint project directory as 03_logistic_regression_implementation.ipynb, including:

  • Complete LogisticRegression class implementation
  • Training and evaluation pipeline
  • Feature importance analysis
  • Visualization and reporting functions

Next Steps

Tomorrow's focus will be on implementing additional machine learning algorithms (SVM, Random Forest) and comparing their performance with our custom logistic regression implementation.