Summary table of models + methods

Introduction

Throughout the course, we will go over several supervised and unsupervised machine learning models. This page summarizes the models.

Method Strengths Limitations Example Use Cases Implementation
Simple Fill - Simple and fast
- Works well with small datasets
- May not handle complex data relationships
- Sensitive to outliers
- Basic data analysis
- Quick data cleaning
Python
KNN Imputation - Can capture the relationships between features
- Works well with moderately missing data
- Computationally intensive for large datasets
- Sensitive to the choice of k
- Medical data analysis
- Market research
Python
Soft Impute - Effective for matrix completion in large datasets
- Works well with low-rank data
- Assumes low-rank data structure
- Can be sensitive to hyperparameters
- Recommender systems
- Large-scale data projects
Python
Iterative Imputer - Can model complex relationships
- Suitable for multiple imputation
- Computationally expensive
- Depends on the choice of model
- Complex datasets with multiple types of missing data Python
Iterative SVD - Good for matrix completion with low-rank assumption
- Handles larger datasets
- Sensitive to rank selection
- Computationally demanding
- Image and video data processing
- Large datasets with structure
Python
Matrix Factorization - Useful for recommendation systems
- Can handle large-scale problems
- Requires careful tuning
- Not suitable for all types of data
- Recommendation engines
- User preference analysis
Python
Nuclear Norm Minimization - Theoretically strong for matrix completion
- Finds the lowest rank solution
- Very computationally intensive
- Impractical for very large datasets
- Research in theoretical data completion
- Small to medium datasets
Python
BiScaler - Normalizes data effectively
- Often used as a preprocessing step
- Not an imputation method itself
- Doesn't always converge
- Preprocessing for other imputation methods
- Data normalization
Python
Model Type Strengths Limitations Example Use Cases Implementation
Logistic Regression - Simple and interpretable
- Fast to train
- Assumes linear boundaries
- Not suitable for complex relationships
- Credit approval
- Medical diagnosis
Python
Decision Trees - Intuitive
- Can model non-linear relationships
- Prone to overfitting
- Sensitive to small changes in data
- Customer segmentation
- Loan default prediction
Python
Random Forest - Handles overfitting
- Can model complex relationships
- Slower to train and predict
- Black box model
- Fraud detection
- Stock price movement prediction
Python
Support Vector Machines (SVM) - Effective in high dimensional spaces
- Works well with clear margin of separation
- Sensitive to kernel choice
- Slow on large datasets
- Image classification
- Handwriting recognition
Python
K-Nearest Neighbors (KNN) - Simple and intuitive
- No training phase
- Slow during query phase
- Sensitive to irrelevant features and scale
- Product recommendation
- Document classification
Python
Neural Networks - Capable of approximating complex functions
- Flexible architecture
Trainable with backpropagation
- Can require a large number of parameters
- Prone to overfitting on small data
Training can be slow
- Pattern recognition
- Basic image classification
- Function approximation
Python
Deep Learning - Can model highly complex relationships
- Excels with vast amounts of data
State-of-the-art results in many domains
- Requires a lot of data
Computationally intensive
- Interpretability challenges
- Advanced image and speech recognition
- Machine translation
- Game playing (like AlphaGo)
Python
Naive Bayes - Fast
- Works well with large feature sets
- Assumes feature independence
- Not suitable for numerical input features
- Spam detection
- Sentiment analysis
Python
Gradient Boosting Machines (GBM) - High performance
- Handles non-linear relationships
- Prone to overfitting if not tuned
- Slow to train
- Web search ranking
- Ecology predictions
Python
Rule-Based Classification - Transparent and explainable
- Easily updated and modified
- Manual rule creation can be tedious
- May not capture complex relationships
- Expert systems
- Business rule enforcement
Python
Bagging - Reduces variance
- Parallelizable
- May not handle bias well - Random Forest is a popular example Python
Boosting - Reduces bias
- Combines weak learners
- Sensitive to noisy data and outliers - AdaBoost
- Gradient Boosting
Python
XGBoost - Scalable and efficient
- Regularization
- Requires careful tuning
- Can overfit if not used correctly
- Competitions on Kaggle
- Retail prediction
Python
Linear Discriminant Analysis (LDA) - Dimensionality reduction
- Simple and interpretable
- Assumes Gaussian distributed data and equal class covariances - Face recognition
- Marketing segmentation
Python
Regularized Models (Shrinking) - Prevents overfitting
- Handles collinearity
- Requires parameter tuning
- May result in loss of interpretability
- Ridge and Lasso regression Python
Stacking - Combines multiple models
- Can improve accuracy
- Increases model complexity
- Risk of overfitting if base models are correlated
- Meta-modeling
- Kaggle competitions
Python
Model Type Strengths Limitations Example Use Cases Implementation
Linear Regression - Simple and interpretable - Assumes linear relationship
- Sensitive to outliers
- Sales forecasting
- Risk assessment
Python
Polynomial Regression - Can model non-linear relationships - Can overfit with high degrees - Growth prediction
- Non-linear trend modeling
Python
Ridge Regression - Prevents overfitting
- Regularizes the model
- Does not perform feature selection - High-dimensional data
- Preventing overfitting
Python
Lasso Regression - Feature selection
- Regularizes the model
- May exclude useful variables - Feature selection
- High-dimensional datasets
Python
Elastic Net Regression - Balance between Ridge and Lasso - Requires tuning for mixing parameter - High-dimensional datasets with correlated features Python
Quantile Regression - Models the median or other quantiles - Less interpretable than ordinary regression - Median house price prediction
- Financial quantiles modeling
Python
Support Vector Regression (SVR) - Flexible
- Can handle non-linear relationships
- Sensitive to kernel and hyperparameters - Stock price prediction
- Non-linear trend modeling
Python
Decision Tree Regression - Handles non-linear data
- Interpretable
- Can overfit on noisy data - Price prediction
- Quality assessment
Python
Random Forest Regression - Handles large datasets
- Reduces overfitting
- Requires more computational resources - Large datasets
- Environmental modeling
Python
Gradient Boosting Regression - High performance
- Can handle non-linear relationships
- Prone to overfitting if not tuned - Web search ranking
- Price prediction
Python
Model Type Strengths Limitations Example Use Cases Implementation
K-Means Clustering - Simple and widely used
- Fast for large datasets
- Sensitive to initial conditions
- Requires specifying the number of clusters
- Market segmentation
- Image compression
Python
Hierarchical Clustering - Doesn't require specifying the number of clusters
- Produces a dendrogram
- May be computationally expensive for large datasets - Taxonomies
- Determining evolutionary relationships
Python
DBSCAN (Density-Based Clustering) - Can find arbitrarily shaped clusters
- Doesn’t require specifying the number of clusters
- Sensitive to scale
- Requires density parameters to be set
- Noise detection and anomaly detection Python
Agglomerative Clustering - Variety of linkage criteria
- Produces a hierarchy of clusters
- Not scalable for very large datasets - Sociological hierarchies
- Taxonomies
Python
Mean Shift Clustering - No need to specify number of clusters
- Can find arbitrarily shaped clusters
- Computationally expensive
- Bandwidth parameter selection is crucial
- Image analysis
- Computer vision tasks
Python
Affinity Propagation - Automatically determines the number of clusters
- Good for data with lots of exemplars
- High computational complexity
- Preference parameter can be difficult to choose
- Image recognition
- Data with many similar exemplars
Python
Spectral Clustering - Can capture complex cluster structures
- Can be used with various affinity matrices
- Choice of affinity matrix is crucial
- Can be computationally expensive
- Image and speech processing
- Graph-based clustering
Python
Method Strengths Limitations Example Use Cases Implementation
PCA - Dimensionality reduction
- Preserves variance
- Linear method
- Not for categorical data
- Feature extraction
- Data compression
Python
t-SNE - Captures non-linear structures
- Good for visualization
- Computationally expensive
- Not for high-dimensional data
- Data visualization
- Exploratory analysis
Python
Autoencoders - Dimensionality reduction
- Non-linear relationships
- Neural network knowledge
- Computationally intensive
- Feature learning
- Noise reduction
Python
Isolation Forest - Effective for high-dimensional data
- Fast and scalable
- Randomized
- May miss some anomalies
- Fraud detection
- Network security
Python
SVD - Matrix factorization
- Efficient for large datasets
- Assumes linear relationships
- Sensitive to scaling
- Recommender systems
- Latent semantic analysis
Python
ICA - Identifies independent components
- Signal separation
- Non-Gaussian components
- Sensitive to noise
- Blind signal separation
- Feature extraction
Python
Method Strengths Limitations Example Use Cases Implementation
Apriori Algorithm - Well-known and widely used
- Easy to understand and implement
- Can be slow on large datasets
- Generates a large number of candidate sets
- Market basket analysis
- Cross-marketing strategies
Python
FP-Growth Algorithm - Faster than Apriori
- Efficient for large datasets
- Memory intensive
- Can be complex to implement
- Frequent itemset mining in large databases
- Customer purchase patterns
Python
Eclat Algorithm - Faster than Apriori
- Scalable and easy to parallelize
- Limited to binary attributes
- Generates many candidate itemsets
- Market basket analysis
- Binary classification tasks
Python
GSP (Generalized Sequential Pattern) - Identifies sequential patterns
- Flexible for various datasets
- Can be computationally expensive
- Not as efficient for very large databases
- Customer purchase sequence analysis
- Event sequence analysis
Python
RuleGrowth Algorithm - Efficient for mining sequential rules
- Works well with sparse datasets
- Requires careful parameter setting
- Less known and used than Apriori or FP-Growth
- Analyzing customer shopping sequences
- Detecting patterns in web browsing data
Python
Technique Strengths Limitations Example Use Cases Implementation
Accuracy - Simple and intuitive
- Effective for balanced datasets
- Misleading for imbalanced datasets
- Doesn't reflect true positives/negatives
- General classification problems
- Comparing baseline models
Python
AUC-ROC - Effective for binary classification
- Good for imbalanced datasets
- Can be overly optimistic in imbalanced data
- Not threshold-specific
- Medical diagnosis classification
- Fraud detection models
Python
Precision - Focuses on positive class
- Reduces false positives
- Ignores false negatives
- Not useful alone in imbalanced datasets
- Spam detection
- Content moderation systems
Python
Recall - Identifies actual positives well
- Minimizes false negatives
- Ignores false positives
- Can be misleading if positives are rare
- Disease outbreak detection
- Recall-focused tasks
Python
F1-Score - Balances precision and recall
- Useful for imbalanced datasets
- May not reflect true model performance
- Depends on balance of precision and recall
- Customer churn prediction
- Sentiment analysis
Python
Cross-Validation - Reduces overfitting
- Provides robust model evaluation
- Computationally expensive
- May not be ideal for very large datasets
- General model evaluation
- Comparing multiple models
Python
The Validation Set Approach - Simple and easy to implement
- Good for initial model assessment
- Can lead to overfitting
- Dependent on the split
- Quick model prototyping
- Small datasets
Python
Leave-One-Out Cross-Validation - Very detailed
- Each observation used for validation exactly once
- Computationally intensive
- Not suitable for large datasets
- Small but rich datasets
- Highly sensitive models
Python
k-Fold Cross-Validation - Balances computational cost and validation accuracy
- Suitable for various data sizes
- Variability in results depending on how data is divided
- Choice of 'k' can impact results
- Medium-sized datasets
- Model selection
Python
The Bootstrap Method - Good for estimating model accuracy
- Effective for small datasets
- Results can be sensitive to outliers
- May overestimate accuracy for small datasets
- Small or medium-sized datasets
- Uncertainty estimation
Python