| Method | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Simple Fill | - Simple and fast - Works well with small datasets |
- May not handle complex data relationships - Sensitive to outliers |
- Basic data analysis - Quick data cleaning |
Python |
| KNN Imputation | - Can capture the relationships between features - Works well with moderately missing data |
- Computationally intensive for large datasets - Sensitive to the choice of k |
- Medical data analysis - Market research |
Python |
| Soft Impute | - Effective for matrix completion in large datasets - Works well with low-rank data |
- Assumes low-rank data structure - Can be sensitive to hyperparameters |
- Recommender systems - Large-scale data projects |
Python |
| Iterative Imputer | - Can model complex relationships - Suitable for multiple imputation |
- Computationally expensive - Depends on the choice of model |
- Complex datasets with multiple types of missing data | Python |
| Iterative SVD | - Good for matrix completion with low-rank assumption - Handles larger datasets |
- Sensitive to rank selection - Computationally demanding |
- Image and video data processing - Large datasets with structure |
Python |
| Matrix Factorization | - Useful for recommendation systems - Can handle large-scale problems |
- Requires careful tuning - Not suitable for all types of data |
- Recommendation engines - User preference analysis |
Python |
| Nuclear Norm Minimization | - Theoretically strong for matrix completion - Finds the lowest rank solution |
- Very computationally intensive - Impractical for very large datasets |
- Research in theoretical data completion - Small to medium datasets |
Python |
| BiScaler | - Normalizes data effectively - Often used as a preprocessing step |
- Not an imputation method itself - Doesn't always converge |
- Preprocessing for other imputation methods - Data normalization |
Python |
Summary table of models + methods
Introduction
Throughout the course, we will go over several supervised and unsupervised machine learning models. This page summarizes the models.
| Model Type | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Logistic Regression | - Simple and interpretable - Fast to train |
- Assumes linear boundaries - Not suitable for complex relationships |
- Credit approval - Medical diagnosis |
Python |
| Decision Trees | - Intuitive - Can model non-linear relationships |
- Prone to overfitting - Sensitive to small changes in data |
- Customer segmentation - Loan default prediction |
Python |
| Random Forest | - Handles overfitting - Can model complex relationships |
- Slower to train and predict - Black box model |
- Fraud detection - Stock price movement prediction |
Python |
| Support Vector Machines (SVM) | - Effective in high dimensional spaces - Works well with clear margin of separation |
- Sensitive to kernel choice - Slow on large datasets |
- Image classification - Handwriting recognition |
Python |
| K-Nearest Neighbors (KNN) | - Simple and intuitive - No training phase |
- Slow during query phase - Sensitive to irrelevant features and scale |
- Product recommendation - Document classification |
Python |
| Neural Networks | - Capable of approximating complex functions - Flexible architecture Trainable with backpropagation |
- Can require a large number of parameters - Prone to overfitting on small data Training can be slow |
- Pattern recognition - Basic image classification - Function approximation |
Python |
| Deep Learning | - Can model highly complex relationships - Excels with vast amounts of data State-of-the-art results in many domains |
- Requires a lot of data Computationally intensive - Interpretability challenges |
- Advanced image and speech recognition - Machine translation - Game playing (like AlphaGo) |
Python |
| Naive Bayes | - Fast - Works well with large feature sets |
- Assumes feature independence - Not suitable for numerical input features |
- Spam detection - Sentiment analysis |
Python |
| Gradient Boosting Machines (GBM) | - High performance - Handles non-linear relationships |
- Prone to overfitting if not tuned - Slow to train |
- Web search ranking - Ecology predictions |
Python |
| Rule-Based Classification | - Transparent and explainable - Easily updated and modified |
- Manual rule creation can be tedious - May not capture complex relationships |
- Expert systems - Business rule enforcement |
Python |
| Bagging | - Reduces variance - Parallelizable |
- May not handle bias well | - Random Forest is a popular example | Python |
| Boosting | - Reduces bias - Combines weak learners |
- Sensitive to noisy data and outliers | - AdaBoost - Gradient Boosting |
Python |
| XGBoost | - Scalable and efficient - Regularization |
- Requires careful tuning - Can overfit if not used correctly |
- Competitions on Kaggle - Retail prediction |
Python |
| Linear Discriminant Analysis (LDA) | - Dimensionality reduction - Simple and interpretable |
- Assumes Gaussian distributed data and equal class covariances | - Face recognition - Marketing segmentation |
Python |
| Regularized Models (Shrinking) | - Prevents overfitting - Handles collinearity |
- Requires parameter tuning - May result in loss of interpretability |
- Ridge and Lasso regression | Python |
| Stacking | - Combines multiple models - Can improve accuracy |
- Increases model complexity - Risk of overfitting if base models are correlated |
- Meta-modeling - Kaggle competitions |
Python |
| Model Type | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Linear Regression | - Simple and interpretable | - Assumes linear relationship - Sensitive to outliers |
- Sales forecasting - Risk assessment |
Python |
| Polynomial Regression | - Can model non-linear relationships | - Can overfit with high degrees | - Growth prediction - Non-linear trend modeling |
Python |
| Ridge Regression | - Prevents overfitting - Regularizes the model |
- Does not perform feature selection | - High-dimensional data - Preventing overfitting |
Python |
| Lasso Regression | - Feature selection - Regularizes the model |
- May exclude useful variables | - Feature selection - High-dimensional datasets |
Python |
| Elastic Net Regression | - Balance between Ridge and Lasso | - Requires tuning for mixing parameter | - High-dimensional datasets with correlated features | Python |
| Quantile Regression | - Models the median or other quantiles | - Less interpretable than ordinary regression | - Median house price prediction - Financial quantiles modeling |
Python |
| Support Vector Regression (SVR) | - Flexible - Can handle non-linear relationships |
- Sensitive to kernel and hyperparameters | - Stock price prediction - Non-linear trend modeling |
Python |
| Decision Tree Regression | - Handles non-linear data - Interpretable |
- Can overfit on noisy data | - Price prediction - Quality assessment |
Python |
| Random Forest Regression | - Handles large datasets - Reduces overfitting |
- Requires more computational resources | - Large datasets - Environmental modeling |
Python |
| Gradient Boosting Regression | - High performance - Can handle non-linear relationships |
- Prone to overfitting if not tuned | - Web search ranking - Price prediction |
Python |
| Model Type | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| K-Means Clustering | - Simple and widely used - Fast for large datasets |
- Sensitive to initial conditions - Requires specifying the number of clusters |
- Market segmentation - Image compression |
Python |
| Hierarchical Clustering | - Doesn't require specifying the number of clusters - Produces a dendrogram |
- May be computationally expensive for large datasets | - Taxonomies - Determining evolutionary relationships |
Python |
| DBSCAN (Density-Based Clustering) | - Can find arbitrarily shaped clusters - Doesn’t require specifying the number of clusters |
- Sensitive to scale - Requires density parameters to be set |
- Noise detection and anomaly detection | Python |
| Agglomerative Clustering | - Variety of linkage criteria - Produces a hierarchy of clusters |
- Not scalable for very large datasets | - Sociological hierarchies - Taxonomies |
Python |
| Mean Shift Clustering | - No need to specify number of clusters - Can find arbitrarily shaped clusters |
- Computationally expensive - Bandwidth parameter selection is crucial |
- Image analysis - Computer vision tasks |
Python |
| Affinity Propagation | - Automatically determines the number of clusters - Good for data with lots of exemplars |
- High computational complexity - Preference parameter can be difficult to choose |
- Image recognition - Data with many similar exemplars |
Python |
| Spectral Clustering | - Can capture complex cluster structures - Can be used with various affinity matrices |
- Choice of affinity matrix is crucial - Can be computationally expensive |
- Image and speech processing - Graph-based clustering |
Python |
| Method | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| PCA | - Dimensionality reduction - Preserves variance |
- Linear method - Not for categorical data |
- Feature extraction - Data compression |
Python |
| t-SNE | - Captures non-linear structures - Good for visualization |
- Computationally expensive - Not for high-dimensional data |
- Data visualization - Exploratory analysis |
Python |
| Autoencoders | - Dimensionality reduction - Non-linear relationships |
- Neural network knowledge - Computationally intensive |
- Feature learning - Noise reduction |
Python |
| Isolation Forest | - Effective for high-dimensional data - Fast and scalable |
- Randomized - May miss some anomalies |
- Fraud detection - Network security |
Python |
| SVD | - Matrix factorization - Efficient for large datasets |
- Assumes linear relationships - Sensitive to scaling |
- Recommender systems - Latent semantic analysis |
Python |
| ICA | - Identifies independent components - Signal separation |
- Non-Gaussian components - Sensitive to noise |
- Blind signal separation - Feature extraction |
Python |
| Method | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Apriori Algorithm | - Well-known and widely used - Easy to understand and implement |
- Can be slow on large datasets - Generates a large number of candidate sets |
- Market basket analysis - Cross-marketing strategies |
Python |
| FP-Growth Algorithm | - Faster than Apriori - Efficient for large datasets |
- Memory intensive - Can be complex to implement |
- Frequent itemset mining in large databases - Customer purchase patterns |
Python |
| Eclat Algorithm | - Faster than Apriori - Scalable and easy to parallelize |
- Limited to binary attributes - Generates many candidate itemsets |
- Market basket analysis - Binary classification tasks |
Python |
| GSP (Generalized Sequential Pattern) | - Identifies sequential patterns - Flexible for various datasets |
- Can be computationally expensive - Not as efficient for very large databases |
- Customer purchase sequence analysis - Event sequence analysis |
Python |
| RuleGrowth Algorithm | - Efficient for mining sequential rules - Works well with sparse datasets |
- Requires careful parameter setting - Less known and used than Apriori or FP-Growth |
- Analyzing customer shopping sequences - Detecting patterns in web browsing data |
Python |
| Technique | Strengths | Limitations | Example Use Cases | Implementation |
|---|---|---|---|---|
| Accuracy | - Simple and intuitive - Effective for balanced datasets |
- Misleading for imbalanced datasets - Doesn't reflect true positives/negatives |
- General classification problems - Comparing baseline models |
Python |
| AUC-ROC | - Effective for binary classification - Good for imbalanced datasets |
- Can be overly optimistic in imbalanced data - Not threshold-specific |
- Medical diagnosis classification - Fraud detection models |
Python |
| Precision | - Focuses on positive class - Reduces false positives |
- Ignores false negatives - Not useful alone in imbalanced datasets |
- Spam detection - Content moderation systems |
Python |
| Recall | - Identifies actual positives well - Minimizes false negatives |
- Ignores false positives - Can be misleading if positives are rare |
- Disease outbreak detection - Recall-focused tasks |
Python |
| F1-Score | - Balances precision and recall - Useful for imbalanced datasets |
- May not reflect true model performance - Depends on balance of precision and recall |
- Customer churn prediction - Sentiment analysis |
Python |
| Cross-Validation | - Reduces overfitting - Provides robust model evaluation |
- Computationally expensive - May not be ideal for very large datasets |
- General model evaluation - Comparing multiple models |
Python |
| The Validation Set Approach | - Simple and easy to implement - Good for initial model assessment |
- Can lead to overfitting - Dependent on the split |
- Quick model prototyping - Small datasets |
Python |
| Leave-One-Out Cross-Validation | - Very detailed - Each observation used for validation exactly once |
- Computationally intensive - Not suitable for large datasets |
- Small but rich datasets - Highly sensitive models |
Python |
| k-Fold Cross-Validation | - Balances computational cost and validation accuracy - Suitable for various data sizes |
- Variability in results depending on how data is divided - Choice of 'k' can impact results |
- Medium-sized datasets - Model selection |
Python |
| The Bootstrap Method | - Good for estimating model accuracy - Effective for small datasets |
- Results can be sensitive to outliers - May overestimate accuracy for small datasets |
- Small or medium-sized datasets - Uncertainty estimation |
Python |