Create MlAlgoCheatSheet.md

devAmoghS · web-flow · commit f80fe5695235 · 2024-09-27T10:00:08.000+05:30
diff --git a/DataScience/MlAlgoCheatSheet.md b/DataScience/MlAlgoCheatSheet.md
@@ -0,0 +1,70 @@
+## Popular Algorithms in Data Science
+
+In data science, various algorithms are employed for tasks such as regression and classification. Each algorithm has associated loss functions and performance metrics that help evaluate its effectiveness. Below is a detailed overview of popular algorithms, their loss functions, performance metrics, and caveats for their use.
+
+### 1. **Linear Regression**
+- **Loss Function:** Mean Squared Error (MSE)
+- **Performance Metrics:** R-squared (R²), Adjusted R², Mean Absolute Error (MAE)
+- **Caveats:** Sensitive to outliers; performs poorly when the relationship between features and target is non-linear.
+
+### 2. **Logistic Regression**
+- **Loss Function:** Binary Cross-Entropy Loss (Log Loss)
+- **Performance Metrics:** Accuracy, Precision, Recall, F1 Score
+- **Caveats:** Assumes linearity between the independent variables and the log odds of the dependent variable; not suitable for multi-class problems without modification.
+
+### 3. **Decision Trees**
+- **Loss Function:** Gini Impurity (for classification), Mean Squared Error (for regression)
+- **Performance Metrics:** Accuracy, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE)
+- **Caveats:** Prone to overfitting; sensitive to small changes in data which can lead to different tree structures.
+
+### 4. **Support Vector Machines (SVM)**
+- **Loss Function:** Hinge Loss (for classification), Epsilon-insensitive Loss (for regression)
+- **Performance Metrics:** Accuracy, Precision, Recall
+- **Caveats:** Computationally expensive for large datasets; requires careful tuning of hyperparameters like the kernel choice.
+
+### 5. **Random Forest**
+- **Loss Function:** Mean Squared Error (for regression), Gini Impurity or Cross-Entropy Loss (for classification)
+- **Performance Metrics:** Out-of-Bag Error, Accuracy
+- **Caveats:** Can be less interpretable than simpler models; may require significant computational resources.
+
+### 6. **Gradient Boosting Machines (GBM)**
+- **Loss Function:** Log Loss (for classification), Mean Squared Error (for regression)
+- **Performance Metrics:** Log-Likelihood, RMSE
+- **Caveats:** Sensitive to overfitting if not properly regularized; requires careful tuning of learning rate and tree depth.
+
+### 7. **Neural Networks**
+- **Loss Function:** Cross-Entropy Loss (for classification), Mean Squared Error (for regression)
+- **Performance Metrics:** Accuracy, F1 Score, Area Under Curve (AUC)
+- **Caveats:** Requires large amounts of data; can be prone to overfitting if not regularized properly; less interpretable compared to traditional models.
+
+### 8. **K-Means Clustering**
+- **Loss Function:** Sum of Squared Errors (SSE)
+- **Performance Metrics:** Silhouette Score, Davies-Bouldin Index
+- **Caveats:** Assumes spherical clusters; sensitive to initial centroid placement; requires specifying the number of clusters in advance.
+
+## Summary of Loss Functions and Performance Metrics
+
+| Algorithm               | Loss Function                       | Performance Metrics                     |
+|------------------------|------------------------------------|----------------------------------------|
+| Linear Regression       | Mean Squared Error                 | R², MAE                                |
+| Logistic Regression     | Binary Cross-Entropy               | Accuracy, F1 Score                     |
+| Decision Trees          | Gini Impurity / MSE               | Accuracy, MAE                          |
+| Support Vector Machines | Hinge Loss                         | Accuracy, Precision                     |
+| Random Forest           | MSE / Gini Impurity                | Out-of-Bag Error                       |
+| Gradient Boosting       | Log Loss / MSE                     | RMSE                                   |
+| Neural Networks         | Cross-Entropy / MSE                | Accuracy, AUC                          |
+| K-Means Clustering      | Sum of Squared Errors              | Silhouette Score                       |
+
+## Conclusion
+
+The choice of algorithm depends on the specific characteristics of the dataset and the nature of the problem being solved. Understanding the strengths and weaknesses of each algorithm helps in selecting the most appropriate one for a given task. For instance, while linear regression is simple and interpretable, it may not capture complex relationships in the data. Conversely, neural networks can model intricate patterns but require more data and computational power.
+
+Citations:
+
+[1] https://www.datacamp.com/tutorial/loss-function-in-machine-learning <br/>
+[2] https://builtin.com/machine-learning/common-loss-functions <br/>
+[3] https://www.ibm.com/think/topics/loss-function <br/>
+[4] https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide <br/>
+[5] https://www.geeksforgeeks.org/ml-common-loss-functions/ <br/>
+[6] https://www.linkedin.com/pulse/performance-metrics-loss-function-machine-learning-alok-choudhary-zou7c <br/>
+[7] https://towardsdatascience.com/estimators-loss-functions-optimizers-core-of-ml-algorithms-d603f6b0161a <br/>