|
| 1 | +## Popular Algorithms in Data Science with Mathematical Formulations |
| 2 | + |
| 3 | +Here is an expanded overview of popular algorithms in data science, including their mathematical formulations, loss functions, performance metrics, and caveats. |
| 4 | + |
| 5 | +### 1. **Linear Regression** |
| 6 | +- **Mathematical Formula:** |
| 7 | + $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon $$ |
| 8 | + where $$y$$ is the dependent variable, $$x_i$$ are independent variables, $$\beta_i$$ are coefficients, and $$\epsilon$$ is the error term. |
| 9 | +- **Loss Function:** Mean Squared Error (MSE) |
| 10 | + $$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$ |
| 11 | +- **Performance Metrics:** R-squared (R²), Adjusted R², Mean Absolute Error (MAE) |
| 12 | +- **Caveats:** Sensitive to outliers; performs poorly with non-linear relationships. |
| 13 | + |
| 14 | +### 2. **Logistic Regression** |
| 15 | +- **Mathematical Formula:** |
| 16 | + $$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_n x_n)}} $$ |
| 17 | +- **Loss Function:** Binary Cross-Entropy Loss (Log Loss) |
| 18 | + $$ L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)] $$ |
| 19 | +- **Performance Metrics:** Accuracy, Precision, Recall, F1 Score |
| 20 | +- **Caveats:** Assumes linearity in log odds; not suitable for multi-class without modification. |
| 21 | + |
| 22 | +### 3. **Decision Trees** |
| 23 | +- **Mathematical Formula:** |
| 24 | + - For classification using Gini Impurity: |
| 25 | + $$ Gini(D) = 1 - \sum_{j=1}^{C} p_j^2 $$ |
| 26 | + where $$p_j$$ is the proportion of class $$j$$ in dataset $$D$$. |
| 27 | + - For regression: |
| 28 | + $$ MSE(D) = \frac{1}{|D|} \sum_{i=1}^{|D|} (y_i - \bar{y})^2 $$ |
| 29 | + where $$y_i$$ are the actual values and $$\bar{y}$$ is the mean of $$y$$. |
| 30 | +- **Loss Function:** Gini Impurity or Mean Squared Error |
| 31 | +- **Performance Metrics:** Accuracy, MAE |
| 32 | +- **Caveats:** Prone to overfitting; sensitive to data changes. |
| 33 | + |
| 34 | +### 4. **Support Vector Machines (SVM)** |
| 35 | +- **Mathematical Formula:** |
| 36 | + $$ f(x) = w^T x + b $$ |
| 37 | + where $$w$$ is the weight vector and $$b$$ is the bias. |
| 38 | +- **Loss Function:** Hinge Loss |
| 39 | + $$ L(y, f(x)) = \max(0, 1 - y f(x)) $$ |
| 40 | +- **Performance Metrics:** Accuracy, Precision, Recall |
| 41 | +- **Caveats:** Computationally expensive for large datasets; requires careful tuning of hyperparameters. |
| 42 | + |
| 43 | +### 5. **Random Forest** |
| 44 | +- **Mathematical Formula:** |
| 45 | + The prediction is made by averaging the predictions from multiple decision trees: |
| 46 | + $$ \hat{y} = \frac{1}{N} \sum_{i=1}^{N} T_i(x) $$ |
| 47 | + where $$T_i$$ are individual trees. |
| 48 | +- **Loss Function:** Mean Squared Error or Gini Impurity |
| 49 | +- **Performance Metrics:** Out-of-Bag Error, Accuracy |
| 50 | +- **Caveats:** Less interpretable than single trees; requires significant computational resources. |
| 51 | + |
| 52 | +### 6. **Gradient Boosting Machines (GBM)** |
| 53 | +- **Mathematical Formula:** |
| 54 | + $$ F(x) = F_{m-1}(x) + \gamma_m h_m(x) $$ |
| 55 | + where $$h_m(x)$$ is the new tree added at iteration $$m$$. |
| 56 | +- **Loss Function:** Log Loss or Mean Squared Error |
| 57 | +- **Performance Metrics:** RMSE |
| 58 | +- **Caveats:** Sensitive to overfitting; requires careful tuning of learning rate and tree depth. |
| 59 | + |
| 60 | +### 7. **Neural Networks** |
| 61 | +- **Mathematical Formula:** |
| 62 | + $$ y = f(WX + b) $$ |
| 63 | + where $$W$$ are weights, $$X$$ is input data, and $$b$$ is bias. |
| 64 | +- **Loss Function:** Cross-Entropy Loss or Mean Squared Error |
| 65 | + - Cross-Entropy for classification: |
| 66 | + $$ L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i)] $$ |
| 67 | +- **Performance Metrics:** Accuracy, F1 Score, AUC |
| 68 | +- **Caveats:** Requires large amounts of data; less interpretable than traditional models. |
| 69 | + |
| 70 | +### 8. **K-Means Clustering** |
| 71 | +- **Mathematical Formula:** |
| 72 | +$$ J = \sum_{i=1}^{k} \sum_{j=1}^{n} ||x_j^{(i)} - c_i||^2 $$ |
| 73 | +where $$c_i$$ are centroids and $$x_j^{(i)}$$ are data points assigned to cluster $$i$$. |
| 74 | +- **Loss Function:** Sum of Squared Errors (SSE) |
| 75 | +- **Performance Metrics:** Silhouette Score, Davies-Bouldin Index |
| 76 | +- **Caveats:** Assumes spherical clusters; sensitive to initial centroid placement. |
| 77 | + |
| 78 | +## Summary of Formulations |
| 79 | + |
| 80 | +| Algorithm | Mathematical Formula | Loss Function | Performance Metrics | |
| 81 | +|------------------------|-------------------------------------------------------------------------------------|------------------------------------|----------------------------------------| |
| 82 | +| Linear Regression | $$ y = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n + \epsilon $$ | MSE | R², MAE | |
| 83 | +| Logistic Regression | $$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + ... + \beta_n x_n)}} $$ | Binary Cross-Entropy | Accuracy, F1 Score | |
| 84 | +| Decision Trees | Gini: $$ Gini(D) = 1 - \sum p_j^2 $$ | Gini Impurity / MSE | Accuracy, MAE | |
| 85 | +| Support Vector Machines | $$ f(x) = w^T x + b $$ | Hinge Loss | Accuracy, Precision | |
| 86 | +| Random Forest | $$ \hat{y} = \frac{1}{N} \sum T_i(x) $$ | MSE / Gini Impurity | Out-of-Bag Error | |
| 87 | +| Gradient Boosting | $$ F(x) = F_{m-1}(x) + h_m(x) $$ | Log Loss / MSE | RMSE | |
| 88 | +| Neural Networks | $$ y = f(WX + b) $$ | Cross-Entropy / MSE | Accuracy, AUC | |
| 89 | +| K-Means Clustering | $$ J = \sum ||x_j^{(i)} - c_i||^2 $$ | SSE | Silhouette Score | |
| 90 | + |
| 91 | +This comprehensive overview provides insights into each algorithm's mathematical foundation along with its practical applications and limitations. Understanding these aspects can help in selecting the right algorithm for specific data science tasks. |
| 92 | + |
| 93 | +Citations: |
| 94 | +[1] https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html |
| 95 | +[2] http://fiascodata.blogspot.com/2018/08/decision-tree-mathematical-formulation.html |
| 96 | +[3] https://en.wikipedia.org/wiki/Decision_tree_learning |
| 97 | +[4] https://www.datascienceprophet.com/understanding-the-mathematics-behind-the-decision-tree-algorithm-part-i/ |
| 98 | +[5] https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3?gi=36fa533e014a |
| 99 | +[6] https://www.datacamp.com/tutorial/loss-function-in-machine-learning |
| 100 | +[7] https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide |
| 101 | +[8] https://towardsdatascience.com/estimators-loss-functions-optimizers-core-of-ml-algorithms-d603f6b0161a?gi=5432fa9d3888 |
0 commit comments