Skip to content

Commit 9d2548c

Browse files
authored
Create MlAlgoKeyFormulae.md
1 parent f80fe56 commit 9d2548c

File tree

1 file changed

+101
-0
lines changed

1 file changed

+101
-0
lines changed

DataScience/MlAlgoKeyFormulae.md

+101
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
## Popular Algorithms in Data Science with Mathematical Formulations
2+
3+
Here is an expanded overview of popular algorithms in data science, including their mathematical formulations, loss functions, performance metrics, and caveats.
4+
5+
### 1. **Linear Regression**
6+
- **Mathematical Formula:**
7+
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon $$
8+
where $$y$$ is the dependent variable, $$x_i$$ are independent variables, $$\beta_i$$ are coefficients, and $$\epsilon$$ is the error term.
9+
- **Loss Function:** Mean Squared Error (MSE)
10+
$$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
11+
- **Performance Metrics:** R-squared (R²), Adjusted R², Mean Absolute Error (MAE)
12+
- **Caveats:** Sensitive to outliers; performs poorly with non-linear relationships.
13+
14+
### 2. **Logistic Regression**
15+
- **Mathematical Formula:**
16+
$$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_n x_n)}} $$
17+
- **Loss Function:** Binary Cross-Entropy Loss (Log Loss)
18+
$$ L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)] $$
19+
- **Performance Metrics:** Accuracy, Precision, Recall, F1 Score
20+
- **Caveats:** Assumes linearity in log odds; not suitable for multi-class without modification.
21+
22+
### 3. **Decision Trees**
23+
- **Mathematical Formula:**
24+
- For classification using Gini Impurity:
25+
$$ Gini(D) = 1 - \sum_{j=1}^{C} p_j^2 $$
26+
where $$p_j$$ is the proportion of class $$j$$ in dataset $$D$$.
27+
- For regression:
28+
$$ MSE(D) = \frac{1}{|D|} \sum_{i=1}^{|D|} (y_i - \bar{y})^2 $$
29+
where $$y_i$$ are the actual values and $$\bar{y}$$ is the mean of $$y$$.
30+
- **Loss Function:** Gini Impurity or Mean Squared Error
31+
- **Performance Metrics:** Accuracy, MAE
32+
- **Caveats:** Prone to overfitting; sensitive to data changes.
33+
34+
### 4. **Support Vector Machines (SVM)**
35+
- **Mathematical Formula:**
36+
$$ f(x) = w^T x + b $$
37+
where $$w$$ is the weight vector and $$b$$ is the bias.
38+
- **Loss Function:** Hinge Loss
39+
$$ L(y, f(x)) = \max(0, 1 - y f(x)) $$
40+
- **Performance Metrics:** Accuracy, Precision, Recall
41+
- **Caveats:** Computationally expensive for large datasets; requires careful tuning of hyperparameters.
42+
43+
### 5. **Random Forest**
44+
- **Mathematical Formula:**
45+
The prediction is made by averaging the predictions from multiple decision trees:
46+
$$ \hat{y} = \frac{1}{N} \sum_{i=1}^{N} T_i(x) $$
47+
where $$T_i$$ are individual trees.
48+
- **Loss Function:** Mean Squared Error or Gini Impurity
49+
- **Performance Metrics:** Out-of-Bag Error, Accuracy
50+
- **Caveats:** Less interpretable than single trees; requires significant computational resources.
51+
52+
### 6. **Gradient Boosting Machines (GBM)**
53+
- **Mathematical Formula:**
54+
$$ F(x) = F_{m-1}(x) + \gamma_m h_m(x) $$
55+
where $$h_m(x)$$ is the new tree added at iteration $$m$$.
56+
- **Loss Function:** Log Loss or Mean Squared Error
57+
- **Performance Metrics:** RMSE
58+
- **Caveats:** Sensitive to overfitting; requires careful tuning of learning rate and tree depth.
59+
60+
### 7. **Neural Networks**
61+
- **Mathematical Formula:**
62+
$$ y = f(WX + b) $$
63+
where $$W$$ are weights, $$X$$ is input data, and $$b$$ is bias.
64+
- **Loss Function:** Cross-Entropy Loss or Mean Squared Error
65+
- Cross-Entropy for classification:
66+
$$ L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i)] $$
67+
- **Performance Metrics:** Accuracy, F1 Score, AUC
68+
- **Caveats:** Requires large amounts of data; less interpretable than traditional models.
69+
70+
### 8. **K-Means Clustering**
71+
- **Mathematical Formula:**
72+
$$ J = \sum_{i=1}^{k} \sum_{j=1}^{n} ||x_j^{(i)} - c_i||^2 $$
73+
where $$c_i$$ are centroids and $$x_j^{(i)}$$ are data points assigned to cluster $$i$$.
74+
- **Loss Function:** Sum of Squared Errors (SSE)
75+
- **Performance Metrics:** Silhouette Score, Davies-Bouldin Index
76+
- **Caveats:** Assumes spherical clusters; sensitive to initial centroid placement.
77+
78+
## Summary of Formulations
79+
80+
| Algorithm | Mathematical Formula | Loss Function | Performance Metrics |
81+
|------------------------|-------------------------------------------------------------------------------------|------------------------------------|----------------------------------------|
82+
| Linear Regression | $$ y = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n + \epsilon $$ | MSE | R², MAE |
83+
| Logistic Regression | $$ P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + ... + \beta_n x_n)}} $$ | Binary Cross-Entropy | Accuracy, F1 Score |
84+
| Decision Trees | Gini: $$ Gini(D) = 1 - \sum p_j^2 $$ | Gini Impurity / MSE | Accuracy, MAE |
85+
| Support Vector Machines | $$ f(x) = w^T x + b $$ | Hinge Loss | Accuracy, Precision |
86+
| Random Forest | $$ \hat{y} = \frac{1}{N} \sum T_i(x) $$ | MSE / Gini Impurity | Out-of-Bag Error |
87+
| Gradient Boosting | $$ F(x) = F_{m-1}(x) + h_m(x) $$ | Log Loss / MSE | RMSE |
88+
| Neural Networks | $$ y = f(WX + b) $$ | Cross-Entropy / MSE | Accuracy, AUC |
89+
| K-Means Clustering | $$ J = \sum ||x_j^{(i)} - c_i||^2 $$ | SSE | Silhouette Score |
90+
91+
This comprehensive overview provides insights into each algorithm's mathematical foundation along with its practical applications and limitations. Understanding these aspects can help in selecting the right algorithm for specific data science tasks.
92+
93+
Citations:
94+
[1] https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
95+
[2] http://fiascodata.blogspot.com/2018/08/decision-tree-mathematical-formulation.html
96+
[3] https://en.wikipedia.org/wiki/Decision_tree_learning
97+
[4] https://www.datascienceprophet.com/understanding-the-mathematics-behind-the-decision-tree-algorithm-part-i/
98+
[5] https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3?gi=36fa533e014a
99+
[6] https://www.datacamp.com/tutorial/loss-function-in-machine-learning
100+
[7] https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide
101+
[8] https://towardsdatascience.com/estimators-loss-functions-optimizers-core-of-ml-algorithms-d603f6b0161a?gi=5432fa9d3888

0 commit comments

Comments
 (0)