12th jul '25

this week's goals

today's checklist

tags

datacamp

here are my notes on the first 3 chapters of the datacamp courses i did today (a continuation of supervised learning with scikit-learn):

chapter 3: fine-tuning your model

how good is your model?
- confusion matrix for assessing classification performance
  - actual:, predicted:
  - :legitimate, :fraudulent
  - true negative (top left actual:legitimate & predicted:legitimate), false positive, false negative, true positive (bottom right - actual:fraudulent & predicted:fraudulent)
- precision - metric which is the number of true positives divided by sum of all positive predictions
- recall - number of true positive divided by sum of true positives and false negatives (also called sensitivity - correct predictions of most fraudulent transactions)
- f1 score - mean of precision and recall
  - 2 * (precision * recall / precision + recall)
logistic regression and the roc curve
- logistic regression - used for classification problems, and outputting probabilities (produces a linear decision boundary (straight line if 2d))
- default threshold of logistic regression = 0.5
- roc curve (receiver operating characteristic curve) - helps visualise different threshold effecting true positive and false positive rates

chapter 4: preprocessing and pipelines

preprocessing data
- using scikit-learn has requirements
  - numeric data
  - no missing values
- we may have to preprocess data using categorical values into numeric values to allow us to use scikit-learn with our data
handling missing data
- imputation - replace missing data with educated guesses