july 12

today's checklist

  • maru hiragana

  • finish datacamp course 32, chapter 3

  • finish datacamp course 32, chapter 4

datacamp

here are my notes on the first 3 chapters of the datacamp courses i did today (a continuation of supervised learning with scikit-learn):

chapter 3: fine-tuning your model

  • how good is your model?
    • confusion matrix for assessing classification performance
      • actual:, predicted:
      • :legitimate, :fraudulent
      • true negative (top left actual:legitimate & predicted:legitimate), false positive, false negative, true positive (bottom right - actual:fraudulent & predicted:fraudulent)
    • precision - metric which is the number of true positives divided by sum of all positive predictions
    • recall - number of true positive divided by sum of true positives and false negatives (also called sensitivity - correct predictions of most fraudulent transactions)
    • f1 score - mean of precision and recall
      • 2 * (precision * recall / precision + recall)
  • logistic regression and the roc curve
    • logistic regression - used for classification problems, and outputting probabilities (produces a linear decision boundary (straight line if 2d))
    • default threshold of logistic regression = 0.5
    • roc curve (receiver operating characteristic curve) - helps visualise different threshold effecting true positive and false positive rates

chapter 4: preprocessing and pipelines

  • preprocessing data
    • using scikit-learn has requirements
      • numeric data
      • no missing values
    • we may have to preprocess data using categorical values into numeric values to allow us to use scikit-learn with our data
  • handling missing data
    • imputation - replace missing data with educated guesses