july 10

today's checklist

  • maru hiragana

  • finish datacamp course 30, chapter 4

  • finish datacamp course 32, chapter 1

  • finish datacamp course 32, chapter 2

more datacamp

here are my notes on the first 3 chapters of the datacamp courses i did today (a continuation of experimental design in python and supervised learning with scikit-learn):

chapter 4: advanced insights from experimental complexity

  • addressing complexities in experimental data
    • heteroscedasticity - changing variability of a variable across the range of another variable

chapter 1: classification

  • machine learning with scikit-learn
    • machine learning - the process in which computers learn to make decisions from data without being explicitly programmed
    • unsupervised learning - the uncovering of hidden patterns from unlabelled data
    • supervised learning - a type of machine learning in which the predicted values are known and the model is built to accurately predict values of previously unseen data
  • the classification challenge
    • the 4 steps of classifying labels of unseen data:
      • building a model
      • have the model learn from labelled data we pass to it
      • pass unlabelled data to the model as input
      • model predicts labels of the unseen data
    • labelled data = training data
    • k-nearest neighbours algorithm - used to predict the label of any data point by looking at the k closest labelled data points, and by taking a majority vote
    • measuring model performance
      • accuracy - correct predictions / total observations
  • covariate adjustment in experimental design
    • covariates - variables that are related to the outcome variable and can influence its analysis
    • can help in reducing confounding

chapter 2: regression

  • introduction to regression
    • (another type of supervised learning) in regression, target variable typically has continuous values
  • the basics of linear regression
    • y=ax+b
      • where y is the target
      • x is the single feature in simple linear regression
      • a and b are prarmeters or the coefficients of the model (slope and intercept)
    • when adding more features: y= a1x1 + a2x2 + …. anxn + b
  • cross-validation
    • steps of k-fold cross-validation:
      • split dataset into groups / folds
      • set aside first fold as test set
      • fit model on remaining folds
      • predict on test set
      • computer metric of interest
      • repeat with the next folds
    • regularised regression
      • regularisation is used to avoid overfitting in regression
      • penalises large coefficients