Skip to content

Commit

Permalink
update 2401
Browse files Browse the repository at this point in the history
  • Loading branch information
mvanrongen committed Jan 24, 2024
1 parent b5ff362 commit 0721ccf
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
Binary file modified .DS_Store
Binary file not shown.
4 changes: 2 additions & 2 deletions _freeze/materials/glm-intro-lm/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"hash": "3cbadacc229dc33b07156990b913cd2f",
"hash": "274c011b10d7862f60a24c9005e359ff",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Linear models\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Data\n\nFor this example, we'll be using the several data sets about Darwin's finches. They are part of a long-term genetic and phenotypic study on the evolution of several species of finches. The exact details are less important for now, but there are data on multiple species where several phenotypic characteristics were measured (see @fig-finchphenotypes).\n\n![Finches phenotypes (courtesy of [HHMI BioInteractive](https://www.biointeractive.org))](images/finches-phenotypes.png){width=75% #fig-finchphenotypes}\n\n\n::: {.cell}\n\n:::\n\n\n## Exploring data\n\nIt's always a good idea to explore your data visually. Here we are focussing on the (potential) relationship between beak length (`blength`) and beak depth (`bdepth`).\n\nOur data contains measurements from two years (`year`) and two species (`species`). If we plot beak depth against beak length, colour our data by species and look across the two time points (1975 and 2012), we get the following graph:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Beak depth and length for _G. fortis_ and _G. scandens_](glm-intro-lm_files/figure-html/fig-finches_1975v2012-1.png){#fig-finches_1975v2012 width=672}\n:::\n:::\n\n\nIt seems that there is a potential linear relationship between beak depth and beak length. There are some differences between the two species and two time points with, what seems, more spread in the data in 2012. The data for both species also seem to be less separated than in 1975.\n\nFor the current purpose, we'll focus on one group of data: those of _Geospiza fortis_ in 1975.\n\n\n::: {.cell}\n\n:::\n\n\n## Linear model\n\nLet's look at the _G. fortis_ data more closely, assuming that the have a linear relationship. We can visualise that as follows:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Beak depth vs beak length _G. fortis_ (1975)](glm-intro-lm_files/figure-html/fig-lm_fortis_1975-1.png){#fig-lm_fortis_1975 width=672}\n:::\n:::\n\n\nIf you recall from the [Core statistics linear regression](https://cambiotraining.github.io/corestats/materials/cs3_practical_linear-regression.html) session, what we're doing here is assuming that there is a linear relationship between the response variable (in this case `bdepth`) and predictor variable (here, `blength`).\n\nWe can get more information on this linear relationship by defining a linear model, which has the form of:\n\n$$\nY = \\beta_0 + \\beta_1X\n$$\n\nwhere $Y$ is the response variable (the thing we're interested in), $X$ the predictor variable and $\\beta_0$ and $\\beta_1$ are model coefficients. \nMore explicitly for our data, we get:\n\n$$\nbeak\\ depth = \\beta_0 + \\beta_1 \\times beak\\ length\n$$\n\n\n::: {.cell}\n\n:::\n\n\nBut how do we find this model? The computer uses a method called **least-squares regression**. There are several steps involved in this.\n\n### Line of best fit\n\nThe computer tries to find the **line of best fit**. This is a linear line that best describes your data. We could draw a linear line through our cloud of data points in many ways, but the least-squares method converges to a single solution, where the **sum of squared residual deviations** is at its smallest.\n\nTo understand this a bit better, it's helpful to realise that each data point consists of a fitted value (the beak depth predicted by the model at a given beak length), combined with the error. The error is the difference between the fitted value and the data point.\n\nLet's look at this for one of the observations, for example finch 473:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Beak depth vs beak length (finch 473, 1975)](glm-intro-lm_files/figure-html/fig-finch473-1.png){#fig-finch473 width=672}\n:::\n:::\n\n\nObtaining the fitted value and error happens for each data point. All these residuals are then squared (to ensure that they are positive), and added together. This is the so-called sum-of-squares.\n\nYou can imagine that if you draw a line through the data that doesn't fit the data all that well, the error associated with each data point increases. The sum-of-squares then also increases. Equally, the closer the data are to the line, the smaller the error. This results in a smaller sum-of-squares.\n\nThe linear line where the sum-of-squares is at its smallest, is called the **line of best fit**. This line acts as a model for our data.\n\nFor finch 473 we have the following values:\n\n* the observed beak depth is 9.5 mm\n* the observed beak length is 10.5 mm\n* the fitted value is 9.11 mm\n* the error is 0.39 mm\n\n### Linear regression\n\nOnce we have the line of best fit, we can perform a **linear regression**. What we're doing with the regression, is asking:\n\n> Is the line of best fit a better predictor of our data than a horizontal line across the average value?\n\nVisually, that looks like this:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Regression: is the slope different from zero?](glm-intro-lm_files/figure-html/fig-lm_regression-1.png){#fig-lm_regression width=672}\n:::\n:::\n\n\nWhat we're actually testing is whether the _slope_ ($\\beta_1$) of the line of best fit is any different from zero.\n\nTo find the answer, we perform an ANOVA. This gives us a p-value of 1.68e-78.\n\nNeedless to say, this p-value is extremely small, and definitely smaller than any common significance threshold, such as $p < 0.05$. This suggests that beak length is a statistically significant predictor of beak depth.\n\nIn this case the model has an **intercept** ($\\beta_0$) of -0.34 and a **slope** ($\\beta_1$) of 0.9. We can use this to write a simple linear equation, describing our data. Remember that this takes the form of:\n\n$$\nY = \\beta_0 + \\beta_1X\n$$\n\nwhich in our case is\n\n$$\nbeak\\ depth = \\beta_0 + \\beta_1 \\times beak\\ length\n$$\n\nand gives us\n\n$$\nbeak\\ depth = -0.34 + 0.90 \\times beak\\ length\n$$\n\n### Assumptions\n\nIn example above we just got on with things once we suspected that there was a linear relationship between beak depth and beak length. However, for the linear regression to be valid, several assumptions need to be met. If any of those assumptions are violated, we can't trust the results. The following four assumptions need to be met, with a 5th point being a case of good scientific practice:\n\n1. Data should be linear\n2. Residuals are normally distributed\n3. Equality of variance\n4. The residuals are independent\n5. (no influential points)\n\nAs we did many times during the [Core statistics](https://cambiotraining.github.io/corestats/) sessions, we mainly rely on diagnostic plots to check these assumptions. For this particular model they look as follows:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagnostic plots for _G. fortis_ (1975) model](glm-intro-lm_files/figure-html/fig-fortis1975_lm_dgplots-1.png){#fig-fortis1975_lm_dgplots width=672}\n:::\n:::\n\n\nThese plots look very good to me. For a recap on how to interpret these plots, see [CS2: ANOVA](https://cambiotraining.github.io/corestats/materials/cs2_practical_anova.html).\n\nTaken together, we can see the relationship between beak depth and beak length as a linear one, described by a (linear) model that has a predicted value for each data point, and an associated error.\n",
"markdown": "---\ntitle: \"Linear models\"\n---\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n:::\n\n\n## Data\n\nFor this example, we'll be using the several data sets about Darwin's finches. They are part of a long-term genetic and phenotypic study on the evolution of several species of finches. The exact details are less important for now, but there are data on multiple species where several phenotypic characteristics were measured (see @fig-finchphenotypes).\n\n![Finches phenotypes (courtesy of [HHMI BioInteractive](https://www.biointeractive.org))](images/finches-phenotypes.png){width=75% #fig-finchphenotypes}\n\n\n::: {.cell}\n\n:::\n\n\n## Exploring data\n\nIt's always a good idea to explore your data visually. Here we are focussing on the (potential) relationship between beak length (`blength`) and beak depth (`bdepth`). \n\nOur data contains measurements from two years (`year`) and two species (`species`). If we plot beak depth against beak length, colour our data by species and look across the two time points (1975 and 2012), we get the following graph:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Beak depth and length for _G. fortis_ and _G. scandens_](glm-intro-lm_files/figure-html/fig-finches_1975v2012-1.png){#fig-finches_1975v2012 width=672}\n:::\n:::\n\n\nIt seems that there is a potential linear relationship between beak depth and beak length. There are some differences between the two species and two time points with, what seems, more spread in the data in 2012. The data for both species also seem to be less separated than in 1975.\n\nFor the current purpose, we'll focus on one group of data: those of _Geospiza fortis_ in 1975.\n\n\n::: {.cell}\n\n:::\n\n\n## Linear model\n\nLet's look at the _G. fortis_ data more closely, assuming that the have a linear relationship. We can visualise that as follows:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Beak depth vs beak length _G. fortis_ (1975)](glm-intro-lm_files/figure-html/fig-lm_fortis_1975-1.png){#fig-lm_fortis_1975 width=672}\n:::\n:::\n\n\nIf you recall from the [Core statistics linear regression](https://cambiotraining.github.io/corestats/materials/cs3_practical_linear-regression.html) session, what we're doing here is assuming that there is a linear relationship between the response variable (in this case `bdepth`) and predictor variable (here, `blength`).\n\nWe can get more information on this linear relationship by defining a linear model, which has the form of:\n\n$$\nY = \\beta_0 + \\beta_1X\n$$\n\nwhere $Y$ is the response variable (the thing we're interested in), $X$ the predictor variable and $\\beta_0$ and $\\beta_1$ are model coefficients. \nMore explicitly for our data, we get:\n\n$$\nbeak\\ depth = \\beta_0 + \\beta_1 \\times beak\\ length\n$$\n\n\n::: {.cell}\n\n:::\n\n\nBut how do we find this model? The computer uses a method called **least-squares regression**. There are several steps involved in this.\n\n### Line of best fit\n\nThe computer tries to find the **line of best fit**. This is a linear line that best describes your data. We could draw a linear line through our cloud of data points in many ways, but the least-squares method converges to a single solution, where the **sum of squared residual deviations** is at its smallest.\n\nTo understand this a bit better, it's helpful to realise that each data point consists of a fitted value (the beak depth predicted by the model at a given beak length), combined with the error. The error is the difference between the fitted value and the data point.\n\nLet's look at this for one of the observations, for example finch 473:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Beak depth vs beak length (finch 473, 1975)](glm-intro-lm_files/figure-html/fig-finch473-1.png){#fig-finch473 width=672}\n:::\n:::\n\n\nObtaining the fitted value and error happens for each data point. All these residuals are then squared (to ensure that they are positive), and added together. This is the so-called sum-of-squares.\n\nYou can imagine that if you draw a line through the data that doesn't fit the data all that well, the error associated with each data point increases. The sum-of-squares then also increases. Equally, the closer the data are to the line, the smaller the error. This results in a smaller sum-of-squares.\n\nThe linear line where the sum-of-squares is at its smallest, is called the **line of best fit**. This line acts as a model for our data.\n\nFor finch 473 we have the following values:\n\n* the observed beak depth is 9.5 mm\n* the observed beak length is 10.5 mm\n* the fitted value is 9.11 mm\n* the error is 0.39 mm\n\n### Linear regression\n\nOnce we have the line of best fit, we can perform a **linear regression**. What we're doing with the regression, is asking:\n\n> Is the line of best fit a better predictor of our data than a horizontal line across the average value?\n\nVisually, that looks like this:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Regression: is the slope different from zero?](glm-intro-lm_files/figure-html/fig-lm_regression-1.png){#fig-lm_regression width=672}\n:::\n:::\n\n\nWhat we're actually testing is whether the _slope_ ($\\beta_1$) of the line of best fit is any different from zero.\n\nTo find the answer, we perform an ANOVA. This gives us a p-value of 1.68e-78.\n\nNeedless to say, this p-value is extremely small, and definitely smaller than any common significance threshold, such as $p < 0.05$. This suggests that beak length is a statistically significant predictor of beak depth.\n\nIn this case the model has an **intercept** ($\\beta_0$) of -0.34 and a **slope** ($\\beta_1$) of 0.9. We can use this to write a simple linear equation, describing our data. Remember that this takes the form of:\n\n$$\nY = \\beta_0 + \\beta_1X\n$$\n\nwhich in our case is\n\n$$\nbeak\\ depth = \\beta_0 + \\beta_1 \\times beak\\ length\n$$\n\nand gives us\n\n$$\nbeak\\ depth = -0.34 + 0.90 \\times beak\\ length\n$$\n\n### Assumptions\n\nIn example above we just got on with things once we suspected that there was a linear relationship between beak depth and beak length. However, for the linear regression to be valid, several assumptions need to be met. If any of those assumptions are violated, we can't trust the results. The following four assumptions need to be met, with a 5th point being a case of good scientific practice:\n\n1. Data should be linear\n2. Residuals are normally distributed\n3. Equality of variance\n4. The residuals are independent\n5. (no influential points)\n\nAs we did many times during the [Core statistics](https://cambiotraining.github.io/corestats/) sessions, we mainly rely on diagnostic plots to check these assumptions. For this particular model they look as follows:\n\n\n::: {.cell}\n::: {.cell-output-display}\n![Diagnostic plots for _G. fortis_ (1975) model](glm-intro-lm_files/figure-html/fig-fortis1975_lm_dgplots-1.png){#fig-fortis1975_lm_dgplots width=672}\n:::\n:::\n\n\nThese plots look very good to me. For a recap on how to interpret these plots, see [CS2: ANOVA](https://cambiotraining.github.io/corestats/materials/cs2_practical_anova.html).\n\nTaken together, we can see the relationship between beak depth and beak length as a linear one, described by a (linear) model that has a predicted value for each data point, and an associated error.\n",
"supporting": [
"glm-intro-lm_files"
],
Expand Down
2 changes: 1 addition & 1 deletion materials/glm-intro-lm.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ finches <- read_csv("data/finches_beaks.csv")

## Exploring data

It's always a good idea to explore your data visually. Here we are focussing on the (potential) relationship between beak length (`blength`) and beak depth (`bdepth`).
It's always a good idea to explore your data visually. Here we are focussing on the (potential) relationship between beak length (`blength`) and beak depth (`bdepth`).

Our data contains measurements from two years (`year`) and two species (`species`). If we plot beak depth against beak length, colour our data by species and look across the two time points (1975 and 2012), we get the following graph:

Expand Down

0 comments on commit 0721ccf

Please sign in to comment.