Interaction Between Categorical and Continuous Variables R

Yes, it's possible. Suppose your categorical model has $k$ levels, you'll need $(k-1)$ binary indicators to represent them, and you'll need another $(k-1)$ interaction terms that interact with the continuous varaible to model the interaction correctly.

In essence, it's just a regression model that allow each level of the categorical variable to have its own slope and intercept (while when without interaction, each level can have their own intercept, but slopes are bound to be the same). Given the model:

$y = 50 + 100(Lv2) + 200(Lv3) + 2.5 x + 3.5(x\times Lv2) - 6.5(x \times Lv3)$

where $Lv2$ and $Lv3$ are binary dummy variables to represent attributes 2 and 3 of the categorical variable, respectively. Here, $k$ = 3 and we kept $Lv1$ as the reference group. It's easy to visualize them once we realized this is just a compact way to express three regression lines. If we substitute 1 and 0 into the regression model accordingly, we will find that the equations are:

for $Lv1$:

$y = 50 + 2.5 x$

for $Lv2$:

$y = (50 + 100) + (2.5+3.5) x$

for $Lv3$:

$y = (50 + 200) + (2.5-6.5) x$

If we plot the predicted y, $\hat{y}$, against the continuous variable and then assign different features by the categorical variable's levels, we'll get:

enter image description here

The red line is group 1, with slope 2.5 and intercept 50; the green line is group 2; the blue line is group 3.

There are more sophisticated ways (for example, it's possible to plot the 95%CI shading), this is just an overall gist.

R code I used:

          set.seed(1520)  x <- rep(0:199, 3) group <- as.factor(rep(1:3, rep(200,3))) lv2   <- as.numeric(group==2) lv3   <- as.numeric(group==3)  y <- 50 + 100 * lv2 + 200 * lv3 + 2.5 * x +       3.5 * (x * lv2) - 6.5 * (x * lv3) +       rnorm(600, 0, 15)  # Without interactions, lines will have to be parallel: m01  <- lm(y ~ x + group) summary(m01) yhat <- m01$fit plot(x, yhat, col=as.numeric(group)+1)  # With interactions, lines can have their own slope: m02  <- lm(y ~ x + group + x:group) summary(m02) yhat <- m02$fit plot(x, yhat, col=as.numeric(group)+1)                  

Just to clarify, what do you mean by 'Without interactions, lines will have to be parallel?' Is that parallel vs reference group?

Correct, we can simulate the data again but this time we don't change the slopes for $Lv2$ and $Lv3$ (aka, we replace the slope adjustment 3.5 and -6.5 with 0):

          set.seed(1520)  x <- rep(0:199, 3) group <- as.factor(rep(1:3, rep(200,3))) lv2   <- as.numeric(group==2) lv3   <- as.numeric(group==3)  y <- 50 + 100 * lv2 + 200 * lv3 + 2.5 * x +       0 * (x * lv2) + 0 * (x * lv3) +       rnorm(600, 0, 15)  m03  <- lm(y ~ x + group + x:group) summary(m03) yhat <- m03$fit plot(x, yhat, col=as.numeric(group)+1)                  

Here is the output:

          Coefficients:              Estimate Std. Error t value Pr(>|t|)     (Intercept) 4.980e+01  2.011e+00  24.761   <2e-16 *** x           2.487e+00  1.748e-02 142.279   <2e-16 *** group2      9.942e+01  2.844e+00  34.956   <2e-16 *** group3      1.986e+02  2.844e+00  69.833   <2e-16 *** x:group2    8.207e-03  2.472e-02   0.332    0.740     x:group3    1.398e-02  2.472e-02   0.566    0.572     --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1                  

And here is the predicted y:

enter image description here

As we can see, if we don't adjust the lines' slope, the interaction terms above (x:group2 and x:group3) will be close to 0, and because of that, the group predicted y will be close to parallel.

So in the graphical illustration you shown, no lines are parallel, therefore there is no interaction?

No, the other way around. Interaction means that the association between an independent variable and the dependent actually depends on the value of another independent variable. In general case, a regression model like:

$$y = \beta_0 + \beta_1 x_1+ \beta_2 x_2$$

indicates that for one unit increase in $x_1$, mean $y$ differs by $\beta_1$ unit, regardless what value $x_2$ is. Applying to your situation, when there is no interaction, each unit increase in the continuous independent variable should be associated with the same amount of change in mean $y$, regardless of which group we are talking about. That scenario means that the lines have to be parallel.

When interaction exists, then each unit increase in the continuous independent variable will be associated with the amount of change in mean $y$ differently depends on which level of the categorical variable we're talking about. And that implies the lines are not parallel.

In the example I provided above, +3.5 adds an extra 3.5 to the slope 2.5 for $Lv2$ and -6.5 takes 6.5 away from the slope 2.5 for $Lv3$. If these two coefficients are different from zero, we have a significant interaction and the lines are not parallel; if they are close to zero, we don't have evidence of an interaction, and the lines are parallel.

Also how do I interpret the coefficients and p-value of the interaction terms? Is it just the same as how coefficients and p-values of categorical variables are interpreted?

First, to safeguard against multiple testing, we test if the whole set of interaction terms is significant or not using extra sum of squares F test:

          m01  <- lm(y ~ x + group) m02  <- lm(y ~ x + group + x:group) anova(m01, m02)                  

If this test is significant, then at least one of the interaction terms in the model is significant. Then we can go on the look at each of their p-values and discuss where the difference might be coming from.

The coefficient (e.g. the 3.5 and -6.5 above in the model) are really just difference in slopes. So, given the reference group has a slope of 2.5, we can report that $Lv2$ has a significant increase in slope, which is 3.5, resulting in a final slope of 6.0. For the same reason the slope for $Lv3$ is (2.5 - 6.5) = -4.

To put this all into context, a unit increase in x is then associated with:

2.5 unit increase in mean $y$ in $Lv1$ of the categorical variable,

6.0 unit increase in mean $y$ in $Lv2$ of the categorical variable,

4.0 unit decrease in mean $y$ in $Lv3$ of the categorical variable.

gishsentre.blogspot.com

Source: https://stats.stackexchange.com/questions/274748/interaction-terms-categorical-continuous

0 Response to "Interaction Between Categorical and Continuous Variables R"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel