*Thomas Friesen*

*2019-01-24*

Public interest for prediction of elections is growing. Since the succes of the 2008 presidential election from Nate Silver from 538, forecasting for elections turned into a more statistically and data mining based method. And since the presidential election of 2016 the critics against those methods are louder than ever. In this post a look the German federal election is presented. For the german federal election, the Wahlbezirke are the more important regions instead of the federal states. A number of economic variables are available for those Wahlbezirke and in this post a look at these economic data is given. The dataset can be downloaded here. 49 variables regarding the population, different unemployment rates, number of employees and other fields are available. The data is also visualized. But visualizing all variables at once is it not really possible. Instead a PCA is applied to the dataset and the corresponding components are plotted. A PCA is a data reduction method for reducing a large dataset with many variables to a linear combination of the variables. Instead of trying to plot 49 variable, one can use the variables created by the PCA. If the variables are highly correlated, the dataset can be visualied by using the first few components and still retain most of the information of the whole dataset.

```
library(spatial)
library(sf)
library(rgdal)
library(dplyr)
library(mapview)
library(reshape2)
library(ggplot2)
library(kableExtra)
```

```
data_pca=daten_bezirk[,-(1:6)]
data_pca=scale(data_pca)
pca1=princomp(data_pca)
daten_bezirk$PCA1=pca1$scores[,1]
daten_bezirk$PCA2=pca1$scores[,2]
daten_bezirk$PCA3=pca1$scores[,3]
daten_bezirk$PCA4=pca1$scores[,4]
daten_bezirk$PCA5=pca1$scores[,5]
pca_plot_data=pca1$loadings[,1:8]
pca_plot_data=data.frame(WKR_NAME=rownames(pca_plot_data),pca_plot_data)
pca_plot_data=melt(pca_plot_data,id.vars="WKR_NAME")
```

We scale the data and use *princomp* for applying a PCA to the data. We then extract the loadings. The loadings tell us the weight of each variable attributed to the principal component. A principal component with very high loading for a few variables allows for an easy interpretation. We then use *melt* to reshape the data into a long form. This is useful for plotting the loadings.

First Component | Second Component | Third component | Forth Component | Fifth Component |
---|---|---|---|---|

0.283 | 0.267 | 0.078 | 0.05 | 0.04 |

0.283 | 0.551 | 0.630 | 0.69 | 0.73 |

The first row shows how much the comopnent explains the total variance in percent. The first component explains 28.3 %, while the second component explains about 26.7% of the total variance. The second row shows the cumulative variance explained by the component combined. The first 4 components for example explain two third of the total variance. The first two component explain more than half of the total variation and are therefore the most important components.

```
ggplot(pca_plot_data,aes(WKR_NAME,abs(value),fill=value))+facet_wrap(~variable,nrow=1)+
geom_bar(stat="identity")+
coord_flip()+
scale_fill_gradient2(name="Loadings",high="blue",mid="white",low="red")+
xlab("Absolute Values of the loadings")+
ylab("Variables")
```