Because the iris dataset does not contain any missing values or other irregularities. We can immediately jump to further preprocessing steps. For the full code see ....
We will first split our training set into test and train.
data(iris)
set.seed(123)
# Define the split ratio
split_ratio <- 0.75
# Create training and testing indices
train_indices <- sample(seq_len(nrow(iris)),
size = round(split_ratio * nrow(iris)))
# Split the data
train_data <- tibble(iris[train_indices, ])
test_data <- tibble(iris[-train_indices, ])
This results in 112
samples in the training set and 38
in the test set.
After cleaning and splittign the data we can now start with the start exploration.
Because the iris dataset is quite well know I will skip describing the features and what they mean and jump straight into graphical analysis.
We will start with a simple PCA plot
pca_result <- prcomp(train_data %>% dplyr::select(-Species), scale. = TRUE)
biplot <- ggbiplot(pca_result, obs.scale = 1, var.scale = 1,
groups = NULL, ellipse = TRUE, circle = TRUE) + theme_minimal()
This tells use already that most of the variance is explained by the first principal component or in other words by the a linear combination of the sepal length, petal length and petal width. This presents us with two choices, we could directly use the PC1 as a predictor or we can use the three featues as predictors separately.
We will use them separately. Another way of visualizing the data is with a series ob box plots.