Histograms are best to plot continuous level variables because, as the name suggests,the values are on a continuum. Histograms are very helpful for investigating the distribution of continuous variables which is important for determining if a variable needs to be recoded.
Code
We can create histograms either through base R or ggplot2 package.
- In base R, we use
hist()
function we are plotting a distribution ofexpenditure
variable. - In
tidyverse
, we useggplot()
andgeom_histogram()
functions to create the same graph. - In comparison to base R,
ggplot()
function enables us to customize our plots. For instance, we were able to change the count of bins, added a theme (theme_bw()
function), and change the labels of the x-axis and y-axis usinglabs()
function.
Output from Base R
Output from ggplot()
Output from ggplot()
- improved version
The histogram shows us the range of ages among theobservations and the frequency of occurrence. We can also see that the distribution of expenditure
does not follow a normal curve (it is closer to normal curve, but it is not normal)and is skewed to the right. This may effect our results of our earlier statistical tests.
Boxplots, often called box-and-whisker plots andare used to represent the quartiles of continuous level variables. Boxplots display the variation in the sample with boxes that represent the quartiles and 'whiskers' of observationsoutside the upper and lower quartiles. These plots can be done with a single variable ormultiple variables, as we will see below.
Code
We can create boxplots either through base R or ggplot2 package.
- In base R, we use
boxplot()
function we are plotting a distribution ofexpenditure
variable. - In
tidyverse
, we useggplot()
andgeom_boxplot()
functions to create the same graph. - In comparison to base R,
ggplot()
function enables us to customize our plots. For instance, we were able toadd a theme (theme_bw()
function), and change the labels of the x-axis and y-axis usinglabs()
function.
The boxplotsbelow showus the median (just above 5,000) of the variable expenditure
with a horizontal line inside the gray box. The top and bottom edgesof the gray box are the 25 (Q1) and 75 (Q3) quartiles of the distribution. Next, the whiskers are the minimum and maximum values recorded for expenditure
of the observations. Dots are outliers.
Output from base R
Output from ggplot()
Output from ggplot()
- improved version
Code
We can also create a boxplot of expenditure
variable by other variables. For instance, we can graph expenditure
by two counties in county
variable.
This code might look intimidating at first. However, each step helps us to configure a specific aspect of the plot:
filter()
function helps us to filter county variable into only two options: Sonoma and Mercedgeom_boxplot()
function creates a boxplot of expenditure by countytheme_bw()
function creates black-and-white theme for the plotlabs()
function changes the x-axis and y-axis namescoord_flip()
function flips the coordinates x and yscale_x_continuous()
function helps us to change how x-axis scale looks likebreaks
argument withseq()
function helps to alter the x-axis tickslimits
argument helps us to alter the limits of the x-axis (lower and upper limits)
Output
This box plot is separated by the two counties (Merced and Sonoma)and expenditure
is represented in the y-axis. This helps us to see the distribution of expenditure
by county
.
Bar plots are bested used to represent ordinal level variables to show the distribution of the options. We can graph a bar plot of a single variable or multiple variables for a direct comparison.
Code
We can create bar plots either through base R or ggplot2 package.
- In base R, we use
barplot()
function we are plotting a distribution of gradesvariable. - In
tidyverse
, we useggplot()
andgeom_bar()
functions to create the same graph. - In comparison to base R,
ggplot()
function enables us to customize our plots. For instance, we were able toadd a theme (theme_bw()
function), and change the labels of the x-axis and y-axis usinglabs()
function.
Output from base R
Output from ggplot()
Output from ggplot()
- improved version
The bar plots above showthe raw count of observations of the variable grades
broken upby the observations. We can clearly see that there are more KK-08 grades than KK-06 grades in the dataset.
Code
This code might look intimidating at first. However, each step helps us to configure a specific aspect of the plot:
filter()
function helps us to filter county variable into only two options: Sonoma and Mercedgeom_bar()
function creates a boxplot of expenditure by countyfill
andcolor
arguments help us to fill and color our bar plot by county variable
theme_minimal()
function creates a minimal theme for the plotlabs()
function changes the x-axis and y-axis namescoord_flip()
function flips the coordinates x and yscale_y_continuous()
function helps us to change how y-axis scale looks likebreaks
argument withseq()
function helps to alter the y-axis tickslimits
argument helps us to alter the limits of the y-axis (lower and upper limits)
Output
We have broken the observations by grades (KK-06 and KK-08) and the county(Merced and Sonoma district).
Scatter plots are best used to graphically showif there is a relationship between two variables and what that relationship may looklike.
Code
We can create bar plots either through base R or ggplot2 package.
- In base R, we use
plot()
function we are plotting a distribution ofgradesvariable. - In
tidyverse
, we useggplot()
andgeom_point()
functions to create the same graph. - In comparison to base R,
ggplot()
function enables us to customize our plots. For instance, we were able toadd a theme (theme_bw()
function), and change the labels of the x-axis and y-axis usinglabs()
function, and even add a regression line usinggeom_smooth()
function.
Output from base R
Output from ggplot()
Output from ggplot()
- improved version
Above arescatter plots of the variables students
by teachers
. Scatter plots are very helpful when examining continuous level variables and if a graphical relationship exists. We can see in this scatter plot that there is a linear and positive relationship between the number of students and teachers.After looking as this graph, we would next want to conduct statistical tests to see if the relationships is statically significant.