How to Tidy Messy Data Part 1)

Why Blog

Data scientist often spent about 80% of data analysis process on cleaning and preparing data1. Worst still, cleaning and preparing the data is an iterative process. Hadley Wickham refer to this process of cleaning and preparing data as data tidying: structuring datasets to facilitate analysis. Therefore, it is very important to get the right tool to efficiently and quickly tidy any messy data and spend more time working on your analysis.

Tidydata

I mostly use Python for machine learning. But R is an exceptional tool for data manipulation, data visualization, and data analysis due to the many available packages in R developed mainly for data analysis. R is becoming the “de facto best tool” for data analysis. Therefore, we will use R to explore different ways to tidy data. In this part 1 series, we will gently start by exploring the Janitor2 package and see how it makes tidying data easy. In subsequent series, we will explore dplyr3 and tidyr4 packages which are the most complete packages for data manipulaton and tidying data.

R Vs Python

Janitor Package

The janitor package has user-friendly functions for tidying messy data. It provides functions for formating column names, detecting duplicate records, provide quick tabulations, and many more.

For the purpose of this series, we are going to use a pinguin dataset that has been deliberately messed.

Loading the required packages

library(janitor)                                                     # CRAN v2.1.0
library(tidyverse)                                                   # CRAN v1.3.0

Loading the dataset from Github.

messy_penguins <- read_csv("https://raw.githubusercontent.com/BrunoGrandePhD/2020-11-14-rladies-workshop/rladies-tunis/learnr-tutorial/messy_penguins.csv")

Observing the data

glimpse(messy_penguins)
## Rows: 344
## Columns: 18
## $ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
## $ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", NA, NA, N…
## $ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
## $ Island                <chr> NA, NA, "Torgersen", NA, NA, NA, NA, NA, NA, NA,…
## $ Stage                 <chr> "Adult, 2 Egg Stage", "Adult, 1 Egg Stage", "Juv…
## $ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
## $ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
## $ `Date Egg`            <chr> "11/11/07", "11/11/07", "11/16/07", "11/16/07", …
## $ `Culmen Length (mm)`  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.…
## $ Nickname              <chr> "Kamile", "Dixie", "Arian", "Alexander", "Dewi",…
## $ `Culmen Depth (mm)`   <chr> "18.7", "17.4", "18", "-", "19.3", "20.6", "17.8…
## $ `Flipper Length (mm)` <chr> "181", "186", "195", "--", "193", "190", "181", …
## $ `Body Mass`           <chr> "3750 g", "3800g", "3250", NA, "3450 grams", "36…
## $ Sex                   <chr> "M", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "…
## $ `Delta 15 N (o/oo)`   <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.6649…
## $ `Delta 13 C (o/oo)`   <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "…
## $ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

clean_names() function

The clean_names() handles problematic variable names, returning only lowercase letters with underscore as separator, appends numbers to duplicated names, andles special characters and spaces, and converts “%” to “percent” to retain meaning. clean_names() can only be use on dataframe like objects, other objects such as named lists and vectors, make_clean_names() is used.

Let us observe column names for our messy data

colnames(messy_penguins)
##  [1] "studyName"           "Sample Number"       "Species"            
##  [4] "Region"              "Island"              "Stage"              
##  [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
## [10] "Culmen Length (mm)"  "Nickname"            "Culmen Depth (mm)"  
## [13] "Flipper Length (mm)" "Body Mass"           "Sex"                
## [16] "Delta 15 N (o/oo)"   "Delta 13 C (o/oo)"   "Comments"

We can see that, the column names structure is not uniform ( e.g “studyName”, “Sample Number”, “Delta 15 N (o/oo)”). So, to tidy these column names, we can use the clean_names() function to change them to uniform structure.

tidy_penguins <- clean_names(messy_penguins) 

colnames(tidy_penguins)
##  [1] "study_name"        "sample_number"     "species"          
##  [4] "region"            "island"            "stage"            
##  [7] "individual_id"     "clutch_completion" "date_egg"         
## [10] "culmen_length_mm"  "nickname"          "culmen_depth_mm"  
## [13] "flipper_length_mm" "body_mass"         "sex"              
## [16] "delta_15_n_o_oo"   "delta_13_c_o_oo"   "comments"

Now, all the column names are change to lower-case and have consistent structure. By default, clean_names() return snake-case like names, but you can specify other options such as the following cases:

  • snake_case: “snake”
  • lowerCamel: “lower_camel” or “small_camel”
  • UpperCamel: “upper_camel” or “big_camel”
  • ALL_CAPS: “all_caps” or “screaming_snake”
  • lowerUPPER: “lower_upper”
  • UPPERlower: “upper_lower”
  • Sentence case: “sentence”
  • Title Case: “title”

You can get details about any case here

messy_penguins %>%  
  clean_names(case ="lower_camel") %>% 
   colnames()
##  [1] "studyName"        "sampleNumber"     "species"          "region"          
##  [5] "island"           "stage"            "individualId"     "clutchCompletion"
##  [9] "dateEgg"          "culmenLengthMm"   "nickname"         "culmenDepthMm"   
## [13] "flipperLengthMm"  "bodyMass"         "sex"              "delta15NOOo"     
## [17] "delta13COOo"      "comments"

To clean names, but leave some abbreviations to appear the way you want.

messy_penguins %>% 
      clean_names(case = "lower_camel",  abbreviations = c("ID", "N", "mm")) %>% 
     colnames()
##  [1] "studyNAme"        "sampleNUmber"     "species"          "region"          
##  [5] "island"           "stage"            "individualID"     "clutchCompletion"
##  [9] "dateEgg"          "culmenLengthMm"   "nIckname"         "culmenDepthMm"   
## [13] "flipperLengthMm"  "bodyMass"         "sex"              "delta15NOOo"     
## [17] "delta13COOo"      "comments"

You can also restore column names to Title Case, e.g., for plotting

tidy_penguins %>%  
  clean_names( case = "title") %>% 
   colnames()
##  [1] "Study Name"        "Sample Number"     "Species"          
##  [4] "Region"            "Island"            "Stage"            
##  [7] "Individual Id"     "Clutch Completion" "Date Egg"         
## [10] "Culmen Length Mm"  "Nickname"          "Culmen Depth Mm"  
## [13] "Flipper Length Mm" "Body Mass"         "Sex"              
## [16] "Delta 15 n o Oo"   "Delta 13 c o Oo"   "Comments"

For vectors, we can use make_clean_names functions.

x <- structure(1:4, names = c("This is first", "this issecond", "3rd", "FinalChoice"))
x
## This is first this issecond           3rd   FinalChoice 
##             1             2             3             4
names(x) <- make_clean_names(names(x)) # `x` is added to names that start with number
x
## this_is_first this_issecond          x3rd  final_choice 
##             1             2             3             4

tabyl() function

This function is an alternative to the table() function from base-R. The function generate a frequency table from either a dataframe or vector.

studynames <- messy_penguins %>%  clean_names() %>% 
                tabyl(study_name) # %>% 
                #adorn_pct_formatting(digits = 0, affix_sign = TRUE) # creates a percentage column

studynames
##  study_name   n   percent
##     PAL0708 110 0.3197674
##     PAL0809 114 0.3313953
##     PAL0910 120 0.3488372

After tabyl() function, we can use the janitor’s family of adorn_functions to format the resultant dataframes.

studynames <- messy_penguins %>%  clean_names() %>% 
                tabyl(study_name) %>% 
                adorn_pct_formatting(digits = 0, affix_sign = TRUE) %>%  # format the percentage column
                adorn_totals(where = "row") #%>%  #Add totals 


studynames
##  study_name   n percent
##     PAL0708 110     32%
##     PAL0809 114     33%
##     PAL0910 120     35%
##       Total 344       -

Other adorn functions are :

  • adorn_pct_formatting Format a data.frame of decimals as percentages
  • adorn_rounding Round the numeric columns in a data.frame
  • adorn_totals Add a totals row and/or column to a data.frame.
  • adorn_ns Add underlying Ns to a tabyl displaying percentages
mtcars %>%
  tabyl(am, cyl) %>% 
  adorn_percentages("col") %>% 
 adorn_totals(where = c("row","col")) %>% 
  adorn_pct_formatting(digits = 0) %>% 
  adorn_rounding(digits = 3) %>% 
  adorn_ns(position = "front") # %>% 
##     am         4        6         8     Total
##      0  3  (27%) 4  (57%) 12  (86%) 19 (170%)
##      1  8  (73%) 3  (43%)  2  (14%) 13 (130%)
##  Total 11 (100%) 7 (100%) 14 (100%) 32 (300%)

Let use compare with Base-R table() function:

tidy_names <- messy_penguins %>%  
                clean_names() 
              
table(tidy_names$study_name)
## 
## PAL0708 PAL0809 PAL0910 
##     110     114     120

If you have a vector, the function also works the same

Age <- c(23, 24, 24, 25, 23,26, 25)
tabyl(Age)
##  Age n   percent
##   23 2 0.2857143
##   24 2 0.2857143
##   25 2 0.2857143
##   26 1 0.1428571

remove_empty()

This functions usage is straight foward. It simply removes any colum/rows from a data.frame or matrix that contain all “NA” as their entries.

empty <- messy_penguins %>% 
  remove_empty() # this will remove both empty rows and columns by default. 
## value for "which" not specified, defaulting to c("rows", "cols")
empty <- messy_penguins %>% 
  remove_empty(which = "rows", quiet = TRUE) # specify row or column and use quite argument to supress  messages be suppressed (TRUE) or printed (FALSE) indicating the summary of empty columns or rows removed?

remove-constant()

Sometimes, we may need to find columns that have the same values, we can use remove-constant() to to do that.

data.frame(A=1, B=1:3, c= c(3,3,3))  
##   A B c
## 1 1 1 3
## 2 1 2 3
## 3 1 3 3
data.frame(A=1, B=1:3, c= c(3,3,3)) %>%  
dplyr::select_at(setdiff(names(.), names(remove_constant(.)))) %>%unique()
##   A c
## 1 1 3

get_dupes() to Remove duplicates

Sometimes our data may contains duplicates that we may not like. So, we need to find and possibly remove them if any. get_dupes() returns a records (and inserts a count of duplicates) so that we can deal with identfied cases accordingly.

messy_penguins %>% 
  clean_names() %>% 
  get_dupes() # This dataset no any duplicate.
## No variable names specified - using all columns.
## No duplicate combinations found of: study_name, sample_number, species, region, island, stage, individual_id, clutch_completion, date_egg, ... and 9 other variables
## # A tibble: 0 x 19
## # … with 19 variables: study_name <chr>, sample_number <dbl>, species <chr>,
## #   region <chr>, island <chr>, stage <chr>, individual_id <chr>,
## #   clutch_completion <chr>, date_egg <chr>, culmen_length_mm <chr>,
## #   nickname <chr>, culmen_depth_mm <chr>, flipper_length_mm <chr>,
## #   body_mass <chr>, sex <chr>, delta_15_n_o_oo <chr>, delta_13_c_o_oo <chr>,
## #   comments <chr>, dupe_count <int>

Tip: call clean_names() every time you read in a new data set to automatically clean column names.

Shamsuddeen Hassan Muhammad
Shamsuddeen Hassan Muhammad
PhD Candidate in Computer Science

My research interests include Natural language processing, machine learning and deep learning.

comments powered by Disqus
Previous