Data Spotlight

Tutorial: Data Cleaning and Prep with ChatGPT
January 2024

Tutorial: Tidying some data from Measuring Distress for graphing using data cleaning and prep with ChatGPT

For this tutorial, we’re using some data from the Measuring Distress county tool on StatsAmerica. We selected a few contiguous counties in Illinois and Missouri and downloaded the data into Excel. See Figure 1 for a screenshot of how the data looks in its raw format.

Figure 1: Raw Measuring Distress data in Excel

First, let’s establish what kind of data we want. We want a “tidy” dataset, wherein each variable is a column, each observation is a row, each cell is a single value and each variable is uniquely named.² This data, as presented above, is close to tidy but not quite. There are three columns named “Threshold Calculation,” and the first column contains names for two different columns, “2022 Per Capita Money Income (ACS 5-year PCMI)” and “Geography.” In addition, the dollar signs, while suitable for analysis in Excel, could cause some programming languages to recognize these columns as text, which is incorrect.

Let’s ask ChatGPT to help us tidy this dataset. Navigate to chat.openai.com (sign in or create an account if it asks you to), then copy and paste the data, starting with row 4, into a new chat session. (see Figure 1).

Figure 2: Preparing ChatGPT with a query and data

We asked it to describe the data to get a sense for what it thinks it contains and its structure. Its response to us is below.

Figure 3: ChatGPT's description of our data

Not bad. However, it didn’t catch all the columns, and we don’t trust that it truly understands what the threshold calculation columns refer to. Let’s offer some guidance.

Figure 4: Asking ChatGPT to rename columns based on other columns

You can’t see in the screenshot, but it did correctly rename the threshold calculation columns to be associated with the columns to their left.

You commonly see data that contains both text and numbers with quotation marks around the text columns, so let’s ask it to do this. We also want to make sure numeric columns are classified as numeric.

Figure 5: Asking ChatGPT to format the columns

We forgot to ask it to remove the dollar signs, so let’s take care of that so these columns are correctly specified as numeric.

Figure 6: Refining the column and data formatting

Now we’re cooking. We want this to be a comma delimited file so we can import it into Excel or another statistical analysis tool.

Figure 7: Asking ChatGPT to make a comma delimited file

Wait... we just noticed an issue with the first column. It should be named “Geography”. Just to be sure, let’s specify how the columns should be ordered explicitly to ensure we get what we want on the first try. In addition, we want the column names to also be “tidy,” so spaces should be replaced with underscores and special characters like parentheses should be removed.

Figure 8: Asking for a few more formatting tweaks

Alright, it took us a little too literally. We only need column names to be tidy; the observations can have spaces or special characters, provided they are text columns. It also removed the quotation marks, for some reason.

Figure 9: Putting finishing touches on formatting

Looks pretty good! Now we have a clean and tidy dataset, suitable for further analysis. Let’s ask it to write an R script that replicates all these steps and exports the cleaned dataset to a .csv file.

Figure 10: Asking ChatGPT to replicate cleaning steps in R script

Note that some of the code is chopped off in the screenshot, but you get the gist. Finally, let’s ask it to use R to reshape the data so we can graph the two per capita income measures for comparison: per capita personal income from the BEA and per capita money income from the ACS. See txt file here.

Figure 11: Asking ChatGPT to reshape the data for graphing

What’s interesting about this is that it recognizes what we mean by “long” data and writes the code to create the graph without explicit instructions to do so. It even “pretties” up the graph for us with colors, labels and a theme. See txt file here. When you run this code, R will generate a graph that looks like this.

Figure 12: The graph produced in R using ChatGPT's code

Final Thoughts

It’s not perfect, but it’s pretty good. For example, ChatGPT didn’t consider how the geography names on the X-axis would overlap so it didn’t make the necessary position adjustments. This is what makes ChatGPT both powerful and frustrating: it won’t do all the work for you, but it will get you started. While there are more straightforward ways to clean data, ChatGPT is a flexible and free solution that requires no coding expertise or spreadsheet wrangling.

Notes

For serious data analysis using ChatGPT, you can upgrade to the paid ChatGPT Plus plan, which gives you access to the powerful Advanced Data Analysis plugin. One advantage of the plugin is the ability to upload data files directly into the AI, eschewing the need to copy and paste data into the chat, which is fickle and subject to character limits. We recommend checking out MIT’s excellent resource about the Advanced Data Analysis plugin.
Wickham, Hadley. R for Data Science (2nd Edition). Chapter 5. https://r4ds.hadley.nz/data-tidy.html. There is more to tidy data than this, so I recommend reading this chapter, regardless of whether you use R.