The Tedium of the Long-Distance Analyzer

Getting LLMs to do Some of Your Work

“Data scientists spend about 45% of their time on data preparation tasks, including loading and cleaning data,” according to a survey of data scientists conducted by Anaconda.

Large Language Models like ChatGPT, BARD, Bing and others are often touted as ushering in a new age of productivity for the knowledge worker. If you haven’t dived in very deeply you might be forgiven if you think that this AI stuff is just a “party trick” or another fad that will blow over.

I think not. There are just too many good use cases where you can demonstrate to yourself that these crazy computer programs are helpful in the real world.

For example, a large part of modern work is analyzing data. And a large part of analyzing data involves tedious tasks like sourcing, extracting, cleaning and loading the data – not one bit of these vital and necessary tasks is “value-add” in the sense of the final output. However they really eat up a lot of time.

Maybe our old buddy ChatGPT (or your other, favorite Large Language Model) can help us get some of that time back.

In order to get LLMs to do more of our work for us, we need to provide them with more tools. Most of our analytical assignments rely on some amount of up-to-date information. LLMs are typically trained on data that is months or years old, and of course, not every single possible item is in their training set – so a lot of factual data just isn’t there to pull out.

Thankfully, the newest version of ChatGPT has been enhanced with plug-ins and extra functionality to overcome this limitation. We’ll look at two of these that can really enhance our productivity: Web Scraper and Code Interpreter.

The Web Scraper plug-in allows us to guide ChatGPT to access the internet and pull current information from websites – we can even specify the site and what we want to extract.

One interesting thing about LLMs is that generally, they are bad at math. Ironic, isn’t it? Some plug-ins have been developed to overcome this limitation, and recently ChatGPT has even added the ability to execute Python code right in the Chat dialog. (This is in the paid ChatGPT-4 Plus version only).

Recently I ran across a discussion on Reddit pertaining to the rates of cancer, per 100,000 population by county in California. The data looked interesting – a quick scan of the website showed a huge variation in rates between counties, with a near 3x difference between the lowest and highest rates. However, data was also included for cancer rates after adjusting for age. Since we know that cancer is mostly a disease of aging, this makes sense, but from the website it was hard to get the big picture since it was set up like a “slide show.” Irritating for data nerds!

Disclaimer: I’m not a public health professional and you probably shouldn’t be basing any decisions on the basic analysis I’m doing here – this just one example of what you can do with these tools.

Here's an idea of how the data was laid out:

https://stacker.com/california/counties-highest-cancer-rates-california

The data is in there… but with 58 counties and lots of photos to scroll past, we don’t want to cut/paste or reenter the data if we don’t have to – and we may want an approach that would scale to doing an analysis of the whole country with thousands of counties. With the Web scraper tool now in ChatGPT4 Plus, we can get right to work and harvest the data!

First we’ll fire up ChatGPT and select the Web Scraper:

I got the Scraper plug-in from the ChatGPT Plug-in store. The last time I checked there were over 100 plug-ins in the store, most granting access to various data providers.

One issue is that it appears that we can only have a maximum of three plug-ins installed at any one time, and the Code interpreter only works without any other plug-ins installed. So we’ll split our work into two sessions - Gathering the data and then analyzing it.

Me:

Please scrape this website https://stacker.com/california/counties-highest-cancer-rates-california I want only the following information in a table: County name, Cancer rate per 100,000, 2016-2020, Age-adjusted rate per 100,000, 2016-2020 please put the data in a table using Markdown.

Note that I was specific on what data I wanted. Luckily our website is very consistent in labeling the data we want. County, Cancer Rate and Age-adjusted Cancer Rate… I just used the exact labeling the site used. the "using Markdown" is a special phrase that tells ChatGPT to format the output in a nice table.

ChatGPT:

Used Scraper

ChatGPT did our bidding and returned data scraped from the website:

Pretty slick, and much faster than re-typing. One issue, though: ChatGPT tells us that it sorted the data in ascending order based on the age-adjusted rate… but a quick glance shows us that isn’t true. I guess ChatGPT is being aspirational here! Luckily, we can easily fix that and do more in the next steps.

Now that we have our (unsorted!) data, it’s time to do a bit of analysis. Our hypothesis is that the variance of the raw data is much higher than that of the age-adjusted data – which could lead people to panic – if they look at only the raw data.

If we are cheeky and just directly ask ChatGPT to calculate the mean, variance, and standard deviation of the data, it tells us:

ChatGPT:

I'm sorry for any confusion, but as a text-based AI developed by OpenAI, I don't have the ability to execute code or perform calculations directly. I can generate code snippets and explain how they work, but I can't run them myself.

Oops.

But there’s a clue in there. ChatGPT tells us that it can generate code snippets. Now we happen to know (because we read the latest AI news) that ChatGPT has gone beyond this and now can generate the code and also execute the code for us! We need to start a new chat, and instead of activating the Scraper plug-in, we’ll activate the Code Interpreter module – currently in Beta.

Now we can proceed – first we’ll just copy and paste our table from the previous chat, and then ask for the mean and standard deviation.

With only a simple instruction we can do this:

Fantastic. Our hunch was correct. The age-adjusted numbers are probably telling a more interesting story here – much less variation once age is taken into account.

So why is the result so different?

ChatGPT has long been able to generate code in various languages, from Excel macros to Python. Once we enable the (beta) Code Interpreter, it can not only generate the code, but also execute it. ChatGPT is not always great at math and often struggles with even the simplest equations. The workaround has been to generate code, paste that into a code environment and execute it. The newest version of ChatGPT shortcuts that. It generates the code and executes it behind the scenes, showing us only the end result.

However, if you look at the top of the answer dialog, above, there’s a little drop-down called “Show work.” Clicking on that will allow us to peak behind the curtain and reveal the code that ChatGPT generated and ran behind the scenes to create our answer.

You can copy this code, paste it into a Python interpreter and run it.

So, we have our answer, but we can do some other things with our data set as well, now that we have it all ingested. The great thing is that we can use plain English to ask for what we want.

I really like the ability to ask follow-up questions! The retention of the conversational context is a huge help in carrying on a more fluid exploration.

Good answer, but we might want a bit more detail – again ChatGPT has memory so we don’t have to be overly specific in asking a follow up question, we can reference the previous answers implicitly:

We can even ask some “meta” questions about the data:

We can even ask tangential questions:

I hope you found this quick demo helpful. I’m off to pack up for a move to Mono County!

Thomas Niccum

The Tedium of the Long-Distance Analyzer

Getting LLMs to do Some of Your Work

MicroStrategy World 2024

AI as Devil’s Advocate