Intro
This is a blog post about an experiment to make the pyEntropy (pyEntrp) library a more powerful and versatile tool for computing entropy using Large Language Models: Google Bard and OpenAI ChatGPT.
Problem
With the recent boom of Large Language Models and their application in various fields, I was curious to see if they could help with the maintenance of the pyEntropy (pyEntrp) library, which is built to calculate different types of entropy. The first entropy calculation there is the Shannon Entropy.
Shannon Entropy measures the complexity or randomness of information. It looks at how many different patterns or possibilities there are in a set of data. If the data is very predictable and has few patterns, the entropy will be low. But if the data is very unpredictable, with lots of different patterns, the entropy is high.
And to test the results, I’ve prepared the following code, where the original version is copy-pasted from pyEntropy 0.8.2 (actually it’s also partly generated with the help of GitHub Copilot):
To find out if there’s a better function in terms of complexity, I put the same question to Bard and ChatGPT:
Can you write more efficient version of the python function: import numpy as np
def shannon_entropy(time_series): … here goes the original function from pyEntrp 0.8.2 …
Results
Bard
Bard generated three versions of the code. Interestingly, in one of the code snippets, there was a reference to pyEntropy. I will not present all of them here, but only the one with the best performance.
Let’s see what Chat GPT has to offer.
Chat GPT
Chat GPT responds with one answer (not like Bard with three drafts). Here is a code snippet:
OK, here we see Counter
from the standard library, which looks like a better choice compared to setting and
calculating frequencies in Bard’s version.
And to make sure it’s the best option, I asked another question:
Are there faster versions?
Of course, there are.
In this version, the improvements include:
- Using
np.unique
with thereturn_counts=True
parameter to directly obtain unique values and their corresponding counts from the time series array. - Utilizing NumPy’s array operations to calculate frequencies and perform element-wise multiplication and summation without explicit loops.
And a sanity check:
Is this the best version?
The short answer is yes.
OK, time to test and see the results:
Size | Original Time | Bard Time | Chat GPT Time | Chat GPT v2 Time | Bard vs Original, % diff | Chat GPT vs Original, % diff | Chat GPT v2 vs Original, % diff |
---|---|---|---|---|---|---|---|
10 | 0.000839 | 0.000635 | 0.000605 | 0.001009 | 24.31% | 27.89% | -20.26% |
100 | 0.033200 | 0.013663 | 0.004940 | 0.001059 | 58.85% | 85.12% | 96.81% |
500 | 0.486557 | 0.289214 | 0.022725 | 0.001692 | 40.56% | 95.33% | 99.65% |
1000 | 1.882986 | 1.151954 | 0.046219 | 0.002610 | 38.82% | 97.55% | 99.86% |
Wow, Bard and ChatGPT have come up with an improved version.
But ChatGPT’s latest version beats all the others by a considerable margin. There is a slight decrease in speed for a small input, but it’s negligible as the results are much better for larger inputs.
The last interesting question is whether it’s possible to find this implementation on GitHub. I’ve selected parts of the code that might be common:
- First try.
unique_vals, counts = np.unique(time_series, return_counts=True)
– no results. - The second.
-np.sum(frequencies * np.log2(frequencies))
– two results. They look similar, but the function generated by ChatGPT is original and not a direct copy-paste.
As a result, ChatGPT was able to provide better versions of the function after the second question. Should it be asked all the time? Who knows. Unfortunately, Bard was not able to give better results than ChatGPT. However, ChatGPT suggested two functions: the first one gives the correct results for string input, and the second is much more optimised and valid for numeric vector input.
But overall, the result is great—an optimised function that looks like an original one. And since pyEntrp is used in many libraries, the optimisation will hopefully save some energy and CO2. Now it’s time to prepare a PR and ship a new version of the pyEntropy.