Intro
This is a blog post about an experiment to make the pyEntropy (pyEntrp) library a more powerful and versatile tool for computing entropy using Large Language Models: Google Bard and OpenAI ChatGPT.
Problem
With the recent boom of Large Language Models and their application in various fields, I was curious to see if they could help with the maintenance of the pyEntropy (pyEntrp) library, which is built to calculate different types of entropy. The first entropy calculation there is the Shannon Entropy.
Shannon Entropy measures the complexity or randomness of information. It looks at how many different patterns or possibilities there are in a set of data. If the data is very predictable and has few patterns, the entropy will be low. But if the data is very unpredictable, with lots of different patterns, the entropy is high.
And to test the results, I’ve prepared the following code, where the original version is copy-pasted from pyEntropy 0.8.2 (actually it’s also partly generated with the help of GitHub Copilot):
import numpy as npimport timeit
def shannon_entropy_original(time_series): """ Return the Shannon Entropy of the sample data.
Args: time_series: Vector or string of the sample data
Returns: The Shannon Entropy as float value """
# Check if string if not isinstance(time_series, str): time_series = list(time_series)
# Create a frequency data data_set = list(set(time_series)) freq_list = [] for entry in data_set: counter = 0.0 for i in time_series: if i == entry: counter += 1 freq_list.append(float(counter) / len(time_series))
# Shannon entropy ent = 0.0 for freq in freq_list: ent += freq * np.log2(freq) ent = -ent return ent
def shannon_entropy_efficient(time_series): ...
if __name__ == "__main__": np.random.seed(42)
# Define a list of orders and delays to test sizes = [10, 100, 500, 1000]
summary = []
for size in sizes: time_series = np.random.rand(size)
Y_original = shannon_entropy_original(time_series) Y_efficient = shannon_entropy_efficient(time_series)
if np.allclose(Y_original, Y_efficient, atol=1e-5): print(f"The outputs of the two versions are equal: {size=}") else: print(f"The outputs of the two versions are not equal: {size=}")
number = 100 repeat = 5
stmt = f"shannon_entropy_original(time_series)" original_times = timeit.repeat(stmt, globals=globals(), number=number, repeat=repeat) original_avg_time = np.mean(original_times)
stmt = f"shannon_entropy_efficient(time_series)" efficient_times = timeit.repeat(stmt, globals=globals(), number=number, repeat=repeat) efficient_avg_time = np.mean(efficient_times)
time_difference = original_avg_time - efficient_avg_time percentage_difference = (time_difference / original_avg_time) * 100
summary.append((size, original_avg_time, efficient_avg_time, percentage_difference))
# Print the summary print("Size\tOriginal Time\tEfficient Time\tPercentage Difference") for entry in summary: size, original_time, efficient_time, percentage_difference = entry print(f"{size}\t{original_time:.6f}\t{efficient_time:.6f}\t{percentage_difference:.2f}%")
To find out if there’s a better function in terms of complexity, I put the same question to Bard and ChatGPT:
Can you write more efficient version of the python function: import numpy as np
def shannon_entropy(time_series): … here goes the original function from pyEntrp 0.8.2 …
Results
Bard
Bard generated three versions of the code. Interestingly, in one of the code snippets, there was a reference to pyEntropy. I will not present all of them here, but only the one with the best performance.
def shannon_entropy_efficient(time_series): # Check if string if not isinstance(time_series, str): time_series = list(time_series)
# Create a frequency data data_set = set(time_series) freq_list = np.array([float(time_series.count(entry)) / len(time_series) for entry in data_set])
# Shannon entropy ent = -np.sum(freq_list * np.log2(freq_list)) return ent
Let’s see what Chat GPT has to offer.
Chat GPT
Chat GPT responds with one answer (not like Bard with three drafts). Here is a code snippet:
def shannon_entropy_efficient(time_series): # Check if string if not isinstance(time_series, str): time_series = list(time_series)
# Calculate frequency counts counter = Counter(time_series) total_count = len(time_series)
# Calculate frequencies and Shannon entropy ent = 0.0 for count in counter.values(): freq = count / total_count ent += freq * np.log2(freq)
ent = -ent return ent
OK, here we see Counter
from the standard library, which looks like a better choice compared to setting and
calculating frequencies in Bard’s version.
And to make sure it’s the best option, I asked another question:
Are there faster versions?
Of course, there are.
def shannon_entropy_efficient(time_series): # Check if string if not isinstance(time_series, str): time_series = np.array(time_series)
# Calculate frequency counts unique_vals, counts = np.unique(time_series, return_counts=True) total_count = len(time_series)
# Calculate frequencies and Shannon entropy frequencies = counts / total_count ent = -np.sum(frequencies * np.log2(frequencies))
return ent
In this version, the improvements include:
- Using
np.unique
with thereturn_counts=True
parameter to directly obtain unique values and their corresponding counts from the time series array. - Utilizing NumPy’s array operations to calculate frequencies and perform element-wise multiplication and summation without explicit loops.
And a sanity check:
Is this the best version?
The short answer is yes.
OK, time to test and see the results:
Size | Original Time | Bard Time | Chat GPT Time | Chat GPT v2 Time | Bard vs Original, % diff | Chat GPT vs Original, % diff | Chat GPT v2 vs Original, % diff |
---|---|---|---|---|---|---|---|
10 | 0.000839 | 0.000635 | 0.000605 | 0.001009 | 24.31% | 27.89% | -20.26% |
100 | 0.033200 | 0.013663 | 0.004940 | 0.001059 | 58.85% | 85.12% | 96.81% |
500 | 0.486557 | 0.289214 | 0.022725 | 0.001692 | 40.56% | 95.33% | 99.65% |
1000 | 1.882986 | 1.151954 | 0.046219 | 0.002610 | 38.82% | 97.55% | 99.86% |
Wow, Bard and ChatGPT have come up with an improved version.
But ChatGPT’s latest version beats all the others by a considerable margin. There is a slight decrease in speed for a small input, but it’s negligible as the results are much better for larger inputs.
The last interesting question is whether it’s possible to find this implementation on GitHub. I’ve selected parts of the code that might be common:
- First try.
unique_vals, counts = np.unique(time_series, return_counts=True)
– no results. - The second.
-np.sum(frequencies * np.log2(frequencies))
– two results. They look similar, but the function generated by ChatGPT is original and not a direct copy-paste.
As a result, ChatGPT was able to provide better versions of the function after the second question. Should it be asked all the time? Who knows. Unfortunately, Bard was not able to give better results than ChatGPT. However, ChatGPT suggested two functions: the first one gives the correct results for string input, and the second is much more optimised and valid for numeric vector input.
But overall, the result is great—an optimised function that looks like an original one. And since pyEntrp is used in many libraries, the optimisation will hopefully save some energy and CO2. Now it’s time to prepare a PR and ship a new version of the pyEntropy.