skip to content
Nikolay Donets

Making Code Efficient with LLMs. Part 1

/ 5 min read

Intro

This is a blog post about an experiment to make the pyEntropy (pyEntrp) library a more powerful and versatile tool for computing entropy using Large Language Models: Google Bard and OpenAI ChatGPT.

Problem

With the recent boom of Large Language Models and their application in various fields, I was curious to see if they could help with the maintenance of the pyEntropy (pyEntrp) library, which is built to calculate different types of entropy. The first entropy calculation there is the Shannon Entropy.

Shannon Entropy measures the complexity or randomness of information. It looks at how many different patterns or possibilities there are in a set of data. If the data is very predictable and has few patterns, the entropy will be low. But if the data is very unpredictable, with lots of different patterns, the entropy is high.

And to test the results, I’ve prepared the following code, where the original version is copy-pasted from pyEntropy 0.8.2 (actually it’s also partly generated with the help of GitHub Copilot):

import numpy as np
import timeit
def shannon_entropy_original(time_series):
"""
Return the Shannon Entropy of the sample data.
Args:
time_series: Vector or string of the sample data
Returns:
The Shannon Entropy as float value
"""
# Check if string
if not isinstance(time_series, str):
time_series = list(time_series)
# Create a frequency data
data_set = list(set(time_series))
freq_list = []
for entry in data_set:
counter = 0.0
for i in time_series:
if i == entry:
counter += 1
freq_list.append(float(counter) / len(time_series))
# Shannon entropy
ent = 0.0
for freq in freq_list:
ent += freq * np.log2(freq)
ent = -ent
return ent
def shannon_entropy_efficient(time_series):
...
if __name__ == "__main__":
np.random.seed(42)
# Define a list of orders and delays to test
sizes = [10, 100, 500, 1000]
summary = []
for size in sizes:
time_series = np.random.rand(size)
Y_original = shannon_entropy_original(time_series)
Y_efficient = shannon_entropy_efficient(time_series)
if np.allclose(Y_original, Y_efficient, atol=1e-5):
print(f"The outputs of the two versions are equal: {size=}")
else:
print(f"The outputs of the two versions are not equal: {size=}")
number = 100
repeat = 5
stmt = f"shannon_entropy_original(time_series)"
original_times = timeit.repeat(stmt, globals=globals(), number=number, repeat=repeat)
original_avg_time = np.mean(original_times)
stmt = f"shannon_entropy_efficient(time_series)"
efficient_times = timeit.repeat(stmt, globals=globals(), number=number, repeat=repeat)
efficient_avg_time = np.mean(efficient_times)
time_difference = original_avg_time - efficient_avg_time
percentage_difference = (time_difference / original_avg_time) * 100
summary.append((size, original_avg_time, efficient_avg_time, percentage_difference))
# Print the summary
print("Size\tOriginal Time\tEfficient Time\tPercentage Difference")
for entry in summary:
size, original_time, efficient_time, percentage_difference = entry
print(f"{size}\t{original_time:.6f}\t{efficient_time:.6f}\t{percentage_difference:.2f}%")

To find out if there’s a better function in terms of complexity, I put the same question to Bard and ChatGPT:

Can you write more efficient version of the python function: import numpy as np

def shannon_entropy(time_series): … here goes the original function from pyEntrp 0.8.2 …

Results

Bard

Bard generated three versions of the code. Interestingly, in one of the code snippets, there was a reference to pyEntropy. I will not present all of them here, but only the one with the best performance.

def shannon_entropy_efficient(time_series):
# Check if string
if not isinstance(time_series, str):
time_series = list(time_series)
# Create a frequency data
data_set = set(time_series)
freq_list = np.array([float(time_series.count(entry)) / len(time_series) for entry in data_set])
# Shannon entropy
ent = -np.sum(freq_list * np.log2(freq_list))
return ent

Let’s see what Chat GPT has to offer.

Chat GPT

Chat GPT responds with one answer (not like Bard with three drafts). Here is a code snippet:

def shannon_entropy_efficient(time_series):
# Check if string
if not isinstance(time_series, str):
time_series = list(time_series)
# Calculate frequency counts
counter = Counter(time_series)
total_count = len(time_series)
# Calculate frequencies and Shannon entropy
ent = 0.0
for count in counter.values():
freq = count / total_count
ent += freq * np.log2(freq)
ent = -ent
return ent

OK, here we see Counter from the standard library, which looks like a better choice compared to setting and calculating frequencies in Bard’s version.

And to make sure it’s the best option, I asked another question:

Are there faster versions?

Of course, there are.

def shannon_entropy_efficient(time_series):
# Check if string
if not isinstance(time_series, str):
time_series = np.array(time_series)
# Calculate frequency counts
unique_vals, counts = np.unique(time_series, return_counts=True)
total_count = len(time_series)
# Calculate frequencies and Shannon entropy
frequencies = counts / total_count
ent = -np.sum(frequencies * np.log2(frequencies))
return ent

In this version, the improvements include:

  1. Using np.unique with the return_counts=True parameter to directly obtain unique values and their corresponding counts from the time series array.
  2. Utilizing NumPy’s array operations to calculate frequencies and perform element-wise multiplication and summation without explicit loops.

And a sanity check:

Is this the best version?

The short answer is yes.

OK, time to test and see the results:

SizeOriginal TimeBard TimeChat GPT TimeChat GPT v2 TimeBard vs Original, % diffChat GPT vs Original, % diffChat GPT v2 vs Original, % diff
100.0008390.0006350.0006050.00100924.31%27.89%-20.26%
1000.0332000.0136630.0049400.00105958.85%85.12%96.81%
5000.4865570.2892140.0227250.00169240.56%95.33%99.65%
10001.8829861.1519540.0462190.00261038.82%97.55%99.86%

Wow, Bard and ChatGPT have come up with an improved version.

But ChatGPT’s latest version beats all the others by a considerable margin. There is a slight decrease in speed for a small input, but it’s negligible as the results are much better for larger inputs.

The last interesting question is whether it’s possible to find this implementation on GitHub. I’ve selected parts of the code that might be common:

  • First try. unique_vals, counts = np.unique(time_series, return_counts=True)no results.
  • The second. -np.sum(frequencies * np.log2(frequencies))two results. They look similar, but the function generated by ChatGPT is original and not a direct copy-paste.

As a result, ChatGPT was able to provide better versions of the function after the second question. Should it be asked all the time? Who knows. Unfortunately, Bard was not able to give better results than ChatGPT. However, ChatGPT suggested two functions: the first one gives the correct results for string input, and the second is much more optimised and valid for numeric vector input.

But overall, the result is great—an optimised function that looks like an original one. And since pyEntrp is used in many libraries, the optimisation will hopefully save some energy and CO2. Now it’s time to prepare a PR and ship a new version of the pyEntropy.

References