Making Code Efficient with LLMs. Part 1

Nikolay Donets

Intro

This is a blog post about an experiment to make the pyEntropy (pyEntrp) library a more powerful and versatile tool for computing entropy using Large Language Models: Google Bard and OpenAI ChatGPT.

Problem

With the recent boom of Large Language Models and their application in various fields, I was curious to see if they could help with the maintenance of the pyEntropy (pyEntrp) library, which is built to calculate different types of entropy. The first entropy calculation there is the Shannon Entropy.

Shannon Entropy measures the complexity or randomness of information. It looks at how many different patterns or possibilities there are in a set of data. If the data is very predictable and has few patterns, the entropy will be low. But if the data is very unpredictable, with lots of different patterns, the entropy is high.

And to test the results, I've prepared the following code, where the original version is copy-pasted from pyEntropy 0.8.2 (actually it's also partly generated with the help of GitHub Copilot):

import numpy as np
import timeit


def shannon_entropy_original(time_series):
  """
  Return the Shannon Entropy of the sample data.

  Args:
      time_series: Vector or string of the sample data

  Returns:
      The Shannon Entropy as float value
  """

  # Check if string
  if not isinstance(time_series, str):
    time_series = list(time_series)

  # Create a frequency data
  data_set = list(set(time_series))
  freq_list = []
  for entry in data_set:
    counter = 0.0
    for i in time_series:
      if i == entry:
        counter += 1
    freq_list.append(float(counter) / len(time_series))

  # Shannon entropy
  ent = 0.0
  for freq in freq_list:
    ent += freq * np.log2(freq)
  ent = -ent
  return ent


def shannon_entropy_efficient(time_series):
  ...


if __name__ == "__main__":
  np.random.seed(42)

  # Define a list of orders and delays to test
  sizes = [10, 100, 500, 1000]

  summary = []

  for size in sizes:
    time_series = np.random.rand(size)

    Y_original = shannon_entropy_original(time_series)
    Y_efficient = shannon_entropy_efficient(time_series)

    if np.allclose(Y_original, Y_efficient, atol=1e-5):
      print(f"The outputs of the two versions are equal: {size=}")
    else:
      print(f"The outputs of the two versions are not equal: {size=}")

    number = 100
    repeat = 5

    stmt = f"shannon_entropy_original(time_series)"
    original_times = timeit.repeat(stmt, globals=globals(), number=number, repeat=repeat)
    original_avg_time = np.mean(original_times)

    stmt = f"shannon_entropy_efficient(time_series)"
    efficient_times = timeit.repeat(stmt, globals=globals(), number=number, repeat=repeat)
    efficient_avg_time = np.mean(efficient_times)

    time_difference = original_avg_time - efficient_avg_time
    percentage_difference = (time_difference / original_avg_time) * 100

    summary.append((size, original_avg_time, efficient_avg_time, percentage_difference))

  # Print the summary
  print("Size\tOriginal Time\tEfficient Time\tPercentage Difference")
  for entry in summary:
    size, original_time, efficient_time, percentage_difference = entry
    print(f"{size}\t{original_time:.6f}\t{efficient_time:.6f}\t{percentage_difference:.2f}%")

To find out if there's a better function in terms of complexity, I put the same question to Bard and ChatGPT:

Can you write more efficient version of the python function: import numpy as np

def shannonentropy(timeseries): ... here goes the original function from pyEntrp 0.8.2 ...

Results

Bard

Bard generated three versions of the code. Interestingly, in one of the code snippets, there was a reference to pyEntropy. I will not present all of them here, but only the one with the best performance.

def shannon_entropy_efficient(time_series):
  # Check if string
  if not isinstance(time_series, str):
    time_series = list(time_series)

  # Create a frequency data
  data_set = set(time_series)
  freq_list = np.array([float(time_series.count(entry)) / len(time_series) for entry in data_set])

  # Shannon entropy
  ent = -np.sum(freq_list * np.log2(freq_list))
  return ent

Let's see what Chat GPT has to offer.

Chat GPT

Chat GPT responds with one answer (not like Bard with three drafts). Here is a code snippet:

def shannon_entropy_efficient(time_series):
  # Check if string
  if not isinstance(time_series, str):
    time_series = list(time_series)

  # Calculate frequency counts
  counter = Counter(time_series)
  total_count = len(time_series)

  # Calculate frequencies and Shannon entropy
  ent = 0.0
  for count in counter.values():
    freq = count / total_count
    ent += freq * np.log2(freq)

  ent = -ent
  return ent

OK, here we see Counter from the standard library, which looks like a better choice compared to setting and calculating frequencies in Bard's version.

And to make sure it's the best option, I asked another question:

Are there faster versions?

Of course, there are.

def shannon_entropy_efficient(time_series):
  # Check if string
  if not isinstance(time_series, str):
    time_series = np.array(time_series)

  # Calculate frequency counts
  unique_vals, counts = np.unique(time_series, return_counts=True)
  total_count = len(time_series)

  # Calculate frequencies and Shannon entropy
  frequencies = counts / total_count
  ent = -np.sum(frequencies * np.log2(frequencies))

  return ent

In this version, the improvements include:

Using np.unique with the return_counts=True parameter to directly obtain unique values and their corresponding counts from the time series array.
Utilizing NumPy's array operations to calculate frequencies and perform element-wise multiplication and summation without explicit loops.

And a sanity check:

Is this the best version?

The short answer is yes.

OK, time to test and see the results:

Size	Original Time	Bard Time	Chat GPT Time	Chat GPT v2 Time	Bard vs Original, % diff	Chat GPT vs Original, % diff	Chat GPT v2 vs Original, % diff
10	0.000839	0.000635	0.000605	0.001009	24.31%	27.89%	-20.26%
100	0.033200	0.013663	0.004940	0.001059	58.85%	85.12%	96.81%
500	0.486557	0.289214	0.022725	0.001692	40.56%	95.33%	99.65%
1000	1.882986	1.151954	0.046219	0.002610	38.82%	97.55%	99.86%

Wow, Bard and ChatGPT have come up with an improved version.

But ChatGPT's latest version beats all the others by a considerable margin. There is a slight decrease in speed for a small input, but it's negligible as the results are much better for larger inputs.

The last interesting question is whether it's possible to find this implementation on GitHub. I've selected parts of the code that might be common:

First try. unique_vals, counts = np.unique(time_series, return_counts=True) – no results.
The second. -np.sum(frequencies * np.log2(frequencies)) – two results. They look similar, but the function generated by ChatGPT is original and not a direct copy-paste.

As a result, ChatGPT was able to provide better versions of the function after the second question. Should it be asked all the time? Who knows. Unfortunately, Bard was not able to give better results than ChatGPT. However, ChatGPT suggested two functions: the first one gives the correct results for string input, and the second is much more optimised and valid for numeric vector input.

But overall, the result is great—an optimised function that looks like an original one. And since pyEntrp is used in many libraries, the optimisation will hopefully save some energy and CO₂. Now it's time to prepare a PR and ship a new version of the pyEntropy.