GitHub Copilot preliminary experience report

Based on a few months of use.

I've been evaluating GitHub Copilot since August 2022. Perhaps it's time to collect my thoughts so far.

In short, it's surprisingly good, but also gets a lot of things wrong. It does seem helpful to the experienced programmer, but I don't see it replacing all programmers yet.

Not only for boilerplate code #

I was initially doubtful. I'd seen some demos where Copilot created fifteen to twenty lines of code to, say, make a REST API request. These examples mostly struck me as auto-generation of something that ought to be a proper abstraction: A method in a reusable library.

In general I don't think that typing is a productivity bottleneck, and I'm sceptical of productivity tools, and particularly code generators. The more code a code base contains, the more code there is to read. Accelerating code production doesn't strike me as a goal in itself.

On the other hand, I'm past fifty and aware that in my age, one often tends to resist new ideas. Perhaps the above justifications are nothing but motivated reasoning. I decided to install the plug-ins and go about my business to see what would happen.

It has surprised and delighted me more than once. It's not only for boilerplate code. Here's a recent example.

I was doing an exercise in F# and wrote this type definition:

type Instruction = Enqueue of int | Dequeue

I wanted to write a property-based test with this type, and since I was using Hedgehog I needed a generator for the type. While FsCheck can automatically generate values of custom discriminated unions, Hedgehog has a more explicit-is-better-than-implicit design philosophy. In other words, if there's a way to make Hedgehog automatically generate values of a type like the above Instruction, it has so far escaped me.

So I started typing let private genInstruction = after which Copilot suggested the following:

let private genInstruction =
    Gen.frequency
        [ 1, Gen.constant Dequeue
          1, Gen.int32 (Range.constant 0 10_000) |> Gen.map Enqueue ]

I was, frankly, stunned. While I do understand that Copilot doesn't actually understand the code it suggests, Hedgehog is sufficiently esoteric that I didn't expect Copilot to have enough training data to enable it to make a useful contribution in this niche. I was wrong. I've repeatedly seen Copilot make suggestions to my F# and Haskell code. It's not just for C#, JavaScript, or python code.

The above suggestion was, to be clear, absolutely appropriate and compiled right away. The only detail I decided to change was the Range, which I decided to change to Range.linear. That's not, however, a significant change.

Perhaps you're not impressed by three lines of auto-generated code. How much of a productivity improvement is that? Quite a bit, in my case.

It wouldn't have taken me long to type those three lines of code, but as I already mentioned, typing isn't a bottleneck. On the other hand, looking up an unfamiliar API can take some time. The Programmer's Brain discusses this kind of problem and suggests exercises to address it. Does Copilot offer a shortcut?

While I couldn't remember the details of Hedgehog's API, once I saw the suggestion, I recognised Gen.frequency, so I understood it as an appropriate code suggestion. The productivity gain, if there is one, may come from saving you the effort of looking up unfamiliar APIs, rather than saving you some keystrokes.

In this example, I already knew of the Gen.frequency function - I just couldn't recall the exact name and type. This enabled me to evaluate Copilot's suggestion and deem it correct. If I hadn't known that API already, how could I have known whether to trust Copilot?

Detectably wrong suggestions #

As amazing as Copilot can be, it's hardly faultless. It makes many erroneous suggestions. Sometimes the suggestion is obviously wrong. If you accept it, it doesn't compile. Sometimes, the compilation error is only a little edit from being correct, but at least in such situations you'll be explicitly aware that the suggestion couldn't be used verbatim.

Other suggestions are wrong, but less conspicuously so. Here's an example.

I was recently subjecting the code base that accompanies Code That Fits in Your Head to the mutation testing tool Stryker. Since it did point out a few possible mutations, I decided to add a few tests. One was of a wrapper class called TimeOfDay. Because of static code analysis rules, it came with conversions to and from TimeSpan, but these methods weren't covered by any tests.

In order to remedy that situation, I started writing an FsCheck property and came as far as:

[Property]
public void ConversionsRoundTrip(TimeSpan timeSpan)

At that point Copilot suggested the following, which I accepted:

[Property]
public void ConversionsRoundTrip(TimeSpan timeSpan)
{
    var timeOfDay = new TimeOfDay(timeSpan);
    var actual = (TimeSpan)timeOfDay;
    Assert.Equal(timeSpan, actual);
}

Looks good, doesn't it? Again, I was impressed. It compiled, and it even looks as though Copilot had picked up one of my naming conventions: naming variables by role, in this case actual.

While I tend to be on guard, I immediately ran the test suite instead of thinking it through. It failed. Keep in mind that this is a characterisation test, so it was supposed to pass.

The TimeOfDay constructor reveals why:

public TimeOfDay(TimeSpan durationSinceMidnight)
{
    if (durationSinceMidnight < TimeSpan.Zero ||
        TimeSpan.FromHours(24) < durationSinceMidnight)
        throw new ArgumentOutOfRangeException(
            nameof(durationSinceMidnight),
            "Please supply a TimeSpan between 0 and 24 hours.");
 
    this.durationSinceMidnight = durationSinceMidnight;
}

While FsCheck knows how to generate TimeSpan values, it'll generate arbitrary durations, including negative values and spans much longer than 24 hours. That explains why the test fails.

Granted, this is hardly a searing indictment against Copilot. After all, I could have made this mistake myself.

Still, that prompted me to look for more issues with the code that Copilot had suggested. Another problem with the code is that it tests the wrong API. The suggested test tries to round-trip via the TimeOfDay class' explicit cast operators, which were already covered by tests. Well, I might eventually have discovered that, too. Keep in mind that I was adding this test to improve the code base's Stryker score. After running the tool again, I would probably eventually have discovered that the score didn't improve. It takes Stryker around 25 minutes to test this code base, though, so it wouldn't have been rapid feedback.

Since, however, I examined the code with a critical eye, I noticed this by myself. This would clearly require changing the test code as well.

In the end, I wrote this test:

[Property]
public void ConversionsRoundTrip(TimeSpan timeSpan)
{
    var expected = ScaleToTimeOfDay(timeSpan);
    var sut = TimeOfDay.ToTimeOfDay(expected);
 
    var actual = TimeOfDay.ToTimeSpan(sut);
 
    Assert.Equal(expected, actual);
}
 
private static TimeSpan ScaleToTimeOfDay(TimeSpan timeSpan)
{
    // Convert an arbitrary TimeSpan to a 24-hour TimeSpan.
    // The data structure that underlies TimeSpan is a 64-bit integer,
    // so first we need to identify the range of possible TimeSpan
    // values. It might be easier to understand to calculate
    // TimeSpan.MaxValue - TimeSpan.MinValue, but that underflows.
    // Instead, the number of possible 64-bit integer values is the same
    // as the number of possible unsigned 64-bit integer values.
    var range = ulong.MaxValue;
    var domain = TimeSpan.FromHours(24).Ticks;
    var scale = (ulong)domain / range;
    var expected = timeSpan * scale;
    return expected;
}

In this case, Copilot didn't improve my productivity. It may actually have slowed me down a bit.

This time, it wasn't too difficult to spot issues with the suggested code. What if the problems are more subtle?

Errors that are difficult to detect #

How do bugs appear? We write them, thinking that our code is going to do one thing, while the compiler decides otherwise. Even when we actively engage with the code, we may be looking at a bug and still fail to see it.

Why should we trust Copilot to produce bug-free code?

Here's another example. I was going through the Range kata in F# and made my Endpoint discriminated union a functor:

type Endpoint<'a> = Open of 'a | Closed of 'a
 
module Endpoint =
    let map f = function
        | Open   x -> Open   (f x)
        | Closed x -> Closed (f x)

Making a type a functor is such a well-understood process that Haskell can automate it. I knew how to write the above map function as well, so I wrote it without testing it.

After I'd written the map function, I decided to add tests of the two functor laws. Not because I doubted the implementation, but because I might need that as example code later.

Copilot suggested this test after I named it:

open Xunit
open Hedgehog
 
[<Fact>]
let ``First functor law`` () = Property.check <| property {
    let! x = Gen.int32 (Range.linearBounded ())
    let f = (fun x -> x + 1)
    let g = (fun x -> x * 2)
 
    let actual = Endpoint.map f (Endpoint.map g (Open x))
    let expected = Endpoint.map (f >> g) (Open x)
 
    Assert.Equal(expected, actual) }

It compiles. It also fails.

What's wrong with it?

If you know the functor laws by heart, you'll be able to spot that this is clearly not the first functor law. On the other hand, it looks convincingly like the second functor law. Should I just change the name and move on?

I can't, though, since the test fails. Could there be a bug in my map function, after all?

No, there's an error in the test. I invite you to spot it.

In terms of keystrokes, it's easy to fix the problem:

open Xunit
open Hedgehog
 
[<Fact>]
let ``First functor law`` () = Property.check <| property {
    let! x = Gen.int32 (Range.linearBounded ())
    let f = (fun x -> x + 1)
    let g = (fun x -> x * 2)
 
    let actual = Endpoint.map f (Endpoint.map g (Open x))
    let expected = Endpoint.map (f << g) (Open x)
 
    Assert.Equal(expected, actual) }

Spot the edit. I bet it'll take you longer to find it than it took me to type it.

The test now passes, but for one who has spent less time worrying over functor laws than I have, troubleshooting this could have taken a long time.

These almost-right suggestions from Copilot both worry me and give me hope.

Copilot for experienced programmers #

When a new technology like Copilot appears, it's natural to speculate on the consequences. Does this mean that programmers will lose their jobs?

This is just a preliminary evaluation after a few months, so I could be wrong, but I think we programmers are safe. If you're experienced, you'll be able to tell most of Copilot's hits from its misses. Perhaps you'll get a productivity improvement out of, but it could also slow you down.

The tool is likely to improve over time, so I'm hopeful that this could become a net productivity gain. Still, with this high an error rate, I'm not too worried yet.

The Pragmatic Programmer describes a programming style named Programming by Coincidence. People who develop software this way have only a partial understanding of the code they write.

"Fred doesn't know why the code is failing because he didn't know why it worked in the first place."

Andy Hunt and Dave Thomas, The Pragmatic Programmer

I've encountered my fair share of these people. When editing code, they make small adjustments and do cursory manual testing until 'it looks like it works'. If they have to start a new feature or are otherwise faced with a metaphorical blank page, they'll copy some code from somewhere else and use that as a starting point.

You'd think that Copilot could enhance the productivity of such people, but I'm not sure. It might actually slow them down. These people don't fully understand the code they themselves 'write', so why should we expect them to understand the code that Copilot suggests?

If faced with a Copilot suggestion that 'almost works', will they be able to spot if it's a genuinely good suggestion, or whether it's off, like I've described above? If the Copilot code doesn't work, how much time will they waste thrashing?

Conclusion #

GitHub Copilot has the potential to be a revolutionary technology, but it's not, yet. So far, I'm not too worried. It's an assistant, like a pairing partner, but it's up to you to evaluate whether the code that Copilot suggests is useful, correct, and safe. How can you do that unless you already know what you're doing?

If you don't have the qualifications to evaluate the suggested code, I fail to see how it's going to help you. Granted, it does have potential to help you move on in less time that you would otherwise have spent. In this article, I showed one example where I would have had to spend significant time looking up API documentation. Instead, Copilot suggested the correct code to use.

Pulling in the other direction are the many false positives. Copilot makes many suggestions, and many of them are poor. The ones that are recognisably bad are unlikely to slow you down. I'm more concerned with those that are subtly wrong. They have the potential to waste much time.

Which of these forces are strongest? The potential for wasting time is infinite, while the maximum productivity gain you can achieve is 100 percent. That's an asymmetric distribution. There's a long tail of time wasters, but there's no equivalent long tail of improvement.

I'm not, however, trying to be pessimistic. I expect to keep Copilot around for the time being. It could very well be here to stay. Used correctly, it seems useful.

Is it going to replace programmers? Hardly. Rather, it may enable poor developers to make such a mess of things that you need even more good programmers to subsequently fix things.

Published: Monday, 05 December 2022 08:37:00 UTC

GitHub Copilot preliminary experience report by Mark Seemann

Not only for boilerplate code #

Detectably wrong suggestions #

Errors that are difficult to detect #

Copilot for experienced programmers #

Conclusion #

Wish to comment?

Published

Tags