The story of a bug, and how I addressed it.

Okay, I admit it: I could have given this article all sorts of alternative titles, each of which would have made as much sense as the one I chose. I didn't want to go with some of the other titles I had in mind, because they would give it all away up front. I didn't want to spoil the surprise.

I recently ran into this bug, it took me hours to troubleshoot it, and I was appalled when I realised what the problem was.

This is the story of that bug.

There are several insights from this story, and I admit that I picked the most click-baity one for the title.

Yak shaving #

I was working on the umpteenth variation of an online restaurant reservations system, and one of the features I'd added was a schedule only visible to the maƮtre d'. The schedule includes a list of all reservations for a day, including guests' email addresses and so on. For that reason, I'd protected that resource by requiring a valid JSON Web Token (JWT) with an appropriate role.

I'd deployed a new version of the API and went for an ad-hoc test. To my surprise, that particular resource didn't work. When I tried to request it, I got a 403 Forbidden response.

"That's odd," I though, "it worked the last time I tried this."

The system is set up with continuous deployment. I push master to a remote repository, and a build pipeline takes over from there. It only deploys the new version if all tests pass, so my first reaction was that I might have made a mistake with the JWT.

I wasted significant time decoding the JWT and comparing its contents to what it was supposed to be. I couldn't find any problems.

I also meticulously compared the encryption key I'd used to sign the JWT with the key on the server. They were identical.

Incredulous, and running out of ideas, I tried running all tests on my development machine. Indeed, all 170 tests passed.

Finally, I gave up and ran the API on my development machine. It takes all of a 30 seconds to configure the code to run in that environment, so you're excused if you wonder why I didn't already do that. What can I say? I've almost two decades of experience with automated test suites. Usually, if all tests pass, the problem is environmental: a network topology issue, a bad or missing connection string, a misconfigured encryption key, an invalid JWT, things like that.

To my surprise, the problem also manifested on my machine.

Not my code #

Okay, even with hundreds of tests, perhaps some edge case went unnoticed. The only problem with that hypothesis was that this was hardly an edge case. I was only making a GET request with a Bearer token. I wasn't going through some convoluted sequence of steps.

GET /restaurants/1/schedule/2020/9/30 HTTP/1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5c[...]

I expected a successful response containing some JSON, but the result was 403 Forbidden. That was the same behaviour I saw in the production environment.

Now, to be clear, this is indeed a protected resource. If you present an invalid JWT, 403 Forbidden is the expected response. That's why I wasted a few hours looking for problems with the the JWT.

I finally, hesitatingly, concluded that the problem might be somewhere else. The JWT looked okay. So, hours into my troubleshooting I reluctantly set a breakpoint in my code and started the debugger. It isn't rational, but I tend to see it as a small personal defeat if I have to use the debugger. Even so, if used judiciously, it can be an effective way to identify problems.

The debugger never hit my breakpoint.

To be clear, the beginning of my Controller method looked like this:

[Authorize(Roles = "MaitreD")]
[HttpGet("restaurants/{restaurantId}/schedule/{year}/{month}/{day}")]
public async Task<ActionResult> Get(
    int restaurantId,
    int year,
    int month,
    int day)
{
    if (!AccessControlList.Authorize(restaurantId))
        return new ForbidResult();

My breakpoint was on the first line (the if conditional), but the debugger didn't break into it. When I issued my GET request, I immediately got the 403 Forbidden response. The breakpoint just sat there in the debugger, mocking me.

When that happens, it's natural to conclude that the problem occurs somewhere in the framework; in this case, ASP.NET. To test that hypothesis, I commented out the [Authorize] attribute and reissued the GET request. My hypothesis was that I'd get a 200 OK response, since the attribute is what tells ASP.NET to check authorisation.

The hypothesis held. The response was 200 OK.

Test interdependency #

I hate when that happens. It's up there with fixing other people's printers. The problem is in the framework, not in my code. I didn't have any authorisation callbacks registered, so I was fairly certain that the problem wasn't in my code.

I rarely jump to the conclusion that there's a bug in the framework. In my experience, select is rarely broken. My new hypothesis had to be that I'd somehow managed to misconfigure the framework.

But where? There were automated tests that verified that a client could request that resource with a valid JWT. There were other automated tests that verified what happened if you presented an invalid JWT, or none at all. And all tests were passing.

While I was fiddling with the tests, I eventually ran a parametrised test by itself, instead of the entire test suite:

[Theory]
[InlineData(         "Hipgnosta", 2024, 11,  2)]
[InlineData(              "Nono", 2018,  9,  9)]
[InlineData("The Vatican Cellar", 2021, 10, 10)]
public async Task GetRestaurantScheduleWhileAuthorized(
    string name,
    int year,
    int month,
    int day)

This parametrised test has three test cases. When I ran just that test method, two of the test cases passed, but one failed: the Nono case, for some reason I haven't yet figured out.

I didn't understand why that test case ought to fail while the others succeeded, but I had an inkling. I commented out the Nono test case and ran the test method again.

One passed and one failing test.

Now The Vatican Cellar test case was failing. I commented that out and ran the test method again. The remaining test case failed.

This reeks of some sort of test interdependency. Apparently, something happens during the first test run that makes succeeding tests pass, but it happens too late for the first one. But what?

Bisect #

Then something occurred to me that I should have thought of sooner. This feature used to work. Not only had the tests been passing, but I'd actually interacted with the deployed service, presenting a valid JWT and received a 200 OK response.

Once than dawned on me, I realised that it was just a manner of performing a binary search. Since, of course, I use version control, I had a version I knew worked, and a version that didn't work. The task, then, was to find the commit that introduced the defect.

As I've already implied, the system in question is an example code base. While I have a cloud-based production environment, none but I use it. It had been four or five days since I'd actually interacted with the real service, and I'd been busy making changes, trusting exclusively in my test suite. I tend to make frequent, small commits instead of big, infrequent commits, so I had accumulated about a hundred and fifty commits since the 'last known good' deployment.

Searching through hundreds of commits sounds overwhelming, but using binary search, it's actually not that bad. Pick the commit halfway between the 'last known good' commit and the most recent commit, and check it out. See if the defect is present there. If it is, you know that it was introduced somewhere between the commit you're looking at, and the 'last known good' commit. If it isn't present, it was introduced later. Regardless of the outcome, you know in which half to look. You now pick a new commit in the middle of that set and repeat the exercise. Even with, say, a hundred commits, the first bisection reduces the candidate set to 50, the next bisection to 25, then 13, then 7, 4, 2, and then you have it. If you do this systematically, you should find the exact commit in less than eight iterations.

This is, as far as I understand it, the algorithm used by Git bisect. You don't have to use the bisect command - the algorithm is easy enough to do by hand - but let's see how it works.

You start a bisect session with:

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((93c6c35...))
$ git bisect start

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((93c6c35...)|BISECTING)

This starts an interactive session, which you can tell from the Git integration in Git Bash (it says BISECTING). You now mark a commit as being bad:

$ git bisect bad

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((93c6c35...)|BISECTING)

If you don't provide a commit ID at that point, Git is going to assume that you meant that the current commit (in this case 93c6c35) is bad. That's what I had in mind, so that's fine.

You now tell it about a commit ID that you know is good:

$ git bisect good 7dfdab2
Bisecting: 75 revisions left to test after this (roughly 6 steps)
[1f78c9a90c2088423ab4fc145b7b2ec3859d6a9a] Use InMemoryRestaurantDatabase in a test

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((1f78c9a...)|BISECTING)

Notice that Git is already telling us how many iterations we should expect. You can also see that it checked out a new commit (1f78c9a) for you. That's the half-way commit.

At this point, I manually ran the test method with the three test cases. All three passed, so I marked that commit as good:

$ git bisect good
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[5abf65a72628efabbf05fccd1b79340bac4490bc] Delete Either API

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((5abf65a...)|BISECTING)

Again, Git estimates how many more steps are left and checks out a new commit (5abf65a).

I repeated the process for each step, marking the commit as either good or bad, depending on whether or not the test passed:

$ git bisect bad
Bisecting: 18 revisions left to test after this (roughly 4 steps)
[fc48292b0d654f4f20522710c14d7726e6eefa70] Delete redundant Test Data Builders

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((fc48292...)|BISECTING)
$ git bisect good
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[b0cb1f5c1e9e40b1dabe035c41bfb4babfbe4585] Extract WillAcceptUpdate helper method

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((b0cb1f5...)|BISECTING)
$ git bisect good
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[d160c57288455377f8b0ad05985b029146228445] Extract ConfigureClock helper method

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((d160c57...)|BISECTING)
$ git bisect good
Bisecting: 2 revisions left to test after this (roughly 1 step)
[4cb73c219565d8377aa67d79024d6836f9000935] Compact code

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((4cb73c2...)|BISECTING)
$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[34238c7d2606e9007b96b54b43e678589723520c] Extract CreateTokenValidationParameters method

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((34238c7...)|BISECTING)
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[7d6583a97ff45fbd85878cecb5af11d93213a25d] Move Configure method up

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((7d6583a...)|BISECTING)
$ git bisect good
34238c7d2606e9007b96b54b43e678589723520c is the first bad commit
commit 34238c7d2606e9007b96b54b43e678589723520c
Author: Mark Seemann <mark@example.com>
Date:   Wed Sep 16 07:15:12 2020 +0200

    Extract CreateTokenValidationParameters method

 Restaurant.RestApi/Startup.cs | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((7d6583a...)|BISECTING)

Notice that the last step finds the culprit. It tells you which commit is the bad one, but surprisingly doesn't check it out for you. You can do that using a label Git has created for you:

$ git checkout refs/bisect/bad
Previous HEAD position was 7d6583a Move Configure method up
HEAD is now at 34238c7 Extract CreateTokenValidationParameters method

mark@Vindemiatrix MINGW64 ~/Documents/Redacted/Restaurant ((34238c7...)|BISECTING)

You've now found and checked out the offending commit. Hopefully, the changes in that commit should give you a clue about the problem.

The culprit #

What's in that commit? Take a gander:

$ git show
commit 34238c7d2606e9007b96b54b43e678589723520c (HEAD, refs/bisect/bad)
Author: Mark Seemann <mark@example.com>
Date:   Wed Sep 16 07:15:12 2020 +0200

    Extract CreateTokenValidationParameters method

diff --git a/Restaurant.RestApi/Startup.cs b/Restaurant.RestApi/Startup.cs
index 6c161b5..bcde861 100644
--- a/Restaurant.RestApi/Startup.cs
+++ b/Restaurant.RestApi/Startup.cs
@@ -79,10 +79,6 @@ namespace Ploeh.Samples.Restaurants.RestApi

         private void ConfigureAuthorization(IServiceCollection services)
         {
-            JwtSecurityTokenHandler.DefaultMapInboundClaims = false;
-
-            var secret = Configuration["JwtIssuerSigningKey"];
-
             services.AddAuthentication(opts =>
             {
                 opts.DefaultAuthenticateScheme =
@@ -91,15 +87,8 @@ namespace Ploeh.Samples.Restaurants.RestApi
                     JwtBearerDefaults.AuthenticationScheme;
             }).AddJwtBearer(opts =>
             {
-                opts.TokenValidationParameters = new TokenValidationParameters
-                {
-                    ValidateIssuerSigningKey = true,
-                    IssuerSigningKey = new SymmetricSecurityKey(
-                        Encoding.ASCII.GetBytes(secret)),
-                    ValidateIssuer = false,
-                    ValidateAudience = false,
-                    RoleClaimType = "role"
-                };
+                opts.TokenValidationParameters =
+                    CreateTokenValidationParameters();
                 opts.RequireHttpsMetadata = false;
             });

@@ -108,6 +97,23 @@ namespace Ploeh.Samples.Restaurants.RestApi
                 sp.GetService<IHttpContextAccessor>().HttpContext.User));
         }

+        private TokenValidationParameters CreateTokenValidationParameters()
+        {
+            JwtSecurityTokenHandler.DefaultMapInboundClaims = false;
+
+            var secret = Configuration["JwtIssuerSigningKey"];
+
+            return new TokenValidationParameters
+            {
+                ValidateIssuerSigningKey = true,
+                IssuerSigningKey = new SymmetricSecurityKey(
+                    Encoding.ASCII.GetBytes(secret)),
+                ValidateIssuer = false,
+                ValidateAudience = false,
+                RoleClaimType = "role"
+            };
+        }
+
         private void ConfigureRepository(IServiceCollection services)
         {

Can you spot the problem?

As soon as I saw that diff, the problem almost jumped out at me. I'm so happy that I make small commits. What you see here is, I promise, the entire diff of that commit.

Clearly, you're not familiar with this code base, but even so, you might be able to intuit the problem from the above diff. You don't need domain-specific knowledge for it, or knowledge of the rest of the code base.

I'll give you a hint: I moved a line of code into a lambda expression.

Deferred execution #

In the above commit, I extracted a helper method that I called CreateTokenValidationParameters. As the name communicates, it creates JWT validation parameters. In other words, it supplies the configuration values that determine how ASP.NET figures out how to authorise an HTTP request.

Among other things, it sets the RoleClaimType property to "role". By default, the [Authorize] attribute is looking for a role claim with a rather complicated identifier. Setting the RoleClaimType enables you to change the identifier. By setting it to "role", I'm telling the ASP.NET framework that it should look for claims named role in the JWT, and these should be translated to role claims.

Except that it doesn't work.

Ostensibly, this is for backwards compatibility reasons. You have to set JwtSecurityTokenHandler.DefaultMapInboundClaims to false. Notice that this is a global variable.

I thought that the line of code that does that conceptually belongs together with the other code that configures the JWT validation parameters, so I moved it in there. I didn't think much about it, and all my tests were still passing.

What happened, though, was that I moved mutation of a global variable into a helper method that was called from a lambda expression. Keep in mind that lambda expressions represent code that may run later; that execution is deferred.

By moving that mutation statement into the helper method, I inadvertently deferred its execution. When the application's Startup code runs, it configures the service. All the code that runs inside of AddJwtBearer, however, doesn't run immediately; it runs when needed.

This explains why all tests were passing. My test suite has plenty of self-hosted integration tests in the style of outside-in test-driven development. When I ran all the tests, the deferred code block would run in some other test context, flipping that global bit as a side effect. When the test suite reaches the test that fails when run in isolation, the bit is already flipped, and then it works.

It took me hours to figure out what the problem was, and it turned out that the root cause was a global variable.

Global variables are evil. Who knew?

Can't reproduce #

Once you figure out what the problem is, you should reproduce it with an automated test.

Yes, and how do I do that here? I already had tests that verify that you can GET the desired resource if you present a valid JWT. And that test passes!

The only way I can think of to reproduce the issue with an automated test is to create a completely new, independent test library with only one test. I considered doing that, weighed the advantages against the disadvantages and decided, given the context, that it wasn't worth the effort.

That means, though, that I had to accept that within the confines of my existing testing strategy, I can't reproduce the defect. This doesn't happen to me often. In fact, I can't recall that it's ever happened to me before, like this.

Resolution #

It took me hours to find the bug, and ten seconds to fix it. I just moved the mutation of JwtSecurityTokenHandler.DefaultMapInboundClaims back to the ConfigureAuthorization method:

diff --git a/Restaurant.RestApi/Startup.cs b/Restaurant.RestApi/Startup.cs
index d4917f5..e40032f 100644
--- a/Restaurant.RestApi/Startup.cs
+++ b/Restaurant.RestApi/Startup.cs
@@ -79,6 +79,7 @@ namespace Ploeh.Samples.Restaurants.RestApi

         private void ConfigureAuthorization(IServiceCollection services)
         {
+            JwtSecurityTokenHandler.DefaultMapInboundClaims = false;
             services.AddAuthentication(opts =>
             {
                 opts.DefaultAuthenticateScheme =
@@ -99,8 +100,6 @@ namespace Ploeh.Samples.Restaurants.RestApi

         private TokenValidationParameters CreateTokenValidationParameters()
         {
-            JwtSecurityTokenHandler.DefaultMapInboundClaims = false;
-
             var secret = Configuration["JwtIssuerSigningKey"];

             return new TokenValidationParameters

An ad-hoc smoke test verified that this solved the problem.

Conclusion #

There are multiple insights to be had from this experience:

  • Global variables are evil
  • Binary search, or bisection, is useful when troubleshooting
  • Not all bugs can be reproduced by an automated test
  • Small commits make it easy to detect problems
I've always disliked Git's squash feature, and here's one of the many reasons to dislike it. Had this happened in a code base with a 'nice history' (as the squash proponents like to present it), that small commit would have been bundled with various other commits. The problem wouldn't have jumped at me if buried in dozens of other changes.



Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.

Published

Monday, 05 October 2020 06:00:00 UTC

Tags



"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!
Published: Monday, 05 October 2020 06:00:00 UTC