The IsNullOrWhiteSpace trap

The IsNullOrWhiteSpace method may seem like a useful utility method, but poisons your design perspective.

The string.IsNullOrWhiteSpace method, together with its older sibling string.IsNullOrEmpty, may seem like useful utility methods. In reality, they aren't. In fact, they trick your mind into thinking that null is equivalent to white space, which it isn't.

Null isn't equivalent to anything; it's the absence of a value.

Various empty and white space strings ("", " ", etc), on the other hand, are values, although, perhaps, not particularly interesting values.

Example: search canonicalization #

Imagine that you have to write a simple search canonicalization algorithm for a music search service. The problem you're trying to solve is that when users search for music, the may use variations of upper and lower case letters, as well as type the artist name before the song title, or vice versa. In order to make your system as efficient as possible, you may want to cache popular search results, but it means that you'll need to transform each search term into a canonical form.

In order to keep things simple, let's assume that you only need to convert all letters to upper case, and order words alphabetically.

Here are five test cases, represented as a Parameterized Test:

[Theory]
[InlineData("Seven Lions Polarized"  , "LIONS POLARIZED SEVEN"  )]
[InlineData("seven lions polarized"  , "LIONS POLARIZED SEVEN"  )]
[InlineData("Polarized seven lions"  , "LIONS POLARIZED SEVEN"  )]
[InlineData("Au5 Crystal Mathematics", "AU5 CRYSTAL MATHEMATICS")]
[InlineData("crystal mathematics au5", "AU5 CRYSTAL MATHEMATICS")]
public void CanonicalizeReturnsCorrectResult(
    string searchTerm,
    string expected)
{
    string actual = SearchTerm.Canonicalize(searchTerm);
    Assert.Equal(expected, actual);
}

Here's one possible implementation that passes all five test cases:

public static string Canonicalize(string searchTerm)
{
    return searchTerm
        .Split(new[] { ' ' })
        .Select(x => x.ToUpper())
        .OrderBy(x => x)
        .Aggregate((x, y) => x + " " + y);
}

This implementation uses the space character to split the string into an array, then converts each sub-string to upper case letters, sorts the sub-strings in ascending order, and finally concatenates them all together to a single string, which is returned.

Continued example: making the implementation more robust #

The above implementation is quite naive, because it doesn't properly canonicalize if the user entered extra white space, such as in these extra test cases:

[InlineData("Seven  Lions   Polarized", "LIONS POLARIZED SEVEN")]
[InlineData(" Seven  Lions Polarized ", "LIONS POLARIZED SEVEN")]

Notice that these new test cases don't pass with the above implementation, because it doesn't properly remove all the white spaces. Here's a more robust implementation that passes all test cases:

public static string Canonicalize(string searchTerm)
{
    return searchTerm
        .Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
        .Select(x => x.ToUpper())
        .OrderBy(x => x)
        .Aggregate((x, y) => x + " " + y);
}

Notice the addition of StringSplitOptions.RemoveEmptyEntries.

Testing for null #

If you consider the above implementation, does it have any other problems?

One, fairly obvious, problem is that if searchTerm is null, the method is going to throw a NullReferenceException, because you can't invoke the Split method on null.

Therefore, in order to protect the invariants of the method, you must test for null:

[Fact]
public void CanonicalizeNullThrows()
{
    Assert.Throws<ArgumentNullException>(
        () => SearchTerm.Canonicalize(null));
}

In this case, you've decided that null is simply invalid input, and I agree. Searching for null (the absence of a value) isn't meaningful; it must be a defect in the calling code.

Often, I see programmers implement their null checks like this:

public static string Canonicalize(string searchTerm)
{
    if (string.IsNullOrWhiteSpace(searchTerm))
        throw new ArgumentNullException("searchTerm");
 
    return searchTerm
        .Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
        .Select(x => x.ToUpper())
        .OrderBy(x => x)
        .Aggregate((x, y) => x + " " + y);
}

Notice the use of IsNullOrWhiteSpace. While it passes all tests so far, it's wrong for a number of reasons.

Problems with IsNullOrWhiteSpace #

The first problem with this use of IsNullOrWhiteSpace is that it may give client programmers wrong messages. For example, if you pass the empty string ("") as searchTerm, you'll still get an ArgumentNullException. This is misleading, because it gives the wrong message: it states that searchTerm was null when it wasn't (it was "").

You may then argue that you could change the implementation to throw an ArgumentException.

if (string.IsNullOrWhiteSpace(searchTerm))
    throw new ArgumentException("Empty or null.", "searchTerm");

This isn't incorrect per se, but not as explicit as it could have been. In other words, it's not as helpful to the client developer as it could have been. While it may not seem like a big deal in a single method like this, it's sloppy code like this that eventually wear client developers down; it's death by a thousand paper cuts.

Moreover, this implementation doesn't follow the Robustness Principle. Is there any rational reason to reject white space strings?

Actually, with a minor tweak, we can make the implementation work with white space as well. Consider these new test cases:

[InlineData("", "")]
[InlineData(" ", "")]
[InlineData("  ", "")]

These currently fail because of the use of IsNullOrWhiteSpace, but they ought to succeed.

The correct implementation of the Canonicalize method is this:

public static string Canonicalize(string searchTerm)
{
    if (searchTerm == null)
        throw new ArgumentNullException("searchTerm");
 
    return searchTerm
        .Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
        .Select(x => x.ToUpper())
        .OrderBy(x => x)
        .Aggregate("", (x, y) => x + " " + y)
        .Trim();
}

First of all, the correct Guard Clause is to test only for null; null is the only invalid value. Second, the method uses another overload of the Aggregate method where an initial seed (in this case "") is used to initialize the Fold operation. Third, the final call to the Trim method ensures that there's no leading or trailing white space.

The IsNullOrWhiteSpace mental model #

The overall problem with IsNullOrWhiteSpace and IsNullOrEmpty is that they give you the impression that null is equivalent to white space strings. This is the wrong mental model: white space strings are proper string values that you can very often manipulate just as well as any other string.

If you insist on the mental model that white space strings are equivalent to null, you'll tend to put them in the same bucket of 'invalid' data. However, if you take a hard look at the preconditions for your classes, methods, or functions, you'll find that often, a white space string is going to be perfectly acceptable input. Why reject input you can understand? That will only make your code more difficult to use.

In testing terms, it's my experience that null rarely falls in the same Equivalence Class as white space strings. Therefore, it's wrong to implicitly treat them as if they do.

The IsNullOrWhiteSpace and IsNullOrEmpty methods imply that null and white space strings are equivalent, and this will often give you the wrong mental model of the boundary cases of your software. Be careful when using these methods.

Comments

Fendy #

I agree if this is used at code library, which will be used by other programmer. However when directly used at application level layer, it is common to use them, at least the IsNullOrEmpty one, and they are quite powerful. And I don't find any problem in using that.

2014-11-28 06:47 UTC

StingyJack #

"Dog" means "DOG" means "dog"

I'm pretty sure you are hating ("hate-ing") on the wrong statement in this code, and have taken the idea past the point of practicality.

  if (string.IsNullOrWhiteSpace(searchTerm)) //this is not the part that is wrong (in this context*)
    throw new ArgumentNullException("searchTerm");  //this is

Its wrong for two (or three) reasons...

It is throwing an exception type that does not match the conditional statement that it comes from. The condition could be null or whitespace , but the exception is specific to null
The message it tells the caller is misinforming the them about the problem
* The lack of bracing is kind of wrong too - its so easy to avoid potential bugs and uninteded side effects by just adding braces. More on this later

That code should look more like this, where the ex and message match the conditional, and aligns with both how humans (even programmers) understand written language and one of your versions of the function (but with braces =D)

  if (string.IsNullOrWhiteSpace(searchTerm))
  {
    throw new ArgumentException("A search term must be provided");
  }

For most (if not all) human written languages, there is no semantic difference between words based on casing of letters. "Dog" means "DOG" means "dog". Likewise, for a function that accepts user input in order to arrange it for use in a search, there is no need to differentiate between values like null, "", " ", "\r\n\t", <GS>, and other unprintable whitespace characters. None of those inputs are acceptable except for very rare and specific use cases (not when the user is expected to provide some search criteria), so we only need to inform the caller correctly and not mislead them.

It is definitely useful to have null and empty be different values so that we can represent "not known" or "not set" for datum, but that does not apply in this case as the user has set the value by providing the function's argument. And I get what you are trying to say about robustness but in most cases the end result is the same for the user - they get no results. Catching the bad inputs earlier would save the resources needed to execute a futile search and would avoid wasting the user's time waiting on search results for data entry that was most likely a mistake.

RE: Bracing

If you object to the bracing being needed and do want to shorten the above further, use an extension method like this..

  public static string NullIfWhiteSpace(this string value)
  {
      return string.IsNullOrWhiteSpace(value) ? null : value;
  }

... then your function would could look like this...

                
  public static string Canonicalize(string searchTerm)
  {
    _ = searchTerm.NullIfWhiteSpace() ?? throw new ArgumentException("A search term must be provided");
    // etc...

This can be made into a Template (resharper) or snippet so you can have even less to type.

2023-04-23 17:04 UTC

Mark Seemann #

StingyJack, thank you for writing. You bring up more than one point. I'll try to address them in order.

You seem to infer from the example that the showcased Canonicalize function implements the search functionality. It doesn't. It canonicalises a search term. A question a library designer must consider is this: What is the contract of the Canonicalize function? What are the preconditions? What are the postconditions? Are there any invariants?

The point that I found most interesting about A Philosophy of Software Design was to favour generality over specialisation. As I also hint at in my review of that book, I think that I identified a connection with Postel's law that isn't explicitly mentioned there. The upshot, in any case, is that it's a good idea to make a function like Canonicalize as robust as possible.

When considering preconditions, then, what are the minimum requirements for it to be able to produce an output?

It turns out that it can, indeed, canonicalise whitespace strings. Since this is possible, it would artificially constrain the function to disallow that. This seems redundant.

When considering Postel's law, there's often a tension between allowing degenerate input and also allowing null input. It wouldn't be hard to handle a null searchTerm by also returning the empty string in that case. This would arguably be more robust, so why not do that?

That's never easy to decide, and someone might be able to convince me that this would, in fact, be more appropriate. On the other hand, one might ask: Is there a conceivable scenario where the search term is validly null?

In most code bases, I'd tend to consider null values as symptomatic of a bug in the calling code. If so, silently accepting null input might entail that defects go undetected. Thus, I tend to favour throwing exceptions on null input.

The bottom line is that there's a fundamental difference between null strings and whitespace strings, just like there's a fundamental difference between null integers and zero integers. It's often dangerous to conflate the two.

Regarding the bracing style, I'm well aware of the argument that omitting the braces can be dangerous. See e.g. the discussion on page 749 in Code Complete.

Impressed by such arguments, I used to insist on always including curly braces even for one-liners, until one day, someone (I believe it was Robert C. Martin) pointed out that these kinds of potential bugs are easily detected by unit tests. Since I routinely have test coverage of my code, I consider the extra braces redundant safety. If there was no cost to having them, I'd leave them, but in practice, they take up space.

2023-04-26 9:01 UTC

Published: Tuesday, 18 November 2014 19:10:00 UTC

The IsNullOrWhiteSpace trap by Mark Seemann