Picture archivist in Haskell

A comprehensive code example showing how to implement a functional architecture in Haskell.

This article shows how to implement the picture archivist architecture described in the previous article. In short, the task is to move some image files to directories based on their date-taken metadata. The architectural idea is to load a directory structure from disk into an in-memory tree, manipulate that tree, and use the resulting tree to perform the desired actions:

A functional program typically loads data, transforms it, and stores it again.

Much of the program will manipulate the tree data, which is immutable.

Tree #

You can start by defining a rose tree:

data Tree a b = Node a [Tree a b] | Leaf b deriving (Eq, Show, Read)

If you wanted to, you could put all the Tree code in a reusable library, because none of it is coupled to a particular application, such as moving pictures. You could also write a comprehensive test suite for the following functions, but in this article, I'll skip that.

Notice that this sort of tree explicitly distinguishes between internal and leaf nodes. This is necessary because you'll need to keep track of the directory names (the internal nodes), while at the same time you'll want to enrich the leaves with additional data - data that you can't meaningfully add to the internal nodes. You'll see this later in the article.

The rose tree catamorphism is this foldTree function:

foldTree :: (a -> [c] -> c) -> (b -> c) -> Tree a b -> c
foldTree  _ fl (Leaf x) = fl x
foldTree fn fl (Node x xs) = fn x $ foldTree fn fl <$> xs

Sometimes I name the catamorphism cata, sometimes something like tree, but using a library like Data.Tree as another source of inspiration, in this article I chose to name it foldTree.

In this article, tree functionality is (with one exception) directly or transitively implemented with foldTree.

Filtering trees #

It'll be useful to be able to filter the contents of a tree. For example, the picture archivist program will only move image files with valid metadata. This means that it'll need to filter out all files that aren't image files, as well as image files without valid metadata.

It turns out that it'll be useful to supply a function that throws away Nothing values from a tree of Maybe leaves. This is similar to the catMaybes function from Data.Maybe, so I call it catMaybeTree:

catMaybeTree :: Tree a (Maybe b) -> Maybe (Tree a b)
catMaybeTree = foldTree (\x -> Just . Node x . catMaybes) (fmap Leaf)

You may find the type of the function surprising. Why does it return a Maybe Tree, instead of simply a Tree? And if you accept the type as given, isn't this simply the sequence function?

While catMaybes simply returns a list, it can do this because lists can be empty. This Tree type, on the other hand, can't be empty. If the purpose of catMaybeTree is to throw away all Nothing values, then how do you return a tree from Leaf Nothing?

You can't return a Leaf because you have no value to put in the leaf. Similarly, you can't return a Node because, again, you have no value to put in the node.

In order to handle this edge case, then, you'll have to return Nothing:

Prelude Tree> catMaybeTree $ Leaf Nothing
Nothing

Isn't this the same as sequence, then? It's not, because sequence short-circuits all data, as this list example shows:

Prelude> sequence [Just 42, Nothing, Just 2112]
Nothing

Contrast this with the behaviour of catMaybes:

Prelude Data.Maybe> catMaybes [Just 42, Nothing, Just 2112]
[42,2112]

You've yet to see the Traversable instance for Tree, but it behaves in the same way:

Prelude Tree> sequence $ Node "Foo" [Leaf (Just 42), Leaf Nothing, Leaf (Just 2112)]
Nothing

The catMaybeTree function, on the other hand, returns a filtered tree:

Prelude Tree> catMaybeTree $ Node "Foo" [Leaf (Just 42), Leaf Nothing, Leaf (Just 2112)]
Just (Node "Foo" [Leaf 42,Leaf 2112])

While the resulting tree is wrapped in a Just case, the leaves contain unwrapped values.

Instances #

The article about the rose tree catamorphism already covered how to add instances of Bifunctor, Bifoldable, and Bitraversable, so I'll give this only cursory treatment. Refer to that article for a more detailed treatment. The code that accompanies that article also has QuickCheck properties that verify the various laws associated with those instances. Here, I'll just list the instances without further comment:

instance Bifunctor Tree where
  bimap f s = foldTree (Node . f) (Leaf . s)
 
instance Bifoldable Tree where
  bifoldMap f = foldTree (\x xs -> f x <> mconcat xs)
 
instance Bitraversable Tree where
  bitraverse f s =
    foldTree (\x xs -> Node <$> f x <*> sequenceA xs) (fmap Leaf . s)
 
instance Functor (Tree a) where
  fmap = second
 
instance Foldable (Tree a) where
  foldMap = bifoldMap mempty
 
instance Traversable (Tree a) where
  sequenceA = bisequenceA . first pure

The picture archivist program isn't going to explicitly need all of these, but transitively, it will.

Moving pictures #

So far, all the code shown here could be in a general-purpose reusable library, since it contains no functionality specifically related to image files. The rest of the code in this article, however, will be specific to the program. I'll put the domain model code in another module and import some functionality:

module Archive where
 
import Data.Time
import Text.Printf
import System.FilePath
import qualified Data.Map.Strict as Map
import Tree

Notice that Tree is one of the imported modules.

Later, we'll look at how to load a tree from the file system, but for now, we'll just pretend that we have such a tree.

The major logic of the program is to create a destination tree based on a source tree. The leaves of the tree will have to carry some extra information apart from a file path, so you can introduce a specific type to capture that information:

data PhotoFile =
  PhotoFile { photoFileName :: FilePath, takenOn :: LocalTime }
  deriving (Eq, Show, Read)

A PhotoFile not only contains the file path for an image file, but also the date the photo was taken. This date can be extracted from the file's metadata, but that's an impure operation, so we'll delegate that work to the start of the program. We'll return to that later.

Given a source tree of PhotoFile leaves, though, the program must produce a destination tree of files:

moveTo :: (Foldable t, Ord a, PrintfType a) => a -> t PhotoFile -> Tree a FilePath
moveTo destination =
  Node destination . Map.foldrWithKey addDir [] . foldr groupByDir Map.empty
  where
    dirNameOf (LocalTime d _) =
      let (y, m, _) = toGregorian d in printf "%d-%02d" y m
    groupByDir (PhotoFile fileName t) =
      Map.insertWith (++) (dirNameOf t) [fileName]
    addDir name files dirs = Node name (Leaf <$> files) : dirs

This moveTo function looks, perhaps, overwhelming, but it's composed of only three steps:

Create a map of destination folders (foldr groupByDir Map.empty).
Create a list of branches from the map (Map.foldrWithKey addDir []).
Create a tree from the list (Node destination).

Recall that when Haskell functions are composed with the . operator, you'll have to read the composition from right to left.

Notice that this function works with any Foldable data container, so it'd work with lists and other data structures besides trees.

The moveTo function starts by folding the input data into a map. The map is keyed by the directory name, which is formatted by the dirNameOf function. This function takes a LocalTime as input and formats it to a YYYY-MM format. For example, December 20, 2018 becomes "2018-12".

The entire mapping step groups the PhotoFile values into a map of the type Map a [FilePath]. All the image files taken in April 2014 are added to the list with the "2014-04" key, all the image files taken in July 2011 are added to the list with the "2011-07" key, and so on.

In the next step, the moveTo function converts the map to a list of trees. This will be the branches (or sub-directories) of the destination directory. Because of the desired structure of the destination tree, this is a list of shallow branches. Each node contains only leaves.

Shallow photo destination directories.

The only remaining step is to add that list of branches to a destination node.

Since this is a pure function, it's easy to unit test. Just create some input values and call the function:

"Move to destination" ~: do
  (source, destination, expected) <-
    [
      ( Leaf $ PhotoFile "1" $ lt 2018 11 9 11 47 17
      , "D"
      , Node "D" [Node "2018-11" [Leaf "1"]])
      ,
      ( Node "S" [
          Leaf $ PhotoFile "4" $ lt 1972 6 6 16 15 00]
      , "D"
      , Node "D" [Node "1972-06" [Leaf "4"]])
      ,
      ( Node "S" [
          Leaf $ PhotoFile "L" $ lt 2002 10 12 17 16 15,
          Leaf $ PhotoFile "J" $ lt 2007 4 21 17 18 19]
      , "D"
      , Node "D" [Node "2002-10" [Leaf "L"], Node "2007-04" [Leaf "J"]])
      ,
      ( Node "1" [
          Leaf $ PhotoFile "a" $ lt 2010 1 12 17 16 15,
          Leaf $ PhotoFile "b" $ lt 2010 3 12 17 16 15,
          Leaf $ PhotoFile "c" $ lt 2010 1 21 17 18 19]
      , "2"
      , Node "2" [
          Node "2010-01" [Leaf "a", Leaf "c"],
          Node "2010-03" [Leaf "b"]])
      ,
      ( Node "foo" [
          Node "bar" [
            Leaf $ PhotoFile "a" $ lt 2010 1 12 17 16 15,
            Leaf $ PhotoFile "b" $ lt 2010 3 12 17 16 15,
            Leaf $ PhotoFile "c" $ lt 2010 1 21 17 18 19],
          Node "baz" [
            Leaf $ PhotoFile "d" $ lt 2010 3 1 2 3 4,
            Leaf $ PhotoFile "e" $ lt 2011 3 4 3 2 1
          ]]
      , "qux"
      , Node "qux" [
          Node "2010-01" [Leaf "a", Leaf "c"],
          Node "2010-03" [Leaf "b", Leaf "d"],
          Node "2011-03" [Leaf "e"]])
    ]
  let actual = moveTo destination source
  return $ expected ~=? actual

This is an inlined parametrised HUnit test. While it looks like a big unit test, it still follows my test formatting heuristic. There's only three expressions, but the arrange expression is big because it creates a list of test cases.

Each test case is a triple of a source tree, a destination directory name, and an expected result. In order to make the test data code more compact, it utilises this test-specific helper function:

lt y mth d h m s = LocalTime (fromGregorian y mth d) (TimeOfDay h m s)

For each test case, the test calls the moveTo function with the destination directory name and the source tree. It then asserts that the expected value is equal to the actual value.

Calculating moves #

One pure step remains. The result of calling the moveTo function is a tree with the desired structure. In order to actually move the files, though, for each file you'll need to keep track of both the source path and the destination path. To make that explicit, you can define a type for that purpose:

data Move =
  Move { sourcePath :: FilePath, destinationPath :: FilePath }
  deriving (Eq, Show, Read)

A Move is simply a data structure. Contrast this with typical object-oriented design, where it would be a (possibly polymorphic) method on an object. In functional programming, you'll regularly model intent with a data structure. As long as intents remain data, you can easily manipulate them, and once you're done with that, you can run an interpreter over your data structure to perform the work you want accomplished.

The unit test cases for the moveTo function suggest that file names are local file names like "L", "J", "a", and so on. That was only to make the tests as compact as possible, since the function actually doesn't manipulate the specific FilePath values.

In reality, the file names will most likely be longer, and they could also contain the full path, instead of the local path: "C:\foo\bar\a.jpg".

If you call moveTo with a tree where each leaf has a fully qualified path, the output tree will have the desired structure of the destination tree, but the leaves will still contain the full path to each source file. That means that you can calculate a Move for each file:

calculateMoves :: Tree FilePath FilePath -> Tree FilePath Move
calculateMoves = imp ""
  where imp path    (Leaf x) = Leaf $ Move x $ replaceDirectory x path
        imp path (Node x xs) = Node (path </> x) $ imp (path </> x) <$> xs

This function takes as input a Tree FilePath FilePath, which is compatible with the output of moveTo. It returns a Tree FilePath Move, i.e. a tree where the leaves are Move values.

To be fair, returning a tree is overkill. A [Move] (list of moves) would have been just as useful, but in this article, I'm trying to describe how to write code with a functional architecture. In the overview article, I explained how you can model a file system using a rose tree, and in order to emphasise that point, I'll stick with that model a little while longer.

Earlier, I wrote that you can implement desired Tree functionality with the foldTree function, but that was a simplification. If you can implement the functionality of calculateMoves with foldTree, I don't know how. You can, however, implement it using explicit pattern matching and simple recursion.

The imp function builds up a file path (using the </> path combinator) as it recursively negotiates the tree. All Leaf nodes are converted to a Move value using the leaf node's current FilePath value as the sourcePath, and the path to figure out the desired destinationPath.

This code is still easy to unit test:

"Calculate moves" ~: do
  (tree, expected) <-
    [
      (Leaf "1", Leaf $ Move "1" "1"),
      (Node "a" [Leaf "1"], Node "a" [Leaf $ Move "1" $ "a" </> "1"]),
      (Node "a" [Leaf "1", Leaf "2"], Node "a" [
        Leaf $ Move "1" $ "a" </> "1",
        Leaf $ Move "2" $ "a" </> "2"]),
      (Node "a" [Node "b" [Leaf "1", Leaf "2"], Node "c" [Leaf "3"]],
       Node "a" [
         Node ("a" </> "b") [
           Leaf $ Move "1" $ "a" </> "b" </> "1",
           Leaf $ Move "2" $ "a" </> "b" </> "2"],
         Node ("a" </> "c") [
           Leaf $ Move "3" $ "a" </> "c" </> "3"]])
    ]
  let actual = calculateMoves tree
  return $ expected ~=? actual

The test cases in this parametrised test are tuples of an input tree and the expected tree. For each test case, the test calls the calculateMoves function with tree and asserts that the actual tree is equal to the expected tree.

That's all the pure code you need in order to implement the desired functionality. Now you only need to write some code that loads a tree from disk, and imprints a destination tree to disk, as well as the code that composes it all.

Loading a tree from disk #

The remaining code in this article is impure. You could put it in dedicated modules, but for this program, you're only going to need three functions and a bit of composition code, so you could also just put it all in the Main module. That's what I did.

To load a tree from disk, you'll need a root directory, under which you load the entire tree. Given a directory path, you read a tree using a recursive function like this:

readTree :: FilePath -> IO (Tree FilePath FilePath)
readTree path = do
  isFile <- doesFileExist path
  if isFile
    then return $ Leaf path
    else do
      dirsAndfiles <- listDirectory path
      let paths = fmap (path </>) dirsAndfiles
      branches <- traverse readTree paths
      return $ Node path branches

This recursive function starts by checking whether the path is a file or a directory. If it's a file, it creates a new Leaf with that FilePath.

If path isn't a file, it's a directory. In that case, use listDirectory to enumerate all the directories and files in that directory. These are only local names, so prefix them with path to create full paths, then traverse all those directory entries recursively. That produces all the branches for the current node. Finally, return a new Node with the path and the branches.

Loading metadata #

The readTree function only produces a tree with FilePath leaves, while the program requires a tree with PhotoFile leaves. You'll need to read the Exif metadata from each file and enrich the tree with the date-taken data.

In this code base, I've used the hsexif library for this. That enables you to write an impure operation like this:

readPhoto :: FilePath -> IO (Maybe PhotoFile)
readPhoto path = do
  exifData <- parseFileExif path
  let dateTaken = either (const Nothing) Just exifData >>= getDateTimeOriginal
  return $ PhotoFile path <$> dateTaken

This operation can fail for various reasons:

The file may not exist.
The file exists, but has no metadata.
The file has metadata, but no date-taken metadata.
The date-taken metadata string is malformed.

The program is just going to skip all files from which it can't extract date-taken metadata, so readPhoto converts the Either value returned by parseFileExif to Maybe and binds the result with getDateTimeOriginal.

When you traverse a Tree FilePath FilePath with readPhoto, you'll get a Tree FilePath (Maybe PhotoFile). That's when you'll need catMaybeTree. You'll see this soon.

Writing a tree to disk #

The above calculateMoves function creates a Tree FilePath Move. The final piece of impure code you'll need to write is an operation that traverses such a tree and executes each Move.

applyMoves :: Foldable t => t Move -> IO ()
applyMoves = traverse_ move
  where
    move m = copy m >> compareFiles m >>= deleteSource
    copy (Move s d) = do
      createDirectoryIfMissing True $ takeDirectory d
      copyFileWithMetadata s d
      putStrLn $ "Copied to " ++ show d
    compareFiles m@(Move s d) = do
      sourceBytes <- B.readFile s
      destinationBytes <- B.readFile d
      return $ if sourceBytes == destinationBytes then Just m else Nothing
    deleteSource           Nothing = return ()
    deleteSource (Just (Move s _)) = removeFile s

As I wrote above, a tree of Move values is, to be honest, overkill. Any Foldable container will do, as the applyMoves operation demonstrates. It traverses the data structure, and for each Move, it first copies the file, then it verifies that the copy was successful, and finally, if that's the case, it deletes the source file.

All of the operations invoked by these three steps are defined in various libraries part of the base GHC installation. You're welcome to peruse the source code repository if you're interested in the details.

Composition #

You can now compose an impure-pure-impure sandwich from all the Lego pieces:

movePhotos :: FilePath -> FilePath -> IO ()
movePhotos source destination = fmap fold $ runMaybeT $ do
  sourceTree <- lift $ readTree source
  photoTree <- MaybeT $ catMaybeTree <$> traverse readPhoto sourceTree
  let destinationTree = calculateMoves $ moveTo destination photoTree
  lift $ applyMoves destinationTree

First, you load the sourceTree using the readTree operation. This is a Tree FilePath FilePath value, because the code is written in do notation, and the context is MaybeT IO (). You then load the image metatadata by traversing sourceTree with readPhoto. This produces a Tree FilePath (Maybe PhotoFile) that you then filter with catMaybeTree. Again, because of do notation and monad transformer shenanigans, photoTree is a Tree FilePath PhotoFile value.

Those two lines of code is the initial impure step of the sandwich (yes: mixed metaphors, I know).

The pure part of the sandwich is the composition of the pure functions moveTo and calculateMoves. The result is a Tree FilePath Move value.

The final, impure step of the sandwich, then, is to applyMoves.

Execution #

The movePhotos operation takes source and destination arguments. You could hypothetically call it from a rich client or a background process, but here I'll just call if from a command-line program. The main operation will have to parse the input arguments and call movePhotos:

main :: IO ()
main = do
  args <- getArgs
  case args of
    [source, destination] -> movePhotos source destination
    _ -> putStrLn "Please provide source and destination directories as arguments."

You could write more sophisticated parsing of the program arguments, but that's not the topic of this article, so I only wrote the bare minimum required to get the program working.

You can now compile and run the program:

$ ./archpics "C:\Users\mark\Desktop\Test" "C:\Users\mark\Desktop\Test-Out"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2003-04\\2003-04-29 15.11.50.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2011-07\\2011-07-10 13.09.36.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2014-04\\2014-04-17 17.11.40.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2014-04\\2014-04-18 14.05.02.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2014-05\\2014-05-23 16.07.20.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2014-06\\2014-06-30 15.44.52.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2014-06\\2014-06-21 16.48.40.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2016-05\\2016-05-01 09.25.23.jpg"
Copied to "C:\\Users\\mark\\Desktop\\Test-Out\\2017-08\\2017-08-22 19.53.28.jpg"

This does indeed produce the expected destination directory structure.

Seven example directories with pictures.

It's always nice when something turns out to work in practice, as well as in theory.

Summary #

Functional software architecture involves separating pure from impure code so that no pure functions invoke impure operations. Often, you can achieve that with what I call the impure-pure-impure sandwich architecture. In this example, you saw how to model the file system as a tree. This enables you to separate the impure file interactions from the pure program logic.

The Haskell type system enforces the functional interaction law, which implies that the architecture is, indeed, properly functional. Other languages, like F#, don't enforce the law via the compiler, but that doesn't prevent you doing functional programming. Now that we've verified that the architecture is, indeed, functional, we can port it to F#.

Next: Picture archivist in F#.

Comments

Jiehong #

This seems a fair architecture.

However, at first glance it does not seem very memory efficient, because everything might be loaded in RAM, and that poses a strict limit.

But then, I remember that Haskell does lazy evaluation, so is it the case here? Are path and the tree lazily loaded and processed?

In "traditional" architectures, IO would be scattered inside the program, and as each file might be read one at a time, and handled. This sandwich of purity with impure buns forces not to do that.

2019-09-09 11:47 UTC

Mark Seemann #

Jiehong, thank you for writing. It's true that Haskell is lazily evaluated, but some strictness rules apply to IO, so it's not so simple.

Just running a quick experiment with the code base shown here, when I try to move thousands of files, the program sits and thinks for quite some time before it starts to output progress. This indicates to me that it does, indeed, load at least the structure of the tree into memory before it starts moving the files. Once it does that, though, it looks like it runs at constant memory.

There's an interplay of laziness and IO in Haskell that I still don't sufficiently master. When I publish the port to F#, however, it should be clear that you could replace all the nodes of the tree with explicitly lazy values. I'd be surprised if something like that isn't possible in Haskell as well, but here I'll solicit help from readers more well-versed in these matters than I am.

2019-09-09 19:16 UTC

André Cardoso #

I really like your posts and I'm really liking this series. But I struggle with Haskell syntax, specially the difference between the operators $, <$>, <>, <*>. Is there a cheat sheet explaining these operators?

2019-09-12 13:51 UTC

Mark Seemann #

André, thank you for writing. I've written about why I think that terse operators make the code overall more readable, but that's obviously not an explanation of any of those operators.

I'm not aware of any cheat sheets for Haskell, although a Google search seems to indicate that many exist. I'm not sure that a cheat sheet will help much if one doesn't know Haskell, and if one does know Haskell, one is likely to also know those operators.

$ is a sort of delimiter that often saves you from having to nest other function calls in brackets.

<$> is just an infix alias for fmap. In C#, that corresponds to the Select method.

<> is a generalised associative binary operation as defined by Data.Semigroup or Data.Monoid. You can read more about monoids and semigroups here on the blog.

<*> is part of the Applicative type class. It's hard to translate to other languages, but when I make the attempt, I usually call it Apply.

2019-09-12 15:45 UTC

Published: Monday, 09 September 2019 08:19:00 UTC

Picture archivist in Haskell by Mark Seemann