Song recommendations as a Haskell Impureim Sandwich by Mark Seemann
A pure function on potentially massive data.
This article is part of a larger article series called Alternative ways to design with functional programming. As the title suggests, these articles discuss various ways to apply functional-programming principles to a particular problem. All the articles engage with the same problem. In short, the task is to calculate song recommendations for a user, based on massive data sets. Earlier articles in this series give you detailed explanation of the problem.
In the previous article, you saw how to refactor the 'base' F# code base to a pure function. In this article, you'll see the same refactoring applied to the 'base' Haskell code base shown in Porting song recommendations to Haskell.
The Git branch discussed in this article is the pure-function branch in the Haskell Git repository.
Collecting all the data #
Like in the previous articles, we may start by adding two more methods to the SongService
type class, which will enable us to enumerate all songs and all users. The full type class, with all four methods, then looks like this:
class SongService a where getAllSongs :: a -> IO [Song] getAllUsers :: a -> IO [User] getTopListeners :: a -> Int -> IO [User] getTopScrobbles :: a -> String -> IO [Scrobble]
If you compare with the type class definition shown in the article Porting song recommendations to Haskell, you'll see that getAllSongs
and getAllUsers
are the new methods.
They enable you to collect all top listeners, and all top scrobbles, even though it may take some time. To gather all the top listeners, we may add this collectAllTopListeners
action:
collectAllTopListeners srvc = do songs <- getAllSongs srvc listeners <- newIORef Map.empty forM_ songs $ \song -> do topListeners <- getTopListeners srvc $ songId song modifyIORef listeners (Map.insert (songId song) topListeners) readIORef listeners
Likewise, you can amass all the top scrobbles with a similar action:
collectAllTopScrobbles srvc = do users <- getAllUsers srvc scrobbles <- newIORef Map.empty forM_ users $ \user -> do topScrobbles <- getTopScrobbles srvc $ userName user modifyIORef scrobbles (Map.insert (userName user) topScrobbles) readIORef scrobbles
As you may have noticed, they're so similar that, had there been more than two, we might consider extracting the similar parts to a reusable operation.
In both cases, we start with the action that enables us to enumerate all the resources (songs or scrobbles) that we're interested in. For each of these, we then invoke the action to get the 'top' resources for that song or scrobble. There's a massive n+1 problem here, but you could conceivably parallelize all these queries, as they're independent. Still, it's bound to take much time, possibly hours.
You may be wondering why I chose to use IORef
values for both actions, instead of more idiomatic combinator-based expressions. Indeed, that is what I would usually do, but in this case, these two actions are heavily IO-bound already, and it makes the Haskell code more similar to the F# code. Not that that is normally a goal, but here I thought it might help you, the reader, to compare the different code bases.
All the data is kept in a Map per action, so two massive maps in all. Once these two actions return, we're done with the read phase of the Recawr Sandwich.
Referentially transparent function with local mutation #
As a first step, we may wish to turn the getRecommendations
action into a referentially transparent function. If you look through the commits in the Git repository, you can see that I actually did this through a series of micro-commits, but here I only present a more coarse-grained version of the changes I made.
In this version, I've removed the srvc
(SongService
) parameter, and instead added the two maps topScrobbles
and topListeners
.
getRecommendations :: Map String [Scrobble] -> Map Int [User] -> String -> IO [Song] getRecommendations topScrobbles topListeners un = do -- 1. Get user's own top scrobbles -- 2. Get other users who listened to the same songs -- 3. Get top scrobbles of those users -- 4. Aggregate the songs into recommendations let scrobbles = Map.findWithDefault [] un topScrobbles let scrobblesSnapshot = take 100 $ sortOn (Down . scrobbleCount) scrobbles recommendationCandidates <- newIORef [] forM_ scrobblesSnapshot $ \scrobble -> do let otherListeners = Map.findWithDefault [] (songId $ scrobbledSong scrobble) topListeners let otherListenersSnapshot = take 20 $ sortOn (Down . userScrobbleCount) $ filter ((10_000 <=) . userScrobbleCount) otherListeners forM_ otherListenersSnapshot $ \otherListener -> do let otherScrobbles = Map.findWithDefault [] (userName otherListener) topScrobbles let otherScrobblesSnapshot = take 10 $ sortOn (Down . songRating . scrobbledSong) $ filter (songHasVerifiedArtist . scrobbledSong) otherScrobbles forM_ otherScrobblesSnapshot $ \otherScrobble -> do let song = scrobbledSong otherScrobble modifyIORef recommendationCandidates (song :) recommendations <- readIORef recommendationCandidates return $ take 200 $ sortOn (Down . songRating) recommendations
You've probably noticed that this action still looks impure, since it returns IO [Song]
. Even so, it's referentially transparent, since it's fully deterministic and without side effects.
The way I refactored the action, this order of changes was what made most sense to me. Getting rid of the SongService
parameter was more important to me than getting rid of the IO
wrapper.
In any case, this is only an interim state towards a more idiomatic pure Haskell function.
A single expression #
A curious property of expression-based languages is that you can conceivably write functions in 'one line of code'. Granted, it would often be a terribly wide line, not at all readable, a beast to maintain, and often with poor performance, so not something you'd want to alway do.
In this case, however, we can do that, although in order to stay within an 80x24 box, we break the expression over multiple lines.
getRecommendations :: Map String [Scrobble] -> Map Int [User] -> String -> [Song] getRecommendations topScrobbles topListeners un = -- 1. Get user's own top scrobbles -- 2. Get other users who listened to the same songs -- 3. Get top scrobbles of those users -- 4. Aggregate the songs into recommendations take 200 $ sortOn (Down . songRating) $ fmap scrobbledSong $ (\otherListener -> take 10 $ sortOn (Down . songRating . scrobbledSong) $ filter (songHasVerifiedArtist . scrobbledSong) $ Map.findWithDefault [] (userName otherListener) topScrobbles) =<< (\scrobble -> take 20 $ sortOn (Down . userScrobbleCount) $ filter ((10_000 <=) . userScrobbleCount) $ Map.findWithDefault [] (songId $ scrobbledSong scrobble) topListeners) =<< take 100 (sortOn (Down . scrobbleCount) $ Map.findWithDefault [] un topScrobbles)
This snapshot also got rid of the IORef
value, and IO
in general. The function is still referentially transparent, but now Haskell can also see that.
Even so, this looks dense and confusing. It doesn't help that Haskell should usually be read right-to-left, and bottom-to-top. Is it possible to improve the readability of this function?
Composition from smaller functions #
To improve readability and maintainability, we may now extract helper functions. The first one easily suggests itself.
getUsersOwnTopScrobbles :: Ord k => Map k [Scrobble] -> k -> [Scrobble] getUsersOwnTopScrobbles topScrobbles un = take 100 $ sortOn (Down . scrobbleCount) $ Map.findWithDefault [] un topScrobbles
Each of the subexpressions in the above code listing are candidates for the same kind of treatment, like this one:
getOtherUsersWhoListenedToTheSameSongs :: Map Int [User] -> Scrobble -> [User] getOtherUsersWhoListenedToTheSameSongs topListeners scrobble = take 20 $ sortOn (Down . userScrobbleCount) $ filter ((10_000 <=) . userScrobbleCount) $ Map.findWithDefault [] (songId $ scrobbledSong scrobble) topListeners
You can't see it from the code listings themselves, but the module doesn't export these functions. They remain implementation details.
With a few more helper functions, you can now implement the getRecommendations
function by composing the helpers.
getRecommendations :: Map String [Scrobble] -> Map Int [User] -> String -> [Song] getRecommendations topScrobbles topListeners un = -- 1. Get user's own top scrobbles -- 2. Get other users who listened to the same songs -- 3. Get top scrobbles of those users -- 4. Aggregate the songs into recommendations aggregateSongsIntoRecommendations $ getTopSongsOfOtherUser topScrobbles =<< getOtherUsersWhoListenedToTheSameSongs topListeners =<< getUsersOwnTopScrobbles topScrobbles un
Since Haskell by default composes from right to left, when you break such a composition over multiple lines (in order to stay within a 80x24 box), it should be read bottom-up.
You can remedy this situation by importing the &
operator from Data.Function:
getRecommendations :: Map String [Scrobble] -> Map Int [User] -> String -> [Song] getRecommendations topScrobbles topListeners un = getUsersOwnTopScrobbles topScrobbles un >>= getOtherUsersWhoListenedToTheSameSongs topListeners >>= getTopSongsOfOtherUser topScrobbles & aggregateSongsIntoRecommendations
Notice that I've named each of the helper functions after the code comments that accompanied the previous incarnations of this function. If we consider code comments apologies for not properly organizing the code, we've now managed to structure it in such a way that those apologies are no longer required.
Conclusion #
If you accept the (perhaps preposterous) assumption that it's possible to fit the required data in persistent data structures, refactoring the recommendation algorithm to a pure function isn't that difficult. That's the pure part of a Recawr Sandwich. While I haven't shown the actual sandwich here, it's quite straightforward. You can find it in the tests in the Haskell Git repository. Also, once you've moved all the data collection to the boundary of the application, you may no longer need the SongService
type class.
I find the final incarnation of the code shown here to be quite attractive. While I've kept the helper functions private to the module, it's always an option to export them if you find that warranted. This could improve testability of the overall code base, albeit at the risk of increasing the surface area of the API that you have to maintain and secure.
There are always trade-offs to be considered. Even if you, eventually, find that for this particular example, the input data size is just too big to make this alternative viable, there are, in my experience, many other situations when this kind of architecture is a good solution. Even if the input size is a decent amount of megabytes, the simplification offered by an Impureim Sandwich may trump the larger memory footprint. As always, if you're concerned about performance, measure it.
This article concludes the overview of using an Recawr Sandwich to address the problem. Since it's, admittedly, a bit of a stretch to imagine running a program that uses terabytes (or more) of memory, we now turn to alternative architectures.
Next: Song recommendations from combinators.