Extracting and analysing data in Haskell and Python.

I was recently following a course in mathematical analysis and probability for computer scientists. One assignment asked to analyze a small CSV file with data collected in a student survey. The course contained a mix of pure maths and practical application, and the official programming language to be used was Python. It was understood that one was to do the work in Python, but it wasn't an explicit requirement, and I was so tired that didn't have the energy for it.

I can get by in Python, but it's not a language I'm actually comfortable with. For small experiments, ad-hoc scripting, etc. I reach for Haskell, so that's what I did.

This was a few months ago, and I've since followed another course that required more intense use of Python. With a few more months of Python programming under my belt, I decided to revisit that old problem and do it in Python with the explicit purpose of comparing and contrasting the two.

Static or dynamic types for scripting #

I'd like to make one point with these articles, and that is that dynamically typed languages aren't inherently better suited for scripting than statically typed languages. From this, it does not, however, follow that statically typed languages are better, either. Rather, I increasingly believe that whether you find one or the other more productive is a question of personality, past experiences, programming background, etc. I've been over this ground before. Many of my heroes seem to favour dynamically typed languages, while I keep returning to statically typed languages.

For more than a decade I've preferred F# or Haskell for ad-hoc scripting. Note that while these languages are statically typed, they are low on ceremony. Types are inferred rather than declared. This means that for scripts, you can experiment with small code blocks, iteratively move closer to what you need, just as you would with a language like Python. Change a line of code, and the inferred type changes with it; there are no type declarations that you also need to fix.

When I talk about writing scripts in statically typed languages, I have such languages in mind. I wouldn't write a script in C#, C, or Java.

"Let me stop you right there: I don't think there is a real dynamic typing versus static typing debate.

"What such debates normally are is language X vs language Y debates (where X happens to be dynamic and Y happens to be static)."

The present articles compare Haskell and Python, so be careful that you don't extrapolate and draw any conclusions about, say, C++ versus Erlang.

When writing an ad-hoc script to extract data from a file, it's important to be able to experiment and iterate. Load the file, inspect the data, figure out how to extract subsets of it (particular columns, for example), calculate totals, averages, etc. A REPL is indispensable in such situations. The Haskell REPL (called Glasgow Haskell Compiler interactive, or just GHCi) is the best one I've encountered.

I imagine that a Python expert would start by reading the data to slice and dice it various ways. We may label this a data-first approach, but be careful not to read too much into this, as I don't really know what I'm talking about. That's not how my mind works. Instead, I tend to take a types-first approach. I'll look at the data and start with the types.

The assignment #

The actual task is the following. At the beginning of the course, the professors asked students to fill out a survey. Among the questions asked was which grade the student expected to receive, and how much experience with programming he or she already had.

Grades are given according to the Danish academic scale: -3, 00, 02, 4, 7, 10, and 12, and experience level on a simple numeric scale from 1 to 7, with 1 indicating no experience and 7 indicating expert-level experience.

Here's a small sample of the data:

No,3,2,6,6
No,4,2,3,7
No,1,12,6,2
No,4,10,4,3
No,3,4,4,6

The expected grade is in the third column (i.e. 2, 2, 12, 10, 4) and the experience level is in the fourth column (6,3,6,4,4). The other columns are answers to different survey questions. The full data set contains 38 rows.

The assignment poses the following questions: Two rows from the survey data are randomly selected. What is the probability mass function (PMF) of the sum of their expected grades, and what is the PMF of the absolute difference between their programming experience levels?

In both cases I was also asked to plot the PMFs.

Comparisons #

As outlined above, I originally wrote a Haskell script to answer the questions, and only months later returned to the problem to give it a go in Python. When reading my detailed walkthroughs, keep in mind that I have 8-9 years of Haskell experience, and that I tend to 'think in Haskell', while I have only about a year of experience with Python. I don't consider myself proficient with Python, so the competition is rigged from the outset.

For this small task, I don't think that there's a clear winner. I still like my Haskell code the best, but I'm sure someone better at Python could write a much cleaner script. I also have to admit that Matplotlib makes it a breeze to produce nice-looking plots with Python, whereas I don't even know where to start with that with Haskell.

Recently I've done some more advanced data analysis with Python, such as random forest classification, principal component analysis, KNN-classification, etc. While I understand that I'm only scratching the surface of data science and machine learning, it's obvious that there's a rich Python ecosystem for that kind of work.

Conclusion #

This lays the foundations for comparing a small Haskell script with an equivalent Python script. There's no scientific method to the comparison; it's just me doing the same exercise twice, a bit like I'd do katas with multiple variations in order to learn.

While I still like Haskell better than Python, that's only a personal preference. I'm deliberately not declaring a winner.

One point I'd like to make, however, is that there's nothing inherently better about a dynamically typed language when it comes to ad-hoc scripting. Languages with strong type inference work well, too.

Next: Extracting data from a small CSV file with Haskell.



Wish to comment?

You can add a comment to this post by sending me a pull request. Alternatively, you can discuss this post on Twitter or somewhere else with a permalink. Ping me with the link, and I may respond.

Published

Monday, 05 February 2024 07:53:00 UTC

Tags



"Our team wholeheartedly endorses Mark. His expert service provides tremendous value."
Hire me!
Published: Monday, 05 February 2024 07:53:00 UTC