Handling IO failure

Let’s talk a bit about IO programming. Filesystem, network, GUI… You cannot write useful code without doing IO these days. So why is it so damn hard to do in “safe” languages like Haskell?

Well, in Haskell, you isolate the unsafe parts to be able to reason safely on the rest of the code. What does “unsafe” mean in that context? Mostly, unsafe code is unpredictable, non deterministic and will fail for a number of reasons independent from your program’s code.

Surely, you might think “come on, it cannot be that hard, I’m doing HTTP requests everywhere in my code, and there is no problem”. Well, let’s see how a simple HTTPS request could fail:

  • you are disconnected from the network (you have no IP)
  • you are connected to the network, but it is not connected to anything
  • your network is connected to Internet, but routers are dropping packets
  • your network is connected to Internet, but very slow
  • your DNS server is unreachable
  • your DNS server drops your packets
  • your DNS server cannot parse your request
  • your DNS server cannot contact other server to get your answer
  • your DNS server sends back an invalid response
  • your DNS server sends back an outdated response
  • you cannot reach the web server’s IP from your network
  • the web server drops your packets silently before connecting
  • the web server connects, then drops the connection silently
  • the web server rejects your connection
  • the web server cannot parse your packets, and so, rejects them
  • the web server timeouts
  • the server’s certificate is expired
  • the server’s certificate is not for the right subject name
  • the server’s certification chain has parts missing
  • the server’s certification chain has an unknown root
  • the server’s certificate was revoked
  • the packet’s signatures are invalid
  • your user agent and the server do not support the same versions of TLS
  • your user agent and the server do not have common cipher suites
  • the web server closes the connection without warning
  • the web server timeouts
  • the web server crashes
  • the web server cannot parse your HTTP request and rejects it
  • your request is too large
  • the web server parses your HTTP request correctly, but your cookie or OAuth token is invalid
  • the data you requested does not exist
  • the data you requested is elsewhere
  • your user agent does not support the mime type of the data
  • the data requested is too large for a simple response
  • the server only sends a part of the data, then drops the connection
  • your user agent cannot parse the response
  • your user agent can parse the data, but some way or another, it is invalid

If you have worked for some time with networks, all of those have probably happened to you at some point (and the list is not nearly exhaustive). What did you do in your code? Did you handle all these exceptions? Did you catch all the exceptions (see what I did there)? Do you check for all the error codes? Do you retry the requests where you need to?

Let’s face it: most of the network handling code out there is made of big chunks of procedural code, without much error handling, in blocking mode. In most cases, it is ok. But that is sloppy programming.

Safe languages do not allow you to write sloppy code like that. So, we are stuck between correct but overly complex code, and simple but failing code. Choose your weapons.

Personally, I prefer isolating unsafe code in asynchronous systems like futures or actors. I know failure will happen, I know threads will crash, I know I will make errors in my code. That is ok, it happens. So, let’s write robust code to handle failure.

For network errors, I just want to know if the server is unreachable. It is ok, I will try later. If my request’s authentication is rejected, I want to know, and must handle that failure. Some errors should be handled seriously, others must be put in the “ok, it failed, whatever” bin.

Even if languages like Haskell make it harder to perform IO safely, they are still good tools, because they let you isolate unsafe parts, to let you reason on safe, deterministic parts of the program.

P.S.: ok, the network case was maybe a bit too much. Surely, filesystem usage will be easier? Just for the fun, let’s list some possible failures when you want to open a file for reading and writing:

  • invalid path
  • correct path, but you do not have the permission
  • correct path, you have the permission, but the file does not exists
  • you do not have the permission to create the file
  • you check that the file does not exists, then you try to create it, but someone already created it in the meantime (fun security bug, that one)
  • the file exists, but someone is already writing on it, no concurrent access
  • you have the handle you want on the file, but someone just deleted it
  • not enough file descriptors available (oh, please, no)
  • someone is writing to the file at the same time
  • there are so many page faults that your program is slowed down
  • the disk is slow, blocking on a large operation
  • the disk is full
  • you checked that you have enough room, but someone is filling the disk at the same time
  • the file is on a networked file system, and it is slow
  • the file is on a remote disk, and the network just failed
  • hardware failure in the disk
  • hardware failure in the RAID array (and for some reason, redundancy was not enough, you lost the data)
  • the file is on a USB card that someone just unplugged

Basically, IO is a nightmare. Please wake me up now.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s