Playing Poker with Elixir (pt. 4)

In our previous posts, we implemented a pair of processes to manage the state of a hand of poker. We did our best to handle expected errors, but our code isn't bullet-proof; for example, calling one of our GenServer's methods with an bad argument type could easily lead to a crash.

In any language, writing a significant program that handles every single edge case is close to impossible, but in Elixir, it's unnecessary. Software running on the Erlang VM is typically designed to embrace the possibility of failure, with the motto "Let it crash". Rather than handle every error case, we instead focus on recovering from the inevitable one. Today, we'll update our application to handle failure in this way.

Common patterns in processes

Although Elixir gives us enough primitives to build this kind of fault tolerance from the ground up, we don't have to. Erlang comes bundled with a set of libraries and tools called OTP, and in Elixir we can take full benefit from them.

OTP defines a set of common patterns in processes. In fact, the generic servers we've been using come from it. As you'd expect, managing process failure is another common pattern. OTP implements the generic parts of these patterns - it's up to us to implement our application-specific logic by defining a behaviour.

On our best behaviour

As I mentioned, we've already encountered behaviours when we implemented GenServers. Behaviours let a library module like GenServer call into our code by defining a set of functions that the library expects us to implement. These functions are called "callbacks". At compile time, Elixir checks to see that your module implements each callback specified by the behaviour and it issues a warning if it doesn't.

If you look into the GenServer behaviour, however, you'll notice that it defines six callbacks, some of which we never explicitly defined. So why didn't we get a compiler warning? It turns out Elixir's macros allow it to take behaviours a step further. When we added use GenServer to our code, we not only adopted the behaviour, but we gained default implementations for each of its callbacks. This allows us to avoid writing a lot of boilerplate functions that we might not care about.

The Supervisor behaviour

When a process crashes, any linked processes will terminate as well. Alternatively, a linked process can trap these "exit signals". In this case, the process will be sent a message instead. In OTP, the Supervisor behaviour is a process abstraction around these messages.

A supervisor starts linked processes and waits for these exit signals. When it receives one, it restarts the dead process. Using this behaviour will allow us to implement a supervisor process without worrying about the nitty-gritty details. We only need to specify what our child processes are and how we should react when they die.

Our first supervisor will be the top-level one. We'll use another OTP behaviour, the Application. We won't go into too much detail on the behaviour, but if you'd like to learn more, the Elixir getting started guide is a good place to look.

The application behaviour expects us to define a start callback. In this callback, we can start the top-level supervisor for our entire program:

defmodule GenPoker do
  use Application

  def start(_type, _args) do
    import Supervisor.Spec

    children = [
      worker(Poker.Bank, [])
    ]

    opts = [strategy: :one_for_one, name: GenPoker.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

We also need to register our application module in mix.exs:

def application do
  [mod: {GenPoker, []}]
end

When mix starts up, it starts the application we registered in our Mix.Project. Our application's start callback then creates a list of specifications for its child processes. A child process can be a "worker" or a "supervisor". Note a worker needs to be an OTP-compliant process, such as a GenServer. A vanilla process won't work - to be supervised, a process needs to handle some system-level messages sent by OTP.

We're using the Supervisor.Spec.worker/3 helper function to define the specification, passing the module and list of arguments. Our supervisor will call a start_link function on the specified module with the arguments given. After we've set up our specifications, we start our supervisor with Supervisor.start_link/2.

We configured our supervisor with a restart strategy. The strategy determines which processes the supervisor will restart when one dies. one_for_one will only restart the crashed process - it's ideal for situations when each supervised process is completely independent.

To test this all out, enter iex with iex -S mix:

iex(1)> Process.whereis(Poker.Bank)      
#PID<0.93.0>
iex(2)> Process.whereis(Poker.Bank) |> Process.exit(:kill)
true
iex(3)> Process.whereis(Poker.Bank)                       
#PID<0.97.0>

You can see that our bank process was already running after we started iex. Mix started our OTP application, which started the supervisor, which started our bank process. After we kill the process, our supervisor starts it back up with a new pid.

The table supervisor

Let's add in supervision for our table process. For simplicity, we'll only run one such process for our entire application. In the final program, we'll have many tables running concurrently.

If you recall from the previous post, the table and hand process were related. If a table process dies while a hand is being played, the hand process would not be able to continue. Because of this relationship, we'll group the hand and table processes together under one supervisor:

defmodule Poker.Table.Supervisor do
  use Supervisor

  def start_link(table_name, num_players) do
    Supervisor.start_link(__MODULE__, [table_name, num_players])
  end

  def init([table_name, num_players]) do
    children = [
      worker(Poker.Table, [table_name, num_players])
    ]

    supervise children, strategy: :one_for_one
  end
end

Here, we're pulling in the supervisor behaviour with use Supervisor. Its callback, init/1, should feel very familiar to GenServer's - it sets up the process. In this case, we set up a list of child specifications and supervision options rather than an initial state.

When we start the table, we pass the number of players and a table_name atom as arguments. The table will use the name to register the process locally. This is a necessary change - we can no longer rely on knowing the pid of our table process - it will change as it is restarted.

We also need to add this new supervisor as a child of our previous one. We'll add a new child specification after the existing one in our top-level supervisor as follows:

def start(_type, _args) do
  import Supervisor.Spec

  children = [
    worker(Poker.Bank, []),
    supervisor(Poker.Table.Supervisor, [:table_one, 6])
  ]

  opts = [strategy: :one_for_one, name: GenPoker.Supervisor]
  Supervisor.start_link(children, opts)
end

Because this child process is a supervisor instead of a worker, we need to use the supervisor/3 function to create this specification.

Adding in the hand

We'd like the new table supervisor to manage our hand process as well, but there are some issues. We don't want our hand process to be started automatically by the supervisor - we need to wait for players to be ready. Also, when the hand finishes normally, we don't want the supervisor to restart it.

Luckily, the supervisor module provides solutions for these problems. First, we'll add a new function, start_hand, to our table supervisor:

def start_hand(supervisor, table, players, config \\ []) do
  Supervisor.start_child(supervisor,
    supervisor(Poker.Hand.Supervisor, 
      [table, players, config], restart: :transient, id: :hand_sup
    )
  )
end

We're creating a new supervised process dynamically through Supervisor.start_child/2. This function takes the supervisor pid and a specification as arguments and works pretty much as you'd expect. We use the transient restart strategy on our child process - this means the supervisor won't restart it if it exits normally.

The child process is a simple supervisor for the hand:

defmodule Poker.Hand.Supervisor do
  use Supervisor

  def start_link(table, players, config) do
    Supervisor.start_link(__MODULE__, [table, players, config])
  end

  def init([table, players, config]) do
    hand_name = String.to_atom("#{table}_hand")
    children = [
      worker(Poker.Hand, [table, players, config, [name: hand_name]], restart: :transient)
    ]

    supervise children, strategy: :one_for_one
  end
end

We're registering the hand process for a similar reason as we did the table, but other than that, there's not much here. So why are we creating an extra level of supervision for the hand, rather than adding it directly to the table supervisor as a worker? We'll come back to this later.

Continuing on, we need to make a few other changes to our code for this to work. Here's the table supervisor's init:

def init([table_name, num_players]) do
  children = [
    worker(Poker.Table, [self, table_name, num_players])
  ]

  supervise children, strategy: :one_for_one
end

And the handle_call implementation for the table's deal message:

def handle_call(:deal, _from, state = %{hand_sup: nil}) do
  players = get_players(state) |> Enum.map(&(&1.id))

  case Poker.Table.Supervisor.start_hand(
    state.sup, state.table_name, players
  ) do
    {:ok, hand_sup} ->
      Process.monitor(hand_sup)
      {:reply, {:ok, hand_sup}, %{state | hand_sup: hand_sup}}
    error ->
      {:reply, error, state}
  end
end

In init, we're passing self (the supervisor's pid) to the table process. Later, when the table handles the deal message, we use it to start the hand.

We set up a monitor on the pid returned by start_hand. Remember that monitoring the hand process itself wouldn't work here; it can crash and be restarted before it finishes successfully. The extra level of supervision gives us a stable process id to monitor. When the supervisor finally exits, the hand will either be finished successfully or crashing repeatedly.

Saving state

By now, all our processes are supervised and will be restarted if they fail, but when they come back, they'll have a clean state. Although this is an improvement, it's still not enough. Currently, the table process owns an ETS table keeping track of the player balances. If the process crashes, the ETS table will be gone as well. Not the typical way you lose all your money playing poker!

Since one of the purposes of supervision is handling an eventual error, we can't rely on state that is kept solely in-process. We'll make a small change to address this. Rather than creating the ETS table inside of our table process, we'll create it in our supervisor's init callback. Because the supervisor now owns the ETS table, it will persist when the table process crashes.

We have to be a bit careful when preserving state in this way. If an invalid state in our ETS table is the cause of the crash, then restarting the process with the same state will probably crash again.

The concept of "supervision trees" comes in handy here. If our table process crashes enough times, its supervisor will give up and terminate as well. A higher-level supervisor will have the opportunity to restart it. At that point, the ETS table would be recreated, clearing the invalid state that was causing the problem.

Wrapping up

Today, we added supervision to our fledgling poker application. We organized our processes into a supervision tree and preserved some of its state upon a crash. As always, code from this post is available on GitHub.

Next time, we'll hook up our existing application into Phoenix channels. See you then!