Media Storage and CDN with Phoenix and Elixir

Aaron D. Parks smirks knowingly as his eyes meet the camera

Aaron D. Parks

May 13, 2022

Share: Twitter Reddit Facebook Linkedin Telegram

In building lofi.limo, media storage and distribution naturally came up. I have songs, announcements, and background image loops which I want to store and distribute to listeners. Let's take a look at how I've been able to do both without getting too fancy or spending too much money.

Storing media

I store most of the data used for running lofi.limo — including metadata about songs, announcements, and backgrounds — in a PostgreSQL database. A relational database like PostgreSQL is handy on its own, but Ecto makes it really great to use with Elixir and Phoenix.

I could store the media data itself in PostgreSQL too, perhaps as a bytea field. In some applications this is very handy, but for mine I didn't expect to get much out of it. In fact, I wanted to support HTTP range requests which would be a little awkward to serve out of PostgreSQL.

Range requests are the same as regular HTTP requests except they ask that only a portion (typically a byte range or ranges) of the requested resource (a song's audio data, for instance) be included in the response. Web browsers may use range requests for audio media in particular so that they can get portions of the media data as it's played rather than all at once. A server that doesn't understand range requests will respond with the entire song or announcement, which can trip up some web browsers.

If I kept the media in separate files, I could set up Plug.Static to serve those files for me. Plug.Static already has good support for range requests, so I wouldn't have to write code to handle that myself.

My first thought was to use natural names for the media files, perhaps artist and title or something like that. But I didn't want to fuss around with keeping the file names synchronized with the metadata in the database or to deal with the confusion of having them usually-but-not-quite the same as what was in the database.

I decided instead to make the file names no more than a unique identifier — a key for the media data which I could link to the metadata in the database. A media record in the database keeps track of the media data file name and the media data MIME type.

The base file name is generated by randomly selecting twenty lower-case letters. I wanted the probability of collisions to be vanishingly remote, but didn't want to have unreasonably long file names. Twenty characters is a reasonable length in my opinion (shorter, anyway, than natural names would likely be) and gives just shy of twenty octillion possible file names. If I've done my math right, I can create one million media items and have little worse than a one in forty quadrillion chance of a collision. That's as close to “won't happen” as I need to get.

The full file name ends in an extension appropriate to the media type. Plug.Static (and maybe the browser) will use this to guess the type of data.

To create a new media item, I wrote a function which takes a path to a temporary file containing the media data, the MIME type of the media, and an appropriate file name extension for the media type. The function assembles a new file name as described above and copies the temporary file into a directory for storing and serving media. It then creates a database record for the media item with its new file name and MIME type.

Listing 1. Media creation.
(from the LofiLimo.Media module)
def create_media(path, type, extension) do
  base_name =
    fn -> Enum.random('abcdefghijklmnopqrstuvwxyz') end
    |> Stream.repeatedly()
    |> Enum.take(20)
    |> List.to_string()

  file_name = "#{base_name}.#{extension}"

  with :ok <- File.cp(path, path(file_name), fn _, _ -> false end) do
    %Media{}
    |> Media.changeset()
    |> Changeset.put_change(:type, type)
    |> Changeset.put_change(:file_name, file_name)
    |> Repo.insert()
  end
end

defp path(name) do
  env = Application.get_env(:lofi_limo, __MODULE__)
  path = Keyword.fetch!(env, :storage)
  Path.join([path, name])
end

I get the path for the media storage directory from the application configuration. This makes it easy to set it up appropriately for development, staging, and production environments.

Backing up media

I use Tarsnap for online backup. I have a cron script which uses pg_dump to dump my important databases into a backup staging directory and then tarsnap to send that directory to the cloud. It also prunes old backups.

To back up the media data as well, I added a command to the script to copy the media files into the backup staging directory after the database dump. The script now looks something like this, but of course the real one is a bit more complex to deal with credentials, pruning, and such.

Listing 2. Backing up.
(from the backup script)
pg_dump -f ~/backup_staging/lofi_limo.sql lofi_limo
cp -Rp ~/lofi_limo/media ~/backup_staging
tarsnap -cf $(date +%s) backup_staging

Serving media

As I mentioned, Plug.Static has good support for range requests and it looked like it would be a nice way to serve my media files. I set it up in my endpoint module:

Listing 3. Plug.Static setup.
(from the LofiLimoWeb.Endpoint module)
plug Plug.Static,
	at: LofiLimo.Media.compile_url_path!(),
	from: LofiLimo.Media.compile_storage!(),
	cache_control_for_etags: "public, max-age=31536000"

I made a couple of convenience macros to get the base URL and base path for media from the compile-time configuration. The LofiLimo.Media module uses this same configuration to assemble the path for importing new media (as shown above) and to assemble URLs that are sent to the front-end:

Listing 4. Conveniences.
(from the LofiLimo.Media module)
defmacro compile_storage!() do
  quote do
    Application.compile_env!(:lofi_limo, [LofiLimo.Media, :storage])
  end
end

defmacro compile_url_path!() do
  quote do
    url = Application.compile_env!(:lofi_limo, [LofiLimo.Media, :url])
    URI.parse(url).path
  end
end

def file_url(media) do
  env = Application.get_env(:lofi_limo, __MODULE__)
  url = Keyword.fetch!(env, :url)
  "#{url}/#{media.file_name}"
end

Let's jump back to the plug definition in the endpoint. Plug.Static generates an ETag header all of its responses. An entity tag is a value that will change when the resource changes. It's often implemented as a hash of the resource's content. The client can give this value in the If-None-Match header of a request for a resource it has in its cache. If the resource hasn't changed, the server can respond with a quick 304 Not Modified and the client can use its cached copy of the resource with confidence.

This is great for large media files, but we can do even better: we can help the client avoid the need to make even this quick request for cached media. By default, Plug.Static sets the Cache-Control header to "public." By adding a max age directive, we can let the client (and CDN — more on that soon) know that it can cache the media and use it without revalidation for a given amount of time. I set this amount of time to about one year because the media files are essentially immutable — that is, if I replace one it will get a new URL anyway.

I think any responses provided by Plug.Static for media will be cacheable even without the public directive, but I don't think it hurts to leave it there either. I see the must-revalidate directive shown in a lot of examples. I left it out in this case since it shouldn't come up much (with such a long max-age) and since in the unlikely even that it does, I'd actually rather have the client or CDN use a cached copy if they're temporarily unable to contact my server to revalidate the entity tag.

Content Distribution Network

I didn't use a CDN back when was iterating on lofi.limo with just a small group of listeners and artists, but I figured if things went well I'd want to add one. With that in mind, I tried to arrange things so it would be easy to add later.

Having a content distribution network handle media delivery for me helps reduce bandwidth usage by my origin server and reduces latency for listeners, especially those who are overseas.

My origin server lives at a local data center and I'm not very close to a major Internet hub, so bandwidth here in my town is more costly than it would be in someplace like Chicago. If I can buy most of my bandwidth from a CDN whose servers are better-located, I can save some money.

Because a content delivery network tries to handle requests using servers that are close to the client, it also reduces the time required to serve those requests. This gives my listeners a better experience since they don't have to wait as long for the next song to start playing or for a new background to load.

When a client requests a resource through a content distribution network, the client must first resolve the resource's host name using the domain name system (DNS). The CDN will have servers around the world ready to handle requests for this host name and their DNS will select the address of one that is close to the client. This helps spread the load of a large number of requests across the CDN's fleet of servers and it also reduces the time it takes to serve client requests because they don't have to travel as far across the network.

When the client contacts the CDN server and makes its request, hopefully the CDN server will have recently served the requested resource and will still have a copy of it handy in its cache. If so, the CDN server can respond immediately, without involving the origin server. If the resource is not in the CDN server's cache, the CDN server will forward the request to the origin server and pass the response along to the client while storing a copy in its cache.

I ran a traceroute to the host name my CDN assigned me from a client in Oregon and from a client in Virginia so we can see that a different address is resolved by DNS for each client and that these addresses belong to servers near the clients.

Trace 1. From Oregon.
debian@d2-2-us-west-or-1-3:~$ traceroute -q 1 lofi-limo.b-cdn.net
traceroute to lofi-limo.b-cdn.net (143.244.49.187)
 1  15.204.28.1 (15.204.28.1)  4.362 ms
 2  192.168.250.254 (192.168.250.254)  4.339 ms
 3  10.142.86.126 (10.142.86.126)  4.349 ms
 4  10.142.86.0 (10.142.86.0)  4.346 ms
 5  10.142.64.4 (10.142.64.4)  4.318 ms
 6  10.244.17.88 (10.244.17.88)  5.560 ms
 7  10.244.72.12 (10.244.72.12)  4.284 ms
 8  *
 9  be101.pdx-pdx02-sbb1-nc5.oregon.us (142.44.208.226)  5.508 ms
10  be101.pdx-pdx02-sbb1-nc5.oregon.us (142.44.208.226)  5.483 ms
11  be100-1365.lax-la1-bb1-a9.ca.us (198.27.73.104)  48.567 ms
12  be100-1365.lax-la1-bb1-a9.ca.us (198.27.73.104)  38.934 ms
13  *
14  unn-143-244-49-187.datapacket.com (143.244.49.187)  30.280 ms
Trace 2. From Virginia.
debian@d2-2-us-east-va-1-1:~$ traceroute -q 1 lofi-limo.b-cdn.net
traceroute to lofi-limo.b-cdn.net (185.93.1.247)
 1  135.148.101.1 (135.148.101.1)  45.975 ms
 2  192.168.250.254 (192.168.250.254)  45.945 ms
 3  10.142.1.126 (10.142.1.126)  45.927 ms
 4  10.142.0.40 (10.142.0.40)  45.911 ms
 5  10.142.0.10 (10.142.0.10)  45.893 ms
 6  10.244.6.56 (10.244.6.56)  47.067 ms
 7  10.244.64.134 (10.244.64.134)  45.861 ms
 8  10.244.120.4 (10.244.120.4)  48.186 ms
 9  *
10  be100-1317.chi-5-a9.il.us (198.27.73.88)  65.665 ms
11  *
12  vl211.chi-cs1-core-1.cdn77.com (185.229.188.46)  65.630 ms
13  vl202.chi-cs1-dist-1.cdn77.com (138.199.0.235)  67.968 ms
14  unn-185-93-1-247.datapacket.com (185.93.1.247)  69.734 ms

We can see from Trace 1 and Trace 2 that the addresses resolved by DNS are different: 143.244.49.187 from Oregon and 185.93.1.247 from Virginia. We can also see that the address resolved for the client in Oregon belongs to a server in Los Angeles (notice the “lax” and “la” in lax-la1-bb1-a9.ca.us) and that the address resolved for the client in Virginia belongs to a server in Chicago (notice the “chi” in chi-cs1-dist-1.cdn77.com). Both of these cities are major Internet hubs where bandwidth is inexpensive and they're both close to their respective clients. Neat!

Setting up a CDN is usually pretty straightforward — or at least most providers try to make it easy. In my case, after signing up for an account I created a “pull zone.” From the CDN's point of view, a pull zone is a mapping between a host name that they will serve requests for and the host name of an origin server their servers should contact if they don't have a requested resource.

For example: I set up a pull zone called lofi-limo to which my CDN provider assigned the hostname lofi-limo.b-cdn.net. I configured the pull zone to use my origin server at lofi.limo. Next I configured my LofiLimo.Media module to assemble URLs using lofi-limo.b-cdn.net as the host name instead of lofi.limo. Now client requests for media go to the CDN's servers instead of mine.

Listing 5. Configuration.
(from the prod.exs script)
cdn = System.get_env("CDN") || raise "CDN not set"

config :lofi_limo, LofiLimo.Media, url: "//#{cdn}/media/"
Listing 6. Startup.
(from the lofi_limo.sh start-up script)
export CDN=lofi-limo.b-cdn.net

In addition to mapping a CDN hostname to an origin server, the pull zone will usually have some other configuration options. A common option is to allow the Cache-Control directives supplied by the origin server to be overridden — handy if you're not able to tune them quite how you'd like.

Different providers will provide different options. The one I'm currently using also provides options for automatic retries to the origin server, two-layer caching to further reduce origin server bandwidth, rate limiting, and IP blocking among others. If you can imagine a handy feature, you can likely find a CDN that provides it.

Since the origin server won't be seeing every client request anymore, most CDN's provide logging and aggregate statistics so their customers can keep an eye on how things are going. I'm not a creep, so I don't have a lot of use for detailed logs. But I do find it handy to keep an eye on some of the aggregate statistics. In particular, I like to look at origin server response time and non 2xx responses from the origin server — an increase in either of these is likely a sign of trouble that I should dig in to. I also like to look at bandwidth served since this gives me an idea of my costs. Finally, I keep an eye on my cache hit rate: this helps me understand if I've got everything set up well.

Conclusion

I hope that you've enjoyed reading this article as much as I've enjoyed writing it and that it has given you some helpful ideas for your own projects!

You might like to discuss this article at Hacker News or at Lobsters.

If you have any questions, comments, or corrections please don't hesitate to drop me a line.

Aaron D. Parks
Parks Digital LLC
(517) 816-3363
support@parksdigital.com