Aaron D. Parks
May 13, 2022
In building lofi.limo, media storage and distribution naturally came up. I have songs, announcements, and background image loops which I want to store and distribute to listeners. Let's take a look at how I've been able to do both without getting too fancy or spending too much money.
I store most of the data used for running lofi.limo — including metadata about songs, announcements, and backgrounds — in a PostgreSQL database. A relational database like PostgreSQL is handy on its own, but Ecto makes it really great to use with Elixir and Phoenix.
I could store the media data itself in PostgreSQL too, perhaps as a
bytea
field.
In some applications this is very handy, but for mine I didn't expect to get
much out of it.
In fact, I wanted to support HTTP range requests which would be a little
awkward to serve out of PostgreSQL.
Range requests are the same as regular HTTP requests except they ask that only a portion (typically a byte range or ranges) of the requested resource (a song's audio data, for instance) be included in the response. Web browsers may use range requests for audio media in particular so that they can get portions of the media data as it's played rather than all at once. A server that doesn't understand range requests will respond with the entire song or announcement, which can trip up some web browsers.
If I kept the media in separate files, I could set up Plug.Static to serve those files for me. Plug.Static already has good support for range requests, so I wouldn't have to write code to handle that myself.
My first thought was to use natural names for the media files, perhaps artist and title or something like that. But I didn't want to fuss around with keeping the file names synchronized with the metadata in the database or to deal with the confusion of having them usually-but-not-quite the same as what was in the database.
I decided instead to make the file names no more than a unique identifier — a key for the media data which I could link to the metadata in the database. A media record in the database keeps track of the media data file name and the media data MIME type.
The base file name is generated by randomly selecting twenty lower-case letters. I wanted the probability of collisions to be vanishingly remote, but didn't want to have unreasonably long file names. Twenty characters is a reasonable length in my opinion (shorter, anyway, than natural names would likely be) and gives just shy of twenty octillion possible file names. If I've done my math right, I can create one million media items and have little worse than a one in forty quadrillion chance of a collision. That's as close to “won't happen” as I need to get.
The full file name ends in an extension appropriate to the media type. Plug.Static (and maybe the browser) will use this to guess the type of data.
To create a new media item, I wrote a function which takes a path to a temporary file containing the media data, the MIME type of the media, and an appropriate file name extension for the media type. The function assembles a new file name as described above and copies the temporary file into a directory for storing and serving media. It then creates a database record for the media item with its new file name and MIME type.
I get the path for the media storage directory from the application configuration. This makes it easy to set it up appropriately for development, staging, and production environments.
I use Tarsnap for online backup.
I have a cron script which uses pg_dump
to dump my important
databases into a backup staging directory and then tarsnap
to send that directory to the cloud.
It also prunes old backups.
To back up the media data as well, I added a command to the script to copy the media files into the backup staging directory after the database dump. The script now looks something like this, but of course the real one is a bit more complex to deal with credentials, pruning, and such.
As I mentioned, Plug.Static has good support for range requests and it looked like it would be a nice way to serve my media files. I set it up in my endpoint module:
I made a couple of convenience macros to get the base URL and base path for media from the compile-time configuration. The LofiLimo.Media module uses this same configuration to assemble the path for importing new media (as shown above) and to assemble URLs that are sent to the front-end:
Let's jump back to the plug definition in the endpoint.
Plug.Static generates an ETag
header all of its responses.
An entity
tag is a value that will change when the resource changes.
It's often implemented as a hash of the resource's content.
The client can give this value in the If-None-Match
header of a
request for a resource it has in its cache.
If the resource hasn't changed, the server can respond with a quick
304 Not Modified
and the client can use its cached
copy of the resource with confidence.
This is great for large media files, but we can do even better: we can
help the client avoid the need to make even this quick request for cached
media.
By default, Plug.Static sets the Cache-Control
header to
"public."
By adding a
max
age directive, we can let the client (and CDN — more on that soon) know
that it can cache the media and use it without revalidation for
a given amount of time.
I set this amount of time to about one year because the media files are
essentially immutable — that is, if I replace one it will get a new URL anyway.
I think any responses provided by Plug.Static for media will be cacheable even without the public directive, but I don't think it hurts to leave it there either. I see the must-revalidate directive shown in a lot of examples. I left it out in this case since it shouldn't come up much (with such a long max-age) and since in the unlikely even that it does, I'd actually rather have the client or CDN use a cached copy if they're temporarily unable to contact my server to revalidate the entity tag.
I didn't use a CDN back when was iterating on lofi.limo with just a small group of listeners and artists, but I figured if things went well I'd want to add one. With that in mind, I tried to arrange things so it would be easy to add later.
Having a content distribution network handle media delivery for me helps reduce bandwidth usage by my origin server and reduces latency for listeners, especially those who are overseas.
My origin server lives at a local data center and I'm not very close to a major Internet hub, so bandwidth here in my town is more costly than it would be in someplace like Chicago. If I can buy most of my bandwidth from a CDN whose servers are better-located, I can save some money.
Because a content delivery network tries to handle requests using servers that are close to the client, it also reduces the time required to serve those requests. This gives my listeners a better experience since they don't have to wait as long for the next song to start playing or for a new background to load.
When a client requests a resource through a content distribution network, the client must first resolve the resource's host name using the domain name system (DNS). The CDN will have servers around the world ready to handle requests for this host name and their DNS will select the address of one that is close to the client. This helps spread the load of a large number of requests across the CDN's fleet of servers and it also reduces the time it takes to serve client requests because they don't have to travel as far across the network.
When the client contacts the CDN server and makes its request, hopefully the CDN server will have recently served the requested resource and will still have a copy of it handy in its cache. If so, the CDN server can respond immediately, without involving the origin server. If the resource is not in the CDN server's cache, the CDN server will forward the request to the origin server and pass the response along to the client while storing a copy in its cache.
I ran a traceroute
to the host name my CDN assigned me from a
client in Oregon and from a client in Virginia so we can see that a different
address is resolved by DNS for each client and that these addresses
belong to servers near the clients.
We can see from Trace 1 and Trace 2 that the addresses resolved by DNS are
different: 143.244.49.187
from Oregon and
185.93.1.247
from Virginia.
We can also see that the address resolved for the client in Oregon belongs to
a server in Los Angeles (notice the “lax” and “la” in
lax-la1-bb1-a9.ca.us
) and that the address resolved for the
client in Virginia belongs to a server in Chicago (notice the “chi” in
chi-cs1-dist-1.cdn77.com
).
Both of these cities are major Internet hubs where bandwidth is inexpensive
and they're both close to their respective clients.
Neat!
Setting up a CDN is usually pretty straightforward — or at least most providers try to make it easy. In my case, after signing up for an account I created a “pull zone.” From the CDN's point of view, a pull zone is a mapping between a host name that they will serve requests for and the host name of an origin server their servers should contact if they don't have a requested resource.
For example: I set up a pull zone called lofi-limo
to which
my CDN provider assigned the hostname lofi-limo.b-cdn.net
.
I configured the pull zone to use my origin server at lofi.limo
.
Next I configured my LofiLimo.Media module to assemble URLs using
lofi-limo.b-cdn.net
as the host name instead of
lofi.limo
. Now client requests for media go to the CDN's servers
instead of mine.
In addition to mapping a CDN hostname to an origin server, the pull zone
will usually have some other configuration options.
A common option is to allow the Cache-Control
directives supplied
by the origin server to be overridden — handy if you're not able to tune them
quite how you'd like.
Different providers will provide different options. The one I'm currently using also provides options for automatic retries to the origin server, two-layer caching to further reduce origin server bandwidth, rate limiting, and IP blocking among others. If you can imagine a handy feature, you can likely find a CDN that provides it.
Since the origin server won't be seeing every client request anymore, most CDN's provide logging and aggregate statistics so their customers can keep an eye on how things are going. I'm not a creep, so I don't have a lot of use for detailed logs. But I do find it handy to keep an eye on some of the aggregate statistics. In particular, I like to look at origin server response time and non 2xx responses from the origin server — an increase in either of these is likely a sign of trouble that I should dig in to. I also like to look at bandwidth served since this gives me an idea of my costs. Finally, I keep an eye on my cache hit rate: this helps me understand if I've got everything set up well.
I hope that you've enjoyed reading this article as much as I've enjoyed writing it and that it has given you some helpful ideas for your own projects!
You might like to discuss this article at Hacker News or at Lobsters.
If you have any questions, comments, or corrections please don't hesitate to drop me a line.
Aaron D. Parks