Fedicache

There’s a well-known design flaw in Mastodon where links included in posts are “unfurled”—that is, Mastodon will visit the link and fetch the page metadata (title, description, a preview image, etc.). That’s fine, but the flawed part is that despite posts being federated, the fetched link metadata is not, which means that every server that receives a copy of a post with a link is going to make its own individual request for that same metadata.

A post from someone with a lot of followers, or one that’s boosted quite a bit, can result in thousands of separate requests for the same data from different Mastodon servers. And these requests are all placed within seconds of one another (because federation is very efficient). The phenomenon is called the Mastodon Stampede, which is cute, but it doesn’t feel very cute when your site gets knocked down by a thousand servers that are banging away on it for no other reason than to grab a little optional information about a link.

The issue has been discussed at length, with some people claiming that the Mastodon Stampede is effectively a DDoS attack against small websites, while others wave away any concerns with a “just use a proper cache on your server, bro.”

Ideally, the originating server would fetch the page metadata and federate it along with the rest of the post. Problem solved. I’ve heard that the current design is intentional and is supposed to prevent any tampering with the page metadata, but that seems silly given that a bad actor could just post a malicious link that already comes with its own fake metadata anyway.

But, whatever. Mastodon behaves the way it behaves, so we have to deal with it—but we can deal with it far more efficiently, I think. The act of fetching OpenGraph or other page metadata requires fetching the entire page by design. In a perfect world we’d have some other slick way to request and serve page metadata requests without serving the entire page (kind of like an HTTP HEAD request, but for <head> content instead of HTTP headers), but nothing like that exists.

So, let’s make it!

Since we can identify Mastodon and other compatible ActivityPub implementations by their User-Agent headers, we can handle their requests uniquely. The omg.lol web server runs Caddy, so here’s how I do that in Caddy:

@fedi {
  header_regexp User-Agent (Mastodon|Pleroma|Akkoma|Misskey|Firefish|gotosocial)
}

handle @fedi {
  root * /path/to/fedicache
  php_fastcgi unix//run/php/php8.2-fpm.sock {
    index.php
  }
}

This takes any request from an ActivityPub server and sends it to my Fedicache script, which looks like this:

<?php

// Fedicache

$path = '/path/to/your/fedicache/cache/';
$user_agent = 'your-user-agent';

$url = 'https://'.$_SERVER['SERVER_NAME'].$_SERVER['REQUEST_URI'];
$hash = md5($url);

if(!file_exists($path.$hash)) {
  $options = array(
    'http' => array(
      'method' => "GET",
      'header' => "User-Agent: $user_agent\r\n"
    )
  );
  $context = stream_context_create($options);
  $data = file_get_contents($url, false, $context);
  $rel_me = "\n";
  preg_match_all('/<a\s+[^>]*rel\s*=\s*["\']?me["\']?[^>]*>(.*?)<\/a>/is', $data, $matches);
  if(!empty($matches[0])) {
    foreach ($matches[0] as $tag) {
      $rel_me .= $tag."\n";
    }
  }
  file_put_contents($path.$hash, substr($data, 0, strpos($data, '</head>'))."</head>\n<body>".$rel_me."</body>\n</html>\n");
}

$size = filesize($path.$hash);

$log = time()."\t".$_SERVER['REMOTE_ADDR']."\t".$_SERVER['HTTP_USER_AGENT']."\t".$url."\t".$size."\n";
file_put_contents('/var/www/html/fedicache/log/'.date("Y-m-d"), $log, FILE_APPEND);

header("Fedicache: Active");
echo file_get_contents($path.$hash);

Here’s what’s happening in that script:

We take the URL that’s being requested and hash it. This gives us a string which is unique to the requested URL.
We then check the cache directory to see if there’s a file there named with that hash.
If there isn’t, we fetch the page and do two things: find any links with a rel="me" attribute (since Mastodon uses these for profile link verification), and strip everything after the closing </head> tag (because we don’t need any other page content). We save this slimmed down page to our cache directory with the hash name.
We log the request and serve the slimmed down page data from the cache.

The last part of the setup is a cron job that purges the cache directory every hour.

Overall, this approach is far more efficient than standard caching, since we’re storing (and serving) only those parts of the page that Mastodon needs. Since we’re going to be serving that content routinely to thousands of Mastodon servers (all day every day!), the efficiency here really adds up over time.

In the code above, the logging step is optional, but I’ve done it so I can keep an eye on just how much activity and bandwidth is devoted to Mastodon’s inefficient link fetching process. I set this up on the omg.lol server less than 24 hours ago from when I’m writing this blog post, and as of now there have been over 33,000 requests and 227 MB of data transferred. I’ve set up a public page to view these stats, in case you want to check in and see how things are going.