Content

Not quite a Yegge long.

OpenRA Reddit Postmortem

Friday 1 October 2010 - Filed under Uncategorized

Yesterday, we unexpectedly hit the front page of Reddit. We’ve more or less survived, and learned a fair amount – including the obvious “how not to write things if you want them to scale.” Here’s what we learned.

1. What went right:

  • We were able to patch a number of issues in the master server live
  • We got the main website up again fairly quickly thanks to RobotCaleb having a box available
  • The people who *could* play it generally gave very positive feedback
  • We’ve finally been able to test on a wide range of hardware and platforms
  • 13,000 downloads (9,000 of which were in that one day)
  • (>0 people playing most of the time)

2. What went wrong:

  • Webhost couldn’t stand up to the traffic. (Reddit is BIG)
  • Master server on hosting server meant people who did manage to download it couldn’t play
  • Far too many people who didn’t read instructions – e.g. forwarding ports, turning off compiz, download Cg – and/or our instructions were crap.
  • Download counter glitched out
  • Our lobby code was shown to be garbage.
  • When things broke there was no way for someone to fix it until NZ woke up – we needed trusted people in other timezones.
  • Mirrors went down and we had no way of dealing with that
  • No published "minimum requirements" list caused a lot of confusion
  • Minor PR disaster
  • We have no graceful upgrade path for these 13K or so users up to the next version

For more details, see below the fold.

The remaining part of this is just raw IRC capture from #openra-dev.

Webhost couldn’t stand up to the traffic:
22:48:35 < beedee> Obviously we are going to get a new host to fix that
22:48:52 < beedee> The current one is never going to suffice now
22:48:59 < RobotCaleb> We have the data on what a bad (good) day looks like
22:49:11 < RobotCaleb> 190GiB served in a matter of hours
22:49:26 < beedee> We just don’t know how that’s going to carry over for the rest of the
                   month
22:49:27 < chrisf> +30GB from the old host before it fell over.
22:49:29 < RobotCaleb> That definitely won’t be the norm, but it helps point out
                       heavy-hitters
22:49:44 < beedee> So it’s still hard to judge our monthly bandwidth requirements
22:50:04 < RobotCaleb> One issue with the web-hosting is we as much bandwidth serving .zip
                       files as we did .jpgs
22:50:31 < RobotCaleb> jpg = 68.71 GB
22:50:33 < beedee> Yet the jpgs aren’t big.
22:50:41 < RobotCaleb> zip = 80.21 GB

Master server on hosting server meant people who did manage to download it couldn’t play:
22:50:46 < pchote> i think the more important issue is that the master server is on the
                   same box as the website
22:50:56 < pchote> having the website fall over is bad, but not critical
22:51:14 < beedee> So maybe the best idea here is a compromise between GAE and linode
22:51:32 < pchote> the master server could live quite well on GAE i think
22:51:39 < beedee> GAE for the main site, linode for master server, packaging and bug
                   trackers
22:51:44 < chrisf> the master server needs to not be a hacked mess of PHP.
22:51:45 < beedee> *tracker
22:52:07 < chrisf> beedee: GAE isnt going to cut it if we’re serving 60GB or so of jpgs
22:52:08 < pchote> beedee: the master server doesn’t do much, why does it need to be on
                   linode?
22:52:10 < beedee> If someone were to write it in Python or Java, it’d run nicely on GAE I
                   think
22:52:30 < beedee> chrisf: You think so?
22:52:51 < beedee> I’m more thinking of paying only for what we use
22:53:37 < beedee> pchote: You’re right, though it still wouldn’t run on GAE at the moment.
                   I don’t think they have a PHP environment.
22:53:45 < pchote> they don’t
22:53:54 < pchote> but our master server wants rewriting anyway
22:53:59 < chrisf> i’d rather the master server wasnt PHP.
22:54:11 < beedee> Well the instructions would need to be made obsolete or made clearer
22:54:13 < chrisf> given a free hand, i’d write it in C ;)

Far too many people who didn’t read instructions – e.g. forwarding ports, turning off compiz, download Cg:
22:54:11 < beedee> Well the instructions would need to be made obsolete or made clearer
22:54:46 < RobotCaleb> Clearly, messaging is an issue. 1. People didn’t necessarily realize
                       it was in an alpha state. 2. No nice messaging of known gotchas
22:55:26 < pchote> – add a notice to the installer
22:55:29 < beedee> Our marketing spiel kind of hides the important instructions on the
                   front page
22:55:44 < pchote> – add a gui crash catcher
22:58:49 < chrisf> we need to kill the wall of text.
22:59:25 < pchote> back to what i said at the start, we need to IMO have a separate
                   download page with the important instructions
22:59:28 < beedee> Replace it with a really short summary?
22:59:35 < beedee> And a news feed or something?
22:59:40 < chrisf> beedee: *really* short. a couple of sentences.
23:00:07 < chrisf> pchote: a sep. download page *maybe*; it does mean we’ll lose more
                   downloads by adding another click.
23:00:19 < beedee> pchote: Ok, we kind of have one already for linux packages.
23:00:34 < pchote> chrisf: the downloads that we lose are the ones that don’t bother to
                   read anything anyway
23:00:39 < beedee> But it doesn’t do anything other than list the distro packages.
23:00:42 < chrisf> but if it helps people have *working* downloads, that might be a
                   worthwhile tradeoff.
23:15:28 < chrisf> unrelated: we’ve got N people following twitter now, and we’re not using
                   it well.
23:15:50 < beedee> I keep asking about that
23:16:00 < beedee> No one seems to want to give me access
23:16:12 < RobotCaleb> pizzabot could announce releases
23:16:23 < chrisf> beedee: password is the usual hash
23:16:23 < RobotCaleb> (on twitter)
23:16:24 < beedee> Exactly
23:16:55 < chrisf> alzeih set it up23:18:18 < chrisf> we need to decide what to do about gma950
23:18:26 < beedee> You missed the last batch of articles about twitters new broken API?
23:18:27 < pchote> can we build hardware checks into the installet?
23:18:44 < chrisf> and (to a lesser extent) dx8.1 cards (radeon 8500/9200)
23:18:57 < pchote> chrisf: i would like to support 950
23:19:04 < chrisf> into the installer, or should we check the stuff we need upfront in the
                   game init?
23:19:19 < pchote> here’s a big one: "no valid techniques"
23:19:19 < pchote> here’s a big one: "no valid techniques"
23:19:33 < chrisf> we can run on cg-2.1 with a small patch
23:19:47 < pchote> orly? lets fix that
23:20:05 < chrisf> each shader needs a technique that targets arbfp1 / arbvp1 rather than
                   latest/latest
23:20:15 < chrisf> as a fallback, after the `latest` technique.
23:20:17 < beedee> pchote: Cross platform checking would be hard.
23:20:30 < chrisf> beedee: check by *trying the APIs*.
23:20:40 < beedee> I guess
23:20:51 < chrisf> for platform-specific things, i assume you can ask compiz if it’s
                   running.
23:21:01 < beedee> Fuck knows
23:21:14 < chrisf> or worst-case shell out and see if its process is running?
23:21:15 < pchote> `ps | grep Compiz` etc
23:21:16 < beedee> They sure don’t seem to give a fuck about anyone else
23:21:30 < pchote> chrisf: this can be done in the launcher shellscript
23:21:45 < chrisf> we also need to check for the GL extensions we use.

Download counter:
22:56:58 < pchote> that comes under "make the master server not fail"
22:56:59 < pchote> i think
22:57:12 < beedee> Probably, though it’s more a main site thing
22:57:24 < pchote> the download counter itself is a site issue, yes

our lobby code was shown to be garbage:
22:58:46 < beedee> I’m not sure what the exact issue is with the lobby but the whole
                   network stack needs work.

when things broke there was no way for someone to fix it until NZ woke up:
23:02:04 < beedee> Matter of trust as I said earlier
23:02:18 < RobotCaleb> Someone mentioned that the website stuff wasn’t up-to-date in the
                       repo
23:02:24 < chrisf> i think we can trust RobotCaleb :)
23:02:30 < beedee> Thanks to people hacking it live :/
23:02:39 < beedee> We’re all guilty of it
23:02:45 < RobotCaleb> That hinders a rescue attempt
23:02:46 < beedee> But it needs to stop
23:03:41 < chrisf> beedee: for it to stop, we have to have a sane way to deploy to the
                   website.
23:03:52 < beedee> git clone isn’t sane?
23:04:07 < chrisf> ideally without having to ssh in
23:04:07 < pchote> not for trivial changes IMO
23:04:24 < beedee> So how do you propose to do that exactly?
23:05:08 < pchote> whatever we do, it needs to be simple
23:05:10 < beedee> You want some magic cron job to rsync with the website or something?
23:05:28 < beedee> Because I’m not writing it this time.
23:06:06 < pchote> if someone is woken up at 5am to put out a fire, they shouldn’t have to
                   jump through a pile of fragile hoops
23:06:56 < pchote> "make the changes in your repo, then upload via ftp", please
23:06:57 < RobotCaleb> Clearly the solution we came up with worked. In the future that
                       solution hopefully won’t be needed, though.
23:07:14 < RobotCaleb> What are some potential showstoppers we might run into if we
                       disregard hosting falling on its face?
23:07:21 < beedee> pchote: Isn’t that what I’ve been saying?
23:07:23 < pchote> RobotCaleb: not really – it worked because i was able to txt beedee to
                   fix things
23:08:03 < RobotCaleb> pchote: Well, it _did_ work. Eventually
23:08:35 < pchote> malicious content injected to the master server
23:08:38 < RobotCaleb> If the master server dies that’s obviously bad.
23:09:26 < beedee> Was there anything intentionally malicious done?
23:09:27 < pchote> half of the apparant server problems were because the master server was
                   serving malformed game entries
23:09:34 < pchote> i don’t know if that was intentional, or a bug
23:09:49 < RobotCaleb> pre or post host transition?
23:09:53 < chrisf> just hitting ping.php by hand is enough to knock it over.
23:10:00 < pchote> there were games showing up with null fields for everything except ip
23:10:08 < pchote> which cause the client to barf
23:10:28 < chrisf> (1) the client shouldnt be so retarded
23:10:40 < pchote> (2) the server should apply some basic sanity checks
23:11:39 < Ytinasni> (N+1) anything the client could cause by kernel-panicing at the right
                     moment is… um…. silly?

Mirrors went down and we had no way of dealing with that:
23:12:25 < chrisf> do we need automatic health monitoring of mirrors?
23:12:34 < beedee> Either a retry mechanism for all packages or health monitoring
23:12:54 < beedee> Health monitoring is tricky though, I mean how often do you do it?
23:13:17 < pchote> client side checks is simpler and more robust
23:13:25 < chrisf> communication with mirror admins was a bit lacking, too.
23:13:47 < RobotCaleb> Is the mirror list embedded in install packages?
23:13:50 < beedee> So distribute a mirror list with clients? That causes maintainence
                   issues.
23:13:54 < beedee> No
23:13:57 < beedee> Server side
23:14:03 < pchote> beedee: no
23:14:14 < beedee> get-dependencies.php checks a file on the server.
23:14:22 < chrisf> beedee: from an admin perspective, it’s important to have *some* idea of
                   how the mirror cloud is coping.
23:14:30 < pchote> and also have the client retry if the download fails
23:14:33 < pchote> and/or error out
23:14:47 < RobotCaleb> We had several people offer up mirrors when we needed some. Having
                       it on the server means we can add them in realtime. Good
23:14:51 < pchote> it *shouldn’t* print a warning, and continue
23:15:17 < beedee> When I added it, I was told not to error out

no graceful upgrade path:
23:22:18 < beedee> There has been a small amount of work done by me in that regard
23:22:25 < chrisf> RobotCaleb: i think that was more to do with ‘how are we going to get
                   these 12K people onto the next release?’
23:22:38 < chrisf> without causing a *massive* pile of version conflict fail in the process
23:23:17 < chrisf> beedee: need that handshake on the connection
23:23:32 < RobotCaleb> The next build will need to not communicate with prior build, then.
                       Master server can have a fake game running that says "new release".
                       This next release needs messaging of new builds
23:23:54 < pchote> RobotCaleb: that pretty much sums up my ideas from last night
23:24:09 < RobotCaleb> I have my MOTD scroller just about done
23:24:13 < pchote> we can make the server request for new versions include a version string
23:24:44 < pchote> checks that don’t include that flag get returned a fake server that they
                   can’t join that tells them to upgrade
23:24:48 < chrisf> the port check needs to tell the host that it worked, too.
23:25:08 < chrisf> pchote: *and* the legacy servers? or do we force-upgrade?
23:25:13 < Ytinasni> chrisf: presumably you do that _as part of_ the check :D
23:25:26 < chrisf> Ytinasni: presumably ;)
23:25:35 < pchote> chrisf: legacy servers won’t be shown to anyone
23:26:04 < pchote> – generalize the master server ping to let the master server inject chat
                   messages
23:26:11 < RobotCaleb> I think same network versions (builds) should still see each other
23:26:12 < pchote> that fixes a few current and future issues
23:26:19 < Ytinasni> we can change the master server urls to get around the legacy issue.
23:27:11 < RobotCaleb> At this point we can’t push a new build without solving these
                       issues. Especially that of messaging to current clients that there’s
                       a new build available
23:27:42 < chrisf> we also should include some gameplay carrots to give people a reason to
                   upgrade.
23:28:03 < RobotCaleb> Is this the transition point from multiple daily builds to minor
                       releases? :)
23:28:40 < beedee> Possibly…
23:28:43 < pchote> i think i may be
23:28:58 < beedee> Since we have a separate dev stream now

2010-10-01  »  admin

Talkback

  1. Tweets that mention OpenRA Reddit Postmortem | Chris on Software -- Topsy.com
    1 October 2010 @ 11:22 pm

    [...] This post was mentioned on Twitter by OpenRA, Matthew B-D. Matthew B-D said: I thought the traffic we got for @openRA on Sep 5th was bad. Then we got hit by @reddit , see http://bit.ly/b4j1L0 for post-mortem. [...]

Share your thoughts

Re: OpenRA Reddit Postmortem







Tags you can use (optional):
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>