Article

Latest articles →

· Tim Riley

Money, Stress and The Cloud

Or: How to Save Your Relationship with Rails Streaming Responses on Heroku

This is as much a personal story as one about software development. It’s about dealing with problems in a production app, handling credit card payments on Heroku, working with customers, implementing custom HTTP streaming responses in Rails 3.0, all while living a happy life alongside.

I failed at the last part, which has actually made this article somewhat painful to write. Although the events occurred in November and December 2011, it’s taken this long to share, not only because it’s a long story, but also because I was happy to have this stuff behind me. However, I believe it is illustrative on both a technical and personal level, and contains some useful lessons, so here you go!

The Problem

I’m at a mountain resort, palm trees shading the warm tropical sun, with a cold bottle of San Miguel Light in my hand. My wife and I have just arrived in the Philippines, where we’ll be spending the next eight months. I’m hanging out for a week during her volunteering orientation, before we head out to our new home town.

Towards the end of that week, a rather disturbing email came to me: a double payment in an e-commerce app that we maintain. It’s the first time we’ve been told about something like this happening. I fixed it for the customer right away and let them know. Handling the money properly is pretty much the most important thing for this kind of app, so I interrogated the database and, surprisingly, found several more instances of the problem, taking a few different forms. Some payments were even missing data. This was not good.

My first step was to step through the codebase again and look for opportunities to make the payment process more robust. I made a few changes and deployed them. For a little while, things seemed better, but it wasn’t too much longer before we heard about some further occurances of the problem. After some more debugging, I isolated the problem: some transactions with our credit card gateway (SecurePay) were very slow.

The credit card transactions were taking longer than 30 seconds, which is the default timeout period for web requests on Heroku. This resulted in the user seeing their request timeout — but the actual payment might or might not still complete in the background, detached from the user’s browser request. In the meantime, the user might just pay again or leave altogether, creating these inconsistencies in the data.

The Search

By this time, we’ve moved from Manila to our new home city, Bacolod. But we haven’t actually moved moved yet. We’re staying in a pension house. It’s basic, but comfortable, and in a good location to start exploring the city. But I have this big technical problem that’s inconveniencing people. So I spend the first couple of days half-thinking about what needs to be done, and the first couple of nights perched on the bed (the only seating in our room) working on code to get a solution ready.

This first attempt at a solution was pushing the payment into a background job, then having the frontend check on the progress of the job before showing the “payment complete” message and order receipt. Pretty straightforward. I’d done this kind of thing plenty of times before. So I put my head down and got to work building a solution using delayed_job, which we already had in place for background jobs. After a couple of nights, I had things ready to test. I tried it and everything worked except the payment. What had happened? The credit card number wasn’t being passed on to the payment gateway. Then it struck me: the credit card number! Naturally, I had ensured my payment model never persisted the full credit card number. We didn’t want to store credit card numbers. This was a problem for delayed_job, because it relied on the persistence of all a record’s information in order for the asynchronous method to work properly. If only I had remembered this first!

This led to a big discussion with my co-workers and Twitter network about ways to make this work. I really didn’t want to persist the credit card number in any way, but could it be done acceptibly, somehow? Perhaps persisting it only momentarily? Maybe we could move to Resque for the background jobs and use an in-memory-only Redis instance for the job persistence? I wasn’t happy with any of these options. I really didn’t want to have to worry about storing credit card numbers. (In my desparation, I briefly looked into what PCI compliance might involve, but that quickly steeled my resolve to avoid it completely.) Back to the drawing board.

My wife has just started her first couple of days at her new workplace. There’s a lot of meeting people and other formalities. I join her for these. I’m still thinking about fixing this problem, but there’s not much time to translate the thought into action. Soon after, we’re taken around the town looking at places to rent. We pick a place and now it’s time to start a final move.

With the delayed_job option no longer possible, I thought about other simple ways to fix the problem. I found this <timeoutValue> option in the SecurePay API docs! Perhaps we could just set that to 30 and all would be good! Turns out it wasn’t so. After some largely unconstructive back and forth with SecurePay’s developer support team, I realised that the timeout value was only intended for batches of transactions, not the individual transactions that we were using.

Meanwhile, I’d written a script that took a day’s transaction data from SecurePay, compared it to the payment records in our app, and reported on any inconsistencies. I ran it each night with bated breath. Many nights, things were OK, and I would feel a massive relief. But my heart was in my mouth everytime I pasted the code into the Heroku console and hit the return key. And at the times when it did reveal problems, I knew I had to then look into things, rectify them, and sheepishly pass on the news to our customer, so they could make things right for their users. Fortunately, our customer was actually very gracious about this. They knew that new software undergoes teething issues, and they were grateful we were staying on top of things for them.

As an aside, one thing that was very helpful for ensuring we didn’t lose any critical payment information was Papertrail, a cloud-based application logging service. I threw a whole bunch of specially marked logger.info messages around the payment events in the app, which would safely store all the request parameters in on Papertrail (after removing the full credit card details), where I could easily find them and review them with a saved search. I highly recommend it. In this case, it ensured we could always recover from any instances of the problem.

But this was just a measure to help us cope while the problem still existed. We still needed a solution. It was clear that the only way forward was to accept that the credit card transactions could take more than 30 seconds, and then find a way to handle it gracefully.

I had an inkling that an asynchronous web server or app framework might help. I didn’t know much about this stuff, so I started madly researching them. Goliath. Rainbows. Cramp. EventMachine and its family of libraries. They’re a whole a different world, and I didn’t really know where to start. I didn’t want to shift the whole app to different underlying tech. Maybe I could extract the payments into an asynchronous mini-app? But that would still have the same response timeout problems, just shifted around a little. Perhaps this approach wasn’t going to help at all.

It’s the weekend now, and we’re in our new place with a minimal set of furniture. Just some bamboo chairs. No bed or table yet. We’ll do some shopping later in the day for the rest of our essentials, but it’s the morning now and I need to keep pushing with this work. I’m perched on the edge of the bamboo chair, my laptop tethered to a delicately placed iPhone, tenuously receiving a 3G signal. The connection was fine when I tested it outside the house, but not so within its concrete walls. We don’t have air conditioning yet. It’s 35 degrees inside and the air is still. I’m sweating and the laptop is hot and I have tab upon tab of slowly-loaded web pages open about a topic I’m not sure will help me and I’m worried about just how on earth I will be able to work in this environment and I’m not getting anywhere and I’m more irritable and tightly wound than I can remember. This is not how things should be.

The Solutions

After some research, an async web server didn’t seem to offer a straightforward answer to my problems. A simpler approach was to leverage a feature of the Heroku Cedar platform: its handling of HTTP streaming responses. If Heroku detects that you’re streaming a response, then its timeout window is extended by 55 seconds every time a byte of data is sent down the line.

Streaming responses are a flagship feature of Rails 3.1, but they are primarily for passing on the start of your page layout while it assembles the rest of the page’s content for rendering. This means the browser can start downloading the site’s JS and CSS assets sooner. This is useful for overall performance, but it wasn’t what I needed. My app was also only on Rails 3.0, and didn’t want to upgrade to a new Rails at the same time as fixing this problem.

Fortunately, some support for manually handling streaming exists in Rails 3.0, via assigning an object to the controller’s response_body. All it needs to do is response to each and accept a block, which it can call with the data to be streamed:

class StreamResponse
  def each(&block)
    # Send some data to the client
    block.call("This is some data")

    # And send a bit more. It's a stream, after all
    block.call("This is some more data")
  end
end

class OrdersController < ApplicationController
  def create
    # Do your normal controller things here

    # Then stream the response
    self.response_body = StreamResponse.new
  end
end

This object is sent the each message once as the controller action runs. Inside this method, whatever you output is streamed immediately to the client browser. Here, finally, I could create some custom streaming logic and support these slow payments.

What I did was split the response into four phases:

  • Immediately stream the content for a loading screen
  • Run the slow payment action in a separate Ruby thread
  • Regularly check the thread’s progress, and while it is still running, stream a bite-sized chunk of data to keep the connection alive
  • When the thread completes, stream a final success or failure message

Here’s how the controller looks:

class OrdersController < ApplicationController
  respond_to :html

  def create
    @order = Order.new(params[:order])

    if !@order.valid?
      # Don't bother streaming and talking to the payment gateway if we know there are already things missing or invalid
      render 'show' and return
    end

    stream_response = ThreadedOrderPurchaseResponse.new
    stream_response.start_content     = render_to_string('purchasing')
    stream_response.keepalive_content = ' '.html_safe
    stream_response.success_content   = render_to_string('purchasing_complete', :layout => false, :locals => {:success => true})
    stream_response.failure_content   = render_to_string('purchasing_complete', :layout => false, :locals => {:success => false})

    stream_response.slow_process = lambda do
      if @order.save
        if @order.purchaser_email.present?
          Resque.enqueue(DeliverOrderMailerOrderCompletedJob, order.id)
        end

        Thread.current['success'] = true
      else
        Thread.current['success'] = false
      end
    end

    self.response_body = stream_response
  end
end

The logic for handling the threaded payment in the background is in the ThreadedOrderPurchaseResponse class, which is based in part on the template_streaming gem for Rails 2.3. For this particular case, I added a simple Ruby thread:

class ThreadedOrderPurchaseResponse
  attr_accessor :bytes_to_threshold,
                :slow_process,
                :start_content,
                :keepalive_content,
                :success_content,
                :failure_content,

  def initialize
    @bytes_to_threshold = 2048
  end

  def each(&block)
    @response_stream = block

    push(@start_content)

    if @payment_thread.blank?
      @payment_thread = create_payment_thread
    end

    # OK, let's sort this out. We're going to let the transaction run for up to 180 seconds.
    # And we only have 190 before unicorn kills us anyway.
    # Everytime we send a bit of data in a streamed response, they give us 55 seconds.
    # So we only only need to check on the thread 3 times.

    90.times do |i|
      if @payment_thread.status.nil?
        # An exception occured in the thread
        push(@failure_content)
        @payment_thread.join
        return
      elsif @payment_thread['complete']
        if @payment_thread['success']
          push(@success_content)
        else
          push(@failure_content)
        end
        @payment_thread.join
        return
      else
        push(@keepalive_content)
      end

      sleep 2
    end

    # We only get here if the thread hasn't completed within 180 seconds.
    @payment_thread.join
    push(@failure_content)
  end

  def push(data)
    if @bytes_to_threshold > 0
      @response_stream.call(data + padding(@bytes_to_threshold - data.length))
      @bytes_to_threshold = 0
    else
      @response_stream.call(data)
    end
  end

  private

  def create_payment_thread
    Thread.new do
      @slow_process.call
      Thread.current['complete'] = true
    end
  end

  def padding(length)
    return '' if length <= 0
    content_length = [length - 7, 0].max
    "<!--#{'+'*content_length}-->".html_safe
  end
end

Some things to note here. Firstly, with a streaming response, browsers won’t start rendering your content until you’ve returned a certain number of bytes. So the first time we send data back, we pad it with an HTML comment string to ensure we pass this threshold. 2048 bytes for the threshold is enough for all the browsers to render the initial content.

We also use two thread-local variables (complete and success booleans) to record the status of the operation we’re running inside the thread. Since these can be inspected from outside the thread, the supervising loop in the each method can know when the operation has completed and return the appropriate data to the client browser.

Because we’re using threads here, it’s important to set config.threadsafe! in config/application.rb to ensure things run properly (though I keep it mostly turned off in development mode because having to restart the app for every change is quite slow).

Finally, here’s how the views look. First, we have a partial HTML page for the loading screen:

<% page_title 'Purchase in progress' -%>

<div class="public-page">
  <div class="content">
    <div class="wrapper">
      <div class="purchasing">
        <div class="purchasing-inner">
          <div id="purchasing-progress">
            <div><!-- Needed to ensure streaming DOM forms how we want it -->
            <div id="spinner"></div>
            <h2>Your payment is in progress</h2>
            <p>We&rsquo;re talking to the bank to confirm your payment. This can take up to a minute or two so hang tight!</p>
            <script src="/javascripts/vendor/spin.min.js"></script>
            <script>
              (function() {
                var opts = {
                  lines: 16,
                  length: 12,
                  width: 4,
                  radius: 19,
                  color: '#000',
                  speed: 1.2,
                  trail: 60,
                  shadow: false
                };
                var spinner = new Spinner(opts).spin(document.getElementById('spinner'));
              })();
            </script>
          </div>
          <!-- just inside .purchasing-progress -->

Even though this is not a complete HTML document, browsers still render it as expected, showing the user a nice spinner and loading message. Then, then the operation (hopefully) succeeds, we send the following along the response stream:

              <div id="purchasing-complete">
                <% if success -%>
                  <h2>Payment successful</h2>
                  <p>Thanks! Your payment was successful and your order is complete. <a href="/order">View your purchase receipt</a>.</p>
                <% else -%>
                  <h2>Payment error</h2>
                  <p>Sorry, something went wrong and we couldn&rsquo;t complete your purchase. <a href="/order?errors=true">Please review your details and try your purchase again</a>.</p>
                <% end -%>
                <script>
                  (function() {
                    document.getElementById('purchasing-progress').style.display = 'none';
                    var t = setTimeout(function() {
                      document.getElementById('purchasing-complete').style.visibility = 'visible';
                    }, 500);
                  })();
                </script>
              </div>
              <script type="text/javascript">
                (function() {
                  window.location.href = '/order<%= "#{'?errors=true' unless success}" %>';
                })();
              </script>
              <meta http-equiv="refresh" content="1; url=<%= "#{request.protocol}#{request.server_name}/order#{'?errors=true' unless success}" %>">
            </div><!-- .purchasing-inner -->
          </div>
        </div>
      </div>
    </div><!-- .public-page -->
  </body>
</html>

When the purchase is complete, we want to send the user back to an /order page, where they see either their purchase receipt, or a form for fixing any errors if the purchase failed. We use a number of methods to ensure this happens. Firstly, for the browser with no support for JavaScript, we show a descriptive paragraph with a simple link back to /order. Additionally, we throw a meta refresh tag in there that should send them along to the right place. Next, for browsers supporting JavaScript, we hide the existing loading message and send the user to /order by setting window.location. Lastly, we append ?errors=true to the URL if the purchase failed. This is necessary because the redirect results in a new GET request, and it doesn’t have the existing object around to inspect for its state.

I’m reasearching and building all of this, and I know it will no longer be something I can cram into a couple of big days alone. So I finally fall back into a regular, more sustainable work pattern, from our thankfully now furnished and air-conditioned home. I wake up, make breakfast, start working at 8am, and work solidly through to around 6pm, when I once again check for any erroneous payments that might have occurred during the day. But outside my working hours, I just want the next work day to hurry along so I can get closer to finishing this refactor. Especially know that I’ve found a workable approach, I really want this off my plate. I want to reclaim my life.

Fortunately, a forced respite comes in the form of an impromptu beach holiday in Sipalay, a 5-hour bus trip to the south of our island. There’s a weekend followed by a Philippine public holiday and also our wedding anniversary, so we get a nice break. The laptop only leaves my bag once, for a movie-viewing session. This is much-needed time away for both of us.

The Release

Finally, I thought I was ready to go. I’d built a working solution for gracefully handling payment processing of any duration. I’d also expanded the state machine around orders to include an additional “paying” state that is entered immediately before the transaction with the payment gateway, and left immediately after. This gave me some extra granularity in finding and catching any orders that don’t complete the transaction as expected.

At this point, it was time to do some thorough testing on the staging app on Heroku. I pushed up the code, eagerly looking for confirmation that the end of this problem was in sight. It didn’t come. My streaming responses, well, didn’t stream. Not in the slightest.

I was baffled and frustrated. Before embarking on this line of development, I’d already written an isolated testing app and verified that the streaming responses worked on Heroku. Now that I’d taken the time to integrate this technique into the app at large, and verified it all locally, all I had was the browser waiting until the request was complete before rendering all the content at once.

To further investigate, I introduced an isolated streaming test controller into my app. Something I could harmlessly access and test at /streams/new within the full app. First, I kept most of the logic that I was using for the purchases. Deployed and tested. Didn’t work. Then I started stripping a little out. Deployed and tested. Didn’t work. Finally, I made it as basic as possible. It was effectively the same as my test within the separate isolated app. Surely this would yield the behaviour I wanted. Deployed and tested. Still didn’t work!

Something strange was happening, and it must have been something outside of the app’s codebase. I compared the environment of my test app with the staging app. The staging app had a few Heroku add-ons enabled, so I started disabling them. It turned out that the NewRelic add-on was preventing the streaming responses from working. I dove deeply into its configuration to see if there was a way to keep it around. Perhaps something to make it thread-friendly? Nothing was immediately obvious, and removing it allowed streaming to work again. So I removed it (though later I worked out out to keep it around, and I’m glad to have it back).

By this stage, I was wary of how changes in the Heroku environment could affect the streaming. To ensure everything would work in production, I copied the remaining environmental difference into the staging app: HTTPS support. This broke everything all over again. So again, it was time for more frustration at trying to work around the problem, especially since there didn’t seem to be many options available. I was using hostname-based SSL, the only suitable option for our app, which uses a wildcard SSL certificate for secure connections across multiple subdomains. Luckily, my good colleague Hugh made me aware of Heroku’s SSL Endpoint beta add-on, which works seamlessly with Cedar’s streaming support and fixed the remaining problem with the app in staging. This improved SSL support has just now been released by Heroku.

So there it was. With the app environment problems smoothed out, the streaming refactor applied to both the public and admin areas, and a nice design put in place for both, everything was ready to go. On the 19th of December, we made the changes live. Everything went smoothly, and after six weeks, I could finally breathe a sigh of relief.

The Lessons

In those six weeks (which truly felt like a lot longer), not only did I have to solve one of the most difficult technical problems I’d encountered, but I also moved six thousand kilometers away, started living in a new culture amid a new language, found and set up a house, and did what I could to support my wife in doing all these same things as well as getting started in her new volunteering role. In trying to do all of these at once, I did all of them somewhat badly. And in placing priority on a quick fix to the technical issues, I did especially poorly at everything that happened away from the computer screen, including being the happy husband and travel partner that I wanted to be.

I won’t do this again. What would I do differently if I had the same things happen? First, I would relax. I’d spend more time thinking over the problems while away from the text editor and the Google searches. This would actually allow some room for insight and incision into the problem. There would be less flaying about. I’d be calmer and happier and would think more clearly. I’d also spend more time talking things over with my creative and resourceful teammates. In this case, the problem occurred immediately after moving overseas. While I’d previously spent quite some time successfully working remotely, the extra distance left me feeling unreasonably isolated, and I didn’t rely on my friends and colleagues as much as I should have.

What I wouldn’t have changed was how I communicated with the customer. I was quick to admit a problem, assured them that addressing it was important to me, and then kept them in the loop through the entire debugging and development process. I’d also still have devised interim measures (like the script to check for erroneous payments) to stay on top of any instances of the problem, ensuring my customer could manage their business as well as possible under the circumstances. Even if my final fix took a little longer to come, keeping the customer informed and equipped in this way would have kept them happy.

And happiness, after all, is what’s important. A happy developer will do better work and end the day satisfied. And then be happy in anything else outside of work. And all of this strengthens their ability to contribute positively to the world.

Despite the unhappiness I endured dealing with this problem, I’m happy that I’ve fixed it, moved on from it, and learnt some valuable lessons:

  • It’s important to recognise when a problem requires a serious fix. Building a serious fix requires a real plan, not an ad-hoc one.
  • Creating a real plan requires time, especially away from the computer screen.
  • Architecting serious changes is best done in collaboration with your teammates, even you’re the holder of most of the domain knowledge.
  • Customers can remain satisfied and sympathetic even during times of app instability, as long as they know you’re serious about fixing the problems and are kept involved throughout the process.
  • You need to take care of yourself. You’re not a machine, and you can’t solve problems when you’re feeling like shit.

I hope you find these helpful too.