TL;DR: We saved tens of thousands of dollars in ec2 inter-zone transfer costs by smartly routing traffic using haproxy.

At Helpshift, our infrastructure is hosted in EC2. Like any sensible infrastructure in EC2, we utilize multiple Availability Zones for high availability of services. A caveat of hosting services in this setup is the costly charges for inter-zone network transfers.

A couple of years back, we realized that our inter-zone transfer costs were a significant chunk of our monthly AWS costs. Upon further digging, we found out that most of this transfer was from our redis cache cluster.

The redis cluster had few slaves distributed in multiple availability zones. Clients were accessing one of the healthy slaves, discovered via Route53 health checked dns. The clients were dumb and making calls to slaves in other zones. We required a solution where clients will make calls to healthy slaves within same zone to avoid inter-zone transfer costs, but could access slaves in other zone in case of any unavailability in the same zone.

Client accessing zone aware redis slaves

An idea that popped to us was if clients could somehow achieve this with haproxy, a service we have come to rely on for nifty solutions. A quick look at the haproxy documentation pointed us to the backup config parameter for the server line. The haproxy docs for this param reads:

When “backup” is present on a server line, the server is only used in load balancing when all other non-backup servers are unavailable. Requests coming with a persistence cookie referencing the server will always be served though. By default, only the first operational backup server is used, unless the “allbackups” option is set in the backend.

Another important configuration to lookout for when using backup parameter is option allbackups for the backend. Haproxy documentation explains it as follows:

By default, the first operational backup server gets all traffic when normal servers are all down. Sometimes, it may be preferred to use multiple backups at once, because one will not be enough. When “option allbackups” is enabled, the load balancing will be performed among all backup servers when all normal ones are unavailable.

All we had to do was setup haproxy on clients to access redis slaves with a configuration that takes availability zones of the client and redis slaves into account and mark slaves in other availability zones with backup parameter. This was easily achieved with an ansible template for haproxy configuration which used our ec2 inventory. So for a client node hosted in AZ1, the following configuration would be generated by ansible:

frontend redisslave
    bind 127.0.0.1:6379
    mode tcp
    option tcp-smart-accept
    option splice-request
    option splice-response
    timeout client-fin 5s
    default_backend redis-slave-servers

backend redis-slave-servers
    mode tcp
    option tcp-smart-connect
    option splice-request
    option splice-response
    option allbackups
    timeout tunnel 70s
    server S1 10.0.0.1:6379 check        # S1 is in AZ1
    server S2 10.0.0.2:6379 check backup # S2 is in AZ2
    server S3 10.0.0.3:6379 check        # S3 is in AZ1
    server S4 10.0.0.4:6379 check backup # S4 is in AZ2

With this config, the client can connect to redis slaves in it’s own availability zone via haproxy and will switch to redis slaves in other zone only if all slaves of the same zone are down. This solution was tested and deployed in production within couple of days and we observed significant reduction in the next month’s bill.

This also solved the issue of adding/removing new slave in the redis clusters without restarting the client service as each time a new slave was added to redis cluster all we had to do was update haproxy configuration via ansible on the client nodes. This solution has been working flawlessly in production for last couple of years.

Ayush Goyal's photograph
Ayush Goyal
DevOps @ Helpshift
December, 8, 2016

Business hours check for a global business: The problem statement

Customer support teams of large global organizations have teams working in multiple geographies across multiple time zones. Each of these teams usually have their own set of business days and hours (in their respective time zones). Below is a Clojure map that represents an example business hours and days configuration for an org -

(def test-business-timings
  {:business-days [true true false true true false true] ;; Mon-Sun
   :business-hours [;; Monday
                    [{:from {:hours 16 :minutes 0}
                      :to {:hours 17 :minutes 0}
                      :timezone "Asia/Kolkata"}
                     {:from {:hours 13 :minutes 0}
                      :to {:hours 18 :minutes 0}
                      :timezone "America/Los_Angeles"}]
                    ;; Tuesday
                    [{:from {:hours 8 :minutes 0}
                      :to {:hours 18 :minutes 0}
                      :timezone "Pacific/Kiritimati"}]
                    ;; Wednesday
                    []
                    ;; Thursday
                    [{:from {:hours 1 :minutes 0}
                      :to {:hours 3 :minutes 0}
                      :timezone "UTC"}
                     {:from {:hours 10 :minutes 0}
                      :to {:hours 18 :minutes 0}
                      :timezone "Asia/Kolkata"}]
                    ;; Friday
                    [{:from {:hours 0 :minutes 0} ;;all day
                      :to {:hours 0 :minutes 0}
                      :timezone "US/Hawaii"}]
                    ;; Saturday
                    []
                    ;; Sunday
                    [{:from {:hours 9 :minutes 0}
                      :to {:hours 17 :minutes 0}
                      :timezone "US/Hawaii"}]]})

When a support ticket from a customer comes in, we need an automated way to check whether any of the global support teams are within business hours.

Approaching the problem

Which day of the week is it for the support teams?

Before making the check for business days, you need to determine what the current day of the week is (‘today’). You need to account for the possibilities that it might still be ‘yesterday’ in some timezone or may already be ‘tomorrow’ in some timezone. And a team in some timezone behind or ahead of the server time may be in business hours.

On this day and at this time, do I have an actively working support team somewhere or not?

When making a check for a specific business day, building the time interval for the specified business hours (for that day) needs to take into account both timezone and the notion of what is the current day.

Solution

An approach that didn’t fly

Convert all the input business hours to UTC before making the check. There are 2 problems with that approach -

  • Shifting the business hours to UTC could mean crossing the day boundary (ahead or behind). And this gets complicated by the fact that we need to factor in business days (not just hours). We’d have to also build a transformed list of UTC business days.
  • We can’t store the business hours and days configuration in UTC format because the next time the user edits them, they’ll need to see it in the time zones they originally used.

The one that did

The solution relies on the fact that no time zone in the world is more than a day apart from UTC. In other words, no matter what timezone you’re in, relative to the UTC day it could never be ‘day after tomorrow’ nor could it still be ‘day before yesterday’.

In order to solve this problem, we assume whatever is the current day of week per UTC to be our current day of week. And then we need to make the following 3 checks -

  • whether ‘today’ is a business day, and whether ‘right now’ falls in ‘today’s’ business hours
  • whether ‘tomorrow’ is a business day, and whether ‘right now’ falls in ‘tomorrow’s’ business hours
  • whether ‘yesterday’ is a business day, and whether ‘right now’ falls in ‘yesterday’s’ business hours

Code

Before I dive into the actual code, I’d like to side step a bit and dwell on how using Clojure (and FP in general) makes the solution intuitive.

Functions are Fun. And natural.

I started out as a Java programmer and I clearly remember that in my first few months at my first job, my natural tendency was to write a lot of C-style functions. And the way to do that in Java is to use static methods and classes. And this immediately stands out as a code smell or bad practice in the object oriented Java world. You need to think about modeling things as ‘real world objects’ I was told. Whatever that meant. Of course I adapted, moved on and eventually became a ‘good’ Java programmer. Working with Clojure now has made me rediscover that long lost natural tendency. I now write code in the style that is most obvious from a logic perspective. Little functions that each do one thing. I’ve come full circle - from instinctively thinking that functions are good, to being told that functions are bad, to finally understanding that functions are good. Java 8 now has functions you may say. But the truth is that just ‘supporting’ or ‘allowing’ functions as first class citizens is very different from a being a functional language at the heart. Which Java 8 certainly isn’t.

IMHO - an important characteristic of good code is that it makes the code look simple or easy. Its like a good batsman (in cricket) makes batting look easy. And this is something that I find in abundance in the Clojure ecosystem.

Clojure’s awesome date time lib : clj-time

Clojure has a feature rich, intuitive, and expressive date time library called clj-time. It’s an excellent example of what an easy-to-use API should look like. And using such a lib adds to the simplicity and elegance of your code.

Lets write the code

We’ll walk through and build the solution from the bottom up. The intent is to highlight how functional programming is the most natural way of building a solution with small, coherent, re-usable and composable functions.

We first create a Clojure NS for the solution. The only external lib we need is the above mentioned date time lib.

(ns business-timings.core
  (:require [clj-time.core :as t]))

First off, three simple functions that tell us ‘today’, ‘tomorrow’, and ‘yesterday’ for a specific UTC date time instance. Each of these return a number in the range of 1-7 representing the day of the week.

(defn- today
  [dt]
  (t/day-of-week dt))

(defn- tomorrow
  [dt]
  (t/day-of-week (t/plus dt (t/days 1))))

(defn- yesterday
  [dt]
  (t/day-of-week (t/minus dt (t/days 1))))

Now, a couple of functions that allows us to read the business hours for a specified day of week, and tell us whether a given day is a business day or not.

(defn- business-day?
  [day business-timings]
  ;; decerement day for zero based index
  (get (:business-days business-timings) (dec day)))

(defn- timings-for-day
  [day business-timings]
  ;; decerement day for zero based index
  (get (:business-hours business-timings) (dec day)))

Lets now use all the above to create functions that can tell us the corresponding timings for today, tomorrow, and yesterday. And whether they are business days.

(defn- timings-today
  [business-timings dt]
  (timings-for-day (today dt) business-timings))

(defn- timings-tomorrow
  [business-timings dt]
  (timings-for-day (tomorrow dt) business-timings))

(defn- timings-yesterday
  [business-timings dt]
  (timings-for-day (yesterday dt) business-timings))

(defn- business-day-today?
  [business-timings dt]
  (business-day? (today dt) business-timings))

(defn- business-day-tomorrow?
  [business-timings dt]
  (business-day? (tomorrow dt) business-timings))

(defn- business-day-yesterday?
  [business-timings dt]
  (business-day? (yesterday dt) business-timings))

Next, once we’ve read the business hours for a day, we’ll need some functions to help us build the actual time intervals for those business hours. In other words, the date time instances corresponding to the start time and end time of the business hours (on a given day). Note - these start and end date-time instances will need to be created in the timezone used in the configuration.

(defn- business-hours-start-today
  [time-slot dt]
  (t/from-time-zone
    (t/date-time (t/year dt) (t/month dt) (t/day dt)
                 (get-in time-slot [:from :hours])
                 (get-in time-slot [:from :minutes]))
    (t/time-zone-for-id (:timezone time-slot))))


(defn- business-hours-end-today
  [time-slot dt]
  (let [end-hour (get-in time-slot [:to :hours])
        end-minute (get-in time-slot [:to :minutes])]
    (t/from-time-zone
      (if (every? zero? [end-hour end-minute])
        (t/date-time (t/year dt) (t/month dt) (t/day dt) 23 59 59 999)
        (t/date-time (t/year dt) (t/month dt) (t/day dt) end-hour end-minute))
      (t/time-zone-for-id (:timezone time-slot)))))


(defn- busines-hours-start-yesterday
  [time-slot dt]
  (t/minus (business-hours-start-today time-slot dt) (t/days 1)))


(defn- business-hours-end-yesterday
  [time-slot dt]
  (t/minus (business-hours-end-today time-slot dt) (t/days 1)))


(defn- business-hours-start-tomorrow
  [time-slot dt]
  (t/plus (business-hours-start-today time-slot dt) (t/days 1)))


(defn- business-hours-end-tomorrow
  [time-slot dt]
  (t/plus (business-hours-end-today time-slot dt) (t/days 1)))

Lets now use the above to create check functions for today, tomorrow, and yesterday. The check passes if the specified date time instance falls in any of the time slots configured for a particular day. As you may remember, the problem statement requires allowing multiple different time slots for the same day (each with its own timezone). We use clj-time’s within? function (which is timezone aware) to make this check. Hence we don’t need to shift the specified date time instance to the timezone used for the business hours.

(defn within-todays-business-timings?
  [business-timings dt]
  (when (business-day-today? business-timings dt)
    (let [time-slots (timings-today business-timings dt)]
      (some (fn [time-slot]
              (t/within? (business-hours-start-today time-slot dt)
                         (business-hours-end-today time-slot dt)
                         dt))
            time-slots))))


(defn within-tomorrows-business-timings?
  [business-timings dt]
  (when (business-day-tomorrow? business-timings dt)
    (let [time-slots (timings-tomorrow business-timings dt)]
      (some (fn [time-slot]
              (t/within? (business-hours-start-tomorrow time-slot dt)
                         (business-hours-end-tomorrow time-slot dt)
                         dt))
            time-slots))))


(defn within-yesterdays-business-timings?
  [business-timings dt]
  (when (business-day-yesterday? business-timings dt)
    (let [time-slots (timings-yesterday business-timings dt)]
      (some (fn [time-slot]
              (t/within? (busines-hours-start-yesterday time-slot dt)
                         (business-hours-end-yesterday time-slot dt)
                         dt))
            time-slots))))

And now the final check function that will essentially be the API for this NS. And it’ll use all the functions we have defined so far. The check passes if the specified date time instance falls in either today, tomorrow, or yesterday’s business hours.

(defn within-business-timings?
  [business-timings dt]
  (or (within-todays-business-timings? business-timings dt)
      (within-tomorrows-business-timings? business-timings dt)
      (within-yesterdays-business-timings? business-timings dt)))

(comment (within-business-timings? test-business-timings (t/now)))

Summary

Functional programming is not about coolness or the bleeding edge. Rather it is - in my humble opinion - the most natural and easy way to solve a problem. With the side benefit that its also cool, and the bleeding edge. For full code and tests, see here https://github.com/moizsj/business-timings/.

Moiz Jinia's photograph
Moiz Jinia
Backend Clojure Dev @ Helpshift
October, 10, 2016

Today we are excited to to open source Herald - a load feedback and check agent for Haproxy.

Herald is the agent service for the agent-check feature in Haproxy. This feature allows a backend server to issue commands and control its state in Haproxy. This can be used for out of band health checks, or load feedback (which we explain below), and many other use cases.

Background

At Helpshift we swear by Haproxy, which is an extremely powerful and versatile load balancer.

Haproxy is used to balance requests from a frontend (client facing interface) to a pool of backend servers. For simple http requests the roundrobin balancing algorithm works well. But for long lived connections, an even balancing is not achieved. The leastconn balancing method helps to a certain extent - here the requests are sent to the backend with least connections - but in a cluster of Haproxy nodes, this doesn’t work, as no state regarding the connections is shared between the clustered haproxy nodes.

To solve this, we need a mechanism to regulate traffic sent by haproxy, depending on the load condition of the respective backend server in realtime. We built Herald to solve this load feedback use case.

Load feedback

As the name suggests, the backend servers can send feedback to Haproxy and regulate its incoming traffic. The application on the backends must expose its load status (a metric such as rps) over some interface, such as http.

When agent-check is enabled, Haproxy periodically opens a tcp connection to herald running on the backend servers. The agent on the server queries the application via the load status interface, and replies back with a weight percentage, say 75%. This directs haproxy to reduce the traffic sent to this node by that percent difference. This percentage can keep changing as per the current traffic condition.

Besides regulating load percentage, the reponse can also be an Haproxy action, such as MAINT, DOWN, READY, UP etc. The Haproxy documentation here has more details on agent-check.

Herald

Herald is the Haproxy agent that runs on the backends. It has been designed to be generic, with a plugin architecture that along with load feedback can be used for other use cases as well.

The agent has two responsibilities:

  1. Respond to Haproxy agent requests, and
  2. Query backend application, and calculate the haproxy agent response

The following illustrates this simple architecture :

Herald Architecture

Plugins

Herald plugins do the job of querying the application and interpreting the result. Plugins for file and http are available, others can be easily added.

Following features are provided out of the box by the plugin framework:

  • Response result cacheing
  • Json parsing and processing parsed data using python dict expressions
  • Arthmetic expressions on the result
  • Regex pattern matching on the result
  • Fallback in case of failures

Performance

Being entirely network and IO driven, we chose to use gevent, a coroutine based python networking library. Gevent uses greenlets, to spawn cooperatively scheduled tasks, which yield when blocked, such as when waiting for IO. The resulting code is very simple, and highly performant.

We have been using Herald in production for almost a year with no issues, on a cluster with 100+ instances (depending on autoscaling), serving requests from a cluster of 10+ load balancers. We have had 100% uptime on our load feedback and agent check system.

Code and Documentation

The source code is on Helpshift github page. Please follow the README for installation and configuration instructions. We hope Herald is useful to others. Feature and pull requests are most welcome.

Raghu Udiyar's photograph
Raghu Udiyar
DevOps @ Helpshift
November, 20, 2015

Background

Testing business logic quickly becomes tricky, as applications grow and scale. Example based unit and integration tests, and exploratory tests become poor choices to check and verify a large state space. Also such methods are not well-suited to clearly describe the state space / transitions we want to test.

For quite some time I have been experimenting with DSLs to write our tests at Helpshift. I recently gave a talk on “Testing business logic using DSLs in Clojure” at Functional Conf this year. I have shared my learning in the talk. Hope you enjoy it. Feel free to ask me any questions if you have.

Talk

Slides

If this sounds interesting to you, we are hiring, join us!

Mayank Jain

Mayank Jain's photograph
Mayank Jain
Backend Engineering
April, 10, 2015

Google is phasing out OpenID in favour of Oauth 2.0 with a deadline on 20th April, 2015 - just 10 days from today. A lot of projects depend on google auth, and can’t easily move to another OpenID provider. I recently had to fix this issue with Jenkins and Gerrit.

Jenkins has a great plugin available for this, which was a piece of cake to install and configure. But it wasn’t so easy with Gerrit. Lot of gerrit users have been asking for Oauth support since May last year; we got that finally when David Ostrovsky wrote gerrit-oauth-provider plugin.

I’ve listed the steps I followed below :

  1. Oauth2 credentials

    Get these from Google Developers Console, note down the client id and client secret. Ensure the redirect url is set to /oauth i.e. http://gerrit.yoursite.com/oauth.

  2. Get the custom gerrit war file

    There are a few gerrit changes the plugin needs that haven’t been merged yet. A custom war is available here with the plugin. Download this gerrit-2.10.2-18-gc6c5e0b.war file to the new gerrit server.

  3. Backup current gerrit data

    Create tarballs of the data directories and dump postgres data (if postgres is being used)

    old-gerrit~$ tar czpf gerrit.tar.gz /srv/gerrit/gerrit old-gerrit~$ tar czpf repositories.tar.gz /srv/gerrit/repositories old-gerrit~$ pg_dump -xO -Fc reviewdb > reviewdb-$(date +%d-%m-%Y).pdump

  4. Restore data to new gerrit server

    gerrit:/srv/gerrit$ tar xzpf repositories.tar.gz gerrit:/srv/gerrit$ tar xzpf gerrit.tar.gz

  5. Restore pg data

    psql : ALTER USER gerrit WITH SUPERUSER; $ dropdb reviewdb $ createdb reviewdb -O gerrit $ pg_restore -O -d reviewdb --role=gerrit reviewdb-20-03-2015.pdump psql: ALTER USER gerrit WITH NOSUPERUSER;

  6. Run migrations

    Gerrit requires cascading migrations to be run for every major version released. For e.g to update from 2.5 to 2.10, we have to run the following

    $ sudo su - gerrit -s /bin/bash $ java -jar gerrit-2.8.6.1.war init -d gerrit $ java -jar gerrit-2.9.4.war init -d gerrit $ java -jar gerrit-2.9.4.war reindex --recheck-mergeable -d gerrit For the custom jar migration be sure to configure the Oauth plugin

    ``` $ java -jar gerrit-2.10.1-4-a83387b.war init -d gerrit […] OAuth Authentication Provider

    Use Google OAuth provider for Gerrit login ? [Y/n]? Application client id : Application client secret : confirm password : Link to OpenID accounts? [true]: Use GitHub OAuth provider for Gerrit login ? [Y/n]? n

    $ java -jar gerrit-2.10.1-4-a83387b.war reindex -d gerrit ```

  7. Switch old gerrit domain name to the new server

    For automatic acount linking to work, the domain name must match the old server. Otherwise the OpenID accounts will not be linked with the new Oauth2 account.

  8. Start gerrit server and confirm everything works

    gerrit:/srv/gerrit$ ./bin/gerrit.sh start

Raghu Udiyar's photograph
Raghu Udiyar
DevOps @ Helpshift

Helpshift is an in-app mobile help desk designed to improve customer support efficiency by over 400% and reduce cost by over 70%. Our engineering team is committed to building a meticulously designed, solid SDK to improve customer retention for clients such as Flipboard, Supercell, and more . We are currently serving over 500 million app sessions weekly.

Learn more about us at Helpshift.com