Sidekiq Memory Leak

While deploying an app to two brand new Ubuntu 14.04LTS servers, a nasty memory leak was found.

The offending processes were our sidekiq workers, and they were hitting a linux kernel OOM death within 24 hours. Oddly, these processes were running just fine on an older version of Ubuntu (11.10). Unfortunately, an email to Mike Perham, the main guy behind sidekiq netted me with a response I’d given so many times before: “did you try turning it off and on again”

Of course, restarting the process certainly fixed the memory leak, but watching a process’s memory usage and restarting it every day did not seem practical.

Ruby Upgrade

So we devised a plan to upgrade Ruby from 2.1.1 to 2.2.0. One of our expert developers lead the upgrade of ruby on the hosts that house our sidekiq processes. Until a new version of sidekiq is relased that addresses this issue, a Ruby upgrade seemed to be our only option within userspace. After the Ruby upgrade, testing was performed on our main app as per usual, and then I got my hands on it to test the memory leak issue.

At first it seemed that our app was performing normally, and the process memory use levelled off. However, once a production load was put onto the new code, the memory leak reared its ugly head once more.

Kernel up/downgrade

The kernel on the old boxes where sidekiq was working was 3.0.0-26-virtual, on Ubuntu 11.10.

On the newer boxes that were exhibiting the memory leak, 3.13.0-29-generic was the kernel in use on Ubuntu 14.10LTS. The potential for a bug to exist between Ruby and the Linux Kernel’s malloc() call is high enough to test another kernel version.

Unfortunately, upgrading a kernel on an Amazon EC2 instance is far from a simple process, and after breaking several boxes, and wasting several days in research and testing, I gave up this path. If you are running sidekiq in a non-EC2 envirnment, I implore you to try a down/upgrade of the kernel to see if you can exhibit/prohibit this memory leak behaviour. If the next Ruby or sidekiq upgrades do not fix the issue, I will be building a third production box within my Vsphere environment where it’s a little bit easier to modify the host kernel.

What did work

Call it powdering a corpse, or call it duct tape; I ended up using monit to monitor the process with the memory leak, and to restart said process if its memory consumption went above a threshold. To do this, I created a file called /etc/monit/conf.d/api_code.conf and wrote the following poetry inside

1
2
3
4
5
check process api_code with pidfile /app_path/pids/api_code.pid
    start program = "/usr/sbin/service api_code start "
    stop program = "/usr/sbin/service api_code stop "
    if totalmem > 1024 MB then restart
    if 5 restarts within 5 cycles then timeout

I’ve also triggered an alert with m/monit so that I can track every time the process is restarted. While this solution works, and keeps our production API afloat, it does not for one happy admin make. While this particular piece of code is relatively harmless when restarting, other pieces of our code could be damaging if interrupted. It’s not hard to imagine a process hitting that memory threshold in the middle of doing something important.

Let’s hope that “turning it off and on again” isn’t a long-term solution.

Octopress

Out with the Old

Growing tired of having to stay on my toes to update, secure, and monitor my Wordpress installations, I figured it was time to move to a different CMS. After several false starts with some rather obnoxious blogging tools, I settled onto octopress.

The big selling feature came in terms of security, in that octopress locally generates static content, which you can then deploy as needed to web servers by several means (scp, rsync, github pages, etc).

Other features that I found to be worthy of mention:

  • markdown language is simple, and simple is good
  • very small file footprint, and no moving parts
  • zero dependancies on web server (other than the web server)
  • using exitwp to migrate from WP to octopress

Necessary Privacy

15 years ago, I held the position that governments and technology providers were not only able to, but were actively monitoring data communications through several dragnet operations. Dismissed as a conspiracy theorist left me with very few people to share gpg keys for email encryption; not that this stopped me from storing my information encrypted, hiding my online tracks, and other actions that furthered the perception that I was a bit paranoid. A typical conversation on this topic usually led to the question “what do you have to hide”? I’ve never had anything incriminating to hide, and it was never about that.

It was about control over my own personal information, where it ended up, who had access, who was rewriting it, and who was selling it. It was also about being labeled something I wasn’t, just because of an acquaintance or affiliation.

Around the time that I started really taking privacy seriously, I pulled out my soapbox and started preaching to anyone within earshot usually to their chagrin. I observed that most people considered their privacy to be a right, and placed trust in the authorities to regulate and manage any of the information collected about them.

Fast forward to today. A simple civilian-accessible search on someone’s name can get you emails, phone numbers, addresses, current and past employers, images, purchases on eBay. Using this information can further reveal more and more information until an entire profile is built on this person.

Or you can just add them Facebook.

So what is the problem with all of this information being collected and processed?

  • This data is persistent; it will not be deleted

  • The data is only good if it’s properly related to other data from the same person; processes are in place to ensure the validity, although there is no guarantee that your data isn’t linked erroneously

  • There are no regulations or laws saying that all of your data must be accessible to you (even through FOI/FIPPA)

  • There are no mechanisms in place to opt out of this collection

But what happens when the analytics engine fails? When we’re talking about big data, we’re talking about a staggering amount of information. No single person or group could manually sift through this data and draw correlations. A program or process handles that work load. History has taught us that programs and processes have bugs. Suddenly, you may find yourself on a list among sex offenders, political dissidents, or worse. Just because you had a common friend with a wanted criminal, and your grandmother’s computer had been infected with a botnet that was being used to distribute anti-government propaganda. There is no one you can call to get this association removed.

So what can you do to prevent this in the first place? You can become translucent, obfuscate, and protect. Starting with your web browser and email, encrypt everything that you can.

  • HTTPS for web connections, and GnuPG for email

  • Don’t bother trying to encrypt to keep the NSA out, just try and keep data mining apps and scripts from hauling off your personal data

  • Try not to leave tracks by requesting that sites do not track you, turn off scripts with NoScript, disable incoming ads, and don’t accept or store third party cookies

  • Shred your bills and other hard copies with personal information

  • Don’t give your postal code or personal details to merchants

You cannot escape the data dragnet entirely, without going completely off the grid. But you can minimize the chances of a disastrous theft of identity by corporate interests, governmental bodies, or other types of nefarious criminal organizations.

ASA Failover

Setting up a failover between Cisco ASA units proved to be far simpler than I anticipated, and a very useful technique if you value the availability and potential load-balancing it has to offer. In the example below, I will demonstrate an Active/Standby configuration, which does not allow for load-balanced operation. For load balancing, you will need to run in Active/Active configuration and you will need to configure them in context mode.

[caption id=“attachment_851” align=“alignleft” width=“210”]Topology Topology[/caption]

Our topology will include 2 ASA 5520 units, ASA4 and ASA5. After you’ve configured your basic ASA operation, we’ll want to configure our interfaces. The difference between a failover setup and a normal setup is the standby address. The IP used for a standby address will end up being the address of the secondary unit while it’s in standby mode. If the secondary unit goes active, it will then assume the first address. So, the configuration is set on ASA4 is as follows:

interface Ethernet0/0
 nameif OUTSIDE
 security-level 0
 ip address 10.0.0.1 255.255.255.0 standby 10.0.0.2

interface Ethernet0/1
 nameif INSIDE
 security-level 100
 ip address 172.16.0.1 255.255.255.0 standby 172.16.0.2

Once set, enable the interfaces and test that they’re operational. Next, we’re going to configure the failover itself, and then configure interface ethernet0/3 as our dedicated failover link.

First, shut down ethernet0/3 on ASA4.

conf t
int e0/3
shut

Next, configure ASA4 for the following failover directives:

failover
failover lan unit primary
failover lan interface FAIL Ethernet0/3
failover interface ip FAIL 192.168.1.1 255.255.255.0 standby 192.168.1.2

This will set ASA4 as the primary unit, creating an interface named FAIL and attaching it to physical interface Ethernet0/3 on 192.168.1.1. We can leave ethernet0/3 shutdown for now.

Next, we want to configure ASA5 with the same interface configuration, only in reverse:

interface Ethernet0/0
 nameif OUTSIDE
 security-level 0
 ip address 10.0.0.2 255.255.255.0 standby 10.0.0.1

interface Ethernet0/1
 nameif INSIDE
 security-level 100
 ip address 172.16.0.2 255.255.255.0 standby 172.16.0.1

Then we configure the secondary failover:

failover
failover lan unit secondary
failover lan interface FAIL Ethernet0/3
failover interface ip FAIL 192.168.1.1 255.255.255.0 standby 192.168.1.2

Then finally, we can enable e0/3 on both ASA4 and ASA5 with

conf t 
int e0/3
no shut

At this point, you should see some console logging data showing synchronization between the ASA units, as they negotiate configuration data and enable the sync link on e0/3. Once complete, you will find the following output from running a ‘show failover’ on each host:

<strong>ASA4</strong># show failover
Failover On
Failover unit <strong>Primary</strong>
Failover LAN Interface: FAIL Ethernet0/3 (up)
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 2 of 250 maximum
Version: Ours 8.0(2), Mate 8.0(2)
Last Failover at: 00:00:06 UTC Nov 30 1999
        This host: Primary - Active
                Active time: 3135 (sec)
                slot 0: empty
                  Interface <strong>OUTSIDE (10.0.0.1)</strong>: Normal
                  Interface <strong>INSIDE (172.16.0.1)</strong>: Normal
                slot 1: empty
        Other host: Secondary - Standby Ready
                Active time: 0 (sec)
                slot 0: empty
                  Interface <strong>OUTSIDE (10.0.0.2)</strong>: Normal
                  Interface <strong>INSIDE (172.16.0.2)</strong>: Normal
                slot 1: empty





<strong>ASA4</strong># show failover
Failover On
Failover unit <strong>Secondary</strong>
Failover LAN Interface: FAIL Ethernet0/3 (up)
Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1
Monitored Interfaces 2 of 250 maximum
Version: Ours 8.0(2), Mate 8.0(2)
Last Failover at: 00:00:01 UTC Nov 30 1999
        This host: Secondary - Standby Ready
                Active time: 0 (sec)
                slot 0: empty
                  Interface<strong> OUTSIDE (10.0.0.2)</strong>: Normal
                  Interface <strong>INSIDE (172.16.0.2)</strong>: Normal
                slot 1: empty
        Other host: Primary - Active
                Active time: 3138 (sec)
                slot 0: empty
                  Interface <strong>OUTSIDE (10.0.0.1)</strong>: Normal
                  Interface <strong>INSIDE (172.16.0.1)</strong>: Normal
                slot 1: empty

I’ve emboldened a few key points in the failover output.

  1. You’ll notice that the hostname is the same on both ASA units. This is because they’re essentially a single logical entity now, with one unit acting as the primary, and syncing data to the secondary.

  2. You’ll also notice that the primary host has the x.x.x.1 addresses for its interfaces, while the secondary has the x.x.x.2 addresses. Watch what happens next…

To test our failover, log into ASA4, and type “no failover active”, which will tell the active failover host to drop to standby mode. DO NOT type “no failover”, as this will turn off failover altogether, and this would be bad on a production box. You’ll see in the output of “show failover” that the IP addresses of the INSIDE and OUTSIDE interfaces have swapped between the Primary and Secondary hosts.

ASA4

This host: Primary - Standby Ready
                Active time: 3552 (sec)
                slot 0: empty
                  Interface OUTSIDE (10.0.0.2): Normal (Waiting)
                  Interface INSIDE (172.16.0.2): Normal (Waiting)
                slot 1: empty
        Other host: Secondary - Active
                Active time: 25 (sec)
                slot 0: empty
                  Interface OUTSIDE (10.0.0.1): Normal (Waiting)
                  Interface INSIDE (172.16.0.1): Normal (Waiting)
                slot 1: empty

ASA5

This host: Primary - Standby Ready
                Active time: 3552 (sec)
                slot 0: empty
                  Interface OUTSIDE (10.0.0.2): Normal (Waiting)
                  Interface INSIDE (172.16.0.2): Normal (Waiting)
                slot 1: empty
        Other host: Secondary - Active
                Active time: 25 (sec)
                slot 0: empty
                  Interface OUTSIDE (10.0.0.1): Normal (Waiting)
                  Interface INSIDE (172.16.0.1): Normal (Waiting)
                slot 1: empty

To summarize, we have created a configuration with two modes, active and standby. Then we’ve created a link between our two devices, over which they synchronize their configuration and communicate their states. In the event that the Primary device fails, the Secondary device already has both a working replicated configuration, but also a copy of open connection states in memory. This allows for a clean cutover with no packet loss.