vgaltes (3) [Avatar] Offline
#1
Hi,

I know there's a unit about canary deployments but I'm trying this now and I have some questions smilie

I've seen that we can do canary deployments in AWS Lambda and that there is a serverless framework to do it, but I've seen that the possible configurations you can use for the time you want to split the traffic are quite limited. Is it possible to customise this? Is it possible to manually change the % of traffic that goes to a function version?

A part from that, my biggest doubt is how you can control the downstrem calls. I mean, imagine that the new version puts a message on a SNS topic and I want that the function that receives that message to have the right version too? Is it possible to control this?

And the last question is, is it possible to do blue/green deployments in Lambda?

Thanks,
Yan Cui (53) [Avatar] Offline
#2
Hi Vicenç,

1. Is it possible to manually change the % of traffic that goes to a function version?

Unfortunately, we're constraint by predefined Deployment Preference Type available in CodeDeploy for Lambda. Whilst Lambda's traffic shifting feature lets you specify what percentage of traffic goes to which version, to take advantage of that we'll need CodeDeploy to catch up, or you'll have to implement all the things CodeDeploy offers yourself - i.e. monitoring metrics, and adjusting the percentages on a schedule.

The serverless framework plugin that lets you do canary deployment is modelled after SAM - which does this by creating CodeDeploy resources as part of its CloudFormation stack. So if you use that plugin then you're stuck with 10%.

2. Can you control the downstream calls so producers and consumers have matching versions?

I don't believe that's possible, nor is it likely to be solved at the platform level. Solving this kinda problem at the platform level is really hard, given the diverse set of event sources we have, and the fact that there's a Lambda at the other end of a SNS message, or an API call is implementation detail.

You can do this at the application layer, but that usually comes with a heavy price in terms of complexity. In my experience, it's almost never worth the price, and instead you should design around it and avoid breaking changes at the contract level.

3. Is it possible to do blue/green deployments in Lambda?

There's no need to do it yourself, as Lambda does that for you already.

Remember, blue/green deployment is intended to remove downtime during deployment, and consists of 1) bringing up a new cluster, 2) flip a switch to route all traffic to the new cluster when it's ready, and 3) terminate old cluster.

That's what you get out of the box with Lambda. The problem is then any bugs that made it through testing are exposed to 100% of your users, which is why people move away from blue/green deployments and into the realms of canary deployments or dark launches where you limit the blast radius of any uncaught bugs. The difference being that one is done at an infrastructure level (canary) and the other at the application level (dark launch/feature flags).


Hope these answer your questions!
vgaltes (3) [Avatar] Offline
#3
Thanks Yan!

When you say "So if you use that plugin then you're stuck with 10%", what do you mean?

Thanks!
Yan Cui (53) [Avatar] Offline
#4
By that, I meant that if you use this plugin https://github.com/davidgf/serverless-plugin-canary-deployments with the serverless framework, which uses CodeDeploy, which doesn't let you configure the % yourself as all the supported deployment types uses 10%.

The Canary10PercentXMinutes deployment types goes from 10% -> 100% after X minutes.
The Linear10PercentEveryXMinute deployment types incrementally adds 10% every X minutes.

In both cases, you can't change the %, and have to use the predefined intervals of 5, 10, 15 or 30 minutes.
vgaltes (3) [Avatar] Offline
#5
Oh yes, I've seen that and it doesn't look very good...

Thanks!
Yan Cui (53) [Avatar] Offline
#6
I think the biggest problem with it is not the lack of control over the % or frequency. Those are limitations for sure, and at scale 10% is way too high for a canary, but most people wouldn't be too worried about it, those who would be worried would likely have the engineering resources to build their own pipeline for monitoring and adjusting the percentage.

The biggest concern for me is that it's shifting the traffic, not users. It means even if you shift only 10% of the traffic for 5 minutes you can actually impact 100% of users that accessed the system during that 5 minutes!

In the canary deployment unit, I'm going to cover client-side routing (which, is probably not gonna look all the polished given the time constraint, but should be enough to illustrate the idea so you can reach a more polished solution yourself).