How We Migrated Over 35k E-commerce Stores to Google Cloud (Cloud Next ’18)
Articles,  Blog

How We Migrated Over 35k E-commerce Stores to Google Cloud (Cloud Next ’18)


[MUSIC PLAYING] BARDIA DEJBAN: Thank you
for joining this session where I get to be the messenger
to the team’s great work in accomplishing an
amazing migration. Actually, there’s
been a single thread that’s been common among all the
leaders and the decision-makers that I’ve been talking
with in the past week. And that is that there’s
always one thing– one thing that leads up to the
decision and one thing that leads up to the catalyst– to be able to actually make
the decision to move to cloud. And for us, this was it. It was a moment of clarity
when our head of IT delivered this message to me. And, funny enough, my
five-year-old son at home, he came to me and
he saw the slide. He said, oh my gosh,
Daddy, what happened? And I said, don’t
worry, don’t worry, Daddy is not going
to let that happen. So what are we
going to talk about? We’re going to talk
about all the decisions that led up to us
migrating the cloud and why we decided
on Google Cloud. We’re going to talk
about our initial plans. And you know what? Things changed, so
we’re also going to talk about what
actually really happened. Just a reminder for the folks
that don’t know the Volusion. We are all-in-one
commerce platform– we’re a SaaS platform. And we are the only
ones in the industry that are focused on SMBs. And the Small Business
Administration defines SMBs in
many different ways. And the one fact, or the
metric– that stands out to me, is the fact that
SMBs make, generally, less than $50 million annually–
those are our customers. At the heart of everything
that we do is our culture code. And, in fact, we reinvented
it a couple of years ago. And one of the most important
slides in our culture code that drives every
decision we make is that question, what’s
in it for the customer? And if you saw PH’s
session yesterday about being the customer, that’s
what we’re talking about here. And if you wanted to check
out more on our culture code, please go to
culture.volusion.com. Hopefully, it
influences you like we were influenced to create
it in the first place. Just a point of clarity– we actually have two
offerings for Volusion. We have what we refer to
internally as V1 and V2. So V2 is modern in the cloud. It’s our new offering. But for the purposes
of this session, we’re going to focus on V1. What is V1? It’s software that’s been pretty
much in development since 1999. It’s got a lot of code. It’s predominantly Windows,
IIS, Microsoft SQL Server. And the team, over
the years, has figured out how to
orchestrate and how to put customers on
different segments and different groupings
of IIS and database depending on the
volume that they get. So some merchants get
more traffic than others. And the isolation, the teams
figured out over the years. As an example, we actually
have pretty huge traffic spikes for certain customers. In fact, over a two-
or three-day period, we experience very heavy
loads, and we call that period Black Friday where our merchants
get a lot of shopper traffic. So let’s talk about why
cloud and why Google Cloud? So here’s a picture– actually, this is misleading. We had Data Center 0.1, which
is the founder’s bedroom closet. This is Volusion
Data Center 1.0. And one thing that
you’re going to notice is single points of failure– single CPU, single
network, single disk. And this is actually a
closet in one of the offices. But we evolved, we
learned, we got better. And we went to a colocation
in Simi Valley, California. But you’ll still see
the same problems– single points of failure,
single deadbolt security to get in, single
network, single power. So the team invested
in a tier 4 colocation, and that’s based in Austin,
Texas, and it’s phenomenal. The security is amazing. The redundancy on power
and cooling is amazing. And, in fact, the initial
proposal from the team was, let’s move everything
from Simi Valley, California to this tier 4 colocation. And what that would
mean is, sometimes you have to buy a
bunch of hardware so that you can move over the
customers, and most of the apps can come over,
but you might also have to acquire some
software along the way. And then I had to
ask myself, which is a question that
some of you here might be asking yourselves,
which is should we be in the data center business? Or what value
would our customers get from us being in the
data center business? So we dug into the facts. And I found out something really
interesting– that over 50% of our failures were
the result of hardware or the interconnectivity between
that hardware within the data center– we call that network. So we started to go to
all the calculators on all the major cloud platforms,
and we started to scope out, what does it look like? How much is it going to cost for
us to move over to the cloud? And initial savings could
have been somewhere around 20% to 21%, so that sounded good. That’s money that we can use
for R&D for our customers. We could repurpose that. But how do you know which
cloud to use, right? There are so many choices. There’s the big three,
and there are tons more. So it’s really hard
to make that decision. How do you choose which
one’s right for you? Everyone has a
selection process. Every company has their own
mechanisms and frameworks. For us, it was really simple. We would get a
bunch of proposals from all the vendors and
all the cloud providers. We would spend some
time evaluating. We would do a proof
of concept, perhaps, during that evaluation. And then we would
make a decision. And what happened
was phenomenal. The best conversations
that we had were with the Google Cloud team. In fact, I’ll give
you an example. We had a major
networking constraint that we had to discuss. And in 24 hours,
we, as a team, got on a “Hangout” with the
team at Google Cloud that built the infrastructure– the networking infrastructure
on Google Cloud. And they were able to answer
our questions directly. And as another reference, one
of the other cloud providers, not to be named, sent
us to a vendor who got back to us in two weeks. It took two weeks
just to get back to us to have a phone call. So now we’re kind of
settling on Google Cloud. What’s our initial
plan going to be? How do we actually do that? It’s starting to sound
like it could fit. Once again, that all-important
question that we ask, what’s in it for the customer? And uptime is, perhaps, one
of the most important things. Our customers are SMBs. These are Small
Medium Businesses, these are moms-and-pops,
and if we go down, they can’t make money
to pay their mortgage, their employees. Sometimes this is
their only business is an online e-commerce store. So if we can eliminate
hardware failures in that graph I showed you, we’d get closer
to even more nines than 99.9. And we can scale
on-demand, so we can deal with that traffic that
spikes during Black Friday. Having everybody on a
homogeneous development environment– IT, engineering, the data
teams, the ops teams– would be powerful. And we’d be able to speed
up development times, having access to the same tools. And this would
benefit the customer because we’d get products
shipped to them a lot faster. Speaking of faster, some of
the case studies on Google, Google Cloud actually
showed us that people are benefiting from the
globally available network that Google has. And we felt we could be faster. The network could be
faster just by moving over. There so many other benefits. I’m sure you’ve read
about them, heard about them, in this Next and
last year’s Next, and all the collateral. The one thing that
stood out to us was the access that we would
get to the Google Cloud Platform teams. And, by the way, when
we started diving a little bit deeper with the
Google Cloud Platform team, we started to see a little bit
better cost-to-serve reduction. So the estimates
were about 33% now that we would save
on monthly cost to serve that we
can reinvest in R&D. So which migration
approach is right for us? There’s so many different ones. These are the three
core ones that we were kind of dabbling in
and making a decision on. And on the far left, there’s
the cloud-native approach, which is a complete rewrite
to match Google Cloud Platform tools and technology. All the way on the right,
it’s a little bit less work, but similar. You leverage the
cloud-native aspects. Both of those that
I just mentioned are pretty high-effort
and high-cost upfront, but the savings are huge. And in the middle, there’s
this thing called Lift & Shift. And Life & Shift
just means move over as much as you can with
minimal modification. And that’s what we
decided to choose. We went with Life & Shift. And what that meant is, if
we had an app server sitting in our colocation in our
data center somewhere, we would spin up an exact
instance like that in GCP and move it over. And we would only write code and
make changes if it didn’t work, if there were impediments. So that’s cool. We’re going to do Lift
& Shift, but how do you actually do that? So other people say it,
usually, better than I can and, in this case, Benjamin
Franklin said it best, “Tell me and I forget. Teach me and I remember. And involve me, and I learn.” And that last sentence is
what’s really important to us as a company. When you involve the
teams, when you get buy-in, when you get that
learning, then everybody is actually going to help move
this thing forward quicker. And, in fact, the way that
we decided to learn upfront was to engage the experts. And at the time that we decided
to do this, we engaged Google. They had a program called
Google Cloud Start. And they brought
in their experts to do infrastructure and
architecture planning and show us, this
is how you do it. And we got to learn from them. And speaking of learning,
there were a lot of ways that we could decide to
learn and choose to learn, and we took advantage of every
single one of these ways. So we engaged
professional services– once again, the experts
that have done this before. We had workshops on-site
and virtual for the teams. We had on-site training,
and sometimes we had classes of 25 or more
people, Volusioneers, in those training
sessions learning about different technologies
that GCP has to offer. And at the time
that we did this, there were no case studies. There were no other examples
of any other companies that have done this before. So we were pioneering a way
for a SaaS platform, SaaS e-commerce company, to move
over all these customers with minimal downtime. And we had a pretty accelerated
timeline to do that. We spent about one quarter
making the decision and planning. We spent another
quarter actually doing the migration itself. We spent a couple of months
flipping over the bits to get the customers live. And then, now
we’re in this phase where we call
modernization, where we’re trying to take advantage
of cloud-native technologies. And I find it remarkable
that last year, we were attendees to
this conference, and this year we’re talking
about how we actually did it. That’s a fantastic
achievement by the team. So that was the plan. So what actually
really happened? Problem number one– because we
host thousands of websites that need certificates,
there was no technology that Google had,
natively, cloud native, to be able to host
and do SSL termination on those certificates. So this was the first problem. And we have over 50,000 of them. And the proposal that we had
approached during the Cloud Start phase was a hybrid
traffic layer using NGINX. And if you don’t
know what NGINX is, it’s a reverse proxy
layer, an HTTP layer, that you can add in. So we would still leverage
some of the cloud Load Balancer aspects of Google, but
we had built this layer in between called NGINX. And that would handle the SSL
hosting and the termination across all the other layers
all the way to the store resolution. I think we had to
rebuild this a few times, but eventually, this
is what we settled on. And we internally refer to
this as the three-layer cake. It’s a combination of
Google Cloud at the top, with Google Cloud
Load Balancers, and NGINX spread
through the middle. What’s fantastic about this
is that different teams, across the company, can focus
on their respective layers. For example, InfoSec can
focus on the security layer. And because it’s
abstracted out this way, they usually don’t
impact the other layers. The app engineering team can
focus on the bottommost layer. Problem number two. At the time that we were
going to do this migration, we had two DNS servers. And we knew that that was
going to be a bottleneck. And the solution here
was Google Cloud DNS. And, in fact, to our
surprise, the day that we cut over our
DNS to Google Cloud DNS, we saw a 15% improvement in
first-byte response time, on average, across all
of our customer sites. So that was a fantastic win. Problem number
three– once again, because we are hosting
all these websites, there are thousands
of small files– CSS, JavaScript,
images, flat files, HTML files– plus the database. How do you move all those
over with minimal downtime? We actually leveraged a
combination of CloudEndure and our internal scripts. And CloudEndure has a product
called Live Migration, where it’ll constantly do continuous
replication from our data center over to Google Cloud. And only when we’re ready, would
it actually do the cutover. So this was a fantastic tool. A little bit deeper
dive on that, this is how we did it,
every single night, with minimal maintenance
window and minimal downtime for our customers. So we would update the
DNS record to point to the new Google Cloud DNS. We’d move all of the files,
we’d update the NGINX layer that we talked about,
we’d move any last pieces of the database, and then
the customer would be live. We did this all with
a 5- to 15-minute maintenance window per customer
or per cluster of customers. So that was how we did
it every single night. And I believe it took us
about 2 and 1/2, 3 weeks, to get through these
nightly migrations. So I’d love to also share
some of the successes and the learnings we
had along the way. Sometimes code
just doesn’t work. And, in fact, we had
over 100 Jira tickets that we had filed
before the migration, and there were some issues that
we didn’t catch along the way. But I’ll give you an example
of what you’d look for. Anything hard-coded IP, anything
hard-coded server names, anything hard-coded
file shares– those are things that
we would have to change. We thought that, by
using multiregion, that we would be
closer to the shoppers, and we’d be closer
to the merchants. But that didn’t necessarily
work out the right way. Because of the way that
Windows likes to route traffic, and NGINX likes
to route traffic, and Linux likes to route
traffic, that kind of diversity caused saturation
at the WAN layer, and we had unpredictable
traffic patterns. And by going to
a single region– and we settled on Central– we were able to get
consistent performance and predictable performance. And then later, the team
can decide on how to tweak and tune that to make it faster. And in fact, here’s one of
the tweaks that they did. So NGINX, by default, it goes
to HTTP 1.0, which is stateless. By turning on HTTP 1.1
persistent connections, once again, with no code
changes, we were able to deliver another
50% first-byte improvement on top of that 15% I
had mentioned earlier. It’s a 50% improvement
on first byte– no code change. We had a challenge on
migration latency– basically moving all those
thousands of those files over– over the years, we’d
built all these scripts and tools to be able to
shuffle customers around in our internal data
center infrastructure. And we just thought
that they would work. All of these scripts would
work if we pointed the end, the destination to cloud,
and then they didn’t. And that’s where
CloudEndure came in. Some other things that
were pretty remarkable happened along the way too. Because we had
training and learning and workshops at
Volusion, other teams caught a glimpse of what
the power of GCP is. And in this case,
the marketing team, they learned all
about NGINX, and they decided to use NGINX for A/B
testing, our www.volusion.com site. And there are a lot of A/B
testing tools out there, but what we’re trying
to do is pretty unique. We had a rebrand effort. And sometime during
the migration, these tools came about. And we were able
to split traffic from our classic nonbranded
site to the new branded site without redirecting anyone
to a new context path. That means everybody
stayed on www.volusion.com. They didn’t go to /B or /A,
and it was pretty smart. This layer in between
would check the URI string, it would check query string,
it would check cookies. And it would persist that user
to that particular A/B test. So this was a great win. And the marketing
team also decided to use buckets and
Ghost to host the blog. So our previous version of the
blog was actually pretty slow. And we didn’t realize it until
we realized how fast it was once we moved it over to GCP. But once we did move it
over to Google Cloud, and we took advantage
of Cloud CDN, this thing was blazing fast. And it’s gone through a
few more iterations now, but it’s definitely a
win to mention here. So as I mentioned,
first-byte response times have improved for our customers
by 65% with no code change. And we have been able
to save even more money than we originally forecasted. That’s money that we can spend
on product development and R&D. So right now, we
are saving 37% month over month on our cost-to-serve. And we feel like we’re
going to keep going down that path of saving more money. And what’s in it
for the customer? Here’s a quote from
one of our customers. Derek operates a
sporting goods store. It’s a paintball
store called ANS Gear. The quote is actually
really long here. I’m going to call attention to
the parts that you can’t see, which is the fact that,
for the first time ever, he was able to run
a marketing promotion 100% on our platform without seeing
any degradation in performance of his response
time on his site. And historically,
site traffic goes up– even a half second
latency causes sales to not go up,
or maybe even dip. That didn’t happen
on GCP at all. So this is a fantastic
win for the customer. We’ve got to take a
step back and celebrate sometimes, celebrate
your wins as a company. And we got a chance to do that
both publicly and internally. We got to celebrate being the
first to migrate to cloud, and we had a big
party internally to celebrate the achievement. So if you only have
to remember one thing from this presentation, it’s
that, whatever decision you make, whether you
want to move to cloud, whether you’re
already on the cloud, you should ask yourself,
what’s in it for the customer? And if you can remember
a couple of things, it’s that learning and
persistence and perseverance is how you get there. Because there were so many
times that we could give up, so many times it’s
easy to say, no, we’re not going to do that right now. We’re just going to go to
a colo and call it a day. But it’s that
perseverance attitude that’s going to get you there. So thank you so much. We have created an email
address for any questions. [MUSIC PLAYING]

Leave a Reply

Your email address will not be published. Required fields are marked *