Last 10th and 11th of October took place the devopsdays conference in Barcelona. In Wuaki.tv we are trying to implement the continuous delivery philosophy, so we attended in a very special way, with a speaker. We told our story, the things we face on a daily basis, and how to solve them.
One of the projects we recently had at Wuaki was developing a service in order to highlight content and promote campaigns in our applications. Because of some technical decisions, we decided to build a micro service that could act as an adserver. This service had to be reliable, light and able to serve ads in multiple devices (web, smart TVs, iOS, Android, Xbox…). It also had to offer filtering options, since different ads would be displayed depending on user state (marketing targeting purposes). After trying out several options,
we decided to use Riak as storage system. This was a classic problem for using
a NoSQL database. We had a key to ask and a value to retrieve, so having Riak
behind our application suited very well.
Some of the reasons why we chose Riak over any other NoSQL solution are High Availability and Key filters.
Riak distributes the data across multiple nodes through replication and
partitioning. Replication makes the data to be fully available for reading and writing if any cluster node is down. It combines both of them to get the maximum data availability as
possible without wasting resources. The official documentation suggests using 5 nodes as minimum for a Riak cluster and replicating the data at least into 3 nodes (this is the default replication value).
Riak follows a very clever hashing technique in order to map partitions to the
edge of a ring. A Riak ring is divided into 64 partitions by default, being (2^160-1) the last value in the space. This last value is thought to be adjacent to the first value of the ring (0).
Key filters and MapReduce
As I mentioned above, one of the requirements of our project was enabling filtering
options. These filters had to be some kind of flags representing user states.
For example whether the user is logged in or the user is a subscriber. Riak
offers a fast way of examining keys before loading the object. This feature combined
with MapReduce inputs gives us a very powerful method to pre-process and
retrieve objects just by looking up at the key.
For example, if we have the following sample keys:
We can retrieve all the active ads in an easy way by using key filters:
If you are eager to know more about Riak, I would suggest you to have a look at
the official documentation. There is also an ebook called A little Riak book written by Eric Redmond, which you can get it for free in multiple formats.
Almost Two years ago, we decided to move our technical infrastructure from a dedicated hosting company in Holland called LeaseWeb to Amazon AWS. The journey wasn’t as easy as we thought, as we faced lots of challenges to reach the point where we are nowadays.
One of the very first things that you need to understand when you arrive to AWS, Rackspace or any other of this IAAS platforms, is that you must forget about server level thinking, and you must start thinking on the Application level. If you stop for a minute and think about it, one of the key features of IAAS is idempotence that they provide you, and this is really valuable when you need to scale a service during certaning periods of time depending on unpredictable time periods. At the moment Wuaki.tv is deployed in AWS EU Region (Ireland) and every single piece of our production stack is running on at least 3 servers (one for each AZ) in a completelly puppetized environment.
Wuaki.tv system architecture
One of the things that you notice when you start working with Amazon is that you don’t get a static IP address by default, and you have to rely on other tricks to ensure that you can code your infrastructure. At the moment, at Wuaki.tv we rely almost everything on Amazon’s security groups, this is the only way you can ensure that your cluster is perfectly configured, and that your firewall rules won’t be a mess.
As you can see we think on clusters, for us everything is a cluster, from the smallest group of servers up to the biggest one. We don’t care if we have 1 request per minute but we do care that when a user wants to use the application he/she can do it no matter what.
Right now, June 2013, Wuaki.tv many different clusters that receives our user requests, cluster for: Nginx, WWW, API, Internal services, CDN balancing, MySQL, Caching, Queueing, Twemproxy, RabbitMQ, ElasticSearch. Almost all of them are involved on each request made by any user of our service. If we had to describe the path of a request, it could be something like this:
A user types in their browser the url Wuaki.tv, they go through our Nginx cluster which will decide depending url if we need to send the request to the WWW or the API cluster, after making that decision, Nginx proceeds to evaluate the ip address of the user who is making the request, and depending on the user localization this cluster needs to decide to which country and which cluster is going to process that request.
Once the user is identified, all requests are sent to the proper app server, as you may know by now, Wuaki.tv runs on Ruby on Rails, and Unicorn this helps us serve requests really fast, we are talking about 100ms to serve a request from our API, and 200ms from our website on the app server according to New Relic. Our Apdex is and must be between 0.98 and 1 no matter what.
Our config management tool (Puppet)
There is no way you can scale your application fast if you do it manually, it’s simply not possible. Since March 2012 we started to puppetize Wuaki.tv’s infrastructure, I had previously worked with puppet, and at that moment I was the only person doing Operations so was a easy decision. At the moment we have 4 environments, almost 60 modules, and we have implemented Mcollective for orchestration with RabbitMQ and The Foreman, working with a single Puppet Master.
We had to do some tuning to our Puppet Master so it could handle all of our server requests:
Basically this setup helps us to avoid too many nodes connecting to our Puppet master at the same time, PuppetDB helps us with things like Nagios, and Hiera to prepare some modules on realtime depending on our needs to scale and more.
Wuaki.tv central logging system
Last year Ignasi Blanco joined the operations team at Wuaki.tv, and one of the first tasks he had was to deploy a central logging system. We happen to be using Splunk at that moment but we reached the limits of the free license really fast. When we were looking for alternatives we found Logstash and Kibana.
Logstash is an Open Source project developed by Jordan Sissel from Dreamhost, which helps us process any kind of logs.
It’s been almost a year since we started that project, I am sure we will make a post with more details soon but to get an idea at the moment we are processing around 83 million log lines per day. One of the key aspects of our infrastructure is that almost everything is asynchronous, not only on the application level but also on the infrastructure.
On December, we decided that was time to implement caching on our application, we made this post to explain about how did we implement it. Since we posted we have made a couple of changes. Because of our global expansion, we decided that parts of our stack were going to handle global traffic while others were going to handle local traffic.
At the beggining this seems a little crazy but after trying and failing in our implementations, we had seen that this ways will have a positive impact on the server administration and will help us detect any issue easily, than having a local stack. This solution is applied thanks to Redis namespaces were we can create keys with country, environment, and application names in order to identify any request at a given time.
And this is basically what makes wuaki.tv runs at the moment, mainly on top os m1.large AWS instances, using tools such as Amazon S3, Cloudfront, Autoscaling and more…much more.
GeoIP configuration on Nginx
During our implementation at Wuaki.tv we found a couple of issues related to Amazon ELB’s and autoscaling policies on Nginx, which doesn’t refresh our upstreams Ip’s on realtime in order to solve this issue we implemented the resolver directive with amazon DNS servers on our nginx.conf file.
As you may know by now, Wuaki.tv is a VOD service which depends on the user location. We have some business rules that we need to follow and depending on our users location we have to send traffic to many different clusters, with completely different business rules on each one of them. To accomplish this we use the Nginx-full package on Debian because it offers us support for Geoip configuration in a simple way using our APT package manager.
Particular Vhost configuration for wuaki.tv
For our API, we use a different method because we need transparency in order to complete each requests from our clients. We need to give the appropriate response to each user depending on their location without any weird redirect.
As the number one Video on Demand Service in Spain Wuaki.tv has faced lots of interesting challenges, one of them is to make our application as asynchronous as possible. In order to avoid any issue that could cause that our application gets blocked during long periods of time, we needed to find a way to avoid any third party related issue.
As you can imagine from our previous posts at Wuaki.tv we are heavy users of Redis, and this was the very first project were we introduced Redis as our backend storage, not only to our stack but to part of our team.
The first step we made in order to become asynchronous, was to create a queue stack built on top of Resque and Redis, at that moment we thought that Redis was more than capable of handling the amount of requests that we were planning to send to it, and given its speed in conjuction with Resque our work was practicly done in a easy and fashionable way.
During 2012, wuaki.tv was growing exponentially, and our stack started to become more complex almost every two or three weeks (we still change very rapidly), and we began to develop new services as we started a race to convert our service into a micro-service oriented application. At the beginning we thought that the best choice was to create a queue stack for each micro-service, with independant Redis instances, but after a couple of months we did realise that Redis powerful enough to handle all this work and that our stack was becoming really complex so we decided to centralise some of our services and reduce complexity. After we found out our company business plans for the next 2 years, we came to the conclusion that maintaining this stack was going to be really painful and that the way we were implementing this wasn’t going to scale, so we went back to the whiteboard. After talking to Juan Hernández and Toni Reina for a couple of hours, we came to the conclusion that creating just one stack for async tasks was more than enough and if we combined this with Amazon autoscaling features we could build a really reliable stack.
Once we have agreed the work that had to be done, was time to design the infrastructure and software that was going to help us solve this problem. In order to accomplish all this, we built a Async task stack on top of Nginx + Unicorn + Resque + Resque web + Puppet and AWS Autoscaling, Toni Reina was kind enough to provide a custom config.ru inside of Resque-web, so we can load and connect to different Redis instances from only one application, depending on the URL. We developed a Resque web module for puppet which deploys Resque web on your servers while we scale up our infrastructure managing:
Deploy Resque web
Manage Config.ru template file
Manage Unicorn configuration
At the moment, Wuaki.tv is built on top of 7 different applications and APIs, so we’ve decided that the easiest way to identify which job is going to be processed by which queue we should add a namespace to our Redis database. This is helping us reduce the pain of scaling and managing this stack.
If you think about this method as your solution to concurrency and resilience problems, please take into consideration that this carries a change of paradigm, when you develop queuing system you should must get rid of long and intensive CPU jobs because the idea of this is to scale and process tasks in a more efficient way, enough of the bulk jobs.
The idea behind our use of Async Tasks is to deliver the best user experience, taking away any task that could cause performance issues.
Wuaki.tv is more than 2 years old now, and during this period of time it has changed incredibly fast. One of the things that amuse us about software development is the different ways you can solve different issues, you can make it as complex or as simple as you want on a given time.
During the first year of our company wuaki.tv was a monolithic application running on RoR and Mysql, we were not using any caching other than Rails native, and our Apdex index was around 0.7 - 0.8 depending on the day and time. During 2012 we focused all of our efforts on performance, we implemented lots of new features, made lots of changes and caching was one of the main improvements, this helped us to move from a 0.7 Apdex index to 0.98 - 1. In order achieve this we explored the most used and recommended softwares to implement on our application:
During this period of researching and evaluation, we’ve agreed to use Redis as our caching datastore, mainly because we’ve used redis in the past on our queue system, but also because this could suppose an easy implementation, so at the moment we all Dev and Ops agreed that this was the best decision in terms of performance, and implementation.
After deciding which data store we were going to use, we had to design the stack that was going to handle all the traffic generated from our application layers (5 at the moment).
The very first problem we faced in Operations was related to scaling Redis, we had to make sure that our caching stack is available 24/7, even if this means that a whole Amazon availability zone dissapears overnight, so we needed that our stack were redundant and scalable. We found that Redis could be configured in four different ways:
Redis sentinel probably is the obvious choice for us, but either Sentinel of Redis Cluster are currently work in progress, and NOT recommended for production environments. Beside this issue, we notice that there is a Redis performance degradation when the connections to your nodes increase, so we had to find another way to figure this out.
On December, 2012 @antirez Redis creator, made a post on his blog where he talks about twemproxy. Twemproxy, is an open sourced project maintained by @manju from Twitter, this tool works as a proxy with failover for not only redis but for memcache as well. After, sending a couple of tweets to @antirez and @manju asking them for information about failover and high availability, we came into the conclusion that we should give this Twemproxy a try. At the moment we are using a simple configuration built in top of Debian Squeeze on Amazon m1.large instances.
Building the Stack
At the moment our caching stack looks like the image below, and it is capable of handling more than 10 million requests per day without the need of triggering any autoscaling instance either from Redis or Twemproxy.
The idea behind this setup is to ensure that our redis instances will only have one incoming connection from each of our Twemproxy servers, so the performance on each redis instance won’t be affected by our current unicorn worker connections, which is around 230 workers at minimum.