If you don’t live in Latin America and buy or sell goods online, there is a strong chance you have never heard of MercadoLibre, literally “free market” in Spanish, but they are one of the largest e-commerce sites on Earth. They use Pivotal’s open source software (OSS) solutions to deliver, run, and scale their applications and data.
In this article, Matias Waisgold, Technical Lead for MercadoLibre’s internal APIs and developer APIs, provides a Q&A session and explains the goals and reasons for using Pivotal open source software, “Tremendous growth has pushed our technology teams to do two main things—improve development cycles and scale more effectively. To achieve these goals, we designed and built an enterprise service bus, or ESB, that uses RabbitMQ, Groovy on Grails, and Redis. Together, these three technologies have allowed us to decouple applications, distribute workloads, and helped to scale our data store. We have also been able to decentralize our development efforts and speed our deployment cycles. Without risking quality, faster dev cycles mean our users and applications benefit from a better user experience, and MercadoLibre maintains its leading position in online retail for Latin America.”
For those that are not familiar with their history, MercadoLibre has been named the 8th largest e-retailer in the world, and Fortune awarded them “Fastest-Growing Company” for 2011, 2012, and 2013. Last year, MercadoLibre reported user growth of 22% to 99.5 million. Both new revenues and gross merchandise volume were up by 50%, and total payment volume was up 66%. With this trajectory and all revenue coming from online channels, the website and supporting applications have to run without putting revenue at risk.
Could you tell us a bit about the functionality on the MercadoLibre site?
Sure, if you are familiar with eBay, then you are familiar with the type of functionality we have. We are eBay’s exclusive Latin American partner. Our site provides listings and auctions for buyers and sellers in Argentina, Brazil, Chile, Colombia, Mexico, Costa Rica, Dominican Republic, Ecuador, Peru, Panama, Uruguay, and Venezuela. People buy and sell everything—real estate, cars, antiques, mobile phones, computers, electronics, furniture, clothing, and more. When you decompose the functionality, the system basically manages users, categories, items, prices, auctions, orders, questions, answers, payments, and similar pieces of information. We also provide integrated payment solutions, digital advertising, and web stores.
Could you give us a bit about your background and role at MercadoLibre?
Yes, I’ve been at the company for more than 8 years. Most of my experience is with Java, Oracle, SQL, Grails, Groovy, Ruby on Rails, MySQL, Spring, MongoDB, and Linux. Early on, I was involved with developing our first catalog, and I am now responsible as technical lead for the core APIs which are both public and private. We are very proud to say our internal and external APIs are exactly the same—all of our front-ends run on the same APIs that any developer can access. The core APIs have become the backend of the entire platform—the APIs support the user interface, dozens of website applications, back office reporting, and third party integrations over public APIs. The APIs are RESTful, test driven, highly available, high-throughput, distributed, and cloud-based. From a user or functional perspective, the core APIs allows applications to interact with users, categories, items, searches, orders, shipping, questions, and orders.
What led MercadoLibre to an Enterprise Service Bus (ESB)?
Well, the first generation architecture was getting old. As you might imagine or expect, the code base was limiting developer productivity and scale.
Over time, our services and applications became entangled. This caused problems. Often, one team’s release impacted other team’s code. Our release frequency slowed, and we couldn’t improve user experience fast enough. So, we set a vision—an enterprise service bus with decentralized, decoupled applications that support an agile process. We wanted to have small apps made for a specific, single purpose. This approach made development much easier. We could iteratively evolve the solution for each specific application without major concerns for breaking the entire system as long as we maintained our interfaces. When the approach showed success, we expanded it. The entire business was on board because development productivity impacts speed to market, user experience, and revenue.
While improved development processes were crucial, there was another key factor. We were dealing with 20 million requests per minute and 4GB of bandwidth per second. As our user-base grew, we needed a multi-datacenter architecture to avoid failures and help scale. We knew our Oracle database wouldn’t support the volume of reads and writes, and both users and usage were growing across geographies. We designed a hybrid cloud with reads/writes in one location and reads from other data centers. We also needed a cache layer to take load off the database. Lastly, we wanted to support a distributed computing model that allowed us to scale specific services or components.
Would you describe the overall solution from a technical viewpoint?
We are largely an open source shop, and our multi-datacenter architecture was designed to split up visitors into a read/write group and a read only group. If someone is making a POST/PUT/DELETE type of transaction, they go to Datacenter A and basically stick there for the session to maintain full data consistency. If you are making read-only requests, then you will go to Datacenter B (or C, D, etc.) where the information is slightly delayed and inconsistent because of the engine that replicates in the background. In either case, users and webservers read data from a Memcached cluster to help reduce load on the Oracle database. If the data is not in Memcached, we read it from the database.
To extract operations or transactions and replicate the database across geographies, we use the Oracle replica engine, Golden Gate. In addition to replication, the database changes are sent to a Java process that populates the RabbitMQ cluster. RabbitMQ then feeds a refetching service written in Groovy on Grails, and the refetch components grab data from the database to ensure ordering and consistency. During this process, it’s worth pointing out one of the more helpful capabilities of RabbitMQ—it takes messages from Java and passes them in Groovy on Grails. We gain so much flexibility knowing that we can write code in virtually any language and connect to RabbitMQ. For example, we could acquire a company with a Node.js, Ruby on Rails, or Go stack (or choose to use those technologies) and integrate fairly easily.
After the refetch step, the refetchers do two other things. One, they update data in Memcached. Two, they send the data to our BigQueue message system which provides an HTTP API and uses Redis as its data store. BigQueue maintains all notification-worthy updates and events that happen in other applications. We chose Redis for double consistency like a redo log in a database, and it’s highly available. As an example, our questions search engine, based on Elasticsearch, is updated from this BigQueue system. As well, writes to our external items API will hit RabbitMQ directly, going through the same process to BigQueue.
Over time, we see RabbitMQ’s role growing. It will help us scale apps across data centers even more by allowing us to push messages across data centers.
What were the benefits of taking this approach with RabbitMQ, Redis, and Groovy on Grails?
As mentioned, we had two key goals—scale better and improve development cycle times. The ESB model allows us to untangle code and separate services, achieving these two goals. Let me explain.
From the perspective of scaling data stores and applications, the architecture offers separation or distribution without dependencies and with asynchronous processing. With the RabbitMQ, refetching, and Redis services, the data is fanning out to other apps. Across 18 queues that are fed by Oracle or external APIs, RabbitMQ processes an average of 1000 messages per second with 2000 acknowledgements per second on a 4 machine cluster with 16 GB of RAM and 8 cores each (see diagram below). Even with external developer APIs for items, we write to RabbitMQ directly from our webservers without degrading performance. RabbitMQ gives us a simple, reliable way to scale on pools of shared infrastructure or components and makes it easy for any app to talk to Java, Groovy on Grails, Ruby, or virtually any other language we might use, and we use all of these. Similarly, the Redis data store works with all modern development languages, and this gives us tons of flexibility for downstream apps. If we need more RabbitMQ processing power to receive or push messages, we can add it independently. If we need more refetchers to update the cache or refetch against the database, we can add processes or add cache nodes independently. BigQueue can have more Redis slaves added independently and serve more data or more applications. The whole thing can scale horizontally except for the Oracle database, but that is another story.
Here is an example of the rates we see, but we also see spikes of 1000 publishes with 2000 deliveries.
In addition, you can see that a transaction in RabbitMQ takes 0.188ms, which is super-fast.
From the perspective of improved development timelines, RabbitMQ is certainly very easy to understand, use, and implement. Asynchronous and independent components also allow our teams to develop and deploy independently with less governance and coordination. For example, people who are developing enhancements to the user service don’t have to worry as much that their code impacts the people working on items, orders, Elasticsearch, or other elements. The ESB lets each app update the others via an API, and the other teams don’t care what changed behind the scenes as long as the API and notifications still work. For the most part, development teams can now operate independently. This independence helps tremendously when you have a lot of teams who want to deploy code on a regular basis. As long as integration tests pass, there is freedom to deploy without a big committee.
In addition, Groovy helps us accelerate the time we spend coding and testing. It is true that you can do something in one line of Groovy instead of say ten lines of Java, and this adds up when you are writing thousands of lines of code. As well, Grails was designed to integrate with Spring and Java. Our development teams get things done a lot faster with Groovy on Grails. Lastly, Redis really makes BigQueue run and solves some very common large-scale web development problems. Since it is designed as a single-threaded server and supports Lua scripts, we have implemented transaction-like operations with fantastic performance on sorted sets. In our case, these sets are both cached and isolated. Together, these three tools help us get to market faster, improve our competitiveness, provide a better user experience, and ultimately impact revenue.
OK, enough business! What do you like to spend time doing outside of work?
Beside software, I enjoy playing soccer, and I play on a team every week. So, every time I can, I go to the stadium to watch the games. I also like cooking asados (barbecue) with Argentinian beef, it is the best!
About the Author