Scaling with RabbitMQ @ Soundcloud

July 18, 2013 Stacey Schneider

SoundCloud and RabbitMQIf you aren’t familiar with SoundCloud, they are one of the fastest growing sites in the US with 10.2 million uniques in March and traffic growth of 26% from the prior month. They reach over 200M people worldwide.

They are also one of the coolest social networks for sharing music and sound, and many of the most popular, modern day musicians, producers, and DJs release music here to a global audience that also collectively uploads 12 hours of music every minute, covering electronic, classical, jazz, blues, comedy, storytelling, and more.

This past June, Sebastian Ohm, Technical Lead at SoundCloud, gave a talk on their use of RabbitMQ at the Erlang User Conference in Stockholm. His talk and this article cover the functionality, messaging architecture, and lessons learned.

SoundCloud’s Functionality

SoundCloud is a social platform—instead of uploading and commenting on pictures, SoundCloud users upload and comment on audio via a waveform image as shown below. The content on SoundCloud is driven entirely by end users and 3rd parties, and it can be accessed via the SoundCloud website, embedded in other websites like Facebook or blogs, and heard via mobile apps like Android or iOS.

cta-download-rabbitmq

MajorLazerWorkoutMixMay2013-SoundCloud

One of the places that RabbitMQ provides a service is when a user uploads an audio file to SoundCloud. Upon upload, RabbitMQ is used to asynchronously process the audio file, build the waveform image, and also notify followers of the new sound.

The Messaging Architecture—Transcoding and Activity Updates

SoundCloud stores media in Amazon S3, and the worker pool is in EC2. A message-based architecture was chosen a few years back to coordinate these separate storage and processing clouds. After reviewing STOMP and other protocols, the engineering team settled on AMQP with RabbitMQ. The team wanted producers and consumers to be entirely decoupled so that pools of resources could be scaled independently.

ScalingAMQPatSoundCloud_Page_17

The application was developed with Ruby on Rails, and, when a new media file is uploaded, the Ruby code creates a record in MySQL and publishes a message to the media exchange with a unique ID for the media. Both the Ruby app and RabbitMQ are running in the SoundCloud data center in Amsterdam while the consumption end of the queue, a transcoding service, runs in EC2. The consumers receive the unique ID, transcode the media, and publish another message to the media exchange with some meta data and a unique ID for the files on S3, available via URI. A Rails app receives these messages and pushes some of the information into the database.

This approach addressed one of their first scale challenges—scaling uploads. Now, they can add resources to the pool of transcoders quickly and automatically with any spike in traffic—RabbitMQ is used to parallel process the workload across all transcoders and can recover from 10,000s of uploads within a few hours.

ScalingAMQPatSoundCloud_Page_25

They also use a separate RabbitMQ broker to update the dashboard. The dashboard shows users the most recent activities or updates from the musicians and other users that they follow. Scale is not a problem until a user like Skrillex uploads 10 tracks at once and has about one million followers. In these cases, the system would have to synchronously publish a write to Cassandra 10 million times. Instead, the engineering team added a broadcast within their application’s domain and used RabbitMQ for staged, asynchronous processing—including three steps:

  • Fan-out determines where activities should propagate
  • Personalization captures the relationship between users and filters an index entry
  • Serialization persists the information in Cassandra for end user display or API access

Key Lessons Learned

With their current approach, the team has been running about 20-30,000 persistent messages per second (as shown in the graph below). Sebastian was kind enough to share the honest challenges they faced and some key lessons learned during his talk:

  • While things have not gone perfectly, Sebastian believes Erlang and RabbitMQ have had great performance and no operational issues, even though they had no Erlang knowledge before
  • Separating production, test, and dev environments are important and reduce headaches and errors
  • Don’t put every type of processing on one queue or one broker, separate workloads with different profiles of use so they can scale independently
  • Use clustering—a load balancer in front allows us to publish once and then workers can subscribe to all
  • AMQP heartbeats worked more smoothly than one TCP connection per broker

ScalingAMQPatSoundCloud_Page_37

Find similar information on RabbitMQ:

About the Author

Biography

Previous
QML Timer Workaround in BB10
QML Timer Workaround in BB10

BlackBerry’s Cascades framework allows developers to add a lot of functionality in QML that would otherwise...

Next
McKinsey on Big Data Analytics: The #1 Key to US Economic Growth?
McKinsey on Big Data Analytics: The #1 Key to US Economic Growth?

Out of 100s of ideas, McKinsey believes big data analytics is one of the 5 top catalysts that can increase ...