Reliable RT processing @ Spotify Pablo Barrera <pablo@spotify.com> February 5, 2014
Spotify
3 Spotify the right music for every moment over 6 million paying customers over 24 million active users each month over 20 million songs over 1.5 billion playlists created so far available in 55 markets
4 i/o tribe responsible for building the awesome infrastructure that supports the Spotify experience
Our goal this looks easy
7 That was easy MISSING FIGURE
but we have a problem... 8
9 Naïve approach (tm) SYSLOG FILE SCP LOG ARCHIVER CURL HDFS PROXY HADOOP
10 SCP CURL
10 SCP CURL
Scalability 11 SCP CURL
Scalability 11 for(;;) { } SCP (file) for(;;) { CURL(file) }
12
13 thousands of servers We have a several problem... data centres millions of users 10 TB each day
14 Our Needs reliable delivery fast data transfer per-service subscription low cpu overhead
15
16 Other options active mq/rabbit mq flume/flume-ng others: scribe, chukwa, bookkeeper
Apache Kafka distributed pub/sub system
18 Kafka coolness at least once read O(1) network bounded
19 Kafka architecture KAFKA BROKER TOPIC A KAFKA PRODUCER TOPIC B TOPIC C TOPIC D TOPIC E KAFKA CONSUMER
20 Cons no reliability no replication manual tuning
Spotify <3 Kafka running in production
22 Kafka at Spotify key component of our log delivery system kafka 0.7.1 java 7
23 Custom extensions end-to-end reliable delivery compression/encryption service
End-to-end reliable delivery
25 production server
25 production server KAFKA SYSLOG PRODUCER
25 production server KAFKA BROKER Service KAFKA SYSLOG PRODUCER
25 production server KAFKA BROKER Service KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER
25 production server KAFKA BROKER Service KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
25 production server KAFKA BROKER Service ACK KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
25 production server KAFKA BROKER Service ACK Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
is that all? 26
Piece of cake right?
28
29 Zookeeper Kafka Producer Kafka Broker Kafka Consumer Hadoop
29 Cross-site problems Zookeeper Kafka Producer Kafka Broker Kafka Consumer Hadoop
30 TCP window TCP parameters for big latency linux TCP scaling algorithm
31 IPSEC linux IPSEC + firewall is slow major drop in throughput can not tweak it at app level
32 production server KAFKA BROKER Service ACK Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
32 production server KAFKA BROKER Service ACK Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
32 production server KAFKA SYSLOG ENCRYPTION KAFKA BROKER Service ACK Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
32 production server KAFKA SYSLOG ENCRYPTION KAFKA BROKER Compressed Service ACK Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
32 production server KAFKA SYSLOG ENCRYPTION KAFKA BROKER Compressed Service ACK Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER HADOOP
34 Garbage collector 50% of performance drop 25% of cpu time young generation tuning
35 100 % of time spent doing Full GC before tuning Time spent on Full GC (%) 80 60 40 20 0 0 2 4 6 8 10 12 14 Time (minutes)
36 100 % of time spent doing Full GC after tuning Time spent on Full GC (%) 80 60 40 20 0 0 200 400 600 800 1000 Time (minutes)
37 Hadoop replication factor stochastic failure mode no real ack from Hadoop files open for a long time
Apache Storm distributed computation framework
40 Storm abstractions: topology, bolt, stream, tuple, grouping great community ack + retries but not for reliable apps use Hadoop instead
41 Kafka integration reliable data for reporting low latency data for RT
42 production server KAFKA BROKER ACK Service Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER
42 production server KAFKA BROKER ACK Service Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER STORM
42 production server KAFKA BROKER Retries ACK Service Checkpoint KAFKA SYSLOG CONSUMER KAFKA SYSLOG PRODUCER STORM
RT apps
Body copy large
49 Storm
49 Storm
Thanks Pablo Barrera <pablo@spotify.com> Want to join the band? spotify.com/jobs February 5, 2014