Difference between stream processing and message processing

2019/2/25 posted in  MQ

https://stackoverflow.com/questions/41744506/difference-between-stream-processing-and-message-processing

In traditional message processing, you apply simple computations on the messages -- in most cases individually per message.

In stream processing, you apply complex operations on multiple input streams and multiple records (ie, messages) at the same time (like aggregations and joins).

Furthermore, traditional messaging system cannot go "back in time" -- ie, the automatically delete messages after they got delivered to all subscribed consumers. In contrast, Kafka keeps the messages as it uses a pull based model (ie, consumer pull data out of Kafka) for a configurable amount of time. This allows consumers to "rewind" and consume messages multiple times -- or if you add a new consumer, it can read the complete history. This makes stream processing possible, because it allows for more complex applications. Furthermore, stream processing is not necessarily about real-time processing -- it's about processing infinite input stream (in contrast to batch processing that is applied to finite inputs).

And Kafka offers Kafka Connect and Streams API -- so it is a stream processing platform and not just a messaging/pub-sub system (even if it uses this in it's core).

What are the differences between Apache Kafka and RabbitMQ?

2017/5/30

https://www.quora.com/What-are-the-differences-between-Apache-Kafka-and-RabbitMQ

Kafka is a general purpose message broker, like RabbItMQ, with similar distributed deployment goals, but with very different assumptions on message model semantics. I would be skeptical of the "AMQP is more mature" argument and look at the facts of how either solution solves your problem.

TL;DR,

a) Use Kafka if you have a fire hose of events (20k+/sec per producer) you need delivered in partitioned order 'at least once' with a mix of online and batch consumers, but most importantly _you’re OK with your consumers managing the state of your “cursor” on the Kafka topic._

Kafka’s main superpower is that it is less like a _queue system_ and more like a _circular buffer_that scales as much as your disk on your cluster, and thus allows you to be able to re-read messages.

b) Use Rabbit if you have messages (20k+/sec per queue) that need to be routed in complex ways to consumers, you want per-message delivery guarantees, you need one or more features of protocols like AMQP 0.9.1, 1.0, MQTT, or STOMP, and _you want the broker to manage that state of which consumer has been delivered which message_.

RabbitMQ’s main superpowers are that it’s a _scalable, high performance queue system_ with well-defined consistency rules, and ability to create interesting exchange toplogies.

Neither offers "filter/processing" capabilities - if you need that, consider using a data flow or stream processing framework - there are many: Apache Beam (which is an abstraction on top of Google Dataflow, Flink, Spark, or Apex), Storm, NiFi, direct use of Apex, Flink, or Spark or Spring Cloud Data Flow on top of one of these solutions to add computation, filtering, querying, on your streams. You may also want to use something like Apache Cassandra or Geode or Ignite as your queryable stream cache.

Kafka traditionally hasn’t offered transactional semantics in its writes, though this is changing in 0.11.

Pivotal has recently published a reasonably fair post on when to use RabbitMQ or Kafka, which I provided some input into. Pivotal is the owner of RabbitMQ but is also a fan of using the right tool for the job, and encouraging open source innovation … and thus is a fan of Kafka!

Details:

Firstly, on RabbitMQ vs. Kafka. They are both excellent solutions, RabbitMQ being more mature, but both have very different design philosophies. Fundamentally, I'd say RabbitMQ is broker-centric, focused around delivery guarantees between producers and consumers, with transient preferred over durable messages. Whereas Kafka is producer-centric, based around partitioning a fire hose of event data into durable message brokers with cursors, supporting batch consumers that may be offline, or online consumers that want messages at low latency.

RabbitMQ uses the broker itself to maintain state of what's consumed (via message acknowledgements) - it uses Erlang's Mnesia to maintain delivery state around the broker cluster. Kafka doesn't have message acknowledgements, it assumes the consumer tracks of what's been consumed so far. Kafka brokers use Zookeeper to reliably maintain their state across a cluster.

RabbitMQ presumes that consumers are mostly online, and any messages "in wait" (persistent or not) are held opaquely (i.e. no cursor). Kafka was based from the beginning around both online and batch consumers, and also has producer message batching - it's designed for holding and distributing large volumes of messages.

RabbitMQ provides rich routing capabilities with AMQP 0.9.1's exchange, binding and queuing model. Kafka has a very simple routing approach - in AMQP parlance it uses topic exchanges only.

Both solutions run as distributed clusters, but RabbitMQ's philosophy is to make the cluster transparent, as if it were a virtual broker. Kafka makes partitions explicit, by forcing the producer to know it is partitioning a topic's messages across several nodes., this has the benefit of preserving ordered delivery within a partition.

RabbitMQ ensures queued messages are stored in published order even in the face of requeues or channel closure. One can setup a similar topology & order delivery to Kafka using the consistent hash exchange or sharding plugin., or even more interesting topologies.

Put another way, Kafka presumes that producers generate a massive stream of events on their own timetable - there's no room for throttling producers because consumers are slow, since the data is too massive. The whole job of Kafka is to provide the"shock absorber" between the flood of events and those who want to consume them in their own way -- some online, others offline - only batch consuming on an hourly or even daily basis.

Performance-wise, both are excellent performers, but have major architectural differences. RabbitMQ has demonstrated setups of over a million messages/sec, Kafka has demonstrated setups of several million messages/sec. The primary architectural difference is that RabbitMQ handles its messages largely in-memory and thus uses a large cluster in these benchmarks (30+ nodes), whereas Kafka proudly leverages the powers of sequential disk I/O and requires less hardware (this benchmark uses 3x 6 core / 32 GB RAM nodes).

This older paper indicates Kafka handled 500,000 messages published per second and 22,000 messages consumed per second on a 2-node cluster with 6-disk RAID 10.
http://research.microsoft.com/en...

Now, a word on AMQP. Frankly, it seems the standard was a mess but has stabilized. Officially there is a 1.0 specification standardized by OASIS . In practice it is a forked standard, with 0.9.1 being broadly deployed in production, and a smaller number of users of 1.0.

AMQP has lost some of its sheen and momentum, but it has already succeeded in its goal of helping to break the hold TIBCO had on high performance, low latency messaging through 2007 or so. Now there are many options.