Distributed stream processing with akka actors.

With my pull request accepted at spark streaming, it will support actor as receiver for processing streams.

What exactly is spark streaming?

Before we talk about spark streaming, Let me quote “What is spark ?”

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

With benefits of fault tolerance through replication and RDD lineage graphs etc.. Spark streaming is a distributed stream processing framework over spark. Now what makes spark streaming special is it uses all the goodness of spark. That said in-memory processing, fault tolerance, and distributed computing comes in box. With all of this, it promises a nearly real-time processing with batch processing duration as low as a second.

The problem this merge will address:

    1. Easy abstraction for bringing in a user defined custom receiver.
    2. Worker is a supervisor[1], which means it provides some fault tolerance for plugged in receiver actors.
    3. Clear interface for filtering incoming data at receiver itself.
    4. Utilize an existing library or abstraction for receivers built for various message queues and socket streams
    5. Multiple streams receiver and each with custom filtering. A possible use case can be: One stream pulls from zeromq and processes XML to only filter or collect value of a particular tag while other stream can be a socket receiver and parses JSON messages. At the end giving near real-time analytics.

Akka actors comes loaded with a actor based programming model and a nice set of abstraction for message passing, remoting and fault tolerance. With this api actors provide both elegant and simple abstraction that reduces the work needed for implementing a receiver.

References:

1. http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance.html