I wanted to run a job which runs 24×7 and which reports if certain keywords occur more than a N times in the stream. Spark streaming looked a ideal candidate for this task. Spark has a reduceByKeyAndWindow function which was exactly what I was looking for.
I decided to use a window length of 1 minute and slide interval of 30 minutes. I had assumed that it will discard all the keys every 30 minutes. While running the job I noticed that the memory usage and the processing time kept on increasing to a point that the application would never be stable i.e. each job was taking more time than the window interval. So new jobs kept on getting added to the processing queue.
Since the documentation on this function is so little that I had to ask around the mailing list for the . After a few days a guy responded that I needed to add a filter function which will discard the unwanted keys.
reduceByKeyAndWindow(reduceRows, invReduceRows, 1800, 60, filterFunc=filterOldKeys)
So now the actual function call looks something like this. The filterOldKey is a simple function where I check if count of the key is greater than 0. This solved my problem.