Spark Streaming historical state

I am building real time processing for detecting fraud ATM card transaction. in order to efficiently detect fraud, logic requires to have last transaction date by card, sum of transaction amount by day (or last 24 Hrs.)

One of usecase is if card transaction outside native country for more than a 30 days of last transaction in that country then send alert as possible fraud

So tried to look at Spark streaming as a solution. In order to achieve this (probably I am missing idea about functional programming) below is my psudo code

stream=ssc.receiverStream() //input receiver 
s1=stream.mapToPair() // creates key with card and transaction date as value
s2=stream.reduceByKey() // applies reduce operation for last transaction date 
s2.checkpoint(new Duration(1000));
s2.persist();

I am facing two problem here

1) how to use this last transaction date further for future comparison from same card
2) how to persist data so even if restart drive program then old values of s2 restores back 3) updateStateByKey can used to maintain historical state?

I think I am missing key point of spark streaming/functional programming that how to implement this kind of logic.


If you are using Spark Streaming you shouldn't really save your state on a file, especially if you are planning to run your application 24/7. If that is not your intention, you will be probably be fine with just a Spark application since you are facing only big data computation and not computation over batches coming real time.

Yes, updateStateByKey can be used to maintain state through the various batches but it has a particular signature that you can see in the docs: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions

Also persist() it's just a form of caching, it doesn't actually persist your data on disk (like on a file).

Hope to have clarified some of your doubts.

链接地址: http://www.djcxy.com/p/21158.html

上一篇: 我如何检测希伯来语的回文?

下一篇: Spark Streaming的历史状态