Beruflich Dokumente
Kultur Dokumente
fat-squirrel-with-nut-in-mouth
As Fast as Squirrels
Apache Flink is an open-source distributed streaming-first processing engine; it
provides high-availability and exactly-once consistency as long as real-time
complex event processing at ridiculous scale. Flink also provides batch computation
as a sub-case of streaming. Radicalbit uses Flink at its core and still, it amazes
for efficiency, robustness and scalability features, making itself perfectly
fitting the core of a Kappa architecture.
PMML stands for Predictive Mark-Up Model Language, and it represents a well-
established standard for the persistence of Machine Learning models across
different systems. PMML is based on a really efficient xml semantic, which allows
defining trained unsupervised/supervised, probabilistic, and deep learning models
in order to persist a source-independent trained model. This can be
imported/exported by any system. We employed the JPMML-evaluator library in order
to adopt the standard within Flink-jpmml.Coming at this step, we're ready to put
our hands dirty.
<dependencies>
<dependency>
<groupId>io.radicalbit</groupId>
<artifactId>flink-jpmml-scala</artifactId>
<version>0.6.3</version>
</dependency>
</dependencies>
Probably, you�ll need also to publish the library locally; in order to do that,
follow these steps:
This will be the only thing you need to bother about: Flink-JPMML automatically
checks the distributed backend accordingly to Flink by implementing a dedicated
ModelReader.
import io.radicalbit.flink.pmml.scala.api.reader.ModelReader
val modelReader = ModelReader(sourcePath)
import org.apache.flink.streaming.api.scala._
case class IrisInput(pLength: Double, pWidth: Double, sLength: Double, sWidth:
Double, timestamp: Long, color: Int, prediction: Option[String]) {
def toVector: Vector = DenseVector(pLength, pWidth, sLength, sWidth)
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
val events: DataStream[IrisInput] = yourIrisSource(env)
import io.radicalbit.flink.pmml.scala._
extends Flink DataStream with the evaluate method. Strictly speaking, it provides
you the tool which lets us achieve streaming predictions in real-time.
import io.radicalbit.flink.pmml.scala._
import org.apache.flink.ml.math.Vector
val out = events.evaluate(modelReader) { (event, model) =>
// flink pmml model requires to be evaluated against Flink Vectors
val vectorEvent: Vector = event.toVector
// now we can call model: PmmlModel predict method
val prediction = model.predict(vectorEvent)
// Prediction container own the prediction result as a ADT called Score
prediction match {
case Prediction(Score(value)) =>
// return the event with updated prediction
event.copy(kind = Some(computeKind(value)))
case Prediction(EmptyScore) =>
// return just the event
logger.info("It was not possible to predict event {}", event); event
}
out.print()
env.execute("Flink JPMML simple execution.")
}
private def computeKind(value: Double): String = {
value match {
case 1.0 => "Iris-setosa"
case 2.0 => "Iris-versicolor"
case 3.0 => "Iris-virginica"
case _ => "other"
}
}
Now, you can take the sample PMML clustering model available here with the only
duty to add class as output parameter; so lets simply add
<groupid>Output</groupid>
<groupid>OutputField name="PCluster" optype="class" dataType="integer"
targetField="class" feature="entityId"/</groupid>
<groupid>/Output</groupid>
At this point, we�re ready to execute our job. Flink-JPMML will send you a log
message about the loading state:
This comes extremely useful if the user needs to apply concrete math preprocessing
before the evaluation and only the prediction result is required (e.g. model
quality assessment).
The Reader
The ModelReader object aims at retrieving the PMML model from every Flink supported
distributed system; namely speaking, it�s able to load from any supported
distributed file system (e.g. HDFS, Alluxio). The model reader instance is
delivered to the Task Managers and the latter will leverage the former�s API at
operator materialization time only: that means the model is lazily ridden.
The Model
The library allows Flink to load the model by the employment of a singleton loader
per Task Manager, so it does read independently from the number of sub-tasks
running on each TM. This optimization lets Flink scale the model evaluation in
thread-safety, considering that even really base PMMLs can grow over several
hundreds of MBs.
Evaluation as UDF
The evaluate method implements an underlying FlatMap implementation, and it�s
enriched by the above-described user-defined function, provided by the user as a
partial function. Formerly, the idea was to create something a-la-flinkML, i.e. a
core object shaped by strategy patterns in order to compute predictions just like
you�d do if you make use of typical ML libraries.
But, at the end of the day, we�re performing a streaming task, so the user has the
unbounded input event and the model as an instance of PmmlModel. Herein Flink-JPMML
demands the user to compute the prediction only, but anyway, the UDF allows to
apply any kind of custom operation and any serializable output type is allowed.
Closing
We introduced a scalable light-weight library called Flink-JPMML, exploiting Apache
Flink capabilities as a real-time processing engine and offering a brand-new way to
serve any of your Machine Learning models exported with PMML standard. Along with
the next post, we will discuss how Flink-JPMML lets the user manage NaN values and
we will describe how the library handles failures; alongside, we will provide the
reason behind Flink vector choice and we will point out the steps we expect to
follow in order to keep this library better.
We�d be really pleased to welcome new contributors to Flink-JPMML, just check the
repository and the open issues.