上QQ阅读APP看书,第一时间看更新
Architecture and integration with applications
The architecture is well covered in the official documentation located at http://predictionio.incubator.apache.org/system/. However, we will expand on the important aspects a little more in this section so that we can completely understand the flexibility and the platform offering in detail.
The following diagram is from the official documentation of PredictionIO:
The key things to understand from the preceding diagram are as follows:
- Event Server will provide a RESTful endpoint for all the applications to drop events in real time. For applications such as product recommender, events may include view data, for when a buyer views various products, an event when a buyer adds a product to a cart, an event from IOT devices, and so on. Event Server of the current version of PredictionIO can use PostgreSQL 9.1/MySQL 5.1 or Apache HBase/ElasticSearch for the event data store. PredictionIO allows different engines to be used in training, but many algorithms come from Spark's MLlib. For scalable and large data volume applications, it is better to consider Apache HBASE, which is an open source, distributed, versioned, and non-relational database capable of handling billions of transactions for the training of data.
- Training: PredictionIO uses Apache Spark to train the dataset. Apache Spark has an extensive API support for developers using data structure and most of the templates use libraries such as SPARK MLlib to directly access machine learning functions developed by data scientists.
- Prediction Server will be a RESTful endpoint to submit a query in real time and get predictive results. The output of the training has two parts: a model and its metadata. The model is then stored in Hadoop Distributed File System (HDFS--a local file system) or ElasticSearch.
HDFS is a distributed filesystem from Hadoop; it allows the storage to be shared among clustered machines. It is used to stage data for the batch import into PredictionIO (PIO), for the export of Event Server datasets, and for the storage of some models. ElasticSearch is a distributed, RESTful search and analytics engine; it's at the core of the Elastic Stack and stores your data centrally so that you can discover the expected and uncover the unexpected.