Skip to content

Latest commit

 

History

History
157 lines (80 loc) · 32.9 KB

Other-Scenarios.md

File metadata and controls

157 lines (80 loc) · 32.9 KB

Applicability and re-use of the architecture and components in different scenarios

I. Data collection

The Message Ingestion and Storage service addresses the need to ingest (often extremely large volumes of) data being created by distributed devices and store it centrally for visibility and reporting. The devices can range from extremely small (low-power microcontrollers that collect data from sensors embedded in larger equipment) to large (PCs or servers, or personal devices such as phones). Microsoft offers services like IoT Hub and Event Hub to ingest data, but in some narrow circumstances, customers can use existing technology such as Apache Kafka for message ingestion, if a robust enterprise service already exists.

The technology used to ingest messages into the system and persist them in the cloud depends on various factors. It starts with the business objective(s), and the way that the system must support business process changes that are required to support the business objective(s). More specifically, the greatest determining factor is the speed and volume of the incoming messages, and then how quickly the organization wants to act on the telemetry collected.

There are various options for ingesting messages from distributed equipment and devices with varying degrees of velocity. When architecting the service, it is important to consider if the technology:

  1. Is appropriate for the variety, volume, and velocity of messages? (If the service can’t handle the message volume and velocity, nothing else matters)

  2. Can scale up/out on demand? (how variable is the message flow?)

  3. Supports failure (local) and disaster (regional) scenarios?

  4. Needs to handle outgoing (cloud to device) messaging for control and configuration of devices?

  5. Needs to support specific messaging protocols (such as MQTT)?

  6. Is available at the level of management desired (does the customer prefer a higher level managed service, or do they want to leverage existing skills and assets and manage it themselves)?

Specifically, Azure Event Hubs (AEH) provides a highly scalable and reliable fully managed service for ingesting messages from (potentially many) origin systems to (potentially many) target applications and systems. If there is no need to ‘reach back’ to the device for configuration or control, it provides a simple and elegant way to accept those messages at massive scale. https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs

In most scenarios today, there needs to be two-way message flow between the cloud and distributed devices. A message service based on IoT Hub provides a logical superset of capabilities compared to AEH. IoT Hub not only supports device-to-cloud messaging over HTTPS and AMQP, but also supports MQTT, and various services to provision devices, manage device identity, and send messages from the cloud back to the devices to manage and configure them. There are also ‘Basic’ and ‘Standard’ versions available to calibrate on the services needed to support your device requirements. Intro to IoT Hub: https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-what-is-iot-hub Comparison of AEH to IoT Hub: https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-compare-event-hubs

IoT Hub natively supports routing messages to Azure storage as BLOBs, and within the setup for an IoT Hub, you can add Azure Storage Containers as custom end points.

Beyond Microsoft technology, however, it is possible that your customer already has a mature messaging infrastructure that is used to ingest messages into the cloud. For example, Apache Kafka has become a popular open-source option. Since most customers use Kafka in cloud VMs, at its core it is a non-managed technology, so customers must manage and scale the service, as well as handle storage, failover, disaster recovery, and similar challenges. There are now options available in Azure to use a managed Kafka service, based on an HDInsight offering. In the end, however, even though the level at which Kafka is normally managed is completely different, the capabilities are more similar to AEH, in that it is meant to be used as a 1-way message movement service. For that reason, per above, IoT Hub is going to be the most appropriate option for nearly all predictive maintenance business scenarios. Excellent (audio) summary of AEH vs. Kafka: https://softwareengineeringdaily.com/2016/04/25/azure-event-hubs-dan-rosanova/

Background article on Kafka vs. AEH: https://blogs.msdn.microsoft.com/opensourcemsft/2015/08/08/choosing-between-azure-event-hub-and-kafka-what-you-need-to-know/

Considerations for Predictive Maintenance:

Most predictive maintenance challenges utilize AI models that are based on a combination of operational and non-operational data. While operational data normally streams in from devices over a time window consistent with their use, non-operational data comes in many forms, and can be ingested in near-real-time – just like the operational data – or in batch. For example, device metadata, or historic maintenance records stored in an ERP system, are frequently very valuable in a PdM model and will normally be ingested (or refreshed) in batch.

Weather data from the internet, on the other hand, may be relevant to the scenario, and could be flowing in constantly, potentially 24 hours a day. No matter where it comes from, or how often, non-operational data is normally combined with operational data periodically to train the predictive AI model. Therefore, in nearly all cases, it will make sense to have two paths for incoming data.

KM: MORE COMMENTS ON THE REALITIES OF BRINGING IN DATA FROM GEOGRAPHICALLY DISTRIBUTED LOCATIONS, OR ON/OFFLINE SCENARIOS WHERE CERTAIN MESSAGES ARE COMING IN BATCH? POINTER(S)?

In the PdM architecture outlined here, data comes in from devices through IoT Hub – providing the ability to not just ingest messages but manage them over time – and the non-operational data is periodically ingested into Azure BLOB storage over Azure Events Hubs. This is a common approach that leverages both Microsoft technologies, consistent with their strengths.

In the AI Solution Template for Predictive Maintenance, a component is provided to generate simulated sensor data to facilitate pilot projects. However, the architecture represented is consistent with best practices in various PdM scenarios that span devices, scenarios and industries.

Work with the customer to define:

  • Existing and needed services to support message ingestion and storage

  • Architecture defining message ingestion technology and storage path

  • POC to validate proposed architecture using representative types and amount of data

Possible Outcomes

  • Message Ingestion architecture and service established

  • Storage service established

II. Modeling

After all of the relevant data has been ingested, accumulated, combined, the process of creating an AI model can start. The AI training system is what is used to create and iteratively improve that model until it is ready for use.

At the highest level, the process of using AI is separated into two phases: training a model, and using the model in production. The cadence for these steps is highly dependent upon the business problem to be solved, but, the training process is something that is done iteratively (over days, weeks, or possibly even months, depending upon the algorithm and the sophistication of the model), while the using phase is ‘constant.’ In this context, ‘constant’ simply means that after the model is created, it will be used regularly to evaluate (‘score’) incoming telemetry, whether that incoming data arrives on a sub-second, second, minute, hourly, daily, or longer time horizon, and whether that telemetry comes in message by message, or in batch. Further, the model will change over time, and will do so as a result of a batch process itself. In other words, the model – depending upon the data and the domain – will need to be updated on a regular basis, such as nightly, weekly, or monthly, as the stored data set changes over time. On a conceptual level, the model training will have to be updated on a cadence that is related to how quickly the operational data changes: rapidly changing data often results in models that must be updated far more frequently – such as hourly or nightly – while slowly changing data may only require the models to be updated through re-training on a weekly or monthly basis.

There are obviously many, far more granular, steps that are required to train a machine learning model. For example, when data is initially ingested and combined, there are prescribed steps and approaches for filling in missing values, dealing with outliers, and otherwise ‘cleaning’ or ‘wrangling’ the data before a model is built. There are industry frameworks that provide proposed steps and a process for getting value out of business data, which has implications for how modern data science can be performed. One well known example is CRISP-DM, the Cross Industry Standard Processing for Data Mining: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining The standard defines an iterative process to clarify the business objective, understand the data and identify all that is required, and then prepare and model the data in order to derive insight. Microsoft has a proposed process for performing data science called the Team Data Science

Process, and there are various videos and training materials online that can be used to become more familiar with it: Team Data Science Process for DevOps: https://github.com/Azure/LearnAnalytics-Team-Data-Science-Process-for-DevOps

Two important variables that affect the target environment for the AI training system are what algorithm is being used to solve the business problem, and the size of the training data set. If the algorithm is simple (linear regression, for example) and the data set small, then the entire training process could be done very quickly using simple off-the-shelf routines in a common library (such as the Python library Scikit-learn). In this case, the entire AI training ‘environment’ could consist of a single VM based on the Microsoft Data Science Virtual Machine (DSVM) running in Azure (which already has Scikit-learn installed). Since the DSVM can be initialized when needed, and scaled up and down to meet the model training requirement, it represents a simple and appealing option for training many ML models. Conversely, if the data set and compute resources required to train the model are larger, then Spark will often be a very appealing training platform, as it provides a distributed processing environment that runs in any of the service layers in Azure. This allows customers to use the right Spark service that targets their desired management burden, and balances capability versus cost.

Saas: Our fully hosted Spark offering is Azure DataBricks. Databricks were the original creators of Spark, and Azure Databricks represents the most powerful and ‘curated’ environment available in Azure for Spark. https://microsoft.sharepoint.com/sites/infopedia/pages/Docset-Viewer.aspx?did=G01KC-1-33626

PaaS: Spark is supported as part of the HDInsight service offering. Spark HDI clusters can be provisioned manually (through the Azure portal), launched via Azure Data Factory, or created programmatically through a CLI interface: https://microsoft.sharepoint.com/sites/infopedia/search/pages/results.aspx?term=HDInsight%20Spark&scope=All#everything

IaaS: Spark can be used with the Cloudera, HortonWorks, or Map/R distributions in VMs, through easily accessible Marketplace options.

Not surprisingly, the cost of the Spark service will increase as you proceed from an IaaS up to SaaS offering. These alternatives allow customers to target the appropriate Spark environment based on their existing Spark skills, historic investments, scale requirements and desired level of service, which will ultimately determine the cost. If the customer wishes to use – or is already using – Databricks (our hero offering), the predictive maintenance solution can leverage that. However, the scenario is modular, allowing Spark at various service levels in Azure to be used for training. Whether or not the logical AI training system is small or large, self-managed or fully hosted, and based on custom code or based on standard and mature AI algorithms readily available, the objective is for service use to be modular. In so doing, the AI environment can solve Predictive Maintenance business problems today, but many other business challenges in other domains moving forward, with incremental service additions.

Lastly, it is important to consider the business ‘cadence’ with which the models must be created and then evaluated at runtime. In some situations, both the training and scoring of the models will be done in batch. For example, if an organization wants to predict, on a monthly (for whatever reason, only monthly) basis, which major assets will fail in the upcoming month (for crew scheduling, for example), then both the training of the model and the prediction may occur in a predictable monthly cycle, and the ‘production’ estimates are done within the same ‘training’ environment, making the need to create a separate production system for scoring unnecessary.

Considerations for Predictive Maintenance

There are a wide variety of business scenarios where data science for predictive maintenance can be valuable. (See the data science guide for Predictive Maintenance, referred to on the PdM overview slide.) In nearly all cases, operational data and non-operational data are combined to create a training data set that contains both, which is then used – using one or more algorithms per model – to iteratively train the model(s) over time. A model can take seconds, minutes or hours to train, depending upon the amount of data and the complexity of the algorithm. However, after a model is trained, analysis is normally required to review the results of the model, compare it against a pre-defined quality bar, and then strategize on how to improve the model moving forward. This makes initial training of the model necessarily batch: even if substantial resources are required to train it, those resources will need to either be available on demand, or pooled so they can be accessed in a shared environment.

As we have described, there are many different business questions that can be answered using predictive methods. In most of them, the amount of data required to train the model is manageable enough that the model can be trained in a single DSVM, in a virtual machine that can be provisioned on demand. If the job is demanding enough that it needs a much larger environment, then a Spark cluster could be used to train the model, in which case the Spark options summarized above could all be considered. Independent of the environment required to complete the training, eventually the ‘use’ of the model logically forks: it will be used in production (to apply to new data to predict what is relevant) and re-trained on an iterative basis (to reflect the new data set that is slowly changing over time). If the customer’s production process requires a near-real-time evaluation of incoming data to predict failures, a production infrastructure must be architected and deployed to handle that at the appropriate level of scale. (That is covered as a separate logical service, in the ‘AI Scoring System’ section.) The ‘training service’ and ‘production service’ then exist separately, and on a regular basis, the (updated) trained model must be pushed to the production environment (‘operationalized’ is a term that is often used) to optimize results. However, in most cases, the business predictions that are desired can be obtained as a batch process as well, in which case the (batch) training environment can also be used for scoring the models, and therefore obtaining the desired predictions. In this case, the same infrastructure and environment is used to manage and improve the model(s) over time and regularly generate the business predictions necessary.

In the AI Solution Template specifically, a Spark service was used to train the model. Per above, there were certainly many Spark options available to do so. One (of the many) benefits of providing Jupyter notebooks as the delivery vehicle for this template is the many possible execution environments for them. To provide the

simplest and most cost-effective option for creating a Spark cluster on demand for training, the AKTK toolkit was used. AZTK allows Azure Batch to be used on demand for executing the training job, which allows the rapid provisioning of low-priority VMs – which are often up to 80% cheaper than standard VMs – to execute a standard Spark container in Azure. Technically, AZTK simply provisions the nodes, then runs PySpark on those VMs, executing the python code in a Jupyter notebook environment. In short, AZTK provides a simple Python layer on top of Azure Batch that makes the training effort in the solution template simple and very inexpensive, and therefore an appealing option for pilots and proofs of concept. More detail on AZTK can be found here: https://github.com/Azure/aztk

To emphasize the point, if the most important business question that must be answered is Which of our wind turbines do we expect to fail in the next fiscal quarter? Or, What is our best estimate for the remaining useful life of the engines in each one of our cruise ships?, then these are cases where the analyses can be done in batch, on a monthly basis or similar, in order to perform business planning. However, if the business question is which of our elevators are exhibiting behavior that suggests they will fail today (so that we can proactively send someone to take it out of rotation and address the issue)?, then a separate production environment that can process the telemetry in near-real-time would be necessary.

III. Model operationalization

Work in progress. Refer to AI Development Using Data Science VMs; in particular, the section on model deployment and management.

IV. Productionalization

Ingress and feature engineering

The online scoring system represents the production side of the AI system, and is the subset of the architecture that processes incoming data – by applying the deployed model – to generate a value that will potentially result in a desired business event. For example, in the simplest of terms, a trained model may be deployed into production that evaluates several variables and computes a value with a logical ‘range’ that indicates normal or safe behavior. If the computed value falls outside of that range, then an event may be generated that results in the appropriate business action, which could be performed by a machine or human being. As stated on the previous slide, this distinct environment for scoring is only required if the operational cadence for evaluation is indeed far more real-time than the (normally batch) training cadence.

Like the training system, the size and form of this production environment is highly dependent upon the business problem being solved, as well as the number and geographic distribution of the devices that are being monitored. However, potentially different from the training system, there isn’t a consistent correlation between the size and scope of the training challenge and the size and scope of the production challenge. More specifically, while it may be possible to train the model in a single, scaled DSVM, if the device telemetry volume and velocity are high, the production AI system may need to be scaled out across various machines, and potentially multiple geographies. Conversely, it is possible – though less common – to have an extremely complex training model that requires a demanding (scaled) system to process (on Spark for example), that can be deployed into production on a small system that evaluates the results in near-real-time with very little resources required.

When the model must be deployed into production for online scoring, the source of the algorithm used will have a profound effect on how that model must be operationalized to scale the deployment. For example, if a model created in Python utilizes an algorithm in Spark MLLib, as in the previous example, then operationalizing the model may be dependent upon scaling out Spark cluster(s) in one or more regions. Various Spark options would then be considered for doing so. However, if the model was written from scratch in Python – or created using various libraries that can be readily installed in a data science environment such as Azure Machine Learning Workbench – then operationalizing the model may be performed by ‘packaging’ all of the relevant code and/or libraries and the model file (created in the training step), creating a Docker container, and then deploying that Docker container to a cloud service, such as Azure Container Service (AKS, which is based on Kubernetes). This packaging and deployment to a hosted container service would allow a far simpler operationalization mechanism than the most commonly used approach today, which is re-development of the ML model into the technologies used in the operational environment.

This last concept deserves repeating: historically, the specialized technologies used to create a machine learning model were not the same technologies used in production in common line of business systems. For example, various R packages used in the CRAN environment could be used to create a high-quality AI model, but since the operational environment of the customer is not based on R, the resulting model would have to be re-developed to run in Java or .NET, which are far more commonly used in production enterprise environments.

This is still a huge challenge in our industry, and the cost and complexity of operationalizing AI models at scale remains a major impediment to capitalizing on the opportunities of AI today. It is also one of the tough problems that the new AML Workbench is attempting to solve, by providing an automated way to package and deploy models at scale. AML Workbench main site: https://azure.microsoft.com/en-us/services/machine-learning-services/ AI developer training on AML Workbench (2 days): https://azure.github.io/LearnAI-Bootcamp/proaidev_bootcamp

Considerations for Predictive Maintenance

Per previous comments, most predictive analytics questions can be answered by scoring the incoming data in batch, in which case the same infrastructure and tools used to train the models can be used to make predictions. However, if a separate (normally much larger, or more distributed, or both) environment is required to evaluate production data frequently, there are many options for doing so. In the AI Solution Template for Predictive Maintenance, the option that is used is the Azure model management and operationalization services, directly accessible via the Azure CLI, and (one or more) Docker containers that will host and score the model at runtime. This allows an effective and simple ‘compartmentalization’ of the deployable model, by creating a specific image with the libraries and deployed model that are required to score incoming data at runtime. The image can then be deployed to one or many containers, depending on the level of scale and distribution that is required. For example, the model could be deployed to a single container – running locally, remotely, or on the other side of around the world – and run on that single host. Alternatively, it could be deployed to a single, massively scaled Kubernetes cluster running in Azure AKS, or completely distributed, where it runs in hundreds of distributed locations on IoT Edge devices, so that runtime data can be scored close to where the data originates. No matter where it runs – in one or many locations, or at small or massive scale – the packaging of the deployable model into a portable Docker container provides deployment flexibility that significantly simplifies the operationalization challenge facing companies today when doing AI.

In the AI solution template, let’s consider how all the pieces fit together. After the model is trained – whether in a single DSVM or a massive Spark cluster – that model is stored in an Azure Machine Learning Model Management account, and it is deployed (by performing packaging and creating a Docker container) using the Azure Machine Learning operationalization services (also referred to as O16N). (If you have been working with the pre-release version of AML Workbench, these model management and operationalization services are the same ones used within the tool and are completely accessible outside of it using the Azure CLI.) This represents a simple and effective way to deploy the resulting model to containers wherever you need them, which will be exposed at runtime as a web service. An explanation of Azure Machine Learning Model Management is here https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/model-management-overview, and there are a number of great links regarding O16N on the blog here: https://blogs.msdn.microsoft.com/mlserver/tag/o16n/. In the PdM AI template, calls to those services are done through the CLI (instead of going through the evolving AML Workbench UI).

Now that the model has been trained and deployed, it must be called at runtime. In the diagram, telemetry data comes into BLOB store, which is the destination for all the data, both operational and non-operational. At runtime, the data needs to be pushed to Azure Service Bus, which provides a reliable messaging queue for calling the scoring model in FIFO order. For simplicity, in the PdM template the IoT Hub therefore forwards the data not only to the BLOB store (for the cold path), but also to the Service Bus (for hot path processing). The messages placed on Service Bus then call the web service exposed on the Docker container, and the resulting scores for that data are written to Azure Tables, which provides the store for visualizing the results through an appropriate tool such as PowerBI. In this architecture Service Bus is a practical choice for a few reasons: It is a simple and elegant way to reliably call the web service in order and provides an effective buffering mechanism if the incoming telemetry fluctuates considerably, or if the production container environment requires increased scaling to process the increasing traffic. Whether expected or not, Service Bus provides the buffer to ensure the messages get delivered and eventually scored.

It deserves mention that in production this architecture (rather simplistically) pushes all incoming messages to both the cold and hot paths at runtime. In practice, there would normally be a streaming service that would reduce the number of messages that call the web service on the production containers. In other words, the data coming in through IoT Hub would certainly all be sent to the BLOB store, to ensure cold path persistence. However, all the same data would be forwarded – instead of directly to Service Bus – to a streaming service such as Azure Stream Analytics. This would allow simple SQL queries to be run against the messages in real-time, to aggregate message data using a specific windowing technique (such as tumbling, sliding, or hopping windows, for example). As a result of executing those ASA queries against the incoming data, fewer calls would be issued against the production web service, allowing the volume of incoming data to be effectively ‘filtered’ in a way that reduces traffic and therefore ‘noise’ in the output. If true near-real-time results are needed in production, the inclusion of the streaming service into the hot path would make sense, but for the purposes of a pilot project, the Service Bus technique was used to simplify setup and execution of the (smaller volume of) test data.

A great summary of how different ASA windowing strategies would support business scenarios is here: https://msdn.microsoft.com/en-us/azure/stream-analytics/reference/windowing-azure-stream-analytics

Scoring and visualization

Data visualization enables both reactive and proactive activities. Great visualizations allow decision makers to see data presented graphically, so they can grasp difficult concepts or identify new patterns. Visualizing data in a consistent, well understood way can increase company-wide understanding of business issues, and create agreed action plans in dealing with common situations.

While visualizing the output of almost any business process has many benefits, the tool that is used is equally important, so this should be closely linked to the customer challenge. There is a logical continuum from real-time (sub-second to second), to near-real-time (seconds to minutes) all the way through to truly batch reporting and analysis that is done on an hourly, daily, or a weekly cadence. Alternately the audience should be taken into consideration: Are you visualizing real time date for a group of individuals needing to act upon triggers, or the weekly batch data influencing marketing and sales plans?

The Visualization dashboard will normally be one of two platforms for customers, depending mainly upon how quickly they must react to the incoming data: Power BI or Time Series Insights.

Power BI allows out-of-the-box, or custom, visualizations to be used to present data in an incredible variety of ways. However, the architecture was not designed to update that information in real-time, and it is therefore not usually the best output mechanism for scenarios that involve monitoring.

Power BI offers a set of capabilities that are uniquely enabled by its global and cloud nature:

  • Harness data from Excel spreadsheets, on-premises data sources through the data gateway, big data, streaming data, and cloud services – it doesn’t matter what type of data you want or where it lives, Power BI allows you to connect to hundreds of data sources.

  • The out-of-the box SaaS content packs that deliver a curated experience with pre-built dashboard.

  • Power BI has unique ways for users to experience their data with speed and agility: 1) Live dashboards 2) Natural language query 3) Custom visuals https://powerbi.microsoft.com/en-us/

Conversely, Time Series Insights is a special purpose tool that was designed specifically for monitoring scenarios: it is designed to ingest data at massive scale and reflect the incoming telemetry in a variety of ways in near-real-time (normally seconds). Further, it allows queries to be executed that allow events to be identified that match logical rules, allowing an organization to explore and analyze potentially billions of events as they stream in. It is also a fully managed service, allowing it to scale in conjunction with other services such as IoT Hub and Azure Event Hubs, both of which can feed telemetry data into the system. Time Series Insights is the right tool in nearly all IoT scenarios, where IoT Hub can act as the ingestion hub for the telemetry data from distributed devices, Time Series Insights can be used to visualize that telemetry, and then control and configuration messages can flow back to devices, also through IoT Hub. Time Series Insights product site: https://azure.microsoft.com/en-us/services/time-series-insights/ Time Series Insights product announcement: https://azure.microsoft.com/en-us/blog/announcing-azure-time-series-insights/ Time Series Insights simple explanation: https://docs.microsoft.com/en-us/azure/time-series-insights/time-series-insights-overview

Ultimately, depending upon the business scenario, the customer may wish to utilize both PowerBI and Time Series Insight in a complementary fashion in order to have the best of both real-time visualization of device telemetry as well as shared dashboards that summarize and communicate data over a longer time horizon.

Considerations for Predictive Maintenance

In most predictive maintenance business scenarios, PowerBI is going to be most appropriate and effective data visualization technology for the information that you wish to present. Per the ‘AI Training System’ comments, most predictive motions are going to be evaluated in batch, in which case the relevance of an environment like Time Series Insights is minimal. However, even if device data is evaluated in near-real-time, and separate infrastructure is employed to do scoring nearly constantly in production, in most business scenarios PowerBI can provide the level of responsiveness required. TSI would be appropriately used, normally in conjunction with PowerBI instead of entirely in place of it, if the organization needed to see second by second status of devices where time to detect and respond is critical.

In the AI Solution Template for Predictive Maintenance, the time horizon for action is consistent with the use case for PowerBI, allowing the predictions that are generated by the models and put into Azure Tables to be visualized by PowerBI.

Work with customer to define:

  • Device and business data that is appropriate for display and aggregation

  • Dashboard options that best showcase events and actions

  • Opportunities for business process improvements

  • Business rule definitions for role-based behaviors