Efficient storing of high frequency data in OSIsoft PI

High Frequency Data

Processes in industrial operation occur often at different time scales, some are fast (sub seconds to hours), and others are slow (hours, days, weeks, or months). In a biotechnology facility for example, there are slow moving batch processes, fast purification steps and very fast filling lines. Capturing events at different time scales and analyzing them, requires a data strategy for the acquisition, storage, and analysis.

To optimize storage space and network bandwidth, the OSIsoft PI system differentiates between high frequency data also known as snapshot values and compressed or archived data. Data are archived from the snapshot table by applying a swinging door compression algorithm. This data strategy has proven to be great balance between displaying real time data in high resolution as well as storing sufficiently enough data for historical data analysis.

The drawback of this approach is that the snapshot queue contains only a single value for each process variable, so analysis based on snapshot or event driven data is limited to single point. There are some valuable use cases such as statistical process control, alarm management or event triggers. However, Machine Learning (ML) or multivariate models (MVA) are usually based on time series vectors.

To accommodate advance modeling of high frequency data, the OSIsoft PI system requires expansion off the snapshot table to a low latency time series storage:

High Frequency Data

The requirements for the Snapshot Db are primarily driven by read speed as well as write speed. Some open-source time series databases such as QuestDB that allows a million writes per seconds are available now. The read speeds are even more impressive: We measured ~ 800K read/sec for a standard OSIsoft PI system, whereas a low latency TSDB is faster by a factor of 800 - 1,000 (see demo:  QuestDB · Console )

An additional benefit of using open source TSDB is that it allows us to add open-source ML and MVA libraries as well as, to take advantage of the very rich open-source visualization ecosphere. For example, the following shows a Grafana Dashboard of the Snapshot Db:

Summary

The OSIsoft PI system has been designed to capture real time events in a snapshot table and store compressed data in the PI Data Archive. This data architecture is optimized for short term event data and long-term data storage. Missing in this scenario are capabilities to store and analyze high frequency data for which modern low latency time series databases can provide. By adding a dedicated high frequency data store, fast processes can be monitored and analyzed in parallel to an already existing data infrastructure. This will open a large range of new uses cases that are difficult or impossible to realize with existing systems.

For information, please contact us.

How to improve your Transition Analysis using TQS Pandas PiFrames in Python

The TQS Pandas PiFrames for OSIsoft® PI System® library has been designed to accelerate multivariate analytics (MVA) and machine learning (ML) for the OSIsoft PI system. The difference to the existing PI Analysis calculation engine, is that TQS Pandas PiFrames is designed for vector or matrix operations instead of single value operation.

The TQS Pandas PiFrames for OSIsoft® PI System® library makes it very easy to work with structured and contextualized data in Python. Time segments can be defined as Event Frames (OSIsoft EF) and retrieved together with sensor data as structured Pandas data frames. This allows simple and very complex analytics of one dimensional or multi-dimensional data.

One use case in Biotechnology is the transition analysis (TA) on chromatography columns. Chromatography is used to purify the product and the performance of the chromatography column is key to achieve a good product quality. There are several metrics that can be calculated to monitor the columns performance, the following lists a few:

The calculations are based on the transition peak which mathematically is a probability density function (pdf). The peak is calculated from the raw sensor data – the transition or cumulative distribution function - by numerical derivation. Often the curves are normalized by the flow rate to account for differences in total volume. The following shows an example of the transition (cdf) and the derivative (pdf):

Transition Analysis

The transition peak of the pdf is used to calculate, for example, the peak asymmetry using the following formula:

Assymetry = b/a

Where b and a are the 10% peak heights left (blue line) and right of the peak maximum (black line). Though the calculations are simple, the major problem is the numerical differentiation of noisy sensor data. This steps introduces so much additional noise that the peak shape is hard to analyze. Therefore, the analysis includes data smoothing steps as the LOWESS filter to reduce the noise level in the raw data and upsampling to increase the resolution.

It was performed using simulated data with different noise leves from 0 to 2.5% to evaluate how acurate and precise this analysis is.

The results show that this calculation has some significant variation even at low noise levels. There are also differences in the accuracy, which are introduced by the filtering step. Depending on the sensor data quality, this approach might not be senssitive enpough to pick up small changes in the columns performance.

To improve the results, the same test was performed by fitting an exponential modified gaussian directly to the transition curve.

The fitting routine led to much better accuracy and precision. This is mainly due to the fact that the tranisiton curve doesn’t have to be modified and therefore no additional noise or peak distortion is being introduced.

Summary:

Transition Analysis in biotech production is a great approach to monitor the column performance during chromatography steps. There are a lot of simple metrics available as key performance indicator (KPI’s), but they mostly operate on derived signal, which introduce noise and distortions in the calculation.

Using the raw transition signal and fitting a distribution function would be a much better way. Though this makes the analysis more complex and increases the latency, however much higher precision and accuracy can be achieved in the results.

For information, please contact us.

How to build Machine Learning with OSI Pi and Python

machin learning with osi pi

Python based machine learning (ML) libraries have evolved at an unbelievable pace. It is most impressive that the time-consuming steps such as data encoding, feature selection, model comparison and even model optimization have been fully automated. For example, the relatively new Python library PyCaret calculates the metrics of over 21 different regression models and selects the best one with just a few lines of codes. Machine learning with OSI Pi has come along way.

There are plenty of industrial applications, where these algorithms could be successfully applied. But there are two major bottlenecks for successful projects:

  1. Historical Data collection for the Model Development
    1. Real time data collection for the Model Integration

Model Development data could be downloaded in Excel or text\csv files and analyzed offline. The drawback is that this approach cannot be productized and is limited to off-line applications.

To accelerate the model development and model integration (MD\MI pipelines) for the OSIsoft PI System, TQS has developed a Python library called TQS Pandas PiFrames for OSIsoft® PI System® that connects to the PI System and provides PI data as Pandas data frames. The Pandas data frame is the preferred data structure in Python for data scientists and is supported by many ML libraries. Therefore, the TQS Pandas PiFrames for OSIsoft® PI System® can be easily integrated into ML projects in both model development and model integration.

The following shows some code examples in Python.

  1. Connecting to the PI Data Historian and PI System:


cdf = ConnectToDefaultAF()
cdf = ConnectToDefaultPI()


df = GetMultipleAttributeValuesByVariable("Bio Reactor 1",["Temperature","Concentration","Level"],'t-2h','t',60,0,None)

The resulting data frame is a time series:


The data frame can also be arranged by variable columns:

df = GetMultipleAttributeValuesByFrame("Batch_0_*","Bio Reactor 1",["Temperature","Concentration","Level"],'t-7d','t',60,0,None)

During the last couple of months, we have developed use cases around OSIsoft PI system that are based on the TQS Pandas PiFrames for OSIsoft® PI System® library:

The library has shown to significantly reduce the model development and model integration time.

SUMMARY

Machine Learning and AI projects are often slow to develop and difficult to integrate. The main reason is that most Python libraries are expecting Pandas data frames (or Numpy arrays) and these data structures are not readily available in industrial automation. TQS Integration has developed the TQS Pandas PiFrames for OSIsoft® PI System® libraries to accelerate both model development and model integration. The library is user friendly, fast and scales well for all common machine learning (ML) applications.

For information, please contact us.

How To Measure Data Latency in OSIsoft PI Using PowerShell

Data Latency

The topic of system latency has come up a couple of times in recent projects. If you really think about it, this is not surprising. As more manufacturing gets integrated, data must be synchronized and\or orchestrated between different applications. Here are just some examples:

  1. MES: Manufacturing execution system typically connect to a variety of data sources, so the workflow developer needs to know timeout settings for different applications. Connections to the automation system will have a very low latency, but what is the expected data latency of the historian?
  1. Analysis: More and more companies move towards real-time analytics. But just how fast can you really expect calculations to be updated? This is especially true for Enterprise level systems, that are typically clones from source OSIsoft PI servers by way of PI-to-PI. So you are looking at a data flow for example:

    Source -> PI Data Archive (local) -> PI-to-PI -> PI Data Archive (region) -> PI-to-PI -> PI Data Archive (enterprise) and latency in each step.
  2. Reports: One example are product release reports. How long do you need to wait to make sure that all data have been collected?

The OSIsoft PI time series object provides a time stamp which is typically provided from the source system. This time stamp will bubble up though interfaces and data archives unchanged. This makes sense when you compare historical data, but it will mask the latency in your data.

To detect when the data point gets queued and recorded at the data server, PI offers 2 event queue that can be monitored:

AFDataPipeType.Snapshot ... to monitor the snapshot queue

AFDataPipeType.Archive ... to monitor the archive queue

You can use PowerShell scripts, which have the advantage of being a lighter application that can be combined with the existing OSIsoft PowerShell library. PowerShell is also available on most server, so you don't need a separate development environment for code changes.

The first step is to connect to the OSIsoft PI Server using the AFSDK:

function Connect-PIServer{
[OutputType('OSIsoft.AF.PI.PIServer')]
param ([string] [Parameter(Mandatory=$true, Position=0, ValueFromPipeline=$true,
ValueFromPipelineByPropertyName=$true)] $PIServerName)
$Library=$env:PIHOME+"\AF\PublicAssemblies\OSIsoft.AFSDK.dll"
Add-Type -Path $Library
$PIServer=[OSIsoft.AF.PI.PIServer]::FindPIServer($PIServerName)
$PIServer.Connect()
Write-Output($PIServer)
}

The function opens a connection to the server and returns the .NET object.

By monitoring the queues and writing the values, it will look like the following:

function Get-PointReference{
param ([PSTypeName('OSIsoft.AF.PI.PIServer')] [Parameter(Mandatory=$true,
Position=0, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] $PIServer,
[string] [Parameter(Mandatory=$true, Position=1, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]
$PIPointName)
$PIPoint=[OSIsoft.AF.PI.PIPoint]::FindPIPoint($PIServer,$PIPointName)
Write-Output($PIPoint)
}

function Get-QueueValues{
param ( [PSTypeName('OSIsoft.AF.PI.PIPoint')] [Parameter(Mandatory=$true,
Position=0, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] $PIPoint,
[double] [Parameter(Mandatory=$true, Position=1, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] $DurationInSeconds )
# get the pi point and cretae NET list
$PIPointList = New-Object System.Collections.Generic.List[OSIsoft.AF.PI.PIPoint]
$PIPointList.Add($PIPoint)
# create the pipeline
$ArchivePipeline=[OSIsoft.AF.PI.PIDataPipe]::new( [OSIsoft.AF.Data.AFDataPipeType]::Archive)
$SnapShotPipeline=[OSIsoft.AF.PI.PIDataPipe]::new( [OSIsoft.AF.Data.AFDataPipeType]::Snapshot)
# add signups
$ArchivePipeline.AddSignups($PIPointList)
$SnapShotPipeline.AddSignups($PIPointList)
# now the polling
$EndTime=(Get-Date).AddSeconds($DurationInSeconds)
While((Get-Date) -lt $EndTime){
$ArchiveEvents = $ArchivePipeline.GetUpdateEvents(1000);
$SnapShotEvents = $SnapShotPipeline.GetUpdateEvents(1000);
$RecordedTime=(Get-Date)
# format output:
foreach($ArchiveEvent in $ArchiveEvents){
$AFEvent = New-Object PSObject -Property @{
Name = $ArchiveEvent.Value.PIPoint.Name
Type = "ArchiveEvent"
Action = $ArchiveEvent.Action
TimeStamp = $ArchiveEvent.Value.Timestamp.LocalTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
QueueTime = $RecordedTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
Value = $ArchiveEvent.Value.Value.ToString()
}
$AFEvent.pstypenames.Add('My.DataQueueItem')
Write-Output($AFEvent)
}
foreach($SnapShotEvent in $SnapShotEvents){
$AFEvent = New-Object PSObject -Property @{
Name = $SnapShotEvent.Value.PIPoint.Name
Type = "SnapShotEvent"
Action = $SnapShotEvent.Action
TimeStamp = $SnapShotEvent.Value.Timestamp.LocalTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
QueueTime = $RecordedTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
Value = $SnapShotEvent.Value.Value.ToString()
}
$AFEvent.pstypenames.Add('My.DataQueueItem')
Write-Output($AFEvent)
}
# 150 ms delay
Start-Sleep -m 150
}
$ArchivePipeline.Dispose()
$SnapShotPipeline.Dispose()
}

These 2 scripts are all you need to monitor events coming into a single server. The data latency is simply the difference between the value's time stamp and the time recorded.

Measuring the data latency between 2 servers - for example a local and an enterprise server - can be done the same way. You just need 2 server objects and then monitor the snapshot (or archive) events.

unction Get-Server2ServerLatency{
param ( [PSTypeName('OSIsoft.AF.PI.PIPoint')] [Parameter(Mandatory=$true, Position=0,
ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] $SourcePoint,
[PSTypeName('OSIsoft.AF.PI.PIPoint')] [Parameter(Mandatory=$true, Position=1,
ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] $TargetPoint,
[double] [Parameter(Mandatory=$true, Position=2, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] $DurationInSeconds )
$SourceList = New-Object System.Collections.Generic.List[OSIsoft.AF.PI.PIPoint]
$SourceList.Add($SourcePoint)
$TargetList = New-Object System.Collections.Generic.List[OSIsoft.AF.PI.PIPoint]
$TargetList.Add($TargetPoint)
# create the pipeline
$SourcePipeline=[OSIsoft.AF.PI.PIDataPipe]::new( [OSIsoft.AF.Data.AFDataPipeType]::Snapshot)
$TargetPipeline=[OSIsoft.AF.PI.PIDataPipe]::new( [OSIsoft.AF.Data.AFDataPipeType]::Snapshot)
# add signups
$SourcePipeline.AddSignups($SourceList)
$TargetPipeline.AddSignups($TargetList)
# now the polling
$EndTime=(Get-Date).AddSeconds($DurationInSeconds)
While((Get-Date) -lt $EndTime){
$SourceEvents = $SourcePipeline.GetUpdateEvents(1000);
$TargetEvents = $TargetPipeline.GetUpdateEvents(1000);
$RecordedTime=(Get-Date)
# format output:
foreach($SourceEvent in $SourceEvents){
$AFEvent = New-Object PSObject -Property @{
Name = $SourceEvent.Value.PIPoint.Name
Type = "SourceEvent"
Action = $SourceEvent.Action
TimeStamp = $SourceEvent.Value.Timestamp.LocalTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
QueueTime = $RecordedTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
Value = $SourceEvent.Value.Value.ToString()
}
$AFEvent.pstypenames.Add('My.DataQueueItem')
Write-Output($AFEvent)
}
foreach($TargetEvent in $TargetEvents){
$AFEvent = New-Object PSObject -Property @{
Name = $TargetEvent.Value.PIPoint.Name
Type = "TargetEvent"
Action = $TargetEvent.Action
TimeStamp = $TargetEvent.Value.Timestamp.LocalTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
QueueTime = $RecordedTime.ToString("yyyy-MM-dd HH:mm:ss.fff")
Value = $TargetEvent.Value.Value.ToString()
}
$AFEvent.pstypenames.Add('My.DataQueueItem')
Write-Output($AFEvent)
}
# 150 ms delay
Start-Sleep -m 150
}
$SourcePipeline.Dispose()
$TargetPipeline.Dispose()
}

Here is a quick test of a PI2PI interface reading and writing to the same server:

Get-Server2ServerLatency $srv $srv sinusoid sinusclone 30

As you can see the difference between target and source is a bit over 1 sec, which is to be expected since the scan rate is 1 second.

SUMMARY

Data latency is a key metric for every system that captures, stores, analyses, or processes data. Every sequential operation will add to the overall system latency and must be accounted for. It is not only the data transport over networks that is the major contributor, but also data queues that facilitate the packaging of data into messages that add significant delays. This topic is especially important for cloud-based systems that rely on on-premises sensor data.

As shown in this blog, data latency can and should be measured and be part of the architectural planning process. As a rule of thumb, sub second data latencies are challenging especially when the number of data sources increases.

Please contact us for more information.

Does MQTT Unified Namespace solve all your data integration issues?

It seems many companies think that Message Queuing Telemetry Transport (short MQTT) can solve all their data integration issues. And there has been a lot of industry chats about this topic. But can it really do that?

The  MQTT has been successfully used to communicate data for over 20 years. It is by design lightweight and has fared well when benchmarked against other competing protocols (e.g. OPC-UA). The central component of the MQTT architecture is the message broker, which allows devices to subscribe to or publish data to a central repository. This architecture makes MQTT very attractive as a central data exchange in manufacturing to integrate different components such as Automation Layer, Historians, MES, ERP and others.

The MQTT message has two components:

  1. Topic
  2. Payload

The MQTT topic is used to route message and allow subscribers to filter messages. The routing by topic requires a design that specifies the location of the data source. If the MQTT broker is used on the enterprise level, it is recommended to use the ISA-95 standard to define the MQTT topic which is often referred to as Unified Name Space (UNS). The following shows the topic as unified names space:

Enterprise A/Site A/Area A/Process Cell A/Bio Reactor 0

MQTT by itself does not specify a message structure and in many IOT applications the payload is simply a JSON string. The JSON is deserialized by the MQTT subscriber (here Ignition) into a tree structure:

OSIsoft AF provides an extensive class or type system for assets (equipment), frames (batch, alarms, OEE) and transfers (traceability and genealogy), where enterprise level equipment structures are either manually created or autogenerated by interface specific connectors.

To publish OSIsoft AF data into the MQTT unified name space is simply to serialize the AF structure into the MQTT topic and attach the attribute values as JSON payloads. As any other MQTT component, OSIsoft AF can be a subscriber, publisher, or both. As a subscriber OSIsoft AF can deserialize the MQTT message and the resulting AF structure can be contextualized by templating (inheritance), categorizing, and referencing.

MQTT with the unified name space is a very elegant way of routing structured data, but the JSON payload is not an ideal solution for several reasons:

(1) There is no clear definition how to structure industrial sensor\time series data.
(2) JSON structures are bulky.
(3) There is no standard way of compressing the payloads.

Is there an alternative?


The SparkplugB standard was developed to address both the data structure and throughput requirements for industrial sending sensor data. There are a couple of key mechanism to accomplish this:

It has exactly five components:

Structuring the unified name space, requires splitting the asset definition between the topic and payload and can lead to an identical structure as described above in the JSON example:

The SparkplugB payload can also be used to serialize (publish) or deserialize (subscribe) OSIsoft AF structure. To subscribe to MQTT SparkplugB, OSIsoft offers a compliant MQTT connector.

Summary

The MQTT SparkplugB standard together with the unified name space concept is an efficient way to exchange sensor or other time series data. There are, however, a few limitations that need to be considered:

Despite above limitations, an edge node can readily publish an ISA-95 compliant OSISoft AF asset structure into the unified namespace. The time-consuming task of mapping OCP or PLC into human readable asset paths has already been concluded when the OSIsosft AF system was setup and configured. An additional benefit shows equipment centric calculations that can be streamed to the MQTT broker and be consumed by MES or ERP systems.

When MQTT SparkplugB data are consumed by the OSIsoft AF system, the equipment centric data stream can be deserialized into an asset structure.

This can lead to significant savings in the integration task, which normally requires the tedious step of mapping the automation layer into the IT layer.

There is an added effort to contextualize the data and add additional abstraction layers such as base classes\inheritance. Frames and transfers must also be configured to allow for the modeling of time-based models (MVA or ML). These steps are essential to rollout MVA\ML models at scale.

MQTT SparkplugB is a real step forward in the Level 1,2 and 3 data integration. Level 3+ system as well as MVA\ML models require an extensive type system that can’t readily be flattened into the MQTT SparkplugB standard.

For now, Level 3+ and MVA\ML systems still require a fair amount of integration and configuration.

For information, please contact us.

Which is better? On-Premise or Cloud Based Industrial Internet of Things Data Flow?

Which is better? On-Premise or Cloud Based Industrial Internet of Things data flow.

Applications around the Industrial Internet of Things (IIOT) have mushroomed and each one comes with a different set of capabilities and features. So how do you compare different applications or services? And how does the new solution fit into your existing data architecture?

In general, industrial internet of things architectures fall into three categories: (1) on-premise, (2) cloud based or (3) a hybrid of the two. In the on-premises solution, data are never leaving the manufacturing network, whereas in the cloud solution all data are directly send to the cloud. In the hybrid solution, a subset of the data is replicated to the cloud and used for analysis.

Industrial Internet of Things data flow.

Today, many industrial internet of things applications fall into the hybrid category and lead to a scenario where some applications will execute on-premise and others in the cloud. To choose the right blend of on-premise and cloud functionality, let’s consider the following key metrics:

For regulated industries, there is often a requirement that the compressed timeseries is identical between two components.

For a sequential system, the calculation is as follows:

R=R1×R2×R3× ... ×Rn=ΠRj

As an example, if a system has four components with a reliability of 95% each, the overall reliability drops to 81.4%.
Making the same system redundant increases the overall system reliability to:

R=1-(1-R1)×(1-R2)×(1-R3)× ... ×(1-Rn)=1-Π(1-Ri) or 96.6% using Ri=81.4%


Highbyte is providing in flight data contextualization on the edge. This opens the door for very flexible and dynamic solutions.

Most of the protocols are equipment centric, missing relational information (one-to many and many-to-one) and time segmentation. Microsoft’s Digital Twins Definition Language (DTDL) is a relative new approach that has the potential to bridge the gap.

Summary

Industrial internet of things apps range from pure on-premises to all cloud-based solutions. On-premises architectures typically provides a higher system reliability and lower latency, while cloud-based solutions offer scalability, flexibility, and wide range of readily accessible data analytics. As a result, manufacturing IT will most likely have a blend of both, where process level analysis will run on premise and enterprise level analytic in the cloud.

Current connectors do not provide a complete manufacturing process model, industrial strength data compression, and redundancy necessary to seamlessly integrate into existing on-premises data architectures. But this is changing quickly and new approaches of in-flight contextualizer are closing the gap quickly. The goal being to better understand and utilize industrial internet of things data.

Please contact us for more information.

Industrial Machine Learning: Why you have to do it.

Machine Learning (ML) has seen an exponential growth during the last five years and many analytical platforms have adopted ML technologies to provide packaged solutions to their users. So, why has Machine Learning become mainstream?

Let’s take a look at Technically Multivariate Analysis (MVA). While many algorithms have been widely available for a long time, MVA is still considered a subset of ML algorithms. MVA typically refers to two algorithms:

As such, MVA has become a de facto standard in manufacturing batch processing and others. Some typical use cases are:

In principle, industrial datasets are not different from other supervised or unsupervised learning problems and they can be evaluated using a wide range of algorithms. Multivariate Analysis was preferred because it offered global and local explainability. MVA models are multivariate extensions of the well understood linear regression that provide weights (slope) for each variable. This enables critical understanding and optimization of underlying process dynamics which is a very important aspect in manufacturing.

NEW CHANGES IN INDUSTRIAL MACHINE LEARNING

In the past, many ML algorithms were considered black box models, because the inner mechanics of the model were not transparent to the user. These model types had limited utility in manufacturing since they could not answer the WHY and therefore lacked credibility.

This has very much changed. Today, model explainers in ML are a very active field of research and excellent libraries have become available to analyze the underlying model mechanics of highly complex architectures.

The following shows an example of applying ML technologies to a typical MVA project type. In the original publication (https://journals.sagepub.com/doi/10.1366/0003702021955358 ), several preprocessing steps have been studied together with PLS to build a predictive model. All steps were performed using commercial off the shelf software that manually worked the analysis.

Using ML pipelines, the same study can be structured as follows:

pipeline=Pipeline(steps= [('preprocess', None), ('regression',None)])
preprocessing_options=[{'preprocess': (SNV(),)},
                       {'preprocess': (MSC(),)},
                       {'preprocess': (SavitzkyGolay(9,2,1),)},
                       {'preprocess': (make_pipeline(SNV(),SavitzkyGolay(9,2,1)),)}]

regression_options=[{'regression': (PLSRegression(),), 'regression__n_components': np.arange(1,10)},
                    {'regression': (LinearRegression(),)},
                    {'regression': (xgb.XGBRegressor(objective="reg:squarederror", random_state=42),)}]
param_grid = []
for preprocess in preprocessing_options:
    for regression in regression_options:
        param_grid.append({**preprocess, **regression})
search=GridSearchCV(pipeline,param_grid=param_grid, scoring=score, n_jobs=2,cv=kf_10,refit=False)

This small code example manages to test every combination of prepossessing and regression steps, then automatically select the best model. [A combination of SNV (Standard Normal Variate), 1st derivative and XGBoost showed the highest cross validated explained variance of 0.958].

The transformed spectra and the model weights can be overlaid to provide insights into the model mechanics:

Conclusion

Multivariate Analysis (MVA) has been successfully applied in manufacturing and is here to stay. But there is no doubt that Machine Learning (ML) data engineering concepts will be widely applied to this domain as well. Pipelines and autotuning libraries will ultimately replace the manual work of selecting data transformation, model selection and hyper parameter tuning. New ML algorithms and Deep Learner, in combination with local and global explainer, will expand Manufacturing Intelligence and provide key insights into Process Dynamics.

Special Thanks

Thanks to Dr. Salvador Garcia-Munoz for providing code examples and data sets.

For more information, please contact us.

How to use Transfer Models to achieve End-to-end Product Traceability

Detailed equipment & batch data models set up by pharmaceutical and biotech companies have enabled the creation of equipment centric machine learning (ML) models for example, batch evolution monitoring. The next step is to extend the existing equipment centric models and create process or end-to-end models.

The challenge is that the current data models do not fully support the extension:

·        Equipment models are based on the ISA-95 structure and reflect only the physical layout of the manufacturing facilities.

·        Batch Execution Systems (BES) are integrated using ISA-88 and entail only equipment that is controlled by the batch execution system. Often BES systems are set up to execute single unit procedures and subsequent processing steps are executed separately.

·        Management Execution Systems (MES) typically map the entire process and material flow but as a level 3+ system is difficult to integrate into a data modelling pipeline.

·        There are also facilities that use paper-based process tracking instead of MES\BES, which makes traceability even more challenging.

Batch-to-Batch traceability can quickly become very complex especially when many different assets are involved. The following shows an example of a reactor train in a biotech facility:

It shows all the different product pathways from reactor ‘01’ to the final processing step, as an example in red: 01, 11, 22, 33, 44. At any moment in time, the other reactors are either being cleaned or used for a parallel process.

Such a process is difficult to model in a BES or MES system and real time visibility or historical analysis is very challenging. This is especially true if subsequent processing steps are to be included (Chromatography, Fill and Finish, ....)

The missing link to model the different pathways is to integrate each transfer between reactors or equipment. OSIsoft AF offers the AF Transfer model that is fully integrated in the AF system. AF Transfer event can be defined with the out-the-box properties:

·        Source Equipment

·        Destination Equipment

·        Start Time

·        End Time

The AF Transfer model has a lot of the same features that AF event frames offer. Transfers can be templated and through the in-and-outflow ports defined in different granularities.

Once the transfer between equipment has been defined, batches can be traced back in real time with or without using the batch id. This is possible through the equipment and time context of the transfer model:

In this case, starting from the end reactor ‘44’ all previous steps can be retraced by going backwards in time and using the source-destination equipment relationships

The implementation requires a data reference to configure each transfer. The configuration user interface requires the following attributes:

·        Destination Element: Attribute of the destination Element

·        Name: Name of the transfer

·        Optional: Description, Batch Id and Total

The result is transfer logs can be matched up to the corresponding unit procedures by time and equipment context as shown below:

As shown in this example, the end time of transfer log 'Transfer Id S7MZUDGK' matches the start time of unit procedure: "Batch Id WNJ6H99R". The entire pathway can now be reconstructed in one query.

Conclusion

The sequence of discrete processing events such as unit procedures can be modelled using the OSIsoft AF Transfer class. The resulting transfer logs allow retracing the process backwards in time by using the source-destination relationship of the transfer model. Modelling the process flow is key to expanding equipment centric ML models.

Please contact us for more information.

Predict Process Conditions with Digital Twin in Manufacturing

digital twin

Have you ever wondered if it were possible to predict process conditions in manufacturing? Know what is likely to happen before it actually happens in your business processes? Digital Twin might just be your answer.

Benefits:

There are several different definitions of Digital Twins or Clones and many use them interchangeably with terms such as Industry 4.0 or the Industrial Internet of Things (IIOT). Fundamentally, Digital Twins are digital representations of a physical asset, process or product, and they behave similarly to the object they represent. The concept of Digital Clones has been around for some time. Earlier models were based on engineering principles and approximations, however they required very deep domain expertise, were time consuming and were limited to a few use cases.

Today Digital Clones are virtual models that are built entirely by using massive historical datasets and Machine Learning (ML) to extract the underlying dynamics. The data driven approach makes Digital Clones accessible for a wide range of applications. Therefore, the potential for Digital Twins is enormous and includes process enhancements\optimization, equipment life cycle management, energy reductions, safety improvements just to name a few.

Building digital clones require:

1.      A large historical data set or data historian

2.      High data quality and sufficient data granularity

3.      Very fast data access

4.      A large GPU for the model development and real time predictions

5.      A supporting data structure to manage the development, deployment, and maintenance of ML models

The following shows the application of a Digital Twin to a batch process example. The model is built with 30 second interpolated data using a window of past data to predict future (5 min) data points:

So, what’s all the hype of Digital Clones? Well, not only are they able to predict process conditions, they also provide explanatory power on what drives the process - the underlying dynamics. The following dashboard shows a replay of this analysis including the estimate of the model weights:

Conclusion

In summary, the availability of enterprise level data historians and deep learning libraries allow Digital Clones to be implemented on the equipment and process level throughout manufacturing. The technology allows a wide range of applications and offer an insight into the process dynamics that were not previously available, improving data integrity and data access while achieving trust and data transparency with your partners. This helps to digitalize data management and processes to lower risk and improve efficient data sharing with partners.

Please contact us for more information.

Improve your Process Monitoring with SEEQ and OSIsoft PI

Multivariate Analysis (MVA) is a well-established technique to analyse highly correlated process variables. It is well known in batch, but also successfully applied in discrete or continuous processing. In comparison to single variable applications, for example statistical process control, MVA has shown to be superior in the detection of process drifts and upsets. In practice, the implementation of MVA requires two different data structures or models:

Event Frames are usually autogenerated from the batch execution system (BES) and reflect the logical\automation sequences for recipe execution. Both AF Elements and Event Frames are  being used to create MVA models and calculate statistics. Below is an example of a multivariate model that combines the autogenerated Event Frame “Unit Procedure” and process variables in the Element: “Bio Reactor 0”:

This type of analysis is  typically used for batch-to-batch comparison (T2 and speX statistics) and batch evolution monitoring in the pharmaceutical, biotech and chemical Industry.

Challenge

One of the shortcomings of using automation phases is that they  seldom  line up with time frames that are critical for the underlying process evolution (process phases). Often there is a mismatch in the granularity, process phases are either longer or shorter in duration compared to the automation phases. Also start and end might be based on specific process conditions, for example temperature, batch maturity, online measurements and others. The mismatch between automation and process phases causes misalignment in the MVA model and a broadening of the process control envelopes. . The resulting models are often not optimal.

Solution

SEEQ has developed a platform that excels in creating time series segments as well as time series data cleansing and conditioning. The platform provides several different approaches to define very precise start and end condition.  The following show the definition of a new capsule based on a profile search that solely focuses on the process peak temperature:

These capsules can be utilized in other applications through an API and blended with other PI data models to create very precise multivariate models:

Benefits

Multivariate Analysis is a powerful method to analyze highly correlated process data. It depends on  equipment\process models and time series segments. OSIsoft PI provides data models for both. And typically time segments are automatically populated from a BES or MES systems. SEEQ provides new capabilities to create highly precise time segments called capsules, that refines the MVA analysis and creates meaningful process envelops. The integration is seamless since both systems provide powerful API’s to their time series data and models. The resulting MVA models target specific process phases that can be used to create improved process control limits or regression analysis.

Please contact us for more information.