Skip to main content

Building a Production-Ready Diagram Parsing Pipeline with C3 AI


by Josh Zhang and Amir H. Delgoshaie


For safety-critical systems, predicting when components will fail is a million-dollar question. A key piece of information in predicting failures is the topological structure of the components and the relations of different components with each other. While the information on these connections is available on engineering diagrams, parsing the engineering diagrams and extracting the metadata from the diagrams has long been tedious work that is difficult to automate. In this blog post, we will go through the steps for building a simple diagram parsing model and demonstrate how this model can be made production-ready using the C3 AI Application Platform.

Along the way we will highlight some of the key features and benefits of the platform, namely:

  • A unified data model with simple APIs to access and manage all data.
  • Simple APIs for converting pre-existing machine learning/deep learning models into the C3 AI MLPipe.
  • Seamlessly persisting models in a database and keeping track of all of modeling iterations.
  • Using the resources available in the C3 AI cluster for model training to not be limited by the size of the Jupyter container.
  • Using the model as an individual step of complex ML pipelines without worrying about managing runtime for each step and passing data between these steps.

Overview of the Problem

Below is an example of a piping & instrumentation diagram (P&ID). P&ID engineering diagrams contain valuable information about the sensor and equipment locations as well as relations among these sensors and equipment. Manual extraction of sensor and equipment locations and relationships from P&ID diagrams is a time-consuming task that relies on domain experts. The goal of diagram parsing is to create an automated pipeline that can identify each component on an engineering diagram, recognize the “id” and “text” related to each symbol, and also identify connections between different components. Finally, using the identified component ids, we can link each component with other associated data sources (like time series data) for additional modeling tasks.


Figure 1: Raw P&ID engineering diagram (left) vs parsed diagram (right)


Figure 2: Example of using a parsed diagram for finding sensor time series related to an asset

In this blog post, we will build a simple diagram parsing pipeline on a data science notebook from scratch using native Python and C3 AI Python SDK. Specifically, we will:

  1. Explore the training data set using the C3 AI Python SDK
  2. Build and train an object detection model using Keras
  3. Convert the object detection model to a C3 AI MLPipe and persist the model
  4. Build a C3 AI MLPipeLine that combines the object detection model with an OCR pipe for text detection

Each diagram in our data set contains at most one symbol. We rely on a data model from an existing C3 AI Application, C3 AI Reliability, and assume raw data required for this demo are loaded into the application. The data contains a set of diagrams with annotated symbol bounding boxes and another set of diagrams without any annotations.

Prototype an Object Detection Model in Python

Data Exploration

For visualizing the diagrams and exploring the available data, we first import matplotlib and a helper function for converting instances of C3 AI Types into pandas DataFrame.

import matplotlib.pyplot as plt
c3_grid = c3.DiagramParsingTypeUtils.fetchGrid

To build an object detection model, we will use a training set of labeled diagrams along with the coordinates of the bounding boxes for each diagram. As a first step, we load our training images and the bounding boxes labels. Our diagrams are stored in the c3.PNGDiagram type.

C3 AI Type System provides simple APIs to fetch data from its distributed data stores that are backed by various database technologies like Cassandra and Postgres. The detailed implementation and the query details for managing data are abstracted away by simple APIs like fetch or remove and optimized by the platform. This enables data scientists and application developers to spend less time on building and debugging their queries and focus on the application at hand.

First let’s count all the diagrams that are available in our environment:

print(f'There are {c3.PNGDiagram.fetchCount()} diagrams persisted')
There are 4999 diagrams persisted

Next, let’s get the ids and the creation timestamp of 5 sample diagrams:

print(f'There are {c3.PNGDiagram.fetchCount()} diagrams persisted')
c3_grid(c3.PNGDiagram, ['id', 'EXT', 'meta.created'], limit=5)
There are 4999 diagrams persisted
id EXT meta.created
0 001a7cf3-8db8-4d2e-9468-288a25638171 .png 2022-01-18 18:25:30+00:00
1 00205a8e-4a71-403d-9893-d2850e04369c .png 2022-01-18 18:26:32+00:00
2 002d0225-1074-4605-88b5-dc8f1f5db9f9 .png 2022-01-18 18:24:11+00:00
3 00301408-b2a0-47b2-911f-2988a69c7146 .png 2022-01-18 18:24:55+00:00
4 004197ab-a070-4cf1-99a2-43554a687e1f .png 2022-01-18 18:23:13+00:00

C3 AI’s Python SDK allows accessing the data in a pythonic way without writing queries for a specific type of database. As an example, we can fetch the diagrams matching a specific filter, in this case, the ones that have a value for bounding box field.

diagrams = c3.PNGDiagram.fetch({
    'filter': 'exists(bounding_box)',
    'limit' : 5

get a diagram with specific id

specific_id = diagrams[0].id
specific_diagram = c3.PNGDiagram.get(specific_id)

remove a persisted diagram or remove all the persisted diagrams from the backend data store

# you can also remove it by 
# or just remove all of them 
# c3.PNGDiagram.removeAll()

The fetched data are directly converted into python objects by C3 AI’s Python SDK and they can be used in our notebook just like any other project in Python.

print('Bounding Box', diagrams[0].bounding_box)
Bounding Box c3.Arry<int>([26, 43, 94, 111])


Let’s visualize some additional training samples:

n_examples = 8
sample_diagrams = c3.PNGDiagram.fetch({'limit': n_examples}).objs
plt.figure(figsize=(20, 20*n_examples))
for i in range(n_examples):
    d = sample_diagrams[i]


Model Architecture

We will use an anchor box regression approach in this implementation which is a simplified version of the Region Proposal Net used in Faster R-CNN object detection architecture.

At a high level, this method assumes there is a box with predefined height and width located in the center of an image, which we refer to as the anchor box. The model then tries to answer the following 3 questions:

  1. Is there any symbol significantly overlapping with the anchor box?
  2. How should the anchor box be moved so that the anchor box center aligns with the target symbol center?
  3. How should the anchor box be rescaled so that its height and width match the dimension of the symbol?

For simplicity, we choose the anchor box size to be the same as the image size (128, 128).


Figure 3: Three Questions Answered by the Object Detection Model

These questions can be formulated into a classification problem with 1 output (probability, p) and a regression problem with 4 outputs (translations dx, dy, and scaling factors rx, ry).


Figure 4: Object Detection Model Architecture

We will use the architecture with a few stacked convolution layers to implement the anchor box regression model. Binary cross-entropy loss will be used for optimizing the classification objective and mean squared error will be used for optimizing the regression objective. We will implement the architecture with Keras.

import numpy as np
import tensorflow as tf
import tensorflow.keras.backend as K

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Lambda
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Flatten

from tensorflow.keras.optimizers import Adam

from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.losses import MeanSquaredError

First, a few stacked convolutional layers are used to extract high-level features from the diagrams

size = 128
images = Input(shape=(size, size,3))
cls_target = Input(shape=(1)) 
reg_target = Input(shape=(4))

# just a few convolution layers
lvl_0 = Conv2D(
    filters=128, kernel_size=(2,2), strides=(2,2), 
    activation='relu', use_bias=True)(images)
lvl_0 = BatchNormalization()(lvl_0)

lvl_1 = Conv2D(
    filters=64, kernel_size=(2,2), strides=(2,2), 
    activation='relu', use_bias=True)(lvl_0)
lvl_1 = BatchNormalization()(lvl_1)

lvl_2 = Conv2D(
    filters=32, kernel_size=(2,2), strides=(2,2), 
    activation='relu', use_bias=True)(lvl_1)
lvl_2 = BatchNormalization()(lvl_2)

lvl_3 = Conv2D(
    filters=8, kernel_size=(2,2), strides=(2,2), 
    activation='relu', use_bias=True)(lvl_2)
lvl_3 = BatchNormalization()(lvl_3)

Then we flatten the features from 3D tensors into 1D arrays and use stacked dense layers to reduce the dimensions. Finally, we use a dense layer with an output size of 4 to predict the translation vector on the horizontal direction, translation vector on the vertical direction, scaling factors for the height, and scaling factor for the width of the bounding box. The dense layer with an output size of 1 activated by the sigmoid function can generate a probability and indicates if the image actually contains a target symbol.

# flatten for prediction
flat = Flatten()(lvl_3)
reduced = Dense(units=64, use_bias=True, activation='relu')(flat)
reduced = Dense(units=32, use_bias=True, activation='relu')(reduced)

# If there is any symbol inside the image? Probability
cls_output = Dense(units=1, use_bias=True, activation='sigmoid')(reduced)
# What are the translation values and scaling factors?
reg_output = Dense(units=4, use_bias=True, activation= 'linear')(reduced)

# jointly optimize the regression and classification losses
cls_loss = Lambda(lambda x: tf.keras.losses.BinaryCrossentropy()(*x))([cls_target, cls_output])
reg_loss = Lambda(lambda x: tf.keras.losses.MeanSquaredError()(*x))([reg_target, reg_output])
all_loss = reg_loss + cls_loss

mdl = Model(inputs=[images, cls_target, reg_target], outputs=[cls_output, reg_output])
mdl.add_metric(cls_loss, aggregation='mean', name='cls loss')
mdl.add_metric(reg_loss, aggregation='mean', name='reg loss')

WARNING:tensorflow:Output dense_6 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to dense_6.
WARNING:tensorflow:Output dense_7 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to dense_7.
Model: "model_1"
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, 128, 128, 3) 0                                            
conv2d_4 (Conv2D)               (None, 64, 64, 128)  1664        input_4[0][0]                    
batch_normalization_4 (BatchNor (None, 64, 64, 128)  512         conv2d_4[0][0]                   
conv2d_5 (Conv2D)               (None, 32, 32, 64)   32832       batch_normalization_4[0][0]      
batch_normalization_5 (BatchNor (None, 32, 32, 64)   256         conv2d_5[0][0]                   
conv2d_6 (Conv2D)               (None, 16, 16, 32)   8224        batch_normalization_5[0][0]      
batch_normalization_6 (BatchNor (None, 16, 16, 32)   128         conv2d_6[0][0]                   
conv2d_7 (Conv2D)               (None, 8, 8, 8)      1032        batch_normalization_6[0][0]      
batch_normalization_7 (BatchNor (None, 8, 8, 8)      32          conv2d_7[0][0]                   
flatten_1 (Flatten)             (None, 512)          0           batch_normalization_7[0][0]      
dense_4 (Dense)                 (None, 64)           32832       flatten_1[0][0]                  
dense_5 (Dense)                 (None, 32)           2080        dense_4[0][0]                    
input_5 (InputLayer)            [(None, 1)]          0                                            
input_6 (InputLayer)            [(None, 4)]          0                                            
dense_6 (Dense)                 (None, 1)            33          dense_5[0][0]                    
dense_7 (Dense)                 (None, 4)            132         dense_5[0][0]                    
lambda_3 (Lambda)               ()                   0           input_6[0][0]                    
lambda_2 (Lambda)               ()                   0           input_5[0][0]                    
tf_op_layer_add_1 (TensorFlowOp [()]                 0           lambda_3[0][0]                   
add_loss_1 (AddLoss)            ()                   0           tf_op_layer_add_1[0][0]          
add_metric_2 (AddMetric)        ()                   0           lambda_2[0][0]                   
add_metric_3 (AddMetric)        ()                   0           lambda_3[0][0]                   
Total params: 79,757
Trainable params: 79,293
Non-trainable params: 464

Model Training

We can now create a data generator to convert the bounding box coordinates into the regression targets that are normalized with the anchor sizes for easier model convergence.

# relative translations
def generate_translation_label(anchor_box_shape, box):
    if not box: return [0, 0]
    h, w = anchor_box_shape
    x1, y1, x2, y2 = box
    center_x = (x1 + x2)/2
    center_y = (y1 + y2)/2
    dx = (w/2 - center_x)/w
    dy = (h/2 - center_y)/h
    return dx, dy
# relative scaling factors
def generate_scaling_label(anchor_box_shape, box):
    if not box: return [0, 0]
    h, w = anchor_box_shape
    x1, y1, x2, y2 = box
    box_h = y2 - y1
    box_w = x2 - x1
    rx = np.log(box_h/h)
    ry = np.log(box_w/w)
    return rx, ry

Then we can use the above two functions for generating labels and use the generator for training the model.

anchor_box_shape = (128, 128)
def sample_generator(diagrams, size, batch_size):
    while True:
        images, cls_labels, reg_labels = [], [], []
        for _ in range(batch_size):
            d = random.choice(diagrams)
            img = d.toImage(cache=True)
            box = d.bounding_box
            has_symbol = bool(box)
            dx, dy = generate_translation_label(anchor_box_shape, box)
            rx, ry = generate_scaling_label(anchor_box_shape, box)
            reg_labels.append((dx, dy, rx, ry))
        images = np.array(images)
        cls_labels = np.array(cls_labels)
        reg_labels = np.array(reg_labels)
        yield (images, cls_labels, reg_labels), None

Using the generator defined above, we can train our model using the labeled diagrams. We use the last 100 diagrams for validation.

all_diagrams = c3.PNGDiagram.fetch({'filter': 'exists(bounding_box)'}).objs
train_diagrams = all_diagrams[:-100]
valid_diagrams = all_diagrams[-100:]

train_g = sample_generator(train_diagrams, size, 32)
valid_g = sample_generator(valid_diagrams, size, 64)

valid_data = next(valid_g)

# you can also directly load the model here from the h5
# from tensorflow.keras.models import load_model
# mdl = load_model('rpn.h5'), epochs=64, steps_per_epoch=32, validation_data=valid_data, verbose=0)

Model Inference

To show that the model that we just trained works as expected, we will test it using a diagram from a holdout test set that does not have its bounding box or text attributes populated. Then we will use our trained model to generate the bounding boxes and the texts within the symbol. As shown in the visualization below, both the bounding box and the text field are empty in the beginning.

unlabeled = c3.PNGDiagram.fetch({'filter': '!exists(bounding_box)', 'limit': 5}).objs


Symbol Detection

# run the model
imgs = np.array([d.toImage() for d in unlabeled])
mdl_input = [imgs, np.empty(len(imgs)), np.empty((len(imgs),4))]
cls_outputs, reg_outputs = mdl.predict(mdl_input)

Since we used relative translations and relative scaling factors as the regression target of our model, we need to transform the model output to recover the coordinates of the bounding box.

def decode_result(size, dx, dy, rx, ry):
    h, w = size, size
    center_x = size/2
    center_y = size/2
    center_x -= dx * w
    center_y -= dy * h
    box_w = np.exp(rx) * w
    box_h = np.exp(ry) * h
    xmin = int(center_x - box_w/2)
    xmax = int(center_x + box_w/2)
    ymin = int(center_y - box_h/2)
    ymax = int(center_y + box_h/2)
    return [xmin, ymin, xmax, ymax]
# populate the bounding_box attribute of the diagram
for diagram, img, reg_output in zip(unlabeled, imgs, reg_outputs):
    box = decode_result(len(img), *reg_output)
    diagram.bounding_box = box

As we can see, our model generates a bounding box that accurately captures the target symbol and achieves the desired outcome.



Text Recognition OCR

Now that we demonstrated building an object detection model to tell where the target symbol is,. we will next demonstrate how to use a pre-trained OCR pipe that is readily available in the platform. Using c3.OcrPipe we will extract the text inside the bounding box, and populate the text attribute of the symbol.

diagram = unlabeled[0]
ocr_pipe = c3.OcrPipe()

labeled_diagram = ocr_pipe.process(diagram)

As shown in the text field below, the OCR pipe correctly recognizes the id of the target symbol.


Building a Production-Ready Pipeline with the platform

Building a production-ready pipeline using the symbol detection and OCR models is very simple. C3 AI Application Platform provides many out-of-the-box Types to convert TensorFlow, Keras, or PyTorch models created in Python to instances of MLPipe.

Keras Pipe

As the first step, we encapsulate our trained Keras model as an instance of a c3.KerasPipe. In one line, the native python model is converted to an instance of a C3 AI Type and persisted to the platform.

# you can directly save a trained model 
keras_pipe = c3.KerasPipe().upsertNativeModel(mdl)

Using a KerasPipe, the trained model, along with its hyperparameters can easily be persisted. This simplifies keeping track of the details of all of your modeling iterations. Similar to any other C3 type, we can fetch these pipes, or update or remove them with convenient APIs.

keras_pipe.get('id, meta.created, typeVersion')
        created=datetime.datetime(2022, 1, 18, 19, 12, 44, tzinfo=datetime.timezone.utc),
print('Part of the Keras Model Parameters:\n', keras_pipe.technique.modelDef[:500], '...')
Part of the Keras Model Parameters:
 {"class_name": "Model", "config": {"name": "model_1", "layers": [{"class_name": "InputLayer", "config": {"batch_input_shape": [null, 128, 128, 3], "dtype": "float32", "sparse": false, "ragged": false, "name": "input_4"}, "name": "input_4", "inbound_nodes": []}, {"class_name": "Conv2D", "config": {"name": "conv2d_4", "trainable": true, "dtype": "float32", "filters": 128, "kernel_size": [2, 2], "strides": [2, 2], "padding": "valid", "data_format": "channels_last", "dilation_rate": [1, 1], "activat ...

C3 AI Application Platform also provides utility types and functionalities to simplify the development of a specific application. To simplify the development of a diagram parsing model, here we use the SymbolDetectionPipe from the C3 AI Reliability Application. This type provides utility functions to apply the logic for decoding the outputs from a symbol detection model and populate the bounding box attribute of an input diagram.

We will use our KerasPipe as the core model for a SymbolDetectionPipe.

syb_pipe = c3.SymbolDetectionPipe(core=keras_pipe).upsert()

This pipe can be used for populating the bounding_box field for the diagram.

syb_pipe.get('id, meta.created')
        created=datetime.datetime(2022, 1, 18, 19, 12, 47, tzinfo=datetime.timezone.utc),
unlabeled = c3.PNGDiagram.fetch({'filter': '!exists(bounding_box)', 'limit': 3}).objs
to_parse = unlabeled[0]

# the target diagram is empty in the beginning
to_parse = unlabeled[0].get()


With the symbol detection, the diagram now has the bounding box of the target symbol

box_detected = syb_pipe.process(to_parse)
WARNING:tensorflow:Output dense_6 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to dense_6.
WARNING:tensorflow:Output dense_7 missing from loss dictionary. We assume this was done on purpose. The fit and evaluate APIs will not be expecting any data to be passed to dense_7.


With the text recognition pipe, the diagram also has the text attribute populated

text_recognized = ocr_pipe.process(to_parse)


And we can sync the current state of the diagram and save everything into the database.


Creating a Multi-Step Machine Learning Pipeline

Finally, we can very easily build an end-to-end symbol detection and text recognition pipeline that can process our diagrams and populate their bounding box and text fields.


Figure 5: Diagram Parsing Pipeline

syb_pipe = c3.SymbolDetectionPipe(core=pipe).upsert()
step_1 = c3.MLStep(

step_2 = c3.MLStep(

pipeline = c3.MLSerialPipeline(steps=[step_1, step_2]) = pipeline.upsert().id
pipeline.get('id, meta.created,')
        created=datetime.datetime(2022, 1, 18, 19, 22, 39, tzinfo=datetime.timezone.utc),
to_parse = unlabeled[2]


Now that our end-to-end pipeline is persisted, it can be used in an application to process new diagrams and save hundreds of hours of manual work for domain experts!!!

About The Authors

  • Josh Zhang is a Senior Data Scientist at C3 AI, where he developed algorithms for multiple large-scale AI applications. He holds an M.S. in Mechanical Engineering from Duke University and a B.S. in Mechanical Engineering from Lafayette College. Before C3 AI, he worked on the development of a large-scale graph deep learning framework as a software engineer.
  • Amir H. Delgoshaie is a Data Science Manager at C3 AI, where he has worked on the development and deployment of multiple large-scale AI applications for the utility, energy, and manufacturing sectors. He holds a Ph.D. in Energy Resources Engineering from Stanford University and master’s and bachelor’s degrees in Mechanical Engineering from ETH Zurich and Sharif UT. Prior to C3 AI, he developed algorithms and software at various research and industrial institutions.