The C3 AI Platform offers our customers extensive and powerful capabilities to build and operate enterprise AI applications, but there can be a steep learning curve for new users during the onboarding process. C3 AI Datasets, a new capability available with Version 8 of the C3 AI Platform, was developed to remove that ramp-up time and get data scientists productionizing new AI applications on the platform quickly.
Now, with C3 AI Datasets, data scientists can go from data exploration workflows authored in Pandas to building and productionizing enterprise-grade AI applications with merely one function call.
Why we built C3 AI Datasets
At the core of our platform is the C3 AI Type System—a sophisticated abstraction layer that models all components needed to build an AI application. The C3 AI Type System allows developers to quickly and easily set up infrastructure, ingest data, train, and productionize ML models, and in-turn build enterprise-grade software applications powered by these models.
For example, a data scientist working in a cloud JupyterHub instance can use the C3 AI FileSystem APIs to read data from any cloud filesystem such as AWS S3, Azure Cloud Storage, or Google Cloud Buckets without worrying about the low-level details of how to connect to the backing data store. Building an end-to-end model involves creating different kinds of C3 AI Types including: Source types to represent the schema and location of data living in external systems; Transform types to describe Data Integration pipelines that ingest data at scale; and Target types that store data in internal databases. After your C3 AI Source, Transform, and Target types are integrated, you can build and persist Feature types and use them as input to MlPipelines types.
C3 AI also delivers a comprehensive set of development tools and services for a wide range of users who work on building and operating enterprise AI applications. These include no-code tools for citizen data scientists and developers, as well as low-code and deep-code tools for more sophisticated users such as application developers, UI developers, data engineers, data scientists, and DevSecOps teams. And although our customers love the power and flexibility that comes with the C3 AI Type System and the overall C3 AI ecosystem, it is so extensive that they asked us if there was a way to reduce the ramp-up time required for a new user to get acquainted with the platform and the entire C3 AI ecosystem. In addition, our customers also wanted us to address the massive challenge of productionization so that their AI/ML projects seamlessly went from data exploration to large-scale AI application deployments.
What C3 AI Datasets offers
C3 AI Datasets:
Enables data scientists to get started on a project and on C3 AI from day-one.
They can onboard and perform raw data loading, data exploration, feature engineering, and ML experimentation using Pandas APIs on the C3 AI Platform. Therefore, no prior knowledge of C3 AI is required to get started.
Data scientists can reuse and deploy work done during ML prototyping on the C3 AI Platform to create C3 AI artifacts (C3 AI Types) required for productionization.
The C3 AI Platform auto-generates production-ready artifacts for applications from Pandas code.
Provides AI/ML prototyping and productionization on a distributed platform without any lock-in to backend execution engines.
Depending on the scale and performance requirements of your project, you can switch between multiple execution engines.
How C3 AI Datasets works
C3 AI Datasets spans several software components. Data scientists can host their Jupyter notebooks in a custom C3 AI JupyterHub server that interacts with backend C3 AI servers and infrastructure that serves their requests and workloads. Working in a Jupyter notebook through a Python SDK developed in-house, a data scientist interacts with the C3 AI Data interface using APIs that syntactically match Pandas code.
Each function call remotes a request to the server, that handles the request by lazily logging a record of the operation, thereby resulting in a DAG (directed acyclic graph). The DAG (log of operations) is executed when required to return a result and can also be used to generate C3 AI artifacts that correspond to the set of transformations they have authored. In the first case, the operations are materialized into data via one of the supported execution engines that handle the actual computation of the Pandas code. In the second case, the package inference engine utilizes the record of operations to generate the relevant C3 AI artifacts based on the Pandas code written by the end-user. These generated artifacts provide a good template or starting point, thereby speeding up the bootstrapping of Application Development on the C3 AI platform. These components all work together to deliver the C3 AI Datasets product while requiring no additional knowledge from the end-user beyond basic Pandas.
A Pandas lazy interface and dynamic execution engine
C3 AI Datasets provides data scientists with a seamless transition from Pandas to the C3 AI ecosystem. Adding a single line to the top of a Jupyter notebook authored in pandas — pd = c3.Data — will cause the notebook to run using the C3 AI Data engine. With that in mind, it is essential for us to ensure that the interface for Data matches Pandas exactly.
Achieving one-to-one API parity between Datasets and Pandas is a complicated problem. There are numerous Pandas classes to support, including DataFrame, Series, Index, DataFrameGroupBy, each with its own attributes and methods. As requests travel across the C3 AI ecosystem, native Python values need to be serialized back and forth across different language runtimes (such as Java and Python). Python and Pandas versions need to be compatible across the whole software path. C3 AI enterprise standards need to be met for data ingestion and security.
We automated a big part of this process by building a framework that inspects a Python class and translates its methods into C3 AI's in-house DSL for API declaration. Generating these method signatures not only saves us hours of manual work but also allows us to update our interface to remain consistent with Pandas, as newer versions of Pandas are released.
Now that we have something that looks like Pandas, we make it act like Pandas by running Pandas code over the C3 AI Data engine using lazy evaluation backed by a dynamic execution engine. Laziness is the concept that a value is only computed when it is actually needed. For our case, this means that when the user calls a Pandas API, we do not execute that code immediately. Instead, we start building an in-memory DAG on the server-side and tracking the history of operations applied to each DataFrame. There are obvious performance benefits to this approach especially when operating on large datasets, and it also allows for optimizations to be carried out on the DAG prior to execution.
For example, to begin exploration, a data scientist may read data from a CSV (comma-separated values) file using the read_csv API. However, the data is not read until the user inspects the actual values.
As the data scientist proceeds with data exploration, the operations graph builds incrementally, only caching and executing when required. The following diagram displays a sample data exploration workflow and its corresponding operations graph.
Data exploration workflow and corresponding operations graph
In order to execute an operations graph, we first translate it into Python code — a relatively simple problem given that the graph tracks all the information that we need. This code can then be executed on a number of different backend engines, each with its own strengths and weaknesses.
For example, if you’re working with small amounts of data, you can compile a graph of operations and execute it in native Pandas; on the other hand, if the data has millions of rows, it may make more sense for you to execute the code in Modin on Ray. In the future, we plan to support additional execution engines such as Pandas on Spark, and even an in-house distributed engine authored in Java. The engine running the computations is completely abstracted from the user, so all you need to do is write Pandas code and leave the onus of running it in different execution engines to the C3 AI Platform.
How C3 AI Datasets accelerates productionization
What value does C3 AI Datasets bring to the table? Does it bring more value than native Pandas, Modin, or Pandas on Spark? To answer this question, let us explore a key component of this project: Package Inference. As previously discussed, the lazy data engine builds a graph as operations are performed on data, and only executes these operations when the data is required to be materialized, such as when you are inspecting values or plotting a histogram. But laziness is only one reason we track this graph of operations.
The true value of C3 AI Datasets is the ability to transform Pandas DataFrame operations into production-ready, petabyte-scale data integration pipeline artifacts with a single API call.
When you are done loading, exploring, cleaning, and transforming your data, through the familiar Pandas interface, you can simply call this API and let the engine work its magic. Behind the scenes, the graph (or graphs) of all user operations is analyzed, pruned, and compiled into the aforementioned artifacts, reflecting the same transformations applied to the data with Pandas code.
The next step is producing these artifacts by compiling the operations graph. As discussed at the beginning of this blog post, there are three primary components to a data integration pipeline in the C3 AI Platform: Sources, Targets, and Transforms. A Source is the schema of raw data coming into the pipeline from a source system, such as a cloud blob storage or an external database holding customer data. A Target is the schema used to persist the integrated data using the C3 AI Database Engine, backed by Postgres. A Transform is a description of how the data is transformed from its raw form (Source) to its integrated form (Target) in a C3 AI database. So, how do we create these artifacts by analyzing the operations graph?
The Source Type is relatively simple. We inspect source nodes in the graph, looking for Pandas data ingestion APIs such as read_csv or read_sql. Once identified, we can infer the schema by inspecting the data, thus creating a Source Type; additionally, we create and configure a Source System, a part of the Data Integration engine that points to a data source and pulls data into the C3 AI Database Engine when Data Integration is triggered. After finding a Source, we traverse down the graph, iteratively building a corresponding Transform.
In this example, the notable aspects of a Transform are its projection (maps Source columns to Target columns) and condition (row-wise filter applied to raw data). These fields are defined with the C3 AI Expression Engine, that is used across the platform to compile and evaluate expressions and comes with a standard library of pre-defined Expression Engine functions.
As we travel down the graph, the Package Inference engine inspects each vertex, translates it to its corresponding Expression Engine function, and updates the Transform accordingly (see the diagram below for a visual explanation). Once we have built a Source and a Transform, we can create the Target schema by applying the inferred transformations to the Source schema.
Sometimes in the process of analyzing the graph to produce these artifacts, we may encounter a Pandas method that does not have an equivalent in the C3 AI Expression Engine or cannot be reflected as part of the Data Integration pipeline. In this case, we produce a best-effort analysis of the incompatible operation. This approach does not reflect the entirety of the authored Pandas transformations, but we still attempt to bootstrap the application by generating artifacts that capture as much information as we can. Of course, these artifacts, published to C3 AI’s metadata storage, can instantly be inspected and modified by data scientists or application developers, using a tool of their choice (such as Visual Studio Code).
Looking ahead for C3 AI Datasets
C3 AI Datasets gives data scientists the power to author production-ready C3 AI Data Integration artifacts by writing Pandas code. Adding more execution engines, either open source or custom distributed in-house computation engines, and supporting intelligent transitions between them will provide an experience that better captures the needs of data scientists and data engineers by replacing “configuration” steps with inference from the software based on metrics, such as the type of data operations, size, velocity, and execution environment.
We also plan to refine a more collaborative aspect of the product, enabling data scientists connected to the same application to work together and share data between different notebooks. We are working on expanding the reach of the data interface within C3 AI so it can become the backbone of data transformation at various parts of the end-to-end machine learning pipeline.
We are looking for highly motivated software engineers to join us in tackling all these exciting challenges—we encourage you to apply here.
The accomplishments so far for this project are the culmination of a highly collaborative culture focused on engineering execution. We would like to specifically thank Manas Talukdar and Shiva Somasundaram for providing engineering and product leadership. We have had the good fortune to work with and learn from not just each other in the immediate engineering team and the broader Data org, but also from the leadership in the Platform engineering department, specifically David Tchankotadze and Rohit Sureka. We would also like to thank all the data scientists for consistently providing end-user feedback and making it a fun experience to work on this incredibly exciting and impactful project.
About the Authors
Cherif Jazra is a Platform Software Engineer at C3 AI working on the C3 AI Dataset product. Before C3 AI, Cherif worked on fraud detection systems at Postmates. Before working at Postmates, he was a Wireless Cellular Software Engineer at Apple and Palm. He holds an MEng in EE from Cornell University.
Eddie Chayes is a Platform Software Engineer who has worked primarily on the Datasets project over his 1.5 years at C3 AI. He holds a bachelor’s degree in Computer Science from UC Berkeley, and joined C3 AI as a full-time engineer after working as an intern on the Platform - Data team over the summer of 2020.
Andrew Fitch is a Platform Software Engineer at C3 AI. Since joining C3 AI in the Summer of 2021, Andrew has worked on expanding the breadth of APIs and execution engines supported by Datasets. Andrew graduated from Carleton College with a B.A. in Computer Science and he is currently pursuing his Masters in Computer Science at the University of Illinois.
Qiang Yao is a Platform Software Engineer at C3 AI, where he has focused on the observability aspect and maintenance of Datasets project. He holds an M.S. in Computer Science from University of Texas as Dallas and a M.S. in Physics from Emory University. Before C3 AI, he worked on the migration of legacy applications to modern technology stacks.