imaga: Mediapipe: What is it, or How We Implemented Real-time Inference on Android and iOS

Olga Tatarinova

Director of Artificial Intelligence

Our company also specializes in projects related to machine learning and data analysis. In this article, we'll discuss how we used the Mediapipe framework for iOS and Android, launched it on desktop, wrote custom calculators, and supported the community.

Google Corporation announced Mediapipe at CVPR in 2019. The original paper presents a simple idea: let's consider the data processing workflow as a graph, where the vertices are data processing modules. Beyond offering a convenient way to describe the sequence of data operations, Mediapipe promises to build your solution as libraries for iOS and Android. And this is very relevant!

Running ML models and processing video on end devices is challenging. If you implement real-time video processing and directly launch neural networks, you can easily find yourself in a situation with uncontrolled workload increase: the application might experience memory leaks, glitches, or a backlog of frames.

That's why mobile app developers turn to Mediapipe. They lack the resources for independent video processing. And why should they seek them, when they can take a ready-made component from Mediapipe and "attach" their own model or another required feature to it? The framework implements video graphic transformation, allows the combination of ML components with mechanical processing, and ultimately produces a cross-platform solution for both mobile and desktop devices.

Mediapipe compensates for the limited functionality of mobile devices and delivers video of acceptable quality. Perfect, isn't it?

Tasks that Mediapipe solves out of the box

The easiest way to get acquainted with Mediapipe? Launch one of the examples they've already developed and then try to modify it slightly.

What Mediapipe can do out of the box:

recognize faces and create a Face Mesh;
detect the irises of the eye: pupils and eye contours;
identify hands, legs, and determine poses;
segment hair and selfies;
launch a detection model and track its predictions;
instantly identify motions;
Objectron: detect 3D objects from 2D images;
KNIFT: match features based on templates;
AutoFlip: an automatic video cropping pipeline.

We'd like to note that the standard two-stage pipeline "detection + classification" is not implemented in it. However, it's not hard to assemble from ready-made data processing modules. Overall, you can even write your own custom module.

Components of Mediapipe

Mediapipe consists of three main structural components: calculators (graph vertices), input/output packets, and computation graphs.

Calculators

A calculator is a graph vertex, code that performs a transformation on input data and produces output.

To create a calculator, you need to inherit from CalculatorBase and implement four methods:

1. GetContract(). This checks the types of incoming and outgoing data for consistency with predefined ones. Experience has shown that if you don't implement this method, nothing will break.
2. Open(). Initialize the calculator upon launch.
3. Process(). Execute the node computation, using the concept of context — an entity in which current variables are recorded at the current point of the pipeline's execution. 4. Close(). Release the calculator's resources.

For instance, if there's a task to run a certain classification model on an input image, you need to:

Initialize that model in Open().
In Process(), extract the input image from the context, apply the model on it, get a prediction, and pass it to the output.
Release resources in Close().

(Calculator: Example code for a classification model on an input image).

                    class ClassificationCalculator : public Node {
    public:
 // expect either an 'Image' type or a 'GpuBuffer' type as input
// provide a list of 'Classification' objects on the output
static constexpr Input<OneOf<mediapipe::Image, 
mediapipe::ImageFrame>> kInImage{"IMAGE"};
static constexpr Input<GpuBuffer>::Optional
kInImageGpu{"IMAGE_GPU"};
static constexpr Output<ClassificationList>
kOutClassificationList{"CLASSIFICATIONS"};
 MEDIAPIPE_NODE_CONTRACT(kInImage, kInImageGpu,
kOutClassificationList);
absl::Status Open(CalculatorContext* cc) override;
absl::Status Process(CalculatorContext* cc) override;
absl::Status Close(CalculatorContext* cc) override;
absl::Status SomeCalculator::Open(CalculatorContext* cc) {
// Here we load the model into a private variable
return absl::OkStatus();
}
// Copied some code from image_to_tensor_calculator.cc
absl::Status SomeCalculator::Process(CalculatorContext* cc) {
// We check that there is an input in this context
if ((kInImage(cc).IsConnected() && kInImage(cc).IsEmpty()) ||
(kInImageGpu(cc).IsConnected() && kInImageGpu(cc).IsEmpty())) {
// Timestamp bound update happens automatically.
return absl::OkStatus();
}

ASSIGN_OR_RETURN(auto image_ptr, GetInputImage(cc));
ASSIGN_OR_RETURN(auto classification_list_ptr, GetClassificationList(cc));
    
// We run the model on the image image_ptr
// Write the results to the output data
auto out_classification_list = absl::make_unique<ClassificationList>();
Classification* classification = out_classification_list->add_classification();
 classification->set_index(0);
classification->set_score(0.8);
classification->set_label("class0")
classification->set_display_name("Dog");
  ____kOutClassificationList(cc).Send(std::move(out_classification_list));

return absl::OkStatus();
}
absl::Status SomeCalculator::Close(CalculatorContext* cc) {
// Release resources (remove the model from memory)
return absl::OkStatus();
}

Input/Output Packets and Computation Graphs

A graph is a combination of input data and vertices (calculators) that implements a specific computation pipeline. The graph is defined in the TensorFlow Graph Text format and can be described in a .pbtxt file.

In a calculator, you can incorporate options described by the Protobuf protocol, which can be modified. This way, you can specify the image size for resizing.

For instance, the image above could be described by the following graph:

                    input_stream: "input1"
input_stream: "input2"
input_stream: "input3"
output_stream: "output1"
output_stream: "output2"
node {
  calculator: "CalculatorCalculator1"
  input_stream: "INPUT_TAG1:input1"
  input_stream: "INPUT_TAG2:input2"
  input_stream: "INPUT_TAG3:input3"
  output_stream: "OUTPUT_TAG:output"
}
node {
  calculator: "CalculatorCalculator2"
  input_stream: "INPUT_TAG:output"
  output_stream: "OUTPUT_TAG1:output1"
  output_stream: "OUTPUT_TAG2:output2"
}

In Mediapipe, the following syntax for data notation is adopted: TAG:variable_name. Tags allow calculators to detect the required input variables obtained from the context. For instance, in the code above with the calculator, we write the tag either as "IMAGE", which is stored on the CPU, or as "IMAGE_GPU", which is stored on the GPU. Depending on which variables were fed to the calculator, the Process will be executed.

Mediapipe even has its own service where you can upload a graph and view a beautiful visualization: https://viz.mediapipe.dev/. Netron (netron.app) can also visualize graphs, though it's less detailed and more schematic.

Standard Calculators in Mediapipe

Mediapipe defines many standard types and calculators that implement various logic operations. Examples of these types include video, images, audio, texts, and time series. Following these are tensors — standard types in machine learning through which you can obtain objects of model prediction results: Detections, Classifications, Landmarks (keypoints) — all of these are types as well. Examples can be viewed here.

Calculators allow for various transformations on these types, converting data from one type to another.

All calculators (and subgraphs) are defined here. Most of them can be found in the calculators/ or gpu/ folders. There's no external documentation for them, but almost all the code is documented: comments describing the calculator are provided alongside the calculator definition. In addition to this, alongside the calculator (recorded in .cc and .h files), there may be a .proto file. These define the so-called calculator options and are written in Protobuf format.

As an example, let's consider the task of detecting objects in an input image using a specific ML model. What is needed for this:

Receive the input image and resize it to 640×640 (assuming our model expects an input tensor of this size — these numbers need to be specified in the option of the respective calculator).
Transform the compressed image into a tensor.
Pass this tensor through the model and obtain raw predictions.
Convert the raw predictions into a type recognizable by Mediapipe, called Detections.
Output this type from the graph.

In Mediapipe, all these actions translate into an elegant graph — a sequence of vertices with specific types of input/output data and vertex parameters:

                    # Define the incoming and outgoing streams (for example, we can catch the outgoing one with a callback in the application)
input_stream: "image"
output_stream: "detections"
# The input image is transformed into an image of size 640х640 while
# maintaining the aspect ratio. The rest of the picture is filled with zero
# pixels (a transformation known as letterboxing).
# Then the image is transformed into a tensor required by the TensorFlow Lite model.
node {
  calculator: "ImageToTensorCalculator"
  input_stream: "IMAGE:image"
  output_stream: "TENSORS:input_tensor"
  output_stream: "LETTERBOX_PADDING:letterbox_padding"
  options: {
    [mediapipe.ImageToTensorCalculatorOptions.ext] {
      output_tensor_width: 640
      output_tensor_height: 640
      keep_aspect_ratio: true
      output_tensor_float_range {
        min: 0.0
        max: 255.0
      }
      border_mode: BORDER_ZERO
    }
 }
}
# Here we simply perform inference using the tflite model
# There is support for GPU, tflite, nnapi delegates, etc. 
node {
   calculator: "InferenceCalculator"
   input_stream: "TENSORS:input_tensor"
   output_stream: "TENSORS:output_tensor"
   options: {
     [mediapipe.InferenceCalculatorOptions.ext] {
       model_path: "model.tflite"
       delegate { gpu {} }
     }
   }
}
# Decodes the tensors obtained by the TensorFlow Lite model into an output type of Detections
# Each Detection consists of at least a Label, Label_id, Score, and a Bounding Box describing the detected object in the format (xmin, ymin, xmax, ymax), where the coordinates are normalized in the range [0,1]
node {
  calculator: "TensorsToDetectionsCalculator"
  input_stream: "TENSORS:output_tensor"
  output_stream: "DETECTIONS:raw_detections"
  options: {
    [mediapipe.TensorsToDetectionsCalculatorOptions.ext] {
      num_classes: 1
      num_boxes: 2
      num_coords: 4
      min_score_thresh: 0.5
     }
  }
}
# Converts the detection coordinates obtained by the model relative to the resized image with letterboxing into detection coordinates relative to the original image
node {
  calculator: "DetectionLetterboxRemovalCalculator"
  input_stream: "DETECTIONS:raw_detections"
  input_stream: "LETTERBOX_PADDING:letterbox_padding"
  output_stream: "DETECTIONS:detections"
}

The received Detections can be returned to the context and further processed (for example, saved into files with annotations or drawn on the initial image).

The Graph We Created

Now, let's transition from generalities to specifics.

We were faced with the task of transferring a two-stage pipeline (detection + classification) to real-time for mobile phones. Moreover, after the classification model, in some cases, there was a need to run C++ code that processed the image using a specific algorithm.

For the detection model, we used Yolov5S, and for classification, we used MobileNetV2.

We quickly found out the following things:

for some reason, custom graphs don't work in Python;
General community discussion around MediaPipe is more dead than alive;
when seeking support in Github Issue, it's quite typical to copy-paste the phrase "could you please provide the error logs" (after you've already provided them, of course) and generally ignore your question for months.

On the other hand, their examples worked well with CPP, and the libraries were successfully converted for iOS and Android, so we decided not to give up.

In the end, we implemented the following graph:

What other useful information can we share with the world?

MediaPipe itself implements calculators that perform inference only on Tensorflow. We personally used only tflite, but there's supposedly support for saved_model and frozen_graph. If you are, for instance, a PyTorch enthusiast, be prepared to either convert your model to tflite or write your own calculator for inference.
Developers unhesitatingly push to master, so if you want to be sure of a successful build, stick to a particular tag.
Be prepared for battles with various delegates, OpenGL, and OpenCL if your model deviates from some "standard set". Seriously, it's not so simple to find and convert a detection model that launches in MediaPipe with GPU support.
Writing custom calculators is relatively straightforward; fortunately, there are plenty of examples in the repository to use as a foundation.
Don't ignore setup_{}.sh scripts — they correctly configure OpenCV and Android SDK/NDK for MediaPipe.
Well, in general, the graph described above can be realistically implemented in MediaPipe and built into libraries for Android and iOS, operating at a decent FPS.

Creating MediaPipe graph libraries for iOS and Android

MediaPipe uses Bazel for the build process. As with the graph construction, with the build process, we also decided to go from simple to complex. So, what needs to be done to build a solution with a custom graph for Linux:

Write this graph and, for instance, place it in a new folder mediapipe/graphs/custom_graph/cutom_graph.pbtxt (analogous to this solution). By the way, starting from version 0.8.11, it's not mandatory to store everything inside //mediapipe.
In the folder with this graph, manually enter all the required dependencies for it (mostly calculators) in cc_library in the BUILD (for example, from the previous point).
Put in the folder mediapipe/examples/desktop/custom_example/ a BUILD file in which you define a cc_binary dependent on the calculators from the previous point and CPP code that launches this graph (example).
Build with the command:

                    bazel build -c opt --define MEDIAPIPE_DISABLE_GPU=1 mediapipe/examples/desktop/custon_example:{cc_binary name}

After this, a cc_binary will be built (you will see the path to it in the logs), and all that remains is to run this graph with the command:

                    GLOG_logtostderr=1 {cc_binary_path}  --calculator_graph_config_file=mediapipe/graphs/custom_graph/custom_graph.pbtxt 
  --input_side_packets=input_video_path=<input video path>,output_video_path=<output video path>

It seems not too complicated, and, in essence, the original version of these instructions can also be found in the documentation.

It's important to understand that the graph and its execution are two different things. In the example above, we are running a graph that expects a path to a video as input (and we specify it in the argument --input_side_packets=input_video_path), but the actual graph is executed within C++ code.

The last thing to note during Linux compilation is external dependencies. In one of our calculators, we used an external library written in C++. We added it similarly to how OpenCV was added: pre-compiled it, created a new_local_repository in the WORKSPACE file (for OpenCV, it's http_archive) and wrote the corresponding BUILD file in the third_party/ directory. We then declared this dependency in the BUILD file of the calculator.

Assuming we didn't encounter any errors at this stage, we can move on!

First and foremost, you should decide on the application code where you intend to run this graph. For Android, there are already two examples recommended by the official documentation (one and two). For iOS, there is a guide and a template, so at this stage, an iOS developer is indeed required.

Don't forget to determine the exact type of image entering the graph — whether it's an Image or Image_GPU (this depends on the application code). Set delegates based on which accelerators you can/want to use on the device. Most importantly, ensure that the input_stream and output_stream in the graph and the application align.

Next come the build specifics, and, naturally, they vary for Android and iOS.

Android

For Android, at a minimum, there's no need for XCode and Mac.

Overall, the existing instruction describes in detail the process of building an AAR for Android. After creating the AAR, it's still necessary to build the graph into a .binarypb. To do this, you need to add the graph in the BUILD file and run the command: bazel build -c opt mediapipe/graphs/custom_graph:custom_graph.

Then, place all models, txt files with classes, and the graph itself into app/src/main/assets in the application (we took the first example). In app/src/main/java/com/example/myfacedetectionapp/MainActivity.java, properly set the names for the input_stream and output_stream, as well as the built graph.

If there's no need to connect additional libraries to MediaPipe, congratulations are in order! However, as I mentioned earlier, there was an external dependency in our build. Just like with Linux, we added it similarly to OpenCV. We pre-built the CPP code for arm-v8a and armeabi-v7. We updated the WORKSPACE, added it to third_party/ and… nothing happened!

Despite the dependency being set in the BUILD file of the calculator, the Android AAR didn't recognize it. Everything turned out to be slightly more complicated: AAR is built using a special Bazel rule written by the developers, which you can find here.

When we added the native.cc_library with our dependency there (remembering everything was analogous to OpenCV), things seemed better. However, for some reason, Bazel refused to find the sources built for Android. Adding flags with --linkopt=-L{path} and pointing to the sources solved this problem — and the application began to work correctly.

iOS

And now we come to iOS. If you've found a Mac and successfully installed Xcode on it, you're almost there! In our specific case, the goal was not to create an application but a library to be used in a client's application. We again created a BUILD file for our framework. As a reference, you can take this one. We replaced the ios_application rule with the ios_framework rule. For this, you need to make a Header file with the methods of the library and add it to the hdrs argument. This guide will help set up the bundle_id. Next, you only have to correctly link the graph, models, and txt files with classes for the models and run the build command bazel build --config=ios_arm64 :FrameworkName --verbose_failures from the folder with the BUILD file.

There's no need to separately build the graph or anything else — Bazel logs will show the path to a ZIP archive containing all the necessary files.

For those who have been following our story closely, you might wonder: how did we add an external dependency here? As it happens, we were quite fortunate this time around: our external dependency had an existing Wrapper for iOS, which we simply linked correctly in the deps argument of the ios_framework rule (and placed it in the application's folder).

Conclusion

Hope you found this interesting! We didn't aim to cover everything about Mediapipe nor claim to provide a comprehensive guide on Mediapipe. Having introduced you to the context in the first half of the article, in the second half, we attempted to share our journey towards real-time inference, enriching it with useful tips, links, and tricks. At the very least, you won't have to face the same hurdles we encountered.

If you're not particularly eager or don't have the resources to delve into the complex world of real-time ML model inference on iOS and Android, we recommend considering Mediapipe. Undoubtedly, it's not a cure-all, and you'll still invest time working with it. However, due to a large number of pre-made and optimized modules for handling data or models, it will be decidedly simpler and quicker than building everything from scratch. And given this article's existence, the process should be almost painless ;)

Olga Tatarinova

Director of Artificial Intelligence

Mediapipe: What is it, or How We Implemented Real-time Inference on Android and iOS

Tasks that Mediapipe solves out of the box

Components of Mediapipe

Standard Calculators in Mediapipe

The Graph We Created

Conclusion

Blog