Improving Security, Costs and Efficiency of Data Ingestion With Diffgram Connections

Published in

Diffgram

3 min readAug 26, 2022

Building machine learning models involves moving large amounts of data through a set of processess that add or remove information on the original data, and that eventually gets transmited to a training process to build an AI model.

Diffgram is a tool that helps to manage all of that, from ingestion, to data annotation, all the way to the training process. We’ve been pretty successful helping a lot of data scientists and ML engineers with this process. But, when it comes to big enterprise projects there are new issues that need to be tackled.

Main Concerns when Building Big Training Datasets

We need to keep data secured.
We have huge amounts of data. Even a single copy of the blob storage can represent thousands of dollars in data transfer costs.
We want to make sure only authorized members can access the data.
We want to save costs by reducing data transfer and data storage as much as possible.

The Standard Approach

Diffgram comes out of the box with a default Blob Storage. This is where all the ingested data will get stored. This approach has its own set of benefits as it isolates the data from other applications and is a single source for all the training data.

The need for Referencing Data instead of Moving it

However, in more complex contexts, you might have multiple blob storages, with different permissions hierarchies and folder structures. Those folder structures might be a critical part of your business organization or security policies so it’s something that cannot be changed that easily. In addition, transfering all of this into a new blob storage for Diffgram can mean high costs of data transfer and duplicating costs for data storage. On really big scales, this can be a significant problem.

The solution we give now is the new “Pass by Reference” upload mechanism. You can use the existing Connection System along with the blob path of your media object to upload new data into Diffgram.

Standard Upload

User selects a file from it computer or blob storage
User goes through the upload wizard or performs SDK calls to upload the data.
Diffgram receives the raw blob bytes and reupload them to the default blob storage.
You can now manage your new file in Diffgram.

Pass By Reference Approach:

You create a Connection in Diffgram to your cloud Provider.
You reference a blob path and a bucket from the desired file.
You add the Connection ID you created at step 1 to handle the blob storage provider authorization.
You upload the data using The new endpoint From Blob Path
Diffgram handles the file by reference, downloading it only when needed on the annotation tool.

The high level overview can be seen in the following Diagram.

Standard Upload Vs Blob Reference in Diffgram

Benefits of the Pass By Reference Upload Process:

Reduced storage costs. Now there is just a single copy of your blob data.
Increased data ingestion rate. When uploading new data, now Diffgram will not need to handle the blob bytes which signifcantly increases the speed of a data annotation pipeline.
Reduced Data Transfer Costs. No data need to be re uploaded so we also save some $$ from that.
Customized Permissions for Data Access: You can keep you bucket access policies and make sure only allowed people are able to see the data. Data is not scattered around multiple blob storages.
Easier Management: Your IT department now focuses just on managing your core blob storages. Diffgram will be agnostic to the inner structure your blob storage has.

This addition to the Diffgram core services will allow to serve use cases that have higher security complexity and scaling complexities. The standard upload still has its place, but now that you can use Connections to reference file uploads you can make Diffgram a multi Blob Storage, Multi Cloud Provider training data management tool.

What are your thoughts on this? Do you have any ideas to improve these new area? Feel Free to join us on Github or Slack