Bulk data and the Cloud

It’s a common enough operation, the client needs to upload or download something, an image, video, audio file, to or from your service. But, the way that it is done (on the serverside) can have an effect on your application, the cloud costs, the availability of your system, and more.

The most common (naive) approach is for an endpoint to be defined in the API that behaves as a proxy between the client and cloud storage. This works, but there are some issues that should be addressed.

By acting as a proxy the service is using compute time, and it needs to spin up another goroutine/thread/process to handle new requests.

This means that actual money is being spent handling the up/down load. And more resources are needed to handle any new requests (which may be made worse if the server isn’t multi-threaded), meaning that the system is going to need to horizontally scale earlier.

There’s another way, one that is cheaper in terms of workload, and cheaper in terms of cash.

Using (pre) signedURLs.

If, instead of being a proxy for the data movement, instead the system tells the client “GET/PUT your data ‘here’” and leaves the rest up to the client , and the cloud storage, then your API service can move on to the next request.

Below are links for common cloud APIs on how to generate signed URLs

The idea is that your API generates a signed URL to a publicly readable or writeable bucket, and passes that back to the client. When I do this I use the HTTP Status 302, to tell the clients that they are being redirected to the new URL to perform their PUT/POST/GET operation.

Note that there is a time limit configured into the signed URL that determines how long a client can use that URL before they need to refresh. Set to an appropriate time (taking into account slow networks).

Each cloud uses its own terminology, and the bucket being used for UP loads needs to be configured to be written to by the general public. The downloads bucket, too, will need to be appropriately configured.

The advantage of this strategy is that the API has only been involved long enough to generate the new URL for the client. The handling of the Up/Down load becomes the cloud’s responsibility. Less compute time is used by the API, and all the savings that come with that.

Summary

In my opinion bulk data transfer with the cloud is a lot easier, cheaper, and faster, when the cloud is used to its full capabilities, taking the API/endpoints out of the equation as much as possible makes everyone’s life a lot more enjoyable.