A Data Scientist's Guide to Docker

Charlie | June 1st 2018

‘What’s the point in Docker?!!?!’ - At Miminal we hear this from lots of data scientists we know. Docker is a software program that performs something called containerisation and has revolutionised cloud computing in last 5 years. In this article I am going to explain why it is important and useful for a data scientist. If you’re not a data scientist, learn about what we do here.

Docker is based on the use of containers, but what’s a container!? A container is a lightweight virtual machine. This container packages everything you need, including all versions and software packages, to run an application. With Docker you can create a shareable snapshot of everything that is needed to run an application.

This ability to share a contained snapshot of your application and all of its dependencies has several powerful implications. The first is reproducibility. After spending weeks obtaining, cleaning and analysing data to build a machine learning model the last thing you want is your demonstration to fail when presenting to a CEO, or an issue to come up when handing over your model to software engineers to use in their development. Docker allows these issues to become a problem of the past. As your model and all of its dependencies are housed in the Docker container, anyone with this container will be able to use it with the same outcomes guaranteed as if they were using it on your computer. Gone are the days of using python virtual environments with a pip freeze of the requirements!

The second advantage of using Docker for data science is portability. Very often when exploring data and training models you are limited by the hardware you are working on. Often laptops are not powerful enough to perform serious analysis on. Being able to send your working environment to the cloud where you can rent as much power as you need speeds up workflow immeasurably. In the past data scientists have been afraid of wasting time setting up their preferred environment in cloud, Docker completely removes this fear.

A third advantage to Docker is a result of reproducibility and portability. Docker takes the complexity out of setting up machine learning environments. Typically setting up a deep learning environment such as TensorFlow with a local GPU is very difficult but using Docker you can access pre-existing containers that are already set up for you! You can download containers from Dockerhub (an open marketplace for Docker containers) and with a few commands your environment is ready and raring to go.

Docker has revolutionised scalable applications from development right through to production. As highlighted above, the properties that make it so useful for software development apply to data science as well. So to all you data scientists out there, why aren’t you using Docker?

If you enjoyed this check out our other blogs here.