Deploy your Data Science Stack with Jupyter Notebooks on the Cloud

Ananya Mukherjee
4 min readJul 14, 2022

From the most accessible to a complex strategy!

Photo by Kelly Sikkema on Unsplash

As a data scientist and a full-stack engineer, I have often come across a need to set up a big-data biomedical/ biotech stack for the various companies I have worked for. Based on the project’s budgets, the team setup, and staff expertise; there could be various solutions to developing a data-science stack.

I highlight a few here and how to get started with each.

The following options vary in degrees of complexity (simplest to the most complex stack) in terms of implementation, monitoring, and maintenance.

Option1 — Use an Implemented Strategy

Use AWS implemented Sagemaker Jupyter notebook servers.

These are easy to deploy/implement and will be managed by AWS. Regular resource management and monitoring are not required. Machine learning packages are all pre-installed and

There are some downsides associated with this approach:

  • Costs are AWS resource linked.
  • Not a lot of customizations are available.
  • Does not offer RStudio server integrated with this offering but if AWS customizes this on their end, it can be extended for use with R as well.

To deploy –

  • Go to AWS Management console
  • Search for “Sagemaker studio”
  • Attach roles -specifically to S3 buckets that house the data to be loaded into notebooks for analysis.

The Sagemaker studio takes a few minutes to come up — and users can work on it as they would on any notebook server.

Option2 — A more Hands-on solution

This approach can be used to host an AWS EC2 instance with Jupyter notebook/ Jupyterlab installed — A snapshot of this instance can be used to redeploy the server whenever there’s a need so as not to maintain a long-running instance.

An EC2 instance needs to be created with specific IAM roles and security groups to allow for the deployment of notebook servers via a terminal you can SSH into. Just like the first option, Machine Learning packages can be preloaded into the instance via the terminal and can be relaunched every time for optimal use.

Although this option allows for more customization, and price management, it is not very collaborative — is a stand-alone option, is not sharable, and, the user sessions may not be reproducible without adding a layer of middleware/ customizations. This is explored more in the next option.

Option3 — Make it easier for others to use your infrastructure

This approach allows a basic web page that enables a login strategy for users to launch notebook resources. This extends from the previous option with the integration of jupyter notebooks with a one-click launcher rendered by a Flask/ Django based web-page hosted via AWS.

The user can simply click on a button to launch a notebook server of a specific size/ resources that they need. This requires pre-installed jupyter notebooks with required packages — and the “launch” button to trigger a new EC2 instance with a pre-loaded snapshot to be created for single user needs.

The usage has to be monitored for resource management. AWS SNS service can be integrated to check and report on idle instances to instantiate automatic shut-down.

Option4 — Build Everything Yourself (or with your team)

This option integrates core cloud computing resources with a managed and collaborative notebook server system — taking all fundamentals from options 2 and 3.

We can leverage the AWS or GCP Kubernetes service to launch and manage collaborative infrastructures with Jupyter notebooks/ jupyterhub and integrated Studio Servers.

By using pre-existing open source (or internally developed) docker images with jupyter notebook — with Rstudio kernels and Python ML packages, and leveraging container orchestration using Kubernetes (already supported and offered as part of cloud infrastructures) to:

  • deploy and manage resources and maintain a basic webpage (example -Django rendered),
  • have an external web-page endpoint for users to access, and,
  • work collaboratively and save their work sessions through persistent disk storage.

Apart from linking various pieces through API, this would also need some sort of middleware to weave pieces together to have them working as a seamless unit.

Furthermore, this design can be extended to implement an entire platform described below (complete with security set-ups like OIDC (enterprise email-based logging in), and AuthNZ.

AuthNZ, (Authentication and authorization) can be handled by OIDC-based external authentication providers like Google/Github, etc.

Launching workflows with CWL or WDL workflow languages, interacting with WFs using CLI, and using jupyter notebooks can be done using the web-based interface (running on the Django-python framework).

This will itself be propped on Kubernetes using abstraction layers and APIs. These layers will have role-based access controls associated with them for added security.

The data will be placed on a bucket and users can access it in a non-privileged way to protect the data from corruption/loss.

Image provided by the author.

Hope you enjoyed the post! Have you implemented similar strategies? Thank you for reading till the end.

--

--

Ananya Mukherjee

Data Scientist, Full-Stack Engineer, and business student. Working towards building my biomedical data-related tech company.