View markdown source on GitHub

The Pangeo ecosystem




last_modification Last modification: Nov 25, 2022

About this presentation

.left[ This presentation is a summary of: ]

Speaker Notes

Pangeo in a nutshell

A Community platform for Big Data geoscience


NSF Logo EarthCube Logo NASA Logo MOORE Logo By Gordon and Betty Moore Foundation - Own work, Public Domain

Speaker Notes


.left[ There are several building crises facing the geoscience community: ]

.left[- Big Data: datasets are growing too rapidly and legacy software tools for scientific analysis can’t handle them. This is a major obstacle to scientific progress.] .left[- Technology Gap: a growing gap between the technological sophistication of industry solutions (high) and scientific software (low).] .left[- Reproducibility: a fragmentation of software tools and environments renders most geoscience research effectively unreproducible and prone to failure.]

Speaker Notes


Pangeo aims to address these challenges through a unified, collaborative effort.

The mission of Pangeo is to cultivate an ecosystem in which the next generation of open-source analysis tools for ocean, atmosphere and climate science can be developed, distributed, and sustained. These tools must be scalable in order to meet the current and future challenges of big data, and these solutions should leverage the existing expertise outside of the geoscience community.

Speaker Notes

The Pangeo Software Ecosystem

Pangeo approach

Source: Pangeo Tutorial - Ocean Sciences 2020 by Ryan Abernathey, February 17, 2020.

Speaker Notes


Xarray is an open source project and Python package that makes working with labeled multi-dimensional arrays simple, efficient, and fun!

Xarray logo

Speaker Notes

What is Xarray?

.left[Xarray expands on NumPy arrays and pandas. Xarray has two core data structures:]

.left[- DataArray is our implementation of a labeled, N-dimensional array. It is a generalization of a pandas.Series.] .left[- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.]

.left[Source: Xarray documentation]

Speaker Notes


Xarray concept

Xarray dataset

Speaker Notes


A powerful, format-agnostic, community-driven Python package for analysing and visualising Earth science data.



.image-40[ IRIS logo ]


.left[Source: Scitools Iris documentation]

Speaker Notes



Enabling performance at scale for the tools you love

Dask accelerates the existing Python ecosystem (Numpy, Pandas, Scikit-learn) ]


.image-40[ DASK logo ]


.left[Source: Dask documentation]

Speaker Notes

How does Dask accelerate Numpy?

.image-40[ Dask and Numpy ]


import numpy as np

x = np.ones((1000, 1000))
x + x.T - x.mean(axis=0)



import dask.array as da

x = da.ones((1000, 1000))
x + x.T - x.mean(axis=0)


Speaker Notes

How does Dask accelerate Pandas?

.image-25[ Dask and Pandas ]


import pandas as pd

df = pd.read_csv("file.csv")



import dask.dataframe as dd

df = dd.read_csv("s3://*.csv")


Speaker Notes

How does Dask accelerate Scikit-Learn?

.image-40[ Dask and Scikit-Learn ]


from scikit_learn.linear_model import LogisticRegression

lr = LogisticRegression(), labels)



from dask_ml.linear_model import LogisticRegression

lr = LogisticRegression(), labels)


Speaker Notes



Free software, open standards, and web services for interactive computing across all programming languages


.image-40[ Jupyter logo ] ]

.left[Source: Jupyter documentation]

Speaker Notes

Jupyter and Galaxy

Speaker Notes

Analysis Ready, Cloud Optimized Data (ARCO)

Speaker Notes

Example of ARCO Data

Arco data

Speaker Notes

Pangeo Forge

Pangeo Forge Logo

Pangeo Forge is an open source platform for data Extraction, Transformation, and Loading (ETL). The goal of Pangeo Forge is to make it easy to extract data from traditional repositories and deposit this data in cloud object storage in an analysis-ready, cloud optimized (ARCO) format.

Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages.

Speaker Notes

How does Pangeo Forge work?

.image-40[ pangeo forge explained ]

.image-40[ pangeo forge recipe ]

Speaker Notes


.center[STAC stands for SpatioTemporal Asset Catalog.]

Speaker Notes


Each provider has its catalog and interface.


Just searching the relevant data for your project could be a tough work…


.pull-right[ Why STAC ]

Speaker Notes


Each provider has its own Application Programming Interface (API).


If you are a programmer that’s exactly the same…

You should design a new data connector each time…


.pull-right[ Why STAC ]

Speaker Notes


Let’s work together.


The main purpose of STAC is:


.pull-right[ STAC ]

Speaker Notes


Let’s work together.


It’s extremely simple, STAC catalogs are composed of three layers :



It’s already used for Sentinel 2 in AWS

.image-90[ Sentinel 2 ]

It’s already used for Landsat 8 in MICROSOFT

.image-90[ Landsat 8 ]


Speaker Notes

How to use STAC

Depending on your needs.

.pull-left[ Storing your data

.image-40[ Storing data ] ]


Searching data

.image-40[ Searching data ]


Speaker Notes

Searching data

Let’s search data over the main region (France) between the 1st January 2019 and the 4th June 2019.

.image-100[ Search data over main and specific dates ]

Speaker Notes

Searching and processing

.image-100[ Search and process ]

Speaker Notes

STAC ecosystem

A lot of project are now build around STAC.

Speaker Notes

A lot of contributors!

Join and contribute to STAC:

.image-100[ STAC contributors ]

Speaker Notes

STAC and Pangeo Forge

Speaker Notes

Using and/or contributing to Pangeo

.left[The Pangeo project is completely open to involvement from anyone with interest.

There are many ways to get involved:]

For more information, consult the Frequently Asked Questions.

Everyone is welcome to the Pangeo Weekly Community Meeting.

Speaker Notes

Learn more

Speaker Notes

Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.