Strange Loop

2009 - 2023

/

St. Louis, MO

Jagged, ragged, awkward arrays

Data processing languages, such as SQL, R, MATLAB, and Numpy/Pandas, implicitly loop over identically-typed objects ("rows") of a dataset ("table"). This makes for a succinct syntax in an interactive environment, but what do you do if your table doesn't have a regular shape?

Particle physicists have this problem: each collision of high-energy protons can produce a different number of electrons, photons, quarks, and other particle species. As a table with one collision per row, this dataset has a jagged edge of unequal-sized rows, sometimes referred to as a ragged array. Traditionally, physicists have used general-purpose programming languages like FORTRAN and C++ to deal with big, irregularly shaped datasets, but at a loss of interactivity and abstraction.

To bring high-level data expressivity to particle physics, my group has been developing awkward-array, a layer over Numpy that generalizes its array programming paradigm to jagged and other awkward data structures. Any JSON-like data, even with nested, heterogeneous content, can be sliced, broadcasted, and reduced with implicit loops as though it were a Numpy array.

This generalization of array programming has implications beyond physics: it simplifies combinatorics and likelihood maximizations in genomics and may also make it easier to analyze structured log files. Most awkward array operations can be vectorized to run efficiently on GPUs, and we are integrating the library with Apache Arrow, Parquet, Numba, and Pandas.

Jim Pivarski

Jim Pivarski

Princeton University

Jim was trained as a particle physicist with a Ph.D. from Cornell and helped commission the CMS experiment at the Large Hadron Collider (LHC). He then worked as a data scientist for Open Data Group for 5 years. In 2016, he joined Princeton as a computational physicist, where he developed a popular software package linking particle physics data formats with the scientific Python ecosystem, and is seeking new ways to foster communication and code reuse between particle physicists and other fields of data analytics.