Python on Tyler Collins

ViewClust: Early Days

Tue, 22 Mar 2022 13:23:54 -0400

In the early days of working for SHARCNET, my colleague and I decided to standardize how cluster metrics were computed across our internal data frames. As mentioned in a previous post, part of the solution was pandas.

The second part was to figure out how to deploy the package for others to contribute to, as well as install on their own specific HPC clusters. Some quick searching of course revealed PyPi and pip were the way to go.

To make a long story short, here’s some references that made it super easy and approachable:

The package is still in use today inside of SHARCNET and has also received development support from WestGrid, Calcul Quebec, as well as MILA.

ViewClust can be found on GitHub, here. Its cousin package ViewClust-Vis can be found here, which implements several summary figures.

Pandas Recipes for New Python Users

Mon, 21 Mar 2022 19:16:48 -0400

Eventually I got to the point in data analytics where keeping things in lists, or list of lists was no longer quite cutting it. My processing was slowly starting to grind to a halt, and things were getting way too abstract.

I decide to call up a friend who had worked in the business longer than me and they suggested “pandas”. I was vaguely familiar as users/clients had used it in the past. A “DataFrame” did sound like it would take care of a lot of my problems after reading the documentation casually…

Fast forward a year and pandas is now core to everything I do in Python. Couldn’t live without it anymore, and as such my second talk at SHARCNET was about pandas.

Below is my abstract for the talk as well as the recording:

“Often programmers find themselves in need of an effective way of working with “labeled” data. In the case of Python, Pandas is the most mature and reliable package that interacts effectively with other well known packages such as NumPy, and TensorFlow. As a package, Pandas is said to provide “fast, flexible, and expressive data structures” for what is known as “labeled data”. Features include: easy handling of missing data points, grouping functionality, simple indexing, time series support, numerous conversion functions, and more. This webinar will provide a basic introduction on how to install Pandas, a discussion of its strengths and various use cases, and lastly a demonstration of various common operations (recipes) that occur with labeled data. Experience with beginner Python concepts will be expected, while familiarity with Jupyter notebooks will be helpful. Webinar material and code will be made available on GitHub for future reference.”

Cython: A First Look

Sun, 20 Mar 2022 14:40:38 -0400

Back when I first got hired at SHARCNET, I used a lot of Python. I mean a lot. What this meant is that I quickly became the lightning rod for all Python related questions (and commentary).

During a fun Friday chat, a colleague remarked that Python was on average 40x slower than C++. I defended my current language of choice saying it was better than that, surely. To make a long story short, I was wrong. It really is about 40x slower depending on the problem. Determined to prove myself capable, and my language of choice a bit more defensible, I decided to look into ways to make Python faster.

I eventually landed on Cython. Turns out the best way to make Python faster was to use as much C++ as possible.

Below is my abstract for the talk as well as the recording:

“Often we write programs in Python for convenience, not for speed. When work becomes elevated to High Performance Computing (HPC) environments, speed once again becomes a concern. Cython is an extension of Python which allows functions to be compiled as C (or C++) and recover the significant performance trade-offs of Python. Cython achieves this by supporting calling C functions, declaring of type information, as well as providing access to C++ STL functionality. Popular packages and libraries that take advantage of Cython include: TensorFlow, OpenCV, NumPy, Pandas, and more. This webinar will cover a basic introduction to Cython, a demo translating vanilla Python into Cython, followed by a short demo of how to run Cython in our own Compute Canada HPC environments. Experience with Python will be expected, while familiarity with C/C++ and Jupyter notebooks will be helpful. Webinar material and code will be made available on GitHub for reference.”