Possibility and Probability

A Python programmer with a personality thinking about space exploration

11 April 2016

Does Python packaging have a left-pad problem?

by nickadmin

[caption id=”attachment_610” align=”alignright” width=”420”]Is Python packaging a tower of babel? Is Python packaging a tower of babel?[/caption] Recently an interesting problem happened. A small but critical piece of code was removed from the internet and in the process brought down many major JavaScript projects. This was quickly fixed and in the process many jokes were made at the expense of the JavaScript ecosystem. But to me it raised an important question: Could this happen with Python packaging?

How did this happen, and what is code reuse?

In most programming courses you will hear instructors encouraging their students to not “copy and paste” code, but to “reuse” existing code. Most of the time this is because they are trying to teach the budding programmers to extract common code out to a function, or that they should use a common library. And this is a very good thing. One of the many benefits is that if there’s a bug that has to be fixed, you go to the one spot where the code live and fix the bug, and you’re done! The first time you have to go and hunt down the 10 different spots where the same 10 lines of code have been pasted you will appreciate the utility of having 1 function where that code lives. This is great code reuse. What happened in JavaScript land was an extreme version of this. A developer had created a function/package called “left-pad” that is designed to make strings of characters be the right length, with a “pad” character inserted as needed. String padding is a fairly common activity in all programming languages. It is also not at all unusual to see students implementing it on a programming test while they are in school. It is typically just a few lines of code and is typically considered to be one of those easy, but pain-in-the-butt tasks. In the npm (Node Package Manager) repository one programmer created a left-pad package that provided this functionality. Over time many bigger projects decided to use this package so that they didn’t have to write their own version of the code. After all, this is what code reuse is all about: Use a tested library, don’t write 100 different version of it. This problem here was the author of the package took  all of this packages off of npm after a dispute. This was certainly within his rights to do, after all he was the author. The other projects that use this library suddenly had what’s known as “a missing dependency” and as a result they could not release new versions of their code. This is a critical Achilles’s heel problem. Thankfully most of the projects where able to recover quickly, but it pointed out some serious flaws in the JavaScript ecosystem. Or is this a more systemic problem with modern software development?

Could this happen with Python packaging?

In short, yes. Imagine for a moment that the requests library, which is one of the most use libraries in the Python world, was suddenly taken off line. What would the consequences be? Immediately many Continuous Integration (CI) tasks would fail. Modern software development loves to use CI as a way to validate changes made to software. In the Python community CI is a way of life for many projects, and many rely on requests (and other packages) to help abstract away the low-level details of the problem they are working on. In the case of left-pad, there were several options to choose from. The code was small enough that each project could have implemented its own version easily. In the case of requests, this would not be such an easy proposal. Requests is wrapping the functionality of urllib/urllib2/urllib3 in a beautiful way. Recreating that per-project would not be quick or bug free. If groups tried to recreate a common version of it you would most likely witness several splinter versions of the code coming into existence. This is not exactly uncommon in the open source world. The end results… well, for this thought experiment I feel was can assume the greater good would win out, but the important thing to realize is that it would take time to happen.

So what is the solution to this problem?

The world needs things like npm and pypi. They are essential to the growth of their respective communities. At the same time, the authors of the programs there should be free to take their creations out of circulation (if they so choose). I had a co-worker once who was militant about not relying on 3rd party sites like pypi. He took it to the extreme of checking each dependency into our version control. For somethings like (python packaging) this didn’t seem too terrible, it was a good way to track the versions of these libraries we were using. For other things (compiled binaries of major open source projects)… this seemed really extreme. Over the years I’ve seen some approaches that seemed like a nice middle ground. Specifically with pip --freeze > requirements.txt which will produce a text file that lists out all of your dependencies and their explicit versions. I’ve also seen an approaches that used specific commit hashes to ensure that the code itself was from a known state (e.g. the code could not have been tampered with or the commit blockchain would have been invalidated). But those approaches don’t help with when the source of the library just disappears. In fact, the blockchain approach would prevent someone from substituting in a new version of the library because the commit hashes would not match up. (That is usually considered a good thing, but in this case it would be a pain because everyone in the world would suddenly have to update their dependencies to reflect this new hash.) A possible solution to this is what the Go Language (go-lang) is moving to: a vendors package. While attending a talk by Kelsey Hightower at the Great Wide Open conference, I heard him give a brief overview of how the vendors package was born out of a desire to keep malicious code out of go-lang projects at Google.

Side note: go-lang projects compile into a binary that contains all of the libraries needed for the code to execute. There is no dynamic linking at run time.

As a result, any libraries that the program needs can be checked into the vendors directory and then it becomes a part of the project (under its version control). [caption id=”attachment_611” align=”aligncenter” width=”256”]golang gopher Maybe this little fella can help out.[/caption] This hearkens back to my coworker’s idea of putting everything into version control. I don’t know the full details of how this works in go-lang, but it still makes me a little uneasy about having a ton of other people’s code living in my repository. But at the same time it will prevent the left-pad problem of disappearing code.

Wrapping up

Python packaging is not immune from the problems experienced in npm recently. While code reuse is a great thing, care must be taken to ensure that builds of our software are truly repeatable. Perhaps if Python follows the example of go-lang and creates a similar “vendor” standard then python packaging problems will become a thing of the past.

tags: