Sep 16, 2021 · 12 min read
How Open Source Increases Access to Computational Tools for Every Scientist
Meet some of the people behind the essential open source software that CZI supports and who are working to bring computational tools to scientists across the globe.
Open source is crucial to scientific research and discovery. When software is open source, it’s free and accessible to the public — anyone can inspect, adapt or enhance its source code. Unlike proprietary or closed software, open source software allows researchers and developers to learn from its code base, reuse it, improve it, and eventually, contribute to it. Biomedical research increasingly depends on computational methods, and open source software has become critical to making these methods broadly available.
Despite its significance, open source for research often lacks dedicated funding for maintenance. As a result, the creators and maintainers of these tools are rarely professionally rewarded or offered job prospects that honor their critical contributions to science. To address this gap, since 2019, the Chan Zuckerberg Initiative has supported open source maintainers through the Essential Open Source Software for Science (EOSS) program. The EOSS program provides grants for some of the most widely used open source software tools across biomedicine, supporting maintenance, growth, development and community engagement for more than 120 widely used tools in biomedical imaging, genomics, cell biology, bioinformatics and other fields.
We spoke with a few EOSS grantees to learn more about their work to bring powerful computational methods to everyone through their projects.
Note: The following quotes have been edited and condensed for clarity.
Melissa Mendonça & NumPy
NumPy, an open source project aiming to enable numerical computing with Python, received a CZI EOSS grant to help manage and sustain the project’s documentation team, make improvements and create new educational content. Melissa Mendonça is a lead software engineer on the project and is also connecting with many contributors beyond the code.
All of these actions that we’re doing are trying to bring other perspectives into the projects and increase the diversity of contributors.
How Is Your Team Working To Support Users of the Program?
Many of the users of NumPy don’t actually know that they are users of NumPy because it is a fundamental dependency for other projects.
One of the things that we’ve been trying to do in terms of community is to be more connected to other communities in the scientific Python ecosystem. For example, we’re trying to meet and participate in the communities of other projects so that the maintainers of these projects know each other in order to best support their users.
Sometimes, you don’t know a user from AstroPy. They might need a feature that we don’t know about. And so, if we have a connection to the community of maintainers of that project, we can support them better.
What Keeps You Doing This Work?
One of the best things about working with NumPy and with open source in general is the impact that you can have. This is very important personally, but I think it is also very rewarding to work in an open source project such as NumPy because it has such an important user base. You do feel like you are contributing to science. So many research projects, both in academia and industry, use NumPy as a dependency. Just by doing this work, you can feel like you’re having a huge impact in the world, and that’s one of the things that keeps me going.
How Are You Working To Improve Diversity in Open Source?
There are so many aspects and so many axes of diversity that we can look into for these projects, but it is mostly white and male. So, all of these actions that we’re doing are trying to bring other perspectives into the projects and increase the diversity of contributors. I think that is extremely important, and not only from an open source point-of-view but also from an open science point-of-view, to increase access for people who are not yet at the center of these projects.
For example, one of the things that we’re also doing at NumPy is working on translations for our website so that not everything is in English. How can we do this in a way that is more welcoming, in a way that we can assure people that they are safe in our communities? It is a process that has been going on for a while and I don’t think that we’re there yet. There’s a lot of work to do, but I’m happy with the steps that we’re taking.
Hannah Aizenman & Matplotlib
Matplotlib is a comprehensive library for creating static, animated and interactive data visualizations in Python, a general purpose programming language. The project was designed to be welcoming and inclusive, and to make “easy things easy and hard things possible.” Hannah Aizenman is a core developer on Matplotlib.
If your tools are free and open source and they’re well-documented, people can start building their own learning and teaching resources around it.
Tell Us About Your Work
Life science data sets are really complicated. Users simply want to feed data sets into a tool and generate a visualization. They don’t want to have to start thinking about how the library is doing it. So my research asks: Can we model visualization in a way where we can think of this data? What is a way to formalize data visualization in a way that’s general enough that it can handle all these different types of datasets that life scientists need to analyze?
We’re not Excel. We’re not trying to be Excel. We’re trying to be the code underneath. Unlike Excel, we can’t just support spreadsheets; we need to support all data types. So we’re asking ourselves, “How do you write that code ‘under the hood’ in a clean way so that the data is going to become a picture that is an accurate, faithful representation? What are the rules we need to follow when writing this code so we can promise that this code is not giving users an inaccurate representation?”
How Do You See Open Science as a Driver of Equity?
I think about this with Matplotlib a lot. We have so many tutorials online. We have people who tweet threads explaining and breaking things down, and people who do informative TikToks and Instagram posts.
We see this kind of information sharing because the library is open source. Anyone can make Instagram Stories and not be worried that someone’s going to sue them. Anyone can share it.
If your tools are free and open source and they’re well-documented, people can start building their own learning and teaching resources around it.
Mackenzie Mathis & DeepLabCut
Mackenzie Mathis runs the Mathis Laboratory at École Polytechnique Fédérale de Lausanne, using machine learning, computer vision and experimental work in rodents to understand the neural basis of adaptive motor control. Her lab is home to the CZI-funded DeepLabCut project, a software package for animal pose estimation. The tool tracks various body parts in multiple animal species across a broad range of behaviors.
I really think that open source software is not only our past, but our future. We have these components in science about reproducibility and sharing data and sharing code, and open source is such a large component of this. And I also feel like it’s a better and safer way to do science.
What Have Been Some of the Most Surprising Uses for DeepLabCut?
There have been some really cool applications. A paper was published about a year or so ago looking at social bats interacting, which I think is a really elegant and beautiful application.
A lot of wildlife conservation efforts are now picking up these types of tools, which is fantastic. So as a laboratory, that’s another area we’re getting more invested in — to help give back to the animals that are giving us so much in the sciences. To see conservation efforts leverage this tool has been quite incredible.
Why Is It Important To Ensure Tools Like DeepLabCut Are Open Source and Free?
I really think that open source software is not only our past, but our future. We have these components in science about reproducibility and sharing data and sharing code, and open source is such a large component of this. And I also feel like it’s a better and safer way to do science. If someone sees a bug in my code, I literally have the whole world to code review for me. That is something that’s really important.
In general, open source software is incredibly important for the future of science, as datasets get bigger and analysis gets more complicated. The sheer volume of videos that we can collect and analyze is unbelievable compared to the number that you could have done even 10 or 15 years ago. It’s going to be really impactful and game-changing to continue to make sure we have open source software available to everyone.
What Would You Say to Someone Hesitant or Skeptical About the Value of Open Science?
I would try to understand why they were skeptical about open source science. I don’t see a lot of downfalls to it. But in general, if someone was completely new to this, and I was trying to explain to them why this was so important, I would say it’s really about impact. You are going to make such a larger impact in the world by sharing your code and sharing your data than you could ever imagine doing in your own little silo.
As a neuroscientist, I’m not going to solve the brain alone. And as an open source code developer, I’m not going to be able to write all the code alone. It’s a way to give back to science too.
Wes McKinney & Apache Arrow
Apache Arrow is a software development platform for building high-performance applications that process and transport large data sets. Its founders designed the tool to improve the performance of analytical algorithms and the efficiency of data that are moving from one system or programming language to another. Wes McKinney, an open source software developer and co-creator of Apache Arrow, hopes it will be used as the foundation for building next-generation data science libraries.
Rather than having scientists competing for who can get the best paper in the best journal, there’s a cultural expectation that the data and the code, and the process of arriving at results, are made freely available so that everyone can have trust in the results that are produced. And that work can serve towards accelerating the next stage of innovation in each particular research domain.
How Do We Address Diversity in Open Science?
We’ve been asking ourselves, “How do we increase the number of maintainers that are out there? How do we recruit and bring more people into the open source ecosystem to create, to have a development community that’s more resilient, that’s larger, that’s less dependent on very small numbers of heroic individuals?”
To increase the diversity of the ecosystem, we have to create an environment where open source development can be a career path that people choose. Otherwise, the problems that have accumulated — such as not having enough maintainers or having only maintainers from privileged backgrounds — will continue to persist. We have to break that cycle.
What Role Does Apache Arrow Play in Increasing Diversity?
Our goal is to build not only great software but also to create an open, collaborative, inclusive and positive development community that can bring more people into the project. CZI’s funding supports one-year apprenticeships targeted towards groups that are traditionally underrepresented in open source development. So individuals who have had some software engineering experience but maybe haven’t been deeply involved or a maintainer of an open source project can receive mentorship in how to be an effective open source software maintainer.
Our goal is to continue operating these programs as long as possible, bring more people into the project, and teach them how to be effective open source maintainers.
Why Is Open Science Integral to the Field?
If researchers don’t make it possible to reproduce their research with open tools, that creates a barrier to innovation that stymies progress. It can also prevent opportunities for peers to find problems in the research and the science that’s been done. People make mistakes. That’s why we need people looking over your shoulder, checking your work, taking work you’ve done to reproduce the results, and seeing if they can improve them or find problems with your methodology.
Rather than having scientists competing for who can get the best paper in the best journal, there’s a cultural expectation that the data and the code, and the process of arriving at results, are made freely available so that everyone can have trust in the results that are produced. And that work can serve towards accelerating the next stage of innovation in each particular research domain.
Learn more about CZI’s work in open science.