“I’d love to hear your thoughts on what makes good scientific software. I strive to write correct software that’s intuitive to use, but would love examples of useful tools you’ve come across.”
RFon, I must admit I was not entirely sure what type of scientific software you had in mind — the code that a student/postdoc would write, or community-supported and maintained open-source codes, or commercial codes? So I will give a general answer based on my experiences, and hopefully not be too vague.
I am a theorist/computational scientist and my work falls under applied physics; the problems I look at lie at the interface of physics, chemistry, and several branches of engineering. In the context of code development, my work falls under”scientific computing,” but our style is such that the emphasis is much more on “scientific” than on “computing.” In short, pure CS folks would probably scoff at much of our work as not pretty or clean enough (that is a common issue; academic science codes are often not pretty by CS standards). However, the main goal of our work is describing properly relevant physical phenomena, so physics is first, and numerics is a means to that end. Our work involves steps like: develop theory (write complicated partial differential or integro-differential equations or coupled systems thereof) that relate to interesting properties of a class of systems –> develop and/or implement algorithms to solve these systems of equations –> have a code that captures the underlying physics well enough that we can perform numerical experiments and understand quite well a class of systems.
In my group, we write our own code — FORTRAN FTW! We also use interpreters like Python and Matlab for some smaller-scale calculations, but FORTRAN is extremely fast for the type of work we do (lots of matrix/array manipulation), we have a lot of legacy code, and modern compilers (such as the free gfortran) are great. [Please, no proselytizing here how everything should be written in C++ or whatever, I have no patience with the “one true programming language” silliness, especially because much of scientific computing admits the procedural (rather than object-oriented) programming paradigm]. However, I will say that sending students to take certain undergrad computer science courses (e.g., data structures) really helps with adopting good programming habits and staying organized.
In the work we do, I cannot say we really pay much attention to a potential user experience (beyond commenting and documenting code), because we assume people will work with the source code. The way science in my field is funded is that there is really no money for developing a user interface (unless you are part of these big centers with permanent staff), let alone for providing user support. Again, our focus is on solving certain physics problems.
There is some commercial software that experimentalists in my field use (I can’t really go into detail without revealing what I do) and those are well done and fairly intuitive, but the operating word is “commercial.” I would not say that someone who uses commercial code does theoretical/computational work; it’s fine to use it if you want to test something or are an experimentalist comparing with a measurement, but this “computational” work by itself is not publication worthy. If it seems like I am bitter, that’s because I am — I cannot tell you how many times I have encountered someone who thinks that modeling and simulation are trivial because they equate modeling and simulation with using “canned” software and have no idea what it actually take to simulate a complex physical system on the computer with enough detail that you can perform numerical experiments on it.
In my field, there are several teams who have tried packaging and selling their code and they are generally puzzled that other computational people don’t want to buy it. Why would I buy it? I can write the same thing based on publications and have my own source code. If that sounds like needless duplication of work, that’s because it is. But most computational scientists have no use for a fancy GUI; give me the source code and some documentation, that will be useful, and we’ll probably still rewrite most of it. We are working on cleaning up some of our larger code and then putting it on GitHub.
I have some experimental colleagues for whom we have written small amounts of specialized code, and really all you need is to work with some representative users closely for a little while, because often what they want or need is really not what you’d think.
RFon, does this somewhat address your question? Readers, what say you?
My main professional work is in scientific software, so I have something I want to add. You have to differentiate between software you write to use yourself as a research tool to answer some specific question and software you write to share with others to use as a tool in their own research, possibly to address questions you’ve never considered. The former is often quite “hack-y” and I would argue that is OK, because you’re still working things out. The science is the driver, and the science is messy so your code might be, too. The latter needs to be more solid. It needs documentation and test cases. It needs to have been tested against some edge cases. It needs error messages that make sense.
As your code moves from something you’re using to answer a specific question to something you’re using to answer a lot of different questions and then perhaps to something you share with others to answer questions beyond what you foresaw when you wrote the code, it should ideally get more rigorous with respect to good software engineering practices .
As a practical aside for anyone hoping to go write scientific software in industry: understanding this is a huge thing we check for in interviews. You don’t necessarily need to know how to do all of the professional software engineering things, but you have to show you understand what they are in a broad sense and why we need them.
LW wanted examples of excellent software that is intuitive and easy to use. I nominate ImageJ available through the NIH.
I’m a cell biologist. I’m smart, but have nearly 0 coding skills (eg: I barely coded my way thru the rock paper scissors game in an online python course before I gave up). Image J provides an excellent platform for other people to write plugins that I can download and use to analyze my data.
The record macro function allows me to easily automate steps I do routinely. (This is beyond the ability of some people in my lab so its probably not actually ‘easy’.)
The main things I like is it is not buggy and almost never crashes. Most of the tools I use have good documentation so I know what I am doing when I do a background correction or peak detection.
There is a lot of proprietary software that can do the same things–possibly better, but they tend to crash. Importantly, I have no clue how they work so I worry I might be skewing my data in a disastrous manner if I use this software for analysis.
It was a broadly phrased question, so your answer address it just fine. While I come from CS my academic sphere is more biology-oriented, and I interact with a lot of people that depend on ‘canned’ software, so sometimes I forget that chemistry and physics are on par (and frequently surpass) us in coding skills. I’m happy to hear your students benefit from CS courses. I’ve taught various undergrad programming and algorithm courses and typically the top performers are math and physics students.
I am also interested in what makes a library API — even an open-source one — succesful, though. I imagine you use lapack and CUDA which I find a little cumbersome to work with. Maybe its the python language and philosophy, but numpy and scikit-learn both manage to be efficient and feel ‘light’ to work with, which I appreciate. I think a challenge for library developers in the future will be to encapsulate and automate parallelism, which does not feel ‘natural’ to an old-school programmer.
I’m not up to speed on state-of-the-art computational physics, but I’d be curious on your thoughts on how the field (or your group) is embracing parallelism, machine learning, and modern visualization tools. Full disclaimer: I’m always horrified when I see the pixelated, non-interactive, primary-colored 3d-visualizations that come out of the high-energy physics labs at our university.
Hi RFon, I believe you will find a lot of nonuniformity among fields in the big tent of computational physics. For instance, I don’t know what the practices are of the people who model the weather or model fluid dynamics, and both are certainly good-sized populations within computational physics. I can say that for the fields I am familiar with, depending on the actual problems people address, they may or may not think about parallelization very much. Some of my work is at the high-throughput end, and I am definitely looking into machine-learning algorithms to lessen the overall load (e.g., I have a related proposal coming up early in the spring). In terms of visualization, field expectations for the visual appeal have become fairly high; all my students know they have to be able to use Adobe Illustrator/Inkscape proficiently, we talk about figure layout and color schemes, and we routinely create movies from data in order to make more interesting presentations.
You can look at Nano Letters, Physical Review Letters, Applied Physics Letters, or Nature Materials/Physics/Photonics/Nanotechnology for visual standards on data presentation in journals. You will find that the figures tend to be really pretty.
Here’s also Kaleidoscope in Phys Rev journals for some of the most appealing visuals. For instance, https://journals.aps.org/prb/kaleidoscope/October2016
You might want to check out the White House backed Materials Genome Initiative, which is fairly distributed. One specific arm is https://www.mgi.gov/content/materials-project
and is you go the project site https://www.materialsproject.org/ you will see people use interactive tools. I am also pretty sure that they incorporate data mining algorithms (as they are searching for materials with certain structure-property relationships). This is a large project, though, and I guess individual groups differ in what they do.
As for handling the need for parallelization, depending on the class of problems, parallelization may or may not be necessary or even feasible, and you are right that this idea is not one that is naturally embraced. There is definitely a push in some subareas to rewrite legacy FORTRAN codes in python+numpy so they would be less onerous to parallelize; right now, Open MPI seems to be the way for most people. (I am personally excited about the prospect of a wider adoption of GPUs, and I think it’s coming.)
Old libraries: lapack/arpack/blas aren’t so bad once you’ve gotten used to them; I am perhaps old-fashioned and I still occasionally thumb through my Numerical Recipes in Fortran book, as it’s nice to understand the algorithms behind your various routines. Also, FORTRAN is much less forgiving of sloppy coding when compared to many modern languages, which honestly I think teaches meticulous coding practices, so as a teacher I am not in a rush to abolish it.
The geosciences also does a lot of computing and coding across a range of spatial and time scales. On my side of things (surface/subsurface), the bigger, more capable codes used to model fluid movement and geochemical reactions through the subsurface have often been funded by the Department of Energy, in large part to either figure out where their pollution is going or could go (e.g., my understanding is that some of the codes were originally developed to assess Yucca Mountain as a nuclear waste repository).
The big community codes are parallelized: PFLOTRAN and PARFLOW. The user community is pretty small and NSF has sponsored some efforts to try to figure out how to expand use of these codes/models.
Geophysics does a lot of coding and computing and of course ocean/atmosphere people do a lot of fluid dynamic modeling, etc.