In Event of Mars Disaster

The operations of the ISS demand a quota of “lifeboat spacecraft” to bring the crew back to Earth in the event of an emergency. The Apollo program had multiple abort states, which saved the crew of Apollo 13 despite massive failure. What will happen when such an emergency occurs on a deep space mission, or on the surface of Mars?

To start, let’s clear the table of some of our preconceptions about mission failure on Earth and consider how they might apply to long-duration, deep-space missions. Some of them will come out positive, some will be negative, but they’ll all help build up a coherent picture.

Abort back to Earth is not an option

If something goes wrong in deep space, you can’t run home with your tail between your legs. If you’re much further from Earth than a lunar transfer orbit (speaking in terms of delta-v and time to return, rather than absolute distance) then returning will be a time-consuming affair, and likely a risky one. It will take the carefully timed and oriented firing of main engines, potentially a liftoff from the surface of a celestial body, and a whole host of other precise events. These are not the kinds of things you want to be doing in a hurry, without the appropriate buildup and without having properly checked the state of the ship. If you’re depending on planetary alignment for a return with your existing propellant margins, returning may not even be an option for weeks, months or years. That all being said…

Free-return trajectories are your friend

There are many cases where a form of abort back to Earth is an option, if your mission is planned correctly. Among the many choices for trajectories to many destinations is what’s known as a free return trajectory. This is one where, if you don’t do any major thruster firings, you will return to the Earth after some predetermined period of time. The Apollo missions all used free-return trajectories to the Moon (they allowed Apollo 13 to loop back around safely with minimal effort) and many proposals for missions to Mars and asteroids use them. These are, without exaggeration, potentially lifesaving in the event of disaster en route. If the thrusters, spacecraft orientation system, attitude control or many other essential systems fail midway through a mission on a free-return, you simply have to wait. That’s not to say you’re totally safe – a critical system could fail in the first week of a Mars mission, leaving an 18 month wait until a return to Earth – but there are many failure modes that can be ridden out with far less difficulty.

Remember the difficulty of a “cold start”

As many residents of Texas found out this February, even if a system is perfectly functional but off that doesn’t mean you’re in the clear. The apparently simple process of switching on a large power system can cause major damage if not carried out correctly. This applies to not just electrical systems (the familiar home of such issues) but also pressurised structures. The failure of Starship prototype SN3 is accepted to be due to an error in filling (and thus pressurising) a large propellant tank which caused a catastrophic failure. Don’t let the same thing happen to a habitat with a pressure diaphragm that’s strong in one direction only.

The surface of a planet is a safer place to be

This is more of a specific example of a good general rule for accident response. If you’re in a bad situation but things aren’t getting rapidly and actively worse, it’s almost certainly safer to stay put than make a rush for somewhere nearby. While the planetary surfaces in the solar system are far from friendly, there are at least a few things in your favour that aren’t in space. Debris falls to the ground, there’s raw materials aplenty (although the utility can be limited at best) and the radiation exposure is drastically reduced. That isn’t to say that it’s always best to stay on a surface – there are many conditions where it could very well be more dangerous than launching to orbit. If surface conditions are worsening drastically (a weeks-long lunar night with the power system blown out) or risk preventing an exit in future, running may be wise. But it should certainly not be the default.

Apollo 13 Flight Journal - Service Module Damage: A Photographic Analysis — The service module of Apollo 13, damaged by MMOD impact. The danage to the fuel cells caused a knock-on failure of the carbon dioxide scrubbers, leading to the famous required repairs. Image credit NASA History Division

The odds of a “clean failure” are low

– something like the events of The Martian, where the only compromised mission element is one crew member and a slight shortening of the mission. In a real scenario, even if a catastrophic failure is avoided, a complex failure (where many systems fail in synchrony, or the failure of one system leads to a cascade) is very possible especially if the failure mode is unexpected. This emphasises the need for a dearth of fallback failsafe states, and the ability of the crew and system to repair from a very small core of undamaged equipment. The more things that can be repaired with parts from a single desktop 3D printer, the better.

With proper vehicle design, catastrophic failure modes can be largely limited

By catastrophic, I mean “kill you before you have a chance to react”. There’s surprisingly few ways this can happen on a deep space mission – realistic MMOD-induced depressurisation events tend to take minutes or hours unless you’re hit by a large object at high speed, the atmosphere of a habitat provides a buffer for overheating and breathing in the event of power loss, structural failures rarely completely level a small structure. The most important thing (obviously) is making sure that a design has relatively few failure modes. The second most important is making sure that failures are slow, with options for recovery and repair along the way. This is sometimes called “graceful degradation” – as a system fails it shouldn’t cause a catastrophic collapse at a certain threshold but rather slowly drop off in performance so it doesn’t cause further issues. The grace period between a system failing and becoming unsurvivable is the difference between life and death.

Soyuz 11 crew.jpg — The crew of Soyuz 11: cosmonauts Dobrovolsky, Volkov and Patsayev. Upon undocking with Salyut 1, a poorly designed pressure equalisation valve was tripped and vented the cabin atmosphere to vacuum in a matter of seconds. All three were killed. Image credit Wikipedia user Ras67.

Help is a long way away

Not just in distance – in time. When something goes wrong on the space station mission control will know in a matter of minutes, and potentially be able to send help in the form of spare parts in a matter of a few weeks or months. In a deep space mission, mission control won’t even hear about a problem for tens of minutes or hours. Spare parts are years away. This impacts the design of almost every system on a deep space mission or settlement for maximum repairability, replaceability and cross-compatibility between components.

Choose a crew that can get around problems

In the plans for Mars Direct, Zubrin talks about the most critical member of the crew being the mechanic who isn’t necessarily an ultra-genius engineer but has the knowhow and training to fix things that go wrong on the mission. The value of this role, and the ability to solve a potentially life-saving problem with extremely limited resources, is orders of magnitude greater for deep space missions and settlements. Historically it’s usually been engineers on the ground who do the meat of this problem solving (think that scene from Apollo 13), although ISS astronauts are increasingly proactive in this regard. On a deep space mission there’s far more pressure to have a nearly independent crew who can survive, diagnose and repair any issue that might arise.

What can we learn from these? One thing is clear: the ISS doctrine of available evacuation is totally unsuitable for deep space. In many cases, the best course of action is to do nothing at all – sit tight and wait for the free return trajectory to carry you back to Earth, or for a cooperative transfer window to allow for supplies to be sent.

But this way of thinking also has implications for both architecture and hardware design. Over the course of a multi-year mission or permanent outpost, Murphy’s Law should be assumed to apply. Anything that can go wrong will go wrong, and it will probably do so at the worst possible time. If the mission can’t be abandoned if something critical goes wrong, there needs to be a plan for what to do instead. Those plans fall into two rough camps: redundancy and repairability.

Redundancy is about having backups, fallback states, safe refuges from which a well-trained crew can use to recover from a failure. Obviously this includes literal redundant systems and spare parts to swap in when something fails. This shouldn’t be a surprising thing to include on a deep space mission! All the usual stuff about having similar and dissimilar redundancy in your critical systems, which really deserves a post all by itself. But it also could also mean having a survivable state of reduced function, which is ideally recoverable but must be long-term stable. The ISS needs about half its power consumption for experiments, a space settlement might well have a similar fraction dedicated to ISRU systems. Dropping those systems in exchange for stored buffers buys a lot of power and complexity margin that could make the difference between a staged repair/restart of the entire settlement or bare survival until the next transfer window. Downgrading the entire mission from active science to a minimal state isn’t a nice decision to make but if it’s that or termination the choice is easy.

Astronaut Barry Wilmore and a 3D-printed ratchet wrench on the ISS. Image credit NASA

Repairability is the other side of the redundancy coin. Even if you can survive the loss of some critical systems, if you can’t repair it then you’re stuck waiting for something to happen. Best case scenario, it’s a Mars base that has to wait 9 months for a full replacement launched on a perfectly-timed transfer window. Worst case, you’re on a vehicle halfway to Saturn with 25 years until the next safe port. An awful lot extra can go wrong in that time. A far better state of affairs is a failure that can be stabilised (through proper redundancy) and then repaired to a state of partial or full operation. Whether that’s through having spare parts, interchangeable components (looking at you, Apollo air filters) or the option to manufacture spares in situ. The ISS already has an FDM 3D printer that’s used for making unusual tools and small fixes. It’s only a matter of time until the first space mission with a micro CNC bed or soldering station that can make whole new valves or control boards in situ. During the first experiments in metal and plastic ISRU, the products will not go to waste but instead used for low-importance spare parts and replacements.

It’s pretty noticeable that the discussion of mission failure is pretty different to our expectation. The lack of emphasis on specific abort options is obviously due to the wildly varying requirements of different missions (the actual abort modes for a Jupiter flyby are wildly different to a Phobos surface settlement). But it also represents the required architectural shift away from “bug out” and towards recovery in the event of disaster.

The next few decades of human spaceflight will inevitably involve technical, possibly life-threatening failures of some kind. If we want to build the spacefaring capabilities that so many of us want, the number of crew-hours in space will need to dramatically increase – as will the system complexity required. We can hold onto the high safety standards and rigorous engineering that’s currently the norm but eventually that method will start to wear thin. The decision making that happens up to that point will determine if the next space catastrophe will unfold more like the wreck of the Endurance and less like Soyuz 11.

Leave a Reply Cancel reply