I stopped using GNOME more than a decade ago, yet my desktop environment of choice–Xfce–keeps using GNOME libraries, and every time I notice a sudden problem, every time it’s GNOME’s fault.

But because I’m a developer, I can track down the problems and fix them myself, but why is it that GNOME developers keep making simple mistakes and are unable to fix them themselves? Because they don’t follow good coding practices.

In this post I will explain the last issue I had with GNOME, how I fixed it, and why it’s still there more than three years later (their attitude is the real problem).

The issue

All of a sudden I started to notice an annoying delay while closing terminals (I use xfce4-terminal). Originally I thought it was something related to gvim (which starts as a fork), so I tried different ways to start it, but then I noticed it happened with gitk too (which I also start as a fork). Then I investigated different zsh options, such as no_hup, until I realized the same happened in bash. Soon I found out it was the terminal itself: any fork created the issue.

The terminal takes several seconds to exit

Initially the terminal was permanently hanged, but later on it was only delayed by a couple of seconds.

Since I open and close terminals all the time, this delay was still extremely annoying, so I decided to track down the problem.

First I thought xfce4-terminal was doing something wrong, but I noticed the same problem with gnome-terminal too, while other terminals did not have this problem. That pointed to vte (GNOME’s virtual terminal library), a library that is used by multiple terminals (e.g. guake, tilda, mate-terminal, terminator, etc.)

I downloaded the source code of vte, managed to compile the latest version, and ran xfce4-terminal with this new library (using LD_LIBRARY_PATH). The issue was still there, so they hadn’t fixed it yet.

I fired up git bisect and soon enough I found the culprit: lib: Rework child exit and EOF handling. If I revert that patch the terminal exits immediately (as intended). I’m simplifying the effort since I had to do multiple hacks to be able to compile older versions of the code.

This is where the shit starts:

When the child process exits, we used to immediately unset the PTY, which causes us to miss data written by the child but not yet read by vte.

Instead, only store the child exit status, and defer emitting the ‘child-exited’ signal until after all the pending data has been read and processed.

Similarly, rework how EOF is processed. Instead of immediately queuing the emission of the ‘eof’ signal, only take note of the EOF, and process it after all pending data has processed. There also was a bug in that we took the first occurence of G_IO_HUP in Terminal::pty_io_read() to stop reading more data. Instead, only take a pure G_IO_HUP without G_IO_IN as EOF, or if reading data from the PTY returns the EIO error.

This also fixes the bug where a(ny) partial character(s) not yet fully decoded by the UTF-8 and ICU decoder would not show in the output; this now correctly flushes the decoder, which inserts either a replacement character (for the UTF-8 decoder) or the character(s) in the ICU decoder internal state (most likely also a replacement character).

Christian Persch

Is that clear? This is a perfect example of how not to do a commit. A good coding practice is to keep commits simple and logically atomic, that is: only do one thing. This commit is doing precisely the opposite: it’s trying to do many things at once, and that’s why the code is virtually impossible to understand.

Here’s a good example of a patch I sent to the git project: pull: cleanup autostash check. Notice how the code is extremely simple, and in fact that explanation is bigger than the code changes, just in case they weren’t clear. I made other changes to this code, but following good practices: in separate commits (1, 2). Developing this way takes more effort, but it’s worth it for many reasons: the code is simpler to understand, review, and later on analyze possible bugs.

Since GNOME developers didn’t do a good job of splitting the patch, we’ll have to do it ourselves to understand what it’s supposed to do.

Doing it correct

The first step of what the patch is trying to do (according to me) is deferring the signal CHILD_EXITED, by simply doing that (and avoiding the rest of the mess) I’m able to solve the original problem the patch attempted to solve: End of cmd output discarded with “Hold the terminal open”. The difference is that my patch is much simpler: 2 files changed, 49 insertions(+), 22 deletions(-), while Christian’s mess of a patch is: 5 files changed, 188 insertions(+), 123 deletions(-).

The next step is trying to change the way the output is processed so that EOS is handled properly and that way all the output from the child is handled. I split this step into 6 substeps. This is the biggest change that although it’s probably desirable, in my tests it made no difference.

In another step there are some style changes, and in another a bug is fixed (G_IO_ERR wasn’t added as a flag to process output).

After all these steps one final change is easily visible:

- if (m_child_exited_after_eos_pending) {
+ if (m_child_exited_after_eos_pending && eos) {

This is what is causing our issue: when m_child_exited_after_eos_pending is on, eos is off, and therefore the CHILD_EXITED signal is never sent.

This issue is really hard to see on a patch that is almost 600 lines of code and tries to do multiple things at once (including code style changes and bug fixes).

Report

It’s all right, we all make mistakes, surely after realizing there was a problem GNOME developers did all they could to fix the issue. Nope, the problem was initially reported in Nov 2019: Doesn’t close in presence of grandchildren processes, and developers did nothing.

One year later in Dec 2020 I complained (as it’s always a GNOME bug), my comment was immediately removed and the issue was locked so nobody could comment again. That didn’t stop me from fixing the issue in a new report, Fix obvious regression caused by 7888602c, in which I explained an easy way to reproduce the issue:

sleep infinity &
exit

Christian’s response is still visible:

Your behaviour is totally unacceptable, that’s why I locked the other issue. It’s also not condusive to fixing any actual issues.

Works just fine here in the vte test application: the terminal closes after a short delay.

Christian Persch

According to Christian a delay of several seconds is a “short delay”, and there’s no issue, except initially the code hanged forever, that’s why Christian decided to add a 5 second timeout. Of course, a timeout of 5 seconds is not a nice solution, the correct solution of course is to reduce the timeout to 2 seconds. 🤦

If there’s no issue, why did he add two hacks to avoid it? And why did he initially say “I’m not really satisfied with the current solution which consists of a simple timeout”.

And how is my patch that actually fixes the issue not conducive to fixing the issue?

Did any other developer step up to solve the problem? No, Zander Brown and Carlos Soriano both complained about my tone and did nothing to attempt to solve the obvious issue.

In May 2022–one year and a half later–the issue was finally unlocked and two people complained about the problem (Chris Angelico and Egmont Koblinger). Did Christian accept there’s an issue now? No.

I think that the current solution is good enough to not require more work…

Christian Persch

It literally doesn’t matter that users consider this an issue, if Christian says it’s not an issue, then it’s not an issue. Period.

Shit attitude vs. good attitude

I’ve written before about the drastic difference between how GNOME developers and Linux developers approach coding: The Linux way; never ever break user experience. But here I want to focus on two examples.

Linux way

In 2013 I was testing a Linux release candidate (Linux 3.11-rc4) and found a crash running StarCraft II through wine, I bisected the commit that caused the problem and reported it to the Linux mailing list:

I found a regression while running all v3.11-rcX kernels; Starcract II through wine crashes.

Felipe Contreras

The commit that caused the problem was a single line changed: ptrace: PTRACE_DETACH should do flush_ptrace_hw_breakpoint(child). Precisely because it was supposed to be a simple patch, neither Oleg Nesterov nor Linus Torvalds could think of any problem.

Felipe, can you double-check that it’s not timing-related in some subtle way, and test multiple times with just that commit reverted (and not reverted) to make sure that it’s 100% that one single line by that particular commit? Because it does seem very benign..

Linus Torvalds

Nesterov pulled the source code of wine, and found out that their usage of PTRACE_DETACH was deliberately what they thought no one would do, and as a result their change broke the ABI:

Ok, so I guess it’s effectively the ABI, and we should just make the rule be that “if you don’t want stale breakpoints, then remove the breakpoints when you detach”.

And thus reverting it the right thing to do. Agreed?

Linus Torvalds

So, because their change broke wine, they simply reverted the change, which only stayed for about a month, and never made it into any release. This was easy to do because the patch did only one thing, and in fact the commit message was much bigger than the single-line patch (6x bigger).

GNOME developers would have blamed wine instead and claimed they are using the API wrong, not Linux developers, Linux developers do care about their users, and never ever break user experience.

Torvalds understands the whole purpose of software: to be useful to the users, that’s why if in the process of adding a new feature they realize they broke existing behavior, they immediately revert the changes, no excuses. The famous rants of Torvalds have happened when a maintainer who is supposed to know better breaks user experience, and doesn’t own up to their transgression (for example here and here).

The biggest thing any program can do is not the technical details of the program itself; it’s how useful the program is to users.

So any time a program–like the kernel or any other project–breaks the user experience, to me that is the absolute worst failure that a software project can make. It’s the complete no-no to ever break userspace–or for other projects–to ever break features that your users depend on.

Because no project is more important than the users of the project.

Linus Torvalds

Shitty way

How does GNOME deal with breaking user experience? For that we can take a look at the issue: Background configuration is missing in terminal profile editor.

User Eduard Valiauka explained the background configuration was present in 3.6, but not in 3.8, and asked to please bring it back.

No.

Christian Persch

It’s Christian again with his usual emphatic consideration towards the needs of users. Just says “no”, no explanation, no nothing, just “no”.

Because Linux is a proper project, I can be relatively certain than Linux 3.11 will work just as Linux 3.10, but GNOME 3.8 won’t work as GNOME 3.6. Why? Because GNOME does not care about its users.

Update

What has happened since Christian graced us with unlocking the issue?

The old behaviour shut down immediately; the new behaviour is to sleep for some time, waiting for… what, exactly?

The current behaviour is annoying when typing rapidly (eg Ctrl-Z, bg, Ctrl-D, and then go on to do other things), because the terminal lingers for several seconds. Would be nice to be able to resolve this properly, or failing that, to be able to configure the timeout.

Chris Angelico

I’ll definitely at least review patches that actually try to do this. However, any motivation to do further work on this myself was thoroughly extinguished by the disrespectful and disruptive behaviour of a few individuals (not you!) that led me to lock this bug and not engage further.

Christian Persch

Except you are not reviewing my patch.

And yeah, unfortunately the behaviour in question also made me extremely cautious about even speaking up on this issue, for fear that I’d be seen to be whining just like that person was 😦 It’s a mess all around. Once someone’s been toxic on a topic, nobody wants to touch it.

Chris Angelico

No Chris, fixing the issue for you is not being “toxic”, it’s the desired behavior.

Stop being a snowflake, review my patch, and apply it.

For the last few years, this has been the only issue annoying me in gnome-terminal — and it annoys me a little bit on a regular basis.

The bug was that the command’s output wasn’t always fully read and displayed — something that is irrelevant for the default “Exit the terminal” behavior. Since I’m only using the default option here, I wasn’t personally affected by that issue.

The new behavior, i.e. the annoying 2 second delay is something that apparently affects a broader range of users, since it also happens with the terminal’s default behavior, all you have to do is launch some graphical_app & or some app that puts itself in the background.

Egmont Koblinger

Egmont understands the situation: Christian tried to fix a very marginal issue, and while trying to do that he broke the vastly most common use case: using a shell. A proper maintainer like Linus Torvalds would simply revert the new broken “feature” until it’s implemented properly, but Christian is not a proper maintainer, he would rather keep his buggy code rather than revert it.

There’s a third option though: don’t revert everything, simply bypass the new behavior (thus going back to the old behavior).

And of course there’s the fourth option: just apply my patch: Fix exit regression. I tested my patch with the marginal issue nobody cares about and it works fine, and also works fine in the normal situation. So it’s the best of both worlds.

But wait

If that wasn’t enough, Fedora maintainers were pretty much fed up and decided to carry their own patches:

The way this is implemented was chosen simply because it is basically impossible to get any improvements past your and chpe’s [Christian] veto.

We are pretty much tired of this, and will carry these patches downstream in Fedora.

Matthias Clasen

See the issue report that is nine years old: Notifications for long-running commands. For this very same reason Arch Linux has a separate vte3-notification package with these extra patches from Fedora.

It’s not just me.

Conclusion

All GNOME developers gain by blocking me is hurt their own users. The issue is real, it does affect users, and it has been present for more than three years.

As you can see from my other blog post: My tone doesn’t make me wrong, or how I convinced the Ruby project to fix an inconsistency, focusing on a developer’s tone achieves nothing, all that matters is if I’m right. I was right on the Ruby issue, and I’m right on this issue too.

Not content with merging code that is buggy, developed with bad coding practices, not properly tested, and not wanted by the vast majority of people, GNOME developers do not listen to their users, nor are they amenable to criticism.

Even though I’m not using GNOME, I’m not immune to their bad coding practices.

Anyone affected by this issue under Arch Linux can simply use this replacement package in AUR: vte3-nohang.

Unfortunately that’s all we can do to free us from GNOME bugs: workaround their bullshit.

All software has bugs, but no software is supposed to be developed this way: they know they have a regression, and they just don’t care.

This is just one example of one issue, I’ve had multiple ones like this, and I bet the next time my system suddenly starts acting weird, it will be the fault of GNOME, yet again.

Read More