Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240708162106.GA4920@openwall.com>
Date: Mon, 8 Jul 2024 18:21:06 +0200
From: Solar Designer <solar@...nwall.com>
To: oss-security@...ts.openwall.com
Cc: Qualys Security Advisory <qsa@...lys.com>
Subject: Re: CVE-2024-6387: RCE in OpenSSH's server, on glibc-based Linux systems

Hi,

Today is the coordinated release date to publicly disclose a related
issue I found during review of Qualys' findings, with further analysis
by Qualys.  My summary is:

CVE-2024-6409: OpenSSH: Possible remote code execution in privsep child
due to a race condition in signal handling

OpenSSH versions 8.7 and 8.8 and the corresponding portable releases
call cleanup_exit() from grace_alarm_handler() when running in the
privsep child process.  cleanup_exit() was not meant to be called from a
signal handler and may call other async-signal-unsafe functions.  The
current understanding is that in those upstream versions cleanup_exit()
would not actually call async-signal-unsafe functions under those
conditions, but with downstream distribution patches it sometimes does.
Specifically, openssh-7.6p1-audit.patch found in Red Hat's package of
OpenSSH adds code to cleanup_exit() that exposes the issue.  Relevantly,
this patch is found in RHEL 9 (and its rebuild/downstream
distributions), where the package is based on OpenSSH 8.7p1.

The audit patch is also found in Fedora, so the package versions that
were based on 8.7p1 and 8.8p1 are affected.  Per change log, it appears
that out of Fedora releases only 36 and 37 were affected, as well as
some updates maybe starting with those for 35 and until those for 37.
These versions are now end-of-life, and Fedora 38+ has moved to newer
upstream OpenSSH that doesn't make the problematic cleanup_exit() call.

The main difference from CVE-2024-6387 is that the race condition and
RCE potential are triggered in the privsep child process, which runs
with reduced privileges compared to the parent server process.  So
immediate impact is lower.  However, there may be differences in
exploitability of these vulnerabilities in a particular scenario, which
could make either one of these a more attractive choice for an attacker,
and if only one of these is fixed or mitigated then the other becomes
more relevant.  In particular, the "LoginGraceTime 0" mitigation works
against both issues, whereas the "-e" mitigation only works against
CVE-2024-6387 and not (fully) against CVE-2024-6409.  It may also be
possible to construct an exploit that would work against either
vulnerability probabilistically, which could decrease attack duration or
increase success rate.  That said, actual exploitation of CVE-2024-6409
has not yet been attempted and thus has not been proven.

I brought this extra issue to distros on June 26 and Qualys confirmed
and completed most of the analysis later same day.  Qualys later found
another issue with the audit patch, which is currently believed to be
mostly non-security.  I include it closer to the end of this message.

I'm sorry for not disclosing CVE-2024-6409 on the same day as
CVE-2024-6387, which would have saved many of us time (me included).
I agreed for the separate coordinated release date because apparently
Red Hat already had CVE-2024-6387 in the pipeline and wasn't ready to
add a fix for CVE-2024-6409 at the same time.

I quote relevant excerpts from the distros list discussion below:

On Wed, Jun 26, 2024 at 04:32:49AM +0200, Solar Designer wrote:
> On Thu, Jun 20, 2024 at 01:26:52PM +0000, Qualys Security Advisory wrote:
> > A few days after we started our work on amd64, we noticed the following
> > bug report (in OpenSSH's public Bugzilla), about a deadlock in sshd's
> > SIGALRM handler:
> > 
> >   https://bugzilla.mindrot.org/show_bug.cgi?id=3690
> 
> Here's another curious one:
> 
> https://bugzilla.mindrot.org/show_bug.cgi?id=3286#c7
> 
> In 2021, djm@ was wondering "why this is triggering when it wasn't
> before" and "I still don't understand why this condition did not exist
> previously."  I guess the reason was this same regression in the signal
> handler.
> 
> The patch for the above issue conditionally replaced the call to
> sigdie() with cleanup_exit(255), but I think that other call was also
> async-signal-unsafe!
> 
> commit e3c032333be5fdbbaf2751f6f478e044922b4ec4
> Author: djm@...nbsd.org <djm@...nbsd.org>
> Date:   Fri May 7 03:09:38 2021 +0000
> 
>     upstream: don't sigdie() in signal handler in privsep child process;
>     
>     this can end up causing sandbox violations per bz3286; ok dtucker@
>     
>     OpenBSD-Commit-ID: a7f40b2141dca4287920da68ede812bff7ccfdda

> -/* $OpenBSD: sshd.c,v 1.572 2021/04/03 06:18:41 djm Exp $ */
> +/* $OpenBSD: sshd.c,v 1.573 2021/05/07 03:09:38 djm Exp $ */
> 
> So in OpenSSH 8.7p1 as used e.g. in RHEL 9, we have this:
> 
>         /* Log error and exit. */
>         if (use_privsep && pmonitor != NULL && pmonitor->m_pid <= 0)
>                 cleanup_exit(255); /* don't log in privsep child */
>         else {
>                 sigdie("Timeout before authentication for %s port %d",
>                     ssh_remote_ipaddr(the_active_state),
>                     ssh_remote_port(the_active_state));
>         }
> 
> So if patching that version, I think not only the body of sshsigdie()
> should be #if 0'ed out, but also the above call to cleanup_exit(255);
> replaced with _exit(1).
> 
> Indeed, that's also the difference between sshlogdie(), which uses
> cleanup_exit(255), and sshsigdie(), which uses _exit(1).
> 
> This extra problematic logic only existed in upstream OpenSSH(-portable)
> for ~9 months, having been removed with:
> 
> commit 4e62c13ab419b4b224c8bc6a761e91fcf048012d
> Author: dtucker@...nbsd.org <dtucker@...nbsd.org>
> Date:   Tue Feb 1 07:57:32 2022 +0000
> 
>     upstream: Remove explicit kill of privsep preauth child's PID in
>     
>     SIGALRM handler. It's no longer needed since the child will get terminated by
>     the SIGTERM to the process group that cleans up any auth helpers, it
>     simplifies the signal handler and removes the risk of a race when updating
>     the PID. Based on analysis by HerrSpace in github PR#289, ok djm@
>     
>     OpenBSD-Commit-ID: 2be1ffa28b4051ad9e33bb4371e2ec8a31d6d663

> -/* $OpenBSD: sshd.c,v 1.582 2021/11/18 03:07:59 djm Exp $ */
> +/* $OpenBSD: sshd.c,v 1.583 2022/02/01 07:57:32 dtucker Exp $ */
> 
> which obviously left the (even worse) sshdie() intact.
> 
> In other words, depending on what exact cleanups are done, it may (or
> may not, I haven't checked) have briefly been possible to exploit a race
> condition of this sort not only in the parent (worse), but also in the
> privsep child (relatively mild).  An extra vulnerability nevertheless,
> maybe with different exploitability conditions - maybe exploitable even
> on *BSD where sshdie() is not?

As a side note, I wrote *BSD above because I had found that syslog_r()
is also available on NetBSD, as well as on AIX and even Tru64, so
_maybe_ some builds of OpenSSH for some of those systems were not
affected by CVE-2024-6387.  We did not research this further.

> Now, cleanup_exit() _might_ have actually be safe if called from this
> place.  Upstream 8.7p1's is:
> 
> /* server specific fatal cleanup */
> void
> cleanup_exit(int i)
> {
>         if (the_active_state != NULL && the_authctxt != NULL) {
>                 do_cleanup(the_active_state, the_authctxt);
>                 if (use_privsep && privsep_is_preauth &&
>                     pmonitor != NULL && pmonitor->m_pid > 1) {
>                         debug("Killing privsep child %d", pmonitor->m_pid);
>                         if (kill(pmonitor->m_pid, SIGKILL) != 0 &&
>                             errno != ESRCH) {
>                                 error_f("kill(%d): %s", pmonitor->m_pid,
>                                     strerror(errno));
>                         }
>                 }
>         }
> #ifdef SSH_AUDIT_EVENTS
>         /* done after do_cleanup so it can cancel the PAM auth 'thread' */
>         if (the_active_state != NULL && (!use_privsep || mm_is_monitor()))
>                 audit_event(the_active_state, SSH_CONNECTION_ABANDON);
> #endif
>         _exit(i);
> }

> Further, RHEL 9 patched this to:
> 
> /* server specific fatal cleanup */
> void
> cleanup_exit(int i)
> {
>         static int in_cleanup = 0;
>         int is_privsep_child;
> 
>         /* cleanup_exit can be called at the very least from the privsep
>            wrappers used for auditing.  Make sure we don't recurse
>            indefinitely. */
>         if (in_cleanup)
>                 _exit(i);
>         in_cleanup = 1;
>         if (the_active_state != NULL && the_authctxt != NULL) {
>                 do_cleanup(the_active_state, the_authctxt);
>                 if (use_privsep && privsep_is_preauth &&
>                     pmonitor != NULL && pmonitor->m_pid > 1) {
>                         debug("Killing privsep child %d", pmonitor->m_pid);
>                         if (kill(pmonitor->m_pid, SIGKILL) != 0 &&
>                             errno != ESRCH) {
>                                 error_f("kill(%d): %s", pmonitor->m_pid,
>                                     strerror(errno));
>                         }
>                 }
>         }
>         is_privsep_child = use_privsep && pmonitor != NULL && pmonitor->m_pid == 0;
>         if (sensitive_data.host_keys != NULL && the_active_state != NULL)
>                 destroy_sensitive_data(the_active_state, is_privsep_child);
>         if (the_active_state != NULL)
>                 packet_destroy_all(the_active_state, 1, is_privsep_child);
> #ifdef SSH_AUDIT_EVENTS
>         /* done after do_cleanup so it can cancel the PAM auth 'thread' */
>         if (the_active_state != NULL &&
>             (the_authctxt == NULL || !the_authctxt->authenticated) &&
>             (!use_privsep || mm_is_monitor()))
>                 audit_event(the_active_state, SSH_CONNECTION_ABANDON);
> #endif
>         _exit(i);
> }
> 
> from where it looks like other cleanup calls can be made even when
> called from privsep child prior to authentication

> In summary:
> 
> 1. I think the call to
>                 cleanup_exit(255); /* don't log in privsep child */
> should also be patched to _exit(1) in distro packages that have that
> (perhaps those based on 8.7p1 and 8.8p1).
> 
> 2. The logic of upstream cleanup_exit() (but not necessarily distros')
> is such that the call _may_ have been safe (unless the other race?)
> 
> 3. My analysis is incomplete.  I welcome further analysis by others -
> Qualys, distros, and we could want to contact upstream (via Qualys?)
> for comment even though the issue is already inadvertently fixed.
> 
> 4. If this is in fact an extra vulnerability (even if only exposed in a
> distro?), should it perhaps get its own CVE?  It's different affected
> versions range, which I think qualifies it for a different CVE.  (If
> confirmed that this is in fact a vulnerability.)

On Wed, Jun 26, 2024 at 11:06:43AM +0000, Qualys Security Advisory wrote:
> On Wed, Jun 26, 2024 at 04:32:49AM +0200, Solar Designer wrote:
> > Here's another curious one:
> 
> Well spotted! Yes, the version matches (8.5p1), this is definitely the
> same root cause (sshsigdie() was called and it is doing something,
> instead of just calling _exit(1)).
> 
> > Now, cleanup_exit() _might_ have actually be safe if called from this
> > place.  Upstream 8.7p1's is:
> 
> Interesting, thanks! We have spent some time on this:
> 
> ------------------------------------------------------------------------
> 2429 cleanup_exit(int i)
> 2430 {
> 2431         if (the_active_state != NULL && the_authctxt != NULL) {
> 2432                 do_cleanup(the_active_state, the_authctxt);
> 2433                 if (use_privsep && privsep_is_preauth &&
> 2434                     pmonitor != NULL && pmonitor->m_pid > 1) {
> 2435                         debug("Killing privsep child %d", pmonitor->m_pid);
> 2436                         if (kill(pmonitor->m_pid, SIGKILL) != 0 &&
> 2437                             errno != ESRCH) {
> 2438                                 error_f("kill(%d): %s", pmonitor->m_pid,
> 2439                                     strerror(errno));
> 2440                         }
> 2441                 }
> 2442         }
> 2443 #ifdef SSH_AUDIT_EVENTS
> 2444         /* done after do_cleanup so it can cancel the PAM auth 'thread' */
> 2445         if (the_active_state != NULL && (!use_privsep || mm_is_monitor()))
> 2446                 audit_event(the_active_state, SSH_CONNECTION_ABANDON);
> 2447 #endif
> 2448         _exit(i);
> 2449 }
> ------------------------------------------------------------------------
> 
> - The block at lines 2431-2442 is actually entered in the unpriv child,
>   pre-auth, because both the_active_state and the_authctxt are set. The
>   block at lines 2433-2441 is not entered in the unpriv child, so we can
>   forget about this one.
> 
> - The lines 2445-2446 is not executed in the unpriv child, so we can
>   forget about this one too.
> 
> - This leaves us with the call to do_cleanup() at line 2432:
> 
> ------------------------------------------------------------------------
> 2643 do_cleanup(struct ssh *ssh, Authctxt *authctxt)
> 2644 {
> ....
> 2647         debug("do_cleanup");
> ....
> 2662         if (options.use_pam) {
> 2663                 sshpam_cleanup();
> 2664                 sshpam_thread_cleanup();
> 2665         }
> ....
> 2668         if (!authctxt->authenticated)
> 2669                 return;
> ------------------------------------------------------------------------
> 
> - The debug() call at line 2647 is not ideal, because it calls
>   snprintf(), but we do not consider this to be a practical problem,
>   because we do not know of an snprintf() implementation that calls
>   malloc() (unless the format string uses positional arguments or
>   floating points, which is not the case here).
> 
> - The calls to PAM functions at lines 2662-2665 are no-ops in the unpriv
>   child, so not a problem either.
> 
> - And then do_cleanup() returns at lines 2668-2669, so OK.
> 
> Summary: we consider this upstream version to be safe (from a practical
> point of view, i.e. despite the few calls to snprintf()).
> 
> > Further, RHEL 9 patched this to:
> 
> Ouch, we had not seen this one; it is definitely a problem:
> 
> ------------------------------------------------------------------------
> 2657 cleanup_exit(int i)
> 2658 {
> ....
> 2681         if (sensitive_data.host_keys != NULL && the_active_state != NULL)
> 2682                 destroy_sensitive_data(the_active_state, is_privsep_child);
> 2683         if (the_active_state != NULL)
> 2684                 packet_destroy_all(the_active_state, 1, is_privsep_child);
> ------------------------------------------------------------------------
> 
> - The call to destroy_sensitive_data() may or may not be a problem,
>   because the "sensitive data" (private keys) were already destroyed in
>   the unpriv child at that point.
> 
> - But the call to packet_destroy_all() is definitely a problem, because
>   it calls packet_destroy_state(), which calls free() everywhere.
> 
> Summary: this patched version of cleanup_exit() makes the unpriv child
> vulnerable too.
> 
> > 1. I think the call to
> >                 cleanup_exit(255); /* don't log in privsep child */
> > should also be patched to _exit(1) in distro packages that have that
> > (perhaps those based on 8.7p1 and 8.8p1).
> 
> Agreed (by "patched" you mean "replace the call to cleanup_exit(255)
> with a call to _exit(1)", right?).
> 
> > 2. The logic of upstream cleanup_exit() (but not necessarily distros')
> > is such that the call _may_ have been safe
> 
> Agreed as well.

The below excepts are about an extra issue (beyond CVE-2024-6409) - the
audit patch's logging of SSH host key fingerprints apparently being
broken at least on RHEL 9.  This issue is only indirectly related to
CVE-2024-6409 because it explains why more async-signal-unsafe calls are
made than would have been otherwise.

On Thu, Jul 04, 2024 at 01:40:34AM +0000, Qualys Security Advisory wrote:
> we found more information on the following:
> 
> On Wed, Jun 26, 2024 at 11:06:43AM +0000, Qualys Security Advisory wrote:
> > - The call to destroy_sensitive_data() may or may not be a problem,
> >   because the "sensitive data" (private keys) were already destroyed in
> >   the unpriv child at that point.
> 
> It is a problem: destroy_sensitive_data() calls malloc functions many
> times (especially free()) because even if only the public keys remain in
> sensitive_data.host_keys, they are still passed to sshkey_free().
> 
> In fact, destroy_sensitive_data() even ends up calling syslog() itself:
> PRIVSEP(audit_destroy_sensitive_data()) calls mm_request_send(), which
> can fail if the priv/parent process already _exit()ed, so fatal_f() is
> called, which calls mm_log_handler(), which also fails because the priv/
> parent process _exit()ed, so fatal_f() is called again, but this time it
> calls syslog().
> 
> If you are wondering why destroy_sensitive_data() calls
> PRIVSEP(audit_destroy_sensitive_data()) in the first place, we were
> wondering too, because this should happen only if one of the keys is
> private, and no private key should remain in the unpriv/child process!
> This is actually a bug in sshkey_is_private(): for ed25519 keys, this
> function checks ed25519_pk, which is the public key! It should check
> ed25519_sk instead, which is the secret/private key.

On Sun, Jul 07, 2024 at 11:21:37PM +0200, Solar Designer wrote:
> Thank you for reporting this to distros.  I don't see this function in
> any upstream version at all - do you?  I see it being added via
> openssh-7.6p1-audit.patch in Red Hat's package.  I don't know if an
> equivalent patch is maybe also added in some other distros?
> 
> As to security consequences, at first my concern was this could leave
> secret host keys around in privsep child.  However, upon a closer look
> the only thing audit_destroy_sensitive_data() does is log a message, so
> the only impact is erroneous audit logging.

> Actual processing of keys is untouched by this patch, so probably
> remains correct.

On Mon, Jul 08, 2024 at 05:23:22PM +0200, Solar Designer wrote:
> On Mon, Jul 08, 2024 at 03:00:37PM +0000, Qualys Security Advisory wrote:
> > On Mon, Jul 08, 2024 at 03:35:29PM +0200, Solar Designer wrote:
> > > [...] with RH's original code
> > > I saw the same key's destruction logged 3 times.  I think some of those
> > > were not from the unprivileged child, and actually made sense.  So I was
> > > hoping that fixing the check from pk to sk would suppress only the
> > > erroneous logging from the unprivileged child, but in my testing it
> > > suppressed all of it.
> > 
> > [...] I think that I have
> > found the solution to this mystery, but I have not yet investigated the
> > consequences (if any): the private host keys are "shielded" as soon as
> > they are loaded (by sshkey_shield_private()), which means that, to the
> > (non-buggy) tests in sshkey_is_private(), they look like public keys,
> > but the private part is actually there, in another part of the key
> > structure, in encrypted form.

> The shielding was added upstream in:
> 
> commit 4f7a56d5e02e3d04ab69eac1213817a7536d0562
> Author: djm@...nbsd.org <djm@...nbsd.org>
> Date:   Fri Jun 21 04:21:04 2019 +0000
> 
>     upstream: Add protection for private keys at rest in RAM against
>     
>     speculation and memory sidechannel attacks like Spectre, Meltdown, Rowhammer
>     and Rambleed. This change encrypts private keys when they are not in use with
>     a symmetic key that is derived from a relatively large "prekey" consisting of
>     random data (currently 16KB).
> 
> which is newer than 7.6p1 that openssh-7.6p1-audit.patch was presumably
> based on.  So it is possible that this audit logging was more functional
> initially (sans this bug for ed25519 you found), but mostly broke when
> it was rebased on newer versions without consideration for the change
> above.  However, that's just a guess, which I did not confirm.  A way to
> confirm it could be to test a Red Hat package that already has the audit
> patch, but does not yet have upstream's shielding.  Since the above
> upstream commit isn't in 8.0 (it's about 1 week newer than V_8_0 tag),
> maybe RHEL 8 is suitable for this test (its package is based on 8.0p1).
> 
> ... and indeed, on a RHEL 8 rebuild system I see "msg='op=destroy
> kind=server fp=SHA256:" /var/log/audit/audit.log lines with 3 different
> fingerprint values, presumably corresponding to the different key types.
> This is different from the RHEL 9 rebuild system I checked before, where
> only one fingerprint value would be seen despite of the system also
> having 3 different host key types.

In the public disclosure of CVE-2024-6387, Qualys wrote:

On Mon, Jul 01, 2024 at 08:40:06AM +0000, Qualys Security Advisory wrote:
> We decided to target Rocky Linux 9 (a Red Hat Enterprise Linux 9
> derivative), from "Rocky-9.4-x86_64-minimal.iso", for two reasons:
> 
> - its OpenSSH version (8.7p1) is vulnerable to this signal handler race
>   condition and its glibc is always mapped at a multiple of 2MB (because
>   of the ASLR weakness discussed in the previous "Theory" subsection),
>   which makes partial pointer overwrites much more powerful;
> 
> - the syslog() function (which is async-signal-unsafe but is called by
>   sshd's SIGALRM handler) of this glibc version (2.34) internally calls
>   __open_memstream(), which malloc()ates a FILE structure in the heap,
>   and also calls calloc(), realloc(), and free() (which gives us some
>   much-needed freedom).
> 
> With a heap corruption as a primitive, two FILE structures malloc()ated
> in the heap, and 21 fixed bits in the glibc's addresses, we believe that
> this signal handler race condition is exploitable on amd64 (probably not
> in ~6-8 hours, but hopefully in less than a week). Only time will tell.

FWIW, we patched CVE-2024-6387 in Rocky Linux SIG/Security and CIQ 9.2
LTS on July 1, and here's the patch we're getting into both today:

$ cat openssh-8.7p1-rocky-CVE-2024-6409.patch
diff -urp openssh-8.7p1-38.el9_4.1-tree.orig/sshd.c openssh-8.7p1-38.el9_4.1-tree/sshd.c
--- openssh-8.7p1-38.el9_4.1-tree.orig/sshd.c	2024-07-08 03:42:51.431994307 +0200
+++ openssh-8.7p1-38.el9_4.1-tree/sshd.c	2024-07-08 03:48:13.860316451 +0200
@@ -384,7 +384,7 @@ grace_alarm_handler(int sig)
 
 	/* Log error and exit. */
 	if (use_privsep && pmonitor != NULL && pmonitor->m_pid <= 0)
-		cleanup_exit(255); /* don't log in privsep child */
+		_exit(1); /* don't log in privsep child */
 	else {
 		sigdie("Timeout before authentication for %s port %d",
 		    ssh_remote_ipaddr(the_active_state),

We'll update the Rocky Linux SIG/Security wiki shortly (didn't do this
in advance in order to push this public disclosure out to all first):

https://sig-security.rocky.page/packages/openssh/

The weakening of ASLR bothers me, but is to be discussed separately.

Alexander

Powered by blists - more mailing lists

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.