Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH authentication fails if constraints/compute.requireOsLogin is enforced #852

Open
tstaerk opened this issue May 19, 2022 · 36 comments
Open
Assignees
Labels

Comments

@tstaerk
Copy link

tstaerk commented May 19, 2022

Following your guide, I get when I type terraform apply:

module.hana_node.null_resource.hana_node_provisioner[1]: Still creating... [5m0s elapsed]

│ Error: file provisioner error

│ with module.hana_node.null_resource.hana_node_provisioner[1],
│ on modules/hana_node/salt_provisioner.tf line 23, in resource "null_resource" "hana_node_provisioner":
│ 23: provisioner "file" {

│ timeout - last error: SSH authentication failed (root@34.140.41.24:22): ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

@petersatsuse
Copy link

Hi Thorsten, could you provide us a bit more information pls.

a) which guide: github or suse getting started docu
b) which CSP (guess gce)
c) your settings of the various ssh flags in tfvars:

  • did you provide the values for the public_key/private_key via files, or in tfvars directly
  • what is your setting of "pre_deployment"? If its false, did you provide the "cluster_ssh_xxx" settings?
    d) which version of the project
@ab-mohamed
Copy link

@petersatsuse, Thorsten uses the SBP guide I created for GCP.

@tstaerk, I would advise the following:

  1. If you use an older GitHub version, please ensure that you use the most recent one. The current version is 8.1.0. You may execute the git pull command if you are in doubt before creating the environment.
  2. Could you please share the terraform.tfvars file with us. It contains all the configuration files that you used for your environment.

Some other required information:

  1. Used SLES4SAP version: Specify the used SLES4SAP version (SLES12SP4, SLES15SP2, etc.)
  2. Used client machine OS: Specify the used machine OS to execute the project (Windows, any Linux distro, macOS).
    Even though terraform is multi-platform, some of the local actions are based on Linux distributions so some operations might fail for this reason.
  3. Expected behaviour vs. observed behaviour: Was your deployment failed, or was it completed with an error message?
  4. The provisioning_log_level = "info" option in the terraform.tfvars file is interesting to get more information during the execution of the terraform commands. So it is suggested to run the deployment with this option to see what happens before opening any ticket.
  5. Logs: Upload the deployment logs to make the root cause finding easier.
    The logs might have sensitive secrets exposed. Remove them before uploading anything here. Otherwise, contact me to send the logs privately to the SUSE teams.

There is the list of the required logs (each of the deployed machines will have all of them):

  • /var/log/salt-os-setup.log
  • /var/log/salt-predeployment.log
  • /var/log/salt-deployment.log
  • /var/log/salt-result.log
@tstaerk
Copy link
Author

tstaerk commented May 19, 2022

just realized I did not define a VPC... if there is only one, can't it use this?

@tstaerk
Copy link
Author

tstaerk commented May 19, 2022

OK, I am using GCP and the following tfvars file:

project = "thorstenstaerk-suse-terraforms"
gcp_credentials_file = "sa.json"
region = "europe-west1"
os_image = "suse-sap-cloud/sles-15-sp2-sap"
public_key = "/home/admin_/.ssh/id_rsa.pub"
private_key = "/home/admin_/.ssh/id_rsa"
cluster_ssh_pub = "salt://sshkeys/cluster.id_rsa.pub"
cluster_ssh_key = "salt://sshkeys/cluster.id_rsa"
ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:/ha-clustering:/sap-deployments:/v8/"
provisioning_log_level = "info"
pre_deployment = true
bastion_enabled = false
machine_type = "n1-highmem-16"
hana_inst_master="thorstenstaerk-sap-media-extracted/"
hana_master_password = "SAP_Pass123"

@ab-mohamed
Copy link

@tstaerk:

I have just completed a successful deployment using the most recent version, 8.1.0, using the following terraform.tfvars file:

project = "<PROJECT ID>"
gcp_credentials_file = "sa-key.json"
region = "us-west1"
os_image = "suse-sap-cloud/sles-15-sp2-sap"
public_key  = "<PATH TO  THE SSH KEY>/gcp_key.pub"
private_key = "<PATH TO  THE SSH KEY>/gcp_key"
cluster_ssh_pub = "salt://sshkeys/cluster.id_rsa.pub"
cluster_ssh_key = "salt://sshkeys/cluster.id_rsa"
ha_sap_deployment_repo = "https://download.opensuse.org/repositories/network:ha-clustering:sap-deployments:v8/"
provisioning_log_level = "info"
pre_deployment = true
bastion_enabled = false
hana_inst_master = "<GCP BUCKET>/HANA/2.0/SPS05/51054623"
hana_master_password = "YourSAPPassword1234"
hana_primary_site = "NUE"
hana_secondary_site = "FRA"

I see that we use almost the same configurations. Please ensure that you use the most recent version, 8.1.0, the master branch?
I would suggest using a new clone to ensure no configuration conflicts, or at least execute the command git pull before starting your deployment?

@tstaerk
Copy link
Author

tstaerk commented May 19, 2022

git pull tells me "already up to date"

@ab-mohamed
Copy link

Can you please try a fresh clone before digging into the issue?

@tstaerk
Copy link
Author

tstaerk commented May 19, 2022

deleted and re-checked out

@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

OK, your and my terraform.tfvars is identical with the exception of passwords, names and your two lines

hana_primary_site = "NUE"
hana_secondary_site = "FRA"

@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

I repeated with my old terraform.tfvars and I get:

module.hana_node.module.hana-load-balancer[0].google_compute_health_check.health-check: Creating...

│ Error: Error creating Network: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/global/networks/demo-network' already exists, alreadyExists

│ with google_compute_network.ha_network[0],
│ on infrastructure.tf line 27, in resource "google_compute_network" "ha_network":
│ 27: resource "google_compute_network" "ha_network" {



│ Error: Error creating Disk: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-c/disks/demo-hana-data-1' already exists, alreadyExists

│ with module.hana_node.google_compute_disk.data[1],
│ on modules/hana_node/main.tf line 12, in resource "google_compute_disk" "data":
│ 12: resource "google_compute_disk" "data" {



│ Error: Error creating Disk: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-b/disks/demo-hana-data-0' already exists, alreadyExists

│ with module.hana_node.google_compute_disk.data[0],
│ on modules/hana_node/main.tf line 12, in resource "google_compute_disk" "data":
│ 12: resource "google_compute_disk" "data" {



│ Error: Error creating Disk: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-b/disks/demo-hana-backup-0' already exists, alreadyExists

│ with module.hana_node.google_compute_disk.backup[0],
│ on modules/hana_node/main.tf line 20, in resource "google_compute_disk" "backup":
│ 20: resource "google_compute_disk" "backup" {



│ Error: Error creating Disk: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-c/disks/demo-hana-backup-1' already exists, alreadyExists

│ with module.hana_node.google_compute_disk.backup[1],
│ on modules/hana_node/main.tf line 20, in resource "google_compute_disk" "backup":
│ 20: resource "google_compute_disk" "backup" {



│ Error: Error creating Disk: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-c/disks/demo-hana-software-1' already exists, alreadyExists

│ with module.hana_node.google_compute_disk.hana-software[1],
│ on modules/hana_node/main.tf line 28, in resource "google_compute_disk" "hana-software":
│ 28: resource "google_compute_disk" "hana-software" {



│ Error: Error creating Disk: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-b/disks/demo-hana-software-0' already exists, alreadyExists

│ with module.hana_node.google_compute_disk.hana-software[0],
│ on modules/hana_node/main.tf line 28, in resource "google_compute_disk" "hana-software":
│ 28: resource "google_compute_disk" "hana-software" {



│ Error: Error creating HealthCheck: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/global/healthChecks/demo-hana-health-check' already exists, alreadyExists

│ with module.hana_node.module.hana-load-balancer[0].google_compute_health_check.health-check,
│ on modules/load_balancer/main.tf line 5, in resource "google_compute_health_check" "health-check":
│ 5: resource "google_compute_health_check" "health-check" {

@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

after deleting all the stuff above and re-starting terraform apply, I now get:

│ Error: Error creating InstanceGroup: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-b/instanceGroups/demo-hana-primary-group' already exists, alreadyExists

│ with module.hana_node.google_compute_instance_group.hana-primary-group,
│ on modules/hana_node/main.tf line 60, in resource "google_compute_instance_group" "hana-primary-group":
│ 60: resource "google_compute_instance_group" "hana-primary-group" {



│ Error: Error creating InstanceGroup: googleapi: Error 409: The resource 'projects/thorstenstaerk-suse-terraforms/zones/europe-west1-c/instanceGroups/demo-hana-secondary-group' already exists, alreadyExists

│ with module.hana_node.google_compute_instance_group.hana-secondary-group,
│ on modules/hana_node/main.tf line 66, in resource "google_compute_instance_group" "hana-secondary-group":
│ 66: resource "google_compute_instance_group" "hana-secondary-group" {



│ Error: file provisioner error

│ with module.hana_node.null_resource.hana_node_provisioner[1],
│ on modules/hana_node/salt_provisioner.tf line 23, in resource "null_resource" "hana_node_provisioner":
│ 23: provisioner "file" {

│ timeout - last error: SSH authentication failed (root@35.187.176.254:22): ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no
│ supported methods remain


│ Error: file provisioner error

│ with module.hana_node.null_resource.hana_node_provisioner[0],
│ on modules/hana_node/salt_provisioner.tf line 23, in resource "null_resource" "hana_node_provisioner":
│ 23: provisioner "file" {

│ timeout - last error: SSH authentication failed (root@130.211.104.240:22): ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no
│ supported methods remain

@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

ssh cannot work, as Cloud Shell does not have network connection to a host inside a GCP project

@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

130.211.104.240 is demo-vmhana01

@ab-mohamed
Copy link

@tstaerk, please execute the terraform destroy command to destroy your environment and before any new attempts to create a new environment using the terraform apply command.

When you ssh to the HANA node using the public IP address, you need to use the used SSH in the terraform.tfvars file. Here is the command format:

ssh -i <SSH PRIVATE KEY> root@<HANA_NODE_PUBLIC_IP_ADDRESS>
@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

Hi, I do not call ssh. I get an error that ssh is not possible and I think this is because of the isolation between cloud shell and VMs.

@tstaerk
Copy link
Author

tstaerk commented May 20, 2022

ok, makes sense - you use the public IP address. Here is what I get:

admin_@cloudshell:~$ ssh -i .ssh/id_rsa root@130.211.104.240
The authenticity of host '130.211.104.240 (130.211.104.240)' can't be established.
ECDSA key fingerprint is SHA256:YgYUATM68uQX/KEEXAqXUm18U+BMR9/1M1iDic7PfVI.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
Host key verification failed.

@ab-mohamed
Copy link

Three possible troubleshooting steps:

  1. Ensure that the public SSH key is attached to the two HANA nodes. If not, attach it manually and try it again.
  2. Ensure that the SSH key pairs have the proper permission:
    Public key -> 600
    Private key -> 400
  3. Try using the -v option with the SSH command to gather more info.
@tstaerk
Copy link
Author

tstaerk commented May 26, 2022

Two questions come to mind:

  • what is salt://sshkeys/cluster.id_rsa.pub? Where does it come from? Can I check if mine is right?
  • you said it worked for you, and it uses ssh. So you must have a firewall rule, right?
@yeoldegrove
Copy link
Collaborator

ok, makes sense - you use the public IP address. Here is what I get:

admin_@cloudshell:~$ ssh -i .ssh/id_rsa root@130.211.104.240 The authenticity of host '130.211.104.240 (130.211.104.240)' can't be established. ECDSA key fingerprint is SHA256:YgYUATM68uQX/KEEXAqXUm18U+BMR9/1M1iDic7PfVI. Are you sure you want to continue connecting (yes/no/[fingerprint])? Host key verification failed.

This is perfectly fine that this fails. Just make sure you delete the old host key from you known_hosts.
A bit more context: https://linuxhint.com/host-key-verification-failed-mean/

@yeoldegrove
Copy link
Collaborator

Two questions come to mind:

  • what is salt://sshkeys/cluster.id_rsa.pub? Where does it come from? Can I check if mine is right?

This is the clusters's ssh key. Normally you don't have to temper with this.

  • you said it worked for you, and it uses ssh. So you must have a firewall rule, right?

You CAN connect via ssh/port-22 so this will not be a firewall issue.

@tstaerk The ssh keys that are used by terraform to connect via ssh and run salt are these:

public_key = "/home/admin_/.ssh/id_rsa.pub"
private_key = "/home/admin_/.ssh/id_rsa"

Did you create these and are you using these also in your test?

@ab-mohamed
Copy link

@tstaerk In addition to @yeoldegrove notes and questions, you may manually attach the SSH public keys to your nodes as a troubleshooting step.

@tstaerk
Copy link
Author

tstaerk commented May 31, 2022

added the authorized_keys file manually to both nodes, now the install looks like it's doing sth!

@tstaerk
Copy link
Author

tstaerk commented May 31, 2022

install finished, hdbsql answers my SQL queries. Please make sure the authorized_keys get created automatically!

@yeoldegrove
Copy link
Collaborator

@tstaerk There is of course already code that handles this https://github.com/SUSE/ha-sap-terraform-deployments/blob/main/gcp/modules/hana_node/main.tf#L155
Are you sure you created the keyfiles and set the correct variables in terraform.tfvars.

@tstaerk
Copy link
Author

tstaerk commented Jun 3, 2022

reproducing it now

@tstaerk
Copy link
Author

tstaerk commented Jun 6, 2022

@yeoldegrove : looking at https://github.com/SUSE/ha-sap-terraform-deployments/blob/main/gcp/modules/hana_node/main.tf#L155, you only add the ssh key to the instance's metadata, so, ssh passwordless login would only work if the project is set to os_login=false, right? Ever tested it with os_login=true?

@yeoldegrove
Copy link
Collaborator

@tstaerk I still do not get which exact problem you're having and trying to solve. Could you elaborate on that?

ssh keys are added to the instance's metadata the usual way as you pointed out.
Are you using the "Cloud Console"? AFAIK most of the users use their workstations to deploy this.
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance#metadata point out that keys added by the Cloud Console will be removed. Maybe this is your issue?

Also, I am not sure what you mean by os_login=true/false. Where would I set this?

@tstaerk
Copy link
Author

tstaerk commented Jun 8, 2022

you would go to cloud console, search for "Metadata", select it, and there you set the key os_login and the value false. Then, the ssh key set in https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance#metadata will be respected.

@yeoldegrove
Copy link
Collaborator

@tstaerk are you talking about https://console.cloud.google.com/compute/metadata where I could set e.g. https://cloud.google.com/compute/docs/oslogin/set-up-oslogin ?

Just that I do not miss anything out... Could you please sum-up what exactly is not working for you (your use case) and how you solve it exactly?

Would just setting enable-oslogin=FALSE in https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance#metadata fix it for you?

@tstaerk
Copy link
Author

tstaerk commented Jun 13, 2022

We found the error, we had an organisation policy (constraints/compute.requireOsLogin) active that enforced every project to have enable-oslogin=true.

This led to the ssh error:
image
Also key verification was not the problem:

admin_@cloudshell:~/ha-sap-terraform-deployments/gcp (tstaerk-tf-demo)$ ssh -o StrictHostKeyChecking=no root@34.79.69.80
Warning: Permanently added '34.79.69.80' (ECDSA) to the list of known hosts.
root@34.79.69.80: Permission denied (publickey).

The issue was that the public ssh key was not automatically added to the HANA node's authorized_keys. To change this, we set enable-oslogin=false in the project metadata, see Screenshot:

image

then, ssh'ing worked and the key could be found in authorized_keys:

admin_@cloudshell:~/ha-sap-terraform-deployments/gcp (tstaerk-tf-demo)$ ssh -o StrictHostKeyChecking=no root@34.79.69.80
SUSE Linux Enterprise Server 15 SP2 for SAP Applications x86_64 (64-bit)

As "root" (sudo or sudo -i) use the:
  - zypper command for package management
  - yast command for configuration management

Management and Config: https://www.suse.com/suse-in-the-cloud-basics
Documentation: https://www.suse.com/documentation/sles-15/
Community: https://community.suse.com/

Have a lot of fun...
demo-hana02:~ # cat .ssh/authorized_keys
# Added by Google
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDfWjWgE1NkXnmv0UgAkm+zHnJ2UJgTVpMEAlc3Fo+tH6U1BsPL++ceiE+mAAjcT41j7Ew5N4qyranPSTQOvrLSGvCITP4edAJlbrh4JOzy5/aNP/EfWZiprtytrkdBEzd0gbhg+Bh98FlEUoxLtZSFsP2090zI7hTuT9DEB3eQknMkR9g+JsgGcDd0t4kdERaLZp+spkPCJF3LQ2h+9ZbmHqwBjzYLsJLRMma3y+aU80IHONBOEaX+ab+1vR1CuxMBwRjSlDkfRVBuxMWnj+ipQaLjiMLFaGbANFxPFj4AaeDnYO/jnKUaIRQOEAvpgjN9r5hVsRT0I+cpBvTpqcrx admin_@cs-485070161371-default-boost-wds4w

So, one solution would be to manually copy the public ssh key into the OS' authorized_keys file. Another option could be to check if constraints/compute.requireOsLogin is enforced and if yes, tell the user that they have to manually copy the ssh key to all nodes.

@tstaerk tstaerk changed the title SSH authentication failed Jun 13, 2022
@tstaerk
Copy link
Author

tstaerk commented Jun 24, 2022

Hi @yeoldegrove

thanks for all your contributions here. @ab-mohamed and I really invested a lot of work debugging a "it all boils down to doesn't work" issue. And arrived at a conclusion - if you have a org policy requiring OS Login, you get an error message like in the description. Solution: remove this org policy and enable OS Login. If you cannot do this, manually go to the hana nodes and add the public key to authorized_keys. Would it be possible to document this or implement a respective error message/policy check?

@yeoldegrove
Copy link
Collaborator

@tstaerk Ok, so this is global setting which is not directly related to this project but gets in the way ...

Could you check if it would be sufficient to set metadata = { enable-oslogin = false, sshKeys = "..." } here for every compute instance deployed by this project?

It would have to be added to every module that builds up compute instances... like here:

If this does not work we should definitely write something into the README.md. A contribution/PR from your side would be appreciated here as your're way more into the topic right now ;)
2-3 sentences with a bit of context in the https://github.com/SUSE/ha-sap-terraform-deployments/tree/main/gcp#troubleshooting section should be enough.

@yeoldegrove yeoldegrove self-assigned this Jun 27, 2022
@tstaerk
Copy link
Author

tstaerk commented Jun 27, 2022

If you have an organization policy that forbids it, you cannot set metadata = { enable-oslogin = false, sshKeys = "..." }

@tstaerk
Copy link
Author

tstaerk commented Jul 15, 2022

OK, I propose that we add the error message to the documentation and explain how to check if the issue is about the organizational policy. And how to resolve it if you have the Org Policy Admin role.

@yeoldegrove
Copy link
Collaborator

@tstaerk Do you want to make a PR (would be preferred by me as you're more into the topic) or shall I write something up (and let you review it) ?

@tstaerk
Copy link
Author

tstaerk commented Jul 18, 2022

I work closely with @ab-mohamed I think we could come up with sth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 participants