Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to pull secrets or registry auth: pull command failed #290

Open
davorceman opened this issue Jan 18, 2022 · 4 comments
Open

unable to pull secrets or registry auth: pull command failed #290

davorceman opened this issue Jan 18, 2022 · 4 comments

Comments

@davorceman
Copy link

davorceman commented Jan 18, 2022

Hello,

I had successfully deploy, tested this solution and it worked. Version 0.42
Now with the same stack Status is "Running" for a whole day.
And I found out that ECS DelStack tasks are failing with this error

STOPPED (ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed)

I have no idea how to troubleshoot.
I have all VPC endpoints in place, with attached private subnets, the same one sent as a parameter to the CF.
Also Security Group is the same "default" one, attached to ECS Service, and also to all of these endpoints.
image

Here are the parameters I'm using, almast all default.
This is Terraform code, but you can see the parameters

resource "aws_cloudformation_stack" "s3_find_and_forget_ohio" {
  name         = format("%s-s3f2-ohio", terraform.workspace)
  template_url = format("https://solution-builders-%s.s3.%s.amazonaws.com/amazon-s3-find-and-forget/%s/template.yaml", data.aws_region.ohio.name, data.aws_region.ohio.name, local.s3f2_version)

  parameters = {
    AdminEmail                       = local.s3f2_admin
    DeployVpc                        = false
    VpcSecurityGroups                = module.vpc_ohio.default_security_group_id
    VpcSubnets                       = join(",", module.vpc_ohio.private_subnets)
    CreateCloudFrontDistribution     = true
    AccessControlAllowOriginOverride = false
    AthenaConcurrencyLimit           = 20
    DeletionTasksMaxNumber           = 3
    DeletionTaskCPU                  = 4096
    DeletionTaskMemory               = 30720
    QueryExecutionWaitSeconds        = 3
    QueryQueueWaitSeconds            = 3
    ForgetQueueWaitSeconds           = 30
    CognitoAdvancedSecurity          = "OFF"
    EnableAPIAccessLogging           = false
    EnableContainerInsights          = false
    JobDetailsRetentionDays          = 0
    EnableDynamoDBBackups            = false
    RetainDynamoDBTables             = true
    AthenaWorkGroup                  = "primary" #module.athena_s3f2_tool_ohio.athena_workgroups.name
    PreBuiltArtefactsBucketOverride  = false
  }

  capabilities = [
    "CAPABILITY_AUTO_EXPAND",
    "CAPABILITY_IAM",
    "CAPABILITY_NAMED_IAM",
  ]

  tags = local.tags
}

Also, different minor issue, I wanted to use my own AthenaWorkGroup, but I was not able to set bucket permissions.
Tried with both roles, Athena role and that one another deployed with CF.

And yes, one important thing.
I don't see how to stop Deletion Job. It runs 24h, I see this error, for sure it will fail, so it would be better to have some option to cancel complete job

@davorceman
Copy link
Author

I found the issue

I deployed one parallel s3f2 stacl, but with VPC. And compared endpoints and SG.

Seems that SG must have ingress https port to VPC CIDR range.
And also my endpoints did not have private DNS enabled

@matteofigus
Copy link
Member

matteofigus commented Jan 18, 2022

Hi, thanks for opening an issue.

  1. Sounds like you managed to solve the networking issue. Would you be keen on opening a Pull Request suggesting an improvement on the docs here? https://github.com/awslabs/amazon-s3-find-and-forget/blob/master/docs/USER_GUIDE.md#using-an-existing-vpc
  2. At the moment there isn't an API for stopping a job. To stop the job manually, you could manually stop the Step Function and then edit the ECS service to have a desired task count to 0. We could add this to the backlog.
  3. I did some tests with different workgroups on my deployment and they succeeded. Can you describe a bit better the issue? What kind of error are you seeing, and is your workgroup set to enabled? In theory if you set the parameter via CloudFormation/Terraform as you are, the policy should be automatically created for you and you should find it attached to the role arn:aws:iam::<account>:role/<stackname>-StateMachineStack-<...>-ExecuteQueryRole-<...>. Note that if you use a different workgroup, that needs to be already created in the same account and region where the solution is running (the deploy won't create it for you).
@davorceman
Copy link
Author

davorceman commented Jan 19, 2022

Hi,

  1. I will, these days. But I would rather suggest you to deploy SG with CF together with rest of the stack. I don't see the reason why this can't be automated? If SG parameter is not passed, then deploy new SG with ingress https rule to VPC CIDR. I'm just not that much familiar with CloudFormation and not sure if it is possible with CF to get VPC CIDR range :), like with the terraform and data resource.

  2. I think it would be good idea to add this to the backlog and have it soon as a option. Since this tool will be mainly used by Data engineers, and they are usually not that much familiar with all resources this task using, would be nice to have one click option to cancel the job.

  3. I deployed s3 bucket, added bucket policy with full s3:* permissions, for these 2 roles. Something like this

data "aws_iam_policy_document" "s3_find_and_forget_ohio" {
  statement {
    effect = "Allow"
    actions = [
      "s3:*"
    ]
    principals {
      type = "AWS"
      identifiers = [
        aws_cloudformation_stack.s3_find_and_forget_role.outputs["RoleArn"],
        aws_cloudformation_stack.s3_find_and_forget_ohio.outputs["AthenaExecutionRoleArn"]
      ]
    }
    resources = [
      module.my_athena_bucket.bucket_arn,
      format("%s/*", module.my_athena_bucket.bucket_arn)
    ]
  }
}

Encryption in that bucked is not with Customer managed KMS keys, I was also disabled SSE-S3 just for troubleshooting.
And at the and I set principal to be root, and still not success.

In Athena dashboard I noticed that Athena was successfully executed, and I think files were saved to the bucket.
This was the error

"ErrorDetails": {
     "Error": "Query Failed",
     "Cause": "Access denied when writing output to url: s3://<redacted>/analytics-workgroup-ohio/<redacted>.csv . Please ensure you are allowed to access the S3 bucket. If you are encrypting query results with KMS key, please ensure you are allowed to access your KMS key"
}

S3 endpoint was in place.
I'll try this days again, now with all endpoints in place.

@davorceman
Copy link
Author

I just tested again.
So here you can see that Athena job is executed again, with my athena workgroup
image

Also I see files in the bucket. CSV files with paths to parquet files.
image

But Job has failed again
image
image

And now error is

"ErrorDetails": {
    "Error": "InvalidRequestException",
    "Cause": "{\"errorMessage\": \"An error occurred (InvalidRequestException) when calling the GetQueryResults operation: You do not seem to have access to the S3 location of your query results. Please confirm your account has access to the S3 location where your query results are saved and try again. If specifying an expected bucket owner, confirm the bucket is owned by the expected account. If you are using KMS to encrypt query results, please ensure you have permission to access your KMS key. If you continue to see this issue, please contact customer support.\", \"errorType\": \"InvalidRequestException\", \"stackTrace\": [\"  File \\\"/opt/python/decorators.py\\\", line 34, in wrapper\\n    return handler(event, *args, **kwargs)\\n\", \"  File \\\"/var/task/submit_query_results.py\\\", line 22, in handler\\n    rows = [result for result in results]\\n\", \"  File \\\"/var/task/submit_query_results.py\\\", line 22, in <listcomp>\\n    rows = [result for result in results]\\n\", \"  File \\\"/opt/python/boto_utils.py\\\", line 49, in paginate\\n    for page in page_iterator:\\n\", \"  File \\\"/opt/python/botocore/paginate.py\\\", line 255, in __iter__\\n    response = self._make_request(current_kwargs)\\n\", \"  File \\\"/opt/python/botocore/paginate.py\\\", line 332, in _make_request\\n    return self._method(**current_kwargs)\\n\", \"  File \\\"/opt/python/botocore/client.py\\\", line 386, in _api_call\\n    return self._make_api_call(operation_name, kwargs)\\n\", \"  File \\\"/opt/python/botocore/client.py\\\", line 705, in _make_api_call\\n    raise error_class(parsed_response, operation_name)\\n\"]}"
}

So seems that stack has privileges to write, but this next step can't read the results?
Bucket is not encrypted with Customer managed KMS key. Role is the same like above one, with s3:*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants