Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Cloud Provider Writing Empty File On System #454

Closed
rossedman opened this issue Mar 29, 2018 · 15 comments
Closed

Azure Cloud Provider Writing Empty File On System #454

rossedman opened this issue Mar 29, 2018 · 15 comments
Assignees
Labels
Milestone

Comments

@rossedman
Copy link

RKE version: v0.1.5-rc1

Docker version: (docker version,docker info preferred) 17.09.1-ce

Operating system and kernel: (cat /etc/os-release, uname -r preferred) 4.14.19-coreos

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Azure

cluster.yml file:

cloud_provider:
  name: azure
  cloud_config:
    aadClientId: <REDACTED>
    aadClientSecret: <REDACTED>
    location: centralus
    resourceGroup: am663-core
    subnetName: core
    subscriptionId: <REDACTED>
    vnetName: am663-core
    tenantId: <REDACTED>
    securityGroupName: am663-core-ssh

Steps to Reproduce:

Run RKE against an Azure cluster on CoreOS. The file is output to the system at /etc/kubernetes/cloud-config.json but nothing is written to the file. Causes API server to fail to start.

Results:
API server fails to start causing deployment to fail.

@rossedman
Copy link
Author

Referencing PR: #449
Tagging @galal-hussein

@HighwayofLife
Copy link
Contributor

Does this work now? It worked when I tried it today.

@rossedman
Copy link
Author

@HighwayofLife it didn't work for me but can try again.

@galal-hussein
Copy link
Contributor

@rossedman Can you post the logs of kubelet to see what went wrong, thanks.

@rossedman
Copy link
Author

@galal-hussein @HighwayofLife Still getting multiple failures. The only log entries without flag dumps is this:

I0403 15:14:46.681465       1 server.go:121] Version: v1.9.0-rancher2
I0403 15:14:46.681781       1 interface.go:360] Looking for default routes with IPv4 addresses
I0403 15:14:46.681824       1 interface.go:365] Default route transits interface "eth0"
I0403 15:14:46.682074       1 interface.go:174] Interface eth0 is up
I0403 15:14:46.682204       1 interface.go:222] Interface "eth0" has 2 addresses :[172.28.2.10/24 fe80::20d:3aff:fe97:4037/64].
I0403 15:14:46.682272       1 interface.go:189] Checking addr  172.28.2.10/24.
I0403 15:14:46.682308       1 interface.go:196] IP found 172.28.2.10
I0403 15:14:46.682348       1 interface.go:228] Found valid IPv4 address 172.28.2.10 for interface "eth0".
I0403 15:14:46.682402       1 interface.go:371] Found active IP 172.28.2.10
I0403 15:14:46.682443       1 services.go:51] Setting service IP to "10.233.0.1" (read-write).
I0403 15:14:46.682486       1 cloudprovider.go:59] --external-hostname was not specified. Trying to get it from the cloud provider.
error setting the external host value: "azure" cloud provider could not be initialized: could not init cloud provider "azure": No credentials provided for AAD application

This makes sense if the /etc/kubernetes/cloud-config.json file is empty though, which it is.

@rossedman
Copy link
Author

Here is the full rke config, I redacted some things just like labels and secrets:

#
# this file was autogenerated, do not edit manually
# template generated from: am663.yaml.j2
#
nodes:
  # control plane
  - address: 172.28.2.10
    user: rancher
    role: ["controlplane", "etcd"]
    hostname_override: am663-kube-ctl0
  - address: 172.28.2.16
    user: rancher
    role: ["controlplane", "etcd"]
    hostname_override: am663-kube-ctl1
  - address: 172.28.2.13
    user: rancher
    role: ["controlplane", "etcd"]
    hostname_override: am663-kube-ctl2
  
  # general compute
  - address: 172.28.2.6
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-gpc0
  - address: 172.28.2.5
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-gpc1
  - address: 172.28.2.15
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-gpc2
  - address: 172.28.2.9
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-gpc3
  - address: 172.28.2.7
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-gpc4
  - address: 172.28.2.12
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-gpc5
  
  # load balancers
  - address: 172.28.2.11
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-lbe0
  - address: 172.28.2.8
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-lbe1
  - address: 172.28.2.14
    user: rancher
    role: ["worker"]
    hostname_override: am663-kube-lbe2
  

authentication:
  sans:
  - <REDACTED>
  - <REDACTED>

services:
  kube-api:
    service_cluster_ip_range: 10.233.0.0/18
    pod_security_policy: false
    extra_args:
      v: 4
  kube-controller:
    cluster_cidr: 10.233.64.0/18
    service_cluster_ip_range: 10.233.0.0/18
  scheduler:
  kubelet:
    cluster_domain: cluster.local
    cluster_dns_server: 10.233.0.3
    infra_container_image: gcr.io/google_containers/pause-amd64:3.0

kubernetes_version: v1.9.1-rancher1-1

ssh_key_path: ~/.ssh/id_rsa

ignore_docker_version: true

network:
  plugin: canal

ingress:
  provider: none

addons_include:
  - ./rke/addons/helm-crd.yaml
  - ./rke/addons/cert-manager.yaml
  - ./rke/addons/filebeat.yaml
  - ./rke/addons/ingress.yaml
  - ./rke/addons/tsp-base.yaml

cloud_provider:
  name: azure
  cloud_config:
    aadClientId: <REDACTED>
    aadClientSecret:<REDACTED>
    location: centralus
    resourceGroup: am663-core
    subnetName: tsp-core
    subscriptionId: <REDACTED>
    vnetName: am663-core
    tenantId: <REDACTED>
    securityGroupName: am663-core-ssh

@rossedman
Copy link
Author

Pretty confused. This is getting written to file system now. I'm not sure what changed. Having different errors. Will close as I determine what happened. I did a rke remove before posting this issue and before everytime I've attempted this.

@rossedman
Copy link
Author

rossedman commented Apr 3, 2018

Alright, so as strange as this seems I'm getting very inconsistent results. Sometimes it gets written to some nodes, sometimes it doesnt get written, sometimes it gets written with line breaks between every line in the cloud-config.json

@deniseschannon deniseschannon added this to the v0.1.6 milestone Apr 3, 2018
@rossedman
Copy link
Author

One more addition to this. Now I am getting even more granular misses. Some of the fields will be empty. Just had a run where subnetName, vnetName and vnetResourceGroup weren't filled in.

@rossedman
Copy link
Author

rossedman commented Apr 9, 2018

Still trying to track this down. every now and then the stars will align and the file will write to the servers and when it does I get this error: kubernetes/kubeadm#484

I tried to use kube 1.9 but was never able to get the files to write again.

@HighwayofLife
Copy link
Contributor

HighwayofLife commented Apr 20, 2018

I have confirmed that this is an issue in the latest master. The log indicates that the cloud config is being deployed to the node, but the file is empty. I'm unclear why and am working on debugging it.

I’ve confirmed that RKE_CLOUD_CONFIG={<cloud-config-json>} is populated in doDeployConfigFile, and the container claims that it’s successfully started and then removed…

INFO[0149] [cloud] Successfully started [cloud-config-deployer] container on host [10.18.160.18] 
INFO[0150] [remove/cloud-config-deployer] Successfully removed container on host [10.18.160.18]

However, the path doesn’t even show it was written to…

$ ls -l /etc/kubernetes/cloud-config.json 
-rw-r--r--. 1 root root 0 Apr 19 20:53 /etc/kubernetes/cloud-config.json

$ date
Fri Apr 20 17:37:38 UTC 2018

@galal-hussein
Copy link
Contributor

@HighwayofLife can you post your cluster.yml file, it maybe a problem with the cloud config, also can you get the logs of kubelet container

@HighwayofLife
Copy link
Contributor

This was fixed in #533 for me.

@galal-hussein
Copy link
Contributor

This also fixed here https://github.com/rancher/rke-tools/blob/master/cloud-provider.sh#L32-L33
@rossedman please reopen the issue if you can still reproduce it, i can't reproduce with the latest rc

@rossedman
Copy link
Author

@HighwayofLife @galal-hussein Thanks! I just read the PR too. Super cool. Getting more familiar with the codebase, I'm guessing I could've fixed this 🥁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants