Skip to content

New cluster using Talos is not progressing beyond Machines in Provisioning stage. #37

@dhaugli

Description

@dhaugli

What happened:
[A clear and concise description of what the bug is.]

The cluster is not coming up, Harvester Loadbalancer is not created, machines never leave provisioning state.
The machines is provisioned in harvester, gets IP from my network. I can attach a console to them. Though its Talos so its not much you get in return.

Screenshot of console of one of the talos cp vms:

Screenshot 2024-06-06 232557

caph-provider logs:

 ERROR   failed to patch HarvesterMachine        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile.func1
        /workspace/controllers/harvestermachine_controller.go:121
github.com/rancher-sandbox/cluster-api-provider-harvester/controllers.(*HarvesterMachineReconciler).Reconcile
        /workspace/controllers/harvestermachine_controller.go:198
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226
2024-06-06T19:58:10Z    ERROR   Reconciler error        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "7ec120a6-8a1e-40b1-98dd-3597ce44ca1c", "error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value", "errorCauses": [{"error": "HarvesterMachine.infrastructure.cluster.x-k8s.io \"capi-mgmt-p-01-zzmph\" is invalid: ready: Required value"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/internal/controller/controller.go:226
  1. These two log entries keeps going.
 2024-06-06T19:58:10Z    INFO    Reconciling HarvesterMachine ...        {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82"}
2024-06-06T19:58:10Z    INFO    Waiting for ProviderID to be set on Node resource in Workload Cluster ...       {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-zzmph","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-zzmph", "reconcileID": "dc815768-5306-42cc-91c0-be802d85bc82", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-7shhp", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}

capt-controller-manager logs:

I0606 19:58:08.737945       1 taloscontrolplane_controller.go:176] "controllers/TalosControlPlane: successfully updated control plane status" namespace="cluster-capi-mgmt-p-01" talosControlPlane="capi-mgmt-p-01" cluster="capi-mgmt-p-01"
I0606 19:58:08.739615       1 controller.go:327] "Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
E0606 19:58:08.739746       1 controller.go:329] "Reconciler error" err="at least one machine should be provided" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="b0b79408-8a41-43df-91ef-07fe7d36fa7c"
I0606 19:58:08.749008       1 taloscontrolplane_controller.go:189] "reconcile TalosControlPlane" controller="taloscontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="TalosControlPlane" TalosControlPlane="cluster-capi-mgmt-p-01/capi-mgmt-p-01" namespace="cluster-capi-mgmt-p-01" name="capi-mgmt-p-01" reconcileID="c37dc309-f8fb-42c7-a375-5faceb9019b9" cluster="capi-mgmt-p-01"
I0606 19:58:09.190175       1 scale.go:33] "controllers/TalosControlPlane: scaling up control plane" Desired=3 Existing=1
I0606 19:58:09.213294       1 taloscontrolplane_controller.go:152] "controllers/TalosControlPlane: attempting to set control plane status"
I0606 19:58:09.220900       1 taloscontrolplane_controller.go:564] "controllers/TalosControlPlane: failed to get kubeconfig for the cluster" error="failed to create cluster accessor: error creating client for remote cluster \"cluster-capi-mgmt-p-01/capi-mgmt-p-01\": error getting rest mapping: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.0.0.113:6443/api/v1?timeout=10s\": tls: failed to verify certificate: x509: certificate is valid for 10.0.0.3, 127.0.0.1, ::1, 10.0.0.5, 10.53.0.1, not 10.0.0.113"

cabpt-talos-bootstrap(I dont know if this is relevant):

I0606 19:58:09.206570       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.224117       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.243118       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.280372       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.341804       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.352557       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.439369       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.480714       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.539945       1 talosconfig_controller.go:186] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-df9f2: Waiting for OwnerRef on the talosconfig"
I0606 19:58:09.548156       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.717884       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-n48cx"
I0606 19:58:09.720944       1 secrets.go:174] "controllers/TalosConfig: handling bootstrap data for " owner="capi-mgmt-p-01-7shhp"
I0606 19:58:09.756344       1 talosconfig_controller.go:223] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: ignoring an already ready config"
I0606 19:58:09.765995       1 secrets.go:243] "controllers/TalosConfig/cabpt-controller/namespace=cluster-capi-mgmt-p-01/talosconfig=capi-mgmt-p-01-npzm4/owner-name=capi-mgmt-p-01-n48cx: updating talosconfig" endpoints=null secret="capi-mgmt-p-01-talosconfig"

What did you expect to happen:
I expected that the caph provider created the LB and proceeded on creating the cluster.

How to reproduce it:

I added the providers for talos (boostrap and controlplane) and ofcourse the harvester provider.

Added 4 files + the harvester secret with the following configuration:

cluster.yaml:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 172.16.0.0/20
    services:
      cidrBlocks:
        - 172.16.16.0/20
    serviceDomain: cluster.local
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
    kind: TalosControlPlane
    name: capi-mgmt-p-01
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterCluster
    name: capi-mgmt-p-01

harvester-cluster.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterCluster
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  targetNamespace: cluster-capi-mgmt-p-01
  loadBalancerConfig:
    ipamType: pool
    ipPoolRef: k8s-api
  server: https://10.0.0.3
  identitySecret: 
    name: trollit-harvester-secret
    namespace: cluster-capi-mgmt-p-01

harvester-machinetemplate.yaml:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: HarvesterMachineTemplate
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  template: 
    spec:
      cpu: 2
      memory: 8Gi
      sshUser: ubuntu
      sshKeyPair: default/david
      networks:
      -  cluster-capi-mgmt-p-01/capi-mgmt-network
      volumes:
      - volumeType: image 
        imageName: harvester-public/talos-1.7.4-metalqemu
        volumeSize: 50Gi
        bootOrder: 0

controlplane.yaml:

apiVersion: controlplane.cluster.x-k8s.io/v1alpha3
kind: TalosControlPlane
metadata:
  name: capi-mgmt-p-01
  namespace: cluster-capi-mgmt-p-01
spec:
  version: "v1.30.0"
  replicas: 3
  infrastructureTemplate:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: HarvesterMachineTemplate
    name: capi-mgmt-p-01
  controlPlaneConfig:
    controlplane:
      generateType: controlplane
      talosVersion: v1.7.4
      configPatches:
        - op: add
          path: /cluster/network
          value:
            cni:
              name: none

        - op: add
          path: /cluster/proxy
          value:
            disabled: true

        - op: add
          path: /cluster/network/podSubnets
          value:
            - 172.16.0.0/20

        - op: add
          path: /cluster/network/serviceSubnets
          value:
            - 172.16.16.0/20

        - op: add
          path: /machine/kubelet/extraArgs
          value:
            cloud-provider: external

        - op: add
          path: /machine/kubelet/nodeIP
          value:
            validSubnets:
              - 10.0.0.0/24

        - op: add
          path: /cluster/discovery
          value:
            enabled: false

        - op: add
          path: /machine/features/kubePrism
          value:
            enabled: true

        - op: add
          path: /cluster/apiServer/certSANs
          value:
            - 127.0.0.1

        - op: add
          path: /cluster/apiServer/extraArgs
          value:
            anonymous-auth: true

Anything else you would like to add:

I have tried to switch the Loadbalancer config from dhcp to ipPoolRef, and set a pre-configured ippool this also did not work. I think its related to that the LB is never provisioned in the first place.

[Miscellaneous information that will assist in solving the issue.]

Environment:

  • talos controlplane provider version: 0.5.5
  • talos bootstrap provider version: 0.6.4
  • harvester cluster api provider: 0.1.2
  • harvester version installed on my HP server: 1.3.0
  • OS (e.g. from /etc/os-release):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions