Thursday, September 21, 2017

Study notes for a K8S cloud deployment case study


  • Architecture:
    • Three K8s Master:  Master1, Master2, Master3,  each Master has:
      • docker -> kubelete -> Kube-apiserver <- kube-scheduler  - kube-controller-manager
    • each K8s Node in the cluster has:
      • flannel <- docker -> kube-proxy kubelet
    • etcd cluster with 3 nodes for key/value store
    • load balncer :   docker -> haproxy + keeplived  <- kubectl, clients, etc
    • How above components connected together?
      • 1. all apiserver rely on etcd servers,  --etcd_servers=<etcd-server-url>
      • 2. controller-manager, scheduler all rely on apiserver,  --master=<api-server-url>
      • 3. kubelet on each node relies on apiserver, --api-servers=<api-server-url>
      • 4. kube-proxy on each node relies on apiserver,  --master=<api-server-url>
      • 5. all apiserver are behind haproxy + keeplived load balancer pod
  • As shown above, all K8s components are containers, easy to deploy and upgrade,  use:  docker run --restart=always --name xxx_component xxx_component_image:xxx_version,    --restart=always to ensure self-healing;
  • to avoid delays during docker restart (which is caused by docker pull), pre-download the mirror/images before the upgrade 
  • Ansible used for configruation management; such as:
    • ansible-playbook --limit=etcds -i hosts/cd-prod k8s.yaml --tags etcd;  (just update the version info)
    • An example template file:
      • [etcds]
      • 10.10.100.191
      • 10.10.100.192
      • 10.10.100.193
      • [etcds:vars]
      • etcd_version=3.0.3
  • etcd3 used,  etcd cluster ensures redundancy;  data storage need be redudant as well;
  • k8s version: 1.6.6, three master servers are in a Static Pod,  kubelet is responsible for monitoring and auto-restart; all run in docker containers;
  • haporxy and keeplived provides VIP for kube-apiserver to ensure HA;  haproxy is for load balance, keeplived is responsible for monitoring haproxy and ensure its HA;  client access apiserver using <VIP>:<Port>'
  • for kube-controller-manager and kube-scheduler HA, leader election is important; in start parameters set:  leader-elect=true
  • Ansible playbooks will be used for kubectl commands, such as drain, cordon, undordon;
  • A ansible playbook code example:
    • docker run --restart=always -d \
    • -v /var/etcd/data:/var/etcd/data \
    • -v /etc/localtime:/etc/localtime \
    • -p 4001:4001 -p 2380:2380 -p 2379:2379 \
    • --name etcd 10.10.190.190:10500/root/etcd:{{ etcd_version }} \
    • /usr/local/bin/etcd \
    • -name {{ ansible_eth0.ipv4.address }} \
    • -advertise-client -urls http://{{ ansible_eth0.ipv4.address }}:2379,http://{{ ansible_eth0.ipv4.address }}:4001 \
    • -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
    • -initial-advertise-peer-urls http://{{ ansible_eth0.ipv4.address }}:2380 \
    • -listen-peer-urls http://0.0.0.0:2380 \
    • -initial-cluster {% for item in groups['etcds'] %}{{ item }}=http://{{ item }}:2380{% if not loop.last %},{% endif %}{% endfor %} \
    • -initial-cluster-state new
  • Storage backend is using ceph rbd,  provide stateful service and docker-registry
    • When a pod is down or moves to other node; storage need be persisten to provide stateful service;  
    • K8s has many options to provide Persistent Volume:  pd in GCE, ebs in AWS, ClusterFS etc. 
    • ceph-common, kubelet, kube-controller-manager containers all have following 3 volumes:
      • --volume=/sbin/mobprobe:/sbin/mobprobe:ro \
      • --volume=/libmodules:/lib/modules:ro \
      • --volume=/dev:/dev:ro 
    • rbd support dynamic provision, single-write, multi-read, but not multi-write; GlusterFS can support multi-write, not in use for this case yet;
    • docker-registry used swift as backend, to improve push/pull efficiency, used Redis to cache metadata, all provided as containers using official docker images;  for example:
      • sudo docker run -d --restart=always -p 10500:10500 --volume=/etc/docker-distribution/registry/config.yml:/etc/docker-distribution/registry/config.yml --name registry registry:2.6.1 /etc/docker-distribution/registry/config.yml
      • config.yml is based on https://gibhub.com/docker/docker.github.io/blob/master/registry/deploying.md
      • Using Harbor to provide HA for docker-registry, running in another pod on K8S cluster, mirrored data and harbor-db all through Ceph PV mount,  thus if a Harbor node down or Pod is down,  it ensure HA for Harbor,  thus won't need Swift any more;
      • PV and StorageClass are limited to single Namespace,  thus can't support using namespace to provide dynamic provision in multi-tenant environment yet;
  • Network backend is using Flannel, some switched to OVS;
    • Flannel support multi-node communication within a Pod; but can't support sepreation of multi-tenant; also not good at network rate-limit for Pod; 
    • thus custom built K8S-OVS components to implement these features,  it uses Open VSwitch to provide SDN for K8S,  it is based on the priciple of Openshift SDN, since SDN in Openshift is tightly integrated with Openshift and can't be used as seperate plug-ins as Flannel or Calico for K8S, thus custom built the K8S-OV plug-ins;  it has similar functions as Openshift SDN, and can serve K8S as plug-ins; 
    • K8S-OVS support single-tenant and multi-tenant, it implemented following features:
      • Single Tenant:  use OpenvSwitch + VxLAN to make Pods on K8S a big L2 network, thus enable Pod to Pod communication;
      • Multi-Tenant:  Also use OpenvSwitch + VxLan to consist L2 network for Pods, it can also use Namespace to seperate tenant network,  a pod in one namespace can not access the pod/service in another namespace;
      • Multi-Tenant: Can config Namespace so that the pod in that namespace can communicate with pods/services in any other Namespaces; 
      • Multi-Tenant:  Also can seperate the joined namespace as mentioned above;
      • Multi-Tenant and Single Tenant:  both support flow control,  pods on the same node can share network bandwidth, avoid "noise-neighbour" issue;
      • Multi-Tenant and Single Tenant: both support load balance for external network;
    • Join means allow pod/service communications between two tenants network;  seperation means for the two joined tenant network, it can reverse the operation and seperate the network back to two tenant network;   Public network means allow a tenant network communicate with all other tenants network;
    • Differnt tenant network has different VNI (from VxLAN),  K8S-OVS store the VNI relations in etcd, for example:
      • etcdctl ls /k8s.ovs.com/ovs/network/k8ssdn/netnamespaces  (helloworld1, helloworld2 )
      • etcdctl get /k8s.ovs.com/ovs/network/k8ssdn/netnamespaces/hellowworld1  
        • {"NetName":"hellowworld1","NetID":300900,"Action":"","Namespace":""}
      • etctctl get /k8s.ovs.com/ovs/network/k8ssdn/netnamespaces/hellowworld2
        • {"NetName":"hellowworld2","NetID":4000852,"Action":"","Namespace":""}
      • etcdctl update /k8s.ovs.com/ovs/network/k8ssdn/netnamespaces/helloworld1  \ '{"NetName":"helloworld1","NetID":300900,"Action":"join","Namespace":"helloworld2"}'
      • etcdctl get /k8s.ovs.com/ovs/network/k8ssdn/netnamespaces/helloworld1
        • Will get same VNI as hellowworld2:  4000852 after above join action
    • In application, use LVS for L4 load balance, Use Nginx+Ingress Controller for L7 load balance on Ingress;  abondoned kube-proxy/serives in this case;
  • CI/CD solution architecture:
    • Use Opads (developed using PHP) for front-end, Pluto (provide RestAPI, interface with K8S apiserver ) for back-end,  Supporting 400 different applications, about 10,000 containers
    • CI/CD pipeline is like following:
      • developer login ->  OPADS { code server (gitlab/gerrit) -> sonar, autotest, benchmark, Jenkins/CI Runner -- push --> Docker Registry } --pull --> PLUTO { deploy (call api) } --> K8S { sit-cluster, uat-cluster, prod-cluster }
      • app images (Such as Tomcat, PHP, Java, NodeJS etc.) with version number,  mount source code to container directory, or use dockerfile ONBUILD to load app codes
        • mount:   easy to use, no need re-construct, fast, but dependencies high, if app code changes while base images unchanged, build fail;
        • ONBUILD:  can solve dependencies better, rollback versions are easy, but need re-construct every time, slow;   Need choose based on use cases;
      • If can't find base images, dev will submit JIRA to ask DEVOPS team build the image;
      • Select code branch and version to deploy, for differnt environment (sit,uat,prod), can see number of Pod copies,  uptim, name, create time, node details on K8S, node selector etc.  Or use Gotty web console to see container status;
      • Elacity / HPA, load balance,  blue/green deployment;  and Sonar for code quality, test modules, benchmark modules etc; All are components in a CI/CD PAAS;
  • Monitoring and Alert solutions used
    • Status about K8S: such as: Docker, Etcd, Flannel/OVS etc;  

    • System performance, such as cpu, memory, disk, network, filesystem, processes etc;
    • application status:  rc/rs/deployment, Pod, Service etc;
      • Used custom shell scripts, start using crond to monitor these components;
      • For containers, used Heapster + Influxdb +Grafana
      • Docker -> Heapster -- sink --> influxdb (on top of docker as well) --> grafana -- alert;  
      • Each K8S node:  flannel <-- docker <-- cAdvisor <--  kubelet -> heapster
      • Heapster --get node list --> K8s Master kube-apiserver (which sits in the docker ->kubelete -> kube-apiserver  <-- kube-scheduler , kube-controller-manager )
      • Each node use kubelet calling cAdvisor API to collect container information (resource, performance etc.), thus both node and container information are collected;  
      • information can be tagged, then aggregated and send to sink in influxdb,  use Grafana visualize data; 
      • Heapster need use --source to  point to Master URL,  --sink point to influxdb, --metric_resolution for intervals such as 30s (seconds) 
      • Backend storage for Heapster has two types: metricSink, and influxdbSink.  
      • MetricSink stored in local memory for metrics data; created by default, may consume large amount of memory; Heapster API get data from here;
      • InfluxDB is where the data are stored,  newer version of Heapster can point to multiple influxDB; 
      • Grafana can use regular expression to check application status, cpu,mem,disk,network,FS status, or sort and search; 
      • When crossing defined threshold, Grafana can do simple alert throuh email, but this case created monitoring point, warning alert policy in Grafana and integrate them with Zabbix,  Zabbix will spit out those warning/alerts;
      • Container logs have: 
        • K8S componetn logs; 
        • system resource useage logs, 
        • container running logs
      • Another solution which got abondoned is:  use Flume to collect logs, Flume run in pod,  configure source (app log folder), channel and sink to Hippo/Kafka, 
        • When need check logs, login to Hippo, cubersome;
        • Each application need a seperate Flume pod; a waste of resource;
        • Hippo is not on containers in this case and shared by other non-container infrastructure, slow during peak usage time;
      • Now switched to Fluentd + Kafka +ES + customized GUI
        • Kafaka cluster has topic  -- fluentd-es --> Elastic Search cluster (elastic search) --> Log API Engine --> Log GUI (history/current/download/search/other logs)
        • Each K8S node :  flanner --> docker --> kubelet, kube-proxy,   --> /var/log, /var/lib/docker/containers --> Fluentd -- fluentd-kafaka -->  Kafaka cluster (topic)

No comments:

Post a Comment