Still On the Way to Cloud

It is the last quarter of 2019. A large number of organizations already deployed docker containerized applications in production environment and usually the services are orchestrated with kubernetes or Openshift. As the well known saying, just moving applications to cloud doesn't mean clouding, we are still in the middle of way to cloud. This post is also a retrospective on the issues discovered this year on migrating traditional technical stacks to cloud.

(WIP)

1 Typical issues on sensitivity of container environment

1.1 Dotnet core 2.1 Pod restarted on OOM Killer

This was the first issue I spent a big effort this year to realize that popular technical stacks were still not ready to adapt themselves to container environment. Typically if a managed system reads the mount point /proc/self/mountinfo as on regular Linux platform but not the /proc/self/cgroup, the memory limits are not observable from the memory management.

The github link is https://github.com/dotnet/coreclr/issues/13489. The fix includes https://github.com/dotnet/coreclr/pull/13488 and https://github.com/dotnet/coreclr/pull/15297 to check cgroup resource limits and expose docker processor counts to CLR environments.

The phenomenon was dotnet core pod restarted more than 200 times per day and openshift monitor portal showed OOM Killer in event description. It was lukcy the production environment deployed with replica number 4 so fintech service was not interrupted. To debug this issue, an image of LLDB on dotnet core was created to detect threading model and high memory blocks (https://maxwu.me/2019/04/15/Debug-dotnet-core-with-LLDB-on-RHEL-Image/). Per my observation, the high runners are Newtonsoft JSON entities because lots of memory were consumed by dotnet string buffers.

1.2 Jenkins Pipeline ran out of memory

This issue is actually a JVM configuration problem. It was dicovered when Jenkins pod ran slowly in one day and Jenkins pod was observed to restart within 72hr everytime. Our Pipeline was a typical Jenkins groovy Pipeline and it communicated to two kinds of slaves: (1) the dynamical jenkins slave created on demand, which were based on different slave images with required technical stack; (2) windows slaves for specific tasks which could only complete by Windows nodes for time being.

(TBC)

1.3 Cypress failed to launch XVFB in docker container

Cypress is the in browser javascript UI test framework I picked for team last year (2018) when migrated from host based Selenium to Pipeline.

(TBC)

1.4 Golang Routine GOMAXPROCS Issue in Container Environment

Go developers could use runtime.GOMAXPROCS() to set the threads limit of go runtime (number of P of MPGmodel) or read it when setting value is 0. From golang v1.5 the default value is the core number of CPU. However, when running in a container, the go runtime still read the core numbers from host, not the container resource limit.

There is a workaround from Uber automaxprocs lib. By import _ "go.uber.org/automaxprocs", the automaxprocs lib initializer will read the core number from container cgroup limit and set the GOMAXPROCS automatically.

(TBC)

2 Typical issues on container orchestration

2.1 Orchestrating Cypress tests in Map-Reduce model on kube cloud

2.2 Rolling out springboot pod generated alert flooding on splunk

3 Retrospective

Change Log

May 01, 2021: Add golang routing burst issue Nov 12, 2019: Initial post with intro part and the outline.