It is the last quarter of 2019. A large number of organizations already deployed docker containerized applications in production environment and usually the services are orchestrated with kubernetes or Openshift. As the well known saying, just moving applications to cloud doesn’t mean clouding, we are still in the middle of way to cloud. This post is also a retrospective on the issues discovered this year on migrating traditional technical stacks to cloud.
(This post is still in progress)
This was the first issue I spent a big effort this year to realize that popular technical stacks were still not ready to adapt themselves to container environment. Typically if a managed system reads the mount point
/proc/self/mountinfo as on regular Linux platform but not the
/proc/self/cgroup, the memory limits are not observable from the memory management.
The github link is https://github.com/dotnet/coreclr/issues/13489. The fix includes https://github.com/dotnet/coreclr/pull/13488 and https://github.com/dotnet/coreclr/pull/15297 to check cgroup resource limits and expose docker processor counts to CLR environments.
The phenomenon was dotnet core pod restarted more than 200 times per day and openshift monitor portal showed
OOM Killer in event description. It was lukcy the production environment deployed with replica number 4 so fintech service was not interrupted. To debug this issue, an image of LLDB on dotnet core was created to detect threading model and high memory blocks (https://maxwu.me/2019/04/15/Debug-dotnet-core-with-LLDB-on-RHEL-Image/). Per my observation, the high runners are Newtonsoft JSON entities because lots of memory were consumed by dotnet string buffers.
This issue is actually a JVM configuration problem. It was dicovered when Jenkins pod ran slowly in one day and Jenkins pod was observed to restart within 72hr everytime. Our Pipeline was a typical Jenkins groovy Pipeline and it communicated to two kinds of slaves: (1) the dynamical jenkins slave created on demand, which were based on different slave images with required technical stack; (2) windows slaves for specific tasks which could only complete by Windows nodes for time being.
Nov 12, 2019: Initial post with intro part and the outline.