百度360必应搜狗淘宝本站头条
当前位置:网站首页 > 技术资源 > 正文

Java服务总在半夜挂,到底怎么回事?

lipiwang 2025-03-03 19:18 19 浏览 0 评论

写在前面

最近有用户反馈测试环境Java服务总在凌晨00:00左右挂掉,用户反馈Java服务没有定时任务,也没有流量突增的情况,Jvm配置也合理,莫名其妙就挂了??

问题排查

问题复现

为了复现该问题,写了个springboot的demo部署在测试环境,其中demo里只做了hello world功能,应用类型为web_tomcat (war包部署),基础镜像是
base_tomcat/java-centos6-jdk18-60-tom8050-ngx197,镜像使用的Java版本是1.8.0_60,有了上次MySQL被kill的经验,盲猜是linux limit惹的祸,因此将打好的镜像分别部署了两批不同的机器,果不其然,新机器当晚挂掉了,老机器服务正常

??看一下挂掉的limit设置

?排查过程

Java进程会受到limits影响?

按理说Java进程是不会受到系统limit open files(系统最大句柄数)影响的,但是为了验证这个问题,我们将他修改为正常机器的值,由于demo是web_tomcat应用,没法修改启动脚本,因此我们通过prlimit修改java进程的limit

prlimit -p 32672 --nofile=1048576 

??结果当晚00:00左右还是挂了,看来open files和java进程挂掉没关系,看dmesg也没发现什么问题

??Java版本过低导致内存分配不合理?

通过寻求jdos研发组的帮助,jdos研发组的同学认为是java版本的问题,低版本可能没有限制住申请的内存大小,具体原因如下

https://blog.softwaremill.com/docker-support-in-new-java-8-finally-fd595df0ca54?gi=a0cc6736ed14

异常机器java内存情况

??正常机器java内存情况

??按照这个文档描述,使用docker cgroups限制内存可能会导致JVM进程被终止,原因是Java读取的还是宿主机的CPU,而不是docker cgroups限制的CPU,高版本的Java解决了这个问题,文档解决方案截图如下:

??对此我们表示怀疑,因为我们的程序里设置了JVM参数

??保持着试一试的心态,我们增加了一个实验组,实验组使用的Java版本是11.0.8

结果当晚实验组的Java进程还是死了,看来和Java版本也没关系

??容器上存在定时任务导致的?

由于基础镜像是jdos官方提供的镜像,所以之前从来没有怀疑过是定时任务的问题,但是现在别无他法了,检查下容器的定时任务

虽然有定时任务,但是这个执行的时间点和Java挂掉的时间对不上,为此我们决定删除定时任务试试

结果当晚Java进程还是挂了,并且这次有dmesg的日志,发现Java被kill的同时crond也被kill了,被kill的原因是crond内存过高导致oom

??难道还有系统级cron任务?于是查了一下/etc/crontab,发现果然还有cron任务(这是谁打的镜像!!!)

??这个时间点和Java进程挂掉的时间点吻合,但是问题来了,执行的任务并没有logrotate.sh这个脚本,应该不会出现问题才对

??到底是不是定时任务的问题,我们修改下cron的时间验证下,调整时间为中午11:00,验证下Java进程是否会挂,同时使用strace打印进程trace log

??果然Java进程在中午11.00挂了,看来真的是cron任务导致的,让我们一起看一下strace

19:59:01 close(3)                        = 0
19:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19:59:01 open("/etc/pam.d/crond", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=293, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# The PAM configuration file f"..., 4096) = 293
19:59:01 open("/lib64/security/pam_access.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\17\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=18552, ...}) = 0
19:59:01 mmap(NULL, 2113800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769322000
19:59:01 mprotect(0x7fd769325000, 2097152, PROT_NONE) = 0
19:59:01 mmap(0x7fd769525000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x3000) = 0x7fd769525000
19:59:01 close(5) = 0
19:59:01 open("/etc/ld.so.cache", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=16203, ...}) = 0
19:59:01 mmap(NULL, 16203, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fd7707f8000
19:59:01 close(5) = 0
19:59:01 open("/lib64/libnsl.so.1", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p@\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=113432, ...}) = 0
19:59:01 mmap(NULL, 2198192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769109000
19:59:01 mprotect(0x7fd76911f000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd76931e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x15000) = 0x7fd76931e000
19:59:01 mmap(0x7fd769320000, 6832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd769320000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd76931e000, 4096, PROT_READ) = 0
19:59:01 mprotect(0x7fd769525000, 4096, PROT_READ) = 0
19:59:01 munmap(0x7fd7707f8000, 16203) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_unix.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240&\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=51960, ...}) = 0
19:59:01 mmap(NULL, 2196352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768ef0000
19:59:01 mprotect(0x7fd768efc000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7690fb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0xb000) = 0x7fd7690fb000
19:59:01 mmap(0x7fd7690fd000, 45952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd7690fd000
19:59:01 close(6)                       = 0
19:59:01 mprotect(0x7fd7690fb000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096)              = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/lib64/security/pam_loginuid.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\t\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=10240, ...}) = 0
19:59:01 mmap(NULL, 2105480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd768ced000
19:59:01 mprotect(0x7fd768cef000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768eee000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x1000) = 0x7fd768eee000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd768eee000, 4096, PROT_READ) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_keyinit.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\10\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=10224, ...}) = 0
19:59:01 mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                      = 0x7fd768aea000
19:59:01 mprotect(0x7fd768aec000, 2093056, PROT_NONE)                     = 0
19:59:01 mmap(0x7fd768ceb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x1000) = 0x7fd768ceb000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ceb000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_limits.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\20\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18600, ...}) = 0
19:59:01 mmap(NULL, 2113848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7688e5000
19:59:01 mprotect(0x7fd7688e9000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768ae8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd768ae8000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ae8000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\v\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=14384, ...}) = 0
19:59:01 mmap(NULL, 2109624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7686e1000
19:59:01 mprotect(0x7fd7686e4000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7688e3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x2000) = 0x7fd7688e3000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7688e3000, 4096, PROT_READ)                       = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5)                     = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY)                      = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                      = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_env.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\r\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18592, ...}) = 0
19:59:01 mmap(NULL, 2113776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7684dc000
19:59:01 mprotect(0x7fd7684e0000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7686df000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd7686df000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7686df000, 4096, PROT_READ)                     = 0
19:59:01 open("/lib64/security/pam_deny.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\5\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=5952, ...}) = 0
19:59:01 mmap(NULL, 2101272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7682da000
19:59:01 mprotect(0x7fd7682db000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7684da000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0)                      = 0x7fd7684da000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7684da000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 read(3, "", 4096)             = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                      = 0
19:59:01 open("/etc/pam.d/other", O_RDONLY)                      = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)   = 0x7fd770804000
19:59:01 read(3, "#%PAM-1.0\nauth     required     "..., 4096) = 154
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC)   = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 open("/etc/security/access.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=4620, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Login access control table.\n#\n"..., 4096) = 4096
19:59:01 read(3, " should get access from ipv4 net"..., 4096) = 524
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)  = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 geteuid() = 0
19:59:01 open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG, st_size=901, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:$6$4.53VPrJ$1wxMpbsWYp4VKea"..., 4096) = 901
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                      = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                       = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
19:59:01 readlink("/proc/self/exe", "/usr/sbin/crond", 4096) = 15
19:59:01 sendto(3, "p\0\0\0M\4\5\0\1\0\0\0\0\0\0\0op=PAM:accountin"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                      = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/security/pam_env.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=2980, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# This is the configuration fi"..., 4096) = 2980
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3)                      = 0
19:59:01 munmap(0x7fd770804000, 4096)                       = 0
19:59:01 open("/etc/environment", O_RDONLY)   = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)               = 0x7fd770804000
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
19:59:01 sendto(3, "p\0\0\0O\4\5\0\2\0\0\0\0\0\0\0op=PAM:setcred a"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                       = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/proc/self/loginuid", O_WRONLY|O_TRUNC|O_NOFOLLOW)        = 3
19:59:01 write(3, "0", 1) = 1
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 getgid() = 0
19:59:01 keyctl(0, 0xfffffffd, 0, 0, 0) = 496466385
19:59:01 keyctl(0, 0xfffffffb, 0, 0, 0x30) = 785702132
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getrlimit(RLIMIT_CPU, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_DATA, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_CORE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_RSS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 getrlimit(RLIMIT_MEMLOCK, {rlim_cur=64*1024, rlim_max=64*1024}) = 0
19:59:01 getrlimit(RLIMIT_AS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_LOCKS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_SIGPENDING, {rlim_cur=883632, rlim_max=883632}) = 0
19:59:01 getrlimit(RLIMIT_MSGQUEUE, {rlim_cur=800*1024, rlim_max=800*1024}) = 0
19:59:01 getrlimit(RLIMIT_NICE, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getrlimit(RLIMIT_RTPRIO, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getpriority(PRIO_PROCESS, 0) = 20
19:59:01 open("/etc/security/limits.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# /etc/security/limits.conf\n#\n#E"..., 4096) = 1835
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/security/limits.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
19:59:01 getdents(3, /* 3 entries */, 32768) = 88
19:59:01 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY)                       = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=26060, ...}) = 0
19:59:01 mmap(NULL, 26060, PROT_READ, MAP_SHARED, 5, 0) = 0x7fd7707f5000
19:59:01 close(5)  = 0
19:59:01 getdents(3, /* 0 entries */, 32768) = 0
19:59:01 close(3) = 0
19:59:01 open("/etc/security/limits.d/90-nproc.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=193, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Default limit for number of us"..., 4096) = 193
19:59:01 read(3, "", 4096)              = 0
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096)   = 0
19:59:01 setrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 setpriority(PRIO_PROCESS, 0, 0) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                     = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                      = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)                      = 0
19:59:01 sendto(3, "t\0\0\0Q\4\5\0\3\0\0\0\0\0\0\0op=PAM:session_o"..., 116, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 116
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 setgid(0) = 0
19:59:01 open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 3
19:59:01 read(3, "65536\n", 31)         = 6
19:59:01 close(3)                       = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110)                       = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 open("/etc/group", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=497, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 lseek(3, 0, SEEK_CUR) = 0
19:59:01 read(3, "root:x:0:\nbin:x:1:bin,daemon\ndae"..., 4096) = 497
19:59:01 read(3, "", 4096)              = 0
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096)                     = 0
19:59:01 setgroups(1, [0]) = 0
19:59:01 setreuid(0, 4294967295) = 0
19:59:01 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, {0x558826e03b80, [], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, 8) = 0
19:59:01 pipe([3, 5])                   = 0
19:59:01 pipe([6, 7])                   = 0
19:59:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd7707fca70) = 1946
19:59:01 gettid()                     = 1943
19:59:01 open("/proc/self/task/1943/attr/exec", O_RDWR) = 8
19:59:01 write(8, NULL, 0) = -1 EINVAL (Invalid argument)
19:59:01 close(8) = 0
19:59:01 close(3) = 0
19:59:01 close(7) = 0
19:59:01 close(5) = 0
19:59:01 fcntl(6, F_GETFL)                       = 0 (flags O_RDONLY)
19:59:01 fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd770804000
19:59:01 lseek(6, 0, SEEK_CUR)                     = -1 ESPIPE (Illegal seek)
19:59:01 read(6, "/bin/bash: ./logrotate.sh: \346\262\241\346\234"..., 4096) = 55
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6682da000
19:59:01 --- SIGCHLD (Child exited) @ 0 (0) ---
19:59:06 +++ killed by SIGKILL +++

可以看到最后用 mmap 一次分配了 4G 内存,然后就被kill了。

mmap前调用了getrlimit,和上次MySQL的问题一样,都是根据系统资源限制来分配内存

为了确定就是cron导致java挂掉的元凶,我们把cron进程手动kill掉,这样就不会执行定时任务了,这次我们在验证下Java进程是否会挂掉

果不其然,Java进程并没有挂掉,看来真的是cron任务导致的

高版本CentOS是否也会出现类似问题?

按理说oom killer应该只kill掉占用内存最高的才对,Java进程占用内存又不是最高的,高版本的CentOS系统oom killer策略会不会有升级?

??让我们来一起验证下高版本的CentOS系统是否有这个问题

当前镜像的CentOS版本是CentOS release 6.6 (Final),为了验证高版本的CentOS是否也有类似的问题,我们将增加两个实验组,分别升级基础镜像至CentOS release 6.10 (Final)和CentOS Linux release 7.9.2009 (Core),也添加相同的cron任务

结果发现CentOS release 6.10 (Final)和CentOS Linux release 7.9.2009 (Core)都没有kill掉Java进程,只kill掉了cron的子进程

?结论

??由于容器limit open files(系统最大句柄数)设置不合理导致cron执行任务时使容器内存飙升,存在内存溢出的风险,linux由于保护机制会kill掉占用内存高的进程,导致cron子任务进程和Java进程一起被kill(但是问题来了,这个jdos基础镜像为什么会执行一个完全不存在的shell脚本,而且还是执行两次???),高版本的CentOS系统不会kill java进程,猜测不同版本的CentOS的kill选择策略略有不同

问题分析

Cron任务执行逻辑

在Linux中,crontab工具是由croine软件包提供的,让我们一起看下cron的执行过程

??其中child_process()执行了cron子进程,cron执行子进程时会有发送mail的动作

??cron_popen在执行时会按照open files(系统最大句柄数)清除内存


??综上,cron oom的原因找到了,是由于open files设置过大且cron任务没有标准输出,导致执行了发送mail逻辑,而清除的内存大小超出了容器本身内存的大小,导致oom。

croine 1.5.4 版本之后修复了该问题,如果想查看当前容器croine版本可执行如下命令:

rpm -q cronie


Linux内核OOM killer机制

Linux 内核有个机制叫OOM killer(Out Of Memory killer),该机制会监控那些占用内存过大,尤其是瞬间占用内存很快的进程,然后防止内存耗尽而自动把该进程杀掉。内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码linux/mm/oom_kill.c,当系统内存不足的时候,out_of_memory()被触发,然后调用select_bad_process()选择一个”bad”进程杀掉。

以下是一些主要的进程选择策略:

1.内存使用情况:OOM Killer首先倾向于选择占用内存最多的进程,因为终止这些进程可以释放最多的内存。

2.OOM分数:每个进程都有一个OOM分数,该分数是基于其内存使用情况和其他因素计算出来的。OOM Killer倾向于终止OOM分数最高的进程。

3.进程优先级:在选择要终止的进程时,OOM Killer通常会避免终止对系统至关重要的系统进程。这些进程通常具有较高的优先级,因此它们更不容易成为终止目标。

4.进程资源需求:OOM Killer还会考虑进程的资源需求。它倾向于终止那些请求较少资源的进程,以最小化影响其他进程的运行。

5.进程属性:某些进程可能被标记为不可终止,例如通过设置/proc/[PID]/oom_score_adj的值来调整OOM分数。这些进程通常不容易被OOM Killer终止。

注:不同版本的Linux oom killer机制可能会存在一些差异

解决方案

使用高版本稳定的CentOS系统,如果业务无法升级CentOS,则需要设置合理的limit open files数量,application_worker类型应用可以在启动脚本中手动修改limit,web_tomcat类型应用没法修改启动脚本,可以选择kill掉cron进程或删除系统cron任务,也可以手动升级cronie的版本至1.5.7-5

写在后面

open files这个坑很大,栽这个坑两次了,大家一定要检查自己服务对应容器的CentOS版本和limit设置是否合理,本次案例发生在测试环境,尚不会引起事故,如果在生产出现类似情况,后果不堪设想

由于测试环境新增的这批机器都存在这个问题,我们团队已经联系机器提供方上报了该问题,后续这批机器会由提供方统一修改系统最大句柄数,如果当前问题影响到了业务的正常使用,可以临时删除容器中/etc/crontab中的任务

参考文献

?https://cloud.tencent.com/developer/article/1183262

?https://github.com/cronie-crond/cronie

作者:京东零售 杨云龙

来源:京东云开发者社区 转载请注明来源

相关推荐

Nat. Synthesis: 重大突破,电化学形成C-S键

第一作者:JunnanLi,HasanAl-Mahayni通讯作者:AliSeifitokaldani,NikolayKornienko通讯单位:蒙特利尔大学,麦吉尔大学【研究亮点】形成C-...

网络安全与应用(二)(网络安全理论与应用)

1、应用层安全协议SHTTP和HTTPS:SHTTP:SecHTTP,安全超文本传输协议,是HTTP扩展,使用TCP的80端口。HTTPS:HTTP+SSL,使用TCP的443端口。大部分web应用...

TN-C、TN-S、TT、IT供电系统详解及对比

TN-C、TN-S、TT、IT供电系统是低压配电系统中常见的四种接地方式,它们各自有不同的特点和适用场景。一、系统介绍TN-C供电系统①定义:整个系统中,工作零线(N线)与保护零线(PE线)是合一的,...

网络应用服务器(三)(网络应用程序服务器)

#头条创作挑战赛#1、DNS协议:域名解析协议,用于把主机域名解析为对应的IP地址。是一个分布式数据库,C/S工作方式。主要基于UDP协议,少数使用TCP,端口号都是53。常用域名如下2、DNS协议...

腾讯发布混元Turbo S:业界首次无损应用Mamba架构

21世纪经济报道记者白杨北京报道2月27日,腾讯正式发布新一代基座模型——混元TurboS。据腾讯混元团队介绍,混元TurboS在架构方面创新性地采用了Hybrid-Mamba-Transfor...

【收藏】低压配电系统中TT IT TN-S/TN-C/TN-C-S 的区别?

低压配电系统的接地型式选择是电气安全设计的核心环节,TT、IT、TN-S、TN-C、TN-C-S这五种主要接地型式因其结构、保护原理和故障特性的显著差异,在工程应用中有不同的适用范围和限制条件。如若发...

金万维公有云平台如何实现C/S架构软件快速SaaS化

金万维作为国内领先的企业信息化垂直B2B平台运营商,拥有超过5000家管理软件合作伙伴,掌握管理软件一线的发展动态,因此深知传统管理软件近年来面对的困境和问题。而SaaS却在软件行业内发展迅猛势如燎原...

随时随地做翻译:B/S架构的传奇时代到来

随着新兴技术的发展和大数据时代的到来,翻译作为连接各国语言和文化的工具,更是具有前所未有的拓展空间。传统的在计算机辅助翻译软件(CAT)上进行翻译的模式,受到时间和空间的限制,导致翻译过程中面临层层障...

BS和CS 架构的介绍(一篇就够了)(cs和bs架构的含义)

简介C/S又称Client/Server或客户/服务器模式。服务器通常采用高性能的PC、工作站或小型机,并采用大型数据库系统,如Oracle、Sybase、Informix或SQLServer。...

物管王(包租婆)软件架构与B/S和C/S架构的优点和缺点比较

一、B/S系统架构的优点和缺点优点:1)客户端无需安装,有Web浏览器即可。2)BS架构可以直接放在广域网上,通过一定的权限控制实现多客户访问的目的,交互性较强。3)BS架构无需升级多个客户端,升级服...

监听器入门看这篇就够了(怎么检查车上有没有被别人安装监听器)

什么是监听器监听器就是一个实现特定接口的普通java程序,这个程序专门用于监听另一个java对象的方法调用或属性改变,当被监听对象发生上述事件后,监听器某个方法将立即被执行。。为什么我们要使用监听器?...

购物车【JavaWeb项目、简单版】(java购物车的实现原理)

①构建开发环境免费学习资料获取方式导入需要用到的开发包建立程序开发包②设计实体书籍实体publicclassBook{privateStringid;privat...

基础篇-SpringBoot监听器Listener的使用

1.监听器Listener简介1.1监听器Listener介绍Listener是JavaWeb的三大组件(Servlet、Filter、Listener)之一,JavaWeb中的监听器主要用...

你在 Spring Boot3 整合 JWT 实现 RESTful 接口鉴权时是否遇到难题?

各位后端开发小伙伴们!在日常使用SpringBoot3搭建项目时,RESTful接口的鉴权至关重要。而JWT技术,作为一种简洁且高效的鉴权方式,被广泛应用。但大家是不是在整合过程中遇到过各...

javaWeb RSA加密使用(rsa加密java代码)

加密算法在各个网站运用很平常,今天整理代码的时候看到了我们项目中运用了RSA加密,就了解了一下。先简单说一下RSA加密算法原理,RSA算法基于一个十分简单的数论事实:将两个大质数相乘十分容易,但是想要...

取消回复欢迎 发表评论: