• 1 Post
  • 5 Comments
Joined 2 years ago
cake
Cake day: June 25th, 2023

help-circle

  • Here’s an update. I set up atop on my VPS and waited until the issue occurred again. Here’s the atop log from the event.

    ATOP - ip-172-31-7-27   2023/07/22  18:40:02   -----------------   10m0s elapsed
    PRC | sys    9m49s | user  12.66s | #proc    134 | #zombie    0 | #exit      3 |
    CPU | sys      99% | user      0% | irq       0% | idle      0% | wait      0% |
    MEM | tot   957.1M | free   49.8M | buff    0.1M | slab   95.1M | numnode    1 |
    SWP | tot     0.0M | free    0.0M | swcac   0.0M | vmcom   2.4G | vmlim 478.6M |
    PAG | numamig    0 | migrate    0 | swin       0 | swout      0 | oomkill    0 |
    PSI | cpusome  63% | memsome  99% | memfull  88% | iosome   99% | iofull    0% |
    DSK |         xvda | busy    100% | read  461505 | write    171 | avio 1.30 ms |
    DSK |        xvda1 | busy    100% | read  461505 | write    171 | avio 1.30 ms |
    NET | transport    | tcpi    2004 | tcpo    1477 | udpi       9 | udpo      11 |
    NET | network      | ipi     2035 | ipo     1521 | ipfrw     20 | deliv   2015 |
    NET | eth0    ---- | pcki    2028 | pcko    1500 | si    4 Kbps | so    1 Kbps |
    
        PID SYSCPU USRCPU  VGROW  RGROW  RDDSK  WRDSK  CPU CMD            
         41  5m17s  0.00s     0B     0B     0B     0B  53% kswapd0        
          1 21.87s  0.00s     0B -80.0K   1.2G     0B   4% systemd        
      21681 20.28s  0.00s     0B   4.0K   4.2G     0B   3% lemmy          
        435 18.00s  0.00s     0B 392.0K 163.1M     0B   3% snapd          
      21576 17.20s  0.00s     0B     0B   4.2G     0B   3% pict-rs        
    

    The culprit seems to be kswapd0 trying to move memory to swap space, although there is no swap space.

    I set memory swappiness to 0 on the system for now, I’ll check if that makes a difference.


  • It just happened again. I couldn’t ssh in despite the limit on docker resources, which leads me to believe it may not be related to docker or Lemmy.

    This time it lasted only 20 minutes or so. Once it was over I could log back in and investigate a little. There isn’t much to see. lemmy-ui was killed sometime during the event

    IMAGE                        COMMAND                  CREATED      STATUS         PORTS                                              
    nginx:1-alpine               "/docker-entrypoint.…"   9 days ago   Up 25 hours    80/tcp, 0.0.0.0:14252->8536/tcp, :::14252->8536/tcp
    dessalines/lemmy-ui:0.18.0   "docker-entrypoint.s…"   9 days ago   Up 3 minutes   1234/tcp                                              
    dessalines/lemmy:0.18.0      "/app/lemmy"             9 days ago   Up 25 hours                                                         
    asonix/pictrs:0.4.0-rc.7     "/sbin/tini -- /usr/…"   9 days ago   Up 25 hours    6669/tcp, 8080/tcp                                    
    mwader/postfix-relay         "/root/run"              9 days ago   Up 25 hours    25/tcp                                                
    postgres:15-alpine           "docker-entrypoint.s…"   9 days ago   Up 25 hours
    

    I still have no idea what’s going on.


  • I had the same thing happen. Max CPU usage, couldn’t even ssh in to fix it and had to reboot from aws console. Logs don’t show anything unusual apart from postgres restarting 30 minutes into the spike, possibly from being killed by the system.

    You say yours solved itself in 10 minutes, mine didn’t seem to stop after 2 hours, so I reeboted. It could be that my vps is just 1 CPU, 1 GB RAM, so it took longer doing whatever it was doing.

    Now I set up RAM and CPU limits following this question, and an alert so I can hopefully ssh in and figure out what’s happening when it’s happening.

    Any suggestions on what I should be looking at if I manage to get into the system?


  • I figured it out after a good night’s sleep.

    The UI doesn’t let you send a request with empty allowed instances, it either sends an array with your picks or no array at all if you selected none. To reset allowed instances you need to send an empty array to the API.

    The easiest way to do it is

    • Open browser devtools network tab or equivalent
    • Send a request with at least one allowed instance
    • Edit the request JSON body to empty the array
    • Resend the request

    I think Chrome lets you copy the request with authentication to send with something like curl, I used Firefox which lets you edit and resend directly.

    Peace ✌