Docker: Checkpoint and Restore

Before I do a deeper dive into Firecracker, I wanted to run a quick experiment and check whether it’s possible to take a snapshot of a Docker container and restore it at some point. I’m far from being the first to think about this, and there’s a fantastic team working on CRIU: https://criu.org/. CRIU is a project that implements checkpoint/restore functionality for Linux. As the description says, it doesn’t work on Windows or Mac, so here I go again, dusting off my Thinkpad X220.

It’s been a while since I did something on Linux and Ubuntu, so I discovered there’s a new package manager called “Snap”. I tried it with Docker, and something messed up permissions on docker.sock file, i.e. whenever I start a Docker service, a sock file is always created under root:root. Perhaps, I’ll try snap next time, but for now I’m back to apt.

Install Docker

Install Docker by following instructions from https://docs.docker.com/engine/install/ubuntu/:

  1. Add Docker’s GPG key:

    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    
  2. Verify that you now have the key with the fingerprint 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88, by searching for the last 8 characters of the fingerprint:

    $ sudo apt-key fingerprint 0EBFCD88
    
    pub   rsa4096 2017-02-22 [SCEA]
          9DC8 5822 9FC7 DD38 854A  E2D8 8D81 803C 0EBF CD88
    uid           [ unknown] Docker Release (CE deb) <docker@docker.com>
    sub   rsa4096 2017-02-22 [S]
    
  3. Setup the stable repository:

    $ sudo add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"
    
  4. Install Docker:

    $ sudo apt-get install docker-ce docker-ce-cli containerd.io
    
  5. Add your user to docker group:

    sudo usermod -aG docker ${USER}
    
  6. Logout/Login to your session.
  7. Validate that Docker runs:

    $ docker ps
    CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
    

Install CRIU

Follow instruction from https://criu.org/Docker:

  1. Install CRIU:

    $ sudo add-apt-repository ppa:criu/ppa
    $ sudo apt install criu
    
  2. Enable experimental features in Docker:

    sudo echo "{\"experimental\": true}" >> /etc/docker/daemon.json
    sudo systemctl restart docker
    
  3. Run a test container that increments a timer every second:

    $ docker run -d --name looper --security-opt seccomp:unconfined busybox  \
    	 /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
    
  4. Verify that logs are showing up:

    $ docker logs looper
    0
    1
    2
    3
    
  5. Create a checkpoint, which will stop the container:

    $ docker checkpoint create looper checkpoint1
    checkpoint1
    
  6. Restore the container:

    $ docker start --checkpoint checkpoint1 looper
    
  7. Check logs one more time:

    $ docker logs looper
    0
    1
    2
    3
    0
    1
    2
    
  8. Something didn’t work well on the last step, as you may notice. Test container is starting the counter from 0 rather than from the checkpoint. Not sure why this is happening. I’m running Docker 20.10.3 and CRIU 3.15 on Linux 5.8.15, which satisfies CRIU requirements.

    At the same time, if I try something different like run alpine container, create a folder and snapshot, it works like expected:

    $ docker run -it alpine sh
    / # mkdir ~/hello
    
    $ docker ps
    CONTAINER ID   IMAGE     COMMAND   CREATED         STATUS        PORTS     NAMES
    a82ccad8fdfa   alpine    "sh"      2 minutes ago   Up 1 second             clever_kapitsa
    $ docker checkpoint create a82ccad8fdfa checkpoint1
    checkpoint1
    $ docker ps
    $ docker start --checkpoint checkpoint1 a82ccad8fdfa
    $ docker ps
    CONTAINER ID   IMAGE     COMMAND   CREATED         STATUS        PORTS     NAMES
    a82ccad8fdfa   alpine    "sh"      2 minutes ago   Up 1 second             clever_kapitsa
    $ docker exec -it a82ccad8fdfa sh
    / # cd
    ~ # ls
    hello
    

I think CRIU will be useful at somepoint with Firecracker as well. Resource-wise in some cases it doesn’t make sense to run user containers at all times if no one is using them.