GitLab artifacts purge!
Due to the rising cost of cloud storage and the ever-increasing volume of data used in pipelines, everyone who hosts GitLab will eventually have to evaluate the size of artifacts. Of course, GitLab allows you to manage artifacts and their eventual removal. Problems can arise once your policy toward the expiration of artifacts is too lenient.
In this blog post, I will answer these questions: How do I find the largest projects in my GitLab instance? Is there a way to remove artifacts from the projects based on date? These questions helped me save a few dollars while not endangering developers’ productivity.
Getting project sizes
Of course, we have good statistics regarding overall usage of the self-hosted GitLab, but those generally do not boil down to single projects. There might be UI options I missed, so don’t stop googling just because you read something here. What I used was this simple Python script:
import requests
from pprint import pprint
import os
# GitLab API endpoint and personal access token
API_URL = os.getenv('API_URL')
PERSONAL_ACCESS_TOKEN = os.getenv('PERSONAL_ACCESS_TOKEN')
def list_gitlab_projects():
headers = {"Authorization": f"Bearer {PERSONAL_ACCESS_TOKEN}"}
params = {"per_page": 100}
response = requests.get(f"{API_URL}/projects?statistics=true", headers=headers, params=params)
projects = response.json()
while "next" in response.links.keys():
next_url = response.links["next"]["url"]
response = requests.get(next_url, headers=headers)
projects.extend(response.json())
return projects
def sort_projects_by_size(projects):
sorted_projects = sorted(projects, key=lambda p: p.get("statistics", {}).get("storage_size", 0), reverse=True)
return sorted_projects
# Fetch and sort GitLab projects
projects = list_gitlab_projects()
sorted_projects = sort_projects_by_size(projects)
# Print the sorted projects
for project in sorted_projects:
project_name = project["name_with_namespace"]
storage_size = project.get("statistics", {}).get("storage_size", 0)
job_artifacts_size = project.get("statistics", {}).get("job_artifacts_size", 0)
print(f"Project: {project_name} | Size: {storage_size/1024/1024/1024} GB | Artifacts: {job_artifacts_size/1024/1024/1024} GB")
The output looks like this:
Project: iOS / an app | Size: 24.820458421483636 GB | Artifacts: 22.45583942811936 GB
Project: iOS / another app | Size: 24.286553697660565 GB | Artifacts: 14.027025677263737 GB
Project: iOS / some other app | Size: 8.51979910954833 GB | Artifacts: 7.566993983462453 GB
Project: Infra / tf-tools / vault | Size: 5.982724603265524 GB | Artifacts: 5.982559899799526 GB
…
Based on that, we discovered that our iOS team keeps their build apps as artifacts. Furthermore, the default artifact expiration was changed to keep a backup of those apps. The team didn’t even know about the hidden cost and reflected it immediately after getting the information from us.
But what about the artifacts kept long ago, and nobody knows why we have them now? Are the artifacts marked to never expire by people who left the company years ago? Also, those could be so small that nobody cares. But the problem might be the amount of those files.
How to remove old artifacts
Let’s try to remove artifacts nobody will ever use. I tried to do the thing via API. One of the problems is that the jobs logs are not removed via API call — not when I tried. AFAIK job logs are also somewhat artifacts. GitLab also published documentation on how to deal with the logs. However, GitLab has a Rails console that could also be used for that. The following script was tested, and it works:
projects = Project.where(archived: true)
projects.each do |project|
puts "Project ID: #{project.id}"
puts "Project name: #{project.name}"
puts "Repository path: #{project.repository.full_path}"
24.step(2, -4) do |months|
builds_to_clear = project.builds.with_existing_job_artifacts(Ci::JobArtifact.trace).where("finished_at < ?", months.months.ago)
builds_to_clear.find_each do |build|
print "Ci::Build ID #{build.id}... "
if build.erasable?
Ci::BuildEraseService.new(build, admin_user).execute
puts "Erased"
else
puts "Skipped (Nothing to erase or not erasable)"
end
end
end
end
It just goes through the list of projects and removes them. It only removes data from two years ago up to two months ago. That is done via two-month steps. I picked two months because larger amounts of details of each job didn’t fit into GitLab instance memory and the script ended with an OOM kill.