Running external Ruby code from Vagrant

6 minute read

The Story

Like a lot of Chef users, I’m using Vagrant for testing my cookbooks. I’m also using Berkshelf for providing the Vagrant box with the cookbooks it needs.
Until recently, I was happy using the ChefDK-provided Berlshelf (v4.0.1). I stopped being happy when running berks started consuming CPU for ~5 mins and then failing when my Berksfile contained multiple sources (the Chef Supermarket and my private Chef server).
While troubleshooting it I’ve learned that there’s an issue with the native dependency graph solver, and I won’t be able to fix it in less than a week.
I also noticed that the latest version of the Berkshelf gem (v4.1.1) had no such issues (unless I’m mistaken, it’s because it switched to the native Ruby graph solver).

The next logical step was migrating to the new version of Berkshelf

Attempting to upgrade Berkshelf in the ChefDK

I firstly tried working inside ChefDK by upgrading its version of ChefDK.
This made me learn several interesting things:

The /usr/bin/berks file (actually /opt/chefdk/bin/berks) loads specific versions of Gems.
This means that even if I install the new version of Berkshelf correctly, I’d have to modify this entry point, and it won’t be trivial.
The ChefDK Ruby environment is configured to install new Gems into the User’s home directory (using GEM_HOME).
I’m not sure why (something related with developing gems?)
The only way I could execute the new Berkshelf gem “properly” inside the ChefDK was using a Gemfile and something like chef exec bundle exec Berkshelf, which was really annoying

Eventually I decided that the comfort of working inside the ChefDK isn’t worth the effort, as taking a clean Ruby 2 environment (e.g. using RVM or Bundler) and installing the Berkshelf Gem inside was effortless.
This worked well for non-Vagrant usage (e.g. calling it from Jenkins), but I still had quite a lot of work.

Running Ruby in Vagrant

My second issue was with running any Ruby code from inside Vagrant.
As any Vagrant-Berkshelf veteran knows, the workflow goes something like this:

User runs some command requiring provisioning, like vagrant up
Vagrant calls the vagrant-berkshelf methods pretty early in the Vagrant workflow (after Vagrant::Action::Builtin::ConfigValidate)
vagrant-berkshelf runs berks install to locate all relevant cookbooks and generate the Bersfile.lock
vagrant-berkshelf calls berks vendor to make a directory containing all cookbooks that the VM needs, which will be accessed by the Chef client on the VM And so forth

This workflow heavily depends on Vagrant executing Berkshelf, which works with ChefDK’s Berkshelf because its entry point is “environment-variable proof”:

#!/opt/chefdk/embedded/bin/ruby
#--APP_BUNDLER_BINSTUB_FORMAT_VERSION=1--
ENV["GEM_HOME"] = ENV["GEM_PATH"] = nil unless ENV["APPBUNDLER_ALLOW_RVM"] == "true"
#...

Compare this to the “normal” entry point generated by Gems:

#!/usr/bin/ruby2.0
#
# This file was generated by RubyGems.
#
# The application 'berkshelf' is installed as part of a gem, and
# this file is here to facilitate running it.
#

require 'rubygems'

version = ">= 0"

if ARGV.first
  str = ARGV.first
  str = str.dup.force_encoding("BINARY") if str.respond_to? :force_encoding
  if str =~ /\A_(.*)_\z/
    version = $1
    ARGV.shift
  end
end

gem 'berkshelf', version
load Gem.bin_path('berkshelf', 'berks', version)

The environment negation (deleting GEM_HOME and GEM_PATH) is (IMO) related to the Vagrant use-case.
Fact is, Vagrant is polluting the environment of subprocesses with Vagrant-specific Ruby-related variables.

Vagrant, Bundler and external processes

Vagrant uses Bundler as a way of managing its Ruby dependencies (both internal and plugins), so Vagrant suffers from the same issue that Bundler has - it assumes that subprocesses are supposed to run inside its own Ruby environment. To do so, it modifies its own ruby-related environment variables, such as GEM_PATH (where to look for gems) and GEM_HOME (where gems should be installed).
For cases where it’s not true, Bundler offers a method called Bundler.with_clean_env. This should yield (execute a given code block) with the “original” environment (the one bundler had when it started), so any processes spawned from that block should be free of the Bundler contamination.
Vagrant tries to utilize this method, but it doesn’t work as expected.

`with_clean_env` internals

Let’s drill down a bit:

# https://github.com/bundler/bundler/blob/5131fcd/lib/bundler.rb#L211

def with_clean_env
  with_original_env do
    ENV['MANPATH'] = ENV['BUNDLE_ORIG_MANPATH']
    ENV.delete_if { |k,_| k[0,7] == 'BUNDLE_' }
    if ENV.has_key? 'RUBYOPT'
      ENV['RUBYOPT'] = ENV['RUBYOPT'].sub '-rbundler/setup', ''
      ENV['RUBYOPT'] = ENV['RUBYOPT'].sub "-I#{File.expand_path('..', __FILE__)}", ''
    end
    yield
  end
end

# https://github.com/bundler/bundler/blob/5131fcd/lib/bundler.rb#L203

def with_original_env
  bundled_env = ENV.to_hash
  ENV.replace(ORIGINAL_ENV)
  yield
ensure
  ENV.replace(bundled_env.to_hash)
end

# https://github.com/bundler/bundler/blob/5131fcd/lib/bundler.rb#L16
module Bundler
  ORIGINAL_ENV = environment_preserver.restore
  ENV.replace(environment_preserver.backup)
#...

So, when the Bundler module is loaded, it creates a backup of the current environment variables. This backup (plus some modifications) is used whenever with_clean_env is called. How can it break?

By adding debug prints inside the Bundler gem, I deduced the following facts:

Bundler is invoked twice
First, the entry point is pre-rubygems.rb, as evident from the vagrant launcher:
```
 :::go
 // Line 187

 cmd.Args[0] = "ruby"
 cmd.Args[1] = filepath.Join(gemPath, "lib", "vagrant", "pre-rubygems.rb")
 //...
 if err := cmd.Start(); err != nil {
 // ...
```
Note these bits at lib/vagrant/pre-rubygems.rb:
```
 :::ruby
 # Line 19
 require_relative "bundler"
```
```
 :::ruby
 # Line 30

 if ENV["VAGRANT_EXECUTABLE"]
   Kernel.exec("ruby", ENV["VAGRANT_EXECUTABLE"], *ARGV)
 else
   Kernel.exec("vagrant", *ARGV)
 end
```
And finally, this in bin/vagrant:
```
 :::ruby
 # Line 69

 require "bundler"
```
As you can see, the pre-rubygems.rb file is invoked first, loads Bundler, and then execs the Vagrant entry point, which loads its own Bundler. So the Bundler gem is loaded twice, and the second instance “saves” the environment already modified by the first instace, meaning with_clean_env is useless.

Vagrant works around this
The Vagrant devs tried to solve this issue by backing up the environment variables before any modification, like so:

 :::go
 // https://github.com/mitchellh/vagrant-installers/blob/c5eb9bb/substrate/launcher/main.go
 // Line 18
 const envPrefix = "VAGRANT_OLD_ENV"

 :::go
 // https://github.com/mitchellh/vagrant-installers/blob/c5eb9bb/substrate/launcher/main.go
 // Line 150
 for _, value := range os.Environ() {
   idx := strings.IndexRune(value, '=')
   key := fmt.Sprintf("%s_%s", envPrefix, value[:idx])
   newEnv[key] = value[idx+1:]
 }

And then allow restoring from it:

 :::ruby
 # https://github.com/mitchellh/vagrant/blob/27157b5/lib/vagrant.rb
 # Line 236

 def self.original_env
     {}.tap do |h|
       ENV.each do |k,v|
         if k.start_with?("VAGRANT_OLD_ENV")
           key = k.sub(/^VAGRANT_OLD_ENV_/, "")
           h[key] = v
         end
       end
     end
   end
 end

This method works (sort of).

`with_original_env` is done wrong

Both the Bundler backup environment and the Vagrant backup environment are being handled in Vagrant::Util::Env.with_original_env:

def self.with_original_env
  original_env = ENV.to_hash
  ENV.replace(::Bundler::ORIGINAL_ENV) if defined?(::Bundler::ORIGINAL_ENV)
  ENV.update(Vagrant.original_env)
  yield
ensure
  ENV.replace(original_env.to_hash)
end

Now, notice the two issues here:

In the normal Vagrant flow (working via the Vagrant launcher), the Bundler::ORIGINAL_ENV hash is useless because of the double invocation of Bundler.
Because we’re only using update with the “proper” environment backup, values won’t be deleted, only replaced:
```
 :::ruby
 good={'a'=>1}
 bad={'a'=>2,'b'=>3}
 bad.update(good)
 bad
 # => {"a"=>1, "b"=>3}
```
So values that didn’t exist in the backup and do exist in the current environment (e.g. GEM_PATH) will stay.

The Solution

This is the relevant PR

Firsty, I modified Vagrant::Util::Env.with_original_env.
I made the assumption that if we’re going through the Vagrant launcher, we only need to restore its environment.
If not, we’ll restore the Bundler environment, if one exists.
The result looks like this:

proxy_env = Vagrant.original_env
if Vagrant.original_env.any?
  ENV.replace(proxy_env)
elsif defined?(::Bundler::ORIGINAL_ENV)
  ENV.replace(::Bundler::ORIGINAL_ENV)
end

After that, I had to locate the code in charge of spawning new processes and make sure that it’s using the right logic.
The interesting method is Vagrant::Util::Subprocess#execute in lib/vagrant/util/subprocess.rb.
It’s very long, but you can save yourself reading it by believing me that the only thing it does about saving the subprocess from the Bundler modifications is calling jailbreak, which is defined in the same file.
The introduction for this method is best quoted from the file direct:

This is, quite possibly, the saddest function in all of Vagrant.

The method itself does plenty with the environment, mainly dealing with environment-related corner cases. Our interesting part is this:

env.replace(::Bundler::ORIGINAL_ENV) if defined?(::Bundler::ORIGINAL_ENV)
env.merge!(Vagrant.original_env)

Instead of repeating the logic from with_original_env, I removed it from jailbreak, and instead took process.start from execute and wrapped it in with_original_env, like so:

Vagrant::Util::Env.with_original_env do
  process.start
end

I might have misunderstood jailbreak a bit, but hopefully it’ll work OK.

And there you have it.

Nitzan

Running external Ruby code from Vagrant

The Story

Attempting to upgrade Berkshelf in the ChefDK

Running Ruby in Vagrant

Vagrant, Bundler and external processes

`with_clean_env` internals

`with_original_env` is done wrong

The Solution

You May Also Enjoy

Introducing ESLint to your codebase smoothly

A quick and simple VPN

A Laptop can be a Big Raspberry Pi

Streaming SQL results from SQLALchemy via a FastAPI endpoint

Nitzan

The Story

Attempting to upgrade Berkshelf in the ChefDK

Running Ruby in Vagrant

Vagrant, Bundler and external processes

with_clean_env internals

with_original_env is done wrong

The Solution

You May Also Enjoy

Introducing ESLint to your codebase smoothly

A quick and simple VPN

A Laptop can be a Big Raspberry Pi

Streaming SQL results from SQLALchemy via a FastAPI endpoint

`with_clean_env` internals

`with_original_env` is done wrong