The Story

Like a lot of Chef users, I'm using Vagrant for testing my cookbooks. I'm also using Berkshelf for providing the Vagrant box with the cookbooks it needs.
Until recently, I was happy using the ChefDK-provided Berlshelf (v4.0.1). I stopped being happy when running berks started consuming CPU for ~5 mins and then failing when my Berksfile contained multiple sources (the Chef Supermarket and my private Chef server).
While troubleshooting it I've learned that there's an issue with the native dependency graph solver, and I won't be able to fix it in less than a week.
I also noticed that the latest version of the Berkshelf gem (v4.1.1) had no such issues (unless I'm mistaken, it's because it switched to the native Ruby graph solver).

The next logical step was migrating to the new version of Berkshelf

Attempting to upgrade Berkshelf in the ChefDK

I firstly tried working inside ChefDK by upgrading its version of ChefDK.
This made me learn several interesting things:

  1. The /usr/bin/berks file (actually /opt/chefdk/bin/berks) loads specific versions of Gems.
    This means that even if I install the new version of Berkshelf correctly, I'd have to modify this entry point, and it won't be trivial.
  2. The ChefDK Ruby environment is configured to install new Gems into the User's home directory (using GEM_HOME).
    I'm not sure why (something related with developing gems?)
  3. The only way I could execute the new Berkshelf gem "properly" inside the ChefDK was using a Gemfile and something like chef exec bundle exec Berkshelf, which was really annoying

Eventually I decided that the comfort of working inside the ChefDK isn't worth the effort, as taking a clean Ruby 2 environment (e.g. using RVM or Bundler) and installing the Berkshelf Gem inside was effortless.
This worked well for non-Vagrant usage (e.g. calling it from Jenkins), but I still had quite a lot of work.

Running Ruby in Vagrant

My second issue was with running any Ruby code from inside Vagrant.
As any Vagrant-Berkshelf veteran knows, the workflow goes something like this:

  1. User runs some command requiring provisioning, like vagrant up
  2. Vagrant calls the vagrant-berkshelf methods pretty early in the Vagrant workflow (after Vagrant::Action::Builtin::ConfigValidate)
  3. vagrant-berkshelf runs berks install to locate all relevant cookbooks and generate the Bersfile.lock
  4. vagrant-berkshelf calls berks vendor to make a directory containing all cookbooks that the VM needs, which will be accessed by the Chef client on the VM And so forth

This workflow heavily depends on Vagrant executing Berkshelf, which works with ChefDK's Berkshelf because its entry point is "environment-variable proof":

#!/opt/chefdk/embedded/bin/ruby
#--APP_BUNDLER_BINSTUB_FORMAT_VERSION=1--
ENV["GEM_HOME"] = ENV["GEM_PATH"] = nil unless ENV["APPBUNDLER_ALLOW_RVM"] == "true"
#...

Compare this to the "normal" entry point generated by Gems:

#!/usr/bin/ruby2.0
#
# This file was generated by RubyGems.
#
# The application 'berkshelf' is installed as part of a gem, and
# this file is here to facilitate running it.
#

require 'rubygems'

version = ">= 0"

if ARGV.first
  str = ARGV.first
  str = str.dup.force_encoding("BINARY") if str.respond_to? :force_encoding
  if str =~ /\A_(.*)_\z/
    version = $1
    ARGV.shift
  end
end

gem 'berkshelf', version
load Gem.bin_path('berkshelf', 'berks', version)

The environment negation (deleting GEM_HOME and GEM_PATH) is (IMO) related to the Vagrant use-case.
Fact is, Vagrant is polluting the environment of subprocesses with Vagrant-specific Ruby-related variables.

Vagrant, Bundler and external processes

Vagrant uses Bundler as a way of managing its Ruby dependencies (both internal and plugins), so Vagrant suffers from the same issue that Bundler has - it assumes that subprocesses are supposed to run inside its own Ruby environment. To do so, it modifies its own ruby-related environment variables, such as GEM_PATH (where to look for gems) and GEM_HOME (where gems should be installed).
For cases where it's not true, Bundler offers a method called Bundler.with_clean_env. This should yield (execute a given code block) with the "original" environment (the one bundler had when it started), so any processes spawned from that block should be free of the Bundler contamination.
Vagrant tries to utilize this method, but it doesn't work as expected.

with_clean_env internals

Let's drill down a bit:

# https://github.com/bundler/bundler/blob/5131fcd/lib/bundler.rb#L211

def with_clean_env
  with_original_env do
    ENV['MANPATH'] = ENV['BUNDLE_ORIG_MANPATH']
    ENV.delete_if { |k,_| k[0,7] == 'BUNDLE_' }
    if ENV.has_key? 'RUBYOPT'
      ENV['RUBYOPT'] = ENV['RUBYOPT'].sub '-rbundler/setup', ''
      ENV['RUBYOPT'] = ENV['RUBYOPT'].sub "-I#{File.expand_path('..', __FILE__)}", ''
    end
    yield
  end
end
# https://github.com/bundler/bundler/blob/5131fcd/lib/bundler.rb#L203

def with_original_env
  bundled_env = ENV.to_hash
  ENV.replace(ORIGINAL_ENV)
  yield
ensure
  ENV.replace(bundled_env.to_hash)
end
# https://github.com/bundler/bundler/blob/5131fcd/lib/bundler.rb#L16
module Bundler
  ORIGINAL_ENV = environment_preserver.restore
  ENV.replace(environment_preserver.backup)
#...

So, when the Bundler module is loaded, it creates a backup of the current environment variables. This backup (plus some modifications) is used whenever with_clean_env is called. How can it break?

By adding debug prints inside the Bundler gem, I deduced the following facts:

  1. Bundler is invoked twice
    First, the entry point is pre-rubygems.rb, as evident from the vagrant launcher:

    // Line 187
    
    cmd.Args[0] = "ruby"
    cmd.Args[1] = filepath.Join(gemPath, "lib", "vagrant", "pre-rubygems.rb")
    //...
    if err := cmd.Start(); err != nil {
    // ...
    

    Note these bits at lib/vagrant/pre-rubygems.rb:

    # Line 19
    require_relative "bundler"
    
    # Line 30
    
    if ENV["VAGRANT_EXECUTABLE"]
      Kernel.exec("ruby", ENV["VAGRANT_EXECUTABLE"], *ARGV)
    else
      Kernel.exec("vagrant", *ARGV)
    end
    

    And finally, this in bin/vagrant:

    # Line 69
    
    require "bundler"
    

    As you can see, the pre-rubygems.rb file is invoked first, loads Bundler, and then execs the Vagrant entry point, which loads its own Bundler. So the Bundler gem is loaded twice, and the second instance "saves" the environment already modified by the first instace, meaning with_clean_env is useless.

  2. Vagrant works around this
    The Vagrant devs tried to solve this issue by backing up the environment variables before any modification, like so:

    // https://github.com/mitchellh/vagrant-installers/blob/c5eb9bb/substrate/launcher/main.go
    // Line 18
    const envPrefix = "VAGRANT_OLD_ENV"
    
    // https://github.com/mitchellh/vagrant-installers/blob/c5eb9bb/substrate/launcher/main.go
    // Line 150
    for _, value := range os.Environ() {
      idx := strings.IndexRune(value, '=')
      key := fmt.Sprintf("%s_%s", envPrefix, value[:idx])
      newEnv[key] = value[idx+1:]
    }
    

    And then allow restoring from it:

    # https://github.com/mitchellh/vagrant/blob/27157b5/lib/vagrant.rb
    # Line 236
    
    def self.original_env
        {}.tap do |h|
          ENV.each do |k,v|
            if k.start_with?("VAGRANT_OLD_ENV")
              key = k.sub(/^VAGRANT_OLD_ENV_/, "")
              h[key] = v
            end
          end
        end
      end
    end
    

    This method works (sort of).

with_original_env is done wrong

Both the Bundler backup environment and the Vagrant backup environment are being handled in Vagrant::Util::Env.with_original_env:

def self.with_original_env
  original_env = ENV.to_hash
  ENV.replace(::Bundler::ORIGINAL_ENV) if defined?(::Bundler::ORIGINAL_ENV)
  ENV.update(Vagrant.original_env)
  yield
ensure
  ENV.replace(original_env.to_hash)
end

Now, notice the two issues here:

  1. In the normal Vagrant flow (working via the Vagrant launcher), the Bundler::ORIGINAL_ENV hash is useless because of the double invocation of Bundler.
  2. Because we're only using update with the "proper" environment backup, values won't be deleted, only replaced:

    good={'a'=>1}
    bad={'a'=>2,'b'=>3}
    bad.update(good)
    bad
    # => {"a"=>1, "b"=>3}
    

    So values that didn't exist in the backup and do exist in the current environment (e.g. GEM_PATH) will stay.

The Solution

This is the relevant PR

Firsty, I modified Vagrant::Util::Env.with_original_env.
I made the assumption that if we're going through the Vagrant launcher, we only need to restore its environment.
If not, we'll restore the Bundler environment, if one exists.
The result looks like this:

proxy_env = Vagrant.original_env
if Vagrant.original_env.any?
  ENV.replace(proxy_env)
elsif defined?(::Bundler::ORIGINAL_ENV)
  ENV.replace(::Bundler::ORIGINAL_ENV)
end

After that, I had to locate the code in charge of spawning new processes and make sure that it's using the right logic.
The interesting method is Vagrant::Util::Subprocess#execute in lib/vagrant/util/subprocess.rb.
It's very long, but you can save yourself reading it by believing me that the only thing it does about saving the subprocess from the Bundler modifications is calling jailbreak, which is defined in the same file.
The introduction for this method is best quoted from the file direct:

This is, quite possibly, the saddest function in all of Vagrant.

The method itself does plenty with the environment, mainly dealing with environment-related corner cases. Our interesting part is this:

env.replace(::Bundler::ORIGINAL_ENV) if defined?(::Bundler::ORIGINAL_ENV)
env.merge!(Vagrant.original_env)

Instead of repeating the logic from with_original_env, I removed it from jailbreak, and instead took process.start from execute and wrapped it in with_original_env, like so:

Vagrant::Util::Env.with_original_env do
  process.start
end

I might have misunderstood jailbreak a bit, but hopefully it'll work OK.

And there you have it.