Migrating a SVN repo to Git: a tale of hacking my way through

  1. ➤ Migrating a SVN repo to Git: a tale of hacking my way through
  2. Migrating a SVN repo to Git, part deux: SubGit to the rescue

If you’re just looking for an easy way to do SVN-Git migration, skip this post and go directly to the part two instead.


We become what we behold. We shape our tools, and thereafter our tools shape us.
― Marshall McLuhan

Lately I’ve orchestrated a SVN to Visual Studio Online migration for one of our projects. Our developers opted to use a Git as version control solution, instead of Team Foundation Version Control (TFVC). Also, we have a pure Windows environment, running VisualSVN Server, so I’ll provide Windows-specific tips along the way.

Git and SVN are quite different beasts, especially when it comes to access control and branching strategies. Because of that, simply using Git’s bidirectional bridge to Subversion called git svn will produce suboptimal results. You will end with all branches and tags as remote svn branches, whereas what you really want is git-native local branches and git tag objects.

To alleviate this issue, a number of solutions is available:

reposurgeon
A tool for editing version-control repository history reposurgeon enables risky operations that version-control systems don’t want to let you do, such as editing past comments and metadata and removing commits. It works with any version control system that can export and import git fast-import streams, including git, hg, fossil, bzr, CVS, and RCS. It can also read Subversion dump files directly and can thus be used to script production of very high-quality conversions from Subversion to any supported DVCS.
agito
Agito is (yet another) Subversion to Git conversion script.It is designed to do a better job of translating history than git-svn, which has some subtleties in the way it works that cause it to construct branch histories that are suboptimal in certain corner case scenarios.
svn2git
svn2git is a tiny utility for migrating projects from Subversion to Git while keeping the trunk, branches and tags where they should be. It uses git-svn to clone an svn repository and does some clean-up to make sure branches and tags are imported in a meaningful way, and that the code checked into master ends up being what’s currently in your svn trunk rather than whichever svn branch your last commit was in.

We are all wonderful, beautiful wrecks. That’s what connects us ― that we’re all broken, all beautifully imperfect.
― Emilio Estevez

Initially I’ve planned to use reposurgeon, because it’s clearly wins over other solutions:

There are many tools for converting repositories between version-control systems out there. This file explains why reposurgeon is the best of breed by comparing it to the competition.

The problems other repository-translation tools have come from ontological mismatches between their source and target systems – models of changesets, branching and tagging can differ in complicated ways. While these gaps can often be bridged by careful analysis, the techniques for doing so are algorithmically complex, difficult to test, and have ugly edge cases.

Furthermore, doing a really high-quality translation often requires human judgment about how to move artifacts – and what to discard. But most lifting tools are, unlike reposurgeon, designed as run-it-once batch processors that can only implement simple and mechanical rules.

Consequently, most repository-translation tools evade the harder problems. They produce a sort of pidgin rendering that crudely and partially copies the history from the source system to the target without fully translating it into native idioms, leaving behind metadata that would take more effort to move over or leaving it in the native format for the source system.

But pidgin repository translations are a kind of friction drag on future development, and are just plain unpleasant to use. So instead of evading the hard problems, reposurgeon tackles them head-on.

Reposurgeon is written in Python and author recommends to run it using PyPy as it provides substantial speed increase (for Windows, get the latest Python 2.7 compatible PyPy binary). Unfortunately, I wasn’t able to do much with it, because reposurgeon failed to read Subversion dump of my repo:

reposurgeon% read repo.svn
reposurgeon: from repo.svn......(0.03 sec) aborted by error.
reposurgeon: EOL not seen where expected, Content-Length incorrect at line 187

This was a bit unexpected, so I decided to put reposurgeon aside for a time being and try something else. Choosing between agito and svn2git, I chose latter, mostly because it’s seemed to be actively maintained, whereas agito last update was about a year ago. Also svn2git usage is more straightforward (no config file needed).

To setup svn2git on Windows, follow this steps:

  • Install your favorite Git flavour (Git for Windows or plain Git)
  • Get Ruby v1.9.x via RubyInstaller
  • Start command prompt with Ruby
  • cd c:\path\to\svn2git
  • gem install jeweler
  • gem install svn2git

My repo has a standard layout with branches and trunk (no tags), but it’s nested. According to the documentation converting it with svn2git should’ve been easy as this:

svn2git http://server/svn/my/nested/repo --notags --authors authors.txt --no-minimize-url --verbose

But after some processing, svn2git just gave up:

error: pathspec 'master' did not match any file(s) known to git.

Browsing issues on Github lead me to this: error: pathspec ‘master’ did not match any file(s) known to git. Common solutions are to delete .git folder and start conversion anew and explicitly specify –trunk, –branches and –tags (or –notags in my case). Needles to say, that none of that worked for me. After some meddling with svn2git options, I’ve concluded, that problems with nested repos are common and I’d better do something about it. Digging further, led me to the svndumpfilter command and a way to move repo contents to the root folder:

If you want your trunk, tags, and branches directories to live in the root of your repository, you might wish to edit your dump files, tweaking the Node-path and Node-copyfrom-path headers so that they no longer have that first calc/ path component. Also, you’ll want to remove the section of dump data that creates the calc directory. It will look something like the following:

Node-path: calc
Node-action: add
Node-kind: dir
Content-length: 0

So, the first step would be to filter my nested repo from the dump:

svnfilter include "/nested/project" --drop-empty-revs < repo.svn > repo_filtered.svn

If svndumpfilter fails to process your dump (and that happens a lot) you might try svndumpfilterIN Python script. Beware, that on Windows, this script produces broken dumps due to CR+LF issues. To fix this you have to tell Python to open files in binary mode. Replacing this two lines in script:

with open(input_dump) as input_file:
with open(output_dump, 'a+') as output_file:

with

with open(input_dump, 'rb') as input_file:
with open(output_dump, 'ab+') as output_file:

will take care of this.

Update (02.01.2015): the issue above is fixed in the latest version of svndumpfilterIN (see this pull request). But I’ve faced another: when trying to filter heavily tangled repos, svndympfilterIN will crash while pulling large amount of tangled files from source repo. I was able to conjure a temporary workaround, see my issue on the GitHub: Crash when untangling large amount of files. Or just use my fork of the svndympfilterIN that has this any some other issues fixed and features added.

Example:

svndumpfilter.py repo.svn --repo=x:\svnpath\repo --output-dump=repo_filtered.svn include "nested/project" --stop-renumber-revs

Next, I’ve to search and replace all occurrences of /nested/project with /. There is a lot of sed on-liners available, but I’ve opted for SVN::DumpReloc Perl script. I’ve used Strawberry Perl to run it on Windows.

svn-dump-reloc "nested/project" "/" < repo_filtered.svn > repo_filtered_relocated.svn

But I can’t just directly import this dump to SVN, because due to relocation, the first commit will try to create a root directory (empty Node-path: entry), which is not allowed.

Revision-number: 123456
Prop-content-length: 111
Content-length: 111

K 7
svn:log
V 13
Start project
K 10
svn:author
V 3
John Doe
K 8
svn:date
V 27
2000-01-01T00:00:00.000000Z
PROPS-END

Node-path: 
Node-kind: dir
Node-action: add
Prop-content-length: 10
Content-length: 10

PROPS-END


Node-path: /subfolder
Node-kind: dir
Node-action: add
Prop-content-length: 10
Content-length: 10

PROPS-END

The marked section should be removed. Make sure to use editor, that will handle big files and wouldn’t change anything else (like line endings). If revision contains only one entry, the whole revision should be removed. This could be done either by editing dump manually, or by using svndumpfilter‘s –revision parameter, to skip this commit altogether. In my case, I had to remove only one section in revision.

Revision-number: 123456
Prop-content-length: 111
Content-length: 111

K 7
svn:log
V 13
Start project
K 10
svn:author
V 3
John Doe
K 8
svn:date
V 27
2000-01-01T00:00:00.000000Z
PROPS-END

Node-path: /subfolder
Node-kind: dir
Node-action: add
Prop-content-length: 10
Content-length: 10

PROPS-END

Then, I need to create a new SVN repo and load filtered and relocated dump:

svnadmin create x:\svnpath\newrepo
svnadmin load x:\svnpath\newrepo < repo_filtered_relocated.svn

Finally, let’s see if I’m able to run svn2git against new repo with success:

svn2git http://server/svn/newrepo --notags --authors authors.txt --verbose

And this time it works right and proper, so I can push my shiny new Git repo to the Visual Studio Online (don’t forget to setup alternate credentials):

git remote add origin https://project.visualstudio.com/DefaultCollection/_git/Project
git push -u origin --all

You can get much farther with a kind word and a PowerShell than you can with a kind word alone.

But thats not all, folks! This story wouldn’t be complete without some PowerShell lifesaver and I wouldn’t dream of disappointing you. Some of you may noticed, that svn2git requires authors file to map SVN commiters to to Git authors. There is plentiful of *nix solutions out there, but I needed a PowerShell one. Since we use VisualSVN Server, the SVN committers’ names are actually Windows domain accounts, so it also would be great to completely automate authors file creation using authors’ full names and emails from Active Directory.

First, I need to get the list of SVN committers for my repo. To do this, I’ve wrapped svn.exe -log command into the Powershell function Get-SvnAuthor. It returns the list of unique commit authors in one or more SVN repositories. I’m listing it here for your convenience, but if you intend to use it, grab instead the latest version from my GitHub repo.

<# .Synopsis Get list of unique commit authors in SVN repository. .Description Get list of unique commit authors in one or more SVN repositories. Requires Subversion binaries. .Parameter Url This parameter is required. An array of strings representing URLs to the SVN repositories. .Parameter User This parameter is optional. A string specifying username for SVN repository. .Parameter Password This parameter is optional. A string specifying password for SVN repository. .Parameter SvnPath This parameter is optional. A string specifying path to the svn.exe. Use it if Subversion binaries is not in your path variable, or you wish to use specific version. .Example Get-SvnAuthor -Url 'http://svnserver/svn/project' Description ----------- Get list of unique commit authors for SVN repository http://svnserver/svn/project .Example Get-SvnAuthor -Url 'http://svnserver/svn/project' -User john -Password doe Description ----------- Get list of unique commit authors for SVN repository http://svnserver/svn/project using username and password. .Example Get-SvnAuthor -Url 'http://svnserver/svn/project' -SvnPath 'C:\Program Files (x86)\VisualSVN Server\bin\svn.exe' Description ----------- Get list of unique commit authors for SVN repository http://svnserver/svn/project using custom svn.exe binary. .Example Get-SvnAuthor -Url 'http://svnserver/svn/project_1', 'http://svnserver/svn/project_2' Description ----------- Get list of unique commit authors for two SVN repositories: http://svnserver/svn/project_1 and http://svnserver/svn/project_2. .Example 'http://svnserver/svn/project_1', 'http://svnserver/svn/project_2' | Get-SvnAuthor Description ----------- Get list of unique commit authors for two SVN repositories: http://svnserver/svn/project_1 and http://svnserver/svn/project_2. #>
function Get-SvnAuthor
{
	[CmdletBinding()]
	Param
	(
		[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
		[ValidateNotNullOrEmpty()]
		[string[]]$Url,

		[Parameter(ValueFromPipelineByPropertyName = $true)]
		[ValidateNotNullOrEmpty()]
		[string]$User,

		[Parameter(ValueFromPipelineByPropertyName = $true)]
		[ValidateNotNullOrEmpty()]
		[string]$Password,

		[ValidateScript({
			if(Test-Path -LiteralPath $_ -PathType Leaf)
			{
				$true
			}
			else
			{
				throw "$_ not found!"
			}
		})]
		[ValidateNotNullOrEmpty()]
		[string]$SvnPath = 'svn.exe'
	)

	Begin
	{
		if(!(Get-Command -Name $SvnPath -CommandType Application -ErrorAction SilentlyContinue))
		{
			throw "$SvnPath not found!"
		}
		$ret = @()
	}

	Process
	{
		$Url | ForEach-Object {
			$SvnCmd = @('log', $_, '--xml', '--quiet', '--non-interactive') + $(if($User){@('--username', $User)}) + $(if($Password){@('--password', $Password)})
			$SvnLog = &$SvnPath $SvnCmd *>&1

			if($LastExitCode)
			{
				Write-Error ($SvnLog | Out-String)
			}
			else
			{
				$ret += [xml]$SvnLog | ForEach-Object {$_.log.logentry.author}
			}
		}
	}

	End
	{
		$ret | Sort-Object -Unique
	}
}

Second, I need to actually grab authors info from Active Directory and save resulting file. This is the job for my another script ― New-GitSvnAuthorsFile. It uses Get-SvnAuthor function, so place it alongside with it.

<# .Synopsis Generate authors file for SVN to Git migration. Can map SVN authors to domain accounts and get full names and emails from Active Directiry. .Description Generate authors file for one or more SVN repositories. Can map SVN authors to domain accounts and get full names and emails from Active Directiry Requires Subversion binaries and Get-SvnAuthor function: https://github.com/beatcracker/Powershell-Misc/blob/master/Get-SvnAuthor.ps1 .Notes Author: beatcracker (https://beatcracker.wordpress.com, https://github.com/beatcracker) License: Microsoft Public License (http://opensource.org/licenses/MS-PL) .Component Requires Subversion binaries and Get-SvnAuthor function: https://github.com/beatcracker/Powershell-Misc/blob/master/Get-SvnAuthor.ps1 .Parameter Url This parameter is required. An array of strings representing URLs to the SVN repositories. .Parameter Path This parameter is optional. A string representing path, where to create authors file. If not specified, new authors file will be created in the script directory. .Parameter ShowOnly This parameter is optional. If this switch is specified, no file will be created and script will output collection of author names and emails. .Parameter QueryActiveDirectory This parameter is optional. A switch indicating whether or not to query Active Directory for author full name and email. Supports the following formats for SVN author name: john, domain\john, john@domain .Parameter User This parameter is optional. A string specifying username for SVN repository. .Parameter Password This parameter is optional. A string specifying password for SVN repository. .Parameter SvnPath This parameter is optional. A string specifying path to the svn.exe. Use it if Subversion binaries is not in your path variable, or you wish to use specific version. .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project' Description ----------- Create authors file for SVN repository http://svnserver/svn/project. New authors file will be created in the script directory. .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project' -QueryActiveDirectory Description ----------- Create authors file for SVN repository http://svnserver/svn/project. Map SVN authors to domain accounts and get full names and emails from Active Directiry. New authors file will be created in the script directory. .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project' -ShowOnly Description ----------- Create authors list for SVN repository http://svnserver/svn/project. Map SVN authors to domain accounts and get full names and emails from Active Directiry. No authors file will be created, instead script will return collection of objects. .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project' -Path c:\authors.txt Description ----------- Create authors file for SVN repository http://svnserver/svn/project. New authors file will be created as c:\authors.txt .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project' -User john -Password doe Description ----------- Create authors file for SVN repository http://svnserver/svn/project using username and password. New authors file will be created in the script directory. .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project' -SvnPath 'C:\Program Files (x86)\VisualSVN Server\bin\svn.exe' Description ----------- Create authors file for SVN repository http://svnserver/svn/project using custom svn.exe binary. New authors file will be created in the script directory. .Example New-GitSvnAuthorsFile -Url 'http://svnserver/svn/project_1', 'http://svnserver/svn/project_2' Description ----------- Create authors file for two SVN repositories: http://svnserver/svn/project_1 and http://svnserver/svn/project_2. New authors file will be created in the script directory. .Example 'http://svnserver/svn/project_1', 'http://svnserver/svn/project_2' | New-GitSvnAuthorsFile Description ----------- Create authors file for two SVN repositories: http://svnserver/svn/project_1 and http://svnserver/svn/project_2. New authors file will be created in the script directory. #>
[CmdletBinding()]
Param
(
	[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Save')]
	[Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Show')]
	[string[]]$Url,

	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Save')]
	[ValidateScript({
		$ParentFolder = Split-Path -LiteralPath $_
		if(!(Test-Path -LiteralPath $ParentFolder  -PathType Container))
		{
			throw "Folder doesn't exist: $ParentFolder"
		}
		else
		{
			$true
		}
	})]
	[ValidateNotNullOrEmpty()]
	[string]$Path = (Join-Path -Path (Split-Path -Path $script:MyInvocation.MyCommand.Path) -ChildPath 'authors'),

	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Show')]
	[switch]$ShowOnly,

	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Save')]
	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Show')]
	[switch]$QueryActiveDirectory,

	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Save')]
	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Show')]
	[string]$User,

	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Save')]
	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Show')]
	[string]$Password,

	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Save')]
	[Parameter(ValueFromPipelineByPropertyName = $true, ParameterSetName = 'Show')]
	[string]$SvnPath
)

# Dotsource 'Get-SvnAuthor' function:
# https://github.com/beatcracker/Powershell-Misc/blob/master/Get-SvnAuthor.ps1
$ScriptDir = Split-Path $script:MyInvocation.MyCommand.Path
. (Join-Path -Path $ScriptDir -ChildPath 'Get-SvnAuthor.ps1')

# Strip extra parameters or splatting will fail
$Param = @{} + $PSBoundParameters
'ShowOnly', 'QueryActiveDirectory', 'Path' | ForEach-Object {$Param.Remove($_)}

# Get authors in SVN repo
$Names = Get-SvnAuthor @Param
[System.Collections.SortedList]$ret = @{}

# Exit, if no authors found
if(!$Names)
{
	Exit
}

# Find full name and email for every author
foreach($name in $Names)
{
	$Email = ''

	if($QueryActiveDirectory)
	{
		# Get account name from commit author name in any of the following formats:
		# john, domain\john, john@domain
		$Local:tmp = $name -split '(@|\\)'
		switch ($Local:tmp.Count)
		{
			1 { $SamAccountName = $Local:tmp[0] ; break }
			3 {
				if($Local:tmp[1] -eq '\')
				{
					[array]::Reverse($Local:tmp)
				}

				$SamAccountName = $Local:tmp[0]
				break
			}
			default {$SamAccountName = $null}
		}

		# Lookup account details
		if($SamAccountName)
		{
			$UserProps = ([adsisearcher]"(samaccountname=$SamAccountName)").FindOne().Properties

			if($UserProps)
			{
				Try
				{
					$Email = '{0} <{1}>' -f $UserProps.displayname[0], $UserProps.mail[0]
				}
				Catch{}
			}
		}
	}

	$ret += @{$name = $Email}
}

if($ShowOnly)
{
	$ret
}
else
{
	# Use System.IO.StreamWriter to write a file with Unix newlines.
	# It's also significally faster then Add\Set-Content Cmdlets.
	Try
	{
		#StreamWriter Constructor (String, Boolean, Encoding): http://msdn.microsoft.com/en-us/library/f5f5x7kt.aspx
		$StreamWriter = New-Object -TypeName System.IO.StreamWriter -ArgumentList $Path, $false,  ([System.Text.Encoding]::ASCII)
	}
	Catch
	{
		throw "Can't create file: $Path"
	}
	$StreamWriter.NewLine = "`n"

	foreach($item in $ret.GetEnumerator())
	{
		$Local:tmp = '{0} = {1}' -f $item.Key, $item.Value
		$StreamWriter.WriteLine($Local:tmp)
	}

	$StreamWriter.Flush()
	$StreamWriter.Close()
}

And that’s all I need to create a fully functional authors file for my SVN repository:

.\New-GitSvnAuthorsFile.ps1 -Url 'http://server/svn/newrepo' -Path 'c:\svn2git\authors.txt' -QueryActiveDirectory

Here is the sample authors file, created by the command above:

john@domain = John Doe <john.doe@mycompany.com>
domain\jane = Jane Doe <jane.doe@mycompany.com>
doe = Doe <doe@mycompany.com>

And now that’s all for today, enjoy your winter holidays and stay tuned for more!

Advertisements

2 thoughts on “Migrating a SVN repo to Git: a tale of hacking my way through

  1. Pingback: Migrating a SVN repo to Git, part deux: SubGit to the rescue | IT magician with a knack for automation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s