Difference between revisions of "Not Forking"

From Dan Shearer CV
 
Line 1: Line 1:
Not-forking avoids duplicating the source code of one project within another project, where the projects are external to each other.
+
[https://lumosql.org/src/not-forking Not-forking] avoids duplicating the source code of one project within another project, where the projects are external to each other.
   
 
Not-forking '''avoids project-level forking''' by largely automating change management in ways that [https://en.wikipedia.org/wiki/Distributed_version_control version control systems] such as [https://fossil-scm.org Fossil], [https://git-scm.org Git], or [https://github.com GitHub] cannot. The [https://lumosql.org/src/not-forking/doc/trunk/doc/not-forking.md full documentation] goes into much more detail than this overview.
 
Not-forking '''avoids project-level forking''' by largely automating change management in ways that [https://en.wikipedia.org/wiki/Distributed_version_control version control systems] such as [https://fossil-scm.org Fossil], [https://git-scm.org Git], or [https://github.com GitHub] cannot. The [https://lumosql.org/src/not-forking/doc/trunk/doc/not-forking.md full documentation] goes into much more detail than this overview.
Line 11: Line 11:
 
Some questions immediately arise:
 
Some questions immediately arise:
   
* Should you import *Upstream* into your source code management system? All source code should be under version management, but having a checkout of an external repository within your local repository feels wrong... and do we want to lose upstream project history?
+
* Should you import ''Upstream'' into your source code management system? All source code should be under version management, but having a checkout of an external repository within your local repository feels wrong... and do we want to lose upstream project history?
* If *Upstream* makes modifications, how can you pull those modifications into *Combined Project* safely?
+
* If ''Upstream'' makes modifications, how can you pull those modifications into ''Combined Project'' safely?
* If *Combined Project* has changed files in *Upstream*, how can you then merge the changes and any new changes made in *Upstream*?
+
* If ''Combined Project'' has changed files in ''Upstream'', how can you then merge the changes and any new changes made in ''Upstream''?
   
The developer now has good reasons to separate *Upstream* project code from its repository and maintain it within the *Combined Project* tree, because in the short term it is just simpler. But that brings the very big problem of the Reluctant Project Fork. A Reluctant Project Fork, or "vendoring" as the [https://debian.org Debian Project] calls it, is where *Combined Project's* version of *Upstream* starts to drift from the original *Upstream*. Nobody wants to maintain code that is currently being maintained by its original authors, but it can become complicated to avoid that. Not-Forking makes this a much easier problem to solve.
+
The developer now has good reasons to separate ''Upstream'' project code from its repository and maintain it within the ''Combined Project'' tree, because in the short term it is just simpler. But that brings the very big problem of the Reluctant Project Fork. A Reluctant Project Fork, or "vendoring" as the [https://debian.org Debian Project] calls it, is where ''Combined Project's'' version of ''Upstream'' starts to drift from the original ''Upstream''. Nobody wants to maintain code that is currently being maintained by its original authors, but it can become complicated to avoid that. Not-Forking makes this a much easier problem to solve.
   
Not-forking also addresses more complicated scenarios, such as when two unrelated projects are upstream of *Combined Project*:
+
Not-forking also addresses more complicated scenarios, such as when two unrelated projects are upstream of ''Combined Project'':
   
 
[[File:Not-Forking-Scenario-2.png]]
 
[[File:Not-Forking-Scenario-2.png]]
Line 46: Line 46:
 
* build with upstream1.c version 2, and upstream3.c version 3, both of which are ported to upstream 0’s main.c version 5
 
* build with upstream1.c version 2, and upstream3.c version 3, both of which are ported to upstream 0’s main.c version 5
 
* track changes in all upstreams, which may use arbitrary release mechanisms (Git, tarball, Fossil, other)
 
* track changes in all upstreams, which may use arbitrary release mechanisms (Git, tarball, Fossil, other)
* cache all versions of all upstreams, so that a build system can step through a large matrix of versions of code quickly, perhaps for test/benchmark
+
* cache all versions of all upstreams, so that a build system can step through a large matrix of versions of code quickly, perhaps for test/benchmark purposes.
 
= Disambiguation of “Fork” =
 
 
The term “fork” has several meanings. Not-Forking is addressing only one meaning: when source code maintained ''by other people elsewhere'' is modified ''by you locally''. This creates the problem of how to maintain your modifications without also maintaining the entire original codebase.
 
 
Not-Forking is not intended for permanent whole-project forks. These tend to be large and rare events, such as when [https://libreoffice.org LibreOffice] split off from [https://openoffice.org OpenOffice.org], or [https://mariadb.org MariaDB] from [https://mysql.org MySQL]. These were expected, planned and managed project forks.
 
 
Not-Forking is not intended for extreme vendoring either, as in the case decided by Debian in January 2021, where the up stream is [https://lwn.net/ml/debian-ctte/handler.971515.D971515.16111708995535.ackdone@bugs.debian.org/ giant and well-funded] and guarantees it will maintain all of its own upstreams.
 
 
Not-forking is strictly about unintentional/reluctant whole-project forks, or ordinary-scale vendoring.
 
 
Here are some other meanings for the word “fork” that are nothing to do with Not-Forking:
 
 
* In Fossil, a “fork” can be a point where a linear branch of development splits into two linear branches which have the same name. [https://fossil-scm.org/home/doc/trunk/www/branching.wiki Fossil has a discussion on forking/branching] .
 
* in Git, a “fork” is just another clone of the repository.
 
* GitHub uses the same definition as Git. As well as providing tools to identify and re-import changes made in the new clone, GitHub promotes forking repositories. As a result it is common for a project on GitHub to have dozens of forks/clones, and for a popular project there can be hundreds.
 
   
 
[[Category:Software Development]]
 
[[Category:Software Development]]

Revision as of 04:06, 16 November 2021

Not-forking avoids duplicating the source code of one project within another project, where the projects are external to each other.

Not-forking avoids project-level forking by largely automating change management in ways that version control systems such as Fossil, Git, or GitHub cannot. The full documentation goes into much more detail than this overview.

Not-forking was a pre-requisite to LumoSQL, but unlike LumoSQL is fully production-ready. I designed and tested Not-forking, and Claudio Calvelli did most of the coding as can be seen in the commit logs. Whatever becomes of this particular implementation, the design of Not-forking is provoking thought. Maintainers in several software distributions are looking at it, because after decades of amalgamating software in fragile ways, this appears to be a way of respecting both upstream and downstream.

This following diagram shows the simplest case of the problem Not-Forking solves. An external piece of software, here called Upstream, forms a part of a new project called Combined Project. Upstream is not a library provided on your system, because then you could simply link to libupstream. Instead, Upstream is source code that you copy into the Combined Project directory tree like this:

Not-Forking-Scenario-1.png

Some questions immediately arise:

  • Should you import Upstream into your source code management system? All source code should be under version management, but having a checkout of an external repository within your local repository feels wrong... and do we want to lose upstream project history?
  • If Upstream makes modifications, how can you pull those modifications into Combined Project safely?
  • If Combined Project has changed files in Upstream, how can you then merge the changes and any new changes made in Upstream?

The developer now has good reasons to separate Upstream project code from its repository and maintain it within the Combined Project tree, because in the short term it is just simpler. But that brings the very big problem of the Reluctant Project Fork. A Reluctant Project Fork, or "vendoring" as the Debian Project calls it, is where Combined Project's version of Upstream starts to drift from the original Upstream. Nobody wants to maintain code that is currently being maintained by its original authors, but it can become complicated to avoid that. Not-Forking makes this a much easier problem to solve.

Not-forking also addresses more complicated scenarios, such as when two unrelated projects are upstream of Combined Project:

Not-Forking-Scenario-2.png

In more detail, the problem of project forking includes these cases:

  • Tracking multiple upstreams, each with a different release schedule and version control system. Manual merging is difficult, but failing to merge or only occasionally merging will often result in a hard fork. LumoSQL tracks three upstreams that differ in all these ways
  • Tracking an upstream to which you wish to make changes that are not mergable. Without Not-Forking a manual merge is the only option even if there is only one upstream and even if the patch set is not complicated. An obvious case of this is replacing, deleting or creating whole files
  • Vendoring, where a package copies a library or module into its own tree, avoiding the versioning problems that arise when using system-provided libraries. This then becomes a standalone fork until the next copy is done, which often involves a manual porting task. Not-Forking can stop this problem arising at all
  • Vendoring with version control, for example some of the 132 forks of LibVNC on GitHub are for maintained, shipping products which are up to hundreds of commits behind the original. Seemingly they are manually synced with the original every year or two, but potentially Not-Forking could remove most of this manual work

The following diagram indicates how even more complex scenarios are managed with Not-Forking. Any of the version control systems could be swapped with any other, and production use of Not-Forking today handles up to 50 versions of three upstreams with ease.

Not-Forking-Scenario-3.png

Why Not Just Use Git/Fossil/Other VCS?

Git rebase cannot solve the Not-Forking problem space. Neither can Git submodules. Nor Fossil’s merge, nor the quilt approach to combining patches.

A VCS cannot address the Not-Forking class of problems because the decisions required are typically made by humans doing a port or reimplementation where multiple upstreams need to be combined. A patch stream can’t describe what needs to be done, so automating this requires a tangle of fragile one-off code. Not-Forking makes it possible to write a build system without these code tangles.

Examples of the sorts of actions Not-Forking can take:

  • check for new versions of all upstreams, doing comparisons of the human-readable release numbers/letters rather than repo checkins or tags, where human-readable version numbers vary widely in their construction
  • replace foo.c with bar.c in all cases (perhaps because we want to replace a library that has an identical API with a safer implementation)
  • apply this patch to main.c of Upstream 0, but only in the case where we are also pulling in upstream1.c, but not if we are also using upstream2.c
  • apply these non-patch changes to Upstream 0 main.c in the style of sed rather than patch, making it possible to merge trees that a VCS says are unmergable
  • build with upstream1.c version 2, and upstream3.c version 3, both of which are ported to upstream 0’s main.c version 5
  • track changes in all upstreams, which may use arbitrary release mechanisms (Git, tarball, Fossil, other)
  • cache all versions of all upstreams, so that a build system can step through a large matrix of versions of code quickly, perhaps for test/benchmark purposes.