Currently, scan-copyrights
(which uses licensecheck under
the hood) is used in Apertis to scan copyright/license notices. This tool has
some downsides, thus we are evaluating to use scancode-toolkit
instead.
A comparison of licensecheck
vs scancode
is available on the
ScanCode’s website,
TL;DR: scancode is more accurate but slower.
scancode-toolkit
has an option to export results as DEP5 format
(see GH#472) which is the
format currently use by Apertis license tooling. That means, scancode-toolkit
is potentially compatible with the rest of the Apertis licensing tooling.
Scancode installation
scancode
is not available as Debian package (GH#1580
and GH#3253) nor as a
Docker image (GH#3026),
but a Dockerfile
is provided by upstream.
That means, we can create our own Docker image, or we can reuse the
OSS Review Toolkit Docker image which integrates scancode
. Since the ORT
Docker image used in our pipeline is outdated, it would be easier for now
to decouple scancode from the ORT docker image to avoid having to use an outdated
scancode (scancode used in the ORT image is one year old.
Here are the steps to build a docker image:
|
|
Scancode output format
Scancode is able to write its output in different formats:
docker run scancode-toolkit --help
...
output formats:
--json FILE Write scan output as compact JSON to FILE.
--json-pp FILE Write scan output as pretty-printed JSON to FILE.
--json-lines FILE Write scan output as JSON Lines to FILE.
--yaml FILE Write scan output as YAML to FILE.
--csv FILE [DEPRECATED] Write scan output as CSV to FILE. The
--csv option is deprecated and will be replaced by
new CSV and tabular output formats in the next
ScanCode release. Visit
https://github.com/nexB/scancode-toolkit/issues/3043
to provide inputs and feedback.
--html FILE Write scan output as HTML to FILE.
--custom-output FILE Write scan output to FILE formatted with the custom
Jinja template file.
--debian FILE Write scan output in machine-readable Debian
copyright format to FILE.
--custom-template FILE Use this Jinja template FILE as a custom template.
--cyclonedx FILE Write scan output in CycloneDX JSON format to FILE.
--cyclonedx-xml FILE Write scan output in CycloneDX XML format to FILE.
--spdx-rdf FILE Write scan output as SPDX RDF to FILE.
--spdx-tv FILE Write scan output as SPDX Tag/Value to FILE.
...
These formats include:
- Debian DEP5 format, the one already in use with Apertis licensing tooling.
- YAML, widely used in Apertis and used to teach
scan-copyrights
detection license issues (i.e.debian/apertis/copyright.yml
). - SPDX, open standard for communicating SBOM information.
- CycloneDX, another SBOM standard.
Initially, it should be simpler to continue using the Debian DEP5 since the whole
Apertis licensing tooling is using it. But for a long term plan, we may want to
switch to a more widely used format like SPDX
or CycloneDX
which are also
compatible with ORT
and other tools. This should make the Apertis license/SBOM
processes more flexible.
Select Apertis packages to evaluate scancode
Let’s use packages that are wrongly detected by scan-copyrights
(i.e. packages
with the use of override-license
in debian/apertis/copyright.yml
.
Here is a small random list of packages based on a local grep of override-license
:
debianutils, libarchive, libgdata, libunistring, libusb, nss, nss-pem, openjpeg2,
openssl xorg-server.
Run scan-copyrights as gold standard
First, scan-copyrights
is run on the package in a v2025dev2 VM:
|
|
Run scancode
Now, scancode
is run by excluding debian/copyright
and debian/apertis/copyright
because they can easily confuse scancode
(see GH#2885).
|
|
Analysis time
This analysis was performed on a XPS13-9310 laptop with a CPU i7-1185G7 (@3.00GHz×8), 16 GB RAM and an SSD hard disk.
Package | Time scan-copyrights | Time scancode | Diff |
---|---|---|---|
debianutils | 1.3 s | 38 s | ~29 times slower |
libarchive | 6.6 s | 7 m 25 s | ~67 times slower |
libgdata | 4.7 s | 3 m 55 s | ~50 times slower |
libunistring | 8.1 s | 10 m 48 s | ~80 times slower |
libusb | 1.2 s | 54 s | ~45 times slower |
nss | 25.8 s | 28 m 47 s | ~67 times slower |
nss-pem | 28.6 s | 26 m 59 s | ~56 times slower |
openjpeg2 | 3.7 s | 3 m 8 s | ~50 times slower |
openssl | 23.6 s | 18 m 3 s | ~46 times slower |
pipewire | 6.1 s | 4 m 57 s | ~48 times slower |
rust-coreutils | 4.4 s | 1 m 54 s | ~26 times slower |
xorg-server | 9.1 s | 9 m 24 s | ~62 times slower |
firefox-esr* | XX s | OOM killed after ~ 1 d [1] | ~XX times slower |
firefox-esr
is one of the biggest packages in Apertis, but is not in thetarget
repository. Thus, we wouldn’t have to analyze it withscancode
, but it is used here to evaluate scancode in the worst cases.
-
[1] scancode ran on
firefox-esr
for ~ 23 hours and 30 mins before being OOM killed. It seems, the scan was over and scancode was processing data to generate its output file when it was killed. Its scanning parallel processes have stopped, only the main process was running and the used RAM was at ~ 3 GB (of 16 GB available) about 10 mins before OOM. -
While doing the analysis with 8 parallel processes, all of them were at 100% during the entire analysis time, so at least the CPU is a bottleneck.
Analysis time with –processes from 1 to 8
From the scancode options:
-n, --processes INTEGER
Scan <input> using n parallel processes. [Default: 1]
This option allows to use several processes for scanning files.
|
|
N processes | Time scancode — debianutils |
---|---|
1 | 1 m 34.5 s |
2 | 58.2 s |
3 | 45.4 s |
4 | 39.0 s |
5 | 38.1 s |
6 | 37.2 s |
7 | 36.7 s |
8 | 36.0 s |
Adding more parallel processes improve the scanning time, but it seems we are
reaching a threshold at ~ 4 parallel processes where adding more processes only
slightly improves the scanning time. This may be due to the fact that the tested
package is small. For a bigger package like firefox-esr
, this threshold may
be higher and it could be beneficial to have more parallel processes.
Analysis time with –timeout X
From the scancode options:
--timeout FLOAT
Stop scanning a file if scanning takes longer than a timeout in seconds. [Default: 120]
This option allows to avoid getting stuck on a file for too long.
|
|
Timeout | Time scancode -n 8 — debianutils |
---|---|
120 (default) | 38.5 s |
110 | 43.6 s |
100 | 44.1 s |
90 | 39.9 s |
80 | 40.3 s |
70 | 38.3 s |
60 | 37.8 s |
30 | 37.9 s |
10 | 29.6 s |
Decreasing the timeout per file seems to be quite efficient to reduce the scanning time, but since some files are no longer fully scanned, a more comprehensive comparison of detected licenses should be done to ensure we are not losing too much data.
|
|
Interestingly, the --timeout 10
option on the firefox-esr
scan avoids
triggering the OOM kill issue, probably because more files are no longer
scanned, thus scancode
has to manage a smaller dataset and therefore
a smaller memory footprint. Unfortunately, this option is not enough to
make scanning time acceptable since it took 35 hours to complete the full
scan of firefox-esr
.
Analysis time with –max-in-memory 0
From the scancode options:
--max-in-memory INTEGER
Maximum number of files and directories scan details kept in memory during a
scan. Additional files and directories scan details above this number are
cached on-disk rather than in memory. Use 0 to use unlimited memory and
disable on-disk caching. Use -1 to use only on-disk caching. [Default: 10000]
Based on an upstream issue (see GH#1014), the disk cache seems to be really slow.
|
|
Package | Time scan-copyrights | Time scancode | Diff |
---|---|---|---|
debianutils | 1.3 s | 34 s | ~26 times slower |
libarchive | 6.6 s | 6 m 41 s | ~60 times slower |
libgdata | 4.7 s | 3 m 35 s | ~46 times slower |
libunistring | 8.1 s | 11 m 12 s | ~83 times slower |
libusb | 1.2 s | 54 s | ~45 times slower |
nss | 25.8 s | 27 m 47 s | ~66 times slower |
nss-pem | 28.6 s | 26 m 49 s | ~57 times slower |
openjpeg2 | 3.7 s | 3 m 14 s | ~52 times slower |
openssl | 23.6 s | 18 m | ~46 times slower |
pipewire | 6.1 s | 5 m 29 s | ~54 times slower |
rust-coreutils | 4.4 s | 2 m 3 s | ~28 times slower |
xorg-server | 9.1 s | 10 m 9 s | ~67 times slower |
firefox-esr* | XXX s | OOM killed after ~ 1 d [1] | ~XX times slower |
Passing --max-in-memory 0
to scancode
doesn’t improve scanning time since
these results are of the same order of magnitude (+/- random fluctuation) to the
ones without this option.
Analysis time with ONLY –license
i.e. without --copyright --license-text
|
|
Package | Time scan-copyrights | Time scancode with copyright | Time scancode without copyright | Improvement % |
---|---|---|---|---|
nss | 25.8 s | 27 m 47 s | 21 m 42 s | 22% |
Avoiding the scan for copyrights and only checking licenses slightly decreases
the scanning time but it remains in the same order of magnitude without
significantly improving the comparison with scan-copyright
.
Reliability of detected license
Some of debian/apertis/copyright.yaml files used are no longer required
since the Bookworm rebase, so all packages analyzed don’t have a problematic file
which can be used to compare scan-copyrights
and scancode
.
Package | File | Actual license | Detected license (scancode) | Detected license (scan-copyrights) |
---|---|---|---|---|
libarchive | shar.1 | BSD-4-Clause-UC | BSD-4-Clause-UC | BSD-4-Clause-UC [0] |
libgdata | README | LGPL-2.1-or-later | LGPL-2.0-or-later | LGPL |
libunistring | version.c | LGPL-3.0-or-later OR GPL-2.0-or-later | LGPL-3.0-or-later OR GPL-2.0-or-later | LGPL |
libusb | 06_bsd.diff | BSD-2-Clause [1] | BSD-4-Clause | [1.1] |
nss | derdump.1 | MPL-2.0 [2] | MPL-2.0 | MPL-2.0 |
nss-pem | doc/rst/legacy/* | MPL-1.1 OR GPL-2.0-only OR LGPL-2.1-only | MPL-1.1 OR GPL-2.0-only OR LGPL-2.1-only | MPL-2.0 |
openjpeg2 | opj_getopt.c | BSD-3-Clause | (BSD-2-Clause AND LicenseRef-scancode-proprietary-license) AND BSD-3-Clause | BSD-3-clause |
openssl | cmll-x86*.pl | Apache-2.0 OR GPL-2.0-or-later OR LGPL-2.1-or-later OR MPL-1.1 OR BSD-2-Clause | OpenSSL AND (GPL-2.0-or-later OR LGPL-2.1-or-later OR MPL-1.1 OR BSD-3-Clause) | Apache-2.0 and/or GPL-2+ |
xorg-server | hw/xwin/winprefsyacc.* | GPL-3.0-or-later WITH Bison-exception-2.2 AND LicenseRef-scancode-xfree86-1.0 | GPL-3.0-or-later WITH Bison-exception-2.2 AND LicenseRef-scancode-xfree86-1.0 [3] | GPL-3+ with Bison-2.2 exception |
- [0] scan-copyrights has improved in Bookworm.
- [1] Retrospective change: https://www.netbsd.org/about/redistribution.html#why2clause
- [1.1] BSD-2-Clause-NetBSD and/or BSD-2-clause and/or BSD-3-clause and/or FSFUL and/or FSFULLR and/or GPL-2 and/or LGPL-2 and/or X11
- [2] Simplified upstream with bookworm, debian/apertis/copyright.yaml outdated.
- [3] A second license is later in the code
scancode
is better to deal with complex licenses combinations, especially
because it scans the whole file and not only the first lines. Moreover, it
reports all licenses detected with a matching score (available in the YAML
output).
No license deduction for project and folder
While scan-copyrights
is able to perform some deduction of license/copyright
for a project and/or folder, scancode
only performs scanning at file level.
For instance, scan-copyrights
gives the following result for openssl
:
Files: *
Copyright: 1998-2023, The OpenSSL Project
1995-1998, Eric A. Young, Tim J. Hudson
License: Apache-2.0
...
Files: crypto/ec/asm/*
Copyright: 1998-2023, The OpenSSL Project Authors.
License: Apache-2.0 and/or OpenSSL
This result give us the information that the project is under the license
Apache-2.0
and files in crypto/ec/asm/*
are under Apache-2.0 and/or OpenSSL
licenses.
This behavior allows to assign a license to files by inheriting it from the license of the project (or from the higher level folder’s license).
Some projects don’t add license/copyright information in all of their files, which
could be annoying for scancode
as it won’t be able to detect the right license.
We would need to add this logic in scancode
or in another Apertis script
(like ci-license-scan?).
Some related upstream issues:
- Proposal: Scan deduction and summarization
- Primary license detections not shown properly in debian_copyright
Statistic about files without detected license
The yaml file generated by scancode is sometimes malformed due to --license-text
(see #GH3219, but seems not enough).
|
|
Package | Files number | Files number with license | Detected license % |
---|---|---|---|
debianutils | 134 | 44 | 32.8 % |
libarchive | 1420 | 716 | 51.9 % |
libgdata | 705 | 326 | 46.2 % |
libunistring | 2118 | 1961 | 92.5 % |
libusb | 96 | 30 | 31.2 % |
nss | 4531 | 2393 | 52.8 % |
nss-pem | 4574 | 2423 | 52.9 % |
openjpeg2 | 478 | 334 | 69.8 % |
openssl | 4655 | 3349 | 72 % |
pipewire | 1251 | 952 | 76 % |
rust-coreutils | 1311 | 326 | 24.8 % |
xorg-server | 1791 | 1227 | 68.5 % |
firefox-esr* | XXX | XXX | XXX |
DEP5 invalid format
scancode
generates malformed files stanza
. As defined in the
debian/copyright specification,
each files stanza
is composed by mandatory fields (i.e. Files
, Copyright
and License
) and one optional field (i.e. Comment
). For instance:
Files: Xext/sleepuntil.h
Copyright: 1993-2003, The XFree86 Project, Inc.
License: Expat
When scancode
is not able to define a copyright or a license for a file, then
it creates a stanza with only the Files
field whereas scan-copyrights
fills
the missing field with UNKNOWN
.
Here is an example of malformed stanza by scancode
:
Files: CODE_OF_CONDUCT.md
Here is another example from scan-copyrights
where a missing field is filled:
Files: xkb/Makefile.in
Copyright: 1994-2021, Free Software Foundation, Inc.
License: UNKNOWN
This issue should easy be fixable in scancode
, by filing missing field with
an UNKNOWN
value.
scancode output format
Support of DEP5
format is incomplete in scancode
, but bigger issues can
probably be easily fixed.
YAML
format gives way more information like: lines of the detected licenses,
the pattern, a score of matching, several identifiers of the licenses detected, etc.
Having all of these information may be useful for future enhancement of Apertis
license tooling.
Summary
Their different approaches in file analysis explains why scancode
is slower,
but is able to detect way more licenses than scan-copyrights
(see upstream comparison):
-
licensecheck
is “a Perl script using hand-crafted regex patterns to find typical copyright statements and about 50 common licenses”; -
scancode
’s detection “is based on a (large) number of license full texts (~2100) and license notices, mentions and variants (~32,000) and is data-driven as opposed to regex-driven. It detects and reports exactly where license text is found in a file. Just throw in more license texts to improve the detection.”
Required resources for analysis
scancode
is ~50 times slower than scan-copyrights
.
scancode
requires much more RAM than scan-copyrights
.
License scan accuracy
scan-copyrights
has improved between Bullseye and Bookworm.scancode
has a better detection for complex cases.
Output format
DEP5
incomplete support, but already used by apertis license tooling.YAML
seems a sensible alternative since it provides many more information, but would require to adapt apertis license tooling to this new format.
Outdated debian/apertis/copyright.yaml
This file would need to be refreshed in Apertis packages since the Bookworm
rebase. scan-copyrights
is smarter and some packages have fixed their licensing
issues.
Proposed plan
Some general guidelines:
- We need to come with a progressive approach, this is not something that will happen from one day to the other
- Most of the packages are small and should take a reasonable amount of time to scan
- We should be able to selectively disable scancode when necessary
- We can add additional logic to only scan the files that have changed since last scan
Proposed plan to use scancode instead of scan-copyrights:
- Update the docker image used to generate ORT reports in order to reuse it for scancode. Apertis uses a handcrafted docker image to generate ORT reports, since this image already contains scancode, it’s possible to reuse it to run scancode. The first step is to switch to an up-to-date image provided by ORT. This step will requite to adjust some scripts use by Apertis including the template used to generate ORT reports.
- Fix the DEP5 format created by scancode by adding missing mandatory fields.
Scancode generates report in the DEP5 format,
unfortunately, mandatory fields are missing when the copyright/license is not
detected (see GH#3714).
Instead, scancode should fill missing field with
UNKNOWN
orno-info-found
as done byscan-copyrights
. - Add support of “license deduction for project and folder” to scancode. scancode is unable to deduce a license for a whole project/folder based on the license of other files. Without this feature, ~ 50% files will have a missing license which is a regression compared to scan-copyrights. This task consists in adding a logic to scancode to deduce a license for a folder and/or project.
- Add a new job running
scancode
to theci-package-builder
pipeline in parallel to the currentscan-licenses
job usingscan-copyrights
. - Generate a new
scancode
report for all packages intarget
using the job added in the previous step. - In the SBOM logic, add preference to use the
scancode
report if available otherwise use the one fromscan-copyrights
.
Some other tasks can be done in parallel:
- Investigate how to use caching to avoid scanning files already scanned in a previous run.
- Investigate how to improve performance of scancode (speed and RAM usage).