extensions

OCFL Community Extension 0011: Direct Clean Path Layout

Overview

This extension is intended to be used to map logical paths to safe content paths. This is done by replacing or removing “dangerous characters” from names.

This functionality is based on David A. Wheeler’s essay “Fixing Unix/Linux/POSIX Filenames”

Usage Scenario

This extension is intended to be used when you want to ensure file and directory names do not include problematic characters.

An example could be complete file systems (disks) that come from estates or research data and are not under the control of archivists.

Caveat

Projection

This extension provides injective (one-to-one) projections only if encodeUTF == true. If you don’t want to use the verbose UTF-Code replacements (encodeUTF == false), first check your source to make sure, that there’s no chance of different names to be mapped on one target. This happens on whitespace replacement as well as on the removal of leading -, ~ and ` ` (blank) or trailing ` ` (blank). Software which generates the OCFL structure should raise an error in this case.

Example names, that would map to the same target

Parameters

Summary

Definition of terms

Procedure

The following is an outline to the steps for mapping a filepath.

UTF Replacement Character List is a list of UTF characters mentioned below.

When encodeUTF is true

  1. Replace all non-UTF8 characters with replacementString
  2. Split the string at path separator /
  3. For each part do the following
    1. Replace = with =u003D if it is followed by u and four hex digits
    2. Replace any character from this list with its UTF code in the form =uXXXX where XXXX is the code: U+0000-U+001F U+007F U+0020 U+0085 U+00A0 U+1680 U+2000-U+200F U+2028 U+2029 U+202F U+205F U+3000 \n \t * ? : [ ] " < > | ( ) { } & ' ! ; # @
    3. If part only contains periods, replace first period (.), with UTF Code (U+002E)
    4. Remove part completely, if its length is 0
    5. When length of part is larger than maxPathSegmentLen use fallback function and return
  4. Join the parts with path separator /
  5. When length of result is larger than maxPathnameLen use fallback function and return

When encodeUTF is false

  1. Replace all non-UTF8 characters with replacementString
  2. Split the string at path separator /
  3. For each part do the following
    1. Replace any whitespace character from this list with whitespaceReplacementString: U+0009 U+000A-U+000D U+0020 U+0085 U+00A0 U+1680 U+2000-U+200F U+2028 U+2029 U+202F U+205F U+3000
    2. Replace any character from this list with replacementString: U+0000-U+001F U+007f * ? : [ ] " < > | ( ) { } & ' ! ; # @
    3. Remove leading spaces, - and ~ / remove trailing spaces
    4. If part only contains periods, replace first period (.) with replacementString
    5. Remove part completely, if its length is 0
    6. When length of part is larger than maxPathSegmentLen use fallback function and return
  4. Join the parts with path separator /
  5. When length of result is larger than maxPathnameLen use fallback function and return

Fallback function

  1. Create digest with fallbackDigestAlgorithm from initial identifier/filepath as lower case hex string
  2. When digest is longer than maxPathSegmentLen add folder separator / after maxPathSegmentLen (character or byte based) until all parts are smaller or equal maxPathSegmentLen
  3. Take the first numberOfFallbackTuples * fallbackTupleSize characters from digest
  4. Split the characters in numberOfFallbackTuples parts of fallbackTupleSize characters
  5. Prepend parts as prefix folders with separator /
  6. Prepend fallbackFolder with separator /

Examples

Parameters

It is not necessary to specify any parameters to use the default configuration. However, if you were to do so, it would look like the following:

{
    "extensionName": "0011-direct-clean-path-layout",
    "maxPathSegmentLen": 127,
    "maxPathnameLen": 32000,
    "encodeUTF": false,
    "replacementString": "_",
    "whitespaceReplacementString": " ",
    "fallbackDigestAlgorithm": "md5",
    "fallbackFolder": "fallback",
    "numberOfFallbackTuples": 0
}

Mappings

#1 encodeUTF == false

{
    "extensionName": "0011-direct-clean-path-layout",
    "maxPathSegmentLen": 127,
    "maxPathnameLen": 32000,
    "encodeUTF": false,
    "replacementString": "_",
    "whitespaceReplacementString": " ",
    "fallbackDigestAlgorithm": "md5",
    "fallbackFolder": "fallback",
    "numberOfFallbackTuples": 2
}
ID or Path Result
..hor_rib:lé-$id ..hor_rib_lé-$id
info:fedora/object-01 info_fedora/object-01
~ info:fedora/-obj#ec@t-"01 info_fedora/obj_ec_t-_01
/test/ ~/.../blah test/_../blah
https://hdl.handle.net/XXXXX/test/bl ah https_/hdl.handle.net/XXXXX/test/bl ah
abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij fallback/0/e/0eafabb38fa7f1583d1461afe980ebdc

#2 encodeUTF == true

{
    "extensionName": "0011-direct-clean-path-layout",
    "maxPathSegmentLen": 127,
    "PathFilenameLen": 32000,
    "encodeUTF": true,
    "replacementString": "_",
    "whitespaceReplacementString": " ",
    "fallbackDigestAlgorithm": "sha512",
    "fallbackFolder": "fallback",
    "numberOfFallbackTuples": 2
}
ID or Path Result
..hor_rib:lé-$id ..hor_rib=u003Alé-$id
object=u123a-01 object=u003Du123a-01
object=u13a-01 object=u13a-01
info:fedora/object-01 info=u003Afedora/object-01
~ info:fedora/-obj#ec@t-"01 =u007E=u0020info=u003Afedora/-obj=u0023ec=u0040t-=u002201=u0020
/test/ ~/.../blah test/=u0020~/=u002E../blah
https://hdl.handle.net/XXXXX/test/bl ah https=u003A/hdl.handle.net/XXXXX/test/bl=u0020ah
abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij abcdefghijabcdefghij fallback/b/8/b8acda4abac53237afa03d6bbb078e1bf46b40438bb256df79b8d9ff0e57b32a688156ad21755363ea19953c160c4dd6d4db175b71e9aa87d68937181a9f69d/9

Implementation

GO

package cleanpath

import (
	"emperror.dev/errors"
	"regexp"
	"strings"
	[...]
)


var directCleanRuleAll = regexp.MustCompile("[\u0000-\u001f\u007f\u0020\u0085\u00a0\u1680\u2000-\u200f\u2028\u2029\u202f\u205f\u3000\n\t*?:\\[\\]\"<>|(){}&'!\\;#@]")
var directCleanRuleWhitespace = regexp.MustCompile("[\u0009\u000a-\u000d\u0020\u0085\u00a0\u1680\u2000-\u200f\u2028\u2029\u202f\u205f\u3000]")
var directCleanRuleEqual = regexp.MustCompile("=(u[a-zA-Z0-9]{4})")
var directCleanRule_1_5 = regexp.MustCompile("[\u0000-\u001F\u007F\n\r\t*?:\\[\\]\"<>|(){}&'!\\;#@]")
var directCleanRule_2_4_6 = regexp.MustCompile("^[\\-~\u0009\u000a-\u000d\u0020\u0085\u00a0\u1680\u2000-\u200f\u2028\u2029\u202f\u205f\u3000]*(.*?)[\u0009\u000a-\u000d\u0020\u0085\u00a0\u1680\u2000-\u20a0\u2028\u2029\u202f\u205f\u3000]*$")
var directCleanRulePeriods = regexp.MustCompile("^\\.+$")

var directCleanErrFilenameTooLong = errors.New("filename too long")
var directCleanErrPathnameTooLong = errors.New("pathname too long")

type DirectCleanConfig struct {
    [...]
	MaxPathnameLen              int                      `json:"maxPathnameLen"`
	MaxFilenameLen              int                      `json:"maxPathSegmentLen"`
	ReplacementString           string                   `json:"replacementString"`
	WhitespaceReplacementString string                   `json:"whitespaceReplacementString"`
	UTFEncode                   bool                     `json:"encodeUTF"`
	FallbackDigestAlgorithm     checksum.DigestAlgorithm `json:"fallbackDigestAlgorithm"`
	FallbackFolder              string                   `json:"fallbackFolder"`
	FallbackSubFolders          int                      `json:"numberOfFallbackTuples"`
	hash                        hash.Hash                `json:"-"`
	hashMutex                   sync.Mutex               `json:"-"`
}

[...]


func encodeUTFCode(s string) string {
	return "=u" + strings.Trim(fmt.Sprintf("%U", []rune(s)), "U+[]")
}

func (sl *DirectClean) fallback(fname string) (string, error) {
	// internal mutex for reusing the hash object
	sl.hashMutex.Lock()

	// reset hash function
	sl.hash.Reset()
	// add path
	if _, err := sl.hash.Write([]byte(fname)); err != nil {
		sl.hashMutex.Unlock()
		return "", errors.Wrapf(err, "cannot hash path '%s'", fname)
	}
	sl.hashMutex.Unlock()

	// get digest and encode it
	digestString := hex.EncodeToString(sl.hash.Sum(nil))

	// check whether digest fits in filename length
	parts := len(digestString) / sl.MaxPathSegmentLen
	rest := len(digestString) % sl.MaxPathSegmentLen
	if rest > 0 {
		parts++
	}
	// cut the digest if it's too long for filename length
	result := ""
	for i := 0; i < parts; i++ {
		result = filepath.Join(result, digestString[i*sl.MaxPathSegmentLen:min((i+1)*sl.MaxPathSegmentLen, len(digestString))])
	}

	// add all necessary subfolders
	for i := 0; i < sl.NumberOfFallbackTuples; i++ {
		// paranoia, but safe
		result = filepath.Join(string(([]rune(digestString))[sl.NumberOfFallbackTuples-i-1:sl.NumberOfFallbackTuples-i-1+sl.FallbackTupleSize]), result)
	}
	result = strings.TrimLeft(filepath.ToSlash(filepath.Clean(filepath.Join(sl.FallbackFolder, result))), "/")
	if len(result) > sl.MaxPathnameLen {
		return result, errors.Errorf("result has length of %d which is more than max allowed length of %d", len(result), sl.MaxPathnameLen)
	}
	return result, nil
}

func (sl *DirectClean) BuildPath(fname string) (string, error) {

	// 1. Replace all non-UTF8 characters with replacementString
	fname = strings.ToValidUTF8(fname, sl.ReplacementString)

	// 2. Split the string at path separator /
	names := strings.Split(fname, "/")
	result := []string{}

	for _, n := range names {
		if len(n) == 0 {
			continue
		}
		if sl.UTFEncode {
			// Replace `=` with `=u003D` if it is followed by `u` and four hex digits
			n = directCleanRuleEqual.ReplaceAllString(n, "=u003D$1")
			n = directCleanRuleAll.ReplaceAllStringFunc(n, encodeUTFCode)
			if n[0] == '~' || directCleanRulePeriods.MatchString(n) {
				n = encodeUTFCode(string(n[0])) + n[1:]
			}
		} else {
			// Replace any whitespace character from this list with whitespaceReplacementString: U+0009 U+000A-U+000D U+0020 U+0085 U+00A0 U+1680 U+2000-U+200F U+2028 U+2029 U+202F U+205F U+3000
			n = directCleanRuleWhitespace.ReplaceAllString(n, sl.WhitespaceReplacementString)
			n = directCleanRule_1_5.ReplaceAllString(n, sl.ReplacementString)
			n = directCleanRule_2_4_6.ReplaceAllString(n, "$1")
			if directCleanRulePeriods.MatchString(n) {
				n = sl.ReplacementString + n[1:]
			}
		}

		lenN := len(n)
		if lenN > sl.MaxFilenameLen {
			return sl.fallback(fname)
			//return "", errors.Wrapf(directCleanErrFilenameTooLong, "filename: %s", n)
		}

		if lenN > 0 {
			result = append(result, n)
		}
	}

	fname = strings.Join(result, "/")

	if len(fname) > sl.MaxPathnameLen {
		return sl.fallback(fname)
		//return "", errors.Wrapf(directCleanErrPathnameTooLong, "pathname: %s", fname)
	}

	return fname, nil
}

[...]

Appendix

UTF Replacement Character List

Code Decimal Octal Description Abbreviation / Key
U+0000 0 0 Null character NUL
U+0001 1 1 Start of Heading SOH / Ctrl-A
U+0002 2 2 Start of Text STX / Ctrl-B
U+0003 3 3 End-of-text character ETX / Ctrl-C1
U+0004 4 4 End-of-transmission character EOT / Ctrl-D2
U+0005 5 5 Enquiry character ENQ / Ctrl-E
U+0006 6 6 Acknowledge character ACK / Ctrl-F
U+0007 7 7 Bell character BEL / Ctrl-G3
U+0008 8 10 Backspace BS / Ctrl-H
U+0009 9 11 Horizontal tab HT / Ctrl-I
U+000A 10 12 Line feed LF / Ctrl-J4
U+000B 11 13 Vertical tab VT / Ctrl-K
U+000C 12 14 Form feed FF / Ctrl-L
U+000D 13 15 Carriage return CR / Ctrl-M5
U+000E 14 16 Shift Out SO / Ctrl-N
U+000F 15 17 Shift In SI / Ctrl-O6
U+0010 16 20 Data Link Escape DLE / Ctrl-P
U+0011 17 21 Device Control 1 DC1 / Ctrl-Q7
U+0012 18 22 Device Control 2 DC2 / Ctrl-R
U+0013 19 23 Device Control 3 DC3 / Ctrl-S8
U+0014 20 24 Device Control 4 DC4 / Ctrl-T
U+0015 21 25 Negative-acknowledge character NAK / Ctrl-U9
U+0016 22 26 Synchronous Idle SYN / Ctrl-V
U+0017 23 27 End of Transmission Block ETB / Ctrl-W
U+0018 24 30 Cancel character CAN / Ctrl-X10
U+0019 25 31 End of Medium EM / Ctrl-Y
U+001A 26 32 Substitute character SUB / Ctrl-Z11
U+001B 27 33 Escape character ESC
U+001C 28 34 File Separator FS
U+001D 29 35 Group Separator GS
U+001E 30 36 Record Separator RS
U+001F 31 37 Unit Separator US
U+001F 31 37 Unit Separator US
U+0020 32 40 Space  
U+007F 127 177 Delete DEL
U+0085 133 0302 0205 Next Line NEL
U+00A0 160 0302 0240   Non-breaking space
U+1680 5760 13200   OGHAM SPACE MARK
U+2000 8192 20000   EN QUAD
U+2001 8193 20001   EM QUAD
U+2002 8194 20002   EN SPACE
U+2003 8195 20003   EM SPACE
U+2004 8196 20004   THREE-PER-EM SPACE
U+2005 8197 20005   FOUR-PER-EM SPACE
U+2006 8198 20006   SIX-PER-EM SPACE
U+2007 8199 20007   FIGURE SPACE
U+2008 8200 20010   PUNCTUATION SPACE
U+2009 8201 20011   THIN SPACE
U+200A 8202 20012   HAIR SPACE
U+200B 8203 20013   ZERO WIDTH SPACE
U+200C 8204 20014   ZERO WIDTH NON-JOINER
U+200D 8205 20015   ZERO WIDTH JOINER
U+200E 8206 20016   LEFT-TO-RIGHT MARK
U+200F 8207 20017   RIGHT-TO-LEFT MARK
U+2028 8232 20050   LINE SEPARATOR
U+2029 8233 20051   PARAGRAPH SEPARATOR
U+205F 8287 20137   MEDIUM MATHEMATICAL SPACE
U+3000 12288 30000   IDEOGRAPHIC SPACE