You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: update for build, config, CLI and tests: (#423)
* chore: update for build, config, CLI and tests:
Build System
- tsconfig.json: Removed dead decorator options; updated moduleResolution/module to node16; disabled
unused importHelpers
- package.json: Fixed exports map with proper types entries for both ESM and CJS TypeScript consumers;
removed unused tslib dependency; added test:coverage script
- rollup.config.js: Enabled tree-shaking for CLI bundle; removed dead code; documented build order
dependency
- .prettierrc.cjs: Inlined config to remove missing prettier-config-standard dependency
CLI (src/cli/)
- 7 bug fixes: Promise constructor anti-pattern, callback fs.writeFile/fs.readdir replaced with
fs.promises, addResultCount type mismatch, dead warningCount removed, TOCTOU race condition in
validateParams removed, broken --singleton flag fixed (now shares parser instance from PDFCLI level), -si
short flag parsing bug fixed
- New --json flag: Structured JSON summary to stdout (version, output paths, stats, errors, elapsed time)
for programmatic consumption and Claude Code Skill integration
- New --quiet flag: Suppresses all non-error output including timer and status messages
- Granular exit codes: 0=success, 1=parse failure, 2=argument error, 3=I/O error (previously only 0 or 1)
- Improved directory filter: Only skips dotfiles now (previously silently skipped files starting with -,
_, spaces, etc.)
- Enhanced help output: Accurate flag descriptions, usage examples, and exit code documentation
Tests
- 3 new test suites (22 new tests): CLI integration (_test_cli.cjs), Stream API (_test_stream.cjs), Error
paths (_test_errors.cjs) -- all previously had zero coverage
- Fixed existing tests: Listener leak in multi-parse test (once() + removeAllListeners()); standardized on
Jest expect() over Node assert(); restored VLines/HLines/Fills/Texts count assertions that were commented
out
- Renamed misleading _test_getRawTextContent.cjs to _test_sortBidiTexts.cjs
- Regenerated 37 baseline files to reflect current parser output (baselines were stale since v0.6.8)
- Test count: 52 tests / 4 suites -> 74 tests / 7 suites
CI & Polish
- CI workflow: Upgraded to actions v4; added tsc --noEmit type-check step; added coverage step
- ESLint: Removed deprecated no-extra-semi and no-var-requires rules
- CLAUDE.md: Added CLI flags table, exit codes, and test suite listing
* fix: correct gitignore and CI build and test
* fix: fix the CI build
* fix: try to fix github CI build issue - 2
* fix: bump CI Node.js from 20.18.0 to 22.x to fix npm ci crash
npm ci was crashing with "Exit handler never called!" because the
package-lock.json was generated with npm 11.x (Node 22) but CI ran
npm 10.8.2 (Node 20.18.0). This left node_modules empty, causing
all subsequent build steps to fail. Node 22 LTS also satisfies the
^20.19.0 engine requirements of eslint devDependencies.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use npm install instead of npm ci to avoid lockfile version mismatch
npm ci crashes with "Exit handler never called!" because the
package-lock.json was generated with npm 11.x but CI runs npm 10.9.7
(bundled with Node 22). npm ci is strict about lockfile compatibility.
npm install resolves dependencies fresh, matching the previous CI
behavior (rm -rf node_modules && npm i).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: try to fix github workflow CI build error -3
* fix: github CI: move type check after build
* fix: improve CLI error handling, validation guards, and test reliability
- Fix mkdir error handling in validateParams: replace try/finally with
try/catch so mkdir failures set the error string and return exit code 3
(EXIT_IO_ERROR) instead of propagating as an unhandled rejection
- Reorder validation guards in initialize(): check Array.isArray before
typeof string so users passing -f multiple times get the specific
"can only be specified once" error instead of a generic message
- Use structured .code property instead of string-matching on
error.message for I/O error classification, preventing misclassification
if a PDF engine error message happens to contain errno strings
- Restore diagnostic console.error in _test_.cjs catch path so CI logs
show which PDF caused a failure
- Fix stream test to use os.tmpdir() instead of test/target/ to prevent
ENOENT on clean checkout
- Re-enable baseline diff validation in p2j.one.sh now that all 37
baselines have been regenerated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: update readme.md and out of date dependency
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: readme.md
+55-5Lines changed: 55 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ After install, run command line:
43
43
44
44
> npm test
45
45
46
-
`pretest` step builds bundles and source maps for both ES Module and CommonJS, output to `./dist` directory. The Jest test suit is defined in `./test/_test_.cjs` with commonJS, test run will also cover `parse-r` and `parse-fd` with ES Modules via command line.
46
+
`pretest` step builds bundles and source maps for both ES Module and CommonJS, output to `./dist` directory. The Jest test suites (7 suites, 74+ tests) are defined in `./test/_test_*.cjs` with CommonJS, test run will also cover `parse-r` and `parse-fd` with ES Modules via command line.
47
47
48
48
The default Jest test suits are essential tests for all PRs. But it only covers a portion of all testing PDFs, for more broader coverage, run:
49
49
@@ -920,15 +920,15 @@ To use the command line utility to transcode a folder or a file:
920
920
node pdf2json.js-f [input directory or pdf file]
921
921
```
922
922
923
-
When -f is a PDF file, it'll be converted to json file with the same name and saved in the same directory. If -f is a directory, it'll scan all ".pdf" files within the specified directory to transcode them one by one.
923
+
When -f is a PDF file, it'll be converted to json file with the same name and saved in the same directory. If -f is a directory, it'll scan all ".pdf" files within the specified directory to transcode them one by one (dotfiles are skipped).
924
924
925
925
Optionally, you can specify the output directory: -o:
926
926
927
927
```javascript
928
928
node pdf2json.js-f [input directory or pdf file] -o [output directory]
929
929
```
930
930
931
-
The output directory must exist, otherwise, it'll exit with an error.
931
+
The output directory will be created automatically if it does not exist.
932
932
933
933
Additionally, you can also use -v or --version to show version number or to display more help info with -h.
934
934
@@ -952,7 +952,57 @@ or
952
952
pdf2json -f [input directory or pdf file] -o [output directory]
953
953
```
954
954
955
-
v0.5.4 added "-s" or "--silent" command line argument to suppress informative logging output. When using pdf2json as a command line tool, the default verbosity is 5 (INFOS). While when running as a web service, default verbosity is 9 (ERRORS).
955
+
### CLI Flags
956
+
957
+
| Flag | Long | Description |
958
+
|------|------|-------------|
959
+
|`-f`|`--file`| (required) Path to a PDF file or a directory of PDF files to parse |
960
+
|`-o`|`--output`| Output directory (created automatically if it doesn't exist; defaults to input directory) |
961
+
|`-s`|`--silent`| Suppress informational output; only errors are printed |
962
+
|`-t`|`--fieldTypes`| Generate a `.fields.json` file with form field ids and types |
963
+
|`-c`|`--content`| Generate a `.content.txt` file with extracted text content |
964
+
|`-m`|`--merge`| Generate a `.merged.json` file with auto-merged broken text blocks |
965
+
|`-r`|`--stream`| Use stream-based parsing instead of loading the entire file into memory |
966
+
|`-si`|`--singleton`| Reuse a single PDFParser instance across all files in a directory (reduces memory for batch processing) |
967
+
|`-j`|`--json`| Output a structured JSON summary to stdout (version, file paths, stats, errors). Implies `-s`|
968
+
|`-q`|`--quiet`| Suppress all non-error output including timer and status messages. Stricter than `-s`|
969
+
|`-v`|`--version`| Print the version number and exit |
970
+
|`-h`|`--help`| Print help message and exit |
971
+
972
+
### Exit Codes
973
+
974
+
| Code | Meaning |
975
+
|------|---------|
976
+
|`0`| All files parsed successfully |
977
+
|`1`| One or more files failed to parse |
978
+
|`2`| Invalid arguments or usage error (e.g. missing `-f`) |
979
+
|`3`| I/O error (file not found, permission denied) |
980
+
981
+
### `--json` Output Schema
982
+
983
+
When using the `--json` flag, pdf2json outputs a structured JSON summary to stdout:
Note: The PDF engine may print warnings to stdout (e.g. `Warning: Setting up fake worker.`). Pipe through `grep '^{'` to isolate the JSON line.
1002
+
1003
+
### Verbosity
1004
+
1005
+
When using pdf2json as a command line tool, the default verbosity is 5 (INFOS). While when running as a web service, default verbosity is 9 (ERRORS).
956
1006
Examples to suppress logging info from command line:
957
1007
958
1008
```javascript
@@ -973,7 +1023,7 @@ var pdfParser = new PFParser();
973
1023
pdfParser.loadPDF(pdfFilePath, 5);
974
1024
```
975
1025
976
-
v0.5.7 added the capability to skip input PDF files if filename begins with any one of "!@#$%^&\*()+=[]\\\';,/{}|\":<>?~`.-\_ ", usually these files are created by PDF authoring tools as backup files.
1026
+
When scanning a directory, pdf2json only skips hidden files (dotfiles). All other `.pdf` files are processed.
977
1027
978
1028
v0.6.2 added "-t" command line argument to generate fields json file in addition to parsed json. The fields json file will contain one Array which contains fieldInfo object for each field, and each fieldInfo object will have 4 fields:
0 commit comments