Skip to content

Commit eed63fb

Browse files
modestyclaude
andauthored
refactor: update for build, config, CLI and tests: (#423)
* chore: update for build, config, CLI and tests: Build System - tsconfig.json: Removed dead decorator options; updated moduleResolution/module to node16; disabled unused importHelpers - package.json: Fixed exports map with proper types entries for both ESM and CJS TypeScript consumers; removed unused tslib dependency; added test:coverage script - rollup.config.js: Enabled tree-shaking for CLI bundle; removed dead code; documented build order dependency - .prettierrc.cjs: Inlined config to remove missing prettier-config-standard dependency CLI (src/cli/) - 7 bug fixes: Promise constructor anti-pattern, callback fs.writeFile/fs.readdir replaced with fs.promises, addResultCount type mismatch, dead warningCount removed, TOCTOU race condition in validateParams removed, broken --singleton flag fixed (now shares parser instance from PDFCLI level), -si short flag parsing bug fixed - New --json flag: Structured JSON summary to stdout (version, output paths, stats, errors, elapsed time) for programmatic consumption and Claude Code Skill integration - New --quiet flag: Suppresses all non-error output including timer and status messages - Granular exit codes: 0=success, 1=parse failure, 2=argument error, 3=I/O error (previously only 0 or 1) - Improved directory filter: Only skips dotfiles now (previously silently skipped files starting with -, _, spaces, etc.) - Enhanced help output: Accurate flag descriptions, usage examples, and exit code documentation Tests - 3 new test suites (22 new tests): CLI integration (_test_cli.cjs), Stream API (_test_stream.cjs), Error paths (_test_errors.cjs) -- all previously had zero coverage - Fixed existing tests: Listener leak in multi-parse test (once() + removeAllListeners()); standardized on Jest expect() over Node assert(); restored VLines/HLines/Fills/Texts count assertions that were commented out - Renamed misleading _test_getRawTextContent.cjs to _test_sortBidiTexts.cjs - Regenerated 37 baseline files to reflect current parser output (baselines were stale since v0.6.8) - Test count: 52 tests / 4 suites -> 74 tests / 7 suites CI & Polish - CI workflow: Upgraded to actions v4; added tsc --noEmit type-check step; added coverage step - ESLint: Removed deprecated no-extra-semi and no-var-requires rules - CLAUDE.md: Added CLI flags table, exit codes, and test suite listing * fix: correct gitignore and CI build and test * fix: fix the CI build * fix: try to fix github CI build issue - 2 * fix: bump CI Node.js from 20.18.0 to 22.x to fix npm ci crash npm ci was crashing with "Exit handler never called!" because the package-lock.json was generated with npm 11.x (Node 22) but CI ran npm 10.8.2 (Node 20.18.0). This left node_modules empty, causing all subsequent build steps to fail. Node 22 LTS also satisfies the ^20.19.0 engine requirements of eslint devDependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use npm install instead of npm ci to avoid lockfile version mismatch npm ci crashes with "Exit handler never called!" because the package-lock.json was generated with npm 11.x but CI runs npm 10.9.7 (bundled with Node 22). npm ci is strict about lockfile compatibility. npm install resolves dependencies fresh, matching the previous CI behavior (rm -rf node_modules && npm i). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: try to fix github workflow CI build error -3 * fix: github CI: move type check after build * fix: improve CLI error handling, validation guards, and test reliability - Fix mkdir error handling in validateParams: replace try/finally with try/catch so mkdir failures set the error string and return exit code 3 (EXIT_IO_ERROR) instead of propagating as an unhandled rejection - Reorder validation guards in initialize(): check Array.isArray before typeof string so users passing -f multiple times get the specific "can only be specified once" error instead of a generic message - Use structured .code property instead of string-matching on error.message for I/O error classification, preventing misclassification if a PDF engine error message happens to contain errno strings - Restore diagnostic console.error in _test_.cjs catch path so CI logs show which PDF caused a failure - Fix stream test to use os.tmpdir() instead of test/target/ to prevent ENOENT on clean checkout - Re-enable baseline diff validation in p2j.one.sh now that all 37 baselines have been regenerated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update readme.md and out of date dependency --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b0067d7 commit eed63fb

56 files changed

Lines changed: 1876 additions & 1637 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: Node.js CI/CD
22

33
on:
44
pull_request:
5-
branches: [ "*" ]
5+
branches: ["*"]
66
workflow_dispatch: # Allow manual triggering
77

88
jobs:
@@ -11,31 +11,31 @@ jobs:
1111
runs-on: ubuntu-latest
1212
steps:
1313
- name: Checkout repository
14-
uses: actions/checkout@v3
14+
uses: actions/checkout@v4
1515

1616
- name: Setup Deno 2.4.3
1717
uses: denoland/setup-deno@v1
1818
with:
1919
deno-version: 2.4.3
20-
20+
2121
- name: Setup Bun 1.2.19
2222
uses: oven-sh/setup-bun@v1
2323
with:
2424
bun-version: 1.2.19
2525

2626
- name: Setup Node.js 20.18.0
27-
uses: actions/setup-node@v3
27+
uses: actions/setup-node@v4
2828
with:
2929
node-version: 20.18.0
30-
31-
- name: Clean install dependencies
32-
run: rm -rf node_modules package-lock.json && npm i
3330

34-
# - name: Install dependencies
35-
# run: npm ci
31+
- name: Clean install dependencies
32+
run: rm -rf node_modules && npm i
3633

3734
- name: Build project
3835
run: npm run build
3936

40-
- name: Run tests
41-
run: npm run test
37+
- name: Type check
38+
run: npx --no-install tsc --noEmit
39+
40+
- name: Run tests (with coverage)
41+
run: npm run test

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
## generic files to ignore
22
*~
3-
*.lock
43
*.DS_Store
54
*.swp
65
*.out

.prettierrc.cjs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
module.exports = {
2-
...require('prettier-config-standard'),
2+
semi: false,
3+
singleQuote: true,
34
trailingComma: 'es5',
45
}

eslint.config.js

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -207,8 +207,6 @@ export default [
207207
destructuredArrayIgnorePattern: "^_?"
208208
}
209209
],
210-
"@typescript-eslint/no-extra-semi": "off",
211-
"@typescript-eslint/no-var-requires": "off",
212210
"arrow-body-style": ["error", "as-needed"],
213211
"dot-notation": ["error"],
214212
"eqeqeq": ["error", "always"],

jest.config.json

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,12 @@
44
"bail": false,
55
"testFailureExitCode": 1,
66
"transform": {},
7-
"testTimeout": 6000,
7+
"testTimeout": 15000,
88
"moduleFileExtensions": ["js", "cjs", "mjs", "json"],
99
"verbose": true,
1010
"forceExit": true,
11-
"maxWorkers": "50%"
11+
"maxWorkers": "50%",
12+
"coverageDirectory": "coverage",
13+
"coverageReporters": ["text", "lcov"],
14+
"coveragePathIgnorePatterns": ["/node_modules/", "/test/", "/base/", "/lib/pdfjs-code.js"]
1215
}

package-lock.json

Lines changed: 940 additions & 1039 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,10 @@
3131
},
3232
"main": "./dist/pdfparser.cjs",
3333
"module": "./dist/pdfparser.js",
34-
"typings": "./dist/pdfparser.d.ts",
34+
"types": "./dist/pdfparser.d.ts",
3535
"scripts": {
3636
"test:jest": "jest --config ./jest.config.json --detectOpenHandles",
37+
"test:coverage": "jest --config ./jest.config.json --coverage --detectOpenHandles",
3738
"test": "npm run test:jest && npm run parse-r && npm run parse-fd && npm run test:deno && npm run test:bun",
3839
"test:forms": "cd ./test && sh p2j.forms.sh",
3940
"test:misc": "cd ./test && sh p2j.one.sh misc . \"Expected: 16 success, 6 fail exception with stack trace\" ",
@@ -50,7 +51,7 @@
5051
"parse-e": "./bin/pdf2json.js -f ./test/pdf/misc/i418_precompilato_fake.pdf -o ./test/target/misc -c",
5152
"build:rollup": "npx rollup -c ./rollup.config.js",
5253
"build:bundle-pdfjs-base": "node rollup/bundle-pdfjs-base.js",
53-
"build": "npm run build:bundle-pdfjs-base && npm run build:rollup",
54+
"build": "npm run build:bundle-pdfjs-base && npm run build:rollup && cp dist/pdfparser.d.ts dist/pdfparser.d.cts",
5455
"build:clean": "rm -rf node_modules && rm -f package-lock.json && npm i && npm run build",
5556
"test:deno": "deno --allow-read --allow-write --allow-net --allow-env --no-check ./bin/pdf2json.js -f ./test/pdf/fd/form/ -o ./test/target/fd/form -t -c -m -r",
5657
"test:bun": "bun ./bin/pdf2json.js -f ./test/pdf/fd/form/ -o ./test/target/fd/form -t -c -m -r"
@@ -76,7 +77,7 @@
7677
"@rollup/plugin-eslint": "^9.1.0",
7778
"@rollup/plugin-json": "^6.1.0",
7879
"@rollup/plugin-node-resolve": "^16.0.2",
79-
"@rollup/plugin-terser": "^0.4.4",
80+
"@rollup/plugin-terser": "^1.0.0",
8081
"@rollup/plugin-typescript": "^12.1.4",
8182
"@types/node": "^25.3.3",
8283
"@typescript-eslint/eslint-plugin": "^8.46.0",
@@ -105,9 +106,14 @@
105106
"readme": "https://github.com/modesty/pdf2json/blob/master/readme.md",
106107
"exports": {
107108
".": {
108-
"types": "./dist/pdfparser.d.ts",
109-
"import": "./dist/pdfparser.js",
110-
"require": "./dist/pdfparser.cjs"
109+
"import": {
110+
"types": "./dist/pdfparser.d.ts",
111+
"default": "./dist/pdfparser.js"
112+
},
113+
"require": {
114+
"types": "./dist/pdfparser.d.cts",
115+
"default": "./dist/pdfparser.cjs"
116+
}
111117
}
112118
},
113119
"publishConfig": {

readme.md

Lines changed: 55 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ After install, run command line:
4343

4444
> npm test
4545
46-
`pretest` step builds bundles and source maps for both ES Module and CommonJS, output to `./dist` directory. The Jest test suit is defined in `./test/_test_.cjs` with commonJS, test run will also cover `parse-r` and `parse-fd` with ES Modules via command line.
46+
`pretest` step builds bundles and source maps for both ES Module and CommonJS, output to `./dist` directory. The Jest test suites (7 suites, 74+ tests) are defined in `./test/_test_*.cjs` with CommonJS, test run will also cover `parse-r` and `parse-fd` with ES Modules via command line.
4747

4848
The default Jest test suits are essential tests for all PRs. But it only covers a portion of all testing PDFs, for more broader coverage, run:
4949

@@ -920,15 +920,15 @@ To use the command line utility to transcode a folder or a file:
920920
node pdf2json.js -f [input directory or pdf file]
921921
```
922922

923-
When -f is a PDF file, it'll be converted to json file with the same name and saved in the same directory. If -f is a directory, it'll scan all ".pdf" files within the specified directory to transcode them one by one.
923+
When -f is a PDF file, it'll be converted to json file with the same name and saved in the same directory. If -f is a directory, it'll scan all ".pdf" files within the specified directory to transcode them one by one (dotfiles are skipped).
924924

925925
Optionally, you can specify the output directory: -o:
926926

927927
```javascript
928928
node pdf2json.js -f [input directory or pdf file] -o [output directory]
929929
```
930930

931-
The output directory must exist, otherwise, it'll exit with an error.
931+
The output directory will be created automatically if it does not exist.
932932

933933
Additionally, you can also use -v or --version to show version number or to display more help info with -h.
934934

@@ -952,7 +952,57 @@ or
952952
pdf2json -f [input directory or pdf file] -o [output directory]
953953
```
954954

955-
v0.5.4 added "-s" or "--silent" command line argument to suppress informative logging output. When using pdf2json as a command line tool, the default verbosity is 5 (INFOS). While when running as a web service, default verbosity is 9 (ERRORS).
955+
### CLI Flags
956+
957+
| Flag | Long | Description |
958+
|------|------|-------------|
959+
| `-f` | `--file` | (required) Path to a PDF file or a directory of PDF files to parse |
960+
| `-o` | `--output` | Output directory (created automatically if it doesn't exist; defaults to input directory) |
961+
| `-s` | `--silent` | Suppress informational output; only errors are printed |
962+
| `-t` | `--fieldTypes` | Generate a `.fields.json` file with form field ids and types |
963+
| `-c` | `--content` | Generate a `.content.txt` file with extracted text content |
964+
| `-m` | `--merge` | Generate a `.merged.json` file with auto-merged broken text blocks |
965+
| `-r` | `--stream` | Use stream-based parsing instead of loading the entire file into memory |
966+
| `-si` | `--singleton` | Reuse a single PDFParser instance across all files in a directory (reduces memory for batch processing) |
967+
| `-j` | `--json` | Output a structured JSON summary to stdout (version, file paths, stats, errors). Implies `-s` |
968+
| `-q` | `--quiet` | Suppress all non-error output including timer and status messages. Stricter than `-s` |
969+
| `-v` | `--version` | Print the version number and exit |
970+
| `-h` | `--help` | Print help message and exit |
971+
972+
### Exit Codes
973+
974+
| Code | Meaning |
975+
|------|---------|
976+
| `0` | All files parsed successfully |
977+
| `1` | One or more files failed to parse |
978+
| `2` | Invalid arguments or usage error (e.g. missing `-f`) |
979+
| `3` | I/O error (file not found, permission denied) |
980+
981+
### `--json` Output Schema
982+
983+
When using the `--json` flag, pdf2json outputs a structured JSON summary to stdout:
984+
985+
```json
986+
{
987+
"version": "4.0.3",
988+
"input": "/path/to/input.pdf",
989+
"outputs": [
990+
{ "type": "json", "path": "/path/to/output.json" },
991+
{ "type": "fields", "path": "/path/to/output.fields.json" },
992+
{ "type": "content", "path": "/path/to/output.content.txt" },
993+
{ "type": "merged", "path": "/path/to/output.merged.json" }
994+
],
995+
"stats": { "input": 1, "success": 1, "failed": 0 },
996+
"errors": [],
997+
"elapsedMs": 234
998+
}
999+
```
1000+
1001+
Note: The PDF engine may print warnings to stdout (e.g. `Warning: Setting up fake worker.`). Pipe through `grep '^{'` to isolate the JSON line.
1002+
1003+
### Verbosity
1004+
1005+
When using pdf2json as a command line tool, the default verbosity is 5 (INFOS). While when running as a web service, default verbosity is 9 (ERRORS).
9561006
Examples to suppress logging info from command line:
9571007

9581008
```javascript
@@ -973,7 +1023,7 @@ var pdfParser = new PFParser();
9731023
pdfParser.loadPDF(pdfFilePath, 5);
9741024
```
9751025

976-
v0.5.7 added the capability to skip input PDF files if filename begins with any one of "!@#$%^&\*()+=[]\\\';,/{}|\":<>?~`.-\_ ", usually these files are created by PDF authoring tools as backup files.
1026+
When scanning a directory, pdf2json only skips hidden files (dotfiles). All other `.pdf` files are processed.
9771027

9781028
v0.6.2 added "-t" command line argument to generate fields json file in addition to parsed json. The fields json file will contain one Array which contains fieldInfo object for each field, and each fieldInfo object will have 4 fields:
9791029

rollup.config.js

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ const external = [
1919
];
2020

2121
export default [
22+
// Build 1: Main library bundle (pdfparser.js -> dist/)
23+
// Must complete before Build 2, as the CLI imports from dist/pdfparser.js
2224
{
2325
input: "./pdfparser.js",
2426
external,
@@ -34,46 +36,43 @@ export default [
3436
sourcemap: true,
3537
},
3638
],
37-
treeshake: false,
39+
treeshake: false, // Required: PDF.js base has global side effects that tree-shaking would break
3840
plugins: [
3941
json(),
4042
eslint({
41-
throwOnError: true
43+
throwOnError: true,
4244
}),
4345
nodeResolve({
44-
preferBuiltins: true, // Prefer Node.js built-in modules
45-
browser: false // Set to true only if targeting browsers
46-
}),
47-
terser()
48-
]
46+
preferBuiltins: true,
47+
browser: false,
48+
}),
49+
terser(),
50+
],
4951
},
52+
// Build 2: CLI bundle (src/cli/ -> bin/cli/)
53+
// Depends on Build 1: imports dist/pdfparser.js as external at runtime
5054
{
5155
input: "./src/cli/p2jcli.ts",
5256
external: [...external, "../../dist/pdfparser.js"],
5357
output: [
54-
// {
55-
// file: "dist/pdfparser_cli.cjs",
56-
// format: "cjs",
57-
// sourcemap: true,
58-
// },
5958
{
6059
file: "bin/cli/pdfparser_cli.js",
6160
format: "es",
6261
sourcemap: true,
6362
},
6463
],
65-
treeshake: false,
64+
treeshake: true,
6665
plugins: [
67-
typescript({ tsconfig: './tsconfig.json' }),
66+
typescript({ tsconfig: "./tsconfig.json" }),
6867
json(),
6968
eslint({
70-
throwOnError: true
69+
throwOnError: true,
7170
}),
7271
nodeResolve({
73-
preferBuiltins: true, // Prefer Node.js built-in modules
74-
browser: false // Set to true only if targeting browsers
75-
}),
76-
terser()
77-
]
78-
}
72+
preferBuiltins: true,
73+
browser: false,
74+
}),
75+
terser(),
76+
],
77+
},
7978
];

0 commit comments

Comments
 (0)