JavaScript Development Space

Extract PDF images with JavaScript

To extract images from a PDF using pdf.js in JavaScript, follow these steps:

  1. Load the PDF: Initialize pdf.js and load the PDF document.
  2. Access Each Page: For each page, use page.getOperatorList() to access its operators.
  3. Extract Images: Check the operators for paintImageXObject commands, where embedded images are stored.
  4. Render Image: Render the images onto a canvas for display or extraction.

Here’s an example setup using pdf.js:

html
1 <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.10.377/pdf.min.js"></script>
2 <canvas id="imageCanvas"></canvas>
3 <script>
4 const url = 'path/to/your.pdf';
5
6 pdfjsLib.getDocument(url).promise.then(async (pdf) => {
7 for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
8 const page = await pdf.getPage(pageNum);
9 const operatorList = await page.getOperatorList();
10
11 operatorList.fnArray.forEach((fn, i) => {
12 if (fn === pdfjsLib.OPS.paintImageXObject) {
13 const imgName = operatorList.argsArray[i][0];
14 const img = page.objs.get(imgName);
15 renderImageToCanvas(img);
16 }
17 });
18 }
19 });
20
21 function renderImageToCanvas(img) {
22 const canvas = document.getElementById('imageCanvas');
23 const ctx = canvas.getContext('2d');
24 canvas.width = img.width;
25 canvas.height = img.height;
26 ctx.putImageData(img, 0, 0); // Render the image data onto canvas
27 }
28 </script>

This script locates and renders images in the PDF as separate canvases.

JavaScript Development Space

© 2024 JavaScript Development Space - Master JS and NodeJS. All rights reserved.