• bitcoinBitcoin(BTC)$59,732.000.89%
  • ethereumEthereum(ETH)$1,590.051.83%
  • tetherTether(USDT)$1.000.01%
  • binancecoinBNB(BNB)$554.630.94%
  • usd-coinUSDC(USDC)$1.000.00%
  • rippleXRP(XRP)$1.050.38%
  • solanaSolana(SOL)$74.073.86%
  • tronTRON(TRX)$0.319147-0.81%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.052.60%
  • HyperliquidHyperliquid(HYPE)$65.656.52%
  • dogecoinDogecoin(DOGE)$0.072253-0.76%
  • RainRain(RAIN)$0.0159302.35%
  • USDSUSDS(USDS)$1.000.01%
  • leo-tokenLEO Token(LEO)$9.551.44%
  • zcashZcash(ZEC)$399.476.72%
  • stellarStellar(XLM)$0.1737561.09%
  • moneroMonero(XMR)$312.560.81%
  • whitebitWhiteBIT Coin(WBT)$47.48-0.13%
  • CantonCanton(CC)$0.144031-2.87%
  • chainlinkChainlink(LINK)$7.280.40%
  • cardanoCardano(ADA)$0.1439830.42%
  • USD1USD1(USD1)$1.00-0.01%
  • daiDai(DAI)$1.000.00%
  • LABLAB(LAB)$14.52-5.96%
  • Ethena USDeEthena USDe(USDE)$1.000.02%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.590.40%
  • bitcoin-cashBitcoin Cash(BCH)$199.294.47%
  • litecoinLitecoin(LTC)$42.65-0.01%
  • Circle USYCCircle USYC(USYC)$1.130.05%
  • hedera-hashgraphHedera(HBAR)$0.071050-0.31%
  • Global DollarGlobal Dollar(USDG)$1.00-0.02%
  • avalanche-2Avalanche(AVAX)$6.632.82%
  • suiSui(SUI)$0.691.57%
  • paypal-usdPayPal USD(PYUSD)$1.000.00%
  • crypto-com-chainCronos(CRO)$0.0540800.51%
  • shiba-inuShiba Inu(SHIB)$0.0000041.28%
  • tether-goldTether Gold(XAUT)$3,954.82-2.22%
  • nearNEAR Protocol(NEAR)$1.830.03%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.14-0.20%
  • BittensorBittensor(TAO)$205.29-0.24%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0595862.04%
  • pax-goldPAX Gold(PAXG)$3,955.42-2.36%
  • uniswapUniswap(UNI)$2.86-2.10%
  • okbOKB(OKB)$79.702.22%
  • AsterAster(ASTER)$0.62-0.26%
  • Ripple USDRipple USD(RLUSD)$1.000.00%
  • OndoOndo(ONDO)$0.3116840.86%
  • HTX DAOHTX DAO(HTX)$0.0000020.08%
  • worldcoin-wldWorldcoin(WLD)$0.409606-6.85%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing

June 28, 2026
in AI & Technology
Reading Time: 3 mins read
A A
OCRmyPDF Tutorial: Convert Scanned Documents into Searchable PDF/A Files with Sidecar Text Extraction and Batch Processing
ShareShareShareShareShare

YOU MAY ALSO LIKE

OpenClaw Releases iOS and Android Companion Node Apps That Connect a Phone to a Self-Hosted AI Agent Gateway

Sensitive iPhone Supplier Details Were Part Of Last Week’s Data Leak At Tata Electronics

def _purge(*prefixes):
   for name in [m for m in list(sys.modules)
                if any(m == p or m.startswith(p + ".") for p in prefixes)]:
       del sys.modules[name]
def _load_ocrmypdf():
   _purge("PIL", "ocrmypdf")
   import ocrmypdf
   return ocrmypdf
try:
   ocrmypdf = _load_ocrmypdf()
except ImportError as e:
   if "_Ink" in str(e) or "PIL" in str(e):
       print("Repairing an incompatible Pillow (reinstalling pillow<12)...")
       sh(f'"{sys.executable}" -m pip install -q --force-reinstall "pillow<12"')
       try:
           ocrmypdf = _load_ocrmypdf()
           print("Pillow repaired — continuing without a restart.")
       except Exception:
           raise RuntimeError(
               "Pillow is still incompatible in this session. Use the Colab menu: "
               "Runtime > Restart session, then run this cell again."
           )
   else:
       raise
from ocrmypdf.exceptions import (
   ExitCode,
   PriorOcrFoundError,
   EncryptedPdfError,
   MissingDependencyError,
   TaggedPDFError,
   DigitalSignatureError,
   DpiError,
   InputFileError,
   UnsupportedImageFormatError,
)
from ocrmypdf.helpers import check_pdf
from ocrmypdf.pdfa import file_claims_pdfa
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ImageFilter
logging.basicConfig(level=logging.WARNING, format="%(levelname)s: %(message)s")
logging.getLogger("ocrmypdf").setLevel(logging.WARNING)
logging.getLogger("pdfminer").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.WARNING)
SAMPLE_TEXT_PAGES = [
   "Optical Character Recognition, commonly abbreviated as OCR, is the "
   "process of converting images of typed or printed text into machine "
   "encoded text. This page was generated as a synthetic scan so that the "
   "OCRmyPDF pipeline has something realistic to recognize and search.",
   "On 14 March 2026 the archive contained 1,482 pages across 37 folders. "
   "Roughly 92 percent of those pages were scanned at 200 to 300 dots per "
   "inch. The remaining 8 percent were skewed and required deskewing before "
   "any reliable recognition was possible.",
   "After OCRmyPDF finishes, the output is a searchable PDF/A file. You can "
   "select text, copy it, and run full text search across thousands of "
   "documents. The original image resolution is preserved while a hidden "
   "text layer is placed accurately underneath the page image.",
]
def _find_font():
   for cand in (
       "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",
       "/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf",
   ):
       if os.path.exists(cand):
           return cand
   return None
_FONT_PATH = _find_font()
FONT = ImageFont.truetype(_FONT_PATH, 40) if _FONT_PATH else ImageFont.load_default()
def _add_speckle(img, n=6000, dark=60):
   """Sprinkle light dark specks to imitate scanner noise (motivates --clean)."""
   import random
   px = img.load()
   w, h = img.size
   for _ in range(n):
       px[random.randint(0, w - 1), random.randint(0, h - 1)] = random.randint(0, dark)
   return img
def render_page(text, skew=False):
   """Render one A4 page (1654x2339 px ≈ 200 DPI) of dark text on white."""
   W, H = 1654, 2339
   img = Image.new("L", (W, H), 255)
   draw = ImageDraw.Draw(img)
   draw.multiline_text((150, 180), textwrap.fill(text, width=58),
                       fill=25, font=FONT, spacing=18)
   if skew:
       img = img.rotate(6, resample=Image.BICUBIC, expand=False, fillcolor=255)
       img = img.filter(ImageFilter.GaussianBlur(0.6))
       img = _add_speckle(img)
   return img
def build_scanned_pdf(pdf_path: Path, pages_text, skew_index=1):
   """Render pages to PNGs and wrap them losslessly into an image-only PDF."""
   pngs = []
   for i, text in enumerate(pages_text):
       img = render_page(text, skew=(i == skew_index))
       p = pdf_path.parent / f"_pg_{pdf_path.stem}_{i}.png"
       img.save(p, format="PNG", dpi=(200, 200))
       pngs.append(str(p))
   with open(pdf_path, "wb") as f:
       f.write(img2pdf.convert(pngs))
   for p in pngs:
       os.remove(p)
   return pdf_path
def do_ocr(input_file, output_file, **kw):
   """Wrapper around ocrmypdf.ocr() that disables the progress bar and times it."""
   kw.setdefault("progress_bar", False)
   t0 = time.perf_counter()
   rc = ocrmypdf.ocr(input_file, output_file, **kw)
   return rc, time.perf_counter() - t0
def tokens(s: str):
   return re.findall(r"[a-z0-9]+", s.lower())
def kb(path) -> str:
   return f"{Path(path).stat().st_size / 1024:,.1f} KB"
def banner(title: str):
   line = "─" * 74
   print(f"\n{line}\n  {title}\n{line}")

Credit: Source link

ShareTweetSendSharePin

Related Posts

OpenClaw Releases iOS and Android Companion Node Apps That Connect a Phone to a Self-Hosted AI Agent Gateway
AI & Technology

OpenClaw Releases iOS and Android Companion Node Apps That Connect a Phone to a Self-Hosted AI Agent Gateway

June 29, 2026
Sensitive iPhone Supplier Details Were Part Of Last Week’s Data Leak At Tata Electronics
AI & Technology

Sensitive iPhone Supplier Details Were Part Of Last Week’s Data Leak At Tata Electronics

June 29, 2026
PyGraphistry Implementation Workflow for Interactive Graph Intelligence Pipelines in Security Analytics and Risk Investigation
AI & Technology

PyGraphistry Implementation Workflow for Interactive Graph Intelligence Pipelines in Security Analytics and Risk Investigation

June 29, 2026
DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%
AI & Technology

DeepSeek open sources DSpark, a new framework to speed up LLM inference by up to 85%

June 29, 2026
Next Post
Manhattan sublease space disappearing thanks to AI firms

Manhattan sublease space disappearing thanks to AI firms

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Revolve Group: The Perfect Mix Of Customer Growth, Margin Expansion, And A Dash Of AI

Revolve Group: The Perfect Mix Of Customer Growth, Margin Expansion, And A Dash Of AI

June 25, 2026
BREAKING: Jury reaches guilty verdict in Karmelo Anthony murder trial

BREAKING: Jury reaches guilty verdict in Karmelo Anthony murder trial

June 28, 2026
Americans need .46 million to comfortably retire — but most seniors have just 0,000 in savings

Americans need $1.46 million to comfortably retire — but most seniors have just $200,000 in savings

June 24, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!