Skip to main content

Notice

We are in the process of rolling out a soft launch of the RDA website, which includes a new member platform. Existing RDA members PLEASE REACTIVATE YOUR ACCOUNT using this link: https://rda-login.wicketcloud.com/users/confirmation. Visitors may encounter functionality issues with group pages, navigation, missing content, broken links, etc. As you explore the new site, please provide your feedback using the UserSnap tool on the bottom right corner of each page. Thank you for your understanding and support as we work through all issues as quickly as possible. Stay updated about upcoming features and functionalities: https://www.rd-alliance.org/rda-web-platform-upcoming-features-and-functionalities/

PDF questions and ArchivesData center connections

  • Creator
    Discussion
  • #78170

    RDA Admin
    Member

    Hello Archives and Records Professionals IG!
    I am hoping you might be able to help me answer a request I just received from a data center friend on archival practices around PDF migration. My apologies if this is a distraction to the group or better addressed elsewhere (?), offline information would be lovely if that is more appropriate.
    The broader connection for the IG might be that data center folks, not necessarily trained in archival practices or up to speed in the institutional records environments, also archive more traditional records with their data holdings. Making (additional) connections across these communities would be an amazing thing, imo.
    Sincere thanks!
    Lynn
    ————————————————————————-
    My basic question(s) are (1) how are you creating PDF/A files (2) are you validating them, and (3) how are you addressing validation errors? (4) Lastly, any thoughts on where I can find answers to similar questions online? I seems to be having zero luck, but I can’t be the only one having these issues?
    Background:
    We’re in the process of handling the backlog conversion of our repository’s materials into archival versions. We’re a disciplinary specific repo, and I’ve worked to ensure that we only accession materials that have a clear migration path, but as you can imagine, it’d be nice if it were only that easy. My current challenge is in working with the 19,000 PDF files in our system (many are quite large).
    We’ve been doing an analysis of various tools that can create PDF/A files, comparing Adobe Acrobat, Ghostscript, and ABBYY FineReader, in how well they converted a small test set of PDFs and evaluating how well they’d work for larger batches in an automated manner. We then took the resulting PDF/A files and validated them using VeraPDF, Adobe’s preflight, and PDFBox’s preflight tools.
    We’ve also looked at tools like Archivematica, but shied away as they don’t support all of the file formats that we support, and don’t address some of these issues.
    Challenges:
    In our initial tests a few years ago, none of the tools created valid PDF/A files in all cases, currently Adobe Acrobat is doing better, but it lacks useful tools for automating hundreds or thousand of conversions. Archivematica uses Ghostscript to convert files, but it seem to be the worst of the bunch in terms of validation errors. We use ABBYY FineReader because of it’s superior OCR and batch tools, but it also produces errors at the moment. I’ve yet to find a good tool beyond Acrobat, when it works.

Log in to reply.