Documents pose a number of challenges to large language models (LLMs) as they become increasingly disorganised or complex. Based on my experiences with automating document testing for business-to-business (B2B) clients, I can testify that LLMs often fail to extract accurately from dishevelled electronic documents, such as those containing complex table structures or having different presentations or scanning formats. LLMs may extract information from these types of documents but will not produce reliable data for use in financial operations or comply with regulations due to poor accuracy rates. For instance, LLMs are known to make many silent mistakes and frequently mislabel fields or exclude rare occurrences. Because of this, companies are at serious risk of losing money, damaging their reputation within their industry, or being forced to produce many more items than they normally would. In our internal tests, we have compared LLM-focused companies to manual data extractors. On average, LLMs produced errors greater than 15% compared to manual data extractors when tested on documents that did not have a standard structure. For example, one SaaS client of ours, who annually processed approximately 4,000 invoices, began a project to automate the invoice intake process. Using LLMs, the client was able to apply the 40% savings from the manual workload associated with standard vendor invoices. In contrast, when attempting to process invoices from a variety of older vendors with non-standard formats and scanned PDF versions, LLMs were unable to accurately extract invoice total amounts or dates. The solution was to provide rule-based checks and optical character recognition (OCR) verification for LLMs. It is critical to understand that LLMs are best suited to assist with automating documents but should not take the lead in the decision-making process.
From my point of view, one of the most significant limitations of Large Language Models in Document Processing is their inability to deliver precise results when accuracy really counts. LLMs do well at summarizing and understanding what's being said in text; however, they fail at structured data extraction, where minor errors in reading a score wrong, reading a date wrong, etc., can be detrimental to the outcome of a process. Additionally, LLMs tend to sound very confident in their answers even when the underlying data is either missing or unclear. In education-related use cases, I have observed that LLMs perform well when summarizing class materials or explaining concepts. Still, they consistently fail to extract standardized information from academic records or test results accurately. Traditional rule-based systems with human oversight continue to outperform pure LLM-based automation in these applications.
Based on my experiences, I would say that LLMs are least effective when transforming unstructured documents into clean, usable datasets at scale. While LLMs are great at producing and rewriting content, they tend to be unreliable when precision and formatting consistency are required. This is why I believe LLMs should never be utilized as a standalone solution for data extraction. Within the educational content space, I have seen LLMs excel at creating worksheets and modifying existing content for different learning levels. However, when I attempted to use LLMs to extract structured data from uploaded documents, I did not find the results consistent. Instead, I found that combining predefined templates and LLMs yielded significantly better results than using generative models alone.
The largest limitation I see with LLMs in Enterprise Automation is their lack of determinism. For business processes to function correctly, they need to know exactly how to proceed based on a predictable output, which is not always possible with LLMs. Depending on the context or wording of a given document, LLMs may interpret the document differently from another user. LLMs also appear to struggle with complicated document layouts, scanned documents, and compliance-sensitive data. As it pertains to real-world automation projects, I have found that LLMs can add real value when used as an augmentation layer to classify documents, determine document intent, and/or handle unusual cases. However, I have found that LLMs are generally inadequate replacements for traditional OCR, rules engines, or workflow logic. The best way to successfully implement LLMs is to utilize them as part of a hybrid system, not as the primary engine.
In our testing of LLMs on roughly 120 real production documents, like location releases, talent agreements and licensing contracts, we found an 18% error rate in data extraction. The largest issues appeared when a language was nonstandard, such as handwritten alterations or addenda written by riders or older contracts scanned as PDFs. That was a real wake-up call for us, because one missed licensing detail can be distribution ending. What we found is to use AI as triage, not a decision-maker. We now employ it to summarize contracts, and to flag sections for review, which saved us about 40% of our internal prep time, but every rights-related document is still seen by human eyes. That alone made our workflow quicker, without added risk.
The biggest limitations I've seen when working with large data sets is hallucination. AI is pretty decent at analyzing data and extracting insights, WHEN it has that data. If the data is missing (e.g. we have a set of business and we're missing their industry or location), LLMs tend to make up that data. The problem is that at the face of it, everything makes perfect sense and you don't find out what the error is until weeks or months later.
The reliability may break when the documents have legal or financial implications. LLMs are good for formats which are clean and repetitive but have limitations in processing edge cases such as a scanned PDF, handwritten annotations, versioned grant forms or different conflicting data from different attachments. Small extraction errors can then cascade into compliance issues which are difficult to detect without doing a manual review. Another limitation is that of false confidence. LLMs will often spit out complete-sounding answers even if you don't have the source material, or it is ambiguous. In terms of grant work, that behavior is risk. Automation is only possible with the help of validation rules, human checkpoints and clear data ownership. LLMs have the benefit of lowering the effort, but not the accountability. The technology assists in recognizing patterns, not in judgment, and so workflows in business have to be designed which have that boundary in mind.
LLMs are great at writing captions, but they're not so great at pulling data from files that mix text with pictures and tables. We've found they miss context or get numbers wrong in those cases. If you need perfect accuracy, the old methods are still more reliable. Your best bet is probably to pair an LLM with a more specialized tool for the extraction part.
The main problem I've run into with LLMs for SEO is they don't always get industry-specific language, which messes up keyword and metadata extraction. We've found the best approach is having a person review the AI's work. For example, when we processed thousands of client pages, the LLM caught basic SEO errors but tripped up on local slang and niche terminology. We needed humans to fix those mistakes to ensure quality.
Running my SaaS company CLDY taught me something about LLMs. They trip over documents with strange formatting and just miss fields. The issue? They don't know our business rules. So we started training them on our own stuff. Our automated onboarding was great with standard templates, but legal docs with custom clauses broke it. You just have to keep teaching it.
Here's what I learned from running cashback programs with engineers and data scientists. LLMs choke on messy PDFs and janky promotional emails. We'd either miss deals completely or list them wrong. After several campaigns, it became clear you need a person checking the AI's work. LLMs aren't a magic solution. Those human backup checks are the only thing that keeps the listings accurate.
I work in dental IT, and LLMs are terrible with handwritten or scanned intake forms, especially when HIPAA is non-negotiable. We tried using them to pull patient data from forms, but the formatting was all over the place and the model would mess up redacted sections. Security is another big issue. Even with safeguards, they can leak patient information. Honestly, the best approach is combining specialized OCR and human review with the LLM.
The most important limitations of LLMs for document processing/automation have been consistently related to consistency and accountability during the past 20 years in my experience operating in transportation. LLMs are proficient at creating summaries and classifications, but they are not reliable when it comes to repeatable audit-ready extraction. The largest gaps have been with document types that are "messy" like scanned PDFs, handwritten notes, and vendor-specific forms that were subject to changes in orientation. LLMs have also created significant disadvantages when missing data leads to hallucinations because of the high rate of errors associated with business automation; therefore, with no set of strong rules to validate the output, along with a required level of human review, many instances where LLMs participated in business automation have produced error rates above 5-10% in financial and operational workflows. We tested the use of LLMs for automatically processing vendor contracts and invoices for multi-city shuttle programs and observed very high levels of efficiency with tagging, flagging missing fields, and providing a summary of contract terms. The time spent by an employee manually reviewing contracts has decreased by approximately 30% as a result of using LLMs for these types of operations.
When you think about document processing, often times the fact that paper based documents are still a large majority of the workload is overlooked. And even when documents are shared digitally, there's no guarantee that they are machine readable, and not just an image. It's exactly at this point critical first step, image to text, that LLMs fall short because their probabilistic nature infer an output that is often riddled with hallucinations. Downstream, this also means that data extraction and the automation that will be based upon that data will lead to costly inaccuracies and wrong decisions. The ideal solution is a combination of machine learning based AI for image to text conversion and Vision Language Models, and depending on the use case leveraging either rules (good old fashioned regular expressions), machine learning, or LLMs leveraging context limiting and just in time token replacement to ensure privacy when working with sensitive data such as PII. A great example of how the combination of the right technologies working together, lead to success is the following. A large global food retail enterprise had to deal with new compliance regulations enforced by the FASB in the US. Large complex contracts had to be analyzed and 300 individual data points had to be extracted. Initially a team of 25 FTEs performed the work necessary but only achieved 67% accuracy, causing rework and fines down the line. The organization wanted to find a better solution and turned to LLMs as the solution. As often seen with this technology, initial demos and experiments looked extremely promising, however upon closer inspection it turned out that accuracy leveraging LLMs was only a meager 42%. Advising them to use a combination of technologies, OCR, machine learning, LLMs and HITL for continuous improvement, lead to a successful project after many months of failures, by achieving 89% accuracy across all data points upon go-live. With the continuous improvement further increasing that number to the high 90's. The benefits of this approach allowed for a reduction in FTE's down to 5 and most importantly eliminated any regulatory incompliance.
What are the most significant limitations you've observed with large language models (LLMs) when it comes to document processing, data extraction, and business automation? LLMs have difficulties dealing with sophisticated layouts, e.g., multi-page tables. They frequently "hallucinate," producing mistakes that seem real but are phony. This renders them unsafe for extracting sensitive information. They are also slow and pricey for large jobs. If they don't have robust logic, they struggle to handle complex problem solving or multi step business automation. Can you share a real-world example or case study where LLMs either excelled or fell short in a document automation/data extraction scenario? LLMs are now being used by insurance companies to pull up claim information within seconds, when the task previously took days. But in legal filings, some of the models have "hallucinated" or made up fake court cases. Although LLMs are good at speed and scale, they can still lack for perfect accuracy. The work must always be reviewed by a human who is able to detect potential costly errors.
What are the most significant limitations you've observed with large language models (LLMs) when it comes to document processing, data extraction, and business automation? LLMs break on structural integrity and have difficulty understanding complex layouts like nested table. Unlike OCR, they generate "plausible hallucinations", small bits of info that are fake but plausible and which humans must check. Moreover, privacy concerns, latency and token starvation issues further challenge the scalable automation of delicate high-throughput financial or legal workflows. Can you share a real-world example or case study where LLMs either excelled or fell short in a document automation/data extraction scenario? In a healthcare case study, Claude achieved 91% accuracy on systematic review data extraction and saved 41 minutes per performed study. In contrast, an LLM on such resumes lines up with correct data but attached job descriptions with the wrong positions, a way to show that it lost preferable structural relationships as much as conventional OCR.
Although LLMs have some powerful capabilities, they are far from infallible. The ability to determine understanding and meaning of unstructured data by an LLM is dependent primarily upon the document processing function of it. The potential for errors in very high-risk applications that require very precise results is known to be increased when minor formatting adjustments or scan images that don't readily convey the data are found within a document. This is also true for fields on the documents that are not filled out. While LLMs can accurately resolve a multitude of areas (for example: document sorting, identifying and extracting key information from invoice), they commonly have difficulty with odd layout/document) issues and exceptions/edge cases such as defining how an LLM should perform. The ideal approach to combating these limitations of LLMs is to employ them along with other system components such as validation rules, structured data extraction, and utilization of a human resource to oversee high-risk applications.
To me, what's most notable about LLMs is that they don't always produce consistent results when using operational documents such as invoices, shipping records, and/or supplier specifications. As mentioned above, these types of documents will often vary in format, terminology, and/or quality. LLMs may misinterpret key elements of these documents unless specific guardrails are in place. Consistency and Repeatability remain major obstacles for using LLMs in this manner. I've seen LLMs assist with speeding up internal documentation and product description generation, but I haven't seen them automate the extraction of vendor paperwork. In a supply chain, a single incorrect quantity or SKU can cause downstream problems. Therefore, we continue to utilize structured systems rather than fully autonomous LLM-based automation.