I usually work backwards. I figure out the exact decision the model needs to support and the smallest amount of information it has to extract to make that decision reliably. That becomes the spine of the schema. From there, I run a tiny annotation pass myself--maybe a few dozen examples. It's the quickest way to spot the spots where the labels get fuzzy or where two reasonable people might disagree. If the rules aren't clear enough for humans, a model is going to struggle even more. On one B2B project, we originally planned to label "intent," which looked straightforward on paper but turned out to be a judgment call every other time. We shelved that idea and focused on concrete actions instead--things like whether someone booked a demo or checked the pricing page. The model learned to infer intent later on. That switch saved us from having to redo the entire dataset.
I usually begin by talking with the people who will actually rely on the model. I want to understand what they expect it to predict, how those predictions get used downstream, and where the threshold is for "this is good enough." Once that's clear, I sketch out a schema that covers the real decision space, including the oddball cases and any operational limits. Most of the time it ends up being a focused core set of labels with a flexible metadata layer that the model can safely ignore if it doesn't need it. After that, we label a small test batch, train a lightweight model, and dive into the mistakes right away. That early pass usually exposes labels that are too fuzzy, redundant, or just not pulling their weight. I make one solid revision to the schema at this stage, and only then do we move to larger-scale annotation. We keep careful notes as we go so future tweaks don't break what's already been labeled. It takes a bit more patience at the start, but it saves us from having to redo everything after the first model run.
If you want a strong schema, first mark up a small "golden set" yourself. Such a manual action is able to show ambiguities and corner cases that one might not see in a theoretical planning. Stop making one-dimensional unique categories and start making a hierarchy or attribute based items (e.g. tags color / shape as opposed to red-square). This modularity means that you can, later on, combine classes together, or break them apart again without repeating work. And explain the border type conditions explicitly by examples. This flexible structure requires the way you store your data to remain consistent, but this comes with many benefits.
I treat annotation schemas as versioned artifacts from day one. My approach is to document explicit decision rules and edge cases alongside the labels, then lock the core schema early. Instead of changing labels after the first model run, I introduce mapping layers that translate old labels into new task-specific representations. This approach avoids full relabeling while still allowing the modeling strategy to evolve quickly.
Our whole stakeholder process is about preventing schema ambiguity before it is able to scale. We start with a small set of identifiable "pilot data" - often only a few hundred examples - that are the hardest schema decisions we expect to see. We fly in all data sciences, the SUBJECT MATTER EXPERT, and the actual annotators to collaboratively label this data, surfacing disagreement and edge cases in a single room. From this pilot, we implement what we call a "living guide" -- the annotation guide isn't a document, it's a playbook for edge cases to label, with visual examples of things we became unsure about and when. "Best practices for labeling data" reminds us to document edge cases so we are consistent even as projects evolve. We also provide for each annotator an easy "flag for review" category; if we see a lot of that item is flagged, generally the schema and not the annotators need adjustment.
Designing schemas in light of change: If public datasets stand void, I begin by mapping out the operational goals of the model and figuring out the minimal basis of core object classes that lay down the invariantity of the schema for every iteration. Subsequently, I establish the hierarchical labeling schema set in such a manner that the base categories are anchored and the sub-class divisions, with the exception of changes in the taxonomy a newer one may permit as a rigorous work theorem, permitting the ontological progression of a class under the shadow values of its parent class. Within the scope of each label are the good semantic Metadata bits, with contextual details, issues regarding visibility, and spatial relations that help preserve meaning even if the schema evolves with distantly set altercations transpiring. Before you go all the way into the resource-draining work of labeling, the schema would be tested first on the pilot subset, and an updated test of different ways will ensure that it is able to undergo refinements without needing complete relabeling. It ensures consistency across rounds of training and saves a lot of annotation time as one's dataset grows. In point of fact, the secret lies in the ability of your schema to change as a living system, out there stable at the core while ready to grow wherever required by machine learning.
Designing an annotation schema is a process that can be done iteratively, rather than once. When I begin designing an annotation schema, I typically create a fundamental, versatile labeling schema with space for additional/optional attribute information, rather than creating too many detailed classes. After my initial model run, I will review which labels were actually helpful for making predictions and make adjustments as needed, without drastically changing everything. This method of iterative design allows me to continue using my original data set for future model development while continually improving it.
I typically begin by identifying the client's unique requirements for their project. For example, an antique dealer who is dealing with high-priced antiques will have different schema that need to be annotated from an art gallery that sells contemporary art. In my experience it is important to create flexible schemas that can be easily adapted based on client goals; yet are also as specific as possible so that there will be less effort required when applying them.
Schema Design begins with Error Tolerance, not Ideal Labels. Classes get defined around decisions the model must support, not visual purity. Early schemas remain coarse with attributes layered separately to allow meaning to expand without disrupting structure. Every label contains a short rule in writing and one counterexample in order to avoid drift. A small pilot set goes through the training prior to full annotation, the ambiguity is exposed quickly. If the model fails then attributes change before classes split. That way, there is no rework involved as labels change through extension rather than replacement. Flexibility at the beginning insures velocity in the future.
Start with the decision the model has to support in the actual workflow and then design the labels on what staff would do next. In a clinic environment, that usually means triage, follow-up urgency, choice of billing code, or care plan reminders. The schema remains stable if it separates between facts and interpretation. Text spans are tagged for concrete entities like symptom, duration, medication, dose, side effect, insurance status, and then smaller subset of downstream flags for interpretation like urgent, routine, needs clarification. Raw spans are seldom changing and interpretation is changing. That structure does not require relabeling in the case of changing thresholds. A tight loop helps keep drift under control. A total of ten to fifteen edge-case examples are provided in guidelines for labels, updated weekly for the first month only. Five percent of items are double labeled to help quantify the disagreement then labels that cause repeated disagreement get split or reworded early on. A catch-all label remains available for rare cases and these get reviewed in a weekly bucket so the schema grows intentionally rather than breaking.
When the public datasets I have don't match the conditions in the real world that I'm looking for, I start designing the annotation schema similar to how I would design an API with a stable core set of primitives, optional extensions, and versioning from the outset. I don't start with "What labels do we want?" I start with "What decisions will our model need to make in production and what kind of ambiguity will we face in the field?" The latter approach helps to frame the right structure early on, rather than finding it out after the first round of models fail due to edge cases. On a practical level, I define the minimum base layer for the annotation schema and make it future-proof: geometry + identity + visibility / occlusion + confidence / ignore. I create one-level attributes for anything that is likely to change over time. I also include schema versioning, clear fallbacks for ambiguous cases, and a bucket for "unknown / other" so that annotators don't feel like they are forced into the wrong buckets. Then I do a small-scale pilot, measure the disagreement among annotators, and only scale the schema once it's been verified as producing consistently accurate labels under pressure. This method allows me to iterate on the model without throwing away the entire dataset; changes to the annotation schema are incremental, adding new attributes, refining the guidelines, and selectively reworking cases rather than requiring a complete relabeling of the dataset.
The beginning of a stable annotation is a stable business object. Product category, type of device, need based on diagnosis, payer document type, request to repair, constraint of delivery and requirement of accessing model are more likely to endure changes in models whereas initial outcomes of models change. A schema constructed out of those enduring things stands on its own feet compared to a schema constructed of whatever the initial model appears to notice. The names of labels remain simple, pegged on the ones already made, about whether a case goes to respiratory or complex rehab or orthotics or repairs or home access. The versioning process is implemented on the first day. Labels remain hierarchical, consisting of a small top level group and optional attributes below, as in laterality, size, urgency, and prior authorization required. Unknown labels remain active and are revisited on a weekly basis, and only after many repeated demonstrations are promoted to the taxonomy. As a stability test, a small set of 200-300 examples of gold is re-annaled each time. The practice avoids drift, reduces relabeling and makes improvements incremental rather than disruptive.
President & CEO at Performance One Data Solutions (Division of Ross Group Inc)
Answered 2 months ago
I always talk to two types of people, the ones who use our software constantly and the ones who manage it. This helps me catch the weird edge cases early. At Performance One Data Solutions, we started with a really basic setup instead of perfect categories from day one. Our system would flag anything that looked messy for a person to check. When teams started working differently, we adapted quickly without fixing thousands of entries by hand. Don't over-engineer your first labeling plan. Let automation show you what's uncertain.
Working on creative media at Meta and now Magic Hour taught me not to over-design schemas. We looked at a few annotation frameworks, but starting with broad categories and evolving them through pilots worked much better. It took a couple rounds of feedback to nail down which labels actually improved our video edits. Here's my takeaway: document every label change and review the schema after each model run. It saves you from a huge relabeling job later.
I usually work directly with engineers to figure out what user actions actually matter for attribution or cashback. At CashbackHQ, we got the event tagging right first, so we could update labels as campaigns changed without relabeling everything. Tracking campaign feedback in our sprint reviews helped us adjust the structure as new data came in. Invest time up front in a modular setup. Your future self will thank you.
At Roy Digital, we got burned using rigid public taxonomies. They just didn't fit our actual work. What worked was starting small. We'd map out a few key user scenarios, then test a tiny set of labels with quick annotation runs and a basic model. Figuring out the kinks on a small scale first made later changes a simple fix instead of a huge project. I'd recommend getting team consensus after each model pass before you label more data.