I've been asked about how "agentic QA" actually behaves inside a real CI/CD pipeline, and my answer comes from modernizing a 75-year-old manufacturing business that now runs ERP, customer portals, and compliance software we can't afford to break. On flakiness and determinism, we learned fast that non-deterministic agents cannot be allowed to gate a build. Any LLM-driven test generation or analysis runs outside the pass/fail path; the CI job itself remains deterministic with fixed assertions, frozen model versions, temperature set to zero, and outputs cached and diffed. Even then, we treat AI results as advisory signals, not truth—if a test outcome can't be reproduced byte-for-byte, it doesn't get a vote in deployment. This mirrors how we treat measurement drift on the shop floor: if a gauge can't be calibrated, it doesn't certify parts. I've also dealt directly with the agentic supply-chain problem, which looks a lot like counterfeit materials showing up in plating chemistry. We block autonomous dependency installation entirely: agents can suggest packages, but builds only pull from an allow-listed internal registry with pinned hashes, signed SBOMs, and SLSA-level provenance checks. Any new npm or pip dependency requires a human code review plus a license and vulnerability scan before it ever reaches CI. We added simple but effective guardrails—no network egress during builds, no dynamic package resolution, and alerts when an agent's suggestion deviates from the approved dependency graph. That's the only way I've found to let AI assist without letting it quietly poison the pipeline. Dawn Stutzman, President, Stutzman Plating https://www.linkedin.com/company/stutzman-plating/
Temperature=0 doesn't solve the problem. OpenAI's documentation admits outputs are only "mostly deterministic" even with fixed seeds. Anthropic's Claude docs say the same: "Note that even with a temperature of 0.0, the results will not be fully deterministic." The non-determinism doesn't come from sampling. It comes from three sources that seed can't control: 1. MoE (Mixture of Experts) routing. Modern models route tokens to different expert subnetworks. Routing decisions can vary based on batch composition. 2. Floating-point accumulation order. GPU kernels use atomic operations where thread execution order isn't fixed. Different batch sizes or scheduling change numeric paths. 3. CUDA library optimizations. PyTorch and TensorFlow use algorithms optimized for speed over reproducibility. Run-to-run variance is a feature, not a bug, from their perspective. How we handle this in CI/CD: For unit tests: Don't use external LLM APIs. Run a small local model (Ollama, llama.cpp) where you control the environment. vLLM with seed=0 gives consistent results on same hardware. For integration tests: Design assertions around semantic equivalence, not exact string matching. We use embedding similarity with a threshold (cosine > 0.95) rather than assertEqual(). For regression detection: Snapshot the distribution of outputs over N runs, not single outputs. If 95% of runs produce semantically equivalent results, the test passes. This mirrors how you'd handle any stochastic system. The recent SGLang work from LMSYS shows you can achieve fully deterministic inference with batch-invariant kernels—34.35% overhead, 2.8x acceleration with CUDA graphs. But that requires self-hosting. If you're hitting OpenAI's API, you're accepting their non-determinism. The industry consensus: design for robustness to minor variations rather than expecting bit-perfect determinism. Your CI shouldn't block on LLM drift. It should block on semantic deviation.
Look, if you let agents auto-update selectors on their own, you're asking for silent failures. That's the real nightmare. It's not about the test breaking--it's about the test passing because the agent latched onto a Submit button in the footer instead of the actual form. To stop that, we use what we call a Weighted Identity Score. When a locator fails, the agent looks at candidates and scores them based on stuff like DOM depth, sibling proximity, and visual similarity. If that confidence score drops below 0.9, we quarantine the fix in a Pending Review branch. We never, ever let an agentic heal commit straight to the main repo without a human looking at it. You have to keep the test suite from drifting away from the actual business logic. Plugging non-deterministic agents into a rigid pipeline like Jenkins or GitHub Actions is tricky. You have to move away from binary pass/fail and start thinking in terms of Probabilistic Confidence. Even if you set your temperature to zero and use fixed seeds, you're still going to deal with environmental drift. Our move is to use a Consensus Engine. We run high-stakes scenarios three times in parallel. If the results aren't unanimous, the pipeline doesn't just crash; it triggers a Retry with Log Analysis. We also check the agent's path against a Reference Trace. If it deviates too much, we mark it Inconclusive and pull it out of the deployment-blocking suite. It keeps the pipeline reliable while still letting us use LLM reasoning for those messy UI flows. Moving to agentic QA isn't about ditching scripts. It's about learning how to manage the shift from deterministic logic to probabilistic reasoning. The teams that actually make this work are the ones building real guardrails--things like consensus engines and shadow branches--to make sure AI speed doesn't come at the cost of system integrity.
The "Agentic" Supply Chain Attack: With coding agents (like Cursor, Devin, GitHub Copilot Workspace) gaining autonomy, how do you prevent them from hallucinating and installing malicious or typosquatted npm/pip packages during the build process? What guardrails do you have in your DevSecOps pipeline? Rubens Basso, CTO, FieldRoutes - https://www.linkedin.com/in/rubensbasso In our CI/CD experiments, the most sobering moment came when a coding agent helpfully "optimized" a Node service by swapping in a typosquatted npm package with nearly the same name as a popular logging library; the only thing that stopped it from reaching production was a software composition analysis gate that blocked any new dependency without an approved SBOM entry and a human code review on the pull request.
I've seen the "agentic" supply chain issue up close as a strategist working with engineering and security leaders, and the pattern is the same: you treat AI agents as untrusted juniors in the toolchain, not as a root user. On package installs, teams I work with don't let agents call npm/pip directly in CI. Agents can propose changes (new deps, version bumps) in a PR, but the actual install runs inside a locked policy layer. That usually means: allow-listed package names, deny-listed scopes, and version pinning via lockfiles. If a suggested package isn't on the allow list, the pipeline fails and flags it for human review. That alone blocks most typosquats and random GitHub "solutions". We also wire software composition analysis (SCA) and signing in front of deploy. So even if an engineer clicks "accept" on an agent's suggestion, the SCA scan and signature check (Sigstore/Cosign or similar) can still stop it before it hits staging or prod. Audit logging is key. Every agent action that touches code or config is logged with who approved it, which model was used, and what context it saw. When something looks off, teams can trace if it began from a hallucinated code change, a bad prompt, or a compromised dependency. Name: Josiah Roche Title: Fractional CMO Company: Silver Atlas Bio: www.silveratlas.org
On the question of how "self-healing" QA frameworks are verified so they don't quietly attach to the wrong element, my experience has been that blind healing is where teams get burned. When I've seen AI-driven tests auto-update a selector after a frontend deploy, we only trusted the change if it passed a secondary validation layer: visual bounds checks, expected text comparison, and a diff against historical interaction patterns. In one rollout, a self-healed test started clicking the correct-looking button that triggered a different conversion path, which wasn't caught until analytics flagged a drop—after that, every "heal" required human approval before merge. The practical takeaway is that self-healing should propose fixes, not commit them, and those fixes should be reviewed like production code because they can change business logic, not just tests. On the risk of agentic supply-chain attacks in CI/CD, the reality is that autonomous coding agents will hallucinate dependencies if you let them. I've seen builds break—and once almost ship—because an agent pulled a similarly named npm package that wasn't on our allowlist. The guardrail that worked was brutal but effective: dependency installation was locked to a preapproved manifest, outbound network calls during builds were restricted, and every new package required a signed approval step. Treat AI agents like junior engineers with root access—use package pinning, SBOM scans, and deny-by-default policies, or your pipeline becomes the easiest attack surface you own. Brandon Leibowitz, Founder & SEO Strategist, SEO Optimizers [https://seooptimizers.com/about/brandon-leibowitz/](https://seooptimizers.com/about/brandon-leibowitz/) [https://www.linkedin.com/in/brandonleibowitz/](https://www.linkedin.com/in/brandonleibowitz/)
We integrate AI agents into our CI flow at PuroClean to draft test data and refactor scripts, but we lock them down hard. I require temperature set to zero and fixed seeds inside GitHub Actions so outputs stay stable and diffable across runs. When an agent proposes a new npm package, our pipeline checks it against an internal allowlist and runs Snyk plus checksum validation before install. Last year we blocked three typosquatted packages that slipped into a branch after an auto refactor suggestion. For self healing tests, we never auto commit the heal. The agent opens a pull request with a DOM diff, screenshot diff, and confidence score above 0.92. A senior SDET reviews it before merge because a wrong selector can pass but break revenue flows. That discipline cut flaky test failures by 38 percent and kept our build times steady, even when the model drifted slighty. Logan Benjamin Co Founder, PuroClean https://www.linkedin.com/in/puroclean-corporate-headquarters https://waterdamage-bocaraton.com