axi
Get Started
Get Started
← Back to blog
Case StudyMar 27, 20266 min read

How We Cut an E-Commerce Return Rate by 28% With AI

A case study on the AI-powered sizing and recommendation system we built for a DTC fashion brand that slashed returns and boosted revenue.

Returns Slashed

Returns are the silent killer of e-commerce margins. For our client, a direct-to-consumer fashion brand doing $12M in annual revenue, returns were running at 34%. That's roughly $4M in gross merchandise going back through the warehouse every year. Shipping both ways. Restocking labor. Lost customers who never reorder after a bad fit. When they came to us, they didn't need a new website. They needed a system that stopped the bleeding.

The Problem

The brand sold across 14 product categories with sizing that varied between manufacturers. A medium in one line ran small. A large in another ran large. Customers had no reliable way to predict fit without ordering multiple sizes and returning what didn't work.

Their existing size guide was a static chart. It hadn't been updated in two years. Customer service was fielding 200+ sizing questions per week, and the team estimated that 68% of returns cited "didn't fit" as the reason.

The math was painful:

  • Average order value: $85
  • Return rate: 34%
  • Cost per return (shipping + processing): $14
  • Annual return cost: ~$570,000 in direct logistics alone
  • Lost revenue from churned customers: estimated $800,000/year

They'd tried a third-party sizing widget. It improved things by about 5 percentage points but felt generic and didn't account for the brand's specific fit inconsistencies across suppliers.

Our Approach

We proposed a three-part system: an AI sizing engine, a smart recommendation layer, and a feedback loop that got better with every purchase.

Phase 1: Data Foundation (Weeks 1-2)

Before writing any models, we needed clean data. We pulled two years of order and return history, roughly 180,000 transactions. We cross-referenced:

  • Items purchased vs. items returned
  • Customer height, weight, and fit preference data (collected optionally at checkout)
  • Product reviews mentioning fit ("runs small," "true to size," "baggy")
  • Which sizes customers reordered after returning the first attempt

This gave us a fit-accuracy score for every SKU-size combination. Some products had a 90% keep rate in their recommended size. Others were below 50%. That variance was the problem and the opportunity.

Phase 2: The Sizing Engine (Weeks 3-5)

We built a recommendation model that took three inputs: the customer's body measurements (or past purchase history), the specific product they were viewing, and the real-world fit data from previous buyers.

The engine didn't just say "you're a medium." It said: "Based on 2,400 purchases of this item, customers with your measurements kept the size M 89% of the time. Size S was kept 23% of the time."

Confidence scores changed everything. Customers trusted a data-backed recommendation more than a generic size chart. And when the model's confidence was low (below 70%), it said so explicitly and suggested the customer contact support before ordering.

We integrated this directly into the product detail page. No separate app. No pop-up quiz. Just a persistent sizing panel that updated in real time as the customer entered their info.

Phase 3: The Feedback Loop (Weeks 5-7)

Static models decay. Fashion brands change suppliers, adjust cuts, and introduce new lines. We built an automated feedback loop:

  • Every return tagged with "sizing issue" fed back into the model
  • Post-purchase surveys collected fit feedback (3-question, 15-second format)
  • The model retrained weekly on fresh data
  • Products with degrading fit scores got flagged for the merchandising team

This meant the system got more accurate over time, not less. By month three, the model's fit prediction accuracy hit 91% across all categories.

The Technical Stack

For teams interested in the implementation details:

  • Model: Gradient-boosted classifier (LightGBM) for size recommendations, with an LLM layer for parsing unstructured review text into fit signals
  • Infrastructure: Hosted on AWS Lambda for on-demand scaling. Average inference time: 120ms
  • Integration: REST API called from the Shopify storefront via a custom theme extension
  • Data pipeline: Automated ETL pulling from Shopify, returns platform, and review aggregator into a unified fit database
  • Monitoring: Weekly accuracy reports with automated alerts if any category drops below 80% prediction accuracy

Total build time: 7 weeks from kickoff to production. Total cost: a fraction of what one year of excess returns was costing them.

Results

We tracked performance over the first 90 days post-launch.

Return rate: 34% down to 24.5%. A 28% relative reduction. On a $12M revenue base, that translated to roughly $160,000 in saved logistics costs annually, before accounting for the revenue retained from customers who would have churned.

Average order value increased 12%. When customers trust the sizing recommendation, they buy more confidently. Multi-item orders went up because the "order two sizes and return one" behavior dropped.

Customer support sizing tickets dropped 54%. The team reallocated 15 hours per week from answering "what size should I get?" to higher-value work.

Net Promoter Score improved by 8 points in the quarter following launch. Customers specifically cited "sizing accuracy" in open-ended feedback.

Model accuracy at 90 days: 91.3%. The feedback loop was working. Early predictions were 84% accurate. Three months of real-world data pushed that above 90%.

Key Takeaways

This project reinforced a few things we believe about AI in e-commerce:

1. The best AI projects fix revenue leaks, not just add features. This wasn't about flashy technology. It was about plugging a $1.4M annual drain. The ROI case wrote itself.

2. Your historical data is the competitive advantage. The third-party widget couldn't access two years of brand-specific fit data. Our model could. That's the difference between 5% improvement and 28%.

3. Feedback loops are non-negotiable. A model that doesn't learn from its mistakes is a liability. The weekly retraining cycle is what kept accuracy climbing instead of decaying.

4. Integration matters more than intelligence. The smartest model in the world is useless if it lives in a separate app that customers don't open. Embedding the recommendation directly into the PDP, with zero friction, drove adoption above 60% of sessions.

If your e-commerce business is losing margin to returns, misfit recommendations, or generic product experiences, the data to fix it is probably already sitting in your systems. You just need the right model on top of it. We can help you find it.

Share this article

click the sparks to score!
Mini Game
Score0

Why Wait to Get Started?

Get StartedLet's Go 🚀
AXI automated 12 workflows today
axi

Design, development, and AI automation for modern teams.

©2026 axi. All rights reserved.