Amazon
Amazon

Chiller Correction of Errors

See Amazon's Correction of Errors in action with a full ficticious example. The company noticed a rise in air temperature levels in their X°C and Y°C chillers, which led to a full product recall of certain fresh meat and seafood products. Amazon's prompt response to this issue demonstrates their commitment to customer safety and satisfaction.

The root cause of the problem was identified as a buildup of ice and dust on the chiller blower, which reduced its output and affected the optimal temperature levels for storing fresh products. Amazon has since implemented various corrective measures, such as increasing the nominal capacity of their cold rooms, reducing the heat load, and improving temperature monitoring and escalation procedures.

This incident serves as a valuable lesson for Amazon and other companies in understanding the importance of having a thorough knowledge of their systems, even when outsourcing to third parties. By taking ownership of critical functions and establishing clear escalation paths and SOPs, businesses can ensure that they are better prepared to handle any potential issues that may arise in the future.

Chiller Correction of Errors

1. Description of problem and its impact

On [date], we noticed the air temperature levels in both the X°C and Y°C chillers began to rise. Some air temperature variation is normal, due to many factors including scheduled defrost cycles, where the chillers shut down with the goal of eliminating ice buildup on the unit. Over the next several days, we realized that the chiller in the y chamber would not be able to maintain optimal temperature levels for storing fresh meat and seafood. On [date], we determined that some products had been stored outside the recommended ranges from [date] through [date], and initiated a full product recall of the [product] and [product] products sold and delivered to customers between those dates.

Upon initiating the recall, we immediately took the following actions:

  1. Sent an email to all customers that had purchased affected products instructing them to dispose of the items and to contact us if they had any issues;
  2. Identified all customers who had already placed future orders for those items and informed them that we would not be able to ship those particular items;
  3. Posted a recall notice on [company] homepage and its social media platforms;  
  4. Disposed of all affected items in the warehouse;
  5. Notified Agri-Food & Veterinary Authority of the incident and welcomed an inspection;
  6. Refunded directly affected customers the full amount of the item(s) and issued goodwill credits to the customers who received the impacted fresh [product] and [product] products;
  7. Provided forthcoming statements to multiple press requests and responded to customers’ comments online and via phone; and
  8. As an extra precaution, we temporarily suspended the sales of [product] , [product], certain [product], and [product] products until our warehouse cold rooms are operating as normal.

It took x days to get the y chamber back in a normal operating mode. On [date], we started stocking up [products] so we could resume selling [product] products. The y chiller failed again during this first stock-up attempt. We had to waste additional product, though none of that product made it to customers so a second recall was not necessary. Our chilled rooms are now operating normally again. We have put in short-term fixes and are working on longer-term solutions to ensure this problem does not happen again. 

Customer and Financial Impact

The recall directly affected x customers. The direct cost of the recall was approximately ~$x ($x in refunds, $x of goodwill credits towards future purchases ($x credits to x customers), and $x in inventory that was disposed of. Additionally, we were unable to sell much of our chilled range from [date] until the full range was back in place at the end of [date]. We lost potential sales, but more importantly lost customer trust during that timeframe. 

2. Root causes

Background

For background, the fresh and frozen operation on level was started on [date] operating across 3 temperature regimes x, y and z. The cold rooms were designed and built by [mfr] based on assumptions given to [company] on [date] to calculate the heat load. The chambers are designed to work at x-y, z-q and r. The insulated envelope was constructed by a contractor using standard sandwich panels and an insulated floor.

The operation and maintenance of the system is the responsibility of [company]. The temperature controlled operation is under the umbrella of [organization] license for the building. The chilled chambers are cooled by the [mfr] central ammonia system with a single direct expansion blower in each chamber. Whilst there is a standby compressor there is no backup for the blowers. The central system is operating at capacity and the freezer is supplied from a standalone Freon system. Overall there have been no major issues with the temperature control until [date]. 

The food quality team has manually recorded the temperature for each of the 3 chambers on an hourly basis since the operation started.

Triggering event

The triggering event that led to the problem was a buildup of ice and dust on the blower that reduced the output of the chiller in the y chamber to the point where it could no longer maintain the desired temperature given the heat load in the room. There were several root causes, most of which stem from an insufficient understanding of, and operational control over the effective chilling capacity, our heat load, and how those two factors interacted with each other. The chiller’s effective output gradually declined due to ice and dust build up. Concurrently, we also had been gradually introducing a higher heat load into the y chamber because we needed more material, people and activity to handle increasing order volumes. Once the heat load passed the effective capacity, we were unable to recover without impacting normal operations. 

Root Causes

Why did the chilling system fail?

  1. The chilling system was not designed for our type of operation. When our vendor originally designed the chiller system, they did not anticipate the heat load our operations would place on these rooms. During the outage, the vendor was surprised to see the amount of activity we had to perform to receive, putaway, pick, pack, and ship chilled product. They also were not aware of the fact that we would have blast freezers to chill our eutectic plates. As we worked with the vendor to understand their assumptions and chilling output specifications, we realized the y chamber is not sufficient for the growth of our business. So even if there were no reduction in effective chilling output due to dust and ice, we eventually would have hit the nominal capacity.
  2. We did not have effective temperature monitoring systems in place. The temperature in the chilled rooms was monitored, just not in a way that was as effective as we needed it to be. Security recorded the temperature regularly but there was no process to escalate changes. We also had a QA process that measured the temperature of outbound totes. However, we did not have the appropriate feedback loops in place to make sure this data made it back to the appropriate people. Moreover, during the chiller event, the QA person involved in testing and monitoring moved to another role. The role was not immediately filled. Also, since food temperature is such a key element in product safety, it should be the responsibility of the operations managers, not the security staff.
  3. We did not have clear guidelines and escalation procedures on how to handle temperature variance. With proper escalations procedures, we should have recognized and reacted to the issue sooner. There were no clear guidelines on what to do when a temperature variance occurs. As mentioned above, some air temperature variance is normal. A short term air temperature variance could have little or no effect on the temperature of the product. However, a sustained air temperature variance may require actions. There was no way for the person recording the temperature to know if the variance required follow-up action. Examples of follow-up actions range from measuring specific food items, escalating to the appropriate internal people or groups, or requesting our vendor to immediately check and/or service the chilling unit.
  4. We relied on a vendor to provide and maintain a mission critical piece of our infrastructure and did not have sufficient understanding, expertise, equipment, and controls to deal with the range of problems that could occur. During our discovery process we learned that our vendor did not have a regular maintenance schedule for the chilling units. They monitored and recorded temperatures, but did not inform us when the temperature was outside of the design limit 2 - 5 as it is our internal operation. We also have learned with the increased heat load, the chiller starts to ice up every two weeks. We also did not have a scissors lift to inspect the chilling unit ourselves.
  5. We had a single point of failure and could have taken more risk mitigation steps. The cooling system has a number of single points of failure. For instance, there is only one blower in each room. So when it goes down for planned maintenance, such as a ~1 hour defrost cycle or an unplanned outage, there is no other chilling source for that room. 

3. Corrective actions taken

We have made or are in the process of making the following changes to address the nominal and effective capacity of our cold rooms:

  1. To increase nominal capacity, we need an additional chilling system. We are working with our vendor for an additional cooling system that takes into account our current and projected heat load and operating characteristics. We also want to build in redundancy to eliminate current single points of failure.
  2. For [location] we are looking at solutions such as a 3 unit chiller, each with 50% capacity.
  3.  We purchased a scissors lift so we can do twice-weekly physical inspections of the chilling unit. We want to detect any ice and dust build-up before it becomes an issue.
  4. We have established clearer communications and rules of engagement for the current chilling system.

We took the following actions to reduce the heat load:

  1. The large door from the ambient room was reserved for vehicle traffic. Foot traffic was rerouted to the smaller door.
  2. We painted lines on the floor to outline the boundary where movement would trigger an automatic door opening.
  3. We also moved some manual work away from the door that occasionally triggered unnecessary door openings.
  4. We installed strip curtains to reduce airflow when the door was open.
  5. We moved the blast freezers out of the chiller 2 chamber.

We have made or are in the process of making the following changes to address the monitoring and escalation of temperature in our chiller rooms:

  1. We have a team of certified food safety experts who established guidelines of temperature and acceptable time/temperature variance.
  2. The temperature readings are now shared via a Google Sheet with the necessary people in operations, commercial, and facilities. 
  3. We are working on a more robust system using ARIMA, a time series forecasting model, that can alert us when a process has gone out of control, and even better, alerts us before it is about to go out of control so we can take corrective action. 

4. Lessons learned…Good and bad

Most of the lessons below can be applied to many areas in [company] outside of the chiller and operations area.

  1. In a complex system, even if part of it is outsourced to another company or another team, you need to understand, be able to predict, and be able to react to potential failure scenarios. We relied too much on a third party to maintain and monitor the chiller. We assumed they were doing things without verifying.
  2. It’s best to control your own destiny whenever possible. When the chiller problem escalated to our vendor, we had neither the information nor the expertise to solve the problem. It took too long for us to get them to take corrective action.
  3. Every critical function at [company] needs an owner, an escalation path, and published SOPs on what to do when a failure occurs. An owner cannot be a single person. There must be secondary people or groups identified beforehand if the primary is unavailable due to factors such as vacation, off-hours, or leaving the company.
  4.  Even if you think the problem is solved, look at recovering in a gradual manner, versus all at once. We stocked up products too quickly and introduced too much heat load after we thought the chiller issue was solved. As a result, we negatively impacted customers and had to waste product a second time, albeit at a much smaller level. 

Here are some things we did well during the event.

  1. We had excellent and immediate support across all levels of the company once the problem was appropriately escalated. People from operations, bi, commercial, product management, facilities, customer service, software, marketing, legal, among others all jumped in and did what it took to make things right for the customer.

See Amazon's Correction of Errors in action with a full ficticious example. The company noticed a rise in air temperature levels in their X°C and Y°C chillers, which led to a full product recall of certain fresh meat and seafood products. Amazon's prompt response to this issue demonstrates their commitment to customer safety and satisfaction.

The root cause of the problem was identified as a buildup of ice and dust on the chiller blower, which reduced its output and affected the optimal temperature levels for storing fresh products. Amazon has since implemented various corrective measures, such as increasing the nominal capacity of their cold rooms, reducing the heat load, and improving temperature monitoring and escalation procedures.

This incident serves as a valuable lesson for Amazon and other companies in understanding the importance of having a thorough knowledge of their systems, even when outsourcing to third parties. By taking ownership of critical functions and establishing clear escalation paths and SOPs, businesses can ensure that they are better prepared to handle any potential issues that may arise in the future.

Related examples in Postmortems
Amazon
Amazon
[Insert Topic Here] Correction of Errors
Atlassian
Atlassian
Postmortem Template