Fessing Up To Our Mistakes Ross | May 7th, 2009
We started working on crowdSPRING in the summer of 2006. We incorporated the company in May 2007 and launched the crowdSPRING marketplace in May 2008. We’ve learned many important lessons along the way. In some ways, our experience is typical of other start-ups. In other ways, it is not. I want to share some of our adventures (and mis-adventures) in the hope that it’ll help others looking to start a company or those who’ve already launched a start-up.
Yesterday was the one year anniversary from the time we launched crowdspring.com publicly. We shared in this post why we skipped the celebration yesterday.
Some who received our email have written asking us to share more details about the problems we’ve had over the past 10 days with the site and what we’ve learned in the process.
A Little Background
We built our application (starting in mid 2007) using PHP and eZ Publish. Before launch, we spent 9 weeks in a closed beta, thoroughly testing our hardware and software architecture. When we launched into our closed beta on March 6, 2008, our dev team concluded that our application would have difficulty fully scaling as traffic and registrations to our site increased. We suspected before our closed beta that this could be an issue, and were extremely disappointed when our suspicions became true. Our dev team then did something that continues to amaze us to this day. Before we launched publicly a mere 8 weeks later, our dev team rewrote nearly all of the code for the site and made numerous hardware architecture changes. It was a brutal period for our entire team. Many 20 hour days (especially for Chad), numerous all-nighters, and tons of problems.
When we officially launched in May of 2008, we were much better prepared to handle the traffic, but we remained concerned about our ability to fully scale. Within weeks after launch, we made significant hardware upgrades to our database server, anticipating much higher traffic as a result of our successful appearance at the Under The Radar Conference in California. We made additional hardware upgrades throughout the summer as we continued to gain visibility from some outstanding news coverage and continued attention. However, we realized mid-summer that hardware improvements alone would not allow us to fully and flexibly scale. The underlying problems were the result of the way our application was written and the structure of the content management system (CMS) that we were using. It’s an outstanding CMS, but for our application, it did not scale particularly well. Throughout the summer, we worked with one of the core developers of that CMS to find ways to address our concerns about scaling, but ultimately, we concluded that long-term, we would need to re-architect our application and start from scratch.
The Refactoring Process
We spent several months evaluating and researching various options, including .NET, Ruby on Rails, PHP, Python, and Java. All had various advantages and disadvantages. After much thought and debate, we elected to refactor our entire code base using Python and Django. We recognized that with a small dev team (3 people), we would face some significant challenges because we would need to concurrently work on the refactoring project and support our production site. As our former president George W. Bush used to say, we mis-underestimated those challenges.
Refactoring work began in earnest very late Fall 2008 after we sat down and thoroughly planned how we would both support our production site and refactor. We wanted to spend at least 80% of our time focused on the refactoring effort. We had hoped to complete all refactoring by end of March 2009, test fully, and roll out the new code in early to mid April.
The Production Site Problems
During the refactoring process, we continued to see numerous problems on the production site. Some of the problems were typical of a growing site and we dealt with them quickly. Other problems were a bit more complicated and required more work. Yet other problems underscored the scaling issues we had anticipated and required us to either make significant architectural changes to our software or push forward with the refactoring process. Believing that we would complete refactoring by end of March (we were well on track to do this), we pushed forward in January and February 2009.
The production site problems didn’t impact all users – but they impacted many. The site was slow, notifications would sometimes fail to work (we send hundreds of thousands of emails every day), we had issues with image uploads, and registration sometimes didn’t work properly.
Our users are incredible. We think of them as our friends. Their patience with us during that period was absolutely stunning. We had always talked about the community at crowdSPRING, but we underestimated how close we had grown together – everyone was pulling to help improve the site. We’ve had many discussions in our forums about our refactoring process, and many of our users wrote to us privately to offer help and suggestions.
We continued to work on applying temporary “band-aids” to our application and making various smaller-scale hardware changes throughout the first Quarter of 2009. In mid-March, when we realized that our refactoring process was going to take a bit longer (as a result of necessary changes we decided to make to our software architecture), we started working with our hosting provider, Rackspace, to plan additional hardware upgrades should we need them due to the increasing problems we were seeing from our application and its failure to properly scale with increasing traffic and use. Prior to that time, we were very fortunate to have received some outstanding recognition, including being a finalist for the 2008 Chicago Innovation Awards, winning a Business.com competition, and winning WIRED magazines Small Biz 2008 award. We were very honored to have received this attention and of course, very worried about the impact it would have on us and our problems with scaling our application.
The need to properly scale impacts small and large online services. For example, some of you might recall that in late 2008/early 2009, Twitter ran into major scaling issues. As Twitter’s popularity grew, it was unable to properly scale. Here’s a nice post with useful links to the actions Twitter took to fix their problems.
Our success and recognition was a mixed blessing. It was a great statement to the outstanding effort from every single person on our team during the past year. But it also brought more traffic and accelerated our problems with scaling. These difficulties increased in April, 2009 when we were honored with more awards and nominations (including a Webby nomination for Community).
Although we did not anticipate we would need them once we rolled out our refactored code, in late April, when we could no longer delay those hardware upgrades, we added additional servers to handle the increasing web traffic. We added the servers in part because we anticipated even more traffic – we were very fortunate to partner with LG and Autodesk to launch a major design competition in April for the design of LG’s next mobile phone, with more than $80,000 in awards.
The Last 10 Days
Over the last ten days, we continued to see further degradation of our site performance, despite our efforts to make significant changes to reduce the load. We temporarily removed certain features (such as the thumbnails displaying awarded projects in the Profiles of creatives) and made numerous software tweaks that worked well but simply could not keep up with the increasing traffic.
Yesterday (May 6, 2009 – our one year anniversary) – was a bad day for us. Many of our users had problems and the problems were widespread throughout the site. Our community was frustrated with us and our site and rightfully so. We failed them.
Our “band-aids” were failing, traffic was increasing, and we had to take prompt action. We assembled our team and refocused every single person on dealing with this crisis. Some focused on customer service while others focused on technical fixes.
We also decided that in lieu of the newsletter we typically send every few weeks to our users (highlighting projects on the site, interviews, and other news) we would send a personal note from us owning up to the problems, apologizing, and reassuring our community that we were working non-stop to fix every problem.
The response from our community to our note left us speechless. Mike is in Chicago and I am in Hamburg, Germany at the next09 conference. This morning we spoke and I, half-embarrassed, confided in Mike that the notes we received from hundreds of our users made me tear-up. After a very short pause – he said the notes affected him the same way.
We cannot possibly convey the heartful, passionate, and humbling responses from our community to these problems. They made us proud to have worked so hard and to have invested so much effort in our business and in our community. The humanity of the responses demonstrated the very best in compassion – from people across the entire globe.
What We’ve Done In Response To The Problems
Over the past week, we focused a much greater effort on shoring up our current site. During the past two days, we focused 100% of our effort to correct some of the scaling issues so that we can stabilize our site and eliminate the many problems our users have been seeing. During this period, we have put on hold our refactoring project and will not resume that effort until we have fully stabilized our entire site – which we hope to do today or tomorrow. Isolated problems continue and we are committed to resolving all of them.
Yesterday, working well into the night, our dev team completely rewrote several small but important areas of our code to better architect how we were dealing with traffic and access to our database. We made numerous architectural changes and improved our hardware configurations. We double and triple checked every configuration file, identified errors and corrected all. We thoroughly researched our server logs, identified problems, debated amongst ourselves, and implemented numerous fixes. Over the next few days, we’ll implement some more sophisticated caching techniques that will allow us to handle an even greater amount of traffic.
We believe we have corrected many of the problems that our users have experienced over the past week. We continue to work non-stop on two continents to make sure that we’ve addressed every single problem and every single user who has written to us. We are exploring a few additional improvements that will further help with the site performance until we roll out the new code.
Once we have fully stabilized our site, we will again refocus most of our effort on completing the refactoring process (we are 95% done and then will fully test the new code before we roll it out). We anticipate this effort will take another 4 weeks. We’re excited by the refactored code – our internal tests show outstanding performance improvements, and we’ve made lots of other improvements in many areas of our site.
What We Learned From Our Mistakes
Nobody wants to fail or to make a mistake. But a fear of failure can paralyze. We gambled on our ability to complete refactoring before the scaling issues we were seeing on our site in mid 2008 caught up to us. It was an honest and reasoned gamble. We could not have anticipated in our wildest dreams the amount of attention we would receive as a young, tiny startup based in Chicago.
But the gamble didn’t pay off. So, what have we learned from our experience and what can we share with you that might help you avoid making this mistake with your business? Here are five important lessons we’ve learned:
1. Full Transparency Builds Real Trust And True Respect. We fully believe that companies today must be as transparent as they can be. While we lost a great deal of credibility with our community during the past week, our commitment to transparency has reinforced our belief that the only way to build real trust and respect in a community is to remain fully transparent, especially during times of adversity. We’ve learned that talking WITH people is much better than talking TO them. The outpouring of support from our community has given us renewed energy to work around the clock to address the problems we’ve been having, and an even deeper respect in the tends of thousands of people who are part of our family on crowdspring.com [for those who want to read more, there is a very interesting post that contains a fascinating look at Microsoft's policy shift in 2006 written by Steven Sinofsky - "Tranlucency vs. Transparency"].
2. Failure Educates and Motivates. Nobody wants to fail. But failure provides a certain clarity to our decisions and mistakes. This clarity helps us understand WHY we made wrong choices, to asses WHAT we could have done differently if we had the opportunity to make those choices again, and to evaluate HOW we could have approached those decisions with a different mindset. Most importantly, failure is a great motivator. People aren’t perfect and neither are companies. While we fully appreciate that many companies would have pointed a finger at their hosting provider, the computers, and pretty much anything other than at themselves, we feel it important to admit privately and publicly that this one was entirely our own doing. We’ve made many great decisions over the past year. And we’ve made many bad decisions. The decisions that created the problems on our site over the past 10 days were bad decisions. [Here is an outstanding short video that looks at some of the most successful people in history but gives you some insight that you might not have known about]
3. Question Your Own Assumptions. Always. We put ourselves in this situation because in late 2008, we started to believe that we could manage a complex refactoring project and maintain a failing production site, with a talented but small team. We planned for the worst, but completely underestimated what that meant. Importantly, while we anticipated problems, we didn’t fully question our own assumptions. We tried to shore up portions of the site that caused us the most problems but underestimated the severity of those problems. Had we approached each decision with a healthier degree of skepticism, we perhaps would have done more to shore up the site and might have been able to finish our refactoring project without exposing our community to the problems of the past 10 days. [We recommend you take 8 minutes and watch this video by Technology forecaster Paul Saffo, who discusses the importance of questioning one's own assumptions when attempting to predict future events].
4. Do The Right Thing, Even If It Hurts You. We knew that we could have sent our email only to those users who posted projects over the past week, or the users who’ve sent us complaints. But ultimately, we sensed a deeper frustration without our community. And we feel it’s very important to listen to our users. For every user who wrote to us, we assumed there were 50 who did not. And so we decided that as painful as it would be to own up to this mistake, we had to reach out to everyone. This was the right thing to do. It was the honest thing to do. It was the responsible thing to do. As individuals and as a company, we deeply feel that as long as our core values are sound and we follow our hearts, we’ll be successful. It was by far the easiest decision we made yesterday.
5. Focus. We’ve been able to get through the past two days of non-stop work primarily because as a team, we focused really hard on the tasks at hand. We had to find the cause of many different problems occurring all at once on our site, answer hundreds upon hundreds of customers who wrote to us by email, private message, and via our customer service ticket system, figure out the necessary software and hardware changes that we would need to make, and communicate all of this information in a timely and transparent way to our community. Each person on our team has done a phenomenal job helping us to deal with the site issues, to deal with frustrated users, and to support everyone else on the team. It’s been a true team effort and underscores for us yet again the importance of building a great team. We kept everything in perspective at all times and didn’t allow the problems to overwhelm us.
In working to fix these problems, we realized that the fixes themselves for the most part were not overly complicated. The strategies we adopted in developing fixes weren’t grand. The time involved in applying the fixes didn’t stretch for months or even weeks. We had neglected to make those changes earlier because we assumed (incorrectly) that they would be difficult, would take too much time, and would not be necessary if we would have completed refactoring on time. Had we focused on the problems and implemented the changes earlier, we could have reinforced the problem areas and avoided the trouble.
We learn from failures. We want people to benefit from this post so that they don’t make the mistake we made. And we would love to learn from you – what lessons have you learned from your own or others’ failures?