Performance and Scalability Forensics
I consider myself to have pretty deep and broad technical knowledge, but listening to Steve Feldman always leaves me wondering which key on my keyboard is the “any” key. I leave his sessions feeling like I was full after the first 15 minutes and somehow missed all the really important stuff that happened after that because I was in some sort of geek coma.
Probably the most important thing I can share with you is the link to his blog: http://sevenseconds.wordpress.com. That way you can read his wisdom directly. He has posted the slides from his presentations there as well.
Here are some of the highlights that struck me from this session:
When thinking about performance, it’s useful to think in terms of transactions. A transaction is any piece of work that the system does for the user. For instance, when a user clicks a button in the interface to, say, display a list of discussion posts, that button click has to:
- Be turned into an http request by the user’s browser.
- Encapsulated into TCP/IP traffic by the user’s network card.
- Sent over the user’s ISPs network, and across the Internet until it gets to your system.
- At that point, your load balancer has to process that packet and send it to the appropriate server.
- That server has to figure out what the user is requesting and prepare a response.
- Part of the requested response requires data from the database, so a database query is created.
- The database server has to process the query. If all of the data for the query is not available in memory, it has to request it from disk.
- Once the database has all of the data, it sends it back to the application server.
- The application server then combines the data with the http elements (the rest of the page) and sends the whole thing back to the end-user.
- Now comes the return trip across the Internet, into the ISP network, into the network card, delivered to the browser, which then has to parse through it to figure out how to run any necessary JavaScript and finally render the page so the user can see it.
System performance (response time) is essentially what you get when you add up the latency at each step of the user’s transaction. Any one of the points listed above could be the cause for poor performance. According to Steve Souders (author of Even Faster Web Sites), 80-90% of end-user response time for a web site is typically on the client end of the application. This may be less the case with CE/Vista, since it’s so heavily database driven, but it’s clear that a significant amount of processing time is spent completely outside the realm of the LMS.
When analyzing the end-user’s part of the transaction, Steve recommends a few tools to peek inside the http traffic to see what’s going on: YSlow (Yahoo), Page Speed (Google), and Fiddler2 (Microsoft). All three of these tools produce graphs and statistics about the http traffic, as seen from the client’s perspective. These tools can help you see where the latencies are and help you understand how to reduce them. Many times when “Vista is slow”, the root cause could very well be the types of content that are included on the page, a malfunctioning network device, or a saturated Internet connection, rather than the system serving the page itself. Also, all browsers are not the same: Internet Explorer, Firefox, and Safari all render content and process Javascript in different ways. Try loading an active CE/Vista discussion board in different browsers… you’ll notice a large difference in performance that has nothing to do with the application servers or the database. (By the way, Chrome seems to load the discussions fastest, even though it’s not a supported browser).
Of course, not all performance issues are on the client side, so we touched briefly on the application server side. Another potentially large source of latency is the Java virtual machine (JVM), particularly relating to its memory management algorithms. Steve recommends Java VisualVM, which is a powerful tool you can use to peek inside the memory model of the Java virtual machine’s heap. I’m actually very excited about this tool, since the JVM is such a black box. I’m definitely interested in peeking under the hood and watching it in action. Steve told me that it’s very low overhead and perfectly fine to run on your production app nodes (which is good, because looking at the JVM on your test system wouldn’t really tell you anything useful about your production load).
Another thing Steve recommends is to turn on compression (gzipping) and caching on the load balancers. I actually have no idea if our load balancers already do this, or even if they’re capable of it. I’ll have to be sure to check with our networking gurus when I get a chance.
Finally, the database is another obvious source of latency. Steve recommends the book Optimizing Oracle Performance by Cary Millsap and Jeffrey Holt as an excellent overview of how to speed up Oracle performance. Some of the advice in the book is related to query optimization, which we obviously can’t do anything about for CE/Vista, but there’s also information about optimizing indexes.
I was surprised to hear Steve say that it was perfectly acceptable for clients to modify indexes on their databases; Blackboard considers that “tuning”, not “unauthorized modification”. Of course, you’ll want to proceed carefully with this, making sure to do as much testing as possible and to put processes in place to measure performance to make sure you’re making things better and not worse. Someone asked why Blackboard didn’t optimize the indexes themselves and Steve responded that they did, but since the content is so customizable and varies so significantly from one institution to another, the initial “one-size-fits-all” indexing might not be optimized for a specific installation.
Some final points from my notes that didn’t really fit anywhere above:
- Outliers should never be ignored; they are the data points you want because they tell you how bad things can be. There is a tendency, when graphing performance statistics, to discard results that seem too far out of the norm. This is a mistake; you should focus attention on those outliers, because they can lead you to discover issues you would otherwise miss.
- Keep in mind that the system can affect transactions and that transactions can affect the system. Even though it’s helpful to focus on one or the other, you should never forget that they are interdependent.
- Identifying where the bulk of a transaction’s time is spent is key to figuring out how to improve overall latency. Spending time trying to streamline something that doesn’t take that long anyway isn’t going to have a big impact on user experience.
- Avoid diagnosis bias. It’s very easy to make assumptions about where the problem is or isn’t. Steve actually used me as an example, since we had been discussing some performance issues on our database that involved high I/O waits; I said that we had a fast SAN, so that probably wasn’t the problem. He pointed out that this type of bias might keep us from finding out that there’s something wrong with the SAN drivers (for instance) because we just wouldn’t look there. Just goes to show that sometimes your purpose in life is to serve as an example for others.
- When adding content to a section, the tendency is to load everything to Vista and then add it to the class. If you are really working to optimize a particular page, there is a process called domain sharding that can be achieved by linking to content on other servers. Most browsers will only open a certain number of connections at a time to a particular domain, but they are capable of opening parallel connections to other domains to bring in page elements that are hosted there at the same time. In some cases, this can make a noticeable difference in page loading times. If you look carefully, you can see if a specific page would benefit from this by using YSlow, Page Speed, or Fiddler2 to look at the connections made for the page load.

ez Said,
July 18, 2009 @ 11:28 am
I’ve seen cases in my web designer days where some computers stop trying to render a page. I worked on cutting the size of the page down as much as I reasonably could. What ultimately made the difference was removing elements (radio buttons, check boxes). The browsers were trying to render too many objects.
So I believe the 80-90% of time to show a page is the client-end.