Search
Close this search box.

New Windows Performance Toolkit

With Windows 8.1 a new version of the Windows Performance Toolkit has been released. There are many improvements in the WPA gui which were mostly shown during the Build Conference 2013. It is interesting to check what has changed in xperf as well. I simply did call xperf –help for all command line options  and write this to one text file. Then I can check the changes between WPA for Windows 8 and Windows 8.1.

xperf –help start

or -SetProfInt [<n>] [cached]

It seems that the profile interval can no longer be persisted for all following stack sampling etw sessions.

xperf –help stackwalk

For Windows 8.1 some new stackwalking capabilities were added

  • PmcInterrupt
  • PagefaultCopyOnWrite
  • PageAccessEx
  • PageRemovedfromWorkingSet

xperf –help processing

  • drips            Show Connected Standby statistics
  • regions          Show regions statistics
  • screenshots      Show the screenshots recorded in the trace

The last thing screenshots is interesting. I do wonder which ETW Event I must create to store a screenshot in my ETW trace file. It is a cool idea to store screenshots in your etl. Now you can find “obvious” reasons why a performance regression test went bad because e.g. a popup from the graphics card driver was blocking your use case. Before that you needed to figure out from the call stack that some messagebox which did never go away was blocking your call.

One big improvement is that you can extract the table contents of WPA not only via Cut&Paste but via a dedicated exporter tool which takes your .etl file and a .wpaProfile view which defines the tables you want to export into .csv files. Previously I had to wait for minutes until the cut and then wait minutes again after the paste operation was completed for bigger tables with call stacks. This caused funny bugs inside Windows in itself which caused Windows to stop responding to all following cut and paste operations. You can select a specific time region or the time between two ETL Marker events which can help tremendously during automated performance regression testing.

The exporter tool is called wpaexporter.exe which uses the view you see in WPA which can be exported as .wpaProfile files. I have saved some tables in (via Export Profile menu) disc.wpaProfile. Then I can generate csv files of my etl file by

wpaexporter -profile disc.wpaProfile -prefix disc_ -i heapusage.etl

This will give me for each table an extra csv file like

disc_CPU_Usage_(Attributed)_Utilization_by_Process,_Thread,_Activity.csv
disc_CPU_Usage_(Precise)_Context_Switch_Count_by_Process,_Thread.csv
disc_CPU_Usage_(Precise)_Context_Switch_Rate_by_CPU.csv
disc_CPU_Usage_(Precise)_Timeline_by_CPU.csv
disc_CPU_Usage_(Precise)_Timeline_by_Process,_Thread.csv
disc_CPU_Usage_(Precise)_Usage_by_Priority_at_Context_Switch_Begin.csv
disc_CPU_Usage_(Precise)_Utilization_by_CPU.csv
disc_CPU_Usage_(Precise)_Utilization_by_Process,_Thread.csv
disc_CPU_Usage_(Sampled)_DPC_and_ISR_Usage_by_Module,_Stack.csv
disc_CPU_Usage_(Sampled)_Utilization_By_CPU.csv
disc_CPU_Usage_(Sampled)_Utilization_By_Priority.csv
disc_CPU_Usage_(Sampled)_Utilization_by_Process,_Stack.csv
disc_CPU_Usage_(Sampled)_Utilization_By_Process.csv
disc_CPU_Usage_(Sampled)_Utilization_By_Process_(1).csv
disc_CPU_Usage_(Sampled)_Utilization_By_Process_and_Thread.csv

Now I can generate graphs and filters or my own analysis tool from the generated csv files.

Here for example I did print the IO size and IO time for normal and low IO priority. Although svchost has very high IO times it does read less than 30%  of the data of Visual Studio which has much smaller IO times. The reason is that the svchost does host Windows Defender which does all reads at low IO priority to get out of the way or “normal” applications which seems to work out quite good. This would have been difficult to see in WPA

The WPA tool has undergone a small facelift to give you even more graphs than before and to allow you to compare two tables of the same or different etl files. This is great for regression to see where the changes might hide. There have been some new graphs added as well. Here are the ones you will usually see

The Disk Offset graph has left its hidden place in the context menu which does allow you to see why you have crappy disc IO because of many disc seeks. This graph gives you a direct view how much seeks the hdd did need to write/read your file/s. It does not make much sense for solid state discs of course. A really useful graph is Throughput by Process which gives you direct numbers how much data your app was able to process.

Recently I have played with ways to clone ADAM instances to speed up integration testing. The idea was to fill in the configuration data already during build time and later deploy the already filled ADAM instances to it. The only problem was that there is no MS supported quick way to transfer ADAM instances between different computers by simply copying the adamntds.dit file to the test machines. A few days later I did install configuration data as usual but I noticed a severe performance loss on my machine. The data import did not take 4 minutes but 12 minutes now! At first I thought it had something to do with my platform (x86 Debug) but it was equally slow in x64 Release. I had no idea what was going on. When using WPA I did see that my disc was maxed out during the data import process which is not unusual since DBs want to make sure that their stuff is written to disc. Unfortunately I had no run of a fast import as ETL file to compare them. By looking at the event logs there were no errors and everything was working as expected. Something quite fundamental must have been changed but what? Perhaps ADAM did remember that I tried to copy around the database files on my machine and it was operating in some safe mode now. By looking at the call stacks I did see some strange methods but I was not sure if ADAM was operating in a different mode. As last resort I tried to move the ADAM instance to a different drive and started the import from there. The import speed was back to normal! So my C drive must have some problem. I have not found any hardware events in the System or Application event logs for any suspicious SMART events which would tell me that my drive is dying. But WPA can show you the complete system configuration for your machine as well:

Now I can see the difference. Someone must have disabled Write Caching for my C drive! I can only guess how this could happen. I do remember that dsamain.exe which does host one ADAM instance did crash once. Perhaps it did disable write caching for my drive while dying to make me remember not to mess around with databases. Now we can look at a 20s time interval during a slow import

and now again for a fast import

Both disc utilization graphs show a busy hard disc but when we look at the new Throughput by Process graph it is clear that we are importing much more data than before. A nice thing of the new WPA tool is that you can scale the y-axis to your needs to get comparable scales for different graphs.

Now I only needed to go back to the device manager and enable write caching for my C drive. But I have also learned the hard way that the DB import is IO bound. I have always wondered why the dsmain process never got above 3-4% of CPU utilization while hammering it with data. I did suspect some internal locking in ADAM but MS support did tell me that this was not the case. So I tried to use something faster. What can be faster than a RAM disc? I have tried the top Google hit and installed it on my machine. Since my ADAM Instance takes needs ca. 200MB there is plenty of free space on a 500MB RAM disc. When trying the configuration data import on the RAM disc I did see for the first time that the dsamain process did fully utilize one core with 100%. The import time went down from 4 minutes to 1 minute which is a major achievement. After seeing these numbers I did decide that it is much better to use a RAM drive for the test machines than to try to copy around prefilled database instances which need to be created, filled, copied and checked if they do run at the test machine at all. I am pretty sure that since Windows XP and Windows 8 the binary format of the database has changed several times which would make db cloning of limited use anyway.

For tests I do not need a reliable data store. The RAM is good enough. For developers setting up their machine it is a little different. They do reboot from time to time but they do certainly not want to reinstall configuration data after every reboot. Luckily some RAM discs do sync the data during shutdown or at configurable time intervals to disc. The RAM disc is therefore a good solution for everyone except for the customers which do not want to have their configuration values in memory and potentially never written to disc.

It is great to have such tooling from MS available. With the WPA performance microscope we can understand much better where the performance is hiding. And we can make informed decisions which approach is better and why. Previously it was all guessing: “We can get faster by cloning the database”. After learning what the database import actually slowed down I was able to make it faster by a factor 3 without going into unsupported regions by copying database files around.

This article is part of the GWB Archives. Original Author: Alois Kraus

Related Posts