MIT 6.035

HKN Underground Guide evaluations

The Underground Guide student evaluations are now open. Visit them here for 6.035.

Please complete your evaluation before 11:59P December 22.

Quiz 3 Graded

Quiz 3 has been graded. You will get your quiz back tomorrow after class. The average (and median) was 68. The standard deviation was 14.

Small bug in data parallelization library (libderby.a)

I fixed a small bug in the data-parallelization library that prevented one group's code from working. Since no other groups emailed me, I'm assuming the problem was contained to this one group. The fix is installed in the same location as the old file, so you don't have to do anything. Let me know if you experience problems.

Parallelization differences across machines

First off, with the new benchmarking library, you can benchmark on chocura.

I have noticed big differences between the parallelization speedups of silver, chocura, and tyner. For a small 4-thread test program, silver gets about a 2.9x speedup over sequential code (executed on silver). Tyner gets a 1.8x speedup over sequential code (executed on tyner). Chocura gets a 3.7x speedup for the same code over sequential execution on chocura.

Based on my experiences over the last couple of days, I'm going to use chocura for the derby. It seems to be the most stable.

New benchmarking library for the derby!

After noticing that the current benchmarking infrastructure produces some inconsistencies when executing multi-threaded benchmarks, I have implemented a new benchmarking library for the derby. You can use this library now for benchmarking.

Don't worry, the implementation decisions that you arrived at using the current (old) benchmarking infrastructure are still valid. I have only noticed inconsistencies when running multiple threads in the current system.

You can continue to use the current (old) library (lib6035.a) but note that it has problems with multi-threaded benchmarks.

The new library is named libderby.a and is located in /u/mgordon/6035/lib64. The new library interface is similar to the old interface. There are two calls: start_caliper()and end_caliper(). Wrap these around the code you would like to benchmark. The only difference from the old library is that the code will not be benchmarked without a caliper defined. As before you can place them in a callout in your decaf source code.

Using the new library, the assemble command is simpler, for example:

gcc4 emboss.s -pthread -lderby -L/u/mgordon/6035/lib64 -o emboss

No need for that papi library from before. When you execute your code (assuming that you have added a call to start_caliper() and end_caliper()), a brief message will print out, for example:

$ emboss
Timer: 276864 usecs
$

This tells you how many usecs (microseconds) it took for the code wrapped in the timer to execute. We will use this library during the derby to determine the ranking. The calls to start_caliper and end_caliper will be in the derby program when it is distributed on Monday.

Sorry about this change so late in the game, but it should not be too much of a hassle to switch to the new library if your are harnessing data parallelism. I just want to make sure that the derby results are as accurate as possible.

Advice for Optimizer final write-up

Here are some points to think about:

Use the provided programs to substantiate your implementation decisions. Benchmark the provided programs on the target architecture. Hand-implement the transformation first. The target architecture is complex to say the least. Don't waste time with ineffectual transformations.
Cover all of the transformations discussed in class, at the very least qualitatively, given the benchmark programs and the target architecture.
I would like to see an analysis of each implemented optimization (you can group optimizations if you feel they are symbiotic).
Discuss the reasons for you benchmarking results given your knowledge of the target architecture (look at my last recitation).
Describe your full optimizations option and the ideas/experiments behind the implementation.
Analyze your generated assembly. Look for non-traditional peephole optimizations.

Appendix to Optimizer Project Handout

The documentation for the data-parallelization library are posted on the website. Link

The evil enter instruction

From your classmate Zev:

After some experimentation, we found that the enter instruction seems to be broken (using the push, mov, sub equivalent works fine). Some googling turned up the following discussion group thread:

http://groups.google.co.nz/group/comp.os.linux.development.system/browse_thread/thread/a057249198598933/a4f5251c9ef1e7a2?#a4f5251c9ef1e7a2.

Everything is pretty much summed up with Linus' reply at the end. Bottom line: enter can cause segfaults on Linux and is a lot slower than its equivalent.

72 + 16 = 96 (Explanation for the curious)

Yeah, um, that was me trying to hid some stuff from you and forgetting what I was glossing over. If you are interested:

Some x86-64 instructions are very complex. They are translated into simpler operations in hardware. In recitation we saw that a mov that references memory is translated into multiple simple operations in the underlying hardware, the mov and the memory reference. The mov is dependent upon the memory reference. These smaller instructions are called micro-operations.

There are 16 architectural registers and 72 re-order buffer entries. That is 88 registers. But there are also 8 hidden registers that are used for shuffling values between micro-operations. These are not part of architectural state (the state the asm programmer sees) but they are needed to shuffle values between micro-ops when a dependent micro-op is evicted from the re-order buffer.

Running without benchmarking

The new lib6035.a and the new assemble require that you use the 'benchmark' script to run your code. If don't you want to benchmark, maybe you are just debugging, use the original lib6035.a in /mit/6.035/provided/optimizer/lib and compile with the old command:

gcc4 example.s -L. -l6035 -pthread -o example
cc example.s -L. -l6035 -pthread -o example

New benchmarking infrastructure!

With help from our sys-admin (thanks Mike!), I have completed the new, much much more accurate benchmarking infrastructure. You have to make some changes to use it.

First, set your LD_LIBRARY_PATH environment variable to include /u/mgordon/6035/lib64

For bash:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/u/mgordon/6035/lib64

For csh:
setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:/u/mgordon/6035/lib64

I have placed a new version of the 6035 library in /u/mgordon/6035/lib64, you don't need
to copy it from there.

Now you have to compile things a bit differently, new compile:

gcc4 program.s -pthread -lpapiex -l6035 -L/u/mgordon/6035/lib64

Just in case your are curious, papiex is the library on which I built our benchmarking infrastructure. It allows us to get access to performance counters on the CPU! Very accurate.

----

You can use /u/mgordon/6035/bin/benchmark to get the cycle count information for your program after you have compiled it.

Use 'benchmark program' to get the cycle count for program.

Here is some sample output:

loop
libmonitor debug: (P20071,T0x0) monitor_fini_process()
PapiEx Version: 0.99rc2
Executable: /u/mgordon/6.035/example/loop
Processor: AMD K8 Revision C
Clockrate: 1993.465942
Parent Process ID: 20064
Process ID: 20071
Hostname: tyner
Options: PAPI_TOT_CYC,NO_WRITE
Domain: User
Real usecs: 125
Real cycles: 241852
Proc usecs: 122
Proc cycles: 239896
PAPI_TOT_CYC: 135670

Event descriptions:
Event: PAPI_TOT_CYC
Derived: No
Short Description: Total cycles
Long Description: Total cycles
Developer's Notes:

Start: Wed Nov 15 23:44:20 2006
Finish: Wed Nov 15 23:44:20 2006
libmonitor debug: (P20071,T0x0) monitor_fini_library()

------

What to notice:
*First line give the program you ran (in this case "loop").
*The line your are interested in: PAPI_TOT_CYC: 135670
this is the number of user cycles that your program took to run.

----

You can also define a single "caliper" in the code. This is a section in your code that you would like detailed information about. Use start_caliper() to define the beginning of a section and end_caliper() to define the end of a section. These functions are defined in lib6035.a. So you can use a callout for each in decaf code or just place it in your assembly code (adhering to calling convention of course).

With a caliper defined, the output would look like:
loop
libmonitor debug: (P20167,T0x0) monitor_fini_process()
PapiEx Version: 0.99rc2
Executable: /u/mgordon/6.035/example/loop
Processor: AMD K8 Revision C
Clockrate: 1993.465942
Parent Process ID: 20156
Process ID: 20167
Hostname: tyner
Options: PAPI_TOT_CYC,NO_WRITE
Domain: User
Real usecs: 435
Real cycles: 860899
Proc usecs: 127
Proc cycles: 249792
PAPI_TOT_CYC: 150002

Caliper 1:
Executions: 1
Real usecs: 16
Real cycles: 32333
Proc usecs: 16
Proc cycles: 32328
PAPI_TOT_CYC: 32226 ***This is the cycle count
for your caliper

Event descriptions:
Event: PAPI_TOT_CYC
Derived: No
Short Description: Total cycles
Long Description: Total cycles
Developer's Notes:

Start: Wed Nov 15 23:51:48 2006
Finish: Wed Nov 15 23:51:48 2006
libmonitor debug: (P20167,T0x0) monitor_fini_library()

Let me know if there are any problems. Actually, let me know if it works for you! It is somewhat untested for anyone but me.

Mike

MIT 6.035

Tuesday, December 12, 2006

HKN Underground Guide evaluations

Quiz 3 Graded

Small bug in data parallelization library (libderby.a)

Monday, December 11, 2006

Parallelization differences across machines

Friday, December 08, 2006

New benchmarking library for the derby!

Tuesday, December 05, 2006

Advice for Optimizer final write-up

Monday, November 27, 2006

Appendix to Optimizer Project Handout

The evil enter instruction

Thursday, November 16, 2006

72 + 16 = 96 (Explanation for the curious)

Running without benchmarking

Wednesday, November 15, 2006

New benchmarking infrastructure!

About Me

Links

Previous Posts

Archives