home
publications
teaching
short CV
personal
the pub

How to Write Fast Numerical Code
2632300 (ETH, CS)
Basic Information
 Course number: 2632300, 6 credits
 Spring 2011, lectures: M 10:0012:00, IFW C33; W 13:0014:00 RZ F21; occasional substitute lectures: F 14:0016:00 RZ F21
 Instructor: Markus Püschel (RZ H18, pueschel at inf, 27303)
TAs: Georg Ofenbeck (RZ H22, ofgeorg at inf, 28414) and Marcela Zuluaga
Admin: Franziska Mäder (RZ H14, maeder at inf, 27311)
 Office hours:
 Markus Püschel: Tues 14:0015:00
 Georg Ofenbeck: Wed 14:0015:00
Course Description
The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture.
This interdisciplinary course introduces the student to the foundations and stateoftheart techniques in high performance software development using important functionality such as linear algebra functionality, transforms, filters, and others as examples. The course will focus on optimizing for the memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.
Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.
The course will build upon the version taught in Spring 2008.
Prerequisites: solid C programming skills, matrix algebra, Master student or above
Topics Covered
 Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
 Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
 Compilers: strengths, limitations, how to use
 Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
 Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
 Stateoftheart research in Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others
Goals of this Course
 Obtain an understanding of runtime performance
 Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
 Understand the connection between algorithms, implementations, and computer architecture
 Learn some fundamental numerical algorithms
 Learn how to analyze numerical algorithms
Background Material
Academic Integrity
All homeworks in this course are singlestudent homeworks. The work must be all your own. Do not copy any parts of any of the homeworks from anyone including the web. Do not look at other students' code, papers, or exams. Do not make any parts of your homework available to anyone, and make sure noone can read your files. The university policies on academic integrity will be applied rigorously.
We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands the programming language in question (C, in our case).
It is not considered cheating to clarify vague points in the assignments or textbook, or to give help or receive help in using the computer systems, compilers, debuggers, profilers, or other facilities.
Grading
 40% research project
 Topic: Very fast, ideally adaptive implementation of a numerical problem
 Team up in pairs
 March 9: Suggest to me a problem or I give you one
 Show "milestones" during semester
 Write 4 page standard conference paper (template will be provided)
 Give short presentation end of semester
 20% midterm
 Algorithm/implementation analysis
 Memory hierarchy
 other
 40% homework
 Exercises on algorithms analysis
 Implementation exercises. Purpose: study the effect of program optimizations, compilers, special instructions, etc. Tasks: writing and submitting C code & creating runtime/performance plots
 Some templates will be provided
 All homeworks are singlestudent homeworks
There is no final Exam.
Research Project
Deadline for code and paper: June 10th, 17:00
 How it works:
 Weeks without homeworks should be used to work on the project
 You select a numerical problem and create a correct (verified) implementation in C
 You determine the arithmetic cost, measure the runtime, and compute the performance
 You profile the implementation to find the parts in which most the runtime spent
 Focussing on these you apply various optimization techniques from this class
 You repeat the previous steps to create various versions with (hopefully) better runtime
 You write a paper about your work and give a presentation
 Paper:
 Presentation
 Last week of classes (May 30th and June 1st)
 10 minutes (at most 7 slides)
 Template (ppt is 2007 and later): presentationtemplate.pptx , presentationtemplate.pdf
 By Sunday (29th) evening, send me your slides; name them with your project number (e.g., 11.pptx or11.pdf)
 The order will be determined randomly right before class
 Who talks will be determined randomly right before class
 Projects:
 Alexander and Aldo
(Model predictive control)
 Maksim, Filip and Daniel (Eigenvalues)
 Rokas and Ugur
(Optimal binary search organization)
 Federico and Ralph
(Image processing library)
 Aymila and Ömer (Enclosing ball of points)
 Arkadiy (Metropolis algorithm, Monte Carlo)
 Benjamin and Sebastian
(Seam carving)
 Florian and Thomas (SURF feature detection)
 Zaheer, Andrei, and Giovanni
(Submodular function optimization)
 Francois (Graph cuts, EdmondKarps Algorithm)
 Ruedi (2D Gaussian filter)
 Mauro, Bruno, and Qunzhi (Black Scholes, option pricing)
 Giaocchino, Thabo, and Manuel (Disparity Map Refinement)
Midterm
Friday, April 15th (solution, without solution)
Homework
Lectures (including pdfs)
Lecture 
Date 
Content 
Slides 
Notes 
Other 
1 
21.02. 
Course motivation, overview, organization 
link 


2 
23.02. 
Problem, algorithm, asymptotic runtime, cost, cost analysis 
link 
link 

3 
28.02 
Cost analysis, architecture, microarchitecture, memory hierarchy 
link 
link, arch 
Core 2 info 
4 
07.03. 
Instruction level parallelism, optimizing compilers and limitations 
link 


5 
09.03. 
Compiler limitations, benchmarking 
link 


6 
16.03. 
Locality 
link 


7 
18.03. 
Caches, linear algebra software, BLAS, blocking 
link 
link 
blas 
8 
21.03. 
MMM, Atlas MMM generator 
link 
link 
paper 
9 
23.03. 
Atlas MMM generator continued 



10 
28.03. 
Modelbased Atlas 
link 
link 

11 
30.03. 
Modelbased Atlas: remaining details, register renaming 
link 


12 
04.04. 
Optimization related to virtual memory and TLB 
link 
link 

13 
06.04. 
Dense linear algebra problems, LU factorization 
link 
link 

14 
13.04. 
Blocked LU factorization 

link 


15.04. 
Midterm exam 



15 
18.04. 
Memorybound computation, sparse linear algebra, Sparsity/Bebop 
link 

paper1, paper2 
16 
20.04. 
Sparse matrixvector multiplication, continued 
link 
link 

17 
02.05. 
SIMD vector instructions, SSE family, SSE intrinsics 
link 

icc manual, VS manual 
18 
04.05. 
SSE intrinsics continued, vectorization efficiency 

link 

19 
09.05. 
Compiler vectorization, linear transforms, fast algorithms, discrete Fourier transform (DFT) 
link 
link 

20 
11.05. 
Structured matrices, CooleyTukey fast Fourier transform (FFT) 
link 
link 

21 
16.05. 
DFT complexity, recursive/iterative FFT, fast implementation (FFTW) 
link 
link 

22 
18.05. 
FFT: fast implementation (FFTW), codelet generator 
link 

FFTW 
23 
23.05. 
Spiral: Library generator for linear transforms 
link 

Spiral 

30.05. 
Project presentations 




02.06. 
Project presentations 



