How to Write Fast Numerical Code

263-2300 (ETH, CS)

Basic Information

Course number: 263-2300, 6 credits
Spring 2011, lectures: M 10:00-12:00, IFW C33; W 13:00-14:00 RZ F21; occasional substitute lectures: F 14:00-16:00 RZ F21
Instructor: Markus Püschel (RZ H18, pueschel at inf, 2-7303)
TAs: Georg Ofenbeck (RZ H22, ofgeorg at inf, 2-8414) and Marcela Zuluaga
Admin: Franziska Mäder (RZ H14, maeder at inf, 2-7311)
Office hours:
- Markus Püschel: Tues 14:00-15:00
- Georg Ofenbeck: Wed 14:00-15:00

Course Description

The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture.

This interdisciplinary course introduces the student to the foundations and state-of-the-art techniques in high performance software development using important functionality such as linear algebra functionality, transforms, filters, and others as examples. The course will focus on optimizing for the memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.

Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.

The course will build upon the version taught in Spring 2008.

Prerequisites: solid C programming skills, matrix algebra, Master student or above

Topics Covered

Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
Compilers: strengths, limitations, how to use
Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
State-of-the-art research in Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others

Goals of this Course

Obtain an understanding of runtime performance
Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
Understand the connection between algorithms, implementations, and computer architecture
Learn some fundamental numerical algorithms
Learn how to analyze numerical algorithms

Background Material

Prior versions of this course: spring 2008, spring 2005
Chapters 5 and 6 in Computer Systems: A Programmer's Perspective, 2nd edition (available in library)
This small tutorial
Research papers, manuals, and other material mentioned in the course

Academic Integrity

All homeworks in this course are single-student homeworks. The work must be all your own. Do not copy any parts of any of the homeworks from anyone including the web. Do not look at other students' code, papers, or exams. Do not make any parts of your homework available to anyone, and make sure noone can read your files. The university policies on academic integrity will be applied rigorously.

We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands the programming language in question (C, in our case).

It is not considered cheating to clarify vague points in the assignments or textbook, or to give help or receive help in using the computer systems, compilers, debuggers, profilers, or other facilities.

Grading

40% research project
- Topic: Very fast, ideally adaptive implementation of a numerical problem
- Team up in pairs
- March 9: Suggest to me a problem or I give you one
- Show "milestones" during semester
- Write 4 page standard conference paper (template will be provided)
- Give short presentation end of semester
20% midterm
- Algorithm/implementation analysis
- Memory hierarchy
- other
40% homework
- Exercises on algorithms analysis
- Implementation exercises. Purpose: study the effect of program optimizations, compilers, special instructions, etc. Tasks: writing and submitting C code & creating runtime/performance plots
- Some templates will be provided
- All homeworks are single-student homeworks

There is no final Exam.

Research Project

Deadline for code and paper: June 10th, 17:00

How it works:
- Weeks without homeworks should be used to work on the project
- You select a numerical problem and create a correct (verified) implementation in C
- You determine the arithmetic cost, measure the runtime, and compute the performance
- You profile the implementation to find the parts in which most the runtime spent
- Focussing on these you apply various optimization techniques from this class
- You repeat the previous steps to create various versions with (hopefully) better runtime
- You write a paper about your work and give a presentation
Paper:
- Maximal 6 pages (hard limit), conference style, template and instructions below
- Everybody reads this: report.pdf
- For latex use: report.zip (start with reading the README file)
- For Word (discouraged) use this: report-word.doc
- Example project papers on this course website
Presentation
- Last week of classes (May 30th and June 1st)
- 10 minutes (at most 7 slides)
- Template (ppt is 2007 and later): presentation-template.pptx , presentation-template.pdf
- By Sunday (29th) evening, send me your slides; name them with your project number (e.g., 11.pptx or11.pdf)
- The order will be determined randomly right before class
- Who talks will be determined randomly right before class
Projects:
1. Alexander and Aldo (Model predictive control)
2. Maksim, Filip and Daniel (Eigenvalues)
3. Rokas and Ugur (Optimal binary search organization)
4. Federico and Ralph (Image processing library)
5. Aymila and Ömer (Enclosing ball of points)
6. Arkadiy (Metropolis algorithm, Monte Carlo)
7. Benjamin and Sebastian (Seam carving)
8. Florian and Thomas (SURF feature detection)
9. Zaheer, Andrei, and Giovanni (Submodular function optimization)
10. Francois (Graph cuts, Edmond-Karps Algorithm)
11. Ruedi (2D Gaussian filter)
12. Mauro, Bruno, and Qunzhi (Black Scholes, option pricing)
13. Giaocchino, Thabo, and Manuel (Disparity Map Refinement)

Midterm

Friday, April 15th (solution, without solution)

Homework

Homework 1: due March 10, 17:00; solution
Homework 2: due March 17, 17:00
Homework 3, code templates: due March 31, 17:00; solution
Homework 4 (start project): due April 14, 17:00
Homework 5 (continue project): due May 5, 17:00
Homework 6, code templates: due May 12, 17:00
From now on: Work on project

Lectures (including pdfs)

Lecture	Date	Content	Slides	Notes	Other
1	21.02.	Course motivation, overview, organization	link
2	23.02.	Problem, algorithm, asymptotic runtime, cost, cost analysis	link	link
3	28.02	Cost analysis, architecture, microarchitecture, memory hierarchy	link	link, arch	Core 2 info
4	07.03.	Instruction level parallelism, optimizing compilers and limitations	link
5	09.03.	Compiler limitations, benchmarking	link
6	16.03.	Locality	link
7	18.03.	Caches, linear algebra software, BLAS, blocking	link	link	blas
8	21.03.	MMM, Atlas MMM generator	link	link	paper
9	23.03.	Atlas MMM generator continued
10	28.03.	Model-based Atlas	link	link
11	30.03.	Model-based Atlas: remaining details, register renaming	link
12	04.04.	Optimization related to virtual memory and TLB	link	link
13	06.04.	Dense linear algebra problems, LU factorization	link	link
14	13.04.	Blocked LU factorization		link
	15.04.	*Midterm exam*
15	18.04.	Memory-bound computation, sparse linear algebra, Sparsity/Bebop	link		paper1, paper2
16	20.04.	Sparse matrix-vector multiplication, continued	link	link
17	02.05.	SIMD vector instructions, SSE family, SSE intrinsics	link		icc manual, VS manual
18	04.05.	SSE intrinsics continued, vectorization efficiency		link
19	09.05.	Compiler vectorization, linear transforms, fast algorithms, discrete Fourier transform (DFT)	link	link
20	11.05.	Structured matrices, Cooley-Tukey fast Fourier transform (FFT)	link	link
21	16.05.	DFT complexity, recursive/iterative FFT, fast implementation (FFTW)	link	link
22	18.05.	FFT: fast implementation (FFTW), codelet generator	link		FFTW
23	23.05.	Spiral: Library generator for linear transforms	link		Spiral
	30.05.	Project presentations
	02.06.	Project presentations