Markus Püschel
Computer Science
ETH Zürich




short CV


the pub

How to Write Fast Numerical Code 263-2300 (ETH, CS)

Basic Information

  • Course number: 263-2300, 6 credits
  • Spring 2015, lectures: M 10:15-12:00, CHN C14; Th 9:15-10:00 CAB G51; occasional substitute lectures: W 13:15-15:00 HG D3.2
  • Instructor: Markus Püschel (CAB H69.3, pueschel at inf, 2-7303)
    • Alen Stojanov (CAB 81.2, astojanov at inf)
    • Daniele Spampinato (CAB H65, daniele.spampinato at inf)
    • Gagandeep Singh (CAB H66, gsingh at inf)
    • Only for project supervision: Georg Ofenbeck (CAB H65, ofgeorg at inf)
  • Office hours:
    • Markus Püschel: Tues 11-12pm
    • Daniele Spampinato: Mon 3-4pm
    • Alen Stojanov: Wed 3-4pm
    • Gagandeep Singh: Fri 4-5pm
  • Mailing lists:
    • For technical questions: (emails to this address go to the lecturer and all TAs)
    • Forum to find project partner: (emails go to all students who have no partner yet and to Alen & Daniele)

Course Description

The fast evolution and increasing complexity of computing platforms pose a major challenge for developers of high performance software for engineering, science, and consumer applications: it becomes increasingly harder to harness the available computing power. Straightforward implementations may lose as much as one or two orders of magnitude in performance. On the other hand, creating optimal implementations requires the developer to have an understanding of algorithms, capabilities and limitations of compilers, and the target platform's architecture and microarchitecture.

This interdisciplinary course aims to give the student an understanding of performance and introduces foundations and state-of-the-art techniques in high performance software development using important functionality such as linear algebra algorithms, transforms, filters, and others as examples. The course will focus on optimizing for the memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.

Further, a general strategy for performance analysis and optimization is introduced that the students will apply in group projects that accompany the course. Finally, the course will introduce the students to the recent field of automatic performance tuning.

Prerequisites: solid C programming skills, matrix algebra, Master student or above

Topics Covered

  • Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
  • Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
  • Compilers: strengths, limitations, how to use
  • Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
  • Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
  • Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others

Goals of this Course

  • Obtain an understanding of runtime performance and how to reason about it
  • Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
  • Understand the connection between algorithms, implementations, and computer architecture

Background Material

Academic Integrity

All homeworks in this course are single-student homeworks. The work must be all your own. Do not copy any parts of any of the homeworks from anyone including the web. Do not look at other students' code, papers, or exams. Do not make any parts of your homework available to anyone, and make sure noone can read your files. The university policies on academic integrity will be applied rigorously.

We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands the programming language in question (C, in our case).

It is not considered cheating to clarify vague points in the assignments or textbook, or to give help or receive help in using the computer systems, compilers, debuggers, profilers, or other facilities.


  • 40% research project
    • Topic: Very fast, ideally adaptive implementation of a numerical problem
    • Team up in pairs
    • March 6: find a partner, find a problem or I give you one (tip: look at the prior courses linked above for examples)
    • Complete "milestones" during semester and enter them into the online check list
    • Write 6 page standard conference paper (template will be provided)
    • Give short presentation end of semester
  • 25% midterm
  • 35% homework
    • Exercises on algorithms analysis
    • Implementation exercises
      • study the effect of program optimizations, compilers, special instructions, etc.
      • write and submit C code & create runtime/performance plots
    • Some templates will be provided
    • All homeworks are single-student homeworks
  • There is no final Exam

Research Project

  • All projects have to be registered at This site is also used later for updates.
  • How it works:
    • Weeks without homeworks should be used to work on the project
    • You select a numerical problem and create a correct (verified) implementation in C
    • You determine the arithmetic cost, measure the runtime and performance
    • You profile the implementation to find the parts in which most the runtime spent
    • Focussing on these you apply various optimization techniques from this class
    • You repeat the previous steps to create various versions with (hopefully) continuously better runtime
    • You write a paper about your work and give a presentation
  • Paper:
    • Maximal 6 pages (hard limit), conference style, template and instructions below
    • Everybody reads this: report.pdf
    • For latex use: (start with reading the README file)
    • For Word (discouraged) use this: report-word.doc
    • Due date: Friday, June 12 (as final-report.pdf in your svn)
  • Presentation
    • Last week of classes
    • Template (the use is totally optional) and some guidelines (ppt is 2007 and later): presentation-template.pptx , presentation-template.pdf
    • The order will be determined randomly right before class
    • Who talks will be determined randomly right before class
  • Projects (each one has a supervisor shown in brackets):
    1. Marc M., Seonwook P., Kevin W., Roger W.: Smoothed Particle Hydrodynamics (AS, DS)
    2. Bruno C., Thomas D., Lionel M., Lazar T.: Particle Filtering (MP)
    3. Christian Z., Lukas M., Thomas H., Roman C.: Simulating Dynamical Features of Escape Panic (AS, DS)
    4. Andrin J., Renzo R., Felix T., Lucas W.: Social Force Model (AS, DS)
    5. Benjamin F., Christoph M., Sandro S., Daniel W.: Numerical treatment of SODEs (AS, DS)
    6. Norman J., Flurin R., Frederik R., Elias S.: Bayesian Belief Propagation (MP)
    7. Sandro A., Byungsoo K., Felix M., Alina V.:Corner Detection using the Harris Algorithm (MP)
    8. Gabriele A., Thomas B, Alessio Z., Marc Z.: Text recognition in natural images (GO, GS)
    9. Hantian Z., Hantian Z.: Simulation of Semiconductor Nanostructure (MP)
    10. Saiwen W., Yifan S., Yijun P., Xinyuan Y.: Deep Matching and DeepFlow (MP)
    11. Alexandra T., Pascal R., Tijana Z.: AdaBoost (MP)
    12. Marcel S., Theodoros T., Lukas S., Olivier S.: SGM Stereo Matching (GO, GS)
    13. Harsh S., Michael R., Vilhjalmur V., Fuchs R.: Splitting methods in Control (GO, GS)
    14. Viktor W., Nico G., Svetoslav K., Dominik K.: Stochastic Combinatorial Optimization (AS, DS)

Tips & Tricks (From Students)


15. April, 13:15 - 15:00, HG E3 (solution, without solution).


Late policy: No deadline extensions, but you have 3 late days. You can use at most 2 on one homework. For example, submitting 7 hours late costs one late day.

We will be using Moodle for the homeworks. More soon.

Lectures (including pdfs)

1 16.02. Course motivation, overview, organization link    
2 19.02. Reminder basic concepts, cost/performance analysis link    
3 23.02. Architecture/Microarchitecture, operational intensity, Core 2/Core i7 link   Intel Core2/Core i7, Intel software optimization manual
4 26.02. Optimization for instruction level parallelism (ILP), Benchmarking link    
5 02.03. Small benchmarking guide, compiler limitations link    
6 05.03 Locality and caches, blocking MMM link link  
7 09.03. Roofline model link link roofline paper
8 12.03. Linear algebra, LAPACK, BLAS, fast MMM link link  
9 16.03. Fast MMM continued      
10 19.03. Fast MMM continued    
11 23.03. Fast MMM: register renaming and virtual memory link  
12 26.03. Memory bound computations, sparse MVM link    
13 30.03. Sparse MVM continued, SIMD vectorization      
14 02.04. SIMD vectorization link   Intel intrinsics guide
15 06.04. Spring break      
16 08.04. Spring break      
17 13.04. (Sechseläuten) class cancelled      
  15.04 Midterm exam, HG E3      
18 16.04. SIMD vectorization  
19 20.04. Compiler vectorization, linear transforms, fast Fourier transform (FFT)   link icc compiler vectorization
20 23.04. FFT link link  
21 27.04. Fast FFT implementation, FFTW link link  
22 30.04. cancelled (due to one-on-ones a week later)      
23 04.05. Spiral: DSL-based program generator for linear transforms link    
24 07.05. cancelled (one-on-ones)      
25 11.05. Machine learning in autotuning link    
26 14.05. (Ascension day) no class      
27 18.05 Language environments for building program generators link    
28 21.05. no class (one-on-ones)      
29 25.05. (Whitsuntide) no class      
30 27.05. Project presentations (HG D3.2)      
31 28.05. Project presentations (CAB G11), 9:15-11:00 (one hour longer)