| 
								   
								home 
								publications 
								teaching 
								short CV 
								personal 
								the pub 
						     | 
							How to Write Fast Numerical Code
									263-2300 (ETH, CS)
									
									Basic Information
									 
									
										- Course number: 263-2300, 6 credits
										
 - Spring 2011, lectures: M 10:00-12:00, IFW C33; W 13:00-14:00 RZ F21; occasional substitute lectures: F 14:00-16:00 RZ F21
										
 - Instructor: Markus Püschel (RZ H18, pueschel at inf, 2-7303)
 
											TAs: Georg Ofenbeck (RZ H22, ofgeorg at inf, 2-8414) and Marcela Zuluaga 
											Admin: Franziska Mäder (RZ H14, maeder at inf, 2-7311)
											
											
											
											
											
											
											
											
										 - Office hours:                                        
										  
										    - Markus Püschel: Tues 14:00-15:00
 
										    - Georg Ofenbeck: Wed 14:00-15:00
 
										     
									     
									  
							  Course Description
									 
									The fast evolution and increasing complexity of computing platforms pose   a major challenge for developers of high performance software for   engineering, science, and consumer applications: it becomes increasingly   harder to harness the available computing power. Straightforward   implementations may lose as much as one or two orders of magnitude in   performance. On the other hand, creating optimal implementations   requires the developer to have an understanding of algorithms,   capabilities and limitations of compilers, and the target platform's   architecture and microarchitecture. 
									   
									  This interdisciplinary   course introduces the student to the foundations and state-of-the-art   techniques in high performance software development using important   functionality such as linear algebra functionality, transforms, filters,   and others as examples. The course will focus on optimizing for the   memory hierarchy and special instruction sets, thus complementing courses on parallel programming. Much of the material is based on recent research.  
									   
							        Further,   a general strategy for performance analysis and optimization is   introduced that the students will apply in group projects that accompany   the course. Finally, the course will introduce the students to the   recent field of automatic performance tuning. 
									The course will build upon  the version taught in Spring 2008. 
									Prerequisites: solid C programming skills, matrix algebra, Master student or above 
									Topics Covered
									
										- Algorithm analysis: Problem versus algorithm, complexity and cost (asymptotic, exact, measured), cost analysis
										
 - Computer architecture (a software point of view): architecture and microarchitecture, memory hierarchy, special instruction sets
 - Compilers: strengths, limitations, how to use
											
											
											
											
											
											
											
										
 - Performance optimization: guide to benchmarking, finding hotspots, code analysis, performance optimization techniques (for memory hierarchy and vector instruction extensions); these techniques are studied using the examples in the next bullet
											
											
											
											
											
											
											
										
 - Numerical functionality studied in detail (complexity, algorithms, how to write highest performance code): linear algebra kernels, transforms, filters, sparse linear algebra, others, your research project
											
											
											
											
											
											
										
 - State-of-the-art research in Automatic Performance Tuning: ATLAS, LAPACK, BeBOP, FFTW, SPIRAL, others
										
										
										
										
										
										
										
								    
  
									Goals of this Course
									
										- Obtain an understanding of runtime performance                                                                                        
									    
 - Learn a guideline how to write fast numerical code and apply it in homeworks and your research project
									  
 - Understand the connection between algorithms, implementations, and computer architecture 
											
											
											
											
											
											
											
									  
 - Learn some fundamental numerical algorithms
											
											
											
											
											
											
											
									  
 - Learn how to analyze numerical algorithms
										
										
										
										
										
										
										
										
								    
  
									Background Material
									
									Academic Integrity
									All homeworks in this course are single-student homeworks. The   work must be all your own. Do not copy any parts of any of the   homeworks from anyone including the web. Do not look at other   students' code, papers, or exams. Do not make any parts of your   homework available to anyone, and make sure noone can read your   files. The university policies on academic integrity will be applied   rigorously. 
                                     We will be using the Moss system to detect software plagiarism. This system is amazingly good, because it understands   the programming language in question (C, in our case). 
                                     It is not considered cheating to clarify vague points in the  assignments or textbook, or to give help or receive help in using the  computer systems, compilers, debuggers, profilers, or other facilities. 
Grading
									 
									
										- 40% research project
											
												- Topic: Very fast, ideally adaptive implementation of a numerical problem
													
													
													
													
													
													
												
 - Team up in pairs
													
													
													
													
													
													
												
 - March 9: Suggest to me a problem or I give you one
													
													
													
													
													
													
												
 - Show "milestones" during semester
													
													
													
													
													
													
													
													
												
 - Write 4 page standard conference paper (template will be provided)
													
													
													
													
													
													
													
													
												
 - Give short presentation end of semester
												
												
												
												
												
												
										    
  
									     - 20% midterm
											
												- Algorithm/implementation analysis
													
													
													
													
													
													
													
													
												
 - Memory hierarchy
												
 - other                                                                                                        
										    
  
									     - 40% homework
											
												- Exercises on algorithms analysis
													
													
													
													
													
													
													
													
												
 - Implementation exercises. Purpose: study the effect of program optimizations, compilers, special instructions, etc. Tasks: writing and submitting C code & creating runtime/performance plots
													
													
													
													
													
													
													
													
												
 - Some templates will be provided
												
 - All homeworks are single-student homeworks                                                
											
                                              
							    
									There is no final Exam.
									  
Research Project
							   
							  Deadline for code and paper: June 10th, 17:00 
							  
										- How it works:
											
												- Weeks without homeworks should be used to work on the project                                                                                                                                                
												
 - You select a numerical problem and create a correct (verified) implementation in C
												
 - You determine the arithmetic cost, measure the runtime, and compute the performance
												
 - You profile the implementation to find the parts in which most the runtime spent
												
 - Focussing on these you apply various optimization techniques from this class
												
 - You repeat the previous steps to create various versions with (hopefully) better runtime
													
													
													
													
													
													
													
												
 - You write a paper about your work and give a presentation
  
									     - Paper:
											
									    
 - Presentation
											
												- Last week of classes (May 30th and June 1st)
 - 10 minutes (at most 7 slides)
												
 - Template (ppt is 2007 and later): presentation-template.pptx , presentation-template.pdf                                                 
												
 - By Sunday (29th) evening, send me your slides; name them with your project number (e.g., 11.pptx or11.pdf)                                                                                                
												
 - The order will be determined randomly	right before class								
											    
 - Who talks will be determined randomly right before class                                                
										  
                                          
						         - Projects: 
								    
									  - Alexander and Aldo
									    (Model predictive control)
									  
 - Maksim, Filip and Daniel (Eigenvalues)
									  
 - Rokas and Ugur
									    (Optimal binary search organization)
									    
									  
 - Federico                                                                                                                  and Ralph
									    (Image processing library)
									  
 - Aymila and Ömer (Enclosing ball of points)                                      
									  
 - Arkadiy (Metropolis algorithm, Monte Carlo)                                       
									  
 - Benjamin and Sebastian
								      (Seam carving)
 - Florian and Thomas (SURF feature detection)                                      
									  
 - Zaheer, Andrei, and Giovanni
									    (Submodular function optimization)                                      
									  
 - Francois (Graph cuts, Edmond-Karps Algorithm)
									  
 - Ruedi (2D Gaussian filter)
 - Mauro, Bruno, and  Qunzhi (Black Scholes,                                         option pricing)                                            
					            
 - Giaocchino, Thabo, and Manuel (Disparity Map Refinement)                                
  
							    
					    Midterm
                                     
                                    Friday, April 15th (solution, without solution) 
Homework
                                     
                                    
Lectures (including pdfs)
									 
							  
							            
							              Lecture  | 
							              Date  | 
							              Content  | 
							              Slides  | 
							              Notes  | 
							              Other  | 
				                 
							             
							              | 1 | 
							              21.02. | 
							              Course motivation, overview, organization | 
							              link | 
							                | 
							                | 
				                 
							            
							              | 2 | 
							              23.02. | 
							              Problem, algorithm, asymptotic runtime, cost, cost analysis | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 3 | 
							              28.02 | 
							              Cost analysis, architecture, microarchitecture, memory hierarchy | 
							              link | 
							              link, arch | 
							              Core 2 info | 
				                 
							            
							              | 4 | 
							              07.03. | 
							              Instruction level parallelism, optimizing compilers and limitations | 
							              link | 
							                | 
							                | 
				                 
							            
							              | 5 | 
							              09.03. | 
							              Compiler limitations, benchmarking | 
							              link | 
							                | 
							                | 
				                 
							            
							              | 6 | 
							              16.03. | 
							              Locality | 
							              link | 
							                | 
							                | 
				                 
							            
							              | 7 | 
							              18.03. | 
							              Caches, linear algebra software, BLAS, blocking | 
							              link | 
							              link | 
							              blas | 
				                 
							            
							              | 8 | 
							              21.03. | 
							              MMM, Atlas MMM generator | 
							              link | 
							              link | 
							              paper | 
				                 
							            
							              | 9 | 
							              23.03. | 
							              Atlas MMM generator continued | 
							                | 
							                | 
							                | 
				                 
							            
							              | 10 | 
							              28.03. | 
							              Model-based Atlas | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 11 | 
							              30.03. | 
							              Model-based Atlas: remaining details, register renaming | 
							              link | 
							                | 
							                | 
				                 
							            
							              | 12 | 
							              04.04. | 
							              Optimization related to virtual memory and TLB | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 13 | 
							              06.04. | 
							              Dense linear algebra problems, LU factorization | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 14 | 
							              13.04. | 
							              Blocked LU factorization | 
							                | 
							              link | 
							                | 
				                 
							            
							              |   | 
							              15.04. | 
							              Midterm exam | 
							                | 
							                | 
							                | 
				                 
							            
							              | 15 | 
							              18.04. | 
							              Memory-bound computation, sparse linear algebra, Sparsity/Bebop | 
							              link | 
							                | 
							              paper1, paper2 | 
				                 
							            
							              | 16 | 
							              20.04. | 
							              Sparse matrix-vector multiplication, continued | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 17 | 
							              02.05. | 
							              SIMD vector instructions, SSE family, SSE intrinsics | 
							              link | 
							                | 
							              icc manual, VS manual | 
				                 
							            
							              | 18 | 
							              04.05. | 
							              SSE intrinsics continued, vectorization efficiency | 
							                | 
							              link | 
							                | 
				                 
							            
							              | 19 | 
							              09.05. | 
							              Compiler vectorization, linear transforms, fast algorithms, discrete Fourier transform (DFT) | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 20 | 
							              11.05. | 
							              Structured matrices, Cooley-Tukey fast Fourier transform (FFT) | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 21 | 
							              16.05. | 
							              DFT complexity, recursive/iterative FFT, fast implementation (FFTW) | 
							              link | 
							              link | 
							                | 
				                 
							            
							              | 22 | 
							              18.05. | 
							              FFT: fast implementation (FFTW), codelet generator | 
							              link | 
							                | 
							              FFTW | 
				                 
							            
							              | 23 | 
							              23.05. | 
							              Spiral: Library generator for linear transforms | 
							              link | 
							                | 
							              Spiral | 
				                 
							            
							              |   | 
							              30.05. | 
							              Project presentations | 
							                | 
							                | 
							                | 
				                 
							            
							              |   | 
							              02.06. | 
							              Project presentations | 
							                | 
							                | 
							                | 
				                 
                           							   |