Skip to content Skip to sidebar Skip to footer

How Can I Make My Python Code Run Faster

I am working on code that loops over multiple netcdf files (large ~28G). The netcdf files have multiple 4D variables[time, east-west, south-north, height] throughout a domain. The

Solution 1:

This is a lame first pass to tighten up your forloops. Since you only use the file shape once per file, you can move the handling outside the loop which should reduce the amount of loading of data in interrupting processing. I still don't get what counter and inc do as they don't seem to be updated in the loop. You definitely want to look into repeated string concatenation performance, or how the performance of your appending to predictors_wrf and names_wrf looks as starting points

k_space = np.arange(0,37)
j_space = np.arange(80,170)
i_space = np.arange(200,307)

predictors_wrf=[]
names_wrf=[]

counter = 0
cdate = start_date
while cdate <= end_date:
    if cdate.month notin month_keep:
        cdate+=inc
        continue
    yy = cdate.strftime('%Y')        
    mm = cdate.strftime('%m')
    dd = cdate.strftime('%d')
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
    file_exists = os.path.isfile(filename)
    if file_exists:
        f = nc.Dataset(filename,'r')
        times = f.variables['Times'][1:]
        num_lines = times.shape[0]
    for i in i_space:
        for j in j_space:
            for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print"partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99else:
                        u = np.ones((144,))*99.99
                        v = np.ones((144,))*99.99
                        wspd = np.ones((144,))*99.99
                        w = np.ones((144,))*99.99
                        p = np.ones((144,))*99.99
                        t = np.ones((144,))*99.99
                        counter=counter+1
                    predictors_wrf.append(u)
                    predictors_wrf.append(v)
                    predictors_wrf.append(wspd)
                    predictors_wrf.append(w)
                    predictors_wrf.append(p)
                    predictors_wrf.append(t)
                    u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                    v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                    wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                    w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                    p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                    t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                    names_wrf.append(u_names)
                    names_wrf.append(v_names)
                    names_wrf.append(wspd_names)
                    names_wrf.append(w_names)
                    names_wrf.append(p_names)
                    names_wrf.append(t_names)
    cdate+=inc

Solution 2:

For your questions, I think multiprocessing will help a lot. I went through your codes and have some pieces of advice here.

  1. Not using start time, but the filenames as the iterators in your codes.

    Wrap a function to find out all the file names based on time and return a list of all filenames.

    def fileNames(start_date, end_date):
        # Find all filenames.
        cdate = start_date
        fileNameList = [] 
        while cdate <= end_date:
            if cdate.month not in month_keep:
                cdate+=inc
                continue
            yy = cdate.strftime('%Y')        
            mm = cdate.strftime('%m')
            dd = cdate.strftime('%d')
            filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'
            fileNameList.append(filename)
            cdate+=inc
    
        return fileNameList
    
  2. Wrap your codes that pull your data and fill with 99.99, the input for the function is the file name.

    def dataExtraction(filename):
        file_exists = os.path.isfile(filename)
        if file_exists:
           f = nc.Dataset(filename,'r')
           times = f.variables['Times'][1:]
           num_lines = times.shape[0]
        for i in i_space:
            for j in j_space:
                for k in k_space:
                    if file_exists:    
                        if num_lines == 144:
                            u = f.variables['U'][1:,k,j,i]
                            v = f.variables['V'][1:,k,j,i]
                            wspd = np.sqrt(u**2.+v**2.)
                            w = f.variables['W'][1:,k,j,i]
                            p = f.variables['P'][1:,k,j,i]
                            t = f.variables['T'][1:,k,j,i]
                        if num_lines < 144:
                            print "partial files for WRF: "+ filename
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                        else:
                            u = np.ones((144,))*99.99
                            v = np.ones((144,))*99.99
                            wspd = np.ones((144,))*99.99
                            w = np.ones((144,))*99.99
                            p = np.ones((144,))*99.99
                            t = np.ones((144,))*99.99
                            counter=counter+1
                        predictors_wrf.append(u)
                        predictors_wrf.append(v)
                        predictors_wrf.append(wspd)
                        predictors_wrf.append(w)
                        predictors_wrf.append(p)
                        predictors_wrf.append(t)
                        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i)
                        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i)
                        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i)
                        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i)
                        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i)
                        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i)
                        names_wrf.append(u_names)
                        names_wrf.append(v_names)
                        names_wrf.append(wspd_names)
                        names_wrf.append(w_names)
                        names_wrf.append(p_names)
                        names_wrf.append(t_names)
    
    
        return zip(predictors_wrf, names_wrf)
    
  3. Using multiprocessing to do your work. Generally, all the computers have more than 1 CPU cores. Multiprocessing will help increase the speed when there are massive CPU calculations. To my previous experience, multiprocessing will reduce up to 2/3 time consumed for huge datasets.

    Updates: After testing my codes an files again on Feb. 25, 2017, I found that using 8 cores for a huge dataset saved me 90% of collapsed time.

    if __name__ == '__main__':
          from multiprocessing import Pool  # This should be in the beginning statements.
          start_date = '01-01-2017'
          end_date = '01-15-2017'
          fileNames = fileNames(start_date, end_date)
          p = Pool(4) # the cores numbers you want to use.
          results = p.map(dataExtraction, fileNames)
          p.close()
          p.join()
    
  4. Finally, be careful about the data structures here as it is pretty complicated. Hope this helps. Please leave comments if you have any further questions.

Solution 3:

I don't have very many suggestions, but a couple of things to note.

Don't open that file so many times

First, you define this filename variable and then inside this loop (deep inside: three for-loops deep), you are checking if the file exists and presumably opening it there (I don't know what nc.Dataset does, but I'm guessing it must open the file and read it):

filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00'for i in i_space:
        for j in j_space:
            for k in k_space:
                    ifos.path.isfile(filename):
                        f = nc.Dataset(filename,'r')

This is going to be pretty inefficient. You can certainly open it once if the file doesn't change before all of your loops.

Try to Use Fewer for-loops

All of these nested for-loops are compounding the number of operations you need to perform. General suggestion: try to use numpy operations instead.

Use CProfile

If you want to know why your programs are taking a long time, one of the best ways to find out is to profile them.

Post a Comment for "How Can I Make My Python Code Run Faster"