I’d like to state off the bat that I don’t have a lot of experience with numPy, and deeper explanation would be appreciated(even obvious ones).
Here’s my issue:
converted_X = X for col in X: curr_data = X[col] i = 0 for pix in curr_data: inv_pix = 255.0 - pix curr_data[i] = inv_pix i+=1 converted_X[col] = curr_data.values
Context: X is a DataFrame with images of handwritten digits (70k images, 784 pixels/image).
The entire point of doing this is to change the black background to white and white numbers to black.
The only problem I’m facing with this is that it’s taking a ridiculously long time. I tried using
rich.Progress() to track its execution, and it’s an astonishing 4 hour ETA.
Also, I’m executing this code block in the jupyter notebook extension of VSCode (Might help).
I know it probably has to do with a ton of inefficiencies and under-usage of numPy functionality, but I need guidance.
Thanks in advance.
Never ever write for loop in python on numpy data, that is how you make them faster.
Most of the times, there are ways to have numpy do the for loop for you (meaning, process data by batch. Obviously, there is still a for loop. But not one you wrote in python)
Here, it seems you are trying to compute an inverted image, whose pixels are 255-original pixel.
inverted_image = 255-image
Addition: note that as a python array, numpy arrays are quite inefficient. If you use them just as 2D arrays, that you read and write with low level instruction (settings values individually), then, most of the time, even good’ol python lists are faster. For example, in your case (I’ve just tried), on my machine, your code is 9 times slower with ndarrays than the exact same code, using directly python list of list of values.
The whole point of ndarrays is that they are faster because you can use them with numpy functions that deal with the whole data in batch for you. And that would not be feasible as easily with python lists.