Class Activity/Notes: Image Processing

I used to have a WordPress site for an undergraduate image processing class (AP186) where we were required to write a blog post for each topic discussed in class. As with blog posts, students were free to write in the style they prefer, whether as personal as a reflection of what they learned to something as stiff as writing technical notes. I naively opted with the latter, thinking that it’s not that different than writing weekly notes. I ended up only having a full discussion for 7 out of 14 topics, and my grade for the class reflects that.

Fast forward today (Nov 2021), where I’m integrating all of my online stuff into this GitHub site, I get to reflect on several things like my quality of writing or understanding of the topics. But most notable is the question I asked myself: since the blogs are supposed to help guide future students taking the class, would I recommend these posts to them? Quality-wise, I definitely wouldn’t, but I thought it would be a good opportunity (and challenge) to get these revised at the level that I’ll be confident to share with students.

So here are the posts, initially copied as-is from my WordPress. I hope you’d still look forward to when I have them revised. Enjoy!

Note: I’m marking this post with an older date to represent the span of time the notes were written (June 11 - October 5, 2013). I’ll change the date when this get updated to a level that is up to snuff, unless I end up moving them on another page.

Nonlinear Classification using Feed-forward Neural Networks

From the previous activity, we have shown that we can devise methods for extracting features in the image using image processing, especially the color and the shape of an object within the image. We also used the information gathered from the measurement of features in order to perform classification, where in our case using Linear Discriminant Analysis which builds a linear decision boundary that separates the observations against classes on the feature space. Although the method is effective to our data set, we have shown that there are cases that a linear decision boundary is not effective, such as observations which seem to merge and cluster together even when the observations are of different classes. In order to perform clasification on a data set which cannot be separated linearly, we introduce a nonlinear moethod of classification, known as neural networks.

Neural networks is a nonlinear classifier which tries to emulate the architecture of neuronal systems of the brain. To do this, a set nodes connect to each other through edges. Information is taken, or ‘sensed’ using input nodes, and processed information comes out to the output nodes. What makes the artificial neural network more close to its biological analog is that it has hidden nodes between the input and output nodes. The hidden nodes, which make up the hidden layer, provides abstraction and additional degrees of freedom that the neural network can process the information from the input to the output. This structure, as we will show, gives the neural network the ability to ‘learn’ from the observations given for training.

To describe how neural networks, we first illustrate the simplest neural network architecture, the perceptron. The perceptron is a simple mapping of nodes from the input nodes to an output node, with no hidden nodes present. The hidden nodes simply take in values from measurements, and the output node produces an output value based on the weighted sum of the inputs, where the weights are determined by the edges. We need to adjust the weights, however, in order to provide a correct answer for each input. We do this by providing a training mechanism for the perceptron which, basically, adjusts the weights of the edges depending on whether or not the given input values and weights yield the correct answer. A correct answer means the weights are correct, and does not require correction. A wrong answer means the weights are incorrect, and requires adjustment on the weights based on the inputs. The magnitude of corrections applied to the weights is adjusted by a learning parameter. Similar to other classifiers, a large number of training objects are fed into the network to cover all variations in the feature space where, after many training cycles, it can perform classification with good prediction rate.

In order to improve this, we introduce a hidden layer into the network. Each node of the network behaves similarly to the perceptron, where each node performs weighted summation of the inputs. We populate the network of such nodes, until we form input, hidden, and output layers. We need to impose however that all edges go forward through the network. Generally, this architecture is called a feed-forward neural network, as the information input goes forward through the network until it reaches the output, with no recurrence or backward mapping. After summing the inputs, we pass it through an activation function, which decides what value would the node pass through to the other nodes. Usually, the activation function used in neural-networks is the sigmoid function. The way a feed-forward neural network is trained differs a bit with the perceptron, with the method of training called backpropagation. What it does is it introduces an mean-square error function comparing the expected output and the output given the weights. By doing this, we may minimize the error by posing the error function as an optimization problem, which determines the proper value of weights (variables) which minimizes the error (objective).

Given this, we can now use a neural network to perform the classification. Scilab has a third-party toolbox, called the Artificial Neural Network (ANN) toolbox, which builds different neural network architectures. Using the toolbox, we simply provide the type of neural network used, number of nodes in the network, and other network parameter, and the toolbox will perform the training and classification. For this activity, we use the same data set as the previous activity, given the measurements of color, aspect ratio and area ratio, and the corresponding classes of the objects. We describe the target, which are the correct classification for its corresponding observations, and then specify the learning rate and number of training cycles. We use the following code below:

N  = [4,6,1];
  
//     ColorFt1(u)  ColorFt2(v)    BlobMass   BBoxAspRatio
x = [   
        1.0785009    0.9921922    0.8608245    0.5082873  
//          ...         ...         ...         ...
        0.9460444    0.5695869    0.7161719    1.      ]';
 
// targets, 1 for banana and 0 for orange
t = [1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0];
// learning rate is 2.5 and 0 is the threshold for the error tolerated by the network
lp = [2.5,0];
W = ann_FF_init(N); // Initialize a feed-forward neural network
T = 800; // Training cycles
 
W = ann_FF_Std_online(x,t,N,W,lp,T); 
//x is the training t is the output W is the initialized weights, 
//N is the NN architecture, lp is the learning rate and T is the number of iterations

After training, we can validate the training of the neural network by passing through the measurements into the network, and obtaining the output of the network, given by the code below.

encoder = ann_FF_run(x,N,W,[2,2])
decoder = ann_FF_run(encoder,N,W,[3,3])

Running the measurements to the neural network yield different values, mostly close to either 0 or 1. If we impose a threshold, say 0.5, to determine if the output is 0 or 1, we obtain the targets which we initially introduced.

Pattern Recognition using Linear Discriminant Analysis

From the previous activities, we have developed different methods in order to segment objects from an image. We have shown that we can use the color distribution of a certain sample object in order to segment instances of the object from a larger image. We have also used blob analysis and morphological analysis in order to extract features which may vary in shape, and isolate them from the image as blobs. All of these methods rely that we provide a certain feature or pattern which we could use as reference for segmenting. In this activity, instead of using a single feature, we aggregate together different features across multiple observations in order to classify objects.

Object classification requires that a set of observations form the training set, in which we would take observations that would form our reference for classification. Using a training set forms the assumption that, using the elements of the set, it can already describe other objects of similar kind with good accuracy. In order to do this, we must obtain a large number of observation to cover all possible variations in measurements of object of the same class.

For this activity, we perform the classification between stock photos of bananas and oranges. We obtain 11 images of bananas and 14 images of oranges from Google images. All of the stock photos used were taken from a white background, so as to simplify the images and reduce feature noise.

The first feature we use is the color of the image. It is easily observed from the image that the two classes differ in color, where bananas are usually bright yellow, and oranges are, well, orange. Using this, we may use the average color of the image in. We convert the values into UV space so as to reduce the dimensionality of the observation. We use the following code, shown below, and applied it across all images. We save the values on arrays which represent their corresponding values.

Iu = double(I(:,:,1)) ./ (double(I(:,:,1) + I(:,:,2) + I(:,:,3)) + 1);
Iv = double(I(:,:,2)) ./ (double(I(:,:,1) + I(:,:,2) + I(:,:,3)) + 1);
renorm = length(find(rgb2gray(~I) > 30))/length(I(:,:,1));
uv(i, : ) = [mean(Iu .* (rgb2gray(~I) > 30)), mean(Iv .* (rgb2gray(~I) > 30))];

Next, we use the thresholded mass of the image. The thresholded mass determines the number of pixels which satisfy a certain threshold. Given that our background is white, the threshold determines the region of which does not belong to the background, and thus the thresholded pixels represent those which belong to the object. We simply take the sum of the pixel, and rescale the value based on the total number of pixels present, using the code below.

mass(i) = double(sum(~rgb2gray(I)))/length(I(:,:,1));

Finally, we use the aspect ratio in order to describe the shape of the objects. Bananas are usually elongated. If we apply binary blob analysis over the object, apply a bounding box over the blob, and determine the length of the sides of the box, we could see that the values are more skewed. Oranges, which are rounded, yields a square bounding box with an aspect ratio equal to one. We use the code below to determine how elongated the object is using the aspect ratio of its bounding box.

binary = SegmentByThreshold(rgb2gray(~I), 30)    
BlobImage = FilterBySize(SearchBlobs(binary), 300)
IsCalculated = CreateFeatureStruct(%f);
IsCalculated.BoundingBox = %t;
BlobStatistics = AnalyzeBlobs(BlobImage, IsCalculated);

aspectRatio = BlobStatistics(1).BoundingBox(3) ./ BlobStatistics(1).BoundingBox(4);
if aspectRatio > 1 then
    aspectRatio = 1 / aspectRatio;
end
aspr(i) = aspectRatio

After applying the code, we now have a complete set of data composed of 25 observation, across 4 measurements, and between 2 classes. We now have to develop a robust way in order to use the training set for classification. There are different methods for performing classification based on a training set, each with their own advantages. For example, the k-nearest neighbors uses a voter model, which determines the classification of an object by taking k training objects which properties are closest to the tested object through a distance metric, and taking the class which forms the majority of the set. This method is known as a supervised learning method, which performs machine learning with information regarding the training set provided beforehand of the learning process.

For the activity, however, we would use a more interesting way of performing classification and machine learning across data, known as Linear Discriminant Analysis (LDA). LDA is a more statistical approach of performing classification, in that it uses covariance of objects across classes. To discuss this, we first illustrate the concept of decision boundaries.

Suppose a scattered set of data, with observations from the data belonging to one of at least two classes, as shown by the figure below. By looking at the data, the classes appear as two clusters in space, seemingly separated by a long line. Using the k-NN classification method would seem to be a good method to use for the data set, until we start classifying at objects which appear between the clusters, where the classification may depend on the density of observations, and not on the measurements itself. Instead, we consider building a decision boundary, which would split the data into their respective classes. A linear decision boundary can be a line, a plane, or a hyperplane, depending on the dimensionality of the data. LDA determines the decision boundary by finding the projection which maximizes the separability between the classes, determined by the Fisher Discriminant. The solution to maximixing the Fisher Discriminant is a linear decision boundary. The method is similar to Principal Component Analysis that they both exploit the variance across the observations. Instead of using separability, PCA uses clustering by minimizing the variance.

We apply Linear Discriminant Analysis to the data by converting the observations between oranges and bananas to a large matrix composed of observation vectors, we use the following code, based on (…), for performing the LDA.

globalMean = cumsum([A; B],1)/(length([A;B])/class)
globalMean = globalMean($,:)

//Mean-corrected Data
mcdA = []
mcdB = []
for i = 1:feat
    mcdA = [mcdA, A(:,i) - globalMean(i)]
    mcdB = [mcdB, B(:,i) - globalMean(i)]
end


covarA = ((mcdA') * mcdA) / (length(A)/class)
covarB = ((mcdB') * mcdB) / (length(B)/class)
pooledCovar =  (((length(A)/class)*covarA) + ((length(B)/class)*covarB))/(length([A;B])/class)

prior = [(length(A)/class)/(length([A;B])/class); (length(B)/class)/(length([A;B])/class)]


funcA = (meanA * inv(pooledCovar) * ktot) - 0.5*(meanA * inv(pooledCovar) *  meanA') + log(prior(1))
funcB = (meanB * inv(pooledCovar) * ktot) - 0.5*(meanB * inv(pooledCovar) *  meanB') + log(prior(2))

We can now perform the classification between the images of orange and banana by comparing the values returned from the funcA and funcB. If funcA is larger than funcB, the observation lies on the side of the decision boundary that classfies the objects as orange. If, instead, funcB is larger than funcA, the observation falls to being classified as a banana. After performing LDA, it was determined that the decision boundary given by the analysis fully separates the classes against each other.

Due to a lack of data that can be gathered from the internet, the method was tested by performing cross-validation. Cross-validation involves splitting the set of data into training and testing sets, and performing the analysis. This is repeated until it covers the entire observation set, and is done with different sizes for the testing and training set. For our case, the analysis was tested by taking between one to ten objects, and ran through 20000 testing cycles. The performance was then measured using its prediction rate, which is given by the number of correct predictions. The result is given by the plot below.

performance-plot

Observing the plot, we could see that the performance of LDA decreases as the number of objects within the training set decreases. However, the decrease is not drastic, after the first few removals from the training set. At ten objects removed from the training set, the prediction rate of the classifier reaches 85.3%, which is good given that 10 objects is already a 40% decrease of objects in the training set. This implies that the classifier, given the observations, is robust, and may still perform well even with a smaller training set.

Image Downscaling by Principal Component Analysis

Suppose we load a particular image file into a computer, where no compression method was performed on the file. If the image was in full high-definition (full HD), the image should be at least 1920px by 1080px large. Usually, color images are shown in 24-bit color gamut, which require 3 bytes of information per pixel (8-bit per color channel). In total, it should require 6.22 MB worth of pixel information within the file. Although this scale of file size is relatively small to modern standards, where 1TB storage disks and fast internet bandwidth are not uncommon, it is still inefficient to store all information about the image without compression. Also, there is a limit on the amount of detail present on an image that can be discerned by the eye, where a sufficient amount of detail can produce the details necessary to recognize an object within the image. For this problem, we introduce a concept for image compression through Principal Component Analysis (PCA).

In essence, principal component analysis tries to deconstruct a set of data into its principal components. The principal components form the orthonormal basis functions, or its eigenvectors, which can be linearly superimposed with corresponding weighted coefficients, or its eigenvalues, to span the data set for reconstruction. For example, if we have 20 vectors in 3D space, we may decompose the vectors into 20 orthonormal eigenvectors which form basis of the original set of vectors. An interesting property of how the principal components are decomposed is that the reconstruction can be performed with less eigenvectors but yielding relatively small variance between the original vector and its reconstruction. The method, then, reduces the amount of information required to reconstruct the image to a good extent.

For this activity, consider the color image of a twig, shown below. The size of the image is a 800x600 pixels, which totals to 431KB in JPEG format. By loading the image into Scilab, it is expected to occupy 1.44MB in RAM. Converting the image into grayscale reduces the theoretical image size into 480KB, which is still larger compared to the compressed image size.

Fig1

We can break down the image into small, 10px by 10px blocks which forms the observation set. For our image, this method yields us 4800 observations which are of the form of 100-dimensional vector. This approach can also be reciprocated, such that we obtain 100 observations of 4800-dimensional vector. For this article, we shall try to use the latter.

By using the pca function in Scilab, we obtain the eigenvectors for all of the 100 observations within the image. We can display each of the eigenvectors as blocks, which describe the components needed to reconstruct the image, as shown in the figure below. Interestingly, the blocks look like two-dimensional corrugated roof varying both in frequency and orientation, describing a 2D sinusoidal function. In fact, these eigenvectors form the basis functions of the 2D discrete cosine transform (DCT), which is used on the JPEG format.

Fig2

We can also plot the performance of the first ten eigenvectors on reconstructing the observation set by using the show_pca function. From the plot, we could see that we could perform 99% reconstruction of all observations using only the first four eigenvectors yielded by PCA. At 30 eigenvectors, the reconstruction quality increases to 99.99%.

Fig3

We then reconstruct each image by using the eigenvectors and its corresponding eigenvalues, as shown below. The animated image (click to play, might take a while to load) shows reconstruction performed from 3 to 45 eigenvectors, at intervals of 3. The animation shows the rapid increase in the quality of reconstruction. Of course, the goal of the image compression is to yield the best image with least amount of space needed. Thus, we may truncate the eigenvectors needed for the format to a less number, say, 30 eigenvectors.

Fig4

To compute the space required in storing the information, we total the sizes of each array used in computing the principal components. The eigenvector array (comprinc) contains 4800 elements over the number of eigenvectors used, e.g. 30. As the values are double float (64-bit), this occupies 1.152MB of space. The eigenvalues are much more smaller, which contains a 1D array that is as long as the number of eigenvectors used, thus occupying 240B, which is almost negligible. At this rate, we have already yielded 20% size decrease from raw image array data. We can decrease this further by using single precision float (32-bit), which should contain sufficient precision to reconstruct the image, thus theoretically yielding a 576KB image data with 60% decrease in image size. This amount of space saving can be significant when dealing with large number of image files.

The method can also be performed on color images, by simply splitting the three color channels, and performing the principal component analysis and image reconstruction on each of the color channel array, with the result shown below, at the similar interval used above. The changing in color balance is an artifact of how the array data are converted to uint8 image data using mat2gray. By scaling, similar percent of data space should be saved for both the color image and the grayscale image.

Fig5

Note that this only involves compression of the information needed to display the image, and does not yet involve file compression techniques to raw binary data (e.g. LZW algorithm), which could possibly decrease the file size further. Although there is a significant decrease in the file size, it should be noted that the method requires computation time in order to perform the reconstruction. Although PCA may not be needed to display the image, it still requires matrix multiplication, which can be CPU intensive. GPU technologies, however, which perform fast matrix multiplication routines, may solve this caveat.

In summary, we have shown that PCA is an effective tool in decomposing an image such that it could be reconstructed with less information size needed. The method involves decomposition of blocks of images into its corresponding eigenvectors and eigenvalues, which can be used to reconstruct the image. The method is mostly performing tradeoff between the quality of the image reconstructed and the amount of size it can occupy, where the context of quality can be subjective.

Binary Image Analysis of Sheet Music for Playback

It was discussed from early articles in this blog, from digitising plots to determining area of a geographic region, that image processing methods are very helpful in performing measurements and converting image data into useful numerical data. For the following article, we develop a way of reading and digitising sheet music for digital playback through image processing.

Sheet music is a way of written communication of music between people. It contains a set of five parallel lines, called the staff, where each of the lines and the spacing between them denote a musical tone. Each of these tones belong to the 7 tone scale, building up a single octave. These tones can be played alike on different musical instruments, although they may vary on the manner of which the tone is played. Along the staff are different notes, which denote the duration of which the tone is played.

The goal of this activity is to be able to utilize image processing, such that the computer is able to read off each of the notes from the staff from an image of sheet music, determine the proper tone and duration, and to playback the notes through the speakers. For this activity, we use a sheet music of the very famous “Happy Birthday To You”, shown below.

Note that the piece is relatively simple and very readable. Only a single note is played at a time, and there is not much variation in the duration of playing back the notes. The pitch range, as defined by the clef, is also consistent all throughout the song.

To prepare the sheet music for processing, we first crop the region of which the staff lies, as shown below. It was then converted into an binary image by first inverting the image, and then thresholding any region that contains a nonzero grey level.

After converting the sheet music into proper binary image, we now perform morphological operations on the image. Knowing that the shape and placement of the notes are given by the large dots placed on the staff, we may process the binary image using the Close operation in IPD toolbox, using a circular structuring element of radius 4px. The operation would then retain structures that are relatively circular in shape, and remove anything smaller or other structure that differs. The resulting binary image, after applying the morphological operation, is shown below.

Observing the binary image, we see that it did retain the circular blobs where the notes were located. The grid lines of the staff were also removed from the image. The only remaining artefact within the image are the bar lines, the beams with connect multiple notes together, and the time signature. Despite the presence of these artefacts, we could see that the area of the blobs representing the notes are quite consistent all throughout the music sheet. Thus, it only requires blob area analysis in order to remove the artifacts. To do this, we use the SearchBlobs function, which searches for all contiguous regions of pixels present in the image, and the FilterBySize function which retains blobs of specific pixel mass by imposing a threshold for minimum and maximum blob size. It was determined that the effective threshold for the pixel mass of the notes was between 50px and 115px.

To make the notes more readable, the upper and lower segments were concatenated together, and restricting the size of the image to that of the staff, and a proportional space as gutter. The resulting image is shown by the long strip image with white dots below.

Simple Pattern Recognition by Blob Area Analysis

Previously, we have shown the effect different morphological operations on binary images, mainly the Erosion and Dilation operation. Given a structure element, we can perform the morphological operation pixel-wise with respect to an origin, depending on the set-theoretic definition of each operation. In essence, the dilation operation tends to enlarge the binary structure within the image, while the erosion shrinks the structure instead. In this article, we show a very good application on how such morphological operations can be utilized on analyzing and recognizing patterns within the image.

Suppose that we have an image that contains objects that follow a certain pattern. We may perform different analysis on such kind of image, such as identifying what pattern exists within the image, where and how this pattern repeats, and what other structures in the image exist that differ with the known pattern. As an example, we consider the image below, composed of different paper disks (probably from a puncher) scattered throughout a piece of paper.

At certain regions of the image, clustering of the paper circular disks occur, while other disks are separate from the clusters and can be distinguished easily from each other. In order to use morphological operations, we must first convert the image into binary form by thresholding. We can determine the proper threshold values to be used by looking at its histogram, as shown below.

Observing the histogram, we could see that there is a defined peak occurring at the grey level of around 190. Because the disk appears as bright white and the background appears grey within the image, we can say that this defined peak between 150 and 200 determines the regions where the paper disks are not present.

By using the threshold values above, we obtain a fairly-defined image of the paper disks in the image.

Although there is a good outline of where the paper disks are placed within the image, the image is still riddled with artifacts, such as small clusters of pixels mostly coming from the varying gray levels of the background (sheet of paper). In order to enhance this, we perform the Opening operation.

The Opening operation is essentially the Erosion morphological operation, followed by the Dilation operation. The Erosion operation cleans up the image by removing artifacts that do not match the structure element, and the Dilation increases the area covered by regions which are similar to the structure element. As the structure of interest present within the image are in the form of circles, we may use a circular structure element of certain radius.

StructureElement = CreateStructureElement('circle', 10);
binaryArg = OpenImage(binaryArg, StructureElement);

By applying the Opening operation, we clean up the image significantly, but clustered disks are still present within the image. In order to determine the true structure of the paper disks within the image, we must ignore the clustered regions by determining the area of each pixel cluster, or blob, that exists. This can be performed in Scilab using the SearchBlobs function of the IPD toolbox. The function provides an index for each identified blob which is not connected with another blob. Afterwards, we determine the area of each blob in the image. The code used for the blob analysis is shown below.

clusterTag = SearchBlobs(binaryArg);
for index = 1:max(clusterTag)
    massTag(find(clusterTag == index)) = sum(clusterTag == index);
    massDistribution(histindex) = sum(clusterTag == index)
    histindex = histindex + 1;
end

In order to get good measurement out of the image provided above, we perform bootstrap sampling of the image. Instead of performing all of the morphological operations and blob analysis previously described, we can instead obtain multiple subsamples of the image, and take the result as an aggregate of the measurement on each subsample. To do this, we cut up the image into 256px by 256px subsample on random locations of the image. We ignore the non-unique regions of the subsample images due to overlap on the sampling, on the assumption that each subsample is a good and uniform description of the entire image.

for i = 1:bootSamp
    posx = int(floor(rand()*(dimx - 256))) + 129;
    posy = int(floor(rand()*(dimy - 256))) + 129;

    bootImage = refSrc((posx - 128):(posx + 128 - 1), (posy - 128):(posy + 128 - 1));
    imwrite(bootImage, 'samples/'+msprintf('%.2i', i)+'-sample.jpg')

    binaryArg =~( ( bootImage &amp;lt; histUpperBound ) &amp;amp; (bootImage &amp;gt; histLowerBound) );

After taking the blob “mass” for all subsampled image, we can plot all of the masses of each blob on a histogram, shown below:

We can observe that a huge peak is present at around the blob mass of 500px, with other measurement present at higher or lower blob masses. As smallest physical structure present of the image is the paper disk, we may say that most of the disks have blob mass of around 500px. Interestingly, some counts of blobs with masses of multiples of 500 are present in the histogram, indicating the masses of the clustered blobs.

In order to determine the actual variation in the blob mass of the paper disk, we take the histogram of blobs that only have masses between 450px and 650px, shown by the histogram and the resulting isolated thresholded blobs below.

The histogram now indicates a more normal distribution of blob masses with some spreading. In fact, the peak can now be pointed out more exactly to be at around 520px. We use this as the average blob mass representing the paper disks, with a variation of 20px from the mean. We use these as threshold values for recognizing structures and patterns for the second part of the activity.

Suppose we have another image containing similar paper disks from the previous image, but with some ‘defective’ disks which vary in size from majority of the disk, as shown by the image below.

Assuming that the nature of the both images are the same, besides the defects present above, we can apply the same parameters for thresholding to arrive at a similar binary image. The image, together with the blob mass distribution, are both shown below.

Interestingly, compared to the original, there is an unusual bump present at around 900px, instead of peaks appearing mostly at multiples of 500px. Although we cannot point out directly that the unusual peak represents the anomalous disks present in the image, we can assume that paper disks that are relatively larger than the usual are present. In order to make the larger disks more distinguishable, we use a structure element with a radius larger than that previously used. We use the code below, with the only modification from the previous part being the size of the structure element.

StructureElement = CreateStructureElement('circle', 16);
cancerAnalyze = rgb2gray(imread('Circles with cancer.jpg'));
binaryCancer = ~( ( cancerAnalyze < histUpperBound ) && (cancerAnalyze > histLowerBound) );
binaryCancer = OpenImage(binaryCancer, StructureElement);

clusterCancer = SearchBlobs(binaryCancer);
for index = 1:max(clusterCancer)
    massTag(find(clusterCancer == index)) = sum(clusterCancer == index);
    cancerDistribution(index) = sum(clusterCancer == index);
end

After using the code above, we can obtain immediately the blobs which are defective and larger, shown by the image below. In fact, by simple visual inspection, we could see that the operations performed well that it detected all of the enlarged disks within the set.

Image Enhancement on the Frequency Domain

Building upon the principles of image formation that was discussed from the previous entry, we now utilise this for performing image manipulation and enhancement in the frequency domain.

The main point of the previous entry was that an image was formed from the convolution of different elements that affect it. As a specific sample, we have shown that the appearance of sidelobes from the image is formed by the convolution of the object and the aperture. Extending this principle, we might say that in order to remove a feature from the image (such as the diffraction pattern), we simply need to deconvolve the effect from the original image. This is not easy, however, and should not work from simple element-wise division of complex numbers. Most of the time, when we are introduced to the image that we are to manipulate, we don’t have the exact information on the stray effects that influence the image, and thus we don’t know exactly what the divisor is. Although this operation looks really technical, it can actually be performed easily using a principle commonly used in processing 1D temporal signals.

An alternative is to perform convolution to the image with another image which, as an estimate, would remove the unnecessary signal from the image. In effect, we are simply estimating the divisor to be used for the convolution in order to remove the convolved effects.

In electrical signals, we commonly use the RLC-circuits in order to remove signals of certain frequency that interferes with the information of the signal. The circuits, in principle, operates on the frequency (Fourier) space of the signal, where it applies bias on certain regions on the frequency space. As an example, a low-pass RC-circuit biases the frequency space such that it nearly zeroes out every high frequency of the signal, thus only permitting signals of low frequency to pass through. This is an effective method on removing high frequency signal interference.

Similarly, we can use this on the Fourier space of the image, where we apply bias on certain regions of the frequency space, and remove such regions that contribute to unnecessary patterns appearing on the image. In order to apply this, we must first develop a sense of how certain patterns appear on frequency space.

The simplest example is would be two dots on an axis. We note that, in 1D, the Fourier transform of a sinusoidal signal is two peaks (Dirac delta) equidistant to the axis. Similarly, by taking the (inverse) Fourier transform of two equidistant axis dots in an image, we would obtain a 2D sinusoidal function, similar to a corrugated roof, as shown below.

Next, we apply the principle of convolution on the two dots. Suppose we add a certain radius on both of the dots. Taking the Fourier transform, we obtain the pattern of the corrugated roof, but combined with a diffraction pattern due to an aperture. We can deduce from the image that it is a result of the convolution of two dots and a circular aperture, which make sense, as the image we used is indeed a convolution of a circular aperture and two dots. By changing the radius of the two dots, we can observe that the size and spread of the diffraction due to the aperture changes, where the diffraction pattern appears more spread out as the aperture size is decreased, but the corrugated pattern is still similar. Observations from these can be seen from the figure below.

We can change the shape of the circular aperture, using a square box instead. By taking the Fourier transform, we obtain a square form of the diffraction pattern, as shown below. It also behaves similarly to the circular aperture, where the pattern spreads out more when the circular aperture is decreased, but the corrugated roof pattern is still the same.

Finally, we show a non-conventional aperture for the image, we use two dots that spread out as a Gaussian distribution, with the spread controlled by $\sigma$. We note that the Fourier transform of the Gaussian function is also a Gaussian function, as shown below. Decreasing the spread on the image increases the spread in the frequency domain.

This actually presents the difficulty on determining the image to be used for deconvolving patterns on the image: small changes on the patterns tend to appear very altered when in frequency space. If the unnecessary pattern within the image would appear somewhat between a circle and a square (squircle), using either the square or circle pattern to deconvolve the image would not be enough, and might even alter the image completely. The Gaussian function appears similar to that of a circular aperture. Also, we don’t have information on how large the aperture used on the pattern is.

Finally, we show the effects of convolution on a repeating structure. Suppose we have randomly-placed dots within the image, and that we have a small pattern. We can take the Fourier transform of each and element-wise multiply both together (thus taking its convolution), and we could have the resulting pattern as shown on the next two figures.

Note that, by convolution, the pattern is simply tiled on the positions where the dots were present. This actually tells a lot on how we would perform image enhancement on the frequency domain: certain patters tend to appear repetitively within the image, but may appear altered on orientation. In order to remove the pattern that appears repetitively on the image, we simply need to deconvolve (filter out) the region on the frequency domain which contributes to this. For example, if a fishnet pattern appears on the image repetitively, we only need to determine the which region on the frequency domain contributes to this, deconvolve this pattern, and thus we can obtain an image without the fishnet pattern.

As an example, we would use an image with grid pattern, as shown by the moon lander image in the figure below. Notice that gridlines tend to appear throughout the image, as artefact by the imaging device, where the image is transmitted by strips, thus requiring tiling in order to reconstruct the image.

In order to remove this, we show one particularly interesting pattern and its Fourier transform. Suppose we have equally-spaced dots placed along both the center x- and y-axis of the image (not the image axis). By taking its Fourier transform, we obtain a grid pattern. By increasing the spacing between the dots, the grid tends to get smaller in size.

Using this, we can claim that the grid pattern would appear on the center axis of the image, as confirmed by the Fourier transform of the lander image below. Thus, we develop a 2D pass filter to convolve with the image, which zeros out the region which produces the grid pattern. Instead of determining the size of spacing to identify which to zero-out, we remove the axis altogether. Note that we leave some space on the center where the low-frequency region exists. We may claim that most of the information occurs as extremely low frequency occurrence, and thus lay on the center region.

The code used for the filtering is shown below:

landerImage = double(rgb2gray(imread('moon.jpg')));
landerImage_FT = fftshift(fft2(landerImage));
landerImage_FT((240-2):(240+2), 1:(320-5)) = zeros(5, 320 -5);
landerImage_FT((240-2):(240+2), (320+6):$) = zeros(5, 320 -5);
landerImage_FT(1:(240-5), (320-2):(320+2)) = zeros(5, 240 -5)';
landerImage_FT((240+6):$, (320-2):(320+2)) = zeros(5, 240 -5)';

landerImage_reconstruct = fft2(landerImage_FT);

By taking the Fourier transform of the convoluted image, we obtain a smooth, grid-free image shown below. The enhancement, however, doesn’t come out without artifacts with it. For one, the image appears inverted, which is simply due to the behavior of the Fourier transform as discussed from the previous entry. Also, if we were to look on the legend box (1km scale marker), some lines are appearing. Despite that, the deconvolution of the pattern appears successful as it was applied on the image.

The next example would involve an image with fishnet pattern, as shown in the image below, by a painting of Dr. Daria. This time, we would take a different approach on producing the filter for deconvolution. Note that, for illustration, we would convert the image first in grayscale.

Taking the Fourier transform, we would obtain the following pattern as shown below. Interestingly, certain peaks appear throughout the frequency space. As we have previously shown, repeating structures appear as peaks within the image. In order to remove these, we impose a threshold on the frequency domain, where anything above five times the median is zeroed out. This threshold value is taken arbitrarily, which produces the best resulting image. Note that a circular region within the frequency space is left unaltered, consistent to our description about information lying at low frequency region.

The resulting image would then appear much smoother compared to the original, but with some artefacts from the fishnet pattern present within the image. As usual, the image appears inverted.

The filtering can then be applied on all of the color channels of the image, using the code below:

paintingRGB = double(imread('painting.JPG'));
paintingR = paintingRGB(:,:,1);
paintingG = paintingRGB(:,:,2);
paintingB = paintingRGB(:,:,3);

painting_FTR = fftshift(fft2(paintingR));
painting_FTG = fftshift(fft2(paintingG));
painting_FTB = fftshift(fft2(paintingB));

outsideRegionR = painting_FTR .* (1 - sample_aperture);
filteredR = (sample_aperture.*painting_FTR) + (((abs(outsideRegion) &lt; (5*medianOutsideDot))).*outsideRegion)
outsideRegionG = painting_FTG .* (1 - sample_aperture);
filteredG = (sample_aperture.*painting_FTG) + (((abs(outsideRegion) &lt; (5*medianOutsideDot))).*outsideRegion)
outsideRegionB = painting_FTB .* (1 - sample_aperture);
filteredB = (sample_aperture.*painting_FTB) + (((abs(outsideRegion) &lt; (5*medianOutsideDot))).*outsideRegion)

painting_reconstructRGB = zeros(paintingRGB);
painting_reconstructRGB(:,:,1) = mat2gray(abs(fft2(filteredR)));
painting_reconstructRGB(:,:,2) = mat2gray(abs(fft2(filteredG)));
painting_reconstructRGB(:,:,3) = mat2gray(abs(fft2(filteredB)));

The result of this is similar to the one obtained previously, but rotated for comparison. The process removed the fishnet pattern, but the process smoothened out the image that make it lack sharpness. Nevertheless, the image appears more pleasing and less annoying to the eyes of the observer, showing that the image enhancement we performed was effective.

Using all the concepts above, we have developed a method on how to perform image enhancement on the frequency domain, more specifically a method of deconvolution by applying certain pass filters on the image. Also, by blocking out peaks that appear on the image, we can remove certain repeating structures that may appear annoying to the observer, thus enhancing the image.

Fourier Transform Model of Image Formation

The focus of this activity shall be to extend the image processing and manipulating techniques we have developed, by instead manipulating the frequency space of the image using the Fourier transform in 2D. First, we must develop some preliminaries about how the Fourier transform in images, and the role it plays in image formation.

The basic form of the Fourier transform is when it is performed on a one-dimensional signal, such as voltage measurements that vary through time. The Fourier transform of a 1D function can be computed using the following equation:

$$F(\omega) = \int^{\infty}_{-\infty}f(t)e^{-j\omega t}dt$$

where $F(\omega)$ is the Fourier transform of the signal $f(t)$. The transform maps the signal from being temporal into the frequency space. In fact, if one were to take the Fourier transform of a pure, single-wavelength sinusoidal signal, the result shall appear as peaks that are located exactly on the frequency of the signal.

A fundamental theorem that stems from the Fourier transform is the convolution theorem. Suppose you have two signals $f(t)$ and $g(t)$. The convolution $f\star g$ of the two signals is given by the following integral:

$$f\star g (t) = \int^{\infty}_{-\infty}f(t)g(t - \tau)d\tau$$

The implication of performing convolution on two signals is that, as shown on the equation above, a signal is translated and projected over the entire region of the other signal. Interestingly, taking the Fourier transform of the convolution function gives the product of the two Fourier-transformed functions, or $F(f\star g) = F(f)F(g)$. This means that the convolution operation, which required tedious translation over the entire region of the signal, can be simplified by instead taking the Fourier transform of the signals, performing element-wise multiplication from the result, and then taking the inverse Fourier transform of the result.

Now, we may apply the following concepts regarding the Fourier transform into images and its significance in the model of image formation. Knowing that images can be represented by a 2D $N\times M$ array (thrice for RGB) containing gray levels per pixel, the Fourier transform needs to be performed discretely and in both the horizontal and vertical regions of the image. Extending the first equation into two dimensions, and performing it discretely, we obtain the following equation for 2D Fourier transform.

$$\frac{1}{NM} \sum_{i=1}^{N} \sum_{j=1}^{M} I(x,y) \exp \left( -2\pi j \left( \frac{hx}{N} + \frac{ky}{M}\right) \right)$$

where $I(x, y)$ represents the image array. Similar techniques can be used from the Fast Fourier Transform method in one-dimensional signals for use in 2D images.

To put the discussion of the 2D Fourier transform into the context of image formation, suppose we have a circular aperture of certain radius, and uniformly distributed monochromatic light passes through the aperture. We could describe the phenomenon by taking an image with a circle at the center, where the entire region outside the circle is blacked out. Using the fft2 function in Scilab, we take the Fourier transform of this image.

Note that the result of the Fourier transform is an array of complex values. In order to visualize the transformed image, we must first take the absolute value of each element. For compliance with the imshow function, we must scale the resulting values and convert each element to uint8, using the mat2gray function. The first thing we could observe upon applying the Fourier transform on the image is that, despite rescaling, nothing seems to

The repositioning of elements of the image is an artifact upon using the Fourier transform, and can be resolved by using the fftshift function. Using this, we obtain a slightly visible dot with radial fringing surrounding the center.

The structure of the resulting point with fringing should remind us of the physical phenomenon occurring when light passes through a circular aperture: a radial pattern forms through diffraction from the aperture, the profile which is determined by the Airy function. Varying the aperture size yields different size of the pattern, where smaller apertures yield more spreading on the resulting pattern.

In fact, this phenomenon is fully described by Fresnel theory of diffraction, which models the diffraction of light based on Huygens Principle. Based on Fresnel diffraction, an image I(x,y) diffracts when projected to a distance d as dictated by the following transformation:

Something something

Observing the equation, we can see that the result of Fresnel diffraction (to be exact, Fraunhofer diffraction, which occurs when the distance between the object and image is far enough that its effects are negligible) can be obtained by simply taking the Fourier transform, together with other parameters, of the object. Comparing to the generated image above, the diffraction pattern produced by light passing through a circular pattern projected to a screen would produce the Airy pattern.

By using the fft2 function over the produced image again, we obtain back the original image, which can be described as taking the inverse Fourier transform of the image. Another illustration of the effect of taking the Fourier and inverse Fourier transform is shown below, with the letter “A”. Similar to a lens, the image appears inverted as compared to the original image.

Finally, we use the principle of convolution and the convolution theory on Fourier transforms, as we have developed previously, on the theory of image formation. Light, as it travels to forms an image, is strongly influenced by the path it takes, including the material it passes through and obstacles that exists through its way. This influence can be thought of as a convolution of the light from the object and the influence of the path. As an example, consider the working principle of a pinhole camera. Light from an object passes through a circular aperture, and then gets projected on the region where the image is formed. Most of the time, when the distance between the aperture and image plane is not adjusted for focus, the image appears blurred, where the edges appear diffracted. This phenomenon can be observed by convolving the object to the circular aperture, as shown below. Note that the convolution is performed by simply performing element-wise multiplication on the Fourier transform of both the object and the aperture.

From the generated image above, we can see that it behaves similarly as a pinhole camera: the image appears inverted, and the image has a diffraction pattern occurring around it. We can attribute all of these phenomenon as a result of the convolution, where the inversion is due to the FFT, and the diffraction pattern due to the sidelobes contributed by the circular aperture to the image.

As a diversion from the discussion of theory of image formation, we discuss other principles and properties of the Fourier transform as applied to signal and image processing. As previously discussed, the convolution is performed such that the result is the projection of one signal (image) to the other. Suppose, this time, that instead of blending together two functions, we would like to determine how the image would match to the other. Consider two complex functions $f$ and $g$ in time. If the two signals match well when $g$ is lagged down by a certain time $\tau$, we would like to obtain high measure. For complex functions, we could obtain this by multiplying one function and the conjugate of the other, $f(t)g^*(t - \tau)$. To perform this throughout the signal, we can modify the convolution integral, given by

$$ f \star g(t) = \int^{\infty}_{-\infty} f^*(\tau) g(t - \tau) d\tau$$

which is the definition of the (cross-) correlation operator. Being another form of the convolution operation, we can also invoke the convolution theory for cross-correlation, $f \star g = F^{-1}[F * G^{*}]$, where each element of the Fourier transform of the signal are multiplied with each other, with an addition that one of the signals must be conjugated. This operation can also be performed on images, as illustrated by the next example.

Suppose we have an image that contains different elements, such as letters contained to an image of a sentence. A specific example of such image is shown below.

We can notice from the sentence that it contains many instances of the letter ‘A’. In order to determine the location of occurrence of the letter ‘A’, we can use cross-correlation between the image above, and a template image which only contains the letter ‘A’. The procedure involves taking the Fourier transform of both image, with the template complex-conjugated, performing element-wise multiplication, and then taking the inverse Fourier transform of the resulting matrix, as illustrated by the following code used.

spain = double(rgb2gray(imread('spain.png')));
aref = double(rgb2gray(imread('aRef.png')));
correlation = fft2(aref) .* conj(fft2(spain));
invcorr = abs(fft2(correlation));

Using the following operations, we obtain the image below. Notice that certain peaks appear on the region where the letter ‘A’ appeared. By cross-correlation, maximum values where obtained when the template matched the original image well. This operation can be useful when performing simple pattern recognition. In fact, there is an important method where the correlation of a signal with itself is taken, called auto-correlation, as this determines if reoccurring elements are present within the signal.

Nonlinear Classification using Feed-forward Neural Networks#

Pattern Recognition using Linear Discriminant Analysis#

Image Downscaling by Principal Component Analysis#

Binary Image Analysis of Sheet Music for Playback#

Simple Pattern Recognition by Blob Area Analysis#

Image Enhancement on the Frequency Domain#

Fourier Transform Model of Image Formation#