Delete all files / folders in subdirectories with a certain pattern on terminal

Today I had a problem on terminal. I have such a folder structure:

./<level_A>/<level_B>_good

or

./<level_A>/<level_B>_bad

Thus I have some folders(level_A) and in each folder there are another level of folders(level_B). On level_B, there are “good” folders and “bad” folders. “Good” folders end with “_good” and “bad” folders end with “_bad”. Also, there are files in all level_B folders.

For example, a good folder can be

./folder_A_1/folder_a_good

and a bad folder can be in the same directory on level A

./folder_A_1/folder_b_bad

Normally, when we want to delete a group of folders that ends with a particular name, we just use

rm -r *_bad

However, this approach does not apply here because I have too many folders on level_A. If I use the rm command, I have to enter and exit all the folders such as

cd folder_A_1

rm -r *_bad

cd ../folder_A_2

rm -r *_bad

...

I have too many folders on level A: folder_A_1 up to folder_A_1000. Hence, I can not use this common approach.

Then I thought of another command I have used: find. That command lists everything in the directory recursively so it is able to target the folders in the subdirectories. That is good. However, how to find, filter and delete? That means, find all the folders recursively, select those paths ending with “_bad”, and delete them. Fortunately, by looking at the manual of find, it turns out that we have a very nice option -delete. Thus we can find all the directories ending with “_bad” using this command

find . -type d -name "*_bad"

-type d requires the target to be a directory. -name “*_bad” specifies the pattern of the name. By this, we find and filtered to target all the directories and subdirectories that ends with “_bad”.

Next, we will make use of the command -delete. 

find . -type d -name "*_bad" -delete

That deleted all the “bad” subdirectories for me and solved my problem. Yay!

I think this can be generalized as a way to simplify processing the system files on terminal. Usually, when it comes to complicated group processing, we have to write bash scripts. However, if we can make good use of the built-in commands, that can save our time.

Copy a folder of files excluding certain files on Mac terminal

Say I want to copy a folder with 1000 files and lots of subfolders, and I need to do this on Mac terminal because it is a part of streamlined process programmed on bash script. Thus I can not just open Finder, select all and copy from there. I can do this very easily with

cp -r source_folder target_path

However, what if I want to exclude some files/subdirectories in this code?

So far, I have put the following requirements:

  • There are lots of files to copy, like 1000.
  • There are still lots of files to exclude, like 50.
  • It must be programmed in bash so no Finder / manual work. One command done

Unfortunately, cp does not offer the function to exclude certain files. Here we need to use rsync.

According to the manual of rsync, it has the following features

  1. It basically copies, like cp, or rcp (remote copy)
  2. It has a lot more options than the basic copy.
  3. It is much faster.

Thus it is good to use rsync. Now let’s go back to the topic: excluding some files or subdirectories during copy. We need to use –exclude option.

We can either put a specific file name or something more general with the asterisk. For example, this one below exclude all hidden files or directories from the copying.

rsync -r source_folder target_path --exclude='.*'

Also, we should notice that, just like cp, copying folders using rsync should put -r. If we only want to exclude a folder called bin, we then should

rsync -r source_folder target_path --exclude='bin'

Excluding multiple elements

If we have more than one element to exclude, we just add more –exclude.

rsync -r source_folder target_path --exclude='bin' --exclude='scripts'

Excluding only hidden folders?

Now that we successfully excluded all hidden files or certain files. What if I still want the files but not the folders? In this case, we use a slash ‘/’ to indicate folders. Like the command below, .*/ means all hidden folders.

rsync -r source_folder target_path --exclude='.*/'

Include back certain files/folders

Say I want to exclude all hidden files but not the hidden folders. However, .*/ means only folders; .* means both files and folders. I did not find anyway to mean only files. Thus, we should use the –include option to include the folders first and then exclude everything, like the command below. To better understand this, according to the manual of rsync, –include tells rsync “not to exclude a pattern”.

rsync -r source_folder target_path --include='.*/' --exclude='.*'

Counting could be recursive

There are lots of algorithm problems that asks for the count of something. They are mostly combinatorics problems. Those problems usually deal with really big numbers and some common strategies apply to a lot of them.

Talking about combinatorics problems, usually what comes up first to my mind is that “I do not know combinatorics; I have not take the course at Carleton.” There is a math course offered once every two years at Carleton called combinatorics. I did not get a chance to take the course. Thus, when the problem asks me to count, I usually start with a slight panic.

However, after I tried a couple of the problems. I realized that the counting problems, although they deal with really large numbers, are not as mathematical as I thought they were. There are some methods that could be used to a lot of the algorithm problems without being an expert on combinatorics.

The most important one I am going to talk about is recursion. The relation of the number of counting with respect to the input could be recursive. It does not mean that the algorithm needs to be implemented with recursion; it means that there might be some useful recursive relation between the output for this problem and the output for a smaller problem. We can take advantage of the recursive relation to implement a dynamic programming algorithm or even a greedy algorithm, which runs much faster than enumerating all possible solutions.

I should start with a problem called k-inverse-pairs array on Leetcode. The problem asks how many arrangements of 1 to n having k inverse pairs. An inverse pair is an occasion that a number comes later is greater than a number comes before it. For example, if we have n = 4, the numbers are 1,2,3,4 and k = 2. We are looking for the number of arrangements of 1,2,3,4 that have 2 inverse pairs. They are

3,1,2,4

2,3,1,4

1,3,4,2

2,1,4,3

1,4,2,3

Thus the answer when n = 4, k = 2 is 5. There are five inverse pairs. Knowing that the point of this post is that counting could be recursive, how to find the recursive relation between the solution for n, k and a set of smaller problems?

First of all, we notice that the max inverse pair is generated when all the numbers are reverted, like when 1,2,3,4 are arranged as 4,3,2,1. There will be n * (n – 1) / 2 inverse pairs generated. Thus, when k is greater than n * (n – 1) / 2, the answer must be zero. When k is equal to n * (n – 1) / 2, the answer must be 1.

Second, we will observe some cases with small n. When n is 1, there is no inverse pair so the answer is zero no matter what. When n is 2, we have 1,2. Thus, when n is 0, we have 1, when n is 1, we have another 1. Those considerations could serve as the base case for our recursive relations.

Third, based on the considerations that we are looking for a recursive relation, we are attacking the main point: the relation between the solution of a general n, k that does not belong to either of the two points above and a smaller problem. We will use ans(n ,k) to denote the generic answer when the input is n and k, where n is greater than 1 and k is less than or equal to n * (n – 1) / 2. We should consider the recursive relation like this: say that we have all the solutions to the all the subproblems with n – 1 elements, What can we do with the last element to make it a solution to the n, k solution?

If we take a solution of n – 1, k. We can just put n at the end of the arrangement to make it a solution of n, k since there are already k inverse pairs in the original array and we append the last element to the end so that no more inverse pair is created. This should allow us to create ans(n – 1, k) solutions. Next, if we take a solution of n – 1, k – 1, we can insert n to the second to the last element so that one more inverse pair is created. That gives us another ans(n – 1, k – 1) solutions. We can generate all the solutions in this manner, following the cases when the last element generates 0, 1, 2… inverse pairs.

However, when we are going down the cases, there are two different restrictions that could end it. First, n can not create more than n – 1 inverse pairs. In other words, it can not be put at a place before the first element. When it is put to the beginning, n – 1 inverse pairs are created. It can not go over that. Second, n can not generate more than k inverse pairs. If n alone generates more inverse pairs than needed for our problem, that is not going to be a solution. We stop whenever either of the two restrictions ends the sum.

By this, we have the recursive solution as such

ans(n , k) = sum(ans(n – 1, i), i, max(0, k – n + 1), k).

It sums ans(n – 1, i) with i going from max(0, k – n + 1) to k. The max of 0 and k – n + 1 reflects the two restrictions we mentioned in the previous paragraph. i is the number of inverse pair of the n – 1 elements. Zero indicates that it can not create less than 0 inverse pairs; in other words, the last element can not create more than k inverse pairs. The k – n  + 1 reflects the restriction that the last element can not create more than n – 1 inverse pairs, in which case the rest of the n – 1 elements will create k – n + 1 inverse pairs.

At this point, we are at a good position in terms of theoretical analysis of the problem. We have a polynomial solution based on a recursive relation. Obviously, we can implement the algorithm with dynamic programming. However, there are two details to note.

  • When implemented in DP, the algorithm could make use of calculate sum, such as ans(n, k – 1), instead of summing from 0 to the max all the time. That reduces the running time from n cube to n square.
  • The problem requires modulo 10^9 + 7, which is a, though reasonable, but tricky requirement. If we do the modulo each time after we calculate the sum, we may risk going over the limit of integers. That would give us negative numbers. Thus, we have to use a long to temporarily store the sum and then do the modulo and convert it back to int.

From this problem, the most important thing we learned is that counting can be done recursively such as with DP, not just with brute-force enumeration. Another similar problem to look at is find the derangement of an array, which reflects exactly the same idea.