如何從bash中的目錄中選擇隨機文件?

[英]How can I select random files from a directory in bash?


I have a directory with about 2000 files. How can I select a random sample of N files through using either a bash script or a list of piped commands?

我有一個大約2000個文件的目錄。如何通過使用bash腳本或管道命令列表來選擇N個文件的隨機樣本?

11 个解决方案

#1


Here's a script that uses GNU sort's random option:

這是一個使用GNU sort的隨機選項的腳本:

ls |sort -R |tail -$N |while read file; do
    # Something involving $file, or you can leave
    # off the while to just get the filenames
done

#2


You can use shuf (from the GNU coreutils package) for that. Just feed it a list of file names and ask it to return the first line from a random permutation:

你可以使用shuf(來自GNU coreutils包)。只需輸入一個文件名列表,並要求它從隨機排列中返回第一行:

ls dirname | shuf -n 1
# probably faster and more flexible:
find dirname -type f | shuf -n 1
# etc..

Adjust the -n, --head-count=COUNT value to return the number of wanted lines. For example to return 5 random filenames you would use:

調整-n, - head-count = COUNT值以返回所需行數。例如,要返回5個隨機文件名,您將使用:

find dirname -type f | shuf -n 5

#3


Here are a few possibilities that don't parse the output of ls and that are 100% safe regarding files with spaces and funny symbols in their name. All of them will populate an array randf with a list of random files. This array is easily printed with printf '%s\n' "${randf[@]}" if needed.

以下是一些不解析ls輸出的可能性,對於名稱中帶有空格和滑稽符號的文件,它們是100%安全的。所有這些都將使用隨機文件列表填充數組randf。如果需要,可以使用printf'%s \ n'“$ {randf [@]}”輕松打印此數組。

  • This one will possibly output the same file several times, and N needs to be known in advance. Here I chose N=42.

    這個可能會多次輸出相同的文件,並且需要事先知道N.在這里我選擇N = 42。

    a=( * )
    randf=( "${a[RANDOM%${#a[@]}]"{1..42}"}" )
    

    This feature is not very well documented.

    此功能沒有很好的記錄。

  • If N is not known in advance, but you really liked the previous possibility, you can use eval. But it's evil, and you must really make sure that N doesn't come directly from user input without being thoroughly checked!

    如果事先不知道N,但你真的很喜歡以前的可能性,你可以使用eval。但它是邪惡的,你必須確保N不直接來自用戶輸入而不經過徹底檢查!

    N=42
    a=( * )
    eval randf=( \"\${a[RANDOM%\${#a[@]}]\"\{1..$N\}\"}\" )
    

    I personally dislike eval and hence this answer!

    我個人不喜歡eval,因此這個答案!

  • The same using a more straightforward method (a loop):

    使用更簡單的方法(循環)相同:

    N=42
    a=( * )
    randf=()
    for((i=0;i<N;++i)); do
        randf+=( "${a[RANDOM%${#a[@]}]}" )
    done
    
  • If you don't want to possibly have several times the same file:

    如果您不希望多次使用同一個文件:

    N=42
    a=( * )
    randf=()
    for((i=0;i<N && ${#a[@]};++i)); do
        ((j=RANDOM%${#a[@]}))
        randf+=( "${a[j]}" )
        a=( "${a[@]:0:j}" "${a[@]:j+1}" )
    done
    

Note. This is a late answer to an old post, but the accepted answer links to an external page that shows terrible practice, and the other answer is not much better as it also parses the output of ls. A comment to the accepted answer points to an excellent answer by Lhunath which obviously shows good practice, but doesn't exactly answer the OP.

注意。這是對舊帖子的遲到答案,但是接受的答案鏈接到顯示可怕的bash練習的外部頁面,而另一個答案並不是更好,因為它也解析了ls的輸出。對接受的答案的評論指出了Lhunath的一個很好的答案,這顯然表明了良好的做法,但並沒有完全回答OP。

#4


ls | shuf -n 10 # ten random files

#5


If you have Python installed (works with either Python 2 or Python 3):

如果安裝了Python(適用於Python 2或Python 3):

To select one file (or line from an arbitrary command), use

要選擇一個文件(或來自任意命令的行),請使用

ls -1 | python -c "import sys; import random; print(random.choice(sys.stdin.readlines()).rstrip())"

To select N files/lines, use (note N is at the end of the command, replace this by a number)

要選擇N個文件/行,請使用(注意N位於命令的末尾,將其替換為數字)

ls -1 | python -c "import sys; import random; print(''.join(random.sample(sys.stdin.readlines(), int(sys.argv[1]))).rstrip())" N

#6


This is an even later response to @gniourf_gniourf's late answer, which I just upvoted because it's by far the best answer, twice over. (Once for avoiding eval and once for safe filename handling.)

這是對@gniourf_gniourf遲到的答案后來的回應,我剛剛贊成,因為它是迄今為止最好的答案,兩次。 (一次用於避免eval,一次用於安全文件名處理。)

But it took me a few minutes to untangle the "not very well documented" feature(s) this answer uses. If your Bash skills are solid enough that you saw immediately how it works, then skip this comment. But I didn't, and having untangled it I think it's worth explaining.

但是我花了幾分鍾時間來解開這個答案使用的“沒有很好記錄”的功能。如果您的Bash技能足夠堅實,您可以立即看到它是如何工作的,那么請跳過此評論。但我沒有,並且解開它我認為值得解釋。

Feature #1 is the shell's own file globbing. a=(*) creates an array, $a, whose members are the files in the current directory. Bash understands all the weirdnesses of filenames, so that list is guaranteed correct, guaranteed escaped, etc. No need to worry about properly parsing textual file names returned by ls.

功能#1是shell自己的文件通配符。 a =(*)創建一個數組$ a,其成員是當前目錄中的文件。 Bash理解文件名的所有奇怪之處,因此列表保證正確,保證轉義等。無需擔心正確解析ls返回的文本文件名。

Feature #2 is Bash parameter expansions for arrays, one nested within another. This starts with ${#ARRAY[@]}, which expands to the length of $ARRAY.

特征#2是數組的Bash參數擴展,一個嵌套在另一個中。這從$ {#ARRAY [@]}開始,擴展到$ ARRAY的長度。

That expansion is then used to subscript the array. The standard way to find a random number between 1 and N is to take the value of random number modulo N. We want a random number between 0 and the length of our array. Here's the approach, broken into two lines for clarity's sake:

然后使用該擴展來下標數組。找到1到N之間的隨機數的標准方法是取模數為N的隨機數的值。我們想要一個介於0和數組長度之間的隨機數。這是方法,為清楚起見分為兩行:

LENGTH=${#ARRAY[@]}
RANDOM=${a[RANDOM%$LENGTH]}

But this solution does it in a single line, removing the unnecessary variable assignment.

但是這個解決方案在一行中完成,刪除了不必要的變量賦值。

Feature #3 is Bash brace expansion, although I have to confess I don't entirely understand it. Brace expansion is used, for instance, to generate a list of 25 files named filename1.txt, filename2.txt, etc: echo "filename"{1..25}".txt".

功能#3是Bash大括號擴展,雖然我不得不承認我並不完全理解它。例如,使用大括號擴展來生成名為filename1.txt,filename2.txt等的25個文件的列表:echo“filename”{1..25}“。txt”。

The expression inside the subshell above, "${a[RANDOM%${#a[@]}]"{1..42}"}", uses that trick to produce 42 separate expansions. The brace expansion places a single digit in between the ] and the }, which at first I thought was subscripting the array, but if so it would be preceded by a colon. (It would also have returned 42 consecutive items from a random spot in the array, which is not at all the same thing as returning 42 random items from the array.) I think it's just making the shell run the expansion 42 times, thereby returning 42 random items from the array. (But if someone can explain it more fully, I'd love to hear it.)

上面的子shell中的表達式“$ {a [RANDOM%$ {#a [@]}]”{1..42}“}”,使用該技巧產生42個單獨的擴展。大括號擴展在]和}之間放置一個數字,起初我認為是下標數組,但如果是這樣,它前面會有一個冒號。 (它也會從數組中的一個隨機點返回42個連續項,這與從數組中返回42個隨機項完全不同。)我認為它只是使shell運行擴展42次,從而返回數組中的42個隨機項。 (但如果有人能夠更充分地解釋它,我很樂意聽到它。)

The reason N has to be hardcoded (to 42) is that brace expansion happens before variable expansion.

N必須被硬編碼(到42)的原因是支撐擴展在變量擴展之前發生。

Finally, here's Feature #4, if you want to do this recursively for a directory hierarchy:

最后,這是功能#4,如果你想以遞歸方式為目錄層次結構執行此操作:

shopt -s globstar
a=( ** )

This turns on a shell option that causes ** to match recursively. Now your $a array contains every file in the entire hierarchy.

這會打開一個shell選項,導致**遞歸匹配。現在,$ a數組包含整個層次結構中的每個文件。

#7


A simple solution for selecting 5 random files while avoiding to parse ls. It also works with files containing spaces, newlines and other special characters:

一個簡單的解決方案,用於選擇5個隨機文件,同時避免解析ls。它還適用於包含空格,換行符和其他特殊字符的文件:

shuf -ezn 5 * | xargs -0 -n1 echo

Replace echo with the command you want to execute for your files.

將echo替換為要為文件執行的命令。

#8


This is the only script I can get to play nice with bash on MacOS. I combined and edited snippets from the following two links:

這是我可以在MacOS上與bash玩得很好的唯一腳本。我合並並編輯了以下兩個鏈接的片段:

ls command: how can I get a recursive full-path listing, one line per file?

ls命令:如何獲得遞歸的完整路徑列表,每個文件一行?

http://www.linuxquestions.org/questions/linux-general-1/is-there-a-bash-command-for-picking-a-random-file-678687/

#!/bin/bash

# Reads a given directory and picks a random file.

# The directory you want to use. You could use "$1" instead if you
# wanted to parametrize it.
DIR="/path/to/"
# DIR="$1"

# Internal Field Separator set to newline, so file names with
# spaces do not break our script.
IFS='
'

if [[ -d "${DIR}" ]]
then
  # Runs ls on the given dir, and dumps the output into a matrix,
  # it uses the new lines character as a field delimiter, as explained above.
  #  file_matrix=($(ls -LR "${DIR}"))

  file_matrix=($(ls -R $DIR | awk '; /:$/&&f{s=$0;f=0}; /:$/&&!f{sub(/:$/,"");s=$0;f=1;next}; NF&&f{ print s"/"$0 }'))
  num_files=${#file_matrix[*]}

  # This is the command you want to run on a random file.
  # Change "ls -l" by anything you want, it's just an example.
  ls -l "${file_matrix[$((RANDOM%num_files))]}"
fi

exit 0

#9


MacOS does not have the sort -R and shuf commands, so I needed a bash only solution that randomizes all files without duplicates and did not find that here. This solution is similar to gniourf_gniourf's solution #4, but hopefully adds better comments.

MacOS沒有sort -R和shuf命令,因此我需要一個僅使用bash的解決方案來隨機化所有文件而不重復,並且在此處找不到。此解決方案類似於gniourf_gniourf的解決方案#4,但希望添加更好的評論。

The script should be easy to modify to stop after N samples using a counter with if, or gniourf_gniourf's for loop with N. $RANDOM is limited to ~32000 files, but that should do for most cases.

該腳本應該很容易修改,以便在使用帶有if的計數器的N個樣本后停止,或者使用帶有N. $ RANDOM的gniourf_gniourf for循環限制為~32000個文件,但這應該適用於大多數情況。

#!/bin/bash

array=(*)  # this is the array of files to shuffle
# echo ${array[@]}
for dummy in "${array[@]}"; do  # do loop length(array) times; once for each file
    length=${#array[@]}
    randomi=$(( $RANDOM % $length ))  # select a random index

    filename=${array[$randomi]}
    echo "Processing: '$filename'"  # do something with the file

    unset -v "array[$randomi]"  # set the element at index $randomi to NULL
    array=("${array[@]}")  # remove NULL elements introduced by unset; copy array
done

#10


I use this: it uses temporary file but goes deeply in a directory until it find a regular file and return it.

我使用它:它使用臨時文件,但深入到目錄,直到找到一個常規文件並返回它。

# find for a quasi-random file in a directory tree:

# directory to start search from:
ROOT="/";  

tmp=/tmp/mytempfile    
TARGET="$ROOT"
FILE=""; 
n=
r=
while [ -e "$TARGET" ]; do 
    TARGET="$(readlink -f "${TARGET}/$FILE")" ; 
    if [ -d "$TARGET" ]; then
      ls -1 "$TARGET" 2> /dev/null > $tmp || break;
      n=$(cat $tmp | wc -l); 
      if [ $n != 0 ]; then
        FILE=$(shuf -n 1 $tmp)
# or if you dont have/want to use shuf:
#       r=$(($RANDOM % $n)) ; 
#       FILE=$(tail -n +$(( $r + 1 ))  $tmp | head -n 1); 
      fi ; 
    else
      if [ -f "$TARGET"  ] ; then
        rm -f $tmp
        echo $TARGET
        break;
      else 
        # is not a regular file, restart:
        TARGET="$ROOT"
        FILE=""
      fi
    fi
done;

#11


How about a Perl solution slightly doctored from Mr. Kang over here:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?

如何從Kang先生那里略微篡改Perl解決方案:如何在Unix命令行或shell腳本中對文本文件的行進行洗牌?

$ ls | perl -MList::Util=shuffle -e '@lines = shuffle(<>); print @lines[0..4]'

$ ls | perl -MList :: Util = shuffle -e'@ lines = shuffle(<>); print @lines [0..4]'


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2009/01/05/2c2a5a76dfe1fa54db9e5ab1d11c84a3.html



 
  © 2014-2022 ITdaan.com 联系我们: