如何對包含二進制數據的文本文件進行grep ?

[英]How to grep a text file which contains some binary data?


grep returns

grep的回報

Binary file test.log matches

For example

例如

echo    "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log  # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log  # in bash
grep re test.log

I wish the result will show line1 and line3 (total two lines).

我希望結果顯示第1行和第3行(總共兩條線)。

Is it possible to use tr convert the unprintable data into readable data, to let grep work again?

是否可以使用tr將不能打印的數據轉換為可讀的數據,讓grep再次工作?

10 个解决方案

#1


53  

You could run the data file through cat -v, e.g

你可以通過cat -v運行數據文件

$ cat -v tmp/test.log | grep re
line1 re ^@^M
line3 re^M

which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.

然后再進行后處理,去除垃圾;這與您關於為任務使用tr的查詢非常類似。

#2


80  

One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).

一種方法是簡單地將二進制文件當作文本,使用grep—text,但是這很可能會導致二進制信息被發送到您的終端。如果您正在運行一個解釋輸出流的終端(例如VT/DEC或其他許多),那么這並不是一個好主意。

Alternatively, you can send your file through tr with the following command:

您也可以通過tr發送您的文件,使用以下命令:

tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever

This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.

這將改變任何小於空格字符(換行除外)和大於126的任何東西。只留下可打印的字符。


If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:

如果你想讓每個“非法”字符都換成另一個字符,你可以使用如下C程序,一個經典的標准輸入過濾器:

#include<stdio.h>
int main (void) {
    int ch;
    while ((ch = getchar()) != EOF) {
        if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
            putchar (ch);
        } else {
            printf ("{{%02x}}", ch);
        }
    }
    return 0;
}

This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.

這將給您{NN},其中NN是字符的十六進制代碼。您可以簡單地為您想要的輸出樣式調整printf。

You can see that program in action here, where it:

你可以在這里看到這個項目,在那里:

pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob

#3


70  

grep -a

It can't get simpler than that.

再簡單不過了。

#4


32  

You can use "strings" to extract strings from a binary file, for example

例如,可以使用“strings”從二進制文件中提取字符串

strings binary.file | grep foo

#5


19  

You can force grep to look at binary files with:

您可以強制grep查看二進制文件:

grep --binary-files=text

You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.

您可能還想添加-o(——唯一匹配),這樣您就不會得到大量的二進制亂語,這將使您的終端無法工作。

#6


11  

Starting with Grep 2.21, binary files are treated differently:

從Grep 2.21開始,二進制文件被區別對待:

When searching binary data, grep now may treat non-text bytes as line terminators. This can boost performance significantly.

當搜索二進制數據時,grep現在可以將非文本字節作為行終止符。這可以顯著提高性能。

So what happens now is that with binary data, all non-text bytes (including newlines) are treated as line terminators. If you want to change this behavior, you can:

現在發生的是,對於二進制數據,所有非文本字節(包括換行)都被當作行終止符。如果你想改變這種行為,你可以:

  • use --text. This will ensure that only newlines are line terminators

    使用——文本。這將確保只有換行符是行終止符

  • use --null-data. This will ensure that only null bytes are line terminators

    使用null數據。這將確保只有空字節是行終止符

#7


3  

As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text. See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

正如詹姆斯·塞爾瓦庫瑪(James Selvakumar)已經說過的,grep -a可以做到這一點。- or -text強制Grep將inputstream作為文本處理。看到從http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

try

試一試

cat test.log | grep -a somestring

#8


2  

you can do

你可以做

strings test.log | grep -i

this will convert give output as a readable string to grep.

這將把輸出作為可讀字符串轉換為grep。

#9


0  

You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

你也可以嘗試單詞提取工具。單詞提取器可以與計算機中的任何文件一起使用,將包含人類文本/文字的字符串從二進制代碼(exe應用程序,dll)中分離出來。

#10


0  

grep -a will force grep to search and output from a file that grep thinks is binary. grep -a re test.log

grep -a將迫使grep從grep認為是二進制的文件中搜索和輸出。grep - re test.log


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2012/04/03/c284e557df2c807b497b22c17f85386a.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com