Question

Bunu yaparken yardım arıyor:

Ben bir sayısal kimliği ile adlandırılır metin dosyaları tam bir dizin var. Her bir metin dosyası bir haber makalenin gövdesini içerir. Bazı haber makaleleri farklı yerlerinde ayrılmış, bu yüzden farklı metin dosyaları vardır.

Isimler şekildedir

1001_1.txt, 1001_2.txt   (These files contain two different part of the same article)
1002_1.txt, 
1003_1.txt, 
1004_1.txt, 1004_2.txt, 1004_3.txt, 1004_4.txt (these files contain four different parts of the same article, the parts will go up to a maximum of 4 only).

ve benzeri ve benzeri.

Basically, I need a script (PHP, Perl, RUBY or otherwise) that would simply put the name of the text file (before the underscore) in a column, and the content of the text file in another column, and if there is any number after the underscore, to put that in one column as well.

Yani bu gibi bakıyor bir tablo yapıya sahip olacaktır:

    1001 | 1 | content of the text file
    1001 | 2 | content of the text file
    1002 | 1 | content of the text file
    1003 | 1 | content of the text file

Ben bu başarabilirsiniz nasıl Herhangi bir yardım mutluluk duyacağız.

There are about 7000 text files that need to be read and imported in a table for future usage in a database.

It would be even better if the _1 and _2 files content could be segregated in different colums, eg:

    1001 | 1 | content | 2 | content | 3 | content | 4 | content
    1002 | 1 | content
    1003 | 1 | content

(Like I said, the file names go maximum up to _4 so you could have 1001_1, 1001_2, 1001_3, 1001_4.txt or only 1002_1 and 1003_1.txt)

Answer 1

Bu File::Find ve File::Slurp ile oldukça basittir:

#!/usr/bin/perl

use strict;
use warnings;

use File::Find;
use File::Slurp;

die "Need somewhere to start\n" unless @ARGV;

my %files;
find(\&wanted, @ARGV);

for my $name (sort keys %files) {
    my $file = $files{$name};
    print join( ' | ', $name,
        map { exists $file->{$_} ? ($_, $file->{$_}) : () } 1 .. 4
    ), "\n";
}

sub wanted {
    my $file = $File::Find::name;
    return unless -f $file;
    return unless $file =~ /([0-9]{4})_([1-4])\.txt$/;
    # I do not know what you want to do with newlines
    $files{$1}->{$2} = join('\n', map { chomp; $_ } read_file $file);
    return;
}

Çıktı:

1001 | 1 | lsdkjv\nsdfljk\nsdklfjlksjadf\nlsdjflkjdsf | 3 | sadlfkjldskfj
1002 | 1 | ldskfjsdlfjkl

Answer 2

Muhtemelen iyi değil, ama başlangıç noktası (üzerinde bilerek yorumladı) olabilir:

#!/usr/bin/perl

use strict;
use warnings;

# results hash
my %res = ();

# foreach .txt files
for (glob '*.txt') {
    s/\.txt$//; # replace suffix .txt by nothing
    my $t = ''; # buffer for the file contents
    my($f, $n) = split '_'; # cut the file name ex. 1001_1 => 1001 and 1

    # read the file contents
    {
        local $/; # slurp mode
        open(my $F, $_ . '.txt') || die $!; # open the txt file
        $t = <$F>; # get contents
        close($F); # close the text file
    }

    # transform \r, \n and \t into one space
    $t =~ s/[\r\n\t]/ /g;
    # appends for example 1001 | 2 | contents of 1001_2.txt to the results hash
    $res{$f} .= "$f | $n | $t | ";
}

# print the results
for (sort { $a <=> $b } keys %res) {
    # remove the trailing ' | '
    $res{$_} =~ s/\s\|\s$//;
    # print
    print $res{$_} . "\n";
}

# happy ending
exit 0;

Çoklu Metin Dosyaları içeriği Okuma

2 Cevap

etiketler