Question

I have a client who is using a separate vCard on a separate page. These are being pasted into a wordpress text field. (Not the most efficient way to maintain a list of people, but I won't editorialize after the fact.) My mission is to write something to parse through all the addresses in the vCards and to dump the information into a central database. This would allow all the disparate pages to become addresses replete with lat and lng coordinates from google and display a lovely front page with pins galore.
This page would show all the vcards from the rest of the pages of the site.

Oh, bu gerçekte şüpheli HTML kodu bir sürü çevrili olacağını, sitede bir vcard, dezenfekte, bir örnek:

<div class="vcard">
<span class="fn org">XYZ Org Name</span><br />
<span class="url">http://www.someurl.com/</span>
<div class="adr"><span class="street-address">1234 Main Ave</span><br />
<span class="locality">Chicago</span><br />
<span class="region">IL</span><br /><span class="postal-code">60647</span></div>
</div>

Now, each page has one of these, and to spider through the entire site, and collect them into an array is a bit out of my league. I can handle dumping them into a database, using PHP and mySQL.
Any and all advice would be welcome!
EDIT: Not sure how important this is, but I am pulling the data from a different server.

Answer 1

Ben HTML ayrıştırıcıların arıyoruz inanıyoruz. Here Python için HTML ayrıştırma modülü

Tüm HTML dosyaları dışarı ilgili verileri ayrıştırmak gerekir ve daha sonra onunla ne yapmak.

Ben herhangi bir tavsiye için herhangi bir php html ayrıştırıcılarını denemedim ama bir web sunucusu üzerinde çalışıyoruz beri ben perl var umuyorum? perl html parsers bir göz atın.

# Bu pasajı Organizasyon adının içeriğini alacak

 sub start {
      my ($self, $tag, $attr, $attrseq, $origtext) = @_;

      if ($tag =~ /^span$/i && $attr->{'class'} =~ /^fn org$/i) {
          # see if we find <span class="fn org"
          push (@org_names, $origtext);
      } 
  }

şimdi tüm kuruluş adları içeren @org_names dizi var.

Answer 2

DOMDocument sınıfını deneyin 'loadHTML method. Sonra istediğiniz düğümleri, özelliklerini ve değerlerini seçmek için DOMDocument yöntemleri kullanabilirsiniz. XPath ile aşina iseniz Veya, aynı zamanda istenen veriyi seçmek için yüklenen DOMDocument karşı sorgulamak için DOMXPath nesne örneğini.

MySQL DB içine web sayfalarında vCards Ayrıştırma

2 Cevap

etiketler