Sample twitter data

{
  "_id": "516799596153307136",
  "lang": "en",
  "plt": -5.799,
  "uid": "67763278",
  "tlt": -5.822,
  "cc": "BR",
  "f": "tw201492918305",
  "p": "a4ddc3856053f7e1",
  "flrs": 1014,
  "acr": {
    "$date": 1250900341000
  },
  "t": "@barrosmirella questão de ideias e conceitos. Você se definiu homofóbica nessa frase. Ngm precisa aceitar e/ou apoiar a homossexualidade*",
  "cr": {
    "$date": 1412049600000
  },
  "pln": -35.221,
  "tln": -35.229,
  "flng": 273
}

mongoimport --db test --collection tweets_collection --file tweets_collection.json
db.all_tweets.ensureIndex({ t: "text" })
Append "," to end of every line sed 's/$/,/' all_tweets.json > all_tweets1.json
Converting mongo to json using json_to_csv.py, manually need to edit out certain things.
Removing substring: sed -r 's/^{u'$date': u'//' data.csv, sed -r 's/^'}//' data.csv
Make sure you drop the collection first.
Use upload.sh to upload data.

cd data/Tagged/
sh ../upload.sh
cd ../Untagged/
sh ../upload.sh

Ensure index

use twitter
db.tweets_collection.ensureIndex({ t: "text" })
exit

290726 tweets unlocated, total

Dumping data from mongo: mongodump --db test --collection tweets_collection -o /home/kaustubh/

Data stored in .bson format in ~/test/

Convert bson to json: bsondump tweets_collection.bson > tweets_collection.json

Stats:

40,524 tweets. 14299 untagged.

Tweets from 2017-09-12 04:05:05.000Z to 2017-10-13 07:20:43.000Z

Min date query:

db.getCollection('tweets_collection').aggregate(
   [
     {
       $group:
       {
         _id: {},
         minDate: { $min: "$cr" }
       }
     }
   ]
  );

Analysis to do:

Day wise analysis: Flood #, Dengue #, min, max, average count_tweets.py for untagged and tagged separately. python ../count_tweets.py in folders
Containing both words: db.getCollection('all_tweets').find({$text: {$search: "\"flood\" \"dengue\""} }).count()
Containing flood, not dengue: db.getCollection('all_tweets').find({$text: {$search: "flood -dengue"}}).count()
Containing flood / dengue: db.getCollection('all_tweets').find({$text: {$search: "flood,dengue"} }).count() 265688
Locations mentioned, frequency, floods, dengue separately [Done together for now]
Locations: db.getCollection('all_tweets').distinct('loc') 962

# Redundant
db.all_tweets.aggregate([
    {
        $match: {
            loc: { $not: {$size: 0} }
        }
    },
    { $unwind: "$loc" },
    {
        $group: {
            _id: {$toLower: '$loc'},
            count: { $sum: 1 }
        }
    },
    {
        $match: {
            count: { $gte: 1 }
        }
    },
    { $sort : { count : -1} },
    { $limit : 100 }
]);

/might need to copy paste from mongo shell, not robo3t/ Saved in location_counts.json. Run location_tabs.py

db.all_tweets.aggregate(
   {$group : { _id : '$loc', count : {$sum : 1}}}
)

distinct_locations.py, then location_tabs.py

locations frequency 'wordcloud', urban rural, state, state, heatmap.
All tagged. Generated by us, and actual geotagged. Actual geotagged : 1099
db.getCollection('all_tweets').find({"f": {$eq: "1e222211"} }).count() => assigned location
db.getCollection('all_tweets').find({"f": {$nin: [ "1e222211", ""] } }).count() => Actual geotagged
% of tweets identified, test case 1000 manually checked. => how many mistaken,
why untagged. Should have been tagged.
Improve ? Stemming.
w2v for locations ?
Improve coverage ? P, R, F, Roc, Auc, ?
Random sample

db.getCollection('all_tweets_untagged').aggregate(
   [ { $sample: { size: 200 } } ]
)

** Retweets ?

** Remove en languages - spanish, etc.

Get wordclouds data, and frequency of how many locations appear 3 times. distinct_locations.py
Print only tweet text from mongo: db.Jan22_tweets.find({}, {t: 1, _id:0})
Export only tweets: mongoexport -d test -c Jan22_tweets -f t -o tweets_Jan22.txt
Sort hashtags used: hashtag_counter.py
db.getCollection('Jan22_tweets').find({"p": {$exists: true, "$ne": ""} ,$text: {$search: "dengue"}}) does location exist ?
No location: db.getCollection('Jan22_tweets').find({"p": {$eq :""} ,$text: {$search: "dengue"}}).count()

Randomly sampling some tweets into a file

mongo | tee out.txt
> DBQuery.shellBatchSize = 40000
> db.Jan30_tweets.aggregate(
   { $sample: { size: 40000 } }
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Analysis to do:

Randomly sampling some tweets into a file

Files

README.md

Latest commit

History

README.md

File metadata and controls

Analysis to do:

Randomly sampling some tweets into a file